Pull cgroup fixes from Tejun Heo:
- cpuset fixes:
- Partition invalidation could return CPUs still in use by sibling
partitions, producing overlapping effective_cpus
- cpuset_can_attach() over-reserved DL bandwidth on moves that
stayed within the same root domain
- Pending DL migration state leaked into later attaches when a
later can_attach() check failed
- Reorder PF_EXITING and __GFP_HARDWALL checks so dying tasks can
allocate from any node and exit quickly
- dmem: propagate -ENOMEM instead of spinning forever when the fallback
pool allocation also fails
- selftests/cgroup: percpu test error-path leak, bogus numeric
comparison of cpuset strings, and a zero-length read() that silently
passed OOM-kill tests
* tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: Return only actually allocated CPUs during partition invalidation
selftests/cgroup: Fix error path leaks in test_percpu_basic
cgroup/cpuset: Reserve DL bandwidth only for root-domain moves
cgroup/cpuset: Reset DL migration state on can_attach() failure
selftests/cgroup: Fix string comparison in write_test
selftests/cgroup: Fix cg_read_strcmp() empty string comparison
cgroup/dmem: Return -ENOMEM on failed pool preallocation
cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed()
In update_parent_effective_cpumask() with partcmd_invalidate, the CPUs
to return to the parent are computed as:
adding = cpumask_and(tmp->addmask, xcpus, parent->effective_xcpus);
where xcpus = user_xcpus(cs) which returns cs->exclusive_cpus (if set)
or cs->cpus_allowed. When exclusive_cpus is not set, user_xcpus(cs) can
contain CPUs that were never actually granted to the partition due to
sibling exclusion in compute_excpus(). Consequently, the invalidation
may return CPUs to the parent that remain in use by sibling partitions,
causing overlapping effective_cpus and triggering the
WARN_ON_ONCE(1) in generate_sched_domains().
Use cs->effective_xcpus instead, which reflects the CPUs actually
granted to this partition.
Reproducer (on a 4-CPU machine):
cd /sys/fs/cgroup
mkdir a1 b1
# a1 becomes partition root with CPUs 0-1
echo "0-1" > a1/cpuset.cpus
echo "root" > a1/cpuset.cpus.partition
# b1 becomes partition root with CPUs 1-2, but sibling exclusion
# reduces its effective_xcpus to CPU 2 only
echo "1-2" > b1/cpuset.cpus
echo "root" > b1/cpuset.cpus.partition
# b1 changes cpus_allowed to 0-1 -> partition invalidation
echo "0-1" > b1/cpuset.cpus
# Expected: CPUs 2-3 (only CPU 2 returned from b1)
# Actual: CPUs 1-3 (CPU 0-1 returned, overlapping with a1)
cat cpuset.cpus.effective
dmesg will also show a WARNING from generate_sched_domains() reporting
overlapping partition root effective_cpus.
Fixes: 2a3602030d ("cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict")
Cc: stable@vger.kernel.org # v7.0+
Signed-off-by: sunshaojie <sunshaojie@kylinos.cn>
Tested-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
cpuset_can_attach() currently adds the bandwidth of all migrating
SCHED_DEADLINE tasks to sum_migrate_dl_bw. If the source and destination
cpuset effective CPU masks do not overlap, the whole sum is then
reserved in the destination root domain.
set_cpus_allowed_dl(), however, subtracts bandwidth from the source
root domain only when the affinity change really moves the task between
root domains. A DL task can move between cpusets that are still in the
same root domain, so including that task in sum_migrate_dl_bw can reserve
destination bandwidth without a matching source-side subtraction.
Share the root-domain move test with set_cpus_allowed_dl(). Keep
nr_migrate_dl_tasks counting all migrating deadline tasks for cpuset DL
task accounting, but add to sum_migrate_dl_bw only for tasks that need a
root-domain bandwidth move. Keep using the destination cpuset effective
CPU mask and leave the broader can_attach()/attach() transaction model
unchanged.
Fixes: 2ef269ef1a ("cgroup/cpuset: Free DL BW in case can_attach() fails")
Cc: stable@vger.kernel.org # v6.10+
Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
Reviewed-by: Waiman Long <longman@redhat.com>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
cpuset_can_attach() accumulates temporary SCHED_DEADLINE migration
state in the destination cpuset while walking the taskset.
If a later task_can_attach() or security_task_setscheduler() check
fails, cgroup_migrate_execute() treats cpuset as the failing subsystem
and does not call cpuset_cancel_attach() for it. The partially
accumulated state is then left behind and can be consumed by a later
attach, corrupting cpuset DL task accounting and pending DL bandwidth
accounting.
Reset the pending DL migration state from the common error exit when
ret is non-zero. Successful can_attach() keeps the state for
cpuset_attach() or cpuset_cancel_attach().
Fixes: 2ef269ef1a ("cgroup/cpuset: Free DL BW in case can_attach() fails")
Cc: stable@vger.kernel.org # v6.10+
Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Waiman Long <longman@redhat.com>
get_cg_pool_unlocked() handles allocation failures under dmemcg_lock by
dropping the lock, preallocating a pool with GFP_KERNEL, and retrying the
locked lookup and creation path.
If the fallback allocation fails too, pool remains NULL. Since the loop
condition is while (!pool), the function can keep retrying instead of
propagating the allocation failure to the caller.
Set pool to ERR_PTR(-ENOMEM) when the fallback allocation fails so the
loop exits through the existing common return path. The callers already
handle ERR_PTR() from get_cg_pool_unlocked(), so this restores the
expected error path.
Fixes: b168ed458d ("kernel/cgroup: Add "dmem" memory accounting cgroup")
Cc: stable@vger.kernel.org # v6.14+
Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
Since prepare_alloc_pages() unconditionally adds __GFP_HARDWALL for the
fast path when cpusets are enabled, the __GFP_HARDWALL check in
cpuset_current_node_allowed() causes the PF_EXITING escape path to be
skipped on the first allocation attempt. This makes it unreachable in
the common case, so dying tasks can get stuck in direct reclaim or even
trigger OOM while trying to exit, despite being allowed to allocate from
any node.
Move the PF_EXITING check before __GFP_HARDWALL so that dying tasks
can allocate memory from any node to exit quickly, even when cpusets
are enabled.
Also update the function comment to reflect the actual behavior of
prepare_alloc_pages() and the corrected check ordering.
Signed-off-by: Chen Wandun <chenwandun@lixiang.com>
Acked-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull cgroup fixes from Tejun Heo:
- During v6.19, cgroup task unlink was moved from do_exit() to after the
final task switch to satisfy a controller invariant. That left the kernel
seeing tasks past exit_signals() longer than userspace expected, and
several v7.0 follow-ups tried to bridge the gap by making rmdir wait for
the kernel side. None held up.
The latest is an A-A deadlock when rmdir is invoked by the reaper of
zombies whose pidns teardown the rmdir itself is waiting on, which
points at the synchronizing approach being fundamentally wrong.
Take a different approach: drop the wait, leave rmdir's user-visible
side returning as soon as cgroup.procs is empty, and defer the css
percpu_ref kill that drives ->css_offline() until the cgroup is fully
depopulated.
Tagged for stable. Somewhat invasive but contained. The hope is that
fixing forward sticks. If not, the fallback is to revert the entire
chain and rework on the development branch.
Note that this doesn't plug a pre-existing analogous race in
cgroup_apply_control_disable() (controller disable via
subtree_control). Not a regression. The development branch will do
the more invasive restructuring needed for that.
- Documentation update for cgroup-v1 charge-commit section that still
referenced functions removed when the memcg hugetlb try-commit-cancel
protocol was retired.
* tag 'cgroup-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
docs: cgroup-v1: Update charge-commit section
cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated
Pull sched_ext fixes from Tejun Heo:
- Fix idle CPU selection returning prev_cpu outside the task's cpus_ptr
when the BPF caller's allowed mask was wider. Stable backport.
- Two opposite-direction gaps in scx_task_iter's cgroup-scoped mode
versus the global mode:
- Tasks past exit_signals() are filtered by the cgroup walk but kept
by global. Sub-scheduler enable abort leaked __scx_init_task()
state. Add a CSS_TASK_ITER_WITH_DEAD flag to cgroup's task
iterator (scx_task_iter is its only user) and use it.
- Tasks past sched_ext_dead() are still returned, tripping
WARN_ON_ONCE() in callers or making them touch torn-down state.
Mark and skip under the per-task rq lock.
* tag 'sched_ext-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: idle: Recheck prev_cpu after narrowing allowed mask
sched_ext: Skip past-sched_ext_dead() tasks in scx_task_iter_next_locked()
cgroup, sched_ext: Include exiting tasks in cgroup iter
a72f73c4dd ("cgroup: Don't expose dead tasks in cgroup") made
css_task_iter_advance() skip exiting tasks so cgroup.procs stays consistent
with waitpid() visibility. Unfortunately, this broke scx_task_iter.
scx_task_iter walks either scx_tasks (global) or a cgroup subtree via
css_task_iter() and the two modes are expected to cover the same set of
tasks. After the above change the cgroup-scoped mode silently skips tasks
past exit_signals() that are still on scx_tasks.
scx_sub_enable_workfn()'s abort path is one of the symptoms: an exiting
SCX_TASK_SUB_INIT task can race past the cgroup iter leaking
__scx_init_task() state. Other iterations share the same gap.
Add CSS_TASK_ITER_WITH_DEAD to opt out of the skip and use it from
scx_task_iter().
Fixes: b0e4c2f8a0 ("sched_ext: Implement cgroup subtree iteration for scx_task_iter")
Reported-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
A chain of commits going back to v7.0 reworked rmdir to satisfy the
controller invariant that a subsystem's ->css_offline() must not run while
tasks are still doing kernel-side work in the cgroup.
[1] d245698d72 ("cgroup: Defer task cgroup unlink until after the task is done switching out")
[2] a72f73c4dd ("cgroup: Don't expose dead tasks in cgroup")
[3] 1b164b876c ("cgroup: Wait for dying tasks to leave on rmdir")
[4] 4c56a8ac68 ("cgroup: Fix cgroup_drain_dying() testing the wrong condition")
[5] 13e786b64b ("cgroup: Increment nr_dying_subsys_* from rmdir context")
[1] moved task cset unlink from do_exit() to finish_task_switch() so a
task's cset link drops only after the task has fully stopped scheduling.
That made tasks past exit_signals() linger on cset->tasks until their final
context switch, which led to a series of problems as what userspace expected
to see after rmdir diverged from what the kernel needs to wait for. [2]-[5]
tried to bridge that divergence: [2] filtered the exiting tasks from
cgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4]
fixed the wait's condition; [5] made nr_dying_subsys_* visible
synchronously.
The cgroup_drain_dying() wait in [3] turned out to be a dead end. When the
rmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g.
host PID 1 systemd reaping orphan pids that were re-parented to it during
the same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those
pids to free, the pids can't free because PID 1 is the reaper and it's stuck
in rmdir, and the system A-A deadlocks. No internal lock ordering breaks
this; the wait itself is the bug.
The css killing side that drove the original reorder, however, can be made
cleanly asynchronous: ->css_offline() is already async, run from
css_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to
make that chain start only after all tasks have left the cgroup. rmdir's
user-visible side then returns as soon as cgroup.procs and friends are
empty, while ->css_offline() still runs only after the cgroup is fully
drained.
Verified by the original reproducer (pidns teardown + zombie reaper, runs
under vng) which hangs vanilla and succeeds here, and by per-commit
deterministic repros for [2], [3], [4], [5] with a boot parameter that
widens the post-exit_signals() window so each state is reliably reachable.
Some stress tests on top of that.
cgroup_apply_control_disable() has the same shape of pre-existing race:
when a controller is disabled via subtree_control, kill_css() ran
synchronously while tasks past exit_signals() could still be linked to
the cgroup's csets, and ->css_offline() could fire before they drained.
This patch preserves the existing synchronous behavior at that call site
(kill_css_sync() + kill_css_finish() back-to-back) and a follow-up patch
will defer kill_css_finish() there using a per-css trigger.
This seems like the right approach and I don't see problems with it. The
changes are somewhat invasive but not excessively so, so backporting to
-stable should be okay. If something does turn out to be wrong, the fallback
is to revert the entire chain ([1]-[5]) and rework in the development branch
instead.
v2: Pin cgrp across the deferred destroy work with explicit
cgroup_get()/cgroup_put() around queue_work() and the work_fn. v1
wasn't actually broken (ordered cgroup_offline_wq + queue_work order
in cgroup_task_dead() saved it) but the explicit ref removes the
dependency on those non-obvious invariants. Also note the
pre-existing cgroup_apply_control_disable() race in the description;
a follow-up will defer kill_css_finish() there.
Fixes: 1b164b876c ("cgroup: Wait for dying tasks to leave on rmdir")
Cc: stable@vger.kernel.org # v7.0+
Reported-and-tested-by: Martin Pitt <martin@piware.de>
Link: https://lore.kernel.org/all/afHNg2VX2jy9bW7y@piware.de/
Link: https://lore.kernel.org/all/35e0670adb4abeab13da2c321582af9f@kernel.org/
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Pull cgroup fixes from Tejun Heo:
- Fix UAF race in psi pressure_write() against cgroup file release by
extending cgroup_mutex coverage and ordering of->priv access after
cgroup_kn_lock_live()
- Fix integer overflow in rdmacg_try_charge() when usage equals INT_MAX
by performing the increment in s64
- Fix asymmetric DL bandwidth accounting on cpuset attach rollback by
recording the CPU used by dl_bw_alloc() so cancel_attach() returns
the reservation to the same root domain
- Fix nr_dying_subsys_* race that briefly showed 0 in cgroup.stat after
rmdir by incrementing from kill_css() instead of offline_css()
- Typo fix in cgroup-v2 documentation
* tag 'cgroup-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
docs: cgroup: fix typo 'protetion' -> 'protection'
cgroup: Increment nr_dying_subsys_* from rmdir context
cgroup/cpuset: record DL BW alloc CPU for attach rollback
cgroup/rdma: fix integer overflow in rdmacg_try_charge()
sched/psi: fix race between file release and pressure write
Incrementing nr_dying_subsys_* in offline_css(), which is executed by
cgroup_offline_wq worker, leads to a race where user can see the value
to be 0 if he reads cgroup.stat after calling rmdir and before the worker
executes. This makes the user wrongly expect resources released by the
removed cgroup to be available for a new assignment.
Increment nr_dying_subsys_* from kill_css(), which is called from the
cgroup_rmdir() context.
Fixes: ab03125268 ("cgroup: Show # of subsystem CSSes in cgroup.stat")
Signed-off-by: Petr Malat <oss@malat.biz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull more MM updates from Andrew Morton:
- "Eliminate Dying Memory Cgroup" (Qi Zheng and Muchun Song)
Address the longstanding "dying memcg problem". A situation wherein a
no-longer-used memory control group will hang around for an extended
period pointlessly consuming memory
- "fix unexpected type conversions and potential overflows" (Qi Zheng)
Fix a couple of potential 32-bit/64-bit issues which were identified
during review of the "Eliminate Dying Memory Cgroup" series
- "kho: history: track previous kernel version and kexec boot count"
(Breno Leitao)
Use Kexec Handover (KHO) to pass the previous kernel's version string
and the number of kexec reboots since the last cold boot to the next
kernel, and print it at boot time
- "liveupdate: prevent double preservation" (Pasha Tatashin)
Teach LUO to avoid managing the same file across different active
sessions
- "liveupdate: Fix module unloading and unregister API" (Pasha
Tatashin)
Address an issue with how LUO handles module reference counting and
unregistration during module unloading
- "zswap pool per-CPU acomp_ctx simplifications" (Kanchana Sridhar)
Simplify and clean up the zswap crypto compression handling and
improve the lifecycle management of zswap pool's per-CPU acomp_ctx
resources
- "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit race"
(SeongJae Park)
Address unlikely but possible leaks and deadlocks in damon_call() and
damon_walk()
- "mm/damon/core: validate damos_quota_goal->nid" (SeongJae Park)
Fix a couple of root-only wild pointer dereferences
- "Docs/admin-guide/mm/damon: warn commit_inputs vs other params race"
(SeongJae Park)
Update the DAMON documentation to warn operators about potential
races which can occur if the commit_inputs parameter is altered at
the wrong time
- "Minor hmm_test fixes and cleanups" (Alistair Popple)
Bugfixes and a cleanup for the HMM kernel selftests
- "Modify memfd_luo code" (Chenghao Duan)
Cleanups, simplifications and speedups to the memfd_lou code
- "mm, kvm: allow uffd support in guest_memfd" (Mike Rapoport)
Support for userfaultfd in guest_memfd
- "selftests/mm: skip several tests when thp is not available" (Chunyu
Hu)
Fix several issues in the selftests code which were causing breakage
when the tests were run on CONFIG_THP=n kernels
- "mm/mprotect: micro-optimization work" (Pedro Falcato)
A couple of nice speedups for mprotect()
- "MAINTAINERS: update KHO and LIVE UPDATE entries" (Pratyush Yadav)
Document upcoming changes in the maintenance of KHO, LUO, memfd_luo,
kexec, crash, kdump and probably other kexec-based things - they are
being moved out of mm.git and into a new git tree
* tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (121 commits)
MAINTAINERS: add page cache reviewer
mm/vmscan: avoid false-positive -Wuninitialized warning
MAINTAINERS: update Dave's kdump reviewer email address
MAINTAINERS: drop include/linux/liveupdate from LIVE UPDATE
MAINTAINERS: drop include/linux/kho/abi/ from KHO
MAINTAINERS: update KHO and LIVE UPDATE maintainers
MAINTAINERS: update kexec/kdump maintainers entries
mm/migrate_device: remove dead migration entry check in migrate_vma_collect_huge_pmd()
selftests: mm: skip charge_reserved_hugetlb without killall
userfaultfd: allow registration of ranges below mmap_min_addr
mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update
mm/hugetlb: fix early boot crash on parameters without '=' separator
zram: reject unrecognized type= values in recompress_store()
docs: proc: document ProtectionKey in smaps
mm/mprotect: special-case small folios when applying permissions
mm/mprotect: move softleaf code out of the main function
mm: remove '!root_reclaim' checking in should_abort_scan()
mm/sparse: fix comment for section map alignment
mm/page_io: use sio->len for PSWPIN accounting in sio_read_complete()
selftests/mm: transhuge_stress: skip the test when thp not available
...
cpuset_can_attach() allocates DL bandwidth only when migrating
deadline tasks to a disjoint CPU mask, but cpuset_cancel_attach()
rolls back based only on nr_migrate_dl_tasks. This makes the DL
bandwidth alloc/free paths asymmetric: rollback can call dl_bw_free()
even when no dl_bw_alloc() was done.
Rollback also needs to undo the reservation against the same CPU/root
domain that was charged. Record the CPU used by dl_bw_alloc() and use
that state in cpuset_cancel_attach(). If no allocation happened,
dl_bw_cpu stays at -1 and rollback skips dl_bw_free(). If allocation
did happen, bandwidth is returned to the same CPU/root domain.
Successful attach paths are unchanged. This only fixes failed attach
rollback accounting.
Fixes: 2ef269ef1a ("cgroup/cpuset: Free DL BW in case can_attach() fails")
Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The expression `rpool->resources[index].usage + 1` is computed in int
arithmetic before being assigned to s64 variable `new`. When usage equals
INT_MAX (the default "max" value), the addition overflows to INT_MIN.
This negative value then passes the `new > max` check incorrectly,
allowing a charge that should be rejected and corrupting usage to
negative.
Fix by casting usage to s64 before the addition so the arithmetic is
done in 64-bit.
Fixes: 39d3e7584a ("rdmacg: Added rdma cgroup controller")
Signed-off-by: cuitao <cuitao@kylinos.cn>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
A potential race condition exists between pressure write and cgroup file
release regarding the priv member of struct kernfs_open_file, which
triggers the uaf reported in [1].
Consider the following scenario involving execution on two separate CPUs:
CPU0 CPU1
==== ====
vfs_rmdir()
kernfs_iop_rmdir()
cgroup_rmdir()
cgroup_kn_lock_live()
cgroup_destroy_locked()
cgroup_addrm_files()
cgroup_rm_file()
kernfs_remove_by_name()
kernfs_remove_by_name_ns()
vfs_write() __kernfs_remove()
new_sync_write() kernfs_drain()
kernfs_fop_write_iter() kernfs_drain_open_files()
cgroup_file_write() kernfs_release_file()
pressure_write() cgroup_file_release()
ctx = of->priv;
kfree(ctx);
of->priv = NULL;
cgroup_kn_unlock()
cgroup_kn_lock_live()
cgroup_get(cgrp)
cgroup_kn_unlock()
if (ctx->psi.trigger) // here, trigger uaf for ctx, that is of->priv
The cgroup_rmdir() is protected by the cgroup_mutex, it also safeguards
the memory deallocation of of->priv performed within cgroup_file_release().
However, the operations involving of->priv executed within pressure_write()
are not entirely covered by the protection of cgroup_mutex. Consequently,
if the code in pressure_write(), specifically the section handling the
ctx variable executes after cgroup_file_release() has completed, a uaf
vulnerability involving of->priv is triggered.
Therefore, the issue can be resolved by extending the scope of the
cgroup_mutex lock within pressure_write() to encompass all code paths
involving of->priv, thereby properly synchronizing the race condition
occurring between cgroup_file_release() and pressure_write().
And, if an live kn lock can be successfully acquired while executing
the pressure write operation, it indicates that the cgroup deletion
process has not yet reached its final stage; consequently, the priv
pointer within open_file cannot be NULL. Therefore, the operation to
retrieve the ctx value must be moved to a point *after* the live kn
lock has been successfully acquired.
In another situation, specifically after entering cgroup_kn_lock_live()
but before acquiring cgroup_mutex, there exists a different class of
race condition:
CPU0: write memory.pressure CPU1: write cgroup.pressure=0
=========================== =============================
kernfs_fop_write_iter()
kernfs_get_active_of(of)
pressure_write()
cgroup_kn_lock_live(memory.pressure)
cgroup_tryget(cgrp)
kernfs_break_active_protection(kn)
... blocks on cgroup_mutex
cgroup_pressure_write()
cgroup_kn_lock_live(cgroup.pressure)
cgroup_file_show(memory.pressure, false)
kernfs_show(false)
kernfs_drain_open_files()
cgroup_file_release(of)
kfree(ctx)
of->priv = NULL
cgroup_kn_unlock()
... acquires cgroup_mutex
ctx = of->priv; // may now be NULL
if (ctx->psi.trigger) // NULL dereference
Consequently, there is a possibility that of->priv is NULL, the pressure
write needs to check for this.
Now that the scope of the cgroup_mutex has been expanded, the original
explicit cgroup_get/put operations are no longer necessary, this is
because acquiring/releasing the live kn lock inherently executes a
cgroup get/put operation.
[1]
BUG: KASAN: slab-use-after-free in pressure_write+0xa4/0x210 kernel/cgroup/cgroup.c:4011
Call Trace:
pressure_write+0xa4/0x210 kernel/cgroup/cgroup.c:4011
cgroup_file_write+0x36f/0x790 kernel/cgroup/cgroup.c:4311
kernfs_fop_write_iter+0x3b0/0x540 fs/kernfs/file.c:352
Allocated by task 9352:
cgroup_file_open+0x90/0x3a0 kernel/cgroup/cgroup.c:4256
kernfs_fop_open+0x9eb/0xcb0 fs/kernfs/file.c:724
do_dentry_open+0x83d/0x13e0 fs/open.c:949
Freed by task 9353:
cgroup_file_release+0xd6/0x100 kernel/cgroup/cgroup.c:4283
kernfs_release_file fs/kernfs/file.c:764 [inline]
kernfs_drain_open_files+0x392/0x720 fs/kernfs/file.c:834
kernfs_drain+0x470/0x600 fs/kernfs/dir.c:525
Fixes: 0e94682b73 ("psi: introduce psi monitor")
Reported-by: syzbot+33e571025d88efd1312c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=33e571025d88efd1312c
Tested-by: syzbot+33e571025d88efd1312c@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull cgroup updates from Tejun Heo:
- cgroup_file_notify() locking converted from a global lock to
per-cgroup_file spinlock with a lockless fast-path when no
notification is needed
- Misc changes including exposing cgroup helpers for sched_ext and
minor fixes
* tag 'cgroup-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/rdma: fix swapped arguments in pr_warn() format string
cgroup/dmem: remove region parameter from dmemcg_parse_limit
cgroup: replace global cgroup_file_kn_lock with per-cgroup_file lock
cgroup: add lockless fast-path checks to cgroup_file_notify()
cgroup: reduce cgroup_file_kn_lock hold time in cgroup_file_notify()
cgroup: Expose some cgroup helpers
The format string says "device %p ... rdma cgroup %p" but the arguments
were passed as (cg, device), printing them in the wrong order.
Signed-off-by: cuitao <cuitao@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
When a CPU hot removal causes a v1 cpuset to lose all its CPUs, the
cpuset hotplug handler will schedule a work function to migrate tasks
in that cpuset with no CPU to its ancestor to enable those tasks to
continue running.
If a strict security policy is in place, however, the task migration
may fail when security_task_setscheduler() call in cpuset_can_attach()
returns a -EACCES error. That will mean that those tasks will have
no CPU to run on. The system administrators will have to explicitly
intervene to either add CPUs to that cpuset or move the tasks elsewhere
if they are aware of it.
This problem was found by a reported test failure in the LTP's
cpuset_hotplug_test.sh. Fix this problem by treating this special case as
an exception to skip the setsched security check in cpuset_can_attach()
when a v1 cpuset with tasks have no CPU left.
With that patch applied, the cpuset_hotplug_test.sh test can be run
successfully without failure.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Centralize the check required to run security_task_setscheduler() in
the task iteration loop of cpuset_can_attach() outside of the loop as
it has no dependency on the characteristics of the tasks themselves.
There is no functional change.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
cgroup_drain_dying() was using cgroup_is_populated() to test whether there are
dying tasks to wait for. cgroup_is_populated() tests nr_populated_csets,
nr_populated_domain_children and nr_populated_threaded_children, but
cgroup_drain_dying() only needs to care about this cgroup's own tasks - whether
there are children is cgroup_destroy_locked()'s concern.
This caused hangs during shutdown. When systemd tried to rmdir a cgroup that had
no direct tasks but had a populated child, cgroup_drain_dying() would enter its
wait loop because cgroup_is_populated() was true from
nr_populated_domain_children. The task iterator found nothing to wait for, yet
the populated state never cleared because it was driven by live tasks in the
child cgroup.
Fix it by using cgroup_has_tasks() which only tests nr_populated_csets.
v3: Fix cgroup_is_populated() -> cgroup_has_tasks() (Sebastian).
v2: https://lore.kernel.org/r/20260323200205.1063629-1-tj@kernel.org
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Fixes: 1b164b876c ("cgroup: Wait for dying tasks to leave on rmdir")
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
a72f73c4dd ("cgroup: Don't expose dead tasks in cgroup") hid PF_EXITING
tasks from cgroup.procs so that systemd doesn't see tasks that have already
been reaped via waitpid(). However, the populated counter (nr_populated_csets)
is only decremented when the task later passes through cgroup_task_dead() in
finish_task_switch(). This means cgroup.procs can appear empty while the
cgroup is still populated, causing rmdir to fail with -EBUSY.
Fix this by making cgroup_rmdir() wait for dying tasks to fully leave. If the
cgroup is populated but all remaining tasks have PF_EXITING set (the task
iterator returns none due to the existing filter), wait for a kick from
cgroup_task_dead() and retry. The wait is brief as tasks are removed from the
cgroup's css_set between PF_EXITING assertion in do_exit() and
cgroup_task_dead() in finish_task_switch().
v2: cgroup_is_populated() true to false transition happens under css_set_lock
not cgroup_mutex, so retest under css_set_lock before sleeping to avoid
missed wakeups (Sebastian).
Fixes: a72f73c4dd ("cgroup: Don't expose dead tasks in cgroup")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202603222104.2c81684e-lkp@intel.com
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Bert Karwatzki <spasswolf@web.de>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: cgroups@vger.kernel.org
dmemcg_parse_limit does not use the region parameter. Remove it.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Replace the global cgroup_file_kn_lock with a per-cgroup_file spinlock
to eliminate cross-cgroup contention as it is not really protecting
data shared between different cgroups.
The lock is initialized in cgroup_add_file() alongside timer_setup().
No lock acquisition is needed during initialization since the cgroup
directory is being populated under cgroup_mutex and no concurrent
accessors exist at that point.
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add lockless checks before acquiring cgroup_file_kn_lock:
1. READ_ONCE(cfile->kn) NULL check to skip torn-down files.
2. READ_ONCE(cfile->notified_at) rate-limit check to skip when
within the notification interval. If within the interval, arm
the deferred timer via timer_reduce() and confirm it is pending
before returning -- if the timer fired in between, fall through
to the lock path so the notification is not lost.
Both checks have safe error directions -- a stale read can only
cause unnecessary lock acquisition, never a missed notification.
The critical section is simplified to just taking a kernfs_get()
reference and updating notified_at.
Annotate cfile->kn and cfile->notified_at write sites with
WRITE_ONCE() to pair with the lockless readers.
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
cgroup_file_notify() calls kernfs_notify() while holding the global
cgroup_file_kn_lock. kernfs_notify() does non-trivial work including
wake_up_interruptible() and acquisition of a second global spinlock
(kernfs_notify_lock), inflating the hold time.
Take a kernfs_get() reference under the lock and call kernfs_notify()
after dropping it, following the pattern from cgroup_file_show().
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
Once a task exits it has its state set to TASK_DEAD and then it is
removed from the cgroup it belonged to. The last step happens on the task
gets out of its last schedule() invocation and is delayed on PREEMPT_RT
due to locking constraints.
As a result it is possible to receive a pid via waitpid() of a task
which is still listed in cgroup.procs for the cgroup it belonged
to. This is something that systemd does not expect and as a result it
waits for its exit until a time out occurs.
This can also be reproduced on !PREEMPT_RT kernel with a significant
delay in do_exit() after exit_notify().
Hide the task from the output which have PF_EXITING set which is done
before the parent is notified. Keeping zombies with live threads
shouldn't break anything (suggested by Tejun).
Reported-by: Bert Karwatzki <spasswolf@web.de>
Closes: https://lore.kernel.org/all/20260219164648.3014-1-spasswolf@web.de/
Tested-by: Bert Karwatzki <spasswolf@web.de>
Fixes: 9311e6c29b ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT")
Cc: stable@vger.kernel.org # v6.19+
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Besides deferring the call to housekeeping_update(), commit 6df415aa46
("cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug
to workqueue") also defers the rebuild_sched_domains() call to
the workqueue. So a new offline CPU may still be in a sched domain
or new online CPU not showing up in the sched domains for a short
transition period. That could be a problem in some corner cases and
can be the cause of a reported test failure[1]. Fix it by calling
rebuild_sched_domains_cpuslocked() directly in hotplug as before. If
isolated partition invalidation or recreation is being done, the
housekeeping_update() call to update the housekeeping cpumasks will
still be deferred to a workqueue.
In commit 3bfe479671 ("cgroup/cpuset: Move
housekeeping_update()/rebuild_sched_domains() together"),
housekeeping_update() is called before rebuild_sched_domains() because
it needs to access the HK_TYPE_DOMAIN housekeeping cpumask. That is now
changed to use the static HK_TYPE_DOMAIN_BOOT cpumask as HK_TYPE_DOMAIN
cpumask is now changeable at run time. As a result, we can move the
rebuild_sched_domains() call before housekeeping_update() with
the slight advantage that it will be done in the same cpus_read_lock
critical section without the possibility of interference by a concurrent
cpu hot add/remove operation.
As it doesn't make sense to acquire cpuset_mutex/cpuset_top_mutex after
calling housekeeping_update() and immediately release them again, move
the cpuset_full_unlock() operation inside update_hk_sched_domains()
and rename it to cpuset_update_sd_hk_unlock() to signify that it will
release the full set of locks.
[1] https://lore.kernel.org/lkml/1a89aceb-48db-4edd-a730-b445e41221fe@nvidia.com
Fixes: 6df415aa46 ("cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue")
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Expose the following through cgroup.h:
- cgroup_on_dfl()
- cgroup_is_dead()
- cgroup_for_each_live_child()
- cgroup_for_each_live_descendant_pre()
- cgroup_for_each_live_descendant_post()
Until now, these didn't need to be exposed because controllers only cared
about the css hierarchy. The planned sched_ext hierarchical scheduler
support will be based on the default cgroup hierarchy, which is in line
with the existing BPF cgroup support, and thus needs these exposed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull cgroup fixes from Tejun Heo:
- Fix circular locking dependency in cpuset partition code by
deferring housekeeping_update() calls to a workqueue instead
of calling them directly under cpus_read_lock
- Fix null-ptr-deref in rebuild_sched_domains_cpuslocked() when
generate_sched_domains() returns NULL due to kmalloc failure
- Fix incorrect cpuset behavior for effective_xcpus in
partition_xcpus_del() and cpuset_update_tasks_cpumask()
in update_cpumasks_hier()
- Fix race between task migration and cgroup iteration
* tag 'cgroup-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: fix null-ptr-deref in rebuild_sched_domains_cpuslocked
cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together
kselftest/cgroup: Simplify test_cpuset_prs.sh by removing "S+" command
cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed
cgroup/cpuset: Clarify exclusion rules for cpuset internal variables
cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in update_cpumasks_hier()
cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del()
cgroup: fix race between task migration and iteration
The current cpuset partition code is able to dynamically update
the sched domains of a running system and the corresponding
HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentially the
"isolcpus=domain,..." boot command line feature at run time.
The housekeeping cpumask update requires flushing a number of different
workqueues which may not be safe with cpus_read_lock() held as the
workqueue flushing code may acquire cpus_read_lock() or acquiring locks
which have locking dependency with cpus_read_lock() down the chain. Below
is an example of such circular locking problem.
======================================================
WARNING: possible circular locking dependency detected
6.18.0-test+ #2 Tainted: G S
------------------------------------------------------
test_cpuset_prs/10971 is trying to acquire lock:
ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180
but task is already holding lock:
ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (cpuset_mutex){+.+.}-{4:4}:
-> #3 (cpu_hotplug_lock){++++}-{0:0}:
-> #2 (rtnl_mutex){+.+.}-{4:4}:
-> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
-> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
Chain exists of:
(wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex
5 locks held by test_cpuset_prs/10971:
#0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
#1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0
#2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0
#3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130
#4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
Call Trace:
<TASK>
:
touch_wq_lockdep_map+0x93/0x180
__flush_workqueue+0x111/0x10b0
housekeeping_update+0x12d/0x2d0
update_parent_effective_cpumask+0x595/0x2440
update_prstate+0x89d/0xce0
cpuset_partition_write+0xc5/0x130
cgroup_file_write+0x1a5/0x680
kernfs_fop_write_iter+0x3df/0x5f0
vfs_write+0x525/0xfd0
ksys_write+0xf9/0x1d0
do_syscall_64+0x95/0x520
entry_SYSCALL_64_after_hwframe+0x76/0x7e
To avoid such a circular locking dependency problem, we have to
call housekeeping_update() without holding the cpus_read_lock() and
cpuset_mutex. The current set of wq's flushed by housekeeping_update()
may not have work functions that call cpus_read_lock() directly,
but we are likely to extend the list of wq's that are flushed in the
future. Moreover, the current set of work functions may hold locks that
may have cpu_hotplug_lock down the dependency chain.
So housekeeping_update() is now called after releasing cpus_read_lock
and cpuset_mutex at the end of a cpuset operation. These two locks are
then re-acquired later before calling rebuild_sched_domains_locked().
To enable mutual exclusion between the housekeeping_update() call and
other cpuset control file write actions, a new top level cpuset_top_mutex
is introduced. This new mutex will be acquired first to allow sharing
variables used by both code paths. However, cpuset update from CPU
hotplug can still happen in parallel with the housekeeping_update()
call, though that should be rare in production environment.
As cpus_read_lock() is now no longer held when
tmigr_isolated_exclude_cpumask() is called, it needs to acquire it
directly.
The lockdep_is_cpuset_held() is also updated to return true if either
cpuset_top_mutex or cpuset_mutex is held.
Fixes: 03ff735101 ("cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The cpuset_handle_hotplug() may need to invoke housekeeping_update(),
for instance, when an isolated partition is invalidated because its
last active CPU has been put offline.
As we are going to enable dynamic update to the nozh_full housekeeping
cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
allowing the CPU hotplug path to call into housekeeping_update() directly
from update_isolation_cpumasks() will likely cause deadlock. So we
have to defer any call to housekeeping_update() after the CPU hotplug
operation has finished. This is now done via the workqueue where
the update_hk_sched_domains() function will be invoked via the
hk_sd_workfn().
An concurrent cpuset control file write may have executed the required
update_hk_sched_domains() function before the work function is called. So
the work function call may become a no-op when it is invoked.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
With the latest changes in sched/isolation.c, rebuild_sched_domains*()
requires the HK_TYPE_DOMAIN housekeeping cpumask to be properly
updated first, if needed, before the sched domains can be
rebuilt. So the two naturally fit together. Do that by creating a new
update_hk_sched_domains() helper to house both actions.
The name of the isolated_cpus_updating flag to control the
call to housekeeping_update() is now outdated. So change it to
update_housekeeping to better reflect its purpose. Also move the call
to update_hk_sched_domains() to the end of cpuset and hotplug operations
before releasing the cpuset_mutex.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
As cpuset is updating HK_TYPE_DOMAIN housekeeping mask when there is
a change in the set of isolated CPUs, making this change is now more
costly than before. Right now, the isolated_cpus_updating flag can be
set even if there is no real change in isolated_cpus. Put in additional
checks to make sure that isolated_cpus_updating is set only if there
is a real change in isolated_cpus.
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Clarify the locking rules associated with file level internal variables
inside the cpuset code. There is no functional change.
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit e2ffe502ba ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2")
incorrectly changed the 2nd parameter of cpuset_update_tasks_cpumask()
from tmp->new_cpus to cp->effective_cpus. This second parameter is just
a temporary cpumask for internal use. The cpuset_update_tasks_cpumask()
function was originally called update_tasks_cpumask() before commit
381b53c3b5 ("cgroup/cpuset: rename functions shared between v1
and v2").
This mistake can incorrectly change the effective_cpus of the
cpuset when it is the top_cpuset or in arm64 architecture where
task_cpu_possible_mask() may differ from cpu_possible_mask. So far
top_cpuset hasn't been passed to update_cpumasks_hier() yet, but arm64
arch can still be impacted. Fix it by reverting the incorrect change.
Fixes: e2ffe502ba ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The effective_xcpus of a cpuset can contain offline CPUs. In
partition_xcpus_del(), the xcpus parameter is incorrectly used as
a temporary cpumask to mask out offline CPUs. As xcpus can be the
effective_xcpus of a cpuset, this can result in unexpected changes
in that cpumask. Fix this problem by not making any changes to the
xcpus parameter.
Fixes: 11e5f407b6 ("cgroup/cpuset: Keep track of CPUs in isolated partitions")
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.
As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
Pull more MM updates from Andrew Morton:
- "mm/vmscan: fix demotion targets checks in reclaim/demotion" fixes a
couple of issues in the demotion code - pages were failed demotion
and were finding themselves demoted into disallowed nodes (Bing Jiao)
- "Remove XA_ZERO from error recovery of dup_mmap()" fixes a rare
mapledtree race and performs a number of cleanups (Liam Howlett)
- "mm: add bitmap VMA flag helpers and convert all mmap_prepare to use
them" implements a lot of cleanups following on from the conversion
of the VMA flags into a bitmap (Lorenzo Stoakes)
- "support batch checking of references and unmapping for large folios"
implements batching to greatly improve the performance of reclaiming
clean file-backed large folios (Baolin Wang)
- "selftests/mm: add memory failure selftests" does as claimed (Miaohe
Lin)
* tag 'mm-stable-2026-02-18-19-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (36 commits)
mm/page_alloc: clear page->private in free_pages_prepare()
selftests/mm: add memory failure dirty pagecache test
selftests/mm: add memory failure clean pagecache test
selftests/mm: add memory failure anonymous page test
mm: rmap: support batched unmapping for file large folios
arm64: mm: implement the architecture-specific clear_flush_young_ptes()
arm64: mm: support batch clearing of the young flag for large folios
arm64: mm: factor out the address and ptep alignment into a new helper
mm: rmap: support batched checks of the references for large folios
tools/testing/vma: add VMA userland tests for VMA flag functions
tools/testing/vma: separate out vma_internal.h into logical headers
tools/testing/vma: separate VMA userland tests into separate files
mm: make vm_area_desc utilise vma_flags_t only
mm: update all remaining mmap_prepare users to use vma_flags_t
mm: update shmem_[kernel]_file_*() functions to use vma_flags_t
mm: update secretmem to use VMA flags on mmap_prepare
mm: update hugetlbfs to use VMA flags on mmap_prepare
mm: add basic VMA flag operation helper functions
tools: bitmap: add missing bitmap_[subset(), andnot()]
mm: add mk_vma_flags() bitmap flag macro helper
...
Patch series "mm/vmscan: fix demotion targets checks in reclaim/demotion",
v9.
This patch series addresses two issues in demote_folio_list(),
can_demote(), and next_demotion_node() in reclaim/demotion.
1. demote_folio_list() and can_demote() do not correctly check
demotion target against cpuset.mems_effective, which will cause (a)
pages to be demoted to not-allowed nodes and (b) pages fail demotion
even if the system still has allowed demotion nodes.
Patch 1 fixes this bug by updating cpuset_node_allowed() and
mem_cgroup_node_allowed() to return effective_mems, allowing directly
logic-and operation against demotion targets.
2. next_demotion_node() returns a preferred demotion target, but it
does not check the node against allowed nodes.
Patch 2 ensures that next_demotion_node() filters against the allowed
node mask and selects the closest demotion target to the source node.
This patch (of 2):
Fix two bugs in demote_folio_list() and can_demote() due to incorrect
demotion target checks against cpuset.mems_effective in reclaim/demotion.
Commit 7d709f49ba ("vmscan,cgroup: apply mems_effective to reclaim")
introduces the cpuset.mems_effective check and applies it to can_demote().
However:
1. It does not apply this check in demote_folio_list(), which leads
to situations where pages are demoted to nodes that are
explicitly excluded from the task's cpuset.mems.
2. It checks only the nodes in the immediate next demotion hierarchy
and does not check all allowed demotion targets in can_demote().
This can cause pages to never be demoted if the nodes in the next
demotion hierarchy are not set in mems_effective.
These bugs break resource isolation provided by cpuset.mems. This is
visible from userspace because pages can either fail to be demoted
entirely or are demoted to nodes that are not allowed in multi-tier memory
systems.
To address these bugs, update cpuset_node_allowed() and
mem_cgroup_node_allowed() to return effective_mems, allowing directly
logic-and operation against demotion targets. Also update can_demote()
and demote_folio_list() accordingly.
Bug 1 reproduction:
Assume a system with 4 nodes, where nodes 0-1 are top-tier and
nodes 2-3 are far-tier memory. All nodes have equal capacity.
Test script:
echo 1 > /sys/kernel/mm/numa/demotion_enabled
mkdir /sys/fs/cgroup/test
echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
echo "0-2" > /sys/fs/cgroup/test/cpuset.mems
echo $$ > /sys/fs/cgroup/test/cgroup.procs
swapoff -a
# Expectation: Should respect node 0-2 limit.
# Observation: Node 3 shows significant allocation (MemFree drops)
stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1
Bug 2 reproduction:
Assume a system with 6 nodes, where nodes 0-2 are top-tier,
node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes.
All nodes have equal capacity.
Test script:
echo 1 > /sys/kernel/mm/numa/demotion_enabled
mkdir /sys/fs/cgroup/test
echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems
echo $$ > /sys/fs/cgroup/test/cgroup.procs
swapoff -a
# Expectation: Pages are demoted to Nodes 4-5
# Observation: No pages are demoted before oom.
stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2
Link: https://lkml.kernel.org/r/20260114205305.2869796-1-bingjiao@google.com
Link: https://lkml.kernel.org/r/20260114205305.2869796-2-bingjiao@google.com
Fixes: 7d709f49ba ("vmscan,cgroup: apply mems_effective to reclaim")
Signed-off-by: Bing Jiao <bingjiao@google.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Gregory Price <gourry@gourry.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull MM updates from Andrew Morton:
- "powerpc/64s: do not re-activate batched TLB flush" makes
arch_{enter|leave}_lazy_mmu_mode() nest properly (Alexander Gordeev)
It adds a generic enter/leave layer and switches architectures to use
it. Various hacks were removed in the process.
- "zram: introduce compressed data writeback" implements data
compression for zram writeback (Richard Chang and Sergey Senozhatsky)
- "mm: folio_zero_user: clear page ranges" adds clearing of contiguous
page ranges for hugepages. Large improvements during demand faulting
are demonstrated (David Hildenbrand)
- "memcg cleanups" tidies up some memcg code (Chen Ridong)
- "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos
stats" improves DAMOS stat's provided information, deterministic
control, and readability (SeongJae Park)
- "selftests/mm: hugetlb cgroup charging: robustness fixes" fixes a few
issues in the hugetlb cgroup charging selftests (Li Wang)
- "Fix va_high_addr_switch.sh test failure - again" addresses several
issues in the va_high_addr_switch test (Chunyu Hu)
- "mm/damon/tests/core-kunit: extend existing test scenarios" improves
the KUnit test coverage for DAMON (Shu Anzai)
- "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" fixes a
glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to
transiently return -EAGAIN (Shivank Garg)
- "arch, mm: consolidate hugetlb early reservation" reworks and
consolidates a pile of straggly code related to reservation of
hugetlb memory from bootmem and creation of CMA areas for hugetlb
(Mike Rapoport)
- "mm: clean up anon_vma implementation" cleans up the anon_vma
implementation in various ways (Lorenzo Stoakes)
- "tweaks for __alloc_pages_slowpath()" does a little streamlining of
the page allocator's slowpath code (Vlastimil Babka)
- "memcg: separate private and public ID namespaces" cleans up the
memcg ID code and prevents the internal-only private IDs from being
exposed to userspace (Shakeel Butt)
- "mm: hugetlb: allocate frozen gigantic folio" cleans up the
allocation of frozen folios and avoids some atomic refcount
operations (Kefeng Wang)
- "mm/damon: advance DAMOS-based LRU sorting" improves DAMOS's movement
of memory betewwn the active and inactive LRUs and adds auto-tuning
of the ratio-based quotas and of monitoring intervals (SeongJae Park)
- "Support page table check on PowerPC" makes
CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc (Andrew Donnellan)
- "nodemask: align nodes_and{,not} with underlying bitmap ops" makes
nodes_and() and nodes_andnot() propagate the return values from the
underlying bit operations, enabling some cleanup in calling code
(Yury Norov)
- "mm/damon: hide kdamond and kdamond_lock from API callers" cleans up
some DAMON internal interfaces (SeongJae Park)
- "mm/khugepaged: cleanups and scan limit fix" does some cleanup work
in khupaged and fixes a scan limit accounting issue (Shivank Garg)
- "mm: balloon infrastructure cleanups" goes to town on the balloon
infrastructure and its page migration function. Mainly cleanups, also
some locking simplification (David Hildenbrand)
- "mm/vmscan: add tracepoint and reason for kswapd_failures reset" adds
additional tracepoints to the page reclaim code (Jiayuan Chen)
- "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" is
part of Marco's kernel-wide migration from the legacy workqueue APIs
over to the preferred unbound workqueues (Marco Crivellari)
- "Various mm kselftests improvements/fixes" provides various unrelated
improvements/fixes for the mm kselftests (Kevin Brodsky)
- "mm: accelerate gigantic folio allocation" greatly speeds up gigantic
folio allocation, mainly by avoiding unnecessary work in
pfn_range_valid_contig() (Kefeng Wang)
- "selftests/damon: improve leak detection and wss estimation
reliability" improves the reliability of two of the DAMON selftests
(SeongJae Park)
- "mm/damon: cleanup kdamond, damon_call(), damos filter and
DAMON_MIN_REGION" does some cleanup work in the core DAMON code
(SeongJae Park)
- "Docs/mm/damon: update intro, modules, maintainer profile, and misc"
performs maintenance work on the DAMON documentation (SeongJae Park)
- "mm: add and use vma_assert_stabilised() helper" refactors and cleans
up the core VMA code. The main aim here is to be able to use the mmap
write lock's lockdep state to perform various assertions regarding
the locking which the VMA code requires (Lorenzo Stoakes)
- "mm, swap: swap table phase II: unify swapin use" removes some old
swap code (swap cache bypassing and swap synchronization) which
wasn't working very well. Various other cleanups and simplifications
were made. The end result is a 20% speedup in one benchmark (Kairui
Song)
- "enable PT_RECLAIM on more 64-bit architectures" makes PT_RECLAIM
available on 64-bit alpha, loongarch, mips, parisc, and um. Various
cleanups were performed along the way (Qi Zheng)
* tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (325 commits)
mm/memory: handle non-split locks correctly in zap_empty_pte_table()
mm: move pte table reclaim code to memory.c
mm: make PT_RECLAIM depends on MMU_GATHER_RCU_TABLE_FREE
mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config
um: mm: enable MMU_GATHER_RCU_TABLE_FREE
parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE
mips: mm: enable MMU_GATHER_RCU_TABLE_FREE
LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE
alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE
mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h
mm/damon/stat: remove __read_mostly from memory_idle_ms_percentiles
zsmalloc: make common caches global
mm: add SPDX id lines to some mm source files
mm/zswap: use %pe to print error pointers
mm/vmscan: use %pe to print error pointers
mm/readahead: fix typo in comment
mm: khugepaged: fix NR_FILE_PAGES and NR_SHMEM in collapse_file()
mm: refactor vma_map_pages to use vm_insert_pages
mm/damon: unify address range representation with damon_addr_range
mm/cma: replace snprintf with strscpy in cma_new_area
...
When a task is migrated out of a css_set, cgroup_migrate_add_task()
first moves it from cset->tasks to cset->mg_tasks via:
list_move_tail(&task->cg_list, &cset->mg_tasks);
If a css_task_iter currently has it->task_pos pointing to this task,
css_set_move_task() calls css_task_iter_skip() to keep the iterator
valid. However, since the task has already been moved to ->mg_tasks,
the iterator is advanced relative to the mg_tasks list instead of the
original tasks list. As a result, remaining tasks on cset->tasks, as
well as tasks queued on cset->mg_tasks, can be skipped by iteration.
Fix this by calling css_set_skip_task_iters() before unlinking
task->cg_list from cset->tasks. This advances all active iterators to
the next task on cset->tasks, so iteration continues correctly even
when a task is concurrently being migrated.
This race is hard to hit in practice without instrumentation, but it
can be reproduced by artificially slowing down cgroup_procs_show().
For example, on an Android device a temporary
/sys/kernel/cgroup/cgroup_test knob can be added to inject a delay
into cgroup_procs_show(), and then:
1) Spawn three long-running tasks (PIDs 101, 102, 103).
2) Create a test cgroup and move the tasks into it.
3) Enable a large delay via /sys/kernel/cgroup/cgroup_test.
4) In one shell, read cgroup.procs from the test cgroup.
5) Within the delay window, in another shell migrate PID 102 by
writing it to a different cgroup.procs file.
Under this setup, cgroup.procs can intermittently show only PID 101
while skipping PID 103. Once the migration completes, reading the
file again shows all tasks as expected.
Note that this change does not allow removing the existing
css_set_skip_task_iters() call in css_set_move_task(). The new call
in cgroup_migrate_add_task() only handles iterators that are racing
with migration while the task is still on cset->tasks. Iterators may
also start after the task has been moved to cset->mg_tasks. If we
dropped css_set_skip_task_iters() from css_set_move_task(), such
iterators could keep task_pos pointing to a migrating task, causing
css_task_iter_advance() to malfunction on the destination css_set,
up to and including crashes or infinite loops.
The race window between migration and iteration is very small, and
css_task_iter is not on a hot path. In the worst case, when an
iterator is positioned on the first thread of the migrating process,
cgroup_migrate_add_task() may have to skip multiple tasks via
css_set_skip_task_iters(). However, this only happens when migration
and iteration actually race, so the performance impact is negligible
compared to the correctness fix provided here.
Fixes: b636fd38dc ("cgroup: Implement css_task_iter_skip()")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Qingye Zhao <zhaoqingye@honor.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull cgroup updates from Tejun Heo:
- cpuset changes:
- Continue separating v1 and v2 implementations by moving more
v1-specific logic into cpuset-v1.c
- Improve partition handling. Sibling partitions are no longer
invalidated on cpuset.cpus conflict, cpuset.cpus changes no longer
fail in v2, and effective_xcpus computation is made consistent
- Fix partition effective CPUs overlap that caused a warning on
cpuset removal when sibling partitions shared CPUs
- Increase the maximum cgroup subsystem count from 16 to 32 to
accommodate future subsystem additions
- Misc cleanups and selftest improvements including switching to
css_is_online() helper, removing dead code and stale documentation
references, using lockdep_assert_cpuset_lock_held() consistently, and
adding polling helpers for asynchronously updated cgroup statistics
* tag 'cgroup-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
cpuset: fix overlap of partition effective CPUs
cgroup: increase maximum subsystem count from 16 to 32
cgroup: Remove stale cpu.rt.max reference from documentation
cpuset: replace direct lockdep_assert_held() with lockdep_assert_cpuset_lock_held()
cgroup/cpuset: Move the v1 empty cpus/mems check to cpuset1_validate_change()
cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict
cgroup/cpuset: Don't fail cpuset.cpus change in v2
cgroup/cpuset: Consistently compute effective_xcpus in update_cpumasks_hier()
cgroup/cpuset: Streamline rm_siblings_excl_cpus()
cpuset: remove dead code in cpuset-v1.c
cpuset: remove v1-specific code from generate_sched_domains
cpuset: separate generate_sched_domains for v1 and v2
cpuset: move update_domain_attr_tree to cpuset_v1.c
cpuset: add cpuset1_init helper for v1 initialization
cpuset: add cpuset1_online_css helper for v1-specific operations
cpuset: add lockdep_assert_cpuset_lock_held helper
cpuset: Remove unnecessary checks in rebuild_sched_domains_locked
cgroup: switch to css_is_online() helper
selftests: cgroup: Replace sleep with cg_read_key_long_poll() for waiting on nr_dying_descendants
selftests: cgroup: make test_memcg_sock robust against delayed sock stats
...
Pull kthread updates from Frederic Weisbecker:
"The kthread code provides an infrastructure which manages the
preferred affinity of unbound kthreads (node or custom cpumask)
against housekeeping (CPU isolation) constraints and CPU hotplug
events.
One crucial missing piece is the handling of cpuset: when an isolated
partition is created, deleted, or its CPUs updated, all the unbound
kthreads in the top cpuset become indifferently affine to _all_ the
non-isolated CPUs, possibly breaking their preferred affinity along
the way.
Solve this with performing the kthreads affinity update from cpuset to
the kthreads consolidated relevant code instead so that preferred
affinities are honoured and applied against the updated cpuset
isolated partitions.
The dispatch of the new isolated cpumasks to timers, workqueues and
kthreads is performed by housekeeping, as per the nice Tejun's
suggestion.
As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set
from boot defined domain isolation (through isolcpus=) and cpuset
isolated partitions. Housekeeping cpumasks are now modifiable with a
specific RCU based synchronization. A big step toward making
nohz_full= also mutable through cpuset in the future"
* tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (33 commits)
doc: Add housekeeping documentation
kthread: Document kthread_affine_preferred()
kthread: Comment on the purpose and placement of kthread_affine_node() call
kthread: Honour kthreads preferred affinity after cpuset changes
sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN
sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN
kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management
kthread: Include kthreadd to the managed affinity list
kthread: Include unbound kthreads in the managed affinity list
kthread: Refine naming of affinity related fields
PCI: Remove superfluous HK_TYPE_WQ check
sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated()
cpuset: Remove cpuset_cpu_is_isolated()
timers/migration: Remove superfluous cpuset isolation test
cpuset: Propagate cpuset isolation update to timers through housekeeping
cpuset: Propagate cpuset isolation update to workqueue through housekeeping
PCI: Flush PCI probe workqueue on cpuset isolated partition change
sched/isolation: Flush vmstat workqueues on cpuset isolated partition change
sched/isolation: Flush memcg workqueues on cpuset isolated partition change
cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
...
When cpuset isolated partitions get updated, unbound kthreads get
indifferently affine to all non isolated CPUs, regardless of their
individual affinity preferences.
For example kswapd is a per-node kthread that prefers to be affine to
the node it refers to. Whenever an isolated partition is created,
updated or deleted, kswapd's node affinity is going to be broken if any
CPU in the related node is not isolated because kswapd will be affine
globally.
Fix this with letting the consolidated kthread managed affinity code do
the affinity update on behalf of cpuset.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org