In [1] it was reported that the acct(2) system call can be used to
trigger NULL deref in cases where it is set to write to a file that
triggers an internal lookup. This can e.g., happen when pointing acc(2)
to /sys/power/resume. At the point the where the write to this file
happens the calling task has already exited and called exit_fs(). A
lookup will thus trigger a NULL-deref when accessing current->fs.
Reorganize the code so that the the final write happens from the
workqueue but with the caller's credentials. This preserves the
(strange) permission model and has almost no regression risk.
This api should stop to exist though.
Link: https://lore.kernel.org/r/20250127091811.3183623-1-quzicheng@huawei.com [1]
Link: https://lore.kernel.org/r/20250211-work-acct-v1-1-1c16aecab8b3@kernel.org
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Zicheng Qu <quzicheng@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pull to receive f3f08c3acf ("sched_ext: Fix incorrect assumption about
migration disabled tasks in task_can_run_on_remote_rq()") which conflicts
with 26176116d9 ("sched_ext: Count SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE in
the right spot") in for-6.15.
While fixing migration disabled task handling, 3296682157 ("sched_ext: Fix
migration disabled handling in targeted dispatches") assumed that a
migration disabled task's ->cpus_ptr would only have the pinned CPU. While
this is eventually true for migration disabled tasks that are switched out,
->cpus_ptr update is performed by migrate_disable_switch() which is called
right before context_switch() in __scheduler(). However, the task is
enqueued earlier during pick_next_task() via put_prev_task_scx(), so there
is a race window where another CPU can see the task on a DSQ.
If the CPU tries to dispatch the migration disabled task while in that
window, task_allowed_on_cpu() will succeed and task_can_run_on_remote_rq()
will subsequently trigger SCHED_WARN(is_migration_disabled()).
WARNING: CPU: 8 PID: 1837 at kernel/sched/ext.c:2466 task_can_run_on_remote_rq+0x12e/0x140
Sched_ext: layered (enabled+all), task: runnable_at=-10ms
RIP: 0010:task_can_run_on_remote_rq+0x12e/0x140
...
<TASK>
consume_dispatch_q+0xab/0x220
scx_bpf_dsq_move_to_local+0x58/0xd0
bpf_prog_84dd17b0654b6cf0_layered_dispatch+0x290/0x1cfa
bpf__sched_ext_ops_dispatch+0x4b/0xab
balance_one+0x1fe/0x3b0
balance_scx+0x61/0x1d0
prev_balance+0x46/0xc0
__pick_next_task+0x73/0x1c0
__schedule+0x206/0x1730
schedule+0x3a/0x160
__do_sys_sched_yield+0xe/0x20
do_syscall_64+0xbb/0x1e0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fix it by converting the SCHED_WARN() back to a regular failure path. Also,
perform the migration disabled test before task_allowed_on_cpu() test so
that BPF schedulers which fail to handle migration disabled tasks can be
noticed easily.
While at it, adjust scx_ops_error() message for !task_allowed_on_cpu() case
for brevity and consistency.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 3296682157 ("sched_ext: Fix migration disabled handling in targeted dispatches")
Acked-by: Andrea Righi <arighi@nvidia.com>
Reported-by: Jake Hillion <jakehillion@meta.com>
Depending on CONFIG_HAVE_ARCH_SECCOMP_FILTER, __secure_computing(NULL)
will crash or not. This is not consistent/safe, especially considering
that after the previous change __secure_computing(sd) is always called
with sd == NULL.
Fortunately, if CONFIG_HAVE_ARCH_SECCOMP_FILTER=n, __secure_computing()
has no callers, these architectures use secure_computing_strict(). Yet
it make sense make __secure_computing(NULL) safe in this case.
Note also that with this change we can unexport secure_computing_strict()
and change the current callers to use __secure_computing(NULL).
Fixes: 8cf8dfceeb ("seccomp: Stub for !HAVE_ARCH_SECCOMP_FILTER")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/r/20250128150307.GA15325@redhat.com
Signed-off-by: Kees Cook <kees@kernel.org>
per-CPU cpumasks are dominantly accessed from their own local CPUs,
so allocate them node-local to improve performance.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Count the number of times a migration disabled task is automatically
dispatched to its local DSQ.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE wasn't quite right in two aspects:
- It counted both migration disabled and offline events.
- It didn't count events from scx_bpf_dsq_move() path.
Fix it by moving the counting into task_can_run_on_remote_rq() which is
shared by both paths and can distinguish the different rejection conditions.
The argument @trigger_error is renamed to @enforce as it now does more than
just triggering error.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Pull to receive:
- 2fa0fbeb69 ("sched_ext: Implement auto local dispatching of migration disabled tasks")
- 3296682157 ("sched_ext: Fix migration disabled handling in targeted dispatches")
as planned for-6.15 changes depend on them (e.g. adding event counter for
implicit migration disabled task handling).
A dispatch operation that can target a specific local DSQ -
scx_bpf_dsq_move_to_local() or scx_bpf_dsq_move() - checks whether the task
can be migrated to the target CPU using task_can_run_on_remote_rq(). If the
task can't be migrated to the targeted CPU, it is bounced through a global
DSQ.
task_can_run_on_remote_rq() assumes that the task is on a CPU that's
different from the targeted CPU but the callers doesn't uphold the
assumption and may call the function when the task is already on the target
CPU. When such task has migration disabled, task_can_run_on_remote_rq() ends
up returning %false incorrectly unnecessarily bouncing the task to a global
DSQ.
Fix it by updating the callers to only call task_can_run_on_remote_rq() when
the task is on a different CPU than the target CPU. As this is a bit subtle,
for clarity and documentation:
- Make task_can_run_on_remote_rq() trigger SCHED_WARN_ON() if the task is on
the same CPU as the target CPU.
- is_migration_disabled() test in task_can_run_on_remote_rq() cannot trigger
if the task is on a different CPU than the target CPU as the preceding
task_allowed_on_cpu() test should fail beforehand. Convert the test into
SCHED_WARN_ON().
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 4c30f5ce4f ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
Fixes: 0366017e09 ("sched_ext: Use task_can_run_on_remote_rq() test in dispatch_to_local_dsq()")
Cc: stable@vger.kernel.org # v6.12+
Migration disabled tasks are special and pinned to their previous CPUs. They
tripped up some unsuspecting BPF schedulers as their ->nr_cpus_allowed may
not agree with the bits set in ->cpus_ptr. Make it easier for BPF schedulers
by automatically dispatching them to the pinned local DSQs by default. If a
BPF scheduler wants to handle migration disabled tasks explicitly, it can
set SCX_OPS_ENQ_MIGRATION_DISABLED.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Pull seccomp fix from Kees Cook:
"This is really a work-around for x86_64 having grown a syscall to
implement uretprobe, which has caused problems since v6.11.
This may change in the future, but for now, this fixes the unintended
seccomp filtering when uretprobe switched away from traps, and does so
with something that should be easy to backport.
- Allow uretprobe on x86_64 to avoid behavioral complications (Eyal
Birger)"
* tag 'seccomp-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
selftests/seccomp: validate uretprobe syscall passes through seccomp
seccomp: passthrough uretprobe systemcall without filtering
Pull ftrace fix from Steven Rostedt:
"Function graph fix of notrace functions.
When the function graph tracer was restructured to use the global
section of the meta data in the shadow stack, the bit logic was
changed. There's a TRACE_GRAPH_NOTRACE_BIT that is the bit number in
the mask that tells if the function graph tracer is currently in the
"notrace" mode. The TRACE_GRAPH_NOTRACE is the mask with that bit set.
But when the code we restructured, the TRACE_GRAPH_NOTRACE_BIT was
used when it should have been the TRACE_GRAPH_NOTRACE mask. This made
notrace not work properly"
* tag 'ftrace-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
fgraph: Fix set_graph_notrace with setting TRACE_GRAPH_NOTRACE_BIT
Pull timer fixes from Ingo Molnar:
"Fix a PREEMPT_RT bug in the clocksource verification code that caused
false positive warnings.
Also fix a timer migration setup bug when new CPUs are added"
* tag 'timers-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers/migration: Fix off-by-one root mis-connection
clocksource: Use migrate_disable() to avoid calling get_random_u32() in atomic context
Pull scheduler fixes from Ingo Molnar:
"Fix a cfs_rq->h_nr_runnable accounting bug that trips up a defensive
SCHED_WARN_ON() on certain workloads. The bug is believed to be
(accidentally) self-correcting, hence no behavioral side effects are
expected.
Also print se.slice in debug output, since this value can now be set
via the syscall ABI and can be useful to track"
* tag 'sched-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/debug: Provide slice length for fair tasks
sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
Pull locking fix from Ingo Molnar:
"Fix a dangling pointer bug in the futex code used by the uring code.
It isn't causing problems at the moment due to uring ABI limitations
leaving it essentially unused in current usages, but is a good idea to
fix nevertheless"
* tag 'locking-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Pass in task to futex_queue()
The code was restructured where the function graph notrace code, that
would not trace a function and all its children is done by setting a
NOTRACE flag when the function that is not to be traced is hit.
There's a TRACE_GRAPH_NOTRACE_BIT which defines the bit in the flags and a
TRACE_GRAPH_NOTRACE which is the mask with that bit set. But the
restructuring used TRACE_GRAPH_NOTRACE_BIT when it should have used
TRACE_GRAPH_NOTRACE.
For example:
# cd /sys/kernel/tracing
# echo set_track_prepare stack_trace_save > set_graph_notrace
# echo function_graph > current_tracer
# cat trace
[..]
0) | __slab_free() {
0) | free_to_partial_list() {
0) | arch_stack_walk() {
0) | __unwind_start() {
0) 0.501 us | get_stack_info();
Where a non filter trace looks like:
# echo > set_graph_notrace
# cat trace
0) | free_to_partial_list() {
0) | set_track_prepare() {
0) | stack_trace_save() {
0) | arch_stack_walk() {
0) | __unwind_start() {
Where the filter should look like:
# cat trace
0) | free_to_partial_list() {
0) | _raw_spin_lock_irqsave() {
0) 0.350 us | preempt_count_add();
0) 0.351 us | do_raw_spin_lock();
0) 2.440 us | }
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250208001511.535be150@batman.local.home
Fixes: b84214890a ("function_graph: Move graph notrace bit to shadow stack global var")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
bpf_arena_alloc_pages() and bpf_arena_free_pages() work with the
bpf_arena pointers [1], which is indicated by the __arena macro in the
kernel source code:
#define __arena __attribute__((address_space(1)))
However currently this information is absent from the debug data in
the vmlinux binary. As a consequence, bpf_arena_* kfuncs declarations
in vmlinux.h (produced by bpftool) do not match prototypes expected by
the BPF programs attempting to use these functions.
Introduce a set of kfunc flags to mark relevant types as bpf_arena
pointers. The flags then can be detected by pahole when generating BTF
from vmlinux's DWARF, allowing it to emit corresponding BTF type tags
for the marked kfuncs.
With recently proposed BTF extension [2], these type tags will be
processed by bpftool when dumping vmlinux.h, and corresponding
compiler attributes will be added to the declarations.
[1] https://lwn.net/Articles/961594/
[2] https://lore.kernel.org/bpf/20250130201239.1429648-1-ihor.solodrai@linux.dev/
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20250206003148.2308659-1-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Refactor get_constant_map_key() to disambiguate the constant key
value from potential error values. In the case that the key is
negative, it could be confused for an error.
It's not currently an issue, as the verifier seems to track s32 spills
as u32. So even if the program wrongly uses a negative value for an
arraymap key, the verifier just thinks it's an impossibly high value
which gets correctly discarded.
Refactor anyways to make things cleaner and prevent potential future
issues.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/dfe144259ae7cfc98aa63e1b388a14869a10632a.1738689872.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Previously, we were trying to extract constant map keys for all
bpf_map_lookup_elem(), regardless of map type. This is an issue if the
map has a u64 key and the value is very high, as it can be interpreted
as a negative signed value. This in turn is treated as an error value by
check_func_arg() which causes a valid program to be incorrectly
rejected.
Fix by only extracting constant map keys for relevant maps. This fix
works because nullness elision is only allowed for {PERCPU_}ARRAY maps,
and keys for these are within u32 range. See next commit for an example
via selftest.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Reported-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/aa868b642b026ff87ba6105ea151bc8693b35932.1738689872.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a core event, SCX_EV_ENQ_SLICE_DFL, which represents how many
tasks have been enqueued (or pick_task-ed or select_cpu-ed) with
a default time slice (SCX_SLICE_DFL).
Scheduling a task with SCX_SLICE_DFL unintentionally would be a source
of latency spikes because SCX_SLICE_DFL is relatively long (20 msec).
Thus, soaring the SCX_EV_ENQ_SLICE_DFL value would be a sign of BPF
scheduler bugs, causing latency spikes, especially when ops.select_cpu()
is provided.
__scx_add_event() is used since the caller holds an rq lock or p->pi_lock,
so the preemption has already been disabled.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The CPU usage time is the time when user, system or both are using the CPU.
Steal time is the time when CPU is waiting to be run by the Hypervisor. It
should not be added to the CPU usage time, hence removing it from the
usage_usec entry.
Fixes: 936f2a70f2 ("cgroup: add cpu.stat file to root cgroup")
Acked-by: Axel Busch <axel.busch@ibm.com>
Acked-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Muhammad Adeel <muhammad.adeel@ibm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
After patch1~14 is applied, all sysctls of vm_table
would be moved. So, delete vm_table.
Signed-off-by: Kaixiong Yu <yukaixiong@huawei.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
When CONFIG_SUPERH and CONFIG_VSYSCALL are defined,
vdso_enabled belongs to arch/sh/kernel/vsyscall/vsyscall.c.
So, move it into its own file. To avoid failure when registering
the vdso_table, move the call to register_sysctl_init() into
its own fs_initcall().
Signed-off-by: Kaixiong Yu <yukaixiong@huawei.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
When CONFIG_X86_32 is defined and CONFIG_UML is not defined,
vdso_enabled belongs to arch/x86/entry/vdso/vdso32-setup.c.
So, move it into its own file.
Before this patch, vdso_enabled was allowed to be set to
a value exceeding 1 on x86_32 architecture. After this patch is
applied, vdso_enabled is not permitted to set the value more than 1.
It does not matter, because according to the function load_vdso32(),
only vdso_enabled is set to 1, VDSO would be enabled. Other values
all mean "disabled". The same limitation could be seen in the
function vdso32_setup().
Signed-off-by: Kaixiong Yu <yukaixiong@huawei.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
The sysctl_vfs_cache_pressure belongs to fs/dcache.c, move it to
fs/dcache.c from kernel/sysctl.c. As a part of fs/dcache.c cleaning,
sysctl_vfs_cache_pressure is changed to a static variable, and change
the inline-type function vfs_pressure_ratio() to out-of-inline type,
export vfs_pressure_ratio() with EXPORT_SYMBOL_GPL to be used by other
files. Move the unneeded include(linux/dcache.h).
Signed-off-by: Kaixiong Yu <yukaixiong@huawei.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
The sysctl_drop_caches to fs/drop_caches.c, move it to
fs/drop_caches.c from /kernel/sysctl.c. And remove the
useless extern variable declaration from include/linux/mm.h
Signed-off-by: Kaixiong Yu <yukaixiong@huawei.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
The dirtytime_expire_interval belongs to fs/fs-writeback.c, move it to
fs/fs-writeback.c from /kernel/sysctl.c. And remove the useless extern
variable declaration and the function declaration from
include/linux/writeback.h
Signed-off-by: Kaixiong Yu <yukaixiong@huawei.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
The sysctl_nr_trim_pages belongs to nommu.c, move it to mm/nommu.c
from /kernel/sysctl.c. And remove the useless extern variable declaration
from include/linux/mm.h
Signed-off-by: Kaixiong Yu <yukaixiong@huawei.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
The dac_mmap_min_addr belongs to min_addr.c, move it to
min_addr.c from /kernel/sysctl.c. In the previous Linux kernel
boot process, sysctl_init_bases needs to be executed before
init_mmap_min_addr, So, register_sysctl_init should be executed
before update_mmap_min_addr in init_mmap_min_addr. And according
to the compilation condition in security/Makefile:
obj-$(CONFIG_MMU) += min_addr.o
if CONFIG_MMU is not defined, min_addr.c would not be included in the
compilation process. So, drop the CONFIG_MMU check.
Signed-off-by: Kaixiong Yu <yukaixiong@huawei.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
This moves all util related sysctls to mm/util.c, as part of the
kernel/sysctl.c cleaning, also removes redundant external
variable declarations and function declarations.
Signed-off-by: Kaixiong Yu <yukaixiong@huawei.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
This moves vm_swappiness and zone_reclaim_mode to mm/vmscan.c,
as part of the kernel/sysctl.c cleaning, also moves some external
variable declarations and function declarations from include/linux/swap.h
into mm/internal.h.
Signed-off-by: Kaixiong Yu <yukaixiong@huawei.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
The page-cluster belongs to mm/swap.c, move it to mm/swap.c .
Removes the redundant external variable declaration and unneeded
include(linux/swap.h).
Signed-off-by: Kaixiong Yu <yukaixiong@huawei.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
This moves all vmstat related sysctls to its own file, removes useless
extern variable declarations, and do some related clean-ups. To avoid
compiler warnings when CONFIG_PROC_FS is not defined, add the macro
definition CONFIG_PROC_FS ahead CONFIG_NUMA in vmstat.c.
Signed-off-by: Kaixiong Yu <yukaixiong@huawei.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
As the clone side already executes pid allocation with only pidmap_lock
held, issuing free_pid() while still holding tasklist_lock exacerbates
total hold time of the latter.
More things may show up later which require initial clean up with the
lock held and allow finishing without it. For that reason a struct to
collect such work is added instead of merely passing the pid array.
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20250206164415.450051-5-mjguzik@gmail.com
Acked-by: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
It predates the git history and most probably it was never needed. It
doesn't really hurt, but it looks confusing because its purpose is not
clear at all.
release_task(p) is called when this task has already passed exit_notify()
so signal_pending(p) == T shouldn't make any difference.
And even _if_ there were a subtle reason to clear TIF_SIGPENDING after
exit_notify(), this clear_tsk_thread_flag() can't help anyway. If the
exiting task is a group leader or if it is ptraced, release_task() will
be likely called when this task has already done its last schedule() from
do_task_dead().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250206152334.GB14620@redhat.com
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
A task can block a signal, accumulate up to RLIMIT_SIGPENDING sigqueues,
and exit. In this case __exit_signal()->flush_sigqueue() called with irqs
disabled can trigger a hard lockup, see
https://lore.kernel.org/all/20190322114917.GC28876@redhat.com/
Fortunately, after the recent posixtimer changes sys_timer_delete() paths
no longer try to clear SIGQUEUE_PREALLOC and/or free tmr->sigq, and after
the exiting task passes __exit_signal() lock_task_sighand() can't succeed
and pid_task(tmr->it_pid) will return NULL.
This means that after __exit_signal(tsk) nobody can play with tsk->pending
or (if group_dead) with tsk->signal->shared_pending, so release_task() can
safely call flush_sigqueue() after write_unlock_irq(&tasklist_lock).
TODO:
- we can probably shift posix_cpu_timers_exit() as well
- do_sigaction() can hit the similar problem
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250206152314.GA14620@redhat.com
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Before attaching a new root to the old root, the children counter of the
new root is checked to verify that only the upcoming CPU's top group have
been connected to it. However since the recently added commit b729cc1ec2
("timers/migration: Fix another race between hotplug and idle entry/exit")
this check is not valid anymore because the old root is pre-accounted
as a child to the new root. Therefore after connecting the upcoming
CPU's top group to the new root, the children count to be expected must
be 2 and not 1 anymore.
This omission results in the old root to not be connected to the new
root. Then eventually the system may run with more than one top level,
which defeats the purpose of a single idle migrator.
Also the old root is pre-accounted but not connected upon the new root
creation. But it can be connected to the new root later on. Therefore
the old root may be accounted twice to the new root. The propagation of
such overcommit can end up creating a double final top-level root with a
groupmask incorrectly initialized. Although harmless given that the final
top level roots will never have a parent to walk up to, this oddity
opportunistically reported the core issue:
WARNING: CPU: 8 PID: 0 at kernel/time/timer_migration.c:543 tmigr_requires_handle_remote
CPU: 8 UID: 0 PID: 0 Comm: swapper/8
RIP: 0010:tmigr_requires_handle_remote
Call Trace:
<IRQ>
? tmigr_requires_handle_remote
? hrtimer_run_queues
update_process_times
tick_periodic
tick_handle_periodic
__sysvec_apic_timer_interrupt
sysvec_apic_timer_interrupt
</IRQ>
Fix the problem by taking the old root into account in the children count
of the new root so the connection is not omitted.
Also warn when more than one top level group exists to better detect
similar issues in the future.
Fixes: b729cc1ec2 ("timers/migration: Fix another race between hotplug and idle entry/exit")
Reported-by: Matt Fleming <mfleming@cloudflare.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250205160220.39467-1-frederic@kernel.org