linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-21 04:24:58 -04:00

Author	SHA1	Message	Date
Linus Torvalds	9dd1835ecd	Merge tag 'dma-mapping-6.17-2025-09-09' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fix from Marek Szyprowski: - one more fix for DMA API debugging infrastructure (Baochen Qiang) * tag 'dma-mapping-6.17-2025-09-09' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-debug: don't enforce dma mapping check on noncoherent allocations	2025-09-09 11:03:04 -07:00
Jiri Wiesner	b9aa93aa51	clocksource: Print durations for sync check unconditionally A typical set of messages that gets printed as a result of the clocksource watchdog finding the TSC unstable usually does not contain messages indicating CPUs being ahead of or behind the CPU from which the check is carried out. That fact suggests that the TSC does not experience time skew between CPUs (if the clocksource.verify_n_cpus parameter is set to a negative value) but quantitative information is missing. The cs_nsec_max value printed by the "CPU %d check durations" message actually provides a worst case estimate of the time skew. If all CPUs have been checked, the cs_nsec_max value multiplied by 2 is the maximum possible time skew between the TSCs of any two CPUs on the system. The worst case estimate is derived from two boundary cases: 1. No time is consumed to execute instructions between csnow_begin and csnow_mid while all the cs_nsec_max time is consumed by the code between csnow_mid and csnow_end. In this case, the maximum undetectable time skew of a CPU being ahead would be cs_nsec_max. 2. All the cs_nsec_max time is consumed to execute instructions between csnow_begin and csnow_mid while no time is consumed by the code between csnow_mid and csnow_end. In this case, the maximum undetectable time skew of a CPU being behind would be cs_nsec_max. The worst case estimate assumes a system experiencing a corner case consisting of the two boundary cases. Always print the "CPU %d check durations" message so that the maximum possible time skew measured by the TSC sync check can be compared to the time skew measured by the clocksource watchdog. Signed-off-by: Jiri Wiesner <jwiesner@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/all/aIuXXfdITXdI0lLp@incl	2025-09-09 14:08:19 +02:00
Xiongfeng Wang	e895f8e291	hrtimers: Unconditionally update target CPU base after offline timer migration When testing softirq based hrtimers on an ARM32 board, with high resolution mode and NOHZ inactive, softirq based hrtimers fail to expire after being moved away from an offline CPU: CPU0 CPU1 hrtimer_start(..., HRTIMER_MODE_SOFT); cpu_down(CPU1) ... hrtimers_cpu_dying() // Migrate timers to CPU0 smp_call_function_single(CPU0, returgger_next_event); retrigger_next_event() if (!highres && !nohz) return; As retrigger_next_event() is a NOOP when both high resolution timers and NOHZ are inactive CPU0's hrtimer_cpu_base::softirq_expires_next is not updated and the migrated softirq timers never expire unless there is a softirq based hrtimer queued on CPU0 later. Fix this by removing the hrtimer_hres_active() and tick_nohz_active() check in retrigger_next_event(), which enforces a full update of the CPU base. As this is not a fast path the extra cost does not matter. [ tglx: Massaged change log ] Fixes: `5c0930ccaa` ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Co-developed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250805081025.54235-1-wangxiongfeng2@huawei.com	2025-09-09 14:05:16 +02:00
Bibo Mao	fe2a449a45	tick: Do not set device to detached state in tick_shutdown() tick_shutdown() sets the state of the clockevent device to detached first and the invokes clockevents_exchange_device(), which in turn invokes clockevents_switch_state(). But clockevents_switch_state() returns without invoking the device shutdown callback as the device is already in detached state. As a consequence the timer device is not shutdown when a CPU goes offline. tick_shutdown() does this because it was originally invoked on a online CPU and not on the outgoing CPU. It therefore could not access the clockevent device of the already offlined CPU and just set the state. Since commit `3b1596a21f` tick_shutdown() is called on the outgoing CPU, so the hardware device can be accessed. Remove the state set before calling clockevents_exchange_device(), so that the subsequent clockevents_switch_state() handles the state transition and invokes the shutdown callback of the clockevent device. [ tglx: Massaged change log ] Fixes: `3b1596a21f` ("clockevents: Shutdown and unregister current clockevents at CPUHP_AP_TICK_DYING") Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250906064952.3749122-2-maobibo@loongson.cn	2025-09-09 13:39:00 +02:00
Thomas Weißschuh	3c3af563b3	hrtimer: Reorder branches in hrtimer_clockid_to_base() Align the ordering to the one used for hrtimer_bases. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-9-3ae822e5bfbd@linutronix.de	2025-09-09 12:27:18 +02:00
Thomas Weißschuh	009eb5da29	hrtimer: Remove hrtimer_clock_base:: Get_time The get_time() callbacks always need to match the bases clockid. Instead of maintaining that association twice in hrtimer_bases, use a helper. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-8-3ae822e5bfbd@linutronix.de	2025-09-09 12:27:18 +02:00
Thomas Weißschuh	b68b7f3e9b	sched/core: Avoid direct access to hrtimer clockbase The field timer->base->get_time is a private implementation detail and should not be accessed outside of the hrtimer core. Switch to the equivalent helper. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-3-3ae822e5bfbd@linutronix.de	2025-09-09 12:27:18 +02:00
Thomas Weißschuh	5f531fe9cb	timers/itimer: Avoid direct access to hrtimer clockbase The field timer->base->get_time is a private implementation detail and should not be accessed outside of the hrtimer core. Switch to the equivalent helper. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-2-3ae822e5bfbd@linutronix.de	2025-09-09 12:27:17 +02:00
Thomas Weißschuh	24fb08dcc4	posix-timers: Avoid direct access to hrtimer clockbase The field timer->base->get_time is a private implementation detail and should not be accessed outside of the hrtimer core. Switch to the equivalent helpers. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-1-3ae822e5bfbd@linutronix.de	2025-09-09 12:27:17 +02:00
Pu Lehui	cd4453c5e9	tracing: Silence warning when chunk allocation fails in trace_pid_write Syzkaller trigger a fault injection warning: WARNING: CPU: 1 PID: 12326 at tracepoint_add_func+0xbfc/0xeb0 Modules linked in: CPU: 1 UID: 0 PID: 12326 Comm: syz.6.10325 Tainted: G U 6.14.0-rc5-syzkaller #0 Tainted: [U]=USER Hardware name: Google Compute Engine/Google Compute Engine RIP: 0010:tracepoint_add_func+0xbfc/0xeb0 kernel/tracepoint.c:294 Code: 09 fe ff 90 0f 0b 90 0f b6 74 24 43 31 ff 41 bc ea ff ff ff RSP: 0018:ffffc9000414fb48 EFLAGS: 00010283 RAX: 00000000000012a1 RBX: ffffffff8e240ae0 RCX: ffffc90014b78000 RDX: 0000000000080000 RSI: ffffffff81bbd78b RDI: 0000000000000001 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: ffffffffffffffef R13: 0000000000000000 R14: dffffc0000000000 R15: ffffffff81c264f0 FS: 00007f27217f66c0(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2e80dff8 CR3: 00000000268f8000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> tracepoint_probe_register_prio+0xc0/0x110 kernel/tracepoint.c:464 register_trace_prio_sched_switch include/trace/events/sched.h:222 [inline] register_pid_events kernel/trace/trace_events.c:2354 [inline] event_pid_write.isra.0+0x439/0x7a0 kernel/trace/trace_events.c:2425 vfs_write+0x24c/0x1150 fs/read_write.c:677 ksys_write+0x12b/0x250 fs/read_write.c:731 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f We can reproduce the warning by following the steps below: 1. echo 8 >> set_event_notrace_pid. Let tr->filtered_pids owns one pid and register sched_switch tracepoint. 2. echo ' ' >> set_event_pid, and perform fault injection during chunk allocation of trace_pid_list_alloc. Let pid_list with no pid and assign to tr->filtered_pids. 3. echo ' ' >> set_event_pid. Let pid_list is NULL and assign to tr->filtered_pids. 4. echo 9 >> set_event_pid, will trigger the double register sched_switch tracepoint warning. The reason is that syzkaller injects a fault into the chunk allocation in trace_pid_list_alloc, causing a failure in trace_pid_list_set, which may trigger double register of the same tracepoint. This only occurs when the system is about to crash, but to suppress this warning, let's add failure handling logic to trace_pid_list_set. Link: https://lore.kernel.org/20250908024658.2390398-1-pulehui@huaweicloud.com Fixes: `8d6e90983a` ("tracing: Create a sparse bitmask for pid filtering") Reported-by: syzbot+161412ccaeff20ce4dde@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67cb890e.050a0220.d8275.022e.GAE@google.com Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-09-08 14:56:43 -04:00
Marco Crivellari	a857210b10	bpf: WQ_PERCPU added to alloc_workqueue users Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This patch adds a new WQ_PERCPU flag to explicitly request the use of the per-CPU behavior. Both flags coexist for one release cycle to allow callers to transition their calls. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. All existing users have been updated accordingly. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://lore.kernel.org/r/20250905085309.94596-4-marco.crivellari@suse.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-08 10:04:37 -07:00
Marco Crivellari	0409819a00	bpf: replace use of system_unbound_wq with system_dfl_wq Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. queue_work() / queue_delayed_work() / mod_delayed_work() will now use the new unbound wq: whether the user still use the old wq a warn will be printed along with a wq redirect to the new one. The old system_unbound_wq will be kept for a few release cycles. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://lore.kernel.org/r/20250905085309.94596-3-marco.crivellari@suse.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-08 10:04:37 -07:00
Marco Crivellari	34f86083a4	bpf: replace use of system_wq with system_percpu_wq Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_wq is a per-CPU worqueue, yet nothing in its name tells about that CPU affinity constraint, which is very often not required by users. Make it clear by adding a system_percpu_wq. queue_work() / queue_delayed_work() mod_delayed_work() will now use the new per-cpu wq: whether the user still stick on the old name a warn will be printed along a wq redirect to the new one. This patch add the new system_percpu_wq except for mm, fs and net subsystem, whom are handled in separated patches. The old wq will be kept for a few release cylces. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://lore.kernel.org/r/20250905085309.94596-2-marco.crivellari@suse.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-08 10:04:37 -07:00
Linus Torvalds	6ab41fca2e	Merge tag 'timers-urgent-2025-09-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix a severe slowdown regression in the timer vDSO code related to the while() loop in __iter_div_u64_rem(), when the AUX-clock is enabled" * tag 'timers-urgent-2025-09-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: vdso/vsyscall: Avoid slow division loop in auxiliary clock update	2025-09-07 08:29:44 -07:00
Linus Torvalds	b7369eb731	Merge tag 'locking-urgent-2025-09-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Ingo Molnar: "Fix an 'allocation from atomic context' regression in the futex vmalloc variant" * tag 'locking-urgent-2025-09-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Move futex_hash_free() back to __mmput()	2025-09-07 08:26:28 -07:00
Linus Torvalds	6a8a34a56a	Merge tag 'perf-urgent-2025-09-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf event fix from Ingo Molnar: "Fix regression where PERF_EVENT_IOC_REFRESH counters miss a PMU-stop" * tag 'perf-urgent-2025-09-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix the POLL_HUP delivery breakage	2025-09-07 08:24:20 -07:00
Wang Liang	c1628c00c4	tracing/osnoise: Fix null-ptr-deref in bitmap_parselist() A crash was observed with the following output: BUG: kernel NULL pointer dereference, address: 0000000000000010 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 2 UID: 0 PID: 92 Comm: osnoise_cpus Not tainted 6.17.0-rc4-00201-gd69eb204c255 #138 PREEMPT(voluntary) RIP: 0010:bitmap_parselist+0x53/0x3e0 Call Trace: <TASK> osnoise_cpus_write+0x7a/0x190 vfs_write+0xf8/0x410 ? do_sys_openat2+0x88/0xd0 ksys_write+0x60/0xd0 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> This issue can be reproduced by below code: fd=open("/sys/kernel/debug/tracing/osnoise/cpus", O_WRONLY); write(fd, "0-2", 0); When user pass 'count=0' to osnoise_cpus_write(), kmalloc() will return ZERO_SIZE_PTR (16) and cpulist_parse() treat it as a normal value, which trigger the null pointer dereference. Add check for the parameter 'count'. Cc: <mhiramat@kernel.org> Cc: <mathieu.desnoyers@efficios.com> Cc: <tglozar@redhat.com> Link: https://lore.kernel.org/20250906035610.3880282-1-wangliang74@huawei.com Fixes: `17f89102fe` ("tracing/osnoise: Allow arbitrarily long CPU string") Signed-off-by: Wang Liang <wangliang74@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-09-06 12:12:38 -04:00
Guenter Roeck	ab1396af75	trace/fgraph: Fix error handling Commit `edede7a6dc` ("trace/fgraph: Fix the warning caused by missing unregister notifier") added a call to unregister the PM notifier if register_ftrace_graph() failed. It does so unconditionally. However, the PM notifier is only registered with the first call to register_ftrace_graph(). If the first registration was successful and a subsequent registration failed, the notifier is now unregistered even if ftrace graphs are still registered. Fix the problem by only unregistering the PM notifier during error handling if there are no active fgraph registrations. Fixes: `edede7a6dc` ("trace/fgraph: Fix the warning caused by missing unregister notifier") Closes: https://lore.kernel.org/all/63b0ba5a-a928-438e-84f9-93028dd72e54@roeck-us.net/ Cc: Ye Weihua <yeweihua4@huawei.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250906050618.2634078-1-linux@roeck-us.net Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-09-06 12:12:38 -04:00
Linus Torvalds	730c1451fb	Merge tag 'audit-pr-20250905' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit fix from Paul Moore: "A single small audit patch to fix a potential out-of-bounds read caused by a negative array index when comparing paths" * tag 'audit-pr-20250905' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: fix out-of-bounds read in audit_compare_dname_path()	2025-09-05 12:35:25 -07:00
Marco Crivellari	a2be943b46	workqueue: replace use of system_wq with system_percpu_wq Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_wq is a per-CPU worqueue, yet nothing in its name tells about that CPU affinity constraint, which is very often not required by users. Make it clear by adding a system_percpu_wq. queue_work() / queue_delayed_work() mod_delayed_work() will now use the new per-cpu wq: whether the user still stick on the old name a warn will be printed along a wq redirect to the new one. This patch add the new system_percpu_wq except for mm, fs and net subsystem, whom are handled in separated patches. The old wq will be kept for a few release cylces. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-05 07:20:00 -10:00
Marco Crivellari	f6cfa602d2	workqueue: replace use of system_unbound_wq with system_dfl_wq Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. queue_work() / queue_delayed_work() / mod_delayed_work() will now use the new unbound wq: whether the user still use the old wq a warn will be printed along with a wq redirect to the new one. The old system_unbound_wq will be kept for a few release cycles. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-05 07:19:09 -10:00
Tejun Heo	4a3e62dfa7	cgroup: Merge branch 'for-6.17-fixes' into for-6.18 Pull for-6.17-fixes to receive `79f919a89c` ("cgroup: split cgroup_destroy_wq into 3 workqueues") to resolve its conflict with `7fa33aa3b0` ("cgroup: WQ_PERCPU added to alloc_workqueue users"). The latter adds WQ_PERCPU when creating cgroup_destroy_wq and the former splits the workqueue into three. Resolve by applying WQ_PERCPU to the three split workqueues. Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-05 07:08:26 -10:00
Marco Crivellari	7fa33aa3b0	cgroup: WQ_PERCPU added to alloc_workqueue users Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This patch adds a new WQ_PERCPU flag to explicitly request the use of the per-CPU behavior. Both flags coexist for one release cycle to allow callers to transition their calls. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. All existing users have been updated accordingly. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-05 06:40:25 -10:00
Marco Crivellari	d6256771d1	cgroup: replace use of system_wq with system_percpu_wq Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_wq is a per-CPU worqueue, yet nothing in its name tells about that CPU affinity constraint, which is very often not required by users. Make it clear by adding a system_percpu_wq. queue_work() / queue_delayed_work() mod_delayed_work() will now use the new per-cpu wq: whether the user still stick on the old name a warn will be printed along a wq redirect to the new one. This patch add the new system_percpu_wq except for mm, fs and net subsystem, whom are handled in separated patches. The old wq will be kept for a few release cylces. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-05 06:40:12 -10:00
Tejun Heo	222f83d5ab	cgroup: Remove unused local variables from cgroup_procs_write_finish() `d8b269e009` ("cgroup: Remove unused cgroup_subsys::post_attach") made $ss and $ssid unused but didn't drop them leading to compilation warnings. Drop them. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Chuyi Zhou <zhouchuyi@bytedance.com>	2025-09-04 11:23:43 -10:00
Jakub Kicinski	5ef04a7b06	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.17-rc5). No conflicts. Adjacent changes: include/net/sock.h `c51613fa27` ("net: add sk->sk_drop_counters") `5d6b58c932` ("net: lockless sock_i_ino()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-04 13:33:00 -07:00
Andrea Righi	47d9f82128	sched_ext: Fix NULL dereference in scx_bpf_cpu_rq() warning When printing the deprecation warning for scx_bpf_cpu_rq(), we may hit a NULL pointer dereference if the kfunc is called before a BPF scheduler is fully attached, for example, when invoked from a BPF timer or during ops.init(): [ 50.752775] BUG: kernel NULL pointer dereference, address: 0000000000000331 ... [ 50.764205] RIP: 0010:scx_bpf_cpu_rq+0x30/0xa0 ... [ 50.787661] Call Trace: [ 50.788398] <TASK> [ 50.789061] bpf_prog_08f7fd2dcb187aaf_wakeup_timerfn+0x75/0x1a8 [ 50.792477] bpf_timer_cb+0x7e/0x140 [ 50.796003] hrtimer_run_softirq+0x91/0xe0 [ 50.796952] handle_softirqs+0xce/0x3c0 [ 50.799087] run_ksoftirqd+0x3e/0x70 [ 50.800197] smpboot_thread_fn+0x133/0x290 [ 50.802320] kthread+0x115/0x220 [ 50.804984] ret_from_fork+0x17a/0x1d0 [ 50.806920] ret_from_fork_asm+0x1a/0x30 [ 50.807799] </TASK> Fix this by only printing the warning once the scheduler is fully registered. Fixes: `5c48d88fe0` ("sched_ext: deprecation warn for scx_bpf_cpu_rq()") Cc: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-04 10:27:16 -10:00
Al Viro	b28f9eba12	change the calling conventions for vfs_parse_fs_string() Absolute majority of callers are passing the 4th argument equal to strlen() of the 3rd one. Drop the v_size argument, add vfs_parse_fs_qstr() for the cases that want independent length. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-04 15:20:51 -04:00
Samuel Wu	56a232d93c	PM: sleep: Make pm_wakeup_clear() call more clear Move pm_wakeup_clear() to the same location as other functions that do bookkeeping prior to suspend_prepare(). Since calling pm_wakeup_clear() is a prerequisite to setting up for suspend and enabling functionalities of suspend (like aborting during suspend), moving pm_wakeup_clear() higher up the call stack makes its intent more clear and obvious that it is called prior to suspend_prepare(). After this change, there is a slightly larger window when abort events can be registered, but otherwise suspend functionality is the same. Suggested-by: Saravana Kannan <saravanak@google.com> Signed-off-by: Samuel Wu <wusamuel@google.com> Link: https://patch.msgid.link/20250821004237.2712312-2-wusamuel@google.com Reviewed-by: Saravana Kannan <saravanak@google.com> [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-09-04 21:05:14 +02:00
Sebastian Andrzej Siewior	ad7c7f4b9c	workqueue: Provide a handshake for canceling BH workers While a BH work item is canceled, the core code spins until it determines that the item completed. On PREEMPT_RT the spinning relies on a lock in local_bh_disable() to avoid a live lock if the canceling thread has higher priority than the BH-worker and preempts it. This lock ensures that the BH-worker makes progress by PI-boosting it. This lock in local_bh_disable() is a central per-CPU BKL and about to be removed. To provide the required synchronisation add a per pool lock. The lock is acquired by the bh_worker at the begin while the individual callbacks are invoked. To enforce progress in case of interruption, __flush_work() needs to acquire the lock. This will flush all BH-work items assigned to that pool. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-04 07:28:33 -10:00
Chuyi Zhou	d8b269e009	cgroup: Remove unused cgroup_subsys::post_attach cgroup_subsys::post_attach callback was introduced in commit `5cf1cacb49` ("cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback") and only cpuset would use this callback to wait for the mm migration to complete at the end of __cgroup_procs_write(). Since the previous patch defer the flush operation until returning to userspace, no one use this callback now. Remove this callback from cgroup_subsys. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-04 07:25:20 -10:00
Chuyi Zhou	3514309e03	cpuset: Defer flushing of the cpuset_migrate_mm_wq to task_work Now in cpuset_attach(), we need to synchronously wait for flush_workqueue to complete. The execution time of flushing cpuset_migrate_mm_wq depends on the amount of mm migration initiated by cpusets at that time. When the cpuset.mems of a cgroup occupying a large amount of memory is modified, it may trigger extensive mm migration, causing cpuset_attach() to block on flush_workqueue for an extended period. This could be dangerous because cpuset_attach() is within the critical section of cgroup_mutex, which may ultimately cause all cgroup-related operations in the system to be blocked. This patch attempts to defer the flush_workqueue() operation until returning to userspace using the task_work which is originally proposed by tejun[1], so that flush happens after cgroup_mutex is dropped. That way we maintain the operation synchronicity while avoiding bothering anyone else. [1]: https://lore.kernel.org/cgroups/ZgMFPMjZRZCsq9Q-@slm.duckdns.org/T/#m117f606fa24f66f0823a60f211b36f24bd9e1883 Originally-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-04 07:22:38 -10:00
Chuyi Zhou	c0fb16ef88	cpuset: Don't always flush cpuset_migrate_mm_wq in cpuset_write_resmask It is unnecessary to always wait for the flush operation of cpuset_migrate_mm_wq to complete in cpuset_write_resmask, as modifying cpuset.cpus or cpuset.exclusive does not trigger mm migrations. The flush_workqueue can be executed only when cpuset.mems is modified. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-04 07:15:30 -10:00
Zqiang	cda2b2d647	workqueue: Remove rcu_read_lock/unlock() in wq_watchdog_timer_fn() The wq_watchdog_timer_fn() is executed in the softirq context, this is already in the RCU read critical section, this commit therefore remove rcu_read_lock/unlock() in wq_watchdog_timer_fn(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-04 06:18:00 -10:00
Zqiang	fd5081f4ef	workqueue: Remove redundant rcu_read_lock/unlock() in workqueue_congested() The preempt_disable/enable() has already formed RCU read crtical section, this commit therefore remove rcu_read_lock/unlock() in workqueue_congested(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-04 06:17:52 -10:00
Rong Tao	19559e8441	bpf: add bpf_strcasecmp kfunc bpf_strcasecmp() function performs same like bpf_strcmp() except ignoring the case of the characters. Signed-off-by: Rong Tao <rongtao@cestc.cn> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Viktor Malik <vmalik@redhat.com> Link: https://lore.kernel.org/r/tencent_292BD3682A628581AA904996D8E59F4ACD06@qq.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-04 09:00:57 -07:00
Eric Dumazet	2aef21a6a6	audit: init ab->skb_list earlier in audit_buffer_alloc() syzbot found a bug in audit_buffer_alloc() if nlmsg_new() returns NULL. We need to initialize ab->skb_list before calling audit_buffer_free() which will use both the skb_list spinlock and list pointers. Fixes: `eb59d494ee` ("audit: add record for multiple task security contexts") Reported-by: syzbot+bb185b018a51f8d91fd2@syzkaller.appspotmail.com Closes: https://lore.kernel.org/lkml/68b93e3c.a00a0220.eb3d.0000.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Eric Paris <eparis@redhat.com> Cc: audit@vger.kernel.org Signed-off-by: Paul Moore <paul@paul-moore.com>	2025-09-04 11:06:33 -04:00
Thomas Weißschuh	ea1a1fa919	time: Build generic update_vsyscall() only with generic time vDSO The generic vDSO can be used without the time-related functionality. In that case the generic update_vsyscall() from kernel/time/vsyscall.c should not be built. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250826-vdso-cleanups-v1-5-d9b65750e49f@linutronix.de	2025-09-04 11:23:50 +02:00
Gatien Chevallier	96c88268b7	time: export timespec64_add_safe() symbol Export the timespec64_add_safe() symbol so that this function can be used in modules where computation of time related is done. Signed-off-by: Gatien Chevallier <gatien.chevallier@foss.st.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20250901-relative_flex_pps-v4-1-b874971dfe85@foss.st.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-03 16:51:08 -07:00
Christian Loehle	5c48d88fe0	sched_ext: deprecation warn for scx_bpf_cpu_rq() scx_bpf_cpu_rq() works on an unlocked rq which generally isn't safe. For the common use-cases scx_bpf_locked_rq() and scx_bpf_cpu_curr() work, so add a deprecation warning to scx_bpf_cpu_rq() so it can eventually be removed. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-03 11:51:57 -10:00
Christian Loehle	20b158094a	sched_ext: Introduce scx_bpf_cpu_curr() Provide scx_bpf_cpu_curr() as a way for scx schedulers to check the curr task of a remote rq without assuming its lock is held. Many scx schedulers make use of scx_bpf_cpu_rq() to check a remote curr (e.g. to see if it should be preempted). This is problematic because scx_bpf_cpu_rq() provides access to all fields of struct rq, most of which aren't safe to use without holding the associated rq lock. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-03 11:50:42 -10:00
Christian Loehle	e0ca169638	sched_ext: Introduce scx_bpf_locked_rq() Most fields in scx_bpf_cpu_rq() assume that its rq_lock is held. Furthermore they become meaningless without rq lock, too. Make a safer version of scx_bpf_cpu_rq() that only returns a rq if we hold rq lock of that rq. Also mark the new scx_bpf_locked_rq() as returning NULL as scx_bpf_cpu_rq() should've been too. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-03 11:50:36 -10:00
Tejun Heo	a5bd6ba30b	sched_ext: Use cgroup_lock/unlock() to synchronize against cgroup operations SCX hooks into CPU cgroup controller operations and read-locks scx_cgroup_rwsem to exclude them while enabling and disable schedulers. While this works, it's unnecessarily complicated given that cgroup_[un]lock() are available and thus the cgroup operations can be locked out that way. Drop scx_cgroup_rwsem locking from the tg on/offline and cgroup [can_]attach operations. Instead, grab cgroup_lock() from scx_cgroup_lock(). Drop scx_cgroup_finish_attach() which is no longer necessary. Drop the now unnecessary rcu locking and css ref bumping in scx_cgroup_init() and scx_cgroup_exit(). As scx_cgroup_set_weight/bandwidth() paths aren't protected by cgroup_lock(), rename scx_cgroup_rwsem to scx_cgroup_ops_rwsem and retain the locking there. This is overall simpler and will also allow enable/disable paths to synchronize against cgroup changes independent of the CPU controller. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2025-09-03 11:36:07 -10:00
Tejun Heo	bcb7c23056	sched_ext: Put event_stats_cpu in struct scx_sched_pcpu scx_sched.event_stats_cpu is the percpu counters that are used to track stats. Introduce struct scx_sched_pcpu and move the counters inside. This will ease adding more per-cpu fields. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2025-09-03 11:33:28 -10:00
Tejun Heo	0c2b8356e4	sched_ext: Move internal type and accessor definitions to ext_internal.h There currently isn't a place to place SCX-internal types and accessors to be shared between ext.c and ext_idle.c. Create kernel/sched/ext_internal.h and move internal type and accessor definitions there. This trims ext.c a bit and makes future additions easier. Pure code reorganization. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2025-09-03 11:33:28 -10:00
Tejun Heo	4a1d9d73aa	sched_ext: Keep bypass on between enable failure and scx_disable_workfn() scx_enable() turns on the bypass mode while enable is in progress. If enabling fails, it turns off the bypass mode and then triggers scx_error(). scx_error() will trigger scx_disable_workfn() which will turn on the bypass mode again and unload the failed scheduler. This moves the system out of bypass mode between the enable error path and the disable path, which is unnecessary and can be brittle - e.g. the thread running scx_enable() may already be on the failed scheduler and can be switched out before it triggers scx_error() leading to a stall. The watchdog would eventually kick in, so the situation isn't critical but is still suboptimal. There is nothing to be gained by turning off the bypass mode between scx_enable() failure and scx_disable_workfn(). Keep bypass on. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2025-09-03 11:33:28 -10:00
Tejun Heo	b7975c4869	sched_ext: Make explicit scx_task_iter_relock() calls unnecessary During tasks iteration, the locks can be dropped using scx_task_iter_unlock() to perform e.g. sleepable allocations. Afterwards, scx_task_iter_relock() has to be called prior to other iteration operations, which is error-prone. This can be easily automated by tracking whether scx_tasks_lock is held in scx_task_iter and re-acquiring when necessary. It already tracks whether the task's rq is locked after all. - Add scx_task_iter->list_locked which remembers whether scx_tasks_lock is held. - Rename scx_task_iter->locked to scx_task_iter->locked_task to better distinguish it from ->list_locked. - Replace scx_task_iter_relock() with __scx_task_iter_maybe_relock() which is automatically called by scx_task_iter_next() and scx_task_iter_stop(). - Drop explicit scx_task_iter_relock() calls. The resulting behavior should be equivalent. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2025-09-03 11:33:28 -10:00
Stanislav Fort	4540f1d23e	audit: fix out-of-bounds read in audit_compare_dname_path() When a watch on dir=/ is combined with an fsnotify event for a single-character name directly under / (e.g., creating /a), an out-of-bounds read can occur in audit_compare_dname_path(). The helper parent_len() returns 1 for "/". In audit_compare_dname_path(), when parentlen equals the full path length (1), the code sets p = path + 1 and pathlen = 1 - 1 = 0. The subsequent loop then dereferences p[pathlen - 1] (i.e., p[-1]), causing an out-of-bounds read. Fix this by adding a pathlen > 0 check to the while loop condition to prevent the out-of-bounds access. Cc: stable@vger.kernel.org Fixes: `e92eebb0d6` ("audit: fix suffixed '/' filename matching") Reported-by: Stanislav Fort <disclosure@aisle.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Stanislav Fort <stanislav.fort@aisle.com> [PM: subject tweak, sign-off email fixes] Signed-off-by: Paul Moore <paul@paul-moore.com>	2025-09-03 16:46:23 -04:00
Waiman Long	e117ff1129	cgroup/cpuset: Prevent NULL pointer access in free_tmpmasks() Commit `5806b3d051` ("cpuset: decouple tmpmasks and cpumasks freeing in cgroup") separates out the freeing of tmpmasks into a new free_tmpmask() helper but removes the NULL pointer check in the process. Unfortunately a NULL pointer can be passed to free_tmpmasks() in cpuset_handle_hotplug() if cpuset v1 is active. This can cause segmentation fault and crash the kernel. Fix that by adding the NULL pointer check to free_tmpmasks(). Fixes: `5806b3d051` ("cpuset: decouple tmpmasks and cpumasks freeing in cgroup") Reported-by: Ashay Jaiswal <quic_ashayj@quicinc.com> Closes: https://lore.kernel.org/lkml/20250902-cpuset-free-on-condition-v1-1-f46ffab53eac@quicinc.com/ Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-03 08:40:11 -10:00
Christian Loehle	5ebf512f33	sched: Fix sched_numa_find_nth_cpu() if mask offline sched_numa_find_nth_cpu() uses a bsearch to look for the 'closest' CPU in sched_domains_numa_masks and given cpus mask. However they might not intersect if all CPUs in the cpus mask are offline. bsearch will return NULL in that case, bail out instead of dereferencing a bogus pointer. The previous behaviour lead to this bug when using maxcpus=4 on an rk3399 (LLLLbb) (i.e. booting with all big CPUs offline): [ 1.422922] Unable to handle kernel paging request at virtual address ffffff8000000000 [ 1.423635] Mem abort info: [ 1.423889] ESR = 0x0000000096000006 [ 1.424227] EC = 0x25: DABT (current EL), IL = 32 bits [ 1.424715] SET = 0, FnV = 0 [ 1.424995] EA = 0, S1PTW = 0 [ 1.425279] FSC = 0x06: level 2 translation fault [ 1.425735] Data abort info: [ 1.425998] ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000 [ 1.426499] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 1.426952] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 1.427428] swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000004a9f000 [ 1.428038] [ffffff8000000000] pgd=18000000f7fff403, p4d=18000000f7fff403, pud=18000000f7fff403, pmd=0000000000000000 [ 1.429014] Internal error: Oops: 0000000096000006 [#1] SMP [ 1.429525] Modules linked in: [ 1.429813] CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.17.0-rc4-dirty #343 PREEMPT [ 1.430559] Hardware name: Pine64 RockPro64 v2.1 (DT) [ 1.431012] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 1.431634] pc : sched_numa_find_nth_cpu+0x2a0/0x488 [ 1.432094] lr : sched_numa_find_nth_cpu+0x284/0x488 [ 1.432543] sp : ffffffc084e1b960 [ 1.432843] x29: ffffffc084e1b960 x28: ffffff80078a8800 x27: ffffffc0846eb1d0 [ 1.433495] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 [ 1.434144] x23: 0000000000000000 x22: fffffffffff7f093 x21: ffffffc081de6378 [ 1.434792] x20: 0000000000000000 x19: 0000000ffff7f093 x18: 00000000ffffffff [ 1.435441] x17: 3030303866666666 x16: 66663d736b73616d x15: ffffffc104e1b5b7 [ 1.436091] x14: 0000000000000000 x13: ffffffc084712860 x12: 0000000000000372 [ 1.436739] x11: 0000000000000126 x10: ffffffc08476a860 x9 : ffffffc084712860 [ 1.437389] x8 : 00000000ffffefff x7 : ffffffc08476a860 x6 : 0000000000000000 [ 1.438036] x5 : 000000000000bff4 x4 : 0000000000000000 x3 : 0000000000000000 [ 1.438683] x2 : 0000000000000000 x1 : ffffffc0846eb000 x0 : ffffff8000407b68 [ 1.439332] Call trace: [ 1.439559] sched_numa_find_nth_cpu+0x2a0/0x488 (P) [ 1.440016] smp_call_function_any+0xc8/0xd0 [ 1.440416] armv8_pmu_init+0x58/0x27c [ 1.440770] armv8_cortex_a72_pmu_init+0x20/0x2c [ 1.441199] arm_pmu_device_probe+0x1e4/0x5e8 [ 1.441603] armv8_pmu_device_probe+0x1c/0x28 [ 1.442007] platform_probe+0x5c/0xac [ 1.442347] really_probe+0xbc/0x298 [ 1.442683] __driver_probe_device+0x78/0x12c [ 1.443087] driver_probe_device+0xdc/0x160 [ 1.443475] __driver_attach+0x94/0x19c [ 1.443833] bus_for_each_dev+0x74/0xd4 [ 1.444190] driver_attach+0x24/0x30 [ 1.444525] bus_add_driver+0xe4/0x208 [ 1.444874] driver_register+0x60/0x128 [ 1.445233] __platform_driver_register+0x24/0x30 [ 1.445662] armv8_pmu_driver_init+0x28/0x4c [ 1.446059] do_one_initcall+0x44/0x25c [ 1.446416] kernel_init_freeable+0x1dc/0x3bc [ 1.446820] kernel_init+0x20/0x1d8 [ 1.447151] ret_from_fork+0x10/0x20 [ 1.447493] Code: 90022e21 f000e5f5 910de2b5 2a1703e2 (f8767803) [ 1.448040] ---[ end trace 0000000000000000 ]--- [ 1.448483] note: swapper/0[1] exited with preempt_count 1 [ 1.449047] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b [ 1.449741] SMP: stopping secondary CPUs [ 1.450105] Kernel Offset: disabled [ 1.450419] CPU features: 0x000000,00080000,20002001,0400421b [ 1.450935] Memory Limit: none [ 1.451217] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]--- Yury: with the fix, the function returns cpu == nr_cpu_ids, and later in smp_call_function_any -> smp_call_function_single -> generic_exec_single we test the cpu for '>= nr_cpu_ids' and return -ENXIO. So everything is handled correctly. Fixes: `cd7f55359c` ("sched: add sched_numa_find_nth_cpu()") Cc: stable@vger.kernel.org Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>	2025-09-03 12:20:06 -04:00

... 7 8 9 10 11 ...

49605 Commits