linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-16 09:02:21 -04:00

Author	SHA1	Message	Date
Lucas De Marchi	663385f915	module: Simplify warning on positive returns from module_init() It should now be rare to trigger this warning - it doesn't need to be so verbose. Make it follow the usual style in the module loading code. For the same reason, drop the dump_stack(). Suggested-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Lucas De Marchi <demarchi@kernel.org> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>	2026-04-04 00:04:48 +00:00
Lucas De Marchi	743f8cae54	module: Override -EEXIST module return The -EEXIST errno is reserved by the module loading functionality. When userspace calls [f]init_module(), it expects a -EEXIST to mean that the module is already loaded in the kernel. If module_init() returns it, that is not true anymore. Override the error when returning to userspace: it doesn't make sense to change potentially long error propagation call chains just because it's will end up as the return of module_init(). Closes: https://lore.kernel.org/all/aKLzsAX14ybEjHfJ@orbyte.nwl.cc/ Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: Phil Sutter <phil@nwl.cc> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Lucas De Marchi <demarchi@kernel.org> [Sami: Fixed a typo.] Signed-off-by: Sami Tolvanen <samitolvanen@google.com>	2026-04-04 00:04:42 +00:00
Linus Torvalds	631919fb12	Merge tag 'sched_ext-for-7.0-rc6-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: "These are late but both fix subtle yet critical problems and the blast radius is limited strictly to sched_ext. - Fix stale direct dispatch state in ddsp_dsq_id which can cause spurious warnings in mark_direct_dispatch() on task wakeup - Fix is_bpf_migration_disabled() false negative on non-PREEMPT_RCU configs which can lead to incorrectly dispatching migration- disabled tasks to remote CPUs" * tag 'sched_ext-for-7.0-rc6-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix stale direct dispatch state in ddsp_dsq_id sched_ext: Fix is_bpf_migration_disabled() false negative on non-PREEMPT_RCU	2026-04-03 12:05:06 -07:00
Tejun Heo	744ab12a5b	Merge branch 'for-7.0-fixes' into for-7.1 Conflict in kernel/sched/ext.c between: `7e0ffb72de` ("sched_ext: Fix stale direct dispatch state in ddsp_dsq_id") which clears ddsp state at individual call sites instead of dispatch_enqueue(), and sub-sched related code reorg and API updates on for-7.1. Resolved by applying the ddsp fix with for-7.1's signatures. Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-03 07:48:28 -10:00
Bartosz Golaszewski	9617b5b62c	kernel: ksysfs: initialize kernel_kobj earlier Software nodes depend on kernel_kobj which is initialized pretty late into the boot process - as a core_initcall(). Ahead of moving the software node initialization to driver_init() we must first make kernel_kobj available before it. Make ksysfs_init() visible in a new header - ksysfs.h - and call it in do_basic_setup() right before driver_init(). Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com> Link: https://patch.msgid.link/20260402-nokia770-gpio-swnodes-v5-1-d730db3dd299@oss.qualcomm.com Signed-off-by: Danilo Krummrich <dakr@kernel.org>	2026-04-03 19:39:52 +02:00
Andrea Righi	7e0ffb72de	sched_ext: Fix stale direct dispatch state in ddsp_dsq_id @p->scx.ddsp_dsq_id can be left set (non-SCX_DSQ_INVALID) triggering a spurious warning in mark_direct_dispatch() when the next wakeup's ops.select_cpu() calls scx_bpf_dsq_insert(), such as: WARNING: kernel/sched/ext.c:1273 at scx_dsq_insert_commit+0xcd/0x140 The root cause is that ddsp_dsq_id was only cleared in dispatch_enqueue(), which is not reached in all paths that consume or cancel a direct dispatch verdict. Fix it by clearing it at the right places: - direct_dispatch(): cache the direct dispatch state in local variables and clear it before dispatch_enqueue() on the synchronous path. For the deferred path, the direct dispatch state must remain set until process_ddsp_deferred_locals() consumes them. - process_ddsp_deferred_locals(): cache the dispatch state in local variables and clear it before calling dispatch_to_local_dsq(), which may migrate the task to another rq. - do_enqueue_task(): clear the dispatch state on the enqueue path (local/global/bypass fallbacks), where the direct dispatch verdict is ignored. - dequeue_task_scx(): clear the dispatch state after dispatch_dequeue() to handle both the deferred dispatch cancellation and the holding_cpu race, covering all cases where a pending direct dispatch is cancelled. - scx_disable_task(): clear the direct dispatch state when transitioning a task out of the current scheduler. Waking tasks may have had the direct dispatch state set by the outgoing scheduler's ops.select_cpu() and then been queued on a wake_list via ttwu_queue_wakelist(), when SCX_OPS_ALLOW_QUEUED_WAKEUP is set. Such tasks are not on the runqueue and are not iterated by scx_bypass(), so their direct dispatch state won't be cleared. Without this clear, any subsequent SCX scheduler that tries to direct dispatch the task will trigger the WARN_ON_ONCE() in mark_direct_dispatch(). Fixes: `5b26f7b920` ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches") Cc: stable@vger.kernel.org # v6.12+ Cc: Daniel Hodges <hodgesd@meta.com> Cc: Patrick Somaru <patsomaru@meta.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-03 07:14:49 -10:00
Linus Torvalds	1270605fd2	Merge tag 'pm-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix a potential NULL pointer dereference in the energy model netlink interface and a potential double free in an error path in the common cpufreq governor management code: - Fix a NULL pointer dereference in the energy model netlink interface that may occur if a given perf domain ID is not recognized (Changwoo Min) - Avoid double free in the cpufreq_dbs_governor_init() error path when kobject_init_and_add() fails (Guangshuo Li)" * tag 'pm-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: governor: fix double free in cpufreq_dbs_governor_init() error path PM: EM: Fix NULL pointer dereference when perf domain ID is not found	2026-04-03 09:56:32 -07:00
Alexei Starovoitov	1a1cadbd5d	bpf: Add helper and kfunc stack access size resolution The static stack liveness analysis needs to know how many bytes a helper or kfunc accesses through a stack pointer argument, so it can precisely mark the affected stack slots as stack 'def' or 'use'. Add bpf_helper_stack_access_bytes() and bpf_kfunc_stack_access_bytes() which resolve the access size for a given call argument. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-7-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-03 08:34:44 -07:00
Alexei Starovoitov	19dbb13474	bpf: Move verifier helpers to header Move several helpers to header as preparation for the subsequent stack liveness patches. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-6-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-03 08:34:41 -07:00
Alexei Starovoitov	f1606dd0ac	bpf: Add bpf_compute_const_regs() and bpf_prune_dead_branches() passes Add two passes before the main verifier pass: bpf_compute_const_regs() is a forward dataflow analysis that tracks register values in R0-R9 across the program using fixed-point iteration in reverse postorder. Each register is tracked with a six-state lattice: UNVISITED -> CONST(val) / MAP_PTR(map_index) / MAP_VALUE(map_index, offset) / SUBPROG(num) -> UNKNOWN At merge points, if two paths produce the same state and value for a register, it stays; otherwise it becomes UNKNOWN. The analysis handles: - MOV, ADD, SUB, AND with immediate or register operands - LD_IMM64 for plain constants, map FDs, map values, and subprogs - LDX from read-only maps: constant-folds the load by reading the map value directly via bpf_map_direct_read() Results that fit in 32 bits are stored per-instruction in insn_aux_data and bitmasks. bpf_prune_dead_branches() uses the computed constants to evaluate conditional branches. When both operands of a conditional jump are known constants, the branch outcome is determined statically and the instruction is rewritten to an unconditional jump. The CFG postorder is then recomputed to reflect new control flow. This eliminates dead edges so that subsequent liveness analysis doesn't propagate through dead code. Also add runtime sanity check to validate that precomputed constants match the verifier's tracked state. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-5-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-03 08:34:36 -07:00
Alexei Starovoitov	e6898ec751	bpf: Sort subprogs in topological order after check_cfg() Add a pass that sorts subprogs in topological order so that iterating subprog_topo_order[] walks leaf subprogs first, then their callers. This is computed as a DFS post-order traversal of the CFG. The pass runs after check_cfg() to ensure the CFG has been validated before traversing and after postorder has been computed to avoid walking dead code. Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-3-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-03 08:34:30 -07:00
Alexei Starovoitov	503d21ef8e	bpf: Do register range validation early Instead of checking src/dst range multiple times during the main verifier pass do them once. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-2-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-03 08:34:26 -07:00
Alexei Starovoitov	891a05ccba	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf 7.0-rc6+ Cross-merge BPF and other fixes after downstream PR. Minor conflict in kernel/bpf/verifier.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-03 08:14:13 -07:00
Vincent Guittot	059258b0d4	sched/fair: Prevent negative lag increase during delayed dequeue Delayed dequeue feature aims to reduce the negative lag of a dequeued task while sleeping but it can happens that newly enqueued tasks will move backward the avg vruntime and increase its negative lag. When the delayed dequeued task wakes up, it has more neg lag compared to being dequeued immediately or to other tasks that have been dequeued just before theses new enqueues. Ensure that the negative lag of a delayed dequeued task doesn't increase during its delayed dequeued phase while waiting for its neg lag to diseappear. Similarly, we remove any positive lag that the delayed dequeued task could have gain during thsi period. Short slice tasks are particularly impacted in overloaded system. Test on snapdragon rb5: hackbench -T -p -l 16000000 -g 2 1> /dev/null & cyclictest -t 1 -i 2777 -D 333 --policy=fair --mlock -h 20000 -q The scheduling latency of cyclictest is: tip/sched/core tip/sched/core +this patch cyclictest slice (ms) (default)2.8 8 8 hackbench slice (ms) (default)2.8 20 20 Total Samples \| 115632 119733 119806 Average (us) \| 364 64(-82%) 61(- 5%) Median (P50) (us) \| 60 56(- 7%) 56( 0%) 90th Percentile (us) \| 1166 62(-95%) 62( 0%) 99th Percentile (us) \| 4192 73(-98%) 72(- 1%) 99.9th Percentile (us) \| 8528 2707(-68%) 1300(-52%) Maximum (us) \| 17735 14273(-20%) 13525(- 5%) Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260331162352.551501-1-vincent.guittot@linaro.org	2026-04-03 14:23:41 +02:00
Vincent Guittot	2d4cc371ba	sched/fair: Use sched_energy_enabled() Use helper sched_energy_enabled() everywhere we want to test if EAS is enabled instead of mixing sched_energy_enabled() and direct call to static_branch_unlikely(). No functional change Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260327132013.2800517-1-vincent.guittot@linaro.org	2026-04-03 14:23:41 +02:00
John Stultz	b049b81bdf	sched: Handle blocked-waiter migration (and return migration) Add logic to handle migrating a blocked waiter to a remote cpu where the lock owner is runnable. Additionally, as the blocked task may not be able to run on the remote cpu, add logic to handle return migration once the waiting task is given the mutex. Because tasks may get migrated to where they cannot run, also modify the scheduling classes to avoid sched class migrations on mutex blocked tasks, leaving find_proxy_task() and related logic to do the migrations and return migrations. This was split out from the larger proxy patch, and significantly reworked. Credits for the original patch go to: Peter Zijlstra (Intel) <peterz@infradead.org> Juri Lelli <juri.lelli@redhat.com> Valentin Schneider <valentin.schneider@arm.com> Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260324191337.1841376-11-jstultz@google.com	2026-04-03 14:23:41 +02:00
John Stultz	dec9554dc0	sched: Move attach_one_task and attach_task helpers to sched.h The fair scheduler locally introduced attach_one_task() and attach_task() helpers, but these could be generically useful so move this code to sched.h so we can use them elsewhere. One minor tweak made to utilize guard(rq_lock)(rq) to simplifiy the function. Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-10-jstultz@google.com	2026-04-03 14:23:40 +02:00
John Stultz	48fda62de6	sched: Add logic to zap balance callbacks if we pick again With proxy-exec, a task is selected to run via pick_next_task(), and then if it is a mutex blocked task, we call find_proxy_task() to find a runnable owner. If the runnable owner is on another cpu, we will need to migrate the selected donor task away, after which we will pick_again can call pick_next_task() to choose something else. However, in the first call to pick_next_task(), we may have had a balance_callback setup by the class scheduler. After we pick again, its possible pick_next_task_fair() will be called which calls sched_balance_newidle() and sched_balance_rq(). This will throw a warning: [ 8.796467] rq->balance_callback && rq->balance_callback != &balance_push_callback [ 8.796467] WARNING: CPU: 32 PID: 458 at kernel/sched/sched.h:1750 sched_balance_rq+0xe92/0x1250 ... [ 8.796467] Call Trace: [ 8.796467] <TASK> [ 8.796467] ? __warn.cold+0xb2/0x14e [ 8.796467] ? sched_balance_rq+0xe92/0x1250 [ 8.796467] ? report_bug+0x107/0x1a0 [ 8.796467] ? handle_bug+0x54/0x90 [ 8.796467] ? exc_invalid_op+0x17/0x70 [ 8.796467] ? asm_exc_invalid_op+0x1a/0x20 [ 8.796467] ? sched_balance_rq+0xe92/0x1250 [ 8.796467] sched_balance_newidle+0x295/0x820 [ 8.796467] pick_next_task_fair+0x51/0x3f0 [ 8.796467] __schedule+0x23a/0x14b0 [ 8.796467] ? lock_release+0x16d/0x2e0 [ 8.796467] schedule+0x3d/0x150 [ 8.796467] worker_thread+0xb5/0x350 [ 8.796467] ? __pfx_worker_thread+0x10/0x10 [ 8.796467] kthread+0xee/0x120 [ 8.796467] ? __pfx_kthread+0x10/0x10 [ 8.796467] ret_from_fork+0x31/0x50 [ 8.796467] ? __pfx_kthread+0x10/0x10 [ 8.796467] ret_from_fork_asm+0x1a/0x30 [ 8.796467] </TASK> This is because if a RT task was originally picked, it will setup the rq->balance_callback with push_rt_tasks() via set_next_task_rt(). Once the task is migrated away and we pick again, we haven't processed any balance callbacks, so rq->balance_callback is not in the same state as it was the first time pick_next_task was called. To handle this, add a zap_balance_callbacks() helper function which cleans up the balance callbacks without running them. This should be ok, as we are effectively undoing the state set in the first call to pick_next_task(), and when we pick again, the new callback can be configured for the donor task actually selected. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-9-jstultz@google.com	2026-04-03 14:23:40 +02:00
John Stultz	f9530b3183	sched: Add assert_balance_callbacks_empty helper With proxy-exec utilizing pick-again logic, we can end up having balance callbacks set by the preivous pick_next_task() call left on the list. So pull the warning out into a helper function, and make sure we check it when we pick again. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-8-jstultz@google.com	2026-04-03 14:23:40 +02:00
John Stultz	2d76226698	sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration As we add functionality to proxy execution, we may migrate a donor task to a runqueue where it can't run due to cpu affinity. Thus, we must be careful to ensure we return-migrate the task back to a cpu in its cpumask when it becomes unblocked. Peter helpfully provided the following example with pictures: "Suppose we have a ww_mutex cycle: ,-+-* Mutex-1 <-. Task-A ---' \| \| ,-- Task-B `-> Mutex-2 -+-' Where Task-A holds Mutex-1 and tries to acquire Mutex-2, and where Task-B holds Mutex-2 and tries to acquire Mutex-1. Then the blocked_on->owner chain will go in circles. Task-A -> Mutex-2 ^ \| \| v Mutex-1 <- Task-B We need two things: - find_proxy_task() to stop iterating the circle; - the woken task to 'unblock' and run, such that it can back-off and re-try the transaction. Now, the current code [without this patch] does: __clear_task_blocked_on(); wake_q_add(); And surely clearing ->blocked_on is sufficient to break the cycle. Suppose it is Task-B that is made to back-off, then we have: Task-A -> Mutex-2 -> Task-B (no further blocked_on) and it would attempt to run Task-B. Or worse, it could directly pick Task-B and run it, without ever getting into find_proxy_task(). Now, here is a problem because Task-B might not be runnable on the CPU it is currently on; and because !task_is_blocked() we don't get into the proxy paths, so nobody is going to fix this up. Ideally we would have dequeued Task-B alongside of clearing ->blocked_on, but alas, [the lock ordering prevents us from getting the task_rq_lock() and] spoils things." Thus we need more than just a binary concept of the task being blocked on a mutex or not. So allow setting blocked_on to PROXY_WAKING as a special value which specifies the task is no longer blocked, but needs to be evaluated for return migration before* it can be run. This will then be used in a later patch to handle proxy return-migration. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-7-jstultz@google.com	2026-04-03 14:23:40 +02:00
John Stultz	56f4b24267	sched: Fix modifying donor->blocked on without proper locking Introduce an action enum in find_proxy_task() which allows us to handle work needed to be done outside the mutex.wait_lock and task.blocked_lock guard scopes. This ensures proper locking when we clear the donor's blocked_on pointer in proxy_deactivate(), and the switch statement will be useful as we add more cases to handle later in this series. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-6-jstultz@google.com	2026-04-03 14:23:39 +02:00
John Stultz	fa4a1ff8ab	locking: Add task::blocked_lock to serialize blocked_on state So far, we have been able to utilize the mutex::wait_lock for serializing the blocked_on state, but when we move to proxying across runqueues, we will need to add more state and a way to serialize changes to this state in contexts where we don't hold the mutex::wait_lock. So introduce the task::blocked_lock, which nests under the mutex::wait_lock in the locking order, and rework the locking to use it. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-5-jstultz@google.com	2026-04-03 14:23:39 +02:00
John Stultz	f4fe6be82e	sched: Fix potentially missing balancing with Proxy Exec K Prateek pointed out that with Proxy Exec, we may have cases where we context switch in __schedule(), while the donor remains the same. This could cause balancing issues, since the put_prev_set_next() logic short-cuts if (prev == next). With proxy-exec prev is the previous donor, and next is the next donor. Should the donor remain the same, but different tasks are picked to actually run, the shortcut will have avoided enqueuing the sched class balance callback. So, if we are context switching, add logic to catch the same-donor case, and trigger the put_prev/set_next calls to ensure the balance callbacks get enqueued. Closes: https://lore.kernel.org/lkml/20ea3670-c30a-433b-a07f-c4ff98ae2379@amd.com/ Reported-by: K Prateek Nayak <kprateek.nayak@amd.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260324191337.1841376-4-jstultz@google.com	2026-04-03 14:23:39 +02:00
John Stultz	37341ec573	sched: Minimise repeated sched_proxy_exec() checking Peter noted: Compilers are really bad (as in they utterly refuse) optimizing (even when marked with __pure) the static branch things, and will happily emit multiple identical in a row. So pull out the one obvious sched_proxy_exec() branch in __schedule() and remove some of the 'implicit' ones in that path. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-3-jstultz@google.com	2026-04-03 14:23:38 +02:00
John Stultz	e0ca8991b2	sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() With proxy-execution, the scheduler selects the donor, but for blocked donors, we end up running the lock owner. This caused some complexity, because the class schedulers make sure to remove the task they pick from their pushable task lists, which prevents the donor from being migrated, but there wasn't then anything to prevent rq->curr from being migrated if rq->curr != rq->donor. This was sort of hacked around by calling proxy_tag_curr() on the rq->curr task if we were running something other then the donor. proxy_tag_curr() did a dequeue/enqueue pair on the rq->curr task, allowing the class schedulers to remove it from their pushable list. The dequeue/enqueue pair was wasteful, and additonally K Prateek highlighted that we didn't properly undo things when we stopped proxying, leaving the lock owner off the pushable list. After some alternative approaches were considered, Peter suggested just having the RT/DL classes just avoid migrating when task_on_cpu(). So rework pick_next_pushable_dl_task() and the rt pick_next_pushable_task() functions so that they skip over the first pushable task if it is on_cpu. Then just drop all of the proxy_tag_curr() logic. Fixes: `be39617e38` ("sched: Fix proxy/current (push,pull)ability") Closes: https://lore.kernel.org/lkml/e735cae0-2cc9-4bae-b761-fcb082ed3e94@amd.com/ Reported-by: K Prateek Nayak <kprateek.nayak@amd.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260324191337.1841376-2-jstultz@google.com	2026-04-03 14:23:38 +02:00
Coiby Xu	03738dd159	crash_dump/dm-crypt: don't print in arch-specific code Patch series "kdump: Enable LUKS-encrypted dump target support in ARM64 and PowerPC", v5. CONFIG_CRASH_DM_CRYPT has been introduced to support LUKS-encrypted device dump target by addressing two challenges [1], - Kdump kernel may not be able to decrypt the LUKS partition. For some machines, a system administrator may not have a chance to enter the password to decrypt the device in kdump initramfs after the 1st kernel crashes - LUKS2 by default use the memory-hard Argon2 key derivation function which is quite memory-consuming compared to the limited memory reserved for kdump. To also enable this feature for ARM64 and PowerPC, we need to add a device tree property dmcryptkeys [2] as similar to elfcorehdr to pass the memory address of the stored info of dm-crypt keys to the kdump kernel. This patch (of 3): When the vmcore dumping target is not a LUKS-encrypted target, it's expected that there is no dm-crypt key thus no need to return -ENOENT. Also print more logs in crash_load_dm_crypt_keys. The benefit is arch-specific code can be more succinct. Link: https://lkml.kernel.org/r/20260225060347.718905-1-coxu@redhat.com Link: https://lkml.kernel.org/r/20260225060347.718905-2-coxu@redhat.com Link: https://lore.kernel.org/all/20250502011246.99238-1-coxu@redhat.com/ [1] Link: https://github.com/devicetree-org/dt-schema/pull/181 [2] Signed-off-by: Coiby Xu <coxu@redhat.com> Suggested-by: Will Deacon <will@kernel.org> Acked-by: Baoquan He <bhe@redhat.com> Cc: Arnaud Lefebvre <arnaud.lefebvre@clever-cloud.com> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Cc: Dave Young <dyoung@redhat.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Pingfan Liu <kernelfans@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Thomas Staudt <tstaudt@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-02 23:36:24 -07:00
Linus Torvalds	7b9e74c5a4	Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Fix register equivalence for pointers to packet (Alexei Starovoitov) - Fix incorrect pruning due to atomic fetch precision tracking (Daniel Borkmann) - Fix grace period wait for bpf_link-ed tracepoints (Kumar Kartikeya Dwivedi) - Fix use-after-free of sockmap's sk->sk_socket (Kuniyuki Iwashima) - Reject direct access to nullable PTR_TO_BUF pointers (Qi Tang) - Reject sleepable kprobe_multi programs at attach time (Varun R Mallya) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Add more precision tracking tests for atomics bpf: Fix incorrect pruning due to atomic fetch precision tracking bpf: Reject sleepable kprobe_multi programs at attach time bpf: reject direct access to nullable PTR_TO_BUF pointers bpf: sockmap: Fix use-after-free of sk->sk_socket in sk_psock_verdict_data_ready(). bpf: Fix grace period wait for tracepoint bpf_link bpf: Fix regsafe() for pointers to packet	2026-04-02 18:59:56 -07:00
Harishankar Vishwanathan	b254c6d816	bpf: Simulate branches to prune based on range violations This patch fixes the invariant violations that can happen after we refine ranges & tnum based on an incorrectly-detected branch condition. For example, the branch is always true, but we miss it in is_branch_taken; we then refine based on the branch being false and end up with incoherent ranges (e.g. umax < umin). To avoid this, we can simulate the refinement on both branches. More specifically, this patch simulates both branches taken using regs_refine_cond_op and reg_bounds_sync. If the resulting register states are ill-formed on one of the branches, is_branch_taken can mark that branch as "never taken". On a more formal note, we can deduce a branch is not taken when regs_refine_cond_op or reg_bounds_sync returns an ill-formed state because the branch operators are sound (verified with Agni [1]). Soundness means that the verifier is guaranteed to produce sound outputs on the taken branches. On the non-taken branch (explored because of imprecision in the bounds), the verifier is free to produce any output. We use ill-formedness as a signal that the branch is dead and prune that branch. This patch moves the refinement logic for both branches from reg_set_min_max to their own function, simulate_both_branches_taken, which is called from is_scalar_branch_taken. As a result, reg_set_min_max now only runs sanity checks and has been renamed to reg_bounds_sanity_check_branches to reflect that. We have had five patches fixing specific cases of invariant violations in the past, all added with selftests: - commit `fbc7aef517` ("bpf: Fix u32/s32 bounds when ranges cross min/max boundary") - commit `efc11a6678` ("bpf: Improve bounds when tnum has a single possible value") - commit `f41345f47f` ("bpf: Use tnums for JEQ/JNE is_branch_taken logic") - commit `00bf8d0c6c` ("bpf: Improve bounds when s64 crosses sign boundary") - commit `6279846b9b` ("bpf: Forget ranges when refining tnum after JSET") To confirm that this patch addresses all invariant violations, we have also reverted those five commits and verified that their related selftests don't cause any invariant violation warnings anymore. Those selftests still fail but only because of misdetected branches or less-precise bounds than expected. This demonstrates that the current patch is enough to avoid the invariant violation warning AND that the previous five patches are still useful to improve branch detection. In addition to the selftests, this change was also tested with the Cilium complexity test suite: all programs were successfully loaded and it didn't change the number of processed instructions. Link: https://github.com/bpfverif/agni [1] Reported-by: syzbot+c950cc277150935cc0b5@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c950cc277150935cc0b5 Co-developed-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/a166b54a3cbbbdbcdf8a87f53045f1097176218b.1775142354.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 18:23:25 -07:00
Harishankar Vishwanathan	a2a14e874b	bpf: Exit early if reg_bounds_sync gets invalid inputs In the subsequent commit, to prune dead branches we will rely on detecting ill-formed ranges using range_bounds_violations() (e.g., umin > umax) after refining register bounds using regs_refine_cond_op(). However, reg_bounds_sync() can sometimes "repair" ill-formed bounds, potentially masking a violation that was produced by regs_refine_cond_op(). This commit modifies reg_bounds_sync() to exit early if an invariant violation is already present in the input. This ensures ill-formed reg_states remain ill-formed after reg_bounds_sync(), allowing simulate_both_branches_taken() to correctly identify dead branches with a single check to range_bounds_violation(). Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/73127d628841c59cb7423d6bdcd204bf90bcdc80.1775142354.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 18:23:25 -07:00
Paul Chaignon	ec1d77cb0e	bpf: Use bpf_verifier_env buffers for reg_set_min_max In a subsequent patch, the regs_refine_cond_op and reg_bounds_sync functions will be called in is_branch_taken instead of reg_set_min_max, to simulate each branch's outcome. Since they will run before we branch out, these two functions will need to work on temporary registers for the two branches. This refactoring patch prepares for that change, by introducing the temporary registers on bpf_verifier_env and using them in reg_set_min_max. This change also allows us to save one fake_reg slot as we don't need to allocate an additional temporary buffer in case of a BPF_K condition. Finally, you may notice that this patch removes the check for "false_reg1 == false_reg2" in reg_set_min_max. That check was introduced in commit `d43ad9da80` ("bpf: Skip bounds adjustment for conditional jumps on same scalar register") to avoid an invariant violation. Given that "env->false_reg1 == env->false_reg2" doesn't make sense and invariant violations are addressed in a subsequent commit, this patch just removes the check. Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Co-developed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/260b0270052944a420e1c56e6a92df4d43cadf03.1775142354.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 18:23:25 -07:00
Harishankar Vishwanathan	a1311b94ef	bpf: Refactor reg_bounds_sanity_check This commit refactors reg_bounds_sanity_check to factor out the logic that performs the sanity check from the logic that does the reporting. Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/198ec3e69343e2c46dc9cbe2b1bc9be9ae2df5bd.1775142354.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 18:23:24 -07:00
Michael Kelley	fd7400cfcb	genirq/chip: Invoke add_interrupt_randomness() in handle_percpu_devid_irq() handle_percpu_devid_irq() is a version of handle_percpu_irq() but with the addition of a pointer to a per-CPU devid. However, handle_percpu_irq() invokes add_interrupt_randomness(), while handle_percpu_devid_irq() currently does not. Add the missing add_interrupt_randomness(), as it is needed when per-CPU interrupts with devid's are used in VMs for interrupts from the hypervisor. Signed-off-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260402202400.1707-2-mhklkml@zohomail.com	2026-04-02 23:03:29 +02:00
Changwoo Min	0c4a59df37	sched_ext: Fix is_bpf_migration_disabled() false negative on non-PREEMPT_RCU Since commit `8e4f0b1ebc` ("bpf: use rcu_read_lock_dont_migrate() for trampoline.c"), the BPF prolog (__bpf_prog_enter) calls migrate_disable() only when CONFIG_PREEMPT_RCU is enabled, via rcu_read_lock_dont_migrate(). Without CONFIG_PREEMPT_RCU, the prolog never touches migration_disabled, so migration_disabled == 1 always means the task is truly migration-disabled regardless of whether it is the current task. The old unconditional p == current check was a false negative in this case, potentially allowing a migration-disabled task to be dispatched to a remote CPU and triggering scx_error in task_can_run_on_remote_rq(). Only apply the p == current disambiguation when CONFIG_PREEMPT_RCU is enabled, where the ambiguity with the BPF prolog still exists. Fixes: `8e4f0b1ebc` ("bpf: use rcu_read_lock_dont_migrate() for trampoline.c") Cc: stable@vger.kernel.org # v6.18+ Link: https://lore.kernel.org/lkml/20250821090609.42508-8-dongml2@chinatelecom.cn/ Signed-off-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-02 09:26:55 -10:00
Samuele Mariotti	b905ee77d5	sched_ext: Fix missing warning in scx_set_task_state() default case In scx_set_task_state(), the default case was setting the warn flag, but then returning immediately. This is problematic because the only purpose of the warn flag is to trigger WARN_ONCE, but the early return prevented it from ever firing, leaving invalid task states undetected and untraced. To fix this, a WARN_ONCE call is now added directly in the default case. The fix addresses two aspects: - Guarantees the invalid task states are properly logged and traced. - Provides a distinct warning message ("sched_ext: Invalid task state") specifically for states outside the defined scx_task_state enum values, making it easier to distinguish from other transition warnings. This ensures proper detection and reporting of invalid states. Signed-off-by: Samuele Mariotti <smariotti@disroot.org> Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-02 09:22:03 -10:00
Steven Rostedt	3515572dd0	tracing: Allow backup to save persistent ring buffer before it starts When the persistent ring buffer was first introduced, it did not make sense to start tracing for it on the kernel command line. That's because if there was a crash, the start of events would invalidate the events from the previous boot that had the crash. But now that there's a "backup" instance that can take a snapshot of the persistent ring buffer when boot starts, it is possible to have the persistent ring buffer start events at boot up and not lose the old events. Update the code where the boot events start after all boot time instances are created. This will allow the backup instance to copy the persistent ring buffer from the previous boot, and allow the persistent ring buffer to start tracing new events for the current boot. reserve_mem=100M:12M:trace trace_instance=boot_mapped^@trace,sched trace_instance=backup=boot_mapped The above will create a boot_mapped persistent ring buffer and enabled the scheduler events. If there's a crash, a "backup" instance will be created holding the events of the persistent ring buffer from the previous boot, while the persistent ring buffer will once again start tracing scheduler events of the current boot. Now the user doesn't have to remember to start the persistent ring buffer. It will always have the events started at each boot. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20260331163924.6ccb3896@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-04-02 13:29:08 -04:00
Masami Hiramatsu (Google)	eca33fdab4	tracing: Remove the backup instance automatically after read Since the backup instance is readonly, after reading all data via pipe, no data is left on the instance. Thus it can be removed safely after closing all files. This also removes it if user resets the ring buffer manually via 'trace' file. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/177502547711.1311542.12572973358010839400.stgit@mhiramat.tok.corp.google.com Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-04-02 13:22:30 -04:00
Masami Hiramatsu (Google)	2c79da099a	tracing: Make the backup instance non-reusable Since there is no reason to reuse the backup instance, make it readonly (but erasable). Note that only backup instances are readonly, because other trace instances will be empty unless it is writable. Only backup instances have copy entries from the original. With this change, most of the trace control files are removed from the backup instance, including eventfs enable/filter etc. # find /sys/kernel/tracing/instances/backup/events/ \| wc -l 4093 # find /sys/kernel/tracing/instances/boot_map/events/ \| wc -l 9573 Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/177502546939.1311542.1826814401724828930.stgit@mhiramat.tok.corp.google.com Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-04-02 13:20:38 -04:00
Vincent Donnefort	20ad8b0888	ring-buffer: Enforce read ordering of trace_buffer cpumask and buffers On CPU hotplug, if it is the first time a trace_buffer sees a CPU, a ring_buffer_per_cpu will be allocated and its corresponding bit toggled in the cpumask. Many readers check this cpumask to know if they can safely read the ring_buffer_per_cpu but they are doing so without memory ordering and may observe the cpumask bit set while having NULL buffer pointer. Enforce the memory read ordering by sending an IPI to all online CPUs. The hotplug path is a slow-path anyway and it saves us from adding read barriers in numerous call sites. Link: https://patch.msgid.link/20260401053659.3458961-1-vdonnefort@google.com Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-04-02 13:19:09 -04:00
Daniel Borkmann	179ee84a89	bpf: Fix incorrect pruning due to atomic fetch precision tracking When backtrack_insn encounters a BPF_STX instruction with BPF_ATOMIC and BPF_FETCH, the src register (or r0 for BPF_CMPXCHG) also acts as a destination, thus receiving the old value from the memory location. The current backtracking logic does not account for this. It treats atomic fetch operations the same as regular stores where the src register is only an input. This leads the backtrack_insn to fail to propagate precision to the stack location, which is then not marked as precise! Later, the verifier's path pruning can incorrectly consider two states equivalent when they differ in terms of stack state. Meaning, two branches can be treated as equivalent and thus get pruned when they should not be seen as such. Fix it as follows: Extend the BPF_LDX handling in backtrack_insn to also cover atomic fetch operations via is_atomic_fetch_insn() helper. When the fetch dst register is being tracked for precision, clear it, and propagate precision over to the stack slot. For non-stack memory, the precision walk stops at the atomic instruction, same as regular BPF_LDX. This covers all fetch variants. Before: 0: (b7) r1 = 8 ; R1=8 1: (7b) (u64 )(r10 -8) = r1 ; R1=8 R10=fp0 fp-8=8 2: (b7) r2 = 0 ; R2=0 3: (db) r2 = atomic64_fetch_add((u64 )(r10 -8), r2) ; R2=8 R10=fp0 fp-8=mmmmmmmm 4: (bf) r3 = r10 ; R3=fp0 R10=fp0 5: (0f) r3 += r2 mark_precise: frame0: last_idx 5 first_idx 0 subseq_idx -1 mark_precise: frame0: regs=r2 stack= before 4: (bf) r3 = r10 mark_precise: frame0: regs=r2 stack= before 3: (db) r2 = atomic64_fetch_add((u64 )(r10 -8), r2) mark_precise: frame0: regs=r2 stack= before 2: (b7) r2 = 0 6: R2=8 R3=fp8 6: (b7) r0 = 0 ; R0=0 7: (95) exit After: 0: (b7) r1 = 8 ; R1=8 1: (7b) (u64 )(r10 -8) = r1 ; R1=8 R10=fp0 fp-8=8 2: (b7) r2 = 0 ; R2=0 3: (db) r2 = atomic64_fetch_add((u64 )(r10 -8), r2) ; R2=8 R10=fp0 fp-8=mmmmmmmm 4: (bf) r3 = r10 ; R3=fp0 R10=fp0 5: (0f) r3 += r2 mark_precise: frame0: last_idx 5 first_idx 0 subseq_idx -1 mark_precise: frame0: regs=r2 stack= before 4: (bf) r3 = r10 mark_precise: frame0: regs=r2 stack= before 3: (db) r2 = atomic64_fetch_add((u64 )(r10 -8), r2) mark_precise: frame0: regs= stack=-8 before 2: (b7) r2 = 0 mark_precise: frame0: regs= stack=-8 before 1: (7b) (u64 )(r10 -8) = r1 mark_precise: frame0: regs=r1 stack= before 0: (b7) r1 = 8 6: R2=8 R3=fp8 6: (b7) r0 = 0 ; R0=0 7: (95) exit Fixes: `5ffa25502b` ("bpf: Add instructions for atomic_[cmp]xchg") Fixes: `5ca419f286` ("bpf: Add BPF_FETCH field / create atomic_fetch_add instruction") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260331222020.401848-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 09:57:59 -07:00
Varun R Mallya	eb7024bfcc	bpf: Reject sleepable kprobe_multi programs at attach time kprobe.multi programs run in atomic/RCU context and cannot sleep. However, bpf_kprobe_multi_link_attach() did not validate whether the program being attached had the sleepable flag set, allowing sleepable helpers such as bpf_copy_from_user() to be invoked from a non-sleepable context. This causes a "sleeping function called from invalid context" splat: BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:169 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo preempt_count: 1, expected: 0 RCU nest depth: 2, expected: 0 Fix this by rejecting sleepable programs early in bpf_kprobe_multi_link_attach(), before any further processing. Fixes: `0dcac27254` ("bpf: Add multi kprobe link") Signed-off-by: Varun R Mallya <varunrmallya@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Leon Hwang <leon.hwang@linux.dev> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260401191126.440683-1-varunrmallya@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 09:48:46 -07:00
Qi Tang	b0db1accbc	bpf: reject direct access to nullable PTR_TO_BUF pointers check_mem_access() matches PTR_TO_BUF via base_type() which strips PTR_MAYBE_NULL, allowing direct dereference without a null check. Map iterator ctx->key and ctx->value are PTR_TO_BUF \| PTR_MAYBE_NULL. On stop callbacks these are NULL, causing a kernel NULL dereference. Add a type_may_be_null() guard to the PTR_TO_BUF branch, matching the existing PTR_TO_BTF_ID pattern. Fixes: `20b2aff4bc` ("bpf: Introduce MEM_RDONLY flag") Signed-off-by: Qi Tang <tpluszz77@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260402092923.38357-2-tpluszz77@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 09:47:13 -07:00
Mykyta Yatsenko	cc878b4144	bpf: Migrate dynptr file to kmalloc_nolock Replace bpf_mem_alloc/bpf_mem_free with kmalloc_nolock/kfree_nolock for bpf_dynptr_file_impl, continuing the migration away from bpf_mem_alloc now that kmalloc can be used from NMI context. freader_cleanup() runs before kfree_nolock() while the dynptr still holds exclusive access, so plain kfree_nolock() is safe — no concurrent readers can access the object. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260330-kmalloc_special-v2-2-c90403f92ff0@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 09:31:42 -07:00
Mykyta Yatsenko	90f51ebff2	bpf: Migrate bpf_task_work to kmalloc_nolock Replace bpf_mem_alloc/bpf_mem_free with kmalloc_nolock/kfree_rcu for bpf_task_work_ctx. Replace guard(rcu_tasks_trace)() with guard(rcu)() in bpf_task_work_irq(). The function only accesses ctx struct members (not map values), so tasks trace protection is not needed - regular RCU is sufficient since ctx is freed via kfree_rcu. The guard in bpf_task_work_callback() remains as tasks trace since it accesses map values from process context. Sleepable BPF programs hold rcu_read_lock_trace but not regular rcu_read_lock. Since kfree_rcu waits for a regular RCU grace period, the ctx memory can be freed while a sleepable program is still running. Add scoped_guard(rcu) around the pointer read and refcount tryget in bpf_task_work_acquire_ctx to close this race window. Since kfree_rcu uses call_rcu internally which is not safe from NMI context, defer destruction via irq_work when irqs are disabled. For the lost-cmpxchg path the ctx was never published, so kfree_nolock is safe. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260330-kmalloc_special-v2-1-c90403f92ff0@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 09:31:42 -07:00
K Prateek Nayak	c03791085a	cpufreq: Pass the policy to cpufreq_driver->adjust_perf() cpufreq_cpu_get() can sleep on PREEMPT_RT in presence of concurrent writer(s), however amd-pstate depends on fetching the cpudata via the policy's driver data which necessitates grabbing the reference. Since schedutil governor can call "cpufreq_driver->update_perf()" during sched_tick/enqueue/dequeue with rq_lock held and IRQs disabled, fetching the policy object using the cpufreq_cpu_get() helper in the scheduler fast-path leads to "BUG: scheduling while atomic" on PREEMPT_RT [1]. Pass the cached cpufreq policy object in sg_policy to the update_perf() instead of just the CPU. The CPU can be inferred using "policy->cpu". The lifetime of cpufreq_policy object outlasts that of the governor and the cpufreq driver (allocated when the CPU is onlined and only reclaimed when the CPU is offlined / the CPU device is removed) which makes it safe to be referenced throughout the governor's lifetime. Closes:https://lore.kernel.org/all/20250731092316.3191-1-spasswolf@web.de/ [1] Fixes: `1d215f0319` ("cpufreq: amd-pstate: Add fast switch function for AMD P-State") Reported-by: Bert Karwatzki <spasswolf@web.de> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Gary Guo <gary@garyguo.net> # Rust Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Reviewed-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com> Link: https://lore.kernel.org/r/20260316081849.19368-3-kprateek.nayak@amd.com Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>	2026-04-02 11:30:24 -05:00
Leon Hwang	611fe4b79a	bpf: Fix abuse of kprobe_write_ctx via freplace uprobe programs are allowed to modify struct pt_regs. Since the actual program type of uprobe is KPROBE, it can be abused to modify struct pt_regs via kprobe+freplace when the kprobe attaches to kernel functions. For example, SEC("?kprobe") int kprobe(struct pt_regs regs) { return 0; } SEC("?freplace") int freplace_kprobe(struct pt_regs regs) { regs->di = 0; return 0; } freplace_kprobe prog will attach to kprobe prog. kprobe prog will attach to a kernel function. Without this patch, when the kernel function runs, its first arg will always be set as 0 via the freplace_kprobe prog. To fix the abuse of kprobe_write_ctx=true via kprobe+freplace, disallow attaching freplace programs on kprobe programs with different kprobe_write_ctx values. Fixes: `7384893d97` ("bpf: Allow uprobe program to change context registers") Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260331145353.87606-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 09:29:49 -07:00
Vincent Donnefort	ce47b798ed	tracing: Non-consuming read for trace remotes with an offline CPU When a trace_buffer is created while a CPU is offline, this CPU is cleared from the trace_buffer CPU mask, preventing the creation of a non-consuming iterator (ring_buffer_iter). For trace remotes, it means the iterator fails to be allocated (-ENOMEM) even though there are available ring buffers in the trace_buffer. For non-consuming reads of trace remotes, skip missing ring_buffer_iter to allow reading the available ring buffers. Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Link: https://patch.msgid.link/20260401045100.3394299-2-vdonnefort@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-04-02 14:16:09 +01:00
Ingo Molnar	9853914c08	Merge branch 'sched/urgent' into sched/core, to resolve conflicts The following fix in sched/urgent: `e08d007f9d` ("sched/debug: Fix avg_vruntime() usage") is in conflict with this pending commit in sched/core: `4823725d9d` ("sched/fair: Increase weight bits for avg_vruntime") Both modify the same variable definition and initialization blocks, resolve it by merging the two. Conflicts: kernel/sched/debug.c Signed-off-by: Ingo Molnar <mingo@kernel.org>	2026-04-02 15:04:09 +02:00
Peter Zijlstra	e08d007f9d	sched/debug: Fix avg_vruntime() usage John reported that stress-ng-yield could make his machine unhappy and managed to bisect it to commit `b3d99f43c7` ("sched/fair: Fix zero_vruntime tracking"). The commit in question changes avg_vruntime() from a function that is a pure reader, to a function that updates variables. This turns an unlocked sched/debug usage of this function from a minor mistake into a data corruptor. Fixes: `af4cf40470` ("sched/fair: Add cfs_rq::avg_vruntime") Fixes: `b3d99f43c7` ("sched/fair: Fix zero_vruntime tracking") Reported-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20260401132355.196370805@infradead.org	2026-04-02 13:42:43 +02:00
Peter Zijlstra	1319ea5752	sched/fair: Fix zero_vruntime tracking fix John reported that stress-ng-yield could make his machine unhappy and managed to bisect it to commit `b3d99f43c7` ("sched/fair: Fix zero_vruntime tracking"). The combination of yield and that commit was specific enough to hypothesize the following scenario: Suppose we have 2 runnable tasks, both doing yield. Then one will be eligible and one will not be, because the average position must be in between these two entities. Therefore, the runnable task will be eligible, and be promoted a full slice (all the tasks do is yield after all). This causes it to jump over the other task and now the other task is eligible and current is no longer. So we schedule. Since we are runnable, there is no {de,en}queue. All we have is the __{en,de}queue_entity() from {put_prev,set_next}_task(). But per the fingered commit, those two no longer move zero_vruntime. All that moves zero_vruntime are tick and full {de,en}queue. This means, that if the two tasks playing leapfrog can reach the critical speed to reach the overflow point inside one tick's worth of time, we're up a creek. Additionally, when multiple cgroups are involved, there is no guarantee the tick will in fact hit every cgroup in a timely manner. Statistically speaking it will, but that same statistics does not rule out the possibility of one cgroup not getting a tick for a significant amount of time -- however unlikely. Therefore, just like with the yield() case, force an update at the end of every slice. This ensures the update is never more than a single slice behind and the whole thing is within 2 lag bounds as per the comment on entity_key(). Fixes: `b3d99f43c7` ("sched/fair: Fix zero_vruntime tracking") Reported-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20260401132355.081530332@infradead.org	2026-04-02 13:42:43 +02:00
Jiri Pirko	f0548044a0	dma-mapping: introduce DMA_ATTR_CC_SHARED for shared memory Current CC designs don't place a vIOMMU in front of untrusted devices. Instead, the DMA API forces all untrusted device DMA through swiotlb bounce buffers (is_swiotlb_force_bounce()) which copies data into shared memory on behalf of the device. When a caller has already arranged for the memory to be shared via set_memory_decrypted(), the DMA API needs to know so it can map directly using the unencrypted physical address rather than bounce buffering. Following the pattern of DMA_ATTR_MMIO, add DMA_ATTR_CC_SHARED for this purpose. Like the MMIO case, only the caller knows what kind of memory it has and must inform the DMA API for it to work correctly. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Sumit Semwal <sumit.semwal@linaro.org> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260325192352.437608-2-jiri@resnulli.us	2026-04-02 07:29:33 +02:00

... 5 6 7 8 9 ...

51814 Commits