linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-19 00:01:01 -04:00

Author	SHA1	Message	Date
Tejun Heo	49d78adf95	sched_ext: Drop spurious warning on kick during scheduler disable kick_cpus_irq_workfn() warns when scx_kick_syncs is NULL, but this can legitimately happen when a BPF timer or other kick source races with free_kick_syncs() during scheduler disable. Drop the pr_warn_once() and add a comment explaining the race. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>	2026-04-10 16:38:25 -10:00
Tejun Heo	e719e17d99	sched_ext: Warn on task-based SCX op recursion The kf_tasks[] design assumes task-based SCX ops don't nest - if they did, kf_tasks[0] would get clobbered. The old scx_kf_allow() WARN_ONCE caught invalid nesting via kf_mask, but that machinery is gone now. Add a WARN_ON_ONCE(current->scx.kf_tasks[0]) at the top of each SCX_CALL_OP_TASK*() macro. Checking kf_tasks[0] alone is sufficient: all three variants (SCX_CALL_OP_TASK, SCX_CALL_OP_TASK_RET, SCX_CALL_OP_2TASKS_RET) write to kf_tasks[0], so a non-NULL value at entry to any of the three means re-entry from somewhere in the family. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Tejun Heo	979a98b6e9	sched_ext: Rename scx_kf_allowed_on_arg_tasks() to scx_kf_arg_task_ok() The "kf_allowed" framing on this helper comes from the old runtime scx_kf_allowed() gate, which has been removed. Rename it to describe what it actually does in the new model. Pure rename, no functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Cheng-Yang Chou	7cd9a5d7d4	sched_ext: Remove runtime kfunc mask enforcement Now that scx_kfunc_context_filter enforces context-sensitive kfunc restrictions at BPF load time, the per-task runtime enforcement via scx_kf_mask is redundant. Remove it entirely: - Delete enum scx_kf_mask, the kf_mask field on sched_ext_entity, and the scx_kf_allow()/scx_kf_disallow()/scx_kf_allowed() helpers along with the higher_bits()/highest_bit() helpers they used. - Strip the @mask parameter (and the BUILD_BUG_ON checks) from the SCX_CALL_OP[_RET]/SCX_CALL_OP_TASK[_RET]/SCX_CALL_OP_2TASKS_RET macros and update every call site. Reflow call sites that were wrapped only to fit the old 5-arg form and now collapse onto a single line under ~100 cols. - Remove the in-kfunc scx_kf_allowed() runtime checks from scx_dsq_insert_preamble(), scx_dsq_move(), scx_bpf_dispatch_nr_slots(), scx_bpf_dispatch_cancel(), scx_bpf_dsq_move_to_local___v2(), scx_bpf_sub_dispatch(), scx_bpf_reenqueue_local(), and the per-call guard inside select_cpu_from_kfunc(). scx_bpf_task_cgroup() and scx_kf_allowed_on_arg_tasks() were already cleaned up in the "drop redundant rq-locked check" patch. scx_kf_allowed_if_unlocked() was rewritten in the preceding "decouple" patch. No further changes to those helpers here. Co-developed-by: Juntong Deng <juntong.deng@outlook.com> Signed-off-by: Juntong Deng <juntong.deng@outlook.com> Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Tejun Heo	d1d3c1c6ae	sched_ext: Add verifier-time kfunc context filter Move enforcement of SCX context-sensitive kfunc restrictions from per-task runtime kf_mask checks to BPF verifier-time filtering, using the BPF core's struct_ops context information. A shared .filter callback is attached to each context-sensitive BTF set and consults a per-op allow table (scx_kf_allow_flags[]) indexed by SCX ops member offset. Disallowed calls are now rejected at program load time instead of at runtime. The old model split reachability across two places: each SCX_CALL_OP*() set bits naming its op context, and each kfunc's scx_kf_allowed() check OR'd together the bits it accepted. A kfunc was callable when those two masks overlapped. The new model transposes the result to the caller side - each op's allow flags directly list the kfunc groups it may call. The old bit assignments were: Call-site bits: ops.select_cpu = ENQUEUE \| SELECT_CPU ops.enqueue = ENQUEUE ops.dispatch = DISPATCH ops.cpu_release = CPU_RELEASE Kfunc-group accepted bits: enqueue group = ENQUEUE \| DISPATCH select_cpu group = SELECT_CPU \| ENQUEUE dispatch group = DISPATCH cpu_release group = CPU_RELEASE Intersecting them yields the reachability now expressed directly by scx_kf_allow_flags[]: ops.select_cpu -> SELECT_CPU \| ENQUEUE ops.enqueue -> SELECT_CPU \| ENQUEUE ops.dispatch -> ENQUEUE \| DISPATCH ops.cpu_release -> CPU_RELEASE Unlocked ops carried no kf_mask bits and reached only unlocked kfuncs; that maps directly to UNLOCKED in the new table. Equivalence was checked by walking every (op, kfunc-group) combination across SCX ops, SYSCALL, and non-SCX struct_ops callers against the old scx_kf_allowed() runtime checks. With two intended exceptions (see below), all combinations reach the same verdict; disallowed calls are now caught at load time instead of firing scx_error() at runtime. scx_bpf_dsq_move_set_slice() and scx_bpf_dsq_move_set_vtime() are exceptions: they have no runtime check at all, but the new filter rejects them from ops outside dispatch/unlocked. The affected cases are nonsensical - the values these setters store are only read by scx_bpf_dsq_move{,_vtime}(), which is itself restricted to dispatch/unlocked, so a setter call from anywhere else was already dead code. Runtime scx_kf_mask enforcement is left in place by this patch and removed in a follow-up. Original-patch-by: Juntong Deng <juntong.deng@outlook.com> Original-patch-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Tejun Heo	2193af26a1	sched_ext: Drop redundant rq-locked check from scx_bpf_task_cgroup() scx_kf_allowed_on_arg_tasks() runs both an scx_kf_allowed(__SCX_KF_RQ_LOCKED) mask check and a kf_tasks[] check. After the preceding call-site fixes, every SCX_CALL_OP_TASK*() invocation has kf_mask & __SCX_KF_RQ_LOCKED non-zero, so the mask check is redundant whenever the kf_tasks[] check passes. Drop it and simplify the helper to take only @sch and @p. Fold the locking guarantee into the SCX_CALL_OP_TASK() comment block, which scx_bpf_task_cgroup() now points to. No functional change. Extracted from a larger verifier-time kfunc context filter patch originally written by Juntong Deng. Original-patch-by: Juntong Deng <juntong.deng@outlook.com> Cc: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Tejun Heo	0022b32850	sched_ext: Decouple kfunc unlocked-context check from kf_mask scx_kf_allowed_if_unlocked() uses !current->scx.kf_mask as a proxy for "no SCX-tracked lock held". kf_mask is removed in a follow-up patch, so its two callers - select_cpu_from_kfunc() and scx_dsq_move() - need another basis. Add a new bool scx_rq.in_select_cpu, set across the SCX_CALL_OP_TASK_RET that invokes ops.select_cpu(), to capture the one case where SCX itself holds no lock but try_to_wake_up() holds @p's pi_lock. Together with scx_locked_rq(), it expresses the same accepted-context set. select_cpu_from_kfunc() needs a runtime test because it has to take different locking paths depending on context. Open-code as a three-way branch. The unlocked branch takes raw_spin_lock_irqsave(&p->pi_lock) directly - pi_lock alone is enough for the fields the kfunc reads, and is lighter than task_rq_lock(). scx_dsq_move() doesn't really need a runtime test - its accepted contexts could be enforced at verifier load time. But since the runtime state is already there and using it keeps the upcoming load-time filter simpler, just write it the same way: (scx_locked_rq() \|\| in_select_cpu) && !kf_allowed(DISPATCH). scx_kf_allowed_if_unlocked() is deleted with the conversions. No semantic change. v2: s/No functional change/No semantic change/ - the unlocked path now acquires pi_lock instead of the heavier task_rq_lock() (Andrea Righi). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Tejun Heo	b470e37c1f	sched_ext: Fix ops.cgroup_move() invocation kf_mask and rq tracking sched_move_task() invokes ops.cgroup_move() inside task_rq_lock(tsk), so @p's rq lock is held. The SCX_CALL_OP_TASK invocation mislabels this: - kf_mask = SCX_KF_UNLOCKED (== 0), claiming no lock is held. - rq = NULL, so update_locked_rq() doesn't run and scx_locked_rq() returns NULL. Switch to SCX_KF_REST and pass task_rq(p), matching ops.set_cpumask() from set_cpus_allowed_scx(). Three effects: - scx_bpf_task_cgroup() becomes callable (was rejected by scx_kf_allowed(__SCX_KF_RQ_LOCKED)). Safe; rq lock is held. - scx_bpf_dsq_move() is now rejected (was allowed via the unlocked branch). Calling it while holding an unrelated task's rq lock is risky; rejection is correct. - scx_bpf_select_cpu_*() previously took the unlocked branch in select_cpu_from_kfunc() and called task_rq_lock(p, &rf), which would deadlock against the already-held pi_lock. Now it takes the locked-rq branch and is rejected with -EPERM via the existing kf_allowed(SCX_KF_SELECT_CPU \| SCX_KF_ENQUEUE) check. Latent deadlock fix. No in-tree scheduler is known to call any of these from ops.cgroup_move(). v2: Add Fixes: tag (Andrea Righi). Fixes: `18853ba782` ("sched_ext: Track currently locked rq") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Tejun Heo	9fb457074f	sched_ext: Track @p's rq lock across set_cpus_allowed_scx -> ops.set_cpumask The SCX_CALL_OP_TASK call site passes rq=NULL incorrectly, leaving scx_locked_rq() unset. Pass task_rq(p) instead so update_locked_rq() reflects reality. v2: Add Fixes: tag (Andrea Righi). Fixes: `18853ba782` ("sched_ext: Track currently locked rq") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Tejun Heo	a37e134317	sched_ext: Add select_cpu kfuncs to scx_kfunc_ids_unlocked select_cpu_from_kfunc() has an extra scx_kf_allowed_if_unlocked() branch that accepts calls from unlocked contexts and takes task_rq_lock() itself - a "callable from unlocked" property encoded in the kfunc body rather than in set membership. That's fine while the runtime check is the authoritative gate, but the upcoming verifier-time filter uses set membership as the source of truth and needs it to reflect every context the kfunc may be called from. Add the three select_cpu kfuncs to scx_kfunc_ids_unlocked so their full set of callable contexts is captured by set membership. This follows the existing dual-set convention used by scx_bpf_dsq_move{,_vtime} and scx_bpf_dsq_move_set_{slice,vtime}, which are members of both scx_kfunc_ids_dispatch and scx_kfunc_ids_unlocked. While at it, add brief comments on each duplicate BTF_ID_FLAGS block (including the pre-existing dsq_move ones) explaining the dual membership. No runtime behavior change: the runtime check in select_cpu_from_kfunc() remains the authoritative gate until it is removed along with the rest of the scx_kf_mask enforcement in a follow-up. v2: Clarify dispatch-set comment to name scx_bpf_dsq_move*() explicitly so it doesn't appear to cover scx_bpf_sub_dispatch() (Andrea Righi). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Tejun Heo	9b5501d3c9	sched_ext: Drop TRACING access to select_cpu kfuncs The select_cpu kfuncs - scx_bpf_select_cpu_dfl(), scx_bpf_select_cpu_and() and __scx_bpf_select_cpu_and() - take task_rq_lock() internally. Exposing them via scx_kfunc_set_idle to BPF_PROG_TYPE_TRACING is unsafe: arbitrary tracing contexts (kprobes, tracepoints, fentry, LSM) may run with @p's pi_lock state unknown. Move them out of scx_kfunc_ids_idle into a new scx_kfunc_ids_select_cpu set registered only for STRUCT_OPS and SYSCALL. Extracted from a larger verifier-time kfunc context filter patch originally written by Juntong Deng. Original-patch-by: Juntong Deng <juntong.deng@outlook.com> Cc: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Tejun Heo	744ab12a5b	Merge branch 'for-7.0-fixes' into for-7.1 Conflict in kernel/sched/ext.c between: `7e0ffb72de` ("sched_ext: Fix stale direct dispatch state in ddsp_dsq_id") which clears ddsp state at individual call sites instead of dispatch_enqueue(), and sub-sched related code reorg and API updates on for-7.1. Resolved by applying the ddsp fix with for-7.1's signatures. Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-03 07:48:28 -10:00
Andrea Righi	7e0ffb72de	sched_ext: Fix stale direct dispatch state in ddsp_dsq_id @p->scx.ddsp_dsq_id can be left set (non-SCX_DSQ_INVALID) triggering a spurious warning in mark_direct_dispatch() when the next wakeup's ops.select_cpu() calls scx_bpf_dsq_insert(), such as: WARNING: kernel/sched/ext.c:1273 at scx_dsq_insert_commit+0xcd/0x140 The root cause is that ddsp_dsq_id was only cleared in dispatch_enqueue(), which is not reached in all paths that consume or cancel a direct dispatch verdict. Fix it by clearing it at the right places: - direct_dispatch(): cache the direct dispatch state in local variables and clear it before dispatch_enqueue() on the synchronous path. For the deferred path, the direct dispatch state must remain set until process_ddsp_deferred_locals() consumes them. - process_ddsp_deferred_locals(): cache the dispatch state in local variables and clear it before calling dispatch_to_local_dsq(), which may migrate the task to another rq. - do_enqueue_task(): clear the dispatch state on the enqueue path (local/global/bypass fallbacks), where the direct dispatch verdict is ignored. - dequeue_task_scx(): clear the dispatch state after dispatch_dequeue() to handle both the deferred dispatch cancellation and the holding_cpu race, covering all cases where a pending direct dispatch is cancelled. - scx_disable_task(): clear the direct dispatch state when transitioning a task out of the current scheduler. Waking tasks may have had the direct dispatch state set by the outgoing scheduler's ops.select_cpu() and then been queued on a wake_list via ttwu_queue_wakelist(), when SCX_OPS_ALLOW_QUEUED_WAKEUP is set. Such tasks are not on the runqueue and are not iterated by scx_bypass(), so their direct dispatch state won't be cleared. Without this clear, any subsequent SCX scheduler that tries to direct dispatch the task will trigger the WARN_ON_ONCE() in mark_direct_dispatch(). Fixes: `5b26f7b920` ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches") Cc: stable@vger.kernel.org # v6.12+ Cc: Daniel Hodges <hodgesd@meta.com> Cc: Patrick Somaru <patsomaru@meta.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-03 07:14:49 -10:00
Changwoo Min	0c4a59df37	sched_ext: Fix is_bpf_migration_disabled() false negative on non-PREEMPT_RCU Since commit `8e4f0b1ebc` ("bpf: use rcu_read_lock_dont_migrate() for trampoline.c"), the BPF prolog (__bpf_prog_enter) calls migrate_disable() only when CONFIG_PREEMPT_RCU is enabled, via rcu_read_lock_dont_migrate(). Without CONFIG_PREEMPT_RCU, the prolog never touches migration_disabled, so migration_disabled == 1 always means the task is truly migration-disabled regardless of whether it is the current task. The old unconditional p == current check was a false negative in this case, potentially allowing a migration-disabled task to be dispatched to a remote CPU and triggering scx_error in task_can_run_on_remote_rq(). Only apply the p == current disambiguation when CONFIG_PREEMPT_RCU is enabled, where the ambiguity with the BPF prolog still exists. Fixes: `8e4f0b1ebc` ("bpf: use rcu_read_lock_dont_migrate() for trampoline.c") Cc: stable@vger.kernel.org # v6.18+ Link: https://lore.kernel.org/lkml/20250821090609.42508-8-dongml2@chinatelecom.cn/ Signed-off-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-02 09:26:55 -10:00
Samuele Mariotti	b905ee77d5	sched_ext: Fix missing warning in scx_set_task_state() default case In scx_set_task_state(), the default case was setting the warn flag, but then returning immediately. This is problematic because the only purpose of the warn flag is to trigger WARN_ONCE, but the early return prevented it from ever firing, leaving invalid task states undetected and untraced. To fix this, a WARN_ONCE call is now added directly in the default case. The fix addresses two aspects: - Guarantees the invalid task states are properly logged and traced. - Provides a distinct warning message ("sched_ext: Invalid task state") specifically for states outside the defined scx_task_state enum values, making it easier to distinguish from other transition warnings. This ensures proper detection and reporting of invalid states. Signed-off-by: Samuele Mariotti <smariotti@disroot.org> Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-02 09:22:03 -10:00
Tejun Heo	94555ca6d0	Merge branch 'for-7.0-fixes' into for-7.1 Conflict in kernel/sched/ext.c init_sched_ext_class() between: `415cb193bb` ("sched_ext: Fix SCX_KICK_WAIT deadlock by deferring wait to balance callback") which adds cpus_to_sync cpumask allocation, and: `84b1a0ea0b` ("sched_ext: Implement scx_bpf_dsq_reenq() for user DSQs") `8c1b9453fd` ("sched_ext: Convert deferred_reenq_locals from llist to regular list") which add deferred_reenq init code at the same location. Both are independent additions. Include both. Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-30 09:02:05 -10:00
Tejun Heo	415cb193bb	sched_ext: Fix SCX_KICK_WAIT deadlock by deferring wait to balance callback SCX_KICK_WAIT busy-waits in kick_cpus_irq_workfn() using smp_cond_load_acquire() until the target CPU's kick_sync advances. Because the irq_work runs in hardirq context, the waiting CPU cannot reschedule and its own kick_sync never advances. If multiple CPUs form a wait cycle, all CPUs deadlock. Replace the busy-wait in kick_cpus_irq_workfn() with resched_curr() to force the CPU through do_pick_task_scx(), which queues a balance callback to perform the wait. The balance callback drops the rq lock and enables IRQs following the sched_core_balance() pattern, so the CPU can process IPIs while waiting. The local CPU's kick_sync is advanced on entry to do_pick_task_scx() and continuously during the wait, ensuring any CPU that starts waiting for us sees the advancement and cannot form cyclic dependencies. Fixes: `90e55164da` ("sched_ext: Implement SCX_KICK_WAIT") Cc: stable@vger.kernel.org # v6.12+ Reported-by: Christian Loehle <christian.loehle@arm.com> Link: https://lore.kernel.org/r/20260316100249.1651641-1-christian.loehle@arm.com Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Christian Loehle <christian.loehle@arm.com>	2026-03-30 08:37:27 -10:00
Cheng-Yang Chou	238aa43f0b	sched_ext: Document why built-in DSQs are unsupported sources in scx_bpf_dsq_move_to_local() Add a comment explaining the design intent behind rejecting built-in DSQs (%SCX_DSQ_GLOBAL and %SCX_DSQ_LOCAL*) as sources. Local DSQs support reenqueueing but the BPF scheduler cannot directly iterate or move tasks from them. %SCX_DSQ_GLOBAL is similar but also doesn't support reenqueueing because it maps to multiple per-node DSQs, making the scope difficult to define. Also annotate @dsq_id to make clear it must be a user-created DSQ. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-27 07:33:08 -10:00
Cheng-Yang Chou	3eb8f02291	sched_ext: Fix missing SCX_EV_SUB_BYPASS_DISPATCH aggregation in scx_read_events() `025b1bd419` introduced SCX_EV_SUB_BYPASS_DISPATCH to track scheduling of bypassed descendant tasks, and correctly increments it per-CPU and displays it in sysfs and dump output. However, scx_read_events() which aggregates per-CPU counters into a summary was not updated to include this event, causing it to always read as zero in sysfs, in debug dumps, and via the scx_bpf_events() kfunc. Add the missing scx_agg_event() call for SCX_EV_SUB_BYPASS_DISPATCH. Fixes: `025b1bd419` ("sched_ext: Implement hierarchical bypass mode") Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-25 16:32:01 -10:00
Cheng-Yang Chou	3d6379196d	sched_ext: Fix missing return after scx_error() in scx_dsq_move() When scx_bpf_dsq_move[_vtime]() is called on a task that belongs to a different scheduler, scx_error() is invoked to flag the violation. scx_error() schedules an asynchronous scheduler teardown via irq_work and returns immediately, so execution falls through and the DSQ move proceeds on a cross-scheduler task regardless, potentially corrupting DSQ state. Add the missing return false so the function exits right after reporting the error, consistent with the other early-exit checks in the same function (e.g. scx_vet_enq_flags() failure at the top). Fixes: `bb4d9fd551` ("sched_ext: scx_dsq_move() should validate the task belongs to the right scheduler") Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-25 16:31:53 -10:00
Zqiang	60d4b17e88	sched_ext: Choose the right sch->ops.name to output in the print_scx_info() The print_scx_info() always output scx_root structure's->ops.name, but for built with CONFIG_EXT_SUB_SCHED=y kernels, the tasks may be attach an sub scx_sched structure. this commit therefore use the scx_task_sched_rcu() to correctly get scx_sched structure to output ops.name, and drop state check. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-24 18:21:15 -10:00
Cheng-Yang Chou	4624211bc6	sched_ext: Fix invalid kobj cast in scx_uevent() When scx_alloc_and_add_sched() creates the sub-scheduler kset, it sets sch->kobj as the parent. Because sch->kobj.kset points to scx_kset, registering this sub-kset triggers a KOBJ_ADD uevent. The uevent walk finds scx_kset and calls scx_uevent() with the sub-kset's kobject. scx_uevent() unconditionally uses container_of() to cast the incoming kobject to struct scx_sched, producing a wild pointer when the kobject belongs to the kset itself rather than a scheduler instance. Accessing sch->ops.name through this pointer causes a KASAN slab-out-of-bounds read: BUG: KASAN: slab-out-of-bounds in string+0x3b6/0x4c0 Read of size 1 at addr ffff888004d04348 by task scx_enable_help/748 Call Trace: string+0x3b6/0x4c0 vsnprintf+0x3ec/0x1550 add_uevent_var+0x160/0x3a0 scx_uevent+0x22/0x30 kobject_uevent_env+0x5dc/0x1730 kset_register+0x192/0x280 scx_alloc_and_add_sched+0x130d/0x1c60 ... Fix this by checking the kobject's ktype against scx_ktype before performing the cast, and returning 0 for non-matching kobjects. Tested with vng and scx_qmap without triggering any KASAN errors. Fixes: `ebeca1f930` ("sched_ext: Introduce cgroup sub-sched support") Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-23 07:52:13 -10:00
Tejun Heo	76edc2761a	sched_ext: Use irq_work_queue_on() in schedule_deferred() schedule_deferred() uses irq_work_queue() which always queues on the calling CPU. The deferred work can run from any CPU correctly, and the _locked() path already processes remote rqs from the calling CPU. However, when falling through to the irq_work path, queuing on the target CPU is preferable as the work can run sooner via IPI delivery rather than waiting for the calling CPU to re-enable IRQs. Currently, only reenqueue operations use this path - either BPF-initiated reenqueue targeting a remote rq, or IMMED reenqueue when the target CPU is busy running userspace (not in balance or wakeup, so the _locked() fast paths aren't available). Use irq_work_queue_on() to target the owning CPU. This improves IMMED reenqueue latency when tasks are dispatched to remote local DSQs. Testing on a 24-CPU AMD Ryzen 3900X with scx_qmap -I -F 50 (ALWAYS_ENQ_IMMED, every 50th enqueue forced to prev_cpu's local DSQ) under heavy mixed load (2x CPU oversubscription, yield and context-switch pressure, SCHED_FIFO bursts, periodic fork storms, mixed nice levels, C-states disabled), measuring local DSQ residence time (insert to remove) over 5 x 120s runs (~1.2M tasks per set): >128us outliers: 71 -> 39 (-45%) >256us outliers: 59 -> 36 (-39%) Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>	2026-03-22 14:05:25 -10:00
Andrea Righi	63f500c32a	sched_ext: Guard cpu_smt_mask() with CONFIG_SCHED_SMT Wrap cpu_smt_mask() usage with CONFIG_SCHED_SMT to avoid build failures on kernels built without SMT support. Fixes: `2197cecdb0` ("sched_ext: idle: Prioritize idle SMT sibling") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202603221422.XIueJOE9-lkp@intel.com/ Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-22 10:02:49 -10:00
Cheng-Yang Chou	e73b1d7210	sched_ext: Fix build errors and unused label warning in non-cgroup configs When building with SCHED_CLASS_EXT=y but CGROUPS=n, clang reports errors for undeclared cgroup_put() and cgroup_get() calls, and a warning for the unused err_stop_helper label. EXT_SUB_SCHED is def_bool y depending only on SCHED_CLASS_EXT, but it fundamentally requires cgroups (cgroup_path, cgroup_get, cgroup_put, cgroup_id, etc.). Add the missing CGROUPS dependency to EXT_SUB_SCHED in init/Kconfig. Guard cgroup_put() and cgroup_get() in the common paths with: #if defined(CONFIG_EXT_GROUP_SCHED) \|\| defined(CONFIG_EXT_SUB_SCHED) Guard the err_stop_helper label with #ifdef CONFIG_EXT_SUB_SCHED since all gotos targeting it are inside that same ifdef block. Tested with both CGROUPS enabled and disabled. Fixes: `ebeca1f930` ("sched_ext: Introduce cgroup sub-sched support") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202603210903.IrKhPd6k-lkp@intel.com/ Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-22 10:02:11 -10:00
Cheng-Yang Chou	db08b1940f	sched_ext: Fix inconsistent NUMA node lookup in scx_select_cpu_dfl() In the WAKE_SYNC path of scx_select_cpu_dfl(), waker_node was computed with cpu_to_node(), while node (for prev_cpu) was computed with scx_cpu_node_if_enabled(). When scx_builtin_idle_per_node is disabled, idle_cpumask(waker_node) is called with a real node ID even though per-node idle tracking is disabled, resulting in undefined behavior. Fix by using scx_cpu_node_if_enabled() for waker_node as well, ensuring both variables are computed consistently. Fixes: `48849271e6` ("sched_ext: idle: Per-node idle cpumasks") Cc: stable@vger.kernel.org # v6.15+ Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-21 14:22:37 -10:00
Andrea Righi	2197cecdb0	sched_ext: idle: Prioritize idle SMT sibling In the default built-in idle CPU selection policy, when @prev_cpu is busy and no fully idle core is available, try to place the task on its SMT sibling if that sibling is idle, before searching any other idle CPU in the same LLC. Migration to the sibling is cheap and keeps the task on the same core, preserving L1 cache and reducing wakeup latency. On large SMT systems this appears to consistently boost throughput by roughly 2-3% on CPU-bound workloads (running a number of tasks equal to the number of SMT cores). Cc: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-21 08:31:16 -10:00
zhidao su	2e5e5b3738	sched_ext: Fix typos in comments Fix five typos across three files: - kernel/sched/ext.c: 'monotically' -> 'monotonically' (line 55) - kernel/sched/ext.c: 'used by to check' -> 'used to check' (line 56) - kernel/sched/ext.c: 'hardlockdup' -> 'hardlockup' (line 3881) - kernel/sched/ext_idle.c: 'don't perfectly overlaps' -> 'don't perfectly overlap' (line 371) - tools/sched_ext/scx_flatcg.bpf.c: 'shaer' -> 'share' (line 21) Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-17 07:46:36 -10:00
Cheng-Yang Chou	2008fb2573	sched_ext: Fix slab-out-of-bounds in scx_alloc_and_add_sched() ancestors[] is a flexible array member that needs level + 1 slots to hold all ancestors including self (indices 0..level), but kzalloc_flex() only allocates `level` slots: sch = kzalloc_flex(sch, ancestors, level); ... sch->ancestors[level] = sch; / one past the end */ For the root scheduler (level = 0), zero slots are allocated and ancestors[0] is written immediately past the end of the object. KASAN reports: BUG: KASAN: slab-out-of-bounds in scx_alloc_and_add_sched+0x1c17/0x1d10 Write of size 8 at addr ffff888066b56538 by task scx_enable_help/667 The buggy address is located 0 bytes to the right of allocated 1336-byte region [ffff888066b56000, ffff888066b56538) Fix by passing level + 1 to kzalloc_flex(). Tested with vng + scx_lavd, KASAN no longer triggers. Fixes: `ebeca1f930` ("sched_ext: Introduce cgroup sub-sched support") Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-16 07:55:50 -10:00
Tejun Heo	618a9db015	sched_ext: Use kobject_put() for kobject_init_and_add() failure in scx_alloc_and_add_sched() kobject_init_and_add() failure requires kobject_put() for proper cleanup, but the error paths were using kfree(sch) possibly leaking the kobject name. The kset_create_and_add() failure was already using kobject_put() correctly. Switch the kobject_init_and_add() error paths to use kobject_put(). As the release path puts the cgroup ref, make scx_alloc_and_add_sched() always consume @cgrp via a new err_put_cgrp label at the bottom of the error chain and update scx_sub_enable_workfn() accordingly. Fixes: `17108735b4` ("sched_ext: Use dynamic allocation for scx_sched") Reported-by: David Carlier <devnexen@gmail.com> Link: https://lore.kernel.org/r/20260314134457.46216-1-devnexen@gmail.com Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-15 23:27:04 -10:00
Tejun Heo	0c66b0da00	sched_ext: Fix cgroup double-put on sub-sched abort path The abort path in scx_sub_enable_workfn() fell through to out_put_cgrp, double-putting the cgroup ref already owned by sch->cgrp. It also skipped kthread_flush_work() needed to flush the disable path. Relocate the abort block above err_unlock_and_disable so it falls through to err_disable. Fixes: `337ec00b1d` ("sched_ext: Implement cgroup sub-sched enabling and disabling") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-15 23:26:35 -10:00
Cheng-Yang Chou	e36bc38ebf	sched_ext: Fix uninitialized ret in scx_alloc_and_add_sched() Under CONFIG_EXT_SUB_SCHED, the kzalloc() and kstrdup() failure paths jump to err_stop_helper without first setting ret. The function then returns ERR_PTR(ret) with ret uninitialized, which can produce ERR_PTR(0) (NULL), causing the caller's IS_ERR() check to pass and leading to a NULL pointer dereference. Set ret = -ENOMEM before each goto to fix the error path. Fixes: `ebeca1f930` ("sched_ext: Introduce cgroup sub-sched support") Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-13 23:00:53 -10:00
Tejun Heo	238eba8c21	sched_ext: Use schedule_deferred_locked() in schedule_dsq_reenq() schedule_dsq_reenq() always uses schedule_deferred() which falls back to irq_work. However, callers like schedule_reenq_local() already hold the target rq lock, and scx_bpf_dsq_reenq() may hold it via the ops callback. Add a locked_rq parameter so schedule_dsq_reenq() can use schedule_deferred_locked() when the target rq is already held. The locked variant can use cheaper paths (balance callbacks, wakeup hooks) instead of always bouncing through irq_work. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-13 09:43:23 -10:00
Tejun Heo	3229ac4a5e	sched_ext: Add SCX_OPS_ALWAYS_ENQ_IMMED ops flag SCX_ENQ_IMMED makes enqueue to local DSQs succeed only if the task can start running immediately. Otherwise, the task is re-enqueued through ops.enqueue(). This provides tighter control but requires specifying the flag on every insertion. Add SCX_OPS_ALWAYS_ENQ_IMMED ops flag. When set, SCX_ENQ_IMMED is automatically applied to all local DSQ enqueues including through scx_bpf_dsq_move_to_local(). scx_qmap is updated with -I option to test the feature and -F option for IMMED stress testing which forces every Nth enqueue to a busy local DSQ. v2: - Cover scx_bpf_dsq_move_to_local() path (now has enq_flags via ___v2). - scx_qmap: Remove sched_switch and cpu_release handlers (superseded by kernel-side wakeup_preempt_scx()). Add -F for IMMED stress testing. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-13 09:43:23 -10:00
Tejun Heo	860683763e	sched_ext: Add enq_flags to scx_bpf_dsq_move_to_local() scx_bpf_dsq_move_to_local() moves a task from a non-local DSQ to the current CPU's local DSQ. This is an indirect way of dispatching to a local DSQ and should support enq_flags like direct dispatches do - e.g. SCX_ENQ_HEAD for head-of-queue insertion and SCX_ENQ_IMMED for immediate execution guarantees. Add scx_bpf_dsq_move_to_local___v2() with an enq_flags parameter. The original becomes a v1 compat wrapper passing 0. The compat macro is updated to a three-level chain: v2 (7.1+) -> v1 (current) -> scx_bpf_consume (pre-rename). All in-tree BPF schedulers are updated to pass 0. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-13 09:43:23 -10:00
Tejun Heo	da32a2986e	sched_ext: Plumb enq_flags through the consume path Add enq_flags parameter to consume_dispatch_q() and consume_remote_task(), passing it through to move_{local,remote}_task_to_local_dsq(). All callers pass 0. No functional change. This prepares for SCX_ENQ_IMMED support on the consume path. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-13 09:43:23 -10:00
Tejun Heo	98d709cba3	sched_ext: Implement SCX_ENQ_IMMED Add SCX_ENQ_IMMED enqueue flag for local DSQ insertions. Once a task is dispatched with IMMED, it either gets on the CPU immediately and stays on it, or gets reenqueued back to the BPF scheduler. It will never linger on a local DSQ behind other tasks or on a CPU taken by a higher-priority class. rq_is_open() uses rq->next_class to determine whether the rq is available, and wakeup_preempt_scx() triggers reenqueue when a higher-priority class task arrives. These capture all higher class preemptions. Combined with reenqueue points in the dispatch path, all cases where an IMMED task would not execute immediately are covered. SCX_TASK_IMMED persists in p->scx.flags until the next fresh enqueue, so the guarantee survives SAVE/RESTORE cycles. If preempted while running, put_prev_task_scx() reenqueues through ops.enqueue() with SCX_TASK_REENQ_PREEMPTED instead of silently placing the task back on the local DSQ. This enables tighter scheduling latency control by preventing tasks from piling up on local DSQs. It also enables opportunistic CPU sharing across sub-schedulers - without this, a sub-scheduler can stuff the local DSQ of a shared CPU, making it difficult for others to use. v2: - Rewrite is_curr_done() as rq_is_open() using rq->next_class and implement wakeup_preempt_scx() to achieve complete coverage of all cases where IMMED tasks could get stranded. - Track IMMED persistently in p->scx.flags and reenqueue preempted-while-running tasks through ops.enqueue(). - Bound deferred reenq cycles (SCX_REENQ_LOCAL_MAX_REPEAT). - Misc renames, documentation. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-13 09:43:22 -10:00
Tejun Heo	b5b38761b4	sched_ext: Add scx_vet_enq_flags() and plumb dsq_id into preamble Add scx_vet_enq_flags() stub and call it from scx_dsq_insert_preamble() and scx_dsq_move(). Pass dsq_id into preamble so the vetting function can validate flag and DSQ combinations. No functional change. This prepares for SCX_ENQ_IMMED which will populate the vetting function. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-13 09:43:22 -10:00
Tejun Heo	f1c1dd9cc1	sched_ext: Split task_should_reenq() into local and user variants Split task_should_reenq() into local_task_should_reenq() and user_task_should_reenq(). The local variant takes reenq_flags by pointer. No functional change. This prepares for SCX_ENQ_IMMED which will add IMMED-specific logic to the local variant. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-13 09:43:22 -10:00
Tejun Heo	6b4576b097	sched_ext: Reject sub-sched attachment to a disabled parent scx_claim_exit() propagates exits to descendants under scx_sched_lock. A sub-sched being attached concurrently could be missed if it links after the propagation. Check the parent's exit_kind in scx_link_sched() under scx_sched_lock to interlock against scx_claim_exit() - either the parent sees the child in its iteration or the child sees the parent's non-NONE exit_kind and fails attachment. Fixes: `ebeca1f930` ("sched_ext: Introduce cgroup sub-sched support") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-10 07:12:21 -10:00
Tejun Heo	6b36c4c293	sched_ext: Fix scx_sched_lock / rq lock ordering There are two sites that nest rq lock inside scx_sched_lock: - scx_bypass() takes scx_sched_lock then rq lock per CPU to propagate per-cpu bypass flags and re-enqueue tasks. - sysrq_handle_sched_ext_dump() takes scx_sched_lock to iterate all scheds, scx_dump_state() then takes rq lock per CPU for dump. And scx_claim_exit() takes scx_sched_lock to propagate exits to descendants. It can be reached from scx_tick(), BPF kfuncs, and many other paths with rq lock already held, creating the reverse ordering: rq lock -> scx_sched_lock vs. scx_sched_lock -> rq lock Fix by flipping scx_bypass() to take rq lock first, and dropping scx_sched_lock from sysrq_handle_sched_ext_dump() as scx_sched_all is already RCU-traversable and scx_dump_lock now prevents dumping a dead sched. This makes the consistent ordering rq lock -> scx_sched_lock. Reported-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Link: https://lore.kernel.org/r/20260309163025.2240221-1-yphbchou0911@gmail.com Fixes: `ebeca1f930` ("sched_ext: Introduce cgroup sub-sched support") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-10 07:12:21 -10:00
Tejun Heo	f4a6c506d1	sched_ext: Always bounce scx_disable() through irq_work scx_disable() directly called kthread_queue_work() which can acquire worker->lock, pi_lock and rq->__lock. This made scx_disable() unsafe to call while holding locks that conflict with this chain - in particular, scx_claim_exit() calls scx_disable() for each descendant while holding scx_sched_lock, which nests inside rq->__lock in scx_bypass(). The error path (scx_vexit()) was already bouncing through irq_work to avoid this issue. Generalize the pattern to all scx_disable() calls by always going through irq_work. irq_work_queue() is lockless and safe to call from any context, and the actual kthread_queue_work() call happens in the irq_work handler outside any locks. Rename error_irq_work to disable_irq_work to reflect the broader usage. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-10 07:12:21 -10:00
Tejun Heo	b5bc043505	sched_ext: Add scx_dump_lock and dump_disabled Add a dedicated scx_dump_lock and per-sched dump_disabled flag so that debug dumping can be safely disabled during sched teardown without relying on scx_sched_lock. This is a prep for the next patch which decouples the sysrq dump path from scx_sched_lock to resolve a lock ordering issue. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-10 07:12:21 -10:00
Tejun Heo	7e92cf4354	sched_ext: Fix sub_detach op check to test the parent's ops sub_detach is the parent's op called to notify the parent that a child is detaching. Test parent->ops.sub_detach instead of sch->ops.sub_detach. Fixes: `ebeca1f930` ("sched_ext: Introduce cgroup sub-sched support") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-10 07:12:21 -10:00
Philipp Hahn	9805933538	sched: Prefer IS_ERR_OR_NULL over manual NULL check Prefer using IS_ERR_OR_NULL() over using IS_ERR() and a manual NULL check. Change generated with coccinelle. Signed-off-by: Philipp Hahn <phahn-oss@avm.de> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-10 06:58:37 -10:00
Tejun Heo	b884094264	sched_ext: Replace system_unbound_wq with system_dfl_wq in scx_kobj_release() `c2a57380df` ("sched: Replace use of system_unbound_wq with system_dfl_wq") converted system_unbound_wq usages in ext.c but missed the queue_rcu_work() call in scx_kobj_release() which was added later by the dynamic scx_sched allocation conversion. Apply the same conversion. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Marco Crivellari <marco.crivellari@suse.com>	2026-03-09 10:06:34 -10:00
Tejun Heo	0e7cd9cef6	Merge branch 'sched/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-7.1 Pull sched/core to resolve conflicts between: `c2a57380df` ("sched: Replace use of system_unbound_wq with system_dfl_wq") from the tip tree and commit: `cde94c032b` ("sched_ext: Make watchdog sub-sched aware") The latter moves around code modiefied by the former. Apply the changes in the new locations. Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-09 09:59:36 -10:00
Zhao Mengmeng	bec10581e9	sched_ext: remove SCX_OPS_HAS_CGROUP_WEIGHT While running scx_flatcg, dmesg prints "SCX_OPS_HAS_CGROUP_WEIGHT is deprecated and a noop", in code, SCX_OPS_HAS_CGROUP_WEIGHT has been marked as DEPRECATED, and will be removed on 6.18. Now it's time to do it. Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-09 09:45:18 -10:00
Tejun Heo	6af9b39135	Merge branch 'for-7.0-fixes' into for-7.1	2026-03-09 06:19:12 -10:00
zhidao su	2fcfe5951e	sched_ext: Use WRITE_ONCE() for the write side of scx_enable helper pointer scx_enable() uses double-checked locking to lazily initialize a static kthread_worker pointer. The fast path reads helper locklessly: if (!READ_ONCE(helper)) { // lockless read -- no helper_mutex The write side initializes helper under helper_mutex, but previously used a plain assignment: helper = kthread_run_worker(0, "scx_enable_helper"); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ plain write -- KCSAN data race with READ_ONCE() above Since READ_ONCE() on the fast path and the plain write on the initialization path access the same variable without a common lock, they constitute a data race. KCSAN requires that all sides of a lock-free access use READ_ONCE()/WRITE_ONCE() consistently. Use a temporary variable to stage the result of kthread_run_worker(), and only WRITE_ONCE() into helper after confirming the pointer is valid. This avoids a window where a concurrent caller on the fast path could observe an ERR pointer via READ_ONCE(helper) before the error check completes. Fixes: `b06ccbabe2` ("sched_ext: Fix starvation of scx_enable() under fair-class saturation") Signed-off-by: zhidao su <suzhidao@xiaomi.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-09 06:08:26 -10:00

1 2 3 4 5 ...

51100 Commits