linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-17 01:23:58 -04:00

Author	SHA1	Message	Date
Ke Zhao	068014daad	tools/sched_ext: Update stale scx_ops_error() comment in fcg_cgroup_move() The function scx_ops_error() was dropped, but the comment here is left pointing to the old name. Update to be consistent with current API. Signed-off-by: Ke Zhao <ke.zhao.kernel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-21 08:35:56 -10:00
zhidao su	818dbedd04	selftests/sched_ext: Return non-zero exit code on test failure runner.c always returned 0 regardless of test results. The kselftest framework (tools/testing/selftests/kselftest/runner.sh) invokes the runner binary and treats a non-zero exit code as a test failure; with the old code, failed sched_ext tests were silently hidden from the parent harness even though individual "not ok" TAP lines were emitted. Return 1 when at least one test failed, 0 when all tests passed or were skipped. Signed-off-by: zhidao su <suzhidao@xiaomi.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-21 08:35:51 -10:00
zhidao su	7e226f036a	sched_ext: Documentation: Document events sysfs file and module parameters Two categories of sched_ext diagnostics are currently undocumented: 1. Per-scheduler events sysfs file Each active BPF scheduler exposes a set of diagnostic counters at /sys/kernel/sched_ext/<name>/events. These counters are defined (with detailed comments) in kernel/sched/ext_internal.h but have no corresponding documentation in sched-ext.rst. BPF scheduler developers must read kernel source to understand what each counter means. Add a description of the events file, an example of its output, and a brief explanation of every counter. 2. Module parameters kernel/sched/ext.c registers two parameters under the sched_ext. prefix (slice_bypass_us, bypass_lb_intv_us) via module_param_cb() with MODULE_PARM_DESC() strings, but sched-ext.rst makes no mention of them. Users who need to tune bypass-mode behavior have no in-tree documentation to consult. Add a "Module Parameters" section documenting both knobs: their default values, valid ranges (taken from the set_*() validators in ext.c), and the note from the source that they are primarily for debugging. No functional changes. Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-21 08:32:41 -10:00
Andrea Righi	2197cecdb0	sched_ext: idle: Prioritize idle SMT sibling In the default built-in idle CPU selection policy, when @prev_cpu is busy and no fully idle core is available, try to place the task on its SMT sibling if that sibling is idle, before searching any other idle CPU in the same LLC. Migration to the sibling is cheap and keeps the task on the same core, preserving L1 cache and reducing wakeup latency. On large SMT systems this appears to consistently boost throughput by roughly 2-3% on CPU-bound workloads (running a number of tasks equal to the number of SMT cores). Cc: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-21 08:31:16 -10:00
Cheng-Yang Chou	f6689792ff	selftests/sched_ext: Show failed test names in summary When tests fail, the runner only printed the failure count, making it hard to tell which tests failed without scrolling through output. Track failed test names in an array and print them after the summary so failures are immediately visible at the end of the run. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-17 07:46:57 -10:00
zhidao su	2e5e5b3738	sched_ext: Fix typos in comments Fix five typos across three files: - kernel/sched/ext.c: 'monotically' -> 'monotonically' (line 55) - kernel/sched/ext.c: 'used by to check' -> 'used to check' (line 56) - kernel/sched/ext.c: 'hardlockdup' -> 'hardlockup' (line 3881) - kernel/sched/ext_idle.c: 'don't perfectly overlaps' -> 'don't perfectly overlap' (line 371) - tools/sched_ext/scx_flatcg.bpf.c: 'shaer' -> 'share' (line 21) Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-17 07:46:36 -10:00
Cheng-Yang Chou	2008fb2573	sched_ext: Fix slab-out-of-bounds in scx_alloc_and_add_sched() ancestors[] is a flexible array member that needs level + 1 slots to hold all ancestors including self (indices 0..level), but kzalloc_flex() only allocates `level` slots: sch = kzalloc_flex(sch, ancestors, level); ... sch->ancestors[level] = sch; / one past the end */ For the root scheduler (level = 0), zero slots are allocated and ancestors[0] is written immediately past the end of the object. KASAN reports: BUG: KASAN: slab-out-of-bounds in scx_alloc_and_add_sched+0x1c17/0x1d10 Write of size 8 at addr ffff888066b56538 by task scx_enable_help/667 The buggy address is located 0 bytes to the right of allocated 1336-byte region [ffff888066b56000, ffff888066b56538) Fix by passing level + 1 to kzalloc_flex(). Tested with vng + scx_lavd, KASAN no longer triggers. Fixes: `ebeca1f930` ("sched_ext: Introduce cgroup sub-sched support") Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-16 07:55:50 -10:00
Tejun Heo	618a9db015	sched_ext: Use kobject_put() for kobject_init_and_add() failure in scx_alloc_and_add_sched() kobject_init_and_add() failure requires kobject_put() for proper cleanup, but the error paths were using kfree(sch) possibly leaking the kobject name. The kset_create_and_add() failure was already using kobject_put() correctly. Switch the kobject_init_and_add() error paths to use kobject_put(). As the release path puts the cgroup ref, make scx_alloc_and_add_sched() always consume @cgrp via a new err_put_cgrp label at the bottom of the error chain and update scx_sub_enable_workfn() accordingly. Fixes: `17108735b4` ("sched_ext: Use dynamic allocation for scx_sched") Reported-by: David Carlier <devnexen@gmail.com> Link: https://lore.kernel.org/r/20260314134457.46216-1-devnexen@gmail.com Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-15 23:27:04 -10:00
Tejun Heo	0c66b0da00	sched_ext: Fix cgroup double-put on sub-sched abort path The abort path in scx_sub_enable_workfn() fell through to out_put_cgrp, double-putting the cgroup ref already owned by sch->cgrp. It also skipped kthread_flush_work() needed to flush the disable path. Relocate the abort block above err_unlock_and_disable so it falls through to err_disable. Fixes: `337ec00b1d` ("sched_ext: Implement cgroup sub-sched enabling and disabling") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-15 23:26:35 -10:00
Cheng-Yang Chou	f96bc0fa92	sched_ext: Update selftests to drop ops.cpu_acquire/release() ops.cpu_acquire/release() are deprecated by commit `a3f5d48222` ("sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere") in favor of handling CPU preemption via the sched_switch tracepoint. In the maximal selftest, replace the cpu_acquire/release stubs with a minimal sched_switch TP program. Attach all non-struct_ops programs (including the new TP) via maximal__attach() after disabling auto-attach for the maximal_ops struct_ops map, which is managed manually in run(). Apply the same fix to reload_loop, which also uses the maximal skeleton. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-14 22:54:05 -10:00
Cheng-Yang Chou	6712c4fefc	sched_ext: Update demo schedulers and selftests to use scx_bpf_task_set_dsq_vtime() Direct writes to p->scx.dsq_vtime are deprecated in favor of scx_bpf_task_set_dsq_vtime(). Update scx_simple, scx_flatcg, and select_cpu_vtime selftest to use the new kfunc with scale_by_task_weight_inverse(). Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-14 22:53:59 -10:00
Cheng-Yang Chou	c959218c65	sched_ext/selftests: Fix incorrect include guard comments Fix two mismatched closing comments in header include guards: - util.h: closing comment says __SCX_TEST_H__ but the guard is __SCX_TEST_UTIL_H__ - exit_test.h: closing comment has a spurious '#' character before the guard name Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-13 23:01:06 -10:00
Cheng-Yang Chou	e36bc38ebf	sched_ext: Fix uninitialized ret in scx_alloc_and_add_sched() Under CONFIG_EXT_SUB_SCHED, the kzalloc() and kstrdup() failure paths jump to err_stop_helper without first setting ret. The function then returns ERR_PTR(ret) with ret uninitialized, which can produce ERR_PTR(0) (NULL), causing the caller's IS_ERR() check to pass and leading to a NULL pointer dereference. Set ret = -ENOMEM before each goto to fix the error path. Fixes: `ebeca1f930` ("sched_ext: Introduce cgroup sub-sched support") Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-13 23:00:53 -10:00
Andrea Righi	12b49dd15e	selftests/sched_ext: Update scx_bpf_dsq_move_to_local() in kselftests After commit `860683763e` ("sched_ext: Add enq_flags to scx_bpf_dsq_move_to_local()") some of the kselftests are failing to build: exit.bpf.c:44:34: error: too few arguments provided to function-like macro invocation 44 \| scx_bpf_dsq_move_to_local(DSQ_ID); Update the kselftests adding the new argument to scx_bpf_dsq_move_to_local(). Fixes: `860683763e` ("sched_ext: Add enq_flags to scx_bpf_dsq_move_to_local()") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-13 22:43:52 -10:00
Tejun Heo	238eba8c21	sched_ext: Use schedule_deferred_locked() in schedule_dsq_reenq() schedule_dsq_reenq() always uses schedule_deferred() which falls back to irq_work. However, callers like schedule_reenq_local() already hold the target rq lock, and scx_bpf_dsq_reenq() may hold it via the ops callback. Add a locked_rq parameter so schedule_dsq_reenq() can use schedule_deferred_locked() when the target rq is already held. The locked variant can use cheaper paths (balance callbacks, wakeup hooks) instead of always bouncing through irq_work. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-13 09:43:23 -10:00
Tejun Heo	3229ac4a5e	sched_ext: Add SCX_OPS_ALWAYS_ENQ_IMMED ops flag SCX_ENQ_IMMED makes enqueue to local DSQs succeed only if the task can start running immediately. Otherwise, the task is re-enqueued through ops.enqueue(). This provides tighter control but requires specifying the flag on every insertion. Add SCX_OPS_ALWAYS_ENQ_IMMED ops flag. When set, SCX_ENQ_IMMED is automatically applied to all local DSQ enqueues including through scx_bpf_dsq_move_to_local(). scx_qmap is updated with -I option to test the feature and -F option for IMMED stress testing which forces every Nth enqueue to a busy local DSQ. v2: - Cover scx_bpf_dsq_move_to_local() path (now has enq_flags via ___v2). - scx_qmap: Remove sched_switch and cpu_release handlers (superseded by kernel-side wakeup_preempt_scx()). Add -F for IMMED stress testing. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-13 09:43:23 -10:00
Tejun Heo	860683763e	sched_ext: Add enq_flags to scx_bpf_dsq_move_to_local() scx_bpf_dsq_move_to_local() moves a task from a non-local DSQ to the current CPU's local DSQ. This is an indirect way of dispatching to a local DSQ and should support enq_flags like direct dispatches do - e.g. SCX_ENQ_HEAD for head-of-queue insertion and SCX_ENQ_IMMED for immediate execution guarantees. Add scx_bpf_dsq_move_to_local___v2() with an enq_flags parameter. The original becomes a v1 compat wrapper passing 0. The compat macro is updated to a three-level chain: v2 (7.1+) -> v1 (current) -> scx_bpf_consume (pre-rename). All in-tree BPF schedulers are updated to pass 0. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-13 09:43:23 -10:00
Tejun Heo	da32a2986e	sched_ext: Plumb enq_flags through the consume path Add enq_flags parameter to consume_dispatch_q() and consume_remote_task(), passing it through to move_{local,remote}_task_to_local_dsq(). All callers pass 0. No functional change. This prepares for SCX_ENQ_IMMED support on the consume path. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-13 09:43:23 -10:00
Tejun Heo	98d709cba3	sched_ext: Implement SCX_ENQ_IMMED Add SCX_ENQ_IMMED enqueue flag for local DSQ insertions. Once a task is dispatched with IMMED, it either gets on the CPU immediately and stays on it, or gets reenqueued back to the BPF scheduler. It will never linger on a local DSQ behind other tasks or on a CPU taken by a higher-priority class. rq_is_open() uses rq->next_class to determine whether the rq is available, and wakeup_preempt_scx() triggers reenqueue when a higher-priority class task arrives. These capture all higher class preemptions. Combined with reenqueue points in the dispatch path, all cases where an IMMED task would not execute immediately are covered. SCX_TASK_IMMED persists in p->scx.flags until the next fresh enqueue, so the guarantee survives SAVE/RESTORE cycles. If preempted while running, put_prev_task_scx() reenqueues through ops.enqueue() with SCX_TASK_REENQ_PREEMPTED instead of silently placing the task back on the local DSQ. This enables tighter scheduling latency control by preventing tasks from piling up on local DSQs. It also enables opportunistic CPU sharing across sub-schedulers - without this, a sub-scheduler can stuff the local DSQ of a shared CPU, making it difficult for others to use. v2: - Rewrite is_curr_done() as rq_is_open() using rq->next_class and implement wakeup_preempt_scx() to achieve complete coverage of all cases where IMMED tasks could get stranded. - Track IMMED persistently in p->scx.flags and reenqueue preempted-while-running tasks through ops.enqueue(). - Bound deferred reenq cycles (SCX_REENQ_LOCAL_MAX_REPEAT). - Misc renames, documentation. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-13 09:43:22 -10:00
Tejun Heo	b5b38761b4	sched_ext: Add scx_vet_enq_flags() and plumb dsq_id into preamble Add scx_vet_enq_flags() stub and call it from scx_dsq_insert_preamble() and scx_dsq_move(). Pass dsq_id into preamble so the vetting function can validate flag and DSQ combinations. No functional change. This prepares for SCX_ENQ_IMMED which will populate the vetting function. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-13 09:43:22 -10:00
Tejun Heo	f1c1dd9cc1	sched_ext: Split task_should_reenq() into local and user variants Split task_should_reenq() into local_task_should_reenq() and user_task_should_reenq(). The local variant takes reenq_flags by pointer. No functional change. This prepares for SCX_ENQ_IMMED which will add IMMED-specific logic to the local variant. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-13 09:43:22 -10:00
David Carlier	1d02346fec	selftests/sched_ext: Add missing error check for exit__load() exit__load(skel) was called without checking its return value. Every other test in the suite wraps the load call with SCX_FAIL_IF(). Add the missing check to be consistent with the rest of the test suite. Fixes: `a5db7817af` ("sched_ext: Add selftests") Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-13 07:00:45 -10:00
Cheng-Yang Chou	bd377af097	sched_ext: Fix incomplete help text usage strings Several demo schedulers and the selftest runner had usage strings that omitted options which are actually supported: - scx_central: add missing [-v] - scx_pair: add missing [-v] - scx_qmap: add missing [-S] and [-H] - scx_userland: add missing [-v] - scx_sdt: remove [-f] which no longer exists - runner.c: add missing [-s], [-l], [-q]; drop [-h] which none of the other sched_ext tools list in their usage lines Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-11 11:02:57 -10:00
Tejun Heo	6b4576b097	sched_ext: Reject sub-sched attachment to a disabled parent scx_claim_exit() propagates exits to descendants under scx_sched_lock. A sub-sched being attached concurrently could be missed if it links after the propagation. Check the parent's exit_kind in scx_link_sched() under scx_sched_lock to interlock against scx_claim_exit() - either the parent sees the child in its iteration or the child sees the parent's non-NONE exit_kind and fails attachment. Fixes: `ebeca1f930` ("sched_ext: Introduce cgroup sub-sched support") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-10 07:12:21 -10:00
Tejun Heo	6b36c4c293	sched_ext: Fix scx_sched_lock / rq lock ordering There are two sites that nest rq lock inside scx_sched_lock: - scx_bypass() takes scx_sched_lock then rq lock per CPU to propagate per-cpu bypass flags and re-enqueue tasks. - sysrq_handle_sched_ext_dump() takes scx_sched_lock to iterate all scheds, scx_dump_state() then takes rq lock per CPU for dump. And scx_claim_exit() takes scx_sched_lock to propagate exits to descendants. It can be reached from scx_tick(), BPF kfuncs, and many other paths with rq lock already held, creating the reverse ordering: rq lock -> scx_sched_lock vs. scx_sched_lock -> rq lock Fix by flipping scx_bypass() to take rq lock first, and dropping scx_sched_lock from sysrq_handle_sched_ext_dump() as scx_sched_all is already RCU-traversable and scx_dump_lock now prevents dumping a dead sched. This makes the consistent ordering rq lock -> scx_sched_lock. Reported-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Link: https://lore.kernel.org/r/20260309163025.2240221-1-yphbchou0911@gmail.com Fixes: `ebeca1f930` ("sched_ext: Introduce cgroup sub-sched support") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-10 07:12:21 -10:00
Tejun Heo	f4a6c506d1	sched_ext: Always bounce scx_disable() through irq_work scx_disable() directly called kthread_queue_work() which can acquire worker->lock, pi_lock and rq->__lock. This made scx_disable() unsafe to call while holding locks that conflict with this chain - in particular, scx_claim_exit() calls scx_disable() for each descendant while holding scx_sched_lock, which nests inside rq->__lock in scx_bypass(). The error path (scx_vexit()) was already bouncing through irq_work to avoid this issue. Generalize the pattern to all scx_disable() calls by always going through irq_work. irq_work_queue() is lockless and safe to call from any context, and the actual kthread_queue_work() call happens in the irq_work handler outside any locks. Rename error_irq_work to disable_irq_work to reflect the broader usage. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-10 07:12:21 -10:00
Tejun Heo	b5bc043505	sched_ext: Add scx_dump_lock and dump_disabled Add a dedicated scx_dump_lock and per-sched dump_disabled flag so that debug dumping can be safely disabled during sched teardown without relying on scx_sched_lock. This is a prep for the next patch which decouples the sysrq dump path from scx_sched_lock to resolve a lock ordering issue. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-10 07:12:21 -10:00
Tejun Heo	7e92cf4354	sched_ext: Fix sub_detach op check to test the parent's ops sub_detach is the parent's op called to notify the parent that a child is detaching. Test parent->ops.sub_detach instead of sch->ops.sub_detach. Fixes: `ebeca1f930` ("sched_ext: Introduce cgroup sub-sched support") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-10 07:12:21 -10:00
Philipp Hahn	9805933538	sched: Prefer IS_ERR_OR_NULL over manual NULL check Prefer using IS_ERR_OR_NULL() over using IS_ERR() and a manual NULL check. Change generated with coccinelle. Signed-off-by: Philipp Hahn <phahn-oss@avm.de> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-10 06:58:37 -10:00
Tejun Heo	b884094264	sched_ext: Replace system_unbound_wq with system_dfl_wq in scx_kobj_release() `c2a57380df` ("sched: Replace use of system_unbound_wq with system_dfl_wq") converted system_unbound_wq usages in ext.c but missed the queue_rcu_work() call in scx_kobj_release() which was added later by the dynamic scx_sched allocation conversion. Apply the same conversion. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Marco Crivellari <marco.crivellari@suse.com>	2026-03-09 10:06:34 -10:00
Tejun Heo	0e7cd9cef6	Merge branch 'sched/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-7.1 Pull sched/core to resolve conflicts between: `c2a57380df` ("sched: Replace use of system_unbound_wq with system_dfl_wq") from the tip tree and commit: `cde94c032b` ("sched_ext: Make watchdog sub-sched aware") The latter moves around code modiefied by the former. Apply the changes in the new locations. Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-09 09:59:36 -10:00
Zhao Mengmeng	bec10581e9	sched_ext: remove SCX_OPS_HAS_CGROUP_WEIGHT While running scx_flatcg, dmesg prints "SCX_OPS_HAS_CGROUP_WEIGHT is deprecated and a noop", in code, SCX_OPS_HAS_CGROUP_WEIGHT has been marked as DEPRECATED, and will be removed on 6.18. Now it's time to do it. Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-09 09:45:18 -10:00
Tejun Heo	6af9b39135	Merge branch 'for-7.0-fixes' into for-7.1	2026-03-09 06:19:12 -10:00
zhidao su	2fcfe5951e	sched_ext: Use WRITE_ONCE() for the write side of scx_enable helper pointer scx_enable() uses double-checked locking to lazily initialize a static kthread_worker pointer. The fast path reads helper locklessly: if (!READ_ONCE(helper)) { // lockless read -- no helper_mutex The write side initializes helper under helper_mutex, but previously used a plain assignment: helper = kthread_run_worker(0, "scx_enable_helper"); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ plain write -- KCSAN data race with READ_ONCE() above Since READ_ONCE() on the fast path and the plain write on the initialization path access the same variable without a common lock, they constitute a data race. KCSAN requires that all sides of a lock-free access use READ_ONCE()/WRITE_ONCE() consistently. Use a temporary variable to stage the result of kthread_run_worker(), and only WRITE_ONCE() into helper after confirming the pointer is valid. This avoids a window where a concurrent caller on the fast path could observe an ERR pointer via READ_ONCE(helper) before the error check completes. Fixes: `b06ccbabe2` ("sched_ext: Fix starvation of scx_enable() under fair-class saturation") Signed-off-by: zhidao su <suzhidao@xiaomi.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-09 06:08:26 -10:00
Tejun Heo	0a0d3b8dd0	tools/sched_ext/include: Regenerate enum_defs.autogen.h Regenerate enum_defs.autogen.h from the current vmlinux.h to pick up new SCX enums added in the for-7.1 cycle. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 22:45:12 -10:00
Tejun Heo	93ac9b150e	tools/sched_ext/include: Add libbpf version guard for assoc_struct_ops Extract the inline bpf_program__assoc_struct_ops() call in SCX_OPS_LOAD() into a __scx_ops_assoc_prog() helper and wrap it with a libbpf >= 1.7 version guard. bpf_program__assoc_struct_ops() was added in libbpf 1.7; the guard provides a no-op fallback for older versions. Add the <bpf/libbpf.h> include needed by the helper, and fix "assumming" typo in a nearby comment. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 22:45:12 -10:00
Tejun Heo	c9c8546cde	tools/sched_ext/include: Add __COMPAT_HAS_scx_bpf_select_cpu_and macro scx_bpf_select_cpu_and() is now an inline wrapper so bpf_ksym_exists(scx_bpf_select_cpu_and) no longer works. Add __COMPAT_HAS_scx_bpf_select_cpu_and macro that checks for either the struct args type (new) or the compat ksym (old) to test availability. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 22:45:12 -10:00
Tejun Heo	3691d380d5	tools/sched_ext/include: Add missing helpers to common.bpf.h Sync several helpers from the scx repo: - bpf_cgroup_acquire() ksym declaration - __sink() macro for hiding values from verifier precision tracking - ctzll() count-trailing-zeros implementation - get_prandom_u64() helper - scx_clock_task/pelt/virt/irq() clock helpers with get_current_rq() Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 22:45:12 -10:00
Tejun Heo	9c6437f7c2	tools/sched_ext/include: Sync bpf_arena_common.bpf.h with scx repo Sync the following changes from the scx repo: - Guard __arena define with #ifndef to avoid redefinition when the attribute is already defined by another header. - Add bpf_arena_reserve_pages() and bpf_arena_mapping_nr_pages() ksym declarations. - Rename TEST to SCX_BPF_UNITTEST to avoid collision with generic TEST macros in other projects. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 22:45:12 -10:00
Tejun Heo	c90af06c80	tools/sched_ext/include: Remove dead sdt_task_defs.h guard from common.h The __has_include guard for sdt_task_defs.h is vestigial — the only remaining content is the bpf_arena_common.h include which is available unconditionally. Remove the dead guard. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 22:45:12 -10:00
Tejun Heo	80a54b807d	Revert "sched_ext: Use READ_ONCE() for the read side of dsq->nr update" This reverts commit `9adfcef334`. dsq->nr is protected by dsq->lock and reading while holding the lock doesn't constitute a racy read. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: zhidao su <suzhidao@xiaomi.com>	2026-03-07 21:42:12 -10:00
Cheng-Yang Chou	28c4ef2b2e	sched_ext: Fix scx_bpf_reenqueue_local() silently reenqueuing nothing `ffa7ae0724` ("sched_ext: Add reenq_flags plumbing to scx_bpf_dsq_reenq()") introduced task_should_reenq() as a filter inside reenq_local(), requiring SCX_REENQ_ANY to be set in order to match any task. scx_bpf_dsq_reenq() handles this correctly by converting a bare reenq_flags=0 to SCX_REENQ_ANY, but scx_bpf_reenqueue_local() was not updated and continued to call reenq_local() with 0, causing it to silently reenqueue zero tasks. Fix by passing SCX_REENQ_ANY directly. Fixes: `ffa7ae0724` ("sched_ext: Add reenq_flags plumbing to scx_bpf_dsq_reenq()") Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-07 08:10:25 -10:00
Tejun Heo	ce897abc21	sched_ext: Add SCX_TASK_REENQ_REASON flags SCX_ENQ_REENQ indicates that a task is being re-enqueued but doesn't tell the BPF scheduler why. Add SCX_TASK_REENQ_REASON flags using bits 12-13 of p->scx.flags to communicate the reason during ops.enqueue(): - NONE: Not being reenqueued - KFUNC: Reenqueued by scx_bpf_dsq_reenq() and friends More reasons will be added. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 05:29:50 -10:00
Tejun Heo	7203d77d6e	sched_ext: Simplify task state handling Task states (NONE, INIT, READY, ENABLED) were defined in a separate enum with unshifted values and then shifted when stored in scx_entity.flags. Simplify by defining them as pre-shifted values directly in scx_ent_flags and removing the separate scx_task_state enum. This removes the need for shifting when reading/writing state values. scx_get_task_state() now returns the masked flags value directly. scx_set_task_state() accepts the pre-shifted state value. scx_dump_task() shifts down for display to maintain readable output. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 05:29:50 -10:00
Tejun Heo	a90449b126	sched_ext: Optimize schedule_dsq_reenq() with lockless fast path schedule_dsq_reenq() always acquires deferred_reenq_lock to queue a reenqueue request. Add a lockless fast-path to skip lock acquisition when the request is already pending with the required flags set. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 05:29:50 -10:00
Tejun Heo	84b1a0ea0b	sched_ext: Implement scx_bpf_dsq_reenq() for user DSQs scx_bpf_dsq_reenq() currently only supports local DSQs. Extend it to support user-defined DSQs by adding a deferred re-enqueue mechanism similar to the local DSQ handling. Add per-cpu deferred_reenq_user_node/flags to scx_dsq_pcpu and deferred_reenq_users list to scx_rq. When scx_bpf_dsq_reenq() is called on a user DSQ, the DSQ's per-cpu node is added to the current rq's deferred list. process_deferred_reenq_users() then iterates the DSQ using the cursor helpers and re-enqueues each task. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 05:29:50 -10:00
Tejun Heo	35250720d6	sched_ext: Factor out nldsq_cursor_next_task() and nldsq_cursor_lost_task() Factor out cursor-based DSQ iteration from bpf_iter_scx_dsq_next() into nldsq_cursor_next_task() and the task-lost check from scx_dsq_move() into nldsq_cursor_lost_task() to prepare for reuse. As ->priv is only used to record dsq->seq for cursors, update INIT_DSQ_LIST_CURSOR() to take the DSQ pointer and set ->priv from dsq->seq so that users don't have to read it manually. Move scx_dsq_iter_flags enum earlier so nldsq_cursor_next_task() can use SCX_DSQ_ITER_REV. bypass_lb_cpu() now sets cursor.priv to dsq->seq but doesn't use it. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 05:29:50 -10:00
Tejun Heo	30b0515342	sched_ext: Add per-CPU data to DSQs Add per-CPU data structure to dispatch queues. Each DSQ now has a percpu scx_dsq_pcpu which contains a back-pointer to the DSQ. This will be used by future changes to implement per-CPU reenqueue tracking for user DSQs. init_dsq() now allocates the percpu data and can fail, so it returns an error code. All callers are updated to handle failures. exit_dsq() is added to free the percpu data and is called from all DSQ cleanup paths. In scx_bpf_create_dsq(), init_dsq() is called before rcu_read_lock() since alloc_percpu() requires GFP_KERNEL context, and dsq->sched is set afterwards. v2: Fix err_free_pcpu to only exit_dsq() initialized bypass DSQs (Andrea Righi). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 05:29:50 -10:00
Tejun Heo	ffa7ae0724	sched_ext: Add reenq_flags plumbing to scx_bpf_dsq_reenq() Add infrastructure to pass flags through the deferred reenqueue path. reenq_local() now takes a reenq_flags parameter, and scx_sched_pcpu gains a deferred_reenq_local_flags field to accumulate flags from multiple scx_bpf_dsq_reenq() calls before processing. No flags are defined yet. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 05:29:49 -10:00
Tejun Heo	9c34c5074d	sched_ext: Introduce scx_bpf_dsq_reenq() for remote local DSQ reenqueue scx_bpf_reenqueue_local() can only trigger re-enqueue of the current CPU's local DSQ. Introduce scx_bpf_dsq_reenq() which takes a DSQ ID and can target any local DSQ including remote CPUs via SCX_DSQ_LOCAL_ON \| cpu. This will be expanded to support user DSQs by future changes. scx_bpf_reenqueue_local() is reimplemented as a simple wrapper around scx_bpf_dsq_reenq(SCX_DSQ_LOCAL, 0) and may be deprecated in the future. Update compat.bpf.h with a compatibility shim and scx_qmap to test the new functionality. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 05:29:49 -10:00

1 2 3 4 5 ...

1427724 Commits