Commit Graph

1427724 Commits

Author SHA1 Message Date
Ke Zhao
068014daad tools/sched_ext: Update stale scx_ops_error() comment in fcg_cgroup_move()
The function scx_ops_error() was dropped, but the
comment here is left pointing to the old name.
Update to be consistent with current API.

Signed-off-by: Ke Zhao <ke.zhao.kernel@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-21 08:35:56 -10:00
zhidao su
818dbedd04 selftests/sched_ext: Return non-zero exit code on test failure
runner.c always returned 0 regardless of test results.  The kselftest
framework (tools/testing/selftests/kselftest/runner.sh) invokes the runner
binary and treats a non-zero exit code as a test failure; with the old
code, failed sched_ext tests were silently hidden from the parent harness
even though individual "not ok" TAP lines were emitted.

Return 1 when at least one test failed, 0 when all tests passed or were
skipped.

Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-21 08:35:51 -10:00
zhidao su
7e226f036a sched_ext: Documentation: Document events sysfs file and module parameters
Two categories of sched_ext diagnostics are currently undocumented:

1. Per-scheduler events sysfs file
   Each active BPF scheduler exposes a set of diagnostic counters at
   /sys/kernel/sched_ext/<name>/events.  These counters are defined
   (with detailed comments) in kernel/sched/ext_internal.h but have
   no corresponding documentation in sched-ext.rst.  BPF scheduler
   developers must read kernel source to understand what each counter
   means.

   Add a description of the events file, an example of its output, and
   a brief explanation of every counter.

2. Module parameters
   kernel/sched/ext.c registers two parameters under the sched_ext.
   prefix (slice_bypass_us, bypass_lb_intv_us) via module_param_cb()
   with MODULE_PARM_DESC() strings, but sched-ext.rst makes no mention
   of them.  Users who need to tune bypass-mode behavior have no
   in-tree documentation to consult.

   Add a "Module Parameters" section documenting both knobs: their
   default values, valid ranges (taken from the set_*() validators in
   ext.c), and the note from the source that they are primarily for
   debugging.

No functional changes.

Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-21 08:32:41 -10:00
Andrea Righi
2197cecdb0 sched_ext: idle: Prioritize idle SMT sibling
In the default built-in idle CPU selection policy, when @prev_cpu is
busy and no fully idle core is available, try to place the task on its
SMT sibling if that sibling is idle, before searching any other idle CPU
in the same LLC.

Migration to the sibling is cheap and keeps the task on the same core,
preserving L1 cache and reducing wakeup latency.

On large SMT systems this appears to consistently boost throughput by
roughly 2-3% on CPU-bound workloads (running a number of tasks equal to
the number of SMT cores).

Cc: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-21 08:31:16 -10:00
Cheng-Yang Chou
f6689792ff selftests/sched_ext: Show failed test names in summary
When tests fail, the runner only printed the failure count, making
it hard to tell which tests failed without scrolling through output.

Track failed test names in an array and print them after the summary
so failures are immediately visible at the end of the run.

Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-17 07:46:57 -10:00
zhidao su
2e5e5b3738 sched_ext: Fix typos in comments
Fix five typos across three files:

- kernel/sched/ext.c: 'monotically' -> 'monotonically' (line 55)
- kernel/sched/ext.c: 'used by to check' -> 'used to check' (line 56)
- kernel/sched/ext.c: 'hardlockdup' -> 'hardlockup' (line 3881)
- kernel/sched/ext_idle.c: 'don't perfectly overlaps' ->
  'don't perfectly overlap' (line 371)
- tools/sched_ext/scx_flatcg.bpf.c: 'shaer' -> 'share' (line 21)

Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-17 07:46:36 -10:00
Cheng-Yang Chou
2008fb2573 sched_ext: Fix slab-out-of-bounds in scx_alloc_and_add_sched()
ancestors[] is a flexible array member that needs level + 1 slots to
hold all ancestors including self (indices 0..level), but kzalloc_flex()
only allocates `level` slots:

  sch = kzalloc_flex(*sch, ancestors, level);
  ...
  sch->ancestors[level] = sch;  /* one past the end */

For the root scheduler (level = 0), zero slots are allocated and
ancestors[0] is written immediately past the end of the object.

KASAN reports:

  BUG: KASAN: slab-out-of-bounds in scx_alloc_and_add_sched+0x1c17/0x1d10
  Write of size 8 at addr ffff888066b56538 by task scx_enable_help/667

  The buggy address is located 0 bytes to the right of
   allocated 1336-byte region [ffff888066b56000, ffff888066b56538)

Fix by passing level + 1 to kzalloc_flex().

Tested with vng + scx_lavd, KASAN no longer triggers.

Fixes: ebeca1f930 ("sched_ext: Introduce cgroup sub-sched support")
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-16 07:55:50 -10:00
Tejun Heo
618a9db015 sched_ext: Use kobject_put() for kobject_init_and_add() failure in scx_alloc_and_add_sched()
kobject_init_and_add() failure requires kobject_put() for proper cleanup, but
the error paths were using kfree(sch) possibly leaking the kobject name. The
kset_create_and_add() failure was already using kobject_put() correctly.

Switch the kobject_init_and_add() error paths to use kobject_put(). As the
release path puts the cgroup ref, make scx_alloc_and_add_sched() always
consume @cgrp via a new err_put_cgrp label at the bottom of the error chain
and update scx_sub_enable_workfn() accordingly.

Fixes: 17108735b4 ("sched_ext: Use dynamic allocation for scx_sched")
Reported-by: David Carlier <devnexen@gmail.com>
Link: https://lore.kernel.org/r/20260314134457.46216-1-devnexen@gmail.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-15 23:27:04 -10:00
Tejun Heo
0c66b0da00 sched_ext: Fix cgroup double-put on sub-sched abort path
The abort path in scx_sub_enable_workfn() fell through to out_put_cgrp,
double-putting the cgroup ref already owned by sch->cgrp. It also skipped
kthread_flush_work() needed to flush the disable path.

Relocate the abort block above err_unlock_and_disable so it falls through to
err_disable.

Fixes: 337ec00b1d ("sched_ext: Implement cgroup sub-sched enabling and disabling")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-15 23:26:35 -10:00
Cheng-Yang Chou
f96bc0fa92 sched_ext: Update selftests to drop ops.cpu_acquire/release()
ops.cpu_acquire/release() are deprecated by commit a3f5d48222
("sched_ext: Allow scx_bpf_reenqueue_local() to be called from
anywhere") in favor of handling CPU preemption via the sched_switch
tracepoint.

In the maximal selftest, replace the cpu_acquire/release stubs with a
minimal sched_switch TP program. Attach all non-struct_ops programs
(including the new TP) via maximal__attach() after disabling auto-attach
for the maximal_ops struct_ops map, which is managed manually in run().

Apply the same fix to reload_loop, which also uses the maximal skeleton.

Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-14 22:54:05 -10:00
Cheng-Yang Chou
6712c4fefc sched_ext: Update demo schedulers and selftests to use scx_bpf_task_set_dsq_vtime()
Direct writes to p->scx.dsq_vtime are deprecated in favor of
scx_bpf_task_set_dsq_vtime(). Update scx_simple, scx_flatcg, and
select_cpu_vtime selftest to use the new kfunc with
scale_by_task_weight_inverse().

Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-14 22:53:59 -10:00
Cheng-Yang Chou
c959218c65 sched_ext/selftests: Fix incorrect include guard comments
Fix two mismatched closing comments in header include guards:

- util.h: closing comment says __SCX_TEST_H__ but the guard is
  __SCX_TEST_UTIL_H__
- exit_test.h: closing comment has a spurious '#' character before
  the guard name

Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-13 23:01:06 -10:00
Cheng-Yang Chou
e36bc38ebf sched_ext: Fix uninitialized ret in scx_alloc_and_add_sched()
Under CONFIG_EXT_SUB_SCHED, the kzalloc() and kstrdup() failure
paths jump to err_stop_helper without first setting ret. The
function then returns ERR_PTR(ret) with ret uninitialized, which
can produce ERR_PTR(0) (NULL), causing the caller's IS_ERR() check
to pass and leading to a NULL pointer dereference.

Set ret = -ENOMEM before each goto to fix the error path.

Fixes: ebeca1f930 ("sched_ext: Introduce cgroup sub-sched support")
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-13 23:00:53 -10:00
Andrea Righi
12b49dd15e selftests/sched_ext: Update scx_bpf_dsq_move_to_local() in kselftests
After commit 860683763e ("sched_ext: Add enq_flags to
scx_bpf_dsq_move_to_local()") some of the kselftests are failing to
build:

 exit.bpf.c:44:34: error: too few arguments provided to function-like macro invocation
    44 |         scx_bpf_dsq_move_to_local(DSQ_ID);

Update the kselftests adding the new argument to
scx_bpf_dsq_move_to_local().

Fixes: 860683763e ("sched_ext: Add enq_flags to scx_bpf_dsq_move_to_local()")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-13 22:43:52 -10:00
Tejun Heo
238eba8c21 sched_ext: Use schedule_deferred_locked() in schedule_dsq_reenq()
schedule_dsq_reenq() always uses schedule_deferred() which falls back to
irq_work. However, callers like schedule_reenq_local() already hold the
target rq lock, and scx_bpf_dsq_reenq() may hold it via the ops callback.

Add a locked_rq parameter so schedule_dsq_reenq() can use
schedule_deferred_locked() when the target rq is already held. The locked
variant can use cheaper paths (balance callbacks, wakeup hooks) instead of
always bouncing through irq_work.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-13 09:43:23 -10:00
Tejun Heo
3229ac4a5e sched_ext: Add SCX_OPS_ALWAYS_ENQ_IMMED ops flag
SCX_ENQ_IMMED makes enqueue to local DSQs succeed only if the task can start
running immediately. Otherwise, the task is re-enqueued through ops.enqueue().
This provides tighter control but requires specifying the flag on every
insertion.

Add SCX_OPS_ALWAYS_ENQ_IMMED ops flag. When set, SCX_ENQ_IMMED is
automatically applied to all local DSQ enqueues including through
scx_bpf_dsq_move_to_local().

scx_qmap is updated with -I option to test the feature and -F option for
IMMED stress testing which forces every Nth enqueue to a busy local DSQ.

v2: - Cover scx_bpf_dsq_move_to_local() path (now has enq_flags via ___v2).
    - scx_qmap: Remove sched_switch and cpu_release handlers (superseded by
      kernel-side wakeup_preempt_scx()). Add -F for IMMED stress testing.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-13 09:43:23 -10:00
Tejun Heo
860683763e sched_ext: Add enq_flags to scx_bpf_dsq_move_to_local()
scx_bpf_dsq_move_to_local() moves a task from a non-local DSQ to the
current CPU's local DSQ. This is an indirect way of dispatching to a local
DSQ and should support enq_flags like direct dispatches do - e.g.
SCX_ENQ_HEAD for head-of-queue insertion and SCX_ENQ_IMMED for immediate
execution guarantees.

Add scx_bpf_dsq_move_to_local___v2() with an enq_flags parameter. The
original becomes a v1 compat wrapper passing 0. The compat macro is updated
to a three-level chain: v2 (7.1+) -> v1 (current) -> scx_bpf_consume
(pre-rename). All in-tree BPF schedulers are updated to pass 0.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-13 09:43:23 -10:00
Tejun Heo
da32a2986e sched_ext: Plumb enq_flags through the consume path
Add enq_flags parameter to consume_dispatch_q() and consume_remote_task(),
passing it through to move_{local,remote}_task_to_local_dsq(). All callers
pass 0.

No functional change. This prepares for SCX_ENQ_IMMED support on the consume
path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-13 09:43:23 -10:00
Tejun Heo
98d709cba3 sched_ext: Implement SCX_ENQ_IMMED
Add SCX_ENQ_IMMED enqueue flag for local DSQ insertions. Once a task is
dispatched with IMMED, it either gets on the CPU immediately and stays on it,
or gets reenqueued back to the BPF scheduler. It will never linger on a local
DSQ behind other tasks or on a CPU taken by a higher-priority class.

rq_is_open() uses rq->next_class to determine whether the rq is available,
and wakeup_preempt_scx() triggers reenqueue when a higher-priority class task
arrives. These capture all higher class preemptions. Combined with reenqueue
points in the dispatch path, all cases where an IMMED task would not execute
immediately are covered.

SCX_TASK_IMMED persists in p->scx.flags until the next fresh enqueue, so the
guarantee survives SAVE/RESTORE cycles. If preempted while running,
put_prev_task_scx() reenqueues through ops.enqueue() with
SCX_TASK_REENQ_PREEMPTED instead of silently placing the task back on the
local DSQ.

This enables tighter scheduling latency control by preventing tasks from
piling up on local DSQs. It also enables opportunistic CPU sharing across
sub-schedulers - without this, a sub-scheduler can stuff the local DSQ of a
shared CPU, making it difficult for others to use.

v2: - Rewrite is_curr_done() as rq_is_open() using rq->next_class and
      implement wakeup_preempt_scx() to achieve complete coverage of all
      cases where IMMED tasks could get stranded.
    - Track IMMED persistently in p->scx.flags and reenqueue
      preempted-while-running tasks through ops.enqueue().
    - Bound deferred reenq cycles (SCX_REENQ_LOCAL_MAX_REPEAT).
    - Misc renames, documentation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-13 09:43:22 -10:00
Tejun Heo
b5b38761b4 sched_ext: Add scx_vet_enq_flags() and plumb dsq_id into preamble
Add scx_vet_enq_flags() stub and call it from scx_dsq_insert_preamble() and
scx_dsq_move(). Pass dsq_id into preamble so the vetting function can
validate flag and DSQ combinations.

No functional change. This prepares for SCX_ENQ_IMMED which will populate the
vetting function.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-13 09:43:22 -10:00
Tejun Heo
f1c1dd9cc1 sched_ext: Split task_should_reenq() into local and user variants
Split task_should_reenq() into local_task_should_reenq() and
user_task_should_reenq(). The local variant takes reenq_flags by pointer.

No functional change. This prepares for SCX_ENQ_IMMED which will add
IMMED-specific logic to the local variant.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-13 09:43:22 -10:00
David Carlier
1d02346fec selftests/sched_ext: Add missing error check for exit__load()
exit__load(skel) was called without checking its return value.
Every other test in the suite wraps the load call with
SCX_FAIL_IF(). Add the missing check to be consistent with the
rest of the test suite.

Fixes: a5db7817af ("sched_ext: Add selftests")
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-13 07:00:45 -10:00
Cheng-Yang Chou
bd377af097 sched_ext: Fix incomplete help text usage strings
Several demo schedulers and the selftest runner had usage strings
that omitted options which are actually supported:

- scx_central: add missing [-v]
- scx_pair: add missing [-v]
- scx_qmap: add missing [-S] and [-H]
- scx_userland: add missing [-v]
- scx_sdt: remove [-f] which no longer exists
- runner.c: add missing [-s], [-l], [-q]; drop [-h] which none of the
  other sched_ext tools list in their usage lines

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-11 11:02:57 -10:00
Tejun Heo
6b4576b097 sched_ext: Reject sub-sched attachment to a disabled parent
scx_claim_exit() propagates exits to descendants under scx_sched_lock.
A sub-sched being attached concurrently could be missed if it links
after the propagation. Check the parent's exit_kind in scx_link_sched()
under scx_sched_lock to interlock against scx_claim_exit() - either the
parent sees the child in its iteration or the child sees the parent's
non-NONE exit_kind and fails attachment.

Fixes: ebeca1f930 ("sched_ext: Introduce cgroup sub-sched support")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-10 07:12:21 -10:00
Tejun Heo
6b36c4c293 sched_ext: Fix scx_sched_lock / rq lock ordering
There are two sites that nest rq lock inside scx_sched_lock:

- scx_bypass() takes scx_sched_lock then rq lock per CPU to propagate
  per-cpu bypass flags and re-enqueue tasks.

- sysrq_handle_sched_ext_dump() takes scx_sched_lock to iterate all
  scheds, scx_dump_state() then takes rq lock per CPU for dump.

And scx_claim_exit() takes scx_sched_lock to propagate exits to
descendants. It can be reached from scx_tick(), BPF kfuncs, and many
other paths with rq lock already held, creating the reverse ordering:

  rq lock -> scx_sched_lock vs. scx_sched_lock -> rq lock

Fix by flipping scx_bypass() to take rq lock first, and dropping
scx_sched_lock from sysrq_handle_sched_ext_dump() as scx_sched_all is
already RCU-traversable and scx_dump_lock now prevents dumping a dead
sched. This makes the consistent ordering rq lock -> scx_sched_lock.

Reported-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Link: https://lore.kernel.org/r/20260309163025.2240221-1-yphbchou0911@gmail.com
Fixes: ebeca1f930 ("sched_ext: Introduce cgroup sub-sched support")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-10 07:12:21 -10:00
Tejun Heo
f4a6c506d1 sched_ext: Always bounce scx_disable() through irq_work
scx_disable() directly called kthread_queue_work() which can acquire
worker->lock, pi_lock and rq->__lock. This made scx_disable() unsafe to
call while holding locks that conflict with this chain - in particular,
scx_claim_exit() calls scx_disable() for each descendant while holding
scx_sched_lock, which nests inside rq->__lock in scx_bypass().

The error path (scx_vexit()) was already bouncing through irq_work to
avoid this issue. Generalize the pattern to all scx_disable() calls by
always going through irq_work. irq_work_queue() is lockless and safe to
call from any context, and the actual kthread_queue_work() call happens
in the irq_work handler outside any locks.

Rename error_irq_work to disable_irq_work to reflect the broader usage.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-10 07:12:21 -10:00
Tejun Heo
b5bc043505 sched_ext: Add scx_dump_lock and dump_disabled
Add a dedicated scx_dump_lock and per-sched dump_disabled flag so that
debug dumping can be safely disabled during sched teardown without
relying on scx_sched_lock. This is a prep for the next patch which
decouples the sysrq dump path from scx_sched_lock to resolve a lock
ordering issue.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-10 07:12:21 -10:00
Tejun Heo
7e92cf4354 sched_ext: Fix sub_detach op check to test the parent's ops
sub_detach is the parent's op called to notify the parent that a child
is detaching. Test parent->ops.sub_detach instead of sch->ops.sub_detach.

Fixes: ebeca1f930 ("sched_ext: Introduce cgroup sub-sched support")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-10 07:12:21 -10:00
Philipp Hahn
9805933538 sched: Prefer IS_ERR_OR_NULL over manual NULL check
Prefer using IS_ERR_OR_NULL() over using IS_ERR() and a manual NULL
check.

Change generated with coccinelle.

Signed-off-by: Philipp Hahn <phahn-oss@avm.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-10 06:58:37 -10:00
Tejun Heo
b884094264 sched_ext: Replace system_unbound_wq with system_dfl_wq in scx_kobj_release()
c2a57380df ("sched: Replace use of system_unbound_wq with system_dfl_wq")
converted system_unbound_wq usages in ext.c but missed the queue_rcu_work()
call in scx_kobj_release() which was added later by the dynamic scx_sched
allocation conversion. Apply the same conversion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
2026-03-09 10:06:34 -10:00
Tejun Heo
0e7cd9cef6 Merge branch 'sched/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-7.1
Pull sched/core to resolve conflicts between:

  c2a57380df ("sched: Replace use of system_unbound_wq with system_dfl_wq")

from the tip tree and commit:

  cde94c032b ("sched_ext: Make watchdog sub-sched aware")

The latter moves around code modiefied by the former. Apply the changes in
the new locations.

Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-09 09:59:36 -10:00
Zhao Mengmeng
bec10581e9 sched_ext: remove SCX_OPS_HAS_CGROUP_WEIGHT
While running scx_flatcg, dmesg prints "SCX_OPS_HAS_CGROUP_WEIGHT is
deprecated and a noop", in code, SCX_OPS_HAS_CGROUP_WEIGHT has been
marked as DEPRECATED, and will be removed on 6.18. Now it's time to do it.

Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-09 09:45:18 -10:00
Tejun Heo
6af9b39135 Merge branch 'for-7.0-fixes' into for-7.1 2026-03-09 06:19:12 -10:00
zhidao su
2fcfe5951e sched_ext: Use WRITE_ONCE() for the write side of scx_enable helper pointer
scx_enable() uses double-checked locking to lazily initialize a static
kthread_worker pointer. The fast path reads helper locklessly:

    if (!READ_ONCE(helper)) {          // lockless read -- no helper_mutex

The write side initializes helper under helper_mutex, but previously
used a plain assignment:

        helper = kthread_run_worker(0, "scx_enable_helper");
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                 plain write -- KCSAN data race with READ_ONCE() above

Since READ_ONCE() on the fast path and the plain write on the
initialization path access the same variable without a common lock,
they constitute a data race. KCSAN requires that all sides of a
lock-free access use READ_ONCE()/WRITE_ONCE() consistently.

Use a temporary variable to stage the result of kthread_run_worker(),
and only WRITE_ONCE() into helper after confirming the pointer is
valid. This avoids a window where a concurrent caller on the fast path
could observe an ERR pointer via READ_ONCE(helper) before the error
check completes.

Fixes: b06ccbabe2 ("sched_ext: Fix starvation of scx_enable() under fair-class saturation")
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-09 06:08:26 -10:00
Tejun Heo
0a0d3b8dd0 tools/sched_ext/include: Regenerate enum_defs.autogen.h
Regenerate enum_defs.autogen.h from the current vmlinux.h to pick up
new SCX enums added in the for-7.1 cycle.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2026-03-07 22:45:12 -10:00
Tejun Heo
93ac9b150e tools/sched_ext/include: Add libbpf version guard for assoc_struct_ops
Extract the inline bpf_program__assoc_struct_ops() call in SCX_OPS_LOAD()
into a __scx_ops_assoc_prog() helper and wrap it with a libbpf >= 1.7
version guard. bpf_program__assoc_struct_ops() was added in libbpf 1.7;
the guard provides a no-op fallback for older versions. Add the
<bpf/libbpf.h> include needed by the helper, and fix "assumming" typo in
a nearby comment.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2026-03-07 22:45:12 -10:00
Tejun Heo
c9c8546cde tools/sched_ext/include: Add __COMPAT_HAS_scx_bpf_select_cpu_and macro
scx_bpf_select_cpu_and() is now an inline wrapper so
bpf_ksym_exists(scx_bpf_select_cpu_and) no longer works. Add
__COMPAT_HAS_scx_bpf_select_cpu_and macro that checks for either the
struct args type (new) or the compat ksym (old) to test availability.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2026-03-07 22:45:12 -10:00
Tejun Heo
3691d380d5 tools/sched_ext/include: Add missing helpers to common.bpf.h
Sync several helpers from the scx repo:
- bpf_cgroup_acquire() ksym declaration
- __sink() macro for hiding values from verifier precision tracking
- ctzll() count-trailing-zeros implementation
- get_prandom_u64() helper
- scx_clock_task/pelt/virt/irq() clock helpers with get_current_rq()

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2026-03-07 22:45:12 -10:00
Tejun Heo
9c6437f7c2 tools/sched_ext/include: Sync bpf_arena_common.bpf.h with scx repo
Sync the following changes from the scx repo:

- Guard __arena define with #ifndef to avoid redefinition when the
  attribute is already defined by another header.
- Add bpf_arena_reserve_pages() and bpf_arena_mapping_nr_pages() ksym
  declarations.
- Rename TEST to SCX_BPF_UNITTEST to avoid collision with generic TEST
  macros in other projects.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2026-03-07 22:45:12 -10:00
Tejun Heo
c90af06c80 tools/sched_ext/include: Remove dead sdt_task_defs.h guard from common.h
The __has_include guard for sdt_task_defs.h is vestigial — the only
remaining content is the bpf_arena_common.h include which is available
unconditionally. Remove the dead guard.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2026-03-07 22:45:12 -10:00
Tejun Heo
80a54b807d Revert "sched_ext: Use READ_ONCE() for the read side of dsq->nr update"
This reverts commit 9adfcef334.

dsq->nr is protected by dsq->lock and reading while holding the lock doesn't
constitute a racy read.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: zhidao su <suzhidao@xiaomi.com>
2026-03-07 21:42:12 -10:00
Cheng-Yang Chou
28c4ef2b2e sched_ext: Fix scx_bpf_reenqueue_local() silently reenqueuing nothing
ffa7ae0724 ("sched_ext: Add reenq_flags plumbing to scx_bpf_dsq_reenq()")
introduced task_should_reenq() as a filter inside reenq_local(), requiring
SCX_REENQ_ANY to be set in order to match any task. scx_bpf_dsq_reenq()
handles this correctly by converting a bare reenq_flags=0 to SCX_REENQ_ANY,
but scx_bpf_reenqueue_local() was not updated and continued to call
reenq_local() with 0, causing it to silently reenqueue zero tasks.

Fix by passing SCX_REENQ_ANY directly.

Fixes: ffa7ae0724 ("sched_ext: Add reenq_flags plumbing to scx_bpf_dsq_reenq()")
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-07 08:10:25 -10:00
Tejun Heo
ce897abc21 sched_ext: Add SCX_TASK_REENQ_REASON flags
SCX_ENQ_REENQ indicates that a task is being re-enqueued but doesn't tell the
BPF scheduler why. Add SCX_TASK_REENQ_REASON flags using bits 12-13 of
p->scx.flags to communicate the reason during ops.enqueue():

- NONE: Not being reenqueued
- KFUNC: Reenqueued by scx_bpf_dsq_reenq() and friends

More reasons will be added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07 05:29:50 -10:00
Tejun Heo
7203d77d6e sched_ext: Simplify task state handling
Task states (NONE, INIT, READY, ENABLED) were defined in a separate enum with
unshifted values and then shifted when stored in scx_entity.flags. Simplify by
defining them as pre-shifted values directly in scx_ent_flags and removing the
separate scx_task_state enum. This removes the need for shifting when
reading/writing state values.

scx_get_task_state() now returns the masked flags value directly.
scx_set_task_state() accepts the pre-shifted state value. scx_dump_task()
shifts down for display to maintain readable output.

No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07 05:29:50 -10:00
Tejun Heo
a90449b126 sched_ext: Optimize schedule_dsq_reenq() with lockless fast path
schedule_dsq_reenq() always acquires deferred_reenq_lock to queue a reenqueue
request. Add a lockless fast-path to skip lock acquisition when the request is
already pending with the required flags set.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07 05:29:50 -10:00
Tejun Heo
84b1a0ea0b sched_ext: Implement scx_bpf_dsq_reenq() for user DSQs
scx_bpf_dsq_reenq() currently only supports local DSQs. Extend it to support
user-defined DSQs by adding a deferred re-enqueue mechanism similar to the
local DSQ handling.

Add per-cpu deferred_reenq_user_node/flags to scx_dsq_pcpu and
deferred_reenq_users list to scx_rq. When scx_bpf_dsq_reenq() is called on a
user DSQ, the DSQ's per-cpu node is added to the current rq's deferred list.
process_deferred_reenq_users() then iterates the DSQ using the cursor helpers
and re-enqueues each task.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07 05:29:50 -10:00
Tejun Heo
35250720d6 sched_ext: Factor out nldsq_cursor_next_task() and nldsq_cursor_lost_task()
Factor out cursor-based DSQ iteration from bpf_iter_scx_dsq_next() into
nldsq_cursor_next_task() and the task-lost check from scx_dsq_move() into
nldsq_cursor_lost_task() to prepare for reuse.

As ->priv is only used to record dsq->seq for cursors, update
INIT_DSQ_LIST_CURSOR() to take the DSQ pointer and set ->priv from dsq->seq
so that users don't have to read it manually. Move scx_dsq_iter_flags enum
earlier so nldsq_cursor_next_task() can use SCX_DSQ_ITER_REV.

bypass_lb_cpu() now sets cursor.priv to dsq->seq but doesn't use it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07 05:29:50 -10:00
Tejun Heo
30b0515342 sched_ext: Add per-CPU data to DSQs
Add per-CPU data structure to dispatch queues. Each DSQ now has a percpu
scx_dsq_pcpu which contains a back-pointer to the DSQ. This will be used by
future changes to implement per-CPU reenqueue tracking for user DSQs.

init_dsq() now allocates the percpu data and can fail, so it returns an
error code. All callers are updated to handle failures. exit_dsq() is added
to free the percpu data and is called from all DSQ cleanup paths.

In scx_bpf_create_dsq(), init_dsq() is called before rcu_read_lock() since
alloc_percpu() requires GFP_KERNEL context, and dsq->sched is set
afterwards.

v2: Fix err_free_pcpu to only exit_dsq() initialized bypass DSQs (Andrea
    Righi).

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07 05:29:50 -10:00
Tejun Heo
ffa7ae0724 sched_ext: Add reenq_flags plumbing to scx_bpf_dsq_reenq()
Add infrastructure to pass flags through the deferred reenqueue path.
reenq_local() now takes a reenq_flags parameter, and scx_sched_pcpu gains a
deferred_reenq_local_flags field to accumulate flags from multiple
scx_bpf_dsq_reenq() calls before processing. No flags are defined yet.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07 05:29:49 -10:00
Tejun Heo
9c34c5074d sched_ext: Introduce scx_bpf_dsq_reenq() for remote local DSQ reenqueue
scx_bpf_reenqueue_local() can only trigger re-enqueue of the current CPU's
local DSQ. Introduce scx_bpf_dsq_reenq() which takes a DSQ ID and can target
any local DSQ including remote CPUs via SCX_DSQ_LOCAL_ON | cpu. This will be
expanded to support user DSQs by future changes.

scx_bpf_reenqueue_local() is reimplemented as a simple wrapper around
scx_bpf_dsq_reenq(SCX_DSQ_LOCAL, 0) and may be deprecated in the future.

Update compat.bpf.h with a compatibility shim and scx_qmap to test the new
functionality.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07 05:29:49 -10:00