Commit Graph

1331563 Commits

Author SHA1 Message Date
Andrea Righi
e4855fc90e sched_ext: idle: Refactor scx_select_cpu_dfl()
Make scx_select_cpu_dfl() more consistent with the other idle-related
APIs by returning a negative value when an idle CPU isn't found.

No functional changes, this is purely a refactoring.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-14 08:17:11 -10:00
Andrea Righi
c414c2171c sched_ext: idle: Honor idle flags in the built-in idle selection policy
Enable passing idle flags (%SCX_PICK_IDLE_*) to scx_select_cpu_dfl(),
to enforce strict selection criteria, such as selecting an idle CPU
strictly within @prev_cpu's node or choosing only a fully idle SMT core.

This functionality will be exposed through a dedicated kfunc in a
separate patch.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-14 08:17:01 -10:00
Andrea Righi
97e13ecb02 sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()
scx_bpf_reenqueue_local() can be invoked from ops.cpu_release() to give
tasks that are queued to the local DSQ a chance to migrate to other
CPUs, when a CPU is taken by a higher scheduling class.

However, there is no point re-enqueuing tasks that can only run on that
particular CPU, as they would simply be re-added to the same local DSQ
without any benefit.

Therefore, skip per-CPU tasks in scx_bpf_reenqueue_local().

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-13 07:07:27 -10:00
Changwoo Min
71d078803c sched_ext: Add trace point to track sched_ext core events
Add tracing support to track sched_ext core events
(/sched_ext/sched_ext_event). This may be useful for debugging sched_ext
schedulers that trigger a particular event.

The trace point can be used as other trace points, so it can be used in,
for example, `perf trace` and BPF programs, as follows:

======
$> sudo perf trace -e sched_ext:sched_ext_event --filter 'name == "SCX_EV_ENQ_SLICE_DFL"'
======

======
struct tp_sched_ext_event {
	struct trace_entry ent;
	u32 __data_loc_name;
	s64 delta;
};

SEC("tracepoint/sched_ext/sched_ext_event")
int rtp_add_event(struct tp_sched_ext_event *ctx)
{
	char event_name[128];
	unsigned short offset = ctx->__data_loc_name & 0xFFFF;
        bpf_probe_read_str((void *)event_name, 128, (char *)ctx + offset);

	bpf_printk("name %s   delta %lld", event_name, ctx->delta);
	return 0;
}
======

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-04 08:06:17 -10:00
Changwoo Min
038730dc12 sched_ext: Change the event type from u64 to s64
The event count could be negative in the future,
so change the event type from u64 to s64.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-04 08:05:23 -10:00
Tejun Heo
8a9b1585e2 sched_ext: Merge branch 'for-6.14-fixes' into for-6.15
Pull for-6.14-fixes to receive:

  9360dfe4cb ("sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()")

which conflicts with:

  337d1b354a ("sched_ext: Move built-in idle CPU selection policy to a separate file")

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-03 08:02:22 -10:00
Andrea Righi
9360dfe4cb sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()
If a BPF scheduler provides an invalid CPU (outside the nr_cpu_ids
range) as prev_cpu to scx_bpf_select_cpu_dfl() it can cause a kernel
crash.

To prevent this, validate prev_cpu in scx_bpf_select_cpu_dfl() and
trigger an scx error if an invalid CPU is specified.

Fixes: f0e1a0643a ("sched_ext: Implement BPF extensible scheduler class")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-03 07:55:48 -10:00
Andrea Righi
0f0714a344 sched_ext: Documentation: add task lifecycle summary
Understanding the lifecycle of a task in sched_ext can be not trivial,
therefore add a section to the main documentation that summarizes the
entire workflow of a task using pseudo-code.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-27 06:28:23 -10:00
Andrea Righi
b214b04df9 tools/sched_ext: Provide a compatible helper for scx_bpf_events()
Introduce __COMPAT_scx_bpf_events() to use scx_bpf_events() in a
compatible way also with kernels that don't provide this kfunc.

This also fixes the following error with scx_qmap when running on a
kernel that does not provide scx_bpf_events():

 ; scx_bpf_events(&events, sizeof(events)); @ scx_qmap.bpf.c:777
 318: (b7) r2 = 72                     ; R2_w=72 async_cb
 319: <invalid kfunc call>
 kfunc 'scx_bpf_events' is referenced but wasn't resolved

Fixes: 9865f31d85 ("sched_ext: Add scx_bpf_events() and scx_read_event() for BPF schedulers")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-27 06:26:58 -10:00
Andrea Righi
5ae5161820 selftests/sched_ext: Add NUMA-aware scheduler test
Add a selftest to validate the behavior of the NUMA-aware scheduler
functionalities, including idle CPU selection within nodes, per-node
DSQs and CPU to node mapping.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-26 08:49:02 -10:00
Andrea Righi
5e3b64246f tools/sched_ext: Provide consistent access to scx flags
Make all the SCX_OPS_* and SCX_PICK_IDLE_* flags available to the
user-space part of the schedulers via the compat interface.

This allows schedulers / selftests to set all the ops flags in
user-space, rather than having them split between BPF and user-space.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-25 08:36:36 -10:00
Andrea Righi
fde7d64766 sched_ext: idle: Fix scx_bpf_pick_any_cpu_node() behavior
When %SCX_PICK_IDLE_IN_NODE is specified, scx_bpf_pick_any_cpu_node()
should always return a CPU from the specified node, regardless of its
idle state.

Also clarify this logic in the function documentation.

Fixes: 01059219b0 ("sched_ext: idle: Introduce node-aware idle cpu kfunc helpers")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-25 08:35:53 -10:00
Tejun Heo
8fef0a3b17 sched_ext: Fix pick_task_scx() picking non-queued tasks when it's called without balance()
a6250aa251 ("sched_ext: Handle cases where pick_task_scx() is called
without preceding balance_scx()") added a workaround to handle the cases
where pick_task_scx() is called without prececing balance_scx() which is due
to a fair class bug where pick_taks_fair() may return NULL after a true
return from balance_fair().

The workaround detects when pick_task_scx() is called without preceding
balance_scx() and emulates SCX_RQ_BAL_KEEP and triggers kicking to avoid
stalling. Unfortunately, the workaround code was testing whether @prev was
on SCX to decide whether to keep the task running. This is incorrect as the
task may be on SCX but no longer runnable.

This could lead to a non-runnable task to be returned from pick_task_scx()
which cause interesting confusions and failures. e.g. A common failure mode
is the task ending up with (!on_rq && on_cpu) state which can cause
potential wakers to busy loop, which can easily lead to deadlocks.

Fix it by testing whether @prev has SCX_TASK_QUEUED set. This makes
@prev_on_scx only used in one place. Open code the usage and improve the
comment while at it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Pat Cody <patcody@meta.com>
Fixes: a6250aa251 ("sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()")
Cc: stable@vger.kernel.org # v6.12+
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-25 08:28:52 -10:00
Andrea Righi
0e9b4c10e8 sched_ext: idle: Introduce scx_bpf_nr_node_ids()
Similarly to scx_bpf_nr_cpu_ids(), introduce a new kfunc
scx_bpf_nr_node_ids() to expose the maximum number of NUMA nodes in the
system.

BPF schedulers can use this information together with the new node-aware
kfuncs, for example to create per-node DSQs, validate node IDs, etc.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-24 07:37:42 -10:00
Andrea Righi
01059219b0 sched_ext: idle: Introduce node-aware idle cpu kfunc helpers
Introduce a new kfunc to retrieve the node associated to a CPU:

 int scx_bpf_cpu_node(s32 cpu)

Add the following kfuncs to provide BPF schedulers direct access to
per-node idle cpumasks information:

 const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
 const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
 s32 scx_bpf_pick_idle_cpu_node(const cpumask_t *cpus_allowed,
 				int node, u64 flags)
 s32 scx_bpf_pick_any_cpu_node(const cpumask_t *cpus_allowed,
 			       int node, u64 flags)

Moreover, trigger an scx error when any of the non-node aware idle CPU
kfuncs are used when SCX_OPS_BUILTIN_IDLE_PER_NODE is enabled.

Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-18 09:10:24 -10:00
Andrea Righi
48849271e6 sched_ext: idle: Per-node idle cpumasks
Using a single global idle mask can lead to inefficiencies and a lot of
stress on the cache coherency protocol on large systems with multiple
NUMA nodes, since all the CPUs can create a really intense read/write
activity on the single global cpumask.

Therefore, split the global cpumask into multiple per-NUMA node cpumasks
to improve scalability and performance on large systems.

The concept is that each cpumask will track only the idle CPUs within
its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
In this way concurrent access to the idle cpumask will be restricted
within each NUMA node.

The split of multiple per-node idle cpumasks can be controlled using the
SCX_OPS_BUILTIN_IDLE_PER_NODE flag.

By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global
host-wide idle cpumask is used, maintaining the previous behavior.

NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via
SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will
trigger an scx error, since there are no system-wide cpumasks.

= Test =

Hardware:
 - System: DGX B200
    - CPUs: 224 SMT threads (112 physical cores)
    - Processor: INTEL(R) XEON(R) PLATINUM 8570
    - 2 NUMA nodes

Scheduler:
 - scx_simple [1] (so that we can focus at the built-in idle selection
   policy and not at the scheduling policy itself)

Test:
 - Run a parallel kernel build `make -j $(nproc)` and measure the average
   elapsed time over 10 runs:

          avg time | stdev
          ---------+------
 before:   52.431s | 2.895
  after:   50.342s | 2.895

= Conclusion =

Splitting the global cpumask into multiple per-NUMA cpumasks helped to
achieve a speedup of approximately +4% with this particular architecture
and test case.

The same test on a DGX-1 (40 physical cores, Intel Xeon E5-2698 v4 @
2.20GHz, 2 NUMA nodes) shows a speedup of around 1.5-3%.

On smaller systems, I haven't noticed any measurable regressions or
improvements with the same test (parallel kernel build) and scheduler
(scx_simple).

Moreover, with a modified scx_bpfland that uses the new NUMA-aware APIs
I observed an additional +2-2.5% performance improvement with the same
test.

[1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c

Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16 06:52:20 -10:00
Andrea Righi
0aaaf89df8 sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE
Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows
BPF schedulers to select between using a global flat idle cpumask or
multiple per-node cpumasks.

This only introduces the flag and the mechanism to enable/disable this
feature without affecting any scheduling behavior.

Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16 06:52:20 -10:00
Andrea Righi
d73249f887 sched_ext: idle: Make idle static keys private
Make all the static keys used by the idle CPU selection policy private
to ext_idle.c. This avoids unnecessary exposure in headers and improves
code encapsulation.

Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16 06:52:20 -10:00
Andrea Righi
f09177ca5f sched/topology: Introduce for_each_node_numadist() iterator
Introduce the new helper for_each_node_numadist() to iterate over node
IDs in order of increasing NUMA distance from a given starting node.

This iterator is somehow similar to for_each_numa_hop_mask(), but
instead of providing a cpumask at each iteration, it provides a node ID.

Example usage:

  nodemask_t unvisited = NODE_MASK_ALL;
  int node, start = cpu_to_node(smp_processor_id());

  node = start;
  for_each_node_numadist(node, unvisited)
  	pr_info("node (%d, %d) -> %d\n",
  		 start, node, node_distance(start, node));

On a system with equidistant nodes:

 $ numactl -H
 ...
 node distances:
 node     0    1    2    3
    0:   10   20   20   20
    1:   20   10   20   20
    2:   20   20   10   20
    3:   20   20   20   10

Output of the example above (on node 0):

[    7.367022] node (0, 0) -> 10
[    7.367151] node (0, 1) -> 20
[    7.367186] node (0, 2) -> 20
[    7.367247] node (0, 3) -> 20

On a system with non-equidistant nodes (simulated using virtme-ng):

 $ numactl -H
 ...
 node distances:
 node     0    1    2    3
    0:   10   51   31   41
    1:   51   10   21   61
    2:   31   21   10   11
    3:   41   61   11   10

Output of the example above (on node 0):

 [    8.953644] node (0, 0) -> 10
 [    8.953712] node (0, 2) -> 31
 [    8.953764] node (0, 3) -> 41
 [    8.953817] node (0, 1) -> 51

Suggested-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16 06:52:20 -10:00
Andrea Righi
16d79f2a4f mm/numa: Introduce nearest_node_nodemask()
Introduce the new helper nearest_node_nodemask() to find the closest
node in a specified nodemask from a given starting node.

Returns MAX_NUMNODES if no node is found.

Suggested-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16 06:52:19 -10:00
Yury Norov
14a8262f50 nodemask: numa: reorganize inclusion path
Nodemasks now pull linux/numa.h for MAX_NUMNODES and NUMA_NO_NODE
macros. This series makes numa.h depending on nodemasks, so we hit
a circular dependency.

Nodemasks library is highly employed by NUMA code, and it would be
logical to resolve the circular dependency by making NUMA headers
dependent nodemask.h.

Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16 06:52:19 -10:00
Yury Norov
7665054ee0 nodemask: add nodes_copy()
Nodemasks API misses the plain nodes_copy() which is required in this
series.

Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16 06:52:19 -10:00
Tejun Heo
f2c880fc81 tools/sched_ext: Sync with scx repo
Synchronize with https://github.com/sched-ext/scx at d384453984a0 ("kernel:
Sync at ad3b301aa0 ("sched_ext: Provides a sysfs 'events' to expose core
event counters")").

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-14 08:46:20 -10:00
Changwoo Min
ad3b301aa0 sched_ext: Provides a sysfs 'events' to expose core event counters
Add a sysfs entry at /sys/kernel/sched_ext/root/events to expose core
event counters through the files system interface. Each line of the file
shows the event name and its counter value.

In addition, the format of scx_dump_event() is adjusted as the event name
gets longer.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-14 07:15:41 -10:00
Tejun Heo
3539c6411a sched_ext: Implement SCX_OPS_ALLOW_QUEUED_WAKEUP
A task wakeup can be either processed on the waker's CPU or bounced to the
wakee's previous CPU using an IPI (ttwu_queue). Bouncing to the wakee's CPU
avoids the waker's CPU locking and accessing the wakee's rq which can be
expensive across cache and node boundaries.

When ttwu_queue path is taken, select_task_rq() and thus ops.select_cpu()
may be skipped in some cases (racing against the wakee switching out). As
this confused some BPF schedulers, there wasn't a good way for a BPF
scheduler to tell whether idle CPU selection has been skipped, ops.enqueue()
couldn't insert tasks into foreign local DSQs, and the performance
difference on machines with simple toplogies were minimal, sched_ext
disabled ttwu_queue.

However, this optimization makes noticeable difference on more complex
topologies and a BPF scheduler now has an easy way tell whether
ops.select_cpu() was skipped since 9b671793c7 ("sched_ext, scx_qmap: Add
and use SCX_ENQ_CPU_SELECTED") and can insert tasks into foreign local DSQs
since 5b26f7b920 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct
dispatches").

Implement SCX_OPS_ALLOW_QUEUED_WAKEUP which allows BPF schedulers to choose
to enable ttwu_queue optimization.

v2: Update the patch description and comment re. ops.select_cpu() being
    skipped in some cases as opposed to always as per Neel.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Neel Natu <neelnatu@google.com>
Reported-by: Barret Rhoden <brho@google.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-13 07:30:46 -10:00
Chuyi Zhou
f5717c93a1 sched_ext: Use SCX_CALL_OP_TASK in task_tick_scx
Now when we use scx_bpf_task_cgroup() in ops.tick() to get the cgroup of
the current task, the following error will occur:

scx_foo[3795244] triggered exit kind 1024:
  runtime error (called on a task not being operated on)

The reason is that we are using SCX_CALL_OP() instead of SCX_CALL_OP_TASK()
when calling ops.tick(), which triggers the error during the subsequent
scx_kf_allowed_on_arg_tasks() check.

SCX_CALL_OP_TASK() was first introduced in commit 36454023f5 ("sched_ext:
Track tasks that are subjects of the in-flight SCX operation") to ensure
task's rq lock is held when accessing task's sched_group. Since ops.tick()
is marked as SCX_KF_TERMINAL and task_tick_scx() is protected by the rq
lock, we can use SCX_CALL_OP_TASK() to avoid the above issue. Similarly,
the same changes should be made for ops.disable() and ops.exit_task(), as
they are also protected by task_rq_lock() and it's safe to access the
task's task_group.

Fixes: 36454023f5 ("sched_ext: Track tasks that are subjects of the in-flight SCX operation")
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-13 06:57:33 -10:00
Chuyi Zhou
2e2006c91c sched_ext: Fix the incorrect bpf_list kfunc API in common.bpf.h.
Now BPF only supports bpf_list_push_{front,back}_impl kfunc, not bpf_list_
push_{front,back}.

This patch fix this issue. Without this patch, if we use bpf_list kfunc
in scx, the BPF verifier would complain:

libbpf: extern (func ksym) 'bpf_list_push_back': not found in kernel or
module BTFs
libbpf: failed to load object 'scx_foo'
libbpf: failed to load BPF skeleton 'scx_foo': -EINVAL

With this patch, the bpf list kfunc will work as expected.

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Fixes: 2a52ca7c98 ("sched_ext: Add scx_simple and scx_example_qmap example schedulers")
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-13 06:47:08 -10:00
Devaansh Kumar
0760d62dad sched_ext: selftests: Fix grammar in tests description
Fixed grammar for a few tests of sched_ext.

Signed-off-by: Devaansh Kumar <devaanshk840@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-13 06:46:22 -10:00
Tejun Heo
78e4690de4 Merge branch 'for-6.14-fixes' into for-6.15
Pull to receive f3f08c3acf ("sched_ext: Fix incorrect assumption about
migration disabled tasks in task_can_run_on_remote_rq()") which conflicts
with 26176116d9 ("sched_ext: Count SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE in
the right spot") in for-6.15.
2025-02-10 10:45:43 -10:00
Tejun Heo
f3f08c3acf sched_ext: Fix incorrect assumption about migration disabled tasks in task_can_run_on_remote_rq()
While fixing migration disabled task handling, 3296682157 ("sched_ext: Fix
migration disabled handling in targeted dispatches") assumed that a
migration disabled task's ->cpus_ptr would only have the pinned CPU. While
this is eventually true for migration disabled tasks that are switched out,
->cpus_ptr update is performed by migrate_disable_switch() which is called
right before context_switch() in __scheduler(). However, the task is
enqueued earlier during pick_next_task() via put_prev_task_scx(), so there
is a race window where another CPU can see the task on a DSQ.

If the CPU tries to dispatch the migration disabled task while in that
window, task_allowed_on_cpu() will succeed and task_can_run_on_remote_rq()
will subsequently trigger SCHED_WARN(is_migration_disabled()).

  WARNING: CPU: 8 PID: 1837 at kernel/sched/ext.c:2466 task_can_run_on_remote_rq+0x12e/0x140
  Sched_ext: layered (enabled+all), task: runnable_at=-10ms
  RIP: 0010:task_can_run_on_remote_rq+0x12e/0x140
  ...
   <TASK>
   consume_dispatch_q+0xab/0x220
   scx_bpf_dsq_move_to_local+0x58/0xd0
   bpf_prog_84dd17b0654b6cf0_layered_dispatch+0x290/0x1cfa
   bpf__sched_ext_ops_dispatch+0x4b/0xab
   balance_one+0x1fe/0x3b0
   balance_scx+0x61/0x1d0
   prev_balance+0x46/0xc0
   __pick_next_task+0x73/0x1c0
   __schedule+0x206/0x1730
   schedule+0x3a/0x160
   __do_sys_sched_yield+0xe/0x20
   do_syscall_64+0xbb/0x1e0
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fix it by converting the SCHED_WARN() back to a regular failure path. Also,
perform the migration disabled test before task_allowed_on_cpu() test so
that BPF schedulers which fail to handle migration disabled tasks can be
noticed easily.

While at it, adjust scx_ops_error() message for !task_allowed_on_cpu() case
for brevity and consistency.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 3296682157 ("sched_ext: Fix migration disabled handling in targeted dispatches")
Acked-by: Andrea Righi <arighi@nvidia.com>
Reported-by: Jake Hillion <jakehillion@meta.com>
2025-02-10 10:40:47 -10:00
Changwoo Min
2e7df12bdd tools/sched_ext: Update enum_defs.autogen.h
Add where the script is located to the comment lines of the header file.
This helps anyone re-generate the header file if required.

Note that this is a sync from the PR [1] in the scx repo.

  [1] https://github.com/sched-ext/scx/pull/1322

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-10 07:22:42 -10:00
Li RongQing
1a4e0d8682 sched_ext: Take NUMA node into account when allocating per-CPU cpumasks
per-CPU cpumasks are dominantly accessed from their own local CPUs,
so allocate them node-local to improve performance.

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-10 07:20:13 -10:00
Changwoo Min
372033ad9e tools/sched_ext: Compatible testing of SCX_ENQ_CPU_SELECTED
This provides compatible testing of SCX_ENQ_CPU_SELECTED.
More specifically, it handles two cases:

  1. a BPF scheduler is compiled against vmlinux.h where
  SCX_ENQ_CPU_SELECTED is defined, but it runs on a kernel that does not
  have SCX_ENQ_CPU_SELECTED. In this case, the test result of
  'enq_flags & SCX_ENQ_CPU_SELECTED' will always be false. That test result
  is semantically incorrect because the kernel before SCX_ENQ_CPU_SELECTED
  has never skipped select_task_rq_scx(), so the result should be true.

  2. a BPF scheduler is compiling against vmlinux.h where
  SCX_ENQ_CPU_SELECTED is not defined. In this case, directly using
  SCX_ENQ_CPU_SELECTED causes compilation errors.

To hide such complexity, introduce __COMPAT_is_enq_cpu_selected(),
which checks if SCX_ENQ_CPU_SELECTED exists in runtime using BPF CO-RE.
This consists of three parts:

  1. Add enum_defs.autogen.h, which has macros (HAVE_{enum name}) denoting
  whether SCX enums are defined in the vmlinux.h or not.

  2. Implement __COMPAT_is_enq_cpu_selected(), which provide the test of
  SCX_ENQ_CPU_SELECTED in a compatible way.

  3. Use  __COMPAT_is_enq_cpu_selected() in scx_qmap.

Note that this is a sync of the relevant PR [1] in the scx repo.

  [1] https://github.com/sched-ext/scx/pull/1314

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-08 20:42:50 -10:00
Tejun Heo
eace54dff0 sched_ext: Add SCX_EV_ENQ_SKIP_MIGRATION_DISABLED
Count the number of times a migration disabled task is automatically
dispatched to its local DSQ.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-02-08 20:39:11 -10:00
Tejun Heo
26176116d9 sched_ext: Count SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE in the right spot
SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE wasn't quite right in two aspects:

- It counted both migration disabled and offline events.

- It didn't count events from scx_bpf_dsq_move() path.

Fix it by moving the counting into task_can_run_on_remote_rq() which is
shared by both paths and can distinguish the different rejection conditions.
The argument @trigger_error is renamed to @enforce as it now does more than
just triggering error.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-02-08 20:39:11 -10:00
Tejun Heo
46a0e16158 tool/sched_ext: Event counter dumping updates
- There's no need to dump event counters from both scx_qmap and scx_central.
  Drop counter dumping from scx_central.

- bpf_printk() implies a trailing new line and the explicit new line leads
  to double new lines. Drop the explicit new lines.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-02-08 20:39:11 -10:00
Tejun Heo
29ef4a2fcf Merge branch 'for-6.14-fixes' into for-6.15
Pull to receive:

- 2fa0fbeb69 ("sched_ext: Implement auto local dispatching of migration disabled tasks")
- 3296682157 ("sched_ext: Fix migration disabled handling in targeted dispatches")

as planned for-6.15 changes depend on them (e.g. adding event counter for
implicit migration disabled task handling).
2025-02-08 20:34:43 -10:00
Tejun Heo
3296682157 sched_ext: Fix migration disabled handling in targeted dispatches
A dispatch operation that can target a specific local DSQ -
scx_bpf_dsq_move_to_local() or scx_bpf_dsq_move() - checks whether the task
can be migrated to the target CPU using task_can_run_on_remote_rq(). If the
task can't be migrated to the targeted CPU, it is bounced through a global
DSQ.

task_can_run_on_remote_rq() assumes that the task is on a CPU that's
different from the targeted CPU but the callers doesn't uphold the
assumption and may call the function when the task is already on the target
CPU. When such task has migration disabled, task_can_run_on_remote_rq() ends
up returning %false incorrectly unnecessarily bouncing the task to a global
DSQ.

Fix it by updating the callers to only call task_can_run_on_remote_rq() when
the task is on a different CPU than the target CPU. As this is a bit subtle,
for clarity and documentation:

- Make task_can_run_on_remote_rq() trigger SCHED_WARN_ON() if the task is on
  the same CPU as the target CPU.

- is_migration_disabled() test in task_can_run_on_remote_rq() cannot trigger
  if the task is on a different CPU than the target CPU as the preceding
  task_allowed_on_cpu() test should fail beforehand. Convert the test into
  SCHED_WARN_ON().

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 4c30f5ce4f ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
Fixes: 0366017e09 ("sched_ext: Use task_can_run_on_remote_rq() test in dispatch_to_local_dsq()")
Cc: stable@vger.kernel.org # v6.12+
2025-02-08 20:33:25 -10:00
Tejun Heo
2fa0fbeb69 sched_ext: Implement auto local dispatching of migration disabled tasks
Migration disabled tasks are special and pinned to their previous CPUs. They
tripped up some unsuspecting BPF schedulers as their ->nr_cpus_allowed may
not agree with the bits set in ->cpus_ptr. Make it easier for BPF schedulers
by automatically dispatching them to the pinned local DSQs by default. If a
BPF scheduler wants to handle migration disabled tasks explicitly, it can
set SCX_OPS_ENQ_MIGRATION_DISABLED.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-08 20:32:54 -10:00
Changwoo Min
38d65cd692 sched_ext: Print an event, SCX_EV_ENQ_SLICE_DFL, in scx_qmap/central
Modify the scx_qmap and scx_celtral schedulers
to print the SCX_EV_ENQ_SLICE_DFL event every second.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-07 11:24:59 -10:00
Changwoo Min
6d3f8fb4b2 sched_ext: Add an event, SCX_EV_ENQ_SLICE_DFL
Add a core event, SCX_EV_ENQ_SLICE_DFL, which represents how many
tasks have been enqueued (or pick_task-ed or select_cpu-ed) with
a default time slice (SCX_SLICE_DFL).

Scheduling a task with SCX_SLICE_DFL unintentionally would be a source
of latency spikes because SCX_SLICE_DFL is relatively long (20 msec).
Thus, soaring the SCX_EV_ENQ_SLICE_DFL value would be a sign of BPF
scheduler bugs, causing latency spikes, especially when ops.select_cpu()
is provided.

__scx_add_event() is used since the caller holds an rq lock or p->pi_lock,
so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-07 11:24:59 -10:00
Changwoo Min
2494e555fb sched_ext: Print core event count in scx_qmap scheduler
Modify the scx_qmap scheduler to print the core event counter
every second.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04 10:36:47 -10:00
Changwoo Min
6df93804b7 sched_ext: Print core event count in scx_central scheduler
Modify the scx_central scheduler to print the core event counter
every second.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04 10:36:47 -10:00
Changwoo Min
9865f31d85 sched_ext: Add scx_bpf_events() and scx_read_event() for BPF schedulers
scx_bpf_events() is added to the header files so the BPF scheduler
can use it. Also, scx_read_event() is added to read an event type in a
compatible way.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04 10:36:47 -10:00
Changwoo Min
d46457c31c sched_ext: Add an event, SCX_EV_BYPASS_DURATION
Add a core event, SCX_EV_BYPASS_DURATION, which represents the
total duration of bypass modes in nanoseconds.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04 10:36:47 -10:00
Changwoo Min
5c605cd33c sched_ext: Add an event, SCX_EV_BYPASS_DISPATCH
Add a core event, SCX_EV_BYPASS_DISPATCH, which represents how many
tasks have been dispatched in the bypass mode.

__scx_add_event() is used since the caller holds an rq lock or
p->pi_lock, so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04 10:36:47 -10:00
Changwoo Min
4f7a38c7c9 sched_ext: Add an event, SCX_EV_BYPASS_ACTIVATE
Add a core event, SCX_EV_BYPASS_ACTIVATE, which represents how many
times the bypass mode has been triggered.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04 10:36:47 -10:00
Changwoo Min
824d4f2dce sched_ext: Add an event, SCX_EV_ENQ_SKIP_EXITING
Add a core event, SCX_EV_ENQ_SKIP_EXITING, which represents how many
times a task is enqueued to a local DSQ when exiting if
SCX_OPS_ENQ_EXITING is not set.

__scx_add_event() is used since the caller holds an rq lock,
so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04 10:36:47 -10:00
Changwoo Min
029b6ce733 sched_ext: Fix incorrect time delta calculation in time_delta()
When (s64)(after - before) > 0, the code returns the result of
(s64)(after - before) > 0 while the intended result should be
(s64)(after - before). That happens because the middle operand of
the ternary operator was omitted incorrectly, returning the result of
(s64)(after - before) > 0. Thus, add the middle operand
-- (s64)(after - before) -- to return the correct time calculation.

Fixes: d07be814fc ("sched_ext: Add time helpers for BPF schedulers")
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02 07:40:35 -10:00
Changwoo Min
7125660bc1 sched_ext: Add an event, SCX_EV_DISPATCH_KEEP_LAST
Add a core event, SCX_EV_DISPATCH_KEEP_LAST, which represents how many
times a task is continued to run without ops.enqueue() when
SCX_OPS_ENQ_LAST is not set.

__scx_add_event() is used since the caller holds an rq lock,
so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02 07:23:18 -10:00