Commit Graph

1396692 Commits

Author SHA1 Message Date
Tejun Heo
a379fa1e2c sched_ext: Fix SCX_KICK_WAIT to work reliably
SCX_KICK_WAIT is used to synchronously wait for the target CPU to complete
a reschedule and can be used to implement operations like core scheduling.

This used to be implemented by scx_next_task_picked() incrementing pnt_seq,
which was always called when a CPU picks the next task to run, allowing
SCX_KICK_WAIT to reliably wait for the target CPU to enter the scheduler and
pick the next task.

However, commit b999e365c2 ("sched_ext: Replace scx_next_task_picked()
with switch_class()") replaced scx_next_task_picked() with the
switch_class() callback, which is only called when switching between sched
classes. This broke SCX_KICK_WAIT because pnt_seq would no longer be
reliably incremented unless the previous task was SCX and the next task was
not.

This fix leverages commit 4c95380701 ("sched/ext: Fold balance_scx() into
pick_task_scx()") which refactored the pick path making put_prev_task_scx()
the natural place to track task switches for SCX_KICK_WAIT. The fix moves
pnt_seq increment to put_prev_task_scx() and also increments it in
pick_task_scx() to handle cases where the same task is re-selected, whether
by BPF scheduler decision or slice refill. The semantics: If the current
task on the target CPU is SCX, SCX_KICK_WAIT waits until the CPU enters the
scheduling path. This provides sufficient guarantee for use cases like core
scheduling while keeping the operation self-contained within SCX.

v2: - Also increment pnt_seq in pick_task_scx() to handle same-task
      re-selection (Andrea Righi).
    - Use smp_cond_load_acquire() for the busy-wait loop for better
      architecture optimization (Peter Zijlstra).

Reported-by: Wen-Fang Liu <liuwenfang@honor.com>
Link: http://lkml.kernel.org/r/228ebd9e6ed3437996dffe15735a9caa@honor.com
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-22 11:42:14 -10:00
Tejun Heo
a9c1fbbd6d sched_ext: Don't kick CPUs running higher classes
When a sched_ext scheduler tries to kick a CPU, the CPU may be running a
higher class task. sched_ext has no control over such CPUs. A sched_ext
scheduler couldn't have expected to get access to the CPU after kicking it
anyway. Skip kicking when the target CPU is running a higher class.

Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-22 11:42:14 -10:00
Tejun Heo
2dbbdeda77 sched_ext: Fix scx_bpf_dsq_insert() backward binary compatibility
cded46d971 ("sched_ext: Make scx_bpf_dsq_insert*() return bool")
introduced a new bool-returning scx_bpf_dsq_insert() and renamed the old
void-returning version to scx_bpf_dsq_insert___compat, with the expectation
that libbpf would match old binaries to the ___compat variant, maintaining
backward binary compatibility. However, while libbpf ignores ___suffix on
the BPF side when matching symbols, it doesn't do so for kernel-side symbols.
Old binaries compiled with the original scx_bpf_dsq_insert() could no longer
resolve the symbol.

Fix by reversing the naming: Keep scx_bpf_dsq_insert() as the old
void-returning interface and add ___v2 to the new bool-returning version.
This allows old binaries to continue working while new code can use the
___v2 variant. Once libbpf is updated to ignore kernel-side ___SUFFIX, the
___v2 suffix can be dropped when the compat interface is removed.

v2: Use ___v2 instead of ___new.

Fixes: cded46d971 ("sched_ext: Make scx_bpf_dsq_insert*() return bool")
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-21 10:40:15 -10:00
Andrea Righi
67fa319f5f sched_ext: Allow forcibly picking an scx task
Refactor pick_task_scx() adding a new argument to forcibly pick a
SCHED_EXT task, ignoring any higher-priority sched class activity.

This refactoring prepares the code for future scenarios, e.g., allowing
the ext dl_server to force a SCHED_EXT task selection.

No functional changes.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-18 13:01:35 -10:00
Tejun Heo
70d837c3e0 sched_ext: Merge branch 'sched/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.19
Pull in tip/sched/core to receive:

 50653216e4 ("sched: Add support to pick functions to take rf")
 4c95380701 ("sched/ext: Fold balance_scx() into pick_task_scx()")

which will enable clean integration of DL server support among other things.

This conflicts with the following from sched_ext/for-6.18-fixes:

 a8ad873113 ("sched_ext: defer queue_balance_callback() until after ops.dispatch")

which adds maybe_queue_balance_callback() to balance_scx() which is removed
by 50653216e4. Resolve by moving the invocation to pick_task_scx() in the
equivalent location.

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-16 08:45:38 -10:00
Tejun Heo
075e3f7206 sched_ext: Merge branch 'for-6.18-fixes' into for-6.19
Pull sched_ext/for-6.18-fixes to sync trees to receive:

 05e63305c8 ("sched_ext: Fix scx_kick_pseqs corruption on concurrent scheduler loads")

to avoid conflicts with planned cgroup sub-sched support.

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-16 08:34:12 -10:00
Emil Tsalapatis
a3c4a0a42e sched_ext: fix flag check for deferred callbacks
When scheduling the deferred balance callbacks, check SCX_RQ_BAL_CB_PENDING
instead of SCX_RQ_BAL_PENDING. This way schedule_deferred() properly tests
whether there is already a pending request for queue_balance_callback() to
be invoked at the end of .balance().

Fixes: a8ad873113 ("sched_ext: defer queue_balance_callback() until after ops.dispatch")
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-16 08:34:00 -10:00
Peter Zijlstra
73cbcfe255 sched/topology,x86: Fix build warning
A compile warning slipped through:

   arch/x86/kernel/smpboot.c:548:5: warning: no previous prototype for function 'arch_sched_node_distance' [-Wmissing-prototypes]

Fixes: 4d6dd05d07 ("sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode")
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-16 13:01:15 +02:00
Peter Zijlstra
4c95380701 sched/ext: Fold balance_scx() into pick_task_scx()
With pick_task() having an rf argument, it is possible to do the
lock-break there, get rid of the weird balance/pick_task hack.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16 11:13:55 +02:00
Joel Fernandes
50653216e4 sched: Add support to pick functions to take rf
Some pick functions like the internal pick_next_task_fair() already take
rf but some others dont. We need this for scx's server pick function.
Prepare for this by having pick functions accept it.

[peterz: - added RETRY_TASK handling
         - removed pick_next_task_fair indirection]
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16 11:13:55 +02:00
Peter Zijlstra
1e900f415c sched: Detect per-class runqueue changes
Have enqueue/dequeue set a per-class bit in rq->queue_mask. This then
enables easy tracking of which runqueues are modified over a
lock-break.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16 11:13:55 +02:00
Peter Zijlstra
73ec89a1ce sched: Mandate shared flags for sched_change
Shrikanth noted that sched_change pattern relies on using shared
flags.

Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-16 11:13:54 +02:00
Peter Zijlstra
d4c64207b8 sched: Cleanup the sched_change NOCLOCK usage
Teach the sched_change pattern how to do update_rq_clock(); this
allows for some simplifications / cleanups.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:54 +02:00
Peter Zijlstra
5892cbd85d sched: Match __task_rq_{,un}lock()
In preparation to adding more rules to __task_rq_lock(), such that
__task_rq_unlock() will no longer be equivalent to rq_unlock(),
make sure every __task_rq_lock() is matched by a __task_rq_unlock()
and vice-versa.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:54 +02:00
Peter Zijlstra
46a177fb01 sched: Add locking comments to sched_class methods
'Document' the locking context the various sched_class methods are
called under.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:53 +02:00
Peter Zijlstra
650952d3fb sched: Make __do_set_cpus_allowed() use the sched_change pattern
Now that do_set_cpus_allowed() holds all the regular locks, convert it
to use the sched_change pattern helper.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:53 +02:00
Peter Zijlstra
b079d93796 sched: Rename do_set_cpus_allowed()
Hopefully saner naming.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:53 +02:00
Peter Zijlstra
abfc01077d sched: Fix do_set_cpus_allowed() locking
All callers of do_set_cpus_allowed() only take p->pi_lock, which is
not sufficient to actually change the cpumask. Again, this is mostly
ok in these cases, but it results in unnecessarily complicated
reasoning.

Furthermore, there is no reason what so ever to not just take all the
required locks, so do just that.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:52 +02:00
Peter Zijlstra
942b8db965 sched: Fix migrate_disable_switch() locking
For some reason migrate_disable_switch() was more complicated than it
needs to be, resulting in mind bending locking of dubious quality.

Recognise that migrate_disable_switch() must be called before a
context switch, but any place before that switch is equally good.
Since the current place results in troubled locking, simply move the
thing before taking rq->lock.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:52 +02:00
Peter Zijlstra
6455ad5346 sched: Move sched_class::prio_changed() into the change pattern
Move sched_class::prio_changed() into the change pattern.

And while there, extend it with sched_class::get_prio() in order to
fix the deadline sitation.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:52 +02:00
Peter Zijlstra
1ae5f5dfe5 sched: Cleanup sched_delayed handling for class switches
Use the new sched_class::switching_from() method to dequeue delayed
tasks before switching to another class.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16 11:13:51 +02:00
Peter Zijlstra
637b068282 sched: Fold sched_class::switch{ing,ed}_{to,from}() into the change pattern
Add {DE,EN}QUEUE_CLASS and fold the sched_class::switch* methods into
the change pattern. This completes and makes the pattern more
symmetric.

This changes the order of callbacks slightly:

  OLD                              NEW
				|
				|  switching_from()
  dequeue_task();		|  dequeue_task()
  put_prev_task();		|  put_prev_task()
				|  switched_from()
				|
  ... change task ...		|  ... change task ...
				|
  switching_to();		|  switching_to()
  enqueue_task();		|  enqueue_task()
  set_next_task();		|  set_next_task()
  prev_class->switched_from()	|
  switched_to()			|  switched_to()
				|

Notably, it moves the switched_from() callback right after the
dequeue/put. Existing implementations don't appear to be affected by
this change in location -- specifically the task isn't enqueued on the
class in question in either location.

Make (CLASS)^(SAVE|MOVE), because there is nothing to save-restore
when changing scheduling classes.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:51 +02:00
Peter Zijlstra
5e42d4c123 sched/deadline: Prepare for switched_from() change
Prepare for the sched_class::switch*() methods getting folded into the
change pattern. As a result of that, the location of switched_from
will change slightly. SCHED_DEADLINE is affected by this change in
location:

  OLD                              NEW
				|
				|  switching_from()
  dequeue_task();		|  dequeue_task()
  put_prev_task();		|  put_prev_task()
				|  switched_from()
				|
  ... change task ...		|  ... change task ...
				|
  switching_to();		|  switching_to()
  enqueue_task();		|  enqueue_task()
  set_next_task();		|  set_next_task()
  prev_class->switched_from()	|
  switched_to()			|  switched_to()
				|

Notably, where switched_from() was called *after* the change to the
task, it will get called before it. Specifically, switched_from_dl()
uses dl_task(p) which uses p->prio; which is changed when switching
class (it might be the reason to switch class in case of PI).

When switched_from_dl() gets called, the task will have left the
deadline class and dl_task() must be false, while when doing
dequeue_dl_entity() the task must be a dl_task(), otherwise we'd have
called a different dequeue method.

Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-16 11:13:51 +02:00
Peter Zijlstra
376f8963bb sched: Re-arrange the {EN,DE}QUEUE flags
Ensure the matched flags are in the low word while the unmatched flags
go into the second word.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:50 +02:00
Peter Zijlstra
e9139f765a sched: Employ sched_change guards
As proposed a long while ago -- and half done by scx -- wrap the
scheduler's 'change' pattern in a guard helper.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:50 +02:00
Adam Li
82d6e01a06 sched/fair: Only update stats for allowed CPUs when looking for dst group
Load imbalance is observed when the workload frequently forks new threads.
Due to CPU affinity, the workload can run on CPU 0-7 in the first
group, and only on CPU 8-11 in the second group. CPU 12-15 are always idle.

{ 0 1 2 3 4 5 6 7 } {8 9 10 11 12 13 14 15}
  * * * * * * * *    * * *  *

When looking for dst group for newly forked threads, in many times
update_sg_wakeup_stats() reports the second group has more idle CPUs
than the first group. The scheduler thinks the second group is less
busy. Then it selects least busy CPUs among CPU 8-11. Therefore CPU 8-11
can be crowded with newly forked threads, at the same time CPU 0-7
can be idle.

A task may not use all the CPUs in a schedule group due to CPU affinity.
Only update schedule group statistics for allowed CPUs.

Signed-off-by: Adam Li <adamli@os.amperecomputing.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-16 11:13:50 +02:00
Tim Chen
4d6dd05d07 sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode
It is possible for Granite Rapids (GNR) and Clearwater Forest
(CWF) to have up to 3 dies per package. When sub-numa cluster (SNC-3)
is enabled, each die will become a separate NUMA node in the package
with different distances between dies within the same package.

For example, on GNR, we see the following numa distances for a 2 socket
system with 3 dies per socket:

    package 1       package2
	----------------
	|               |
    ---------       ---------
    |   0   |       |   3   |
    ---------       ---------
	|               |
    ---------       ---------
    |   1   |       |   4   |
    ---------       ---------
	|               |
    ---------       ---------
    |   2   |       |   5   |
    ---------       ---------
	|               |
	----------------

node distances:
node     0    1    2    3    4    5
0:   	10   15   17   21   28   26
1:   	15   10   15   23   26   23
2:   	17   15   10   26   23   21
3:   	21   28   26   10   15   17
4:   	23   26   23   15   10   15
5:   	26   23   21   17   15   10

The node distances above led to 2 problems:

1. Asymmetric routes taken between nodes in different packages led to
asymmetric scheduler domain perspective depending on which node you
are on.  Current scheduler code failed to build domains properly with
asymmetric distances.

2. Multiple remote distances to respective tiles on remote package create
too many levels of domain hierarchies grouping different nodes between
remote packages.

For example, the above GNR topology lead to NUMA domains below:

Sched domains from the perspective of a CPU in node 0, where the number
in bracket represent node number.

NUMA-level 1    [0,1] [2]
NUMA-level 2    [0,1,2] [3]
NUMA-level 3    [0,1,2,3] [5]
NUMA-level 4    [0,1,2,3,5] [4]

Sched domains from the perspective of a CPU in node 4
NUMA-level 1    [4] [3,5]
NUMA-level 2    [3,4,5] [0,2]
NUMA-level 3    [0,2,3,4,5] [1]

Scheduler group peers for load balancing from the perspective of CPU 0
and 4 are different.  Improper task could be chosen for load balancing
between groups such as [0,2,3,4,5] [1].  Ideally you should choose nodes
in 0 or 2 that are in same package as node 1 first.  But instead tasks
in the remote package node 3, 4, 5 could be chosen with an equal chance
and could lead to excessive remote package migrations and imbalance of
load between packages.  We should not group partial remote nodes and
local nodes together.
Simplify the remote distances for CWF and GNR for the purpose of
sched domains building, which maintains symmetry and leads to a more
reasonable load balance hierarchy.

The sched domains from the perspective of a CPU in node 0 NUMA-level 1
is now
NUMA-level 1    [0,1] [2]
NUMA-level 2    [0,1,2] [3,4,5]

The sched domains from the perspective of a CPU in node 4 NUMA-level 1
is now
NUMA-level 1    [4] [3,5]
NUMA-level 2    [3,4,5] [0,1,2]

We have the same balancing perspective from node 0 or node 4.  Loads are
now balanced equally between packages.

Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Zhao Liu <zhao1.liu@intel.com>
2025-10-16 11:13:50 +02:00
Tim Chen
06f2c90885 sched: Create architecture specific sched domain distances
Allow architecture specific sched domain NUMA distances that are
modified from actual NUMA node distances for the purpose of building
NUMA sched domains.

Keep actual NUMA distances separately if modified distances
are used for building sched domains. Such distances
are still needed as NUMA balancing benefits from finding the
NUMA nodes that are actually closer to a task numa_group.

Consolidate the recording of unique NUMA distances in an array to
sched_record_numa_dist() so the function can be reused to record NUMA
distances when the NUMA distance metric is changed.

No functional change and additional distance array
allocated if there're no arch specific NUMA distances
being defined.

Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
2025-10-16 11:13:49 +02:00
Doug Berger
382748c05e sched/deadline: only set free_cpus for online runqueues
Commit 16b269436b ("sched/deadline: Modify cpudl::free_cpus
to reflect rd->online") introduced the cpudl_set/clear_freecpu
functions to allow the cpu_dl::free_cpus mask to be manipulated
by the deadline scheduler class rq_on/offline callbacks so the
mask would also reflect this state.

Commit 9659e1eeee ("sched/deadline: Remove cpu_active_mask
from cpudl_find()") removed the check of the cpu_active_mask to
save some processing on the premise that the cpudl::free_cpus
mask already reflected the runqueue online state.

Unfortunately, there are cases where it is possible for the
cpudl_clear function to set the free_cpus bit for a CPU when the
deadline runqueue is offline. When this occurs while a CPU is
connected to the default root domain the flag may retain the bad
state after the CPU has been unplugged. Later, a different CPU
that is transitioning through the default root domain may push a
deadline task to the powered down CPU when cpudl_find sees its
free_cpus bit is set. If this happens the task will not have the
opportunity to run.

One example is outlined here:
https://lore.kernel.org/lkml/20250110233010.2339521-1-opendmb@gmail.com

Another occurs when the last deadline task is migrated from a
CPU that has an offlined runqueue. The dequeue_task member of
the deadline scheduler class will eventually call cpudl_clear
and set the free_cpus bit for the CPU.

This commit modifies the cpudl_clear function to be aware of the
online state of the deadline runqueue so that the free_cpus mask
can be updated appropriately.

It is no longer necessary to manage the mask outside of the
cpudl_set/clear functions so the cpudl_set/clear_freecpu
functions are removed. In addition, since the free_cpus mask is
now only updated under the cpudl lock the code was changed to
use the non-atomic __cpumask functions.

Signed-off-by: Doug Berger <opendmb@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-16 11:13:49 +02:00
Fernand Sieber
79104becf4 sched/fair: Forfeit vruntime on yield
If a task yields, the scheduler may decide to pick it again. The task in
turn may decide to yield immediately or shortly after, leading to a tight
loop of yields.

If there's another runnable task as this point, the deadline will be
increased by the slice at each loop. This can cause the deadline to runaway
pretty quickly, and subsequent elevated run delays later on as the task
doesn't get picked again. The reason the scheduler can pick the same task
again and again despite its deadline increasing is because it may be the
only eligible task at that point.

Fix this by making the task forfeiting its remaining vruntime and pushing
the deadline one slice ahead. This implements yield behavior more
authentically.

We limit the forfeiting to eligible tasks. This is because core scheduling
prefers running ineligible tasks rather than force idling. As such, without
the condition, we can end up on a yield loop which makes the vruntime
increase rapidly, leading to anomalous run delays later down the line.

Fixes: 147f3efaa2 ("sched/fair: Implement an EEVDF-like scheduling  policy")
Signed-off-by: Fernand Sieber <sieberf@amazon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250401123622.584018-1-sieberf@amazon.com
Link: https://lore.kernel.org/r/20250911095113.203439-1-sieberf@amazon.com
Link: https://lore.kernel.org/r/20250916140228.452231-1-sieberf@amazon.com
2025-10-16 11:13:49 +02:00
Ryan Newton
5aff3b3199 sched_ext: Add a selftest for scx_bpf_dsq_peek
This commit adds two tests. The first is the most basic unit test:
make sure an empty queue peeks as empty, and when we put one element
in the queue, make sure peek returns that element.

However, even this simple test is a little complicated by the different
behavior of scx_bpf_dsq_insert in different calling contexts:
 - insert is for direct dispatch in enqueue
 - insert is delayed when called from select_cpu

In this case we split the insert and the peek that verifies the
result between enqueue/dispatch.

Note: An alternative would be to call `scx_bpf_dsq_move_to_local` on an
empty queue, which in turn calls `flush_dispatch_buf`, in order to flush
the buffered insert. Unfortunately, this is not viable within the
enqueue path, as it attempts a voluntary context switch within an RCU
read-side critical section.

The second test is a stress test that performs many peeks on all DSQs
and records the observed tasks.

Signed-off-by: Ryan Newton <newton@meta.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-15 06:46:36 -10:00
Ryan Newton
44f5c8ec5b sched_ext: Add lockless peek operation for DSQs
The builtin DSQ queue data structures are meant to be used by a wide
range of different sched_ext schedulers with different demands on these
data structures. They might be per-cpu with low-contention, or
high-contention shared queues. Unfortunately, DSQs have a coarse-grained
lock around the whole data structure. Without going all the way to a
lock-free, more scalable implementation, a small step we can take to
reduce lock contention is to allow a lockless, small-fixed-cost peek at
the head of the queue.

This change allows certain custom SCX schedulers to cheaply peek at
queues, e.g. during load balancing, before locking them. But it
represents a few extra memory operations to update the pointer each
time the DSQ is modified, including a memory barrier on ARM so the write
appears correctly ordered.

This commit adds a first_task pointer field which is updated
atomically when the DSQ is modified, and allows any thread to peek at
the head of the queue without holding the lock.

Signed-off-by: Ryan Newton <newton@meta.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-15 06:46:25 -10:00
Andrea Righi
05e63305c8 sched_ext: Fix scx_kick_pseqs corruption on concurrent scheduler loads
If we load a BPF scheduler while another scheduler is already running,
alloc_kick_pseqs() would be called again, overwriting the previously
allocated arrays.

Fix by moving the alloc_kick_pseqs() call after the scx_enable_state()
check, ensuring that the arrays are only allocated when a scheduler can
actually be loaded.

Fixes: 14c1da3895 ("sched_ext: Allocate scx_kick_cpus_pnt_seqs lazily using kvzalloc()")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-14 10:29:17 -10:00
zhidao su
347ed2d566 sched/ext: Implement cgroup_set_idle() callback
Implement the missing cgroup_set_idle() callback that was marked as a
TODO. This allows BPF schedulers to be notified when a cgroup's idle
state changes, enabling them to adjust their scheduling behavior
accordingly.

The implementation follows the same pattern as other cgroup callbacks
like cgroup_set_weight() and cgroup_set_bandwidth(). It checks if the
BPF scheduler has implemented the callback and invokes it with the
appropriate parameters.

Fixes a spelling error in the cgroup_set_bandwidth() documentation.

tj: s/scx_cgroup_rwsem/scx_cgroup_ops_rwsem/ to fix build breakage.

Signed-off-by: zhidao su <soolaugust@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-14 10:17:33 -10:00
Tejun Heo
bd7143e74e sched_ext/tools: Add compat wrapper for scx_bpf_task_set_slice/dsq_vtime()
for sub-scheduler authority checks. Add compat wrappers which fall back to
direct p->scx field writes on older kernels.

Suggested-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13 08:49:29 -10:00
Tejun Heo
cded46d971 sched_ext: Make scx_bpf_dsq_insert*() return bool
In preparation for hierarchical schedulers, change scx_bpf_dsq_insert() and
scx_bpf_dsq_insert_vtime() to return bool instead of void. With
sub-schedulers, there will be no reliable way to guarantee a task is still
owned by the sub-scheduler at insertion time (e.g., the task may have been
migrated to another scheduler). The bool return value will enable
sub-schedulers to detect and gracefully handle insertion failures.

For the root scheduler, insertion failures will continue to trigger scheduler
abort via scx_error(), so existing code doesn't need to check the return
value. Backward compatibility is maintained through compat wrappers.

Also update scx_bpf_dsq_move() documentation to clarify that it can return
false for sub-schedulers when @dsq_id points to a disallowed local DSQ.

Reviewed-by: Changwoo Min <changwoo@igalia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13 08:49:29 -10:00
Tejun Heo
c0d630ba34 sched_ext: Wrap kfunc args in struct to prepare for aux__prog
scx_bpf_dsq_insert_vtime() and scx_bpf_select_cpu_and() currently have 5
parameters. An upcoming change will add aux__prog parameter which will exceed
BPF's 5 argument limit.

Prepare by adding new kfuncs __scx_bpf_dsq_insert_vtime() and
__scx_bpf_select_cpu_and() that take args structs. The existing kfuncs are
kept as compatibility wrappers. BPF programs use inline wrappers that detect
kernel API version via bpf_core_type_exists() and use the new struct-based
kfuncs when available, falling back to compat kfuncs otherwise. This allows
BPF programs to work with both old and new kernels.

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13 08:49:29 -10:00
Tejun Heo
3035addfaf sched_ext: Add scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime()
With the planned hierarchical scheduler support, sub-schedulers will need to
be verified for authority before being allowed to modify task->scx.slice and
task->scx.dsq_vtime. Add scx_bpf_task_set_slice() and
scx_bpf_task_set_dsq_vtime() which will perform the necessary permission
checks.

Root schedulers can still directly write to these fields, so this doesn't
affect existing schedulers.

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13 08:49:29 -10:00
Tejun Heo
111a79800a tools/sched_ext: Strip compatibility macros for cgroup and dispatch APIs
Enough time has passed since the introduction of scx_bpf_task_cgroup() and
the scx_bpf_dispatch* -> scx_bpf_dsq* kfunc renaming. Strip the compatibility
macros.

Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13 08:49:29 -10:00
Andrea Righi
0128c85051 sched_ext: Exit early on hotplug events during attach
There is no need to complete the entire scx initialization if a
scheduler is failing to be attached due to a hotplug event.

Exit early to avoid unnecessary work and simplify the attach flow.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13 08:45:01 -10:00
Tejun Heo
14c1da3895 sched_ext: Allocate scx_kick_cpus_pnt_seqs lazily using kvzalloc()
On systems with >4096 CPUs, scx_kick_cpus_pnt_seqs allocation fails during
boot because it exceeds the 32,768 byte percpu allocator limit.

Restructure to use DEFINE_PER_CPU() for the per-CPU pointers, with each CPU
pointing to its own kvzalloc'd array. Move allocation from boot time to
scx_enable() and free in scx_disable(), so the O(nr_cpu_ids^2) memory is only
consumed when sched_ext is active.

Use RCU to guard against racing with free. Arrays are freed via call_rcu()
and kick_cpus_irq_workfn() uses rcu_dereference_bh() with a NULL check.

While at it, rename to scx_kick_pseqs for brevity and update comments to
clarify these are pick_task sequence numbers.

v2: RCU protect scx_kick_seqs to manage kick_cpus_irq_workfn() racing
    against disable as per Andrea.

v3: Fix bugs notcied by Andrea.

Reported-by: Phil Auld <pauld@redhat.com>
Link: http://lkml.kernel.org/r/20251007133523.GA93086@pauld.westford.csb
Cc: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13 08:42:19 -10:00
Emil Tsalapatis
a8ad873113 sched_ext: defer queue_balance_callback() until after ops.dispatch
The sched_ext code calls queue_balance_callback() during enqueue_task()
to defer operations that drop multiple locks until we can unpin them.
The call assumes that the rq lock is held until the callbacks are
invoked, and the pending callbacks will not be visible to any other
threads. This is enforced by a WARN_ON_ONCE() in rq_pin_lock().

However, balance_one() may actually drop the lock during a BPF dispatch
call. Another thread may win the race to get the rq lock and see the
pending callback. To avoid this, sched_ext must only queue the callback
after the dispatch calls have completed.

CPU 0                   CPU 1           CPU 2

scx_balance()
  rq_unpin_lock()
  scx_balance_one()
    |= IN_BALANCE	scx_enqueue()
    ops.dispatch()
      rq_unlock()
                        rq_lock()
                        queue_balance_callback()
                        rq_unlock()
                                        [WARN] rq_pin_lock()
      rq_lock()
    &= ~IN_BALANCE
rq_repin_lock()

Changelog

v2-> v1 (https://lore.kernel.org/sched-ext/aOgOxtHCeyRT_7jn@gpd4)

- Fixed explanation in patch description (Andrea)
- Fixed scx_rq mask state updates (Andrea)
- Added Reviewed-by tag from Andrea

Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13 08:36:19 -10:00
Tejun Heo
efeeaac9ae sched_ext: Sync error_irq_work before freeing scx_sched
By the time scx_sched_free_rcu_work() runs, the scx_sched is no longer
reachable. However, a previously queued error_irq_work may still be pending or
running. Ensure it completes before proceeding with teardown.

Fixes: bff3b5aec1 ("sched_ext: Move disable machinery into scx_sched")
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13 08:25:55 -10:00
Tejun Heo
54e96258a6 sched_ext: Mark scx_bpf_dsq_move_set_[slice|vtime]() with KF_RCU
scx_bpf_dsq_move_set_slice() and scx_bpf_dsq_move_set_vtime() take a DSQ
iterator argument which has to be valid. Mark them with KF_RCU.

Fixes: 4c30f5ce4f ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
Cc: stable@vger.kernel.org # v6.12+
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13 08:13:38 -10:00
Linus Torvalds
3a86608788 Linux 6.18-rc1 v6.18-rc1 2025-10-12 13:42:36 -07:00
Linus Torvalds
3dd7b81235 Merge tag 'i2c-for-6.18-rc1-hotfix' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fix from Wolfram Sang:
 "One revert because of a regression in the I2C core which has sadly not
  showed up during its time in -next"

* tag 'i2c-for-6.18-rc1-hotfix' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  Revert "i2c: boardinfo: Annotate code used in init phase only"
2025-10-12 13:27:56 -07:00
Linus Torvalds
8765f46791 Merge tag 'irq_urgent_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Borislav Petkov:

 - Skip interrupt ID 0 in sifive-plic during suspend/resume because
   ID 0 is reserved and accessing reserved register space could result
   in undefined behavior

 - Fix a function's retval check in aspeed-scu-ic

* tag 'irq_urgent_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/sifive-plic: Avoid interrupt ID 0 handling during suspend/resume
  irqchip/aspeed-scu-ic: Fix an IS_ERR() vs NULL check
2025-10-12 08:45:52 -07:00
Linus Torvalds
67029a49db Merge tag 'trace-v6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
 "The previous fix to trace_marker required updating trace_marker_raw as
  well. The difference between trace_marker_raw from trace_marker is
  that the raw version is for applications to write binary structures
  directly into the ring buffer instead of writing ASCII strings. This
  is for applications that will read the raw data from the ring buffer
  and get the data structures directly. It's a bit quicker than using
  the ASCII version.

  Unfortunately, it appears that our test suite has several tests that
  test writes to the trace_marker file, but lacks any tests to the
  trace_marker_raw file (this needs to be remedied). Two issues came
  about the update to the trace_marker_raw file that syzbot found:

   - Fix tracing_mark_raw_write() to use per CPU buffer

     The fix to use the per CPU buffer to copy from user space was
     needed for both the trace_maker and trace_maker_raw file.

     The fix for reading from user space into per CPU buffers properly
     fixed the trace_marker write function, but the trace_marker_raw
     file wasn't fixed properly. The user space data was correctly
     written into the per CPU buffer, but the code that wrote into the
     ring buffer still used the user space pointer and not the per CPU
     buffer that had the user space data already written.

   - Stop the fortify string warning from writing into trace_marker_raw

     After converting the copy_from_user_nofault() into a memcpy(),
     another issue appeared. As writes to the trace_marker_raw expects
     binary data, the first entry is a 4 byte identifier. The entry
     structure is defined as:

     struct {
   	struct trace_entry ent;
   	int id;
   	char buf[];
     };

     The size of this structure is reserved on the ring buffer with:

       size = sizeof(*entry) + cnt;

     Then it is copied from the buffer into the ring buffer with:

       memcpy(&entry->id, buf, cnt);

     This use to be a copy_from_user_nofault(), but now converting it to
     a memcpy() triggers the fortify-string code, and causes a warning.

     The allocated space is actually more than what is copied, as the
     cnt used also includes the entry->id portion. Allocating
     sizeof(*entry) plus cnt is actually allocating 4 bytes more than
     what is needed.

     Change the size function to:

       size = struct_size(entry, buf, cnt - sizeof(entry->id));

     And update the memcpy() to unsafe_memcpy()"

* tag 'trace-v6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Stop fortify-string from warning in tracing_mark_raw_write()
  tracing: Fix tracing_mark_raw_write() to use buf and not ubuf
2025-10-11 16:06:04 -07:00
Linus Torvalds
c04022dccb Merge tag 'kbuild-fixes-6.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux
Pull Kbuild fixes from Nathan Chancellor:

 - Fix UAPI types check in headers_check.pl

 - Only enable -Werror for hostprogs with CONFIG_WERROR / W=e

 - Ignore fsync() error when output of gen_init_cpio is a pipe

 - Several little build fixes for recent modules.builtin.modinfo series

* tag 'kbuild-fixes-6.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux:
  kbuild: Use '--strip-unneeded-symbol' for removing module device table symbols
  s390/vmlinux.lds.S: Move .vmlinux.info to end of allocatable sections
  kbuild: Add '.rel.*' strip pattern for vmlinux
  kbuild: Restore pattern to avoid stripping .rela.dyn from vmlinux
  gen_init_cpio: Ignore fsync() returning EINVAL on pipes
  scripts/Makefile.extrawarn: Respect CONFIG_WERROR / W=e for hostprogs
  kbuild: uapi: Strip comments before size type check
2025-10-11 15:47:12 -07:00
Wolfram Sang
a8482d2c90 Revert "i2c: boardinfo: Annotate code used in init phase only"
This reverts commit 1a2b423be6 because we
got a regression report and need time to find out the details.

Reported-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Closes: https://lore.kernel.org/r/29ec0082-4dd4-4120-acd2-44b35b4b9487@oss.qualcomm.com
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
2025-10-11 23:57:33 +02:00