Commit Graph

49602 Commits

Author SHA1 Message Date
Tejun Heo
61debc251c sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode
Bypass mode routes tasks through fallback dispatch queues. Originally a single
global DSQ, b7b3b2dbae ("sched_ext: Split the global DSQ per NUMA node")
changed this to per-node DSQs to resolve NUMA-related livelocks.

Dan Schatzberg found per-node DSQs can still livelock when many threads are
pinned to different small CPU subsets: each CPU must scan many incompatible
tasks to find runnable ones, causing severe contention with high CPU counts.

Switch to per-CPU bypass DSQs. Each task queues on its current CPU. Default
idle CPU selection and direct dispatch handle most cases well.

This introduces a failure mode when tasks concentrate on one CPU in
over-saturated systems. If the BPF scheduler severely skews placement before
triggering bypass, that CPU's queue may be too long to drain, causing RCU
stalls. A load balancer in a future patch will address this. The bypass DSQ is
separate from local DSQ to enable load balancing: local DSQs use rq locks,
preventing efficient scanning and transfer across CPUs, especially problematic
when systems are already contended.

v2: Clarified why bypass DSQ is separate from local DSQ (Andrea Righi).

Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12 06:43:44 -10:00
Tejun Heo
3546119f18 sched_ext: Refactor do_enqueue_task() local and global DSQ paths
The local and global DSQ enqueue paths in do_enqueue_task() share the same
slice refill logic. Factor out the common code into a shared enqueue label.
This makes adding new enqueue cases easier. No functional changes.

Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12 06:43:44 -10:00
Tejun Heo
bfd3749d48 sched_ext: Use shorter slice in bypass mode
There have been reported cases of bypass mode not making forward progress fast
enough. The 20ms default slice is unnecessarily long for bypass mode where the
primary goal is ensuring all tasks can make forward progress.

Introduce SCX_SLICE_BYPASS set to 5ms and make the scheduler automatically
switch to it when entering bypass mode. Also make the bypass slice value
tunable through the slice_bypass_us module parameter (adjustable between 100us
and 100ms) to make it easier to test whether slice durations are a factor in
problem cases.

v3: Use READ_ONCE/WRITE_ONCE for scx_slice_dfl access (Dan).

v2: Removed slice_dfl_us module parameter. Fixed typos (Andrea).

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12 06:43:43 -10:00
Tejun Heo
5a629ecbcd sched_ext: Mark racy bitfields to prevent adding fields that can't tolerate races
The warned bitfields in struct scx_sched are updated racily from concurrent
CPUs causing RMW races, which is fine for these boolean warning flags. Add a
comment marking this area to prevent future fields that can't tolerate racy
updates from being added here.

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-05 12:07:09 -10:00
Tejun Heo
d723f36e01 sched_ext: Minor cleanups to scx_task_iter
- Use memset() in scx_task_iter_start() instead of zeroing fields individually.

- In scx_task_iter_next(), move __scx_task_iter_maybe_relock() after the batch
  check which is simpler.

- Update comment to reflect that tasks are removed from scx_tasks when dead
  (commit 7900aa699c ("sched_ext: Fix cgroup exit ordering by moving
  sched_ext_free() to finish_task_switch()")).

No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-04 11:46:25 -10:00
Tejun Heo
023af03cae sched_ext: Move __SCX_DSQ_ITER_ALL_FLAGS BUILD_BUG_ON to the right place
The BUILD_BUG_ON() which checks that __SCX_DSQ_ITER_ALL_FLAGS doesn't
overlap with the private lnode bits was in scx_task_iter_start() which has
nothing to do with DSQ iteration. Move it to bpf_iter_scx_dsq_new() where it
belongs.

No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-04 11:46:24 -10:00
Tejun Heo
7900aa699c sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()
sched_ext_free() was called from __put_task_struct() when the last reference
to the task is dropped, which could be long after the task has finished
running. This causes cgroup-related problems:

- ops.init_task() can be called on a cgroup which didn't get ops.cgroup_init()'d
  during scheduler load, because the cgroup might be destroyed/unlinked
  while the zombie or dead task is still lingering on the scx_tasks list.

- ops.cgroup_exit() could be called before ops.exit_task() is called on all
  member tasks, leading to incorrect exit ordering.

Fix by moving it to finish_task_switch() to be called right after the final
context switch away from the dying task, matching when sched_class->task_dead()
is called. Rename it to sched_ext_dead() to match the new calling context.

By calling sched_ext_dead() before cgroup_task_dead(), we ensure that:

- Tasks visible on scx_tasks list have valid cgroups during scheduler load,
  as cgroup_mutex prevents cgroup destruction while the task is still linked.

- All member tasks have ops.exit_task() called and are removed from scx_tasks
  before the cgroup can be destroyed and trigger ops.cgroup_exit().

This fix is made possible by the cgroup_task_dead() split in the previous patch.

This also makes more sense resource-wise as there's no point in keeping
scheduler side resources around for dead tasks.

Reported-by: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-03 11:57:30 -10:00
Tejun Heo
587eb08a5f sched_ext: Merge branch 'for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-6.19
Pull cgroup/for-6.19 to receive:

 16dad7801a ("cgroup: Rename cgroup lifecycle hooks to cgroup_task_*()")
 260fbcb92b ("cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()")
 d245698d72 ("cgroup: Defer task cgroup unlink until after the task is done switching out")

These are needed for the sched_ext cgroup exit ordering fix.

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-03 11:57:26 -10:00
Tejun Heo
d245698d72 cgroup: Defer task cgroup unlink until after the task is done switching out
When a task exits, css_set_move_task(tsk, cset, NULL, false) unlinks the task
from its cgroup. From the cgroup's perspective, the task is now gone. If this
makes the cgroup empty, it can be removed, triggering ->css_offline() callbacks
that notify controllers the cgroup is going offline resource-wise.

However, the exiting task can still run, perform memory operations, and schedule
until the final context switch in finish_task_switch(). This creates a confusing
situation where controllers are told a cgroup is offline while resource
activities are still happening in it. While this hasn't broken existing
controllers, it has caused direct confusion for sched_ext schedulers.

Split cgroup_task_exit() into two functions. cgroup_task_exit() now only calls
the subsystem exit callbacks and continues to be called from do_exit(). The
css_set cleanup is moved to the new cgroup_task_dead() which is called from
finish_task_switch() after the final context switch, so that the cgroup only
appears empty after the task is truly done running.

This also reorders operations so that subsys->exit() is now called before
unlinking from the cgroup, which shouldn't break anything.

Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-03 11:46:18 -10:00
Tejun Heo
260fbcb92b cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()
Currently, cgroup_task_exit() adds thread group leaders with live member
threads to their css_set's dying_tasks list (so cgroup.procs iteration can
still see the leader), and cgroup_task_release() later removes them with
list_del_init(&task->cg_list).

An upcoming patch will defer the dying_tasks list addition, moving it from
cgroup_task_exit() (called from do_exit()) to a new function called from
finish_task_switch(). However, release_task() (which calls
cgroup_task_release()) can run either before or after finish_task_switch(),
creating a race where cgroup_task_release() might try to remove the task from
dying_tasks before or while it's being added.

Move the list_del_init() from cgroup_task_release() to cgroup_task_free() to
fix this race. cgroup_task_free() runs from __put_task_struct(), which is
always after both paths, making the cleanup safe.

Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-03 11:46:18 -10:00
Tejun Heo
16dad7801a cgroup: Rename cgroup lifecycle hooks to cgroup_task_*()
The current names cgroup_exit(), cgroup_release(), and cgroup_free() are
confusing because they look like they're operating on cgroups themselves when
they're actually task lifecycle hooks. For example, cgroup_init() initializes
the cgroup subsystem while cgroup_exit() is a task exit notification to
cgroup. Rename them to cgroup_task_exit(), cgroup_task_release(), and
cgroup_task_free() to make it clear that these operate on tasks.

Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-03 11:46:18 -10:00
Tejun Heo
a3f5d48222 sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere
The ops.cpu_acquire/release() callbacks miss events under multiple conditions.
There are two distinct task dispatch gaps that can cause cpu_released flag
desynchronization:

1. balance-to-pick_task gap: This is what was originally reported. balance_scx()
   can enqueue a task, but during consume_remote_task() when the rq lock is
   released, a higher priority task can be enqueued and ultimately picked while
   cpu_released remains false. This gap is closeable via RETRY_TASK handling.

2. ttwu-to-pick_task gap: ttwu() can directly dispatch a task to a CPU's local
   DSQ. By the time the sched path runs on the target CPU, higher class tasks may
   already be queued. In such cases, nothing on sched_ext side will be invoked,
   and the only solution would be a hook invoked regardless of sched class, which
   isn't desirable.

Rather than adding invasive core hooks, BPF schedulers can use generic BPF
mechanisms like tracepoints. From SCX scheduler's perspective, this is congruent
with other mechanisms it already uses and doesn't add further friction.

The main use case for cpu_release() was calling scx_bpf_reenqueue_local() when
a CPU gets preempted by a higher priority scheduling class. However, the old
scx_bpf_reenqueue_local() could only be called from cpu_release() context.

Add a new version of scx_bpf_reenqueue_local() that can be called from any
context by deferring the actual re-enqueue operation. This eliminates the need
for cpu_acquire/release() ops entirely. Schedulers can now use standard BPF
mechanisms like the sched_switch tracepoint to detect and handle CPU preemption.

Update scx_qmap to demonstrate the new approach using sched_switch instead of
cpu_release, with compat support for older kernels. Mark cpu_acquire/release()
as deprecated. The old scx_bpf_reenqueue_local() variant will be removed in
v6.23.

Reported-by: Wen-Fang Liu <liuwenfang@honor.com>
Link: https://lore.kernel.org/all/8d64c74118c6440f81bcf5a4ac6b9f00@honor.com/
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-29 05:29:04 -10:00
Tejun Heo
8803e6a7fb sched_ext: Factor out reenq_local() from scx_bpf_reenqueue_local()
Factor out the core re-enqueue logic from scx_bpf_reenqueue_local() into a
new reenq_local() helper function. scx_bpf_reenqueue_local() now handles the
BPF kfunc checks and calls reenq_local() to perform the actual work.

This is a prep patch to allow reenq_local() to be called from other contexts.

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-29 05:29:04 -10:00
Tejun Heo
180b4ac342 sched_ext: Split schedule_deferred() into locked and unlocked variants
schedule_deferred() currently requires the rq lock to be held so that it can
use scheduler hooks for efficiency when available. However, there are cases
where deferred actions need to be scheduled from contexts that don't hold the
rq lock.

Split into schedule_deferred() which can be called from any context and just
queues irq_work, and schedule_deferred_locked() which requires the rq lock and
can optimize by using scheduler hooks when available. Update the existing call
site to use the _locked variant.

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-29 05:29:04 -10:00
Tejun Heo
2d697e5f5a Merge branch 'for-6.18-fixes' into for-6.19
Pull to receive f4fa7c25f6 ("sched_ext: Fix use of uninitialized variable
in scx_bpf_cpuperf_set()") which conflicts with changes for planned
sub-sched changes.
2025-10-29 05:18:13 -10:00
Andrea Righi
f4fa7c25f6 sched_ext: Fix use of uninitialized variable in scx_bpf_cpuperf_set()
scx_bpf_cpuperf_set() has a typo where it dereferences the local
variable @sch, instead of the global @scx_root pointer. Fix by
dereferencing the correct variable.

Fixes: 956f2b11a8 ("sched_ext: Drop kf_cpu_valid()")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-29 05:14:39 -10:00
Tejun Heo
b7d4b28db7 sched_ext: Use SCX_TASK_READY test instead of tryget_task_struct() during class switch
ddf7233fca ("sched/ext: Fix invalid task state transitions on class
switch") added tryget_task_struct() test during scx_enable()'s class
switching loop. The reason for the addition was to avoid enabling tasks which
skipped prep in the previous loop due to being dead.

While tryget_task_struct() does work for this purpose as tasks that fail
tryget always will fail it, it's a bit roundabout. A more direct way is
testing whether the task is in READY state. Switch to testing SCX_TASK_READY
directly.

Cc: Andrea Righi <arighi@nvidia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-28 11:38:34 -10:00
Andrea Righi
71d7847cad sched_ext: Fix scx_bpf_dsq_peek() with FIFO DSQs
When removing a task from a FIFO DSQ, we must delete it from the list
before updating dsq->first_task, otherwise the following lookup will
just re-read the same task, leaving first_task pointing to removed
entry.

This issue only affects DSQs operating in FIFO mode, as priority DSQs
correctly update the rbtree before re-evaluating the new first task.

Remove the item from the list before refreshing the first task to
guarantee the correct behavior in FIFO DSQs.

Fixes: 44f5c8ec5b ("sched_ext: Add lockless peek operation for DSQs")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-24 12:20:24 -10:00
Tejun Heo
e67708823d sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()
The find_user_dsq() function is called from contexts that are already
under RCU read lock protection. Switch from rhashtable_lookup_fast() to
rhashtable_lookup() to avoid redundant RCU locking.

Requires: bee8a520eb ("rhashtable: Use rcu_dereference_all and rcu_dereference_all_check")
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-22 11:55:06 -10:00
Tejun Heo
987e00035c sched_ext: Rename pnt_seq to kick_sync
The pnt_seq field and related infrastructure were originally named for
"pick next task sequence", reflecting their original implementation in
scx_next_task_picked(). However, the sequence counter is now incremented in
both put_prev_task_scx() and pick_task_scx() and its purpose is to
synchronize kick operations via SCX_KICK_WAIT, not specifically to track
pick_next_task events.

Rename to better reflect the actual semantics:
- pnt_seq -> kick_sync
- scx_kick_pseqs -> scx_kick_syncs
- pseqs variables -> ksyncs
- Update comments to refer to "kick_sync sequence" instead of "pick_task
  sequence"

This is a pure renaming with no functional changes.

Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-22 11:42:14 -10:00
Tejun Heo
a379fa1e2c sched_ext: Fix SCX_KICK_WAIT to work reliably
SCX_KICK_WAIT is used to synchronously wait for the target CPU to complete
a reschedule and can be used to implement operations like core scheduling.

This used to be implemented by scx_next_task_picked() incrementing pnt_seq,
which was always called when a CPU picks the next task to run, allowing
SCX_KICK_WAIT to reliably wait for the target CPU to enter the scheduler and
pick the next task.

However, commit b999e365c2 ("sched_ext: Replace scx_next_task_picked()
with switch_class()") replaced scx_next_task_picked() with the
switch_class() callback, which is only called when switching between sched
classes. This broke SCX_KICK_WAIT because pnt_seq would no longer be
reliably incremented unless the previous task was SCX and the next task was
not.

This fix leverages commit 4c95380701 ("sched/ext: Fold balance_scx() into
pick_task_scx()") which refactored the pick path making put_prev_task_scx()
the natural place to track task switches for SCX_KICK_WAIT. The fix moves
pnt_seq increment to put_prev_task_scx() and also increments it in
pick_task_scx() to handle cases where the same task is re-selected, whether
by BPF scheduler decision or slice refill. The semantics: If the current
task on the target CPU is SCX, SCX_KICK_WAIT waits until the CPU enters the
scheduling path. This provides sufficient guarantee for use cases like core
scheduling while keeping the operation self-contained within SCX.

v2: - Also increment pnt_seq in pick_task_scx() to handle same-task
      re-selection (Andrea Righi).
    - Use smp_cond_load_acquire() for the busy-wait loop for better
      architecture optimization (Peter Zijlstra).

Reported-by: Wen-Fang Liu <liuwenfang@honor.com>
Link: http://lkml.kernel.org/r/228ebd9e6ed3437996dffe15735a9caa@honor.com
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-22 11:42:14 -10:00
Tejun Heo
a9c1fbbd6d sched_ext: Don't kick CPUs running higher classes
When a sched_ext scheduler tries to kick a CPU, the CPU may be running a
higher class task. sched_ext has no control over such CPUs. A sched_ext
scheduler couldn't have expected to get access to the CPU after kicking it
anyway. Skip kicking when the target CPU is running a higher class.

Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-22 11:42:14 -10:00
Tejun Heo
2dbbdeda77 sched_ext: Fix scx_bpf_dsq_insert() backward binary compatibility
cded46d971 ("sched_ext: Make scx_bpf_dsq_insert*() return bool")
introduced a new bool-returning scx_bpf_dsq_insert() and renamed the old
void-returning version to scx_bpf_dsq_insert___compat, with the expectation
that libbpf would match old binaries to the ___compat variant, maintaining
backward binary compatibility. However, while libbpf ignores ___suffix on
the BPF side when matching symbols, it doesn't do so for kernel-side symbols.
Old binaries compiled with the original scx_bpf_dsq_insert() could no longer
resolve the symbol.

Fix by reversing the naming: Keep scx_bpf_dsq_insert() as the old
void-returning interface and add ___v2 to the new bool-returning version.
This allows old binaries to continue working while new code can use the
___v2 variant. Once libbpf is updated to ignore kernel-side ___SUFFIX, the
___v2 suffix can be dropped when the compat interface is removed.

v2: Use ___v2 instead of ___new.

Fixes: cded46d971 ("sched_ext: Make scx_bpf_dsq_insert*() return bool")
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-21 10:40:15 -10:00
Waiman Long
d5cf4d34a3 cgroup/cpuset: Don't track # of local child partitions
The cpuset structure has a nr_subparts field which tracks the number
of child local partitions underneath a particular cpuset. Right now,
nr_subparts is only used in partition_is_populated() to avoid iteration
of child cpusets if the condition is right. So by always performing the
child iteration, we can avoid tracking the number of child partitions
and simplify the code a bit.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-20 07:01:48 -10:00
Linus Torvalds
d9043c79ba Merge tag 'sched_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:

 - Make sure the check for lost pelt idle time is done unconditionally
   to have correct lost idle time accounting

 - Stop the deadline server task before a CPU goes offline

* tag 'sched_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix pelt lost idle time detection
  sched/deadline: Stop dl_server before CPU goes offline
2025-10-19 04:59:43 -10:00
Linus Torvalds
343b4b44a1 Merge tag 'perf_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:

 - Make sure perf reporting works correctly in setups using
   overlayfs or FUSE

 - Move the uprobe optimization to a better location logically

* tag 'perf_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix MMAP2 event device with backing files
  perf/core: Fix MMAP event path names with backing files
  perf/core: Fix address filter match with backing files
  uprobe: Move arch_uprobe_optimize right after handlers execution
2025-10-19 04:54:08 -10:00
Andrea Righi
67fa319f5f sched_ext: Allow forcibly picking an scx task
Refactor pick_task_scx() adding a new argument to forcibly pick a
SCHED_EXT task, ignoring any higher-priority sched class activity.

This refactoring prepares the code for future scenarios, e.g., allowing
the ext dl_server to force a SCHED_EXT task selection.

No functional changes.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-18 13:01:35 -10:00
Tejun Heo
70d837c3e0 sched_ext: Merge branch 'sched/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.19
Pull in tip/sched/core to receive:

 50653216e4 ("sched: Add support to pick functions to take rf")
 4c95380701 ("sched/ext: Fold balance_scx() into pick_task_scx()")

which will enable clean integration of DL server support among other things.

This conflicts with the following from sched_ext/for-6.18-fixes:

 a8ad873113 ("sched_ext: defer queue_balance_callback() until after ops.dispatch")

which adds maybe_queue_balance_callback() to balance_scx() which is removed
by 50653216e4. Resolve by moving the invocation to pick_task_scx() in the
equivalent location.

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-16 08:45:38 -10:00
Tejun Heo
075e3f7206 sched_ext: Merge branch 'for-6.18-fixes' into for-6.19
Pull sched_ext/for-6.18-fixes to sync trees to receive:

 05e63305c8 ("sched_ext: Fix scx_kick_pseqs corruption on concurrent scheduler loads")

to avoid conflicts with planned cgroup sub-sched support.

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-16 08:34:12 -10:00
Emil Tsalapatis
a3c4a0a42e sched_ext: fix flag check for deferred callbacks
When scheduling the deferred balance callbacks, check SCX_RQ_BAL_CB_PENDING
instead of SCX_RQ_BAL_PENDING. This way schedule_deferred() properly tests
whether there is already a pending request for queue_balance_callback() to
be invoked at the end of .balance().

Fixes: a8ad873113 ("sched_ext: defer queue_balance_callback() until after ops.dispatch")
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-16 08:34:00 -10:00
Shardul Bankar
f6fddc6df3 bpf: Fix memory leak in __lookup_instance error path
When __lookup_instance() allocates a func_instance structure but fails
to allocate the must_write_set array, it returns an error without freeing
the previously allocated func_instance. This causes a memory leak of 192
bytes (sizeof(struct func_instance)) each time this error path is triggered.

Fix by freeing 'result' on must_write_set allocation failure.

Fixes: b3698c356a ("bpf: callchain sensitive stack liveness tracking using CFG")
Reported-by: BPF Runtime Fuzzer (BRF)
Signed-off-by: Shardul Bankar <shardulsb08@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://patch.msgid.link/20251016063330.4107547-1-shardulsb08@gmail.com
2025-10-16 10:45:17 -07:00
Peter Zijlstra
4c95380701 sched/ext: Fold balance_scx() into pick_task_scx()
With pick_task() having an rf argument, it is possible to do the
lock-break there, get rid of the weird balance/pick_task hack.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16 11:13:55 +02:00
Joel Fernandes
50653216e4 sched: Add support to pick functions to take rf
Some pick functions like the internal pick_next_task_fair() already take
rf but some others dont. We need this for scx's server pick function.
Prepare for this by having pick functions accept it.

[peterz: - added RETRY_TASK handling
         - removed pick_next_task_fair indirection]
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16 11:13:55 +02:00
Peter Zijlstra
1e900f415c sched: Detect per-class runqueue changes
Have enqueue/dequeue set a per-class bit in rq->queue_mask. This then
enables easy tracking of which runqueues are modified over a
lock-break.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16 11:13:55 +02:00
Peter Zijlstra
73ec89a1ce sched: Mandate shared flags for sched_change
Shrikanth noted that sched_change pattern relies on using shared
flags.

Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-16 11:13:54 +02:00
Peter Zijlstra
d4c64207b8 sched: Cleanup the sched_change NOCLOCK usage
Teach the sched_change pattern how to do update_rq_clock(); this
allows for some simplifications / cleanups.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:54 +02:00
Peter Zijlstra
5892cbd85d sched: Match __task_rq_{,un}lock()
In preparation to adding more rules to __task_rq_lock(), such that
__task_rq_unlock() will no longer be equivalent to rq_unlock(),
make sure every __task_rq_lock() is matched by a __task_rq_unlock()
and vice-versa.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:54 +02:00
Peter Zijlstra
46a177fb01 sched: Add locking comments to sched_class methods
'Document' the locking context the various sched_class methods are
called under.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:53 +02:00
Peter Zijlstra
650952d3fb sched: Make __do_set_cpus_allowed() use the sched_change pattern
Now that do_set_cpus_allowed() holds all the regular locks, convert it
to use the sched_change pattern helper.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:53 +02:00
Peter Zijlstra
b079d93796 sched: Rename do_set_cpus_allowed()
Hopefully saner naming.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:53 +02:00
Peter Zijlstra
abfc01077d sched: Fix do_set_cpus_allowed() locking
All callers of do_set_cpus_allowed() only take p->pi_lock, which is
not sufficient to actually change the cpumask. Again, this is mostly
ok in these cases, but it results in unnecessarily complicated
reasoning.

Furthermore, there is no reason what so ever to not just take all the
required locks, so do just that.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:52 +02:00
Peter Zijlstra
942b8db965 sched: Fix migrate_disable_switch() locking
For some reason migrate_disable_switch() was more complicated than it
needs to be, resulting in mind bending locking of dubious quality.

Recognise that migrate_disable_switch() must be called before a
context switch, but any place before that switch is equally good.
Since the current place results in troubled locking, simply move the
thing before taking rq->lock.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:52 +02:00
Peter Zijlstra
6455ad5346 sched: Move sched_class::prio_changed() into the change pattern
Move sched_class::prio_changed() into the change pattern.

And while there, extend it with sched_class::get_prio() in order to
fix the deadline sitation.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:52 +02:00
Peter Zijlstra
1ae5f5dfe5 sched: Cleanup sched_delayed handling for class switches
Use the new sched_class::switching_from() method to dequeue delayed
tasks before switching to another class.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16 11:13:51 +02:00
Peter Zijlstra
637b068282 sched: Fold sched_class::switch{ing,ed}_{to,from}() into the change pattern
Add {DE,EN}QUEUE_CLASS and fold the sched_class::switch* methods into
the change pattern. This completes and makes the pattern more
symmetric.

This changes the order of callbacks slightly:

  OLD                              NEW
				|
				|  switching_from()
  dequeue_task();		|  dequeue_task()
  put_prev_task();		|  put_prev_task()
				|  switched_from()
				|
  ... change task ...		|  ... change task ...
				|
  switching_to();		|  switching_to()
  enqueue_task();		|  enqueue_task()
  set_next_task();		|  set_next_task()
  prev_class->switched_from()	|
  switched_to()			|  switched_to()
				|

Notably, it moves the switched_from() callback right after the
dequeue/put. Existing implementations don't appear to be affected by
this change in location -- specifically the task isn't enqueued on the
class in question in either location.

Make (CLASS)^(SAVE|MOVE), because there is nothing to save-restore
when changing scheduling classes.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:51 +02:00
Peter Zijlstra
5e42d4c123 sched/deadline: Prepare for switched_from() change
Prepare for the sched_class::switch*() methods getting folded into the
change pattern. As a result of that, the location of switched_from
will change slightly. SCHED_DEADLINE is affected by this change in
location:

  OLD                              NEW
				|
				|  switching_from()
  dequeue_task();		|  dequeue_task()
  put_prev_task();		|  put_prev_task()
				|  switched_from()
				|
  ... change task ...		|  ... change task ...
				|
  switching_to();		|  switching_to()
  enqueue_task();		|  enqueue_task()
  set_next_task();		|  set_next_task()
  prev_class->switched_from()	|
  switched_to()			|  switched_to()
				|

Notably, where switched_from() was called *after* the change to the
task, it will get called before it. Specifically, switched_from_dl()
uses dl_task(p) which uses p->prio; which is changed when switching
class (it might be the reason to switch class in case of PI).

When switched_from_dl() gets called, the task will have left the
deadline class and dl_task() must be false, while when doing
dequeue_dl_entity() the task must be a dl_task(), otherwise we'd have
called a different dequeue method.

Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-16 11:13:51 +02:00
Peter Zijlstra
376f8963bb sched: Re-arrange the {EN,DE}QUEUE flags
Ensure the matched flags are in the low word while the unmatched flags
go into the second word.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:50 +02:00
Peter Zijlstra
e9139f765a sched: Employ sched_change guards
As proposed a long while ago -- and half done by scx -- wrap the
scheduler's 'change' pattern in a guard helper.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:50 +02:00
Adam Li
82d6e01a06 sched/fair: Only update stats for allowed CPUs when looking for dst group
Load imbalance is observed when the workload frequently forks new threads.
Due to CPU affinity, the workload can run on CPU 0-7 in the first
group, and only on CPU 8-11 in the second group. CPU 12-15 are always idle.

{ 0 1 2 3 4 5 6 7 } {8 9 10 11 12 13 14 15}
  * * * * * * * *    * * *  *

When looking for dst group for newly forked threads, in many times
update_sg_wakeup_stats() reports the second group has more idle CPUs
than the first group. The scheduler thinks the second group is less
busy. Then it selects least busy CPUs among CPU 8-11. Therefore CPU 8-11
can be crowded with newly forked threads, at the same time CPU 0-7
can be idle.

A task may not use all the CPUs in a schedule group due to CPU affinity.
Only update schedule group statistics for allowed CPUs.

Signed-off-by: Adam Li <adamli@os.amperecomputing.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-16 11:13:50 +02:00
Tim Chen
06f2c90885 sched: Create architecture specific sched domain distances
Allow architecture specific sched domain NUMA distances that are
modified from actual NUMA node distances for the purpose of building
NUMA sched domains.

Keep actual NUMA distances separately if modified distances
are used for building sched domains. Such distances
are still needed as NUMA balancing benefits from finding the
NUMA nodes that are actually closer to a task numa_group.

Consolidate the recording of unique NUMA distances in an array to
sched_record_numa_dist() so the function can be reused to record NUMA
distances when the NUMA distance metric is changed.

No functional change and additional distance array
allocated if there're no arch specific NUMA distances
being defined.

Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
2025-10-16 11:13:49 +02:00