linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-02-19 14:47:06 -05:00

Author	SHA1	Message	Date
Tejun Heo	5a629ecbcd	sched_ext: Mark racy bitfields to prevent adding fields that can't tolerate races The warned bitfields in struct scx_sched are updated racily from concurrent CPUs causing RMW races, which is fine for these boolean warning flags. Add a comment marking this area to prevent future fields that can't tolerate racy updates from being added here. Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-05 12:07:09 -10:00
Tejun Heo	d723f36e01	sched_ext: Minor cleanups to scx_task_iter - Use memset() in scx_task_iter_start() instead of zeroing fields individually. - In scx_task_iter_next(), move __scx_task_iter_maybe_relock() after the batch check which is simpler. - Update comment to reflect that tasks are removed from scx_tasks when dead (commit `7900aa699c` ("sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()")). No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-04 11:46:25 -10:00
Tejun Heo	023af03cae	sched_ext: Move __SCX_DSQ_ITER_ALL_FLAGS BUILD_BUG_ON to the right place The BUILD_BUG_ON() which checks that __SCX_DSQ_ITER_ALL_FLAGS doesn't overlap with the private lnode bits was in scx_task_iter_start() which has nothing to do with DSQ iteration. Move it to bpf_iter_scx_dsq_new() where it belongs. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-04 11:46:24 -10:00
Tejun Heo	7900aa699c	sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch() sched_ext_free() was called from __put_task_struct() when the last reference to the task is dropped, which could be long after the task has finished running. This causes cgroup-related problems: - ops.init_task() can be called on a cgroup which didn't get ops.cgroup_init()'d during scheduler load, because the cgroup might be destroyed/unlinked while the zombie or dead task is still lingering on the scx_tasks list. - ops.cgroup_exit() could be called before ops.exit_task() is called on all member tasks, leading to incorrect exit ordering. Fix by moving it to finish_task_switch() to be called right after the final context switch away from the dying task, matching when sched_class->task_dead() is called. Rename it to sched_ext_dead() to match the new calling context. By calling sched_ext_dead() before cgroup_task_dead(), we ensure that: - Tasks visible on scx_tasks list have valid cgroups during scheduler load, as cgroup_mutex prevents cgroup destruction while the task is still linked. - All member tasks have ops.exit_task() called and are removed from scx_tasks before the cgroup can be destroyed and trigger ops.cgroup_exit(). This fix is made possible by the cgroup_task_dead() split in the previous patch. This also makes more sense resource-wise as there's no point in keeping scheduler side resources around for dead tasks. Reported-by: Dan Schatzberg <dschatzberg@meta.com> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-03 11:57:30 -10:00
Tejun Heo	587eb08a5f	sched_ext: Merge branch 'for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-6.19 Pull cgroup/for-6.19 to receive: `16dad7801a` ("cgroup: Rename cgroup lifecycle hooks to cgroup_task_*()") `260fbcb92b` ("cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()") `d245698d72` ("cgroup: Defer task cgroup unlink until after the task is done switching out") These are needed for the sched_ext cgroup exit ordering fix. Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-03 11:57:26 -10:00
Tejun Heo	d245698d72	cgroup: Defer task cgroup unlink until after the task is done switching out When a task exits, css_set_move_task(tsk, cset, NULL, false) unlinks the task from its cgroup. From the cgroup's perspective, the task is now gone. If this makes the cgroup empty, it can be removed, triggering ->css_offline() callbacks that notify controllers the cgroup is going offline resource-wise. However, the exiting task can still run, perform memory operations, and schedule until the final context switch in finish_task_switch(). This creates a confusing situation where controllers are told a cgroup is offline while resource activities are still happening in it. While this hasn't broken existing controllers, it has caused direct confusion for sched_ext schedulers. Split cgroup_task_exit() into two functions. cgroup_task_exit() now only calls the subsystem exit callbacks and continues to be called from do_exit(). The css_set cleanup is moved to the new cgroup_task_dead() which is called from finish_task_switch() after the final context switch, so that the cgroup only appears empty after the task is truly done running. This also reorders operations so that subsys->exit() is now called before unlinking from the cgroup, which shouldn't break anything. Cc: Dan Schatzberg <dschatzberg@meta.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-03 11:46:18 -10:00
Tejun Heo	260fbcb92b	cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free() Currently, cgroup_task_exit() adds thread group leaders with live member threads to their css_set's dying_tasks list (so cgroup.procs iteration can still see the leader), and cgroup_task_release() later removes them with list_del_init(&task->cg_list). An upcoming patch will defer the dying_tasks list addition, moving it from cgroup_task_exit() (called from do_exit()) to a new function called from finish_task_switch(). However, release_task() (which calls cgroup_task_release()) can run either before or after finish_task_switch(), creating a race where cgroup_task_release() might try to remove the task from dying_tasks before or while it's being added. Move the list_del_init() from cgroup_task_release() to cgroup_task_free() to fix this race. cgroup_task_free() runs from __put_task_struct(), which is always after both paths, making the cleanup safe. Cc: Dan Schatzberg <dschatzberg@meta.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-03 11:46:18 -10:00
Tejun Heo	16dad7801a	cgroup: Rename cgroup lifecycle hooks to cgroup_task_*() The current names cgroup_exit(), cgroup_release(), and cgroup_free() are confusing because they look like they're operating on cgroups themselves when they're actually task lifecycle hooks. For example, cgroup_init() initializes the cgroup subsystem while cgroup_exit() is a task exit notification to cgroup. Rename them to cgroup_task_exit(), cgroup_task_release(), and cgroup_task_free() to make it clear that these operate on tasks. Cc: Dan Schatzberg <dschatzberg@meta.com> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-03 11:46:18 -10:00
Tejun Heo	a3f5d48222	sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere The ops.cpu_acquire/release() callbacks miss events under multiple conditions. There are two distinct task dispatch gaps that can cause cpu_released flag desynchronization: 1. balance-to-pick_task gap: This is what was originally reported. balance_scx() can enqueue a task, but during consume_remote_task() when the rq lock is released, a higher priority task can be enqueued and ultimately picked while cpu_released remains false. This gap is closeable via RETRY_TASK handling. 2. ttwu-to-pick_task gap: ttwu() can directly dispatch a task to a CPU's local DSQ. By the time the sched path runs on the target CPU, higher class tasks may already be queued. In such cases, nothing on sched_ext side will be invoked, and the only solution would be a hook invoked regardless of sched class, which isn't desirable. Rather than adding invasive core hooks, BPF schedulers can use generic BPF mechanisms like tracepoints. From SCX scheduler's perspective, this is congruent with other mechanisms it already uses and doesn't add further friction. The main use case for cpu_release() was calling scx_bpf_reenqueue_local() when a CPU gets preempted by a higher priority scheduling class. However, the old scx_bpf_reenqueue_local() could only be called from cpu_release() context. Add a new version of scx_bpf_reenqueue_local() that can be called from any context by deferring the actual re-enqueue operation. This eliminates the need for cpu_acquire/release() ops entirely. Schedulers can now use standard BPF mechanisms like the sched_switch tracepoint to detect and handle CPU preemption. Update scx_qmap to demonstrate the new approach using sched_switch instead of cpu_release, with compat support for older kernels. Mark cpu_acquire/release() as deprecated. The old scx_bpf_reenqueue_local() variant will be removed in v6.23. Reported-by: Wen-Fang Liu <liuwenfang@honor.com> Link: https://lore.kernel.org/all/8d64c74118c6440f81bcf5a4ac6b9f00@honor.com/ Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-29 05:29:04 -10:00
Tejun Heo	8803e6a7fb	sched_ext: Factor out reenq_local() from scx_bpf_reenqueue_local() Factor out the core re-enqueue logic from scx_bpf_reenqueue_local() into a new reenq_local() helper function. scx_bpf_reenqueue_local() now handles the BPF kfunc checks and calls reenq_local() to perform the actual work. This is a prep patch to allow reenq_local() to be called from other contexts. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-29 05:29:04 -10:00
Tejun Heo	180b4ac342	sched_ext: Split schedule_deferred() into locked and unlocked variants schedule_deferred() currently requires the rq lock to be held so that it can use scheduler hooks for efficiency when available. However, there are cases where deferred actions need to be scheduled from contexts that don't hold the rq lock. Split into schedule_deferred() which can be called from any context and just queues irq_work, and schedule_deferred_locked() which requires the rq lock and can optimize by using scheduler hooks when available. Update the existing call site to use the _locked variant. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-29 05:29:04 -10:00
Tejun Heo	2d697e5f5a	Merge branch 'for-6.18-fixes' into for-6.19 Pull to receive `f4fa7c25f6` ("sched_ext: Fix use of uninitialized variable in scx_bpf_cpuperf_set()") which conflicts with changes for planned sub-sched changes.	2025-10-29 05:18:13 -10:00
Andrea Righi	f4fa7c25f6	sched_ext: Fix use of uninitialized variable in scx_bpf_cpuperf_set() scx_bpf_cpuperf_set() has a typo where it dereferences the local variable @sch, instead of the global @scx_root pointer. Fix by dereferencing the correct variable. Fixes: `956f2b11a8` ("sched_ext: Drop kf_cpu_valid()") Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-29 05:14:39 -10:00
Tejun Heo	b7d4b28db7	sched_ext: Use SCX_TASK_READY test instead of tryget_task_struct() during class switch `ddf7233fca` ("sched/ext: Fix invalid task state transitions on class switch") added tryget_task_struct() test during scx_enable()'s class switching loop. The reason for the addition was to avoid enabling tasks which skipped prep in the previous loop due to being dead. While tryget_task_struct() does work for this purpose as tasks that fail tryget always will fail it, it's a bit roundabout. A more direct way is testing whether the task is in READY state. Switch to testing SCX_TASK_READY directly. Cc: Andrea Righi <arighi@nvidia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-28 11:38:34 -10:00
Andrea Righi	71d7847cad	sched_ext: Fix scx_bpf_dsq_peek() with FIFO DSQs When removing a task from a FIFO DSQ, we must delete it from the list before updating dsq->first_task, otherwise the following lookup will just re-read the same task, leaving first_task pointing to removed entry. This issue only affects DSQs operating in FIFO mode, as priority DSQs correctly update the rbtree before re-evaluating the new first task. Remove the item from the list before refreshing the first task to guarantee the correct behavior in FIFO DSQs. Fixes: `44f5c8ec5b` ("sched_ext: Add lockless peek operation for DSQs") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-24 12:20:24 -10:00
Tejun Heo	e67708823d	sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast() The find_user_dsq() function is called from contexts that are already under RCU read lock protection. Switch from rhashtable_lookup_fast() to rhashtable_lookup() to avoid redundant RCU locking. Requires: `bee8a520eb` ("rhashtable: Use rcu_dereference_all and rcu_dereference_all_check") Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-22 11:55:06 -10:00
Tejun Heo	987e00035c	sched_ext: Rename pnt_seq to kick_sync The pnt_seq field and related infrastructure were originally named for "pick next task sequence", reflecting their original implementation in scx_next_task_picked(). However, the sequence counter is now incremented in both put_prev_task_scx() and pick_task_scx() and its purpose is to synchronize kick operations via SCX_KICK_WAIT, not specifically to track pick_next_task events. Rename to better reflect the actual semantics: - pnt_seq -> kick_sync - scx_kick_pseqs -> scx_kick_syncs - pseqs variables -> ksyncs - Update comments to refer to "kick_sync sequence" instead of "pick_task sequence" This is a pure renaming with no functional changes. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-22 11:42:14 -10:00
Tejun Heo	a379fa1e2c	sched_ext: Fix SCX_KICK_WAIT to work reliably SCX_KICK_WAIT is used to synchronously wait for the target CPU to complete a reschedule and can be used to implement operations like core scheduling. This used to be implemented by scx_next_task_picked() incrementing pnt_seq, which was always called when a CPU picks the next task to run, allowing SCX_KICK_WAIT to reliably wait for the target CPU to enter the scheduler and pick the next task. However, commit `b999e365c2` ("sched_ext: Replace scx_next_task_picked() with switch_class()") replaced scx_next_task_picked() with the switch_class() callback, which is only called when switching between sched classes. This broke SCX_KICK_WAIT because pnt_seq would no longer be reliably incremented unless the previous task was SCX and the next task was not. This fix leverages commit `4c95380701` ("sched/ext: Fold balance_scx() into pick_task_scx()") which refactored the pick path making put_prev_task_scx() the natural place to track task switches for SCX_KICK_WAIT. The fix moves pnt_seq increment to put_prev_task_scx() and also increments it in pick_task_scx() to handle cases where the same task is re-selected, whether by BPF scheduler decision or slice refill. The semantics: If the current task on the target CPU is SCX, SCX_KICK_WAIT waits until the CPU enters the scheduling path. This provides sufficient guarantee for use cases like core scheduling while keeping the operation self-contained within SCX. v2: - Also increment pnt_seq in pick_task_scx() to handle same-task re-selection (Andrea Righi). - Use smp_cond_load_acquire() for the busy-wait loop for better architecture optimization (Peter Zijlstra). Reported-by: Wen-Fang Liu <liuwenfang@honor.com> Link: http://lkml.kernel.org/r/228ebd9e6ed3437996dffe15735a9caa@honor.com Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-22 11:42:14 -10:00
Tejun Heo	a9c1fbbd6d	sched_ext: Don't kick CPUs running higher classes When a sched_ext scheduler tries to kick a CPU, the CPU may be running a higher class task. sched_ext has no control over such CPUs. A sched_ext scheduler couldn't have expected to get access to the CPU after kicking it anyway. Skip kicking when the target CPU is running a higher class. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-22 11:42:14 -10:00
Tejun Heo	2dbbdeda77	sched_ext: Fix scx_bpf_dsq_insert() backward binary compatibility `cded46d971` ("sched_ext: Make scx_bpf_dsq_insert() return bool") introduced a new bool-returning scx_bpf_dsq_insert() and renamed the old void-returning version to scx_bpf_dsq_insert___compat, with the expectation that libbpf would match old binaries to the ___compat variant, maintaining backward binary compatibility. However, while libbpf ignores ___suffix on the BPF side when matching symbols, it doesn't do so for kernel-side symbols. Old binaries compiled with the original scx_bpf_dsq_insert() could no longer resolve the symbol. Fix by reversing the naming: Keep scx_bpf_dsq_insert() as the old void-returning interface and add ___v2 to the new bool-returning version. This allows old binaries to continue working while new code can use the ___v2 variant. Once libbpf is updated to ignore kernel-side ___SUFFIX, the ___v2 suffix can be dropped when the compat interface is removed. v2: Use ___v2 instead of ___new. Fixes: `cded46d971` ("sched_ext: Make scx_bpf_dsq_insert() return bool") Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-21 10:40:15 -10:00
Waiman Long	d5cf4d34a3	cgroup/cpuset: Don't track # of local child partitions The cpuset structure has a nr_subparts field which tracks the number of child local partitions underneath a particular cpuset. Right now, nr_subparts is only used in partition_is_populated() to avoid iteration of child cpusets if the condition is right. So by always performing the child iteration, we can avoid tracking the number of child partitions and simplify the code a bit. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-20 07:01:48 -10:00
Linus Torvalds	d9043c79ba	Merge tag 'sched_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Make sure the check for lost pelt idle time is done unconditionally to have correct lost idle time accounting - Stop the deadline server task before a CPU goes offline * tag 'sched_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix pelt lost idle time detection sched/deadline: Stop dl_server before CPU goes offline	2025-10-19 04:59:43 -10:00
Linus Torvalds	343b4b44a1	Merge tag 'perf_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Make sure perf reporting works correctly in setups using overlayfs or FUSE - Move the uprobe optimization to a better location logically * tag 'perf_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix MMAP2 event device with backing files perf/core: Fix MMAP event path names with backing files perf/core: Fix address filter match with backing files uprobe: Move arch_uprobe_optimize right after handlers execution	2025-10-19 04:54:08 -10:00
Andrea Righi	67fa319f5f	sched_ext: Allow forcibly picking an scx task Refactor pick_task_scx() adding a new argument to forcibly pick a SCHED_EXT task, ignoring any higher-priority sched class activity. This refactoring prepares the code for future scenarios, e.g., allowing the ext dl_server to force a SCHED_EXT task selection. No functional changes. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-18 13:01:35 -10:00
Tejun Heo	70d837c3e0	sched_ext: Merge branch 'sched/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.19 Pull in tip/sched/core to receive: `50653216e4` ("sched: Add support to pick functions to take rf") `4c95380701` ("sched/ext: Fold balance_scx() into pick_task_scx()") which will enable clean integration of DL server support among other things. This conflicts with the following from sched_ext/for-6.18-fixes: `a8ad873113` ("sched_ext: defer queue_balance_callback() until after ops.dispatch") which adds maybe_queue_balance_callback() to balance_scx() which is removed by `50653216e4`. Resolve by moving the invocation to pick_task_scx() in the equivalent location. Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-16 08:45:38 -10:00
Tejun Heo	075e3f7206	sched_ext: Merge branch 'for-6.18-fixes' into for-6.19 Pull sched_ext/for-6.18-fixes to sync trees to receive: `05e63305c8` ("sched_ext: Fix scx_kick_pseqs corruption on concurrent scheduler loads") to avoid conflicts with planned cgroup sub-sched support. Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-16 08:34:12 -10:00
Emil Tsalapatis	a3c4a0a42e	sched_ext: fix flag check for deferred callbacks When scheduling the deferred balance callbacks, check SCX_RQ_BAL_CB_PENDING instead of SCX_RQ_BAL_PENDING. This way schedule_deferred() properly tests whether there is already a pending request for queue_balance_callback() to be invoked at the end of .balance(). Fixes: `a8ad873113` ("sched_ext: defer queue_balance_callback() until after ops.dispatch") Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-16 08:34:00 -10:00
Shardul Bankar	f6fddc6df3	bpf: Fix memory leak in __lookup_instance error path When __lookup_instance() allocates a func_instance structure but fails to allocate the must_write_set array, it returns an error without freeing the previously allocated func_instance. This causes a memory leak of 192 bytes (sizeof(struct func_instance)) each time this error path is triggered. Fix by freeing 'result' on must_write_set allocation failure. Fixes: `b3698c356a` ("bpf: callchain sensitive stack liveness tracking using CFG") Reported-by: BPF Runtime Fuzzer (BRF) Signed-off-by: Shardul Bankar <shardulsb08@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://patch.msgid.link/20251016063330.4107547-1-shardulsb08@gmail.com	2025-10-16 10:45:17 -07:00
Peter Zijlstra	4c95380701	sched/ext: Fold balance_scx() into pick_task_scx() With pick_task() having an rf argument, it is possible to do the lock-break there, get rid of the weird balance/pick_task hack. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org>	2025-10-16 11:13:55 +02:00
Joel Fernandes	50653216e4	sched: Add support to pick functions to take rf Some pick functions like the internal pick_next_task_fair() already take rf but some others dont. We need this for scx's server pick function. Prepare for this by having pick functions accept it. [peterz: - added RETRY_TASK handling - removed pick_next_task_fair indirection] Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org>	2025-10-16 11:13:55 +02:00
Peter Zijlstra	1e900f415c	sched: Detect per-class runqueue changes Have enqueue/dequeue set a per-class bit in rq->queue_mask. This then enables easy tracking of which runqueues are modified over a lock-break. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org>	2025-10-16 11:13:55 +02:00
Peter Zijlstra	73ec89a1ce	sched: Mandate shared flags for sched_change Shrikanth noted that sched_change pattern relies on using shared flags. Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2025-10-16 11:13:54 +02:00
Peter Zijlstra	d4c64207b8	sched: Cleanup the sched_change NOCLOCK usage Teach the sched_change pattern how to do update_rq_clock(); this allows for some simplifications / cleanups. Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>	2025-10-16 11:13:54 +02:00
Peter Zijlstra	5892cbd85d	sched: Match __task_rq_{,un}lock() In preparation to adding more rules to __task_rq_lock(), such that __task_rq_unlock() will no longer be equivalent to rq_unlock(), make sure every __task_rq_lock() is matched by a __task_rq_unlock() and vice-versa. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>	2025-10-16 11:13:54 +02:00
Peter Zijlstra	46a177fb01	sched: Add locking comments to sched_class methods 'Document' the locking context the various sched_class methods are called under. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>	2025-10-16 11:13:53 +02:00
Peter Zijlstra	650952d3fb	sched: Make __do_set_cpus_allowed() use the sched_change pattern Now that do_set_cpus_allowed() holds all the regular locks, convert it to use the sched_change pattern helper. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>	2025-10-16 11:13:53 +02:00
Peter Zijlstra	b079d93796	sched: Rename do_set_cpus_allowed() Hopefully saner naming. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>	2025-10-16 11:13:53 +02:00
Peter Zijlstra	abfc01077d	sched: Fix do_set_cpus_allowed() locking All callers of do_set_cpus_allowed() only take p->pi_lock, which is not sufficient to actually change the cpumask. Again, this is mostly ok in these cases, but it results in unnecessarily complicated reasoning. Furthermore, there is no reason what so ever to not just take all the required locks, so do just that. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>	2025-10-16 11:13:52 +02:00
Peter Zijlstra	942b8db965	sched: Fix migrate_disable_switch() locking For some reason migrate_disable_switch() was more complicated than it needs to be, resulting in mind bending locking of dubious quality. Recognise that migrate_disable_switch() must be called before a context switch, but any place before that switch is equally good. Since the current place results in troubled locking, simply move the thing before taking rq->lock. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>	2025-10-16 11:13:52 +02:00
Peter Zijlstra	6455ad5346	sched: Move sched_class::prio_changed() into the change pattern Move sched_class::prio_changed() into the change pattern. And while there, extend it with sched_class::get_prio() in order to fix the deadline sitation. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>	2025-10-16 11:13:52 +02:00
Peter Zijlstra	1ae5f5dfe5	sched: Cleanup sched_delayed handling for class switches Use the new sched_class::switching_from() method to dequeue delayed tasks before switching to another class. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org>	2025-10-16 11:13:51 +02:00
Peter Zijlstra	637b068282	sched: Fold sched_class::switch{ing,ed}_{to,from}() into the change pattern Add {DE,EN}QUEUE_CLASS and fold the sched_class::switch* methods into the change pattern. This completes and makes the pattern more symmetric. This changes the order of callbacks slightly: OLD NEW \| \| switching_from() dequeue_task(); \| dequeue_task() put_prev_task(); \| put_prev_task() \| switched_from() \| ... change task ... \| ... change task ... \| switching_to(); \| switching_to() enqueue_task(); \| enqueue_task() set_next_task(); \| set_next_task() prev_class->switched_from() \| switched_to() \| switched_to() \| Notably, it moves the switched_from() callback right after the dequeue/put. Existing implementations don't appear to be affected by this change in location -- specifically the task isn't enqueued on the class in question in either location. Make (CLASS)^(SAVE\|MOVE), because there is nothing to save-restore when changing scheduling classes. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>	2025-10-16 11:13:51 +02:00
Peter Zijlstra	5e42d4c123	sched/deadline: Prepare for switched_from() change Prepare for the sched_class::switch() methods getting folded into the change pattern. As a result of that, the location of switched_from will change slightly. SCHED_DEADLINE is affected by this change in location: OLD NEW \| \| switching_from() dequeue_task(); \| dequeue_task() put_prev_task(); \| put_prev_task() \| switched_from() \| ... change task ... \| ... change task ... \| switching_to(); \| switching_to() enqueue_task(); \| enqueue_task() set_next_task(); \| set_next_task() prev_class->switched_from() \| switched_to() \| switched_to() \| Notably, where switched_from() was called after* the change to the task, it will get called before it. Specifically, switched_from_dl() uses dl_task(p) which uses p->prio; which is changed when switching class (it might be the reason to switch class in case of PI). When switched_from_dl() gets called, the task will have left the deadline class and dl_task() must be false, while when doing dequeue_dl_entity() the task must be a dl_task(), otherwise we'd have called a different dequeue method. Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2025-10-16 11:13:51 +02:00
Peter Zijlstra	376f8963bb	sched: Re-arrange the {EN,DE}QUEUE flags Ensure the matched flags are in the low word while the unmatched flags go into the second word. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>	2025-10-16 11:13:50 +02:00
Peter Zijlstra	e9139f765a	sched: Employ sched_change guards As proposed a long while ago -- and half done by scx -- wrap the scheduler's 'change' pattern in a guard helper. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>	2025-10-16 11:13:50 +02:00
Adam Li	82d6e01a06	sched/fair: Only update stats for allowed CPUs when looking for dst group Load imbalance is observed when the workload frequently forks new threads. Due to CPU affinity, the workload can run on CPU 0-7 in the first group, and only on CPU 8-11 in the second group. CPU 12-15 are always idle. { 0 1 2 3 4 5 6 7 } {8 9 10 11 12 13 14 15} * * * * * * * * * * * * When looking for dst group for newly forked threads, in many times update_sg_wakeup_stats() reports the second group has more idle CPUs than the first group. The scheduler thinks the second group is less busy. Then it selects least busy CPUs among CPU 8-11. Therefore CPU 8-11 can be crowded with newly forked threads, at the same time CPU 0-7 can be idle. A task may not use all the CPUs in a schedule group due to CPU affinity. Only update schedule group statistics for allowed CPUs. Signed-off-by: Adam Li <adamli@os.amperecomputing.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2025-10-16 11:13:50 +02:00
Tim Chen	06f2c90885	sched: Create architecture specific sched domain distances Allow architecture specific sched domain NUMA distances that are modified from actual NUMA node distances for the purpose of building NUMA sched domains. Keep actual NUMA distances separately if modified distances are used for building sched domains. Such distances are still needed as NUMA balancing benefits from finding the NUMA nodes that are actually closer to a task numa_group. Consolidate the recording of unique NUMA distances in an array to sched_record_numa_dist() so the function can be reused to record NUMA distances when the NUMA distance metric is changed. No functional change and additional distance array allocated if there're no arch specific NUMA distances being defined. Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com>	2025-10-16 11:13:49 +02:00
Doug Berger	382748c05e	sched/deadline: only set free_cpus for online runqueues Commit `16b269436b` ("sched/deadline: Modify cpudl::free_cpus to reflect rd->online") introduced the cpudl_set/clear_freecpu functions to allow the cpu_dl::free_cpus mask to be manipulated by the deadline scheduler class rq_on/offline callbacks so the mask would also reflect this state. Commit `9659e1eeee` ("sched/deadline: Remove cpu_active_mask from cpudl_find()") removed the check of the cpu_active_mask to save some processing on the premise that the cpudl::free_cpus mask already reflected the runqueue online state. Unfortunately, there are cases where it is possible for the cpudl_clear function to set the free_cpus bit for a CPU when the deadline runqueue is offline. When this occurs while a CPU is connected to the default root domain the flag may retain the bad state after the CPU has been unplugged. Later, a different CPU that is transitioning through the default root domain may push a deadline task to the powered down CPU when cpudl_find sees its free_cpus bit is set. If this happens the task will not have the opportunity to run. One example is outlined here: https://lore.kernel.org/lkml/20250110233010.2339521-1-opendmb@gmail.com Another occurs when the last deadline task is migrated from a CPU that has an offlined runqueue. The dequeue_task member of the deadline scheduler class will eventually call cpudl_clear and set the free_cpus bit for the CPU. This commit modifies the cpudl_clear function to be aware of the online state of the deadline runqueue so that the free_cpus mask can be updated appropriately. It is no longer necessary to manage the mask outside of the cpudl_set/clear functions so the cpudl_set/clear_freecpu functions are removed. In addition, since the free_cpus mask is now only updated under the cpudl lock the code was changed to use the non-atomic __cpumask functions. Signed-off-by: Doug Berger <opendmb@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2025-10-16 11:13:49 +02:00
Fernand Sieber	79104becf4	sched/fair: Forfeit vruntime on yield If a task yields, the scheduler may decide to pick it again. The task in turn may decide to yield immediately or shortly after, leading to a tight loop of yields. If there's another runnable task as this point, the deadline will be increased by the slice at each loop. This can cause the deadline to runaway pretty quickly, and subsequent elevated run delays later on as the task doesn't get picked again. The reason the scheduler can pick the same task again and again despite its deadline increasing is because it may be the only eligible task at that point. Fix this by making the task forfeiting its remaining vruntime and pushing the deadline one slice ahead. This implements yield behavior more authentically. We limit the forfeiting to eligible tasks. This is because core scheduling prefers running ineligible tasks rather than force idling. As such, without the condition, we can end up on a yield loop which makes the vruntime increase rapidly, leading to anomalous run delays later down the line. Fixes: `147f3efaa2` ("sched/fair: Implement an EEVDF-like scheduling policy") Signed-off-by: Fernand Sieber <sieberf@amazon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250401123622.584018-1-sieberf@amazon.com Link: https://lore.kernel.org/r/20250911095113.203439-1-sieberf@amazon.com Link: https://lore.kernel.org/r/20250916140228.452231-1-sieberf@amazon.com	2025-10-16 11:13:49 +02:00
Ryan Newton	44f5c8ec5b	sched_ext: Add lockless peek operation for DSQs The builtin DSQ queue data structures are meant to be used by a wide range of different sched_ext schedulers with different demands on these data structures. They might be per-cpu with low-contention, or high-contention shared queues. Unfortunately, DSQs have a coarse-grained lock around the whole data structure. Without going all the way to a lock-free, more scalable implementation, a small step we can take to reduce lock contention is to allow a lockless, small-fixed-cost peek at the head of the queue. This change allows certain custom SCX schedulers to cheaply peek at queues, e.g. during load balancing, before locking them. But it represents a few extra memory operations to update the pointer each time the DSQ is modified, including a memory barrier on ARM so the write appears correctly ordered. This commit adds a first_task pointer field which is updated atomically when the DSQ is modified, and allows any thread to peek at the head of the queue without holding the lock. Signed-off-by: Ryan Newton <newton@meta.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-15 06:46:25 -10:00

1 2 3 4 5 ...

49599 Commits