linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-06-09 04:30:33 -04:00

Author	SHA1	Message	Date
Yang Yingliang	fe7a11c78d	sched/core: Fix unbalance set_rq_online/offline() in sched_cpu_deactivate() If cpuset_cpu_inactive() fails, set_rq_online() need be called to rollback. Fixes: `120455c514` ("sched: Fix hotplug vs CPU bandwidth control") Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-5-yangyingliang@huaweicloud.com	2024-07-29 12:22:33 +02:00
Yang Yingliang	2f02735412	sched/core: Introduce sched_set_rq_on/offline() helper Introduce sched_set_rq_on/offline() helper, so it can be called in normal or error path simply. No functional changed. Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-4-yangyingliang@huaweicloud.com	2024-07-29 12:22:32 +02:00
Yang Yingliang	e22f910a26	sched/smt: Fix unbalance sched_smt_present dec/inc I got the following warn report while doing stress test: jump label: negative count! WARNING: CPU: 3 PID: 38 at kernel/jump_label.c:263 static_key_slow_try_dec+0x9d/0xb0 Call Trace: <TASK> __static_key_slow_dec_cpuslocked+0x16/0x70 sched_cpu_deactivate+0x26e/0x2a0 cpuhp_invoke_callback+0x3ad/0x10d0 cpuhp_thread_fun+0x3f5/0x680 smpboot_thread_fn+0x56d/0x8d0 kthread+0x309/0x400 ret_from_fork+0x41/0x70 ret_from_fork_asm+0x1b/0x30 </TASK> Because when cpuset_cpu_inactive() fails in sched_cpu_deactivate(), the cpu offline failed, but sched_smt_present is decremented before calling sched_cpu_deactivate(), it leads to unbalanced dec/inc, so fix it by incrementing sched_smt_present in the error path. Fixes: `c5511d03ec` ("sched/smt: Make sched_smt_present track topology") Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20240703031610.587047-3-yangyingliang@huaweicloud.com	2024-07-29 12:22:32 +02:00
Yang Yingliang	31b164e2e4	sched/smt: Introduce sched_smt_present_inc/dec() helper Introduce sched_smt_present_inc/dec() helper, so it can be called in normal or error path simply. No functional changed. Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-2-yangyingliang@huaweicloud.com	2024-07-29 12:22:32 +02:00
Zheng Zucheng	77baa5bafc	sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime In extreme test scenarios: the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime, utime = 18446744073709518790 ns, rtime = 135989749728000 ns In cputime_adjust() process, stime is greater than rtime due to mul_u64_u64_div_u64() precision problem. before call mul_u64_u64_div_u64(), stime = 175136586720000, rtime = 135989749728000, utime = 1416780000. after call mul_u64_u64_div_u64(), stime = 135989949653530 unsigned reversion occurs because rtime is less than stime. utime = rtime - stime = 135989749728000 - 135989949653530 = -199925530 = (u64)18446744073709518790 Trigger condition: 1). User task run in kernel mode most of time 2). ARM64 architecture 3). TICK_CPU_ACCOUNTING=y CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime Fixes: `3dc167ba57` ("sched/cputime: Improve cputime_adjust()") Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20240726023235.217771-1-zhengzucheng@huawei.com	2024-07-29 12:22:32 +02:00
Valentin Schneider	d65d411c92	treewide: context_tracking: Rename CONTEXT_* into CT_STATE_* Context tracking state related symbols currently use a mix of the CONTEXT_ (e.g. CONTEXT_KERNEL) and CT_SATE_ (e.g. CT_STATE_MASK) prefixes. Clean up the naming and make the ctx_state enum use the CT_STATE_ prefix. Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-07-29 07:33:10 +05:30
Joel Granados	78eb4ea25c	sysctl: treewide: constify the ctl_table argument of proc_handlers const qualify the struct ctl_table argument in the proc_handler function signatures. This is a prerequisite to moving the static ctl_table structs into .rodata data which will ensure that proc_handler function pointers cannot be modified. This patch has been generated by the following coccinelle script: ``` virtual patch @r1@ identifier ctl, write, buffer, lenp, ppos; identifier func !~ "appldata_(timer\|interval)_handler\|sched_(rt\|rr)_handler\|rds_tcp_skbuf_handler\|proc_sctp_do_(hmac_alg\|rto_min\|rto_max\|udp_port\|alpha_beta\|auth\|probe_interval)"; @@ int func( - struct ctl_table ctl + const struct ctl_table ctl ,int write, void buffer, size_t lenp, loff_t ppos); @r2@ identifier func, ctl, write, buffer, lenp, ppos; @@ int func( - struct ctl_table ctl + const struct ctl_table ctl ,int write, void buffer, size_t lenp, loff_t ppos) { ... } @r3@ identifier func; @@ int func( - struct ctl_table * + const struct ctl_table * ,int , void , size_t , loff_t ); @r4@ identifier func, ctl; @@ int func( - struct ctl_table ctl + const struct ctl_table ctl ,int , void , size_t , loff_t ); @r5@ identifier func, write, buffer, lenp, ppos; @@ int func( - struct ctl_table * + const struct ctl_table * ,int write, void buffer, size_t lenp, loff_t ppos); ``` Code formatting was adjusted in xfs_sysctl.c to comply with code conventions. The xfs_stats_clear_proc_handler, xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where adjusted. * The ctl_table argument in proc_watchdog_common was const qualified. This is called from a proc_handler itself and is calling back into another proc_handler, making it necessary to change it as part of the proc_handler migration. Co-developed-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Co-developed-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Joel Granados <j.granados@samsung.com>	2024-07-24 20:59:29 +02:00
Linus Torvalds	4a996d90b9	Merge tag 'sched-core-2024-07-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Update Daniel Bristot de Oliveira's entry in MAINTAINERS, and credit him in CREDITS - Harmonize the lock-yielding behavior on dynamically selected preemption models with static ones - Reorganize the code a bit: split out sched/syscalls.c to reduce the size of sched/core.c - Micro-optimize psi_group_change() - Fix set_load_weight() for SCHED_IDLE tasks - Misc cleanups & fixes * tag 'sched-core-2024-07-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Update MAINTAINERS and CREDITS sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE tasks sched/psi: Optimise psi_group_change a bit sched/core: Drop spinlocks on contention iff kernel is preemptible sched/core: Move preempt_model_*() helpers from sched.h to preempt.h sched/balance: Skip unnecessary updates to idle load balancer's flags idle: Remove stale RCU comment sched/headers: Move struct pre-declarations to the beginning of the header sched/core: Clean up kernel/sched/sched.h a bit sched/core: Simplify prefetch_curr_exec_start() sched: Fix spelling in comments sched/syscalls: Split out kernel/sched/syscalls.c from kernel/sched/core.c	2024-07-16 17:00:50 -07:00
Linus Torvalds	9855e87328	Merge tag 'rcu.2024.07.12a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: - Update Tasks RCU and Tasks Rude RCU description in Requirements.rst and clarify rcu_assign_pointer() and rcu_dereference() ordering properties - Add lockdep assertions for RCU readers, limit inline wakeups for callback-bypass synchronize_rcu(), add an rcutree.nohz_full_patience_delay to reduce nohz_full OS jitter, add Uladzislau Rezki as RCU maintainer, and fix a subtle callback-migration memory-ordering issue - Remove a number of redundant memory barriers - Remove unnecessary bypass-list lock-contention mitigation, use parking API instead of open-coded ad-hoc equivalent, and upgrade obsolete comments - Revert avoidance of a deadlock that can no longer occur and properly synchronize Tasks Trace RCU checking of runqueues - Add tests for handling of double-call_rcu() bug, add missing MODULE_DESCRIPTION, and add a script that histograms the number of calls to RCU updaters - Fill out SRCU polled-grace-period API * tag 'rcu.2024.07.12a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (29 commits) rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU invocation rcu: Eliminate lockless accesses to rcu_sync->gp_count MAINTAINERS: Add Uladzislau Rezki as RCU maintainer rcu: Add rcutree.nohz_full_patience_delay to reduce nohz_full OS jitter rcu/exp: Remove redundant full memory barrier at the end of GP rcu: Remove full memory barrier on RCU stall printout rcu: Remove full memory barrier on boot time eqs sanity check rcu/exp: Remove superfluous full memory barrier upon first EQS snapshot rcu: Remove superfluous full memory barrier upon first EQS snapshot rcu: Remove full ordering on second EQS snapshot srcu: Fill out polled grace-period APIs srcu: Update cleanup_srcu_struct() comment srcu: Add NUM_ACTIVE_SRCU_POLL_OLDSTATE srcu: Disable interrupts directly in srcu_gp_end() rcu: Disable interrupts directly in rcu_gp_init() rcu/tree: Reduce wake up for synchronize_rcu() common case rcu/tasks: Fix stale task snaphot for Tasks Trace tools/rcu: Add rcu-updaters.sh script rcutorture: Add missing MODULE_DESCRIPTION() macros rcutorture: Fix rcu_torture_fwd_cb_cr() data race ...	2024-07-15 15:25:27 -07:00
Jiapeng Chong	8bb30798fd	sched_ext: Fixes incorrect type in bpf_scx_init() The type_id is defined as u32type, if(type_id<0) is invalid, hence modified its type to s32. ./kernel/sched/ext.c:4958:5-12: WARNING: Unsigned expression compared with zero: type_id < 0. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9523 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-07-14 18:10:10 -10:00
Tejun Heo	5b26f7b920	sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches In ops.dispatch(), SCX_DSQ_LOCAL_ON can be used to dispatch the task to the local DSQ of any CPU. However, during direct dispatch from ops.select_cpu() and ops.enqueue(), this isn't allowed. This is because dispatching to the local DSQ of a remote CPU requires locking both the task's current and new rq's and such double locking can't be done directly from ops.enqueue(). While waking up a task, as ops.select_cpu() can pick any CPU and both ops.select_cpu() and ops.enqueue() can use SCX_DSQ_LOCAL as the dispatch target to dispatch to the DSQ of the picked CPU, the BPF scheduler can still do whatever it wants to do. However, while a task is being enqueued for a different reason, e.g. after its slice expiration, only ops.enqueue() is called and there's no way for the BPF scheduler to directly dispatch to the local DSQ of a remote CPU. This gap in API forces schedulers into work-arounds which are not straightforward or optimal such as skipping direct dispatches in such cases. Implement deferred enqueueing to allow directly dispatching to the local DSQ of a remote CPU from ops.select_cpu() and ops.enqueue(). Such tasks are temporarily queued on rq->scx.ddsp_deferred_locals. When the rq lock can be safely released, the tasks are taken off the list and queued on the target local DSQs using dispatch_to_local_dsq(). v2: - Add missing return after queue_balance_callback() in schedule_deferred(). (David). - dispatch_to_local_dsq() now assumes that @rq is locked but unpinned and thus no longer takes @rf. Updated accordingly. - UP build warning fix. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Andrea Righi <righi.andrea@gmail.com> Acked-by: David Vernet <void@manifault.com> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Changwoo Min <changwoo@igalia.com>	2024-07-12 08:20:33 -10:00
Tejun Heo	f47a818950	sched_ext: s/SCX_RQ_BALANCING/SCX_RQ_IN_BALANCE/ and add SCX_RQ_IN_WAKEUP SCX_RQ_BALANCING is used to mark that the rq is currently in balance(). Rename it to SCX_RQ_IN_BALANCE and add SCX_RQ_IN_WAKEUP which marks whether the rq is currently enqueueing for a wakeup. This will be used to implement direct dispatching to local DSQ of another CPU. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>	2024-07-12 08:20:33 -10:00
Tejun Heo	3cf78c5d01	sched_ext: Unpin and repin rq lock from balance_scx() sched_ext often needs to migrate tasks across CPUs right before execution and thus uses the balance path to dispatch tasks from the BPF scheduler. balance_scx() is called with rq locked and pinned but is passed @rf and thus allowed to unpin and unlock. Currently, @rf is passed down the call stack so the rq lock is unpinned just when double locking is needed. This creates unnecessary complications such as having to explicitly manipulate lock pinning for core scheduling. We also want to use dispatch_to_local_dsq_lock() from other paths which are called with rq locked but unpinned. rq lock handling in the dispatch path is straightforward outside the migration implementation and extending the pinning protection down the callstack doesn't add enough meaningful extra protection to justify the extra complexity. Unpin and repin rq lock from the outer balance_scx() and drop @rf passing and lock pinning handling from the inner functions. UP is updated to call balance_one() instead of balance_scx() to avoid adding NULL @rf handling to balance_scx(). AS this makes balance_scx() unused in UP, it's put inside a CONFIG_SMP block. No functional changes intended outside of lock annotation updates. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Righi <righi.andrea@gmail.com>	2024-07-12 08:20:32 -10:00
Tejun Heo	d6a05910d2	sched_ext: Open-code task_linked_on_dsq() task_linked_on_dsq() exists as a helper because it used to test both the rbtree and list nodes. It now only tests the list node and the list node will soon be used for something else too. The helper doesn't improve anything materially and the naming will become confusing. Open-code the list node testing and remove task_linked_on_dsq() Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>	2024-07-12 08:20:32 -10:00
Tejun Heo	fc283116d0	sched: Move struct balance_callback definition upward Move struct balance_callback definition upward so that it's visible to class-specific rq struct definitions. This will be used to embed a struct balance_callback in struct scx_rq. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org>	2024-07-12 08:20:32 -10:00
Ingo Molnar	011b1134b8	Merge branch 'sched/urgent' into sched/core, to pick up fixes and refresh the branch Signed-off-by: Ingo Molnar <mingo@kernel.org>	2024-07-11 10:42:33 +02:00
Tejun Heo	e7a6395a88	sched_ext: Make scx_bpf_reenqueue_local() skip tasks that are being migrated When a running task is migrated to another CPU, the stop_task is used to preempt the running task and migrate it. This, expectedly, invokes ops.cpu_release(). If the BPF scheduler then calls scx_bpf_reenqueue_local(), it re-enqueues all tasks on the local DSQ including the task which is being migrated. This creates an unnecessary re-enqueue of a task which is about to be deactivated and re-activated for migration anyway. It can also cause confusion for the BPF scheduler as scx_bpf_task_cpu() of the task and its allowed CPUs may not agree while migration is pending. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `245254f708` ("sched_ext: Implement sched_ext_ops.cpu_acquire/release()") Acked-by: David Vernet <void@manifault.com>	2024-07-09 12:30:26 -10:00
Tejun Heo	fd0cf51695	sched_ext: Reimplement scx_bpf_reenqueue_local() scx_bpf_reenqueue_local() is used to re-enqueue tasks on the local DSQ from ops.cpu_release(). Because the BPF scheduler may dispatch tasks to the same local DSQ, to avoid processing the same tasks repeatedly, it first takes the number of queued tasks and processes the task at the head of the queue that number of times. This is incorrect as a task can be dispatched to the same local DSQ with SCX_ENQ_HEAD. Such a task will be processed repeatedly until the count is exhausted and the succeeding tasks won't be processed at all. Fix it by first moving all candidate tasks to a private list and then processing that list. While at it, remove the WARNs. They're rather superflous as later steps will check them anyway. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `245254f708` ("sched_ext: Implement sched_ext_ops.cpu_acquire/release()") Acked-by: David Vernet <void@manifault.com>	2024-07-09 12:30:26 -10:00
Tejun Heo	650ba21b13	sched_ext: Implement DSQ iterator DSQs are very opaque in the consumption path. The BPF scheduler has no way of knowing which tasks are being considered and which is picked. This patch adds BPF DSQ iterator. - Allows iterating tasks queued on a DSQ in the dispatch order or reverse from anywhere using bpf_for_each(scx_dsq) or calling the iterator kfuncs directly. - Has ordering guarantee where only tasks which were already queued when the iteration started are visible and consumable during the iteration. v5: - Add a comment to the naked list_empty(&dsq->list) test in consume_dispatch_q() to explain the reasoning behind the lockless test and by extension why nldsq_next_task() isn't used there. - scx_qmap changes separated into its own patch. v4: - bpf_iter_scx_dsq_new() declaration in common.bpf.h was using the wrong type for the last argument (bool rev instead of u64 flags). Fix it. v3: - Alexei pointed out that the iterator is too big to allocate on stack. Added a prep patch to reduce the size of the cursor. Now bpf_iter_scx_dsq is 48 bytes and bpf_iter_scx_dsq_kern is 40 bytes on 64bit. - u32_before() comparison factored out. v2: - scx_bpf_consume_task() is separated out into a separate patch. - DSQ seq and iter flags don't need to be u64. Use u32. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: bpf@vger.kernel.org	2024-07-08 14:30:55 -10:00
Tejun Heo	d4af01c373	sched_ext: Take out ->priq and ->flags from scx_dsq_node struct scx_dsq_node contains two data structure nodes to link the containing task to a DSQ and a flags field that is protected by the lock of the associated DSQ. One reason why they are grouped into a struct is to use the type independently as a cursor node when iterating tasks on a DSQ. However, when iterating, the cursor only needs to be linked on the FIFO list and the rb_node part ends up inflating the size of the iterator data structure unnecessarily making it potentially too expensive to place it on stack. Take ->priq and ->flags out of scx_dsq_node and put them in sched_ext_entity as ->dsq_priq and ->dsq_flags, respectively. scx_dsq_node is renamed to scx_dsq_list_node and the field names are renamed accordingly. This will help implementing DSQ task iterator that can be allocated on stack. No functional change intended. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: David Vernet <void@manifault.com>	2024-07-08 14:30:55 -10:00
Tejun Heo	e196c908f9	sched, sched_ext: Move some declarations from kernel/sched/ext.h to sched.h While sched_ext was out of tree, everything sched_ext specific which can be put in kernel/sched/ext.h was put there to ease forward porting. However, kernel/sched/sched.h is the better location for some of them. Relocate. - struct sched_enq_and_set_ctx, sched_deq_and_put_task() and sched_enq_and_set_task(). - scx_enabled() and scx_switched_all(). - for_active_class_range() and for_each_active_class(). sched_class declarations are moved above the class iterators for this. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: David Vernet <void@manifault.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de>	2024-07-08 09:39:48 -10:00
Tejun Heo	744d83601f	sched, sched_ext: Open code for_balance_class_range() For flexibility, sched_ext allows the BPF scheduler to select the CPU to execute a task on at dispatch time so that e.g. a queue can be shared across multiple CPUs. To enable this, the dispatch path is executed from balance() so that a dispatched task can be hot-migrated to its target CPU. This means that sched_ext needs its balance() method invoked before every pick_next_task() even when the CPU is waking up from SCHED_IDLE. for_balance_class_range() defined in kernel/sched/ext.h implements this selective iteration promotion. However, the indirection obfuscates more than helps. Open code the iteration promotion in put_prev_task_balance() and remove for_balance_class_range(). No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Vernet <void@manifault.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de>	2024-07-08 09:39:48 -10:00
Tejun Heo	6ab228ecc3	sched_ext: Minor cleanups in kernel/sched/ext.h - scx_ops_cpu_preempt is only used in kernel/sched/ext.c and doesn't need to be global. Make it static. - Relocate task_on_scx() so that the inline functions are located together. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>	2024-07-08 09:39:48 -10:00
Tejun Heo	9f391f94a1	sched_ext: Disallow loading BPF scheduler if isolcpus= domain isolation is in effect sched_domains regulate the load balancing for sched_classes. A machine can be partitioned into multiple sections that are not load-balanced across using either isolcpus= boot param or cpuset partitions. In such cases, tasks that are in one partition are expected to stay within that partition. cpuset configured partitions are always reflected in each member task's cpumask. As SCX always honors the task cpumasks, the BPF scheduler is automatically in compliance with the configured partitions. However, for isolcpus= domain isolation, the isolated CPUs are simply omitted from the top-level sched_domain[s] without further restrictions on tasks' cpumasks, so, for example, a task currently running in an isolated CPU may have more CPUs in its allowed cpumask while expected to remain on the same CPU. There is no straightforward way to enforce this partitioning preemptively on BPF schedulers and erroring out after a violation can be surprising. isolcpus= domain isolation is being replaced with cpuset partitions anyway, so keep it simple and simply disallow loading a BPF scheduler if isolcpus= domain isolation is in effect. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20240626082342.GY31592@noisy.programming.kicks-ass.net Cc: David Vernet <void@manifault.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <frederic@kernel.org>	2024-07-08 09:30:13 -10:00
Tejun Heo	e98abd22fb	sched_ext: Account for idle policy when setting p->scx.weight in scx_ops_enable_task() When initializing p->scx.weight, scx_ops_enable_task() wasn't considering whether the task is SCHED_IDLE. Update it to use WEIGHT_IDLEPRIO as the source weight for SCHED_IDLE tasks. This leaves reweight_task_scx() the sole user of set_task_scx_weight(). Open code it. @weight is going to be provided by sched core in the future anyway. v2: Use the newly available @lw->weight to set @p->scx.weight in reweight_task_scx(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com> Cc: Peter Zijlstra <peterz@infradead.org>	2024-07-08 09:25:35 -10:00
Tejun Heo	60564acbef	sched, sched_ext: Simplify dl_prio() case handling in sched_fork() sched_fork() returns with -EAGAIN if dl_prio(@p). `a7a9fc5492` ("sched_ext: Add boilerplate for extensible scheduler class") added scx_pre_fork() call before it and then scx_cancel_fork() on the exit path. This is silly as the dl_prio() block can just be moved above the scx_pre_fork() call. Move the dl_prio() block above the scx_pre_fork() call and remove the now unnecessary scx_cancel_fork() invocation. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Vernet <void@manifault.com>	2024-07-08 08:55:46 -10:00
Hongyan Xia	6203ef73fa	sched/ext: Add BPF function to fetch rq rq contains many useful fields to implement a custom scheduler. For example, various clock signals like clock_task and clock_pelt can be used to track load. It also contains stats in other sched_classes, which are useful to drive scheduling decisions in ext. tj: Put the new helper below scx_bpf_task_*() helpers. Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-07-08 07:10:48 -10:00
Tejun Heo	7b9f6c864a	Merge branch 'sched/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.11 `d329605287` ("sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE tasks") applied to sched/core changes how reweight_task() is called causing conflicts with `e83edbf88f` ("sched: Add sched_class->reweight_task()"). Resolve the conflicts by taking set_load_weight() changes from `d329605287` and updating sched_class->reweight_task() to take pointer to struct load_weight instead of int prio. Signed-off-by: Tejun Heo<tj@kernel.org>	2024-07-08 07:06:26 -10:00
Tejun Heo	d329605287	sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE tasks When a task's weight is being changed, set_load_weight() is called with @update_load set. As weight changes aren't trivial for the fair class, set_load_weight() calls fair.c::reweight_task() for fair class tasks. However, set_load_weight() first tests task_has_idle_policy() on entry and skips calling reweight_task() for SCHED_IDLE tasks. This is buggy as SCHED_IDLE tasks are just fair tasks with a very low weight and they would incorrectly skip load, vlag and position updates. Fix it by updating reweight_task() to take struct load_weight as idle weight can't be expressed with prio and making set_load_weight() call reweight_task() for SCHED_IDLE tasks too when @update_load is set. Fixes: `9059393e4e` ("sched/fair: Use reweight_entity() for set_user_nice()") Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org # v4.15+ Link: http://lkml.kernel.org/r/20240624102331.GI31592@noisy.programming.kicks-ass.net	2024-07-04 15:59:52 +02:00
Tvrtko Ursulin	0ec208ce98	sched/psi: Optimise psi_group_change a bit The current code loops over the psi_states only to call a helper which then resolves back to the action needed for each state using a switch statement. That is effectively creating a double indirection of a kind which, given how all the states need to be explicitly listed and handled anyway, we can simply remove. Both the for loop and the switch statement that is. The benefit is both in the code size and CPU time spent in this function. YMMV but on my Steam Deck, while in a game, the patch makes the CPU usage go from ~2.4% down to ~1.2%. Text size at the same time went from 0x323 to 0x2c1. Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20240625135000.38652-1-tursulin@igalia.com	2024-07-04 15:59:52 +02:00
Tejun Heo	b651d7c392	sched_ext: Swap argument positions in kcalloc() call to avoid compiler warning alloc_exit_info() calls kcalloc() but puts in the size of the element as the first argument which triggers the following gcc warning: kernel/sched/ext.c:3815:32: warning: ‘kmalloc_array_noprof’ sizes specified with ‘sizeof’ in the earlier argument and not in the later argument [-Wcalloc-transposed-args] Fix it by swapping the positions of the first two arguments. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Vishal Chourasia <vishalc@linux.ibm.com> Link: http://lkml.kernel.org/r/ZoG6zreEtQhAUr_2@linux.ibm.com	2024-07-01 08:30:02 -10:00
John Stultz	ddae0ca2a8	sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath It was reported that in moving to 6.1, a larger then 10% regression was seen in the performance of clock_gettime(CLOCK_THREAD_CPUTIME_ID,...). Using a simple reproducer, I found: 5.10: 100000000 calls in 24345994193 ns => 243.460 ns per call 100000000 calls in 24288172050 ns => 242.882 ns per call 100000000 calls in 24289135225 ns => 242.891 ns per call 6.1: 100000000 calls in 28248646742 ns => 282.486 ns per call 100000000 calls in 28227055067 ns => 282.271 ns per call 100000000 calls in 28177471287 ns => 281.775 ns per call The cause of this was finally narrowed down to the addition of psi_account_irqtime() in update_rq_clock_task(), in commit `52b1364ba0` ("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure"). In my initial attempt to resolve this, I leaned towards moving all accounting work out of the clock_gettime() call path, but it wasn't very pretty, so it will have to wait for a later deeper rework. Instead, Peter shared this approach: Rework psi_account_irqtime() to use its own psi_irq_time base for accounting, and move it out of the hotpath, calling it instead from sched_tick() and __schedule(). In testing this, we found the importance of ensuring psi_account_irqtime() is run under the rq_lock, which Johannes Weiner helpfully explained, so also add some lockdep annotations to make that requirement clear. With this change the performance is back in-line with 5.10: 6.1+fix: 100000000 calls in 24297324597 ns => 242.973 ns per call 100000000 calls in 24318869234 ns => 243.189 ns per call 100000000 calls in 24291564588 ns => 242.916 ns per call Reported-by: Jimmy Shiu <jimmyshiu@google.com> Originally-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Qais Yousef <qyousef@layalina.io> Link: https://lore.kernel.org/r/20240618215909.4099720-1-jstultz@google.com	2024-07-01 13:01:44 +02:00
Wander Lairson Costa	b58652db66	sched/deadline: Fix task_struct reference leak During the execution of the following stress test with linux-rt: stress-ng --cyclic 30 --timeout 30 --minimize --quiet kmemleak frequently reported a memory leak concerning the task_struct: unreferenced object 0xffff8881305b8000 (size 16136): comm "stress-ng", pid 614, jiffies 4294883961 (age 286.412s) object hex dump (first 32 bytes): 02 40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 .@.............. 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ debug hex dump (first 16 bytes): 53 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 S............... backtrace: [<00000000046b6790>] dup_task_struct+0x30/0x540 [<00000000c5ca0f0b>] copy_process+0x3d9/0x50e0 [<00000000ced59777>] kernel_clone+0xb0/0x770 [<00000000a50befdc>] __do_sys_clone+0xb6/0xf0 [<000000001dbf2008>] do_syscall_64+0x5d/0xf0 [<00000000552900ff>] entry_SYSCALL_64_after_hwframe+0x6e/0x76 The issue occurs in start_dl_timer(), which increments the task_struct reference count and sets a timer. The timer callback, dl_task_timer, is supposed to decrement the reference count upon expiration. However, if enqueue_task_dl() is called before the timer expires and cancels it, the reference count is not decremented, leading to the leak. This patch fixes the reference leak by ensuring the task_struct reference count is properly decremented when the timer is canceled. Fixes: `feff2e65ef` ("sched/deadline: Unthrottle PI boosted threads while enqueuing") Signed-off-by: Wander Lairson Costa <wander@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20240620125618.11419-1-wander@redhat.com	2024-07-01 13:01:44 +02:00
Josh Don	2feab2492d	Revert "sched/fair: Make sure to try to detach at least one movable task" This reverts commit `b0defa7ae0`. `b0defa7ae0` changed the load balancing logic to ignore env.max_loop if all tasks examined to that point were pinned. The goal of the patch was to make it more likely to be able to detach a task buried in a long list of pinned tasks. However, this has the unfortunate side effect of creating an O(n) iteration in detach_tasks(), as we now must fully iterate every task on a cpu if all or most are pinned. Since this load balance code is done with rq lock held, and often in softirq context, it is very easy to trigger hard lockups. We observed such hard lockups with a user who affined O(10k) threads to a single cpu. When I discussed this with Vincent he initially suggested that we keep the limit on the number of tasks to detach, but increase the number of tasks we can search. However, after some back and forth on the mailing list, he recommended we instead revert the original patch, as it seems likely no one was actually getting hit by the original issue. Fixes: `b0defa7ae0` ("sched/fair: Make sure to try to detach at least one movable task") Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240620214450.316280-1-joshdon@google.com	2024-07-01 13:01:43 +02:00
Andrea Righi	1ff4f169c9	sched_ext: fix typo in set_weight() description Correct eight to weight in the description of the .set_weight() operation in sched_ext_ops. Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-06-27 08:57:09 -10:00
David Vernet	8a6c6b4b93	sched_ext: Make scx_bpf_cpuperf_set() @cpu arg signed The scx_bpf_cpuperf_set() kfunc allows a BPF program to set the relative performance target of a specified CPU. Commit `d86adb4fc0` ("sched_ext: Add cpuperf support") defined the @cpu argument to be unsigned. Let's update it to be signed to match the norm for the rest of ext.c and the kernel. Note that the kfunc declaration of scx_bpf_cpuperf_set() in the common.bpf.h header in tools/sched_ext already listed the cpu as signed, so this also fixes the build for tools/sched_ext and the sched_ext selftests due to kfunc declarations now being emitted in vmlinux.h based on BTF (thus causing the compiler to error due to observing conflicting types). Fixes: `d86adb4fc0` ("sched_ext: Add cpuperf support") Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-06-23 07:53:15 -10:00
Tejun Heo	d86adb4fc0	sched_ext: Add cpuperf support sched_ext currently does not integrate with schedutil. When schedutil is the governor, frequencies are left unregulated and usually get stuck close to the highest performance level from running RT tasks. Add CPU performance monitoring and scaling support by integrating into schedutil. The following kfuncs are added: - scx_bpf_cpuperf_cap(): Query the relative performance capacity of different CPUs in the system. - scx_bpf_cpuperf_cur(): Query the current performance level of a CPU relative to its max performance. - scx_bpf_cpuperf_set(): Set the current target performance level of a CPU. This gives direct control over CPU performance setting to the BPF scheduler. The only changes on the schedutil side are accounting for the utilization factor from sched_ext and disabling frequency holding heuristics as it may not apply well to sched_ext schedulers which may have a lot weaker connection between tasks and their current / last CPU. With cpuperf support added, there is no reason to block uclamp. Enable while at it. A toy implementation of cpuperf is added to scx_qmap as a demonstration of the feature. v2: Ignore cpu_util_cfs_boost() when scx_switched_all() in sugov_get_util() to avoid factoring in stale util metric. (Christian) Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Christian Loehle <christian.loehle@arm.com>	2024-06-21 12:37:22 -10:00
Tejun Heo	8988cad8d0	cpufreq_schedutil: Refactor sugov_cpu_is_busy() sugov_cpu_is_busy() is used to avoid decreasing performance level while the CPU is busy and called by sugov_update_single_freq() and sugov_update_single_perf(). Both callers repeat the same pattern to first test for uclamp and then the business. Let's refactor so that the tests aren't repeated. The new helper is named sugov_hold_freq() and tests both the uclamp exception and CPU business. No functional changes. This will make adding more exception conditions easier. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Viresh Kumar <viresh.kumar@linaro.org>	2024-06-21 12:37:03 -10:00
Tejun Heo	b999e365c2	sched, sched_ext: Replace scx_next_task_picked() with sched_class->switch_class() scx_next_task_picked() is used by sched_ext to notify the BPF scheduler when a CPU is taken away by a task dispatched from a higher priority sched_class so that the BPF scheduler can, e.g., punt the task[s] which was running or were waiting for the CPU to other CPUs. Replace the sched_ext specific hook scx_next_task_picked() with a new sched_class operation switch_class(). The changes are straightforward and the code looks better afterwards. However, when !CONFIG_SCHED_CLASS_EXT, this ends up adding an unused hook which is unlikely to be useful to other sched_classes. For further discussion on this subject, please refer to the following: http://lkml.kernel.org/r/CAHk-=wjFPLqo7AXu8maAGEGnOy6reUg-F4zzFhVB0Kyu22h7pw@mail.gmail.com Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de>	2024-06-21 09:49:28 -10:00
Tejun Heo	fa48e8d2c7	sched_ext: Documentation: scheduler: Document extensible scheduler class Add Documentation/scheduler/sched-ext.rst which gives a high-level overview and pointers to the examples. v6: - Add paragraph explaining debug dump. v5: - Updated to reflect /sys/kernel interface change. Kconfig options added. v4: - README improved, reformatted in markdown and renamed to README.md. v3: - Added tools/sched_ext/README. - Dropped _example prefix from scheduler names. v2: - Apply minor edits suggested by Bagas. Caveats section dropped as all of them are addressed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com>	2024-06-18 10:09:21 -10:00
Tejun Heo	06e51be3d5	sched_ext: Add vtime-ordered priority queue to dispatch_q's Currently, a dsq is always a FIFO. A task which is dispatched earlier gets consumed or executed earlier. While this is sufficient when dsq's are used for simple staging areas for tasks which are ready to execute, it'd make dsq's a lot more useful if they can implement custom ordering. This patch adds a vtime-ordered priority queue to dsq's. When the BPF scheduler dispatches a task with the new scx_bpf_dispatch_vtime() helper, it can specify the vtime tha the task should be inserted at and the task is inserted into the priority queue in the dsq which is ordered according to time_before64() comparison of the vtime values. A DSQ can either be a FIFO or priority queue and automatically switches between the two depending on whether scx_bpf_dispatch() or scx_bpf_dispatch_vtime() is used. Using the wrong variant while the DSQ already has the other type queued is not allowed and triggers an ops error. Built-in DSQs must always be FIFOs. This makes it very easy for the BPF schedulers to implement proper vtime based scheduling within each dsq very easy and efficient at a negligible cost in terms of code complexity and overhead. scx_simple and scx_example_flatcg are updated to default to weighted vtime scheduling (the latter within each cgroup). FIFO scheduling can be selected with -f option. v4: - As allowing mixing priority queue and FIFO on the same DSQ sometimes led to unexpected starvations, DSQs now error out if both modes are used at the same time and the built-in DSQs are no longer allowed to be priority queues. - Explicit type struct scx_dsq_node added to contain fields needed to be linked on DSQs. This will be used to implement stateful iterator. - Tasks are now always linked on dsq->list whether the DSQ is in FIFO or PRIQ mode. This confines PRIQ related complexities to the enqueue and dequeue paths. Other paths only need to look at dsq->list. This will also ease implementing BPF iterator. - Print p->scx.dsq_flags in debug dump. v3: - SCX_TASK_DSQ_ON_PRIQ flag is moved from p->scx.flags into its own p->scx.dsq_flags. The flag is protected with the dsq lock unlike other flags in p->scx.flags. This led to flag corruption in some cases. - Add comments explaining the interaction between using consumption of p->scx.slice to determine vtime progress and yielding. v2: - p->scx.dsq_vtime was not initialized on load or across cgroup migrations leading to some tasks being stalled for extended period of time depending on how saturated the machine is. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com>	2024-06-18 10:09:21 -10:00
Tejun Heo	7b0888b7cc	sched_ext: Implement core-sched support The core-sched support is composed of the following parts: - task_struct->scx.core_sched_at is added. This is a timestamp which can be used to order tasks. Depending on whether the BPF scheduler implements custom ordering, it tracks either global FIFO ordering of all tasks or local-DSQ ordering within the dispatched tasks on a CPU. - prio_less() is updated to call scx_prio_less() when comparing SCX tasks. scx_prio_less() calls ops.core_sched_before() if available or uses the core_sched_at timestamp. For global FIFO ordering, the BPF scheduler doesn't need to do anything. Otherwise, it should implement ops.core_sched_before() which reflects the ordering. - When core-sched is enabled, balance_scx() balances all SMT siblings so that they all have tasks dispatched if necessary before pick_task_scx() is called. pick_task_scx() picks between the current task and the first dispatched task on the local DSQ based on availability and the core_sched_at timestamps. Note that FIFO ordering is expected among the already dispatched tasks whether running or on the local DSQ, so this path always compares core_sched_at instead of calling into ops.core_sched_before(). qmap_core_sched_before() is added to scx_qmap. It scales the distances from the heads of the queues to compare the tasks across different priority queues and seems to behave as expected. v3: Fixed build error when !CONFIG_SCHED_SMT reported by Andrea Righi. v2: Sched core added the const qualifiers to prio_less task arguments. Explicitly drop them for ops.core_sched_before() task arguments. BPF enforces access control through the verifier, so the qualifier isn't actually operative and only gets in the way when interacting with various helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Reviewed-by: Josh Don <joshdon@google.com> Cc: Andrea Righi <andrea.righi@canonical.com>	2024-06-18 10:09:20 -10:00
Tejun Heo	0fd55582ed	sched_ext: Bypass BPF scheduler while PM events are in progress PM operations freeze userspace. Some BPF schedulers have active userspace component and may misbehave as expected across PM events. While the system is frozen, nothing too interesting is happening in terms of scheduling and we can get by just fine with the fallback FIFO behavior. Let's make things easier by always bypassing the BPF scheduler while PM events are in progress. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com>	2024-06-18 10:09:20 -10:00
Tejun Heo	60c27fb59f	sched_ext: Implement sched_ext_ops.cpu_online/offline() Add ops.cpu_online/offline() which are invoked when CPUs come online and offline respectively. As the enqueue path already automatically bypasses tasks to the local dsq on a deactivated CPU, BPF schedulers are guaranteed to see tasks only on CPUs which are between online() and offline(). If the BPF scheduler doesn't implement ops.cpu_online/offline(), the scheduler is automatically exited with SCX_ECODE_RESTART \| SCX_ECODE_RSN_HOTPLUG. Userspace can implement CPU hotpplug support trivially by simply reinitializing and reloading the scheduler. scx_qmap is updated to print out online CPUs on hotplug events. Other schedulers are updated to restart based on ecode. v3: - The previous implementation added @reason to sched_class.rq_on/offline() to distinguish between CPU hotplug events and topology updates. This was buggy and fragile as the methods are skipped if the current state equals the target state. Instead, add scx_rq_[de]activate() which are directly called from sched_cpu_de/activate(). This also allows ops.cpu_on/offline() to sleep which can be useful. - ops.dispatch() could be called on a CPU that the BPF scheduler was told to be offline. The dispatch patch is updated to bypass in such cases. v2: - To accommodate lock ordering change between scx_cgroup_rwsem and cpus_read_lock(), CPU hotplug operations are put into its own SCX_OPI block and enabled eariler during scx_ope_enable() so that cpus_read_lock() can be dropped before acquiring scx_cgroup_rwsem. - Auto exit with ECODE added. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>	2024-06-18 10:09:20 -10:00
David Vernet	245254f708	sched_ext: Implement sched_ext_ops.cpu_acquire/release() Scheduler classes are strictly ordered and when a higher priority class has tasks to run, the lower priority ones lose access to the CPU. Being able to monitor and act on these events are necessary for use cases includling strict core-scheduling and latency management. This patch adds two operations ops.cpu_acquire() and .cpu_release(). The former is invoked when a CPU becomes available to the BPF scheduler and the opposite for the latter. This patch also implements scx_bpf_reenqueue_local() which can be called from .cpu_release() to trigger requeueing of all tasks in the local dsq of the CPU so that the tasks can be reassigned to other available CPUs. scx_pair is updated to use .cpu_acquire/release() along with %SCX_KICK_WAIT to make the pair scheduling guarantee strict even when a CPU is preempted by a higher priority scheduler class. scx_qmap is updated to use .cpu_acquire/release() to empty the local dsq of a preempted CPU. A similar approach can be adopted by BPF schedulers that want to have a tight control over latency. v4: Use the new SCX_KICK_IDLE to wake up a CPU after re-enqueueing. v3: Drop the const qualifier from scx_cpu_release_args.task. BPF enforces access control through the verifier, so the qualifier isn't actually operative and only gets in the way when interacting with various helpers. v2: Add p->scx.kf_mask annotation to allow calling scx_bpf_reenqueue_local() from ops.cpu_release() nested inside ops.init() and other sleepable operations. Signed-off-by: David Vernet <dvernet@meta.com> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>	2024-06-18 10:09:20 -10:00
David Vernet	90e55164da	sched_ext: Implement SCX_KICK_WAIT If set when calling scx_bpf_kick_cpu(), the invoking CPU will busy wait for the kicked cpu to enter the scheduler. See the following for example usage: https://github.com/sched-ext/scx/blob/main/scheds/c/scx_pair.bpf.c v2: - Updated to fit the updated kick_cpus_irq_workfn() implementation. - Include SCX_KICK_WAIT related information in debug dump. Signed-off-by: David Vernet <dvernet@meta.com> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>	2024-06-18 10:09:20 -10:00
Tejun Heo	36454023f5	sched_ext: Track tasks that are subjects of the in-flight SCX operation When some SCX operations are in flight, it is known that the subject task's rq lock is held throughout which makes it safe to access certain fields of the task - e.g. its current task_group. We want to add SCX kfunc helpers that can make use of this guarantee - e.g. to help determining the currently associated CPU cgroup from the task's current task_group. As it'd be dangerous call such a helper on a task which isn't rq lock protected, the helper should be able to verify the input task and reject accordingly. This patch adds sched_ext_entity.kf_tasks[] that track the tasks which are currently being operated on by a terminal SCX operation. The new SCX_CALL_OP_[2]TASK[_RET]() can be used when invoking SCX operations which take tasks as arguments and the scx_kf_allowed_on_arg_tasks() can be used by kfunc helpers to verify the input task status. Note that as sched_ext_entity.kf_tasks[] can't handle nesting, the tracking is currently only limited to terminal SCX operations. If needed in the future, this restriction can be removed by moving the tracking to the task side with a couple per-task counters. v2: Updated to reflect the addition of SCX_KF_SELECT_CPU. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com>	2024-06-18 10:09:19 -10:00
Tejun Heo	22a920209a	sched_ext: Implement tickless support Allow BPF schedulers to indicate tickless operation by setting p->scx.slice to SCX_SLICE_INF. A CPU whose current task has infinte slice goes into tickless operation. scx_central is updated to use tickless operations for all tasks and instead use a BPF timer to expire slices. This also uses the SCX_ENQ_PREEMPT and task state tracking added by the previous patches. Currently, there is no way to pin the timer on the central CPU, so it may end up on one of the worker CPUs; however, outside of that, the worker CPUs can go tickless both while running sched_ext tasks and idling. With schbench running, scx_central shows: root@test ~# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts LOC: 142024 656 664 449 Local timer interrupts LOC: 161663 663 665 449 Local timer interrupts Without it: root@test ~ [SIGINT]# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts LOC: 188778 3142 3793 3993 Local timer interrupts LOC: 198993 5314 6323 6438 Local timer interrupts While scx_central itself is too barebone to be useful as a production scheduler, a more featureful central scheduler can be built using the same approach. Google's experience shows that such an approach can have significant benefits for certain applications such as VM hosting. v4: Allow operation even if BPF_F_TIMER_CPU_PIN is not available. v3: Pin the central scheduler's timer on the central_cpu using BPF_F_TIMER_CPU_PIN. v2: Convert to BPF inline iterators. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>	2024-06-18 10:09:19 -10:00
Tejun Heo	1c29f8541e	sched_ext: Add task state tracking operations Being able to track the task runnable and running state transitions are useful for a variety of purposes including latency tracking and load factor calculation. Currently, BPF schedulers don't have a good way of tracking these transitions. Becoming runnable can be determined from ops.enqueue() but becoming quiescent can only be inferred from the lack of subsequent enqueue. Also, as the local dsq can have multiple tasks and some events are handled in the sched_ext core, it's difficult to determine when a given task starts and stops executing. This patch adds sched_ext_ops.runnable(), .running(), .stopping() and .quiescent() operations to track the task runnable and running state transitions. They're mostly self explanatory; however, we want to ensure that running <-> stopping transitions are always contained within runnable <-> quiescent transitions which is a bit different from how the scheduler core behaves. This adds a bit of complication. See the comment in dequeue_task_scx(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>	2024-06-18 10:09:19 -10:00
Tejun Heo	0922f54fdd	sched_ext: Make watchdog handle ops.dispatch() looping stall The dispatch path retries if the local DSQ is still empty after ops.dispatch() either dispatched or consumed a task. This is both out of necessity and for convenience. It has to retry because the dispatch path might lose the tasks to dequeue while the rq lock is released while trying to migrate tasks across CPUs, and the retry mechanism makes ops.dispatch() implementation easier as it only needs to make some forward progress each iteration. However, this makes it possible for ops.dispatch() to stall CPUs by repeatedly dispatching ineligible tasks. If all CPUs are stalled that way, the watchdog or sysrq handler can't run and the system can't be saved. Let's address the issue by breaking out of the dispatch loop after 32 iterations. It is unlikely but not impossible for ops.dispatch() to legitimately go over the iteration limit. We want to come back to the dispatch path in such cases as not doing so risks stalling the CPU by idling with runnable tasks pending. As the previous task is still current in balance_scx(), resched_curr() doesn't do anything - it will just get cleared. Let's instead use scx_kick_bpf() which will trigger reschedule after switching to the next task which will likely be the idle task. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com>	2024-06-18 10:09:19 -10:00

... 6 7 8 9 10 ...

4482 Commits