linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-18 13:31:45 -04:00

Author	SHA1	Message	Date
Tejun Heo	7ed8df0d15	sched_ext: Make handle_lockup() propagate scx_verror() result handle_lockup() currently calls scx_verror() but ignores its return value, always returning true when the scheduler is enabled. Make it capture and return the result from scx_verror(). This prepares for hardlockup handling. Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Emil Tsalapatis <etsal@meta.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-12 06:43:44 -10:00
Tejun Heo	4ba54a6cbd	sched_ext: Refactor lockup handlers into handle_lockup() scx_rcu_cpu_stall() and scx_softlockup() share the same pattern: check if the scheduler is enabled under RCU read lock and trigger an error if so. Extract the common pattern into handle_lockup() helper. Add scx_verror() macro and use guard(rcu)(). This simplifies both handlers, reduces code duplication, and prepares for hardlockup handling. Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Cc: Emil Tsalapatis <etsal@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-12 06:43:44 -10:00
Tejun Heo	f2fe382e1f	sched_ext: Make scx_exit() and scx_vexit() return bool Make scx_exit() and scx_vexit() return bool indicating whether the calling thread successfully claimed the exit. This will be used by the abort mechanism added in a later patch. Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Cc: Emil Tsalapatis <etsal@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-12 06:43:44 -10:00
Tejun Heo	5ebec443fb	sched_ext: Exit dispatch and move operations immediately when aborting `62dcbab8b0` ("sched_ext: Avoid live-locking bypass mode switching") introduced the breather mechanism to inject delays during bypass mode switching. It maintains operation semantics unchanged while reducing lock contention to avoid live-locks on large NUMA systems. However, the breather only activates when exiting the scheduler, so there's no need to maintain operation semantics. Simplify by exiting dispatch and move operations immediately when scx_aborting is set. In consume_dispatch_q(), break out of the task iteration loop. In scx_dsq_move(), return early before acquiring locks. This also fixes cases the breather mechanism cannot handle. When a large system has many runnable threads affinitized to different CPU subsets and the BPF scheduler places them all into a single DSQ, many CPUs can scan the DSQ concurrently for tasks they can run. This can cause DSQ and RQ locks to be held for extended periods, leading to various failure modes. The breather cannot solve this because once in the consume loop, there's no exit. The new mechanism fixes this by exiting the loop immediately. The bypass DSQ is exempted to ensure the bypass mechanism itself can make progress. v2: Use READ_ONCE() when reading scx_aborting (Andrea Righi). Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Andrea Righi <arighi@nvidia.com> Cc: Emil Tsalapatis <etsal@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-12 06:43:44 -10:00
Tejun Heo	a69040ed57	sched_ext: Simplify breather mechanism with scx_aborting flag The breather mechanism was introduced in `62dcbab8b0` ("sched_ext: Avoid live-locking bypass mode switching") and `e32c260195` ("sched_ext: Enable the ops breather and eject BPF scheduler on softlockup") to prevent live-locks by injecting delays when CPUs are trapped in dispatch paths. Currently, it uses scx_breather_depth (atomic_t) and scx_in_softlockup (unsigned long) with separate increment/decrement and cleanup operations. The breather is only activated when aborting, so tie it directly to the exit mechanism. Replace both variables with scx_aborting flag set when exit is claimed and cleared after bypass is enabled. Introduce scx_claim_exit() to consolidate exit_kind claiming and breather enablement. This eliminates scx_clear_softlockup() and simplifies scx_softlockup() and scx_bypass(). The breather mechanism will be replaced by a different abort mechanism in a future patch. This simplification prepares for that change. Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-12 06:43:44 -10:00
Tejun Heo	61debc251c	sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode Bypass mode routes tasks through fallback dispatch queues. Originally a single global DSQ, `b7b3b2dbae` ("sched_ext: Split the global DSQ per NUMA node") changed this to per-node DSQs to resolve NUMA-related livelocks. Dan Schatzberg found per-node DSQs can still livelock when many threads are pinned to different small CPU subsets: each CPU must scan many incompatible tasks to find runnable ones, causing severe contention with high CPU counts. Switch to per-CPU bypass DSQs. Each task queues on its current CPU. Default idle CPU selection and direct dispatch handle most cases well. This introduces a failure mode when tasks concentrate on one CPU in over-saturated systems. If the BPF scheduler severely skews placement before triggering bypass, that CPU's queue may be too long to drain, causing RCU stalls. A load balancer in a future patch will address this. The bypass DSQ is separate from local DSQ to enable load balancing: local DSQs use rq locks, preventing efficient scanning and transfer across CPUs, especially problematic when systems are already contended. v2: Clarified why bypass DSQ is separate from local DSQ (Andrea Righi). Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-12 06:43:44 -10:00
Tejun Heo	3546119f18	sched_ext: Refactor do_enqueue_task() local and global DSQ paths The local and global DSQ enqueue paths in do_enqueue_task() share the same slice refill logic. Factor out the common code into a shared enqueue label. This makes adding new enqueue cases easier. No functional changes. Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-12 06:43:44 -10:00
Tejun Heo	bfd3749d48	sched_ext: Use shorter slice in bypass mode There have been reported cases of bypass mode not making forward progress fast enough. The 20ms default slice is unnecessarily long for bypass mode where the primary goal is ensuring all tasks can make forward progress. Introduce SCX_SLICE_BYPASS set to 5ms and make the scheduler automatically switch to it when entering bypass mode. Also make the bypass slice value tunable through the slice_bypass_us module parameter (adjustable between 100us and 100ms) to make it easier to test whether slice durations are a factor in problem cases. v3: Use READ_ONCE/WRITE_ONCE for scx_slice_dfl access (Dan). v2: Removed slice_dfl_us module parameter. Fixed typos (Andrea). Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-12 06:43:43 -10:00
Zqiang	5f02151c41	sched_ext: Fix unsafe locking in the scx_dump_state() For built with CONFIG_PREEMPT_RT=y kernels, the dump_lock will be converted sleepable spinlock and not disable-irq, so the following scenarios occur: inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. irq_work/0/27 [HC0[0]:SC0[0]:HE1:SE1] takes: (&rq->__lock){?...}-{2:2}, at: raw_spin_rq_lock_nested+0x2b/0x40 {IN-HARDIRQ-W} state was registered at: lock_acquire+0x1e1/0x510 _raw_spin_lock_nested+0x42/0x80 raw_spin_rq_lock_nested+0x2b/0x40 sched_tick+0xae/0x7b0 update_process_times+0x14c/0x1b0 tick_periodic+0x62/0x1f0 tick_handle_periodic+0x48/0xf0 timer_interrupt+0x55/0x80 __handle_irq_event_percpu+0x20a/0x5c0 handle_irq_event_percpu+0x18/0xc0 handle_irq_event+0xb5/0x150 handle_level_irq+0x220/0x460 __common_interrupt+0xa2/0x1e0 common_interrupt+0xb0/0xd0 asm_common_interrupt+0x2b/0x40 _raw_spin_unlock_irqrestore+0x45/0x80 __setup_irq+0xc34/0x1a30 request_threaded_irq+0x214/0x2f0 hpet_time_init+0x3e/0x60 x86_late_time_init+0x5b/0xb0 start_kernel+0x308/0x410 x86_64_start_reservations+0x1c/0x30 x86_64_start_kernel+0x96/0xa0 common_startup_64+0x13e/0x148 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&rq->__lock); <Interrupt> lock(&rq->__lock); * DEADLOCK * stack backtrace: CPU: 0 UID: 0 PID: 27 Comm: irq_work/0 Call Trace: <TASK> dump_stack_lvl+0x8c/0xd0 dump_stack+0x14/0x20 print_usage_bug+0x42e/0x690 mark_lock.part.44+0x867/0xa70 ? __pfx_mark_lock.part.44+0x10/0x10 ? string_nocheck+0x19c/0x310 ? number+0x739/0x9f0 ? __pfx_string_nocheck+0x10/0x10 ? __pfx_check_pointer+0x10/0x10 ? kvm_sched_clock_read+0x15/0x30 ? sched_clock_noinstr+0xd/0x20 ? local_clock_noinstr+0x1c/0xe0 __lock_acquire+0xc4b/0x62b0 ? __pfx_format_decode+0x10/0x10 ? __pfx_string+0x10/0x10 ? __pfx___lock_acquire+0x10/0x10 ? __pfx_vsnprintf+0x10/0x10 lock_acquire+0x1e1/0x510 ? raw_spin_rq_lock_nested+0x2b/0x40 ? __pfx_lock_acquire+0x10/0x10 ? dump_line+0x12e/0x270 ? raw_spin_rq_lock_nested+0x20/0x40 _raw_spin_lock_nested+0x42/0x80 ? raw_spin_rq_lock_nested+0x2b/0x40 raw_spin_rq_lock_nested+0x2b/0x40 scx_dump_state+0x3b3/0x1270 ? finish_task_switch+0x27e/0x840 scx_ops_error_irq_workfn+0x67/0x80 irq_work_single+0x113/0x260 irq_work_run_list.part.3+0x44/0x70 run_irq_workd+0x6b/0x90 ? __pfx_run_irq_workd+0x10/0x10 smpboot_thread_fn+0x529/0x870 ? __pfx_smpboot_thread_fn+0x10/0x10 kthread+0x305/0x3f0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x40/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> This commit therefore use rq_lock_irqsave/irqrestore() to replace rq_lock/unlock() in the scx_dump_state(). Fixes: `07814a9439` ("sched_ext: Print debug dump after an error exit") Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-12 06:28:32 -10:00
Steven Rostedt	76680d0d28	tracing: Have function tracer define options per instance Currently the function tracer's options are saved via a global mask when it should be per instance. Use the new infrastructure to define a "default_flags" field in the tracer structure that is used for the top level instance as well as new ones. Currently the global mask causes confusion: # cd /sys/kernel/tracing # mkdir instances/foo # echo function > instances/foo/current_tracer # echo 1 > options/func-args # echo function > current_tracer # cat trace [..] <idle>-0 [005] d..3. 1050.656187: rcu_needs_cpu() <-tick_nohz_next_event <idle>-0 [005] d..3. 1050.656188: get_next_timer_interrupt(basej=0x10002dbad, basem=0xf45fd7d300) <-tick_nohz_next_event <idle>-0 [005] d..3. 1050.656189: _raw_spin_lock(lock=0xffff8944bdf5de80) <-__get_next_timer_interrupt <idle>-0 [005] d..4. 1050.656190: do_raw_spin_lock(lock=0xffff8944bdf5de80) <-__get_next_timer_interrupt <idle>-0 [005] d..4. 1050.656191: _raw_spin_lock_nested(lock=0xffff8944bdf5f140, subclass=1) <-__get_next_timer_interrupt # cat instances/foo/options/func-args 1 # cat instances/foo/trace [..] kworker/4:1-88 [004] ...1. 298.127735: next_zone <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127736: first_online_pgdat <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127738: next_online_pgdat <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127739: fold_diff <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127741: round_jiffies_relative <-vmstat_update [..] The above shows that updating the "func-args" option at the top level instance also updates the "func-args" option in the instance but because the update is only done by the instance that gets changed (as it should), it's confusing to see that the option is already set in the other instance. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251111232429.470883736@kernel.org Fixes: `f20a580627` ("ftrace: Allow instances to use function tracing") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-12 09:59:54 -05:00
Steven Rostedt	428add559b	tracing: Have tracer option be instance specific Tracers can add specify options to modify them. This logic was added before instances were created and the tracer flags were global variables. After instances were created where a tracer may exist in more than one instance, the flags were not updated from being global into instance specific. This causes confusion with these options. For example, the function tracer has an option to enable function arguments: # cd /sys/kernel/tracing # mkdir instances/foo # echo function > instances/foo/current_tracer # echo 1 > options/func-args # echo function > current_tracer # cat trace [..] <idle>-0 [005] d..3. 1050.656187: rcu_needs_cpu() <-tick_nohz_next_event <idle>-0 [005] d..3. 1050.656188: get_next_timer_interrupt(basej=0x10002dbad, basem=0xf45fd7d300) <-tick_nohz_next_event <idle>-0 [005] d..3. 1050.656189: _raw_spin_lock(lock=0xffff8944bdf5de80) <-__get_next_timer_interrupt <idle>-0 [005] d..4. 1050.656190: do_raw_spin_lock(lock=0xffff8944bdf5de80) <-__get_next_timer_interrupt <idle>-0 [005] d..4. 1050.656191: _raw_spin_lock_nested(lock=0xffff8944bdf5f140, subclass=1) <-__get_next_timer_interrupt # cat instances/foo/options/func-args 1 # cat instances/foo/trace [..] kworker/4:1-88 [004] ...1. 298.127735: next_zone <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127736: first_online_pgdat <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127738: next_online_pgdat <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127739: fold_diff <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127741: round_jiffies_relative <-vmstat_update [..] The above shows that setting "func-args" in the top level instance also set it in the instance "foo", but since the interface of the trace flags are per instance, the update didn't take affect in the "foo" instance. Update the infrastructure to allow tracers to add a "default_flags" field in the tracer structure that can be set instead of "flags" which will make the flags per instance. If a tracer needs to keep the flags global (like blktrace), keeping the "flags" field set will keep the old behavior. This does not update function or the function graph tracers. That will be handled later. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251111232429.305317942@kernel.org Fixes: `f20a580627` ("ftrace: Allow instances to use function tracing") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-12 09:59:54 -05:00
Christian Brauner	a3f8f86627	power: always freeze efivarfs The efivarfs filesystems must always be frozen and thawed to resync variable state. Make it so. Link: https://patch.msgid.link/20251105-vorbild-zutreffen-fe00d1dd98db@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-12 10:12:39 +01:00
Chen Ridong	f23cb0ced8	cpuset: remove need_rebuild_sched_domains Previously, update_cpumasks_hier() used need_rebuild_sched_domains to decide whether to invoke rebuild_sched_domains_locked(). Now that rebuild_sched_domains_locked() only sets force_rebuild, the flag is redundant. Hence, remove it. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-11 11:47:08 -10:00
Chen Ridong	648d43da64	cpuset: remove global remote_children list The remote_children list is used to track all remote partitions attached to a cpuset. However, it serves no other purpose. Using a boolean flag to indicate whether a cpuset is a remote partition is a more direct approach, making remote_children unnecessary. This patch replaces the list with a remote_partition flag in the cpuset structure and removes remote_children entirely. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-11 11:47:08 -10:00
Chen Ridong	0241e9e2bd	cpuset: simplify node setting on error There is no need to jump to the 'done' label upon failure, as no cleanup is required. Return the error code directly instead. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-11 11:47:08 -10:00
Bert Karwatzki	01a743550b	cgroup: include missing header for struct irq_work To compile cgroup.c with PREEMPT_RT=y include header which declares struct irq_work. Fixes: `9311e6c29b` ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT") Signed-off-by: Bert Karwatzki <spasswolf@web.de> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-11 08:52:42 -10:00
Shrikanth Hegde	65177ea9f6	sched/deadline: Minor cleanup in select_task_rq_dl() In select_task_rq_dl, there is only one goto statement, there is no need for it. No functional changes. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20251014100342.978936-2-sshegde@linux.ibm.com	2025-11-11 17:27:55 +01:00
Shrikanth Hegde	b4bfacd392	sched/deadline: Use cpumask_weight_and() in dl_bw_cpus cpumask_subset(a,b) -> cpumask_weight(a) should be same as cpumask_weight_and(a,b) for_each_cpu_and(a,b) to count cpus could be replaced by cpumask_weight_and(a,b) No Functional Change. It could save a few cycles since cpumask_weight_and would be more efficient. Plus one less stack variable. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20251014100342.978936-3-sshegde@linux.ibm.com	2025-11-11 17:27:55 +01:00
Peter Zijlstra	2614069c59	sched/deadline: Document dl_server Place the notes that resulted from going through the dl_server code in a comment. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2025-11-11 17:27:50 +01:00
Menglong Dong	cd06078a38	tracing: fprobe: use ftrace if CONFIG_DYNAMIC_FTRACE_WITH_ARGS For now, we will use ftrace for the fprobe if fp->exit_handler not exists and CONFIG_DYNAMIC_FTRACE_WITH_REGS is enabled. However, CONFIG_DYNAMIC_FTRACE_WITH_REGS is not supported by some arch, such as arm. What we need in the fprobe is the function arguments, so we can use ftrace for fprobe if CONFIG_DYNAMIC_FTRACE_WITH_ARGS is enabled. Therefore, use ftrace if CONFIG_DYNAMIC_FTRACE_WITH_REGS or CONFIG_DYNAMIC_FTRACE_WITH_ARGS enabled. Link: https://lore.kernel.org/all/20251103063434.47388-1-dongml2@chinatelecom.cn/ Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2025-11-11 22:32:10 +09:00
Menglong Dong	2c67dc457b	tracing: fprobe: optimization for entry only case For now, fgraph is used for the fprobe, even if we need trace the entry only. However, the performance of ftrace is better than fgraph, and we can use ftrace_ops for this case. Then performance of kprobe-multi increases from 54M to 69M. Before this commit: $ ./benchs/run_bench_trigger.sh kprobe-multi kprobe-multi : 54.663 ± 0.493M/s After this commit: $ ./benchs/run_bench_trigger.sh kprobe-multi kprobe-multi : 69.447 ± 0.143M/s Mitigation is disable during the bench testing above. Link: https://lore.kernel.org/all/20251015083238.2374294-2-dongml2@chinatelecom.cn/ Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2025-11-11 22:32:09 +09:00
Masami Hiramatsu (Google)	e667152e00	tracing: fprobe: Fix to init fprobe_ip_table earlier Since the fprobe_ip_table is used from module unloading in the failure path of load_module(), it must be initialized in the earlier timing than late_initcall(). Unless that, the fprobe_module_callback() will use an uninitialized spinlock of fprobe_ip_table. Initialize fprobe_ip_table in core_initcall which is the same timing as ftrace. Link: https://lore.kernel.org/all/175939434403.3665022.13030530757238556332.stgit@mhiramat.tok.corp.google.com/ Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202509301440.be4b3631-lkp@intel.com Fixes: e5a4cc28a052 ("tracing: fprobe: use rhltable for fprobe_ip_table") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Menglong Dong <menglong8.dong@gmail.com>	2025-11-11 22:32:09 +09:00
Thomas Weißschuh	69d8895cb9	rv: Add explicit lockdep context for reactors Reactors can be called from any context through tracepoints. When developing reactors care needs to be taken to only call APIs which are safe. As the tracepoints used during testing may not actually be called from restrictive contexts lockdep may not be helpful. Add explicit overrides to help lockdep find invalid code patterns. The usage of LD_WAIT_FREE will trigger lockdep warnings in the panic reactor. These are indeed valid warnings but they are out of scope for RV and will instead be fixed by the printk subsystem. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Nam Cao <namcao@linutronix.de> Link: https://lore.kernel.org/r/20251014-rv-lockdep-v1-3-0b9e51919ea8@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>	2025-11-11 13:18:56 +01:00
Thomas Weißschuh	68f63cea46	rv: Make rv_reacting_on() static There are no external users left. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Link: https://lore.kernel.org/r/20251014-rv-lockdep-v1-2-0b9e51919ea8@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>	2025-11-11 13:18:56 +01:00
Thomas Weißschuh	4f739ed19d	rv: Pass va_list to reactors The only thing the reactors can do with the passed in varargs is to convert it into a va_list. Do that in a central helper instead. It simplifies the reactors, removes some hairy macro-generated code and introduces a convenient hook point to modify reactor behavior. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Link: https://lore.kernel.org/r/20251014-rv-lockdep-v1-1-0b9e51919ea8@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>	2025-11-11 13:18:55 +01:00
Peter Zijlstra	f5a538c07d	sched/deadline: Fix dl_server stop condition Gabriel reported that the dl_server doesn't stop as expected. The problem was found to be the fact that idle time and fair runtime are treated equally. Both will count towards dl_server runtime and push the activation forwards when it is in the zero-laxity wait state. Notably: dl_server_update_idle() update_curr_dl_se() if (dl_defer && dl_throttled && dl_runtime_exceeded()) hrtimer_try_to_cancel(); // stop timer replenish_dl_new_period() deadline = now + dl_deadline; // fwd period runtime = dl_runtime; start_dl_timer(); // restart timer And while we do want idle time accounted towards the current activation of the dl_server -- after all, a fair task could've ran if we had any -- we don't necessarily want idle time to cause or push forward an activation. Introduce dl_defer_idle to make this distinction. It will be set once idle time pushed the activation forward, once set idle time will only be allowed to consume any runtime but not push the activation. This will then cause dl_server_timer() to fire, which will stop the dl_server. Any non-idle time accounting during this phase will clear dl_defer_idle, so only a full period of idle will cause the dl_server to stop. Reported-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251101000057.GA2184199@noisy.programming.kicks-ass.net	2025-11-11 12:33:39 +01:00
Peter Zijlstra	e636ffb9e3	sched/deadline: Fix dl_server time accounting The dl_server time accounting code is a little odd. The normal scheduler pattern is to update curr before doing something, such that the old state is fully accounted before changing state. Notably, the dl_server_timer() needs to propagate the current time accounting since the current task could be ran by dl_server and thus this can affect dl_se->runtime. Similarly for dl_server_start(). And since the (deferred) dl_server wants idle time accounted, rework sched_idle_class time accounting to be more like all the others. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251020141130.GJ3245006@noisy.programming.kicks-ass.net	2025-11-11 12:33:38 +01:00
Hao Jia	e40cea333e	sched/core: Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked() Since commit `d4c64207b8` ("sched: Cleanup the sched_change NOCLOCK usage"), update_rq_clock() is called in do_set_cpus_allowed() -> sched_change_begin() to update the rq clock. This results in a duplicate call update_rq_clock() in __set_cpus_allowed_ptr_locked(). While holding the rq lock and before calling do_set_cpus_allowed(), there is nothing that depends on an updated rq_clock. Therefore, remove the redundant update_rq_clock() in __set_cpus_allowed_ptr_locked() to avoid the warning about double rq clock updates. Fixes: `d4c64207b8` ("sched: Cleanup the sched_change NOCLOCK usage") Signed-off-by: Hao Jia <jiahao1@lixiang.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20251029093655.31252-1-jiahao.kernel@gmail.com	2025-11-11 12:33:38 +01:00
Peter Zijlstra	79f3f9bedd	sched/eevdf: Fix min_vruntime vs avg_vruntime Basically, from the constraint that the sum of lag is zero, you can infer that the 0-lag point is the weighted average of the individual vruntime, which is what we're trying to compute: \Sum w_i * v_i avg = -------------- \Sum w_i Now, since vruntime takes the whole u64 (worse, it wraps), this multiplication term in the numerator is not something we can compute; instead we do the min_vruntime (v0 henceforth) thing like: v_i = (v_i - v0) + v0 This does two things: - it keeps the key: (v_i - v0) 'small'; - it creates a relative 0-point in the modular space. If you do that subtitution and work it all out, you end up with: \Sum w_i * (v_i - v0) avg = --------------------- + v0 \Sum w_i Since you cannot very well track a ratio like that (and not suffer terrible numerical problems) we simpy track the numerator and denominator individually and only perform the division when strictly needed. Notably, the numerator lives in cfs_rq->avg_vruntime and the denominator lives in cfs_rq->avg_load. The one extra 'funny' is that these numbers track the entities in the tree, and current is typically outside of the tree, so avg_vruntime() adds current when needed before doing the division. (vruntime_eligible() elides the division by cross-wise multiplication) Anyway, as mentioned above, we currently use the CFS era min_vruntime for this purpose. However, this thing can only move forward, while the above avg can in fact move backward (when a non-eligible task leaves, the average becomes smaller), this can cause trouble when through happenstance (or construction) these values drift far enough apart to wreck the game. Replace cfs_rq::min_vruntime with cfs_rq::zero_vruntime which is kept near/at avg_vruntime, following its motion. The down-side is that this requires computing the avg more often. Fixes: `147f3efaa2` ("sched/fair: Implement an EEVDF-like scheduling policy") Reported-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251106111741.GC4068168@noisy.programming.kicks-ass.net Cc: stable@vger.kernel.org	2025-11-11 12:33:38 +01:00
Peter Zijlstra	9359d9785d	sched/core: Add comment explaining force-idle vruntime snapshots I always end up having to re-read these emails every time I look at this code. And a future patch is going to change this story a little. This means it is past time to stick them in a comment so it can be modified and stay current. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200506143506.GH5298@hirez.programming.kicks-ass.net Link: https://lkml.kernel.org/r/20200515103844.GG2978@hirez.programming.kicks-ass.net Link: https://patch.msgid.link/20251106111603.GB4068168@noisy.programming.kicks-ass.net	2025-11-11 12:33:37 +01:00
Fernand Sieber	7f829bde94	sched/core: Optimize core cookie matching check Early return true if the core cookie matches. This avoids the SMT mask loop to check for an idle core, which might be more expensive on wide platforms. Signed-off-by: Fernand Sieber <sieberf@amazon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Link: https://patch.msgid.link/20251105152538.470586-1-sieberf@amazon.com	2025-11-11 12:33:37 +01:00
Fernand Sieber	127b90315c	sched/proxy: Yield the donor task When executing a task in proxy context, handle yields as if they were requested by the donor task. This matches the traditional PI semantics of yield() as well. This avoids scenario like proxy task yielding, pick next task selecting the same previous blocked donor, running the proxy task again, etc. Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202510211205.1e0f5223-lkp@intel.com Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Fernand Sieber <sieberf@amazon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251106104022.195157-1-sieberf@amazon.com	2025-11-11 12:33:36 +01:00
Christian Brauner	c2bbd2db52	ns: drop custom reference count initialization for initial namespaces Initial namespaces don't modify their reference count anymore. They remain fixed at one so drop the custom refcount initializations. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-16-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-11 10:01:32 +01:00
Christian Brauner	282879afa0	pid: rely on common reference count behavior Now that we changed the generic reference counting mechanism for all namespaces to never manipulate reference counts of initial namespaces we can drop the special handling for pid namespaces. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-15-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-11 10:01:32 +01:00
Christian Brauner	6bf253855a	ns: rename is_initial_namespace() Rename is_initial_namespace() to ns_init_inum() and make it symmetrical with the ns id variant. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-9-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-11 10:01:31 +01:00
Christian Brauner	298ab06ae4	nstree: use guards for ns_tree_lock Make use of the guard infrastructure for ns_tree_lock. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-7-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-11 10:01:31 +01:00
Christian Brauner	8a30420c89	nstree: simplify owner list iteration Make use of list_for_each_entry_from_rcu(). Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-6-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-11 10:01:30 +01:00
Christian Brauner	a657bc8a75	nstree: switch to new structures Switch the nstree management to the new combined structures. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-5-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-11 10:01:30 +01:00
Christian Brauner	d12ea8062f	nstree: add helper to operate on struct ns_tree_{node,root} Add helpers that work on the combined rbtree and rculist combined. This will make the code a lot more managable and legible. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-4-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-11 10:01:30 +01:00
Christian Brauner	a67ee4e2ba	Merge branch 'kbuild-6.19.fms.extension' Bring in the shared branch with the kbuild tree to enable '-fms-extensions' for 6.19. Further namespace cleanup work requires this extension. Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-11 09:59:08 +01:00
Masami Hiramatsu (Google)	7157062bb4	tracing: Report wrong dynamic event command Report wrong dynamic event type in the command via error_log. ----- # echo "z hoge" > /sys/kernel/tracing/dynamic_events sh: write error: Invalid argument # cat /sys/kernel/tracing/error_log [ 22.977022] dynevent: error: No matching dynamic event type Command: z hoge ^ ----- Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/176278970056.343441.10528135217342926645.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-10 19:26:14 -05:00
Steven Rostedt	3a0d5bc76f	tracing: Use switch statement instead of ifs in set_tracer_flag() The "mask" passed in to set_trace_flag() has a single bit set. The function then checks if the mask is equal to one of the option masks and performs the appropriate function associated to that option. Instead of having a bunch of "if ()" statement, use a "switch ()" statement instead to make it cleaner and a bit more optimal. No function changes. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251106003501.890298562@kernel.org Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-10 14:33:51 -05:00
Steven Rostedt	9c5053083e	tracing: Exit out immediately after update_marker_trace() The call to update_marker_trace() in set_tracer_flag() performs the update to the tr->trace_flags. There's no reason to perform it again after it is called. Return immediately instead. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251106003501.726406870@kernel.org Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-10 14:33:43 -05:00
Steven Rostedt	5aa0d18df0	tracing: Have add_tracer_options() error pass up to callers The function add_tracer_options() can fail, but currently it is ignored. Pass the status of add_tracer_options() up to adding a new tracer as well as when an instance is created. Have the instance creation fail if the add_tracer_options() fail. Only print a warning for the top level instance, like it does with other failures. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251105161935.375299297@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-10 14:28:31 -05:00
Steven Rostedt	c7bed15ccf	tracing: Remove dummy options and flags When a tracer does not define their own flags, dummy options and flags are used so that the values are always valid. There's not that many locations that reference these values so having dummy versions just complicates the code. Remove the dummy values and just check for NULL when appropriate. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251105161935.206093132@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-10 14:27:04 -05:00
Steven Rostedt	a10e6e6818	tracing: Hide __NR_utimensat and _NR_mq_timedsend when not defined Some architectures (riscv-32) do not define __NR_utimensat and _NR_mq_timedsend, and fails to build when they are used. Hide them in "ifdef"s. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251104205310.00a1db9a@batman.local.home Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202511031239.ZigDcWzY-lkp@intel.com/ Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-10 14:23:53 -05:00
D. Wythe	07c428ece3	bpf: Export necessary symbols for modules with struct_ops Exports three necessary symbols for implementing struct_ops with tristate subsystem. To hold or release refcnt of struct_ops refcnt by inline funcs bpf_try_module_get and bpf_module_put which use bpf_struct_ops_get(put) conditionally. And to copy obj name from one to the other with effective checks by bpf_obj_name_cpy. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251107035632.115950-2-alibuda@linux.alibaba.com	2025-11-10 11:07:34 -08:00
Jakub Sitnicki	f38499ff45	bpf: Unclone skb head on bpf_dynptr_write to skb metadata Currently bpf_dynptr_from_skb_meta() marks the dynptr as read-only when the skb is cloned, preventing writes to metadata. Remove this restriction and unclone the skb head on bpf_dynptr_write() to metadata, now that the metadata is preserved during uncloning. This makes metadata dynptr consistent with skb dynptr, allowing writes regardless of whether the skb is cloned. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-3-5ceb08a9b37b@cloudflare.com	2025-11-10 10:52:31 -08:00
zhang jiao	4457265c61	workqueue: Remove unused assert_rcu_or_wq_mutex_or_pool_mutex assert_rcu_or_wq_mutex_or_pool_mutex is never referenced in the code. Just remove it. Signed-off-by: zhang jiao <zhangjiao2@cmss.chinamobile.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-10 06:10:18 -10:00
Petr Mladek	394aa576c0	printk_ringbuffer: Create a helper function to decide whether more space is needed The decision whether some more space is needed is tricky in the printk ring buffer code: 1. The given lpos values might overflow. A subtraction must be used instead of a simple "lower than" check. 2. Another CPU might reuse the space in the mean time. It can be detected when the subtraction is bigger than DATA_SIZE(data_ring). 3. There is exactly enough space when the result of the subtraction is zero. But more space is needed when the result is exactly DATA_SIZE(data_ring). Add a helper function to make sure that the check is done correctly in all situations. Also it helps to make the code consistent and better documented. Suggested-by: John Ogness <john.ogness@linutronix.de> Link: https://lore.kernel.org/r/87tsz7iea2.fsf@jogness.linutronix.de Reviewed-by: John Ogness <john.ogness@linutronix.de> Link: https://patch.msgid.link/20251107194720.1231457-3-pmladek@suse.com [pmladek@suse.com: Updated wording as suggested by John] Signed-off-by: Petr Mladek <pmladek@suse.com>	2025-11-10 13:09:43 +01:00

... 6 7 8 9 10 ...

50282 Commits