linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-07-16 17:57:38 -04:00

Author	SHA1	Message	Date
Linus Torvalds	a53fcff8fc	Merge tag 'timers-nohz-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull NOHZ updates from Thomas Gleixner: - Fix a long standing TOCTOU in get_cpu_sleep_time_us() - Make the CPU offline NOHZ handling more robust by disabling NOHZ on the outgoing CPU early instead of creating unneeded state which needs to be undone. - Unify idle CPU time accounting instead of having two different accounting mechanisms. These two different mechanisms are not really independent, but the different properties can in the worst case cause that gloabl idle time can be observed going backwards. - Consolidate the idle/iowait time retrieval interfaces instead of converting back and forth between them. - Make idle interrupt time accounting more robust. The original code assumes that interrupt time accouting is enabled and therefore stops elapsing idle time while an interrupt is handled in NOHZ dyntick state. That assumption is not correct as interrupt time accounting can be disabled at compile and runtime. - Fix an accounting error between dyntick idle time and dyntick idle steal time. The stolen time is not accounted and therefore idle time becomes inaccurate. The stolen time is now accounted after the fact as there is no way to predict the steal time upfront. * tag 'timers-nohz-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: sched/cputime: Handle dyntick-idle steal time correctly sched/cputime: Handle idle irqtime gracefully sched/cputime: Provide get_cpu_[idle\|iowait]_time_us() off-case tick/sched: Consolidate idle time fetching APIs tick/sched: Account tickless idle cputime only when tick is stopped tick/sched: Remove unused fields tick/sched: Move dyntick-idle cputime accounting to cputime code tick/sched: Remove nohz disabled special case in cputime fetch tick/sched: Unify idle cputime accounting s390/time: Prepare to stop elapsing in dynticks-idle powerpc/time: Prepare to stop elapsing in dynticks-idle sched/cputime: Correctly support generic vtime idle time sched/cputime: Remove superfluous and error prone kcpustat_field() parameter sched/idle: Handle offlining first in idle loop tick/sched: Fix TOCTOU in nohz idle time fetch	2026-06-15 13:48:52 +05:30
Linus Torvalds	b8b674748f	Merge tag 'rcu.release.v7.2' of gitolite.kernel.org:pub/scm/linux/kernel/git/rcu/linux Pull RCU updates from Uladzislau Rezki: "Torture test updates: - Improve kvm-series.sh script by adding examples in its header comment - Lazy RCU is more fully tested now by replacing call_rcu_hurry() with call_rcu() and doing rcu_barrier() to motivate lazy callbacks during a stutter pause - Add more synonyms for the "--do-normal" group of torture.sh command-line arguments Misc changes: - Reduce stack usage of nocb_gp_wait() to address frame size warning when built with CONFIG_UBSAN_ALIGNMENT - The synchronize_rcu() call can detect the flood and latches a normal/default path temporary switching to wait_rcu_gp() path - Document using rcu_access_pointer() to fetch the old pointer for lockless cmpxchg() updates - Simplify some RCU code using clamp_val() - Fix a kerneldoc header comment typo in srcu_down_read_fast()" * tag 'rcu.release.v7.2' of gitolite.kernel.org:pub/scm/linux/kernel/git/rcu/linux: rcu/nocb: reduce stack usage in nocb_gp_wait() rcu-tasks: Fix possible boot-time tests failed for the call_rcu_tasks() rcu: Latch normal synchronize_rcu() path on flood rcu: Document rcu_access_pointer() feeding into cmpxchg() rcu: Simplify param_set_next_fqs_jiffies() by applying clamp_val() rcu: Simplify rcu_do_batch() by applying clamp() checkpatch: Undeprecate rcu_read_lock_trace() and rcu_read_unlock_trace() srcu: Fix kerneldoc header comment typo in srcu_down_read_fast() torture: Allow "norm" abbreviation for "normal" torture: Improve kvm-series.sh header comment torture: Add torture_sched_set_normal() for user-specified nice values rcutorture: Fully test lazy RCU	2026-06-15 09:16:00 +05:30
Frederic Weisbecker	080b5c6d95	sched/cputime: Remove superfluous and error prone kcpustat_field() parameter The first parameter to kcpustat_field() is a pointer to the cpu kcpustat to be fetched from. This parameter is error prone because a copy to a kcpustat could be passed by accident instead of the original one. Also the kcpustat structure can already be retrieved with the help of the mandatory CPU argument. Remove the needless parameter. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20260508131647.43868-4-frederic@kernel.org	2026-06-02 21:27:25 +02:00
Uladzislau Rezki (Sony)	e853c1b285	Merge branches 'rcutorture.2026.05.24' and 'misc.2026.05.24' into rcu-merge.2026.05.24 rcutorture.2026.05.24: Torture-test updates misc.2026.05.24: Miscellaneous RCU updates	2026-06-02 19:45:08 +02:00
Arnd Bergmann	002668809b	rcu/nocb: reduce stack usage in nocb_gp_wait() When CONFIG_UBSAN_ALIGNMENT is enabled, the stack usage of nocb_gp_wait() grows above typical warning limits: In file included from kernel/rcu/tree.c:4930: kernel/rcu/tree_nocb.h: In function 'rcu_nocb_gp_kthread': kernel/rcu/tree_nocb.h:866:1: error: the frame size of 1968 bytes is larger than 1280 bytes [-Werror=frame-larger-than=] Apparently, the problem is passing rcu_data from a 'void *' pointer, which gcc assumes may be misaligned. When the function is not inlined into rcu_nocb_gp_kthread(), that is no longer visible to gcc. Add a 'noinline_for_stack' annotation that leads to skipping a lot of the alignment sanitizer checks and keeps the stack usage 60% lower here. Reviewed-by: Kunwu Chan <chentao@kylinos.cn> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2026-06-02 19:43:53 +02:00
Zqiang	42c5468f9c	rcu-tasks: Fix possible boot-time tests failed for the call_rcu_tasks() The following scenarios will cause the call_rcu_tasks() boot-time tests failed: CPU0 CPU1 rcu_init_tasks_generic() ->rcu_tasks_initiate_self_tests() ->call_rcu_tasks_trace(&tests[1].rh, test_rcu_tasks_callback) ->call_rcu_tasks_generic() ->havekthread = smp_load_acquire(&rtp->kthread_ptr) "The havekthread is false" .... rcu_tasks_kthread() ->smp_store_release(&rtp->kthread_ptr, current) ->rcu_tasks_one_gp() ->rcuwait_wait_event() ->rcu_tasks_need_gpcb() ->for (cpu = 0; cpu < dequeue_limit; cpu++) ->rcu_segcblist_n_cbs(&rtpcp->cblist) == 0 ->schedule() ->raw_spin_trylock_rcu_node() ->needwake = (func == wakeme_after_rcu) \|\| (rcu_segcblist_n_cbs(&rtpcp->cblist) == rcu_task_lazy_lim) "the rcu_task_lazy_lim default value is 32, and the func pointer is test_rcu_tasks_callback, lead to needwake is false." ->if (havekthread && !needwake && !timer_pending(&rtpcp->lazy_timer)) "the havekthread is false, will not enter here." .... "the needwake is false lead to rtp_irq_work can not queue, even if the rtp->kthread_ptr already exists at this point." ->if (needwake && READ_ONCE(rtp->kthread_ptr)) ->irq_work_queue(&rtpcp->rtp_irq_work) For the above scenarios, if the call_rcu_tasks() is not called again afterward, the rcu_tasks_kthread will not have a chance to be wakeup, the test_rcu_tasks_callback() will never be called, the boot-time tests failed can happen, this commit therefore check havekthread variable, if it's false and the rtpcp->cblist is empty, set needwake variable is true, if the rtp->kthread_ptr exist, the rtpcp->rtp_irq_work can be queued to wakeup rcu_tasks_kthread. Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2026-05-24 09:40:13 +02:00
Uladzislau Rezki (Sony)	eceb659752	rcu: Latch normal synchronize_rcu() path on flood Currently, rcu_normal_wake_from_gp is only enabled by default on small systems(<= 16 CPUs) or when a user explicitly set it enabled. Introduce an adaptive latching mechanism: * Track the number of in-flight synchronize_rcu() requests using a new rcu_sr_normal_count counter; * If the count reaches/exceeds RCU_SR_NORMAL_LATCH_THR(64), it sets the rcu_sr_normal_latched, reverting new requests onto the scaled wait_rcu_gp() path; * The latch is cleared only when the pending requests are fully drained(nr == 0); * Enables rcu_normal_wake_from_gp by default for all systems, relying on this dynamic throttling instead of static CPU limits. Testing(synthetic flood workload): * Kernel version: 6.19.0-rc6 * Number of CPUs: 1536 * 60K concurrent synchronize_rcu() calls Perf(cycles, system-wide): total cycles: 932020263832 rcu_sr_normal_add_req(): 2650282811 cycles(~0.28%) Perf report excerpt: 0.01% 0.01% sync_test/... [k] rcu_sr_normal_add_req Measured overhead of rcu_sr_normal_add_req() remained ~0.28% of total CPU cycles in this synthetic stress test. Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Tested-by: Samir M <samir@linux.ibm.com> Suggested-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2026-05-24 09:40:08 +02:00
Paul E. McKenney	d9b4d36b8c	rcu: Simplify param_set_next_fqs_jiffies() by applying clamp_val() This commit replaces a nested ?: sequence with clamp_val(). This does not reduce the number of lines of code, but it does simplify the line that it modifies. Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2026-05-24 09:39:57 +02:00
Paul E. McKenney	5f136351ed	rcu: Simplify rcu_do_batch() by applying clamp() This commit replaces a nested ?: sequence with clamp(). This does not reduce the number of lines of code, but it does simplify the line that it modifies. Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2026-05-24 09:39:52 +02:00
Paul E. McKenney	aa61e8b4fb	rcutorture: Fully test lazy RCU Currently, rcutorture bypasses lazy RCU by using call_rcu_hurry(). This works, avoiding the dreaded rtort_pipe_count WARN(), but fails to fully test lazy RCU. The rtort_pipe_count WARN() splats because lazy RCU could delay the start of an RCU grace period for a full stutter period, which defaults to only three seconds. This commit therefore reverts the call_rcu_hurry() instances back to call_rcu(), but, in kernels built with CONFIG_RCU_LAZY=y, queues a workqueue handler just before the call to stutter_wait() in rcu_torture_writer(). This workqueue handler invokes rcu_barrier(), which motivates any lingering lazy callbacks, thus avoiding the splat. Reported-by: Saravana Kannan <saravanak@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2026-05-24 09:38:39 +02:00
Paul E. McKenney	593889c401	srcu: Don't queue workqueue handlers to never-online CPUs While an srcu_struct structure is in the midst of switching from CPU-0 to all-CPUs state, it can attempt to invoke callbacks for CPUs that have never been online. Worse yet, it can attempt in invoke callbacks for CPUs that never will be online, even including imaginary CPUs not in cpu_possible_mask. This can cause hangs on s390, which is not set up to deal with workqueue handlers being scheduled on such CPUs. This commit therefore causes Tree SRCU to refrain from queueing workqueue handlers on CPUs that have not yet (and might never) come online. Because callbacks are not invoked on CPUs that have not been online, it is an error to invoke call_srcu(), synchronize_srcu(), or synchronize_srcu_expedited() on a CPU that is not yet fully online. However, it turns out to be less code to redirect the callbacks from too-early invocations of call_srcu() than to warn about such invocations. This commit therefore also redirects callbacks queued on not-yet-fully-online CPUs to the boot CPU. Reported-by: Vasily Gorbik <gor@linux.ibm.com> Fixes: `61bbcfb505` ("srcu: Push srcu_node allocation to GP when non-preemptible") Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Vasily Gorbik <gor@linux.ibm.com> Tested-by: Samir <samir@linux.ibm.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Boqun Feng <boqun@kernel.org>	2026-05-18 12:27:18 -07:00
Linus Torvalds	28483203f7	Merge tag 'rcu.2026.03.31a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux Pull RCU updates from Joel Fernandes: "NOCB CPU management: - Consolidate rcu_nocb_cpu_offload() and rcu_nocb_cpu_deoffload() to reduce code duplication - Extract nocb_bypass_needs_flush() helper to reduce duplication in NOCB bypass path rcutorture/torture infrastructure: - Add NOCB01 config for RCU_LAZY torture testing - Add NOCB02 config for NOCB poll mode testing - Add TRIVIAL-PREEMPT config for textbook-style preemptible RCU torture - Test call_srcu() with preemption both disabled and enabled - Remove kvm-check-branches.sh in favor of kvm-series.sh - Make hangs more visible in torture.sh output - Add informative message for tests without a recheck file - Fix numeric test comparison in srcu_lockdep.sh - Use torture_shutdown_init() in refscale and rcuscale instead of open-coded shutdown functions - Fix modulo-zero error in torture_hrtimeout_ns(). SRCU: - Fix SRCU read flavor macro comments - Fix s/they disables/they disable/ typo in srcu_read_unlock_fast() RCU Tasks: - Document that RCU Tasks Trace grace periods now imply RCU grace periods - Remove unnecessary smp_store_release() in cblist_init_generic()" * tag 'rcu.2026.03.31a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: rcutorture: Test call_srcu() with preemption disabled and not rcu: Add BOOTPARAM_RCU_STALL_PANIC Kconfig option torture: Avoid modulo-zero error in torture_hrtimeout_ns() rcu/nocb: Extract nocb_bypass_needs_flush() to reduce duplication rcu/nocb: Consolidate rcu_nocb_cpu_offload/deoffload functions rcu-tasks: Remove unnecessary smp_store_release() in cblist_init_generic() rcutorture: Add NOCB02 config for nocb poll mode testing rcutorture: Add NOCB01 config for RCU_LAZY torture testing rcu-tasks: Document that RCU Tasks Trace grace periods now imply RCU grace periods srcu: Fix s/they disables/they disable/ typo in srcu_read_unlock_fast() srcu: Fix SRCU read flavor macro comments rcuscale: Ditch rcu_scale_shutdown in favor of torture_shutdown_init() refscale: Ditch ref_scale_shutdown in favor of torture_shutdown_init() rcutorture: Fix numeric "test" comparison in srcu_lockdep.sh torture: Print informative message for test without recheck file torture: Make hangs more visible in torture.sh output kvm-check-branches.sh: Remove in favor of kvm-series.sh rcutorture: Add a textbook-style trivial preemptible RCU	2026-04-13 09:36:45 -07:00
Paul E. McKenney	95c7d025cc	rcutorture: Test call_srcu() with preemption disabled and not This commit tests invoking call_srcu() with preemption both enabled and disabled, via acquiring of pi lock. [ Joel: reword commit message. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>	2026-03-30 15:48:14 -04:00
Gustavo Luiz Duarte	ab875b3e17	rcu: Add BOOTPARAM_RCU_STALL_PANIC Kconfig option Add a Kconfig option to set the default value of the kernel.panic_on_rcu_stall sysctl, allowing the kernel to be built with panic-on-RCU-stall enabled by default. This is useful for high-availability systems that require automatic recovery (via panic_timeout) when a CPU stall is detected, without needing userspace to configure the sysctl at boot. This follows the pattern established by BOOTPARAM_SOFTLOCKUP_PANIC and BOOTPARAM_HUNG_TASK_PANIC. The runtime sysctl can still override the Kconfig default. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>	2026-03-30 15:48:14 -04:00
Joel Fernandes	2243517a54	rcu/nocb: Extract nocb_bypass_needs_flush() to reduce duplication The bypass flush decision logic is duplicated in rcu_nocb_try_bypass() and nocb_gp_wait() with similar conditions. This commit therefore extracts the functionality into a common helper function nocb_bypass_needs_flush() improving the code readability. A flush_faster parameter is added to controlling the flushing thresholds and timeouts. This design was in the original commit `d1b222c6be` ("rcu/nocb: Add bypass callback queueing") to avoid having the GP kthread aggressively flush the bypass queue. Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>	2026-03-30 15:48:14 -04:00
Joel Fernandes	18d01ff3b9	rcu/nocb: Consolidate rcu_nocb_cpu_offload/deoffload functions The rcu_nocb_cpu_offload() and rcu_nocb_cpu_deoffload() functions are nearly duplicates. Therefore, extract the common logic into rcu_nocb_cpu_toggle_offload() which takes an 'offload' boolean, and make both exported functions simple wrappers. This eliminates a bunch of duplicate code at the call sites, namely mutex locking, CPU hotplug locking and CPU online checks. Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>	2026-03-30 15:48:14 -04:00
Zqiang	3e3d7d8f3a	rcu-tasks: Remove unnecessary smp_store_release() in cblist_init_generic() The cblist_init_generic() is executed during the CPU early boot phase due to commit:30ef09635b9e ("rcu-tasks: Initialize callback lists at rcu_init() time"), at this time, only one boot CPU is online and the irq is disabled. this commit therefore use routine assignment replace of smp_store_release() and WRITE_ONCE() in the cblist_init_generic(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>	2026-03-30 15:48:14 -04:00
Paul E. McKenney	359cf5c942	rcuscale: Ditch rcu_scale_shutdown in favor of torture_shutdown_init() The torture_shutdown_init() function spawns a shutdown kthread in a manner very similar to that implemented by rcu_scale_shutdown(). This commit therefore re-implements rcu_scale_shutdown() in terms of torture_shutdown_init(). This patch was generated by Claude given as input the patch making the same transformation of ref_scale_shutdown(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>	2026-03-30 15:48:13 -04:00
Paul E. McKenney	b0c8dd5097	refscale: Ditch ref_scale_shutdown in favor of torture_shutdown_init() The torture_shutdown_init() function spawns a shutdown kthread in a manner very similar to that implemented by ref_scale_shutdown(). This commit therefore re-implements ref_scale_shutdown in terms of torture_shutdown_init(). The initial draft of this patch was generated by version 2.1.16 of the Claude AI/LLM, but trained and configured for use by my employer, and prompted to refer to Linux-kernel source code. This initial draft failed to provide a forward reference to ref_scale_cleanup(), passed zero to torture_shutdown_init() for an unwelcome insta-shutdown, and failed to pass the kvm.sh --duration argument in as a refscale module parameter. On the other hand, it did catch the need to NULL main_task on the post-test self-shutdown code path, which I might well have forgotten to do. This version of the patch fixes those problems, and in fact very little of the initial draft remains. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>	2026-03-30 15:48:13 -04:00
Paul E. McKenney	c6f4e552e1	rcutorture: Add a textbook-style trivial preemptible RCU This commit adds a trivial textbook implementation of preemptible RCU to rcutorture ("torture_type=trivial-preempt"), similar in spirit to the existing "torture_type=trivial" textbook implementation of non-preemptible RCU. Neither trivial RCU implementation has any value for production use, and are intended only to keep Paul honest in his introductory writings and presentations. [ paulmck: Apply kernel test robot feedback. ] Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>	2026-03-30 15:48:13 -04:00
Joel Fernandes	a6fc88b22b	srcu: Use irq_work to start GP in tiny SRCU Tiny SRCU's srcu_gp_start_if_needed() directly calls schedule_work(), which acquires the workqueue pool->lock. This causes a lockdep splat when call_srcu() is called with a scheduler lock held, due to: call_srcu() [holding pi_lock] srcu_gp_start_if_needed() schedule_work() -> pool->lock workqueue_init() / create_worker() [holding pool->lock] wake_up_process() -> try_to_wake_up() -> pi_lock Also add irq_work_sync() to cleanup_srcu_struct() to prevent a use-after-free if a queued irq_work fires after cleanup begins. Tested with rcutorture SRCU-T and no lockdep warnings. [ Thanks to Boqun for similar fix in patch "rcu: Use an intermediate irq_work to start process_srcu()" ] Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun@kernel.org>	2026-03-25 09:00:05 -07:00
Boqun Feng	7c405fb327	rcu: Use an intermediate irq_work to start process_srcu() Since commit `c27cea4416` ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can happen basically everywhere (including where a scheduler lock is held), call_srcu() now needs to avoid acquiring scheduler lock because otherwise it could cause deadlock [1]. Fix this by following what the previous RCU Tasks Trace did: using an irq_work to delay the queuing of the work to start process_srcu(). [boqun: Apply Joel's feedback] [boqun: Apply Andrea's test feedback] Reported-by: Andrea Righi <arighi@nvidia.com> Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ Fixes: commit `c27cea4416` ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] Suggested-by: Zqiang <qiang.zhang@linux.dev> Tested-by: Andrea Righi <arighi@nvidia.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun@kernel.org>	2026-03-25 08:59:59 -07:00
Paul E. McKenney	61bbcfb505	srcu: Push srcu_node allocation to GP when non-preemptible When the srcutree.convert_to_big and srcutree.big_cpu_lim kernel boot parameters specify initialization-time allocation of the srcu_node tree for statically allocated srcu_struct structures (for example, in DEFINE_SRCU() at build time instead of init_srcu_struct() at runtime), init_srcu_struct_nodes() will attempt to dynamically allocate this tree at the first run-time update-side use of this srcu_struct structure, but while holding a raw spinlock. Because the memory allocator can acquire non-raw spinlocks, this can result in lockdep splats. This commit therefore uses the same SRCU_SIZE_ALLOC trick that is used when the first run-time update-side use of this srcu_struct structure happens before srcu_init() is called. The actual allocation then takes place from workqueue context at the ends of upcoming SRCU grace periods. [boqun: Adjust the sha1 of the Fixes tag] Fixes: `175b45ed34` ("srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable()") Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun@kernel.org>	2026-03-25 08:59:02 -07:00
Paul E. McKenney	175b45ed34	srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable() Tree SRCU has used non-raw spinlocks for many years, motivated by a desire to avoid unnecessary real-time latency and the absence of any reason to use raw spinlocks. However, the recent use of SRCU in tracing as the underlying implementation of RCU Tasks Trace means that call_srcu() is invoked from preemption-disabled regions of code, which in turn requires that any locks acquired by call_srcu() or its callees must be raw spinlocks. This commit therefore converts SRCU's spinlocks to raw spinlocks. [boqun: Add Fixes tag] Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Fixes: `c27cea4416` ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2026-03-25 08:55:50 -07:00
Kees Cook	189f164e57	Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses Conversion performed via this Coccinelle script: // SPDX-License-Identifier: GPL-2.0-only // Options: --include-headers-for-types --all-includes --include-headers --keep-comments virtual patch @gfp depends on patch && !(file in "tools") && !(file in "samples")@ identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex, kzalloc_obj,kzalloc_objs,kzalloc_flex, kvmalloc_obj,kvmalloc_objs,kvmalloc_flex, kvzalloc_obj,kvzalloc_objs,kvzalloc_flex}; @@ ALLOC(... - , GFP_KERNEL ) $ make coccicheck MODE=patch COCCI=gfp.cocci Build and boot tested x86_64 with Fedora 42's GCC and Clang: Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-22 08:26:33 -08:00
Linus Torvalds	32a92f8c89	Convert more 'alloc_obj' cases to default GFP_KERNEL arguments This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 20:03:00 -08:00
Linus Torvalds	bf4afc53b7	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/$alloc_objs(.*$, GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Linus Torvalds	3c6e577d5a	Merge tag 'trace-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: "User visible changes: - Add an entry into MAINTAINERS file for RUST versions of code There's now RUST code for tracing and static branches. To differentiate that code from the C code, add entries in for the RUST version (with "[RUST]" around it) so that the right maintainers get notified on changes. - New bitmask-list option added to tracefs When this is set, bitmasks in trace event are not displayed as hex numbers, but instead as lists: e.g. 0-5,7,9 instead of 0000015f - New show_event_filters file in tracefs Instead of having to search all events///filter for any active filters enabled in the trace instance, the file show_event_filters will list them so that there's only one file that needs to be examined to see if any filters are active. - New show_event_triggers file in tracefs Instead of having to search all events///trigger for any active triggers enabled in the trace instance, the file show_event_triggers will list them so that there's only one file that needs to be examined to see if any triggers are active. - Have traceoff_on_warning disable trace pintk buffer too Recently recording of trace_printk() could go to other trace instances instead of the top level instance. But if traceoff_on_warning triggers, it doesn't stop the buffer with trace_printk() and that data can easily be lost by being overwritten. Have traceoff_on_warning also disable the instance that has trace_printk() being written to it. - Update the hist_debug file to show what function the field uses When CONFIG_HIST_TRIGGERS_DEBUG is enabled, a hist_debug file exists for every event. This displays the internal data of any histogram enabled for that event. But it is lacking the function that is called to process one of its fields. This is very useful information that was missing when debugging histograms. - Up the histogram stack size from 16 to 31 Stack traces can be used as keys for event histograms. Currently the size of the stack that is stored is limited to just 16 entries. But the storage space in the histogram is 256 bytes, meaning that it can store up to 31 entries (plus one for the count of entries). Instead of letting that space go to waste, up the limit from 16 to 31. This makes the keys much more useful. - Fix permissions of per CPU file buffer_size_kb The per CPU file of buffer_size_kb was incorrectly set to read only in a previous cleanup. It should be writable. - Reset "last_boot_info" if the persistent buffer is cleared The last_boot_info shows address information of a persistent ring buffer if it contains data from a previous boot. It is cleared when recording starts again, but it is not cleared when the buffer is reset. The data is useless after a reset so clear it on reset too. Internal changes: - A change was made to allow tracepoint callbacks to have preemption enabled, and instead be protected by SRCU. This required some updates to the callbacks for perf and BPF. perf needed to disable preemption directly in its callback because it expects preemption disabled in the later code. BPF needed to disable migration, as its code expects to run completely on the same CPU. - Have irq_work wake up other CPU if current CPU is "isolated" When there's a waiter waiting on ring buffer data and a new event happens, an irq work is triggered to wake up that waiter. This is noisy on isolated CPUs (running NO_HZ_FULL). Trigger an IPI to a house keeping CPU instead. - Use proper free of trigger_data instead of open coding it in. - Remove redundant call of event_trigger_reset_filter() It was called immediately in a function that was called right after it. - Workqueue cleanups - Report errors if tracing_update_buffers() were to fail. - Make the enum update workqueue generic for other parts of tracing On boot up, a work queue is created to convert enum names into their numbers in the trace event format files. This work queue can also be used for other aspects of tracing that takes some time and shouldn't be called by the init call code. The blk_trace initialization takes a bit of time. Have the initialization code moved to the new tracing generic work queue function. - Skip kprobe boot event creation call if there's no kprobes defined on cmdline The kprobe initialization to set up kprobes if they are defined on the cmdline requires taking the event_mutex lock. This can be held by other tracing code doing initialization for a long time. Since kprobes added to the kernel command line need to be setup immediately, as they may be tracing early initialization code, they cannot be postponed in a work queue and must be setup in the initcall code. If there's no kprobe on the kernel cmdline, there's no reason to take the mutex and slow down the boot up code waiting to get the lock only to find out there's nothing to do. Simply exit out early if there's no kprobes on the kernel cmdline. If there are kprobes on the cmdline, then someone cares more about tracing over the speed of boot up. - Clean up the trigger code a bit - Move code out of trace.c and into their own files trace.c is now over 11,000 lines of code and has become more difficult to maintain. Start splitting it up so that related code is in their own files. Move all the trace_printk() related code into trace_printk.c. Move the __always_inline stack functions into trace.h. Move the pid filtering code into a new trace_pid.c file. - Better define the max latency and snapshot code The latency tracers have a "max latency" buffer that is a copy of the main buffer and gets swapped with it when a new high latency is detected. This keeps the trace up to the highest latency around where this max_latency buffer is never written to. It is only used to save the last max latency trace. A while ago a snapshot feature was added to tracefs to allow user space to perform the same logic. It could also enable events to trigger a "snapshot" if one of their fields hit a new high. This was built on top of the latency max_latency buffer logic. Because snapshots came later, they were dependent on the latency tracers to be enabled. In reality, the latency tracers depend on the snapshot code and not the other way around. It was just that they came first. Restructure the code and the kconfigs to have the latency tracers depend on snapshot code instead. This actually simplifies the logic a bit and allows to disable more when the latency tracers are not defined and the snapshot code is. - Fix a "false sharing" in the hwlat tracer code The loop to search for latency in hardware was using a variable that could be changed by user space for each sample. If the user change this variable, it could cause a bus contention, and reading that variable can show up as a large latency in the trace causing a false positive. Read this variable at the start of the sample with a READ_ONCE() into a local variable and keep the code from sharing cache lines with readers. - Fix function graph tracer static branch optimization code When only one tracer is defined for function graph tracing, it uses a static branch to call that tracer directly. When another tracer is added, it goes into loop logic to call all the registered callbacks. The code was incorrect when going back to one tracer and never re-enabled the static branch again to do the optimization code. - And other small fixes and cleanups" * tag 'trace-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (46 commits) function_graph: Restore direct mode when callbacks drop to one tracing: Fix indentation of return statement in print_trace_fmt() tracing: Reset last_boot_info if ring buffer is reset tracing: Fix to set write permission to per-cpu buffer_size_kb tracing: Fix false sharing in hwlat get_sample() tracing: Move d_max_latency out of CONFIG_FSNOTIFY protection tracing: Better separate SNAPSHOT and MAX_TRACE options tracing: Add tracer_uses_snapshot() helper to remove #ifdefs tracing: Rename trace_array field max_buffer to snapshot_buffer tracing: Move pid filtering into trace_pid.c tracing: Move trace_printk functions out of trace.c and into trace_printk.c tracing: Use system_state in trace_printk_init_buffers() tracing: Have trace_printk functions use flags instead of using global_trace tracing: Make tracing_update_buffers() take NULL for global_trace tracing: Make printk_trace global for tracing system tracing: Move ftrace_trace_stack() out of trace.c and into trace.h tracing: Move __trace_buffer_{un}lock_*() functions to trace.h tracing: Make tracing_selftest_running global to the tracing subsystem tracing: Make tracing_disabled global for tracing system tracing: Clean up use of trace_create_maxlat_file() ...	2026-02-13 19:25:16 -08:00
Paul E. McKenney	a77cb6a867	srcu: Fix warning to permit SRCU-fast readers in NMI handlers SRCU-fast is designed to be used in NMI handlers, even going so far as to use atomic operations for architectures supporting NMIs but not providing NMI-safe per-CPU atomic operations. However, the WARN_ON_ONCE() in __srcu_check_read_flavor() complains if SRCU-fast is used in an NMI handler. This commit therefore modifies that WARN_ON_ONCE() to avoid such complaints. Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Steven Rostedt <rostedt@goodmis.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Boqun Feng <boqun@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bpf@vger.kernel.org Link: https://patch.msgid.link/8232efe8-a7a3-446c-af0b-19f9b523b4f7@paulmck-laptop Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-30 10:43:58 -05:00
Boqun Feng	ed062c41df	Merge branch 'rcu-nocb.20260123a' * rcu-nocb.20260123a: rcu/nocb: Extract nocb_defer_wakeup_cancel() helper rcu/nocb: Remove dead callback overload handling rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake path	2026-01-23 11:15:36 -08:00
Joel Fernandes	cc74050f13	rcu/nocb: Extract nocb_defer_wakeup_cancel() helper The pattern of checking nocb_defer_wakeup and deleting the timer is duplicated in __wake_nocb_gp() and nocb_gp_wait(). Extract this into a common helper function nocb_defer_wakeup_cancel(). This removes code duplication and makes it easier to maintain. Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-23 11:12:25 -08:00
Joel Fernandes	b11c1efa7f	rcu/nocb: Remove dead callback overload handling During callback overload (exceeding qhimark), the NOCB code attempts opportunistic advancement via rcu_advance_cbs_nowake(). Analysis shows this code path is practically unreachable and serves no useful purpose. Testing with 300,000 callback floods showed: - 30 overload conditions triggered - 0 advancements actually occurred While a theoretical window exists where this code could execute (e.g., vCPU preemption between gp_seq update and rcu_nocb_gp_cleanup()), even if it did, the advancement would be redundant. The rcuog kthread must still run to wake the rcuoc callback thread - we would just be duplicating work that rcuog will perform when it finally gets to run. Since this path provides no meaningful benefit and extensive testing confirms it is never useful, remove it entirely. Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-23 11:12:25 -08:00
Joel Fernandes	d92eca60fe	rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake path The WakeOvfIsDeferred code path in __call_rcu_nocb_wake() attempts to wake rcuog when the callback count exceeds qhimark and callbacks aren't done with their GP (newly queued or awaiting GP). However, a lot of testing proves this wake is always redundant or useless. In the flooding case, rcuog is always waiting for a GP to finish. So waking up the rcuog thread is pointless. The timer wakeup adds overhead, rcuog simply wakes up and goes back to sleep achieving nothing. This path also adds a full memory barrier, and additional timer expiry modifications unnecessarily. The root cause is that WakeOvfIsDeferred fires when !rcu_segcblist_ready_cbs() (GP not complete), but waking rcuog cannot accelerate GP completion. This commit therefore removes this path. Tested with rcutorture scenarios: TREE01, TREE05, TREE08 (all NOCB configurations) - all pass. Also stress tested using a kernel module that floods call_rcu() to trigger the overload conditions and made the observations confirming the findings. Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-23 11:12:25 -08:00
Boqun Feng	fe1d482884	Merge branch 'rcu-misc.20260111a' * rcu-misc.20260111a: rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early srcu: Use suitable gfp_flags for the init_srcu_struct_nodes() rcu: Fix rcu_read_unlock() deadloop due to softirq rcutorture: Correctly compute probability to invoke ->exp_current() rcu: Make expedited RCU CPU stall warnings detect stall-end races	2026-01-11 20:15:07 +08:00
Joel Fernandes	bc3705e209	rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early The RCU grace period mechanism uses a two-phase FQS (Force Quiescent State) design where the first FQS saves dyntick-idle snapshots and the second FQS compares them. This results in long and unnecessary latency for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with 1000HZ) whenever one FQS wait sufficed. Some investigations showed that the GP kthread's CPU is the holdout CPU a lot of times after the first FQS as - it cannot be detected as "idle" because it's actively running the FQS scan in the GP kthread. Therefore, at the end of rcu_gp_init(), immediately report a quiescent state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The GP kthread cannot be in an RCU read-side critical section while running GP initialization, so this is safe and results in significant latency improvements. The following tests were performed: (1) synchronize_rcu() benchmarking 100 synchronize_rcu() calls with 32 CPUs, 10 runs each (default fqs jiffies settings): Baseline (without fix): \| Run \| Mean \| Min \| Max \| \|-----\|-----------\|----------\|-----------\| \| 1 \| 10.088 ms \| 9.989 ms \| 18.848 ms \| \| 2 \| 10.064 ms \| 9.982 ms \| 16.470 ms \| \| 3 \| 10.051 ms \| 9.988 ms \| 15.113 ms \| \| 4 \| 10.125 ms \| 9.929 ms \| 22.411 ms \| \| 5 \| 8.695 ms \| 5.996 ms \| 15.471 ms \| \| 6 \| 10.157 ms \| 9.977 ms \| 25.723 ms \| \| 7 \| 10.102 ms \| 9.990 ms \| 20.224 ms \| \| 8 \| 8.050 ms \| 5.985 ms \| 10.007 ms \| \| 9 \| 10.059 ms \| 9.978 ms \| 15.934 ms \| \| 10 \| 10.077 ms \| 9.984 ms \| 17.703 ms \| With fix: \| Run \| Mean \| Min \| Max \| \|-----\|----------\|----------\|-----------\| \| 1 \| 6.027 ms \| 5.915 ms \| 8.589 ms \| \| 2 \| 6.032 ms \| 5.984 ms \| 9.241 ms \| \| 3 \| 6.010 ms \| 5.986 ms \| 7.004 ms \| \| 4 \| 6.076 ms \| 5.993 ms \| 10.001 ms \| \| 5 \| 6.084 ms \| 5.893 ms \| 10.250 ms \| \| 6 \| 6.034 ms \| 5.908 ms \| 9.456 ms \| \| 7 \| 6.051 ms \| 5.993 ms \| 10.000 ms \| \| 8 \| 6.057 ms \| 5.941 ms \| 10.001 ms \| \| 9 \| 6.016 ms \| 5.927 ms \| 7.540 ms \| \| 10 \| 6.036 ms \| 5.993 ms \| 9.579 ms \| Summary: - Mean latency: 9.75 ms -> 6.04 ms (38% improvement) - Max latency: 25.72 ms -> 10.25 ms (60% improvement) (2) Bridge setup/teardown latency (Uladzislau Rezki) x86_64 with 64 CPUs, 100 iterations of bridge add/configure/delete: real time 1 - default: 24.221s 2 - this patch: 20.754s (14% faster) 3 - this patch + wake_from_gp: 15.895s (34% faster) 4 - wake_from_gp only: 18.947s (22% faster) Per-synchronize_rcu() latency (in usec): 1 2 3 4 median: 37249.5 31540.5 15765 22480 min: 7881 7918 9803 7857 max: 63651 55639 31861 32040 This patch combined with rcu_normal_wake_from_gp reduces bridge setup/teardown time from 24 seconds to 16 seconds. (3) CPU overhead verification (Uladzislau Rezki) System CPU time across 5 runs showed no measurable increase: default: 1.698s - 1.937s this patch: 1.667s - 1.930s Conclusion: variations are within noise, no CPU overhead regression. (4) rcutorture Tested TREE and SRCU configurations - no regressions. Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org> Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Samir M <samir@linux.ibm.com> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-11 20:11:15 +08:00
Zqiang	cee2557ae3	srcu: Use suitable gfp_flags for the init_srcu_struct_nodes() For use the init_srcu_struct*() to initialized srcu structure, the srcu structure's->srcu_sup and sda use GFP_KERNEL flags to allocate memory. similarly, if set SRCU_SIZING_INIT, the srcu_sup's->node can still use GFP_KERNEL flags to allocate memory, not need to use GFP_ATOMIC flags all the time. Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-07 21:59:41 +08:00
Yao Kai	d41e37f26b	rcu: Fix rcu_read_unlock() deadloop due to softirq Commit `5f5fa7ea89` ("rcu: Don't use negative nesting depth in __rcu_read_unlock()") removes the recursion-protection code from __rcu_read_unlock(). Therefore, we could invoke the deadloop in raise_softirq_irqoff() with ftrace enabled as follows: WARNING: CPU: 0 PID: 0 at kernel/trace/trace.c:3021 __ftrace_trace_stack.constprop.0+0x172/0x180 Modules linked in: my_irq_work(O) CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.18.0-rc7-dirty #23 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:__ftrace_trace_stack.constprop.0+0x172/0x180 RSP: 0018:ffffc900000034a8 EFLAGS: 00010002 RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000 RDX: 0000000000000003 RSI: ffffffff826d7b87 RDI: ffffffff826e9329 RBP: 0000000000090009 R08: 0000000000000005 R09: ffffffff82afbc4c R10: 0000000000000008 R11: 0000000000011d7a R12: 0000000000000000 R13: ffff888003874100 R14: 0000000000000003 R15: ffff8880038c1054 FS: 0000000000000000(0000) GS:ffff8880fa8ea000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055b31fa7f540 CR3: 00000000078f4005 CR4: 0000000000770ef0 PKRU: 55555554 Call Trace: <IRQ> trace_buffer_unlock_commit_regs+0x6d/0x220 trace_event_buffer_commit+0x5c/0x260 trace_event_raw_event_softirq+0x47/0x80 raise_softirq_irqoff+0x6e/0xa0 rcu_read_unlock_special+0xb1/0x160 unwind_next_frame+0x203/0x9b0 __unwind_start+0x15d/0x1c0 arch_stack_walk+0x62/0xf0 stack_trace_save+0x48/0x70 __ftrace_trace_stack.constprop.0+0x144/0x180 trace_buffer_unlock_commit_regs+0x6d/0x220 trace_event_buffer_commit+0x5c/0x260 trace_event_raw_event_softirq+0x47/0x80 raise_softirq_irqoff+0x6e/0xa0 rcu_read_unlock_special+0xb1/0x160 unwind_next_frame+0x203/0x9b0 __unwind_start+0x15d/0x1c0 arch_stack_walk+0x62/0xf0 stack_trace_save+0x48/0x70 __ftrace_trace_stack.constprop.0+0x144/0x180 trace_buffer_unlock_commit_regs+0x6d/0x220 trace_event_buffer_commit+0x5c/0x260 trace_event_raw_event_softirq+0x47/0x80 raise_softirq_irqoff+0x6e/0xa0 rcu_read_unlock_special+0xb1/0x160 unwind_next_frame+0x203/0x9b0 __unwind_start+0x15d/0x1c0 arch_stack_walk+0x62/0xf0 stack_trace_save+0x48/0x70 __ftrace_trace_stack.constprop.0+0x144/0x180 trace_buffer_unlock_commit_regs+0x6d/0x220 trace_event_buffer_commit+0x5c/0x260 trace_event_raw_event_softirq+0x47/0x80 raise_softirq_irqoff+0x6e/0xa0 rcu_read_unlock_special+0xb1/0x160 __is_insn_slot_addr+0x54/0x70 kernel_text_address+0x48/0xc0 __kernel_text_address+0xd/0x40 unwind_get_return_address+0x1e/0x40 arch_stack_walk+0x9c/0xf0 stack_trace_save+0x48/0x70 __ftrace_trace_stack.constprop.0+0x144/0x180 trace_buffer_unlock_commit_regs+0x6d/0x220 trace_event_buffer_commit+0x5c/0x260 trace_event_raw_event_softirq+0x47/0x80 __raise_softirq_irqoff+0x61/0x80 __flush_smp_call_function_queue+0x115/0x420 __sysvec_call_function_single+0x17/0xb0 sysvec_call_function_single+0x8c/0xc0 </IRQ> Commit `b41642c877` ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work") fixed the infinite loop in rcu_read_unlock_special() for IRQ work by setting a flag before calling irq_work_queue_on(). We fix this issue by setting the same flag before calling raise_softirq_irqoff() and rename the flag to defer_qs_pending for more common. Fixes: `5f5fa7ea89` ("rcu: Don't use negative nesting depth in __rcu_read_unlock()") Reported-by: Tengda Wu <wutengda2@huawei.com> Signed-off-by: Yao Kai <yaokai34@huawei.com> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-07 21:58:37 +08:00
Paul E. McKenney	37d9b47507	rcutorture: Correctly compute probability to invoke ->exp_current() Lack of parentheses causes the ->exp_current() function, for example, srcu_expedite_current(), to be called only once in four billion times instead of the intended once in 256 times. This commit therefore adds the needed parentheses. Reported-by: Chris Mason <clm@meta.com> Reported-by: Joel Fernandes <joelagnelf@nvidia.com> Fixes: `950063c6e8` ("rcutorture: Test srcu_expedite_current()") Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-07 21:58:34 +08:00
Paul E. McKenney	255019537c	rcu: Make expedited RCU CPU stall warnings detect stall-end races If an expedited RCU CPU stall ends just at the stall-warning timeout, the current code will print an expedited stall-warning message, but one that doesn't identify any CPUs or tasks causing the stall. This is most likely to happen for short-timeout stalls, for example, the 20-millisecond timeouts that are sometimes used for small embedded devices. Needless to say, these semi-empty stall-warning messages can be rather confusing. One option would be to suppress the stall-warning message entirely in this case, but the near-miss information can be quite valuable. Detect this race condition and emits a "INFO: Expedited stall ended before state dump start" message to clarify matters. [boqun: Apply feedback from Borislav] Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-07 21:58:26 +08:00
Boqun Feng	acb0b2f5d6	Merge branch 'rcu-torture.20260104a' into rcu-next * rcu-torture.20260104a: rcutorture: Add --kill-previous option to terminate previous kvm.sh runs rcutorture: Prevent concurrent kvm.sh runs on same source tree torture: Include commit discription in testid.txt torture: Make config2csv.sh properly handle comments in .boot files torture: Make kvm-series.sh give run numbers and totals torture: Make kvm-series.sh give build numbers and totals torture: Parallelize kvm-series.sh guest-OS execution rcutorture: Add context checks to rcu_torture_timer()	2026-01-04 18:53:06 +08:00
Paul E. McKenney	e8a534a671	rcutorture: Add context checks to rcu_torture_timer() This commit adds irq, NMI, and softirq context checks to the rcu_torture_timer() function. Just because you are paranoid does not mean that they are not out to get you... ;-) Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-01 16:43:21 +08:00
Paul E. McKenney	760f05bc83	rcutorture: Test rcu_tasks_trace_expedite_current() This commit adds a ->exp_current member to the tasks_tracing_ops structure to test the rcu_tasks_trace_expedite_current() function. [ paulmck: Apply kernel test robot feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bpf@vger.kernel.org Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-01 16:39:46 +08:00
Paul E. McKenney	1a72f4bb6f	rcu: Add noinstr-fast rcu_read_{,un}lock_tasks_trace() APIs When expressing RCU Tasks Trace in terms of SRCU-fast, it was necessary to keep a nesting count and per-CPU srcu_ctr structure pointer in the task_struct structure, which is slow to access. But an alternative is to instead make rcu_read_lock_tasks_trace() and rcu_read_unlock_tasks_trace(), which match the underlying SRCU-fast semantics, avoiding the task_struct accesses. When all callers have switched to the new API, the previous rcu_read_lock_trace() and rcu_read_unlock_trace() APIs will be removed. The rcu_read_{,un}lock_{,tasks_}trace() functions need to use smp_mb() only if invoked where RCU is not watching, that is, from locations where a call to rcu_is_watching() would return false. In architectures that define the ARCH_WANTS_NO_INSTR Kconfig option, use of noinstr and friends ensures that tracing happens only where RCU is watching, so those architectures can dispense entirely with the read-side calls to smp_mb(). Other architectures include these read-side calls by default, but in many installations there might be either larger than average tolerance for risk, prohibition of removing tracing on a running system, or careful review and approval of removing of tracing. Such installations can build their kernels with CONFIG_TASKS_TRACE_RCU_NO_MB=y to avoid those read-side calls to smp_mb(), thus accepting responsibility for run-time removal of tracing from code regions that RCU is not watching. Those wishing to disable read-side memory barriers for an entire architecture can select this TASKS_TRACE_RCU_NO_MB Kconfig option, hence the polarity. [ paulmck: Apply Peter Zijlstra feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bpf@vger.kernel.org Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-01 16:39:46 +08:00
Paul E. McKenney	176a6aeaf1	rcu: Move rcu_tasks_trace_srcu_struct out of #ifdef CONFIG_TASKS_RCU_GENERIC Moving the rcu_tasks_trace_srcu_struct structure instance out from under the CONFIG_TASKS_RCU_GENERIC Kconfig option permits the CONFIG_TASKS_TRACE_RCU Kconfig option to stop enabling this CONFIG_TASKS_RCU_GENERIC Kconfig option. This commit also therefore makes it so. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bpf@vger.kernel.org Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-01 16:39:46 +08:00
Paul E. McKenney	a73fc3dcc6	rcu: Clean up after the SRCU-fastification of RCU Tasks Trace Now that RCU Tasks Trace has been re-implemented in terms of SRCU-fast, the ->trc_ipi_to_cpu, ->trc_blkd_cpu, ->trc_blkd_node, ->trc_holdout_list, and ->trc_reader_special task_struct fields are no longer used. In addition, the rcu_tasks_trace_qs(), rcu_tasks_trace_qs_blkd(), exit_tasks_rcu_finish_trace(), and rcu_spawn_tasks_trace_kthread(), show_rcu_tasks_trace_gp_kthread(), rcu_tasks_trace_get_gp_data(), rcu_tasks_trace_torture_stats_print(), and get_rcu_tasks_trace_gp_kthread() functions and all the other functions that they invoke are no longer used. Also, the TRC_NEED_QS and TRC_NEED_QS_CHECKED CPP macros are no longer used. Neither are the rcu_tasks_trace_lazy_ms and rcu_task_ipi_delay rcupdate module parameters and the TASKS_TRACE_RCU_READ_MB Kconfig option. This commit therefore removes all of them. [ paulmck: Apply Alexei Starovoitov feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bpf@vger.kernel.org Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-01 16:39:46 +08:00
Paul E. McKenney	c27cea4416	rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast This commit saves more than 500 lines of RCU code by re-implementing RCU Tasks Trace in terms of SRCU-fast. Follow-up work will remove more code that does not cause problems by its presence, but that is no longer required. This variant places smp_mb() in rcu_read_{,un}lock_trace(), and in the same place that srcu_read_{,un}lock() would put them. These smp_mb() calls will be removed on common-case architectures in a later commit. In the meantime, it serves to enforce ordering between the underlying srcu_read_{,un}lock_fast() markers and the intervening critical section, even on architectures that permit attaching tracepoints on regions of code not watched by RCU. Such architectures defeat SRCU-fast's use of implicit single-instruction, interrupts-disabled, and atomic-operation RCU read-side critical sections, which have no effect when RCU is not watching. The aforementioned later commit will insert these smp_mb() calls only on architectures that have not used noinstr to prevent attaching tracepoints to code where RCU is not watching. [ paulmck: Apply kernel test robot, Boqun Feng, and Zqiang feedback. ] [ paulmck: Split out Tiny SRCU fixes per Andrii Nakryiko feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: kernel test robot <oliver.sang@intel.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bpf@vger.kernel.org Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-01 16:39:46 +08:00
Linus Torvalds	98e7dcbb82	Merge tag 'rcu.release.v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux Pull RCU updates from Frederic Weisbecker: "SRCU: - Properly handle SRCU readers within IRQ disabled sections in tiny SRCU - Preparation to reimplement RCU Tasks Trace on top of SRCU fast: - Introduce API to expedite a grace period and test it through rcutorture - Split srcu-fast in two flavours: SRCU-fast and SRCU-fast-updown. Both are still targeted toward faster readers (without full barriers on LOCK and UNLOCK) at the expense of heavier write side (using full RCU grace period ordering instead of simply full ordering) as compared to "traditional" non-fast SRCU. But those srcu-fast flavours are going to be optimized in two different ways: - SRCU-fast will become the reimplementation basis for RCU-TASK-TRACE for consolidation. Since RCU-TASK-TRACE must be NMI safe, SRCU-fast must be as well. - SRCU-fast-updown will be needed for uretprobes code in order to get rid of the read-side memory barriers while still allowing entering the reader at task level while exiting it in a timer handler. It is considered semaphore-like in that it can have different owners between LOCK and UNLOCK. However it is not NMI-safe. The actual optimizations are work in progress for the next cycle. Only the new interfaces are added for now, along with related torture and scalability test code. - Create/document/debug/torture new proper initializers for RCU fast: DEFINE_SRCU_FAST() and init_srcu_struct_fast() This allows for using right away the proper ordering on the write side (either full ordering or full RCU grace period ordering) without waiting for the read side to tell which to use. This also optimizes the read side altogether with moving flavour debug checks under debug config and with removing a costly RmW operation on their first call. - Make some diagnostic functions tracing safe Refscale: - Add performance testing for common context synchronizations (Preemption, IRQ, Softirq) and per-cpu increments. Those are relevant comparisons against SRCU-fast read side APIs, especially as they are planned to synchronize further tracing fast-path code Miscellanous: - In order to prepare the layout for nohz_full work deferral to user exit, the context tracking state must shrink the counter of transitions to/from RCU not watching. The only possible hazard is to trigger wrap-around more easily, delaying a bit grace periods when that happens. This should be a rare event though. Yet add debugging and torture code to test that assumption - Fix memory leak on locktorture module - Annotate accesses in rculist_nulls.h to prevent from KCSAN warnings. On recent discussions, we also concluded that all those WRITE_ONCE() and READ_ONCE() on list APIs deserve appropriate comments. Something to be expected for the next cycle - Provide a script to apply several configs to several commits with torture - Allow torture to reuse a build directory in order to save needless rebuild time - Various cleanups" * tag 'rcu.release.v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (29 commits) refscale: Add SRCU-fast-updown readers refscale: Exercise DEFINE_STATIC_SRCU_FAST() and init_srcu_struct_fast() rcutorture: Make srcu{,d}_torture_init() announce the SRCU type srcu: Create an SRCU-fast-updown API refscale: Do not disable interrupts for tests involving local_bh_enable() refscale: Add non-atomic per-CPU increment readers refscale: Add this_cpu_inc() readers refscale: Add preempt_disable() readers refscale: Add local_bh_disable() readers refscale: Add local_irq_disable() and local_irq_save() readers torture: Permit negative kvm.sh --kconfig numberic arguments srcu: Add SRCU_READ_FLAVOR_FAST_UPDOWN CPP macro rcu: Mark diagnostic functions as notrace rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE rcutorture: Remove redundant rcutorture_one_extend() from rcu_torture_one_read() rcutorture: Permit kvm-again.sh to re-use the build directory torture: Add kvm-series.sh to test commit/scenario combination rcu: use WRITE_ONCE() for ->next and ->pprev of hlist_nulls locktorture: Fix memory leak in param_set_cpumask() doc: Update for SRCU-fast definitions and initialization ...	2025-12-03 12:18:07 -08:00
Frederic Weisbecker	9a08942f17	Merge branch 'rcu/misc' into next - In order to prepare the layout for nohz_full work deferral to user exit, the context tracking state must shrink the counter of transitions to/from RCU not watching. The only possible hazard is to trigger wrap-around more easily, delaying a bit grace periods when that happens. This should be a rare event though. Yet add debugging and torture code to test that assumption. - Fix memory leak on locktorture module - Annotate accesses in rculist_nulls.h to prevent from KCSAN warnings. On recent discussions, we also concluded that all those WRITE_ONCE() and READ_ONCE() on list APIs deserve appropriate comments. Something to be expected for the next cycle. - Provide a script to apply several configs to several commits with torture. - Allow torture to reuse a build directory in order to save needless rebuild time. - Various cleanups.	2025-11-30 22:20:33 +01:00
Frederic Weisbecker	a50413848f	Merge branch 'rcu/refscale' into next Add performance testing for common context synchronizations (Preemption, IRQ, Softirq) and per-cpu increments. Those are relevant comparisons against SRCU-fast read side APIs, especially as they are planned to synchronize further tracing fast-path code.	2025-11-28 23:30:38 +01:00

1 2 3 4 5 ...

2847 Commits