linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-12-28 07:54:36 -05:00

Author	SHA1	Message	Date
Linus Torvalds	2215336295	Merge tag 'hyperv-next-signed-20251006' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - Unify guest entry code for KVM and MSHV (Sean Christopherson) - Switch Hyper-V MSI domain to use msi_create_parent_irq_domain() (Nam Cao) - Add CONFIG_HYPERV_VMBUS and limit the semantics of CONFIG_HYPERV (Mukesh Rathor) - Add kexec/kdump support on Azure CVMs (Vitaly Kuznetsov) - Deprecate hyperv_fb in favor of Hyper-V DRM driver (Prasanna Kumar T S M) - Miscellaneous enhancements, fixes and cleanups (Abhishek Tiwari, Alok Tiwari, Nuno Das Neves, Wei Liu, Roman Kisel, Michael Kelley) * tag 'hyperv-next-signed-20251006' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: hyperv: Remove the spurious null directive line MAINTAINERS: Mark hyperv_fb driver Obsolete fbdev/hyperv_fb: deprecate this in favor of Hyper-V DRM driver Drivers: hv: Make CONFIG_HYPERV bool Drivers: hv: Add CONFIG_HYPERV_VMBUS option Drivers: hv: vmbus: Fix typos in vmbus_drv.c Drivers: hv: vmbus: Fix sysfs output format for ring buffer index Drivers: hv: vmbus: Clean up sscanf format specifier in target_cpu_store() x86/hyperv: Switch to msi_create_parent_irq_domain() mshv: Use common "entry virt" APIs to do work in root before running guest entry: Rename "kvm" entry code assets to "virt" to genericize APIs entry/kvm: KVM: Move KVM details related to signal/-EINTR into KVM proper mshv: Handle NEED_RESCHED_LAZY before transferring to guest x86/hyperv: Add kexec/kdump support on Azure CVMs Drivers: hv: Simplify data structures for VMBus channel close message Drivers: hv: util: Cosmetic changes for hv_utils_transport.c mshv: Add support for a new parent partition configuration clocksource: hyper-v: Skip unnecessary checks for the root partition hyperv: Add missing field to hv_output_map_device_interrupt	2025-10-07 08:40:15 -07:00
Linus Torvalds	67da125e30	Merge tag 'rcu.2025.09.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux Pull RCU updates from Paul McKenney: "Documentation updates: - Update whatisRCU.rst and checklist.rst for recent RCU API additions - Fix RCU documentation formatting and typos - Replace dead Ottawa Linux Symposium links in RTFP.txt Miscellaneous RCU updates: - Document that rcu_barrier() hurries RCU_LAZY callbacks - Remove redundant interrupt disabling from rcu_preempt_deferred_qs_handler() - Move list_for_each_rcu from list.h to rculist.h, and adjust the include directive in kernel/cgroup/dmem.c accordingly - Make initial set of changes to accommodate upcoming system_percpu_wq changes SRCU updates: - Create an srcu_read_lock_fast_notrace() for eventual use in tracing, including adding guards - Document the reliance on per-CPU operations as implicit RCU readers in __srcu_read_{,un}lock_fast() - Document the srcu_flip() function's memory-barrier D's relationship to SRCU-fast readers - Remove a redundant preempt_disable() and preempt_enable() pair from srcu_gp_start_if_needed() Torture-test updates: - Fix jitter.sh spin time so that it actually varies as advertised. It is still quite coarse-grained, but at least it does now vary - Update torture.sh help text to include the not-so-new --do-normal parameter, which permits (for example) testing KCSAN kernels without doing non-debug kernels - Fix a number of false-positive diagnostics that were being triggered by rcutorture starting before boot completed. Running multiple near-CPU-bound rcutorture processes when there is only the boot CPU is after all a bit excessive - Substitute kcalloc() for kzalloc() - Remove a redundant kfree() and NULL out kfree()ed objects" * tag 'rcu.2025.09.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (31 commits) rcu: WQ_UNBOUND added to sync_wq workqueue rcu: WQ_PERCPU added to alloc_workqueue users rcu: replace use of system_wq with system_percpu_wq refperf: Set reader_tasks to NULL after kfree() refperf: Remove redundant kfree() after torture_stop_kthread() srcu/tiny: Remove preempt_disable/enable() in srcu_gp_start_if_needed() srcu: Document srcu_flip() memory-barrier D relation to SRCU-fast srcu: Document __srcu_read_{,un}lock_fast() implicit RCU readers rculist: move list_for_each_rcu() to where it belongs refscale: Use kcalloc() instead of kzalloc() rcutorture: Use kcalloc() instead of kzalloc() docs: rcu: Replace multiple dead OLS links in RTFP.txt doc: Fix typo in RCU's torture.rst documentation Documentation: RCU: Retitle toctree index Documentation: RCU: Reduce toctree depth Documentation: RCU: Wrap kvm-remote.sh rerun snippet in literal code block rcu: docs: Requirements.rst: Abide by conventions of kernel documentation doc: Add RCU guards to checklist.rst doc: Update whatisRCU.rst for recent RCU API additions rcutorture: Delay forward-progress testing until boot completes ...	2025-10-04 11:28:45 -07:00
Sean Christopherson	9be7e1e320	entry: Rename "kvm" entry code assets to "virt" to genericize APIs Rename the "kvm" entry code files and Kconfigs to use generic "virt" nomenclature so that the code can be reused by other hypervisors (or rather, their root/dom0 partition drivers), without incorrectly suggesting the code somehow relies on and/or involves KVM. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>	2025-09-30 22:50:18 +00:00
Paul E. McKenney	1d289fc569	Merge branch 'torture.2025.08.14a' into HEAD Torture-test updates: * rcutorture: Fix jitter.sh spin time * torture: Add --do-normal parameter to torture.sh help text * torture: Announce kernel boot status at torture-test startup * rcutorture: Suppress "Writer stall state" reports during boot * rcutorture: Delay rcutorture readers and writers until boot completes * torture: Delay CPU-hotplug operations until boot completes * rcutorture: Delay forward-progress testing until boot completes * rcutorture,refscale: Use kcalloc() instead of kzalloc() * refperf: Remove redundant kfree() after torture_stop_kthread() * refperf: Set reader_tasks to NULL after kfree()	2025-09-23 02:10:51 -07:00
Paul E. McKenney	a590b67d33	Merge branch 'srcu-next.2025.08.21a' into HEAD SRCU updates: * Create srcu_read_{,un}lock_fast_notrace() * Add srcu_read_lock_fast_notrace() and srcu_read_unlock_fast_notrace() * Add guards for notrace variants of SRCU-fast readers * Document srcu_read_{,un}lock_fast() use of implicit RCU readers * Document srcu_flip() memory-barrier D relation to SRCU-fast * Remove preempt_disable/enable() in Tiny SRCU srcu_gp_start_if_needed()	2025-09-23 02:07:10 -07:00
Marco Crivellari	82c427bc93	rcu: WQ_UNBOUND added to sync_wq workqueue Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This change add the WQ_UNBOUND flag to sync_wq, to make explicit this workqueue can be unbound and that it does not benefit from per-cpu work. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2025-09-23 02:01:18 -07:00
Marco Crivellari	499d48f75b	rcu: WQ_PERCPU added to alloc_workqueue users Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This patch adds a new WQ_PERCPU flag to explicitly request the use of the per-CPU behavior. Both flags coexist for one release cycle to allow callers to transition their calls. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. All existing users have been updated accordingly. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2025-09-23 02:01:18 -07:00
Marco Crivellari	143ddfa169	rcu: replace use of system_wq with system_percpu_wq Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_wq is a per-CPU worqueue, yet nothing in its name tells about that CPU affinity constraint, which is very often not required by users. Make it clear by adding a system_percpu_wq. The old wq will be kept for a few release cylces. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2025-09-23 02:01:18 -07:00
Kaushlendra Kumar	1441edd129	refperf: Set reader_tasks to NULL after kfree() Set reader_tasks to NULL after kfree() in ref_scale_cleanup() to improve debugging experience with kernel debugging tools. This follows the common pattern of NULLing pointers after freeing to avoid dangling pointer issues during debugging sessions. Setting pointers to NULL after freeing helps debugging tools like kdgb,drgn, and other kernel debuggers by providing clear indication that the memory has been freed and the pointer is no longer valid. Suggested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Kaushlendra Kumar <kaushlendra.kumar@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2025-09-15 05:20:09 -07:00
Kaushlendra Kumar	fb7855a6b5	refperf: Remove redundant kfree() after torture_stop_kthread() Remove unnecessary kfree(main_task) call in ref_scale_cleanup() as torture_stop_kthread() already handles the memory cleanup for the task structure internally. The additional kfree(main_task) call after torture_stop_kthread() is redundant and confusing since torture_stop_kthread() sets the pointer to NULL, making this a no-op. This pattern is consistent with other torture test modules where torture_stop_kthread() is called without explicit kfree() of the task pointer, as the torture framework manages the task lifecycle internally. Signed-off-by: Kaushlendra Kumar <kaushlendra.kumar@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2025-09-15 05:20:09 -07:00
Zqiang	e6a43aeb71	srcu/tiny: Remove preempt_disable/enable() in srcu_gp_start_if_needed() Currently, the srcu_gp_start_if_needed() is always be invoked in preempt disable's critical section, this commit therefore remove redundant preempt_disable/enable() in srcu_gp_start_if_needed() and adds a call to lockdep_assert_preemption_disabled() in order to enable lockdep to diagnose mistaken invocations of this function from preempts-enabled code. Fixes: `65b4a59557` ("srcu: Make Tiny SRCU explicitly disable preemption") Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2025-09-15 05:16:28 -07:00
Paul E. McKenney	1c77e862b8	srcu: Document srcu_flip() memory-barrier D relation to SRCU-fast The smp_mb() memory barrier at the end of srcu_flip() has a comment, but that comment does not make it clear that this memory barrier is an optimization, as opposed to being needed for correctness. This commit therefore adds this information and points out that it is omitted for SRCU-fast, where a much heavier weight synchronize_srcu() would be required. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: <bpf@vger.kernel.org>	2025-09-15 05:16:28 -07:00
Ye Liu	79e1c24285	mm: replace (20 - PAGE_SHIFT) with common macros for pages<->MB conversion Replace repeated (20 - PAGE_SHIFT) calculations with standard macros: - MB_TO_PAGES(mb) converts MB to page count - PAGES_TO_MB(pages) converts pages to MB No functional change. [akpm@linux-foundation.org: remove arc's private PAGES_TO_MB, remove its unused PAGES_TO_KB] [akpm@linux-foundation.org: don't include mm.h due to include file ordering mess] Link: https://lkml.kernel.org/r/20250718024134.1304745-1-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Ben Segall <bsegall@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 16:54:42 -07:00
Qianfeng Rong	9a0352dd45	refscale: Use kcalloc() instead of kzalloc() Use kcalloc() in main_func() to gain built-in overflow protection, making memory allocation safer when calculating allocation size compared to explicit multiplication. Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2025-08-22 06:26:22 -07:00
Qianfeng Rong	3e15cccf3e	rcutorture: Use kcalloc() instead of kzalloc() Use kcalloc() in rcu_torture_writer() to gain built-in overflow protection, making memory allocation safer when calculating allocation size compared to explicit multiplication. Change sizeof(ulo[0]) and sizeof(rgo[0]) to sizeof(ulo) and sizeof(rgo), as this is more consistent with coding conventions. Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2025-08-22 06:26:22 -07:00
Paul E. McKenney	51c285baa3	rcutorture: Delay forward-progress testing until boot completes Forward-progress testing can hog CPUs, which is not a great thing to do before boot has completed. This commit therefore makes the CPU-hotplug operations hold off until boot has completed. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2025-08-14 15:26:30 -07:00
Paul E. McKenney	9a316fe3ad	rcutorture: Delay rcutorture readers and writers until boot completes The rcutorture writers and (especially) readers are the biggest CPU hogs of the bunch, so this commit therefore makes them wait until boot has completed. This makes the current setting of the boot_ended local variable dead code, so while in the area, this commit removes that as well. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2025-08-14 15:26:30 -07:00
Paul E. McKenney	1b0f583843	rcutorture: Suppress "Writer stall state" reports during boot When rcutorture is running on only the one boot-time CPU while that CPU is busy invoking initcall() functions, the added load is quite likely to unduly delay the RCU grace-period kthread, rcutorture readers, and much else besides. This can result in rcu_torture_stats_print() reporting rcutorture writer stalls, which are not really a bug in that environment. After all, one CPU can only do so much. This commit therefore suppresses rcutorture writer stalls while the kernel is booting, that is, while rcu_inkernel_boot_has_ended() continues returning false. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2025-08-14 15:26:30 -07:00
Zqiang	42d590d100	rcu: Remove local_irq_save/restore() in rcu_preempt_deferred_qs_handler() The per-CPU rcu_data structure's ->defer_qs_iw field is initialized by IRQ_WORK_INIT_HARD(), which means that the subsequent invocation of rcu_preempt_deferred_qs_handler() will always be executed with interrupts disabled. This commit therefore removes the local_irq_save/restore() operations from rcu_preempt_deferred_qs_handler() and adds a call to lockdep_assert_irqs_disabled() in order to enable lockdep to diagnose mistaken invocations of this function from interrupts-enabled code. Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2025-08-14 15:25:15 -07:00
Paul E. McKenney	faab3ae329	rcu: Document that rcu_barrier() hurries lazy callbacks This commit adds to the rcu_barrier() kerneldoc header stating that this function hurries lazy callbacks and that it does not normally result in additional RCU grace periods. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2025-08-13 15:04:06 -07:00
Frederic Weisbecker	61399e0c54	rcu: Fix racy re-initialization of irq_work causing hangs RCU re-initializes the deferred QS irq work everytime before attempting to queue it. However there are situations where the irq work is attempted to be queued even though it is already queued. In that case re-initializing messes-up with the irq work queue that is about to be handled. The chances for that to happen are higher when the architecture doesn't support self-IPIs and irq work are then all lazy, such as with the following sequence: 1) rcu_read_unlock() is called when IRQs are disabled and there is a grace period involving blocked tasks on the node. The irq work is then initialized and queued. 2) The related tasks are unblocked and the CPU quiescent state is reported. rdp->defer_qs_iw_pending is reset to DEFER_QS_IDLE, allowing the irq work to be requeued in the future (note the previous one hasn't fired yet). 3) A new grace period starts and the node has blocked tasks. 4) rcu_read_unlock() is called when IRQs are disabled again. The irq work is re-initialized (but it's queued! and its node is cleared) and requeued. Which means it's requeued to itself. 5) The irq work finally fires with the tick. But since it was requeued to itself, it loops and hangs. Fix this with initializing the irq work only once before the CPU boots. Fixes: `b41642c877` ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202508071303.c1134cce-lkp@intel.com Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-08-11 08:43:49 +05:30
Linus Torvalds	6a68cec16b	Merge tag 'sched_ext-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Add support for cgroup "cpu.max" interface - Code organization cleanup so that ext_idle.c doesn't depend on the source-file-inclusion build method of sched/ - Drop UP paths in accordance with sched core changes - Documentation and other misc changes * tag 'sched_ext-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix scx_bpf_reenqueue_local() reference sched_ext: Drop kfuncs marked for removal in 6.15 sched_ext, rcu: Eject BPF scheduler on RCU CPU stall panic kernel/sched/ext.c: fix typo "occured" -> "occurred" in comments sched_ext: Add support for cgroup bandwidth control interface sched_ext, sched/core: Factor out struct scx_task_group sched_ext: Return NULL in llc_span sched_ext: Always use SMP versions in kernel/sched/ext_idle.h sched_ext: Always use SMP versions in kernel/sched/ext_idle.c sched_ext: Always use SMP versions in kernel/sched/ext.h sched_ext: Always use SMP versions in kernel/sched/ext.c sched_ext: Documentation: Clarify time slice handling in task lifecycle sched_ext: Make scx_locked_rq() inline sched_ext: Make scx_rq_bypassing() inline sched_ext: idle: Make local functions static in ext_idle.c sched_ext: idle: Remove unnecessary ifdef in scx_bpf_cpu_node()	2025-07-31 16:29:46 -07:00
Linus Torvalds	2db4df0c09	Merge tag 'rcu.release.v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux Pull RCU updates from Neeraj Upadhyay: "Expedited grace period updates: - Protect against early RCU exp quiescent state reporting during exp grace period initialization - Remove superfluous barrier in task unblock path - Remove the CPU online quiescent state report optimization, which is error prone for certain scenarios - Add warning for unexpected pending requested expedited quiescent state on dying CPU Core: - Robustify rcu_is_cpu_rrupt_from_idle() by using more accurate indicators of the actual context tracking state of a CPU - Handle ->defer_qs_iw_pending field data race - Enable rcu_normal_wake_from_gp by default on systems with <= 16 CPUs - Fix lockup in rcu_read_unlock() due to recursive irq_exit() calls - Refactor expedited handling condition in rcu_read_unlock_special() - Documentation updates for hotplug and GP init scan ordering, separation of rcu_state and rnp's gp_seq states, quiescent state reporting for offline CPUs torture-scripts: - Cleanup and improve scripts : remove superfluous warnings for disabled tests; better handling of kvm.sh --kconfig arg; suppress some confusing diagnostics; tolerate bad kvm.sh args; add new diagnostic for build output; fail allmodconfig testing on warnings - Include RCU_TORTURE_TEST_CHK_RDR_STATE config for KCSAN kernels - Disable default RCU-tasks and clocksource-wdog testing on arm64 - Add EXPERT Kconfig option for arm64 KCSAN runs - Remove SRCU-lite testing rcutorture: - Start torture writer threads creation after reader threads to handle race in SRCU-P scenario - Add SRCU down_read()/up_read() test - Add diagnostics for delayed SRCU up_read(), unmatched up_read(), print number of up/down readers and the number of such readers which migrated to other CPU - Ignore certain unsupported configurations for trivial RCU test - Fix splats in RT kernels due to inaccurate checks for BH-disabled context - Enable checks and logs to capture intentionally exercised unexpected scenarios (too short readers) for BUSTED test - Remove SRCU-lite testing srcu: - Expedite SRCU-fast grace periods - Remove SRCU-lite implementation - Add guards for SRCU-fast readers rcu nocb: - Dump NOCB group leader state on stall detection - Robustify nocb_cb_kthread pointer accesses - Fix delayed execution of hurry callbacks when LAZY_RCU is enabled refscale: - Fix multiplication overflow in "loops" and "nreaders" calculations" * tag 'rcu.release.v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (49 commits) rcu: Document concurrent quiescent state reporting for offline CPUs rcu: Document separation of rcu_state and rnp's gp_seq rcu: Document GP init vs hotplug-scan ordering requirements srcu: Add guards for SRCU-fast readers rcu: Fix delayed execution of hurry callbacks rcu: Refactor expedited handling check in rcu_read_unlock_special() checkpatch: Remove SRCU-lite deprecation srcu: Remove SRCU-lite implementation srcu: Expedite SRCU-fast grace periods rcutorture: Remove support for SRCU-lite rcutorture: Remove SRCU-lite scenarios torture: Remove support for SRCU-lite torture: Make torture.sh --allmodconfig testing fail on warnings torture: Add "ERROR" diagnostic for testing kernel-build output torture: Make torture.sh tolerate runs having bad kvm.sh arguments torture: Add textid.txt file to --do-allmodconfig and --do-rcu-rust runs torture: Extract testid.txt generation to separate script torture: Suppress "find" diagnostics from torture.sh --do-none run torture: Provide EXPERT Kconfig option for arm64 KCSAN torture.sh runs rcu: Fix rcu_read_unlock() deadloop due to IRQ work ...	2025-07-30 11:01:41 -07:00
Linus Torvalds	4b290aae78	Merge tag 'sysctl-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl updates from Joel Granados: - Move sysctls out of the kern_table array This is the final move of ctl_tables into their respective subsystems. Only 5 (out of the original 50) will remain in kernel/sysctl.c file; these handle either sysctl or common arch variables. By decentralizing sysctl registrations, subsystem maintainers regain control over their sysctl interfaces, improving maintainability and reducing the likelihood of merge conflicts. - docs: Remove false positives from check-sysctl-docs Stopped falsely identifying sysctls as undocumented or unimplemented in the check-sysctl-docs script. This script can now be used to automatically identify if documentation is missing. * tag 'sysctl-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: (23 commits) docs: Downgrade arm64 & riscv from titles to comment docs: Replace spaces with tabs in check-sysctl-docs docs: Remove colon from ctltable title in vm.rst docs: Add awk section for ucount sysctl entries docs: Use skiplist when checking sysctl admin-guide docs: nixify check-sysctl-docs sysctl: rename kern_table -> sysctl_subsys_table kernel/sys.c: Move overflow{uid,gid} sysctl into kernel/sys.c uevent: mv uevent_helper into kobject_uevent.c sysctl: Removed unused variable sysctl: Nixify sysctl.sh sysctl: Remove superfluous includes from kernel/sysctl.c sysctl: Remove (very) old file changelog sysctl: Move sysctl_panic_on_stackoverflow to kernel/panic.c sysctl: move cad_pid into kernel/pid.c sysctl: Move tainted ctl_table into kernel/panic.c Input: sysrq: mv sysrq into drivers/tty/sysrq.c fork: mv threads-max into kernel/fork.c parisc/power: Move soft-power into power.c mm: move randomize_va_space into memory.c ...	2025-07-29 21:43:08 -07:00
Neeraj Upadhyay (AMD)	cc1d1365f0	Merge branches 'rcu-exp.23.07.2025', 'rcu.22.07.2025', 'torture-scripts.16.07.2025', 'srcu.19.07.2025', 'rcu.nocb.18.07.2025' and 'refscale.07.07.2025' into rcu.merge.23.07.2025	2025-07-23 21:42:20 +05:30
Joel Granados	fff6703fc8	rcu: Move rcu_stall related sysctls into rcu/tree_stall.h Move sysctl_panic_on_rcu_stall and sysctl_max_rcu_stall_to_panic into the kernel/rcu subdirectory. Make these static in tree_stall.h and removed them as extern from panic.h as their scope is now confined into one file. This is part of a greater effort to move ctl tables into their respective subsystems which will reduce the merge conflicts in kernel/sysctl.c. Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Reviewed-by: Kees Cook <kees@kernel.org> Signed-off-by: Joel Granados <joel.granados@kernel.org>	2025-07-23 11:52:47 +02:00
Joel Fernandes	5d71c2b53f	rcu: Document concurrent quiescent state reporting for offline CPUs The synchronization of CPU offlining with GP initialization is confusing to put it mildly (rightfully so as the issue it deals with is complex). Recent discussions brought up a question -- what prevents the rcu_implicit_dyntick_qs() from warning about QS reports for offline CPUs (missing QS reports for offline CPUs causing indefinite hangs). QS reporting for now-offline CPUs should only happen from: - gp_init() - rcutree_cpu_report_dead() Add some documentation on this and refer to it from comments in the code explaining how QS reporting is not missed when these functions are concurrently running. I referred heavily to this post [1] about the need for the ofl_lock. [1] https://lore.kernel.org/all/20180924164443.GF4222@linux.ibm.com/ [ Applied paulmck feedback on moving documentation to Requirements.rst ] Link: https://lore.kernel.org/all/01b4d228-9416-43f8-a62e-124b92e8741a@paulmck-laptop/ Co-developed-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-07-22 17:10:50 +05:30
Joel Fernandes	186779c036	rcu: Document separation of rcu_state and rnp's gp_seq The details of this are subtle and was discussed recently. Add a quick-quiz about this and refer to it from the code, for more clarity. Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-07-22 17:10:31 +05:30
Joel Fernandes	30a7806ada	rcu: Document GP init vs hotplug-scan ordering requirements Add detailed comments explaining the critical ordering constraints during RCU grace period initialization, based on discussions with Frederic. Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org> Co-developed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-07-22 17:09:35 +05:30
Tze-nan Wu	463d46044f	rcu: Fix delayed execution of hurry callbacks We observed a regression in our customer’s environment after enabling CONFIG_LAZY_RCU. In the Android Update Engine scenario, where ioctl() is used heavily, we found that callbacks queued via call_rcu_hurry (such as percpu_ref_switch_to_atomic_rcu) can sometimes be delayed by up to 5 seconds before execution. This occurs because the new grace period does not start immediately after the previous one completes. The root cause is that the wake_nocb_gp_defer() function now checks "rdp->nocb_defer_wakeup" instead of "rdp_gp->nocb_defer_wakeup". On CPUs that are not rcuog, "rdp->nocb_defer_wakeup" may always be RCU_NOCB_WAKE_NOT. This can cause "rdp_gp->nocb_defer_wakeup" to be downgraded and the "rdp_gp->nocb_timer" to be postponed by up to 10 seconds, delaying the execution of hurry RCU callbacks. The trace log of one scenario we encountered is as follow: // previous GP ends at this point rcu_preempt [000] d..1. 137.240210: rcu_grace_period: rcu_preempt 8369 end rcu_preempt [000] ..... 137.240212: rcu_grace_period: rcu_preempt 8372 reqwait // call_rcu_hurry enqueues "percpu_ref_switch_to_atomic_rcu", the callback waited on by UpdateEngine update_engine [002] d..1. 137.301593: __call_rcu_common: wyy: unlikely p_ref = 00000000******. lazy = 0 // FirstQ on cpu 2 rdp_gp->nocb_timer is set to fire after 1 jiffy (4ms) // and the rdp_gp->nocb_defer_wakeup is set to RCU_NOCB_WAKE update_engine [002] d..2. 137.301595: rcu_nocb_wake: rcu_preempt 2 FirstQ on cpu2 with rdp_gp (cpu0). // FirstBQ event on cpu2 during the 1 jiffy, make the timer postpond 10 seconds later. // also, the rdp_gp->nocb_defer_wakeup is overwrite to RCU_NOCB_WAKE_LAZY update_engine [002] d..1. 137.301601: rcu_nocb_wake: rcu_preempt 2 WakeEmptyIsDeferred ... ... ... // before the 10 seconds timeout, cpu0 received another call_rcu_hurry // reset the timer to jiffies+1 and set the waketype = RCU_NOCB_WAKE. kworker/u32:0 [000] d..2. 142.557564: rcu_nocb_wake: rcu_preempt 0 FirstQ kworker/u32:0 [000] d..1. 142.557576: rcu_nocb_wake: rcu_preempt 0 WakeEmptyIsDeferred kworker/u32:0 [000] d..1. 142.558296: rcu_nocb_wake: rcu_preempt 0 WakeNot kworker/u32:0 [000] d..1. 142.558562: rcu_nocb_wake: rcu_preempt 0 WakeNot // idle(do_nocb_deferred_wakeup) wake rcuog due to waketype == RCU_NOCB_WAKE <idle> [000] d..1. 142.558786: rcu_nocb_wake: rcu_preempt 0 DoWake <idle> [000] dN.1. 142.558839: rcu_nocb_wake: rcu_preempt 0 DeferredWake rcuog/0 [000] ..... 142.558871: rcu_nocb_wake: rcu_preempt 0 EndSleep rcuog/0 [000] ..... 142.558877: rcu_nocb_wake: rcu_preempt 0 Check // finally rcuog request a new GP at this point (5 seconds after the FirstQ event) rcuog/0 [000] d..2. 142.558886: rcu_grace_period: rcu_preempt 8372 newreq rcu_preempt [001] d..1. 142.559458: rcu_grace_period: rcu_preempt 8373 start ... rcu_preempt [000] d..1. 142.564258: rcu_grace_period: rcu_preempt 8373 end rcuop/2 [000] D..1. 142.566337: rcu_batch_start: rcu_preempt CBs=219 bl=10 // the hurry CB is invoked at this point rcuop/2 [000] b.... 142.566352: blk_queue_usage_counter_release: wyy: wakeup. p_ref = 00000000******. This patch changes the condition to check "rdp_gp->nocb_defer_wakeup" in the lazy path. This prevents an already scheduled "rdp_gp->nocb_timer" from being postponed and avoids overwriting "rdp_gp->nocb_defer_wakeup" when it is not RCU_NOCB_WAKE_NOT. Fixes: `3cb278e73b` ("rcu: Make call_rcu() lazy to save power") Co-developed-by: Cheng-jui Wang <cheng-jui.wang@mediatek.com> Signed-off-by: Cheng-jui Wang <cheng-jui.wang@mediatek.com> Co-developed-by: Lorry.Luo@mediatek.com Signed-off-by: Lorry.Luo@mediatek.com Tested-by: weiyangyang@vivo.com Signed-off-by: weiyangyang@vivo.com Signed-off-by: Tze-nan Wu <Tze-nan.Wu@mediatek.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-07-18 09:25:34 +05:30
Joel Fernandes	908a97eba8	rcu: Refactor expedited handling check in rcu_read_unlock_special() Extract the complex expedited handling condition in rcu_read_unlock_special() into a separate function rcu_unlock_needs_exp_handling() with detailed comments explaining each condition. This improves code readability. No functional change intended. Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-07-16 10:01:06 +05:30
Paul E. McKenney	3aea745a2a	srcu: Expedite SRCU-fast grace periods Currently, SRCU-fast grace periods use synchronize_rcu() to provide the needed ordering with readers, even given an expedited SRCU-fast grace period, which isn't all that expedited. This commit therefore instead uses synchronize_rcu_expedited() if there is an expedited SRCU-fast grace period in flight. Of course, given an non-expedited SRCU-fast grace period blocked in synchronize_rcu(), a later request for an expedited SRCU-fast grace period will wait for that synchronize_rcu() to return before switching to use of synchronize_rcu_expedited(). If this turns out to be a real problem for a production workload, we can increase the complexity (but likely also degrade the energy efficiency) to speed things up further. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-07-16 09:50:57 +05:30
Paul E. McKenney	941ab0b369	rcutorture: Remove support for SRCU-lite Because SRCU-lite is being replaced by SRCU-fast, this commit removes support for SRCU-lite from rcutorture.c Both SRCU-lite and SRCU-fast provide faster readers by dropping the smp_mb() call from their lock and unlock primitives, but incur a pair of added RCU grace periods during the SRCU grace period. There is a trivial mapping from the SRCU-lite API to that of SRCU-fast, so there should be no transition issues. [ paulmck: Apply Christoph Hellwig feedback. ] Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-07-16 09:48:44 +05:30
Paul E. McKenney	cbd5d35e6d	torture: Remove support for SRCU-lite Because SRCU-lite is being replaced by SRCU-fast, this commit removes support for SRCU-lite from refscale.c. Both SRCU-lite and SRCU-fast provide faster readers by dropping the smp_mb() call from their lock and unlock primitives, but incur a pair of added RCU grace periods during the SRCU grace period. There is a trivial mapping from the SRCU-lite API to that of SRCU-fast, so there should be no transition issues. [ paulmck: Apply Christoph Hellwig feedback. ] Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-07-16 09:44:44 +05:30
Joel Fernandes	b41642c877	rcu: Fix rcu_read_unlock() deadloop due to IRQ work During rcu_read_unlock_special(), if this happens during irq_exit(), we can lockup if an IPI is issued. This is because the IPI itself triggers the irq_exit() path causing a recursive lock up. This is precisely what Xiongfeng found when invoking a BPF program on the trace_tick_stop() tracepoint As shown in the trace below. Fix by managing the irq_work state correctly. irq_exit() __irq_exit_rcu() /* in_hardirq() returns false after this / preempt_count_sub(HARDIRQ_OFFSET) tick_irq_exit() tick_nohz_irq_exit() tick_nohz_stop_sched_tick() trace_tick_stop() / a bpf prog is hooked on this trace point / __bpf_trace_tick_stop() bpf_trace_run2() rcu_read_unlock_special() / will send a IPI to itself */ irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu); A simple reproducer can also be obtained by doing the following in tick_irq_exit(). It will hang on boot without the patch: static inline void tick_irq_exit(void) { + rcu_read_lock(); + WRITE_ONCE(current->rcu_read_unlock_special.b.need_qs, true); + rcu_read_unlock(); + Reported-by: Xiongfeng Wang <wangxiongfeng2@huawei.com> Closes: https://lore.kernel.org/all/9acd5f9f-6732-7701-6880-4b51190aa070@huawei.com/ Tested-by: Qi Xi <xiqi2@huawei.com> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> [neeraj: Apply Frederic's suggested fix for PREEMPT_RT] Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-07-16 09:38:26 +05:30
Uladzislau Rezki (Sony)	78370df5c3	rcu: Enable rcu_normal_wake_from_gp on small systems Automatically enable the rcu_normal_wake_from_gp parameter on systems with a small number of CPUs. The activation threshold is set to 16 CPUs. This helps to reduce a latency of normal synchronize_rcu() API by waking up GP-waiters earlier and decoupling synchronize_rcu() callers from regular callback handling. A benchmark running 64 parallel jobs(system with 64 CPUs) invoking synchronize_rcu() demonstrates a notable latency reduction with the setting enabled. Latency distribution (microseconds): <default> 0 - 9999 : 1 10000 - 19999 : 4 20000 - 29999 : 399 30000 - 39999 : 3197 40000 - 49999 : 10428 50000 - 59999 : 17363 60000 - 69999 : 15529 70000 - 79999 : 9287 80000 - 89999 : 4249 90000 - 99999 : 1915 100000 - 109999 : 922 110000 - 119999 : 390 120000 - 129999 : 187 ... <default> <rcu_normal_wake_from_gp> 0 - 9999 : 1 10000 - 19999 : 234 20000 - 29999 : 6678 30000 - 39999 : 33463 40000 - 49999 : 20669 50000 - 59999 : 2766 60000 - 69999 : 183 ... <rcu_normal_wake_from_gp> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-07-16 09:23:56 +05:30
Paul E. McKenney	90c09d57ca	rcu: Protect ->defer_qs_iw_pending from data race On kernels built with CONFIG_IRQ_WORK=y, when rcu_read_unlock() is invoked within an interrupts-disabled region of code [1], it will invoke rcu_read_unlock_special(), which uses an irq-work handler to force the system to notice when the RCU read-side critical section actually ends. That end won't happen until interrupts are enabled at the soonest. In some kernels, such as those booted with rcutree.use_softirq=y, the irq-work handler is used unconditionally. The per-CPU rcu_data structure's ->defer_qs_iw_pending field is updated by the irq-work handler and is both read and updated by rcu_read_unlock_special(). This resulted in the following KCSAN splat: ------------------------------------------------------------------------ BUG: KCSAN: data-race in rcu_preempt_deferred_qs_handler / rcu_read_unlock_special read to 0xffff96b95f42d8d8 of 1 bytes by task 90 on cpu 8: rcu_read_unlock_special+0x175/0x260 __rcu_read_unlock+0x92/0xa0 rt_spin_unlock+0x9b/0xc0 __local_bh_enable+0x10d/0x170 __local_bh_enable_ip+0xfb/0x150 rcu_do_batch+0x595/0xc40 rcu_cpu_kthread+0x4e9/0x830 smpboot_thread_fn+0x24d/0x3b0 kthread+0x3bd/0x410 ret_from_fork+0x35/0x40 ret_from_fork_asm+0x1a/0x30 write to 0xffff96b95f42d8d8 of 1 bytes by task 88 on cpu 8: rcu_preempt_deferred_qs_handler+0x1e/0x30 irq_work_single+0xaf/0x160 run_irq_workd+0x91/0xc0 smpboot_thread_fn+0x24d/0x3b0 kthread+0x3bd/0x410 ret_from_fork+0x35/0x40 ret_from_fork_asm+0x1a/0x30 no locks held by irq_work/8/88. irq event stamp: 200272 hardirqs last enabled at (200272): [<ffffffffb0f56121>] finish_task_switch+0x131/0x320 hardirqs last disabled at (200271): [<ffffffffb25c7859>] __schedule+0x129/0xd70 softirqs last enabled at (0): [<ffffffffb0ee093f>] copy_process+0x4df/0x1cc0 softirqs last disabled at (0): [<0000000000000000>] 0x0 ------------------------------------------------------------------------ The problem is that irq-work handlers run with interrupts enabled, which means that rcu_preempt_deferred_qs_handler() could be interrupted, and that interrupt handler might contain an RCU read-side critical section, which might invoke rcu_read_unlock_special(). In the strict KCSAN mode of operation used by RCU, this constitutes a data race on the ->defer_qs_iw_pending field. This commit therefore disables interrupts across the portion of the rcu_preempt_deferred_qs_handler() that updates the ->defer_qs_iw_pending field. This suffices because this handler is not a fast path. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-07-16 09:22:02 +05:30
Zqiang	1bba3900ca	rcu/nocb: Fix possible invalid rdp's->nocb_cb_kthread pointer access In the preparation stage of CPU online, if the corresponding the rdp's->nocb_cb_kthread does not exist, will be created, there is a situation where the rdp's rcuop kthreads creation fails, and then de-offload this CPU's rdp, does not assign this CPU's rdp->nocb_cb_kthread pointer, but this rdp's->nocb_gp_rdp and rdp's->rdp_gp->nocb_gp_kthread is still valid. This will cause the subsequent re-offload operation of this offline CPU, which will pass the conditional check and the kthread_unpark() will access invalid rdp's->nocb_cb_kthread pointer. This commit therefore use rdp's->nocb_gp_kthread instead of rdp_gp's->nocb_gp_kthread for safety check. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-07-08 23:42:51 +05:30
Frederic Weisbecker	fc39760cd0	rcu/exp: Warn on QS requested on dying CPU It is not possible to send an IPI to a dying CPU that has passed the CPUHP_TEARDOWN_CPU stage. Remaining unhandled IPIs are handled later at CPUHP_AP_SMPCFD_DYING stage by stop machine. This is the last opportunity for RCU exp handler to request an expedited quiescent state. And the upcoming final context switch between stop machine and idle must have reported the requested context switch. Therefore, it should not be possible to observe a pending requested expedited quiescent state when RCU finally stops watching the outgoing CPU. Once IPIs aren't possible anymore, the QS for the target CPU will be reported on its behalf by the RCU exp kworker. Provide an assertion to verify those expectations. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-07-08 23:21:13 +05:30
Frederic Weisbecker	bf0a57445d	rcu/exp: Remove needless CPU up quiescent state report A CPU coming online checks for an ongoing grace period and reports a quiescent state accordingly if needed. This special treatment that shortcuts the expedited IPI finds its origin as an optimization purpose on the following commit: `338b0f760e` (rcu: Better hotplug handling for synchronize_sched_expedited() The point is to avoid an IPI while waiting for a CPU to become online or failing to become offline. However this is pointless and even error prone for several reasons: * If the CPU has been seen offline in the first round scanning offline and idle CPUs, no IPI is even tried and the quiescent state is reported on behalf of the CPU. * This means that if the IPI fails, the CPU just became offline. So it's unlikely to become online right away, unless the cpu hotplug operation failed and rolled back, which is a rare event that can wait a jiffy for a new IPI to be issued. * But then the "optimization" applying on failing CPU hotplug down only applies to !PREEMPT_RCU. * This force reports a quiescent state even if ->cpu_no_qs.b.exp is not set. As a result it can race with remote QS reports on the same rdp. Fortunately it happens to be OK but an accident is waiting to happen. For all those reasons, remove this optimization that doesn't look worthy to keep around. Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-07-08 23:21:13 +05:30
Frederic Weisbecker	4b9432ed65	rcu/exp: Remove confusing needless full barrier on task unblock A full memory barrier in the RCU-PREEMPT task unblock path advertizes to order the context switch (or rather the accesses prior to rcu_read_unlock()) with the expedited grace period fastpath. However the grace period can not complete without the rnp calling into rcu_report_exp_rnp() with the node locked. This reports the quiescent state in a fully ordered fashion against updater's accesses thanks to: 1) The READ-SIDE smp_mb__after_unlock_lock() barrier across nodes locking while propagating QS up to the root. 2) The UPDATE-SIDE smp_mb__after_unlock_lock() barrier while holding the the root rnp to wait/check for the GP completion. 3) The (perhaps redundant given step 1) and 2)) smp_mb() in rcu_seq_end() before the grace period completes. This makes the explicit barrier in this place superfluous. Therefore remove it as it is confusing. Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-07-08 23:20:19 +05:30
Artem Sadovnikov	005b618770	refscale: Check that nreaders and loops multiplication doesn't overflow The nreaders and loops variables are exposed as module parameters, which, in certain combinations, can lead to multiplication overflow. Besides, loops parameter is defined as long, while through the code is used as int, which can cause truncation on 64-bit kernels and possible zeroes where they shouldn't appear. Since code uses result of multiplication as int anyway, it only makes sense to replace loops with int. Multiplication overflow check is also added due to possible multiplication between two very big numbers. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: `653ed64b01` ("refperf: Add a test to measure performance of read-side synchronization") Signed-off-by: Artem Sadovnikov <a.sadovnikov@ispras.ru> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-07-07 09:45:45 +05:30
Frederic Weisbecker	a33ad03aae	rcu/nocb: Dump gp state even if rdp gp itself is not offloaded When a stall is detected, the state of each NOCB CPU is dumped along with the state of each NOCB group. The latter part however is incidentally ignored if the NOCB group leader happens not to be offloaded itself. Fix this to make sure related precious informations aren't lost over a stall report. Reported-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-07-07 09:45:19 +05:30
Zqiang	8d71351d88	rcutorture: Fix rcutorture_one_extend_check() splat in RT kernels For built with CONFIG_PREEMPT_RT=y kernels, running rcutorture tests resulted in the following splat: [ 68.797425] rcutorture_one_extend_check during change: Current 0x1 To add 0x1 To remove 0x0 preempt_count() 0x0 [ 68.797533] WARNING: CPU: 2 PID: 512 at kernel/rcu/rcutorture.c:1993 rcutorture_one_extend_check+0x419/0x560 [rcutorture] [ 68.797601] Call Trace: [ 68.797602] <TASK> [ 68.797619] ? lockdep_softirqs_off+0xa5/0x160 [ 68.797631] rcutorture_one_extend+0x18e/0xcc0 [rcutorture 2466dbd2ff34dbaa36049cb323a80c3306ac997c] [ 68.797646] ? local_clock+0x19/0x40 [ 68.797659] rcu_torture_one_read+0xf0/0x280 [rcutorture 2466dbd2ff34dbaa36049cb323a80c3306ac997c] [ 68.797678] ? __pfx_rcu_torture_one_read+0x10/0x10 [rcutorture 2466dbd2ff34dbaa36049cb323a80c3306ac997c] [ 68.797804] ? __pfx_rcu_torture_timer+0x10/0x10 [rcutorture 2466dbd2ff34dbaa36049cb323a80c3306ac997c] [ 68.797815] rcu-torture: rcu_torture_reader task started [ 68.797824] rcu-torture: Creating rcu_torture_reader task [ 68.797824] rcu_torture_reader+0x238/0x580 [rcutorture 2466dbd2ff34dbaa36049cb323a80c3306ac997c] [ 68.797836] ? kvm_sched_clock_read+0x15/0x30 Disable BH does not change the SOFTIRQ corresponding bits in preempt_count() for RT kernels, this commit therefore use softirq_count() to check the if BH is disabled. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-06-25 08:39:02 +05:30
Paul E. McKenney	5f2417ba05	rcutorture: Make Trivial RCU ignore onoff_interval and shuffle_interval Trivial RCU is a textbook implementation that is not used in the Linux kernel, but tested to keep textbooks (and presentations) honest. It is so trivial that it cannot deal with either CPU hotplug or external migration from one CPU to another. This commit therefore splats whenever onoff_interval or shuffle_interval are non-zero, and then sets them to zero in order to avoid false-positive failures. Those wishing to set these module parameters in order to force failures in Trivial RCU are free to revert this commit. Just don't expect me to be sympathetic to any resulting bug reports! Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202505131651.af6e81d7-lkp@intel.com Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-06-25 08:39:01 +05:30
Paul E. McKenney	f32367d96e	rcutorture: Drop redundant "insoftirq" parameters Given that the rcutorture_one_extend_check() function now uses in_serving_softirq() and in_hardirq(), it is no longer necessary to pass insoftirq flags down the function-call stack. This commit therefore removes those flags, and, while in the area, does a bit of whitespace cleanup. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-06-25 08:39:01 +05:30
Paul E. McKenney	cacba0bf6d	rcutorture: Print number of RCU up/down readers and migrations This commit prints the number of RCU up/down readers and the number of such readers that migrated from one CPU to another, along with the rest of the periodic rcu_torture_stats_print() output. These statistics are currently used only by srcu_down_read{,_fast}() and srcu_up_read(,_fast)(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-06-25 08:39:01 +05:30
Paul E. McKenney	f6c8785f50	rcutorture: Check for no up/down readers at task level The design of testing of up/down readers such as srcu_down_read() and srcu_up_read() assumes that these are tested only by the rcu_torture_updown() kthread, and never by the rcu_torture_reader() kthread. Because we all know which road is paved with good intentions, this commit adds WARN_ON_ONCE() to verify that things are going to plan. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-06-25 08:39:01 +05:30
Paul E. McKenney	93856948be	rcutorture: Check for ->up_read() without matching ->down_read() This commit creates counters in the rcu_torture_one_read_state_updown structure that check for a call to ->up_read() that lacks a matching call to ->down_read(). While in the area, add end-of-run cleanup code that prevents calls to rcu_torture_updown_hrt() from happening after the test has moved on. Yes, the srcu_barrier() at the end of the test will wait for them, but this could result in confusing states, statistics, and diagnostic information. So explicitly wait for them before we get to the end-of-test output. [ paulmck: Apply kernel test robot feedback. ] [ joel: Apply Boqun's fix for counter increment ordering. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-06-25 08:39:01 +05:30
Paul E. McKenney	62d92c9b87	rcutorture: Complain if an ->up_read() is delayed more than 10 seconds The down/up SRCU reader testing uses an hrtimer handler to exit the SRCU read-side critical section. This might be delayed, and if delayed for too long, it can prevent the rcutorture run from completing. This commit therefore complains if the hrtimer handler is delayed for more than ten seconds. [ paulmck, joel: Apply kernel test robot feedback to avoid false-positive complaint of excessive ->up_read() delays by using HRTIMER_MODE_HARD ] Tested-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-06-25 08:39:01 +05:30

1 2 3 4 5 ...

2775 Commits