linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-04-11 20:03:50 -04:00

Author	SHA1	Message	Date
Hou Tao	e8a65856c7	bpf: Add is_fd_htab() helper Add is_fd_htab() helper to check whether the map is htab of maps. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250401062250.543403-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-09 20:12:53 -07:00
Hou Tao	2c304172e0	bpf: Support atomic update for htab of maps As reported by Cody Haas [1], when there is concurrent map lookup and map update operation in an existing element for htab of maps, the map lookup procedure may return -ENOENT unexpectedly. The root cause is twofold: 1) the update of existing element involves two separated list operation In htab_map_update_elem(), it first inserts the new element at the head of list, then it deletes the old element. Therefore, it is possible a lookup operation has already iterated to the middle of the list when a concurrent update operation begins, and the lookup operation will fail to find the target element. 2) the immediate reuse of htab element. It is more subtle. Even through the lookup operation finds the old element, it is possible that the target element has been removed by a concurrent update operation, and the element has been reused immediately by other update operation which runs on the same CPU as the previous update operation, and the element is inserted into the same bucket list. After these steps above, when the lookup operation tries to compare the key in the old element with the expected key, the match will fail because the key in the old element have been overwritten by other update operation. The two-step update process is relatively straightforward to address. The more challenging aspect is the immediate reuse. As Alexei pointed out: So since 2022 both prealloc and no_prealloc reuse elements. We can consider a new flag for the hash map like F_REUSE_AFTER_RCU_GP that will use _rcu() flavor of freeing into bpf_ma, but it has to have a strong reason. Given that htab of maps doesn't support special field in value and directly stores the inner map pointer in htab_element, just do in-place update for htab of maps instead of attempting to address the immediate reuse issue. [1]: https://lore.kernel.org/xdp-newbies/CAH7f-ULFTwKdoH_t2SFc5rWCVYLEg-14d1fBYWH2eekudsnTRg@mail.gmail.com/ Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250401062250.543403-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-09 20:12:53 -07:00
Hou Tao	5771e306b6	bpf: Rename __htab_percpu_map_update_elem to htab_map_update_elem_in_place Rename __htab_percpu_map_update_elem to htab_map_update_elem_in_place, and add a new percpu argument for the helper to support in-place update for both per-cpu htab and htab of maps. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250401062250.543403-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-09 20:12:53 -07:00
Hou Tao	ba2b31b0f3	bpf: Factor out htab_elem_value helper() All hash maps store map key and map value together. The relative offset of the map value compared to the map key is round_up(key_size, 8). Therefore, factor out a common helper htab_elem_value() to calculate the address of the map value instead of duplicating the logic. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250401062250.543403-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-09 20:12:53 -07:00
Stanislav Fomichev	311920774c	configs/debug: run and debug PREEMPT Recent change [0] resulted in a "BUG: using __this_cpu_read() in preemptible" splat [1]. PREEMPT kernels have additional requirements on what can and can not run with/without preemption enabled. Expose those constrains in the debug kernels. 0: https://lore.kernel.org/netdev/20250314120048.12569-2-justin.iurman@uliege.be/ 1: https://lore.kernel.org/netdev/20250402094458.006ba2a7@kernel.org/T/#mbf72641e9d7d274daee9003ef5edf6833201f1bc Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250402172305.1775226-1-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-09 17:47:06 -07:00
Tao Chen	a76116f422	bpf: Check link_create.flags parameter for multi_uprobe The link_create.flags are currently not used for multi-uprobes, so return -EINVAL if it is set, same as for other attach APIs. We allow target_fd to have an arbitrary value for multi-uprobe, though, as there are existing users (libbpf) relying on this. Fixes: `89ae89f53d` ("bpf: Add multi uprobe link") Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250407035752.1108927-2-chen.dylane@linux.dev	2025-04-09 16:28:51 -07:00
Tao Chen	243911982a	bpf: Check link_create.flags parameter for multi_kprobe The link_create.flags are currently not used for multi-kprobes, so return -EINVAL if it is set, same as for other attach APIs. We allow target_fd, on the other hand, to have an arbitrary value for multi-kprobe, as there are existing users (libbpf) relying on this. Fixes: `0dcac27254` ("bpf: Add multi kprobe link") Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250407035752.1108927-1-chen.dylane@linux.dev	2025-04-09 16:28:22 -07:00
Sebastian Andrzej Siewior	92e250c624	timekeeping: Add a lockdep override in tick_freeze() tick_freeze() acquires a raw spinlock (tick_freeze_lock). Later in the callchain (timekeeping_suspend() -> mc146818_avoid_UIP()) the RTC driver acquires a spinlock which becomes a sleeping lock on PREEMPT_RT. Lockdep complains about this lock nesting. Add a lockdep override for this special case and a comment explaining why it is okay. Reported-by: Borislav Petkov <bp@alien8.de> Reported-by: Chris Bainbridge <chris.bainbridge@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20250404133429.pnAzf-eF@linutronix.de Closes: https://lore.kernel.org/all/20250330113202.GAZ-krsjAnurOlTcp-@fat_crate.local/ Closes: https://lore.kernel.org/all/CAP-bSRZ0CWyZZsMtx046YV8L28LhY0fson2g4EqcwRAVN1Jk+Q@mail.gmail.com/	2025-04-09 22:30:39 +02:00
Eric Dumazet	0df6db767a	posix-timers: Initialize cache early and move pointer into __timer_data Move posix_timers_cache initialization to posixtimer_init(). At that point the memory subsystem is already up and running. Also move the cache pointer to the __timer_data variable to avoid potential false sharing, since it never was marked as __ro_after_init. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250402133114.253901-1-edumazet@google.com	2025-04-09 21:21:36 +02:00
Tejun Heo	0b30461793	sched_ext: Make scx_has_op a bitmap scx_has_op is used to encode which ops are implemented by the BPF scheduler into an array of static_keys. While this saves a bit of branching overhead, that is unlikely to be noticeable compared to the overall cost. As the global static_keys can't work with the planned hierarchical multiple scheduler support, replace the static_key array with a bitmap. In repeated hackbench runs before and after static_keys removal on an AMD Ryzen 3900X, I couldn't tell any measurable performance difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com>	2025-04-09 09:06:00 -10:00
Tejun Heo	743354e3bb	sched_ext: Remove scx_ops_allow_queued_wakeup static_key scx_ops_allow_queued_wakeup is used to encode SCX_OPS_ALLOW_QUEUED_WAKEUP into a static_key. The test is gated behind scx_enabled(), and, even when sched_ext is enabled, is unlikely for the static_key usage to make any meaningful difference. It is made to use a static_key mostly because there was no reason not to. However, global static_keys can't work with the planned hierarchical multiple scheduler support. Remove the static_key and instead test SCX_OPS_ALLOW_QUEUED_WAKEUP directly. In repeated hackbench runs before and after static_keys removal on an AMD Ryzen 3900X, I couldn't tell any measurable performance difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com>	2025-04-09 09:06:00 -10:00
Tejun Heo	54d2e717bc	sched_ext: Remove scx_ops_cpu_preempt static_key scx_ops_cpu_preempt is used to encode whether ops.cpu_acquire/release() are implemented into a static_key. These tests aren't hot enough for static_key usage to make any meaningful difference and are made to use a static_key mostly because there was no reason not to. However, global static_keys can't work with the planned hierarchical multiple scheduler support. Remove the static_key and instead use an internal ops flag SCX_OPS_HAS_CPU_PREEMPT to record and test whether ops.cpu_acquire/release() are implemented. In repeated hackbench runs before and after static_keys removal on an AMD Ryzen 3900X, I couldn't tell any measurable performance difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com>	2025-04-09 09:06:00 -10:00
Tejun Heo	cc39454c34	sched_ext: Remove scx_ops_enq_* static_keys scx_ops_enq_last/exiting/migration_disabled are used to encode the corresponding SCX_OPS_ flags into static_keys. These flags aren't hot enough for static_key usage to make any meaningful difference and are made static_keys mostly because there was no reason not to. However, global static_keys can't work with the planned hierarchical multiple scheduler support. Remove the static_keys and test the ops flags directly. In repeated hackbench runs before and after static_keys removal on an AMD Ryzen 3900X, I couldn't tell any measurable performance difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com>	2025-04-09 09:05:59 -10:00
Tejun Heo	d75ee2d678	sched_ext: Indentation updates Purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com>	2025-04-09 09:05:59 -10:00
Nam Cao	2424e146be	hrtimer: Add missing ACCESS_PRIVATE() for hrtimer::function The "function" field of struct hrtimer has been changed to private, but two instances have not been converted to use ACCESS_PRIVATE(). Convert them to use ACCESS_PRIVATE(). Fixes: `04257da0c9` ("hrtimers: Make callback function pointer private") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250408103854.1851093-1-namcao@linutronix.de Closes: https://lore.kernel.org/oe-kbuild-all/202504071931.vOVl13tt-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202504072155.5UAZjYGU-lkp@intel.com/	2025-04-09 21:00:42 +02:00
Thomas Gleixner	9357e329cd	genirq/msi: Rename msi_[un]lock_descs() Now that all abuse is gone and the legit users are converted to guard(msi_descs_lock), rename the lock functions and document them as internal. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huwei.com> Link: https://lore.kernel.org/all/20250319105506.864699741@linutronix.de	2025-04-09 20:47:30 +02:00
Thomas Gleixner	0dac2b0930	genirq/msi: Use lock guards for MSI descriptor locking Provide a lock guard for MSI descriptor locking and update the core code accordingly. No functional change intended. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/all/20250319105506.144672678@linutronix.de	2025-04-09 20:47:29 +02:00
Thorsten Blum	cfdb7520f9	PM: hibernate: Remove size arguments when calling strscpy() The size parameter is optional and strscpy() automatically determines the length of the destination buffer using sizeof() if the argument is omitted. This makes the explicit sizeof() calls unnecessary. Remove them to shorten and simplify the code. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://patch.msgid.link/20250318080755.61126-2-thorsten.blum@linux.dev Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-04-09 20:11:46 +02:00
Steven Rostedt	e1a453a57b	tracing: Do not add length to print format in synthetic events The following causes a vsnprintf fault: # echo 's:wake_lat char[] wakee; u64 delta;' >> /sys/kernel/tracing/dynamic_events # echo 'hist:keys=pid:ts=common_timestamp.usecs if !(common_flags & 0x18)' > /sys/kernel/tracing/events/sched/sched_waking/trigger # echo 'hist:keys=next_pid:delta=common_timestamp.usecs-$ts:onmatch(sched.sched_waking).trace(wake_lat,next_comm,$delta)' > /sys/kernel/tracing/events/sched/sched_switch/trigger Because the synthetic event's "wakee" field is created as a dynamic string (even though the string copied is not). The print format to print the dynamic string changed from "%*s" to "%s" because another location (__set_synth_event_print_fmt()) exported this to user space, and user space did not need that. But it is still used in print_synth_event(), and the output looks like: <idle>-0 [001] d..5. 193.428167: wake_lat: wakee=(efault)sshd-sessiondelta=155 sshd-session-879 [001] d..5. 193.811080: wake_lat: wakee=(efault)kworker/u34:5delta=58 <idle>-0 [002] d..5. 193.811198: wake_lat: wakee=(efault)bashdelta=91 bash-880 [002] d..5. 193.811371: wake_lat: wakee=(efault)kworker/u35:2delta=21 <idle>-0 [001] d..5. 193.811516: wake_lat: wakee=(efault)sshd-sessiondelta=129 sshd-session-879 [001] d..5. 193.967576: wake_lat: wakee=(efault)kworker/u34:5delta=50 The length isn't needed as the string is always nul terminated. Just print the string and not add the length (which was hard coded to the max string length anyway). Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Douglas Raillard <douglas.raillard@arm.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/20250407154139.69955768@gandalf.local.home Fixes: `4d38328eb4` ("tracing: Fix synth event printk format for str fields"); Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-04-09 11:34:21 -04:00
Arnd Bergmann	d7b98ae522	dma/contiguous: avoid warning about unused size_bytes When building with W=1, this variable is unused for configs with CONFIG_CMA_SIZE_SEL_PERCENTAGE=y: kernel/dma/contiguous.c:67:26: error: 'size_bytes' defined but not used [-Werror=unused-const-variable=] Change this to a macro to avoid the warning. Fixes: `c64be2bb1c` ("drivers: add Contiguous Memory Allocator") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20250409151557.3890443-1-arnd@kernel.org	2025-04-09 17:28:53 +02:00
Joel Granados	bc4f328ff5	sparc: mv sparc sysctls into their own file under arch/sparc/kernel Move sparc sysctls (reboot-cmd, stop-a, scons-poweroff and tsb-ratio) into a new file (arch/sparc/kernel/setup.c). This file will be included for both 32 and 64 bit sparc. Leave "tsb-ratio" under SPARC64 ifdef as it was in kernel/sysctl.c. The sysctl table register is called with arch_initcall placing it after its original place in proc_root_init. This is part of a greater effort to move ctl tables into their respective subsystems which will reduce the merge conflicts in kernel/sysctl.c. Signed-off-by: Joel Granados <joel.granados@kernel.org>	2025-04-09 13:32:16 +02:00
Joel Granados	67049b53e0	stack_tracer: move sysctl registration to kernel/trace/trace_stack.c Move stack_tracer_enabled into trace_stack_sysctl_table. This is part of a greater effort to move ctl tables into their respective subsystems which will reduce the merge conflicts in kernel/sysctl.c. Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Joel Granados <joel.granados@kernel.org>	2025-04-09 13:32:16 +02:00
Joel Granados	dd293df639	tracing: Move trace sysctls into trace.c Move trace ctl tables into their own const array in kernel/trace/trace.c. The sysctl table register is called with subsys_initcall placing if after its original place in proc_root_init. This is part of a greater effort to move ctl tables into their respective subsystems which will reduce the merge conflicts in kernel/sysctl.c. Signed-off-by: Joel Granados <joel.granados@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-04-09 13:32:16 +02:00
Joel Granados	256db5c9b5	signal: Move signal ctl tables into signal.c Move print-fatal-signals into its own const ctl table array in kernel/signal.c. This is part of a greater effort to move ctl tables into their respective subsystems which will reduce the merge conflicts in kernel/sysctl.c. Signed-off-by: Joel Granados <joel.granados@kernel.org>	2025-04-09 13:32:16 +02:00
Joel Granados	c09b981041	panic: Move panic ctl tables into panic.c Move panic, panic_on_oops, panic_print, panic_on_warn into kerne/panic.c. This is part of a greater effort to move ctl tables into their respective subsystems which will reduce the merge conflicts in kernel/sysctl.c. Signed-off-by: Joel Granados <joel.granados@kernel.org>	2025-04-09 13:32:16 +02:00
Geert Uytterhoeven	771487050f	genirq/generic-chip: Fix incorrect lock guard conversions When booting BeagleBone Black: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:4398 lockdep_hardirqs_on_prepare+0x23c/0x280 DEBUG_LOCKS_WARN_ON(early_boot_irqs_disabled) CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.15.0-rc1-boneblack-00004-g195298c3b116 #209 NONE Hardware name: Generic AM33XX (Flattened Device Tree) Call trace: _raw_spin_unlock_irq from irq_map_generic_chip+0x144/0x190 irq_map_generic_chip from irq_domain_associate_locked+0x68/0x164 irq_domain_associate_locked from irq_create_fwspec_mapping+0x34c/0x43c irq_create_fwspec_mapping from irq_create_of_mapping+0x64/0x8c irq_create_of_mapping from irq_of_parse_and_map+0x54/0x7c irq_of_parse_and_map from dmtimer_clkevt_init_common+0x54/0x15c dmtimer_clkevt_init_common from dmtimer_systimer_init+0x41c/0x5b8 dmtimer_systimer_init from timer_probe+0x68/0xf0 timer_probe from start_kernel+0x4a4/0x6bc start_kernel from 0x0 irq event stamp: 0 hardirqs last enabled at (0): [<00000000>] 0x0 hardirqs last disabled at (0): [<00000000>] 0x0 softirqs last enabled at (0): [<00000000>] 0x0 softirqs last disabled at (0): [<00000000>] 0x0 ---[ end trace 0000000000000000 ]--- and: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 0 at init/main.c:1022 start_kernel+0x4e8/0x6bc Interrupts were enabled early CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G W 6.15.0-rc1-boneblack-00004-g195298c3b116 #209 NONE Tainted: [W]=WARN Hardware name: Generic AM33XX (Flattened Device Tree) Call trace: unwind_backtrace from show_stack+0x10/0x14 show_stack from dump_stack_lvl+0x6c/0x90 dump_stack_lvl from __warn+0x70/0x1b0 __warn from warn_slowpath_fmt+0x1d4/0x1ec warn_slowpath_fmt from start_kernel+0x4e8/0x6bc start_kernel from 0x0 irq event stamp: 0 hardirqs last enabled at (0): [<00000000>] 0x0 hardirqs last disabled at (0): [<00000000>] 0x0 softirqs last enabled at (0): [<00000000>] 0x0 softirqs last disabled at (0): [<00000000>] 0x0 ---[ end trace 0000000000000000 ]--- Fix this by correcting two misconversions of raw_spin_{,un}lock_irq{save,restore}() to lock guards. Fixes: `195298c3b1` ("genirq/generic-chip: Convert core code to lock guards") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/514f94c5891c61ac0a4a7fdad113e75db1eea367.1744135467.git.geert+renesas@glider.be	2025-04-08 22:49:02 +02:00
Linus Torvalds	bec7dcbc24	Merge tag 'probes-fixes-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - fprobe: remove fprobe_hlist_node when module unloading When a fprobe target module is removed, the fprobe_hlist_node should be removed from the fprobe's hash table to prevent reusing accidentally if another module is loaded at the same address. - fprobe: lock module while registering fprobe The module containing the function to be probeed is locked using a reference counter until the fprobe registration is complete, which prevents use after free. - fprobe-events: fix possible UAF on modules Basically as same as above, but in the fprobe-events layer we also need to get module reference counter when we find the tracepoint in the module. * tag 'probes-fixes-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: fprobe: Cleanup fprobe hash when module unloading tracing: fprobe events: Fix possible UAF on modules tracing: fprobe: Fix to lock module while registering fprobe	2025-04-08 12:51:34 -07:00
Linus Torvalds	e37f72b3b4	Merge tag 'cgroup-for-6.15-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - A number of cpuset remote partition related fixes and cleanups along with selftest updates. - A change from this merge window made cgroup_rstat_updated_list() called outside cgroup_rstat_lock leading to list corruptions. Fix it by relocating the call inside the lock. * tag 'cgroup-for-6.15-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: Fix race between newly created partition and dying one cgroup: rstat: call cgroup_rstat_updated_list with cgroup_rstat_lock selftest/cgroup: Add a remote partition transition test to test_cpuset_prs.sh selftest/cgroup: Clean up and restructure test_cpuset_prs.sh selftest/cgroup: Update test_cpuset_prs.sh to use \| as effective CPUs and state separator cgroup/cpuset: Remove unneeded goto in sched_partition_write() and rename it cgroup/cpuset: Code cleanup and comment update cgroup/cpuset: Don't allow creation of local partition over a remote one cgroup/cpuset: Remove remote_partition_check() & make update_cpumasks_hier() handle remote partition cgroup/cpuset: Fix error handling in remote_partition_disable() cgroup/cpuset: Fix incorrect isolated_cpus update in update_parent_effective_cpumask()	2025-04-08 12:15:05 -07:00
Tejun Heo	294f5ff474	sched_ext: Merge branch 'for-6.15-fixes' into for-6.16 Pull for-6.15-fixes to receive: `e776b26e37` ("sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings") which conflicts with: `1a7ff7216c` ("sched_ext: Drop "ops" from scx_ops_enable_state and friends") The former removes code updated by the latter. Resolved by removing the updated section. Signed-off-by: Tejun Heo <tj@kernel.org>	2025-04-08 08:56:57 -10:00
Joel Fernandes	f50ad4b73e	srcu: Use rcu_seq_done_exact() for polling API poll_state_synchronize_srcu() uses rcu_seq_done() unlike poll_state_synchronize_rcu() which uses rcu_seq_done_exact(). The rcu_seq_done_exact() makes more sense for polling API, as with this API, there is a higher chance that there is a significant delay between the get_state..() and poll_state..() calls since a cookie can be stored and reused at a later time. During such a delay, if the gp_seq counter progresses more than ULONG_MAX/2 distance, then poll_state..() may return false for a long time unwantedly. Fix by using the more accurate rcu_seq_done_exact() API which is exactly what straight RCU's polling does. It may make sense, as future work, to add debug code here as well, where we compare a physical timestamp between get_state..() and poll_state() calls and yell if significant time has past but the grace period has still not progressed. Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>	2025-04-08 14:55:55 -04:00
Phil Auld	6432e163ba	sched/isolation: Make use of more than one housekeeping cpu The exising code uses housekeeping_any_cpu() to select a cpu for a given housekeeping task. However, this often ends up calling cpumask_any_and() which is defined as cpumask_first_and() which has the effect of alyways using the first cpu among those available. The same applies when multiple NUMA nodes are involved. In that case the first cpu in the local node is chosen which does provide a bit of spreading but with multiple HK cpus per node the same issues arise. We have numerous cases where a single HK cpu just cannot keep up and the remote_tick warning fires. It also can lead to the other things (orchastration sw, HA keepalives etc) on the HK cpus getting starved which leads to other issues. In these cases we recommend increasing the number of HK cpus. But... that only helps the userspace tasks somewhat. It does not help the actual housekeeping part. Spread the HK work out by having housekeeping_any_cpu() and sched_numa_find_closest() use cpumask_any_and_distribute() instead of cpumask_any_and(). Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Waiman Long <longman@redhat.com> Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20250218184618.1331715-1-pauld@redhat.com	2025-04-08 20:55:55 +02:00
Harshit Agarwal	690e47d140	sched/rt: Fix race in push_rt_task Overview ======== When a CPU chooses to call push_rt_task and picks a task to push to another CPU's runqueue then it will call find_lock_lowest_rq method which would take a double lock on both CPUs' runqueues. If one of the locks aren't readily available, it may lead to dropping the current runqueue lock and reacquiring both the locks at once. During this window it is possible that the task is already migrated and is running on some other CPU. These cases are already handled. However, if the task is migrated and has already been executed and another CPU is now trying to wake it up (ttwu) such that it is queued again on the runqeue (on_rq is 1) and also if the task was run by the same CPU, then the current checks will pass even though the task was migrated out and is no longer in the pushable tasks list. Crashes ======= This bug resulted in quite a few flavors of crashes triggering kernel panics with various crash signatures such as assert failures, page faults, null pointer dereferences, and queue corruption errors all coming from scheduler itself. Some of the crashes: -> kernel BUG at kernel/sched/rt.c:1616! BUG_ON(idx >= MAX_RT_PRIO) Call Trace: ? __die_body+0x1a/0x60 ? die+0x2a/0x50 ? do_trap+0x85/0x100 ? pick_next_task_rt+0x6e/0x1d0 ? do_error_trap+0x64/0xa0 ? pick_next_task_rt+0x6e/0x1d0 ? exc_invalid_op+0x4c/0x60 ? pick_next_task_rt+0x6e/0x1d0 ? asm_exc_invalid_op+0x12/0x20 ? pick_next_task_rt+0x6e/0x1d0 __schedule+0x5cb/0x790 ? update_ts_time_stats+0x55/0x70 schedule_idle+0x1e/0x40 do_idle+0x15e/0x200 cpu_startup_entry+0x19/0x20 start_secondary+0x117/0x160 secondary_startup_64_no_verify+0xb0/0xbb -> BUG: kernel NULL pointer dereference, address: 00000000000000c0 Call Trace: ? __die_body+0x1a/0x60 ? no_context+0x183/0x350 ? __warn+0x8a/0xe0 ? exc_page_fault+0x3d6/0x520 ? asm_exc_page_fault+0x1e/0x30 ? pick_next_task_rt+0xb5/0x1d0 ? pick_next_task_rt+0x8c/0x1d0 __schedule+0x583/0x7e0 ? update_ts_time_stats+0x55/0x70 schedule_idle+0x1e/0x40 do_idle+0x15e/0x200 cpu_startup_entry+0x19/0x20 start_secondary+0x117/0x160 secondary_startup_64_no_verify+0xb0/0xbb -> BUG: unable to handle page fault for address: ffff9464daea5900 kernel BUG at kernel/sched/rt.c:1861! BUG_ON(rq->cpu != task_cpu(p)) -> kernel BUG at kernel/sched/rt.c:1055! BUG_ON(!rq->nr_running) Call Trace: ? __die_body+0x1a/0x60 ? die+0x2a/0x50 ? do_trap+0x85/0x100 ? dequeue_top_rt_rq+0xa2/0xb0 ? do_error_trap+0x64/0xa0 ? dequeue_top_rt_rq+0xa2/0xb0 ? exc_invalid_op+0x4c/0x60 ? dequeue_top_rt_rq+0xa2/0xb0 ? asm_exc_invalid_op+0x12/0x20 ? dequeue_top_rt_rq+0xa2/0xb0 dequeue_rt_entity+0x1f/0x70 dequeue_task_rt+0x2d/0x70 __schedule+0x1a8/0x7e0 ? blk_finish_plug+0x25/0x40 schedule+0x3c/0xb0 futex_wait_queue_me+0xb6/0x120 futex_wait+0xd9/0x240 do_futex+0x344/0xa90 ? get_mm_exe_file+0x30/0x60 ? audit_exe_compare+0x58/0x70 ? audit_filter_rules.constprop.26+0x65e/0x1220 __x64_sys_futex+0x148/0x1f0 do_syscall_64+0x30/0x80 entry_SYSCALL_64_after_hwframe+0x62/0xc7 -> BUG: unable to handle page fault for address: ffff8cf3608bc2c0 Call Trace: ? __die_body+0x1a/0x60 ? no_context+0x183/0x350 ? spurious_kernel_fault+0x171/0x1c0 ? exc_page_fault+0x3b6/0x520 ? plist_check_list+0x15/0x40 ? plist_check_list+0x2e/0x40 ? asm_exc_page_fault+0x1e/0x30 ? _cond_resched+0x15/0x30 ? futex_wait_queue_me+0xc8/0x120 ? futex_wait+0xd9/0x240 ? try_to_wake_up+0x1b8/0x490 ? futex_wake+0x78/0x160 ? do_futex+0xcd/0xa90 ? plist_check_list+0x15/0x40 ? plist_check_list+0x2e/0x40 ? plist_del+0x6a/0xd0 ? plist_check_list+0x15/0x40 ? plist_check_list+0x2e/0x40 ? dequeue_pushable_task+0x20/0x70 ? __schedule+0x382/0x7e0 ? asm_sysvec_reschedule_ipi+0xa/0x20 ? schedule+0x3c/0xb0 ? exit_to_user_mode_prepare+0x9e/0x150 ? irqentry_exit_to_user_mode+0x5/0x30 ? asm_sysvec_reschedule_ipi+0x12/0x20 Above are some of the common examples of the crashes that were observed due to this issue. Details ======= Let's look at the following scenario to understand this race. 1) CPU A enters push_rt_task a) CPU A has chosen next_task = task p. b) CPU A calls find_lock_lowest_rq(Task p, CPU Z’s rq). c) CPU A identifies CPU X as a destination CPU (X < Z). d) CPU A enters double_lock_balance(CPU Z’s rq, CPU X’s rq). e) Since X is lower than Z, CPU A unlocks CPU Z’s rq. Someone else has locked CPU X’s rq, and thus, CPU A must wait. 2) At CPU Z a) Previous task has completed execution and thus, CPU Z enters schedule, locks its own rq after CPU A releases it. b) CPU Z dequeues previous task and begins executing task p. c) CPU Z unlocks its rq. d) Task p yields the CPU (ex. by doing IO or waiting to acquire a lock) which triggers the schedule function on CPU Z. e) CPU Z enters schedule again, locks its own rq, and dequeues task p. f) As part of dequeue, it sets p.on_rq = 0 and unlocks its rq. 3) At CPU B a) CPU B enters try_to_wake_up with input task p. b) Since CPU Z dequeued task p, p.on_rq = 0, and CPU B updates B.state = WAKING. c) CPU B via select_task_rq determines CPU Y as the target CPU. 4) The race a) CPU A acquires CPU X’s lock and relocks CPU Z. b) CPU A reads task p.cpu = Z and incorrectly concludes task p is still on CPU Z. c) CPU A failed to notice task p had been dequeued from CPU Z while CPU A was waiting for locks in double_lock_balance. If CPU A knew that task p had been dequeued, it would return NULL forcing push_rt_task to give up the task p's migration. d) CPU B updates task p.cpu = Y and calls ttwu_queue. e) CPU B locks Ys rq. CPU B enqueues task p onto Y and sets task p.on_rq = 1. f) CPU B unlocks CPU Y, triggering memory synchronization. g) CPU A reads task p.on_rq = 1, cementing its assumption that task p has not migrated. h) CPU A decides to migrate p to CPU X. This leads to A dequeuing p from Y's queue and various crashes down the line. Solution ======== The solution here is fairly simple. After obtaining the lock (at 4a), the check is enhanced to make sure that the task is still at the head of the pushable tasks list. If not, then it is anyway not suitable for being pushed out. Testing ======= The fix is tested on a cluster of 3 nodes, where the panics due to this are hit every couple of days. A fix similar to this was deployed on such cluster and was stable for more than 30 days. Co-developed-by: Jon Kohler <jon@nutanix.com> Signed-off-by: Jon Kohler <jon@nutanix.com> Co-developed-by: Gauri Patwardhan <gauri.patwardhan@nutanix.com> Signed-off-by: Gauri Patwardhan <gauri.patwardhan@nutanix.com> Co-developed-by: Rahul Chunduru <rahul.chunduru@nutanix.com> Signed-off-by: Rahul Chunduru <rahul.chunduru@nutanix.com> Signed-off-by: Harshit Agarwal <harshit@nutanix.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Reviewed-by: Phil Auld <pauld@redhat.com> Tested-by: Will Ton <william.ton@nutanix.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250225180553.167995-1-harshit@nutanix.com	2025-04-08 20:55:55 +02:00
Michal Koutný	0ab94c3242	sched: Add annotations to RT_GROUP_SCHED fields Update comments to ease RT throttling understanding. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250310170442.504716-10-mkoutny@suse.com	2025-04-08 20:55:55 +02:00
Frederic Weisbecker	4d949edbc4	rcu: Comment on the extraneous delta test on rcu_seq_done_exact() The numbers used in rcu_seq_done_exact() lack some explanation behind their magic. Especially after the commit: `85aad7cc41` ("rcu: Fix get_state_synchronize_rcu_full() GP-start detection") which reported a subtle issue where a new GP sequence snapshot was taken on the root node state while a grace period had already been started and reflected on the global state sequence but not yet on the root node sequence, making a polling user waiting on a wrong already started grace period that would ignore freshly online CPUs. The fix involved taking the snaphot on the global state sequence and waiting on the root node sequence. And since a grace period is first started on the global state and only afterward reflected on the root node, a snapshot taken on the global state sequence might be two full grace periods ahead of the root node as in the following example: rnp->gp_seq = rcu_state.gp_seq = 0 CPU 0 CPU 1 ----- ----- // rcu_state.gp_seq = 1 rcu_seq_start(&rcu_state.gp_seq) // snap = 8 snap = rcu_seq_snap(&rcu_state.gp_seq) // Two full GP differences rcu_seq_done_exact(&rnp->gp_seq, snap) // rnp->gp_seq = 1 WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq); Add a comment about those expectations and to clarify the magic within the relevant function. Note that the issue arises mainly with the use of rcu_seq_done_exact() which has a much tigher guardband (of 2 GPs) to ensure the false-negative window of the API during wraparound is limited to just 2 GPs. rcu_seq_done() does not have such strict requirements, however its large false-negative window of ULONG_MAX/2 is not ideal for the polling API. However, this also means care is needed to ensure the guardband is as large as needed to avoid the example scenario describe above which a warning added in an earlier patch does. [ Comment wordsmithing by Joel ] Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>	2025-04-08 14:55:54 -04:00
Michal Koutný	87f1fb77d8	sched: Add RT_GROUP WARN checks for non-root task_groups With CONFIG_RT_GROUP_SCHED but runtime disabling of RT_GROUPs we expect the existence of the root task_group only and all rt_sched_entity'ies should be queued on root's rt_rq. If we get a non-root RT_GROUP something went wrong. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250310170442.504716-9-mkoutny@suse.com	2025-04-08 20:55:54 +02:00
Joel Fernandes	4aa6e94cf9	rcu: Add warning to ensure rcu_seq_done_exact() is working The previous patch improved the rcu_seq_done_exact() function by adding a meaningful constant for the guardband. Ensure that this is working for the future by a quick check during rcu_gp_init(). Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>	2025-04-08 14:55:54 -04:00
Michal Koutný	d6809c2f60	sched: Do not construct nor expose RT_GROUP_SCHED structures if disabled Thanks to kernel cmdline being available early, before any cgroup hierarchy exists, we can achieve the RT_GROUP_SCHED boottime disabling goal by simply skipping any creation (and destruction) of RT_GROUP data and its exposure via RT attributes. We can do this thanks to previously placed runtime guards that would redirect all operations to root_task_group's data when RT_GROUP_SCHED disabled. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250310170442.504716-8-mkoutny@suse.com	2025-04-08 20:55:54 +02:00
Joel Fernandes	9c94c5ad39	rcu: Replace magic number with meaningful constant in rcu_seq_done_exact() The rcu_seq_done_exact() function checks if a grace period has completed by comparing sequence numbers. It includes a guard band to handle sequence number wraparound, which was previously expressed using the magic number calculation '3 * RCU_SEQ_STATE_MASK + 1'. This magic number is not immediately obvious in terms of what it represents. Instead, the reason we need this tiny guardband is because of the lag between the setting of rcu_state.gp_seq_polled and root rnp's gp_seq in rcu_gp_init(). This guardband needs to be at least 2 GPs worth of counts, to avoid recognizing the newly started GP as completed immediately, due to the following sequence which arises due to the delay between update of rcu_state.gp_seq_polled and root rnp's gp_seq: rnp->gp_seq = rcu_state.gp_seq = 0 CPU 0 CPU 1 ----- ----- // rcu_state.gp_seq = 1 rcu_seq_start(&rcu_state.gp_seq) // snap = 8 snap = rcu_seq_snap(&rcu_state.gp_seq) // Two full GP differences rcu_seq_done_exact(&rnp->gp_seq, snap) // rnp->gp_seq = 1 WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq); This can happen due to get_state_synchronize_rcu_full() sampling rcu_state.gp_seq_polled, however the poll_state_synchronize_rcu_full() sampling the root rnp's gp_seq. The delay between the update of the 2 counters occurs in rcu_gp_init() during which the counters briefly go out of sync. Make the guardband explictly 2 GPs. This improves code readability and maintainability by making the intent clearer as well. Suggested-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>	2025-04-08 14:55:54 -04:00
Michal Koutný	277e090975	sched: Bypass bandwitdh checks with runtime disabled RT_GROUP_SCHED When RT_GROUPs are compiled but not exposed, their bandwidth cannot be configured (and it is not initialized for non-root task_groups neither). Therefore bypass any checks of task vs task_group bandwidth. This will achieve behavior very similar to setups that have !CONFIG_RT_GROUP_SCHED and attach cpu controller to cgroup v2 hierarchy. (On a related note, this may allow having RT tasks with CONFIG_RT_GROUP_SCHED and cgroup v2 hierarchy.) Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250310170442.504716-7-mkoutny@suse.com	2025-04-08 20:55:54 +02:00
Michal Koutný	61d3164fec	sched: Skip non-root task_groups with disabled RT_GROUP_SCHED First, we want to prevent placement of RT tasks on non-root rt_rqs which we achieve in the task migration code that'd fall back to root_task_group's rt_rq. Second, we want to work with only root_task_group's rt_rq when iterating all "real" rt_rqs when RT_GROUP is disabled. To achieve this we keep root_task_group as the first one on the task_groups and break out quickly. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250310170442.504716-6-mkoutny@suse.com	2025-04-08 20:55:53 +02:00
Michal Koutný	e34e0131fe	sched: Add commadline option for RT_GROUP_SCHED toggling Only simple implementation with a static key wrapper, it will be wired in later. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250310170442.504716-5-mkoutny@suse.com	2025-04-08 20:55:53 +02:00
Michal Koutný	a5a25b32c0	sched: Always initialize rt_rq's task_group rt_rq->tg may be NULL which denotes the root task_group. Store the pointer to root_task_group directly so that callers may use rt_rq->tg homogenously. root_task_group exists always with CONFIG_CGROUPS_SCHED, CONFIG_RT_GROUP_SCHED depends on that. This changes root level rt_rq's default limit from infinity to the value of (originally) global RT throttling. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250310170442.504716-4-mkoutny@suse.com	2025-04-08 20:55:53 +02:00
Michal Koutný	e285313f08	sched: Remove unneeed macro wrap rt_entity_is_task has split definitions based on CONFIG_RT_GROUP_SCHED, therefore we can use it always. No functional change intended. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250310170442.504716-3-mkoutny@suse.com	2025-04-08 20:55:53 +02:00
Michal Koutný	433bce5dad	sched: Convert CONFIG_RT_GROUP_SCHED macros to code conditions Convert the blocks guarded by macros to regular code so that the RT group code gets more compile validation. Reasoning is in Documentation/process/coding-style.rst 21) Conditional Compilation. With that, no functional change is expected. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250310170442.504716-2-mkoutny@suse.com	2025-04-08 20:55:52 +02:00
Pierre Gondois	f2d650618b	sched/fair: Allow decaying util_est when util_avg > CPU capa commit `10a35e6812` ("sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity") prevents util_est from being updated if util_avg is higher than the underlying CPU capacity to avoid overestimating the task when the CPU is capped (due to thermal issue for instance). In this scenario, the task will miss its deadlines and start overlapping its wake-up events for instance. The task will appear as always running when the CPU is just not powerful enough to allow having a good estimation of the task. commit `b8c9636140` ("sched/fair/util_est: Implement faster ramp-up EWMA on utilization increases") sets ewma to util_avg when ewma > util_avg, allowing ewma to quickly grow instead of slowly converge to the new util_avg value when a task profile changes from small to big. However, the 2 conditions: - Check util_avg against max CPU capacity - Check whether util_est > util_avg are placed in an order such as it is possible to set util_est to a value higher than the CPU capacity if util_est > util_avg, but util_est is prevented to decay as long as: CPU capacity < util_avg < util_est. Just remove the check as either: 1. There is idle time on the CPU. In that case the util_avg value of the task is actually correct. It is possible that the task missed a deadline and appears bigger, but this is also the case when the util_avg of the task is lower than the maximum CPU capacity. 2. There is no idle time. In that case, the util_avg value might aswell be an under estimation of the size of the task. It is possible that undesired frequency spikes will appear when the task is later enqueued with an inflated util_est value, but the frequency spike might aswell be deserved. The absence of idle time prevents from drawing any conclusion. Signed-off-by: Pierre Gondois <pierre.gondois@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.rog> Link: https://lore.kernel.org/r/20250325150542.1077344-1-pierre.gondois@arm.com	2025-04-08 20:55:52 +02:00
Steve Wahl	ce29a7da84	sched/topology: Refinement to topology_span_sane speedup Simplify the topology_span_sane code further, removing the need to allocate an array and gotos used to make sure the array gets freed. This version is in a separate commit because it could return a different sanity result than the previous code, but only in odd circumstances that are not expected to actually occur; for example, when a CPU is not listed in its own mask. Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Link: https://lore.kernel.org/r/20250304160844.75373-3-steve.wahl@hpe.com	2025-04-08 20:55:52 +02:00
Steve Wahl	f55dac1daf	sched/topology: improve topology_span_sane speed Use a different approach to topology_span_sane(), that checks for the same constraint of no partial overlaps for any two CPU sets for non-NUMA topology levels, but does so in a way that is O(N) rather than O(N^2). Instead of comparing with all other masks to detect collisions, keep one mask that includes all CPUs seen so far and detect collisions with a single cpumask_intersects test. If the current mask has no collisions with previously seen masks, it should be a new mask, which can be uniquely identified by the lowest bit set in this mask. Keep a pointer to this mask for future reference (in an array indexed by the lowest bit set), and add the CPUs in this mask to the list of those seen. If the current mask does collide with previously seen masks, it should be exactly equal to a mask seen before, looked up in the same array indexed by the lowest bit set in the mask, a single comparison. Move the topology_span_sane() check out of the existing topology level loop, let it use its own loop so that the array allocation can be done only once, shared across levels. On a system with 1920 processors (16 sockets, 60 cores, 2 threads), the average time to take one processor offline is reduced from 2.18 seconds to 1.01 seconds. (Off-lining 959 of 1920 processors took 34m49.765s without this change, 16m10.038s with this change in place.) Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Link: https://lore.kernel.org/r/20250304160844.75373-2-steve.wahl@hpe.com	2025-04-08 20:55:51 +02:00
Peter Zijlstra	8feb053d53	sched: Fix trace_sched_switch(.prev_state) Gabriele noted that in case of signal_pending_state(), the tracepoint sees a stale task-state. Fixes: `fa2c3254d7` ("sched/tracing: Don't re-read p->state when emitting sched_switch event") Reported-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Valentin Schneider <vschneid@redhat.com>	2025-04-08 20:55:51 +02:00
Peter Zijlstra	da916e96e2	perf: Make perf_pmu_unregister() useable Previously it was only safe to call perf_pmu_unregister() if there were no active events of that pmu around -- which was impossible to guarantee since it races all sorts against perf_init_event(). Rework the whole thing by: - keeping track of all events for a given pmu - 'hiding' the pmu from perf_init_event() - waiting for the appropriate (s)rcu grace periods such that all prior references to the PMU will be completed - detaching all still existing events of that pmu (see first point) and moving them to a new REVOKED state. - actually freeing the pmu data. Where notably the new REVOKED state must inhibit all event actions from reaching code that wants to use event->pmu. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lkml.kernel.org/r/20250307193723.525402029@infradead.org	2025-04-08 20:55:48 +02:00
Peter Zijlstra	4da0600eda	perf: Rename perf_event_exit_task(.child) The task passed to perf_event_exit_task() is not a child, it is current. Fix this confusing naming, since much of the rest of the code also relies on it being current. Specifically, both exec() and exit() callers use it with current as the argument. Notably, task_ctx_sched_out() doesn't make much sense outside of current. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lkml.kernel.org/r/20250307193305.486326750@infradead.org	2025-04-08 20:55:48 +02:00

... 9 10 11 12 13 ...

48360 Commits