linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-19 18:41:01 -04:00

Author	SHA1	Message	Date
Feng Tang	a9af76a787	watchdog: add sys_info sysctls to dump sys info on system lockup When soft/hard lockup happens, developers may need different kinds of system information (call-stacks, memory info, locks, etc.) to help debugging. Add 'softlockup_sys_info' and 'hardlockup_sys_info' sysctl knobs to take human readable string like "tasks,mem,timers,locks,ftrace,...", and when system lockup happens, all requested information will be printed out. (refer kernel/sys_info.c for more details). Link: https://lkml.kernel.org/r/20251113111039.22701-4-feng.tang@linux.alibaba.com Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-20 14:03:43 -08:00
Feng Tang	8b2b9b4f6f	hung_task: add hung_task_sys_info sysctl to dump sys info on task-hung When task-hung happens, developers may need different kinds of system information (call-stacks, memory info, locks, etc.) to help debugging. Add 'hung_task_sys_info' sysctl knob to take human readable string like "tasks,mem,timers,locks,ftrace,...", and when task-hung happens, all requested information will be dumped. (refer kernel/sys_info.c for more details). Meanwhile, the newly introduced sys_info() call is used to unify some existing info-dumping knobs. [feng.tang@linux.alibaba.com: maintain consistecy established behavior, per Lance and Petr] Link: https://lkml.kernel.org/r/aRncJo1mA5Zk77Hr@U-2FWC9VHC-2323.local Link: https://lkml.kernel.org/r/20251113111039.22701-3-feng.tang@linux.alibaba.com Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Suggested-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-20 14:03:43 -08:00
Christoph Hellwig	af9b65d686	kernel/hung_task: unexport sysctl_hung_task_timeout_secs This was added by the bcachefs pull requests despite various objections, and with bcachefs removed is now unused. This reverts commit `5c3273ec3c` ("kernel/hung_task.c: export sysctl_hung_task_timeout_secs"). Link: https://lkml.kernel.org/r/20251104121920.2430568-1-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-20 14:03:41 -08:00
Andy Shevchenko	760fc597c3	panic: sys_info: align constant definition names with parameters Align constant definition names with parameters to make it easier to map. It's also better to maintain and extend the names while keeping their uniqueness. Link: https://lkml.kernel.org/r/20251030132007.3742368-3-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Feng Tang <feng.tang@linux.alibaba.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-20 14:03:40 -08:00
Oleg Nesterov	c25d24d0f4	release_task: kill unnecessary rcu_read_lock() around dec_rlimit_ucounts() rcu_read_lock() was added to shut RCU-lockdep up when this code used __task_cred()->rcu_dereference(), but after the commit `21d1c5e386` ("Reimplement RLIMIT_NPROC on top of ucounts") it is no longer needed: task_ucounts()->task_cred_xxx() takes rcu_read_lock() itself. NOTE: task_ucounts() returns the pointer to another rcu-protected data, struct ucounts. So it should either be used when task->real_cred and thus task->real_cred->ucounts is stable (release_task, copy_process, copy_creds), or it should be called under rcu_read_lock(). In both cases it is pointless to take rcu_read_lock() to read the cred->ucounts pointer. Link: https://lkml.kernel.org/r/20251026143140.GA22463@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Alexey Gladkov <legion@kernel.org> Cc: David Howells <dhowells@redhat.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-12 10:00:17 -08:00
Petr Pavlu	37ade54f38	taint/module: remove unnecessary taint_flag.module field The TAINT_RANDSTRUCT and TAINT_FWCTL flags are mistakenly set in the taint_flags table as per-module flags. While this can be trivially corrected, the issue can be avoided altogether by removing the taint_flag.module field. This is possible because, since commit `7fd8329ba5` ("taint/module: Clean up global and module taint flags handling") in 2016, the handling of module taint flags has been fully generic. Specifically, module_flags_taint() can print all flags, and the required output buffer size is properly defined in terms of TAINT_FLAGS_COUNT. The actual per-module flags are always those added to module.taints by calls to add_taint_module(). Link: https://lkml.kernel.org/r/20251022082938.26670-1-petr.pavlu@suse.com Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Acked-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-12 10:00:15 -08:00
Randy Dunlap	ed4bbe7e8f	taint: add reminder about updating docs and scripts Sometimes people update taint-related pieces of the kernel without updating the supporting documentation or scripts. Add a reminder to do this. Link: https://lkml.kernel.org/r/20251015221626.1126156-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: David Gow <davidgow@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-12 10:00:15 -08:00
Sourabh Jain	adc15829fb	crash: let architecture decide crash memory export to iomem_resource With the generic crashkernel reservation, the kernel emits the following warning on powerpc: WARNING: CPU: 0 PID: 1 at arch/powerpc/mm/mem.c:341 add_system_ram_resources+0xfc/0x180 Modules linked in: CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.17.0-auto-12607-g5472d60c129f #1 VOLUNTARY Hardware name: IBM,9080-HEX Power11 (architected) 0x820200 0xf000007 of:IBM,FW1110.01 (NH1110_069) hv:phyp pSeries NIP: c00000000201de3c LR: c00000000201de34 CTR: 0000000000000000 REGS: c000000127cef8a0 TRAP: 0700 Not tainted (6.17.0-auto-12607-g5472d60c129f) MSR: 8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 84000840 XER: 20040010 CFAR: c00000000017eed0 IRQMASK: 0 GPR00: c00000000201de34 c000000127cefb40 c0000000016a8100 0000000000000001 GPR04: c00000012005aa00 0000000020000000 c000000002b705c8 0000000000000000 GPR08: 000000007fffffff fffffffffffffff0 c000000002db8100 000000011fffffff GPR12: c00000000201dd40 c000000002ff0000 c0000000000112bc 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 c0000000015a3808 GPR24: c00000000200468c c000000001699888 0000000000000106 c0000000020d1950 GPR28: c0000000014683f8 0000000081000200 c0000000015c1868 c000000002b9f710 NIP [c00000000201de3c] add_system_ram_resources+0xfc/0x180 LR [c00000000201de34] add_system_ram_resources+0xf4/0x180 Call Trace: add_system_ram_resources+0xf4/0x180 (unreliable) do_one_initcall+0x60/0x36c do_initcalls+0x120/0x220 kernel_init_freeable+0x23c/0x390 kernel_init+0x34/0x26c ret_from_kernel_user_thread+0x14/0x1c This warning occurs due to a conflict between crashkernel and System RAM iomem resources. The generic crashkernel reservation adds the crashkernel memory range to /proc/iomem during early initialization. Later, all memblock ranges are added to /proc/iomem as System RAM. If the crashkernel region overlaps with any memblock range, it causes a conflict while adding those memblock regions as iomem resources, triggering the above warning. The conflicting memblock regions are then omitted from /proc/iomem. For example, if the following crashkernel region is added to /proc/iomem: 20000000-11fffffff : Crash kernel then the following memblock regions System RAM regions fail to be inserted: 00000000-7fffffff : System RAM 80000000-257fffffff : System RAM Fix this by not adding the crashkernel memory to /proc/iomem on powerpc. Introduce an architecture hook to let each architecture decide whether to export the crashkernel region to /proc/iomem. For more info checkout commit `c40dd2f766` ("powerpc: Add System RAM to /proc/iomem") and commit `bce074bdbc` ("powerpc: insert System RAM resource to prevent crashkernel conflict") Note: Before switching to the generic crashkernel reservation, powerpc never exported the crashkernel region to /proc/iomem. Link: https://lkml.kernel.org/r/20251016142831.144515-1-sourabhjain@linux.ibm.com Fixes: `e3185ee438` ("powerpc/crash: use generic crashkernel reservation"). Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Closes: https://lore.kernel.org/all/90937fe0-2e76-4c82-b27e-7b8a7fe3ac69@linux.ibm.com/ Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Cc: Baoquan he <bhe@redhat.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-12 10:00:15 -08:00
Li RongQing	9544f9e694	hung_task: panic when there are more than N hung tasks at the same time The hung_task_panic sysctl is currently a blunt instrument: it's all or nothing. Panicking on a single hung task can be an overreaction to a transient glitch. A more reliable indicator of a systemic problem is when multiple tasks hang simultaneously. Extend hung_task_panic to accept an integer threshold, allowing the kernel to panic only when N hung tasks are detected in a single scan. This provides finer control to distinguish between isolated incidents and system-wide failures. The accepted values are: - 0: Don't panic (unchanged) - 1: Panic on the first hung task (unchanged) - N > 1: Panic after N hung tasks are detected in a single scan The original behavior is preserved for values 0 and 1, maintaining full backward compatibility. [lance.yang@linux.dev: new changelog] Link: https://lkml.kernel.org/r/20251015063615.2632-1-lirongqing@baidu.com Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Tested-by: Lance Yang <lance.yang@linux.dev> Acked-by: Andrew Jeffery <andrew@codeconstruct.com.au> [aspeed_g5_defconfig] Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Hildenbrand <david@redhat.com> Cc: Florian Wesphal <fw@strlen.de> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Joel Stanley <joel@jms.id.au> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <kees@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Phil Auld <pauld@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Shuah Khan <shuah@kernel.org> Cc: Simon Horman <horms@kernel.org> Cc: Stanislav Fomichev <sdf@fomichev.me> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-12 10:00:14 -08:00
Zhichi Lin	08bd4c46d5	scs: fix a wrong parameter in __scs_magic __scs_magic() needs a 'void ' variable, but a 'struct task_struct ' is given. 'task_scs(tsk)' is the starting address of the task's shadow call stack, and '__scs_magic(task_scs(tsk))' is the end address of the task's shadow call stack. Here should be '__scs_magic(task_scs(tsk))'. The user-visible effect of this bug is that when CONFIG_DEBUG_STACK_USAGE is enabled, the shadow call stack usage checking function (scs_check_usage) would scan an incorrect memory range. This could lead to: 1. Inaccurate stack usage reporting: The function would calculate wrong usage statistics for the shadow call stack, potentially showing incorrect value in kmsg. 2. Potential kernel crash: If the value of __scs_magic(tsk)is greater than that of __scs_magic(task_scs(tsk)), the for loop may access unmapped memory, potentially causing a kernel panic. However, this scenario is unlikely because task_struct is allocated via the slab allocator (which typically returns lower addresses), while the shadow call stack returned by task_scs(tsk) is allocated via vmalloc(which typically returns higher addresses). However, since this is purely a debugging feature (CONFIG_DEBUG_STACK_USAGE), normal production systems should be not unaffected. The bug only impacts developers and testers who are actively debugging stack usage with this configuration enabled. Link: https://lkml.kernel.org/r/20251011082222.12965-1-zhichi.lin@vivo.com Fixes: `5bbaf9d1fc` ("scs: Add support for stack usage debugging") Signed-off-by: Jiyuan Xie <xiejiyuan@vivo.com> Signed-off-by: Zhichi Lin <zhichi.lin@vivo.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Will Deacon <will@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Will Deacon <will@kernel.org> Cc: Yee Lee <yee.lee@mediatek.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-12 10:00:13 -08:00
Justinien Bouron	6a2e57ad22	kexec_core: remove superfluous page offset handling in segment loading During kexec_segment loading, when copying the content of the segment (i.e. kexec_segment::kbuf or kexec_segment::buf) to its associated pages, kimage_load_{cma,normal,crash}_segment handle the case where the physical address of the segment is not page aligned, e.g. in kimage_load_normal_segment: page = kimage_alloc_page(image, GFP_HIGHUSER, maddr); // ... ptr = kmap_local_page(page); // ... ptr += maddr & ~PAGE_MASK; mchunk = min_t(size_t, mbytes, PAGE_SIZE - (maddr & ~PAGE_MASK)); // ^^^^ Non page-aligned segments handled here ^^^ // ... if (image->file_mode) memcpy(ptr, kbuf, uchunk); else result = copy_from_user(ptr, buf, uchunk); (similar logic is present in kimage_load_{cma,crash}_segment). This is actually not needed because, prior to their loading, all kexec_segments first go through a vetting step in `sanity_check_segment_list`, which rejects any segment that is not page-aligned: for (i = 0; i < nr_segments; i++) { unsigned long mstart, mend; mstart = image->segment[i].mem; mend = mstart + image->segment[i].memsz; // ... if ((mstart & ~PAGE_MASK) \|\| (mend & ~PAGE_MASK)) return -EADDRNOTAVAIL; // ... } In case `sanity_check_segment_list` finds a non-page aligned the whole kexec load is aborted and no segment is loaded. This means that `kimage_load_{cma,normal,crash}_segment` never actually have to handle non page-aligned segments and `(maddr & ~PAGE_MASK) == 0` is always true no matter if the segment is coming from a file (i.e. `kexec_file_load` syscall), from a user-space buffer (i.e. `kexec_load` syscall) or created by the kernel through `kexec_add_buffer`. In the latter case, `kexec_add_buffer` actually enforces the page alignment: /* Ensure minimum alignment needed for segments. */ kbuf->memsz = ALIGN(kbuf->memsz, PAGE_SIZE); kbuf->buf_align = max(kbuf->buf_align, PAGE_SIZE); [jbouron@amazon.com: v3] Link: https://lkml.kernel.org/r/20251024155009.39502-1-jbouron@amazon.com Link: https://lkml.kernel.org/r/20250929160220.47616-1-jbouron@amazon.com Signed-off-by: Justinien Bouron <jbouron@amazon.com> Reviewed-by: Gunnar Kudrjavets <gunnarku@amazon.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Alexander Graf <graf@amazon.com> Cc: Marcos Paulo de Souza <mpdesouza@suse.com> Cc: Mario Limonciello <mario.limonciello@amd.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-12 10:00:13 -08:00
Linus Torvalds	b5c0946029	Merge tag 'sched-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a group-throttling bug in the fair scheduler" * tag 'sched-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining	2025-11-08 08:59:05 -08:00
Linus Torvalds	133262cae9	Merge tag 'perf-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf event fix from Ingo Molnar: "Fix a system hang caused by cpu-clock events deadlock" * tag 'perf-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix system hang caused by cpu-clock usage	2025-11-08 08:54:13 -08:00
Linus Torvalds	e6f55fe790	Merge tag 'locking-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Ingo Molnar: "Fix (well, cut in half) a futex performance regression on PowerPC" * tag 'locking-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Optimize per-cpu reference counting	2025-11-08 08:51:22 -08:00
Linus Torvalds	5b95a50001	Merge tag 'trace-v6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Check for reader catching up in ring_buffer_map_get_reader() If the reader catches up to the writer in the memory mapped ring buffer then calling rb_get_reader_page() will return NULL as there's no pages left. But this isn't checked for before calling rb_get_reader_page() and the return of NULL causes a warning. If it is detected that the reader caught up to the writer, then simply exit the routine - Fix memory leak in histogram create_field_var() The couple of the error paths in create_field_var() did not properly clean up what was allocated. Make sure everything is freed properly on error - Fix help message of tools latency_collector The help message incorrectly stated that "-t" was the same as "--threads" whereas "--threads" is actually represented by "-e" * tag 'trace-v6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/tools: Fix incorrcet short option in usage text for --threads tracing: Fix memory leaks in create_field_var() ring-buffer: Do not warn in ring_buffer_map_get_reader() when reader catches up	2025-11-07 08:07:11 -08:00
Zilin Guan	80f0d631dc	tracing: Fix memory leaks in create_field_var() The function create_field_var() allocates memory for 'val' through create_hist_field() inside parse_atom(), and for 'var' through create_var(), which in turn allocates var->type and var->var.name internally. Simply calling kfree() to release these structures will result in memory leaks. Use destroy_hist_field() to properly free 'val', and explicitly release the memory of var->type and var->var.name before freeing 'var' itself. Link: https://patch.msgid.link/20251106120132.3639920-1-zilin@seu.edu.cn Fixes: `02205a6752` ("tracing: Add support for 'field variables'") Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-06 19:51:33 -05:00
Steven Rostedt	aa997d2d2a	ring-buffer: Do not warn in ring_buffer_map_get_reader() when reader catches up The function ring_buffer_map_get_reader() is a bit more strict than the other get reader functions, and except for certain situations the rb_get_reader_page() should not return NULL. If it does, it triggers a warning. This warning was triggering but after looking at why, it was because another acceptable situation was happening and it wasn't checked for. If the reader catches up to the writer and there's still data to be read on the reader page, then the rb_get_reader_page() will return NULL as there's no new page to get. In this situation, the reader page should not be updated and no warning should trigger. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Reported-by: syzbot+92a3745cea5ec6360309@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/690babec.050a0220.baf87.0064.GAE@google.com/ Link: https://lore.kernel.org/20251016132848.1b11bb37@gandalf.local.home Fixes: `117c39200d` ("ring-buffer: Introducing ring-buffer mapping functions") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-06 19:38:54 -05:00
Masami Hiramatsu (Google)	c91afa7610	tracing: tprobe-events: Fix to put tracepoint_user when disable the tprobe __unregister_trace_fprobe() checks tf->tuser to put it when removing tprobe. However, disable_trace_fprobe() does not use it and only calls unregister_fprobe(). Thus it forgets to disable tracepoint_user. If the trace_fprobe has tuser, put it for unregistering the tracepoint callbacks when disabling tprobe correctly. Link: https://lore.kernel.org/all/176244794466.155515.3971904050506100243.stgit@devnote2/ Fixes: `2867495dea` ("tracing: tprobe-events: Register tracepoint when enable tprobe event") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Tested-by: Beau Belgrave <beaub@linux.microsoft.com> Reviewed-by: Beau Belgrave <beaub@linux.microsoft.com>	2025-11-07 07:36:20 +09:00
Masami Hiramatsu (Google)	10d9dda426	tracing: tprobe-events: Fix to register tracepoint correctly Since __tracepoint_user_init() calls tracepoint_user_register() without initializing tuser->tpoint with given tracpoint, it does not register tracepoint stub function as callback correctly, and tprobe does not work. Initializing tuser->tpoint correctly before tracepoint_user_register() so that it sets up tracepoint callback. I confirmed below example works fine again. echo "t sched_switch preempt prev_pid=prev->pid next_pid=next->pid" > /sys/kernel/tracing/dynamic_events echo 1 > /sys/kernel/tracing/events/tracepoints/sched_switch/enable cat /sys/kernel/tracing/trace_pipe Link: https://lore.kernel.org/all/176244793514.155515.6466348656998627773.stgit@devnote2/ Fixes: `2867495dea` ("tracing: tprobe-events: Register tracepoint when enable tprobe event") Reported-by: Beau Belgrave <beaub@linux.microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Tested-by: Beau Belgrave <beaub@linux.microsoft.com> Reviewed-by: Beau Belgrave <beaub@linux.microsoft.com>	2025-11-07 07:32:55 +09:00
Peter Zijlstra	4cb5ac2626	futex: Optimize per-cpu reference counting Shrikanth noted that the per-cpu reference counter was still some 10% slower than the old immutable option (which removes the reference counting entirely). Further optimize the per-cpu reference counter by: - switching from RCU to preempt; - using __this_cpu_() since we now have preempt disabled; - switching from smp_load_acquire() to READ_ONCE(). This is all safe because disabling preemption inhibits the RCU grace period exactly like rcu_read_lock(). Having preemption disabled allows using __this_cpu_() provided the only access to the variable is in task context -- which is the case here. Furthermore, since we know changing fph->state to FR_ATOMIC demands a full RCU grace period we can rely on the implied smp_mb() from that to replace the acquire barrier(). This is very similar to the percpu_down_read_internal() fast-path. The reason this is significant for PowerPC is that it uses the generic this_cpu_() implementation which relies on local_irq_disable() (the x86 implementation relies on it being a single memop instruction to be IRQ-safe). Switching to preempt_disable() and __this_cpu() avoids this IRQ state swizzling. Also, PowerPC needs LWSYNC for the ACQUIRE barrier, not having to use explicit barriers safes a bunch. Combined this reduces the performance gap by half, down to some 5%. Fixes: `760e6f7bef` ("futex: Remove support for IMMUTABLE") Reported-by: Shrikanth Hegde <sshegde@linux.ibm.com> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20251106092929.GR4067720@noisy.programming.kicks-ass.net	2025-11-06 12:30:54 +01:00
Aaron Lu	956dfda6a7	sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining When a cfs_rq is to be throttled, its limbo list should be empty and that's why there is a warn in tg_throttle_down() for non empty cfs_rq->throttled_limbo_list. When running a test with the following hierarchy: root / \ A* ... / \| \ ... B / \ C* where both A and C have quota settings, that warn on non empty limbo list is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu part of the cfs_rq for the sake of simpler representation). Debug showed it happened like this: Task group C is created and quota is set, so in tg_set_cfs_bandwidth(), cfs_rq_c is initialized with runtime_enabled set, runtime_remaining equals to 0 and unthrottled. Before any tasks are enqueued to cfs_rq_c, multiple throttled tasks can migrate to cfs_rq_c (e.g., due to task group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled), these throttled tasks are directly placed into cfs_rq_c's limbo list by enqueue_throttled_task(). Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these tasks. The first enqueue triggers check_enqueue_throttle(), and with zero runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it can't get more runtime and enters tg_throttle_down(), where the warning is hit due to remaining tasks in the limbo list. I think it's a chaos to trigger throttle on unthrottle path, the status of a being unthrottled cfs_rq can be in a mixed state in the end, so fix this by granting 1ns to cfs_rq in tg_set_cfs_bandwidth(). This ensures cfs_rq_c has a positive runtime_remaining when initialized as unthrottled and cannot enter tg_unthrottle_up() with zero runtime_remaining. Also, update outdated comments in tg_throttle_down() since unthrottle_cfs_rq() is no longer called with zero runtime_remaining. While at it, remove a redundant assignment to se in tg_throttle_down(). Fixes: `e1fad12dcb` ("sched/fair: Switch to task based throttle model") Reviewed-By: Benjamin Segall <bsegall@google.com> Suggested-by: Benjamin Segall <bsegall@google.com> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Hao Jia <jiahao1@lixiang.com> Link: https://patch.msgid.link/20251030032755.560-1-ziqianlu@bytedance.com	2025-11-06 12:30:52 +01:00
Dapeng Mi	eb3182ef04	perf/core: Fix system hang caused by cpu-clock usage cpu-clock usage by the async-profiler tool can trigger a system hang, which got bisected back to the following commit by Octavia Togami: `18dbcbfabf` ("perf: Fix the POLL_HUP delivery breakage") causes this issue The root cause of the hang is that cpu-clock is a special type of SW event which relies on hrtimers. The __perf_event_overflow() callback is invoked from the hrtimer handler for cpu-clock events, and __perf_event_overflow() tries to call cpu_clock_event_stop() to stop the event, which calls htimer_cancel() to cancel the hrtimer. But that's a recursion into the hrtimer code from a hrtimer handler, which (unsurprisingly) deadlocks. To fix this bug, use hrtimer_try_to_cancel() instead, and set the PERF_HES_STOPPED flag, which causes perf_swevent_hrtimer() to stop the event once it sees the PERF_HES_STOPPED flag. [ mingo: Fixed the comments and improved the changelog. ] Closes: https://lore.kernel.org/all/CAHPNGSQpXEopYreir+uDDEbtXTBvBvi8c6fYXJvceqtgTPao3Q@mail.gmail.com/ Fixes: `18dbcbfabf` ("perf: Fix the POLL_HUP delivery breakage") Reported-by: Octavia Togami <octavia.togami@gmail.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Octavia Togami <octavia.togami@gmail.com> Cc: stable@vger.kernel.org Link: https://github.com/lucko/spark/issues/530 Link: https://patch.msgid.link/20251015051828.12809-1-dapeng1.mi@linux.intel.com	2025-11-03 11:04:19 +01:00
Linus Torvalds	ba36dd5ee6	Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Mark migrate_disable/enable() as always_inline to avoid issues with partial inlining (Yonghong Song) - Fix powerpc stack register definition in libbpf bpf_tracing.h (Andrii Nakryiko) - Reject negative head_room in __bpf_skb_change_head (Daniel Borkmann) - Conditionally include dynptr copy kfuncs (Malin Jonsson) - Sync pending IRQ work before freeing BPF ring buffer (Noorain Eqbal) - Do not audit capability check in x86 do_jit() (Ondrej Mosnacek) - Fix arm64 JIT of BPF_ST insn when it writes into arena memory (Puranjay Mohan) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf/arm64: Fix BPF_ST into arena memory bpf: Make migrate_disable always inline to avoid partial inlining bpf: Reject negative head_room in __bpf_skb_change_head bpf: Conditionally include dynptr copy kfuncs libbpf: Fix powerpc's stack register definition in bpf_tracing.h bpf: Do not audit capability check in do_jit() bpf: Sync pending IRQ work before freeing ring buffer	2025-10-31 18:22:26 -07:00
Linus Torvalds	a5dbbb39e1	Merge tag 'pm-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix three regressions, two recent ones and one introduced during the 6.17 development cycle: - Add an exit latency check to the menu cpuidle governor in the case when it considers using a real idle state instead of a polling one to address a performance regression (Rafael Wysocki) - Revert an attempted cleanup of a system suspend code path that introduced a regression elsewhere (Samuel Wu) - Allow pm_restrict_gfp_mask() to be called multiple times in a row and adjust pm_restore_gfp_mask() accordingly to avoid having to play nasty games with these calls during hibernation (Rafael Wysocki)" * tag 'pm-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: sleep: Allow pm_restrict_gfp_mask() stacking cpuidle: governors: menu: Select polling state in some more cases Revert "PM: sleep: Make pm_wakeup_clear() call more clear"	2025-10-30 19:02:16 -07:00
Rafael J. Wysocki	590c5cd106	Merge branches 'pm-cpuidle' and 'pm-sleep' Merge a cpuidle fix and two fixes related to system sleep for 6.18-rc4: - Add an exit latency check to the menu cpuidle governor in the case when it considers using a real idle state instead of a polling one to address a performance regression (Rafael Wysocki) - Revert an attempted cleanup of a system suspend code path that introduced a regression elsewhere (Samuel Wu) - Allow pm_restrict_gfp_mask() to be called multiple times in a row and adjust pm_restore_gfp_mask() accordingly to avoid having to play nasty games with these calls during hibernation (Rafael Wysocki) * pm-cpuidle: cpuidle: governors: menu: Select polling state in some more cases * pm-sleep: PM: sleep: Allow pm_restrict_gfp_mask() stacking Revert "PM: sleep: Make pm_wakeup_clear() call more clear"	2025-10-30 20:25:18 +01:00
Rafael J. Wysocki	35e4a69b20	PM: sleep: Allow pm_restrict_gfp_mask() stacking Allow pm_restrict_gfp_mask() to be called many times in a row to avoid issues with calling dpm_suspend_start() when the GFP mask has been already restricted. Only the first invocation of pm_restrict_gfp_mask() will actually restrict the GFP mask and the subsequent calls will warn if there is a mismatch between the expected allowed GFP mask and the actual one. Moreover, if pm_restrict_gfp_mask() is called many times in a row, pm_restore_gfp_mask() needs to be called matching number of times in a row to actually restore the GFP mask. Calling it when the GFP mask has not been restricted will cause it to warn. This is necessary for the GFP mask restriction starting in hibernation_snapshot() to continue throughout the entire hibernation flow until it completes or it is aborted (either by a wakeup event or by an error). Fixes: `449c9c0253` ("PM: hibernate: Restrict GFP mask in hibernation_snapshot()") Fixes: `469d80a371` ("PM: hibernate: Fix hybrid-sleep") Reported-by: Askar Safin <safinaskar@gmail.com> Closes: https://lore.kernel.org/linux-pm/20251025050812.421905-1-safinaskar@gmail.com/ Link: https://lore.kernel.org/linux-pm/20251028111730.2261404-1-safinaskar@gmail.com/ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Tested-by: Mario Limonciello (AMD) <superm1@kernel.org> Cc: 6.16+ <stable@vger.kernel.org> # 6.16+ Link: https://patch.msgid.link/5935682.DvuYhMxLoT@rafael.j.wysocki	2025-10-29 18:55:32 +01:00
Linus Torvalds	fd57572253	Merge tag 'sched_ext-for-6.18-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix scx_kick_pseqs corruption when multiple schedulers are loaded concurrently - Allocate scx_kick_cpus_pnt_seqs lazily using kvzalloc() to handle systems with large CPU counts - Defer queue_balance_callback() until after ops.dispatch to fix callback ordering issues - Sync error_irq_work before freeing scx_sched to prevent use-after-free - Mark scx_bpf_dsq_move_set_[slice\|vtime]() with KF_RCU for proper RCU protection - Fix flag check for deferred callbacks * tag 'sched_ext-for-6.18-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: fix flag check for deferred callbacks sched_ext: Fix scx_kick_pseqs corruption on concurrent scheduler loads sched_ext: Allocate scx_kick_cpus_pnt_seqs lazily using kvzalloc() sched_ext: defer queue_balance_callback() until after ops.dispatch sched_ext: Sync error_irq_work before freeing scx_sched sched_ext: Mark scx_bpf_dsq_move_set_[slice\|vtime]() with KF_RCU	2025-10-27 10:52:18 -07:00
Linus Torvalds	5fee0dafba	Merge tag 'irq_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Borislav Petkov: - Restore the original buslock locking in a couple of places in the irq core subsystem after a rework * tag 'irq_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/manage: Add buslock back in to enable_irq() genirq/manage: Add buslock back in to __disable_irq_nosync() genirq/chip: Add buslock back in to irq_set_handler()	2025-10-26 09:54:36 -07:00
Linus Torvalds	1bc9743b64	Merge tag 'sched_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: - Make sure a CFS runqueue on a throttled hierarchy has its PELT clock throttled otherwise task movement and manipulation would lead to dangling cfs_rq references and an eventual crash * tag 'sched_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled	2025-10-26 09:42:19 -07:00
Linus Torvalds	7ea5092f52	Merge tag 'timers_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Borislav Petkov: - Do not create more than eight (max supported) AUX clocks sysfs hierarchies * tag 'timers_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping: Fix aux clocks sysfs initialization loop bound	2025-10-26 09:40:16 -07:00
Andy Shevchenko	53abe3e1c1	sched: Remove never used code in mm_cid_get() Clang is not happy with set but unused variable (this is visible with `make W=1` build: kernel/sched/sched.h:3744:18: error: variable 'cpumask' set but not used [-Werror,-Wunused-but-set-variable] It seems like the variable was never used along with the assignment that does not have side effects as far as I can see. Remove those altogether. Fixes: `223baf9d17` ("sched: Fix performance regression introduced by mm_cid") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Tested-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: Breno Leitao <leitao@debian.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2025-10-24 16:55:46 -07:00
Malin Jonsson	8ce93aabbf	bpf: Conditionally include dynptr copy kfuncs Since commit `a498ee7576` ("bpf: Implement dynptr copy kfuncs"), if CONFIG_BPF_EVENTS is not enabled, but BPF_SYSCALL and DEBUG_INFO_BTF are, the build will break like so: BTFIDS vmlinux.unstripped WARN: resolve_btfids: unresolved symbol bpf_probe_read_user_str_dynptr WARN: resolve_btfids: unresolved symbol bpf_probe_read_user_dynptr WARN: resolve_btfids: unresolved symbol bpf_probe_read_kernel_str_dynptr WARN: resolve_btfids: unresolved symbol bpf_probe_read_kernel_dynptr WARN: resolve_btfids: unresolved symbol bpf_copy_from_user_task_str_dynptr WARN: resolve_btfids: unresolved symbol bpf_copy_from_user_task_dynptr WARN: resolve_btfids: unresolved symbol bpf_copy_from_user_str_dynptr WARN: resolve_btfids: unresolved symbol bpf_copy_from_user_dynptr make[2]: * [scripts/Makefile.vmlinux:72: vmlinux.unstripped] Error 255 make[2]: * Deleting file 'vmlinux.unstripped' make[1]: * [/repo/malin/upstream/linux/Makefile:1242: vmlinux] Error 2 make: * [Makefile:248: __sub-make] Error 2 Guard these symbols with #ifdef CONFIG_BPF_EVENTS to resolve the problem. Fixes: `a498ee7576` ("bpf: Implement dynptr copy kfuncs") Reported-by: Yong Gu <yong.g.gu@ericsson.com> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Malin Jonsson <malin.jonsson@est.tech> Link: https://lore.kernel.org/r/20251024151436.139131-1-malin.jonsson@est.tech Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-24 09:44:47 -07:00
Charles Keepax	ef3330b99c	genirq/manage: Add buslock back in to enable_irq() The locking was changed from a buslock to a plain lock, but the patch description states there was no functional change. Assuming this was accidental so reverting to using the buslock. Fixes: `bddd10c554` ("genirq/manage: Rework enable_irq()") Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251023154901.1333755-4-ckeepax@opensource.cirrus.com	2025-10-24 11:38:39 +02:00
Charles Keepax	56363e25f7	genirq/manage: Add buslock back in to __disable_irq_nosync() The locking was changed from a buslock to a plain lock, but the patch description states there was no functional change. Assuming this was accidental so reverting to using the buslock. Fixes: `1b74444467` ("genirq/manage: Rework __disable_irq_nosync()") Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251023154901.1333755-3-ckeepax@opensource.cirrus.com	2025-10-24 11:38:39 +02:00
Charles Keepax	5d7e45dd67	genirq/chip: Add buslock back in to irq_set_handler() The locking was changed from a buslock to a plain lock, but the patch description states there was no functional change. Assuming this was accidental so reverting to using the buslock. Fixes: `5cd05f3e23` ("genirq/chip: Rework irq_set_handler() variants") Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251023154901.1333755-2-ckeepax@opensource.cirrus.com	2025-10-24 11:38:39 +02:00
Linus Torvalds	5121062e83	Merge tag 'trace-rv-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: "A couple of fixes for Runtime Verification: - A bug caused a kernel panic when reading enabled_monitors was reported. Change callback functions to always use list_head iterators and by doing so, fix the wrong pointer that was leading to the panic. - The rtapp/pagefault monitor relies on the MMU to be present (pagefaults exist) but that was not enforced via kconfig, leading to potential build errors on systems without an MMU. Add that kconfig dependency" * tag 'trace-rv-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: rv: Make rtapp/pagefault monitor depends on CONFIG_MMU rv: Fully convert enabled_monitors to use list_head as iterator	2025-10-23 16:50:25 -07:00
Samuel Wu	79816d4b9e	Revert "PM: sleep: Make pm_wakeup_clear() call more clear" This reverts commit `56a232d93c`. The above commit changed the position of pm_wakeup_clear() for the suspend call path, but other call paths with references to freeze_processes() were not updated. This means that other call paths, such as hibernate(), will not have pm_wakeup_clear() called. Suggested-by: Saravana Kannan <saravanak@google.com> Signed-off-by: Samuel Wu <wusamuel@google.com> [ rjw: Changelog edits ] Link: https://patch.msgid.link/20251022222830.634086-1-wusamuel@google.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-10-23 12:48:04 +02:00
Linus Torvalds	0f3ad9c610	Merge tag 'mm-hotfixes-stable-2025-10-22-12-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "17 hotfixes. 12 are cc:stable and 14 are for MM. There's a two-patch DAMON series from SeongJae Park which addresses a missed check and possible memory leak. Apart from that it's all singletons - please see the changelogs for details" * tag 'mm-hotfixes-stable-2025-10-22-12-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: csky: abiv2: adapt to new folio flags field mm/damon/core: use damos_commit_quota_goal() for new goal commit mm/damon/core: fix potential memory leak by cleaning ops_filter in damon_destroy_scheme hugetlbfs: move lock assertions after early returns in huge_pmd_unshare() vmw_balloon: indicate success when effectively deflating during migration mm/damon/core: fix list_add_tail() call on damon_call() mm/mremap: correctly account old mapping after MREMAP_DONTUNMAP remap mm: prevent poison consumption when splitting THP ocfs2: clear extent cache after moving/defragmenting extents mm: don't spin in add_stack_record when gfp flags don't allow dma-debug: don't report false positives with DMA_BOUNCE_UNALIGNED_KMALLOC mm/damon/sysfs: dealloc commit test ctx always mm/damon/sysfs: catch commit test ctx alloc failure hung_task: fix warnings caused by unaligned lock pointers	2025-10-22 14:57:35 -10:00
K Prateek Nayak	0e4a169d1a	sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled Matteo reported hitting the assert_list_leaf_cfs_rq() warning from enqueue_task_fair() post commit `fe8d238e64` ("sched/fair: Propagate load for throttled cfs_rq") which transitioned to using cfs_rq_pelt_clock_throttled() check for leaf cfs_rq insertions in propagate_entity_cfs_rq(). The "cfs_rq->pelt_clock_throttled" flag is used to indicate if the hierarchy has its PELT frozen. If a cfs_rq's PELT is marked frozen, all its descendants should have their PELT frozen too or weird things can happen as a result of children accumulating PELT signals when the parents have their PELT clock stopped. Another side effect of this is the loss of integrity of the leaf cfs_rq list. As debugged by Aaron, consider the following hierarchy: root(#) / \ A(#) B() \| C <--- new cgroup \| D <--- new cgroup # - Already on leaf cfs_rq list - Throttled with PELT frozen The newly created cgroups don't have their "pelt_clock_throttled" signal synced with cgroup B. Next, the following series of events occur: 1. online_fair_sched_group() for cgroup D will call propagate_entity_cfs_rq(). (Same can happen if a throttled task is moved to cgroup C and enqueue_task_fair() returns early.) propagate_entity_cfs_rq() adds the cfs_rq of cgroup C to "rq->tmp_alone_branch" since its PELT clock is not marked throttled and cfs_rq of cgroup B is not on the list. cfs_rq of cgroup B is skipped since its PELT is throttled. root cfs_rq already exists on cfs_rq leading to list_add_leaf_cfs_rq() returning early. The cfs_rq of cgroup C is left dangling on the "rq->tmp_alone_branch". 2. A new task wakes up on cgroup A. Since the whole hierarchy is already on the leaf cfs_rq list, list_add_leaf_cfs_rq() keeps returning early without any modifications to "rq->tmp_alone_branch". The final assert_list_leaf_cfs_rq() in enqueue_task_fair() sees the dangling reference to cgroup C's cfs_rq in "rq->tmp_alone_branch". !!! Splat !!! Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not enough since the new cfs_rq is not yet enqueued on the hierarchy. A dequeue on other subtree on the throttled hierarchy can freeze the PELT clock for the parent hierarchy without setting the indicators for this newly added cfs_rq which was never enqueued. Since there are no tasks on the new hierarchy, start a cfs_rq on a throttled hierarchy with its PELT clock throttled. The first enqueue, or the distribution (whichever happens first) will unfreeze the PELT clock and queue the cfs_rq on the leaf cfs_rq list. While at it, add an assert_list_leaf_cfs_rq() in propagate_entity_cfs_rq() to catch such cases in the future. Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/ Fixes: `e1fad12dcb` ("sched/fair: Switch to task based throttle model") Reported-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Suggested-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Aaron Lu <ziqianlu@bytedance.com> Tested-by: Aaron Lu <ziqianlu@bytedance.com> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Link: https://patch.msgid.link/20251021053522.37583-1-kprateek.nayak@amd.com	2025-10-22 15:21:52 +02:00
Noorain Eqbal	4e90776383	bpf: Sync pending IRQ work before freeing ring buffer Fix a race where irq_work can be queued in bpf_ringbuf_commit() but the ring buffer is freed before the work executes. In the syzbot reproducer, a BPF program attached to sched_switch triggers bpf_ringbuf_commit(), queuing an irq_work. If the ring buffer is freed before this work executes, the irq_work thread may accesses freed memory. Calling `irq_work_sync(&rb->work)` ensures that all pending irq_work complete before freeing the buffer. Fixes: `457f44363a` ("bpf: Implement BPF ring buffer and verifier support for it") Reported-by: syzbot+2617fc732430968b45d2@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=2617fc732430968b45d2 Tested-by: syzbot+2617fc732430968b45d2@syzkaller.appspotmail.com Signed-off-by: Noorain Eqbal <nooraineqbal@gmail.com> Link: https://lore.kernel.org/r/20251020180301.103366-1-nooraineqbal@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-21 09:57:48 -07:00
Linus Torvalds	6548d364a3	Merge tag 'cgroup-for-6.18-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix seqcount lockdep assertion failure in cgroup freezer on PREEMPT_RT. Plain seqcount_t expects preemption disabled, but PREEMPT_RT spinlocks don't disable preemption. Switch to seqcount_spinlock_t to properly associate css_set_lock with the freeze timing seqcount. - Misc changes including kernel-doc warning fix for misc_res_type enum and improved selftest diagnostics. * tag 'cgroup-for-6.18-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/misc: fix misc_res_type kernel-doc warning selftests: cgroup: Use values_close_report in test_cpu selftests: cgroup: add values_close_report helper cgroup: Fix seqcount lockdep assertion in cgroup freezer	2025-10-20 09:41:27 -10:00
Haofeng Li	39a9ed0fb6	timekeeping: Fix aux clocks sysfs initialization loop bound The loop in tk_aux_sysfs_init() uses `i <= MAX_AUX_CLOCKS` as the termination condition, which results in 9 iterations (i=0 to 8) when MAX_AUX_CLOCKS is defined as 8. However, the kernel is designed to support only up to 8 auxiliary clocks. This off-by-one error causes the creation of a 9th sysfs entry that exceeds the intended auxiliary clock range. Fix the loop bound to use `i < MAX_AUX_CLOCKS` to ensure exactly 8 auxiliary clock entries are created, matching the design specification. Fixes: `7b95663a3d` ("timekeeping: Provide interface to control auxiliary clocks") Signed-off-by: Haofeng Li <lihaofeng@kylinos.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/tencent_2376993D9FC06A3616A4F981B3DE1C599607@qq.com	2025-10-20 19:56:12 +02:00
Nam Cao	3d62f95bd8	rv: Make rtapp/pagefault monitor depends on CONFIG_MMU There is no page fault without MMU. Compiling the rtapp/pagefault monitor without CONFIG_MMU fails as page fault tracepoints' definitions are not available. Make rtapp/pagefault monitor depends on CONFIG_MMU. Fixes: `9162620eb6` ("rv: Add rtapp_pagefault monitor") Signed-off-by: Nam Cao <namcao@linutronix.de> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202509260455.6Z9Vkty4-lkp@intel.com/ Cc: stable@vger.kernel.org Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/r/20251002082317.973839-1-namcao@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>	2025-10-20 12:47:40 +02:00
Nam Cao	103541e6a5	rv: Fully convert enabled_monitors to use list_head as iterator The callbacks in enabled_monitors_seq_ops are inconsistent. Some treat the iterator as struct rv_monitor , while others treat the iterator as struct list_head . This causes a wrong type cast and crashes the system as reported by Nathan. Convert everything to use struct list_head * as iterator. This also makes enabled_monitors consistent with available_monitors. Fixes: `de090d1cca` ("rv: Fix wrong type cast in enabled_monitors_next()") Reported-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/linux-trace-kernel/20250923002004.GA2836051@ax162/ Signed-off-by: Nam Cao <namcao@linutronix.de> Cc: stable@vger.kernel.org Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/r/20251002082235.973099-1-namcao@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>	2025-10-20 12:47:40 +02:00
Linus Torvalds	d9043c79ba	Merge tag 'sched_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Make sure the check for lost pelt idle time is done unconditionally to have correct lost idle time accounting - Stop the deadline server task before a CPU goes offline * tag 'sched_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix pelt lost idle time detection sched/deadline: Stop dl_server before CPU goes offline	2025-10-19 04:59:43 -10:00
Linus Torvalds	343b4b44a1	Merge tag 'perf_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Make sure perf reporting works correctly in setups using overlayfs or FUSE - Move the uprobe optimization to a better location logically * tag 'perf_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix MMAP2 event device with backing files perf/core: Fix MMAP event path names with backing files perf/core: Fix address filter match with backing files uprobe: Move arch_uprobe_optimize right after handlers execution	2025-10-19 04:54:08 -10:00
Emil Tsalapatis	a3c4a0a42e	sched_ext: fix flag check for deferred callbacks When scheduling the deferred balance callbacks, check SCX_RQ_BAL_CB_PENDING instead of SCX_RQ_BAL_PENDING. This way schedule_deferred() properly tests whether there is already a pending request for queue_balance_callback() to be invoked at the end of .balance(). Fixes: `a8ad873113` ("sched_ext: defer queue_balance_callback() until after ops.dispatch") Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-16 08:34:00 -10:00
Shardul Bankar	f6fddc6df3	bpf: Fix memory leak in __lookup_instance error path When __lookup_instance() allocates a func_instance structure but fails to allocate the must_write_set array, it returns an error without freeing the previously allocated func_instance. This causes a memory leak of 192 bytes (sizeof(struct func_instance)) each time this error path is triggered. Fix by freeing 'result' on must_write_set allocation failure. Fixes: `b3698c356a` ("bpf: callchain sensitive stack liveness tracking using CFG") Reported-by: BPF Runtime Fuzzer (BRF) Signed-off-by: Shardul Bankar <shardulsb08@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://patch.msgid.link/20251016063330.4107547-1-shardulsb08@gmail.com	2025-10-16 10:45:17 -07:00
Marek Szyprowski	03521c892b	dma-debug: don't report false positives with DMA_BOUNCE_UNALIGNED_KMALLOC Commit `370645f41e` ("dma-mapping: force bouncing if the kmalloc() size is not cache-line-aligned") introduced DMA_BOUNCE_UNALIGNED_KMALLOC feature and permitted architecture specific code configure kmalloc slabs with sizes smaller than the value of dma_get_cache_alignment(). When that feature is enabled, the physical address of some small kmalloc()-ed buffers might be not aligned to the CPU cachelines, thus not really suitable for typical DMA. To properly handle that case a SWIOTLB buffer bouncing is used, so no CPU cache corruption occurs. When that happens, there is no point reporting a false-positive DMA-API warning that the buffer is not properly aligned, as this is not a client driver fault. [m.szyprowski@samsung.com: replace is_swiotlb_allocated() with is_swiotlb_active(), per Catalin] Link: https://lkml.kernel.org/r/20251010173009.3916215-1-m.szyprowski@samsung.com Link: https://lkml.kernel.org/r/20251009141508.2342138-1-m.szyprowski@samsung.com Fixes: `370645f41e` ("dma-mapping: force bouncing if the kmalloc() size is not cache-line-aligned") Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: Robin Murohy <robin.murphy@arm.com> Cc: "Isaac J. Manjarres" <isaacmanjarres@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-10-15 13:24:33 -07:00
Alexei Starovoitov	5fb750e8a9	bpf: Replace bpf_map_kmalloc_node() with kmalloc_nolock() to allocate bpf_async_cb structures. The following kmemleak splat: [ 8.105530] kmemleak: Trying to color unknown object at 0xff11000100e918c0 as Black [ 8.106521] Call Trace: [ 8.106521] <TASK> [ 8.106521] dump_stack_lvl+0x4b/0x70 [ 8.106521] kvfree_call_rcu+0xcb/0x3b0 [ 8.106521] ? hrtimer_cancel+0x21/0x40 [ 8.106521] bpf_obj_free_fields+0x193/0x200 [ 8.106521] htab_map_update_elem+0x29c/0x410 [ 8.106521] bpf_prog_cfc8cd0f42c04044_overwrite_cb+0x47/0x4b [ 8.106521] bpf_prog_8c30cd7c4db2e963_overwrite_timer+0x65/0x86 [ 8.106521] bpf_prog_test_run_syscall+0xe1/0x2a0 happens due to the combination of features and fixes, but mainly due to commit `6d78b4473c` ("bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init()") It's using __GFP_HIGH, which instructs slub/kmemleak internals to skip kmemleak_alloc_recursive() on allocation, so subsequent kfree_rcu()-> kvfree_call_rcu()->kmemleak_ignore() complains with the above splat. To fix this imbalance, replace bpf_map_kmalloc_node() with kmalloc_nolock() and kfree_rcu() with call_rcu() + kfree_nolock() to make sure that the objects allocated with kmalloc_nolock() are freed with kfree_nolock() rather than the implicit kfree() that kfree_rcu() uses internally. Note, the kmalloc_nolock() happens under bpf_spin_lock_irqsave(), so it will always fail in PREEMPT_RT. This is not an issue at the moment, since bpf_timers are disabled in PREEMPT_RT. In the future bpf_spin_lock will be replaced with state machine similar to bpf_task_work. Fixes: `6d78b4473c` ("bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init()") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: linux-mm@kvack.org Link: https://lore.kernel.org/bpf/20251015000700.28988-1-alexei.starovoitov@gmail.com	2025-10-15 12:22:22 +02:00

1 2 3 4 5 ...

49594 Commits