linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-16 13:41:48 -04:00

Author	SHA1	Message	Date
Linus Walleij	929cc1a53a	fork: zero vmap stack using clear_pages() instead of memset() After the introduction of clear_pages() we exploit the fact that the process vm_area is allocated in contiguous pages to just clear them all in one swift operation. Link: https://lkml.kernel.org/r/20260224-mm-fork-clear-pages-v1-1-184c65a72d49@kernel.org Signed-off-by: Linus Walleij <linusw@kernel.org> Suggested-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/linux-mm/dpnwsp7dl4535rd7qmszanw6u5an2p74uxfex4dh53frpb7pu3@2bnjjavjrepe/ Suggested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Link: https://lore.kernel.org/20240311164638.2015063-7-pasha.tatashin@soleen.com Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Ankur Arora <ankur.a.arora@oracle.com> Cc: Ben Segall <bsegall@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 21:19:49 -07:00
Oleg Nesterov	0f8e38eeb9	do_notify_parent: sanitize the valid_signal() checks Now that kernel_clone() checks valid_signal(args->exit_signal), the "sig" argument of do_notify_parent() must always be valid or we have a bug. However, do_notify_parent() only checks that sig != -1 at the start, then it does another valid_signal() check before __send_signal_locked(). This is confusing. Change do_notify_parent() to WARN and return early if valid_signal(sig) is false. Link: https://lkml.kernel.org/r/abld-ilvMEZ7VgMw@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Deepanshu Kartikey <Kartikey406@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 21:19:48 -07:00
Mayank Rungta	077ba03600	watchdog/hardlockup: improve buddy system detection timeliness Currently, the buddy system only performs checks every 3rd sample. With a 4-second interval. If a check window is missed, the next check occurs 12 seconds later, potentially delaying hard lockup detection for up to 24 seconds. Modify the buddy system to perform checks at every interval (4s). Introduce a missed-interrupt threshold to maintain the existing grace period while reducing the detection window to 8-12 seconds. Best and worst case detection scenarios: Before (12s check window): - Best case: Lockup occurs after first check but just before heartbeat interval. Detected in ~8s (8s till next check). - Worst case: Lockup occurs just after a check. Detected in ~24s (missed check + 12s till next check + 12s logic). After (4s check window with threshold of 3): - Best case: Lockup occurs just before a check. Detected in ~8s (0s till 1st check + 4s till 2nd + 4s till 3rd). - Worst case: Lockup occurs just after a check. Detected in ~12s (4s till 1st check + 4s till 2nd + 4s till 3rd). Link: https://lkml.kernel.org/r/20260312-hardlockup-watchdog-fixes-v2-4-45bd8a0cc7ed@google.com Signed-off-by: Mayank Rungta <mrungta@google.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Ian Rogers <irogers@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Li Huafei <lihuafei1@huawei.com> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Stephane Erainan <eranian@google.com> Cc: Wang Jinchao <wangjinchao600@gmail.com> Cc: Yunhui Cui <cuiyunhui@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 21:19:47 -07:00
Mayank Rungta	746bb7fa91	watchdog: update saved interrupts during check Currently, arch_touch_nmi_watchdog() causes an early return that skips updating hrtimer_interrupts_saved. This leads to stale comparisons and delayed lockup detection. I found this issue because in our system the serial console is fairly chatty. For example, the 8250 console driver frequently calls touch_nmi_watchdog() via console_write(). If a CPU locks up after a timer interrupt but before next watchdog check, we see the following sequence: * watchdog_hardlockup_check() saves counter (e.g., 1000) * Timer runs and updates the counter (1001) * touch_nmi_watchdog() is called * CPU locks up * 10s pass: check() notices touch, returns early, skips update * 10s pass: check() saves counter (1001) * 10s pass: check() finally detects lockup This delays detection to 30 seconds. With this fix, we detect the lockup in 20 seconds. Link: https://lkml.kernel.org/r/20260312-hardlockup-watchdog-fixes-v2-2-45bd8a0cc7ed@google.com Signed-off-by: Mayank Rungta <mrungta@google.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Ian Rogers <irogers@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Li Huafei <lihuafei1@huawei.com> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Stephane Erainan <eranian@google.com> Cc: Wang Jinchao <wangjinchao600@gmail.com> Cc: Yunhui Cui <cuiyunhui@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 21:19:46 -07:00
Mayank Rungta	3e811cae32	watchdog: return early in watchdog_hardlockup_check() Patch series "watchdog/hardlockup: Improvements to hardlockup", v2. This series addresses limitations in the hardlockup detector implementations and updates the documentation to reflect actual behavior and recent changes. The changes are structured as follows: Refactoring (Patch 1) ===================== Patch 1 refactors watchdog_hardlockup_check() to return early if no lockup is detected. This reduces the indentation level of the main logic block, serving as a clean base for the subsequent changes. Hardlockup Detection Improvements (Patches 2 & 4) ================================================= The hardlockup detector logic relies on updating saved interrupt counts to determine if the CPU is making progress. Patch 1 ensures that the saved interrupt count is updated unconditionally before checking the "touched" flag. This prevents stale comparisons which can delay detection. This is a logic fix that ensures the detector remains accurate even when the watchdog is frequently touched. Patch 3 improves the Buddy detector's timeliness. The current checking interval (every 3rd sample) causes high variability in detection time (up to 24s). This patch changes the Buddy detector to check at every hrtimer interval (4s) with a missed-interrupt threshold of 3, narrowing the detection window to a consistent 8-12 second range. Documentation Updates (Patches 3 & 5) ===================================== The current documentation does not fully capture the variable nature of detection latency or the details of the Buddy system. Patch 3 removes the strict "10 seconds" definition of a hardlockup, which was misleading given the periodic nature of the detector. It adds a "Detection Overhead" section to the admin guide, using "Best Case" and "Worst Case" scenarios to illustrate that detection time can vary significantly (e.g., ~6s to ~20s). Patch 5 adds a dedicated section for the Buddy detector, which was previously undocumented. It details the mechanism, the new timing logic, and known limitations. This patch (of 5): Invert the `is_hardlockup(cpu)` check in `watchdog_hardlockup_check()` to return early when a hardlockup is not detected. This flattens the main logic block, reducing the indentation level and making the code easier to read and maintain. This refactoring serves as a preparation patch for future hardlockup changes. Link: https://lkml.kernel.org/r/20260312-hardlockup-watchdog-fixes-v2-0-45bd8a0cc7ed@google.com Link: https://lkml.kernel.org/r/20260312-hardlockup-watchdog-fixes-v2-1-45bd8a0cc7ed@google.com Signed-off-by: Mayank Rungta <mrungta@google.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Ian Rogers <irogers@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Li Huafei <lihuafei1@huawei.com> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Stephane Erainan <eranian@google.com> Cc: Wang Jinchao <wangjinchao600@gmail.com> Cc: Yunhui Cui <cuiyunhui@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 21:19:46 -07:00
Eric Biggers	ea2976032d	kernel/kexec: remove inclusion of crypto/hash.h kexec_core.c does not do any cryptographic hashing, so the header crypto/hash.h is not needed at all. Link: https://lkml.kernel.org/r/20260314204144.44884-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 21:19:46 -07:00
Eric Biggers	231bb8c1be	kernel/crash: remove inclusion of crypto/sha1.h Several files related to kernel crash dumps include crypto/sha1.h but never use any of its functionality. Remove these includes so that these files don't unnecessarily come up in searches for which kernel code is still using the obsolete SHA-1 algorithm. Link: https://lkml.kernel.org/r/20260314204243.45001-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 21:19:46 -07:00
Aaron Tomlin	73d40c42f6	hung_task: explicitly report I/O wait state in log output Currently, the hung task reporting mechanism indiscriminately labels all TASK_UNINTERRUPTIBLE (D) tasks as "blocked", irrespective of whether they are awaiting I/O completion or kernel locking primitives. This ambiguity compels system administrators to manually inspect stack traces to discern whether the delay stems from an I/O wait (typically indicative of hardware or filesystem anomalies) or software contention. Such detailed analysis is not always immediately accessible to system administrators or support engineers. To address this, this patch utilises the existing in_iowait field within struct task_struct to augment the failure report. If the task is blocked due to I/O (e.g., via io_schedule_prepare()), the log message is updated to explicitly state "blocked in I/O wait". Examples: - Standard Block: "INFO: task bash:123 blocked for more than 120 seconds". - I/O Block: "INFO: task dd:456 blocked in I/O wait for more than 120 seconds". Theoretically, concurrent executions of io_schedule_finish() could result in a race condition where the read flag does not precisely correlate with the subsequently printed backtrace. However, this limitation is deemed acceptable in practice. The entire reporting mechanism is inherently racy by design; nevertheless, it remains highly reliable in the vast majority of cases, particularly because it primarily captures protracted stalls. Consequently, introducing additional synchronisation to mitigate this minor inaccuracy would be entirely disproportionate to the situation. Link: https://lkml.kernel.org/r/20260303221324.4106917-1-atomlin@atomlin.com Signed-off-by: Aaron Tomlin <atomlin@atomlin.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Lance Yang <lance.yang@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 21:19:40 -07:00
Petr Mladek	5eaef7f8ee	hung_task: increment the global counter immediately A recent change allowed to reset the global counter of hung tasks using the sysctl interface. A potential race with the regular check has been solved by updating the global counter only once at the end of the check. However, the hung task check can take a significant amount of time, particularly when task information is being dumped to slow serial consoles. Some users monitor this global counter to trigger immediate migration of critical containers. Delaying the increment until the full check completes postpones these high-priority rescue operations. Update the global counter as soon as a hung task is detected. Since the value is read asynchronously, a relaxed atomic operation is sufficient. Link: https://lkml.kernel.org/r/20260303203031.4097316-4-atomlin@atomlin.com Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Aaron Tomlin <atomlin@atomlin.com> Reported-by: Lance Yang <lance.yang@linux.dev> Closes: https://lore.kernel.org/r/f239e00f-4282-408d-b172-0f9885f4b01b@linux.dev Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Joel Granados <joel.granados@kernel.org> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 21:19:40 -07:00
Aaron Tomlin	49085e1b70	hung_task: enable runtime reset of hung_task_detect_count Currently, the hung_task_detect_count sysctl provides a cumulative count of hung tasks since boot. In long-running, high-availability environments, this counter may lose its utility if it cannot be reset once an incident has been resolved. Furthermore, the previous implementation relied upon implicit ordering, which could not strictly guarantee that diagnostic metadata published by one CPU was visible to the panic logic on another. This patch introduces the capability to reset the detection count by writing "0" to the hung_task_detect_count sysctl. The proc_handler logic has been updated to validate this input and atomically reset the counter. The synchronisation of sysctl_hung_task_detect_count relies upon a transactional model to ensure the integrity of the detection counter against concurrent resets from userspace. The application of atomic_long_read_acquire() and atomic_long_cmpxchg_release() is correct and provides the following guarantees: 1. Prevention of Load-Store Reordering via Acquire Semantics By utilising atomic_long_read_acquire() to snapshot the counter before initiating the task traversal, we establish a strict memory barrier. This prevents the compiler or hardware from reordering the initial load to a point later in the scan. Without this "acquire" barrier, a delayed load could potentially read a "0" value resulting from a userspace reset that occurred mid-scan. This would lead to the subsequent cmpxchg succeeding erroneously, thereby overwriting the user's reset with stale increment data. 2. Atomicity of the "Commit" Phase via Release Semantics The atomic_long_cmpxchg_release() serves as the transaction's commit point. The "release" barrier ensures that all diagnostic recordings and task-state observations made during the scan are globally visible before the counter is incremented. 3. Race Condition Resolution This pairing effectively detects any "out-of-band" reset of the counter. If sysctl_hung_task_detect_count is modified via the procfs interface during the scan, the final cmpxchg will detect the discrepancy between the current value and the "acquire" snapshot. Consequently, the update will fail, ensuring that a reset command from the administrator is prioritised over a scan that may have been invalidated by that very reset. Link: https://lkml.kernel.org/r/20260303203031.4097316-3-atomlin@atomlin.com Signed-off-by: Aaron Tomlin <atomlin@atomlin.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Joel Granados <joel.granados@kernel.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Lance Yang <lance.yang@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 21:19:40 -07:00
Aaron Tomlin	00b5cdeb9f	hung_task: refactor detection logic and atomicise detection count Patch series "hung_task: Provide runtime reset interface for hung task detector", v9. This series introduces the ability to reset /proc/sys/kernel/hung_task_detect_count. Writing a "0" value to this file atomically resets the counter of detected hung tasks. This functionality provides system administrators with the means to clear the cumulative diagnostic history following incident resolution, thereby simplifying subsequent monitoring without necessitating a system restart. This patch (of 3): The check_hung_task() function currently conflates two distinct responsibilities: validating whether a task is hung and handling the subsequent reporting (printing warnings, triggering panics, or tracepoints). This patch refactors the logic by introducing hung_task_info(), a function dedicated solely to reporting. The actual detection check, task_is_hung(), is hoisted into the primary loop within check_hung_uninterruptible_tasks(). This separation clearly decouples the mechanism of detection from the policy of reporting. Furthermore, to facilitate future support for concurrent hung task detection, the global sysctl_hung_task_detect_count variable is converted from unsigned long to atomic_long_t. Consequently, the counting logic is updated to accumulate the number of hung tasks locally (this_round_count) during the iteration. The global counter is then updated atomically via atomic_long_cmpxchg_relaxed() once the loop concludes, rather than incrementally during the scan. These changes are strictly preparatory and introduce no functional change to the system's runtime behaviour. Link: https://lkml.kernel.org/r/20260303203031.4097316-1-atomlin@atomlin.com Link: https://lkml.kernel.org/r/20260303203031.4097316-2-atomlin@atomlin.com Signed-off-by: Aaron Tomlin <atomlin@atomlin.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Joel Granados <joel.granados@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 21:19:40 -07:00
Thorsten Blum	defec2ca7c	crash_dump: use sysfs_emit in sysfs show functions Replace sprintf() with sysfs_emit() in sysfs show functions. sysfs_emit() is preferred for formatting sysfs output because it provides safer bounds checking. No functional changes. Link: https://lkml.kernel.org/r/20260301125106.911980-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Acked-by: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 21:19:38 -07:00
Oleg Nesterov	8fba1920ac	pid: document the PIDNS_ADDING checks in alloc_pid() and copy_process() Both copy_process() and alloc_pid() do the same PIDNS_ADDING check. The reasons for these checks, and the fact that both are necessary, are not immediately obvious. Add the comments. Link: https://lkml.kernel.org/r/aaGIRElc78U4Er42@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Adrian Reber <areber@redhat.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 21:19:36 -07:00
Oleg Nesterov	040261b118	pid: make sub-init creation retryable Patch series "pid: make sub-init creation retryable". This patch (of 2): Currently we allow only one attempt to create init in a new namespace. If the first fork() fails after alloc_pid() succeeds, free_pid() clears PIDNS_ADDING and thus disables further PID allocations. Nowadays this looks like an unnecessary limitation. The original reason to handle "case PIDNS_ADDING" in free_pid() is gone, most probably after commit `69879c01a0` ("proc: Remove the now unnecessary internal mount of proc"). Change free_pid() to keep ns->pid_allocated == PIDNS_ADDING, and change alloc_pid() to reset the cursor early, right after taking pidmap_lock. Test-case: #define _GNU_SOURCE #include <linux/sched.h> #include <sys/syscall.h> #include <sys/wait.h> #include <assert.h> #include <sched.h> #include <errno.h> int main(void) { struct clone_args args = { .exit_signal = SIGCHLD, .flags = CLONE_PIDFD, .pidfd = 0, }; unsigned long pidfd; int pid; assert(unshare(CLONE_NEWPID) == 0); pid = syscall(__NR_clone3, &args, sizeof(args)); assert(pid == -1 && errno == EFAULT); args.pidfd = (unsigned long)&pidfd; pid = syscall(__NR_clone3, &args, sizeof(args)); if (pid) assert(pid > 0 && wait(NULL) == pid); else assert(getpid() == 1); return 0; } Link: https://lkml.kernel.org/r/aaGHu3ixbw9Y7kFj@redhat.com Link: https://lkml.kernel.org/r/aaGIHa7vGdwhEc_D@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andrei Vagin <avagin@gmail.com> Cc: Adrian Reber <areber@redhat.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 21:19:36 -07:00
Thorsten Blum	26430489b1	crash_dump: fix typo in function name read_key_from_user_keying The function read_key_from_user_keying() is missing an 'r' in its name. Fix the typo by renaming it to read_key_from_user_keyring(). Link: https://lkml.kernel.org/r/20260227230422.859423-1-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 21:19:35 -07:00
Thorsten Blum	c02474fe1a	crash_dump: remove redundant less-than-zero check 'key_count' is an 'unsigned int' and cannot be less than zero. Remove the redundant condition. Link: https://lkml.kernel.org/r/20260228085136.861971-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 21:19:35 -07:00
Thorsten Blum	380369ea2e	fork: replace simple_strtoul with kstrtoul in coredump_filter_setup Replace simple_strtoul() with the recommended kstrtoul() for parsing the 'coredump_filter=' boot parameter. Check the return value of kstrtoul() and reject invalid values. This adds error handling while preserving behavior for existing values, and removes use of the deprecated simple_strtoul() helper. The current code silently sets 'default_dump_filter = 0' if parsing fails, instead of leaving the default value (MMF_DUMP_FILTER_DEFAULT) unchanged. Rename the static variable 'default_dump_filter' to 'coredump_filter' since it does not necessarily contain the default value and the current name can be misleading. Link: https://lkml.kernel.org/r/20251215142152.4082-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Ben Segall <bsegall@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 21:19:35 -07:00
Oleg Nesterov	162e4fd97d	complete_signal: kill always-true "core_state \|\| !SIGNAL_GROUP_EXIT" check The "(signal->core_state \|\| !(signal->flags & SIGNAL_GROUP_EXIT))" check in complete_signal() is not obvious at all, and in fact it only adds unnecessary confusion: this condition is always true. prepare_signal() does: if (signal->flags & SIGNAL_GROUP_EXIT) { if (signal->core_state) return sig == SIGKILL; /* * The process is in the middle of dying, drop the signal. */ return false; } This means that "!signal->core_state && (signal->flags & SIGNAL_GROUP_EXIT)" in complete_signal() is never possible. If SIGNAL_GROUP_EXIT is set, prepare_signal() can only return true if signal->core_state is not NULL. Link: https://lkml.kernel.org/r/aZsfkDhnqJ4s1oTs@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc; Deepanshu Kartikey <kartikey406@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 21:19:35 -07:00
Oleg Nesterov	617ab884b8	exit: kill unnecessary thread_group_leader() checks in exit_notify() and do_notify_parent() thread_group_empty(tsk) is only possible if tsk is a group leader, and thread_group_empty() already does the thread_group_leader() check. So it makes no sense to check "thread_group_leader() && thread_group_empty()"; thread_group_empty() alone is enough. Link: https://lkml.kernel.org/r/aZsfeegKZPZZszJh@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc; Deepanshu Kartikey <kartikey406@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 21:19:34 -07:00
Rio	48d76a8282	kernel/panic: mark init_taint_buf as __initdata and panic instead of warning in alloc_taint_buf() However there's a convention of assuming that __init-time allocations cannot fail. Because if a kmalloc() were to fail at this time, the kernel is hopelessly messed up anyway. So simply panic() if that kmalloc failed, then make that 350-byte buffer __initdata. Link: https://lkml.kernel.org/r/20260223035914.4033-1-rioo.tsukatsukii@gmail.com Signed-off-by: Rio <rioo.tsukatsukii@gmail.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Wang Jinchao <wangjinchao600@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 21:19:33 -07:00
Rio	a9dff0d0d1	kernel/panic: allocate taint string buffer dynamically The buffer used to hold the taint string is statically allocated, which requires updating whenever a new taint flag is added. Instead, allocate the exact required length at boot once the allocator is available in an init function. The allocation sums the string lengths in taint_flags[], along with space for separators and formatting. print_tainted() is switched to use this dynamically allocated buffer. If allocation fails, print_tainted() warns about the failure and continues to use the original static buffer as a fallback. Link: https://lkml.kernel.org/r/20260222140804.22225-1-rioo.tsukatsukii@gmail.com Signed-off-by: Rio <rioo.tsukatsukii@gmail.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Wang Jinchao <wangjinchao600@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 21:19:33 -07:00
Rio	a75d207916	kernel/panic: increase buffer size for verbose taint logging The verbose 'Tainted: ...' string in print_tainted_seq can total to 327 characters while the buffer defined in _print_tainted is 320 bytes. Increase its size to 350 characters to hold all flags, along with some headroom. [akpm@linux-foundation.org: fix spello, add comment] Link: https://lkml.kernel.org/r/20260220151500.13585-1-rioo.tsukatsukii@gmail.com Signed-off-by: Rio <rioo.tsukatsukii@gmail.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Wang Jinchao <wangjinchao600@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 21:19:33 -07:00
Michal Grzedzicki	a98621a0f1	unshare: fix nsproxy leak in ksys_unshare() on set_cred_ucounts() failure When set_cred_ucounts() fails in ksys_unshare() new_nsproxy is leaked. Let's call put_nsproxy() if that happens. Link: https://lkml.kernel.org/r/20260213193959.2556730-1-mge@meta.com Fixes: `905ae01c4a` ("Add a reference to ucounts for each cred") Signed-off-by: Michal Grzedzicki <mge@meta.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexey Gladkov (Intel) <legion@kernel.org> Cc: Ben Segall <bsegall@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 21:19:32 -07:00
Linus Torvalds	d5273fd3ca	Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Fix how linked registers track zero extension of subregisters (Daniel Borkmann) - Fix unsound scalar fork for OR instructions (Daniel Wade) - Fix exception exit lock check for subprogs (Ihor Solodrai) - Fix undefined behavior in interpreter for SDIV/SMOD instructions (Jenny Guanni Qu) - Release module's BTF when module is unloaded (Kumar Kartikeya Dwivedi) - Fix constant blinding for PROBE_MEM32 instructions (Sachin Kumar) - Reset register ID for END instructions to prevent incorrect value tracking (Yazhou Tang) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Add a test cases for sync_linked_regs regarding zext propagation bpf: Fix sync_linked_regs regarding BPF_ADD_CONST32 zext propagation selftests/bpf: Add tests for maybe_fork_scalars() OR vs AND handling bpf: Fix unsound scalar forking in maybe_fork_scalars() for BPF_OR selftests/bpf: Add tests for sdiv32/smod32 with INT_MIN dividend bpf: Fix undefined behavior in interpreter sdiv/smod for INT_MIN selftests/bpf: Add tests for bpf_throw lock leak from subprogs bpf: Fix exception exit lock checking for subprogs bpf: Release module BTF IDR before module unload selftests/bpf: Fix pkg-config call on static builds bpf: Fix constant blinding for PROBE_MEM32 stores selftests/bpf: Add test for BPF_END register ID reset bpf: Reset register ID for BPF_END value tracking	2026-03-22 11:16:06 -07:00
Linus Torvalds	ac57fa9faf	Merge tag 'trace-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Revert "tracing: Remove pid in task_rename tracing output" A change was made to remove the pid field from the task_rename event because it was thought that it was always done for the current task and recording the pid would be redundant. This turned out to be incorrect and there are a few corner case where this is not true and caused some regressions in tooling. - Fix the reading from user space for migration The reading of user space uses a seq lock type of logic where it uses a per-cpu temporary buffer and disables migration, then enables preemption, does the copy from user space, disables preemption, enables migration and checks if there was any schedule switches while preemption was enabled. If there was a context switch, then it is considered that the per-cpu buffer could be corrupted and it tries again. There's a protection check that tests if it takes a hundred tries, it issues a warning and exits out to prevent a live lock. This was triggered because the task was selected by the load balancer to be migrated to another CPU, every time preemption is enabled the migration task would schedule in try to migrate the task but can't because migration is disabled and let it run again. This caused the scheduler to schedule out the task every time it enabled preemption and made the loop never exit (until the 100 iteration test triggered). Fix this by enabling and disabling preemption and keeping migration enabled if the reading from user space needs to be done again. This will let the migration thread migrate the task and the copy from user space will likely pass on the next iteration. - Fix trace_marker copy option freeing The "copy_trace_marker" option allows a tracing instance to get a copy of a write to the trace_marker file of the top level instance. This is managed by a link list protected by RCU. When an instance is removed, a check is made if the option is set, and if so synchronized_rcu() is called. The problem is that an iteration is made to reset all the flags to what they were when the instance was created (to perform clean ups) was done before the check of the copy_trace_marker option and that option was cleared, so the synchronize_rcu() was never called. Move the clearing of all the flags after the check of copy_trace_marker to do synchronize_rcu() so that the option is still set if it was before and the synchronization is performed. - Fix entries setting when validating the persistent ring buffer When validating the persistent ring buffer on boot up, the number of events per sub-buffer is added to the sub-buffer meta page. The validator was updating cpu_buffer->head_page (the first sub-buffer of the per-cpu buffer) and not the "head_page" variable that was iterating the sub-buffers. This was causing the first sub-buffer to be assigned the entries for each sub-buffer and not the sub-buffer that was supposed to be updated. - Use "hash" value to update the direct callers When updating the ftrace direct callers, it assigned a temporary callback to all the callback functions of the ftrace ops and not just the functions represented by the passed in hash. This causes an unnecessary slow down of the functions of the ftrace_ops that is not being modified. Only update the functions that are going to be modified to call the ftrace loop function so that the update can be made on those functions. * tag 'trace-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ftrace: Use hash argument for tmp_ops in update_ftrace_direct_mod ring-buffer: Fix to update per-subbuf entries of persistent ring buffer tracing: Fix trace_marker copy link list updates tracing: Fix failure to read user space from system call trace events tracing: Revert "tracing: Remove pid in task_rename tracing output"	2026-03-22 11:10:31 -07:00
Linus Torvalds	ebfd9b7af2	Merge tag 'perf-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: - Fix a PMU driver crash on AMD EPYC systems, caused by a race condition in x86_pmu_enable() - Fix a possible counter-initialization bug in x86_pmu_enable() - Fix a counter inheritance bug in inherit_event() and __perf_event_read() - Fix an Intel PMU driver branch constraints handling bug found by UBSAN - Fix the Intel PMU driver's new Off-Module Response (OMR) support code for Diamond Rapids / Nova lake, to fix a snoop information parsing bug * tag 'perf-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Fix OMR snoop information parsing issues perf/x86/intel: Add missing branch counters constraint apply perf: Make sure to use pmu_ctx->pmu for groups x86/perf: Make sure to program the counter value for stopped events on migration perf/x86: Move event pointer setup earlier in x86_pmu_enable()	2026-03-22 10:31:51 -07:00
Jiri Olsa	50b35c9e50	ftrace: Use hash argument for tmp_ops in update_ftrace_direct_mod The modify logic registers temporary ftrace_ops object (tmp_ops) to trigger the slow path for all direct callers to be able to safely modify attached addresses. At the moment we use ops->func_hash for tmp_ops filter, which represents all the systems attachments. It's faster to use just the passed hash filter, which contains only the modified sites and is always a subset of the ops->func_hash. Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Menglong Dong <menglong8.dong@gmail.com> Cc: Song Liu <song@kernel.org> Link: https://patch.msgid.link/20260312123738.129926-1-jolsa@kernel.org Fixes: `e93672f770` ("ftrace: Add update_ftrace_direct_mod function") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-03-21 16:51:04 -04:00
Masami Hiramatsu (Google)	f35dbac694	ring-buffer: Fix to update per-subbuf entries of persistent ring buffer Since the validation loop in rb_meta_validate_events() updates the same cpu_buffer->head_page->entries, the other subbuf entries are not updated. Fix to use head_page to update the entries field, since it is the cursor in this loop. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ian Rogers <irogers@google.com> Fixes: `5f3b6e839f` ("ring-buffer: Validate boot range memory events") Link: https://patch.msgid.link/177391153882.193994.17158784065013676533.stgit@mhiramat.tok.corp.google.com Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-03-21 16:47:28 -04:00
Steven Rostedt	07183aac4a	tracing: Fix trace_marker copy link list updates When the "copy_trace_marker" option is enabled for an instance, anything written into /sys/kernel/tracing/trace_marker is also copied into that instances buffer. When the option is set, that instance's trace_array descriptor is added to the marker_copies link list. This list is protected by RCU, as all iterations uses an RCU protected list traversal. When the instance is deleted, all the flags that were enabled are cleared. This also clears the copy_trace_marker flag and removes the trace_array descriptor from the list. The issue is after the flags are called, a direct call to update_marker_trace() is performed to clear the flag. This function returns true if the state of the flag changed and false otherwise. If it returns true here, synchronize_rcu() is called to make sure all readers see that its removed from the list. But since the flag was already cleared, the state does not change and the synchronization is never called, leaving a possible UAF bug. Move the clearing of all flags below the updating of the copy_trace_marker option which then makes sure the synchronization is performed. Also use the flag for checking the state in update_marker_trace() instead of looking at if the list is empty. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260318185512.1b6c7db4@gandalf.local.home Fixes: `7b382efd5e` ("tracing: Allow the top level trace_marker to write into another instances") Reported-by: Sasha Levin <sashal@kernel.org> Closes: https://lore.kernel.org/all/20260225133122.237275-1-sashal@kernel.org/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-03-21 16:43:53 -04:00
Steven Rostedt	edca33a562	tracing: Fix failure to read user space from system call trace events The system call trace events call trace_user_fault_read() to read the user space part of some system calls. This is done by grabbing a per-cpu buffer, disabling migration, enabling preemption, calling copy_from_user(), disabling preemption, enabling migration and checking if the task was preempted while preemption was enabled. If it was, the buffer is considered corrupted and it tries again. There's a safety mechanism that will fail out of this loop if it fails 100 times (with a warning). That warning message was triggered in some pi_futex stress tests. Enabling the sched_switch trace event and traceoff_on_warning, showed the problem: pi_mutex_hammer-1375 [006] d..21 138.981648: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981651: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981656: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981659: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981664: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981667: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981671: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981675: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981679: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981682: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981687: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981690: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981695: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981698: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981703: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981706: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981711: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981714: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981719: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981722: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981727: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981730: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981735: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981738: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 What happened was the task 1375 was flagged to be migrated. When preemption was enabled, the migration thread woke up to migrate that task, but failed because migration for that task was disabled. This caused the loop to fail to exit because the task scheduled out while trying to read user space. Every time the task enabled preemption the migration thread would schedule in, try to migrate the task, fail and let the task continue. But because the loop would only enable preemption with migration disabled, it would always fail because each time it enabled preemption to read user space, the migration thread would try to migrate it. To solve this, when the loop fails to read user space without being scheduled out, enabled and disable preemption with migration enabled. This will allow the migration task to successfully migrate the task and the next loop should succeed to read user space without being scheduled out. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260316130734.1858a998@gandalf.local.home Fixes: `64cf7d058a` ("tracing: Have trace_marker use per-cpu data to read user space") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-03-21 16:42:36 -04:00
Daniel Borkmann	bc308be380	bpf: Fix sync_linked_regs regarding BPF_ADD_CONST32 zext propagation Jenny reported that in sync_linked_regs() the BPF_ADD_CONST32 flag is checked on known_reg (the register narrowed by a conditional branch) instead of reg (the linked target register created by an alu32 operation). Example case with reg: 1. r6 = bpf_get_prandom_u32() 2. r7 = r6 (linked, same id) 3. w7 += 5 (alu32 -- r7 gets BPF_ADD_CONST32, zero-extended by CPU) 4. if w6 < 0xFFFFFFFC goto safe (narrows r6 to [0xFFFFFFFC, 0xFFFFFFFF]) 5. sync_linked_regs() propagates to r7 but does NOT call zext_32_to_64() 6. Verifier thinks r7 is [0x100000001, 0x100000004] instead of [1, 4] Since known_reg above does not have BPF_ADD_CONST32 set above, zext_32_to_64() is never called on alu32-derived linked registers. This causes the verifier to track incorrect 64-bit bounds, while the CPU correctly zero-extends the 32-bit result. The code checking known_reg->id was correct however (see scalars_alu32_wrap selftest case), but the real fix needs to handle both directions - zext propagation should be done when either register has BPF_ADD_CONST32, since the linked relationship involves a 32-bit operation regardless of which side has the flag. Example case with known_reg (exercised also by scalars_alu32_wrap): 1. r1 = r0; w1 += 0x100 (alu32 -- r1 gets BPF_ADD_CONST32) 2. if r1 > 0x80 - known_reg = r1 (has BPF_ADD_CONST32), reg = r0 (doesn't) Hence, fix it by checking for (reg->id \| known_reg->id) & BPF_ADD_CONST32. Moreover, sync_linked_regs() also has a soundness issue when two linked registers used different ALU widths: one with BPF_ADD_CONST32 and the other with BPF_ADD_CONST64. The delta relationship between linked registers assumes the same arithmetic width though. When one register went through alu32 (CPU zero-extends the 32-bit result) and the other went through alu64 (no zero-extension), the propagation produces incorrect bounds. Example: r6 = bpf_get_prandom_u32() // fully unknown if r6 >= 0x100000000 goto out // constrain r6 to [0, U32_MAX] r7 = r6 w7 += 1 // alu32: r7.id = N \| BPF_ADD_CONST32 r8 = r6 r8 += 2 // alu64: r8.id = N \| BPF_ADD_CONST64 if r7 < 0xFFFFFFFF goto out // narrows r7 to [0xFFFFFFFF, 0xFFFFFFFF] At the branch on r7, sync_linked_regs() runs with known_reg=r7 (BPF_ADD_CONST32) and reg=r8 (BPF_ADD_CONST64). The delta path computes: r8 = r7 + (delta_r8 - delta_r7) = 0xFFFFFFFF + (2 - 1) = 0x100000000 Then, because known_reg->id has BPF_ADD_CONST32, zext_32_to_64(r8) is called, truncating r8 to [0, 0]. But r8 used a 64-bit ALU op -- the CPU does NOT zero-extend it. The actual CPU value of r8 is 0xFFFFFFFE + 2 = 0x100000000, not 0. The verifier now underestimates r8's 64-bit bounds, which is a soundness violation. Fix sync_linked_regs() by skipping propagation when the two registers have mixed ALU widths (one BPF_ADD_CONST32, the other BPF_ADD_CONST64). Lastly, fix regsafe() used for path pruning: the existing checks used "& BPF_ADD_CONST" to test for offset linkage, which treated BPF_ADD_CONST32 and BPF_ADD_CONST64 as equivalent. Fixes: `7a433e5193` ("bpf: Support negative offsets, BPF_SUB, and alu32 for linked register tracking") Reported-by: Jenny Guanni Qu <qguanni@gmail.com> Co-developed-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260319211507.213816-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-21 13:19:40 -07:00
Daniel Wade	c845894ebd	bpf: Fix unsound scalar forking in maybe_fork_scalars() for BPF_OR maybe_fork_scalars() is called for both BPF_AND and BPF_OR when the source operand is a constant. When dst has signed range [-1, 0], it forks the verifier state: the pushed path gets dst = 0, the current path gets dst = -1. For BPF_AND this is correct: 0 & K == 0. For BPF_OR this is wrong: 0 \| K == K, not 0. The pushed path therefore tracks dst as 0 when the runtime value is K, producing an exploitable verifier/runtime divergence that allows out-of-bounds map access. Fix this by passing env->insn_idx (instead of env->insn_idx + 1) to push_stack(), so the pushed path re-executes the ALU instruction with dst = 0 and naturally computes the correct result for any opcode. Fixes: `bffacdb80b` ("bpf: Recognize special arithmetic shift in the verifier") Signed-off-by: Daniel Wade <danjwade95@gmail.com> Reviewed-by: Amery Hung <ameryhung@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260314021521.128361-2-danjwade95@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-21 13:14:28 -07:00
Jenny Guanni Qu	c77b30bd1d	bpf: Fix undefined behavior in interpreter sdiv/smod for INT_MIN The BPF interpreter's signed 32-bit division and modulo handlers use the kernel abs() macro on s32 operands. The abs() macro documentation (include/linux/math.h) explicitly states the result is undefined when the input is the type minimum. When DST contains S32_MIN (0x80000000), abs((s32)DST) triggers undefined behavior and returns S32_MIN unchanged on arm64/x86. This value is then sign-extended to u64 as 0xFFFFFFFF80000000, causing do_div() to compute the wrong result. The verifier's abstract interpretation (scalar32_min_max_sdiv) computes the mathematically correct result for range tracking, creating a verifier/interpreter mismatch that can be exploited for out-of-bounds map value access. Introduce abs_s32() which handles S32_MIN correctly by casting to u32 before negating, avoiding signed overflow entirely. Replace all 8 abs((s32)...) call sites in the interpreter's sdiv32/smod32 handlers. s32 is the only affected case -- the s64 division/modulo handlers do not use abs(). Fixes: `ec0e2da95f` ("bpf: Support new signed div/mod instructions.") Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Jenny Guanni Qu <qguanni@gmail.com> Link: https://lore.kernel.org/r/20260311011116.2108005-2-qguanni@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-21 13:12:16 -07:00
Ihor Solodrai	6c2128505f	bpf: Fix exception exit lock checking for subprogs process_bpf_exit_full() passes check_lock = !curframe to check_resource_leak(), which is false in cases when bpf_throw() is called from a static subprog. This makes check_resource_leak() to skip validation of active_rcu_locks, active_preempt_locks, and active_irq_id on exception exits from subprogs. At runtime bpf_throw() unwinds the stack via ORC without releasing any user-acquired locks, which may cause various issues as the result. Fix by setting check_lock = true for exception exits regardless of curframe, since exceptions bypass all intermediate frame cleanup. Update the error message prefix to "bpf_throw" for exception exits to distinguish them from normal BPF_EXIT. Fix reject_subprog_with_rcu_read_lock test which was previously passing for the wrong reason. Test program returned directly from the subprog call without closing the RCU section, so the error was triggered by the unclosed RCU lock on normal exit, not by bpf_throw. Update __msg annotations for affected tests to match the new "bpf_throw" error prefix. The spin_lock case is not affected because they are already checked [1] at the call site in do_check_insn() before bpf_throw can run. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/verifier.c?h=v7.0-rc4#n21098 Assisted-by: Claude:claude-opus-4-6 Fixes: `f18b03faba` ("bpf: Implement BPF exceptions") Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260320000809.643798-1-ihor.solodrai@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-21 12:51:44 -07:00
Linus Torvalds	e9825d1c79	Merge tag 'pm-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix an idle loop issue exposed by recent changes and a race condition related to device removal in the runtime PM core code: - Consolidate the handling of two special cases in the idle loop that occur when only one CPU idle state is present (Rafael Wysocki) - Fix a race condition related to device removal in the runtime PM core code that may cause a stale device object pointer to be dereferenced (Bart Van Assche)" * tag 'pm-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: runtime: Fix a race condition related to device removal sched: idle: Consolidate the handling of two special cases	2026-03-19 08:45:34 -07:00
Kumar Kartikeya Dwivedi	146bd2a87a	bpf: Release module BTF IDR before module unload Gregory reported in [0] that the global_map_resize test when run in repeatedly ends up failing during program load. This stems from the fact that BTF reference has not dropped to zero after the previous run's module is unloaded, and the older module's BTF is still discoverable and visible. Later, in libbpf, load_module_btfs() will find the ID for this stale BTF, open its fd, and then it will be used during program load where later steps taking module reference using btf_try_get_module() fail since the underlying module for the BTF is gone. Logically, once a module is unloaded, it's associated BTF artifacts should become hidden. The BTF object inside the kernel may still remain alive as long its reference counts are alive, but it should no longer be discoverable. To fix this, let us call btf_free_id() from the MODULE_STATE_GOING case for the module unload to free the BTF associated IDR entry, and disable its discovery once module unload returns to user space. If a race happens during unload, the outcome is non-deterministic anyway. However, user space should be able to rely on the guarantee that once it has synchronously established a successful module unload, no more stale artifacts associated with this module can be obtained subsequently. Note that we must be careful to not invoke btf_free_id() in btf_put() when btf_is_module() is true now. There could be a window where the module unload drops a non-terminal reference, frees the IDR, but the same ID gets reused and the second unconditional btf_free_id() ends up releasing an unrelated entry. To avoid a special case for btf_is_module() case, set btf->id to zero to make btf_free_id() idempotent, such that we can unconditionally invoke it from btf_put(), and also from the MODULE_STATE_GOING case. Since zero is an invalid IDR, the idr_remove() should be a noop. Note that we can be sure that by the time we reach final btf_put() for btf_is_module() case, the btf_free_id() is already done, since the module itself holds the BTF reference, and it will call this function for the BTF before dropping its own reference. [0]: https://lore.kernel.org/bpf/cover.1773170190.git.grbell@redhat.com Fixes: `36e68442d1` ("bpf: Load and verify kernel module BTFs") Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Suggested-by: Martin KaFai Lau <martin.lau@kernel.org> Reported-by: Gregory Bell <grbell@redhat.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260312205307.1346991-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-18 17:26:40 -07:00
Rafael J. Wysocki	f4c31b07b1	sched: idle: Consolidate the handling of two special cases There are two special cases in the idle loop that are handled inconsistently even though they are analogous. The first one is when a cpuidle driver is absent and the default CPU idle time power management implemented by the architecture code is used. In that case, the scheduler tick is stopped every time before invoking default_idle_call(). The second one is when a cpuidle driver is present, but there is only one idle state in its table. In that case, the scheduler tick is never stopped at all. Since each of these approaches has its drawbacks, reconcile them with the help of one simple heuristic. Namely, stop the tick if the CPU has been woken up by it in the previous iteration of the idle loop, or let it tick otherwise. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Fixes: `ed98c34919` ("sched: idle: Do not stop the tick before cpuidle_idle_call()") [ rjw: Added Fixes tag, changelog edits ] Link: https://patch.msgid.link/4741364.LvFx2qVVIh@rafael.j.wysocki Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2026-03-16 20:29:47 +01:00
Linus Torvalds	8a91ebb337	Merge tag 'mm-hotfixes-stable-2026-03-16-12-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "6 hotfixes. 4 are cc:stable. 3 are for MM. All are singletons - please see the changelogs for details" * tag 'mm-hotfixes-stable-2026-03-16-12-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: MAINTAINERS: update email address for Ignat Korchagin mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared THP mm/rmap: fix incorrect pte restoration for lazyfree folios mm/huge_memory: fix use of NULL folio in move_pages_huge_pmd() build_bug.h: correct function parameters names in kernel-doc crash_dump: don't log dm-crypt key bytes in read_key_from_user_keying	2026-03-16 12:21:00 -07:00
Linus Torvalds	d9bf296c39	Merge tag 'probes-fixes-v7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - Avoid crash when rmmod/insmod after ftrace killed This fixes a kernel crash caused by kprobes on the symbol in a module which is unloaded after ftrace_kill() is called. - Remove unneeded warnings from __arm_kprobe_ftrace() Remove unneeded WARN messages which can be triggered if the kprobe is using ftrace and it fails to enable the ftrace. Since kprobes correctly handle such failure, we don't need to warn it. * tag 'probes-fixes-v7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: kprobes: Remove unneeded warnings from __arm_kprobe_ftrace() kprobes: avoid crash when rmmod/insmod after ftrace killed	2026-03-15 13:08:05 -07:00
Linus Torvalds	164cb546e9	Merge tag 'timers-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix function tracer recursion bug by marking jiffies_64_to_clock_t() notrace" * tag 'timers-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time/jiffies: Mark jiffies_64_to_clock_t() notrace	2026-03-15 11:14:09 -07:00
Linus Torvalds	63724e9519	Merge tag 'sched-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "More MM-CID fixes, mostly fixing hangs/races: - Fix CID hangs due to a race between concurrent forks - Fix vfork()/CLONE_VM MMCID bug causing hangs - Remove pointless preemption guard - Fix CID task list walk performance regression on large systems by removing the known-flaky and slow counting logic using for_each_process_thread() in mm_cid_fixup_tasks_to_cpus(), and implementing a simple sched_mm_cid::node list instead" tag 'sched-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/mmcid: Avoid full tasklist walks sched/mmcid: Remove pointless preempt guard sched/mmcid: Handle vfork()/CLONE_VM correctly sched/mmcid: Prevent CID stalls due to concurrent forks	2026-03-15 10:49:47 -07:00
Linus Torvalds	9abff5748e	Merge tag 'wq-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: - Improve workqueue stall diagnostics: dump all busy workers (not just running ones), show wall-clock duration of in-flight work items, and add a sample module for reproducing stalls - Fix POOL_BH vs WQ_BH flag namespace mismatch in pr_cont_worker_id() - Rename pool->watchdog_ts to pool->last_progress_ts and related functions for clarity * tag 'wq-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Rename show_cpu_pool{s,}_hog{s,}() to reflect broadened scope workqueue: Add stall detector sample module workqueue: Show all busy workers in stall diagnostics workqueue: Show in-flight work item duration in stall diagnostics workqueue: Rename pool->watchdog_ts to pool->last_progress_ts workqueue: Use POOL_BH instead of WQ_BH when checking pool flags	2026-03-13 15:11:05 -07:00
Linus Torvalds	b073bcb8d4	Merge tag 'cgroup-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Hide PF_EXITING tasks from cgroup.procs to avoid exposing dead tasks that haven't been removed yet, fixing a systemd timeout issue on PREEMPT_RT - Call rebuild_sched_domains() directly in CPU hotplug instead of deferring to a workqueue, fixing a race where online/offline CPUs could briefly appear in stale sched domains * tag 'cgroup-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Don't expose dead tasks in cgroup cgroup/cpuset: Call rebuild_sched_domains() directly in hotplug	2026-03-13 15:06:31 -07:00
Linus Torvalds	8369b2e97d	Merge tag 'sched_ext-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix data races flagged by KCSAN: add missing READ_ONCE()/WRITE_ONCE() annotations for lock-free accesses to module parameters and dsq->seq - Fix silent truncation of upper 32 enqueue flags (SCX_ENQ_PREEMPT and above) when passed through the int sched_class interface - Documentation updates: scheduling class precedence, task ownership state machine, example scheduler descriptions, config list cleanup - Selftest fix for format specifier and buffer length in file_write_long() * tag 'sched_ext-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Use WRITE_ONCE() for the write side of scx_enable helper pointer sched_ext: Fix enqueue_task_scx() truncation of upper enqueue flags sched_ext: Documentation: Update sched-ext.rst sched_ext: Use READ_ONCE() for scx_slice_bypass_us in scx_bypass() sched_ext: Documentation: Mention scheduling class precedence sched_ext: Document task ownership state machine sched_ext: Use READ_ONCE() for lock-free reads of module param variables sched_ext/selftests: Fix format specifier and buffer length in file_write_long() sched_ext: Use WRITE_ONCE() for the write side of dsq->seq update	2026-03-13 14:54:56 -07:00
Masami Hiramatsu (Google)	5ef268cb7a	kprobes: Remove unneeded warnings from __arm_kprobe_ftrace() Remove unneeded warnings for handled errors from __arm_kprobe_ftrace() because all caller handled the error correctly. Link: https://lore.kernel.org/all/177261531182.1312989.8737778408503961141.stgit@mhiramat.tok.corp.google.com/ Reported-by: Zw Tang <shicenci@gmail.com> Closes: https://lore.kernel.org/all/CAPHJ_V+J6YDb_wX2nhXU6kh466Dt_nyDSas-1i_Y8s7tqY-Mzw@mail.gmail.com/ Fixes: `9c89bb8e32` ("kprobes: treewide: Cleanup the error messages for kprobes") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2026-03-13 23:15:26 +09:00
Masami Hiramatsu (Google)	e113f0b46d	kprobes: avoid crash when rmmod/insmod after ftrace killed After we hit ftrace is killed by some errors, the kernel crash if we remove modules in which kprobe probes. BUG: unable to handle page fault for address: fffffbfff805000d PGD 817fcc067 P4D 817fcc067 PUD 817fc8067 PMD 101555067 PTE 0 Oops: Oops: 0000 [#1] SMP KASAN PTI CPU: 4 UID: 0 PID: 2012 Comm: rmmod Tainted: G W OE Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE RIP: 0010:kprobes_module_callback+0x89/0x790 RSP: 0018:ffff88812e157d30 EFLAGS: 00010a02 RAX: 1ffffffff805000d RBX: dffffc0000000000 RCX: ffffffff86a8de90 RDX: ffffed1025c2af9b RSI: 0000000000000008 RDI: ffffffffc0280068 RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed1025c2af9a R10: ffff88812e157cd7 R11: 205d323130325420 R12: 0000000000000002 R13: ffffffffc0290488 R14: 0000000000000002 R15: ffffffffc0280040 FS: 00007fbc450dd740(0000) GS:ffff888420331000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffbfff805000d CR3: 000000010f624000 CR4: 00000000000006f0 Call Trace: <TASK> notifier_call_chain+0xc6/0x280 blocking_notifier_call_chain+0x60/0x90 __do_sys_delete_module.constprop.0+0x32a/0x4e0 do_syscall_64+0x5d/0xfa0 entry_SYSCALL_64_after_hwframe+0x76/0x7e This is because the kprobe on ftrace does not correctly handles the kprobe_ftrace_disabled flag set by ftrace_kill(). To prevent this error, check kprobe_ftrace_disabled in __disarm_kprobe_ftrace() and skip all ftrace related operations. Link: https://lore.kernel.org/all/176473947565.1727781.13110060700668331950.stgit@mhiramat.tok.corp.google.com/ Reported-by: Ye Bin <yebin10@huawei.com> Closes: https://lore.kernel.org/all/20251125020536.2484381-1-yebin@huaweicloud.com/ Fixes: `ae6aa16fdc` ("kprobes: introduce ftrace based optimization") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-03-13 23:14:14 +09:00
Peter Zijlstra	4b9ce67196	perf: Make sure to use pmu_ctx->pmu for groups Oliver reported that x86_pmu_del() ended up doing an out-of-bound memory access when group_sched_in() fails and needs to roll back. This should be handled by the transaction callbacks, but he found that when the group leader is a software event, the transaction handlers of the wrong PMU are used. Despite the move_group case in perf_event_open() and group_sched_in() using pmu_ctx->pmu. Turns out, inherit uses event->pmu to clone the events, effectively undoing the move_group case for all inherited contexts. Fix this by also making inherit use pmu_ctx->pmu, ensuring all inherited counters end up in the same pmu context. Similarly, __perf_event_read() should use equally use pmu_ctx->pmu for the group case. Fixes: `bd27568117` ("perf: Rewrite core context handling") Reported-by: Oliver Rosenberg <olrose55@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://patch.msgid.link/20260309133713.GB606826@noisy.programming.kicks-ass.net	2026-03-12 11:29:16 +01:00
Thomas Gleixner	192d852129	sched/mmcid: Avoid full tasklist walks Chasing vfork()'ed tasks on a CID ownership mode switch requires a full task list walk, which is obviously expensive on large systems. Avoid that by keeping a list of tasks using a mm MMCID entity in mm::mm_cid and walk this list instead. This removes the proven to be flaky counting logic and avoids a full task list walk in the case of vfork()'ed tasks. Fixes: `fbd0e71dc3` ("sched/mmcid: Provide CID ownership mode fixup functions") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260310202526.183824481@kernel.org	2026-03-11 12:01:07 +01:00
Thomas Gleixner	7574ac6e49	sched/mmcid: Remove pointless preempt guard This is a leftover from the early versions of this function where it could be invoked without mm::mm_cid::lock held. Remove it and add lockdep asserts instead. Fixes: `653fda7ae7` ("sched/mmcid: Switch over to the new mechanism") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260310202526.116363613@kernel.org	2026-03-11 12:01:06 +01:00
Thomas Gleixner	28b5a13950	sched/mmcid: Handle vfork()/CLONE_VM correctly Matthieu and Jiri reported stalls where a task endlessly loops in mm_get_cid() when scheduling in. It turned out that the logic which handles vfork()'ed tasks is broken. It is invoked when the number of tasks associated to a process is smaller than the number of MMCID users. It then walks the task list to find the vfork()'ed task, but accounts all the already processed tasks as well. If that double processing brings the number of to be handled tasks to 0, the walk stops and the vfork()'ed task's CID is not fixed up. As a consequence a subsequent schedule in fails to acquire a (transitional) CID and the machine stalls. Cure this by removing the accounting condition and make the fixup always walk the full task list if it could not find the exact number of users in the process' thread list. Fixes: `fbd0e71dc3` ("sched/mmcid: Provide CID ownership mode fixup functions") Closes: https://lore.kernel.org/b24ffcb3-09d5-4e48-9070-0b69bc654281@kernel.org Reported-by: Matthieu Baerts <matttbe@kernel.org> Reported-by: Jiri Slaby <jirislaby@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260310202526.048657665@kernel.org	2026-03-11 12:01:06 +01:00

1 2 3 4 5 ...

51060 Commits