linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-02-15 03:53:12 -05:00

Author	SHA1	Message	Date
Frederic Weisbecker	3c8eb36e2a	timers/migration: Assert that hotplug preparing CPU is part of stable active hierarchy The CPU doing the prepare work for a remote target must be online from the tree point of view and its hierarchy must be active, otherwise propagating its active state up to the new root branch would be either incorrect or racy. Assert those conditions with more sanity checks. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251024132536.39841-5-frederic@kernel.org	2025-11-01 20:38:25 +01:00
Frederic Weisbecker	5eb579dfd4	timers/migration: Fix imbalanced NUMA trees When a CPU from a new node boots, the old root may happen to be connected to the new root even if their node mismatch, as depicted in the following scenario: 1) CPU 0 boots and creates the first group for node 0. [GRP0:0] node 0 \| CPU 0 2) CPU 1 from node 1 boots and creates a new top that corresponds to node 1, but it also connects the old root from node 0 to the new root from node 1 by mistake. [GRP1:0] node 1 / \ / \ [GRP0:0] [GRP0:1] node 0 node 1 \| \| CPU 0 CPU 1 3) This eventually leads to an imbalanced tree where some node 0 CPUs migrate node 1 timers (and vice versa) way before reaching the crossnode groups, resulting in more frequent remote memory accesses than expected. [GRP2:0] NUMA_NO_NODE / \ [GRP1:0] [GRP1:1] node 1 node 0 / \ \| / \ [...] [GRP0:0] [GRP0:1] node 0 node 1 \| \| CPU 0... CPU 1... A balanced tree should only contain groups having children that belong to the same node: [GRP2:0] NUMA_NO_NODE / \ [GRP1:0] [GRP1:0] node 0 node 1 / \ / \ / \ / \ [GRP0:0] [...] [...] [GRP0:1] node 0 node 1 \| \| CPU 0... CPU 1... In order to fix this, the hierarchy must be unfolded up to the crossnode level as soon as a node mismatch is detected. For example the stage 2 above should lead to this layout: [GRP2:0] NUMA_NO_NODE / \ [GRP1:0] [GRP1:1] node 0 node 1 / \ / \ [GRP0:0] [GRP0:1] node 0 node 1 \| \| CPU 0 CPU 1 This means that not only GRP1:0 must be created but also GRP1:1 and GRP2:0 in order to prepare a balanced tree for next CPUs to boot. Fixes: `7ee9887703` ("timers: Implement the hierarchical pull model") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251024132536.39841-4-frederic@kernel.org	2025-11-01 20:38:25 +01:00
Frederic Weisbecker	fa9620355d	timers/migration: Remove locking on group connection Initializing the tmc's group, the group's number of children and the group's parent can all be done without locking because: 1) Reading the group's parent and its group mask is done locklessly. 2) The connections prepared for a given CPU hierarchy are visible to the target CPU once online, thanks to the CPU hotplug enforced memory ordering. 3) In case of a newly created upper level, the new root and its connections and initialization are made visible by the CPU which made the connections. When that CPUs goes idle in the future, the new link is published by tmigr_inactive_up() through the atomic RmW on ->migr_state. 4) If CPUs were still walking up the active hierarchy, they could observe the new root earlier. In this case the ordering is enforced by an early initialization of the group mask and by barriers that maintain address dependency as explained in: `b729cc1ec2` ("timers/migration: Fix another race between hotplug and idle entry/exit") `de3ced72a7` ("timers/migration: Enforce group initialization visibility to tree walkers") 5) Timers are propagated by a chain of group locking from the bottom to the top. And while doing so, the tree also propagates groups links and initialization. Therefore remote expiration, which also relies on group locking, will observe those links and initialization while holding the root lock before walking the tree remotely and update remote timers. This is especially important for migrators in the active hierarchy that may observe the new root early. Therefore the locking is unnecessary at initialization. If anything, it just brings confusion. Remove it. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251024132536.39841-3-frederic@kernel.org	2025-11-01 20:38:25 +01:00
Frederic Weisbecker	6c181b5667	timers/migration: Convert "while" loops to use "for" Both the "do while" and "while" loops in tmigr_setup_groups() eventually mimic the behaviour of "for" loops. Simplify accordingly. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251024132536.39841-2-frederic@kernel.org	2025-11-01 20:38:24 +01:00
Steve Wahl	4138787408	tick/sched: Limit non-timekeeper CPUs calling jiffies update On large NUMA systems, while running a test program that saturates the inter-processor and inter-NUMA links, acquiring the jiffies_lock can be very expensive. If the cpu designated to do jiffies updates (tick_do_timer_cpu) gets delayed and other cpus decide to do the jiffies update themselves, a large number of them decide to do so at the same time. The inexpensive check against tick_next_period is far quicker than actually acquiring the lock, so most of these get in line to obtain the lock. If obtaining the lock is slow enough, this spirals into the vast majority of CPUs continuously being stuck waiting for this lock, just to obtain it and find out that time has already been updated by another cpu. For example, on one random entry to kdb by manually-injected NMI, 2912 of 3840 CPUs were observed to be stuck there. To avoid this, allow only one non-timekeeper CPU to call tick_do_update_jiffies64() at any given time, resetting ts->stalled jiffies only if the jiffies update function is actually called. With this change, manually interrupting the test at most two CPUs are observed to invoke tick_do_update_jiffies64() - the timekeeper and one other. Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20251027183456.343407-1-steve.wahl@hpe.com	2025-11-01 20:25:53 +01:00
Haofeng Li	391253b25f	time: Fix spelling mistakes in comments Correct several typos found in comments across various files in the kernel/time directory. No functional changes are introduced by these corrections. Signed-off-by: Haofeng Li <lihaofeng@kylinos.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2025-09-21 10:02:02 +02:00
Jiri Wiesner	b9aa93aa51	clocksource: Print durations for sync check unconditionally A typical set of messages that gets printed as a result of the clocksource watchdog finding the TSC unstable usually does not contain messages indicating CPUs being ahead of or behind the CPU from which the check is carried out. That fact suggests that the TSC does not experience time skew between CPUs (if the clocksource.verify_n_cpus parameter is set to a negative value) but quantitative information is missing. The cs_nsec_max value printed by the "CPU %d check durations" message actually provides a worst case estimate of the time skew. If all CPUs have been checked, the cs_nsec_max value multiplied by 2 is the maximum possible time skew between the TSCs of any two CPUs on the system. The worst case estimate is derived from two boundary cases: 1. No time is consumed to execute instructions between csnow_begin and csnow_mid while all the cs_nsec_max time is consumed by the code between csnow_mid and csnow_end. In this case, the maximum undetectable time skew of a CPU being ahead would be cs_nsec_max. 2. All the cs_nsec_max time is consumed to execute instructions between csnow_begin and csnow_mid while no time is consumed by the code between csnow_mid and csnow_end. In this case, the maximum undetectable time skew of a CPU being behind would be cs_nsec_max. The worst case estimate assumes a system experiencing a corner case consisting of the two boundary cases. Always print the "CPU %d check durations" message so that the maximum possible time skew measured by the TSC sync check can be compared to the time skew measured by the clocksource watchdog. Signed-off-by: Jiri Wiesner <jwiesner@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/all/aIuXXfdITXdI0lLp@incl	2025-09-09 14:08:19 +02:00
Bibo Mao	fe2a449a45	tick: Do not set device to detached state in tick_shutdown() tick_shutdown() sets the state of the clockevent device to detached first and the invokes clockevents_exchange_device(), which in turn invokes clockevents_switch_state(). But clockevents_switch_state() returns without invoking the device shutdown callback as the device is already in detached state. As a consequence the timer device is not shutdown when a CPU goes offline. tick_shutdown() does this because it was originally invoked on a online CPU and not on the outgoing CPU. It therefore could not access the clockevent device of the already offlined CPU and just set the state. Since commit `3b1596a21f` tick_shutdown() is called on the outgoing CPU, so the hardware device can be accessed. Remove the state set before calling clockevents_exchange_device(), so that the subsequent clockevents_switch_state() handles the state transition and invokes the shutdown callback of the clockevent device. [ tglx: Massaged change log ] Fixes: `3b1596a21f` ("clockevents: Shutdown and unregister current clockevents at CPUHP_AP_TICK_DYING") Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250906064952.3749122-2-maobibo@loongson.cn	2025-09-09 13:39:00 +02:00
Thomas Weißschuh	3c3af563b3	hrtimer: Reorder branches in hrtimer_clockid_to_base() Align the ordering to the one used for hrtimer_bases. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-9-3ae822e5bfbd@linutronix.de	2025-09-09 12:27:18 +02:00
Thomas Weißschuh	009eb5da29	hrtimer: Remove hrtimer_clock_base:: Get_time The get_time() callbacks always need to match the bases clockid. Instead of maintaining that association twice in hrtimer_bases, use a helper. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-8-3ae822e5bfbd@linutronix.de	2025-09-09 12:27:18 +02:00
Thomas Weißschuh	b68b7f3e9b	sched/core: Avoid direct access to hrtimer clockbase The field timer->base->get_time is a private implementation detail and should not be accessed outside of the hrtimer core. Switch to the equivalent helper. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-3-3ae822e5bfbd@linutronix.de	2025-09-09 12:27:18 +02:00
Thomas Weißschuh	5f531fe9cb	timers/itimer: Avoid direct access to hrtimer clockbase The field timer->base->get_time is a private implementation detail and should not be accessed outside of the hrtimer core. Switch to the equivalent helper. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-2-3ae822e5bfbd@linutronix.de	2025-09-09 12:27:17 +02:00
Thomas Weißschuh	24fb08dcc4	posix-timers: Avoid direct access to hrtimer clockbase The field timer->base->get_time is a private implementation detail and should not be accessed outside of the hrtimer core. Switch to the equivalent helpers. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-1-3ae822e5bfbd@linutronix.de	2025-09-09 12:27:17 +02:00
Linus Torvalds	fe3ad7a58b	Merge tag 'sched_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Fix a stall on the CPU offline path due to mis-counting a deadline server task twice as part of the runqueue's running tasks count - Fix a realtime tasks starvation case where failure to enqueue a timer whose expiration time is already in the past would cause repeated attempts to re-enqueue a deadline server task which leads to starving the former, realtime one - Prevent a delayed deadline server task stop from breaking the per-runqueue bandwidth tracking - Have a function checking whether the deadline server task has stopped, return the correct value * tag 'sched_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Don't count nr_running for dl_server proxy tasks sched/deadline: Fix RT task potential starvation when expiry time passed sched/deadline: Always stop dl-server before changing parameters sched/deadline: Fix dl_server_stopped()	2025-08-31 09:13:00 -07:00
Linus Torvalds	4d28e28098	Merge tag 'dma-mapping-6.17-2025-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fixes from Marek Szyprowski: - another small fix for arm64 systems with memory encryption (Shanker Donthineni) - fix for arm32 systems with non-standard CMA configuration (Oreoluwa Babatunde) * tag 'dma-mapping-6.17-2025-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma/pool: Ensure DMA_DIRECT_REMAP allocations are decrypted of: reserved_mem: Restructure call site for dma_contiguous_early_fixup()	2025-08-28 16:04:14 -07:00
Yicong Yang	52d15521eb	sched/deadline: Don't count nr_running for dl_server proxy tasks On CPU offline the kernel stalled with below call trace: INFO: task kworker/0:1:11 blocked for more than 120 seconds. cpuhp hold the cpu hotplug lock endless and stalled vmstat_shepherd. This is because we count nr_running twice on cpuhp enqueuing and failed the wait condition of cpuhp: enqueue_task_fair() // pick cpuhp from idle, rq->nr_running = 0 dl_server_start() [...] add_nr_running() // rq->nr_running = 1 add_nr_running() // rq->nr_running = 2 [switch to cpuhp, waiting on balance_hotplug_wait()] rcuwait_wait_event(rq->nr_running == 1 && ...) // failed, rq->nr_running=2 schedule() // wait again It doesn't make sense to count the dl_server towards runnable tasks, since it runs other tasks. Fixes: `63ba8422f8` ("sched/deadline: Introduce deadline servers") Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250627035420.37712-1-yangyicong@huawei.com	2025-08-26 10:46:01 +02:00
kuyo chang	421fc59cf5	sched/deadline: Fix RT task potential starvation when expiry time passed [Symptom] The fair server mechanism, which is intended to prevent fair starvation when higher-priority tasks monopolize the CPU. Specifically, RT tasks on the runqueue may not be scheduled as expected. [Analysis] The log "sched: DL replenish lagged too much" triggered. By memory dump of dl_server: curr = 0xFFFFFF80D6A0AC00 ( dl_server = 0xFFFFFF83CD5B1470( dl_runtime = 0x02FAF080, dl_deadline = 0x3B9ACA00, dl_period = 0x3B9ACA00, dl_bw = 0xCCCC, dl_density = 0xCCCC, runtime = 0x02FAF080, deadline = 0x0000082031EB0E80, flags = 0x0, dl_throttled = 0x0, dl_yielded = 0x0, dl_non_contending = 0x0, dl_overrun = 0x0, dl_server = 0x1, dl_server_active = 0x1, dl_defer = 0x1, dl_defer_armed = 0x0, dl_defer_running = 0x1, dl_timer = ( node = ( expires = 0x000008199756E700), _softexpires = 0x000008199756E700, function = 0xFFFFFFDB9AF44D30 = dl_task_timer, base = 0xFFFFFF83CD5A12C0, state = 0x0, is_rel = 0x0, is_soft = 0x0, clock_update_flags = 0x4, clock = 0x000008204A496900, - The timer expiration time (rq->curr->dl_server->dl_timer->expires) is already in the past, indicating the timer has expired. - The timer state (rq->curr->dl_server->dl_timer->state) is 0. [Suspected Root Cause] The relevant code flow in the throttle path of update_curr_dl_se() as follows: dequeue_dl_entity(dl_se, 0); // the DL entity is dequeued if (unlikely(is_dl_boosted(dl_se) \|\| !start_dl_timer(dl_se))) { if (dl_server(dl_se)) // timer registration fails enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);//enqueue immediately ... } The failure of `start_dl_timer` is caused by attempting to register a timer with an expiration time that is already in the past. When this situation persists, the code repeatedly re-enqueues the DL entity without properly replenishing or restarting the timer, resulting in RT task may not be scheduled as expected. [Proposed Solution]: Instead of immediately re-enqueuing the DL entity on timer registration failure, this change ensures the DL entity is properly replenished and the timer is restarted, preventing RT potential starvation. Fixes: `63ba8422f8` ("sched/deadline: Introduce deadline servers") Signed-off-by: kuyo chang <kuyo.chang@mediatek.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Closes: https://lore.kernel.org/CAMuHMdXn4z1pioTtBGMfQM0jsLviqS2jwysaWXpoLxWYoGa82w@mail.gmail.com Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Jiri Slaby <jirislaby@kernel.org> Tested-by: Diederik de Haas <didi.debian@cknow.org> Link: https://lkml.kernel.org/r/20250615131129.954975-1-kuyo.chang@mediatek.com	2025-08-26 10:46:01 +02:00
Juri Lelli	bb4700adc3	sched/deadline: Always stop dl-server before changing parameters Commit `cccb45d7c4` ("sched/deadline: Less agressive dl_server handling") reduced dl-server overhead by delaying disabling servers only after there are no fair task around for a whole period, which means that deadline entities are not dequeued right away on a server stop event. However, the delay opens up a window in which a request for changing server parameters can break per-runqueue running_bw tracking, as reported by Yuri. Close the problematic window by unconditionally calling dl_server_stop() before applying the new parameters (ensuring deadline entities go through an actual dequeue). Fixes: `cccb45d7c4` ("sched/deadline: Less agressive dl_server handling") Reported-by: Yuri Andriaccio <yurand2000@gmail.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20250721-upstream-fix-dlserver-lessaggressive-b4-v1-1-4ebc10c87e40@redhat.com	2025-08-26 10:46:00 +02:00
Huacai Chen	4717432dfd	sched/deadline: Fix dl_server_stopped() Commit `cccb45d7c4` ("sched/deadline: Less agressive dl_server handling") introduces dl_server_stopped(). But it is obvious that dl_server_stopped() should return true if dl_se->dl_server_active is 0. Fixes: `cccb45d7c4` ("sched/deadline: Less agressive dl_server handling") Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250809130419.1980742-1-chenhuacai@loongson.cn	2025-08-26 10:46:00 +02:00
Linus Torvalds	69fd6b99b8	Merge tag 'perf_urgent_for_v6.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Borislav Petkov: - Fix a case where the events throttling logic operates on inactive events * tag 'perf_urgent_for_v6.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Avoid undefined behavior from stopping/starting inactive events	2025-08-24 10:13:05 -04:00
Linus Torvalds	14f84cd318	Merge tag 'modules-6.17-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux Pull modules fix from Daniel Gomez: "This includes a fix part of the KSPP (Kernel Self Protection Project) to replace the deprecated and unsafe strcpy() calls in the kernel parameter string handler and sysfs parameters for built-in modules. Single commit, no functional changes" * tag 'modules-6.17-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux: params: Replace deprecated strcpy() with strscpy() and memcpy()	2025-08-24 09:43:50 -04:00
Linus Torvalds	e1d8f9ccb2	Merge tag 'trace-v6.17-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix rtla and latency tooling pkg-config errors If libtraceevent and libtracefs is installed, but their corresponding '.pc' files are not installed, it reports that the libraries are missing and confuses the developer. Instead, report that the pkg-config files are missing and should be installed. - Fix overflow bug of the parser in trace_get_user() trace_get_user() uses the parsing functions to parse the user space strings. If the parser fails due to incorrect processing, it doesn't terminate the buffer with a nul byte. Add a "failed" flag to the parser that gets set when parsing fails and is used to know if the buffer is fine to use or not. - Remove a semicolon that was at an end of a comment line - Fix register_ftrace_graph() to unregister the pm notifier on error The register_ftrace_graph() registers a pm notifier but there's an error path that can exit the function without unregistering it. Since the function returns an error, it will never be unregistered. - Allocate and copy ftrace hash for reader of ftrace filter files When the set_ftrace_filter or set_ftrace_notrace files are open for read, an iterator is created and sets its hash pointer to the associated hash that represents filtering or notrace filtering to it. The issue is that the hash it points to can change while the iteration is happening. All the locking used to access the tracer's hashes are released which means those hashes can change or even be freed. Using the hash pointed to by the iterator can cause UAF bugs or similar. Have the read of these files allocate and copy the corresponding hashes and use that as that will keep them the same while the iterator is open. This also simplifies the code as opening it for write already does an allocate and copy, and now that the read is doing the same, there's no need to check which way it was opened on the release of the file, and the iterator hash can always be freed. - Fix function graph to copy args into temp storage The output of the function graph tracer shows both the entry and the exit of a function. When the exit is right after the entry, it combines the two events into one with the output of "function();", instead of showing: function() { } In order to do this, the iterator descriptor that reads the events includes storage that saves the entry event while it peaks at the next event in the ring buffer. The peek can free the entry event so the iterator must store the information to use it after the peek. With the addition of function graph tracer recording the args, where the args are a dynamic array in the entry event, the temp storage does not save them. This causes the args to be corrupted or even cause a read of unsafe memory. Add space to save the args in the temp storage of the iterator. - Fix race between ftrace_dump and reading trace_pipe ftrace_dump() is used when a crash occurs where the ftrace buffer will be printed to the console. But it can also be triggered by sysrq-z. If a sysrq-z is triggered while a task is reading trace_pipe it can cause a race in the ftrace_dump() where it checks if the buffer has content, then it checks if the next event is available, and then prints the output (regardless if the next event was available or not). Reading trace_pipe at the same time can cause it to not be available, and this triggers a WARN_ON in the print. Move the printing into the check if the next event exists or not * tag 'trace-v6.17-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ftrace: Also allocate and copy hash for reading of filter files ftrace: Fix potential warning in trace_printk_seq during ftrace_dump fgraph: Copy args in intermediate storage with entry trace/fgraph: Fix the warning caused by missing unregister notifier ring-buffer: Remove redundant semicolons tracing: Limit access to parser->buffer when trace_get_user failed rtla: Check pkg-config install tools/latency-collector: Check pkg-config install	2025-08-23 10:11:34 -04:00
Steven Rostedt	bfb336cf97	ftrace: Also allocate and copy hash for reading of filter files Currently the reader of set_ftrace_filter and set_ftrace_notrace just adds the pointer to the global tracer hash to its iterator. Unlike the writer that allocates a copy of the hash, the reader keeps the pointer to the filter hashes. This is problematic because this pointer is static across function calls that release the locks that can update the global tracer hashes. This can cause UAF and similar bugs. Allocate and copy the hash for reading the filter files like it is done for the writers. This not only fixes UAF bugs, but also makes the code a bit simpler as it doesn't have to differentiate when to free the iterator's hash between writers and readers. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/20250822183606.12962cc3@batman.local.home Fixes: `c20489dad1` ("ftrace: Assign iter->hash to filter or notrace hashes on seq read") Closes: https://lore.kernel.org/all/20250813023044.2121943-1-wutengda@huaweicloud.com/ Closes: https://lore.kernel.org/all/20250822192437.GA458494@ax162/ Reported-by: Tengda Wu <wutengda@huaweicloud.com> Tested-by: Tengda Wu <wutengda@huaweicloud.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-08-22 19:58:35 -04:00
Tengda Wu	4013aef2ce	ftrace: Fix potential warning in trace_printk_seq during ftrace_dump When calling ftrace_dump_one() concurrently with reading trace_pipe, a WARN_ON_ONCE() in trace_printk_seq() can be triggered due to a race condition. The issue occurs because: CPU0 (ftrace_dump) CPU1 (reader) echo z > /proc/sysrq-trigger !trace_empty(&iter) trace_iterator_reset(&iter) <- len = size = 0 cat /sys/kernel/tracing/trace_pipe trace_find_next_entry_inc(&iter) __find_next_entry ring_buffer_empty_cpu <- all empty return NULL trace_printk_seq(&iter.seq) WARN_ON_ONCE(s->seq.len >= s->seq.size) In the context between trace_empty() and trace_find_next_entry_inc() during ftrace_dump, the ring buffer data was consumed by other readers. This caused trace_find_next_entry_inc to return NULL, failing to populate `iter.seq`. At this point, due to the prior trace_iterator_reset, both `iter.seq.len` and `iter.seq.size` were set to 0. Since they are equal, the WARN_ON_ONCE condition is triggered. Move the trace_printk_seq() into the if block that checks to make sure the return value of trace_find_next_entry_inc() is non-NULL in ftrace_dump_one(), ensuring the 'iter.seq' is properly populated before subsequent operations. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ingo Molnar <mingo@elte.hu> Link: https://lore.kernel.org/20250822033343.3000289-1-wutengda@huaweicloud.com Fixes: `d769041f86` ("ring_buffer: implement new locking") Signed-off-by: Tengda Wu <wutengda@huaweicloud.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-08-22 17:32:36 -04:00
Steven Rostedt	e3d01979e4	fgraph: Copy args in intermediate storage with entry The output of the function graph tracer has two ways to display its entries. One way for leaf functions with no events recorded within them, and the other is for functions with events recorded inside it. As function graph has an entry and exit event, to simplify the output of leaf functions it combines the two, where as non leaf functions are separate: 2) \| invoke_rcu_core() { 2) \| raise_softirq() { 2) 0.391 us \| __raise_softirq_irqoff(); 2) 1.191 us \| } 2) 2.086 us \| } The __raise_softirq_irqoff() function above is really two events that were merged into one. Otherwise it would have looked like: 2) \| invoke_rcu_core() { 2) \| raise_softirq() { 2) \| __raise_softirq_irqoff() { 2) 0.391 us \| } 2) 1.191 us \| } 2) 2.086 us \| } In order to do this merge, the reading of the trace output file needs to look at the next event before printing. But since the pointer to the event is on the ring buffer, it needs to save the entry event before it looks at the next event as the next event goes out of focus as soon as a new event is read from the ring buffer. After it reads the next event, it will print the entry event with either the '{' (non leaf) or ';' and timestamps (leaf). The iterator used to read the trace file has storage for this event. The problem happens when the function graph tracer has arguments attached to the entry event as the entry now has a variable length "args" field. This field only gets set when funcargs option is used. But the args are not recorded in this temp data and garbage could be printed. The entry field is copied via: data->ent = *curr; Where "curr" is the entry field. But this method only saves the non variable length fields from the structure. Add a helper structure to the iterator data that adds the max args size to the data storage in the iterator. Then simply copy the entire entry into this storage (with size protection). Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/20250820195522.51d4a268@gandalf.local.home Reported-by: Sasha Levin <sashal@kernel.org> Tested-by: Sasha Levin <sashal@kernel.org> Closes: https://lore.kernel.org/all/aJaxRVKverIjF4a6@lappy/ Fixes: `ff5c9c576e` ("ftrace: Add support for function argument to graph tracer") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-08-22 17:32:35 -04:00
Linus Torvalds	6eba757ce9	Merge tag 'mm-hotfixes-stable-2025-08-21-18-17' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "20 hotfixes. 10 are cc:stable and the remainder address post-6.16 issues or aren't considered necessary for -stable kernels. 17 of these fixes are for MM. As usual, singletons all over the place, apart from a three-patch series of KHO followup work from Pasha which is actually also a bunch of singletons" * tag 'mm-hotfixes-stable-2025-08-21-18-17' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/mremap: fix WARN with uffd that has remap events disabled mm/damon/sysfs-schemes: put damos dests dir after removing its files mm/migrate: fix NULL movable_ops if CONFIG_ZSMALLOC=m mm/damon/core: fix damos_commit_filter not changing allow mm/memory-failure: fix infinite UCE for VM_PFNMAP pfn MAINTAINERS: mark MGLRU as maintained mm: rust: add page.rs to MEMORY MANAGEMENT - RUST iov_iter: iterate_folioq: fix handling of offset >= folio size selftests/damon: fix selftests by installing drgn related script .mailmap: add entry for Easwar Hariharan selftests/mm: add test for invalid multi VMA operations mm/mremap: catch invalid multi VMA moves earlier mm/mremap: allow multi-VMA move when filesystem uses thp_get_unmapped_area mm/damon/core: fix commit_ops_filters by using correct nth function tools/testing: add linux/args.h header and fix radix, VMA tests mm/debug_vm_pgtable: clear page table entries at destroy_args() squashfs: fix memory leak in squashfs_fill_super kho: warn if KHO is disabled due to an error kho: mm: don't allow deferred struct page with KHO kho: init new_physxa->phys_bits to fix lockdep	2025-08-22 08:54:34 -04:00
Linus Torvalds	3957a57201	Merge tag 'cgroup-for-6.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix NULL de-ref in css_rstat_exit() which could happen after allocation failure - Fix a cpuset partition handling bug and a couple other misc issues - Doc spelling fix * tag 'cgroup-for-6.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: docs: cgroup: fixed spelling mistakes in documentation cgroup: avoid null de-ref in css_rstat_exit() cgroup/cpuset: Remove the unnecessary css_get/put() in cpuset_partition_write() cgroup/cpuset: Fix a partition error with CPU hotplug cgroup/cpuset: Use static_branch_enable_cpuslocked() on cpusets_insane_config_key	2025-08-21 16:31:27 -04:00
Linus Torvalds	d72052ac09	Merge tag 'sched_ext-for-6.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix a subtle bug during SCX enabling where a dead task skips init but doesn't skip sched class switch leading to invalid task state transition warning - Cosmetic fix in selftests * tag 'sched_ext-for-6.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: selftests/sched_ext: Remove duplicate sched.h header sched/ext: Fix invalid task state transitions on class switch	2025-08-21 16:02:35 -04:00
Linus Torvalds	068a56e56f	Merge tag 'probes-fixes-v6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fix from Masami Hiramatsu: "Sanitize wildcard for fprobe event name Fprobe event accepts wildcards for the target functions, but unless the user specifies its event name, it makes an event with the wildcards. Replace the wildcard '' with the underscore '_'" tag 'probes-fixes-v6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: fprobe-event: Sanitize wildcard for fprobe event name	2025-08-20 16:29:30 -07:00
Masami Hiramatsu (Google)	ec879e1a0b	tracing: fprobe-event: Sanitize wildcard for fprobe event name Fprobe event accepts wildcards for the target functions, but unless user specifies its event name, it makes an event with the wildcards. /sys/kernel/tracing # echo 'f mutex' >> dynamic_events /sys/kernel/tracing # cat dynamic_events f:fprobes/mutex__entry mutex* /sys/kernel/tracing # ls events/fprobes/ enable filter mutex__entry To fix this, replace the wildcard ('') with an underscore. Link: https://lore.kernel.org/all/175535345114.282990.12294108192847938710.stgit@devnote2/ Fixes: `334e5519c3` ("tracing/probes: Add fprobe events for tracing function entry and exit.") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: stable@vger.kernel.org	2025-08-20 23:41:58 +09:00
Ye Weihua	edede7a6dc	trace/fgraph: Fix the warning caused by missing unregister notifier This warning was triggered during testing on v6.16: notifier callback ftrace_suspend_notifier_call already registered WARNING: CPU: 2 PID: 86 at kernel/notifier.c:23 notifier_chain_register+0x44/0xb0 ... Call Trace: <TASK> blocking_notifier_chain_register+0x34/0x60 register_ftrace_graph+0x330/0x410 ftrace_profile_write+0x1e9/0x340 vfs_write+0xf8/0x420 ? filp_flush+0x8a/0xa0 ? filp_close+0x1f/0x30 ? do_dup2+0xaf/0x160 ksys_write+0x65/0xe0 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x77/0x7f When writing to the function_profile_enabled interface, the notifier was not unregistered after start_graph_tracing failed, causing a warning the next time function_profile_enabled was written. Fixed by adding unregister_pm_notifier in the exception path. Link: https://lore.kernel.org/20250818073332.3890629-1-yeweihua4@huawei.com Fixes: `4a2b8dda3f` ("tracing/function-graph-tracer: fix a regression while suspend to disk") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Ye Weihua <yeweihua4@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-08-20 09:21:03 -04:00
Liao Yuanhong	cd6e4faba9	ring-buffer: Remove redundant semicolons Remove unnecessary semicolons. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250813095114.559530-1-liaoyuanhong@vivo.com Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-08-20 09:20:30 -04:00
Pu Lehui	6a909ea83f	tracing: Limit access to parser->buffer when trace_get_user failed When the length of the string written to set_ftrace_filter exceeds FTRACE_BUFF_MAX, the following KASAN alarm will be triggered: BUG: KASAN: slab-out-of-bounds in strsep+0x18c/0x1b0 Read of size 1 at addr ffff0000d00bd5ba by task ash/165 CPU: 1 UID: 0 PID: 165 Comm: ash Not tainted 6.16.0-g6bcdbd62bd56-dirty Hardware name: linux,dummy-virt (DT) Call trace: show_stack+0x34/0x50 (C) dump_stack_lvl+0xa0/0x158 print_address_description.constprop.0+0x88/0x398 print_report+0xb0/0x280 kasan_report+0xa4/0xf0 __asan_report_load1_noabort+0x20/0x30 strsep+0x18c/0x1b0 ftrace_process_regex.isra.0+0x100/0x2d8 ftrace_regex_release+0x484/0x618 __fput+0x364/0xa58 ____fput+0x28/0x40 task_work_run+0x154/0x278 do_notify_resume+0x1f0/0x220 el0_svc+0xec/0xf0 el0t_64_sync_handler+0xa0/0xe8 el0t_64_sync+0x1ac/0x1b0 The reason is that trace_get_user will fail when processing a string longer than FTRACE_BUFF_MAX, but not set the end of parser->buffer to 0. Then an OOB access will be triggered in ftrace_regex_release-> ftrace_process_regex->strsep->strpbrk. We can solve this problem by limiting access to parser->buffer when trace_get_user failed. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250813040232.1344527-1-pulehui@huaweicloud.com Fixes: `8c9af478c0` ("ftrace: Handle commands when closing set_ftrace_filter file") Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-08-20 09:20:30 -04:00
Pasha Tatashin	44958f2025	kho: warn if KHO is disabled due to an error During boot scratch area is allocated based on command line parameters or auto calculated. However, scratch area may fail to allocate, and in that case KHO is disabled. Currently, no warning is printed that KHO is disabled, which makes it confusing for the end user to figure out why KHO is not available. Add the missing warning message. Link: https://lkml.kernel.org/r/20250808201804.772010-4-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Coiby Xu <coxu@redhat.com> Cc: Dave Vasilevsky <dave@vasilevsky.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Kees Cook <kees@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-19 16:35:53 -07:00
Pasha Tatashin	8b66ed2c3f	kho: mm: don't allow deferred struct page with KHO KHO uses struct pages for the preserved memory early in boot, however, with deferred struct page initialization, only a small portion of memory has properly initialized struct pages. This problem was detected where vmemmap is poisoned, and illegal flag combinations are detected. Don't allow them to be enabled together, and later we will have to teach KHO to work properly with deferred struct page init kernel feature. Link: https://lkml.kernel.org/r/20250808201804.772010-3-pasha.tatashin@soleen.com Fixes: `4e1d010e3b` ("kexec: add config option for KHO") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Coiby Xu <coxu@redhat.com> Cc: Dave Vasilevsky <dave@vasilevsky.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Kees Cook <kees@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-19 16:35:53 -07:00
Pasha Tatashin	63b17b653d	kho: init new_physxa->phys_bits to fix lockdep Patch series "Several KHO Hotfixes". Three unrelated fixes for Kexec Handover. This patch (of 3): Lockdep shows the following warning: INFO: trying to register non-static key. The code is fine but needs lockdep annotation, or maybe you didn't initialize this object before use? turning off the locking correctness validator. [<ffffffff810133a6>] dump_stack_lvl+0x66/0xa0 [<ffffffff8136012c>] assign_lock_key+0x10c/0x120 [<ffffffff81358bb4>] register_lock_class+0xf4/0x2f0 [<ffffffff813597ff>] __lock_acquire+0x7f/0x2c40 [<ffffffff81360cb0>] ? __pfx_hlock_conflict+0x10/0x10 [<ffffffff811707be>] ? native_flush_tlb_global+0x8e/0xa0 [<ffffffff8117096e>] ? __flush_tlb_all+0x4e/0xa0 [<ffffffff81172fc2>] ? __kernel_map_pages+0x112/0x140 [<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0 [<ffffffff81359556>] lock_acquire+0xe6/0x280 [<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0 [<ffffffff8100b9e0>] _raw_spin_lock+0x30/0x40 [<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0 [<ffffffff813ec327>] xa_load_or_alloc+0x67/0xe0 [<ffffffff813eb4c0>] kho_preserve_folio+0x90/0x100 [<ffffffff813ebb7f>] __kho_finalize+0xcf/0x400 [<ffffffff813ebef4>] kho_finalize+0x34/0x70 This is becase xa has its own lock, that is not initialized in xa_load_or_alloc. Modifiy __kho_preserve_order(), to properly call xa_init(&new_physxa->phys_bits); Link: https://lkml.kernel.org/r/20250808201804.772010-2-pasha.tatashin@soleen.com Fixes: `fc33e4b44b` ("kexec: enable KHO support for memory preservation") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Coiby Xu <coxu@redhat.com> Cc: Dave Vasilevsky <dave@vasilevsky.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Pratyush Yadav <pratyush@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-19 16:35:53 -07:00
Linus Torvalds	055f213075	Merge tag 'vfs-6.17-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix two memory leaks in pidfs - Prevent changing the idmapping of an already idmapped mount without OPEN_TREE_CLONE through open_tree_attr() - Don't fail listing extended attributes in kernfs when no extended attributes are set - Fix the return value in coredump_parse() - Fix the error handling for unbuffered writes in netfs - Fix broken data integrity guarantees for O_SYNC writes via iomap - Fix UAF in __mark_inode_dirty() - Keep inode->i_blkbits constant in fuse - Fix coredump selftests - Fix get_unused_fd_flags() usage in do_handle_open() - Rename EXPORT_SYMBOL_GPL_FOR_MODULES to EXPORT_SYMBOL_FOR_MODULES - Fix use-after-free in bh_read() - Fix incorrect lflags value in the move_mount() syscall * tag 'vfs-6.17-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: signal: Fix memory leak for PIDFD_SELF* sentinels kernfs: don't fail listing extended attributes coredump: Fix return value in coredump_parse() fs/buffer: fix use-after-free when call bh_read() helper pidfs: Fix memory leak in pidfd_info() netfs: Fix unbuffered write error handling fhandle: do_handle_open() should get FD with user flags module: Rename EXPORT_SYMBOL_GPL_FOR_MODULES to EXPORT_SYMBOL_FOR_MODULES fs: fix incorrect lflags value in the move_mount syscall selftests/coredump: Remove the read() that fails the test fuse: keep inode->i_blkbits constant iomap: Fix broken data integrity guarantees for O_SYNC writes selftests/mount_setattr: add smoke tests for open_tree_attr(2) bug open_tree_attr: do not allow id-mapping changes without OPEN_TREE_CLONE fs: writeback: fix use-after-free in __mark_inode_dirty()	2025-08-19 09:54:47 -07:00
Adrian Huang (Lenovo)	a2c1f82618	signal: Fix memory leak for PIDFD_SELF* sentinels Commit `f08d0c3a71` ("pidfd: add PIDFD_SELF* sentinels to refer to own thread/process") introduced a leak by acquiring a pid reference through get_task_pid(), which increments pid->count but never drops it with put_pid(). As a result, kmemleak reports unreferenced pid objects after running tools/testing/selftests/pidfd/pidfd_test, for example: unreferenced object 0xff1100206757a940 (size 160): comm "pidfd_test", pid 16965, jiffies 4294853028 hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 00 00 00 00 fd 57 50 04 .............WP. 5e 44 00 00 00 00 00 00 18 de 34 17 01 00 11 ff ^D........4..... backtrace (crc cd8844d4): kmem_cache_alloc_noprof+0x2f4/0x3f0 alloc_pid+0x54/0x3d0 copy_process+0xd58/0x1740 kernel_clone+0x99/0x3b0 __do_sys_clone3+0xbe/0x100 do_syscall_64+0x7b/0x2c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fix this by calling put_pid() after do_pidfd_send_signal() returns. Fixes: `f08d0c3a71` ("pidfd: add PIDFD_SELF* sentinels to refer to own thread/process") Signed-off-by: Adrian Huang (Lenovo) <adrianhuang0701@gmail.com> Link: https://lore.kernel.org/20250818134310.12273-1-adrianhuang0701@gmail.com Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-19 13:51:28 +02:00
Linus Torvalds	0a9ee9ce49	Merge tag 'locking_urgent_for_v6.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Borislav Petkov: - Make sure sanity checks down in the mutex lock path happen on the correct type of task so that they don't trigger falsely - Use the write unsafe user access pairs when writing a futex value to prevent an error on PowerPC which does user read and write accesses differently * tag 'locking_urgent_for_v6.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking: Fix __clear_task_blocked_on() warning from __ww_mutex_wound() path futex: Use user_write_access_begin/_end() in futex_put_value()	2025-08-17 05:57:47 -07:00
Thorsten Blum	5eb4b9a4cd	params: Replace deprecated strcpy() with strscpy() and memcpy() strcpy() is deprecated; use strscpy() and memcpy() instead. In param_set_copystring(), we can safely use memcpy() because we already know the length of the source string 'val' and that it is guaranteed to be NUL-terminated within the first 'kps->maxlen' bytes. Link: https://github.com/KSPP/linux/issues/88 Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Link: https://lore.kernel.org/r/20250813132200.184064-2-thorsten.blum@linux.dev Signed-off-by: Daniel Gomez <da.gomez@samsung.com>	2025-08-16 21:47:25 +02:00
Yunseong Kim	b64fdd422a	perf: Avoid undefined behavior from stopping/starting inactive events Calling pmu->start()/stop() on perf events in PERF_EVENT_STATE_OFF can leave event->hw.idx at -1. When PMU drivers later attempt to use this negative index as a shift exponent in bitwise operations, it leads to UBSAN shift-out-of-bounds reports. The issue is a logical flaw in how event groups handle throttling when some members are intentionally disabled. Based on the analysis and the reproducer provided by Mark Rutland (this issue on both arm64 and x86-64). The scenario unfolds as follows: 1. A group leader event is configured with a very aggressive sampling period (e.g., sample_period = 1). This causes frequent interrupts and triggers the throttling mechanism. 2. A child event in the same group is created in a disabled state (.disabled = 1). This event remains in PERF_EVENT_STATE_OFF. Since it hasn't been scheduled onto the PMU, its event->hw.idx remains initialized at -1. 3. When throttling occurs, perf_event_throttle_group() and later perf_event_unthrottle_group() iterate through all siblings, including the disabled child event. 4. perf_event_throttle()/unthrottle() are called on this inactive child event, which then call event->pmu->start()/stop(). 5. The PMU driver receives the event with hw.idx == -1 and attempts to use it as a shift exponent. e.g., in macros like PMCNTENSET(idx), leading to the UBSAN report. The throttling mechanism attempts to start/stop events that are not actively scheduled on the hardware. Move the state check into perf_event_throttle()/perf_event_unthrottle() so that inactive events are skipped entirely. This ensures only active events with a valid hw.idx are processed, preventing undefined behavior and silencing UBSAN warnings. The corrected check ensures true before proceeding with PMU operations. The problem can be reproduced with the syzkaller reproducer: Fixes: `9734e25fbf` ("perf: Fix the throttle logic for a group") Signed-off-by: Yunseong Kim <ysk@kzalloc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20250812181046.292382-2-ysk@kzalloc.com	2025-08-15 13:12:56 +02:00
Linus Torvalds	63467137ec	Merge tag 'net-6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from Netfilter and IPsec. Current release - regressions: - netfilter: nft_set_pipapo: - don't return bogus extension pointer - fix null deref for empty set Current release - new code bugs: - core: prevent deadlocks when enabling NAPIs with mixed kthread config - eth: netdevsim: Fix wild pointer access in nsim_queue_free(). Previous releases - regressions: - page_pool: allow enabling recycling late, fix false positive warning - sched: ets: use old 'nbands' while purging unused classes - xfrm: - restore GSO for SW crypto - bring back device check in validate_xmit_xfrm - tls: handle data disappearing from under the TLS ULP - ptp: prevent possible ABBA deadlock in ptp_clock_freerun() - eth: - bnxt: fill data page pool with frags if PAGE_SIZE > BNXT_RX_PAGE_SIZE - hv_netvsc: fix panic during namespace deletion with VF Previous releases - always broken: - netfilter: fix refcount leak on table dump - vsock: do not allow binding to VMADDR_PORT_ANY - sctp: linearize cloned gso packets in sctp_rcv - eth: - hibmcge: fix the division by zero issue - microchip: fix KSZ8863 reset problem" * tag 'net-6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (54 commits) net: usb: asix_devices: add phy_mask for ax88772 mdio bus net: kcm: Fix race condition in kcm_unattach() selftests: net/forwarding: test purge of active DWRR classes net/sched: ets: use old 'nbands' while purging unused classes bnxt: fill data page pool with frags if PAGE_SIZE > BNXT_RX_PAGE_SIZE netdevsim: Fix wild pointer access in nsim_queue_free(). net: mctp: Fix bad kfree_skb in bind lookup test netfilter: nf_tables: reject duplicate device on updates ipvs: Fix estimator kthreads preferred affinity netfilter: nft_set_pipapo: fix null deref for empty set selftests: tls: test TCP stealing data from under the TLS socket tls: handle data disappearing from under the TLS ULP ptp: prevent possible ABBA deadlock in ptp_clock_freerun() ixgbe: prevent from unwanted interface name changes devlink: let driver opt out of automatic phys_port_name generation net: prevent deadlocks when enabling NAPIs with mixed kthread config net: update NAPI threaded config even for disabled NAPIs selftests: drv-net: don't assume device has only 2 queues docs: Fix name for net.ipv4.udp_child_hash_entries riscv: dts: thead: Add APB clocks for TH1520 GMACs ...	2025-08-14 07:14:30 -07:00
Shanker Donthineni	89a2d212bd	dma/pool: Ensure DMA_DIRECT_REMAP allocations are decrypted When CONFIG_DMA_DIRECT_REMAP is enabled, atomic pool pages are remapped via dma_common_contiguous_remap() using the supplied pgprot. Currently, the mapping uses pgprot_dmacoherent(PAGE_KERNEL), which leaves the memory encrypted on systems with memory encryption enabled (e.g., ARM CCA Realms). This can cause the DMA layer to fail or crash when accessing the memory, as the underlying physical pages are not configured as expected. Fix this by requesting a decrypted mapping in the vmap() call: pgprot_decrypted(pgprot_dmacoherent(PAGE_KERNEL)) This ensures that atomic pool memory is consistently mapped unencrypted. Cc: stable@vger.kernel.org Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20250811181759.998805-1-sdonthineni@nvidia.com	2025-08-13 11:02:10 +02:00
John Stultz	21924af67d	locking: Fix __clear_task_blocked_on() warning from __ww_mutex_wound() path The __clear_task_blocked_on() helper added a number of sanity checks ensuring we hold the mutex wait lock and that the task we are clearing blocked_on pointer (if set) matches the mutex. However, there is an edge case in the _ww_mutex_wound() logic where we need to clear the blocked_on pointer for the task that owns the mutex, not the task that is waiting on the mutex. For this case the sanity checks aren't valid, so handle this by allowing a NULL lock to skip the additional checks. K Prateek Nayak and Maarten Lankhorst also pointed out that in this case where we don't hold the owner's mutex wait_lock, we need to be a bit more careful using READ_ONCE/WRITE_ONCE in both the __clear_task_blocked_on() and __set_task_blocked_on() implementations to avoid accidentally tripping WARN_ONs if two instances race. So do that here as well. This issue was easier to miss, I realized, as the test-ww_mutex driver only exercises the wait-die class of ww_mutexes. I've sent a patch[1] to address this so the logic will be easier to test. [1]: https://lore.kernel.org/lkml/20250801023358.562525-2-jstultz@google.com/ Fixes: `a4f0b6fef4` ("locking/mutex: Add p->blocked_on wrappers for correctness checks") Closes: https://lore.kernel.org/lkml/68894443.a00a0220.26d0e1.0015.GAE@google.com/ Reported-by: syzbot+602c4720aed62576cd79@syzkaller.appspotmail.com Reported-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250805001026.2247040-1-jstultz@google.com	2025-08-13 10:34:54 +02:00
Frederic Weisbecker	c0a23bbc98	ipvs: Fix estimator kthreads preferred affinity The estimator kthreads' affinity are defined by sysctl overwritten preferences and applied through a plain call to the scheduler's affinity API. However since the introduction of managed kthreads preferred affinity, such a practice shortcuts the kthreads core code which eventually overwrites the target to the default unbound affinity. Fix this with using the appropriate kthread's API. Fixes: `d1a8919758` ("kthread: Default affine kthread to its preferred NUMA node") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>	2025-08-13 08:34:33 +02:00
Andrea Righi	ddf7233fca	sched/ext: Fix invalid task state transitions on class switch When enabling a sched_ext scheduler, we may trigger invalid task state transitions, resulting in warnings like the following (which can be easily reproduced by running the hotplug selftest in a loop): sched_ext: Invalid task state transition 0 -> 3 for fish[770] WARNING: CPU: 18 PID: 787 at kernel/sched/ext.c:3862 scx_set_task_state+0x7c/0xc0 ... RIP: 0010:scx_set_task_state+0x7c/0xc0 ... Call Trace: <TASK> scx_enable_task+0x11f/0x2e0 switching_to_scx+0x24/0x110 scx_enable.isra.0+0xd14/0x13d0 bpf_struct_ops_link_create+0x136/0x1a0 __sys_bpf+0x1edd/0x2c30 __x64_sys_bpf+0x21/0x30 do_syscall_64+0xbb/0x370 entry_SYSCALL_64_after_hwframe+0x77/0x7f This happens because we skip initialization for tasks that are already dead (with their usage counter set to zero), but we don't exclude them during the scheduling class transition phase. Fix this by also skipping dead tasks during class swiching, preventing invalid task state transitions. Fixes: `a8532fac7b` ("sched_ext: TASK_DEAD tasks must be switched into SCX on ops_enable") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-08-11 06:56:37 -10:00
Waiman Long	dfb36e4a8d	futex: Use user_write_access_begin/_end() in futex_put_value() Commit `cec199c5e3` ("futex: Implement FUTEX2_NUMA") introduced the futex_put_value() helper to write a value to the given user address. However, it uses user_read_access_begin() before the write. For architectures that differentiate between read and write accesses, like PowerPC, futex_put_value() fails with -EFAULT. Fix that by using the user_write_access_begin/user_write_access_end() pair instead. Fixes: `cec199c5e3` ("futex: Implement FUTEX2_NUMA") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20250811141147.322261-1-longman@redhat.com	2025-08-11 17:53:21 +02:00
Oreoluwa Babatunde	2c223f7239	of: reserved_mem: Restructure call site for dma_contiguous_early_fixup() Restructure the call site for dma_contiguous_early_fixup() to where the reserved_mem nodes are being parsed from the DT so that dma_mmu_remap[] is populated before dma_contiguous_remap() is called. Fixes: `8a6e02d0c0` ("of: reserved_mem: Restructure how the reserved memory regions are processed") Signed-off-by: Oreoluwa Babatunde <oreoluwa.babatunde@oss.qualcomm.com> Tested-by: William Zhang <william.zhang@broadcom.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20250806172421.2748302-1-oreoluwa.babatunde@oss.qualcomm.com	2025-08-11 13:05:38 +02:00
Frederic Weisbecker	61399e0c54	rcu: Fix racy re-initialization of irq_work causing hangs RCU re-initializes the deferred QS irq work everytime before attempting to queue it. However there are situations where the irq work is attempted to be queued even though it is already queued. In that case re-initializing messes-up with the irq work queue that is about to be handled. The chances for that to happen are higher when the architecture doesn't support self-IPIs and irq work are then all lazy, such as with the following sequence: 1) rcu_read_unlock() is called when IRQs are disabled and there is a grace period involving blocked tasks on the node. The irq work is then initialized and queued. 2) The related tasks are unblocked and the CPU quiescent state is reported. rdp->defer_qs_iw_pending is reset to DEFER_QS_IDLE, allowing the irq work to be requeued in the future (note the previous one hasn't fired yet). 3) A new grace period starts and the node has blocked tasks. 4) rcu_read_unlock() is called when IRQs are disabled again. The irq work is re-initialized (but it's queued! and its node is cleared) and requeued. Which means it's requeued to itself. 5) The irq work finally fires with the tick. But since it was requeued to itself, it loops and hangs. Fix this with initializing the irq work only once before the CPU boots. Fixes: `b41642c877` ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202508071303.c1134cce-lkp@intel.com Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>	2025-08-11 08:43:49 +05:30
Linus Torvalds	b96ddbc5c8	Merge tag 'smp_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull smp fixes from Borislav Petkov: - Remove an obsolete comment and fix spelling * tag 'smp_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu: Remove obsolete comment from takedown_cpu() smp: Fix spelling in on_each_cpu_cond_mask()'s doc-comment	2025-08-10 08:51:37 +03:00

1 2 3 4 5 ...

49039 Commits