Commit Graph

51107 Commits

Author SHA1 Message Date
Michal Koutný
985215804d sched/rt: Cleanup global RT bandwidth functions
The commit 5f6bd380c7 ("sched/rt: Remove default bandwidth control")
and followup changes made a few of the functions unnecessary, drop them
for simplicity.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260323-sched-rert_groups-v3-3-1e7d5ed6b249@suse.com
2026-04-08 13:11:44 +02:00
Michal Koutný
4f70a0456d sched/rt: Move group schedulability check to sched_rt_global_validate()
The sched_rt_global_constraints() function is a remnant that used to set
up global RT throttling but that is no more since commit 5f6bd380c7
("sched/rt: Remove default bandwidth control") and the function ended up
only doing schedulability check.
Move the check into the validation function where it fits better.
(The order of validations sched_dl_global_validate() and
sched_rt_global_validate() shouldn't matter.)

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260323-sched-rert_groups-v3-2-1e7d5ed6b249@suse.com
2026-04-08 13:11:44 +02:00
Michal Koutný
8b016dcec9 sched/rt: Skip group schedulable check with rt_group_sched=0
The warning from the commit 87f1fb77d8 ("sched: Add RT_GROUP WARN
checks for non-root task_groups") is wrong -- it assumes that only
task_groups with rt_rq are traversed, however, the schedulability check
would iterate all task_groups even when rt_group_sched=0 is disabled at
boot time but some non-root task_groups exist.

The schedulability check is supposed to validate:
  a) that children don't overcommit its parent,
  b) no RT task group overcommits global RT limit.
but with rt_group_sched=0 there is no (non-trivial) hierarchy of RT groups,
therefore skip the validation altogether. Otherwise, writes to the
global sched_rt_runtime_us knob will be rejected with incorrect
validation error.

This fix is immaterial with CONFIG_RT_GROUP_SCHED=n.

Fixes: 87f1fb77d8 ("sched: Add RT_GROUP WARN checks for non-root task_groups")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260323-sched-rert_groups-v3-1-1e7d5ed6b249@suse.com
2026-04-08 13:11:44 +02:00
K Prateek Nayak
556146ce5e sched/fair: Avoid overflow in enqueue_entity()
Here is one scenario which was triggered when running:

    stress-ng --yield=32 -t 10000000s&
    while true; do perf bench sched messaging -p -t -l 100000 -g 16; done

on a 256CPUs machine after about an hour into the run:

    __enqeue_entity: entity_key(-141245081754) weight(90891264) overflow_mul(5608800059305154560) vlag(57498) delayed?(0)
    cfs_rq: zero_vruntime(3809707759657809) sum_w_vruntime(0) sum_weight(0) nr_queued(1)
    cfs_rq->curr: entity_key(0) vruntime(3809707759657809) deadline(3809723966988476) weight(37)

The above comes from __enqueue_entity() after a place_entity(). Breaking
this down:

    vlag_initial = 57498
    vlag = (57498 * (37 + 90891264)) / 37 = 141,245,081,754

    vruntime = 3809707759657809 - 141245081754 = 3,809,566,514,576,055
    entity_key(se, cfs_rq) = -141,245,081,754

Now, multiplying the entity_key with its own weight results to
5,608,800,059,305,154,560 (same as what overflow_mul() suggests) but
in Python, without overflow, this would be: -1,2837,944,014,404,397,056

Avoid the overflow (without doing the division for avg_vruntime()), by moving
zero_vruntime to the new entity when it is heavier.

Fixes: 4823725d9d ("sched/fair: Increase weight bits for avg_vruntime")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
[peterz: suggested 'weight > load' condition]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260407120052.GG3738010@noisy.programming.kicks-ass.net
2026-04-07 14:02:00 +02:00
Joseph Salisbury
c6e80201e0 sched: Use u64 for bandwidth ratio calculations
to_ratio() computes BW_SHIFT-scaled bandwidth ratios from u64 period and
runtime values, but it returns unsigned long.  tg_rt_schedulable() also
stores the current group limit and the accumulated child sum in unsigned
long.

On 32-bit builds, large bandwidth ratios can be truncated and the RT
group sum can wrap when enough siblings are present.  That can let an
overcommitted RT hierarchy pass the schedulability check, and it also
narrows the helper result for other callers.

Return u64 from to_ratio() and use u64 for the RT group totals so
bandwidth ratios are preserved and compared at full width on both 32-bit
and 64-bit builds.

Fixes: b40b2e8eb5 ("sched: rt: multi level group constraints")
Assisted-by: Codex:GPT-5
Signed-off-by: Joseph Salisbury <joseph.salisbury@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260403210014.2713404-1-joseph.salisbury@oracle.com
2026-04-07 09:23:52 +02:00
Vincent Guittot
059258b0d4 sched/fair: Prevent negative lag increase during delayed dequeue
Delayed dequeue feature aims to reduce the negative lag of a dequeued
task while sleeping but it can happens that newly enqueued tasks will
move backward the avg vruntime and increase its negative lag.
When the delayed dequeued task wakes up, it has more neg lag compared
to being dequeued immediately or to other tasks that have been
dequeued just before theses new enqueues.

Ensure that the negative lag of a delayed dequeued task doesn't
increase during its delayed dequeued phase while waiting for its neg
lag to diseappear. Similarly, we remove any positive lag that the
delayed dequeued task could have gain during thsi period.

Short slice tasks are particularly impacted in overloaded system.

Test on snapdragon rb5:

hackbench -T -p -l 16000000 -g 2 1> /dev/null &
cyclictest -t 1 -i 2777 -D 333 --policy=fair --mlock  -h 20000 -q

The scheduling latency of cyclictest is:

                       tip/sched/core  tip/sched/core    +this patch
cyclictest slice  (ms) (default)2.8             8               8
hackbench slice   (ms) (default)2.8            20              20
Total Samples          |   115632          119733          119806
Average           (us) |      364              64(-82%)        61(- 5%)
Median (P50)      (us) |       60              56(- 7%)        56(  0%)
90th Percentile   (us) |     1166              62(-95%)        62(  0%)
99th Percentile   (us) |     4192              73(-98%)        72(- 1%)
99.9th Percentile (us) |     8528            2707(-68%)      1300(-52%)
Maximum           (us) |    17735           14273(-20%)     13525(- 5%)

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260331162352.551501-1-vincent.guittot@linaro.org
2026-04-03 14:23:41 +02:00
Vincent Guittot
2d4cc371ba sched/fair: Use sched_energy_enabled()
Use helper sched_energy_enabled() everywhere we want to test if EAS is
enabled instead of mixing sched_energy_enabled() and direct call to
static_branch_unlikely().

No functional change

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260327132013.2800517-1-vincent.guittot@linaro.org
2026-04-03 14:23:41 +02:00
John Stultz
b049b81bdf sched: Handle blocked-waiter migration (and return migration)
Add logic to handle migrating a blocked waiter to a remote
cpu where the lock owner is runnable.

Additionally, as the blocked task may not be able to run
on the remote cpu, add logic to handle return migration once
the waiting task is given the mutex.

Because tasks may get migrated to where they cannot run, also
modify the scheduling classes to avoid sched class migrations on
mutex blocked tasks, leaving find_proxy_task() and related logic
to do the migrations and return migrations.

This was split out from the larger proxy patch, and
significantly reworked.

Credits for the original patch go to:
  Peter Zijlstra (Intel) <peterz@infradead.org>
  Juri Lelli <juri.lelli@redhat.com>
  Valentin Schneider <valentin.schneider@arm.com>
  Connor O'Brien <connoro@google.com>

Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260324191337.1841376-11-jstultz@google.com
2026-04-03 14:23:41 +02:00
John Stultz
dec9554dc0 sched: Move attach_one_task and attach_task helpers to sched.h
The fair scheduler locally introduced attach_one_task() and
attach_task() helpers, but these could be generically useful so
move this code to sched.h so we can use them elsewhere.

One minor tweak made to utilize guard(rq_lock)(rq) to simplifiy
the function.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260324191337.1841376-10-jstultz@google.com
2026-04-03 14:23:40 +02:00
John Stultz
48fda62de6 sched: Add logic to zap balance callbacks if we pick again
With proxy-exec, a task is selected to run via pick_next_task(),
and then if it is a mutex blocked task, we call find_proxy_task()
to find a runnable owner. If the runnable owner is on another
cpu, we will need to migrate the selected donor task away, after
which we will pick_again can call pick_next_task() to choose
something else.

However, in the first call to pick_next_task(), we may have
had a balance_callback setup by the class scheduler. After we
pick again, its possible pick_next_task_fair() will be called
which calls sched_balance_newidle() and sched_balance_rq().

This will throw a warning:
[    8.796467] rq->balance_callback && rq->balance_callback != &balance_push_callback
[    8.796467] WARNING: CPU: 32 PID: 458 at kernel/sched/sched.h:1750 sched_balance_rq+0xe92/0x1250
...
[    8.796467] Call Trace:
[    8.796467]  <TASK>
[    8.796467]  ? __warn.cold+0xb2/0x14e
[    8.796467]  ? sched_balance_rq+0xe92/0x1250
[    8.796467]  ? report_bug+0x107/0x1a0
[    8.796467]  ? handle_bug+0x54/0x90
[    8.796467]  ? exc_invalid_op+0x17/0x70
[    8.796467]  ? asm_exc_invalid_op+0x1a/0x20
[    8.796467]  ? sched_balance_rq+0xe92/0x1250
[    8.796467]  sched_balance_newidle+0x295/0x820
[    8.796467]  pick_next_task_fair+0x51/0x3f0
[    8.796467]  __schedule+0x23a/0x14b0
[    8.796467]  ? lock_release+0x16d/0x2e0
[    8.796467]  schedule+0x3d/0x150
[    8.796467]  worker_thread+0xb5/0x350
[    8.796467]  ? __pfx_worker_thread+0x10/0x10
[    8.796467]  kthread+0xee/0x120
[    8.796467]  ? __pfx_kthread+0x10/0x10
[    8.796467]  ret_from_fork+0x31/0x50
[    8.796467]  ? __pfx_kthread+0x10/0x10
[    8.796467]  ret_from_fork_asm+0x1a/0x30
[    8.796467]  </TASK>

This is because if a RT task was originally picked, it will
setup the rq->balance_callback with push_rt_tasks() via
set_next_task_rt().

Once the task is migrated away and we pick again, we haven't
processed any balance callbacks, so rq->balance_callback is not
in the same state as it was the first time pick_next_task was
called.

To handle this, add a zap_balance_callbacks() helper function
which cleans up the balance callbacks without running them. This
should be ok, as we are effectively undoing the state set in
the first call to pick_next_task(), and when we pick again,
the new callback can be configured for the donor task actually
selected.

Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260324191337.1841376-9-jstultz@google.com
2026-04-03 14:23:40 +02:00
John Stultz
f9530b3183 sched: Add assert_balance_callbacks_empty helper
With proxy-exec utilizing pick-again logic, we can end up having
balance callbacks set by the preivous pick_next_task() call left
on the list.

So pull the warning out into a helper function, and make sure we
check it when we pick again.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260324191337.1841376-8-jstultz@google.com
2026-04-03 14:23:40 +02:00
John Stultz
2d76226698 sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration
As we add functionality to proxy execution, we may migrate a
donor task to a runqueue where it can't run due to cpu affinity.
Thus, we must be careful to ensure we return-migrate the task
back to a cpu in its cpumask when it becomes unblocked.

Peter helpfully provided the following example with pictures:
"Suppose we have a ww_mutex cycle:

                  ,-+-* Mutex-1 <-.
        Task-A ---' |             | ,-- Task-B
                    `-> Mutex-2 *-+-'

Where Task-A holds Mutex-1 and tries to acquire Mutex-2, and
where Task-B holds Mutex-2 and tries to acquire Mutex-1.

Then the blocked_on->owner chain will go in circles.

        Task-A  -> Mutex-2
          ^          |
          |          v
        Mutex-1 <- Task-B

We need two things:

 - find_proxy_task() to stop iterating the circle;

 - the woken task to 'unblock' and run, such that it can
   back-off and re-try the transaction.

Now, the current code [without this patch] does:
        __clear_task_blocked_on();
        wake_q_add();

And surely clearing ->blocked_on is sufficient to break the
cycle.

Suppose it is Task-B that is made to back-off, then we have:

  Task-A -> Mutex-2 -> Task-B (no further blocked_on)

and it would attempt to run Task-B. Or worse, it could directly
pick Task-B and run it, without ever getting into
find_proxy_task().

Now, here is a problem because Task-B might not be runnable on
the CPU it is currently on; and because !task_is_blocked() we
don't get into the proxy paths, so nobody is going to fix this
up.

Ideally we would have dequeued Task-B alongside of clearing
->blocked_on, but alas, [the lock ordering prevents us from
getting the task_rq_lock() and] spoils things."

Thus we need more than just a binary concept of the task being
blocked on a mutex or not.

So allow setting blocked_on to PROXY_WAKING as a special value
which specifies the task is no longer blocked, but needs to
be evaluated for return migration *before* it can be run.

This will then be used in a later patch to handle proxy
return-migration.

Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260324191337.1841376-7-jstultz@google.com
2026-04-03 14:23:40 +02:00
John Stultz
56f4b24267 sched: Fix modifying donor->blocked on without proper locking
Introduce an action enum in find_proxy_task() which allows
us to handle work needed to be done outside the mutex.wait_lock
and task.blocked_lock guard scopes.

This ensures proper locking when we clear the donor's blocked_on
pointer in proxy_deactivate(), and the switch statement will be
useful as we add more cases to handle later in this series.

Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260324191337.1841376-6-jstultz@google.com
2026-04-03 14:23:39 +02:00
John Stultz
fa4a1ff8ab locking: Add task::blocked_lock to serialize blocked_on state
So far, we have been able to utilize the mutex::wait_lock
for serializing the blocked_on state, but when we move to
proxying across runqueues, we will need to add more state
and a way to serialize changes to this state in contexts
where we don't hold the mutex::wait_lock.

So introduce the task::blocked_lock, which nests under the
mutex::wait_lock in the locking order, and rework the locking
to use it.

Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260324191337.1841376-5-jstultz@google.com
2026-04-03 14:23:39 +02:00
John Stultz
f4fe6be82e sched: Fix potentially missing balancing with Proxy Exec
K Prateek pointed out that with Proxy Exec, we may have cases
where we context switch in __schedule(), while the donor remains
the same. This could cause balancing issues, since the
put_prev_set_next() logic short-cuts if (prev == next). With
proxy-exec prev is the previous donor, and next is the next
donor. Should the donor remain the same, but different tasks are
picked to actually run, the shortcut will have avoided enqueuing
the sched class balance callback.

So, if we are context switching, add logic to catch the
same-donor case, and trigger the put_prev/set_next calls to
ensure the balance callbacks get enqueued.

Closes: https://lore.kernel.org/lkml/20ea3670-c30a-433b-a07f-c4ff98ae2379@amd.com/
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260324191337.1841376-4-jstultz@google.com
2026-04-03 14:23:39 +02:00
John Stultz
37341ec573 sched: Minimise repeated sched_proxy_exec() checking
Peter noted: Compilers are really bad (as in they utterly refuse)
optimizing (even when marked with __pure) the static branch
things, and will happily emit multiple identical in a row.

So pull out the one obvious sched_proxy_exec() branch in
__schedule() and remove some of the 'implicit' ones in that
path.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260324191337.1841376-3-jstultz@google.com
2026-04-03 14:23:38 +02:00
John Stultz
e0ca8991b2 sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
With proxy-execution, the scheduler selects the donor, but for
blocked donors, we end up running the lock owner.

This caused some complexity, because the class schedulers make
sure to remove the task they pick from their pushable task
lists, which prevents the donor from being migrated, but there
wasn't then anything to prevent rq->curr from being migrated
if rq->curr != rq->donor.

This was sort of hacked around by calling proxy_tag_curr() on
the rq->curr task if we were running something other then the
donor. proxy_tag_curr() did a dequeue/enqueue pair on the
rq->curr task, allowing the class schedulers to remove it from
their pushable list.

The dequeue/enqueue pair was wasteful, and additonally K Prateek
highlighted that we didn't properly undo things when we stopped
proxying, leaving the lock owner off the pushable list.

After some alternative approaches were considered, Peter
suggested just having the RT/DL classes just avoid migrating
when task_on_cpu().

So rework pick_next_pushable_dl_task() and the rt
pick_next_pushable_task() functions so that they skip over the
first pushable task if it is on_cpu.

Then just drop all of the proxy_tag_curr() logic.

Fixes: be39617e38 ("sched: Fix proxy/current (push,pull)ability")
Closes: https://lore.kernel.org/lkml/e735cae0-2cc9-4bae-b761-fcb082ed3e94@amd.com/
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260324191337.1841376-2-jstultz@google.com
2026-04-03 14:23:38 +02:00
Ingo Molnar
9853914c08 Merge branch 'sched/urgent' into sched/core, to resolve conflicts
The following fix in sched/urgent:

  e08d007f9d ("sched/debug: Fix avg_vruntime() usage")

is in conflict with this pending commit in sched/core:

  4823725d9d ("sched/fair: Increase weight bits for avg_vruntime")

Both modify the same variable definition and initialization blocks,
resolve it by merging the two.

 Conflicts:
	kernel/sched/debug.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2026-04-02 15:04:09 +02:00
Peter Zijlstra
e08d007f9d sched/debug: Fix avg_vruntime() usage
John reported that stress-ng-yield could make his machine unhappy and
managed to bisect it to commit b3d99f43c7 ("sched/fair: Fix
zero_vruntime tracking").

The commit in question changes avg_vruntime() from a function that is
a pure reader, to a function that updates variables. This turns an
unlocked sched/debug usage of this function from a minor mistake into
a data corruptor.

Fixes: af4cf40470 ("sched/fair: Add cfs_rq::avg_vruntime")
Fixes: b3d99f43c7 ("sched/fair: Fix zero_vruntime tracking")
Reported-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/20260401132355.196370805@infradead.org
2026-04-02 13:42:43 +02:00
Peter Zijlstra
1319ea5752 sched/fair: Fix zero_vruntime tracking fix
John reported that stress-ng-yield could make his machine unhappy and
managed to bisect it to commit b3d99f43c7 ("sched/fair: Fix
zero_vruntime tracking").

The combination of yield and that commit was specific enough to
hypothesize the following scenario:

Suppose we have 2 runnable tasks, both doing yield. Then one will be
eligible and one will not be, because the average position must be in
between these two entities.

Therefore, the runnable task will be eligible, and be promoted a full
slice (all the tasks do is yield after all). This causes it to jump over
the other task and now the other task is eligible and current is no
longer. So we schedule.

Since we are runnable, there is no {de,en}queue. All we have is the
__{en,de}queue_entity() from {put_prev,set_next}_task(). But per the
fingered commit, those two no longer move zero_vruntime.

All that moves zero_vruntime are tick and full {de,en}queue.

This means, that if the two tasks playing leapfrog can reach the
critical speed to reach the overflow point inside one tick's worth of
time, we're up a creek.

Additionally, when multiple cgroups are involved, there is no guarantee
the tick will in fact hit every cgroup in a timely manner. Statistically
speaking it will, but that same statistics does not rule out the
possibility of one cgroup not getting a tick for a significant amount of
time -- however unlikely.

Therefore, just like with the yield() case, force an update at the end
of every slice. This ensures the update is never more than a single
slice behind and the whole thing is within 2 lag bounds as per the
comment on entity_key().

Fixes: b3d99f43c7 ("sched/fair: Fix zero_vruntime tracking")
Reported-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/20260401132355.081530332@infradead.org
2026-04-02 13:42:43 +02:00
Linus Torvalds
47e3f23f0e Merge tag 'timers-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
 "Fix an argument order bug in the alarm timer forwarding logic, which
  may cause missed expirations or incorrect overrun accounting"

* tag 'timers-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  alarmtimer: Fix argument order in alarm_timer_forward()
2026-03-29 10:02:38 -07:00
Linus Torvalds
f087b0bad4 Merge tag 'locking-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull futex fixes from Ingo Molnar:

 - Tighten up the sys_futex_requeue() ABI a bit, to disallow dissimilar
   futex flags and potential UaF access (Peter Zijlstra)

 - Fix UaF between futex_key_to_node_opt() and vma_replace_policy()
   (Hao-Yu Yang)

 - Clear stale exiting pointer in futex_lock_pi() retry path, which
   triggered a warning (and potential misbehavior) in stress-testing
   (Davidlohr Bueso)

* tag 'locking-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Clear stale exiting pointer in futex_lock_pi() retry path
  futex: Fix UaF between futex_key_to_node_opt() and vma_replace_policy()
  futex: Require sys_futex_requeue() to have identical flags
2026-03-29 09:59:46 -07:00
Linus Torvalds
cbfffcca2b Merge tag 'trace-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:

 - Fix potential deadlock in osnoise and hotplug

   The interface_lock can be called by a osnoise thread and the CPU
   shutdown logic of osnoise can wait for this thread to finish. But
   cpus_read_lock() can also be taken while holding the interface_lock.
   This produces a circular lock dependency and can cause a deadlock.

   Swap the ordering of cpus_read_lock() and the interface_lock to have
   interface_lock taken within the cpus_read_lock() context to prevent
   this circular dependency.

 - Fix freeing of event triggers in early boot up

   If the same trigger is added on the kernel command line, the second
   one will fail to be applied and the trigger created will be freed.
   This calls into the deferred logic and creates a kernel thread to do
   the freeing. But the command line logic is called before kernel
   threads can be created and this leads to a NULL pointer dereference.

   Delay freeing event triggers until late init.

* tag 'trace-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Drain deferred trigger frees if kthread creation fails
  tracing: Fix potential deadlock in cpu hotplug with osnoise
2026-03-28 09:59:09 -07:00
Davidlohr Bueso
210d36d892 futex: Clear stale exiting pointer in futex_lock_pi() retry path
Fuzzying/stressing futexes triggered:

    WARNING: kernel/futex/core.c:825 at wait_for_owner_exiting+0x7a/0x80, CPU#11: futex_lock_pi_s/524

When futex_lock_pi_atomic() sees the owner is exiting, it returns -EBUSY
and stores a refcounted task pointer in 'exiting'.

After wait_for_owner_exiting() consumes that reference, the local pointer
is never reset to nil. Upon a retry, if futex_lock_pi_atomic() returns a
different error, the bogus pointer is passed to wait_for_owner_exiting().

  CPU0			     CPU1		       CPU2
  futex_lock_pi(uaddr)
  // acquires the PI futex
  exit()
    futex_cleanup_begin()
      futex_state = EXITING;
			     futex_lock_pi(uaddr)
			       futex_lock_pi_atomic()
				 attach_to_pi_owner()
				   // observes EXITING
				   *exiting = owner;  // takes ref
				   return -EBUSY
			       wait_for_owner_exiting(-EBUSY, owner)
				 put_task_struct();   // drops ref
			       // exiting still points to owner
			       goto retry;
			       futex_lock_pi_atomic()
				 lock_pi_update_atomic()
				   cmpxchg(uaddr)
					*uaddr ^= WAITERS // whatever
				   // value changed
				 return -EAGAIN;
			       wait_for_owner_exiting(-EAGAIN, exiting) // stale
				 WARN_ON_ONCE(exiting)

Fix this by resetting upon retry, essentially aligning it with requeue_pi.

Fixes: 3ef240eaff ("futex: Prevent exit livelock")
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260326001759.4129680-1-dave@stgolabs.net
2026-03-28 13:54:02 +01:00
Wesley Atwell
250ab25391 tracing: Drain deferred trigger frees if kthread creation fails
Boot-time trigger registration can fail before the trigger-data cleanup
kthread exists. Deferring those frees until late init is fine, but the
post-boot fallback must still drain the deferred list if kthread
creation never succeeds.

Otherwise, boot-deferred nodes can accumulate on
trigger_data_free_list, later frees fall back to synchronously freeing
only the current object, and the older queued entries are leaked
forever.

To trigger this, add the following to the kernel command line:

  trace_event=sched_switch trace_trigger=sched_switch.traceon,sched_switch.traceon

The second traceon trigger will fail and be freed. This triggers a NULL
pointer dereference and crashes the kernel.

Keep the deferred boot-time behavior, but when kthread creation fails,
drain the whole queued list synchronously. Do the same in the late-init
drain path so queued entries are not stranded there either.

Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260324221326.1395799-3-atwellwea@gmail.com
Fixes: 61d445af0a ("tracing: Add bulk garbage collection of freeing event_trigger_data")
Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-28 08:32:44 -04:00
Linus Torvalds
0b8bf3b64e Merge tag 'sysctl-7.00-fixes-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
Pull sysctl fix from Joel Granados:
 "Fix uninitialized variable error when writing to a sysctl bitmap

  Removed the possibility of returning an unjustified -EINVAL when
  writing to a sysctl bitmap"

* tag 'sysctl-7.00-fixes-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
  sysctl: fix uninitialized variable in proc_do_large_bitmap
2026-03-27 13:04:34 -07:00
Luo Haiyang
1f98857322 tracing: Fix potential deadlock in cpu hotplug with osnoise
The following sequence may leads deadlock in cpu hotplug:

    task1        task2        task3
    -----        -----        -----

 mutex_lock(&interface_lock)

            [CPU GOING OFFLINE]

            cpus_write_lock();
            osnoise_cpu_die();
              kthread_stop(task3);
                wait_for_completion();

                      osnoise_sleep();
                        mutex_lock(&interface_lock);

 cpus_read_lock();

 [DEAD LOCK]

Fix by swap the order of cpus_read_lock() and mutex_lock(&interface_lock).

Cc: stable@vger.kernel.org
Cc: <mathieu.desnoyers@efficios.com>
Cc: <zhang.run@zte.com.cn>
Cc: <yang.tao172@zte.com.cn>
Cc: <ran.xiaokai@zte.com.cn>
Fixes: bce29ac9ce ("trace: Add osnoise tracer")
Link: https://patch.msgid.link/20260326141953414bVSj33dAYktqp9Oiyizq8@zte.com.cn
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Luo Haiyang <luo.haiyang@zte.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-27 15:18:06 -04:00
Linus Torvalds
d813f42193 Merge tag 'pm-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
 "These fix two cpufreq issues, one in the core and one in the
  conservative governor, and two issues related to system sleep:

   - Restore the cpufreq core behavior changed inadvertently during the
     6.19 development cycle to call cpufreq_frequency_table_cpuinfo()
     for cpufreq policies getting re-initialized which ensures that
     policy->max and policy->cpuinfo_max_freq will be valid going
     forward (Viresh Kumar)

   - Adjust the cached requested frequency in the conservative cpufreq
     governor on policy limits changes to prevent it from becoming stale
     in some cases (Viresh Kumar)

   - Prevent pm_restore_gfp_mask() from triggering a WARN_ON() in some
     code paths in which it is legitimately called without invoking
     pm_restrict_gfp_mask() previously (Youngjun Park)

   - Update snapshot_write_finalize() to take trailing zero pages into
     account properly which prevents user space restore from failing
     subsequently in some cases (Alberto Garcia)"

* tag 'pm-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: sleep: Drop spurious WARN_ON() from pm_restore_gfp_mask()
  PM: hibernate: Drain trailing zero pages on userspace restore
  cpufreq: conservative: Reset requested_freq on limits change
  cpufreq: Don't skip cpufreq_frequency_table_cpuinfo()
2026-03-26 12:42:28 -07:00
Linus Torvalds
dabb83ecf4 Merge tag 'dma-mapping-7.0-2026-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
 "A set of fixes for DMA-mapping subsystem, which resolve false-
  positive warnings from KMSAN and DMA-API debug (Shigeru Yoshida
  and Leon Romanovsky) as well as a simple build fix (Miguel Ojeda)"

* tag 'dma-mapping-7.0-2026-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
  dma-mapping: add missing `inline` for `dma_free_attrs`
  mm/hmm: Indicate that HMM requires DMA coherency
  RDMA/umem: Tell DMA mapping that UMEM requires coherency
  iommu/dma: add support for DMA_ATTR_REQUIRE_COHERENT attribute
  dma-direct: prevent SWIOTLB path when DMA_ATTR_REQUIRE_COHERENT is set
  dma-mapping: Introduce DMA require coherency attribute
  dma-mapping: Clarify valid conditions for CPU cache line overlap
  dma-mapping: handle DMA_ATTR_CPU_CACHE_CLEAN in trace output
  dma-debug: Allow multiple invocations of overlapping entries
  dma: swiotlb: add KMSAN annotations to swiotlb_bounce()
2026-03-26 08:22:07 -07:00
Hao-Yu Yang
190a8c48ff futex: Fix UaF between futex_key_to_node_opt() and vma_replace_policy()
During futex_key_to_node_opt() execution, vma->vm_policy is read under
speculative mmap lock and RCU. Concurrently, mbind() may call
vma_replace_policy() which frees the old mempolicy immediately via
kmem_cache_free().

This creates a race where __futex_key_to_node() dereferences a freed
mempolicy pointer, causing a use-after-free read of mpol->mode.

[  151.412631] BUG: KASAN: slab-use-after-free in __futex_key_to_node (kernel/futex/core.c:349)
[  151.414046] Read of size 2 at addr ffff888001c49634 by task e/87

[  151.415969] Call Trace:

[  151.416732]  __asan_load2 (mm/kasan/generic.c:271)
[  151.416777]  __futex_key_to_node (kernel/futex/core.c:349)
[  151.416822]  get_futex_key (kernel/futex/core.c:374 kernel/futex/core.c:386 kernel/futex/core.c:593)

Fix by adding rcu to __mpol_put().

Fixes: c042c50521 ("futex: Implement FUTEX2_MPOL")
Reported-by: Hao-Yu Yang <naup96721@gmail.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Hao-Yu Yang <naup96721@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Link: https://patch.msgid.link/20260324174418.GB1850007@noisy.programming.kicks-ass.net
2026-03-26 16:13:48 +01:00
Peter Zijlstra
19f94b3905 futex: Require sys_futex_requeue() to have identical flags
Nicholas reported that his LLM found it was possible to create a UaF
when sys_futex_requeue() is used with different flags. The initial
motivation for allowing different flags was the variable sized futex,
but since that hasn't been merged (yet), simply mandate the flags are
identical, as is the case for the old style sys_futex() requeue
operations.

Fixes: 0f4b5f9722 ("futex: Add sys_futex_requeue()")
Reported-by: Nicholas Carlini <npc@anthropic.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2026-03-26 16:13:48 +01:00
Marc Buerg
f63a9df7e3 sysctl: fix uninitialized variable in proc_do_large_bitmap
proc_do_large_bitmap() does not initialize variable c, which is expected
to be set to a trailing character by proc_get_long().

However, proc_get_long() only sets c when the input buffer contains a
trailing character after the parsed value.

If c is not initialized it may happen to contain a '-'. If this is the
case proc_do_large_bitmap() expects to be able to parse a second part of
the input buffer. If there is no second part an unjustified -EINVAL will
be returned.

Initialize c to 0 to prevent returning -EINVAL on valid input.

Fixes: 9f977fb7ae ("sysctl: add proc_do_large_bitmap")
Signed-off-by: Marc Buerg <buermarc@googlemail.com>
Reviewed-by: Joel Granados <joel.granados@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
2026-03-26 09:32:19 +01:00
Linus Torvalds
aba9da0905 Merge tag 'rcu-fixes.v7.0-20260325a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux
Pull RCU fixes from Boqun Feng:
 "Fix a regression introduced by commit c27cea4416 ("rcu: Re-implement
  RCU Tasks Trace in terms of SRCU-fast"): BPF contexts can run with
  preemption disabled or scheduler locks held, so call_srcu() must work
  in all such contexts.

  Fix this by converting SRCU's spinlocks to raw spinlocks and avoiding
  scheduler lock acquisition in call_srcu() by deferring to an irq_work
  (similar to call_rcu_tasks_generic()), for both tree SRCU and tiny
  SRCU.

  Also fix a follow-on lockdep splat caused by srcu_node allocation
  under the newly introduced raw spinlock by deferring the allocation to
  grace-period worker context"

* tag 'rcu-fixes.v7.0-20260325a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux:
  srcu: Use irq_work to start GP in tiny SRCU
  rcu: Use an intermediate irq_work to start process_srcu()
  srcu: Push srcu_node allocation to GP when non-preemptible
  srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable()
2026-03-25 18:14:19 -07:00
Joel Fernandes
a6fc88b22b srcu: Use irq_work to start GP in tiny SRCU
Tiny SRCU's srcu_gp_start_if_needed() directly calls schedule_work(),
which acquires the workqueue pool->lock.

This causes a lockdep splat when call_srcu() is called with a scheduler
lock held, due to:

  call_srcu() [holding pi_lock]
    srcu_gp_start_if_needed()
      schedule_work() -> pool->lock

  workqueue_init() / create_worker() [holding pool->lock]
    wake_up_process() -> try_to_wake_up() -> pi_lock

Also add irq_work_sync() to cleanup_srcu_struct() to prevent a
use-after-free if a queued irq_work fires after cleanup begins.

Tested with rcutorture SRCU-T and no lockdep warnings.

[ Thanks to Boqun for similar fix in patch "rcu: Use an intermediate irq_work
to start process_srcu()" ]

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun@kernel.org>
2026-03-25 09:00:05 -07:00
Boqun Feng
7c405fb327 rcu: Use an intermediate irq_work to start process_srcu()
Since commit c27cea4416 ("rcu: Re-implement RCU Tasks Trace in terms
of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can
happen basically everywhere (including where a scheduler lock is held),
call_srcu() now needs to avoid acquiring scheduler lock because
otherwise it could cause deadlock [1]. Fix this by following what the
previous RCU Tasks Trace did: using an irq_work to delay the queuing of
the work to start process_srcu().

[boqun: Apply Joel's feedback]
[boqun: Apply Andrea's test feedback]

Reported-by: Andrea Righi <arighi@nvidia.com>
Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/
Fixes: commit c27cea4416 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast")
Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1]
Suggested-by: Zqiang <qiang.zhang@linux.dev>
Tested-by: Andrea Righi <arighi@nvidia.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun@kernel.org>
2026-03-25 08:59:59 -07:00
Paul E. McKenney
61bbcfb505 srcu: Push srcu_node allocation to GP when non-preemptible
When the srcutree.convert_to_big and srcutree.big_cpu_lim kernel boot
parameters specify initialization-time allocation of the srcu_node
tree for statically allocated srcu_struct structures (for example, in
DEFINE_SRCU() at build time instead of init_srcu_struct() at runtime),
init_srcu_struct_nodes() will attempt to dynamically allocate this tree
at the first run-time update-side use of this srcu_struct structure,
but while holding a raw spinlock.  Because the memory allocator can
acquire non-raw spinlocks, this can result in lockdep splats.

This commit therefore uses the same SRCU_SIZE_ALLOC trick that is used
when the first run-time update-side use of this srcu_struct structure
happens before srcu_init() is called.  The actual allocation then takes
place from workqueue context at the ends of upcoming SRCU grace periods.

[boqun: Adjust the sha1 of the Fixes tag]

Fixes: 175b45ed34 ("srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable()")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun@kernel.org>
2026-03-25 08:59:02 -07:00
Paul E. McKenney
175b45ed34 srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable()
Tree SRCU has used non-raw spinlocks for many years, motivated by a desire
to avoid unnecessary real-time latency and the absence of any reason to
use raw spinlocks.  However, the recent use of SRCU in tracing as the
underlying implementation of RCU Tasks Trace means that call_srcu()
is invoked from preemption-disabled regions of code, which in turn
requires that any locks acquired by call_srcu() or its callees must be
raw spinlocks.

This commit therefore converts SRCU's spinlocks to raw spinlocks.

[boqun: Add Fixes tag]

Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Fixes: c27cea4416 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2026-03-25 08:55:50 -07:00
Zhan Xusheng
5d16467ae5 alarmtimer: Fix argument order in alarm_timer_forward()
alarm_timer_forward() passes arguments to alarm_forward() in the wrong
order:

  alarm_forward(alarm, timr->it_interval, now);

However, alarm_forward() is defined as:

  u64 alarm_forward(struct alarm *alarm, ktime_t now, ktime_t interval);

and uses the second argument as the current time:

  delta = ktime_sub(now, alarm->node.expires);

Passing the interval as "now" results in incorrect delta computation,
which can lead to missed expirations or incorrect overrun accounting.

This issue has been present since the introduction of
alarm_timer_forward().

Fix this by swapping the arguments.

Fixes: e7561f1633 ("alarmtimer: Implement forward callback")
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260323061130.29991-1-zhanxusheng@xiaomi.com
2026-03-24 23:17:14 +01:00
Shrikanth Hegde
0e81fe79fe sched/core: Get this cpu once in ttwu_queue_cond()
Calling smp_processor_id() on:
 - In CONFIG_DEBUG_PREEMPT=y, if preemption/irq is disabled, then it does
   not print any warning.
 - In CONFIG_DEBUG_PREEMPT=n, it doesn't do anything apart from getting
   __smp_processor_id

So with both CONFIG_DEBUG_PREEMPT=y/n, in preemption disabled section
it is better to cache the value. It could save a few cycles. Though
tiny, repeated could add up to a small value.

ttwu_queue_cond is called with interrupt disabled. So preemption is
disabled. Hence cache the value once instead.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com>
Link: https://patch.msgid.link/20260323193630.640311-3-sshegde@linux.ibm.com
2026-03-24 10:07:05 +01:00
Shrikanth Hegde
76504bce4e sched/fair: Get this cpu once in find_new_ilb()
Calling smp_processor_id() on:
 - In CONFIG_DEBUG_PREEMPT=y, if preemption/irq is disabled, then it does
   not print any warning.
 - In CONFIG_DEBUG_PREEMPT=n, it doesn't do anything apart from getting
   __smp_processor_id

So with both CONFIG_DEBUG_PREEMPT=y/n, in preemption disabled section
it is better to cache the value. It could save a few cycles. Though
tiny, repeated in loop could add up to a small value.

find_new_ilb is called in interrupt context. So preemption is disabled.
So Hoist the this_cpu out of loop

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260323193630.640311-2-sshegde@linux.ibm.com
2026-03-24 10:07:04 +01:00
Youngjun Park
a8d51efb59 PM: sleep: Drop spurious WARN_ON() from pm_restore_gfp_mask()
Commit 35e4a69b20 ("PM: sleep: Allow pm_restrict_gfp_mask()
stacking") introduced refcount-based GFP mask management that warns
when pm_restore_gfp_mask() is called with saved_gfp_count == 0.

Some hibernation paths call pm_restore_gfp_mask() defensively where
the GFP mask may or may not be restricted depending on the execution
path. For example, the uswsusp interface invokes it in
SNAPSHOT_CREATE_IMAGE, SNAPSHOT_UNFREEZE, and snapshot_release().
Before the stacking change this was a silent no-op; it now triggers
a spurious WARNING.

Remove the WARN_ON() wrapper from the !saved_gfp_count check while
retaining the check itself, so that defensive calls remain harmless
without producing false warnings.

Fixes: 35e4a69b20 ("PM: sleep: Allow pm_restrict_gfp_mask() stacking")
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
[ rjw: Subject tweak ]
Link: https://patch.msgid.link/20260322120528.750178-1-youngjun.park@lge.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-03-23 13:54:53 +01:00
Alberto Garcia
734eba62cd PM: hibernate: Drain trailing zero pages on userspace restore
Commit 005e8dddd4 ("PM: hibernate: don't store zero pages in the
image file") added an optimization to skip zero-filled pages in the
hibernation image. On restore, zero pages are handled internally by
snapshot_write_next() in a loop that processes them without returning
to the caller.

With the userspace restore interface, writing the last non-zero page
to /dev/snapshot is followed by the SNAPSHOT_ATOMIC_RESTORE ioctl. At
this point there are no more calls to snapshot_write_next() so any
trailing zero pages are not processed, snapshot_image_loaded() fails
because handle->cur is smaller than expected, the ioctl returns -EPERM
and the image is not restored.

The in-kernel restore path is not affected by this because the loop in
load_image() in swap.c calls snapshot_write_next() until it returns 0.
It is this final call that drains any trailing zero pages.

Fixed by calling snapshot_write_next() in snapshot_write_finalize(),
giving the kernel the chance to drain any trailing zero pages.

Fixes: 005e8dddd4 ("PM: hibernate: don't store zero pages in the image file")
Signed-off-by: Alberto Garcia <berto@igalia.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Link: https://patch.msgid.link/ef5a7c5e3e3dbd17dcb20efaa0c53a47a23498bb.1773075892.git.berto@igalia.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-03-23 13:33:06 +01:00
Linus Torvalds
d5273fd3ca Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:

 - Fix how linked registers track zero extension of subregisters (Daniel
   Borkmann)

 - Fix unsound scalar fork for OR instructions (Daniel Wade)

 - Fix exception exit lock check for subprogs (Ihor Solodrai)

 - Fix undefined behavior in interpreter for SDIV/SMOD instructions
   (Jenny Guanni Qu)

 - Release module's BTF when module is unloaded (Kumar Kartikeya
   Dwivedi)

 - Fix constant blinding for PROBE_MEM32 instructions (Sachin Kumar)

 - Reset register ID for END instructions to prevent incorrect value
   tracking (Yazhou Tang)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Add a test cases for sync_linked_regs regarding zext propagation
  bpf: Fix sync_linked_regs regarding BPF_ADD_CONST32 zext propagation
  selftests/bpf: Add tests for maybe_fork_scalars() OR vs AND handling
  bpf: Fix unsound scalar forking in maybe_fork_scalars() for BPF_OR
  selftests/bpf: Add tests for sdiv32/smod32 with INT_MIN dividend
  bpf: Fix undefined behavior in interpreter sdiv/smod for INT_MIN
  selftests/bpf: Add tests for bpf_throw lock leak from subprogs
  bpf: Fix exception exit lock checking for subprogs
  bpf: Release module BTF IDR before module unload
  selftests/bpf: Fix pkg-config call on static builds
  bpf: Fix constant blinding for PROBE_MEM32 stores
  selftests/bpf: Add test for BPF_END register ID reset
  bpf: Reset register ID for BPF_END value tracking
2026-03-22 11:16:06 -07:00
Linus Torvalds
ac57fa9faf Merge tag 'trace-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:

 - Revert "tracing: Remove pid in task_rename tracing output"

   A change was made to remove the pid field from the task_rename event
   because it was thought that it was always done for the current task
   and recording the pid would be redundant. This turned out to be
   incorrect and there are a few corner case where this is not true and
   caused some regressions in tooling.

 - Fix the reading from user space for migration

   The reading of user space uses a seq lock type of logic where it uses
   a per-cpu temporary buffer and disables migration, then enables
   preemption, does the copy from user space, disables preemption,
   enables migration and checks if there was any schedule switches while
   preemption was enabled. If there was a context switch, then it is
   considered that the per-cpu buffer could be corrupted and it tries
   again. There's a protection check that tests if it takes a hundred
   tries, it issues a warning and exits out to prevent a live lock.

   This was triggered because the task was selected by the load balancer
   to be migrated to another CPU, every time preemption is enabled the
   migration task would schedule in try to migrate the task but can't
   because migration is disabled and let it run again. This caused the
   scheduler to schedule out the task every time it enabled preemption
   and made the loop never exit (until the 100 iteration test
   triggered).

   Fix this by enabling and disabling preemption and keeping migration
   enabled if the reading from user space needs to be done again. This
   will let the migration thread migrate the task and the copy from user
   space will likely pass on the next iteration.

 - Fix trace_marker copy option freeing

   The "copy_trace_marker" option allows a tracing instance to get a
   copy of a write to the trace_marker file of the top level instance.
   This is managed by a link list protected by RCU. When an instance is
   removed, a check is made if the option is set, and if so
   synchronized_rcu() is called.

   The problem is that an iteration is made to reset all the flags to
   what they were when the instance was created (to perform clean ups)
   was done before the check of the copy_trace_marker option and that
   option was cleared, so the synchronize_rcu() was never called.

   Move the clearing of all the flags after the check of
   copy_trace_marker to do synchronize_rcu() so that the option is still
   set if it was before and the synchronization is performed.

 - Fix entries setting when validating the persistent ring buffer

   When validating the persistent ring buffer on boot up, the number of
   events per sub-buffer is added to the sub-buffer meta page. The
   validator was updating cpu_buffer->head_page (the first sub-buffer of
   the per-cpu buffer) and not the "head_page" variable that was
   iterating the sub-buffers. This was causing the first sub-buffer to
   be assigned the entries for each sub-buffer and not the sub-buffer
   that was supposed to be updated.

 - Use "hash" value to update the direct callers

   When updating the ftrace direct callers, it assigned a temporary
   callback to all the callback functions of the ftrace ops and not just
   the functions represented by the passed in hash. This causes an
   unnecessary slow down of the functions of the ftrace_ops that is not
   being modified. Only update the functions that are going to be
   modified to call the ftrace loop function so that the update can be
   made on those functions.

* tag 'trace-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ftrace: Use hash argument for tmp_ops in update_ftrace_direct_mod
  ring-buffer: Fix to update per-subbuf entries of persistent ring buffer
  tracing: Fix trace_marker copy link list updates
  tracing: Fix failure to read user space from system call trace events
  tracing: Revert "tracing: Remove pid in task_rename tracing output"
2026-03-22 11:10:31 -07:00
Linus Torvalds
ebfd9b7af2 Merge tag 'perf-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:

 - Fix a PMU driver crash on AMD EPYC systems, caused by
   a race condition in x86_pmu_enable()

 - Fix a possible counter-initialization bug in x86_pmu_enable()

 - Fix a counter inheritance bug in inherit_event() and
   __perf_event_read()

 - Fix an Intel PMU driver branch constraints handling bug
   found by UBSAN

 - Fix the Intel PMU driver's new Off-Module Response (OMR)
   support code for Diamond Rapids / Nova lake, to fix a snoop
   information parsing bug

* tag 'perf-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel: Fix OMR snoop information parsing issues
  perf/x86/intel: Add missing branch counters constraint apply
  perf: Make sure to use pmu_ctx->pmu for groups
  x86/perf: Make sure to program the counter value for stopped events on migration
  perf/x86: Move event pointer setup earlier in x86_pmu_enable()
2026-03-22 10:31:51 -07:00
Jiri Olsa
50b35c9e50 ftrace: Use hash argument for tmp_ops in update_ftrace_direct_mod
The modify logic registers temporary ftrace_ops object (tmp_ops) to trigger
the slow path for all direct callers to be able to safely modify attached
addresses.

At the moment we use ops->func_hash for tmp_ops filter, which represents all
the systems attachments. It's faster to use just the passed hash filter, which
contains only the modified sites and is always a subset of the ops->func_hash.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Menglong Dong <menglong8.dong@gmail.com>
Cc: Song Liu <song@kernel.org>
Link: https://patch.msgid.link/20260312123738.129926-1-jolsa@kernel.org
Fixes: e93672f770 ("ftrace: Add update_ftrace_direct_mod function")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-21 16:51:04 -04:00
Masami Hiramatsu (Google)
f35dbac694 ring-buffer: Fix to update per-subbuf entries of persistent ring buffer
Since the validation loop in rb_meta_validate_events() updates the same
cpu_buffer->head_page->entries, the other subbuf entries are not updated.
Fix to use head_page to update the entries field, since it is the cursor
in this loop.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ian Rogers <irogers@google.com>
Fixes: 5f3b6e839f ("ring-buffer: Validate boot range memory events")
Link: https://patch.msgid.link/177391153882.193994.17158784065013676533.stgit@mhiramat.tok.corp.google.com
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-21 16:47:28 -04:00
Steven Rostedt
07183aac4a tracing: Fix trace_marker copy link list updates
When the "copy_trace_marker" option is enabled for an instance, anything
written into /sys/kernel/tracing/trace_marker is also copied into that
instances buffer. When the option is set, that instance's trace_array
descriptor is added to the marker_copies link list. This list is protected
by RCU, as all iterations uses an RCU protected list traversal.

When the instance is deleted, all the flags that were enabled are cleared.
This also clears the copy_trace_marker flag and removes the trace_array
descriptor from the list.

The issue is after the flags are called, a direct call to
update_marker_trace() is performed to clear the flag. This function
returns true if the state of the flag changed and false otherwise. If it
returns true here, synchronize_rcu() is called to make sure all readers
see that its removed from the list.

But since the flag was already cleared, the state does not change and the
synchronization is never called, leaving a possible UAF bug.

Move the clearing of all flags below the updating of the copy_trace_marker
option which then makes sure the synchronization is performed.

Also use the flag for checking the state in update_marker_trace() instead
of looking at if the list is empty.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260318185512.1b6c7db4@gandalf.local.home
Fixes: 7b382efd5e ("tracing: Allow the top level trace_marker to write into another instances")
Reported-by: Sasha Levin <sashal@kernel.org>
Closes: https://lore.kernel.org/all/20260225133122.237275-1-sashal@kernel.org/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-21 16:43:53 -04:00
Steven Rostedt
edca33a562 tracing: Fix failure to read user space from system call trace events
The system call trace events call trace_user_fault_read() to read the user
space part of some system calls. This is done by grabbing a per-cpu
buffer, disabling migration, enabling preemption, calling
copy_from_user(), disabling preemption, enabling migration and checking if
the task was preempted while preemption was enabled. If it was, the buffer
is considered corrupted and it tries again.

There's a safety mechanism that will fail out of this loop if it fails 100
times (with a warning). That warning message was triggered in some
pi_futex stress tests. Enabling the sched_switch trace event and
traceoff_on_warning, showed the problem:

 pi_mutex_hammer-1375    [006] d..21   138.981648: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
     migration/6-47      [006] d..2.   138.981651: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
 pi_mutex_hammer-1375    [006] d..21   138.981656: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
     migration/6-47      [006] d..2.   138.981659: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
 pi_mutex_hammer-1375    [006] d..21   138.981664: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
     migration/6-47      [006] d..2.   138.981667: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
 pi_mutex_hammer-1375    [006] d..21   138.981671: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
     migration/6-47      [006] d..2.   138.981675: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
 pi_mutex_hammer-1375    [006] d..21   138.981679: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
     migration/6-47      [006] d..2.   138.981682: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
 pi_mutex_hammer-1375    [006] d..21   138.981687: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
     migration/6-47      [006] d..2.   138.981690: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
 pi_mutex_hammer-1375    [006] d..21   138.981695: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
     migration/6-47      [006] d..2.   138.981698: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
 pi_mutex_hammer-1375    [006] d..21   138.981703: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
     migration/6-47      [006] d..2.   138.981706: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
 pi_mutex_hammer-1375    [006] d..21   138.981711: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
     migration/6-47      [006] d..2.   138.981714: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
 pi_mutex_hammer-1375    [006] d..21   138.981719: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
     migration/6-47      [006] d..2.   138.981722: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
 pi_mutex_hammer-1375    [006] d..21   138.981727: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
     migration/6-47      [006] d..2.   138.981730: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
 pi_mutex_hammer-1375    [006] d..21   138.981735: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
     migration/6-47      [006] d..2.   138.981738: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95

What happened was the task 1375 was flagged to be migrated. When
preemption was enabled, the migration thread woke up to migrate that task,
but failed because migration for that task was disabled. This caused the
loop to fail to exit because the task scheduled out while trying to read
user space.

Every time the task enabled preemption the migration thread would schedule
in, try to migrate the task, fail and let the task continue. But because
the loop would only enable preemption with migration disabled, it would
always fail because each time it enabled preemption to read user space,
the migration thread would try to migrate it.

To solve this, when the loop fails to read user space without being
scheduled out, enabled and disable preemption with migration enabled. This
will allow the migration task to successfully migrate the task and the
next loop should succeed to read user space without being scheduled out.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260316130734.1858a998@gandalf.local.home
Fixes: 64cf7d058a ("tracing: Have trace_marker use per-cpu data to read user space")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-21 16:42:36 -04:00
Daniel Borkmann
bc308be380 bpf: Fix sync_linked_regs regarding BPF_ADD_CONST32 zext propagation
Jenny reported that in sync_linked_regs() the BPF_ADD_CONST32 flag is
checked on known_reg (the register narrowed by a conditional branch)
instead of reg (the linked target register created by an alu32 operation).

Example case with reg:

  1. r6 = bpf_get_prandom_u32()
  2. r7 = r6 (linked, same id)
  3. w7 += 5 (alu32 -- r7 gets BPF_ADD_CONST32, zero-extended by CPU)
  4. if w6 < 0xFFFFFFFC goto safe (narrows r6 to [0xFFFFFFFC, 0xFFFFFFFF])
  5. sync_linked_regs() propagates to r7 but does NOT call zext_32_to_64()
  6. Verifier thinks r7 is [0x100000001, 0x100000004] instead of [1, 4]

Since known_reg above does not have BPF_ADD_CONST32 set above, zext_32_to_64()
is never called on alu32-derived linked registers. This causes the verifier
to track incorrect 64-bit bounds, while the CPU correctly zero-extends the
32-bit result.

The code checking known_reg->id was correct however (see scalars_alu32_wrap
selftest case), but the real fix needs to handle both directions - zext
propagation should be done when either register has BPF_ADD_CONST32, since
the linked relationship involves a 32-bit operation regardless of which
side has the flag.

Example case with known_reg (exercised also by scalars_alu32_wrap):

  1. r1 = r0; w1 += 0x100 (alu32 -- r1 gets BPF_ADD_CONST32)
  2. if r1 > 0x80 - known_reg = r1 (has BPF_ADD_CONST32), reg = r0 (doesn't)

Hence, fix it by checking for (reg->id | known_reg->id) & BPF_ADD_CONST32.

Moreover, sync_linked_regs() also has a soundness issue when two linked
registers used different ALU widths: one with BPF_ADD_CONST32 and the
other with BPF_ADD_CONST64. The delta relationship between linked registers
assumes the same arithmetic width though. When one register went through
alu32 (CPU zero-extends the 32-bit result) and the other went through
alu64 (no zero-extension), the propagation produces incorrect bounds.

Example:

  r6 = bpf_get_prandom_u32()     // fully unknown
  if r6 >= 0x100000000 goto out  // constrain r6 to [0, U32_MAX]
  r7 = r6
  w7 += 1                        // alu32: r7.id = N | BPF_ADD_CONST32
  r8 = r6
  r8 += 2                        // alu64: r8.id = N | BPF_ADD_CONST64
  if r7 < 0xFFFFFFFF goto out    // narrows r7 to [0xFFFFFFFF, 0xFFFFFFFF]

At the branch on r7, sync_linked_regs() runs with known_reg=r7
(BPF_ADD_CONST32) and reg=r8 (BPF_ADD_CONST64). The delta path
computes:

  r8 = r7 + (delta_r8 - delta_r7) = 0xFFFFFFFF + (2 - 1) = 0x100000000

Then, because known_reg->id has BPF_ADD_CONST32, zext_32_to_64(r8) is
called, truncating r8 to [0, 0]. But r8 used a 64-bit ALU op -- the
CPU does NOT zero-extend it. The actual CPU value of r8 is
0xFFFFFFFE + 2 = 0x100000000, not 0. The verifier now underestimates
r8's 64-bit bounds, which is a soundness violation.

Fix sync_linked_regs() by skipping propagation when the two registers
have mixed ALU widths (one BPF_ADD_CONST32, the other BPF_ADD_CONST64).

Lastly, fix regsafe() used for path pruning: the existing checks used
"& BPF_ADD_CONST" to test for offset linkage, which treated
BPF_ADD_CONST32 and BPF_ADD_CONST64 as equivalent.

Fixes: 7a433e5193 ("bpf: Support negative offsets, BPF_SUB, and alu32 for linked register tracking")
Reported-by: Jenny Guanni Qu <qguanni@gmail.com>
Co-developed-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260319211507.213816-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-21 13:19:40 -07:00