If a time slice extension is granted and the reschedule delayed, the kernel
has to ensure that user space cannot abuse the extension and exceed the
maximum granted time.
It was suggested to implement this via the existing hrtick() timer in the
scheduler, but that turned out to be problematic for several reasons:
1) It creates a dependency on CONFIG_SCHED_HRTICK, which can be disabled
independently of CONFIG_HIGHRES_TIMERS
2) HRTICK usage in the scheduler can be runtime disabled or is only used
for certain aspects of scheduling.
3) The function is calling into the scheduler code and that might have
unexpected consequences when this is invoked due to a time slice
enforcement expiry. Especially when the task managed to clear the
grant via sched_yield(0).
It would be possible to address #2 and #3 by storing state in the
scheduler, but that is extra complexity and fragility for no value.
Implement a dedicated per CPU hrtimer instead, which is solely used for the
purpose of time slice enforcement.
The timer is armed when an extension was granted right before actually
returning to user mode in rseq_exit_to_user_mode_restart().
It is disarmed, when the task relinquishes the CPU. This is expensive as
the timer is probably the first expiring timer on the CPU, which means it
has to reprogram the hardware. But that's less expensive than going through
a full hrtimer interrupt cycle for nothing.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251215155709.068329497@linutronix.de
The kernel sets SYSCALL_WORK_RSEQ_SLICE when it grants a time slice
extension. This allows to handle the rseq_slice_yield() syscall, which is
used by user space to relinquish the CPU after finishing the critical
section for which it requested an extension.
In case the kernel state is still GRANTED, the kernel resets both kernel
and user space state with a set of sanity checks. If the kernel state is
already cleared, then this raced against the timer or some other interrupt
and just clears the work bit.
Doing it in syscall entry work allows to catch misbehaving user space,
which issues an arbitrary syscall, i.e. not rseq_slice_yield(), from the
critical section. Contrary to the initial strict requirement to use
rseq_slice_yield() arbitrary syscalls are not considered a violation of the
ABI contract anymore to allow onion architecture applications, which cannot
control the code inside a critical section, to utilize this as well.
If the code detects inconsistent user space that result in a SIGSEGV for
the application.
If the grant was still active and the task was not preempted yet, the work
code reschedules immediately before continuing through the syscall.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155709.005777059@linutronix.de
Provide a new syscall which has the only purpose to yield the CPU after the
kernel granted a time slice extension.
sched_yield() is not suitable for that because it unconditionally
schedules, but the end of the time slice extension is not required to
schedule when the task was already preempted. This also allows to have a
strict check for termination to catch user space invoking random syscalls
including sched_yield() from a time slice extension region.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20251215155708.929634896@linutronix.de
Implement a prctl() so that tasks can enable the time slice extension
mechanism. This fails, when time slice extensions are disabled at compile
time or on the kernel command line and when no rseq pointer is registered
in the kernel.
That allows to implement a single trivial check in the exit to user mode
hotpath, to decide whether the whole mechanism needs to be invoked.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.858717691@linutronix.de
Aside of a Kconfig knob add the following items:
- Two flag bits for the rseq user space ABI, which allow user space to
query the availability and enablement without a syscall.
- A new member to the user space ABI struct rseq, which is going to be
used to communicate request and grant between kernel and user space.
- A rseq state struct to hold the kernel state of this
- Documentation of the new mechanism
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.669472597@linutronix.de
nohz.nr_cpus was observed as contended cacheline when running
enterprise workload on large systems.
Fundamental scalability challenge with nohz.idle_cpus_mask
and nohz.nr_cpus is the following:
(1) nohz_balancer_kick() observes (reads) nohz.nr_cpus
(or nohz.idle_cpu_mask) and nohz.has_blocked to see whether there's
any nohz balancing work to do, in every scheduler tick.
(2) nohz_balance_enter_idle() and nohz_balance_exit_idle()
(through nohz_balancer_kick() via sched_tick()) modify (write)
nohz.nr_cpus (and/or nohz.idle_cpu_mask) and nohz.has_blocked.
The characteristic frequencies are the following:
(1) nohz_balancer_kick() happens at scheduler (busy)tick frequency
on CPU(which has not gone idle). This is a relatively constant
frequency in the ~1 kHz range or lower.
(2) happens at idle enter/exit frequency on every CPU that goes to idle.
This is workload dependent, but can easily be hundreds of kHz for
IO-bound loads and high CPU counts. Ie. can be orders of magnitude
higher than (1), in which case a cachemiss at every invocation of (1)
is almost inevitable. idle exit will trigger (1) on the CPU
which is coming out of idle.
There's two types of costs from these functions:
(A) scheduler tick cost via (1): this happens on busy CPUs too, and is
thus a primary scalability cost. But the rate here is constant and
typically much lower than (B), hence the absolute benefit to workload
scalability will be lower as well.
(B) idle cost via (2): going-to-idle and coming-from-idle costs are
secondary concerns, because they impact power efficiency more than
they impact scalability. But in terms of absolute cost this scales
up with nr_cpus as well, and a much faster rate, and thus may also
approach and negatively impact system limits like
memory bus/fabric bandwidth.
Note that nohz.idle_cpus_mask and nohz.nr_cpus may appear to reside in the
same cacheline, however under CONFIG_CPUMASK_OFFSTACK=y the backing storage
for nohz.idle_cpus_mask will be elsewhere. With CPUMASK_OFFSTACK=n,
the nohz.idle_cpus_mask and rest of nohz fields are in different cachelines
under typical NR_CPUS=512/2048. This implies two separate cachelines
being dirtied upon idle entry / exit.
nohz.nr_cpus can be derived from the mask itself. Its usage doesn't warrant
a functionally correct value. This means one less cacheline being dirtied in
idle entry/exit path which helps to save some bus bandwidth w.r.t to those
nohz functions(approx 50%). This in turn helps to improve enterprise
workload throughput.
On system with 480 CPUs, running "hackbench 40 process 10000 loops"
(Avg of 3 runs)
baseline:
0.81% hackbench [k] nohz_balance_exit_idle
0.21% hackbench [k] nohz_balancer_kick
0.09% swapper [k] nohz_run_idle_balance
With patch:
0.35% hackbench [k] nohz_balance_exit_idle
0.09% hackbench [k] nohz_balancer_kick
0.07% swapper [k] nohz_run_idle_balance
[Ingo Molnar: scalability analysis changlog]
Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20260115073524.376643-4-sshegde@linux.ibm.com
Current code does.
- Read nohz.nr_cpus
- Check if the time has passed to do NOHZ idle balance
Instead do this.
- Check if the time has passed to do NOHZ idle balance
- Read nohz.nr_cpus
This will skip the read most of the time in normal system usage.
i.e when there are nohz.nr_cpus (system is not 100% busy).
Note that when there are no idle CPUs(100% busy), even if the flag gets
set to NOHZ_STATS_KICK | NOHZ_NEXT_KICK, find_new_ilb will fail and
there will be no NOHZ idle balance. In such cases there will be a very
narrow window where, kick_ilb will be called un-necessarily.
However current functionality is still retained.
Note: This patch doesn't solve any cacheline overheads. No improvement
in performance apart from saving a few cycles of reading nohz.nr_cpus
Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20260115073524.376643-2-sshegde@linux.ibm.com
The avg_vruntime comment contains a couple of mathematical notation
issues:
- The summation over w_i * (V - v_i) is written in an ambiguous form
- The delta term refers to v instead of v0, which is inconsistent
with the code and preceding explanation
Fix these to make the comment mathematically correct and consistent
with the implementation.
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260114090035.19033-1-zhanxusheng@xiaomi.com
Commit adcc3bfa88 ("sched: Adapt sched tracepoints for RV task model")
added a tracepoint to the need_resched action that can be triggered also
by set_tsk_need_resched.
This function was previously accessible from out-of-tree modules but
it's no longer available because the __trace_set_need_resched() symbol
is not exported (together with the tracepoint itself, which was exported
in a separate patch) and building such modules fails.
Export __trace_set_need_resched to modules to fix those build issues.
Fixes: adcc3bfa88 ("sched: Adapt sched tracepoints for RV task model")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://patch.msgid.link/20260112140413.362202-1-gmonaco@redhat.com
The tracepoints sched_entry, sched_exit and sched_set_need_resched
are not exported to tracefs as trace events, this allows only kernel
code to access them. Helper modules like [1] can be used to still have
the tracepoints available to ftrace for debugging purposes, but they do
rely on the tracepoints being exported.
Export the 3 not exported tracepoints.
Note that sched_set_state is already exported as the macro is called
from modules.
[1] - https://github.com/qais-yousef/sched_tp.git
Fixes: adcc3bfa88 ("sched: Adapt sched tracepoints for RV task model")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://patch.msgid.link/20251205131621.135513-9-gmonaco@redhat.com
The introduction of PREEMPT_LAZY was for multiple reasons:
- PREEMPT_RT suffered from over-scheduling, hurting performance compared to
!PREEMPT_RT.
- the introduction of (more) features that rely on preemption; like
folio_zero_user() which can do large memset() without preemption checks.
(Xen already had a horrible hack to deal with long running hypercalls)
- the endless and uncontrolled sprinkling of cond_resched() -- mostly cargo
cult or in response to poor to replicate workloads.
By moving to a model that is fundamentally preemptable these things become
managable and avoid needing to introduce more horrible hacks.
Since this is a requirement; limit PREEMPT_NONE to architectures that do not
support preemption at all. Further limit PREEMPT_VOLUNTARY to those
architectures that do not yet have PREEMPT_LAZY support (with the eventual goal
to make this the empty set and completely remove voluntary preemption and
cond_resched() -- notably VOLUNTARY is already limited to !ARCH_NO_PREEMPT.)
This leaves up-to-date architectures (arm64, loongarch, powerpc, riscv, s390,
x86) with only two preemption models: full and lazy.
While Lazy has been the recommended setting for a while, not all distributions
have managed to make the switch yet. Force things along. Keep the patch minimal
in case of hard to address regressions that might pop up.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://patch.msgid.link/20251219101502.GB1132199@noisy.programming.kicks-ass.net
This colocates some hot fields in "struct rq" to be on the same cache line
as others that are often accessed at the same time or in similar ways.
Using data from a Google-internal fleet-scale profiler, I found three
distinct groups of hot fields in struct rq:
- (1) The runqueue lock: __lock.
- (2) Those accessed from hot code in pick_next_task_fair():
nr_running, nr_numa_running, nr_preferred_running,
ttwu_pending, cpu_capacity, curr, idle.
- (3) Those accessed from some other hot codepaths, e.g.
update_curr(), update_rq_clock(), and scheduler_tick():
clock_task, clock_pelt, clock, lost_idle_time,
clock_update_flags, clock_pelt_idle, clock_idle.
The cycles spent on accessing these different groups of fields broke down
roughly as follows:
- 50% on group (1) (the runqueue lock, always read-write)
- 39% on group (2) (load:store ratio around 38:1)
- 8% on group (3) (load:store ratio around 5:1)
- 3% on all the other fields
Most of the fields in group (3) are already in a cache line grouping; this
patch just adds "clock" and "clock_update_flags" to that group. The fields
in group (2) are scattered across several cache lines; the main effect of
this patch is to group them together, on a single line at the beginning of
the structure. A few other less performance-critical fields (nr_switches,
numa_migrate_on, has_blocked_load, nohz_csd, last_blocked_load_update_tick)
were also reordered to reduce holes in the data structure.
Since the runqueue lock is acquired from so many different contexts, and is
basically always accessed using an atomic operation, putting it on either
of the cache lines for groups (2) or (3) would slow down accesses to those
fields dramatically, since those groups are read-mostly accesses.
To test this, I wrote a focused load test that would put load on the
pick_next_task_fair() path. A parent process would fork many child
processes, and each child would nanosleep() for 1 msec many times in a
loop. The load test was monitored with "perf", and I looked at the amount
of cycles that were spent with sched_balance_rq() on the stack. The test
was reliably spending ~5% of all of its cycles there. I ran it 100 times
on a pair of 2-socket Intel Haswell machines (72 vCPUs per machine) - one
running the tip of sched/core, the other running this change - using 360
child processes and 8192 1-msec sleeps per child. The mean cycle count
dropped from 5.14B to 4.91B, or a *4.6% decrease* in relevant scheduler
cycles.
Given that this change reduces cache misses in a very hot kernel codepath,
there's likely to be additional application performance improvement due to
reduced cache conflicts from kernel data structures.
On a Power11 system with 128-byte cache lines, my test showed a ~5%
decrease in relevant scheduler cycles, along with a slight increase in user
time - both positive indicators. This data comes from
https://lore.kernel.org/lkml/affdc6b1-9980-44d1-89db-d90730c1e384@linux.ibm.com/
This is the case even though the additional "____cacheline_aligned" that
puts the runqueue lock on the next cache line adds an additional 64 bytes
of padding on those machines. This patch does not change the size of
"struct rq" on machines with 64-byte cache lines.
I also ran "hackbench" to try to test this change, but it didn't show
conclusive results. Looking at a CPU cycle profile of the hackbench run,
it was spending 95% of its cycles inside __alloc_skb(), __kfree_skb(), or
kmem_cache_free() - almost all of which was spent updating memcg counters
or contending on the list_lock in kmem_cache_node. And it spent less than
0.5% of its cycles inside either schedule() or try_to_wake_up(). So it's
not surprising that it didn't show useful results here.
The "__no_randomize_layout" was added to reflect the fact that performance
of code that references this data structure is unusually sensitive to
placement of its members.
Signed-off-by: Blake Jones <blakejones@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Reviewed-by: Josh Don <joshdon@google.com>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Link: https://patch.msgid.link/20251202023743.1524247-1-blakejones@google.com
Commit 47efe2ddcc ("sched/core: Add assertions to QUEUE_CLASS") added an
assert to sched_change_end() verifying that a class demotion would result in a
reschedule.
As it turns out; rt_mutex_setprio() does not force a resched on class
demontion. Furthermore, this is only relevant to running tasks.
Change the warning into a reschedule and make sure to only do so for running
tasks.
Fixes: 47efe2ddcc ("sched/core: Add assertions to QUEUE_CLASS")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251216141725.GW3707837@noisy.programming.kicks-ass.net
Change sched_class::wakeup_preempt() to also get called for
cross-class wakeups, specifically those where the woken task
is of a higher class than the previous highest class.
In order to do this, track the current highest class of the runqueue
in rq::next_class and have wakeup_preempt() track this upwards for
each new wakeup. Additionally have schedule() re-set the value on
pick.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251127154725.901391274@infradead.org
There's three layers of logic in the scheduler that
deal with 'has_blocked' (load) handling of the NOHZ code:
(1) nohz.has_blocked,
(2) rq->has_blocked_load, deal with NOHZ idle balancing,
(3) and cfs_rq_has_blocked(), which is part of the layer
that is passing the SMP load-balancing signal to the
NOHZ layers.
The 'has_blocked' and 'has_blocked_load' names are used
in a mixed fashion, sometimes within the same function.
Standardize on 'has_blocked_load' to make it all easy
to read and easy to grep.
No change in functionality.
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/aS6yvxyc3JfMxxQW@gmail.com
We have to be careful with vruntime comparisons and subtraction,
due to the possibility of wrapping, so we have macros like:
#define vruntime_gt(field, lse, rse) ({ (s64)((lse)->field - (rse)->field) > 0; })
Which is used like this:
if (vruntime_gt(min_vruntime, se, rse))
se->min_vruntime = rse->min_vruntime;
Replace this with an easier to read pattern that uses the regular
arithmetics operators:
if (vruntime_cmp(se->min_vruntime, ">", rse->min_vruntime))
se->min_vruntime = rse->min_vruntime;
Also replace vruntime subtractions with vruntime_op():
- delta = (s64)(sea->vruntime - seb->vruntime) +
- (s64)(cfs_rqb->zero_vruntime_fi - cfs_rqa->zero_vruntime_fi);
+ delta = vruntime_op(sea->vruntime, "-", seb->vruntime) +
+ vruntime_op(cfs_rqb->zero_vruntime_fi, "-", cfs_rqa->zero_vruntime_fi);
In the vruntime_cmp() and vruntime_op() macros use Use __builtin_strcmp(),
because of __HAVE_ARCH_STRCMP might turn off the compiler optimizations
we rely on here to catch usage bugs.
No change in functionality.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The ::avg_vruntime field is a misnomer: it says it's an
'average vruntime', but in reality it's the momentary sum
of the weighted vruntimes of all queued tasks, which is
at least a division away from being an average.
This is clear from comments about the math of fair scheduling:
* \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime
This confusion is increased by the cfs_avg_vruntime() function,
which does perform the division and returns a true average.
The sum of all weighted vruntimes should be named thusly,
so rename the field to ::sum_w_vruntime. (As arguably
::sum_weighted_vruntime would be a bit of a mouthful.)
Understanding the scheduler is hard enough already, without
extra layers of obfuscated naming. ;-)
Also rename related helper functions:
sum_vruntime_add() => sum_w_vruntime_add()
sum_vruntime_sub() => sum_w_vruntime_sub()
sum_vruntime_update() => sum_w_vruntime_update()
With the notable exception of cfs_avg_vruntime(), which
was named accurately.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251201064647.1851919-7-mingo@kernel.org
The ::avg_load field is a long-standing misnomer: it says it's an
'average load', but in reality it's the momentary sum of the load
of all currently runnable tasks. We'd have to also perform a
division by nr_running (or use time-decay) to arrive at any sort
of average value.
This is clear from comments about the math of fair scheduling:
* \Sum w_i := cfs_rq->avg_load
The sum of all weights is ... the sum of all weights, not
the average of all weights.
To make it doubly confusing, there's also an ::avg_load
in the load-balancing struct sg_lb_stats, which *is* a
true average.
The second part of the field's name is a minor misnomer
as well: it says 'load', and it is indeed a load_weight
structure as it shares code with the load-balancer - but
it's only in an SMP load-balancing context where
load = weight, in the fair scheduling context the primary
purpose is the weighting of different nice levels.
So rename the field to ::sum_weight instead, which makes
the terminology of the EEVDF math match up with our
implementation of it:
* \Sum w_i := cfs_rq->sum_weight
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251201064647.1851919-6-mingo@kernel.org
Join two identical #ifdef blocks:
#ifdef CONFIG_FAIR_GROUP_SCHED
...
#endif
#ifdef CONFIG_FAIR_GROUP_SCHED
...
#endif
Also mark nested #ifdef blocks in the usual fashion, to make
it more apparent where in a nested hierarchy of #ifdefs we
are at a glance.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/20251201064647.1851919-2-mingo@kernel.org
The task_tick_fair() function does:
- update the hierarchical runtimes
- drive NUMA-balancing
- update load-balance statistics
- drive force-idle preemption
All but the very first can be limited to the periodic tick. Let hrtick
only update accounting and drive preemption, not load-balancing and
other bits.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20250918080205.563385766@infradead.org
With the {rcu,sched,bh} RCU flavours being unified, it doesn't really
make sense to check for just the rcu one. Switch to the _all family of
verification which includes all 3 of the listed flavours.
Notably, this will enable us to remove some superfluous
rcu_read_lock() regions when we know they are inside preempt/IRQ
disabled regions.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove check from the name for being surplus to requirements.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While poking at this code recently I noted we do a pointless
unlock+lock cycle in sched_balance_newidle(). We drop the rq->lock (so
we can balance) but then instantly grab the same rq->lock again in
sched_balance_update_blocked_averages().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251127154725.532469061@infradead.org
Pull CPU hotplug fix from Ingo Molnar:
- Fix CPU hotplug callbacks to disable interrupts on UP kernels
* tag 'smp-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu: Make atomic hotplug callbacks run with interrupts disabled on UP
Pull irq fixes from Ingo Molnar:
- Fix error code in the irqchip/mchp-eic driver
- Fix setup_percpu_irq() affinity assumptions
- Remove the unused irq_domain_add_tree() function
* tag 'irq-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/mchp-eic: Fix error code in mchp_eic_domain_alloc()
irqdomain: Delete irq_domain_add_tree()
genirq: Allow NULL affinity for setup_percpu_irq()
Pull misc updates from Andrew Morton:
"There are no significant series in this small merge. Please see the
individual changelogs for details"
[ Editor's note: it's mainly ocfs2 and a couple of random fixes ]
* tag 'mm-nonmm-stable-2025-12-11-11-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: memfd_luo: add CONFIG_SHMEM dependency
mm: shmem: avoid build warning for CONFIG_SHMEM=n
ocfs2: fix memory leak in ocfs2_merge_rec_left()
ocfs2: invalidate inode if i_mode is zero after block read
ocfs2: avoid -Wflex-array-member-not-at-end warning
ocfs2: convert remaining read-only checks to ocfs2_emergency_state
ocfs2: add ocfs2_emergency_state helper and apply to setattr
checkpatch: add uninitialized pointer with __free attribute check
args: fix documentation to reflect the correct numbers
ocfs2: fix kernel BUG in ocfs2_find_victim_chain
liveupdate: luo_core: fix redundant bound check in luo_ioctl()
ocfs2: validate inline xattr size and entry count in ocfs2_xattr_ibody_list
fs/fat: remove unnecessary wrapper fat_max_cache()
ocfs2: replace deprecated strcpy with strscpy
ocfs2: check tl_used after reading it from trancate log inode
liveupdate: luo_file: don't use invalid list iterator
The new memfd code fails to link without SHMEM:
aarch64-linux-ld: mm/memfd_luo.o: in function `memfd_luo_retrieve_folios':
memfd_luo.c:(.text.memfd_luo_retrieve_folios+0xdc): undefined reference to `shmem_add_to_page_cache'
memfd_luo.c:(.text.memfd_luo_retrieve_folios+0x11c): undefined reference to `shmem_inode_acct_blocks'
memfd_luo.c:(.text.memfd_luo_retrieve_folios+0x134): undefined reference to `shmem_recalc_inode'
Add a Kconfig dependency to disallow that configuration.
Link: https://lkml.kernel.org/r/20251204100203.1034394-1-arnd@kernel.org
Fixes: b3749f174d ("mm: memfd_luo: allow preserving memfd")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The kernel test robot reported a Smatch warning:
kernel/liveupdate/luo_core.c:402 luo_ioctl() warn: unsigned 'nr' is
never less than zero.
This occurs because 'nr' is unsigned and LIVEUPDATE_CMD_BASE is currently
defined as 0, making the check (nr < LIVEUPDATE_CMD_BASE) always false.
Remove the explicit lower bound check. The logic remains correct because
'nr' is unsigned; if nr is less than LIVEUPDATE_CMD_BASE, the expression
(nr - LIVEUPDATE_CMD_BASE) will wrap around to a large positive value.
This will inevitably be larger than ARRAY_SIZE(luo_ioctl_ops) and be
caught by the upper bound check.
Link: https://lkml.kernel.org/r/20251130010919.1488230-1-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511280300.6pvBmXUS-lkp@intel.com/
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull more s390 updates from Heiko Carstens:
- Use the MSI parent domain API instead of the legacy API for setup and
teardown of PCI MSI IRQs
- Select POSIX_CPU_TIMERS_TASK_WORK now that VIRT_XFER_TO_GUEST_WORK
has been implemented for s390
- Fix a KVM bug which can lead to guest memory corruption
- Fix KASAN shadow memory mapping for hotplugged memory
- Minor bug fixes and improvements
* tag 's390-6.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/bug: Add missing alignment
s390/bug: Add missing CONFIG_BUG ifdef again
KVM: s390: Fix gmap_helper_zap_one_page() again
s390/pci: Migrate s390 IRQ logic to IRQ domain API
genirq: Change hwirq parameter to irq_hw_number_t
s390: Select POSIX_CPU_TIMERS_TASK_WORK
s390: Unmap early KASAN shadow on memory offlining
s390/vmem: Support 2G page splitting for KASAN shadow freeing
s390/boot: Use entire page for PTEs
s390/vmur: Use scnprintf() instead of sprintf()
Pull dma-mapping fixes from Marek Szyprowski:
- last minute fix for missing parenthesis in recently merged code (Hans
de Goede)
- removal of excessive, non-fatal warnings (Dave Kleikamp)
* tag 'dma-mapping-6.19-2025-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-mapping: Fix DMA_BIT_MASK() macro being broken
dma/pool: eliminate alloc_pages warning in atomic_pool_expand
Pull futex updates from Ingo Molnar:
- Standardize on ktime_t in restart_block::time as well (Thomas
Weißschuh)
- Futex selftests:
- Add robust list testcases (André Almeida)
- Formatting fixes/cleanups (Carlos Llamas)
* tag 'locking-futex-2025-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Store time as ktime_t in restart block
selftests/futex: Create test for robust list
selftests/futex: Skip tests if shmget unsupported
selftests/futex: Add newline to ksft_exit_fail_msg()
selftests/futex: Remove unused test_futex_mpol()
On SMP systems the CPU hotplug callbacks in the "starting" range are
invoked while the CPU is brought up and interrupts are still
disabled. Callbacks which are added later are invoked via the
hotplug-thread on the target CPU and interrupts are explicitly disabled.
In the UP case callbacks which are added later are invoked directly without
the thread indirection. This is in principle okay since there is just one
CPU but those callbacks are invoked with interrupt disabled code. That's
incorrect as those callbacks assume interrupt disabled context.
Disable interrupts before invoking the callbacks on UP if the state is
atomic and interrupts are expected to be disabled. The "save" part is
required because this is also invoked early in the boot process while
interrupts are disabled and must not be enabled prematurely.
Fixes: 06ddd17521 ("sched/smp: Always define is_percpu_thread() and scheduler_ipi()")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251127144723.ev9DuXXR@linutronix.de
For events with inherit_stat enabled, a "read" event will be generated
to collect per task event counts on task exit.
The call chain is as follows:
do_exit
-> perf_event_exit_task
-> perf_event_exit_task_context
-> perf_event_exit_event
-> perf_remove_from_context
-> perf_child_detach
-> sync_child_event
-> perf_event_read_event
However, the child event context detaches the task too early in
perf_event_exit_task_context, which causes sync_child_event to never
generate the read event in this case, since child_event->ctx->task is
always set to TASK_TOMBSTONE. Fix that by moving context lock section
backward to ensure ctx->task is not set to TASK_TOMBSTONE before
generating the read event.
Because perf_event_free_task calls perf_event_exit_task_context with
exit = false to tear down all child events from the context, and the
task never lived, accessing the task PID can lead to a use-after-free.
To fix that, let sync_child_event read task from argument and move the
call to the only place it should be triggered to avoid the effect of
setting ctx->task to TASK_TOMESTONE, and add a task parameter to
perf_event_exit_event to trigger the sync_child_event properly when
needed.
This bug can be reproduced by running "perf record -s" and attaching to
any program that generates perf events in its child tasks. If we check
the result with "perf report -T", the last line of the report will leave
an empty table like "# PID TID", which is expected to contain the
per-task event counts by design.
Fixes: ef54c1a476 ("perf: Rework perf_event_exit_event()")
Signed-off-by: Thaumy Cheng <thaumy.love@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: linux-perf-users@vger.kernel.org
Link: https://patch.msgid.link/20251209041600.963586-1-thaumy.love@gmail.com