Commit Graph

49584 Commits

Author SHA1 Message Date
Peter Zijlstra
b33a51564e perf: Use guard() for aux_mutex in perf_mmap()
After duplicating the common code into the rb/aux branches is it
possible to use a simple guard() for the aux_mutex. Making the aux
branch self-contained.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104019.246250452@infradead.org
2025-08-15 13:13:00 +02:00
Peter Zijlstra
41b80e1d74 perf: Remove redundant aux_unlock label
unlock and aux_unlock are now identical, remove the aux_unlock one.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104019.131293512@infradead.org
2025-08-15 13:13:00 +02:00
Peter Zijlstra
4118994b33 perf: Move common code into both rb and aux branches
if (cond) {
    A;
  } else {
    B;
  }
  C;

into

  if (cond) {
    A;
    C;
  } else {
    B;
    C;
  }

Notably C has a success branch and both A and B have two places for
success. For A (rb case), duplicate the success case because later
patches will result in them no longer being identical. For B (aux
case), share using goto (cleaned up later).

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104019.016252852@infradead.org
2025-08-15 13:12:59 +02:00
Peter Zijlstra
3821f25868 perf: Merge consecutive conditionals in perf_mmap()
if (cond) {
    A;
  } else {
    B;
  }

  if (cond) {
    C;
  } else {
    D;
  }

into:

  if (cond) {
    A;
    C;
  } else {
    B;
    D;
  }

Notably the conditions are not identical in form, but are equivalent.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104018.900078502@infradead.org
2025-08-15 13:12:59 +02:00
Peter Zijlstra
86a0a7c598 perf: Move perf_mmap_calc_limits() into both rb and aux branches
if (cond) {
    A;
  } else {
    B;
  }
  C;

into

  if (cond) {
    A;
    C;
  } else {
    B;
    C;
  }

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104018.781244099@infradead.org
2025-08-15 13:12:59 +02:00
Thomas Gleixner
1ea3e3b0da perf: Split out VM accounting
Similarly to the mlock limit calculation the VM accounting is required for
both the ringbuffer and the AUX buffer allocations.

To prepare for splitting them out into separate functions, move the
accounting into a helper function.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104018.660347811@infradead.org
2025-08-15 13:12:59 +02:00
Thomas Gleixner
81e026ca47 perf: Split out mlock limit handling
To prepare for splitting the buffer allocation out into separate functions
for the ring buffer and the AUX buffer, split out mlock limit handling into
a helper function, which can be called from both.

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104018.541975109@infradead.org
2025-08-15 13:12:58 +02:00
Thomas Gleixner
e8c4f6ee8e perf: Remove redundant condition for AUX buffer size
It is already checked whether the VMA size is the same as
nr_pages * PAGE_SIZE, so later checking both:

      aux_size == vma_size && aux_size == nr_pages * PAGE_SIZE

is redundant. Remove the vma_size check as nr_pages is what is actually
used in the allocation function. That prepares for splitting out the buffer
allocation into separate functions, so that only nr_pages needs to be
handed in.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104018.424519320@infradead.org
2025-08-15 13:12:58 +02:00
Yunseong Kim
b64fdd422a perf: Avoid undefined behavior from stopping/starting inactive events
Calling pmu->start()/stop() on perf events in PERF_EVENT_STATE_OFF can
leave event->hw.idx at -1. When PMU drivers later attempt to use this
negative index as a shift exponent in bitwise operations, it leads to UBSAN
shift-out-of-bounds reports.

The issue is a logical flaw in how event groups handle throttling when some
members are intentionally disabled. Based on the analysis and the
reproducer provided by Mark Rutland (this issue on both arm64 and x86-64).

The scenario unfolds as follows:

 1. A group leader event is configured with a very aggressive sampling
    period (e.g., sample_period = 1). This causes frequent interrupts and
    triggers the throttling mechanism.
 2. A child event in the same group is created in a disabled state
    (.disabled = 1). This event remains in PERF_EVENT_STATE_OFF.
    Since it hasn't been scheduled onto the PMU, its event->hw.idx remains
    initialized at -1.
 3. When throttling occurs, perf_event_throttle_group() and later
    perf_event_unthrottle_group() iterate through all siblings, including
    the disabled child event.
 4. perf_event_throttle()/unthrottle() are called on this inactive child
    event, which then call event->pmu->start()/stop().
 5. The PMU driver receives the event with hw.idx == -1 and attempts to
    use it as a shift exponent. e.g., in macros like PMCNTENSET(idx),
    leading to the UBSAN report.

The throttling mechanism attempts to start/stop events that are not
actively scheduled on the hardware.

Move the state check into perf_event_throttle()/perf_event_unthrottle() so
that inactive events are skipped entirely. This ensures only active events
with a valid hw.idx are processed, preventing undefined behavior and
silencing UBSAN warnings. The corrected check ensures true before
proceeding with PMU operations.

The problem can be reproduced with the syzkaller reproducer:

Fixes: 9734e25fbf ("perf: Fix the throttle logic for a group")
Signed-off-by: Yunseong Kim <ysk@kzalloc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/20250812181046.292382-2-ysk@kzalloc.com
2025-08-15 13:12:56 +02:00
Jiri Olsa
e4414b01c1 bpf: Check the helper function is valid in get_helper_proto
kernel test robot reported verifier bug [1] where the helper func
pointer could be NULL due to disabled config option.

As Alexei suggested we could check on that in get_helper_proto
directly. Marking tail_call helper func with BPF_PTR_POISON,
because it is unused by design.

  [1] https://lore.kernel.org/oe-lkp/202507160818.68358831-lkp@intel.com

Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: syzbot+a9ed3d9132939852d0df@syzkaller.appspotmail.com
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20250814200655.945632-1-jolsa@kernel.org
Closes: https://lore.kernel.org/oe-lkp/202507160818.68358831-lkp@intel.com
2025-08-15 11:16:56 +02:00
Jesper Dangaard Brouer
2b986b9e91 bpf, cpumap: Disable page_pool direct xdp_return need larger scope
When running an XDP bpf_prog on the remote CPU in cpumap code
then we must disable the direct return optimization that
xdp_return can perform for mem_type page_pool.  This optimization
assumes code is still executing under RX-NAPI of the original
receiving CPU, which isn't true on this remote CPU.

The cpumap code already disabled this via helpers
xdp_set_return_frame_no_direct() and xdp_clear_return_frame_no_direct(),
but the scope didn't include xdp_do_flush().

When doing XDP_REDIRECT towards e.g devmap this causes the
function bq_xmit_all() to run with direct return optimization
enabled. This can lead to hard to find bugs.  The issue
only happens when bq_xmit_all() cannot ndo_xdp_xmit all
frames and them frees them via xdp_return_frame_rx_napi().

Fix by expanding scope to include xdp_do_flush(). This was found
by Dragos Tatulea.

Fixes: 11941f8a85 ("bpf: cpumap: Implement generic cpumap")
Reported-by: Dragos Tatulea <dtatulea@nvidia.com>
Reported-by: Chris Arges <carges@cloudflare.com>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Chris Arges <carges@cloudflare.com>
Link: https://patch.msgid.link/175519587755.3008742.1088294435150406835.stgit@firesoul
2025-08-15 11:08:08 +02:00
Paul E. McKenney
51c285baa3 rcutorture: Delay forward-progress testing until boot completes
Forward-progress testing can hog CPUs, which is not a great thing to do
before boot has completed.  This commit therefore makes the CPU-hotplug
operations hold off until boot has completed.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2025-08-14 15:26:30 -07:00
Paul E. McKenney
6e9c48b3e3 torture: Delay CPU-hotplug operations until boot completes
CPU-hotplug operations invoke stop-machine, which can hog CPUs, which is
not a great thing to do before boot has completed.  This commit therefore
makes the CPU-hotplug operations hold off until boot has completed.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2025-08-14 15:26:30 -07:00
Paul E. McKenney
9a316fe3ad rcutorture: Delay rcutorture readers and writers until boot completes
The rcutorture writers and (especially) readers are the biggest CPU
hogs of the bunch, so this commit therefore makes them wait until boot
has completed.

This makes the current setting of the boot_ended local variable dead code,
so while in the area, this commit removes that as well.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2025-08-14 15:26:30 -07:00
Paul E. McKenney
1b0f583843 rcutorture: Suppress "Writer stall state" reports during boot
When rcutorture is running on only the one boot-time CPU while that CPU
is busy invoking initcall() functions, the added load is quite likely to
unduly delay the RCU grace-period kthread, rcutorture readers, and much
else besides.  This can result in rcu_torture_stats_print() reporting
rcutorture writer stalls, which are not really a bug in that environment.
After all, one CPU can only do so much.

This commit therefore suppresses rcutorture writer stalls while the
kernel is booting, that is, while rcu_inkernel_boot_has_ended() continues
returning false.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2025-08-14 15:26:30 -07:00
Paul E. McKenney
b930ff84f3 torture: Announce kernel boot status at torture-test startup
Sometimes a given system takes surprisingly long to boot, for example,
in one recent case, 70 seconds instead of three seconds.  It would
be good to fix these slow-boot issues, but it would also be good for
the torture tests to announce that the system was still booting at the
start of the test.  Especially for tests that have a greater probability
of false positives when run in the single-CPU boot-time environment.
Yes, those tests should defend themselves, but we should also make this
situation easier to diagnose.

This commit therefore causes torture_print_module_parms() to print
"still booting" at the end of its printk() that dumps out the values of
its module parameters.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2025-08-14 15:26:30 -07:00
Zqiang
42d590d100 rcu: Remove local_irq_save/restore() in rcu_preempt_deferred_qs_handler()
The per-CPU rcu_data structure's ->defer_qs_iw field is initialized
by IRQ_WORK_INIT_HARD(), which means that the subsequent invocation of
rcu_preempt_deferred_qs_handler() will always be executed with interrupts
disabled.  This commit therefore removes the local_irq_save/restore()
operations from rcu_preempt_deferred_qs_handler() and adds a call to
lockdep_assert_irqs_disabled() in order to enable lockdep to diagnose
mistaken invocations of this function from interrupts-enabled code.

Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2025-08-14 15:25:15 -07:00
Linus Torvalds
63467137ec Merge tag 'net-6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
 "Including fixes from Netfilter and IPsec.

  Current release - regressions:

   - netfilter: nft_set_pipapo:
      - don't return bogus extension pointer
      - fix null deref for empty set

  Current release - new code bugs:

   - core: prevent deadlocks when enabling NAPIs with mixed kthread
     config

   - eth: netdevsim: Fix wild pointer access in nsim_queue_free().

  Previous releases - regressions:

   - page_pool: allow enabling recycling late, fix false positive
     warning

   - sched: ets: use old 'nbands' while purging unused classes

   - xfrm:
      - restore GSO for SW crypto
      - bring back device check in validate_xmit_xfrm

   - tls: handle data disappearing from under the TLS ULP

   - ptp: prevent possible ABBA deadlock in ptp_clock_freerun()

   - eth:
      - bnxt: fill data page pool with frags if PAGE_SIZE > BNXT_RX_PAGE_SIZE
      - hv_netvsc: fix panic during namespace deletion with VF

  Previous releases - always broken:

   - netfilter: fix refcount leak on table dump

   - vsock: do not allow binding to VMADDR_PORT_ANY

   - sctp: linearize cloned gso packets in sctp_rcv

   - eth:
      - hibmcge: fix the division by zero issue
      - microchip: fix KSZ8863 reset problem"

* tag 'net-6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (54 commits)
  net: usb: asix_devices: add phy_mask for ax88772 mdio bus
  net: kcm: Fix race condition in kcm_unattach()
  selftests: net/forwarding: test purge of active DWRR classes
  net/sched: ets: use old 'nbands' while purging unused classes
  bnxt: fill data page pool with frags if PAGE_SIZE > BNXT_RX_PAGE_SIZE
  netdevsim: Fix wild pointer access in nsim_queue_free().
  net: mctp: Fix bad kfree_skb in bind lookup test
  netfilter: nf_tables: reject duplicate device on updates
  ipvs: Fix estimator kthreads preferred affinity
  netfilter: nft_set_pipapo: fix null deref for empty set
  selftests: tls: test TCP stealing data from under the TLS socket
  tls: handle data disappearing from under the TLS ULP
  ptp: prevent possible ABBA deadlock in ptp_clock_freerun()
  ixgbe: prevent from unwanted interface name changes
  devlink: let driver opt out of automatic phys_port_name generation
  net: prevent deadlocks when enabling NAPIs with mixed kthread config
  net: update NAPI threaded config even for disabled NAPIs
  selftests: drv-net: don't assume device has only 2 queues
  docs: Fix name for net.ipv4.udp_child_hash_entries
  riscv: dts: thead: Add APB clocks for TH1520 GMACs
  ...
2025-08-14 07:14:30 -07:00
Paul E. McKenney
faab3ae329 rcu: Document that rcu_barrier() hurries lazy callbacks
This commit adds to the rcu_barrier() kerneldoc header stating that this
function hurries lazy callbacks and that it does not normally result in
additional RCU grace periods.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2025-08-13 15:04:06 -07:00
Chen Ridong
4c70fb2624 cpuset: remove redundant CS_ONLINE flag
The CS_ONLINE flag was introduced prior to the CSS_ONLINE flag in the
cpuset subsystem. Currently, the flag setting sequence is as follows:

1. cpuset_css_online() sets CS_ONLINE
2. css->flags gets CSS_ONLINE set
...
3. cgroup->kill_css sets CSS_DYING
4. cpuset_css_offline() clears CS_ONLINE
5. css->flags clears CSS_ONLINE

The is_cpuset_online() check currently occurs between steps 1 and 3.
However, it would be equally safe to perform this check between steps 2
and 3, as CSS_ONLINE provides the same synchronization guarantee as
CS_ONLINE.

Since CS_ONLINE is redundant with CSS_ONLINE and provides no additional
synchronization benefits, we can safely remove it to simplify the code.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-13 08:14:20 -10:00
Dan Carpenter
70d0085864 audit: add a missing tab
Someone got a bit carried away deleting tabs.  Add it back.

Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-08-13 11:51:43 -04:00
Shanker Donthineni
89a2d212bd dma/pool: Ensure DMA_DIRECT_REMAP allocations are decrypted
When CONFIG_DMA_DIRECT_REMAP is enabled, atomic pool pages are
remapped via dma_common_contiguous_remap() using the supplied
pgprot. Currently, the mapping uses
pgprot_dmacoherent(PAGE_KERNEL), which leaves the memory encrypted
on systems with memory encryption enabled (e.g., ARM CCA Realms).

This can cause the DMA layer to fail or crash when accessing the
memory, as the underlying physical pages are not configured as
expected.

Fix this by requesting a decrypted mapping in the vmap() call:
pgprot_decrypted(pgprot_dmacoherent(PAGE_KERNEL))

This ensures that atomic pool memory is consistently mapped
unencrypted.

Cc: stable@vger.kernel.org
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20250811181759.998805-1-sdonthineni@nvidia.com
2025-08-13 11:02:10 +02:00
John Stultz
21924af67d locking: Fix __clear_task_blocked_on() warning from __ww_mutex_wound() path
The __clear_task_blocked_on() helper added a number of sanity
checks ensuring we hold the mutex wait lock and that the task
we are clearing blocked_on pointer (if set) matches the mutex.

However, there is an edge case in the _ww_mutex_wound() logic
where we need to clear the blocked_on pointer for the task that
owns the mutex, not the task that is waiting on the mutex.

For this case the sanity checks aren't valid, so handle this
by allowing a NULL lock to skip the additional checks.

K Prateek Nayak and Maarten Lankhorst also pointed out that in
this case where we don't hold the owner's mutex wait_lock, we
need to be a bit more careful using READ_ONCE/WRITE_ONCE in both
the __clear_task_blocked_on() and __set_task_blocked_on()
implementations to avoid accidentally tripping WARN_ONs if two
instances race. So do that here as well.

This issue was easier to miss, I realized, as the test-ww_mutex
driver only exercises the wait-die class of ww_mutexes. I've
sent a patch[1] to address this so the logic will be easier to
test.

[1]: https://lore.kernel.org/lkml/20250801023358.562525-2-jstultz@google.com/

Fixes: a4f0b6fef4 ("locking/mutex: Add p->blocked_on wrappers for correctness checks")
Closes: https://lore.kernel.org/lkml/68894443.a00a0220.26d0e1.0015.GAE@google.com/
Reported-by: syzbot+602c4720aed62576cd79@syzkaller.appspotmail.com
Reported-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20250805001026.2247040-1-jstultz@google.com
2025-08-13 10:34:54 +02:00
Frederic Weisbecker
c0a23bbc98 ipvs: Fix estimator kthreads preferred affinity
The estimator kthreads' affinity are defined by sysctl overwritten
preferences and applied through a plain call to the scheduler's affinity
API.

However since the introduction of managed kthreads preferred affinity,
such a practice shortcuts the kthreads core code which eventually
overwrites the target to the default unbound affinity.

Fix this with using the appropriate kthread's API.

Fixes: d1a8919758 ("kthread: Default affine kthread to its preferred NUMA node")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
2025-08-13 08:34:33 +02:00
Qianfeng Rong
bf0c2a84df bpf: Replace kvfree with kfree for kzalloc memory
The 'backedge' pointer is allocated with kzalloc(), which returns
physically contiguous memory. Using kvfree() to deallocate such
memory is functionally safe but semantically incorrect.

Replace kvfree() with kfree() to avoid unnecessary is_vmalloc_addr()
check in kvfree().

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20250811123949.552885-1-rongqianfeng@vivo.com
2025-08-12 15:55:01 -07:00
Qianfeng Rong
3e2b799008 bpf: Remove redundant __GFP_NOWARN
Commit 16f5dfbc85 ("gfp: include __GFP_NOWARN in GFP_NOWAIT")
made GFP_NOWAIT implicitly include __GFP_NOWARN.

Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT
(e.g., `GFP_NOWAIT | __GFP_NOWARN`) is now redundant. Let's clean
up these redundant flags across subsystems.

No functional changes.

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20250804122731.460158-1-rongqianfeng@vivo.com
2025-08-12 14:56:04 -07:00
Martin KaFai Lau
9e293d47bf Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Cross merge bpf/master after 6.17-rc1.

No conflict.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2025-08-12 12:14:02 -07:00
Thorsten Blum
8a013ec9cb cgroup: Replace deprecated strcpy() with strscpy()
strcpy() is deprecated; use strscpy() instead.

Link: https://github.com/KSPP/linux/issues/88
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-12 08:57:03 -10:00
Andrea Righi
ddf7233fca sched/ext: Fix invalid task state transitions on class switch
When enabling a sched_ext scheduler, we may trigger invalid task state
transitions, resulting in warnings like the following (which can be
easily reproduced by running the hotplug selftest in a loop):

 sched_ext: Invalid task state transition 0 -> 3 for fish[770]
 WARNING: CPU: 18 PID: 787 at kernel/sched/ext.c:3862 scx_set_task_state+0x7c/0xc0
 ...
 RIP: 0010:scx_set_task_state+0x7c/0xc0
 ...
 Call Trace:
  <TASK>
  scx_enable_task+0x11f/0x2e0
  switching_to_scx+0x24/0x110
  scx_enable.isra.0+0xd14/0x13d0
  bpf_struct_ops_link_create+0x136/0x1a0
  __sys_bpf+0x1edd/0x2c30
  __x64_sys_bpf+0x21/0x30
  do_syscall_64+0xbb/0x370
  entry_SYSCALL_64_after_hwframe+0x77/0x7f

This happens because we skip initialization for tasks that are already
dead (with their usage counter set to zero), but we don't exclude them
during the scheduling class transition phase.

Fix this by also skipping dead tasks during class swiching, preventing
invalid task state transitions.

Fixes: a8532fac7b ("sched_ext: TASK_DEAD tasks must be switched into SCX on ops_enable")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-11 06:56:37 -10:00
Waiman Long
dfb36e4a8d futex: Use user_write_access_begin/_end() in futex_put_value()
Commit cec199c5e3 ("futex: Implement FUTEX2_NUMA") introduced the
futex_put_value() helper to write a value to the given user
address.

However, it uses user_read_access_begin() before the write. For
architectures that differentiate between read and write accesses, like
PowerPC, futex_put_value() fails with -EFAULT.

Fix that by using the user_write_access_begin/user_write_access_end() pair
instead.

Fixes: cec199c5e3 ("futex: Implement FUTEX2_NUMA")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250811141147.322261-1-longman@redhat.com
2025-08-11 17:53:21 +02:00
Kieran Moy
df1145b56c audit: fix typo in auditfilter.c comment
Correct the misspelling of "searching" (was "serarching")
in the function documentation for audit_update_lsm_rules.

Found via code inspection, no functional impact.

Signed-off-by: Kieran Moy <kfatyuip@gmail.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-08-11 11:44:57 -04:00
Thorsten Blum
d8c09d7b55 audit: Replace deprecated strcpy() with strscpy()
strcpy() is deprecated; use strscpy() instead.

Link: https://github.com/KSPP/linux/issues/88
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-08-11 11:44:55 -04:00
Casey Schaufler
c5055d0c8e audit: fix indentation in audit_log_exit()
Fix two indentation errors in audit_log_exit().

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subject tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-08-11 11:44:50 -04:00
Oreoluwa Babatunde
2c223f7239 of: reserved_mem: Restructure call site for dma_contiguous_early_fixup()
Restructure the call site for dma_contiguous_early_fixup() to
where the reserved_mem nodes are being parsed from the DT so that
dma_mmu_remap[] is populated before dma_contiguous_remap() is called.

Fixes: 8a6e02d0c0 ("of: reserved_mem: Restructure how the reserved memory regions are processed")
Signed-off-by: Oreoluwa Babatunde <oreoluwa.babatunde@oss.qualcomm.com>
Tested-by: William Zhang <william.zhang@broadcom.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20250806172421.2748302-1-oreoluwa.babatunde@oss.qualcomm.com
2025-08-11 13:05:38 +02:00
Qianfeng Rong
110aa2c74d swiotlb: Remove redundant __GFP_NOWARN
Commit 16f5dfbc85 ("gfp: include __GFP_NOWARN in GFP_NOWAIT")
made GFP_NOWAIT implicitly include __GFP_NOWARN.

Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT
(e.g., `GFP_NOWAIT | __GFP_NOWARN`) is now redundant. Let's clean
up these redundant flags across subsystems.

No functional changes.

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20250805023222.332920-1-rongqianfeng@vivo.com
2025-08-11 11:29:38 +02:00
Petr Tesarik
9f683dfe80 dma-direct: clean up the logic in __dma_direct_alloc_pages()
Convert a goto-based loop to a while() loop. To allow the simplification,
return early when allocation from CMA is successful. As a bonus, this early
return avoids a repeated dma_coherent_ok() check.

No functional change.

Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20250710083829.1853466-1-ptesarik@suse.com
2025-08-11 11:28:04 +02:00
Frederic Weisbecker
61399e0c54 rcu: Fix racy re-initialization of irq_work causing hangs
RCU re-initializes the deferred QS irq work everytime before attempting
to queue it. However there are situations where the irq work is
attempted to be queued even though it is already queued. In that case
re-initializing messes-up with the irq work queue that is about to be
handled.

The chances for that to happen are higher when the architecture doesn't
support self-IPIs and irq work are then all lazy, such as with the
following sequence:

1) rcu_read_unlock() is called when IRQs are disabled and there is a
   grace period involving blocked tasks on the node. The irq work
   is then initialized and queued.

2) The related tasks are unblocked and the CPU quiescent state
   is reported. rdp->defer_qs_iw_pending is reset to DEFER_QS_IDLE,
   allowing the irq work to be requeued in the future (note the previous
   one hasn't fired yet).

3) A new grace period starts and the node has blocked tasks.

4) rcu_read_unlock() is called when IRQs are disabled again. The irq work
   is re-initialized (but it's queued! and its node is cleared) and
   requeued. Which means it's requeued to itself.

5) The irq work finally fires with the tick. But since it was requeued
   to itself, it loops and hangs.

Fix this with initializing the irq work only once before the CPU boots.

Fixes: b41642c877 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202508071303.c1134cce-lkp@intel.com
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-08-11 08:43:49 +05:30
Linus Torvalds
b96ddbc5c8 Merge tag 'smp_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp fixes from Borislav Petkov:

 - Remove an obsolete comment and fix spelling

* tag 'smp_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu: Remove obsolete comment from takedown_cpu()
  smp: Fix spelling in on_each_cpu_cond_mask()'s doc-comment
2025-08-10 08:51:37 +03:00
Linus Torvalds
7d2fed1f3c Merge tag 'irq_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Borislav Petkov:

 - Fix a wrong ioremap size in mvebu-gicp

 - Remove yet another compile-test case for a driver which needs an
   additional dependency

 - Fix a lock inversion scenario in the IRQ unit test suite

 - Remove an impossible flag situation in gic-v5

 - Do not iounmap resources in gic-v5 which are managed by devm

 - Make sure stale, left-over interrupts in mvebu-gicp are cleared on
   driver init

 - Fix a reference counting mishap in msi-lib

 - Fix a dereference-before-null-ptr-check case in the riscv-imsic
   irqchip driver

* tag 'irq_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/mvebu-gicp: Use resource_size() for ioremap()
  irqchip: Build IMX_MU_MSI only on ARM
  genirq/test: Resolve irq lock inversion warnings
  irqchip/gic-v5: Remove IRQD_RESEND_WHEN_IN_PROGRESS for ITS IRQs
  irqchip/gic-v5: iwb: Fix iounmap probe failure path
  irqchip/mvebu-gicp: Clear pending interrupts on init
  irqchip/msi-lib: Fix fwnode refcount in msi_lib_irq_domain_select()
  irqchip/riscv-imsic: Don't dereference before NULL pointer check
2025-08-10 08:46:47 +03:00
Linus Torvalds
8e8f6b635f Merge tag 'locking_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Borislav Petkov:

 - Prevent a futex hash leak due to different mm lifetimes

* tag 'locking_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Move futex cleanup to __mmdrop()
2025-08-10 08:11:39 +03:00
JP Kobryn
eea51c6e3f cgroup: avoid null de-ref in css_rstat_exit()
css_rstat_exit() may be called asynchronously in scenarios where preceding
calls to css_rstat_init() have not completed. One such example is this
sequence below:

css_create(...)
{
	...
	init_and_link_css(css, ...);

	err = percpu_ref_init(...);
	if (err)
		goto err_free_css;
	err = cgroup_idr_alloc(...);
	if (err)
		goto err_free_css;
	err = css_rstat_init(css, ...);
	if (err)
		goto err_free_css;
	...
err_free_css:
	INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
	queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
	return ERR_PTR(err);
}

If any of the three goto jumps are taken, async cleanup will begin and
css_rstat_exit() will be invoked on an uninitialized css->rstat_cpu.

Avoid accessing the unitialized field by returning early in
css_rstat_exit() if this is the case.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Suggested-by: Michal Koutný <mkoutny@suse.com>
Fixes: 5da3bfa029 ("cgroup: use separate rstat trees for each subsystem")
Cc: stable@vger.kernel.org # v6.16
Reported-by: syzbot+8d052e8b99e40bc625ed@syzkaller.appspotmail.com
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-09 08:46:32 -10:00
Waiman Long
87eba5bc5a cgroup/cpuset: Remove the unnecessary css_get/put() in cpuset_partition_write()
The css_get/put() calls in cpuset_partition_write() are unnecessary as
an active reference of the kernfs node will be taken which will prevent
its removal and guarantee the existence of the css. Only the online
check is needed.

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-09 08:42:47 -10:00
Waiman Long
150e298ae0 cgroup/cpuset: Fix a partition error with CPU hotplug
It was found during testing that an invalid leaf partition with an
empty effective exclusive CPU list can become a valid empty partition
with no CPU afer an offline/online operation of an unrelated CPU. An
empty partition root is allowed in the special case that it has no
task in its cgroup and has distributed out all its CPUs to its child
partitions. That is certainly not the case here.

The problem is in the cpumask_subsets() test in the hotplug case
(update with no new mask) of update_parent_effective_cpumask() as it
also returns true if the effective exclusive CPU list is empty. Fix that
by addding the cpumask_empty() test to root out this exception case.
Also add the cpumask_empty() test in cpuset_hotplug_update_tasks()
to avoid calling update_parent_effective_cpumask() for this special case.

Fixes: 0c7f293efc ("cgroup/cpuset: Add cpuset.cpus.exclusive.effective for v2")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-09 08:42:28 -10:00
Waiman Long
65f97cc81b cgroup/cpuset: Use static_branch_enable_cpuslocked() on cpusets_insane_config_key
The following lockdep splat was observed.

[  812.359086] ============================================
[  812.359089] WARNING: possible recursive locking detected
[  812.359097] --------------------------------------------
[  812.359100] runtest.sh/30042 is trying to acquire lock:
[  812.359105] ffffffffa7f27420 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_enable+0xe/0x20
[  812.359131]
[  812.359131] but task is already holding lock:
[  812.359134] ffffffffa7f27420 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_write_resmask+0x98/0xa70
     :
[  812.359267] Call Trace:
[  812.359272]  <TASK>
[  812.359367]  cpus_read_lock+0x3c/0xe0
[  812.359382]  static_key_enable+0xe/0x20
[  812.359389]  check_insane_mems_config.part.0+0x11/0x30
[  812.359398]  cpuset_write_resmask+0x9f2/0xa70
[  812.359411]  cgroup_file_write+0x1c7/0x660
[  812.359467]  kernfs_fop_write_iter+0x358/0x530
[  812.359479]  vfs_write+0xabe/0x1250
[  812.359529]  ksys_write+0xf9/0x1d0
[  812.359558]  do_syscall_64+0x5f/0xe0

Since commit d74b27d63a ("cgroup/cpuset: Change cpuset_rwsem
and hotplug lock order"), the ordering of cpu hotplug lock
and cpuset_mutex had been reversed. That patch correctly
used the cpuslocked version of the static branch API to enable
cpusets_pre_enable_key and cpusets_enabled_key, but it didn't do the
same for cpusets_insane_config_key.

The cpusets_insane_config_key can be enabled in the
check_insane_mems_config() which is called from update_nodemask()
or cpuset_hotplug_update_tasks() with both cpu hotplug lock and
cpuset_mutex held. Deadlock can happen with a pending hotplug event that
tries to acquire the cpu hotplug write lock which will block further
cpus_read_lock() attempt from check_insane_mems_config(). Fix that by
switching to use static_branch_enable_cpuslocked().

Fixes: d74b27d63a ("cgroup/cpuset: Change cpuset_rwsem and hotplug lock order")
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-09 08:41:39 -10:00
Linus Torvalds
c30a13538d Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:

 - Fix memory leak of bpf_scc_info objects (Eduard Zingerman)

 - Fix a regression in the 'perf' tool caused by moving UID filtering to
   BPF (Ilya Leoshkevich)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  perf bpf-filter: Enable events manually
  libbpf: Add the ability to suppress perf event enablement
  bpf: Fix memory leak of bpf_scc_info objects
2025-08-09 09:03:21 +03:00
Eduard Zingerman
77620d1267 bpf: use realloc in bpf_patch_insn_data
Avoid excessive vzalloc/vfree calls when patching instructions in
do_misc_fixups(). bpf_patch_insn_data() uses vzalloc to allocate new
memory for env->insn_aux_data for each patch as follows:

  struct bpf_prog *bpf_patch_insn_data(env, ...)
  {
    ...
    new_data = vzalloc(... O(program size) ...);
    ...
    adjust_insn_aux_data(env, new_data, ...);
    ...
  }

  void adjust_insn_aux_data(env, new_data, ...)
  {
    ...
    memcpy(new_data, env->insn_aux_data);
    vfree(env->insn_aux_data);
    env->insn_aux_data = new_data;
    ...
  }

The vzalloc/vfree pair is hot in perf report collected for e.g.
pyperf180 test case. It can be replaced with a call to vrealloc in
order to reduce the number of actual memory allocations.

This is a stop-gap solution, as bpf_patch_insn_data is still hot in
the profile. More comprehansive solutions had been discussed before
e.g. as in [1].

[1] https://lore.kernel.org/bpf/CAEf4BzY_E8MSL4mD0UPuuiDcbJhh9e2xQo2=5w+ppRWWiYSGvQ@mail.gmail.com/

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Tested-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20250807010205.3210608-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-07 09:17:02 -07:00
Eduard Zingerman
cb070a8156 bpf: removed unused 'env' parameter from is_reg64 and insn_has_def32
Parameter 'env' is not used by is_reg64() and insn_has_def32()
functions. Remove the parameter to make it clear that neither function
depends on 'env' state, e.g. env->insn_aux_data.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250807010205.3210608-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-07 09:17:02 -07:00
Waiman Long
da274853fe cpu: Remove obsolete comment from takedown_cpu()
takedown_cpu() has a comment about "all preempt/rcu users must observe
!cpu_active()" which is kind of meaningless in this function. This
comment was originally introduced by commit 6acce3ef84 ("sched: Remove
get_online_cpus() usage") when _cpu_down() was setting cpu_active_mask
and synchronize_rcu()/synchronize_sched() were added after that.

Later commit 40190a78f8 ("sched/hotplug: Convert cpu_[in]active
notifiers to state machine") added a new CPUHP_AP_ACTIVE hotplug
state to set/clear cpu_active_mask. The following commit b2454caa89
("sched/hotplug: Move sync_rcu to be with set_cpu_active(false)")
move the synchronize_*() calls to sched_cpu_deactivate() associated
with the new hotplug state, but left the comment behind.

Remove this comment as it is no longer relevant in takedown_cpu().

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250729191232.664931-1-longman@redhat.com
2025-08-06 22:48:12 +02:00
Amery Hung
d87a513d09 bpf: Allow struct_ops to get map id by kdata
Add bpf_struct_ops_id() to enable struct_ops implementors to use
struct_ops map id as the unique id of a struct_ops in their subsystem.
A subsystem that wishes to create a mapping between id and struct_ops
instance pointer can update the mapping accordingly during
bpf_struct_ops::reg(), unreg(), and update().

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250806162540.681679-2-ameryhung@gmail.com
2025-08-06 13:39:58 -07:00
Brian Norris
5b65258229 genirq/test: Resolve irq lock inversion warnings
irq_shutdown_and_deactivate() is normally called with the descriptor lock
held, and interrupts disabled. Nested a few levels down, it grabs the
global irq_resend_lock. Lockdep rightfully complains when interrupts are
not disabled:

       CPU0                    CPU1
       ----                    ----
  lock(irq_resend_lock);
                               local_irq_disable();
                               lock(&irq_desc_lock_class);
                               lock(irq_resend_lock);
  <Interrupt>
    lock(&irq_desc_lock_class);

...
   _raw_spin_lock+0x2b/0x40
   clear_irq_resend+0x14/0x70
   irq_shutdown_and_deactivate+0x29/0x80
   irq_shutdown_depth_test+0x1ce/0x600
   kunit_try_run_case+0x90/0x120

Grab the descriptor lock and disable interrupts, to resolve the
problem.

Fixes: 66067c3c8a ("genirq: Add kunit tests for depth counts")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/all/aJJONEIoIiTSDMqc@google.com
Closes: https://lore.kernel.org/lkml/31a761e4-8f81-40cf-aaf5-d220ba11911c@roeck-us.net/
2025-08-06 10:29:48 +02:00