Add a FEAT_BTF_LAYOUT feature check which checks if the
kernel supports BTF layout information. Also sanitize
BTF if it contains layout data but the kernel does not
support it. The sanitization requires rewriting raw
BTF data to update the header and eliminate the layout
section (since it lies between the types and strings),
so refactor sanitization to do the raw BTF retrieval
and creation of updated BTF, returning that new BTF
on success.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260326145444.2076244-7-alan.maguire@oracle.com
Support encoding of BTF layout data via btf__new_empty_opts().
Current supported opts are base_btf and add_layout.
Layout information is maintained in btf.c in the layouts[] array;
when BTF is created with the add_layout option it represents the
current view of supported BTF kinds.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260326145444.2076244-5-alan.maguire@oracle.com
Support reading in layout fixing endian issues on reading;
also support writing layout section to raw BTF object.
There is not yet an API to populate the layout with meaningful
information.
As part of this, we need to consider multiple valid BTF header
sizes; the original or the layout-extended headers.
So to support this, the "struct btf" representation is modified
to contain a "struct btf_header" and we copy the valid
portion from the raw data to it; this means we can always safely
check fields like btf->hdr.layout_len .
Note if parsed-in BTF has extra header information beyond
sizeof(struct btf_header) - if so we make that BTF ineligible
for modification by setting btf->has_hdr_extra .
Ensure that we handle endianness issues for BTF layout section,
though currently only field that needs this (flags) is unused.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260326145444.2076244-3-alan.maguire@oracle.com
The DEVMAP_HASH branch in dev_map_redirect_multi() uses
hlist_for_each_entry_safe() to iterate hash buckets, but this function
runs under RCU protection (called from xdp_do_generic_redirect_map()
in softirq context). Concurrent writers (__dev_map_hash_update_elem,
dev_map_hash_delete_elem) modify the list using RCU primitives
(hlist_add_head_rcu, hlist_del_rcu).
hlist_for_each_entry_safe() performs plain pointer dereferences without
rcu_dereference(), missing the acquire barrier needed to pair with
writers' rcu_assign_pointer(). On weakly-ordered architectures (ARM64,
POWER), a reader can observe a partially-constructed node. It also
defeats CONFIG_PROVE_RCU lockdep validation and KCSAN data-race
detection.
Replace with hlist_for_each_entry_rcu() using rcu_read_lock_bh_held()
as the lockdep condition, consistent with the rcu_dereference_check()
used in the DEVMAP (non-hash) branch of the same functions. Also fix
the same incorrect lockdep_is_held(&dtab->index_lock) condition in
dev_map_enqueue_multi(), where the lock is not held either.
Fixes: e624d4ed4a ("xdp: Extend xdp_redirect_map with broadcast support")
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260320072645.16731-1-devnexen@gmail.com
trampoline_count fills all trampoline attachment slots for a single
target function and verifies that one extra attach fails with -E2BIG.
It currently targets bpf_modify_return_test, which is also used by
other selftests such as modify_return, get_func_ip_test, and
get_func_args_test. When such tests run in parallel, they can contend
for the same per-function trampoline quota and cause unexpected attach
failures. This issue is currently masked by harness serialization.
Move trampoline_count to a dedicated bpf_testmod target and register it
for fmod_ret attachment. Also route the final trigger through
trigger_module_test_read(), so the execution path exercises the same
dedicated target.
This keeps the test semantics unchanged while isolating it from other
selftests, so it no longer needs to run in serial mode. Remove the
TODO comment as well.
Tested:
./test_progs -t trampoline_count -vv
./test_progs -j$(nproc) -t trampoline_count -vv
./test_progs -j$(nproc) -t \
trampoline_count,modify_return,get_func_ip_test,get_func_args_test -vv
20 runs of:
./test_progs -j$(nproc) -t \
trampoline_count,modify_return,get_func_ip_test,get_func_args_test
Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260324044949.869801-1-sun.jian.kdev@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Previously I added a FIONREAD test for sockmap, but it can occasionally
fail in CI [1].
The test sends 10 bytes in two segments (2 + 8). For UDP, FIONREAD only
reports the length of the first datagram, not the total queued data.
The original code used recv_timeout() expecting all 10 bytes, but under
high system load, the second datagram may not yet be processed by the
protocol stack, so recv would only return the first 2-byte datagram,
causing a size mismatch failure.
Fix this by receiving exactly the expected bytes (matching FIONREAD) in
the first recv. The remaining datagram is then consumed in a second recv
block, which is only reachable for UDP since TCP's expected already
equals sizeof(buf).
Test:
./test_progs -a sockmap_basic
410/1 sockmap_basic/sockmap create_update_free:OK
...
Summary: 1/35 PASSED, 0 SKIPPED, 0 FAILED
[1] https://github.com/kernel-patches/bpf/actions/runs/22919385910/job/66515395423
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Fixes: 17e2ce02bf ("selftests/bpf: Add tests for FIONREAD and copied_seq")
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Link: https://lore.kernel.org/r/20260312072549.6766-1-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
A test failure was discovered in BPF CI [1] caused by connection timeout.
The current test timeout of 500ms is insufficient for CI environments,
particularly under high load.
While the optimal timeout is unclear, this test was converted from the
original test_tc_tunnel.sh script. The original script used nc with "-w 1"
to specify a 1-second timeout [2]. Therefore, this test restores the
timeout to 1s.
Test:
./test_progs -a tc_tunnel
#478/1 tc_tunnel/ipip_none:OK
#478/2 tc_tunnel/ipip6_none:OK
#478/3 tc_tunnel/ip6tnl_none:OK
#478/4 tc_tunnel/sit_none:OK
#478/5 tc_tunnel/vxlan_eth:OK
#478/6 tc_tunnel/ip6vxlan_eth:OK
#478/7 tc_tunnel/gre_none:OK
#478/8 tc_tunnel/gre_eth:OK
#478/9 tc_tunnel/gre_mpls:OK
#478/10 tc_tunnel/ip6gre_none:OK
#478/11 tc_tunnel/ip6gre_eth:OK
#478/12 tc_tunnel/ip6gre_mpls:OK
#478/13 tc_tunnel/udp_none:OK
#478/14 tc_tunnel/udp_eth:OK
#478/15 tc_tunnel/udp_mpls:OK
#478/16 tc_tunnel/ip6udp_none:OK
#478/17 tc_tunnel/ip6udp_eth:OK
#478/18 tc_tunnel/ip6udp_mpls:OK
#478 tc_tunnel:OK
Summary: 1/18 PASSED, 0 SKIPPED, 0 FAILED
[1] https://github.com/kernel-patches/bpf/actions/runs/22674350732/job/65728072723
[2] https://lore.kernel.org/all/20251027-tc_tunnel-v3-4-505c12019f9d@bootlin.com/
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Link: https://lore.kernel.org/r/20260312083615.31835-1-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Slava Imameev says:
====================
bpf: Add multi-level pointer parameter support for trampolines
This is v6 of a series adding support for new pointer types for
trampoline parameters.
Originally, only support for multi-level pointers was proposed.
As suggested during review, it was extended to single-level pointers.
During discussion, it was proposed to add support for any single or
multi-level pointer type that is not a single-level pointer to a
structure, with the condition if (!btf_type_is_struct(t)). The safety
of this condition is based on BTF data verification performed for
modules and programs, and vmlinux BTF being trusted to not contain
invalid types, so it is not possible for invalid types, like
PTR->DATASEC, PTR->FUNC, PTR->VAR and corresponding multi-level
pointers, to reach btf_ctx_access.
These changes appear to be a safe extension since any future support
for arrays and output values would require annotation (similar to
Microsoft SAL), which differentiates between current unannotated
scalar cases and new annotated cases.
This series adds BPF verifier support for single- and multi-level
pointer parameters and return values in BPF trampolines. The
implementation treats these parameters as SCALAR_VALUE.
This is consistent with existing pointers to int and void that are
already treated as SCALAR.
This provides consistent logic for single- and multi-level pointers:
if the type is treated as SCALAR for a single-level pointer, the
same applies to multi-level pointers, except for pointers to structs
which are currently PTR_TO_BTF_ID. However, in the case of
multi-level pointers, they are treated as scalar since the verifier
lacks the context to infer the size of their target memory regions.
Background:
Prior to these changes, accessing multi-level pointer parameters or
return values through BPF trampoline context arrays resulted in
verification failures in btf_ctx_access, producing errors such as:
func '%s' arg%d type %s is not a struct
For example, consider a BPF program that logs an input parameter of
type struct posix_acl **:
SEC("fentry/__posix_acl_chmod")
int BPF_PROG(trace_posix_acl_chmod, struct posix_acl **ppacl, gfp_t gfp,
umode_t mode)
{
bpf_printk("__posix_acl_chmod ppacl = %px\n", ppacl);
return 0;
}
This program failed BPF verification with the following error:
libbpf: prog 'trace_posix_acl_chmod': -- BEGIN PROG LOAD LOG --
0: R1=ctx() R10=fp0
; int BPF_PROG(trace_posix_acl_chmod, struct posix_acl **ppacl,
gfp_t gfp, umode_t mode) @ posix_acl_monitor.bpf.c:23
0: (79) r6 = *(u64 *)(r1 +16) ; R1=ctx() R6_w=scalar()
1: (79) r1 = *(u64 *)(r1 +0)
func '__posix_acl_chmod' arg0 type PTR is not a struct
invalid bpf_context access off=0 size=8
processed 2 insns (limit 1000000) max_states_per_insn 0 total_states 0
peak_states 0 mark_read 0
-- END PROG LOAD LOG --
The common workaround involved using helper functions to fetch
parameter values by passing the address of the context array entry:
SEC("fentry/__posix_acl_chmod")
int BPF_PROG(trace_posix_acl_chmod, struct posix_acl **ppacl, gfp_t gfp,
umode_t mode)
{
struct posix_acl **pp;
bpf_probe_read_kernel(&pp, sizeof(ppacl), &ctx[0]);
bpf_printk("__posix_acl_chmod %px\n", pp);
return 0;
}
This approach introduced helper call overhead and created
inconsistency with parameter access patterns.
Improvements:
With this patch, trampoline programs can directly access multi-level
pointer parameters, eliminating helper call overhead and explicit ctx
access while ensuring consistent parameter handling. For example, the
following ctx access with a helper call:
SEC("fentry/__posix_acl_chmod")
int BPF_PROG(trace_posix_acl_chmod, struct posix_acl **ppacl, gfp_t gfp,
umode_t mode)
{
struct posix_acl **pp;
bpf_probe_read_kernel(&pp, sizeof(pp), &ctx[0]);
bpf_printk("__posix_acl_chmod %px\n", pp);
...
}
is replaced by a load instruction:
SEC("fentry/__posix_acl_chmod")
int BPF_PROG(trace_posix_acl_chmod, struct posix_acl **ppacl, gfp_t gfp,
umode_t mode)
{
bpf_printk("__posix_acl_chmod %px\n", ppacl);
...
}
The bpf_core_cast macro can be used for deeper level dereferences.
v1 -> v2:
* corrected maintainer's email
v2 -> v3:
* Addressed reviewers' feedback:
* Changed the register type from PTR_TO_MEM to SCALAR_VALUE.
* Modified tests to accommodate SCALAR_VALUE handling.
* Fixed a compilation error for loongarch
* https://lore.kernel.org/oe-kbuild-all/202602181710.tEK6nOl6-lkp@intel.com/
* Addressed AI bot review
* Added a commentary to address a NULL pointer case
* Removed WARN_ON
* Fixed a commentary
v3 -> v4:
* Added more consistent support for single and multi-level pointers
as suggested by reviewers.
* added single level pointers to enum 32 and 64
* added single level pointers to functions
* harmonized support for single and multi-level pointer types
* added new tests to support the above changes
* Removed create_bad_kaddr that allocated and invalidated kernel VA
for tests, and replaced it with hardcoded values similar to
bpf_testmod_return_ptr as suggested by reviewers.
v4 -> v5:
* As suggested, extended support to single-level pointers and
covered all supported valid pointer (single- and multi-level) types
with a wider condition if (!btf_type_is_struct(t)).
* As requested, simplified tests by keeping only tests that check
the verifier log for scalar().
v5 -> v6:
* Fixed the test message based on the bot's feedback.
====================
Link: https://patch.msgid.link/20260314082127.7939-1-slava.imameev@crowdstrike.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add BPF verifier support for single- and multi-level pointer
parameters and return values in BPF trampolines by treating these
parameters as SCALAR_VALUE.
This extends the existing support for int and void pointers that are
already treated as SCALAR_VALUE.
This provides consistent logic for single and multi-level pointers:
if a type is treated as SCALAR for a single-level pointer, the same
applies to multi-level pointers. The exception is pointer-to-struct,
which is currently PTR_TO_BTF_ID for single-level but treated as
scalar for multi-level pointers since the verifier lacks context
to infer the size of target memory regions.
Safety is ensured by existing BTF verification, which rejects invalid
pointer types at the BTF verification stage.
Signed-off-by: Slava Imameev <slava.imameev@crowdstrike.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260314082127.7939-2-slava.imameev@crowdstrike.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a test verifying that stacksafe() correctly handles 32-bit scalar
spills when comparing stack states for equivalence during state pruning.
A 32-bit scalar spill creates slot[0-3] = STACK_INVALID and
slot[4-7] = STACK_SPILL. Without the im=4 check in stacksafe(), the
STACK_SPILL vs STACK_MISC mismatch at byte 4 causes pruning to fail,
forcing the verifier to re-explore a path that is provably safe.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260323022410.75444-2-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a selftest to verify that the verifier correctly identifies refcounted
arguments in struct_ops programs, even when they are not the first
argument. This ensures that the restriction on tail calls for programs
with __ref arguments is properly enforced regardless of which argument
they appear in.
This test verifies the fix for check_struct_ops_btf_id() proposed by
Keisuke Nishimura [0], which corrected a bug where only the first
argument was checked for the refcounted flag.
The test includes:
- An update to bpf_testmod to add 'test_refcounted_multi', an operator with
three arguments where the third is tagged with "__ref".
- A BPF program 'test_refcounted_multi' that attempts a tail call.
- A test runner that asserts the verifier rejects the program with
"program with __ref argument cannot tail call".
[0]: https://lore.kernel.org/bpf/20260320130219.63711-1-keisuke.nishimura@inria.fr/
Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
Link: https://lore.kernel.org/r/20260321214038.80479-1-varunrmallya@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Simplify tnum_step() from a 10-variable algorithm into a straight
line sequence of bitwise operations.
Problem Reduction:
tnum_step(): Given a tnum `(tval, tmask)` where `tval & tmask == 0`,
and a value `z` with `tval ≤ z < (tval | tmask)`, find the smallest
`r > z`, a tnum-satisfying value, i.e., `r & ~tmask == tval`.
Every tnum-satisfying value has the form tval | s where s is a subset
of tmask bits (s & ~tmask == 0). Since tval and tmask are disjoint:
tval | s = tval + s
Similarly z = tval + d where d = z - tval, so r > z becomes:
tval + s > tval + d
s > d
The problem reduces to: find the smallest s, a subset of tmask, such
that s > d.
Notice that `s` must be a subset of tmask, the problem now is simplified.
Algorithm:
The mask bits of `d` form a "counter" that we want to increment by one,
but the counter has gaps at the fixed-bit positions. A normal +1 would
stop at the first 0-bit it meets; we need it to skip over fixed-bit
gaps and land on the next mask bit.
Step 1 -- plug the gaps:
d | carry_mask | ~tmask
- ~tmask fills all fixed-bit positions with 1.
- carry_mask = (1 << fls64(d & ~tmask)) - 1 fills all positions
(including mask positions) below the highest non-mask bit of d.
After this, the only remaining 0s are mask bits above the highest
non-mask bit of d where d is also 0 -- exactly the positions where
the carry can validly land.
Step 2 -- increment:
(d | carry_mask | ~tmask) + 1
Adding 1 flips all trailing 1s to 0 and sets the first 0 to 1. Since
every gap has been plugged, that first 0 is guaranteed to be a mask bit
above all non-mask bits of d.
Step 3 -- mask:
((d | carry_mask | ~tmask) + 1) & tmask
Strip the scaffolding, keeping only mask bits. Call the result inc.
Step 4 -- result:
tval | inc
Reattach the fixed bits.
A simple 8-bit example:
tmask: 1 1 0 1 0 1 1 0
d: 1 0 1 0 0 0 1 0 (d = 162)
^
non-mask 1 at bit 5
With carry_mask = 0b00111111 (smeared from bit 5):
d|carry|~tm 1 0 1 1 1 1 1 1
+ 1 1 1 0 0 0 0 0 0
& tmask 1 1 0 0 0 0 0 0
The patch passes my local test: test_verifier, test_progs for
`-t verifier` and `-t reg_bounds`.
CBMC shows the new code is equiv to original one[1], and
a lean4 proof of correctness is available[2]:
theorem tnumStep_correct (tval tmask z : BitVec 64)
-- Precondition: valid tnum and input z
(h_consistent : (tval &&& tmask) = 0)
(h_lo : tval ≤ z)
(h_hi : z < (tval ||| tmask)) :
-- Postcondition: r must be:
-- (1) tnum member
-- (2) z < r
-- (3) for any other member w > z, r <= w
let r := tnumStep tval tmask z
satisfiesTnum64 r tval tmask ∧
tval ≤ r ∧ r ≤ (tval ||| tmask) ∧
z < r ∧
∀ w, satisfiesTnum64 w tval tmask → z < w → r ≤ w := by
-- unfold definition
unfold tnumStep satisfiesTnum64
simp only []
refine ⟨?_, ?_, ?_, ?_, ?_⟩
-- the solver proves each conjunct
· bv_decide
· bv_decide
· bv_decide
· bv_decide
· intro w hw1 hw2; bv_decide
[1] https://github.com/eddyz87/tnum-step-verif/blob/master/main.c
[2] https://pastebin.com/raw/czHKiyY0
Signed-off-by: Hao Sun <hao.sun@inf.ethz.ch>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Reviewed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Link: https://lore.kernel.org/r/20260320162336.166542-1-hao.sun@inf.ethz.ch
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Since commit 603b441623 ("bpf: Update the bpf_prog_calc_tag to use
SHA256") made BPF program tags use SHA-256 instead of SHA-1, the header
<crypto/sha1.h> no longer needs to be included. Remove the relevant
inclusions so that they no longer unnecessarily come up in searches for
which kernel code is still using the obsolete SHA-1 algorithm.
Since net/ipv6/addrconf.c was relying on the transitive inclusion of
<crypto/sha1.h> (for an unrelated purpose) via <linux/filter.h>, make it
include <crypto/sha1.h> explicitly in order to keep that file building.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Acked-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/20260314214555.112386-1-ebiggers@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cross-merge BPF and other fixes after downstream PR.
Minor conflicts in:
tools/testing/selftests/bpf/progs/exceptions_fail.c
tools/testing/selftests/bpf/progs/verifier_bounds.c
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pull bpf fixes from Alexei Starovoitov:
- Fix how linked registers track zero extension of subregisters (Daniel
Borkmann)
- Fix unsound scalar fork for OR instructions (Daniel Wade)
- Fix exception exit lock check for subprogs (Ihor Solodrai)
- Fix undefined behavior in interpreter for SDIV/SMOD instructions
(Jenny Guanni Qu)
- Release module's BTF when module is unloaded (Kumar Kartikeya
Dwivedi)
- Fix constant blinding for PROBE_MEM32 instructions (Sachin Kumar)
- Reset register ID for END instructions to prevent incorrect value
tracking (Yazhou Tang)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Add a test cases for sync_linked_regs regarding zext propagation
bpf: Fix sync_linked_regs regarding BPF_ADD_CONST32 zext propagation
selftests/bpf: Add tests for maybe_fork_scalars() OR vs AND handling
bpf: Fix unsound scalar forking in maybe_fork_scalars() for BPF_OR
selftests/bpf: Add tests for sdiv32/smod32 with INT_MIN dividend
bpf: Fix undefined behavior in interpreter sdiv/smod for INT_MIN
selftests/bpf: Add tests for bpf_throw lock leak from subprogs
bpf: Fix exception exit lock checking for subprogs
bpf: Release module BTF IDR before module unload
selftests/bpf: Fix pkg-config call on static builds
bpf: Fix constant blinding for PROBE_MEM32 stores
selftests/bpf: Add test for BPF_END register ID reset
bpf: Reset register ID for BPF_END value tracking
Pull tracing fixes from Steven Rostedt:
- Revert "tracing: Remove pid in task_rename tracing output"
A change was made to remove the pid field from the task_rename event
because it was thought that it was always done for the current task
and recording the pid would be redundant. This turned out to be
incorrect and there are a few corner case where this is not true and
caused some regressions in tooling.
- Fix the reading from user space for migration
The reading of user space uses a seq lock type of logic where it uses
a per-cpu temporary buffer and disables migration, then enables
preemption, does the copy from user space, disables preemption,
enables migration and checks if there was any schedule switches while
preemption was enabled. If there was a context switch, then it is
considered that the per-cpu buffer could be corrupted and it tries
again. There's a protection check that tests if it takes a hundred
tries, it issues a warning and exits out to prevent a live lock.
This was triggered because the task was selected by the load balancer
to be migrated to another CPU, every time preemption is enabled the
migration task would schedule in try to migrate the task but can't
because migration is disabled and let it run again. This caused the
scheduler to schedule out the task every time it enabled preemption
and made the loop never exit (until the 100 iteration test
triggered).
Fix this by enabling and disabling preemption and keeping migration
enabled if the reading from user space needs to be done again. This
will let the migration thread migrate the task and the copy from user
space will likely pass on the next iteration.
- Fix trace_marker copy option freeing
The "copy_trace_marker" option allows a tracing instance to get a
copy of a write to the trace_marker file of the top level instance.
This is managed by a link list protected by RCU. When an instance is
removed, a check is made if the option is set, and if so
synchronized_rcu() is called.
The problem is that an iteration is made to reset all the flags to
what they were when the instance was created (to perform clean ups)
was done before the check of the copy_trace_marker option and that
option was cleared, so the synchronize_rcu() was never called.
Move the clearing of all the flags after the check of
copy_trace_marker to do synchronize_rcu() so that the option is still
set if it was before and the synchronization is performed.
- Fix entries setting when validating the persistent ring buffer
When validating the persistent ring buffer on boot up, the number of
events per sub-buffer is added to the sub-buffer meta page. The
validator was updating cpu_buffer->head_page (the first sub-buffer of
the per-cpu buffer) and not the "head_page" variable that was
iterating the sub-buffers. This was causing the first sub-buffer to
be assigned the entries for each sub-buffer and not the sub-buffer
that was supposed to be updated.
- Use "hash" value to update the direct callers
When updating the ftrace direct callers, it assigned a temporary
callback to all the callback functions of the ftrace ops and not just
the functions represented by the passed in hash. This causes an
unnecessary slow down of the functions of the ftrace_ops that is not
being modified. Only update the functions that are going to be
modified to call the ftrace loop function so that the update can be
made on those functions.
* tag 'trace-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ftrace: Use hash argument for tmp_ops in update_ftrace_direct_mod
ring-buffer: Fix to update per-subbuf entries of persistent ring buffer
tracing: Fix trace_marker copy link list updates
tracing: Fix failure to read user space from system call trace events
tracing: Revert "tracing: Remove pid in task_rename tracing output"
Pull i2c fixes from Wolfram Sang:
- fix broken I2C communication on Armada 3700 with recovery
- fix device_node reference leak in probe (fsi)
- fix NULL-deref when serial string is missing (cp2615)
* tag 'i2c-for-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: pxa: defer reset on Armada 3700 when recovery is used
i2c: fsi: Fix a potential leak in fsi_i2c_probe()
i2c: cp2615: fix serial string NULL-deref at probe
Pull x86 fixes from Ingo Molnar:
- Improve Qemu MCE-injection behavior by only using AMD SMCA MSRs if
the feature bit is set
- Fix the relative path of gettimeofday.c inclusion in vclock_gettime.c
- Fix a boot crash on UV clusters when a socket is marked as
'deconfigured' which are mapped to the SOCK_EMPTY node ID by
the UV firmware, while Linux APIs expect NUMA_NO_NODE.
The difference being (0xffff [unsigned short ~0]) vs [int -1]
* tag 'x86-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/platform/uv: Handle deconfigured sockets
x86/entry/vdso: Fix path of included gettimeofday.c
x86/mce/amd: Check SMCA feature bit before accessing SMCA MSRs
Pull perf fixes from Ingo Molnar:
- Fix a PMU driver crash on AMD EPYC systems, caused by
a race condition in x86_pmu_enable()
- Fix a possible counter-initialization bug in x86_pmu_enable()
- Fix a counter inheritance bug in inherit_event() and
__perf_event_read()
- Fix an Intel PMU driver branch constraints handling bug
found by UBSAN
- Fix the Intel PMU driver's new Off-Module Response (OMR)
support code for Diamond Rapids / Nova lake, to fix a snoop
information parsing bug
* tag 'perf-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Fix OMR snoop information parsing issues
perf/x86/intel: Add missing branch counters constraint apply
perf: Make sure to use pmu_ctx->pmu for groups
x86/perf: Make sure to program the counter value for stopped events on migration
perf/x86: Move event pointer setup earlier in x86_pmu_enable()
Pull objtool fixes from Ingo Molnar:
"Fix three more livepatching related build environment bugs, and a
false positive warning with Clang jump tables"
* tag 'objtool-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Fix Clang jump table detection
livepatch/klp-build: Fix inconsistent kernel version
objtool/klp: fix mkstemp() failure with long paths
objtool/klp: fix data alignment in __clone_symbol()
Pull locking fix from Ingo Molnar:
"Fix a sparse build error regression in <linux/local_lock_internal.h>
caused by the locking context-analysis changes"
* tag 'locking-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
include/linux/local_lock_internal.h: Make this header file again compatible with sparse
Pull irq fix from Ingo Molnar:
"Fix a mailbox channel leak in the riscv-rpmi-sysmsi irqchip driver"
* tag 'irq-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/riscv-rpmi-sysmsi: Fix mailbox channel leak in rpmi_sysmsi_probe()
Pull driver core fixes from Danilo Krummrich:
- Generalize driver_override in the driver core, providing a common
sysfs implementation and concurrency-safe accessors for bus
implementations
- Do not use driver_override as IRQ name in the hwmon axi-fan driver
- Remove an unnecessary driver_override check in sh platform_early
- Migrate the platform bus to use the generic driver_override
infrastructure, fixing a UAF condition caused by accessing the
driver_override field without proper locking in the platform_match()
callback
* tag 'driver-core-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
driver core: platform: use generic driver_override infrastructure
sh: platform_early: remove pdev->driver_override check
hwmon: axi-fan: don't use driver_override as IRQ name
docs: driver-model: document driver_override
driver core: generalize driver_override in struct device
The modify logic registers temporary ftrace_ops object (tmp_ops) to trigger
the slow path for all direct callers to be able to safely modify attached
addresses.
At the moment we use ops->func_hash for tmp_ops filter, which represents all
the systems attachments. It's faster to use just the passed hash filter, which
contains only the modified sites and is always a subset of the ops->func_hash.
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Menglong Dong <menglong8.dong@gmail.com>
Cc: Song Liu <song@kernel.org>
Link: https://patch.msgid.link/20260312123738.129926-1-jolsa@kernel.org
Fixes: e93672f770 ("ftrace: Add update_ftrace_direct_mod function")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When the "copy_trace_marker" option is enabled for an instance, anything
written into /sys/kernel/tracing/trace_marker is also copied into that
instances buffer. When the option is set, that instance's trace_array
descriptor is added to the marker_copies link list. This list is protected
by RCU, as all iterations uses an RCU protected list traversal.
When the instance is deleted, all the flags that were enabled are cleared.
This also clears the copy_trace_marker flag and removes the trace_array
descriptor from the list.
The issue is after the flags are called, a direct call to
update_marker_trace() is performed to clear the flag. This function
returns true if the state of the flag changed and false otherwise. If it
returns true here, synchronize_rcu() is called to make sure all readers
see that its removed from the list.
But since the flag was already cleared, the state does not change and the
synchronization is never called, leaving a possible UAF bug.
Move the clearing of all flags below the updating of the copy_trace_marker
option which then makes sure the synchronization is performed.
Also use the flag for checking the state in update_marker_trace() instead
of looking at if the list is empty.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260318185512.1b6c7db4@gandalf.local.home
Fixes: 7b382efd5e ("tracing: Allow the top level trace_marker to write into another instances")
Reported-by: Sasha Levin <sashal@kernel.org>
Closes: https://lore.kernel.org/all/20260225133122.237275-1-sashal@kernel.org/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The system call trace events call trace_user_fault_read() to read the user
space part of some system calls. This is done by grabbing a per-cpu
buffer, disabling migration, enabling preemption, calling
copy_from_user(), disabling preemption, enabling migration and checking if
the task was preempted while preemption was enabled. If it was, the buffer
is considered corrupted and it tries again.
There's a safety mechanism that will fail out of this loop if it fails 100
times (with a warning). That warning message was triggered in some
pi_futex stress tests. Enabling the sched_switch trace event and
traceoff_on_warning, showed the problem:
pi_mutex_hammer-1375 [006] d..21 138.981648: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981651: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981656: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981659: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981664: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981667: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981671: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981675: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981679: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981682: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981687: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981690: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981695: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981698: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981703: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981706: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981711: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981714: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981719: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981722: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981727: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981730: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981735: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981738: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
What happened was the task 1375 was flagged to be migrated. When
preemption was enabled, the migration thread woke up to migrate that task,
but failed because migration for that task was disabled. This caused the
loop to fail to exit because the task scheduled out while trying to read
user space.
Every time the task enabled preemption the migration thread would schedule
in, try to migrate the task, fail and let the task continue. But because
the loop would only enable preemption with migration disabled, it would
always fail because each time it enabled preemption to read user space,
the migration thread would try to migrate it.
To solve this, when the loop fails to read user space without being
scheduled out, enabled and disable preemption with migration enabled. This
will allow the migration task to successfully migrate the task and the
next loop should succeed to read user space without being scheduled out.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260316130734.1858a998@gandalf.local.home
Fixes: 64cf7d058a ("tracing: Have trace_marker use per-cpu data to read user space")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add multiple test cases for linked register tracking with alu32 ops:
- Add a test that checks sync_linked_regs() regarding reg->id (the linked
target register) for BPF_ADD_CONST32 rather than known_reg->id (the
branch register).
- Add a test case for linked register tracking that exposes the cross-type
sync_linked_regs() bug. One register uses alu32 (w7 += 1, BPF_ADD_CONST32)
and another uses alu64 (r8 += 2, BPF_ADD_CONST64), both linked to the
same base register.
- Add a test case that exercises regsafe() path pruning when two execution
paths reach the same program point with linked registers carrying
different ADD_CONST flags (BPF_ADD_CONST32 from alu32 vs BPF_ADD_CONST64
from alu64). This particular test passes with and without the fix since
the pruning will fail due to different ranges, but it would still be
useful to carry this one as a regression test for the unreachable div
by zero.
With the fix applied all the tests pass:
# LDLIBS=-static PKG_CONFIG='pkg-config --static' ./vmtest.sh -- ./test_progs -t verifier_linked_scalars
[...]
./test_progs -t verifier_linked_scalars
#602/1 verifier_linked_scalars/scalars: find linked scalars:OK
#602/2 verifier_linked_scalars/sync_linked_regs_preserves_id:OK
#602/3 verifier_linked_scalars/scalars_neg:OK
#602/4 verifier_linked_scalars/scalars_neg_sub:OK
#602/5 verifier_linked_scalars/scalars_neg_alu32_add:OK
#602/6 verifier_linked_scalars/scalars_neg_alu32_sub:OK
#602/7 verifier_linked_scalars/scalars_pos:OK
#602/8 verifier_linked_scalars/scalars_sub_neg_imm:OK
#602/9 verifier_linked_scalars/scalars_double_add:OK
#602/10 verifier_linked_scalars/scalars_sync_delta_overflow:OK
#602/11 verifier_linked_scalars/scalars_sync_delta_overflow_large_range:OK
#602/12 verifier_linked_scalars/scalars_alu32_big_offset:OK
#602/13 verifier_linked_scalars/scalars_alu32_basic:OK
#602/14 verifier_linked_scalars/scalars_alu32_wrap:OK
#602/15 verifier_linked_scalars/scalars_alu32_zext_linked_reg:OK
#602/16 verifier_linked_scalars/scalars_alu32_alu64_cross_type:OK
#602/17 verifier_linked_scalars/scalars_alu32_alu64_regsafe_pruning:OK
#602/18 verifier_linked_scalars/alu32_negative_offset:OK
#602/19 verifier_linked_scalars/spurious_precision_marks:OK
#602 verifier_linked_scalars:OK
Summary: 1/19 PASSED, 0 SKIPPED, 0 FAILED
Co-developed-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260319211507.213816-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jenny reported that in sync_linked_regs() the BPF_ADD_CONST32 flag is
checked on known_reg (the register narrowed by a conditional branch)
instead of reg (the linked target register created by an alu32 operation).
Example case with reg:
1. r6 = bpf_get_prandom_u32()
2. r7 = r6 (linked, same id)
3. w7 += 5 (alu32 -- r7 gets BPF_ADD_CONST32, zero-extended by CPU)
4. if w6 < 0xFFFFFFFC goto safe (narrows r6 to [0xFFFFFFFC, 0xFFFFFFFF])
5. sync_linked_regs() propagates to r7 but does NOT call zext_32_to_64()
6. Verifier thinks r7 is [0x100000001, 0x100000004] instead of [1, 4]
Since known_reg above does not have BPF_ADD_CONST32 set above, zext_32_to_64()
is never called on alu32-derived linked registers. This causes the verifier
to track incorrect 64-bit bounds, while the CPU correctly zero-extends the
32-bit result.
The code checking known_reg->id was correct however (see scalars_alu32_wrap
selftest case), but the real fix needs to handle both directions - zext
propagation should be done when either register has BPF_ADD_CONST32, since
the linked relationship involves a 32-bit operation regardless of which
side has the flag.
Example case with known_reg (exercised also by scalars_alu32_wrap):
1. r1 = r0; w1 += 0x100 (alu32 -- r1 gets BPF_ADD_CONST32)
2. if r1 > 0x80 - known_reg = r1 (has BPF_ADD_CONST32), reg = r0 (doesn't)
Hence, fix it by checking for (reg->id | known_reg->id) & BPF_ADD_CONST32.
Moreover, sync_linked_regs() also has a soundness issue when two linked
registers used different ALU widths: one with BPF_ADD_CONST32 and the
other with BPF_ADD_CONST64. The delta relationship between linked registers
assumes the same arithmetic width though. When one register went through
alu32 (CPU zero-extends the 32-bit result) and the other went through
alu64 (no zero-extension), the propagation produces incorrect bounds.
Example:
r6 = bpf_get_prandom_u32() // fully unknown
if r6 >= 0x100000000 goto out // constrain r6 to [0, U32_MAX]
r7 = r6
w7 += 1 // alu32: r7.id = N | BPF_ADD_CONST32
r8 = r6
r8 += 2 // alu64: r8.id = N | BPF_ADD_CONST64
if r7 < 0xFFFFFFFF goto out // narrows r7 to [0xFFFFFFFF, 0xFFFFFFFF]
At the branch on r7, sync_linked_regs() runs with known_reg=r7
(BPF_ADD_CONST32) and reg=r8 (BPF_ADD_CONST64). The delta path
computes:
r8 = r7 + (delta_r8 - delta_r7) = 0xFFFFFFFF + (2 - 1) = 0x100000000
Then, because known_reg->id has BPF_ADD_CONST32, zext_32_to_64(r8) is
called, truncating r8 to [0, 0]. But r8 used a 64-bit ALU op -- the
CPU does NOT zero-extend it. The actual CPU value of r8 is
0xFFFFFFFE + 2 = 0x100000000, not 0. The verifier now underestimates
r8's 64-bit bounds, which is a soundness violation.
Fix sync_linked_regs() by skipping propagation when the two registers
have mixed ALU widths (one BPF_ADD_CONST32, the other BPF_ADD_CONST64).
Lastly, fix regsafe() used for path pruning: the existing checks used
"& BPF_ADD_CONST" to test for offset linkage, which treated
BPF_ADD_CONST32 and BPF_ADD_CONST64 as equivalent.
Fixes: 7a433e5193 ("bpf: Support negative offsets, BPF_SUB, and alu32 for linked register tracking")
Reported-by: Jenny Guanni Qu <qguanni@gmail.com>
Co-developed-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260319211507.213816-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>