- simple propagation of read/write marks;
- joining read/write marks from conditional branches;
- avoid must_write marks in when same instruction accesses different
stack offsets on different execution paths;
- avoid must_write marks in case same instruction accesses stack
and non-stack pointers on different execution paths;
- read/write marks propagation to outer stack frame;
- independent read marks for different callchains ending with the same
function;
- bpf_calls_callback() dependent logic in
liveness.c:bpf_stack_slot_alive().
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-12-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch adds tags __not_msg(<msg>) and __not_msg_unpriv(<msg>).
Test fails if <msg> is found in verifier log.
If __msg_not() is situated between __msg() tags framework matches
__msg() tags first, and then checks that <msg> is not present in a
portion of a log between bracketing __msg() tags.
__msg_not() tags bracketed by a same __msg() group are effectively
unordered.
The idea is borrowed from LLVM's CheckFile with its CHECK-NOT syntax.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-11-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Converting bpf_insn_successors() to use lookup table makes it ~1.5
times faster.
Also remove unnecessary conditionals:
- `idx + 1 < prog->len` is unnecessary because after check_cfg() all
jump targets are guaranteed to be within a program;
- `i == 0 || succ[0] != dst` is unnecessary because any client of
bpf_insn_successors() can handle duplicate edges:
- compute_live_registers()
- compute_scc()
Moving bpf_insn_successors() to liveness.c allows its inlining in
liveness.c:__update_stack_liveness().
Such inlining speeds up __update_stack_liveness() by ~40%.
bpf_insn_successors() is used in both verifier.c and liveness.c.
perf shows such move does not negatively impact users in verifier.c,
as these are executed only once before main varification pass.
Unlike __update_stack_liveness() which can be triggered multiple
times.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-10-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Remove register chain based liveness tracking:
- struct bpf_reg_state->{parent,live} fields are no longer needed;
- REG_LIVE_WRITTEN marks are superseded by bpf_mark_stack_write()
calls;
- mark_reg_read() calls are superseded by bpf_mark_stack_read();
- log.c:print_liveness() is superseded by logging in liveness.c;
- propagate_liveness() is superseded by bpf_update_live_stack();
- no need to establish register chains in is_state_visited() anymore;
- fix a bunch of tests expecting "_w" suffixes in verifier log
messages.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-9-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Allocate analysis instance:
- Add bpf_stack_liveness_{init,free}() calls to bpf_check().
Notify the instance about any stack reads and writes:
- Add bpf_mark_stack_write() call at every location where
REG_LIVE_WRITTEN is recorded for a stack slot.
- Add bpf_mark_stack_read() call at every location mark_reg_read() is
called.
- Both bpf_mark_stack_{read,write}() rely on
env->liveness->cur_instance callchain being in sync with
env->cur_state. It is possible to update env->liveness->cur_instance
every time a mark read/write is called, but that costs a hash table
lookup and is noticeable in the performance profile. Hence, manually
reset env->liveness->cur_instance whenever the verifier changes
env->cur_state call stack:
- call bpf_reset_live_stack_callchain() when the verifier enters a
subprogram;
- call bpf_update_live_stack() when the verifier exits a subprogram
(it implies the reset).
Make sure bpf_update_live_stack() is called for a callchain before
issuing liveness queries. And make sure that bpf_update_live_stack()
is called for any callee callchain first:
- Add bpf_update_live_stack() call at every location that processes
BPF_EXIT:
- exit from a subprogram;
- before pop_stack() call.
This makes sure that bpf_update_live_stack() is called for callee
callchains before caller callchains.
Make sure must_write marks are set to zero for instructions that
do not always access the stack:
- Wrap do_check_insn() with bpf_reset_stack_write_marks() /
bpf_commit_stack_write_marks() calls.
Any calls to bpf_mark_stack_write() are accumulated between this
pair of calls. If no bpf_mark_stack_write() calls were made
it means that the instruction does not access stack (at-least
on the current verification path) and it is important to record
this fact.
Finally, use bpf_live_stack_query_init() / bpf_stack_slot_alive()
to query stack liveness info.
The manual tracking of the correct order for callee/caller
bpf_update_live_stack() calls is a bit convoluted and may warrant some
automation in future revisions.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-7-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This commit adds a flow-sensitive, context-sensitive, path-insensitive
data flow analysis for live stack slots:
- flow-sensitive: uses program control flow graph to compute data flow
values;
- context-sensitive: collects data flow values for each possible call
chain in a program;
- path-insensitive: does not distinguish between separate control flow
graph paths reaching the same instruction.
Compared to the current path-sensitive analysis, this approach trades
some precision for not having to enumerate every path in the program.
This gives a theoretical capability to run the analysis before main
verification pass. See cover letter for motivation.
The basic idea is as follows:
- Data flow values indicate stack slots that might be read and stack
slots that are definitely written.
- Data flow values are collected for each
(call chain, instruction number) combination in the program.
- Within a subprogram, data flow values are propagated using control
flow graph.
- Data flow values are transferred from entry instructions of callee
subprograms to call sites in caller subprograms.
In other words, a tree of all possible call chains is constructed.
Each node of this tree represents a subprogram. Read and write marks
are collected for each instruction of each node. Live stack slots are
first computed for lower level nodes. Then, information about outer
stack slots that might be read or are definitely written by a
subprogram is propagated one level up, to the corresponding call
instructions of the upper nodes. Procedure repeats until root node is
processed.
In the absence of value range analysis, stack read/write marks are
collected during main verification pass, and data flow computation is
triggered each time verifier.c:states_equal() needs to query the
information.
Implementation details are documented in kernel/bpf/liveness.c.
Quantitative data about verification performance changes and memory
consumption is in the cover letter.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-6-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The next patch would require doing postorder traversal of individual
subprograms. Facilitate this by moving env->cfg.insn_postorder
computation from check_cfg() to a separate pass, as check_cfg()
descends into called subprograms (and it needs to, because of
merge_callee_effects() logic).
env->cfg.insn_postorder is used only by compute_live_registers(),
this function does not track cross subprogram dependencies,
thus the change does not affect it's operation.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-5-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently only array maps are supported, but the implementation can be
extended for other maps and objects. The hash is memoized only for
exclusive and frozen maps as their content is stable until the exclusive
program modifies the map.
This is required for BPF signing, enabling a trusted loader program to
verify a map's integrity. The loader retrieves
the map's runtime hash from the kernel and compares it against an
expected hash computed at build time.
Signed-off-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/r/20250914215141.15144-7-kpsingh@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Use AF_ALG sockets to not have libbpf depend on OpenSSL. The helper is
used for the loader generation code to embed the metadata hash in the
loader program and also by the bpf_map__make_exclusive API to calculate
the hash of the program the map is exclusive to.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/r/20250914215141.15144-4-kpsingh@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Exclusive maps allow maps to only be accessed by program with a
program with a matching hash which is specified in the excl_prog_hash
attr.
For the signing use-case, this allows the trusted loader program
to load the map and verify the integrity
Signed-off-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/r/20250914215141.15144-3-kpsingh@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Exclusive maps restrict map access to specific programs using a hash.
The current hash used for this is SHA1, which is prone to collisions.
This patch uses SHA256, which is more resilient against
collisions. This new hash is stored in bpf_prog and used by the verifier
to determine if a program can access a given exclusive map.
The original 64-bit tags are kept, as they are used by users as a short,
possibly colliding program identifier for non-security purposes.
Signed-off-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/r/20250914215141.15144-2-kpsingh@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Kumar Kartikeya Dwivedi says:
====================
Update KF_RCU_PROTECTED
Currently, KF_RCU_PROTECTED only applies to iterator APIs and that too
in a convoluted fashion: the presence of this flag on the kfunc is used
to set MEM_RCU in iterator type, and the lack of RCU protection results
in an error only later, once next() or destroy() methods are invoked on
the iterator. While there is no bug, this is certainly a bit unintuitive,
and makes the enforcement of the flag iterator specific.
In the interest of making this flag useful for other upcoming kfuncs,
e.g. scx_bpf_cpu_curr() [0][1], add enforcement for invoking the kfunc
in an RCU critical section in general.
In addition to this, the aforementioned kfunc also needs to return an
RCU protected pointer, which currently has no generic kfunc flag or
annotation. Add such a flag as well while we are at it.
[0]: https://lore.kernel.org/all/20250903212311.369697-3-christian.loehle@arm.com
[1]: https://lore.kernel.org/all/20250909195709.92669-1-arighi@nvidia.com
Changelog:
----------
v2 -> v3
v2: https://lore.kernel.org/bpf/20250917032014.4060112-1-memxor@gmail.com
* Add back lost hunk reworking documentation for KF_RCU_PROTECTED.
v1 -> v2
v1: https://lore.kernel.org/bpf/20250915024731.1494251-1-memxor@gmail.com
* Drop KF_RET_RCU and fold change into KF_RCU_PROTECTED. (Andrea, Alexei)
* Update tests for non-struct pointer return values with KF_RCU_PROTECTED.
====================
Link: https://patch.msgid.link/20250917032755.4068726-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, KF_RCU_PROTECTED only applies to iterator APIs and that too
in a convoluted fashion: the presence of this flag on the kfunc is used
to set MEM_RCU in iterator type, and the lack of RCU protection results
in an error only later, once next() or destroy() methods are invoked on
the iterator. While there is no bug, this is certainly a bit
unintuitive, and makes the enforcement of the flag iterator specific.
In the interest of making this flag useful for other upcoming kfuncs,
e.g. scx_bpf_cpu_curr() [0][1], add enforcement for invoking the kfunc
in an RCU critical section in general.
This would also mean that iterator APIs using KF_RCU_PROTECTED will
error out earlier, instead of throwing an error for lack of RCU CS
protection when next() or destroy() methods are invoked.
In addition to this, if the kfuncs tagged KF_RCU_PROTECTED return a
pointer value, ensure that this pointer value is only usable in an RCU
critical section. There might be edge cases where the return value is
special and doesn't need to imply MEM_RCU semantics, but in general, the
assumption should hold for the majority of kfuncs, and we can revisit
things if necessary later.
[0]: https://lore.kernel.org/all/20250903212311.369697-3-christian.loehle@arm.com
[1]: https://lore.kernel.org/all/20250909195709.92669-1-arighi@nvidia.com
Tested-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250917032755.4068726-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This is a test case minimized from a syzbot reproducer from [1].
The test case triggers verifier.c:maybe_exit_scc() w/o
preceding call to verifier.c:maybe_enter_scc() on a speculative
symbolic execution path.
Here is verifier log for the test case:
Live regs before insn:
0: .......... (b7) r0 = 100
1 1: 0......... (7b) *(u64 *)(r10 -512) = r0
1 2: 0......... (b5) if r0 <= 0x0 goto pc-2
3: 0......... (95) exit
0: R1=ctx() R10=fp0
0: (b7) r0 = 100 ; R0_w=100
1: (7b) *(u64 *)(r10 -512) = r0 ; R0_w=100 R10=fp0 fp-512_w=100
2: (b5) if r0 <= 0x0 goto pc-2
mark_precise: ...
2: R0_w=100
3: (95) exit
from 2 to 1 (speculative execution): R0_w=scalar() R1=ctx() R10=fp0 fp-512_w=100
1: R0_w=scalar() R1=ctx() R10=fp0 fp-512_w=100
1: (7b) *(u64 *)(r10 -512) = r0
processed 5 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
- Non-speculative execution path 0-3 does not allocate any checkpoints
(and hence does not call maybe_enter_scc()), and schedules a
speculative jump from 2 to 1.
- Speculative execution path stops immediately because of an infinite
loop detection and triggers verifier.c:update_branch_counts() ->
maybe_exit_scc() calls.
[1] https://lore.kernel.org/bpf/68c85acd.050a0220.2ff435.03a4.GAE@google.com/
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250916212251.3490455-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Syzbot generated a program that triggers a verifier_bug() call in
maybe_exit_scc(). maybe_exit_scc() assumes that, when called for a
state with insn_idx in some SCC, there should be an instance of struct
bpf_scc_visit allocated for that SCC. Turns out the assumption does
not hold for speculative execution paths. See example in the next
patch.
maybe_scc_exit() is called from update_branch_counts() for states that
reach branch count of zero, meaning that path exploration for a
particular path is finished. Path exploration can finish in one of
three ways:
a. Verification error is found. In this case, update_branch_counts()
is called only for non-speculative paths.
b. Top level BPF_EXIT is reached. Such instructions are never a part of
an SCC, so compute_scc_callchain() in maybe_scc_exit() will return
false, and maybe_scc_exit() will return early.
c. A checkpoint is reached and matched. Checkpoints are created by
is_state_visited(), which calls maybe_enter_scc(), which allocates
bpf_scc_visit instances for checkpoints within SCCs.
Hence, for non-speculative symbolic execution paths, the assumption
still holds: if maybe_scc_exit() is called for a state within an SCC,
bpf_scc_visit instance must exist.
This patch removes the verifier_bug() call for speculative paths.
Fixes: c9e31900b5 ("bpf: propagate read/precision marks over state graph backedges")
Reported-by: syzbot+3afc814e8df1af64b653@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/68c85acd.050a0220.2ff435.03a4.GAE@google.com/
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250916212251.3490455-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Function bpf_patch_insn_data() has the following structure:
static struct bpf_prog *bpf_patch_insn_data(... env ...)
{
struct bpf_prog *new_prog;
struct bpf_insn_aux_data *new_data = NULL;
if (len > 1) {
new_data = vrealloc(...); // <--------- (1)
if (!new_data)
return NULL;
env->insn_aux_data = new_data; // <---- (2)
}
new_prog = bpf_patch_insn_single(env->prog, off, patch, len);
if (IS_ERR(new_prog)) {
...
vfree(new_data); // <----------------- (3)
return NULL;
}
... happy path ...
}
In case if bpf_patch_insn_single() returns an error the `new_data`
allocated at (1) will be freed at (3). However, at (2) this pointer
is stored in `env->insn_aux_data`. Which is freed unconditionally
by verifier.c:bpf_check() on both happy and error paths.
Thus, leading to double-free.
Fix this by removing vfree() call at (3), ownership over `new_data` is
already passed to `env->insn_aux_data` at this point.
Fixes: 77620d1267 ("bpf: use realloc in bpf_patch_insn_data")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250912-patch-insn-data-double-free-v1-1-af05bd85a21a@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Kumar Kartikeya Dwivedi says:
====================
Remove use of current->cgns in bpf_cgroup_from_id
bpf_cgroup_from_id currently ends up doing a check on whether the cgroup
being looked up is a descendant of the root cgroup of the current task's
cgroup namespace. This leads to unreliable results since this kfunc can
be invoked from any arbitrary context, for any arbitrary value of
current. Fix this by removing namespace-awarness in the kfunc, and
include a test that detects such a case and fails without the fix.
Changelog:
----------
v2 -> v3
v2: https://lore.kernel.org/bpf/20250811195901.1651800-1-memxor@gmail.com
* Refactor cgroup_get_from_id into non-ns version. (Andrii)
* Address nits from Eduard.
v1 -> v2
v1: https://lore.kernel.org/bpf/20250811175045.1055202-1-memxor@gmail.com
* Add Ack from Tejun.
* Fix selftest to perform namespace migration and cgroup setup in a
child process to avoid changing test_progs namespace.
====================
Link: https://patch.msgid.link/20250915032618.1551762-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Make sure that we only switch the cgroup namespace and enter a new
cgroup in a child process separate from test_progs, to not mess up the
environment for subsequent tests.
To remove this cgroup, we need to wait for the child to exit, and then
rmdir its cgroup. If the read call fails, or waitpid succeeds, we know
the child exited (read call would fail when the last pipe end is closed,
otherwise waitpid waits until exit(2) is called). We then invoke a newly
introduced remove_cgroup_pid() helper, that identifies cgroup path using
the passed in pid of the now dead child, instead of using the current
process pid (getpid()).
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250915032618.1551762-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The bpf_cgroup_from_id kfunc relies on cgroup_get_from_id to obtain the
cgroup corresponding to a given cgroup ID. This helper can be called in
a lot of contexts where the current thread can be random. A recent
example was its use in sched_ext's ops.tick(), to obtain the root cgroup
pointer. Since the current task can be whatever random user space task
preempted by the timer tick, this makes the behavior of the helper
unreliable.
Refactor out __cgroup_get_from_id as the non-namespace aware version of
cgroup_get_from_id, and change bpf_cgroup_from_id to make use of it.
There is no compatibility breakage here, since changing the namespace
against which the lookup is being done to the root cgroup namespace only
permits a wider set of lookups to succeed now. The cgroup IDs across
namespaces are globally unique, and thus don't need to be retranslated.
Reported-by: Dan Schatzberg <dschatzberg@meta.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250915032618.1551762-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
For systems having CONFIG_NR_CPUS set to > 1024 in kernel config
the selftest fails as arena_spin_lock_irqsave() returns EOPNOTSUPP.
(eg - incase of powerpc default value for CONFIG_NR_CPUS is 8192)
The selftest is skipped incase bpf program returns EOPNOTSUPP,
with a descriptive message logged.
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Saket Kumar Bhaskar <skb99@linux.ibm.com>
Link: https://lore.kernel.org/r/20250913091337.1841916-1-skb99@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Like commit fbdd61c94b ("selftests/bpf: Skip timer cases when bpf_timer is not supported"),
'timer_interrupt' test case should be skipped if verifier rejects
bpf_timer with returning -EOPNOTSUPP.
cd tools/testing/selftests/bpf
./test_progs -t timer
461 timer_interrupt:SKIP
Summary: 6/0 PASSED, 7 SKIPPED, 0 FAILED
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250915121657.28084-1-leon.hwang@linux.dev
With "bpftool prog tracelog", bpftool prints messages from the trace
pipe. To do so, it first needs to find the tracefs mount point to open
the pipe. Bpftool looks at a few "default" locations, including
/sys/kernel/debug/tracing and /sys/kernel/tracing.
Some of these locations, namely /tracing and /trace, are not standard.
They are in the list because some users used to hardcode the tracing
directory to short names; but we have no compelling reason to look at
these locations. If we fail to find the tracefs at the default
locations, we have an additional step to find it by parsing /proc/mounts
anyway, so it's safe to remove these entries from the list of default
locations to check.
Additionally, Alexei reports that looking for the tracefs at
/sys/kernel/debug/tracing may automatically mount the file system under
that location, and generate a kernel log message telling that
auto-mounting there is deprecated. To avoid this message, let's swap the
order for checking the potential mount points: try /sys/kernel/tracing
first, which should be the standard location nowadays. The kernel log
message may still appear if the tracefs is not mounted on
/sys/kernel/tracing when we run bpftool.
Reported-by: Alexei Starovoitov <ast@kernel.org>
Closes: https://lore.kernel.org/r/CAADnVQLcMi5YQhZKsU4z3S2uVUAGu_62C33G2Zx_ruG3uXa-Ug@mail.gmail.com/
Signed-off-by: Quentin Monnet <qmo@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20250915134209.36568-1-qmo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Puranjay Mohan says:
====================
bpf: report arena faults to BPF streams
Changes in v6->v7:
v6: https://lore.kernel.org/all/20250908163638.23150-1-puranjay@kernel.org/
- Added comments about the usage of arena_reg in x86 and arm64 jits. (Alexei)
- Used clear_lo32() for clearing the lower 32-bits of user_vm_start. (Alexei)
- Moved update of the old tests to use __stderr to a separate commit (Eduard)
- Used test__skip() in prog_tests/stream.c (Eduard)
- Start a sub-test for read / write
Changes in v5->v6:
v5: https://lore.kernel.org/all/20250901193730.43543-1-puranjay@kernel.org/
- Introduces __stderr and __stdout for easy testing of bpf streams
(Eduard)
- Add more test cases for arena fault reporting (subprog and callback)
- Fix main_prog_aux usage and return main_prog from find_from_stack_cb
(Kumar)
- Properly fix the build issue reported by kernel test robot
Changes in v4->v5:
v4: https://lore.kernel.org/all/20250827153728.28115-1-puranjay@kernel.org/
- Added patch 2 to introducing main_prog_aux for easier access to
streams.
- Fixed bug in fault handlers when arena_reg == dst_reg
- Updated selftest to check test above edge case.
- Added comments about the usage of barrier_var() in code and commit
message.
Changes in v3->v4:
v3: https://lore.kernel.org/all/20250827150113.15763-1-puranjay@kernel.org/
- Fixed a build issue when CONFIG_BPF_JIT=y and # CONFIG_BPF_SYSCALL is
not set
Changes in v2->v3:
v2: https://lore.kernel.org/all/20250811111828.13836-1-puranjay@kernel.org/
- Improved the selftest to check the exact fault address
- Dropped BPF_NO_KFUNC_PROTOTYPES and bpf_arena_alloc/free_pages() usage
- Rebased on bpf-next/master
Changes in v1->v2:
v1: https://lore.kernel.org/all/20250806085847.18633-1-puranjay@kernel.org/
- Changed variable and mask names for consistency (Yonghong)
- Added Acked-by: Yonghong Song <yonghong.song@linux.dev> on two patches
This set adds the support of reporting page faults inside arena to BPF
stderr stream. The reported address is the one that a user would expect
to see if they pass it to bpf_printk();
Here is an example output from the stderr stream and bpf_printk()
ERROR: Arena WRITE access at unmapped address 0xdeaddead0000
CPU: 9 UID: 0 PID: 502 Comm: test_progs
Call trace:
bpf_stream_stage_dump_stack+0xc0/0x150
bpf_prog_report_arena_violation+0x98/0xf0
ex_handler_bpf+0x5c/0x78
fixup_exception+0xf8/0x160
__do_kernel_fault+0x40/0x188
do_bad_area+0x70/0x88
do_translation_fault+0x54/0x98
do_mem_abort+0x4c/0xa8
el1_abort+0x44/0x70
el1h_64_sync_handler+0x50/0x108
el1h_64_sync+0x6c/0x70
bpf_prog_a64a9778d31b8e88_stream_arena_write_fault+0x84/0xc8
*(page) = 1; @ stream.c:100
bpf_prog_test_run_syscall+0x100/0x328
__sys_bpf+0x508/0xb98
__arm64_sys_bpf+0x2c/0x48
invoke_syscall+0x50/0x120
el0_svc_common.constprop.0+0x48/0xf8
do_el0_svc+0x28/0x40
el0_svc+0x48/0xf8
el0t_64_sync_handler+0xa0/0xe8
el0t_64_sync+0x198/0x1a0
Same address is printed by bpf_printk():
1389.078831: bpf_trace_printk: Read Address: 0xdeaddead0000
To make this possible, some extra metadata has to be passed to the bpf
exception handler, so the bpf exception handling mechanism for both
x86-64 and arm64 have been improved in this set.
The streams selftest has been updated to test this new feature.
====================
Link: https://patch.msgid.link/20250911145808.58042-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add selftests for testing the reporting of arena page faults through BPF
streams. Two new bpf programs are added that read and write to an
unmapped arena address and the fault reporting is verified in the
userspace through streams.
The added bpf programs need to access the user_vm_start in struct
bpf_arena, this is done by casting &arena to struct bpf_arena *, but
barrier_var() is used on this ptr before accessing ptr->user_vm_start;
to stop GCC from issuing an out-of-bound access due to the cast from
smaller map struct to larger "struct bpf_arena"
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250911145808.58042-7-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Begin reporting arena page faults and the faulting address to BPF
program's stderr, this patch adds support in the arm64 and x86-64 JITs,
support for other archs can be added later.
The fault handlers receive the 32 bit address in the arena region so
the upper 32 bits of user_vm_start is added to it before printing the
address. This is what the user would expect to see as this is what is
printed by bpf_printk() is you pass it an address returned by
bpf_arena_alloc_pages();
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250911145808.58042-4-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
BPF streams are only valid for the main programs, to make it easier to
access streams from subprogs, introduce main_prog_aux in struct
bpf_prog_aux.
prog->aux->main_prog_aux = prog->aux, for main programs and
prog->aux->main_prog_aux = main_prog->aux, for subprograms.
Make bpf_prog_find_from_stack() use the added main_prog_aux to return
the mainprog when a subprog is found on the stack.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250911145808.58042-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
BPF loads with BPF_PROBE_MEM(SX) can load from unsafe pointers and the
JIT adds an exception table entry for the JITed instruction which allows
the exeption handler to set the destination register of the load to zero
and continue execution from the next instruction.
As all arm64 instructions are AARCH64_INSN_SIZE size, the exception
handler can just increment the pc by AARCH64_INSN_SIZE without needing
the exact address of the instruction following the the faulting
instruction.
Simplify the exception table usage in arm64 JIT by only saving the
destination register in ex->fixup and drop everything related to
the fixup_offset. The fault handler is modified to add AARCH64_INSN_SIZE
to the pc.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20250911145808.58042-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pull s390 fixes from Alexander Gordeev:
- ptep_modify_prot_start() may be called in a loop, which might lead to
the preempt_count overflow due to the unnecessary preemption
disabling. Do not disable preemption to prevent the overflow
- Events of type PERF_TYPE_HARDWARE are not tested for sampling and
return -EOPNOTSUPP eventually.
Instead, deny all sampling events by CPUMF counter facility and
return -ENOENT to allow other PMUs to be tried
- The PAI PMU driver returns -EINVAL if an event out of its range. That
aborts a search for an alternative PMU driver.
Instead, return -ENOENT to allow other PMUs to be tried
* tag 's390-6.17-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/cpum_cf: Deny all sampling events by counter PMU
s390/pai: Deny all events not handled by this PMU
s390/mm: Prevent possible preempt_count overflow
Pull power management fixes from Rafael Wysocki:
"These fix a nasty hibernation regression introduced during the 6.16
cycle, an issue related to energy model management occurring on Intel
hybrid systems where some CPUs are offline to start with, and two
regressions in the amd-pstate driver:
- Restore a pm_restrict_gfp_mask() call in hibernation_snapshot()
that was removed incorrectly during the 6.16 development cycle
(Rafael Wysocki)
- Introduce a function for registering a perf domain without
triggering a system-wide CPU capacity update and make the
intel_pstate driver use it to avoid reocurring unsuccessful
attempts to update capacities of all CPUs in the system (Rafael
Wysocki)
- Fix setting of CPPC.min_perf in the active mode with performance
governor in the amd-pstate driver to restore its expected behavior
changed recently (Gautham Shenoy)
- Avoid mistakenly setting EPP to 0 in the amd-pstate driver after
system resume as a result of recent code changes (Mario
Limonciello)"
* tag 'pm-6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: hibernate: Restrict GFP mask in hibernation_snapshot()
PM: EM: Add function for registering a PD without capacity update
cpufreq/amd-pstate: Fix a regression leading to EPP 0 after resume
cpufreq/amd-pstate: Fix setting of CPPC.min_perf in active mode for performance governor
Pull btrfs fixes from David Sterba:
- fix delayed inode tracking in xarray, eviction can race with
insertion and leave behind a disconnected inode
- on systems with large page (64K) and small block size (4K) fix
compression read that can return partially filled folio
- slightly relax compression option format for backward compatibility,
allow to specify level for LZO although there's only one
- fix simple quota accounting of compressed extents
- validate minimum device size in 'device add'
- update maintainers' entry
* tag 'for-6.17-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: don't allow adding block device of less than 1 MB
MAINTAINERS: update btrfs entry
btrfs: fix subvolume deletion lockup caused by inodes xarray race
btrfs: fix corruption reading compressed range when block size is smaller than page size
btrfs: accept and ignore compression level for lzo
btrfs: fix squota compressed stats leak
Pull bpf fixes from Alexei Starovoitov:
"A number of fixes accumulated due to summer vacations
- Fix out-of-bounds dynptr write in bpf_crypto_crypt() kfunc which
was misidentified as a security issue (Daniel Borkmann)
- Update the list of BPF selftests maintainers (Eduard Zingerman)
- Fix selftests warnings with icecc compiler (Ilya Leoshkevich)
- Disable XDP/cpumap direct return optimization (Jesper Dangaard
Brouer)
- Fix unexpected get_helper_proto() result in unusual configuration
BPF_SYSCALL=y and BPF_EVENTS=n (Jiri Olsa)
- Allow fallback to interpreter when JIT support is limited (KaFai
Wan)
- Fix rqspinlock and choose trylock fallback for NMI waiters. Pick
the simplest fix. More involved fix is targeted bpf-next (Kumar
Kartikeya Dwivedi)
- Fix cleanup when tcp_bpf_send_verdict() fails to allocate
psock->cork (Kuniyuki Iwashima)
- Disallow bpf_timer in PREEMPT_RT for now. Proper solution is being
discussed for bpf-next. (Leon Hwang)
- Fix XSK cq descriptor production (Maciej Fijalkowski)
- Tell memcg to use allow_spinning=false path in bpf_timer_init() to
avoid lockup in cgroup_file_notify() (Peilin Ye)
- Fix bpf_strnstr() to handle suffix match cases (Rong Tao)"
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Skip timer cases when bpf_timer is not supported
bpf: Reject bpf_timer for PREEMPT_RT
tcp_bpf: Call sk_msg_free() when tcp_bpf_send_verdict() fails to allocate psock->cork.
bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init()
bpf: Allow fall back to interpreter for programs with stack size <= 512
rqspinlock: Choose trylock fallback for NMI waiters
xsk: Fix immature cq descriptor production
bpf: Update the list of BPF selftests maintainers
selftests/bpf: Add tests for bpf_strnstr
selftests/bpf: Fix "expression result unused" warnings with icecc
bpf: Fix bpf_strnstr() to handle suffix match cases better
selftests/bpf: Extend crypto_sanity selftest with invalid dst buffer
bpf: Fix out-of-bounds dynptr write in bpf_crypto_crypt
bpf: Check the helper function is valid in get_helper_proto
bpf, cpumap: Disable page_pool direct xdp_return need larger scope