Jiri Olsa says:
====================
ftrace,bpf: Use single direct ops for bpf trampolines
hi,
while poking the multi-tracing interface I ended up with just one ftrace_ops
object to attach all trampolines.
This change allows to use less direct API calls during the attachment changes
in the future code, so in effect speeding up the attachment.
In current code we get a speed up from using just a single ftrace_ops object.
- with current code:
Performance counter stats for 'bpftrace -e fentry:vmlinux:ksys_* {} -c true':
6,364,157,902 cycles:k
828,728,902 cycles:u
1,064,803,824 instructions:u # 1.28 insn per cycle
23,797,500,067 instructions:k # 3.74 insn per cycle
4.416004987 seconds time elapsed
0.164121000 seconds user
1.289550000 seconds sys
- with the fix:
Performance counter stats for 'bpftrace -e fentry:vmlinux:ksys_* {} -c true':
6,535,857,905 cycles:k
810,809,429 cycles:u
1,064,594,027 instructions:u # 1.31 insn per cycle
23,962,552,894 instructions:k # 3.67 insn per cycle
1.666961239 seconds time elapsed
0.157412000 seconds user
1.283396000 seconds sys
The speedup seems to be related to the fact that with single ftrace_ops object
we don't call ftrace_shutdown anymore (we use ftrace_update_ops instead) and
we skip the synchronize rcu calls (each ~100ms) at the end of that function.
rfc: https://lore.kernel.org/bpf/20250729102813.1531457-1-jolsa@kernel.org/
v1: https://lore.kernel.org/bpf/20250923215147.1571952-1-jolsa@kernel.org/
v2: https://lore.kernel.org/bpf/20251113123750.2507435-1-jolsa@kernel.org/
v3: https://lore.kernel.org/bpf/20251120212402.466524-1-jolsa@kernel.org/
v4: https://lore.kernel.org/bpf/20251203082402.78816-1-jolsa@kernel.org/
v5: https://lore.kernel.org/bpf/20251215211402.353056-10-jolsa@kernel.org/
v6 changes:
- rename add_hash_entry_direct to add_ftrace_hash_entry_direct [Steven]
- factor hash_add/hash_sub [Steven]
- add kerneldoc header for update_ftrace_direct_* functions [Steven]
- few assorted smaller fixes [Steven]
- added missing direct_ops wrappers for !CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
case [Steven]
v5 changes:
- do not export ftrace_hash object [Steven]
- fix update_ftrace_direct_add new_filter_hash leak [ci]
v4 changes:
- rebased on top of bpf-next/master (with jmp attach changes)
added patch 1 to deal with that
- added extra checks for update_ftrace_direct_del/mod to address
the ci bot review
v3 changes:
- rebased on top of bpf-next/master
- fixed update_ftrace_direct_del cleanup path
- added missing inline to update_ftrace_direct_* stubs
v2 changes:
- rebased on top fo bpf-next/master plus Song's livepatch fixes [1]
- renamed the API functions [2] [Steven]
- do not export the new api [Steven]
- kept the original direct interface:
I'm not sure if we want to melt both *_ftrace_direct and the new interface
into single one. It's bit different in semantic (hence the name change as
Steven suggested [2]) and I don't think the changes are not that big so
we could easily keep both APIs.
v1 changes:
- make the change x86 specific, after discussing with Mark options for
arm64 [Mark]
thanks,
jirka
[1] https://lore.kernel.org/bpf/20251027175023.1521602-1-song@kernel.org/
[2] https://lore.kernel.org/bpf/20250924050415.4aefcb91@batman.local.home/
---
====================
Link: https://patch.msgid.link/20251230145010.103439-1-jolsa@kernel.org
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Using single ftrace_ops for direct calls update instead of allocating
ftrace_ops object for each trampoline.
With single ftrace_ops object we can use update_ftrace_direct_* api
that allows multiple ip sites updates on single ftrace_ops object.
Adding HAVE_SINGLE_FTRACE_DIRECT_OPS config option to be enabled on
each arch that supports this.
At the moment we can enable this only on x86 arch, because arm relies
on ftrace_ops object representing just single trampoline image (stored
in ftrace_ops::direct_call). Archs that do not support this will continue
to use *_ftrace_direct api.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-10-jolsa@kernel.org
Adding update_ftrace_direct_mod function that modifies all entries
(ip -> direct) provided in hash argument to direct ftrace ops and
updates its attachments.
The difference to current modify_ftrace_direct is:
- hash argument that allows to modify multiple ip -> direct
entries at once
This change will allow us to have simple ftrace_ops for all bpf
direct interface users in following changes.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-7-jolsa@kernel.org
Adding update_ftrace_direct_del function that removes all entries
(ip -> addr) provided in hash argument to direct ftrace ops and
updates its attachments.
The difference to current unregister_ftrace_direct is
- hash argument that allows to unregister multiple ip -> direct
entries at once
- we can call update_ftrace_direct_del multiple times on the
same ftrace_ops object, becase we do not need to unregister
all entries at once, we can do it gradualy with the help of
ftrace_update_ops function
This change will allow us to have simple ftrace_ops for all bpf
direct interface users in following changes.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-6-jolsa@kernel.org
Adding update_ftrace_direct_add function that adds all entries
(ip -> addr) provided in hash argument to direct ftrace ops
and updates its attachments.
The difference to current register_ftrace_direct is
- hash argument that allows to register multiple ip -> direct
entries at once
- we can call update_ftrace_direct_add multiple times on the
same ftrace_ops object, becase after first registration with
register_ftrace_function_nolock, it uses ftrace_update_ops to
update the ftrace_ops object
This change will allow us to have simple ftrace_ops for all bpf
direct interface users in following changes.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-5-jolsa@kernel.org
At the moment the we allow the jmp attach only for ftrace_ops that
has FTRACE_OPS_FL_JMP set. This conflicts with following changes
where we use single ftrace_ops object for all direct call sites,
so all could be be attached via just call or jmp.
We already limit the jmp attach support with config option and bit
(LSB) set on the trampoline address. It turns out that's actually
enough to limit the jmp attach for architecture and only for chosen
addresses (with LSB bit set).
Each user of register_ftrace_direct or modify_ftrace_direct can set
the trampoline bit (LSB) to indicate it has to be attached by jmp.
The bpf trampoline generation code uses trampoline flags to generate
jmp-attach specific code and ftrace inner code uses the trampoline
bit (LSB) to handle return from jmp attachment, so there's no harm
to remove the FTRACE_OPS_FL_JMP bit.
The fexit/fmodret performance stays the same (did not drop),
current code:
fentry : 77.904 ± 0.546M/s
fexit : 62.430 ± 0.554M/s
fmodret : 66.503 ± 0.902M/s
with this change:
fentry : 80.472 ± 0.061M/s
fexit : 63.995 ± 0.127M/s
fmodret : 67.362 ± 0.175M/s
Fixes: 25e4e3565d ("ftrace: Introduce FTRACE_OPS_FL_JMP")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-2-jolsa@kernel.org
This commit fixes a security issue where BPF_PROG_DETACH on tcx or
netkit devices could be executed by any user when no program fd was
provided, bypassing permission checks. The fix adds a capability
check for CAP_NET_ADMIN or CAP_SYS_ADMIN in this case.
Fixes: e420bed025 ("bpf: Add fd-based tcx multi-prog infra with link support")
Signed-off-by: Guillaume Gonnet <ggonnet.linux@gmail.com>
Link: https://lore.kernel.org/r/20260127160200.10395-1-ggonnet.linux@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jiayuan Chen says:
====================
bpf: Fix FIONREAD and copied_seq issues
syzkaller reported a bug [1] where a socket using sockmap, after being
unloaded, exposed incorrect copied_seq calculation. The selftest I
provided can be used to reproduce the issue reported by syzkaller.
TCP recvmsg seq # bug 2: copied E92C873, seq E68D125, rcvnxt E7CEB7C, fl 40
WARNING: CPU: 1 PID: 5997 at net/ipv4/tcp.c:2724 tcp_recvmsg_locked+0xb2f/0x2910 net/ipv4/tcp.c:2724
Call Trace:
<TASK>
receive_fallback_to_copy net/ipv4/tcp.c:1968 [inline]
tcp_zerocopy_receive+0x131a/0x2120 net/ipv4/tcp.c:2200
do_tcp_getsockopt+0xe28/0x26c0 net/ipv4/tcp.c:4713
tcp_getsockopt+0xdf/0x100 net/ipv4/tcp.c:4812
do_sock_getsockopt+0x34d/0x440 net/socket.c:2421
__sys_getsockopt+0x12f/0x260 net/socket.c:2450
__do_sys_getsockopt net/socket.c:2457 [inline]
__se_sys_getsockopt net/socket.c:2454 [inline]
__x64_sys_getsockopt+0xbd/0x160 net/socket.c:2454
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xcd/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
A sockmap socket maintains its own receive queue (ingress_msg) which may
contain data from either its own protocol stack or forwarded from other
sockets.
FD1:read()
-- FD1->copied_seq++
| [read data]
|
[enqueue data] v
[sockmap] -> ingress to self -> ingress_msg queue
FD1 native stack ------> ^
-- FD1->rcv_nxt++ -> redirect to other | [enqueue data]
| |
| ingress to FD1
v ^
... | [sockmap]
FD2 native stack
The issue occurs when reading from ingress_msg: we update tp->copied_seq
by default, but if the data comes from other sockets (not the socket's
own protocol stack), tcp->rcv_nxt remains unchanged. Later, when
converting back to a native socket, reads may fail as copied_seq could
be significantly larger than rcv_nxt.
Additionally, FIONREAD calculation based on copied_seq and rcv_nxt is
insufficient for sockmap sockets, requiring separate field tracking.
[1] https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983
---
v7 -> v9: Address Jakub Sitnicki's feedback:
- Remove sk_receive_queue check in tcp_bpf_ioctl, only report
ingress_msg data length for FIONREAD
- Minor nits fixes
- Add Reviewed-by tag from John Fastabend
- Fix ci error
https://lore.kernel.org/bpf/20260113025121.197535-1-jiayuan.chen@linux.dev/
v5 -> v7: Some modifications suggested by Jakub Sitnicki, and added Reviewed-by tag.
https://lore.kernel.org/bpf/20260106051458.279151-1-jiayuan.chen@linux.dev/
v1 -> v5: Use skmsg.sk instead of extending BPF_F_XXX macro and fix CI
failure reported by CI
v1: https://lore.kernel.org/bpf/20251117110736.293040-1-jiayuan.chen@linux.dev/
====================
Link: https://patch.msgid.link/20260124113314.113584-1-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
A socket using sockmap has its own independent receive queue: ingress_msg.
This queue may contain data from its own protocol stack or from other
sockets.
Therefore, for sockmap, relying solely on copied_seq and rcv_nxt to
calculate FIONREAD is not enough.
This patch adds a new msg_tot_len field in the psock structure to record
the data length in ingress_msg. Additionally, we implement new ioctl
interfaces for TCP and UDP to intercept FIONREAD operations.
Note that we intentionally do not include sk_receive_queue data in the
FIONREAD result. Data in sk_receive_queue has not yet been processed by
the BPF verdict program, and may be redirected to other sockets or
dropped. Including it would create semantic ambiguity since this data
may never be readable by the user.
Unix and VSOCK sockets have similar issues, but fixing them is outside
the scope of this patch as it would require more intrusive changes.
Previous work by John Fastabend made some efforts towards FIONREAD support:
commit e5c6de5fa0 ("bpf, sockmap: Incorrectly handling copied_seq")
Although the current patch is based on the previous work by John Fastabend,
it is acceptable for our Fixes tag to point to the same commit.
FD1:read()
-- FD1->copied_seq++
| [read data]
|
[enqueue data] v
[sockmap] -> ingress to self -> ingress_msg queue
FD1 native stack ------> ^
-- FD1->rcv_nxt++ -> redirect to other | [enqueue data]
| |
| ingress to FD1
v ^
... | [sockmap]
FD2 native stack
Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/r/20260124113314.113584-3-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
A socket using sockmap has its own independent receive queue: ingress_msg.
This queue may contain data from its own protocol stack or from other
sockets.
The issue is that when reading from ingress_msg, we update tp->copied_seq
by default. However, if the data is not from its own protocol stack,
tcp->rcv_nxt is not increased. Later, if we convert this socket to a
native socket, reading from this socket may fail because copied_seq might
be significantly larger than rcv_nxt.
This fix also addresses the syzkaller-reported bug referenced in the
Closes tag.
This patch marks the skmsg objects in ingress_msg. When reading, we update
copied_seq only if the data is from its own protocol stack.
FD1:read()
-- FD1->copied_seq++
| [read data]
|
[enqueue data] v
[sockmap] -> ingress to self -> ingress_msg queue
FD1 native stack ------> ^
-- FD1->rcv_nxt++ -> redirect to other | [enqueue data]
| |
| ingress to FD1
v ^
... | [sockmap]
FD2 native stack
Closes: https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983
Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://lore.kernel.org/r/20260124113314.113584-2-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, the BPF cgroup iterator supports walking descendants in
either pre-order (BPF_CGROUP_ITER_DESCENDANTS_PRE) or post-order
(BPF_CGROUP_ITER_DESCENDANTS_POST). These modes perform an exhaustive
depth-first search (DFS) of the hierarchy. In scenarios where a BPF
program may need to inspect only the direct children of a given parent
cgroup, a full DFS is unnecessarily expensive.
This patch introduces a new BPF cgroup iterator control option,
BPF_CGROUP_ITER_CHILDREN. This control option restricts the traversal
to the immediate children of a specified parent cgroup, allowing for
more targeted and efficient iteration, particularly when exhaustive
depth-first search (DFS) traversal is not required.
Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Link: https://lore.kernel.org/r/20260127085112.3608687-1-mattbobrowski@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The test_bpftool_map.sh script tests that maps read/write accesses
are being properly allowed/refused by the kernel depending on a specific
fmod_ret program being attached on security_bpf_map function.
Rewrite this test to integrate it in the test_progs. The
new test spawns a few subtests:
#36/1 bpftool_maps_access/unprotected_unpinned:OK
#36/2 bpftool_maps_access/unprotected_pinned:OK
#36/3 bpftool_maps_access/protected_unpinned:OK
#36/4 bpftool_maps_access/protected_pinned:OK
#36/5 bpftool_maps_access/nested_maps:OK
#36/6 bpftool_maps_access/btf_list:OK
#36 bpftool_maps_access:OK
Summary: 1/6 PASSED, 0 SKIPPED, 0 FAILED
Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
Acked-by: Quentin Monnet <qmo@kernel.org>
Link: https://lore.kernel.org/r/20260123-bpftool-tests-v4-3-a6653a7f28e7@bootlin.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The test_bpftool_metadata.sh script validates that bpftool properly
returns in its ouptput any metadata generated by bpf programs through
some .rodata sections.
Port this test to the test_progs framework so that it can be executed
automatically in CI. The new test, similarly to the former script,
checks that valid data appears both for textual output and json output,
as well as for both data not used at all and used data. For the json
check part, the expected json string is hardcoded to avoid bringing a
new external dependency (eg: a json deserializer) for test_progs.
As the test is now converted into test_progs, remove the former script.
The newly converted test brings two new subtests:
#37/1 bpftool_metadata/metadata_unused:OK
#37/2 bpftool_metadata/metadata_used:OK
#37 bpftool_metadata:OK
Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED
Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
Link: https://lore.kernel.org/r/20260123-bpftool-tests-v4-2-a6653a7f28e7@bootlin.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In order to integrate some bpftool tests into test_progs, define a few
specific helpers that allow to execute bpftool commands, while possibly
retrieving the command output. Those helpers most notably set the
path to the bpftool binary under test. This version checks different
possible paths relative to the directories where the different
test_progs runners are executed, as we want to make sure not to
accidentally use a bootstrap version of the binary.
Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
Link: https://lore.kernel.org/r/20260123-bpftool-tests-v4-1-a6653a7f28e7@bootlin.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
CI occasionally reports failures in the
percpu_alloc/cpu_flag_lru_percpu_hash selftest, for example:
First test_progs failure (test_progs_no_alu32-x86_64-llvm-21):
#264/15 percpu_alloc/cpu_flag_lru_percpu_hash
...
test_percpu_map_op_cpu_flag:FAIL:bpf_map_lookup_batch value on specified cpu unexpected bpf_map_lookup_batch value on specified cpu: actual 0 != expected 3735929054
The unexpected value indicates that an element was removed from the map.
However, the test never calls delete_elem(), so the only possible cause
is LRU eviction.
This can happen when the current task migrates to another CPU: an
update_elem() triggers eviction because there is no available LRU node
on local freelist and global freelist.
Harden the test against this behavior by provisioning sufficient spare
elements. Set max_entries to 'nr_cpus * 2' and restrict the test to using
the first nr_cpus entries, ensuring that updates do not spuriously trigger
LRU eviction.
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260119133417.19739-1-leon.hwang@linux.dev
Changwoo Min says:
====================
selftests/bpf: Introduce execution context detection helpers
This series introduces four new BPF-native inline helpers -- bpf_in_nmi(),
bpf_in_hardirq(), bpf_in_serving_softirq(), and bpf_in_task() -- to allow
BPF programs to query the current execution context.
Following the feedback on v1, these are implemented in bpf_experimental.h
as inline helpers wrapping get_preempt_count(). This approach allows the
logic to be JIT-inlined for better performance compared to a kfunc call,
while providing the granular context detection (e.g., hardirq vs. softirq)
required by subsystems like sched_ext.
The series includes a new selftest suite, exe_ctx, which uses bpf_testmod
to verify context detection across Task, HardIRQ, and SoftIRQ boundaries
via irq_work and tasklets. NMI context testing is omitted as NMIs cannot
be triggered deterministically within software-only BPF CI environments.
ChangeLog v2 -> v3:
- Added exe_ctx to DENYLIST.s390x since new helpers are supported only
on x86 and arm64 (patch 2).
- Added comments to helpers describing supported architectures (patch 1).
ChangeLog v1 -> v2:
- Dropped the core kernel kfunc implementations, and implemented context
detection as inline BPF helpers in bpf_experimental.h.
- Renamed the selftest suite from ctx_kfunc to exe_ctx to reflect the
change from kfuncs to helpers.
- Updated BPF programs to use the new inline helpers.
- Swapped clean-up order between tasklet and irqwork in bpf_testmod to
avoid re-scheduling the already-killed tasklet (reported by bot+bpf-ci).
====================
Link: https://patch.msgid.link/20260125115413.117502-1-changwoo@igalia.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a new selftest suite `exe_ctx` to verify the accuracy of the
bpf_in_task(), bpf_in_hardirq(), and bpf_in_serving_softirq() helpers
introduced in bpf_experimental.h.
Testing these execution contexts deterministically requires crossing
context boundaries within a single CPU. To achieve this, the test
implements a "Trigger-Observer" pattern using bpf_testmod:
1. Trigger: A BPF syscall program calls a new bpf_testmod kfunc
bpf_kfunc_trigger_ctx_check().
2. Task to HardIRQ: The kfunc uses irq_work_queue() to trigger a
self-IPI on the local CPU.
3. HardIRQ to SoftIRQ: The irq_work handler calls a dummy function
(observed by BPF fentry) and then schedules a tasklet to
transition into SoftIRQ context.
The user-space runner ensures determinism by pinning itself to CPU 0
before execution, forcing the entire interrupt chain to remain on a
single core. Dummy noinline functions with compiler barriers are
added to bpf_testmod.c to serve as stable attachment points for
fentry programs. A retry loop is used in user-space to wait for the
asynchronous SoftIRQ to complete.
Note that testing on s390x is avoided because supporting those helpers
purely in BPF on s390x is not possible at this point.
Reviewed-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Link: https://lore.kernel.org/r/20260125115413.117502-3-changwoo@igalia.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Introduce bpf_in_nmi(), bpf_in_hardirq(), bpf_in_serving_softirq(), and
bpf_in_task() inline helpers in bpf_experimental.h. These allow BPF
programs to query the current execution context with higher granularity
than the existing bpf_in_interrupt() helper.
While BPF programs can often infer their context from attachment points,
subsystems like sched_ext may call the same BPF logic from multiple
contexts (e.g., task-to-task wake-ups vs. interrupt-to-task wake-ups).
These helpers provide a reliable way for logic to branch based on
the current CPU execution state.
Implementing these as BPF-native inline helpers wrapping
get_preempt_count() allows the compiler and JIT to inline the logic. The
implementation accounts for differences in preempt_count layout between
standard and PREEMPT_RT kernels.
Reviewed-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Link: https://lore.kernel.org/r/20260125115413.117502-2-changwoo@igalia.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Menglong Dong says:
====================
bpf: fsession support
overall
-------
Sometimes, we need to hook both the entry and exit of a function with
TRACING. Therefore, we need define a FENTRY and a FEXIT for the target
function, which is not convenient.
Therefore, we add a tracing session support for TRACING. Generally
speaking, it's similar to kprobe session, which can hook both the entry
and exit of a function with a single BPF program.
We allow the usage of bpf_get_func_ret() to get the return value in the
fentry of the tracing session, as it will always get "0", which is safe
enough and is OK.
Session cookie is also supported with the kfunc bpf_session_cookie().
In order to limit the stack usage, we limit the maximum number of cookies
to 4.
kfunc design
------------
In order to keep consistency with existing kfunc, we don't introduce new
kfunc for fsession. Instead, we reuse the existing kfunc
bpf_session_cookie() and bpf_session_is_return().
The prototype of bpf_session_cookie() and bpf_session_is_return() don't
satisfy our needs, so we change their prototype by adding the argument
"void *ctx" to them.
We inline bpf_session_cookie() and bpf_session_is_return() for fsession
in the verifier directly. Therefore, we don't need to introduce new
functions for them.
architecture
------------
The fsession stuff is arch related, so the -EOPNOTSUPP will be returned if
it is not supported yet by the arch. In this series, we only support
x86_64. And later, other arch will be implemented.
Changes v12 -> v13:
* fix the selftests fail on !x86_64 in the 11th patch
* v12: https://lore.kernel.org/bpf/20260124033119.28682-1-dongml2@chinatelecom.cn/
Changes v11 -> v12:
* update the variable "delta" in the 2nd patch
* improve the fsession testcase by adding the 11th patch, which will test
bpf_get_func_* for fsession
* v11: https://lore.kernel.org/bpf/20260123073532.238985-1-dongml2@chinatelecom.cn/
Changes v10 -> v11:
* rebase and fix the conflicts in the 2nd patch
* use "volatile" in the 11th patch
* rename BPF_TRAMP_SHIFT_* to BPF_TRAMP_*_SHIFT
* v10: https://lore.kernel.org/bpf/20260115112246.221082-1-dongml2@chinatelecom.cn/
Changes v9 -> v10:
* 1st patch: some small adjustment, such as use switch in
bpf_prog_has_trampoline()
* 2nd patch: some adjustment to the commit log and comment
* 3rd patch:
- drop the declaration of bpf_session_is_return() and
bpf_session_cookie()
- use vmlinux.h instead of bpf_kfuncs.h in uprobe_multi_session.c,
kprobe_multi_session_cookie.c and uprobe_multi_session_cookie.c
* 4th patch:
- some adjustment to the comment and commit log
- rename the prefix from BPF_TRAMP_M_ to BPF_TRAMP_SHIFT_
- remove the definition of BPF_TRAMP_M_NR_ARGS
- check the program type in bpf_session_filter()
* 5th patch: some adjustment to the commit log
* 6th patch:
- add the "reg" to the function arguments of emit_store_stack_imm64()
- use the positive offset in emit_store_stack_imm64()
* 7th patch:
- use "|" for func_meta instead of "+"
- pass the "func_meta_off" to invoke_bpf() explicitly, instead of
computing it with "stack_size + 8"
- pass the "cookie_off" to invoke_bpf() instead of computing the current
cookie index with "func_meta"
* 8th patch:
- split the modification to bpftool to a separate patch
* v9: https://lore.kernel.org/bpf/20260110141115.537055-1-dongml2@chinatelecom.cn/
Changes v8 -> v9:
* remove the definition of bpf_fsession_cookie and bpf_fsession_is_return
in the 4th and 5th patch
* rename emit_st_r0_imm64() to emit_store_stack_imm64() in the 6th patch
* v8: https://lore.kernel.org/bpf/20260108022450.88086-1-dongml2@chinatelecom.cn/
Changes v7 -> v8:
* use the last byte of nr_args for bpf_get_func_arg_cnt() in the 2nd patch
* v7: https://lore.kernel.org/bpf/20260107064352.291069-1-dongml2@chinatelecom.cn/
Changes v6 -> v7:
* change the prototype of bpf_session_cookie() and bpf_session_is_return(),
and reuse them instead of introduce new kfunc for fsession.
* v6: https://lore.kernel.org/bpf/20260104122814.183732-1-dongml2@chinatelecom.cn/
Changes v5 -> v6:
* No changes in this version, just a rebase to deal with conflicts.
* v5: https://lore.kernel.org/bpf/20251224130735.201422-1-dongml2@chinatelecom.cn/
Changes v4 -> v5:
* use fsession terminology consistently in all patches
* 1st patch:
- use more explicit way in __bpf_trampoline_link_prog()
* 4th patch:
- remove "cookie_cnt" in struct bpf_trampoline
* 6th patch:
- rename nr_regs to func_md
- define cookie_off in a new line
* 7th patch:
- remove the handling of BPF_TRACE_SESSION in legacy fallback path for
BPF_RAW_TRACEPOINT_OPEN
* v4: https://lore.kernel.org/bpf/20251217095445.218428-1-dongml2@chinatelecom.cn/
Changes v3 -> v4:
* instead of adding a new hlist to progs_hlist in trampoline, add the bpf
program to both the fentry hlist and the fexit hlist.
* introduce the 2nd patch to reuse the nr_args field in the stack to
store all the information we need(except the session cookies).
* limit the maximum number of cookies to 4.
* remove the logic to skip fexit if the fentry return non-zero.
* v3: https://lore.kernel.org/bpf/20251026030143.23807-1-dongml2@chinatelecom.cn/
Changes v2 -> v3:
* squeeze some patches:
- the 2 patches for the kfunc bpf_tracing_is_exit() and
bpf_fsession_cookie() are merged into the second patch.
- the testcases for fsession are also squeezed.
* fix the CI error by move the testcase for bpf_get_func_ip to
fsession_test.c
* v2: https://lore.kernel.org/bpf/20251022080159.553805-1-dongml2@chinatelecom.cn/
Changes v1 -> v2:
* session cookie support.
In this version, session cookie is implemented, and the kfunc
bpf_fsession_cookie() is added.
* restructure the layout of the stack.
In this version, the session stuff that stored in the stack is changed,
and we locate them after the return value to not break
bpf_get_func_ip().
* testcase enhancement.
Some nits in the testcase that suggested by Jiri is fixed. Meanwhile,
the testcase for get_func_ip and session cookie is added too.
* v1: https://lore.kernel.org/bpf/20251018142124.783206-1-dongml2@chinatelecom.cn/
====================
Link: https://patch.msgid.link/20260124062008.8657-1-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Test session cookie for fsession. Multiple fsession BPF progs is attached
to bpf_fentry_test1() and session cookie is read and write in the
testcase.
bpf_get_func_ip() will influence the layout of the session cookies, so we
test the cookie in two case: with and without bpf_get_func_ip().
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20260124062008.8657-13-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add BPF_TRACE_FSESSION supporting to x86_64, including:
1. clear the return value in the stack before fentry to make the fentry
of the fsession can only get 0 with bpf_get_func_ret().
2. clear all the session cookies' value in the stack.
2. store the index of the cookie to ctx[-1] before the calling to fsession
3. store the "is_return" flag to ctx[-1] before the calling to fexit of
the fsession.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Co-developed-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260124062008.8657-8-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Implement session cookie for fsession. The session cookies will be stored
in the stack, and the layout of the stack will look like this:
return value -> 8 bytes
argN -> 8 bytes
...
arg1 -> 8 bytes
nr_args -> 8 bytes
ip (optional) -> 8 bytes
cookie2 -> 8 bytes
cookie1 -> 8 bytes
The offset of the cookie for the current bpf program, which is in 8-byte
units, is stored in the
"(((u64 *)ctx)[-1] >> BPF_TRAMP_COOKIE_INDEX_SHIFT) & 0xFF". Therefore, we
can get the session cookie with ((u64 *)ctx)[-offset].
Implement and inline the bpf_session_cookie() for the fsession in the
verifier.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20260124062008.8657-6-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
If fsession exists, we will use the bit (1 << BPF_TRAMP_IS_RETURN_SHIFT)
in ((u64 *)ctx)[-1] to store the "is_return" flag.
The logic of bpf_session_is_return() for fsession is implemented in the
verifier by inline following code:
bool bpf_session_is_return(void *ctx)
{
return (((u64 *)ctx)[-1] >> BPF_TRAMP_IS_RETURN_SHIFT) & 1;
}
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Co-developed-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260124062008.8657-5-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
For now, ((u64 *)ctx)[-1] is used to store the nr_args in the trampoline.
However, 1 byte is enough to store such information. Therefore, we use
only the least significant byte of ((u64 *)ctx)[-1] to store the nr_args,
and reserve the rest for other usages.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20260124062008.8657-3-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
If the argument 'pull_len' of run_test() is 'PULL_MAX' or
'PULL_MAX | PULL_PLUS_ONE', the eventual pull_len size
will close to the page size. On arm64 systems with 64K pages,
the pull_len size will be close to 64K. But the existing buffer
will be close to 9000 which is not enough to pull.
For those failed run_tests(), make buff size to
pg_sz + (pg_sz / 2)
This way, there will be enough buffer space to pull
regardless of page size.
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Cc: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20260123055128.495265-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
On arm64 systems with 64K pages, the selftest task_local_data has the following
failures:
...
test_task_local_data_basic:PASS:tld_create_key 0 nsec
test_task_local_data_basic:FAIL:tld_create_key unexpected tld_create_key: actual 0 != expected -28
...
test_task_local_data_basic_thread:PASS:run task_main 0 nsec
test_task_local_data_basic_thread:FAIL:task_main retval unexpected error: 2 (errno 0)
test_task_local_data_basic_thread:FAIL:tld_get_data value0 unexpected tld_get_data value0: actual 0 != expected 6268
...
#447/1 task_local_data/task_local_data_basic:FAIL
...
#447/2 task_local_data/task_local_data_race:FAIL
#447 task_local_data:FAIL
When TLD_DYN_DATA_SIZE is 64K page size, for
struct tld_meta_u {
_Atomic __u8 cnt;
__u16 size;
struct tld_metadata metadata[];
};
field 'cnt' would overflow. For example, for 4K page, 'cnt' will
be 4096/64 = 64. But for 64K page, 'cnt' will be 65536/64 = 1024
and 'cnt' is not enough for 1024. To accommodate 64K page,
'_Atomic __u8 cnt' becomes '_Atomic __u16 cnt'. A few other places
are adjusted accordingly.
In test_task_local_data.c, the value for TLD_DYN_DATA_SIZE is changed
from 4096 to (getpagesize() - 8) since the maximum buffer size for
TLD_DYN_DATA_SIZE is (getpagesize() - 8).
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Cc: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20260123055122.494352-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The TAS fallback can be invoked directly when queued spin locks are
disabled, and through the slow path when paravirt is enabled for queued
spin locks. In the latter case, the res_spin_lock macro will attempt the
fast path and already hold the entry when entering the slow path. This
will lead to creation of extraneous entries that are not released, which
may cause false positives for deadlock detection.
Fix this by always preceding invocation of the TAS fallback in every
case with the grabbing of the held lock entry, and add a comment to make
note of this.
Fixes: c9102a68c0 ("rqspinlock: Add a test-and-set fallback")
Reported-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Tested-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20260122115911.3668985-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When wq__attach() fails, serial_test_wq() returns early without calling
wq__destroy(), leaking the skeleton resources allocated by
wq__open_and_load(). This causes ASAN leak reports in selftests runs.
Fix this by jumping to a common clean_up label that calls wq__destroy()
on all exit paths after successful open_and_load.
Note that the early return after wq__open_and_load() failure is correct
and doesn't need fixing, since that function returns NULL on failure
(after internally cleaning up any partial allocations).
Fixes: 8290dba519 ("selftests/bpf: wq: add bpf_wq_start() checks")
Signed-off-by: Kery Qi <qikeyu2017@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20260121094114.1801-3-qikeyu2017@gmail.com
Yuzuki Ishiyama says:
====================
bpf: Add kfunc bpf_strncasecmp()
This patchset introduces bpf_strncasecmp to allow case-insensitive and
limited-length string comparison. This is useful for parsing protocol
headers like HTTP.
---
Changes in v5:
- Fixed the test function numbering
Changes in v4:
- Updated the loop variable to maintain style consistency
Changes in v3:
- Use ternary operator to maintain style consistency
- Reverted unnecessary doc comment about XATTR_SIZE_MAX
Changes in v2:
- Compute max_sz upfront and remove len check from the loop body
- Document that @len is limited by XATTR_SIZE_MAX
====================
Link: https://patch.msgid.link/20260121033328.1850010-1-ishiyama@hpc.is.uec.ac.jp
Signed-off-by: Alexei Starovoitov <ast@kernel.org>