linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-19 22:11:53 -04:00

Author	SHA1	Message	Date
Bastien Curutchet (eBPF Foundation)	2233ef8bba	selftests/bpf: test_xsk: Initialize bitmap before use bitmap is used before being initialized. Initialize it to zero before using it. Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Link: https://lore.kernel.org/r/20251031-xsk-v7-2-39fe486593a3@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-31 09:24:38 -07:00
Bastien Curutchet (eBPF Foundation)	3ab77f35a7	selftests/bpf: test_xsk: Split xskxceiver AF_XDP features are tested by the test_xsk.sh script but not by the test_progs framework. The tests used by the script are defined in xksxceiver.c which can't be integrated in the test_progs framework as is. Extract these test definitions from xskxceiver{.c/.h} to put them in new test_xsk{.c/.h} files. Keep the main() function and its unshared dependencies in xksxceiver to avoid impacting the test_xsk.sh script which is often used to test real hardware. Move ksft_test_result_*() calls to xskxceiver.c to keep the kselftest's report valid Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Link: https://lore.kernel.org/r/20251031-xsk-v7-1-39fe486593a3@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-31 09:24:38 -07:00
Puranjay Mohan	5701d5aefa	bpf: Use kmalloc_nolock() in bpf streams BPF stream kfuncs need to be non-sleeping as they can be called from programs running in any context, this requires a way to allocate memory from any context. Currently, this is done by a custom per-CPU NMI-safe bump allocation mechanism, backed by alloc_pages_nolock() and free_pages_nolock() primitives. As kmalloc_nolock() and kfree_nolock() primitives are available now, the custom allocator can be removed in favor of these. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251023161448.4263-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-29 18:19:46 -07:00
Alexei Starovoitov	d28c0e4921	Merge branch 'misc-rqspinlock-updates' Kumar Kartikeya Dwivedi says: ==================== Misc rqspinlock updates A couple of changes for rqspinlock, the first disables propagation of AA and ABBA deadlocks to waiters succeeding the deadlocking waiter. A more verbose rationale is available in the commit log. The second commit expands the stress test to introduce a ABBCCA mode that will reliably exercise the timeout fallback. ==================== Link: https://lore.kernel.org/r/20251029181828.231529-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-29 18:17:56 -07:00
Kumar Kartikeya Dwivedi	a8a0abf097	selftests/bpf: Add ABBCCA case for rqspinlock stress test Introduce a new mode for the rqspinlock stress test that exercises a deadlock that won't be detected by the AA and ABBA checks, such that we always reliably trigger the timeout fallback. We need 4 CPUs for this particular case, as CPU 0 is untouched, and three participant CPUs for triggering the ABBCCA case. Refactor the lock acquisition paths in the module to better reflect the three modes and choose the right lock depending on the context. Also drop ABBA case from running by default as part of test progs, since the stress test can consume a significant amount of time. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Reviewed-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251029181828.231529-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-29 18:17:56 -07:00
Kumar Kartikeya Dwivedi	7bd6e5ce5b	rqspinlock: Disable queue destruction for deadlocks Disable propagation and unwinding of the waiter queue in case the head waiter detects a deadlock condition, but keep it enabled in case of the timeout fallback. Currently, when the head waiter experiences an AA deadlock, it will signal all its successors in the queue to exit with an error. This is not ideal for cases where the same lock is held in contexts which can cause errors in an unrestricted fashion (e.g., BPF programs, or kernel paths invoked through BPF programs), and core kernel logic which is written in a correct fashion and does not expect deadlocks. The same reasoning can be extended to ABBA situations. Depending on the actual runtime schedule, one or both of the head waiters involved in an ABBA situation can detect and exit directly without terminating their waiter queue. If the ABBA situation manifests again, the waiters will keep exiting until progress can be made, or a timeout is triggered in case of more complicated locking dependencies. We still preserve the queue destruction in case of timeouts, as either the locking dependencies are too complex to be captured by AA and ABBA heuristics, or the owner is perpetually stuck. As such, it would be unwise to continue to apply the timeout for each new head waiter without terminating the queue, since we may end up waiting for more than 250 ms in aggregate with all participants in the locking transaction. The patch itself is fairly simple; we can simply signal our successor to become the next head waiter, and leave the queue without attempting to acquire the lock. With this change, the behavior for waiters in case of deadlocks experienced by a predecessor changes. It is guaranteed that call sites will no longer receive errors if the predecessors encounter deadlocks and the successors do not participate in one. This should lower the failure rate for waiters that are not doing improper locking opreations, just because they were unlucky to queue behind a misbehaving waiter. However, timeouts are still a possibility, hence they must be accounted for, so users cannot rely upon errors not occuring at all. Suggested-by: Amery Hung <ameryhung@gmail.com> Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251029181828.231529-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-29 18:17:56 -07:00
Mykyta Yatsenko	5913e936f6	selftests/bpf: Fix intermittent failures in file_reader test file_reader/on_open_expect_fault intermittently fails when test_progs runs tests in parallel, because it expects a page fault on first read. Another file_reader test running concurrently may have already pulled the same pages into the page cache, eliminating the fault and causing a spurious failure. Make file_reader/on_open_expect_fault read from a file region that does not overlap with other file_reader tests, so the initial access still faults even under parallel execution. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Ihor Solodrai <ihor.solodrai@linux.dev> Link: https://lore.kernel.org/r/20251029195907.858217-1-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-29 18:15:30 -07:00
Martin KaFai Lau	e2e668bd81	Merge branch 'selftests-bpf-convert-test_tc_tunnel-sh-to-test_progs' Alexis Lothoré says: ==================== Hello, this is the v3 of test_tc_tunnel conversion into test_progs framework. This new revision: - fixes a few issues spotted by the bot reviewer - removes any test ensuring connection failure (and so depending on a timout) to keep the execution time reasonable test_tc_tunnel.sh tests a variety of tunnels based on BPF: packets are encapsulated by a BPF program on the client egress. We then check that those packets can be decapsulated on server ingress side, either thanks to kernel-based or BPF-based decapsulation. Those tests are run thanks to two veths in two dedicated namespaces. - patches 1 and 2 are preparatory patches - patch 3 introduce tc_tunnel test into test_progs - patch 4 gets rid of the test_tc_tunnel.sh script The new test has been executed both in some x86 local qemu machine, as well as in CI: # ./test_progs -a tc_tunnel #454/1 tc_tunnel/ipip_none:OK #454/2 tc_tunnel/ipip6_none:OK #454/3 tc_tunnel/ip6tnl_none:OK #454/4 tc_tunnel/sit_none:OK #454/5 tc_tunnel/vxlan_eth:OK #454/6 tc_tunnel/ip6vxlan_eth:OK #454/7 tc_tunnel/gre_none:OK #454/8 tc_tunnel/gre_eth:OK #454/9 tc_tunnel/gre_mpls:OK #454/10 tc_tunnel/ip6gre_none:OK #454/11 tc_tunnel/ip6gre_eth:OK #454/12 tc_tunnel/ip6gre_mpls:OK #454/13 tc_tunnel/udp_none:OK #454/14 tc_tunnel/udp_eth:OK #454/15 tc_tunnel/udp_mpls:OK #454/16 tc_tunnel/ip6udp_none:OK #454/17 tc_tunnel/ip6udp_eth:OK #454/18 tc_tunnel/ip6udp_mpls:OK #454 tc_tunnel:OK Summary: 1/18 PASSED, 0 SKIPPED, 0 FAILED ==================== Link: https://patch.msgid.link/20251027-tc_tunnel-v3-0-505c12019f9d@bootlin.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2025-10-29 12:24:57 -07:00
Alexis Lothoré (eBPF Foundation)	5d3591607d	selftests/bpf: Remove test_tc_tunnel.sh Now that test_tc_tunnel.sh scope has been ported to the test_progs framework, remove it. Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251027-tc_tunnel-v3-4-505c12019f9d@bootlin.com	2025-10-29 12:17:24 -07:00
Alexis Lothoré (eBPF Foundation)	8517b1abe5	selftests/bpf: Integrate test_tc_tunnel.sh tests into test_progs The test_tc_tunnel.sh script checks that a large variety of tunneling mechanisms handled by the kernel can be handled as well by eBPF programs. While this test shares similarities with test_tunnel.c (which is already integrated in test_progs), those are testing slightly different things: - test_tunnel.c creates a tunnel interface, and then get and set tunnel keys in packet metadata, from BPF programs. - test_tc_tunnels.sh manually parses/crafts packets content Bring the tests covered by test_tc_tunnel.sh into the test_progs framework, by creating a dedicated test_tc_tunnel.sh. This new test defines a "generic" runner which, for each test configuration: - will configure the relevant veth pair, each of those isolated in a dedicated namespace - will check that traffic will fail if there is only an encapsulating program attached to one veth egress - will check that traffic succeed if we enable some decapsulation module on kernel side - will check that traffic still succeeds if we replace the kernel decapsulation with some eBPF ingress decapsulation. Example of the new test execution: # ./test_progs -a tc_tunnel #447/1 tc_tunnel/ipip_none:OK #447/2 tc_tunnel/ipip6_none:OK #447/3 tc_tunnel/ip6tnl_none:OK #447/4 tc_tunnel/sit_none:OK #447/5 tc_tunnel/vxlan_eth:OK #447/6 tc_tunnel/ip6vxlan_eth:OK #447/7 tc_tunnel/gre_none:OK #447/8 tc_tunnel/gre_eth:OK #447/9 tc_tunnel/gre_mpls:OK #447/10 tc_tunnel/ip6gre_none:OK #447/11 tc_tunnel/ip6gre_eth:OK #447/12 tc_tunnel/ip6gre_mpls:OK #447/13 tc_tunnel/udp_none:OK #447/14 tc_tunnel/udp_eth:OK #447/15 tc_tunnel/udp_mpls:OK #447/16 tc_tunnel/ip6udp_none:OK #447/17 tc_tunnel/ip6udp_eth:OK #447/18 tc_tunnel/ip6udp_mpls:OK #447 tc_tunnel:OK Summary: 1/18 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251027-tc_tunnel-v3-3-505c12019f9d@bootlin.com	2025-10-29 12:17:22 -07:00
Alexis Lothoré (eBPF Foundation)	86433db932	selftests/bpf: Make test_tc_tunnel.bpf.c compatible with big endian platforms When trying to run bpf-based encapsulation in a s390x environment, some parts of test_tc_tunnel.bpf.o do not encapsulate correctly the traffic, leading to tests failures. Adding some logs shows for example that packets about to be sent on an interface with the ip6vxlan_eth program attached do not have the expected value 5 in the ip header ihl field, and so are ignored by the program. This phenomenon appears when trying to cross-compile the selftests, rather than compiling it from a virtualized host: the selftests build system may then wrongly pick some host headers. If <asm/byteorder.h> ends up being picked on the host (and if the host has a endianness different from the target one), it will then expose wrong endianness defines (e.g __LITTLE_ENDIAN_BITFIELD instead of __BIT_ENDIAN_BITFIELD), and it will for example mess up the iphdr structure layout used in the ebpf program. To prevent this, directly use the vmlinux.h header generated by the selftests build system rather than including directly specific kernel headers. As a consequence, add some missing definitions that are not exposed by vmlinux.h, and adapt the bitfield manipulations to allow building and using the program on both types of platforms. Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251027-tc_tunnel-v3-2-505c12019f9d@bootlin.com	2025-10-29 11:07:26 -07:00
Alexis Lothoré (eBPF Foundation)	1d5137c8d1	selftests/bpf: Add tc helpers The test_tunnel.c file defines small fonctions to easily attach eBPF programs to tc hooks, either on egress, ingress or both. Create a shared helper in network_helpers.c so that other tests can benefit from it. Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251027-tc_tunnel-v3-1-505c12019f9d@bootlin.com	2025-10-29 11:07:24 -07:00
Jianyun Gao	54c134f379	libbpf: Fix the incorrect reference to the memlock_rlim variable in the comment. The variable "memlock_rlim_max" referenced in the comment does not exist. I think that the author probably meant the variable "memlock_rlim". So, correct it. Signed-off-by: Jianyun Gao <jianyungao89@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20251027032008.738944-1-jianyungao89@gmail.com	2025-10-28 10:28:53 -07:00
Jianyun Gao	4f361895ae	libbpf: Optimize the redundant code in the bpf_object__init_user_btf_maps() function. In the elf_sec_data() function, the input parameter 'scn' will be evaluated. If it is NULL, then it will directly return NULL. Therefore, the return value of the elf_sec_data() function already takes into account the case where the input parameter scn is NULL. Therefore, subsequently, the code only needs to check whether the return value of the elf_sec_data() function is NULL. Signed-off-by: Jianyun Gao <jianyungao89@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20251024080802.642189-1-jianyungao89@gmail.com	2025-10-28 10:26:00 -07:00
Arnaud Lecomte	23f852daa4	bpf: Fix stackmap overflow check in __bpf_get_stackid() Syzkaller reported a KASAN slab-out-of-bounds write in __bpf_get_stackid() when copying stack trace data. The issue occurs when the perf trace contains more stack entries than the stack map bucket can hold, leading to an out-of-bounds write in the bucket's data array. Fixes: `ee2a098851` ("bpf: Adjust BPF stack helper functions to accommodate skip > 0") Reported-by: syzbot+c9b724fbb41cf2538b7b@syzkaller.appspotmail.com Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20251025192941.1500-1-contact@arnaud-lcm.com Closes: https://syzkaller.appspot.com/bug?extid=c9b724fbb41cf2538b7b	2025-10-28 09:20:27 -07:00
Arnaud Lecomte	e17d62fedd	bpf: Refactor stack map trace depth calculation into helper function Extract the duplicated maximum allowed depth computation for stack traces stored in BPF stacks from bpf_get_stackid() and __bpf_get_stack() into a dedicated stack_map_calculate_max_depth() helper function. This unifies the logic for: - The max depth computation - Enforcing the sysctl_perf_event_max_stack limit No functional changes for existing code paths. Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20251025192858.31424-1-contact@arnaud-lcm.com	2025-10-28 09:20:27 -07:00
Zhang Chujun	88427328e3	bpftool: Fix missing closing parethesis for BTF_KIND_UNKN In the btf_dumper_do_type function, the debug print statement for BTF_KIND_UNKN was missing a closing parenthesis in the output format. This patch adds the missing ')' to ensure proper formatting of the dump output. Signed-off-by: Zhang Chujun <zhangchujun@cmss.chinamobile.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20251028063345.1911-1-zhangchujun@cmss.chinamobile.com	2025-10-28 09:00:55 -07:00
Xu Kuohai	f9db3a3822	selftests/bpf/benchs: Add overwrite mode benchmark for BPF ring buffer Add --rb-overwrite option to benchmark BPF ring buffer in overwrite mode. Since overwrite mode is not yet supported by libbpf for consumer, also add --rb-bench-producer option to benchmark producer directly without a consumer. Benchmarks on an x86_64 and an arm64 CPU are shown below for reference. - AMD EPYC 9654 (x86_64) Ringbuf, multi-producer contention in overwrite mode, no consumer ================================================================= rb-prod nr_prod 1 32.180 ± 0.033M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 2 9.617 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 3 8.810 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 4 9.272 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 8 9.173 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 12 3.086 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 16 2.945 ± 0.021M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 20 2.519 ± 0.021M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 24 2.545 ± 0.021M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 28 2.363 ± 0.024M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 32 2.357 ± 0.021M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 36 2.267 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 40 2.284 ± 0.020M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 44 2.215 ± 0.025M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 48 2.193 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 52 2.208 ± 0.024M/s (drops 0.000 ± 0.000M/s) - HiSilicon Kunpeng 920 (arm64) Ringbuf, multi-producer contention in overwrite mode, no consumer ================================================================= rb-prod nr_prod 1 14.478 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 2 21.787 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 3 6.045 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 4 5.352 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 8 4.850 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 12 3.542 ± 0.016M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 16 3.509 ± 0.021M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 20 3.171 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 24 3.154 ± 0.014M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 28 2.974 ± 0.015M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 32 3.167 ± 0.014M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 36 2.903 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 40 2.866 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 44 2.914 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 48 2.806 ± 0.012M/s (drops 0.000 ± 0.000M/s) Rb-prod nr_prod 52 2.840 ± 0.012M/s (drops 0.000 ± 0.000M/s) Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20251018035738.4039621-4-xukuohai@huaweicloud.com	2025-10-27 19:47:32 -07:00
Xu Kuohai	8f7a86ecde	selftests/bpf: Add overwrite mode test for BPF ring buffer Add overwrite mode test for BPF ring buffer. The test creates a BPF ring buffer in overwrite mode, then repeatedly reserves and commits records to check if the ring buffer works as expected both before and after overwriting occurs. Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20251018035738.4039621-3-xukuohai@huaweicloud.com	2025-10-27 19:46:32 -07:00
Xu Kuohai	feeaf1346f	bpf: Add overwrite mode for BPF ring buffer When the BPF ring buffer is full, a new event cannot be recorded until one or more old events are consumed to make enough space for it. In cases such as fault diagnostics, where recent events are more useful than older ones, this mechanism may lead to critical events being lost. So add overwrite mode for BPF ring buffer to address it. In this mode, the new event overwrites the oldest event when the buffer is full. The basic idea is as follows: 1. producer_pos tracks the next position to record new event. When there is enough free space, producer_pos is simply advanced by producer to make space for the new event. 2. To avoid waiting for consumer when the buffer is full, a new variable, overwrite_pos, is introduced for producer. It points to the oldest event committed in the buffer. It is advanced by producer to discard one or more oldest events to make space for the new event when the buffer is full. 3. pending_pos tracks the oldest event to be committed. pending_pos is never passed by producer_pos, so multiple producers never write to the same position at the same time. The following example diagrams show how it works in a 4096-byte ring buffer. 1. At first, {producer,overwrite,pending,consumer}_pos are all set to 0. 0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ \| \| \| \| \| \| +-----------------------------------------------------------------------+ ^ \| \| producer_pos = 0 overwrite_pos = 0 pending_pos = 0 consumer_pos = 0 2. Now reserve a 512-byte event A. There is enough free space, so A is allocated at offset 0. And producer_pos is advanced to 512, the end of A. Since A is not submitted, the BUSY bit is set. 0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ \| \| \| \| A \| \| \| [BUSY] \| \| +-----------------------------------------------------------------------+ ^ ^ \| \| \| \| \| producer_pos = 512 \| overwrite_pos = 0 pending_pos = 0 consumer_pos = 0 3. Reserve event B, size 1024. B is allocated at offset 512 with BUSY bit set, and producer_pos is advanced to the end of B. 0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ \| \| \| \| \| A \| B \| \| \| [BUSY] \| [BUSY] \| \| +-----------------------------------------------------------------------+ ^ ^ \| \| \| \| \| producer_pos = 1536 \| overwrite_pos = 0 pending_pos = 0 consumer_pos = 0 4. Reserve event C, size 2048. C is allocated at offset 1536, and producer_pos is advanced to 3584. 0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ \| \| \| \| \| \| A \| B \| C \| \| \| [BUSY] \| [BUSY] \| [BUSY] \| \| +-----------------------------------------------------------------------+ ^ ^ \| \| \| \| \| producer_pos = 3584 \| overwrite_pos = 0 pending_pos = 0 consumer_pos = 0 5. Submit event A. The BUSY bit of A is cleared. B becomes the oldest event to be committed, so pending_pos is advanced to 512, the start of B. 0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ \| \| \| \| \| \| A \| B \| C \| \| \| \| [BUSY] \| [BUSY] \| \| +-----------------------------------------------------------------------+ ^ ^ ^ \| \| \| \| \| \| \| pending_pos = 512 producer_pos = 3584 \| overwrite_pos = 0 consumer_pos = 0 6. Submit event B. The BUSY bit of B is cleared, and pending_pos is advanced to the start of C, which is now the oldest event to be committed. 0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ \| \| \| \| \| \| A \| B \| C \| \| \| \| \| [BUSY] \| \| +-----------------------------------------------------------------------+ ^ ^ ^ \| \| \| \| \| \| \| pending_pos = 1536 producer_pos = 3584 \| overwrite_pos = 0 consumer_pos = 0 7. Reserve event D, size 1536 (3 * 512). There are 2048 bytes not being written between producer_pos (currently 3584) and pending_pos, so D is allocated at offset 3584, and producer_pos is advanced by 1536 (from 3584 to 5120). Since event D will overwrite all bytes of event A and the first 512 bytes of event B, overwrite_pos is advanced to the start of event C, the oldest event that is not overwritten. 0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ \| \| \| \| \| \| D End \| \| C \| D Begin\| \| [BUSY] \| \| [BUSY] \| [BUSY] \| +-----------------------------------------------------------------------+ ^ ^ ^ \| \| \| \| \| pending_pos = 1536 \| \| overwrite_pos = 1536 \| \| \| producer_pos=5120 \| consumer_pos = 0 8. Reserve event E, size 1024. Although there are 512 bytes not being written between producer_pos and pending_pos, E cannot be reserved, as it would overwrite the first 512 bytes of event C, which is still being written. 9. Submit event C and D. pending_pos is advanced to the end of D. 0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ \| \| \| \| \| \| D End \| \| C \| D Begin\| \| \| \| \| \| +-----------------------------------------------------------------------+ ^ ^ ^ \| \| \| \| \| overwrite_pos = 1536 \| \| \| producer_pos=5120 \| pending_pos=5120 \| consumer_pos = 0 The performance data for overwrite mode will be provided in a follow-up patch that adds overwrite-mode benchmarks. A sample of performance data for non-overwrite mode, collected on an x86_64 CPU and an arm64 CPU, before and after this patch, is shown below. As we can see, no obvious performance regression occurs. - x86_64 (AMD EPYC 9654) Before: Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 11.623 ± 0.027M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 15.812 ± 0.014M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 7.871 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 6.703 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 2.896 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 2.054 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 1.864 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 1.580 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 1.484 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 1.369 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 1.316 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 1.272 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 1.239 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 1.226 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 1.213 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 1.193 ± 0.001M/s (drops 0.000 ± 0.000M/s) After: Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 11.845 ± 0.036M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 15.889 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 8.155 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 6.708 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 2.918 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 2.065 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 1.870 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 1.582 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 1.482 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 1.372 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 1.323 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 1.264 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 1.236 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 1.209 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 1.189 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 1.165 ± 0.002M/s (drops 0.000 ± 0.000M/s) - arm64 (HiSilicon Kunpeng 920) Before: Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 11.310 ± 0.623M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 9.947 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 6.634 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 4.502 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 3.888 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 3.372 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 3.189 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 2.998 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 3.086 ± 0.018M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 2.845 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 2.815 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 2.771 ± 0.009M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 2.814 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 2.752 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 2.695 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 2.710 ± 0.006M/s (drops 0.000 ± 0.000M/s) After: Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 11.283 ± 0.550M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 9.993 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 6.898 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 5.257 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 3.830 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 3.528 ± 0.013M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 3.265 ± 0.018M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 2.990 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 2.929 ± 0.014M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 2.898 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 2.818 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 2.789 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 2.770 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 2.651 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 2.669 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 2.695 ± 0.009M/s (drops 0.000 ± 0.000M/s) Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20251018035738.4039621-2-xukuohai@huaweicloud.com	2025-10-27 19:42:39 -07:00
Alexei Starovoitov	ff880798de	Merge branch 'bpf-introduce-file-dynptr' Mykyta Yatsenko says: ==================== bpf: Introduce file dynptr From: Mykyta Yatsenko <yatsenko@meta.com> This series adds a new dynptr kind, file dynptr, which enables BPF programs to perform safe reads from files in a structured way. Initial motivations include: * Parsing the executable’s ELF to locate thread-local variable symbols * Capturing stack traces when frame pointers are disabled By leveraging the existing dynptr abstraction, we reuse the verifier’s lifetime/size checks and keep the API consistent with existing dynptr read helpers. Technical details: 1. Reuses the existing freader library to read files a folio at a time. 2. bpf_dynptr_slice() and bpf_dynptr_read() always copy data from folios into a program-provided buffer; zero-copy access is intentionally not supported to keep it simple. 3. Reads may sleep if the requested folios are not in the page cache. 4. Few verifier changes required: * Support dynptr destruction in kfuncs * Add kfunc address substitution based on whether the program runs in a sleepable or non-sleepable context. Testing: The final patch adds a selftest that validates BPF program reads the same data as userspace, page faults are enabled in sleepable context and disabled in non-sleepable. Changelog: --- v4 -> v5 v4: https://lore.kernel.org/all/20251021200334.220542-1-mykyta.yatsenko5@gmail.com/ * Inlined and removed kfunc_call_imm(), run overflow check for call_imm only if !bpf_jit_supports_far_kfunc_call(). v3 -> v4 v3: https://lore.kernel.org/bpf/20251020222538.932915-1-mykyta.yatsenko5@gmail.com/ * Remove ringbuf usage from selftests * bpf_dynptr_set_null(ptr) when discarding file dynptr * call kfunc_call_imm() in specialize_kfunc() only, removed call from add_kfunc_call() v2 -> v3 v2: https://lore.kernel.org/bpf/20251015161155.120148-1-mykyta.yatsenko5@gmail.com/ * Add negative tests * Rewrote tests to use LSM for bpf_get_task_exe_file() * Move call_imm overflow check into kfunc_call_imm() v1 -> v2 v1: https://lore.kernel.org/bpf/20251003160416.585080-1-mykyta.yatsenko5@gmail.com/ * Remove ELF parsing selftest * Expanded u32 -> u64 refactoring, changes in include/uapi/linux/bpf.h * Removed freader.{c,h}, instead move freader definitions into buildid.h. * Small refactoring of the multiple folios reading algorithm * Directly return error after unmark_stack_slots_dynptr(). * Make kfuncs receive trusted arguments. * Remove enum bpf_is_sleepable, use bool instead * Remove unnecessary sorting from specialize_kfunc() * Remove bool kfunc_in_sleepable_ctx; field from the struct bpf_insn_aux_data, rely on non_sleepable field introduced by Kumar * Refactor selftests, do madvise(...MADV_PAGEOUT) for all pages read by the test * Introduce the test for non-sleepable case, verify it fails with -EFAULT ==================== Link: https://lore.kernel.org/r/20251026203853.135105-1-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-27 09:56:28 -07:00
Mykyta Yatsenko	784cdf9315	selftests/bpf: add file dynptr tests Introducing selftests for validating file-backed dynptr works as expected. * validate implementation supports dynptr slice and read operations * validate destructors should be paired with initializers * validate sleepable progs can page in. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251026203853.135105-11-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-27 09:56:27 -07:00
Mykyta Yatsenko	2c52e8943a	bpf: dispatch to sleepable file dynptr File dynptr reads may sleep when the requested folios are not in the page cache. To avoid sleeping in non-sleepable contexts while still supporting valid sleepable use, given that dynptrs are non-sleepable by default, enable sleeping only when bpf_dynptr_from_file() is invoked from a sleepable context. This change: * Introduces a sleepable constructor: bpf_dynptr_from_file_sleepable() * Override non-sleepable constructor with sleepable if it's always called in sleepable context Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251026203853.135105-10-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-27 09:56:27 -07:00
Mykyta Yatsenko	d869d56ca8	bpf: verifier: refactor kfunc specialization Move kfunc specialization (function address substitution) to later stage of verification to support a new use case, where we need to take into consideration whether kfunc is called in sleepable context. Minor refactoring in add_kfunc_call(), making sure that if function fails, kfunc desc is not added to tab->descs (previously it could be added or not, depending on what failed). Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251026203853.135105-9-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-27 09:56:27 -07:00
Mykyta Yatsenko	e3e36edb1b	bpf: add kfuncs and helpers support for file dynptrs Add support for file dynptr. Introduce struct bpf_dynptr_file_impl to hold internal state for file dynptrs, with 64-bit size and offset support. Introduce lifecycle management kfuncs: - bpf_dynptr_from_file() for initialization - bpf_dynptr_file_discard() for destruction Extend existing helpers to support file dynptrs in: - bpf_dynptr_read() - bpf_dynptr_slice() Write helpers (bpf_dynptr_write() and bpf_dynptr_data()) are not modified, as file dynptr is read-only. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251026203853.135105-8-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-27 09:56:27 -07:00
Mykyta Yatsenko	8d8771dc03	bpf: add plumbing for file-backed dynptr Add the necessary verifier plumbing for the new file-backed dynptr type. Introduce two kfuncs for its lifecycle management: * bpf_dynptr_from_file() for initialization * bpf_dynptr_file_discard() for destruction Currently there is no mechanism for kfunc to release dynptr, this patch add one: * Dynptr release function sets meta->release_regno * Call unmark_stack_slots_dynptr() if meta->release_regno is set and dynptr ref_obj_id is set as well. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251026203853.135105-7-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-27 09:56:27 -07:00
Mykyta Yatsenko	9cba966f1c	bpf: verifier: centralize const dynptr check in unmark_stack_slots_dynptr() Move the const dynptr check into unmark_stack_slots_dynptr() so callers don’t have to duplicate it. This puts the validation next to the code that manipulates dynptr stack slots and allows upcoming changes to reuse it directly. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251026203853.135105-6-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-27 09:56:27 -07:00
Mykyta Yatsenko	5a5fff604f	lib/freader: support reading more than 2 folios freader_fetch currently reads from at most two folios. When a read spans into a third folio, the overflow bytes are copied adjacent to the second folio’s data instead of being handled as a separate folio. This patch modifies fetch algorithm to support reading from many folios. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Reviewed-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20251026203853.135105-5-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-27 09:56:27 -07:00
Mykyta Yatsenko	76e4fed847	lib: move freader into buildid.h Move struct freader and prototypes of the functions operating on it into the buildid.h. This allows reusing freader outside buildid, e.g. for file dynptr support added later. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20251026203853.135105-4-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-27 09:56:27 -07:00
Mykyta Yatsenko	531b87d865	bpf: widen dynptr size/offset to 64 bit Dynptr currently caps size and offset at 24 bits, which isn’t sufficient for file-backed use cases; even 32 bits can be limiting. Refactor dynptr helpers/kfuncs to use 64-bit size and offset, ensuring consistency across the APIs. This change does not affect internals of xdp, skb or other dynptrs, which continue to behave as before. Also it does not break binary compatibility. The widening enables large-file access support via dynptr, implemented in the next patches. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251026203853.135105-3-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-27 09:56:26 -07:00
Mykyta Yatsenko	a61a257ff5	selftests/bpf: remove unnecessary kfunc prototypes Remove unnecessary kfunc prototypes from test programs, these are provided by vmlinux.h Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251026203853.135105-2-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-27 09:56:26 -07:00
Anton Protopopov	e7586577b7	libbpf: fix formatting of bpf_object__append_subprog_code The commit `6c918709bd` ("libbpf: Refactor bpf_object__reloc_code") added the bpf_object__append_subprog_code() with incorrect indentations. Use tabs instead. (This also makes a consequent commit better readable.) Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20251019202145.3944697-14-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-21 11:20:23 -07:00
Anton Protopopov	2f69c56854	bpf: make bpf_insn_successors to return a pointer The bpf_insn_successors() function is used to return successors to a BPF instruction. So far, an instruction could have 0, 1 or 2 successors. Prepare the verifier code to introduction of instructions with more than 2 successors (namely, indirect jumps). To do this, introduce a new struct, struct bpf_iarray, containing an array of bpf instruction indexes and make bpf_insn_successors to return a pointer of that type. The storage for all instructions is allocated in the env->succ, which holds an array of size 2, to be used for all instructions. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251019202145.3944697-10-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-21 11:20:23 -07:00
Anton Protopopov	44481e4925	bpf: generalize and export map_get_next_key for arrays The kernel/bpf/array.c file defines the array_map_get_next_key() function which finds the next key for array maps. It actually doesn't use any map fields besides the generic max_entries field. Generalize it, and export as bpf_array_get_next_key() such that it can be re-used by other array-like maps. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251019202145.3944697-4-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-21 11:17:25 -07:00
Anton Protopopov	f7d72d0b3f	bpf: save the start of functions in bpf_prog_aux Introduce a new subprog_start field in bpf_prog_aux. This field may be used by JIT compilers wanting to know the real absolute xlated offset of the function being jitted. The func_info[func_id] may have served this purpose, but func_info may be NULL, so JIT compilers can't rely on it. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251019202145.3944697-3-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-21 11:17:25 -07:00
Anton Protopopov	6ea5fc92a0	bpf: fix the return value of push_stack In [1] Eduard mentioned that on push_stack failure verifier code should return -ENOMEM instead of -EFAULT. After checking with the other call sites I've found that code randomly returns either -ENOMEM or -EFAULT. This patch unifies the return values for the push_stack (and similar push_async_cb) functions such that error codes are always assigned properly. [1] https://lore.kernel.org/bpf/20250615085943.3871208-1-a.s.protopopov@gmail.com Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251019202145.3944697-2-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-21 11:17:25 -07:00
Shardul Bankar	96d31dff3f	bpf: Clarify get_outer_instance() handling in propagate_to_outer_instance() propagate_to_outer_instance() calls get_outer_instance() and uses the returned pointer to reset and commit stack write marks. Under normal conditions, update_instance() guarantees that an outer instance exists, so get_outer_instance() cannot return an ERR_PTR. However, explicitly checking for IS_ERR(outer_instance) makes this code more robust and self-documenting. It reduces cognitive load when reading the control flow and silences potential false-positive reports from static analysis or automated tooling. No functional change intended. Signed-off-by: Shardul Bankar <shardulsb08@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251021080849.860072-1-shardulsb08@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-21 09:39:05 -07:00
Daniel Borkmann	04a899573f	bpf: Do not let BPF test infra emit invalid GSO types to stack Yinhao et al. reported that their fuzzer tool was able to trigger a skb_warn_bad_offload() from netif_skb_features() -> gso_features_check(). When a BPF program - triggered via BPF test infra - pushes the packet to the loopback device via bpf_clone_redirect() then mentioned offload warning can be seen. GSO-related features are then rightfully disabled. We get into this situation due to convert___skb_to_skb() setting gso_segs and gso_size but not gso_type. Technically, it makes sense that this warning triggers since the GSO properties are malformed due to the gso_type. Potentially, the gso_type could be marked non-trustworthy through setting it at least to SKB_GSO_DODGY without any other specific assumptions, but that also feels wrong given we should not go further into the GSO engine in the first place. The checks were added in `121d57af30` ("gso: validate gso_type in GSO handlers") because there were malicious (syzbot) senders that combine a protocol with a non-matching gso_type. If we would want to drop such packets, gso_features_check() currently only returns feature flags via netif_skb_features(), so one location for potentially dropping such skbs could be validate_xmit_unreadable_skb(), but then otoh it would be an additional check in the fast-path for a very corner case. Given bpf_clone_redirect() is the only place where BPF test infra could emit such packets, lets reject them right there. Fixes: `850a88cc40` ("bpf: Expose __sk_buff wire_len/gso_segs to BPF_PROG_TEST_RUN") Fixes: `cf62089b0e` ("bpf: Add gso_size to __sk_buff") Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reported-by: Dongliang Mu <dzm91@hust.edu.cn> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251020075441.127980-1-daniel@iogearbox.net	2025-10-20 13:16:10 -07:00
Puranjay Mohan	7361c86485	selftests/bpf: Fix list_del() in arena list The __list_del fuction doesn't set the previous node's next pointer to the next node of the node to be deleted. It just updates the local variable and not the actual pointer in the previous node. The test was passing up till now because the bpf code is doing bpf_free() after list_del and therfore reading head->first from the userspace will read all zeroes. But after arena_list_del() is finished, head->first should point to NULL; Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20251017141727.51355-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-18 19:27:26 -07:00
Chu Guangqing	b74938a3bd	samples/bpf: Fix spelling typos in samples/bpf do_hbm_test.sh: The comment incorrectly used "upcomming" instead of "upcoming". hbm.c The comment incorrectly used "Managment" instead of "Management". The comment incorrectly used "Currrently" instead of "Currently". tcp_cong_kern.c The comment incorrectly used "deteremined" instead of "determined". tracex1.bpf.c The comment incorrectly used "loobpack" instead of "loopback". Signed-off-by: Chu Guangqing <chuguangqing@inspur.com> Link: https://lore.kernel.org/r/20251015015024.2212-2-chuguangqing@inspur.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-18 19:26:23 -07:00
Yonghong Song	4f8543b5f2	selftests/bpf: Fix selftest verif_scale_strobemeta failure with llvm22 With latest llvm22, I hit the verif_scale_strobemeta selftest failure below: $ ./test_progs -n 618 libbpf: prog 'on_event': BPF program load failed: -E2BIG libbpf: prog 'on_event': -- BEGIN PROG LOAD LOG -- BPF program is too large. Processed 1000001 insn verification time 7019091 usec stack depth 488 processed 1000001 insns (limit 1000000) max_states_per_insn 28 total_states 33927 peak_states 12813 mark_read 0 -- END PROG LOAD LOG -- libbpf: prog 'on_event': failed to load: -E2BIG libbpf: failed to load object 'strobemeta.bpf.o' scale_test:FAIL:expect_success unexpected error: -7 (errno 7) #618 verif_scale_strobemeta:FAIL But if I increase the verificaiton insn limit from 1M to 10M, the above test_progs run actually will succeed. The below is the result from veristat: $ ./veristat strobemeta.bpf.o Processing 'strobemeta.bpf.o'... File Program Verdict Duration (us) Insns States Program size Jited size ---------------- -------- ------- ------------- ------- ------ ------------ ---------- strobemeta.bpf.o on_event success 90250893 9777685 358230 15954 80794 ---------------- -------- ------- ------------- ------- ------ ------------ ---------- Done. Processed 1 files, 0 programs. Skipped 1 files, 0 programs. Further debugging shows the llvm commit [1] is responsible for the verificaiton failure as it tries to convert certain switch statement to if-condition. Such change may cause different transformation compared to original switch statement. In bpf program strobemeta.c case, the initial llvm ir for read_int_var() function is define internal void @read_int_var(ptr noundef %0, i64 noundef %1, ptr noundef %2, ptr noundef %3, ptr noundef %4) #2 !dbg !535 { %6 = alloca ptr, align 8 %7 = alloca i64, align 8 %8 = alloca ptr, align 8 %9 = alloca ptr, align 8 %10 = alloca ptr, align 8 %11 = alloca ptr, align 8 %12 = alloca i32, align 4 ... %20 = icmp ne ptr %19, null, !dbg !561 br i1 %20, label %22, label %21, !dbg !562 21: ; preds = %5 store i32 1, ptr %12, align 4 br label %48, !dbg !563 22: %23 = load ptr, ptr %9, align 8, !dbg !564 ... 47: ; preds = %38, %22 store i32 0, ptr %12, align 4, !dbg !588 br label %48, !dbg !588 48: ; preds = %47, %21 call void @llvm.lifetime.end.p0(ptr %11) #4, !dbg !588 %49 = load i32, ptr %12, align 4 switch i32 %49, label %51 [ i32 0, label %50 i32 1, label %50 ] 50: ; preds = %48, %48 ret void, !dbg !589 51: ; preds = %48 unreachable } Note that the above 'switch' statement is added by clang frontend. Without [1], the switch statement will survive until SelectionDag, so the switch statement acts like a 'barrier' and prevents some transformation involved with both 'before' and 'after' the switch statement. But with [1], the switch statement will be removed during middle end optimization and later middle end passes (esp. after inlining) have more freedom to reorder the code. The following is the related source code: static void calc_location(struct strobe_value_loc loc, void tls_base): bpf_probe_read_user(&tls_ptr, sizeof(void ), dtv); /* if pointer has (void )-1 value, then TLS wasn't initialized yet / return tls_ptr && tls_ptr != (void )-1 ? tls_ptr + tls_index.offset : NULL; In read_int_var() func, we have: void location = calc_location(&cfg->int_locs[idx], tls_base); if (!location) return; bpf_probe_read_user(value, sizeof(struct strobe_value_generic), location); ... The static func calc_location() is called inside read_int_var(). The asm code without [1]: 77: .123....89 (85) call bpf_probe_read_user#112 78: ........89 (79) r1 = (u64 )(r10 -368) 79: .1......89 (79) r2 = (u64 )(r10 -8) 80: .12.....89 (bf) r3 = r2 81: .123....89 (0f) r3 += r1 82: ..23....89 (07) r2 += 1 83: ..23....89 (79) r4 = (u64 )(r10 -464) 84: ..234...89 (a5) if r2 < 0x2 goto pc+13 85: ...34...89 (15) if r3 == 0x0 goto pc+12 86: ...3....89 (bf) r1 = r10 87: .1.3....89 (07) r1 += -400 88: .1.3....89 (b4) w2 = 16 In this case, 'r2 < 0x2' and 'r3 == 0x0' go to null 'locaiton' place, so the verifier actually prefers to do verification first at 'r1 = r10' etc. The asm code with [1]: 119: .123....89 (85) call bpf_probe_read_user#112 120: ........89 (79) r1 = (u64 )(r10 -368) 121: .1......89 (79) r2 = (u64 )(r10 -8) 122: .12.....89 (bf) r3 = r2 123: .123....89 (0f) r3 += r1 124: ..23....89 (07) r2 += -1 125: ..23....89 (a5) if r2 < 0xfffffffe goto pc+6 126: ........89 (05) goto pc+17 ... 144: ........89 (b4) w1 = 0 145: .1......89 (6b) (u16 )(r8 +80) = r1 In this case, if 'r2 < 0xfffffffe' is true, the control will go to non-null 'location' branch, so 'goto pc+17' will actually go to null 'location' branch. This seems causing tremendous amount of verificaiton state. To fix the issue, rewrite the following code return tls_ptr && tls_ptr != (void *)-1 ? tls_ptr + tls_index.offset : NULL; to if/then statement and hopefully these explicit if/then statements are sticky during middle-end optimizations. Test with llvm20 and llvm21 as well and all strobemeta related selftests are passed. [1] https://github.com/llvm/llvm-project/pull/161000 Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20251014051639.1996331-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-18 19:25:03 -07:00
Alexei Starovoitov	7a9f475d52	Merge branch 'bpf-mm-related-minor-changes' Yafang Shao says: ==================== These two minor patches were developed during the implementation of BPF-THP: https://lwn.net/Articles/1042138/ As suggested by Andrii, they are being submitted separately. ==================== Link: https://patch.msgid.link/20251016063929.13830-1-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-18 19:23:46 -07:00
Yafang Shao	7484e7cd8a	bpf: mark vma->{vm_mm,vm_file} as __safe_trusted_or_null The vma->vm_mm might be NULL and it can be accessed outside of RCU. Thus, we can mark it as trusted_or_null. With this change, BPF helpers can safely access vma->vm_mm to retrieve the associated mm_struct from the VMA. Then we can make policy decision from the VMA. The "trusted" annotation enables direct access to vma->vm_mm within kfuncs marked with KF_TRUSTED_ARGS or KF_RCU, such as bpf_task_get_cgroup1() and bpf_task_under_cgroup(). Conversely, "null" enforcement requires all callsites using vma->vm_mm to perform NULL checks. The lsm selftest must be modified because it directly accesses vma->vm_mm without a NULL pointer check; otherwise it will break due to this change. For the VMA based THP policy, the use case is as follows, @mm = @vma->vm_mm; // vm_area_struct::vm_mm is trusted or null if (!@mm) return; bpf_rcu_read_lock(); // rcu lock must be held to dereference the owner @owner = @mm->owner; // mm_struct::owner is rcu trusted or null if (!@owner) goto out; @cgroup1 = bpf_task_get_cgroup1(@owner, MEMCG_HIERARCHY_ID); /* make the decision based on the @cgroup1 attribute / bpf_cgroup_release(@cgroup1); // release the associated cgroup out: bpf_rcu_read_unlock(); PSI memory information can be obtained from the associated cgroup to inform policy decisions. Since upstream PSI support is currently limited to cgroup v2, the following example demonstrates cgroup v2 implementation: @owner = @mm->owner; if (@owner) { // @ancestor_cgid is user-configured @ancestor = bpf_cgroup_from_id(@ancestor_cgid); if (bpf_task_under_cgroup(@owner, @ancestor)) { @psi_group = @ancestor->psi; / Extract PSI metrics from @psi_group and * implement policy logic based on the values */ } } The vma::vm_file can also be marked with __safe_trusted_or_null. No additional selftests are required since vma->vm_file and vma->vm_mm are already validated in the existing selftest suite. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Link: https://lore.kernel.org/r/20251016063929.13830-3-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-18 19:23:08 -07:00
Yafang Shao	ec8e3e27a1	bpf: mark mm->owner as __safe_rcu_or_null When CONFIG_MEMCG is enabled, we can access mm->owner under RCU. The owner can be NULL. With this change, BPF helpers can safely access mm->owner to retrieve the associated task from the mm. We can then make policy decision based on the task attribute. The typical use case is as follows, bpf_rcu_read_lock(); // rcu lock must be held for rcu trusted field @owner = @mm->owner; // mm_struct::owner is rcu trusted or null if (!@owner) goto out; /* Do something based on the task attribute */ out: bpf_rcu_read_unlock(); Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20251016063929.13830-2-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-18 19:23:08 -07:00
Tiezhu Yang	c67f4ae737	selftests/bpf: Silence unused-but-set build warnings There are some set but not used build errors when compiling bpf selftests with the latest upstream mainline GCC, at the beginning add the attribute __maybe_unused for the variables, but it is better to just add the option -Wno-unused-but-set-variable to CFLAGS in Makefile to disable the errors instead of hacking the tests. tools/testing/selftests/bpf/map_tests/lpm_trie_map_basic_ops.c:229:36: error: variable ‘n_matches_after_delete’ set but not used [-Werror=unused-but-set-variable=] tools/testing/selftests/bpf/map_tests/lpm_trie_map_basic_ops.c:229:25: error: variable ‘n_matches’ set but not used [-Werror=unused-but-set-variable=] tools/testing/selftests/bpf/prog_tests/bpf_cookie.c:426:22: error: variable ‘j’ set but not used [-Werror=unused-but-set-variable=] tools/testing/selftests/bpf/prog_tests/find_vma.c:52:22: error: variable ‘j’ set but not used [-Werror=unused-but-set-variable=] tools/testing/selftests/bpf/prog_tests/perf_branches.c:67:22: error: variable ‘j’ set but not used [-Werror=unused-but-set-variable=] tools/testing/selftests/bpf/prog_tests/perf_link.c:15:22: error: variable ‘j’ set but not used [-Werror=unused-but-set-variable=] Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Link: https://lore.kernel.org/r/20251018082815.20622-1-yangtiezhu@loongson.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-18 19:21:29 -07:00
Alexei Starovoitov	50de48a4dd	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf at 6.18-rc2 Cross-merge BPF and other fixes after downstream PR. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-18 18:20:57 -07:00
Linus Torvalds	1c64efcb08	Merge tag 'rust-rustfmt' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux Pull rustfmt fixes from Miguel Ojeda: "Rust 'rustfmt' cleanup 'rustfmt', by default, formats imports in a way that is prone to conflicts while merging and rebasing, since in some cases it condenses several items into the same line. Document in our guidelines that we will handle this for the moment with the trailing empty comment workaround and make the tree 'rustfmt'-clean again" * tag 'rust-rustfmt' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux: rust: bitmap: fix formatting rust: cpufreq: fix formatting rust: alloc: employ a trailing comment to keep vertical layout docs: rust: add section on imports formatting	2025-10-18 10:05:13 -10:00
Linus Torvalds	648937f64a	Merge tag 'tpmdd-next-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd Pull tpm fix from Jarkko Sakkinen: "Correct the state transitions for ARM FF-A to match the spec and how tpm_crb behaves on other platforms" * tag 'tpmdd-next-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd: tpm_crb: Add idle support for the Arm FF-A start method	2025-10-18 08:38:28 -10:00
Linus Torvalds	e67bb0da33	Merge tag 'pci-v6.18-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci Pull pci fixes from Bjorn Helgaas: - Search for MSI Capability with correct ID to fix an MSI regression on platforms with Cadence IP (Hans Zhang) - Revert early bridge resource set up to fix resource assignment failures that broke at least alpha boot and Snapdragon ath12k WiFi (Ilpo Järvinen) - Implement VMD .irq_startup()/.irq_shutdown() to fix IRQ issues that caused boot crashes and broken devices below VMD (Inochi Amaoto) - Select CONFIG_SCREEN_INFO on X86 to fix black screen on boot when SCREEN_INFO not selected (Mario Limonciello) * tag 'pci-v6.18-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: PCI/VGA: Select SCREEN_INFO on X86 PCI: vmd: Override irq_startup()/irq_shutdown() in vmd_init_dev_msi_info() PCI: Revert early bridge resource set up PCI: cadence: Search for MSI Capability with correct ID	2025-10-18 08:35:09 -10:00
Linus Torvalds	ea0bdf2b94	Merge tag 'cxl-fixes-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull Compute Express Link fixes from Dave Jiang: "A small collection of CXL fixes. In addition to some misc fixes for the CXL subsystem, a number of fixes for CXL extended linear cache support are included to make it functional again. - Avoid missing port component registers setup due to dport enumeration failure - Add check for no entries in cxl_feature_info to address accessing invalid pointer. - Use %pa printk format to emit resource_size_t in validate_region_offset() CXL extended linear cache support fixes: - Fix setup of memory resource in cxl_acpi_set_cache_size() - Set range param for region_res_match_cxl_range() as const (addresses a compile warning for match_region_by_range() fix) - Fix match_region_by_range() to use region_res_match_cxl_range() - Subtract to find an hpa_alias0 in cxl_poison events to correct the alias math calculation" * tag 'cxl-fixes-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: cxl/trace: Subtract to find an hpa_alias0 in cxl_poison events cxl/region: Use %pa printk format to emit resource_size_t cxl: Fix match_region_by_range() to use region_res_match_cxl_range() cxl: Set range param for region_res_match_cxl_range() as const cxl/acpi: Fix setup of memory resource in cxl_acpi_set_cache_size() cxl/features: Add check for no entries in cxl_feature_info cxl/port: Avoid missing port component registers setup	2025-10-18 08:22:07 -10:00

1 2 3 4 5 ...

1397063 Commits