linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-19 00:01:01 -04:00

Author	SHA1	Message	Date
Kumar Kartikeya Dwivedi	f25777056e	bpf: Enable unaligned accesses for syscall ctx Don't reject usage of fixed unaligned offsets for syscall ctx. Tests will be added in later commits. Unaligned offsets already work for variable offsets. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260406194403.1649608-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-06 15:27:26 -07:00
Kumar Kartikeya Dwivedi	ae5ef001aa	bpf: Support variable offsets for syscall PTR_TO_CTX Allow accessing PTR_TO_CTX with variable offsets in syscall programs. Fixed offsets are already enabled for all program types that do not convert their ctx accesses, since the changes we made in the commit `de6c7d99f8` ("bpf: Relax fixed offset check for PTR_TO_CTX"). Note that we also lift the restriction on passing syscall context into helpers, which was not permitted before, and passing modified syscall context into kfuncs. The structure of check_mem_access can be mostly shared and preserved, but we must use check_mem_region_access to correctly verify access with variable offsets. The check made in check_helper_mem_access is hardened to only allow PTR_TO_CTX for syscall programs to be passed in as helper memory. This was the original intention of the existing code anyway, and it makes little sense for other program types' context to be utilized as a memory buffer. In case a convincing example presents itself in the future, this check can be relaxed further. We also no longer use the last-byte access to simulate helper memory access, but instead go through check_mem_region_access. Since this no longer updates our max_ctx_offset, we must do so manually, to keep track of the maximum offset at which the program ctx may be accessed. Take care to ensure that when arg_type is ARG_PTR_TO_CTX, we do not relax any fixed or variable offset constraints around PTR_TO_CTX even in syscall programs, and require them to be passed unmodified. There are several reasons why this is necessary. First, if we pass a modified ctx, then the global subprog's accesses will not update the max_ctx_offset to its true maximum offset, and can lead to out of bounds accesses. Second, tail called program (or extension program replacing global subprog) where their max_ctx_offset exceeds the program they are being called from can also cause issues. For the latter, unmodified PTR_TO_CTX is the first requirement for the fix, the second is ensuring max_ctx_offset >= the program they are being called from, which has to be a separate change not made in this commit. All in all, we can hint using arg_type when we expect ARG_PTR_TO_CTX and make our relaxation around offsets conditional on it. Drop coverage of syscall tests from verifier_ctx.c temporarily for negative cases until they are updated in subsequent commits. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260406194403.1649608-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-06 15:27:26 -07:00
MingTao Huang	a1aa9ef47c	bpf: Fix stale offload->prog pointer after constant blinding When a dev-bound-only BPF program (BPF_F_XDP_DEV_BOUND_ONLY) undergoes JIT compilation with constant blinding enabled (bpf_jit_harden >= 2), bpf_jit_blind_constants() clones the program. The original prog is then freed in bpf_jit_prog_release_other(), which updates aux->prog to point to the surviving clone, but fails to update offload->prog. This leaves offload->prog pointing to the freed original program. When the network namespace is subsequently destroyed, cleanup_net() triggers bpf_dev_bound_netdev_unregister(), which iterates ondev->progs and calls __bpf_prog_offload_destroy(offload->prog). Accessing the freed prog causes a page fault: BUG: unable to handle page fault for address: ffffc900085f1038 Workqueue: netns cleanup_net RIP: 0010:__bpf_prog_offload_destroy+0xc/0x80 Call Trace: __bpf_offload_dev_netdev_unregister+0x257/0x350 bpf_dev_bound_netdev_unregister+0x4a/0x90 unregister_netdevice_many_notify+0x2a2/0x660 ... cleanup_net+0x21a/0x320 The test sequence that triggers this reliably is: 1. Set net.core.bpf_jit_harden=2 (echo 2 > /proc/sys/net/core/bpf_jit_harden) 2. Run xdp_metadata selftest, which creates a dev-bound-only XDP program on a veth inside a netns (./test_progs -t xdp_metadata) 3. cleanup_net -> page fault in __bpf_prog_offload_destroy Dev-bound-only programs are unique in that they have an offload structure but go through the normal JIT path instead of bpf_prog_offload_compile(). This means they are subject to constant blinding's prog clone-and-replace, while also having offload->prog that must stay in sync. Fix this by updating offload->prog in bpf_jit_prog_release_other(), alongside the existing aux->prog update. Both are back-pointers to the prog that must be kept in sync when the prog is replaced. Fixes: `2b3486bc2d` ("bpf: Introduce device-bound XDP programs") Signed-off-by: MingTao Huang <mintaohuang@tencent.com> Link: https://lore.kernel.org/r/tencent_BCF692F45859CCE6C22B7B0B64827947D406@qq.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-05 18:48:09 -07:00
Weiming Shi	5828b9e5b2	bpf: fix end-of-list detection in cgroup_storage_get_next_key() list_next_entry() never returns NULL -- when the current element is the last entry it wraps to the list head via container_of(). The subsequent NULL check is therefore dead code and get_next_key() never returns -ENOENT for the last element, instead reading storage->key from a bogus pointer that aliases internal map fields and copying the result to userspace. Replace it with list_entry_is_head() so the function correctly returns -ENOENT when there are no more entries. Fixes: `de9cbbaadb` ("bpf: introduce cgroup storage maps") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Reviewed-by: Sun Jian <sun.jian.kdev@gmail.com> Acked-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/20260403132951.43533-2-bestswngs@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-05 18:45:05 -07:00
Mykyta Yatsenko	07738bc566	bpf: Use copy_map_value_locked() in alloc_htab_elem() for BPF_F_LOCK When a BPF_F_LOCK update races with a concurrent delete, the freed element can be immediately recycled by alloc_htab_elem(). The fast path in htab_map_update_elem() performs a lockless lookup and then calls copy_map_value_locked() under the element's spin_lock. If alloc_htab_elem() recycles the same memory, it overwrites the value with plain copy_map_value(), without taking the spin_lock, causing torn writes. Use copy_map_value_locked() when BPF_F_LOCK is set so the new element's value is written under the embedded spin_lock, serializing against any stale lock holders. Fixes: `96049f3afd` ("bpf: introduce BPF_F_LOCK flag") Reported-by: Aaron Esau <aaron1esau@gmail.com> Closes: https://lore.kernel.org/all/CADucPGRvSRpkneb94dPP08YkOHgNgBnskTK6myUag_Mkjimihg@mail.gmail.com/ Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260401-bpf_map_torn_writes-v1-1-782d071c55e7@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-05 18:37:32 -07:00
Alexei Starovoitov	1a1cadbd5d	bpf: Add helper and kfunc stack access size resolution The static stack liveness analysis needs to know how many bytes a helper or kfunc accesses through a stack pointer argument, so it can precisely mark the affected stack slots as stack 'def' or 'use'. Add bpf_helper_stack_access_bytes() and bpf_kfunc_stack_access_bytes() which resolve the access size for a given call argument. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-7-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-03 08:34:44 -07:00
Alexei Starovoitov	19dbb13474	bpf: Move verifier helpers to header Move several helpers to header as preparation for the subsequent stack liveness patches. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-6-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-03 08:34:41 -07:00
Alexei Starovoitov	f1606dd0ac	bpf: Add bpf_compute_const_regs() and bpf_prune_dead_branches() passes Add two passes before the main verifier pass: bpf_compute_const_regs() is a forward dataflow analysis that tracks register values in R0-R9 across the program using fixed-point iteration in reverse postorder. Each register is tracked with a six-state lattice: UNVISITED -> CONST(val) / MAP_PTR(map_index) / MAP_VALUE(map_index, offset) / SUBPROG(num) -> UNKNOWN At merge points, if two paths produce the same state and value for a register, it stays; otherwise it becomes UNKNOWN. The analysis handles: - MOV, ADD, SUB, AND with immediate or register operands - LD_IMM64 for plain constants, map FDs, map values, and subprogs - LDX from read-only maps: constant-folds the load by reading the map value directly via bpf_map_direct_read() Results that fit in 32 bits are stored per-instruction in insn_aux_data and bitmasks. bpf_prune_dead_branches() uses the computed constants to evaluate conditional branches. When both operands of a conditional jump are known constants, the branch outcome is determined statically and the instruction is rewritten to an unconditional jump. The CFG postorder is then recomputed to reflect new control flow. This eliminates dead edges so that subsequent liveness analysis doesn't propagate through dead code. Also add runtime sanity check to validate that precomputed constants match the verifier's tracked state. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-5-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-03 08:34:36 -07:00
Alexei Starovoitov	e6898ec751	bpf: Sort subprogs in topological order after check_cfg() Add a pass that sorts subprogs in topological order so that iterating subprog_topo_order[] walks leaf subprogs first, then their callers. This is computed as a DFS post-order traversal of the CFG. The pass runs after check_cfg() to ensure the CFG has been validated before traversing and after postorder has been computed to avoid walking dead code. Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-3-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-03 08:34:30 -07:00
Alexei Starovoitov	503d21ef8e	bpf: Do register range validation early Instead of checking src/dst range multiple times during the main verifier pass do them once. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-2-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-03 08:34:26 -07:00
Alexei Starovoitov	891a05ccba	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf 7.0-rc6+ Cross-merge BPF and other fixes after downstream PR. Minor conflict in kernel/bpf/verifier.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-03 08:14:13 -07:00
Linus Torvalds	7b9e74c5a4	Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Fix register equivalence for pointers to packet (Alexei Starovoitov) - Fix incorrect pruning due to atomic fetch precision tracking (Daniel Borkmann) - Fix grace period wait for bpf_link-ed tracepoints (Kumar Kartikeya Dwivedi) - Fix use-after-free of sockmap's sk->sk_socket (Kuniyuki Iwashima) - Reject direct access to nullable PTR_TO_BUF pointers (Qi Tang) - Reject sleepable kprobe_multi programs at attach time (Varun R Mallya) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Add more precision tracking tests for atomics bpf: Fix incorrect pruning due to atomic fetch precision tracking bpf: Reject sleepable kprobe_multi programs at attach time bpf: reject direct access to nullable PTR_TO_BUF pointers bpf: sockmap: Fix use-after-free of sk->sk_socket in sk_psock_verdict_data_ready(). bpf: Fix grace period wait for tracepoint bpf_link bpf: Fix regsafe() for pointers to packet	2026-04-02 18:59:56 -07:00
Harishankar Vishwanathan	b254c6d816	bpf: Simulate branches to prune based on range violations This patch fixes the invariant violations that can happen after we refine ranges & tnum based on an incorrectly-detected branch condition. For example, the branch is always true, but we miss it in is_branch_taken; we then refine based on the branch being false and end up with incoherent ranges (e.g. umax < umin). To avoid this, we can simulate the refinement on both branches. More specifically, this patch simulates both branches taken using regs_refine_cond_op and reg_bounds_sync. If the resulting register states are ill-formed on one of the branches, is_branch_taken can mark that branch as "never taken". On a more formal note, we can deduce a branch is not taken when regs_refine_cond_op or reg_bounds_sync returns an ill-formed state because the branch operators are sound (verified with Agni [1]). Soundness means that the verifier is guaranteed to produce sound outputs on the taken branches. On the non-taken branch (explored because of imprecision in the bounds), the verifier is free to produce any output. We use ill-formedness as a signal that the branch is dead and prune that branch. This patch moves the refinement logic for both branches from reg_set_min_max to their own function, simulate_both_branches_taken, which is called from is_scalar_branch_taken. As a result, reg_set_min_max now only runs sanity checks and has been renamed to reg_bounds_sanity_check_branches to reflect that. We have had five patches fixing specific cases of invariant violations in the past, all added with selftests: - commit `fbc7aef517` ("bpf: Fix u32/s32 bounds when ranges cross min/max boundary") - commit `efc11a6678` ("bpf: Improve bounds when tnum has a single possible value") - commit `f41345f47f` ("bpf: Use tnums for JEQ/JNE is_branch_taken logic") - commit `00bf8d0c6c` ("bpf: Improve bounds when s64 crosses sign boundary") - commit `6279846b9b` ("bpf: Forget ranges when refining tnum after JSET") To confirm that this patch addresses all invariant violations, we have also reverted those five commits and verified that their related selftests don't cause any invariant violation warnings anymore. Those selftests still fail but only because of misdetected branches or less-precise bounds than expected. This demonstrates that the current patch is enough to avoid the invariant violation warning AND that the previous five patches are still useful to improve branch detection. In addition to the selftests, this change was also tested with the Cilium complexity test suite: all programs were successfully loaded and it didn't change the number of processed instructions. Link: https://github.com/bpfverif/agni [1] Reported-by: syzbot+c950cc277150935cc0b5@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c950cc277150935cc0b5 Co-developed-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/a166b54a3cbbbdbcdf8a87f53045f1097176218b.1775142354.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 18:23:25 -07:00
Harishankar Vishwanathan	a2a14e874b	bpf: Exit early if reg_bounds_sync gets invalid inputs In the subsequent commit, to prune dead branches we will rely on detecting ill-formed ranges using range_bounds_violations() (e.g., umin > umax) after refining register bounds using regs_refine_cond_op(). However, reg_bounds_sync() can sometimes "repair" ill-formed bounds, potentially masking a violation that was produced by regs_refine_cond_op(). This commit modifies reg_bounds_sync() to exit early if an invariant violation is already present in the input. This ensures ill-formed reg_states remain ill-formed after reg_bounds_sync(), allowing simulate_both_branches_taken() to correctly identify dead branches with a single check to range_bounds_violation(). Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/73127d628841c59cb7423d6bdcd204bf90bcdc80.1775142354.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 18:23:25 -07:00
Paul Chaignon	ec1d77cb0e	bpf: Use bpf_verifier_env buffers for reg_set_min_max In a subsequent patch, the regs_refine_cond_op and reg_bounds_sync functions will be called in is_branch_taken instead of reg_set_min_max, to simulate each branch's outcome. Since they will run before we branch out, these two functions will need to work on temporary registers for the two branches. This refactoring patch prepares for that change, by introducing the temporary registers on bpf_verifier_env and using them in reg_set_min_max. This change also allows us to save one fake_reg slot as we don't need to allocate an additional temporary buffer in case of a BPF_K condition. Finally, you may notice that this patch removes the check for "false_reg1 == false_reg2" in reg_set_min_max. That check was introduced in commit `d43ad9da80` ("bpf: Skip bounds adjustment for conditional jumps on same scalar register") to avoid an invariant violation. Given that "env->false_reg1 == env->false_reg2" doesn't make sense and invariant violations are addressed in a subsequent commit, this patch just removes the check. Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Co-developed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/260b0270052944a420e1c56e6a92df4d43cadf03.1775142354.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 18:23:25 -07:00
Harishankar Vishwanathan	a1311b94ef	bpf: Refactor reg_bounds_sanity_check This commit refactors reg_bounds_sanity_check to factor out the logic that performs the sanity check from the logic that does the reporting. Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/198ec3e69343e2c46dc9cbe2b1bc9be9ae2df5bd.1775142354.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 18:23:24 -07:00
Daniel Borkmann	179ee84a89	bpf: Fix incorrect pruning due to atomic fetch precision tracking When backtrack_insn encounters a BPF_STX instruction with BPF_ATOMIC and BPF_FETCH, the src register (or r0 for BPF_CMPXCHG) also acts as a destination, thus receiving the old value from the memory location. The current backtracking logic does not account for this. It treats atomic fetch operations the same as regular stores where the src register is only an input. This leads the backtrack_insn to fail to propagate precision to the stack location, which is then not marked as precise! Later, the verifier's path pruning can incorrectly consider two states equivalent when they differ in terms of stack state. Meaning, two branches can be treated as equivalent and thus get pruned when they should not be seen as such. Fix it as follows: Extend the BPF_LDX handling in backtrack_insn to also cover atomic fetch operations via is_atomic_fetch_insn() helper. When the fetch dst register is being tracked for precision, clear it, and propagate precision over to the stack slot. For non-stack memory, the precision walk stops at the atomic instruction, same as regular BPF_LDX. This covers all fetch variants. Before: 0: (b7) r1 = 8 ; R1=8 1: (7b) (u64 )(r10 -8) = r1 ; R1=8 R10=fp0 fp-8=8 2: (b7) r2 = 0 ; R2=0 3: (db) r2 = atomic64_fetch_add((u64 )(r10 -8), r2) ; R2=8 R10=fp0 fp-8=mmmmmmmm 4: (bf) r3 = r10 ; R3=fp0 R10=fp0 5: (0f) r3 += r2 mark_precise: frame0: last_idx 5 first_idx 0 subseq_idx -1 mark_precise: frame0: regs=r2 stack= before 4: (bf) r3 = r10 mark_precise: frame0: regs=r2 stack= before 3: (db) r2 = atomic64_fetch_add((u64 )(r10 -8), r2) mark_precise: frame0: regs=r2 stack= before 2: (b7) r2 = 0 6: R2=8 R3=fp8 6: (b7) r0 = 0 ; R0=0 7: (95) exit After: 0: (b7) r1 = 8 ; R1=8 1: (7b) (u64 )(r10 -8) = r1 ; R1=8 R10=fp0 fp-8=8 2: (b7) r2 = 0 ; R2=0 3: (db) r2 = atomic64_fetch_add((u64 )(r10 -8), r2) ; R2=8 R10=fp0 fp-8=mmmmmmmm 4: (bf) r3 = r10 ; R3=fp0 R10=fp0 5: (0f) r3 += r2 mark_precise: frame0: last_idx 5 first_idx 0 subseq_idx -1 mark_precise: frame0: regs=r2 stack= before 4: (bf) r3 = r10 mark_precise: frame0: regs=r2 stack= before 3: (db) r2 = atomic64_fetch_add((u64 )(r10 -8), r2) mark_precise: frame0: regs= stack=-8 before 2: (b7) r2 = 0 mark_precise: frame0: regs= stack=-8 before 1: (7b) (u64 )(r10 -8) = r1 mark_precise: frame0: regs=r1 stack= before 0: (b7) r1 = 8 6: R2=8 R3=fp8 6: (b7) r0 = 0 ; R0=0 7: (95) exit Fixes: `5ffa25502b` ("bpf: Add instructions for atomic_[cmp]xchg") Fixes: `5ca419f286` ("bpf: Add BPF_FETCH field / create atomic_fetch_add instruction") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260331222020.401848-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 09:57:59 -07:00
Varun R Mallya	eb7024bfcc	bpf: Reject sleepable kprobe_multi programs at attach time kprobe.multi programs run in atomic/RCU context and cannot sleep. However, bpf_kprobe_multi_link_attach() did not validate whether the program being attached had the sleepable flag set, allowing sleepable helpers such as bpf_copy_from_user() to be invoked from a non-sleepable context. This causes a "sleeping function called from invalid context" splat: BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:169 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo preempt_count: 1, expected: 0 RCU nest depth: 2, expected: 0 Fix this by rejecting sleepable programs early in bpf_kprobe_multi_link_attach(), before any further processing. Fixes: `0dcac27254` ("bpf: Add multi kprobe link") Signed-off-by: Varun R Mallya <varunrmallya@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Leon Hwang <leon.hwang@linux.dev> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260401191126.440683-1-varunrmallya@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 09:48:46 -07:00
Qi Tang	b0db1accbc	bpf: reject direct access to nullable PTR_TO_BUF pointers check_mem_access() matches PTR_TO_BUF via base_type() which strips PTR_MAYBE_NULL, allowing direct dereference without a null check. Map iterator ctx->key and ctx->value are PTR_TO_BUF \| PTR_MAYBE_NULL. On stop callbacks these are NULL, causing a kernel NULL dereference. Add a type_may_be_null() guard to the PTR_TO_BUF branch, matching the existing PTR_TO_BTF_ID pattern. Fixes: `20b2aff4bc` ("bpf: Introduce MEM_RDONLY flag") Signed-off-by: Qi Tang <tpluszz77@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260402092923.38357-2-tpluszz77@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 09:47:13 -07:00
Mykyta Yatsenko	cc878b4144	bpf: Migrate dynptr file to kmalloc_nolock Replace bpf_mem_alloc/bpf_mem_free with kmalloc_nolock/kfree_nolock for bpf_dynptr_file_impl, continuing the migration away from bpf_mem_alloc now that kmalloc can be used from NMI context. freader_cleanup() runs before kfree_nolock() while the dynptr still holds exclusive access, so plain kfree_nolock() is safe — no concurrent readers can access the object. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260330-kmalloc_special-v2-2-c90403f92ff0@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 09:31:42 -07:00
Mykyta Yatsenko	90f51ebff2	bpf: Migrate bpf_task_work to kmalloc_nolock Replace bpf_mem_alloc/bpf_mem_free with kmalloc_nolock/kfree_rcu for bpf_task_work_ctx. Replace guard(rcu_tasks_trace)() with guard(rcu)() in bpf_task_work_irq(). The function only accesses ctx struct members (not map values), so tasks trace protection is not needed - regular RCU is sufficient since ctx is freed via kfree_rcu. The guard in bpf_task_work_callback() remains as tasks trace since it accesses map values from process context. Sleepable BPF programs hold rcu_read_lock_trace but not regular rcu_read_lock. Since kfree_rcu waits for a regular RCU grace period, the ctx memory can be freed while a sleepable program is still running. Add scoped_guard(rcu) around the pointer read and refcount tryget in bpf_task_work_acquire_ctx to close this race window. Since kfree_rcu uses call_rcu internally which is not safe from NMI context, defer destruction via irq_work when irqs are disabled. For the lost-cmpxchg path the ctx was never published, so kfree_nolock is safe. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260330-kmalloc_special-v2-1-c90403f92ff0@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 09:31:42 -07:00
Leon Hwang	611fe4b79a	bpf: Fix abuse of kprobe_write_ctx via freplace uprobe programs are allowed to modify struct pt_regs. Since the actual program type of uprobe is KPROBE, it can be abused to modify struct pt_regs via kprobe+freplace when the kprobe attaches to kernel functions. For example, SEC("?kprobe") int kprobe(struct pt_regs regs) { return 0; } SEC("?freplace") int freplace_kprobe(struct pt_regs regs) { regs->di = 0; return 0; } freplace_kprobe prog will attach to kprobe prog. kprobe prog will attach to a kernel function. Without this patch, when the kernel function runs, its first arg will always be set as 0 via the freplace_kprobe prog. To fix the abuse of kprobe_write_ctx=true via kprobe+freplace, disallow attaching freplace programs on kprobe programs with different kprobe_write_ctx values. Fixes: `7384893d97` ("bpf: Allow uprobe program to change context registers") Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260331145353.87606-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 09:29:49 -07:00
Kumar Kartikeya Dwivedi	c76fef7dcd	bpf: Fix grace period wait for tracepoint bpf_link Recently, tracepoints were switched from using disabled preemption (which acts as RCU read section) to SRCU-fast when they are not faultable. This means that to do a proper grace period wait for programs running in such tracepoints, we must use SRCU's grace period wait. This is only for non-faultable tracepoints, faultable ones continue using RCU Tasks Trace. However, bpf_link_free() currently does call_rcu() for all cases when the link is non-sleepable (hence, for tracepoints, non-faultable). Fix this by doing a call_srcu() grace period wait. As far RCU Tasks Trace gp -> RCU gp chaining is concerned, it is deemed unnecessary for tracepoint programs. The link and program are either accessed under RCU Tasks Trace protection, or SRCU-fast protection now. The earlier logic of chaining both RCU Tasks Trace and RCU gp waits was to generalize the logic, even if it conceded an extra RCU gp wait, however that is unnecessary for tracepoints even before this change. In practice no cost was paid since rcu_trace_implies_rcu_gp() was always true. Hence we need not chaining any RCU gp after the SRCU gp. For instance, in the non-faultable raw tracepoint, the RCU read section of the program in __bpf_trace_run() is enclosed in the SRCU gp, likewise for faultable raw tracepoint, the program is under the RCU Tasks Trace protection. Hence, the outermost scope can be waited upon to ensure correctness. Also, sleepable programs cannot be attached to non-faultable tracepoints, so whenever program or link is sleepable, only RCU Tasks Trace protection is being used for the link and prog. Fixes: `a46023d561` ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast") Reviewed-by: Sun Jian <sun.jian.kdev@gmail.com> Reviewed-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20260331211021.1632902-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-31 16:01:13 -07:00
Alexei Starovoitov	a8502a79e8	bpf: Fix regsafe() for pointers to packet In case rold->reg->range == BEYOND_PKT_END && rcur->reg->range == N regsafe() may return true which may lead to current state with valid packet range not being explored. Fix the bug. Fixes: `6d94e741a8` ("bpf: Support for pointers beyond pkt_end.") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Amery Hung <ameryhung@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20260331204228.26726-1-alexei.starovoitov@gmail.com	2026-03-31 15:18:10 -07:00
Linus Torvalds	9147566d80	Merge tag 'sched_ext-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix SCX_KICK_WAIT deadlock where multiple CPUs waiting for each other in hardirq context form a cycle. Move the wait to a balance callback which can drop the rq lock and process IPIs. - Fix inconsistent NUMA node lookup in scx_select_cpu_dfl() where the waker_node used cpu_to_node() while prev_cpu used scx_cpu_node_if_enabled(), leading to undefined behavior when per-node idle tracking is disabled. * tag 'sched_ext-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: selftests/sched_ext: Add cyclic SCX_KICK_WAIT stress test sched_ext: Fix SCX_KICK_WAIT deadlock by deferring wait to balance callback sched_ext: Fix inconsistent NUMA node lookup in scx_select_cpu_dfl()	2026-03-31 14:23:12 -07:00
Linus Torvalds	0958d657b4	Merge tag 'wq-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: - Fix false positive stall reports on weakly ordered architectures where the lockless worklist/timestamp check in the watchdog can observe stale values due to memory reordering. Recheck under pool->lock to confirm. * tag 'wq-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Better describe stall check workqueue: Fix false positive stall reports	2026-03-31 14:20:39 -07:00
Linus Torvalds	53d85a2056	Merge tag 'cgroup-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix cgroup rmdir racing with dying tasks. Deferred task cgroup unlink introduced a window where cgroup.procs is empty but the cgroup is still populated, causing rmdir to fail with -EBUSY and selftest failures. Make rmdir wait for dying tasks to fully leave and fix selftests to not depend on synchronous populated updates. - Fix cpuset v1 task migration failure from empty cpusets under strict security policies. When CPU hotplug removes the last CPU from a v1 cpuset, tasks must be migrated to an ancestor without a security_task_setscheduler() check that would block the migration. * tag 'cgroup-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: Skip security check for hotplug induced v1 task migration cgroup/cpuset: Simplify setsched decision check in task iteration loop of cpuset_can_attach() cgroup: Fix cgroup_drain_dying() testing the wrong condition selftests/cgroup: Don't require synchronous populated update on task exit cgroup: Wait for dying tasks to leave on rmdir	2026-03-31 13:59:51 -07:00
Waiman Long	089f3fcd69	cgroup/cpuset: Skip security check for hotplug induced v1 task migration When a CPU hot removal causes a v1 cpuset to lose all its CPUs, the cpuset hotplug handler will schedule a work function to migrate tasks in that cpuset with no CPU to its ancestor to enable those tasks to continue running. If a strict security policy is in place, however, the task migration may fail when security_task_setscheduler() call in cpuset_can_attach() returns a -EACCES error. That will mean that those tasks will have no CPU to run on. The system administrators will have to explicitly intervene to either add CPUs to that cpuset or move the tasks elsewhere if they are aware of it. This problem was found by a reported test failure in the LTP's cpuset_hotplug_test.sh. Fix this problem by treating this special case as an exception to skip the setsched security check in cpuset_can_attach() when a v1 cpuset with tasks have no CPU left. With that patch applied, the cpuset_hotplug_test.sh test can be run successfully without failure. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-31 09:14:13 -10:00
Waiman Long	bbe5ab8191	cgroup/cpuset: Simplify setsched decision check in task iteration loop of cpuset_can_attach() Centralize the check required to run security_task_setscheduler() in the task iteration loop of cpuset_can_attach() outside of the loop as it has no dependency on the characteristics of the tasks themselves. There is no functional change. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-31 09:14:13 -10:00
Jiri Olsa	9eccdd38fb	bpf: Fix block device hooks names Use proper names for block device hooks names. Fixes: `46df585fcf` ("bpf: classify block device hooks appropriately") Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Closes: https://lore.kernel.org/bpf/acrVKUy_EPiFFmV9@krava/T/#m7c7906a1ff4029e29185aec3266dbf5c8996dbf7 Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20260330210344.3073712-1-jolsa@kernel.org	2026-03-31 11:11:42 +02:00
Tejun Heo	415cb193bb	sched_ext: Fix SCX_KICK_WAIT deadlock by deferring wait to balance callback SCX_KICK_WAIT busy-waits in kick_cpus_irq_workfn() using smp_cond_load_acquire() until the target CPU's kick_sync advances. Because the irq_work runs in hardirq context, the waiting CPU cannot reschedule and its own kick_sync never advances. If multiple CPUs form a wait cycle, all CPUs deadlock. Replace the busy-wait in kick_cpus_irq_workfn() with resched_curr() to force the CPU through do_pick_task_scx(), which queues a balance callback to perform the wait. The balance callback drops the rq lock and enables IRQs following the sched_core_balance() pattern, so the CPU can process IPIs while waiting. The local CPU's kick_sync is advanced on entry to do_pick_task_scx() and continuously during the wait, ensuring any CPU that starts waiting for us sees the advancement and cannot form cyclic dependencies. Fixes: `90e55164da` ("sched_ext: Implement SCX_KICK_WAIT") Cc: stable@vger.kernel.org # v6.12+ Reported-by: Christian Loehle <christian.loehle@arm.com> Link: https://lore.kernel.org/r/20260316100249.1651641-1-christian.loehle@arm.com Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Christian Loehle <christian.loehle@arm.com>	2026-03-30 08:37:27 -10:00
Linus Torvalds	47e3f23f0e	Merge tag 'timers-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix an argument order bug in the alarm timer forwarding logic, which may cause missed expirations or incorrect overrun accounting" * tag 'timers-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: alarmtimer: Fix argument order in alarm_timer_forward()	2026-03-29 10:02:38 -07:00
Linus Torvalds	f087b0bad4	Merge tag 'locking-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull futex fixes from Ingo Molnar: - Tighten up the sys_futex_requeue() ABI a bit, to disallow dissimilar futex flags and potential UaF access (Peter Zijlstra) - Fix UaF between futex_key_to_node_opt() and vma_replace_policy() (Hao-Yu Yang) - Clear stale exiting pointer in futex_lock_pi() retry path, which triggered a warning (and potential misbehavior) in stress-testing (Davidlohr Bueso) * tag 'locking-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Clear stale exiting pointer in futex_lock_pi() retry path futex: Fix UaF between futex_key_to_node_opt() and vma_replace_policy() futex: Require sys_futex_requeue() to have identical flags	2026-03-29 09:59:46 -07:00
Ihor Solodrai	d457072576	bpf: Support struct btf_struct_meta via KF_IMPLICIT_ARGS The following kfuncs currently accept void meta__ign argument: bpf_obj_new_impl * bpf_obj_drop_impl * bpf_percpu_obj_new_impl * bpf_percpu_obj_drop_impl * bpf_refcount_acquire_impl * bpf_list_push_back_impl * bpf_list_push_front_impl * bpf_rbtree_add_impl The __ign suffix is an indicator for the verifier to skip the argument in check_kfunc_args(). Then, in fixup_kfunc_call() the verifier may set the value of this argument to struct btf_struct_meta * kptr_struct_meta from insn_aux_data. BPF programs must pass a dummy NULL value when calling these kfuncs. Additionally, the list and rbtree _impl kfuncs also accept an implicit u64 argument, which doesn't require __ign suffix because it's a scalar, and BPF programs explicitly pass 0. Add new kfuncs with KF_IMPLICIT_ARGS [1], that correspond to each _impl kfunc accepting meta__ign. The existing _impl kfuncs remain unchanged for backwards compatibility. To support this, add "btf_struct_meta" to the list of recognized implicit argument types in resolve_btfids. Implement is_kfunc_arg_implicit() in the verifier, that determines implicit args by inspecting both a non-_impl BTF prototype of the kfunc. Update the special_kfunc_list in the verifier and relevant checks to support both the old _impl and the new KF_IMPLICIT_ARGS variants of btf_struct_meta users. [1] https://lore.kernel.org/bpf/20260120222638.3976562-1-ihor.solodrai@linux.dev/ Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260327203241.3365046-1-ihor.solodrai@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-29 09:56:06 -07:00
Linus Torvalds	cbfffcca2b	Merge tag 'trace-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix potential deadlock in osnoise and hotplug The interface_lock can be called by a osnoise thread and the CPU shutdown logic of osnoise can wait for this thread to finish. But cpus_read_lock() can also be taken while holding the interface_lock. This produces a circular lock dependency and can cause a deadlock. Swap the ordering of cpus_read_lock() and the interface_lock to have interface_lock taken within the cpus_read_lock() context to prevent this circular dependency. - Fix freeing of event triggers in early boot up If the same trigger is added on the kernel command line, the second one will fail to be applied and the trigger created will be freed. This calls into the deferred logic and creates a kernel thread to do the freeing. But the command line logic is called before kernel threads can be created and this leads to a NULL pointer dereference. Delay freeing event triggers until late init. * tag 'trace-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Drain deferred trigger frees if kthread creation fails tracing: Fix potential deadlock in cpu hotplug with osnoise	2026-03-28 09:59:09 -07:00
Davidlohr Bueso	210d36d892	futex: Clear stale exiting pointer in futex_lock_pi() retry path Fuzzying/stressing futexes triggered: WARNING: kernel/futex/core.c:825 at wait_for_owner_exiting+0x7a/0x80, CPU#11: futex_lock_pi_s/524 When futex_lock_pi_atomic() sees the owner is exiting, it returns -EBUSY and stores a refcounted task pointer in 'exiting'. After wait_for_owner_exiting() consumes that reference, the local pointer is never reset to nil. Upon a retry, if futex_lock_pi_atomic() returns a different error, the bogus pointer is passed to wait_for_owner_exiting(). CPU0 CPU1 CPU2 futex_lock_pi(uaddr) // acquires the PI futex exit() futex_cleanup_begin() futex_state = EXITING; futex_lock_pi(uaddr) futex_lock_pi_atomic() attach_to_pi_owner() // observes EXITING exiting = owner; // takes ref return -EBUSY wait_for_owner_exiting(-EBUSY, owner) put_task_struct(); // drops ref // exiting still points to owner goto retry; futex_lock_pi_atomic() lock_pi_update_atomic() cmpxchg(uaddr) uaddr ^= WAITERS // whatever // value changed return -EAGAIN; wait_for_owner_exiting(-EAGAIN, exiting) // stale WARN_ON_ONCE(exiting) Fix this by resetting upon retry, essentially aligning it with requeue_pi. Fixes: `3ef240eaff` ("futex: Prevent exit livelock") Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260326001759.4129680-1-dave@stgolabs.net	2026-03-28 13:54:02 +01:00
Wesley Atwell	250ab25391	tracing: Drain deferred trigger frees if kthread creation fails Boot-time trigger registration can fail before the trigger-data cleanup kthread exists. Deferring those frees until late init is fine, but the post-boot fallback must still drain the deferred list if kthread creation never succeeds. Otherwise, boot-deferred nodes can accumulate on trigger_data_free_list, later frees fall back to synchronously freeing only the current object, and the older queued entries are leaked forever. To trigger this, add the following to the kernel command line: trace_event=sched_switch trace_trigger=sched_switch.traceon,sched_switch.traceon The second traceon trigger will fail and be freed. This triggers a NULL pointer dereference and crashes the kernel. Keep the deferred boot-time behavior, but when kthread creation fails, drain the whole queued list synchronously. Do the same in the late-init drain path so queued entries are not stranded there either. Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260324221326.1395799-3-atwellwea@gmail.com Fixes: `61d445af0a` ("tracing: Add bulk garbage collection of freeing event_trigger_data") Signed-off-by: Wesley Atwell <atwellwea@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-03-28 08:32:44 -04:00
Linus Torvalds	0b8bf3b64e	Merge tag 'sysctl-7.00-fixes-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl fix from Joel Granados: "Fix uninitialized variable error when writing to a sysctl bitmap Removed the possibility of returning an unjustified -EINVAL when writing to a sysctl bitmap" * tag 'sysctl-7.00-fixes-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: sysctl: fix uninitialized variable in proc_do_large_bitmap	2026-03-27 13:04:34 -07:00
Luo Haiyang	1f98857322	tracing: Fix potential deadlock in cpu hotplug with osnoise The following sequence may leads deadlock in cpu hotplug: task1 task2 task3 ----- ----- ----- mutex_lock(&interface_lock) [CPU GOING OFFLINE] cpus_write_lock(); osnoise_cpu_die(); kthread_stop(task3); wait_for_completion(); osnoise_sleep(); mutex_lock(&interface_lock); cpus_read_lock(); [DEAD LOCK] Fix by swap the order of cpus_read_lock() and mutex_lock(&interface_lock). Cc: stable@vger.kernel.org Cc: <mathieu.desnoyers@efficios.com> Cc: <zhang.run@zte.com.cn> Cc: <yang.tao172@zte.com.cn> Cc: <ran.xiaokai@zte.com.cn> Fixes: `bce29ac9ce` ("trace: Add osnoise tracer") Link: https://patch.msgid.link/20260326141953414bVSj33dAYktqp9Oiyizq8@zte.com.cn Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Luo Haiyang <luo.haiyang@zte.com.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-03-27 15:18:06 -04:00
Christian Brauner	46df585fcf	bpf: classify block device hooks appropriately A bunch of new hooks for managing block devices were added a while ago but they weren't actually appropriately classified. * bpf_lsm_bdev_alloc() is called when the inode for the block device is allocated. This happens from a sleepable context so mark the function as sleepable. When this function is called the memory for the block device storage embedded into the inode is zeroed. That block device cannot be meaningfully reference or interacted with at this point. So mark it as untrusted for now. * bpf_lsm_bdev_free() is called when the inode for the block device is freed. A bunch of memory associated with the block device has already been freed and there's dangling pointers in there. So mark it as untrusted. It cannot be meaningfully referenced or interacted with anymore. It is also called from sb->s_op->free_inode:: which means it runs in rcu context (most of the times). So leave it as non-sleepable. * bpf_lsm_bdev_setintegrity() is called when a dm-verity device is instantiated (glossing over details for simplicity of the commit message). The block device is very much alive so it remains a trusted hook. It's also called with device mapper's suspend lock held and so the hook is able to sleep so mark it sleepable. Signed-off-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20260326-work-bpf-bdev-v2-1-5e3c58963987@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-27 09:05:13 -07:00
Alan Maguire	626e88c070	btf: Support kernel parsing of BTF with layout info Validate layout if present, but because the kernel must be strict in what it accepts, reject BTF with unsupported kinds, even if they are in the layout information. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260326145444.2076244-8-alan.maguire@oracle.com	2026-03-26 13:53:56 -07:00
Linus Torvalds	d813f42193	Merge tag 'pm-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix two cpufreq issues, one in the core and one in the conservative governor, and two issues related to system sleep: - Restore the cpufreq core behavior changed inadvertently during the 6.19 development cycle to call cpufreq_frequency_table_cpuinfo() for cpufreq policies getting re-initialized which ensures that policy->max and policy->cpuinfo_max_freq will be valid going forward (Viresh Kumar) - Adjust the cached requested frequency in the conservative cpufreq governor on policy limits changes to prevent it from becoming stale in some cases (Viresh Kumar) - Prevent pm_restore_gfp_mask() from triggering a WARN_ON() in some code paths in which it is legitimately called without invoking pm_restrict_gfp_mask() previously (Youngjun Park) - Update snapshot_write_finalize() to take trailing zero pages into account properly which prevents user space restore from failing subsequently in some cases (Alberto Garcia)" * tag 'pm-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: sleep: Drop spurious WARN_ON() from pm_restore_gfp_mask() PM: hibernate: Drain trailing zero pages on userspace restore cpufreq: conservative: Reset requested_freq on limits change cpufreq: Don't skip cpufreq_frequency_table_cpuinfo()	2026-03-26 12:42:28 -07:00
Linus Torvalds	dabb83ecf4	Merge tag 'dma-mapping-7.0-2026-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fixes from Marek Szyprowski: "A set of fixes for DMA-mapping subsystem, which resolve false- positive warnings from KMSAN and DMA-API debug (Shigeru Yoshida and Leon Romanovsky) as well as a simple build fix (Miguel Ojeda)" * tag 'dma-mapping-7.0-2026-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-mapping: add missing `inline` for `dma_free_attrs` mm/hmm: Indicate that HMM requires DMA coherency RDMA/umem: Tell DMA mapping that UMEM requires coherency iommu/dma: add support for DMA_ATTR_REQUIRE_COHERENT attribute dma-direct: prevent SWIOTLB path when DMA_ATTR_REQUIRE_COHERENT is set dma-mapping: Introduce DMA require coherency attribute dma-mapping: Clarify valid conditions for CPU cache line overlap dma-mapping: handle DMA_ATTR_CPU_CACHE_CLEAN in trace output dma-debug: Allow multiple invocations of overlapping entries dma: swiotlb: add KMSAN annotations to swiotlb_bounce()	2026-03-26 08:22:07 -07:00
Hao-Yu Yang	190a8c48ff	futex: Fix UaF between futex_key_to_node_opt() and vma_replace_policy() During futex_key_to_node_opt() execution, vma->vm_policy is read under speculative mmap lock and RCU. Concurrently, mbind() may call vma_replace_policy() which frees the old mempolicy immediately via kmem_cache_free(). This creates a race where __futex_key_to_node() dereferences a freed mempolicy pointer, causing a use-after-free read of mpol->mode. [ 151.412631] BUG: KASAN: slab-use-after-free in __futex_key_to_node (kernel/futex/core.c:349) [ 151.414046] Read of size 2 at addr ffff888001c49634 by task e/87 [ 151.415969] Call Trace: [ 151.416732] __asan_load2 (mm/kasan/generic.c:271) [ 151.416777] __futex_key_to_node (kernel/futex/core.c:349) [ 151.416822] get_futex_key (kernel/futex/core.c:374 kernel/futex/core.c:386 kernel/futex/core.c:593) Fix by adding rcu to __mpol_put(). Fixes: `c042c50521` ("futex: Implement FUTEX2_MPOL") Reported-by: Hao-Yu Yang <naup96721@gmail.com> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Hao-Yu Yang <naup96721@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Link: https://patch.msgid.link/20260324174418.GB1850007@noisy.programming.kicks-ass.net	2026-03-26 16:13:48 +01:00
Peter Zijlstra	19f94b3905	futex: Require sys_futex_requeue() to have identical flags Nicholas reported that his LLM found it was possible to create a UaF when sys_futex_requeue() is used with different flags. The initial motivation for allowing different flags was the variable sized futex, but since that hasn't been merged (yet), simply mandate the flags are identical, as is the case for the old style sys_futex() requeue operations. Fixes: `0f4b5f9722` ("futex: Add sys_futex_requeue()") Reported-by: Nicholas Carlini <npc@anthropic.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2026-03-26 16:13:48 +01:00
Marc Buerg	f63a9df7e3	sysctl: fix uninitialized variable in proc_do_large_bitmap proc_do_large_bitmap() does not initialize variable c, which is expected to be set to a trailing character by proc_get_long(). However, proc_get_long() only sets c when the input buffer contains a trailing character after the parsed value. If c is not initialized it may happen to contain a '-'. If this is the case proc_do_large_bitmap() expects to be able to parse a second part of the input buffer. If there is no second part an unjustified -EINVAL will be returned. Initialize c to 0 to prevent returning -EINVAL on valid input. Fixes: `9f977fb7ae` ("sysctl: add proc_do_large_bitmap") Signed-off-by: Marc Buerg <buermarc@googlemail.com> Reviewed-by: Joel Granados <joel.granados@kernel.org> Signed-off-by: Joel Granados <joel.granados@kernel.org>	2026-03-26 09:32:19 +01:00
Linus Torvalds	aba9da0905	Merge tag 'rcu-fixes.v7.0-20260325a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux Pull RCU fixes from Boqun Feng: "Fix a regression introduced by commit `c27cea4416` ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast"): BPF contexts can run with preemption disabled or scheduler locks held, so call_srcu() must work in all such contexts. Fix this by converting SRCU's spinlocks to raw spinlocks and avoiding scheduler lock acquisition in call_srcu() by deferring to an irq_work (similar to call_rcu_tasks_generic()), for both tree SRCU and tiny SRCU. Also fix a follow-on lockdep splat caused by srcu_node allocation under the newly introduced raw spinlock by deferring the allocation to grace-period worker context" * tag 'rcu-fixes.v7.0-20260325a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: srcu: Use irq_work to start GP in tiny SRCU rcu: Use an intermediate irq_work to start process_srcu() srcu: Push srcu_node allocation to GP when non-preemptible srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable()	2026-03-25 18:14:19 -07:00
Tejun Heo	4c56a8ac68	cgroup: Fix cgroup_drain_dying() testing the wrong condition cgroup_drain_dying() was using cgroup_is_populated() to test whether there are dying tasks to wait for. cgroup_is_populated() tests nr_populated_csets, nr_populated_domain_children and nr_populated_threaded_children, but cgroup_drain_dying() only needs to care about this cgroup's own tasks - whether there are children is cgroup_destroy_locked()'s concern. This caused hangs during shutdown. When systemd tried to rmdir a cgroup that had no direct tasks but had a populated child, cgroup_drain_dying() would enter its wait loop because cgroup_is_populated() was true from nr_populated_domain_children. The task iterator found nothing to wait for, yet the populated state never cleared because it was driven by live tasks in the child cgroup. Fix it by using cgroup_has_tasks() which only tests nr_populated_csets. v3: Fix cgroup_is_populated() -> cgroup_has_tasks() (Sebastian). v2: https://lore.kernel.org/r/20260323200205.1063629-1-tj@kernel.org Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Fixes: `1b164b876c` ("cgroup: Wait for dying tasks to leave on rmdir") Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2026-03-25 14:08:04 -10:00
Joel Fernandes	a6fc88b22b	srcu: Use irq_work to start GP in tiny SRCU Tiny SRCU's srcu_gp_start_if_needed() directly calls schedule_work(), which acquires the workqueue pool->lock. This causes a lockdep splat when call_srcu() is called with a scheduler lock held, due to: call_srcu() [holding pi_lock] srcu_gp_start_if_needed() schedule_work() -> pool->lock workqueue_init() / create_worker() [holding pool->lock] wake_up_process() -> try_to_wake_up() -> pi_lock Also add irq_work_sync() to cleanup_srcu_struct() to prevent a use-after-free if a queued irq_work fires after cleanup begins. Tested with rcutorture SRCU-T and no lockdep warnings. [ Thanks to Boqun for similar fix in patch "rcu: Use an intermediate irq_work to start process_srcu()" ] Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun@kernel.org>	2026-03-25 09:00:05 -07:00
Boqun Feng	7c405fb327	rcu: Use an intermediate irq_work to start process_srcu() Since commit `c27cea4416` ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can happen basically everywhere (including where a scheduler lock is held), call_srcu() now needs to avoid acquiring scheduler lock because otherwise it could cause deadlock [1]. Fix this by following what the previous RCU Tasks Trace did: using an irq_work to delay the queuing of the work to start process_srcu(). [boqun: Apply Joel's feedback] [boqun: Apply Andrea's test feedback] Reported-by: Andrea Righi <arighi@nvidia.com> Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ Fixes: commit `c27cea4416` ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] Suggested-by: Zqiang <qiang.zhang@linux.dev> Tested-by: Andrea Righi <arighi@nvidia.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun@kernel.org>	2026-03-25 08:59:59 -07:00

1 2 3 4 5 ...

51141 Commits