linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-01-18 23:29:48 -05:00

Author	SHA1	Message	Date
Alexei Starovoitov	852486b35f	x86/cfi,bpf: Fix bpf_exception_cb() signature As per the earlier patches, BPF sub-programs have bpf_callback_t signature and CFI expects callers to have matching signature. This is violated by bpf_prog_aux::bpf_exception_cb(). [peterz: Changelog] Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/CAADnVQ+Z7UcXXBBhMubhcMM=R-dExk-uHtfOLtoLxQ1XxEpqEA@mail.gmail.com Link: https://lore.kernel.org/r/20231215092707.910319166@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-15 16:25:55 -08:00
Peter Zijlstra	e9d13b9d2f	cfi: Add CFI_NOSEAL() Add a CFI_NOSEAL() helper to mark functions that need to retain their CFI information, despite not otherwise leaking their address. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20231215092707.669401084@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-15 16:25:55 -08:00
Peter Zijlstra	2cd3e3772e	x86/cfi,bpf: Fix bpf_struct_ops CFI BPF struct_ops uses __arch_prepare_bpf_trampoline() to write trampolines for indirect function calls. These tramplines much have matching CFI. In order to obtain the correct CFI hash for the various methods, add a matching structure that contains stub functions, the compiler will generate correct CFI which we can pilfer for the trampolines. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20231215092707.566977112@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-15 16:25:55 -08:00
Peter Zijlstra	4f9087f166	x86/cfi,bpf: Fix BPF JIT call The current BPF call convention is __nocfi, except when it calls !JIT things, then it calls regular C functions. It so happens that with FineIBT the __nocfi and C calling conventions are incompatible. Specifically __nocfi will call at func+0, while FineIBT will have endbr-poison there, which is not a valid indirect target. Causing #CP. Notably this only triggers on IBT enabled hardware, which is probably why this hasn't been reported (also, most people will have JIT on anyway). Implement proper CFI prologues for the BPF JIT codegen and drop __nocfi for x86. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20231215092707.345270396@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-15 16:25:55 -08:00
Peter Zijlstra	4382159696	cfi: Flip headers Normal include order is that linux/foo.h should include asm/foo.h, CFI has it the wrong way around. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Link: https://lore.kernel.org/r/20231215092707.231038174@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-15 16:25:55 -08:00
Daniel Xu	8f0ec8c681	bpf: xfrm: Add bpf_xdp_get_xfrm_state() kfunc This commit adds an unstable kfunc helper to access internal xfrm_state associated with an SA. This is intended to be used for the upcoming IPsec pcpu work to assign special pcpu SAs to a particular CPU. In other words: for custom software RSS. That being said, the function that this kfunc wraps is fairly generic and used for a lot of xfrm tasks. I'm sure people will find uses elsewhere over time. This commit also adds a corresponding bpf_xdp_xfrm_state_release() kfunc to release the refcnt acquired by bpf_xdp_get_xfrm_state(). The verifier will require that all acquired xfrm_state's are released. Co-developed-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Antony Antony <antony.antony@secunet.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/a29699c42f5fad456b875c98dd11c6afc3ffb707.1702593901.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-14 17:12:49 -08:00
Randy Dunlap	04d25ccea2	net, xdp: Correct grammar Use the correct verb form in 2 places in the XDP rx-queue comment. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/bpf/20231213043735.30208-1-rdunlap@infradead.org	2023-12-14 16:38:59 +01:00
Larysa Zaremba	7978bad4b6	mlx5: implement VLAN tag XDP hint Implement the newly added .xmo_rx_vlan_tag() hint function. Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/20231205210847.28460-15-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-13 16:16:41 -08:00
Larysa Zaremba	537fec0733	net: make vlan_get_tag() return -ENODATA instead of -EINVAL __vlan_hwaccel_get_tag() is used in veth XDP hints implementation, its return value (-EINVAL if skb is not VLAN tagged) is passed to bpf code, but XDP hints specification requires drivers to return -ENODATA, if a hint cannot be provided for a particular packet. Solve this inconsistency by changing error return value of __vlan_hwaccel_get_tag() from -EINVAL to -ENODATA, do the same thing to __vlan_get_tag(), because this function is supposed to follow the same convention. This, in turn, makes -ENODATA the only non-zero value vlan_get_tag() can return. We can do this with no side effects, because none of the users of the 3 above-mentioned functions rely on the exact value. Suggested-by: Jesper Dangaard Brouer <jbrouer@redhat.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://lore.kernel.org/r/20231205210847.28460-14-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-13 16:16:41 -08:00
Larysa Zaremba	e6795330f8	xdp: Add VLAN tag hint Implement functionality that enables drivers to expose VLAN tag to XDP code. VLAN tag is represented by 2 variables: - protocol ID, which is passed to bpf code in BE - VLAN TCI, in host byte order Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/20231205210847.28460-10-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-13 16:16:40 -08:00
Maciej Fijalkowski	b4e352ff11	xsk: add functions to fill control buffer Commit `94ecc5ca4d` ("xsk: Add cb area to struct xdp_buff_xsk") has added a buffer for custom data to xdp_buff_xsk. Particularly, this memory is used for data, consumed by XDP hints kfuncs. It does not always change on a per-packet basis and some parts can be set for example, at the same time as RX queue info. Add functions to fill all cbs in xsk_buff_pool with the same metadata. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20231205210847.28460-8-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-13 16:16:40 -08:00
Larysa Zaremba	0e6a7b0959	ice: Support RX hash XDP hint RX hash XDP hint requests both hash value and type. Type is XDP-specific, so we need a separate way to map these values to the hardware ptypes, so create a lookup table. Instead of creating a new long list, reuse contents of ice_decode_rx_desc_ptype[] through preprocessor. Current hash type enum does not contain ICMP packet type, but ice devices support it, so also add a new type into core code. Then use previously refactored code and create a function that allows XDP code to read RX hash. Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://lore.kernel.org/r/20231205210847.28460-7-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-13 16:16:40 -08:00
Jie Jiang	750e785796	bpf: Support uid and gid when mounting bpffs Parse uid and gid in bpf_parse_param() so that they can be passed in as the `data` parameter when mount() bpffs. This will be useful when we want to control which user/group has the control to the mounted bpffs, otherwise a separate chown() call will be needed. Signed-off-by: Jie Jiang <jiejiang@chromium.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Mike Frysinger <vapier@chromium.org> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231212093923.497838-1-jiejiang@chromium.org	2023-12-13 15:37:42 -08:00
Andrii Nakryiko	406a6fa44b	bpf: use bitfields for simple per-subprog bool flags We have a bunch of bool flags for each subprog. Instead of wasting bytes for them, use bitfields instead. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231204233931.49758-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-11 19:23:37 -08:00
Andrii Nakryiko	1a1ad782dc	bpf: tidy up exception callback management a bit Use the fact that we are passing subprog index around and have a corresponding struct bpf_subprog_info in bpf_verifier_env for each subprogram. We don't need to separately pass around a flag whether subprog is exception callback or not, each relevant verifier function can determine this using provided subprog index if we maintain bpf_subprog_info properly. Also move out exception callback-specific logic from btf_prepare_func_args(), keeping it generic. We can enforce all these restriction right before exception callback verification pass. We add out parameter, arg_cnt, for now, but this will be unnecessary with subsequent refactoring and will be removed. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231204233931.49758-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-11 19:23:32 -08:00
Aleksander Lobakin	2ebe81c814	net, xdp: Allow metadata > 32 32 bytes may be not enough for some custom metadata. Relax the restriction, allow metadata larger than 32 bytes and make __skb_metadata_differs() work with bigger lengths. Now size of metadata is only limited by the fact it is stored as u8 in skb_shared_info, so maximum possible value is 255. Size still has to be aligned to 4, so the actual upper limit becomes 252. Most driver implementations will offer less, none can offer more. Other important conditions, such as having enough space for xdp_frame building, are already checked in bpf_xdp_adjust_meta(). Signed-off-by: Aleksander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/eb87653c-8ff8-447d-a7a1-25961f60518a@kernel.org Link: https://lore.kernel.org/bpf/20231206205919.404415-3-larysa.zaremba@intel.com	2023-12-11 16:09:36 +01:00
Andrei Matei	92e1567ee3	bpf: Add some comments to stack representation Add comments to the datastructure tracking the stack state, as the mapping between each stack slot and where its state is stored is not entirely obvious. Signed-off-by: Andrei Matei <andreimatei1@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20231208032519.260451-2-andreimatei1@gmail.com	2023-12-08 14:19:00 -08:00
Song Liu	26ef208c20	bpf: Use arch_bpf_trampoline_size Instead of blindly allocating PAGE_SIZE for each trampoline, check the size of the trampoline with arch_bpf_trampoline_size(). This size is saved in bpf_tramp_image->size, and used for modmem charge/uncharge. The fallback arch_alloc_bpf_trampoline() still allocates a whole page because we need to use set_memory_* to protect the memory. struct_ops trampoline still uses a whole page for multiple trampolines. With this size check at caller (regular trampoline and struct_ops trampoline), remove arch_bpf_trampoline_size() from arch_prepare_bpf_trampoline() in archs. Also, update bpf_image_ksym_add() to handle symbol of different sizes. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Björn Töpel <bjorn@rivosinc.com> Tested-by: Björn Töpel <bjorn@rivosinc.com> # on riscv Link: https://lore.kernel.org/r/20231206224054.492250-7-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-06 17:17:20 -08:00
Song Liu	96d1b7c081	bpf: Add arch_bpf_trampoline_size() This helper will be used to calculate the size of the trampoline before allocating the memory. arch_prepare_bpf_trampoline() for arm64 and riscv64 can use arch_bpf_trampoline_size() to check the trampoline fits in the image. OTOH, arch_prepare_bpf_trampoline() for s390 has to call the JIT process twice, so it cannot use arch_bpf_trampoline_size(). Signed-off-by: Song Liu <song@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Björn Töpel <bjorn@rivosinc.com> Tested-by: Björn Töpel <bjorn@rivosinc.com> # on riscv Link: https://lore.kernel.org/r/20231206224054.492250-6-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-06 17:17:20 -08:00
Song Liu	82583daa2e	bpf: Add helpers for trampoline image management As BPF trampoline of different archs moves from bpf_jit_[alloc\|free]_exec() to bpf_prog_pack_[alloc\|free](), we need to use different _alloc, _free for different archs during the transition. Add the following helpers for this transition: void arch_alloc_bpf_trampoline(unsigned int size); void arch_free_bpf_trampoline(void image, unsigned int size); void arch_protect_bpf_trampoline(void image, unsigned int size); void arch_unprotect_bpf_trampoline(void image, unsigned int size); The fallback version of these helpers require size <= PAGE_SIZE, but they are only called with size == PAGE_SIZE. They will be called with size < PAGE_SIZE when arch_bpf_trampoline_size() helper is introduced later. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20231206224054.492250-4-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-06 17:17:20 -08:00
Song Liu	7a3d9a159b	bpf: Adjust argument names of arch_prepare_bpf_trampoline() We are using "im" for "struct bpf_tramp_image" and "tr" for "struct bpf_trampoline" in most of the code base. The only exception is the prototype and fallback version of arch_prepare_bpf_trampoline(). Update them to match the rest of the code base. We mix "orig_call" and "func_addr" for the argument in different versions of arch_prepare_bpf_trampoline(). s/orig_call/func_addr/g so they match. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20231206224054.492250-3-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-06 17:17:20 -08:00
Song Liu	f08a1c6582	bpf: Let bpf_prog_pack_free handle any pointer Currently, bpf_prog_pack_free only can only free pointer to struct bpf_binary_header, which is not flexible. Add a size argument to bpf_prog_pack_free so that it can handle any pointer. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20231206224054.492250-2-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-06 17:17:20 -08:00
Andrii Nakryiko	7065eefb38	bpf: rename MAX_BPF_LINK_TYPE into __MAX_BPF_LINK_TYPE for consistency To stay consistent with the naming pattern used for similar cases in BPF UAPI (__MAX_BPF_ATTACH_TYPE, etc), rename MAX_BPF_LINK_TYPE into __MAX_BPF_LINK_TYPE. Also similar to MAX_BPF_ATTACH_TYPE and MAX_BPF_REG, add: #define MAX_BPF_LINK_TYPE __MAX_BPF_LINK_TYPE Not all __MAX_xxx enums have such #define, so I'm not sure if we should add it or not, but I figured I'll start with a completely backwards compatible way, and we can drop that, if necessary. Also adjust a selftest that used MAX_BPF_LINK_TYPE enum. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231206190920.1651226-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-06 14:41:16 -08:00
Andrii Nakryiko	d734ca7b33	bpf,lsm: add BPF token LSM hooks Wire up bpf_token_create and bpf_token_free LSM hooks, which allow to allocate LSM security blob (we add `void *security` field to struct bpf_token for that), but also control who can instantiate BPF token. This follows existing pattern for BPF map and BPF prog. Also add security_bpf_token_allow_cmd() and security_bpf_token_capable() LSM hooks that allow LSM implementation to control and negate (if necessary) BPF token's delegation of a specific bpf_cmd and capability, respectively. Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231130185229.2688956-12-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-06 10:03:00 -08:00
Andrii Nakryiko	66d636d70a	bpf,lsm: refactor bpf_map_alloc/bpf_map_free LSM hooks Similarly to bpf_prog_alloc LSM hook, rename and extend bpf_map_alloc hook into bpf_map_create, taking not just struct bpf_map, but also bpf_attr and bpf_token, to give a fuller context to LSMs. Unlike bpf_prog_alloc, there is no need to move the hook around, as it currently is firing right before allocating BPF map ID and FD, which seems to be a sweet spot. But like bpf_prog_alloc/bpf_prog_free combo, make sure that bpf_map_free LSM hook is called even if bpf_map_create hook returned error, as if few LSMs are combined together it could be that one LSM successfully allocated security blob for its needs, while subsequent LSM rejected BPF map creation. The former LSM would still need to free up LSM blob, so we need to ensure security_bpf_map_free() is called regardless of the outcome. Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231130185229.2688956-11-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-06 10:02:59 -08:00
Andrii Nakryiko	c3dd6e94df	bpf,lsm: refactor bpf_prog_alloc/bpf_prog_free LSM hooks Based on upstream discussion ([0]), rework existing bpf_prog_alloc_security LSM hook. Rename it to bpf_prog_load and instead of passing bpf_prog_aux, pass proper bpf_prog pointer for a full BPF program struct. Also, we pass bpf_attr union with all the user-provided arguments for BPF_PROG_LOAD command. This will give LSMs as much information as we can basically provide. The hook is also BPF token-aware now, and optional bpf_token struct is passed as a third argument. bpf_prog_load LSM hook is called after a bunch of sanity checks were performed, bpf_prog and bpf_prog_aux were allocated and filled out, but right before performing full-fledged BPF verification step. bpf_prog_free LSM hook is now accepting struct bpf_prog argument, for consistency. SELinux code is adjusted to all new names, types, and signatures. Note, given that bpf_prog_load (previously bpf_prog_alloc) hook can be used by some LSMs to allocate extra security blob, but also by other LSMs to reject BPF program loading, we need to make sure that bpf_prog_free LSM hook is called after bpf_prog_load/bpf_prog_alloc one even if the hook itself returned error. If we don't do that, we run the risk of leaking memory. This seems to be possible today when combining SELinux and BPF LSM, as one example, depending on their relative ordering. Also, for BPF LSM setup, add bpf_prog_load and bpf_prog_free to sleepable LSM hooks list, as they are both executed in sleepable context. Also drop bpf_prog_load hook from untrusted, as there is no issue with refcount or anything else anymore, that originally forced us to add it to untrusted list in `c0c852dd18` ("bpf: Do not mark certain LSM hook arguments as trusted"). We now trigger this hook much later and it should not be an issue anymore. [0] https://lore.kernel.org/bpf/9fe88aef7deabbe87d3fc38c4aea3c69.paul@paul-moore.com/ Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231130185229.2688956-10-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-06 10:02:59 -08:00
Andrii Nakryiko	8062fb12de	bpf: consistently use BPF token throughout BPF verifier logic Remove remaining direct queries to perfmon_capable() and bpf_capable() in BPF verifier logic and instead use BPF token (if available) to make decisions about privileges. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231130185229.2688956-9-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-06 10:02:59 -08:00
Andrii Nakryiko	4cbb270e11	bpf: take into account BPF token when fetching helper protos Instead of performing unconditional system-wide bpf_capable() and perfmon_capable() calls inside bpf_base_func_proto() function (and other similar ones) to determine eligibility of a given BPF helper for a given program, use previously recorded BPF token during BPF_PROG_LOAD command handling to inform the decision. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231130185229.2688956-8-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-06 10:02:59 -08:00
Andrii Nakryiko	e1cef620f5	bpf: add BPF token support to BPF_PROG_LOAD command Add basic support of BPF token to BPF_PROG_LOAD. Wire through a set of allowed BPF program types and attach types, derived from BPF FS at BPF token creation time. Then make sure we perform bpf_token_capable() checks everywhere where it's relevant. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231130185229.2688956-7-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-06 10:02:59 -08:00
Andrii Nakryiko	ee54b1a910	bpf: add BPF token support to BPF_BTF_LOAD command Accept BPF token FD in BPF_BTF_LOAD command to allow BTF data loading through delegated BPF token. BTF loading is a pretty straightforward operation, so as long as BPF token is created with allow_cmds granting BPF_BTF_LOAD command, kernel proceeds to parsing BTF data and creating BTF object. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231130185229.2688956-6-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-06 10:02:59 -08:00
Andrii Nakryiko	688b7270b3	bpf: add BPF token support to BPF_MAP_CREATE command Allow providing token_fd for BPF_MAP_CREATE command to allow controlled BPF map creation from unprivileged process through delegated BPF token. Wire through a set of allowed BPF map types to BPF token, derived from BPF FS at BPF token creation time. This, in combination with allowed_cmds allows to create a narrowly-focused BPF token (controlled by privileged agent) with a restrictive set of BPF maps that application can attempt to create. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231130185229.2688956-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-06 10:02:59 -08:00
Andrii Nakryiko	4527358b76	bpf: introduce BPF token object Add new kind of BPF kernel object, BPF token. BPF token is meant to allow delegating privileged BPF functionality, like loading a BPF program or creating a BPF map, from privileged process to a trusted unprivileged process, all while having a good amount of control over which privileged operations could be performed using provided BPF token. This is achieved through mounting BPF FS instance with extra delegation mount options, which determine what operations are delegatable, and also constraining it to the owning user namespace (as mentioned in the previous patch). BPF token itself is just a derivative from BPF FS and can be created through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF FS FD, which can be attained through open() API by opening BPF FS mount point. Currently, BPF token "inherits" delegated command, map types, prog type, and attach type bit sets from BPF FS as is. In the future, having an BPF token as a separate object with its own FD, we can allow to further restrict BPF token's allowable set of things either at the creation time or after the fact, allowing the process to guard itself further from unintentionally trying to load undesired kind of BPF programs. But for now we keep things simple and just copy bit sets as is. When BPF token is created from BPF FS mount, we take reference to the BPF super block's owning user namespace, and then use that namespace for checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN} capabilities that are normally only checked against init userns (using capable()), but now we check them using ns_capable() instead (if BPF token is provided). See bpf_token_capable() for details. Such setup means that BPF token in itself is not sufficient to grant BPF functionality. User namespaced process has to also have necessary combination of capabilities inside that user namespace. So while previously CAP_BPF was useless when granted within user namespace, now it gains a meaning and allows container managers and sys admins to have a flexible control over which processes can and need to use BPF functionality within the user namespace (i.e., container in practice). And BPF FS delegation mount options and derived BPF tokens serve as a per-container "flag" to grant overall ability to use bpf() (plus further restrict on which parts of bpf() syscalls are treated as namespaced). Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF) within the BPF FS owning user namespace, rounding up the ns_capable() story of BPF token. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231130185229.2688956-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-06 10:02:59 -08:00
Andrii Nakryiko	40bba140c6	bpf: add BPF token delegation mount options to BPF FS Add few new mount options to BPF FS that allow to specify that a given BPF FS instance allows creation of BPF token (added in the next patch), and what sort of operations are allowed under BPF token. As such, we get 4 new mount options, each is a bit mask - `delegate_cmds` allow to specify which bpf() syscall commands are allowed with BPF token derived from this BPF FS instance; - if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies a set of allowable BPF map types that could be created with BPF token; - if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies a set of allowable BPF program types that could be loaded with BPF token; - if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies a set of allowable BPF program attach types that could be loaded with BPF token; delegate_progs and delegate_attachs are meant to be used together, as full BPF program type is, in general, determined through both program type and program attach type. Currently, these mount options accept the following forms of values: - a special value "any", that enables all possible values of a given bit set; - numeric value (decimal or hexadecimal, determined by kernel automatically) that specifies a bit mask value directly; - all the values for a given mount option are combined, if specified multiple times. E.g., `mount -t bpf nodev /path/to/mount -o delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3 mask. Ideally, more convenient (for humans) symbolic form derived from corresponding UAPI enums would be accepted (e.g., `-o delegate_progs=kprobe\|tracepoint`) and I intend to implement this, but it requires a bunch of UAPI header churn, so I postponed it until this feature lands upstream or at least there is a definite consensus that this feature is acceptable and is going to make it, just to minimize amount of wasted effort and not increase amount of non-essential code to be reviewed. Attentive reader will notice that BPF FS is now marked as FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init user namespace as long as the process has sufficient namespaced capabilities within that user namespace. But in reality we still restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN in init userns (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added to allow creating BPF FS context object (i.e., fsopen("bpf")) from inside unprivileged process inside non-init userns, to capture that userns as the owning userns. It will still be required to pass this context object back to privileged process to instantiate and mount it. This manipulation is important, because capturing non-init userns as the owning userns of BPF FS instance (super block) allows to use that userns to constraint BPF token to that userns later on (see next patch). So creating BPF FS with delegation inside unprivileged userns will restrict derived BPF token objects to only "work" inside that intended userns, making it scoped to a intended "container". Also, setting these delegation options requires capable(CAP_SYS_ADMIN), so unprivileged process cannot set this up without involvement of a privileged process. There is a set of selftests at the end of the patch set that simulates this sequence of steps and validates that everything works as intended. But careful review is requested to make sure there are no missed gaps in the implementation and testing. This somewhat subtle set of aspects is the result of previous discussions ([0]) about various user namespace implications and interactions with BPF token functionality and is necessary to contain BPF token inside intended user namespace. [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/ Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231130185229.2688956-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-06 10:02:58 -08:00
Andrii Nakryiko	41f6f64e69	bpf: support non-r10 register spill/fill to/from stack in precision tracking Use instruction (jump) history to record instructions that performed register spill/fill to/from stack, regardless if this was done through read-only r10 register, or any other register after copying r10 into it and potentially adjusting offset. To make this work reliably, we push extra per-instruction flags into instruction history, encoding stack slot index (spi) and stack frame number in extra 10 bit flags we take away from prev_idx in instruction history. We don't touch idx field for maximum performance, as it's checked most frequently during backtracking. This change removes basically the last remaining practical limitation of precision backtracking logic in BPF verifier. It fixes known deficiencies, but also opens up new opportunities to reduce number of verified states, explored in the subsequent patches. There are only three differences in selftests' BPF object files according to veristat, all in the positive direction (less states). File Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF) -------------------------------------- ------------- --------- --------- ------------- ---------- ---------- ------------- test_cls_redirect_dynptr.bpf.linked3.o cls_redirect 2987 2864 -123 (-4.12%) 240 231 -9 (-3.75%) xdp_synproxy_kern.bpf.linked3.o syncookie_tc 82848 82661 -187 (-0.23%) 5107 5073 -34 (-0.67%) xdp_synproxy_kern.bpf.linked3.o syncookie_xdp 85116 84964 -152 (-0.18%) 5162 5130 -32 (-0.62%) Note, I avoided renaming jmp_history to more generic insn_hist to minimize number of lines changed and potential merge conflicts between bpf and bpf-next trees. Notice also cur_hist_entry pointer reset to NULL at the beginning of instruction verification loop. This pointer avoids the problem of relying on last jump history entry's insn_idx to determine whether we already have entry for current instruction or not. It can happen that we added jump history entry because current instruction is_jmp_point(), but also we need to add instruction flags for stack access. In this case, we don't want to entries, so we need to reuse last added entry, if it is present. Relying on insn_idx comparison has the same ambiguity problem as the one that was fixed recently in [0], so we avoid that. [0] https://patchwork.kernel.org/project/netdevbpf/patch/20231110002638.4168352-3-andrii@kernel.org/ Acked-by: Eduard Zingerman <eddyz87@gmail.com> Reported-by: Tao Lyu <tao.lyu@epfl.ch> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231205184248.1502704-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-05 13:40:20 -08:00
Hou Tao	af66bfd3c8	bpf: Optimize the free of inner map When removing the inner map from the outer map, the inner map will be freed after one RCU grace period and one RCU tasks trace grace period, so it is certain that the bpf program, which may access the inner map, has exited before the inner map is freed. However there is no need to wait for one RCU tasks trace grace period if the outer map is only accessed by non-sleepable program. So adding sleepable_refcnt in bpf_map and increasing sleepable_refcnt when adding the outer map into env->used_maps for sleepable program. Although the max number of bpf program is INT_MAX - 1, the number of bpf programs which are being loaded may be greater than INT_MAX, so using atomic64_t instead of atomic_t for sleepable_refcnt. When removing the inner map from the outer map, using sleepable_refcnt to decide whether or not a RCU tasks trace grace period is needed before freeing the inner map. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231204140425.1480317-6-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-04 17:50:26 -08:00
Hou Tao	8766733641	bpf: Defer the free of inner map when necessary When updating or deleting an inner map in map array or map htab, the map may still be accessed by non-sleepable program or sleepable program. However bpf_map_fd_put_ptr() decreases the ref-counter of the inner map directly through bpf_map_put(), if the ref-counter is the last one (which is true for most cases), the inner map will be freed by ops->map_free() in a kworker. But for now, most .map_free() callbacks don't use synchronize_rcu() or its variants to wait for the elapse of a RCU grace period, so after the invocation of ops->map_free completes, the bpf program which is accessing the inner map may incur use-after-free problem. Fix the free of inner map by invoking bpf_map_free_deferred() after both one RCU grace period and one tasks trace RCU grace period if the inner map has been removed from the outer map before. The deferment is accomplished by using call_rcu() or call_rcu_tasks_trace() when releasing the last ref-counter of bpf map. The newly-added rcu_head field in bpf_map shares the same storage space with work field to reduce the size of bpf_map. Fixes: `bba1dc0b55` ("bpf: Remove redundant synchronize_rcu.") Fixes: `638e4b825d` ("bpf: Allows per-cpu maps and map-in-map in sleepable programs") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231204140425.1480317-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-04 17:50:26 -08:00
Hou Tao	20c20bd11a	bpf: Add map and need_defer parameters to .map_fd_put_ptr() map is the pointer of outer map, and need_defer needs some explanation. need_defer tells the implementation to defer the reference release of the passed element and ensure that the element is still alive before the bpf program, which may manipulate it, exits. The following three cases will invoke map_fd_put_ptr() and different need_defer values will be passed to these callers: 1) release the reference of the old element in the map during map update or map deletion. The release must be deferred, otherwise the bpf program may incur use-after-free problem, so need_defer needs to be true. 2) release the reference of the to-be-added element in the error path of map update. The to-be-added element is not visible to any bpf program, so it is OK to pass false for need_defer parameter. 3) release the references of all elements in the map during map release. Any bpf program which has access to the map must have been exited and released, so need_defer=false will be OK. These two parameters will be used by the following patches to fix the potential use-after-free problem for map-in-map. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231204140425.1480317-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-04 17:50:26 -08:00
Andrii Nakryiko	8fa4ecd49b	bpf: enforce exact retval range on subprog/callback exit Instead of relying on potentially imprecise tnum representation of expected return value range for callbacks and subprogs, validate that smin/smax range satisfy exact expected range of return values. E.g., if callback would need to return [0, 2] range, tnum can't represent this precisely and instead will allow [0, 3] range. By checking smin/smax range, we can make sure that subprog/callback indeed returns only valid [0, 2] range. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231202175705.885270-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-02 11:36:50 -08:00
Andrii Nakryiko	45b5623f2d	bpf: rearrange bpf_func_state fields to save a bit of memory It's a trivial rearrangement saving 8 bytes. We have 4 bytes of padding at the end which can be filled with another field without increasing struct bpf_func_state. copy_func_state() logic remains correct without any further changes. BEFORE ====== struct bpf_func_state { struct bpf_reg_state regs[11]; /* 0 1320 / / --- cacheline 20 boundary (1280 bytes) was 40 bytes ago --- / int callsite; / 1320 4 / u32 frameno; / 1324 4 / u32 subprogno; / 1328 4 / u32 async_entry_cnt; / 1332 4 / bool in_callback_fn; / 1336 1 / / XXX 7 bytes hole, try to pack / / --- cacheline 21 boundary (1344 bytes) --- / struct tnum callback_ret_range; / 1344 16 / bool in_async_callback_fn; / 1360 1 / bool in_exception_callback_fn; / 1361 1 / / XXX 2 bytes hole, try to pack / int acquired_refs; / 1364 4 / struct bpf_reference_state refs; /* 1368 8 / int allocated_stack; / 1376 4 / / XXX 4 bytes hole, try to pack / struct bpf_stack_state stack; /* 1384 8 / / size: 1392, cachelines: 22, members: 13 / / sum members: 1379, holes: 3, sum holes: 13 / / last cacheline: 48 bytes / }; AFTER ===== struct bpf_func_state { struct bpf_reg_state regs[11]; / 0 1320 / / --- cacheline 20 boundary (1280 bytes) was 40 bytes ago --- / int callsite; / 1320 4 / u32 frameno; / 1324 4 / u32 subprogno; / 1328 4 / u32 async_entry_cnt; / 1332 4 / struct tnum callback_ret_range; / 1336 16 / / --- cacheline 21 boundary (1344 bytes) was 8 bytes ago --- / bool in_callback_fn; / 1352 1 / bool in_async_callback_fn; / 1353 1 / bool in_exception_callback_fn; / 1354 1 / / XXX 1 byte hole, try to pack / int acquired_refs; / 1356 4 / struct bpf_reference_state refs; /* 1360 8 / struct bpf_stack_state stack; /* 1368 8 / int allocated_stack; / 1376 4 / / size: 1384, cachelines: 22, members: 13 / / sum members: 1379, holes: 1, sum holes: 1 / / padding: 4 / / last cacheline: 40 bytes */ }; Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231202175705.885270-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-02 11:36:50 -08:00
Andrew Halaney	df16c1c51d	net: phy: mdio_device: Reset device only when necessary Currently the phy reset sequence is as shown below for a devicetree described mdio phy on boot: 1. Assert the phy_device's reset as part of registering 2. Deassert the phy_device's reset as part of registering 3. Deassert the phy_device's reset as part of phy_probe 4. Deassert the phy_device's reset as part of phy_hw_init The extra two deasserts include waiting the deassert delay afterwards, which is adding unnecessary delay. This applies to both possible types of resets (reset controller reference and a reset gpio) that can be used. Here's some snipped tracing output using the following command line params "trace_event=gpio:* trace_options=stacktrace" illustrating the reset handling and where its coming from: /* Assert / systemd-udevd-283 [002] ..... 6.780434: gpio_value: 544 set 0 systemd-udevd-283 [002] ..... 6.783849: <stack trace> => gpiod_set_raw_value_commit => gpiod_set_value_nocheck => gpiod_set_value_cansleep => mdio_device_reset => mdiobus_register_device => phy_device_register => fwnode_mdiobus_phy_device_register => fwnode_mdiobus_register_phy => __of_mdiobus_register => stmmac_mdio_register => stmmac_dvr_probe => stmmac_pltfr_probe => devm_stmmac_pltfr_probe => qcom_ethqos_probe => platform_probe / Deassert / systemd-udevd-283 [002] ..... 6.802480: gpio_value: 544 set 1 systemd-udevd-283 [002] ..... 6.805886: <stack trace> => gpiod_set_raw_value_commit => gpiod_set_value_nocheck => gpiod_set_value_cansleep => mdio_device_reset => phy_device_register => fwnode_mdiobus_phy_device_register => fwnode_mdiobus_register_phy => __of_mdiobus_register => stmmac_mdio_register => stmmac_dvr_probe => stmmac_pltfr_probe => devm_stmmac_pltfr_probe => qcom_ethqos_probe => platform_probe / Deassert / systemd-udevd-283 [002] ..... 6.882601: gpio_value: 544 set 1 systemd-udevd-283 [002] ..... 6.886014: <stack trace> => gpiod_set_raw_value_commit => gpiod_set_value_nocheck => gpiod_set_value_cansleep => mdio_device_reset => phy_probe => really_probe => __driver_probe_device => driver_probe_device => __device_attach_driver => bus_for_each_drv => __device_attach => device_initial_probe => bus_probe_device => device_add => phy_device_register => fwnode_mdiobus_phy_device_register => fwnode_mdiobus_register_phy => __of_mdiobus_register => stmmac_mdio_register => stmmac_dvr_probe => stmmac_pltfr_probe => devm_stmmac_pltfr_probe => qcom_ethqos_probe => platform_probe / Deassert */ NetworkManager-477 [000] ..... 7.023144: gpio_value: 544 set 1 NetworkManager-477 [000] ..... 7.026596: <stack trace> => gpiod_set_raw_value_commit => gpiod_set_value_nocheck => gpiod_set_value_cansleep => mdio_device_reset => phy_init_hw => phy_attach_direct => phylink_fwnode_phy_connect => __stmmac_open => stmmac_open There's a lot of paths where the device is getting its reset asserted and deasserted. Let's track the state and only actually do the assert/deassert when it changes. Reported-by: Sagar Cheluvegowda <quic_scheluve@quicinc.com> Signed-off-by: Andrew Halaney <ahalaney@redhat.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20231127-net-phy-reset-once-v2-1-448e8658779e@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-30 23:11:21 -08:00
Jakub Kicinski	753c8608f3	Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-11-30 We've added 30 non-merge commits during the last 7 day(s) which contain a total of 58 files changed, 1598 insertions(+), 154 deletions(-). The main changes are: 1) Add initial TX metadata implementation for AF_XDP with support in mlx5 and stmmac drivers. Two types of offloads are supported right now, that is, TX timestamp and TX checksum offload, from Stanislav Fomichev with stmmac implementation from Song Yoong Siang. 2) Change BPF verifier logic to validate global subprograms lazily instead of unconditionally before the main program, so they can be guarded using BPF CO-RE techniques, from Andrii Nakryiko. 3) Add BPF link_info support for uprobe multi link along with bpftool integration for the latter, from Jiri Olsa. 4) Use pkg-config in BPF selftests to determine ld flags which is in particular needed for linking statically, from Akihiko Odaki. 5) Fix a few BPF selftest failures to adapt to the upcoming LLVM18, from Yonghong Song. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (30 commits) bpf/tests: Remove duplicate JSGT tests selftests/bpf: Add TX side to xdp_hw_metadata selftests/bpf: Convert xdp_hw_metadata to XDP_USE_NEED_WAKEUP selftests/bpf: Add TX side to xdp_metadata selftests/bpf: Add csum helpers selftests/xsk: Support tx_metadata_len xsk: Add option to calculate TX checksum in SW xsk: Validate xsk_tx_metadata flags xsk: Document tx_metadata_len layout net: stmmac: Add Tx HWTS support to XDP ZC net/mlx5e: Implement AF_XDP TX timestamp and checksum offload tools: ynl: Print xsk-features from the sample xsk: Add TX timestamp and TX checksum offload support xsk: Support tx_metadata_len selftests/bpf: Use pkg-config for libelf selftests/bpf: Override PKG_CONFIG for static builds selftests/bpf: Choose pkg-config for the target bpftool: Add support to display uprobe_multi links selftests/bpf: Add link_info test for uprobe_multi link selftests/bpf: Use bpf_link__destroy in fill_link_info tests ... ==================== Conflicts: Documentation/netlink/specs/netdev.yaml: `839ff60df3` ("net: page_pool: add nlspec for basic access to page pools") `48eb03dd26` ("xsk: Add TX timestamp and TX checksum offload support") https://lore.kernel.org/all/20231201094705.1ee3cab8@canb.auug.org.au/ While at it also regen, tree is dirty after: `48eb03dd26` ("xsk: Add TX timestamp and TX checksum offload support") looks like code wasn't re-rendered after "render-max" was removed. Link: https://lore.kernel.org/r/20231130145708.32573-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-30 16:58:42 -08:00
Jakub Kicinski	975f2d73a9	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-30 16:11:19 -08:00
Linus Torvalds	6172a5180f	Merge tag 'net-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bpf and wifi. Current release - regressions: - neighbour: fix __randomize_layout crash in struct neighbour - r8169: fix deadlock on RTL8125 in jumbo mtu mode Previous releases - regressions: - wifi: - mac80211: fix warning at station removal time - cfg80211: fix CQM for non-range use - tools: ynl-gen: fix unexpected response handling - octeontx2-af: fix possible buffer overflow - dpaa2: recycle the RX buffer only after all processing done - rswitch: fix missing dev_kfree_skb_any() in error path Previous releases - always broken: - ipv4: fix uaf issue when receiving igmp query packet - wifi: mac80211: fix debugfs deadlock at device removal time - bpf: - sockmap: af_unix stream sockets need to hold ref for pair sock - netdevsim: don't accept device bound programs - selftests: fix a char signedness issue - dsa: mv88e6xxx: fix marvell 6350 probe crash - octeontx2-pf: restore TC ingress police rules when interface is up - wangxun: fix memory leak on msix entry - ravb: keep reverse order of operations in ravb_remove()" * tag 'net-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (51 commits) net: ravb: Keep reverse order of operations in ravb_remove() net: ravb: Stop DMA in case of failures on ravb_open() net: ravb: Start TX queues after HW initialization succeeded net: ravb: Make write access to CXR35 first before accessing other EMAC registers net: ravb: Use pm_runtime_resume_and_get() net: ravb: Check return value of reset_control_deassert() net: libwx: fix memory leak on msix entry ice: Fix VF Reset paths when interface in a failed over aggregate bpf, sockmap: Add af_unix test with both sockets in map bpf, sockmap: af_unix stream sockets need to hold ref for pair sock tools: ynl-gen: always construct struct ynl_req_state ethtool: don't propagate EOPNOTSUPP from dumps ravb: Fix races between ravb_tx_timeout_work() and net related ops r8169: prevent potential deadlock in rtl8169_close r8169: fix deadlock on RTL8125 in jumbo mtu mode neighbour: Fix __randomize_layout crash in struct neighbour octeontx2-pf: Restore TC ingress police rules when interface is up octeontx2-pf: Fix adding mbox work queue entry when num_vfs > 64 net: stmmac: xgmac: Disable FPE MMC interrupts octeontx2-af: Fix possible buffer overflow ...	2023-12-01 08:24:46 +09:00
Kuniyuki Iwashima	8e7bab6b96	tcp: Factorise cookie-dependent fields initialisation in cookie_v[46]_check() We will support arbitrary SYN Cookie with BPF, and then kfunc at TC will preallocate reqsk and initialise some fields that should not be overwritten later by cookie_v[46]_check(). To simplify the flow in cookie_v[46]_check(), we move such fields' initialisation to cookie_tcp_reqsk_alloc() and factorise non-BPF SYN Cookie handling into cookie_tcp_check(), where we validate the cookie and allocate reqsk, as done by kfunc later. Note that we set ireq->ecn_ok in two steps, the latter of which will be shared by the BPF case. As cookie_ecn_ok() is one-liner, now it's inlined. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231129022924.96156-9-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:16:38 -08:00
Kuniyuki Iwashima	7b0f570f87	tcp: Move TCP-AO bits from cookie_v[46]_check() to tcp_ao_syncookie(). We initialise treq->af_specific in cookie_tcp_reqsk_alloc() so that we can look up a key later in tcp_create_openreq_child(). Initially, that change was added for MD5 by commit `ba5a4fdd63` ("tcp: make sure treq->af_specific is initialized"), but it has not been used since commit `d0f2b7a9ca` ("tcp: Disable header prediction for MD5 flow."). Now, treq->af_specific is used only by TCP-AO, so, we can move that initialisation into tcp_ao_syncookie(). In addition to that, l3index in cookie_v[46]_check() is only used for tcp_ao_syncookie(), so let's move it as well. While at it, we move down tcp_ao_syncookie() in cookie_v4_check() so that it will be called after security_inet_conn_request() to make functions order consistent with cookie_v6_check(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231129022924.96156-7-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:16:28 -08:00
Kuniyuki Iwashima	efce3d1fdf	tcp: Don't initialise tp->tsoffset in tcp_get_cookie_sock(). When we create a full socket from SYN Cookie, we initialise tcp_sk(sk)->tsoffset redundantly in tcp_get_cookie_sock() as the field is inherited from tcp_rsk(req)->ts_off. cookie_v[46]_check \|- treq->ts_off = 0 `- tcp_get_cookie_sock \|- tcp_v[46]_syn_recv_sock \| `- tcp_create_openreq_child \| `- newtp->tsoffset = treq->ts_off `- tcp_sk(child)->tsoffset = tsoff Let's initialise tcp_rsk(req)->ts_off with the correct offset and remove the second initialisation of tcp_sk(sk)->tsoffset. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231129022924.96156-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:16:23 -08:00
Kuniyuki Iwashima	7577bc8249	tcp: Don't pass cookie to __cookie_v[46]_check(). tcp_hdr(skb) and SYN Cookie are passed to __cookie_v[46]_check(), but none of the callers passes cookie other than ntohl(th->ack_seq) - 1. Let's fetch it in __cookie_v[46]_check() instead of passing the cookie over and over. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231129022924.96156-5-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:16:19 -08:00
Colin Ian King	f422544118	net: mana: Fix spelling mistake "enforecement" -> "enforcement" There is a spelling mistake in struct field hc_tx_err_sqpdid_enforecement. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Link: https://lore.kernel.org/r/20231128095304.515492-1-colin.i.king@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:13:40 -08:00
Geliang Tang	6ebf6f90ab	mptcp: add mptcpi_subflows_total counter If the initial subflow has been removed, we cannot know without checking other counters, e.g. ss -ti <filter> \| grep -c tcp-ulp-mptcp or getsockopt(SOL_MPTCP, MPTCP_FULL_INFO, ...) (or others except MPTCP_INFO of course) and then check mptcp_subflow_data->num_subflows to get the total amount of subflows. This patch adds a new counter mptcpi_subflows_total in mptcpi_flags to store the total amount of subflows, including the initial one. A new helper __mptcp_has_initial_subflow() is added to check whether the initial subflow has been removed or not. With this helper, we can then compute the total amount of subflows from mptcp_info by doing something like: mptcpi_subflows_total = mptcpi_subflows + __mptcp_has_initial_subflow(msk). Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/428 Reviewed-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231128-send-net-next-2023107-v4-1-8d6b94150f6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:06:17 -08:00
Jakub Kicinski	300fbb247e	Merge tag 'wireless-2023-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== wireless fixes: - debugfs had a deadlock (removal vs. use of files), fixes going through wireless ACKed by Greg - support for HT STAs on 320 MHz channels, even if it's not clear that should ever happen (that's 6 GHz), best not to WARN() - fix for the previous CQM fix that broke most cases - various wiphy locking fixes - various small driver fixes * tag 'wireless-2023-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: wifi: mac80211: use wiphy locked debugfs for sdata/link wifi: mac80211: use wiphy locked debugfs helpers for agg_status wifi: cfg80211: add locked debugfs wrappers debugfs: add API to allow debugfs operations cancellation debugfs: annotate debugfs handlers vs. removal with lockdep debugfs: fix automount d_fsdata usage wifi: mac80211: handle 320 MHz in ieee80211_ht_cap_ie_to_sta_ht_cap wifi: avoid offset calculation on NULL pointer wifi: cfg80211: hold wiphy mutex for send_interface wifi: cfg80211: lock wiphy mutex for rfkill poll wifi: cfg80211: fix CQM for non-range use wifi: mac80211: do not pass AP_VLAN vif pointer to drivers during flush wifi: iwlwifi: mvm: fix an error code in iwl_mvm_mld_add_sta() wifi: mt76: mt7925: fix typo in mt7925_init_he_caps wifi: mt76: mt7921: fix 6GHz disabled by the missing default CLC config ==================== Link: https://lore.kernel.org/r/20231129150809.31083-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 19:43:34 -08:00

1 2 3 4 5 ...

152103 Commits