linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-21 13:45:53 -04:00

Author	SHA1	Message	Date
Maciej Fijalkowski	c30d084960	xsk: avoid overwriting skb fields for multi-buffer traffic We are unnecessarily setting a bunch of skb fields per each processed descriptor, which is redundant for fragmented frames. Let us set these respective members for first fragment only. To address both paths that we have within xsk_build_skb(), move assignments onto xsk_set_destructor_arg() and rename it to xsk_skb_init_misc(). Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250925160009.2474816-2-maciej.fijalkowski@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-26 13:51:45 -07:00
Amery Hung	bc8712f2b5	bpf: Emit struct bpf_xdp_sock type in vmlinux BTF Similar to other BPF UAPI struct, force emit BTF of struct bpf_xdp_sock so that it is defined in vmlinux.h. In a later patch, a selftest will use vmlinux.h to get the definition of struct bpf_xdp_sock instead of bpf.h. Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250925170013.1752561-1-ameryhung@gmail.com	2025-09-25 14:29:46 -07:00
Jakub Kicinski	203e3beb73	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.17-rc8). Conflicts: drivers/net/can/spi/hi311x.c `6b69680847` ("can: hi311x: fix null pointer dereference when resuming from sleep before interface was enabled") `27ce71e1ce` ("net: WQ_PERCPU added to alloc_workqueue users") https://lore.kernel.org/72ce7599-1b5b-464a-a5de-228ff9724701@kernel.org net/smc/smc_loopback.c drivers/dibs/dibs_loopback.c `a35c04de25` ("net/smc: fix warning in smc_rx_splice() when calling get_page()") `cc21191b58` ("dibs: Move data path to dibs layer") https://lore.kernel.org/74368a5c-48ac-4f8e-a198-40ec1ed3cf5f@kernel.org Adjacent changes: drivers/net/dsa/lantiq/lantiq_gswip.c `c0054b25e2` ("net: dsa: lantiq_gswip: move gswip_add_single_port_br() call to port_setup()") `7a1eaef0a7` ("net: dsa: lantiq_gswip: support model-specific mac_select_pcs()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-25 11:00:59 -07:00
Richard Gobert	f095a358fa	net: gro: remove unnecessary df checks Currently, packets with fixed IDs will be merged only if their don't-fragment bit is set. This restriction is unnecessary since packets without the don't-fragment bit will be forwarded as-is even if they were merged together. The merged packets will be segmented into their original forms before being forwarded, either by GSO or by TSO. The IDs will also remain identical unless NETIF_F_TSO_MANGLEID is set, in which case the IDs can become incrementing, which is also fine. Clean up the code by removing the unnecessary don't-fragment checks. Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250923085908.4687-5-richardbgobert@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-25 12:42:49 +02:00
Richard Gobert	3271f19bf7	net: gso: restore ids of outer ip headers correctly Currently, NETIF_F_TSO_MANGLEID indicates that the inner-most ID can be mangled. Outer IDs can always be mangled. Make GSO preserve outer IDs by default, with NETIF_F_TSO_MANGLEID allowing both inner and outer IDs to be mangled. This commit also modifies a few drivers that use SKB_GSO_FIXEDID directly. Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> # for sfc Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250923085908.4687-4-richardbgobert@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-25 12:42:49 +02:00
Richard Gobert	21f7484220	net: gro: only merge packets with incrementing or fixed outer ids Only merge encapsulated packets if their outer IDs are either incrementing or fixed, just like for inner IDs and IDs of non-encapsulated packets. Add another ip_fixedid bit for a total of two bits: one for outer IDs (and for unencapsulated packets) and one for inner IDs. This commit preserves the current behavior of GSO where only the IDs of the inner-most headers are restored correctly. Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250923085908.4687-3-richardbgobert@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-25 12:42:49 +02:00
Richard Gobert	25c550464a	net: gro: remove is_ipv6 from napi_gro_cb Remove is_ipv6 from napi_gro_cb and use sk->sk_family instead. This frees up space for another ip_fixedid bit that will be added in the next commit. udp_sock_create always creates either a AF_INET or a AF_INET6 socket, so using sk->sk_family is reliable. In IPv6-FOU, cfg->ipv6_v6only is always enabled. Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250923085908.4687-2-richardbgobert@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-25 12:42:49 +02:00
Christian Brauner	4055526d35	ns: move ns type into struct ns_common It's misplaced in struct proc_ns_operations and ns->ops might be NULL if the namespace is compiled out but we still want to know the type of the namespace for the initial namespace struct. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-25 09:23:54 +02:00
Jakub Kicinski	c7ab8024ca	Merge tag 'nf-next-25-09-24' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter: fixes for net-next These fixes target next because the bug is either not severe or has existed for so long that there is no reason to cram them in at the last minute. 1) Fix IPVS ftp unregistering during netns cleanup, broken since netns support was introduced in 2011 in the 2.6.39 kernel. From Slavin Liu. 2) nfnetlink must reset the 'nlh' pointer back to the original address when a batch is replayed, else we emit bogus ACK messages and conceal real errno from userspace. From Fernando Fernandez Mancera. This was broken since 6.10. 3) Recent fix for nftables 'pipapo' set type was incomplete, it only made things work for the AVX2 version of the algorithm. 4) Testing revealed another problem with avx2 version that results in out-of-bounds read access, this bug always existed since feature was added in 5.7 kernel. This also comes with a selftest update. Last fix resolves a long-standing bug (since 4.9) in conntrack /proc interface: Decrease skip count when we reap an expired entry during dump. As-is we erronously elide one conntrack entry from dump for every expired entry seen. From Eric Dumazet. * tag 'nf-next-25-09-24' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nf_conntrack: do not skip entries in /proc/net/nf_conntrack selftests: netfilter: nft_concat_range.sh: add check for double-create bug netfilter: nft_set_pipapo_avx2: fix skip of expired entries netfilter: nft_set_pipapo: use 0 genmask for packetpath lookups netfilter: nfnetlink: reset nlh pointer during batch replay ipvs: Defer ip_vs_ftp unregister during netns cleanup ==================== Link: https://patch.msgid.link/20250924140654.10210-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-24 17:45:15 -07:00
Gustavo A. R. Silva	b6db19d1df	tls: Avoid -Wflex-array-member-not-at-end warning Remove unused flexible-array member in struct tls_rec and, with this, fix the following warning: net/tls/tls.h:131:29: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] Also, add a comment to prevent people from adding any members after struct aead_request, which is a flexible structure --this is a structure that ends in a flexible-array member. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/aNMG1lyXw4XEAVaE@kspp Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-24 16:23:02 -07:00
Jakub Kicinski	5e3fee34f6	Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2025-09-23 We've added 9 non-merge commits during the last 33 day(s) which contain a total of 10 files changed, 480 insertions(+), 53 deletions(-). The main changes are: 1) A new bpf_xdp_pull_data kfunc that supports pulling data from a frag into the linear area of a xdp_buff, from Amery Hung. This includes changes in the xdp_native.bpf.c selftest, which Nimrod's future work depends on. It is a merge from a stable branch 'xdp_pull_data' which has also been merged to bpf-next. There is a conflict with recent changes in 'include/net/xdp.h' in the net-next tree that will need to be resolved. 2) A compiler warning fix when CONFIG_NET=n in the recent dynptr skb_meta support, from Jakub Sitnicki. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: selftests: drv-net: Pull data before parsing headers selftests/bpf: Test bpf_xdp_pull_data bpf: Support specifying linear xdp packet data size for BPF_PROG_TEST_RUN bpf: Make variables in bpf_prog_test_run_xdp less confusing bpf: Clear packet pointers after changing packet data in kfuncs bpf: Support pulling non-linear xdp data bpf: Allow bpf_xdp_shrink_data to shrink a frag from head and tail bpf: Clear pfmemalloc flag when freeing all fragments bpf: Return an error pointer for skb metadata when CONFIG_NET=n ==================== Link: https://patch.msgid.link/20250924050303.2466356-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-24 10:22:37 -07:00
Eric Dumazet	c5ba345b2d	netfilter: nf_conntrack: do not skip entries in /proc/net/nf_conntrack ct_seq_show() has an opportunistic garbage collector : if (nf_ct_should_gc(ct)) { nf_ct_kill(ct); goto release; } So if one nf_conn is killed there, next time ct_get_next() runs, we skip the following item in the bucket, even if it should have been displayed if gc did not take place. We can decrement st->skip_elems to tell ct_get_next() one of the items was removed from the chain. Fixes: `58e207e498` ("netfilter: evict stale entries when user reads /proc/net/nf_conntrack") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2025-09-24 11:50:28 +02:00
Florian Westphal	5823699a11	netfilter: nft_set_pipapo_avx2: fix skip of expired entries KASAN reports following splat: BUG: KASAN: slab-out-of-bounds in pipapo_get_avx2+0x941/0x25d0 Read of size 1 at addr ffff88814c561be0 by task nft/3944 Call Trace: pipapo_get_avx2+0x941/0x25d0 nft_pipapo_insert+0x440/0x11b0 nf_tables_newsetelem+0x220a/0x3a00 .. This bisects to commit `84c1da7b38` ("netfilter: nft_set_pipapo: use AVX2 algorithm for insertions too"). However, that change merely uncovers this bug. When we find a match but that match has expired or timed out, the AVX2 implementation restarts the full match loop. At that point, the pointer to the key data has already been changed and points to the keys last field. This will then result in out-of-bounds read once its incremented again for the next field. The restart logic in AVX2 is different compared to the plain C implementation, but both should follow the same logic. The C implementation just calls pipapo_refill() again do check the next entry. Do the same in the AVX2 implementation. Note that with this change, due to implementation differences of pipapo_refill vs. nft_pipapo_avx2_refill, the refill call will return the same element again. Then, on the next call, it will move to the next entry as expected. This is because avx2_refill doesn't clear the bitmap in the 'last' conditional. This is harmless. Expired/timed out elements are also not expected to be frequent. selftest is added in a followup commit. Fixes: `7400b06396` ("nft_set_pipapo: Introduce AVX2-based lookup implementation") Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2025-09-24 11:50:28 +02:00
Florian Westphal	4dbac7db17	netfilter: nft_set_pipapo: use 0 genmask for packetpath lookups In commit `c4eaca2e10` ("netfilter: nft_set_pipapo: don't check genbit from packetpath lookups") I replaced genmask_cur() with NFT_GENMASK_ANY, but this change has no effect in the pipapo set type. New entries are unreachable from the active copy, so NFT_GENMASK_ANY has same result as genmask_cur(): current-gen elements are disabled and the new-generation elements cannot be found. Tests did not catch this incomplete fix because the change also dropped the genmask test from the AVX2 version of the algorithm, so test only fails if host cpu lacks AVX2 support. Use genmask test only from the control plane (inserts, deletions, ..). Packet path has to skip the check, use of 0 is enough for this because ext->genmask has a the relevant bit set when the element is INACTIVE in that generation: using a 0 genmask thus makes nft_set_elem_active() always return true. Fix the comment and replace NFT_GENMASK_ANY with 0. Fixes: `c4eaca2e10` ("netfilter: nft_set_pipapo: don't check genbit from packetpath lookups") Signed-off-by: Florian Westphal <fw@strlen.de>	2025-09-24 11:50:28 +02:00
Fernando Fernandez Mancera	09efbac953	netfilter: nfnetlink: reset nlh pointer during batch replay During a batch replay, the nlh pointer is not reset until the parsing of the commands. Since commit `bf2ac490d2` ("netfilter: nfnetlink: Handle ACK flags for batch messages") that is problematic as the condition to add an ACK for batch begin will evaluate to true even if NLM_F_ACK wasn't used for batch begin message. If there is an error during the command processing, netlink is sending an ACK despite that. This misleads userspace tools which think that the return code was 0. Reset the nlh pointer to the original one when a replay is triggered. Fixes: `bf2ac490d2` ("netfilter: nfnetlink: Handle ACK flags for batch messages") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Florian Westphal <fw@strlen.de>	2025-09-24 11:50:28 +02:00
Slavin Liu	134121bfd9	ipvs: Defer ip_vs_ftp unregister during netns cleanup On the netns cleanup path, __ip_vs_ftp_exit() may unregister ip_vs_ftp before connections with valid cp->app pointers are flushed, leading to a use-after-free. Fix this by introducing a global `exiting_module` flag, set to true in ip_vs_ftp_exit() before unregistering the pernet subsystem. In __ip_vs_ftp_exit(), skip ip_vs_ftp unregister if called during netns cleanup (when exiting_module is false) and defer it to __ip_vs_cleanup_batch(), which unregisters all apps after all connections are flushed. If called during module exit, unregister ip_vs_ftp immediately. Fixes: `61b1ab4583` ("IPVS: netns, add basic init per netns.") Suggested-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Slavin Liu <slavin452@gmail.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>	2025-09-24 11:50:28 +02:00
Kuniyuki Iwashima	dc1dea796b	tcp: Remove stale locking comment for TFO. The listener -> child locking no longer exists in the fast path since commit `e994b2f0fb` ("tcp: do not lock listener to process SYN packets"). Let's remove the stale comment for reqsk_fastopen_remove(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250923005441.4131554-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-23 18:21:36 -07:00
Vadim Fedorenko	e8ab231782	net: ethtool: tsconfig: set command must provide a reply Timestamping configuration through ethtool has inconsistent behavior of skipping the reply for set command if configuration was not changed. Fix it be providing reply in any case. Fixes: `6e9e2eed4f` ("net: ethtool: Add support for tsconfig command to get/set hwtstamp config") Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20250922231924.2769571-1-vadfed@meta.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-23 17:13:05 -07:00
Petr Machata	cd9a9562b2	net: bridge: Install FDB for bridge MAC on VLAN 0 Currently, after the bridge is created, the FDB does not hold an FDB entry for the bridge MAC on VLAN 0: # ip link add name br up type bridge # ip -br link show dev br br UNKNOWN 92:19:8c:4e:01:ed <BROADCAST,MULTICAST,UP,LOWER_UP> # bridge fdb show \| grep 92:19:8c:4e:01:ed 92:19:8c:4e:01:ed dev br vlan 1 master br permanent Later when the bridge MAC is changed, or in fact when the address is given during netdevice creation, the entry appears: # ip link add name br up address 00:11:22:33:44:55 type bridge # bridge fdb show \| grep 00:11:22:33:44:55 00:11:22:33:44:55 dev br vlan 1 master br permanent 00:11:22:33:44:55 dev br master br permanent However when the bridge address is set by the user to the current bridge address before the first port is enslaved, none of the address handlers gets invoked, because the address is not actually changed. The address is however marked as NET_ADDR_SET. Then when a port is enslaved, the address is not changed, because it is NET_ADDR_SET. Thus the VLAN 0 entry is not added, and it has not been added previously either: # ip link add name br up type bridge # ip -br link show dev br br UNKNOWN 7e:f0:a8:1a:be:c2 <BROADCAST,MULTICAST,UP,LOWER_UP> # ip link set dev br addr 7e:f0:a8:1a:be:c2 # ip link add name v up type veth # ip link set dev v master br # ip -br link show dev br br UNKNOWN 7e:f0:a8:1a:be:c2 <BROADCAST,MULTICAST,UP,LOWER_UP> # bridge fdb \| grep 7e:f0:a8:1a:be:c2 7e:f0:a8:1a:be:c2 dev br vlan 1 master br permanent Then when the bridge MAC is used as DMAC, and br_handle_frame_finish() looks up an FDB entry with VLAN=0, it doesn't find any, and floods the traffic instead of passing it up. Fix this by simply adding the VLAN 0 FDB entry for the bridge itself always on netdevice creation. This also makes the behavior consistent with how ports are treated: ports always have an FDB entry for each member VLAN as well as VLAN 0. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/415202b2d1b9b0899479a502bbe2ba188678f192.1758550408.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-23 17:10:49 -07:00
Ido Schimmel	390b3a300d	nexthop: Forbid FDB status change while nexthop is in a group The kernel forbids the creation of non-FDB nexthop groups with FDB nexthops: # ip nexthop add id 1 via 192.0.2.1 fdb # ip nexthop add id 2 group 1 Error: Non FDB nexthop group cannot have fdb nexthops. And vice versa: # ip nexthop add id 3 via 192.0.2.2 dev dummy1 # ip nexthop add id 4 group 3 fdb Error: FDB nexthop group can only have fdb nexthops. However, as long as no routes are pointing to a non-FDB nexthop group, the kernel allows changing the type of a nexthop from FDB to non-FDB and vice versa: # ip nexthop add id 5 via 192.0.2.2 dev dummy1 # ip nexthop add id 6 group 5 # ip nexthop replace id 5 via 192.0.2.2 fdb # echo $? 0 This configuration is invalid and can result in a NPD [1] since FDB nexthops are not associated with a nexthop device: # ip route add 198.51.100.1/32 nhid 6 # ping 198.51.100.1 Fix by preventing nexthop FDB status change while the nexthop is in a group: # ip nexthop add id 7 via 192.0.2.2 dev dummy1 # ip nexthop add id 8 group 7 # ip nexthop replace id 7 via 192.0.2.2 fdb Error: Cannot change nexthop FDB status while in a group. [1] BUG: kernel NULL pointer dereference, address: 00000000000003c0 [...] Oops: Oops: 0000 [#1] SMP CPU: 6 UID: 0 PID: 367 Comm: ping Not tainted 6.17.0-rc6-virtme-gb65678cacc03 #1 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-4.fc41 04/01/2014 RIP: 0010:fib_lookup_good_nhc+0x1e/0x80 [...] Call Trace: <TASK> fib_table_lookup+0x541/0x650 ip_route_output_key_hash_rcu+0x2ea/0x970 ip_route_output_key_hash+0x55/0x80 __ip4_datagram_connect+0x250/0x330 udp_connect+0x2b/0x60 __sys_connect+0x9c/0xd0 __x64_sys_connect+0x18/0x20 do_syscall_64+0xa4/0x2a0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Fixes: `38428d6871` ("nexthop: support for fdb ecmp nexthops") Reported-by: syzbot+6596516dd2b635ba2350@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/68c9a4d2.050a0220.3c6139.0e63.GAE@google.com/ Tested-by: syzbot+6596516dd2b635ba2350@syzkaller.appspotmail.com Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250921150824.149157-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-23 17:01:05 -07:00
Jason Baron	ca9f9cdc4d	net: allow alloc_skb_with_frags() to use MAX_SKB_FRAGS Currently, alloc_skb_with_frags() will only fill (MAX_SKB_FRAGS - 1) slots. I think it should use all MAX_SKB_FRAGS slots, as callers of alloc_skb_with_frags() will size their allocation of frags based on MAX_SKB_FRAGS. This issue was discovered via a test patch that sets 'order' to 0 in alloc_skb_with_frags(), which effectively tests/simulates high fragmentation. In this case sendmsg() on unix sockets will fail every time for large allocations. If the PAGE_SIZE is 4K, then data_len will request 68K or 17 pages, but alloc_skb_with_frags() can only allocate 64K in this case or 16 pages. Fixes: `09c2c90705` ("net: allow alloc_skb_with_frags() to allocate bigger packets") Signed-off-by: Jason Baron <jbaron@akamai.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250922191957.2855612-1-jbaron@akamai.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-23 16:51:26 -07:00
Eric Dumazet	b650bf0977	udp: remove busylock and add per NUMA queues busylock was protecting UDP sockets against packet floods, but unfortunately was not protecting the host itself. Under stress, many cpus could spin while acquiring the busylock, and NIC had to drop packets. Or packets would be dropped in cpu backlog if RPS/RFS were in place. This patch replaces the busylock by intermediate lockless queues. (One queue per NUMA node). This means that fewer number of cpus have to acquire the UDP receive queue lock. Most of the cpus can either: - immediately drop the packet. - or queue it in their NUMA aware lockless queue. Then one of the cpu is chosen to process this lockless queue in a batch. The batch only contains packets that were cooked on the same NUMA node, thus with very limited latency impact. Tested: DDOS targeting a victim UDP socket, on a platform with 6 NUMA nodes (Intel(R) Xeon(R) 6985P-C) Before: nstat -n ; sleep 1 ; nstat \| grep Udp Udp6InDatagrams 1004179 0.0 Udp6InErrors 3117 0.0 Udp6RcvbufErrors 3117 0.0 After: nstat -n ; sleep 1 ; nstat \| grep Udp Udp6InDatagrams `1116633` 0.0 Udp6InErrors 14197275 0.0 Udp6RcvbufErrors 14197275 0.0 We can see this host can now proces 14.2 M more packets per second while under attack, and the victim socket can receive 11 % more packets. I used a small bpftrace program measuring time (in us) spent in __udp_enqueue_schedule_skb(). Before: @udp_enqueue_us[398]: [0] 24901 \|@@@ \| [1] 63512 \|@@@@@@@@@ \| [2, 4) 344827 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\| [4, 8) 244673 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ \| [8, 16) 54022 \|@@@@@@@@ \| [16, 32) 222134 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ \| [32, 64) 232042 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ \| [64, 128) 4219 \| \| [128, 256) 188 \| \| After: @udp_enqueue_us[398]: [0] 5608855 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\| [1] `1111277` \|@@@@@@@@@@ \| [2, 4) 501439 \|@@@@ \| [4, 8) 102921 \| \| [8, 16) 29895 \| \| [16, 32) 43500 \| \| [32, 64) 31552 \| \| [64, 128) 979 \| \| [128, 256) 13 \| \| Note that the remaining bottleneck for this platform is in udp_drops_inc() because we limited struct numa_drop_counters to only two nodes so far. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250922104240.2182559-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-23 16:38:39 -07:00
Martin KaFai Lau	34f033a6c9	Merge branch 'bpf-next/xdp_pull_data' into 'bpf-next/master' Merge the xdp_pull_data stable branch into the master branch. No conflict. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2025-09-23 16:23:58 -07:00
Martin KaFai Lau	55d5a5154d	Merge branch 'bpf-next/xdp_pull_data' into 'bpf-next/net' Merge the xdp_pull_data stable branch into the net branch. No conflict. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2025-09-23 15:46:52 -07:00
Amery Hung	fe9544ed1a	bpf: Support specifying linear xdp packet data size for BPF_PROG_TEST_RUN To test bpf_xdp_pull_data(), an xdp packet containing fragments as well as free linear data area after xdp->data_end needs to be created. However, bpf_prog_test_run_xdp() always fills the linear area with data_in before creating fragments, leaving no space to pull data. This patch will allow users to specify the linear data size through ctx->data_end. Currently, ctx_in->data_end must match data_size_in and will not be the final ctx->data_end seen by xdp programs. This is because ctx->data_end is populated according to the xdp_buff passed to test_run. The linear data area available in an xdp_buff, max_linear_sz, is alawys filled up before copying data_in into fragments. This patch will allow users to specify the size of data that goes into the linear area. When ctx_in->data_end is different from data_size_in, only ctx_in->data_end bytes of data will be put into the linear area when creating the xdp_buff. While ctx_in->data_end will be allowed to be different from data_size_in, it cannot be larger than the data_size_in as there will be no data to copy from user space. If it is larger than the maximum linear data area size, the layout suggested by the user will not be honored. Data beyond max_linear_sz bytes will still be copied into fragments. Finally, since it is possible for a NIC to produce a xdp_buff with empty linear data area, allow it when calling bpf_test_init() from bpf_prog_test_run_xdp() so that we can test XDP kfuncs with such xdp_buff. This is done by moving lower-bound check to callers as most of them already do except bpf_prog_test_run_skb(). The change also fixes a bug that allows passing an xdp_buff with data < ETH_HLEN. This can happen when ctx is used and metadata is at least ETH_HLEN. Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250922233356.3356453-7-ameryhung@gmail.com	2025-09-23 13:35:12 -07:00
Amery Hung	7eb83bff02	bpf: Make variables in bpf_prog_test_run_xdp less confusing Change the variable naming in bpf_prog_test_run_xdp() to make the overall logic less confusing. As different modes were added to the function over the time, some variables got overloaded, making it hard to understand and changing the code becomes error-prone. Replace "size" with "linear_sz" where it refers to the size of metadata and data. If "size" refers to input data size, use test.data_size_in directly. Replace "max_data_sz" with "max_linear_sz" to better reflect the fact that it is the maximum size of metadata and data (i.e., linear_sz). Also, xdp_rxq.frags_size is always PAGE_SIZE, so just set it directly instead of subtracting headroom and tailroom and adding them back. Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250922233356.3356453-6-ameryhung@gmail.com	2025-09-23 13:35:12 -07:00
Amery Hung	4dce1a0d7c	bpf: Support pulling non-linear xdp data Add kfunc, bpf_xdp_pull_data(), to support pulling data from xdp fragments. Similar to bpf_skb_pull_data(), bpf_xdp_pull_data() makes the first len bytes of data directly readable and writable in bpf programs. If the "len" argument is larger than the linear data size, data in fragments will be copied to the linear data area when there is enough room. Specifically, the kfunc will try to use the tailroom first. When the tailroom is not enough, metadata and data will be shifted down to make room for pulling data. A use case of the kfunc is to decapsulate headers residing in xdp fragments. It is possible for a NIC driver to place headers in xdp fragments. To keep using direct packet access for parsing and decapsulating headers, users can pull headers into the linear data area by calling bpf_xdp_pull_data() and then pop the header with bpf_xdp_adjust_head(). Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250922233356.3356453-4-ameryhung@gmail.com	2025-09-23 13:35:12 -07:00
Amery Hung	dea1526fba	bpf: Allow bpf_xdp_shrink_data to shrink a frag from head and tail Move skb_frag_t adjustment into bpf_xdp_shrink_data() and extend its functionality to be able to shrink an xdp fragment from both head and tail. In a later patch, bpf_xdp_pull_data() will reuse it to shrink an xdp fragment from head. Additionally, in bpf_xdp_frags_shrink_tail(), breaking the loop when bpf_xdp_shrink_data() returns false (i.e., not releasing the current fragment) is not necessary as the loop condition, offset > 0, has the same effect. Remove the else branch to simplify the code. Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20250922233356.3356453-3-ameryhung@gmail.com	2025-09-23 13:35:12 -07:00
Amery Hung	8f12d1137c	bpf: Clear pfmemalloc flag when freeing all fragments It is possible for bpf_xdp_adjust_tail() to free all fragments. The kfunc currently clears the XDP_FLAGS_HAS_FRAGS bit, but not XDP_FLAGS_FRAGS_PF_MEMALLOC. So far, this has not caused a issue when building sk_buff from xdp_buff since all readers of xdp_buff->flags use the flag only when there are fragments. Clear the XDP_FLAGS_FRAGS_PF_MEMALLOC bit as well to make the flags correct. Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20250922233356.3356453-2-ameryhung@gmail.com	2025-09-23 13:35:11 -07:00
Anna Schumaker	cc6ac66f1c	SUNRPC: Update gssx_accept_sec_context() to use xdr_set_scratch_folio() This was the last caller of xdr_set_scratch_page(), so I remove this function while I'm at it. Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:29:50 -04:00
Anna Schumaker	d57e43b72b	SUNRPC: Update svcxdr_init_decode() to call xdr_set_scratch_folio() The only snag here is that __folio_alloc_node() doesn't handle NUMA_NO_NODE, so I also need to update svc_pool_map_get_node() to return numa_mem_id() instead. I arrived at this approach by looking at what other users of __folio_alloc_node() do for this case. Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:29:50 -04:00
Qianfeng Rong	040058a8f7	SUNRPC: Remove redundant __GFP_NOWARN GFP_NOWAIT already includes __GFP_NOWARN, so let's remove the redundant __GFP_NOWARN. Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Acked-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:29:50 -04:00
Chuck Lever	62c0c0e749	SUNRPC: Move the svc_rpcb_cleanup() call sites Clean up: because svc_rpcb_cleanup() and svc_xprt_destroy_all() are always invoked in pairs, we can deduplicate code by moving the svc_rpcb_cleanup() call sites into svc_xprt_destroy_all(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Olga Kornievskaia <okorniev@redhat.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:28:19 -04:00
Jeff Layton	ec7d8e68ef	sunrpc: add a Kconfig option to redirect dfprintk() output to trace buffer We have a lot of old dprintk() call sites that aren't going anywhere anytime soon. At the same time, turning them up is a serious burden on the host due to the console locking overhead. Add a new Kconfig option that redirects dfprintk() output to the trace buffer. This is more efficient than logging to the console and allows for proper interleaving of dprintk and static tracepoint events. Since using trace_printk() causes scary warnings to pop at boot time, this new option defaults to "n". Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:28:19 -04:00
NeilBrown	3d18f80ce1	VFS: rename kern_path_locked() and related functions. kern_path_locked() is now only used to prepare for removing an object from the filesystem (and that is the only credible reason for wanting a positive locked dentry). Thus it corresponds to kern_path_create() and so should have a corresponding name. Unfortunately the name "kern_path_create" is somewhat misleading as it doesn't actually create anything. The recently added simple_start_creating() provides a better pattern I believe. The "start" can be matched with "end" to bracket the creating or removing. So this patch changes names: kern_path_locked -> start_removing_path kern_path_create -> start_creating_path user_path_create -> start_creating_user_path user_path_locked_at -> start_removing_user_path_at done_path_create -> end_creating_path and also introduces end_removing_path() which is identical to end_creating_path(). __start_removing_path (which was __kern_path_locked) is enhanced to call mnt_want_write() for consistency with the start_creating_path(). Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-23 12:37:36 +02:00
Julian Ruess	a612dbe8d0	dibs: Move event handling to dibs layer Add defines for all event types and subtypes an ism device is known to produce as it can be helpful for debugging purposes. Introduces a generic 'struct dibs_event' and adopt ism device driver and smc-d client accordingly. Tolerate and ignore other type and subtype values to enable future device extensions. SMC-D and ISM are now independent. struct ism_dev can be moved to drivers/s390/net/ism.h. Note that in smc, the term 'ism' is still used. Future patches could replace that with 'dibs' or 'smc-d' as appropriate. Signed-off-by: Julian Ruess <julianr@linux.ibm.com> Co-developed-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Link: https://patch.msgid.link/20250918110500.1731261-15-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:22 +02:00
Alexandra Winter	cc21191b58	dibs: Move data path to dibs layer Use struct dibs_dmb instead of struct smc_dmb and move the corresponding client tables to dibs_dev. Leave driver specific implementation details like sba in the device drivers. Register and unregister dmbs via dibs_dev_ops. A dmb is dedicated to a single client, but a dibs device can have dmbs for more than one client. Trigger dibs clients via dibs_client_ops->handle_irq(), when data is received into a dmb. For dibs_loopback replace scheduling an smcd receive tasklet with calling dibs_client_ops->handle_irq(). For loopback devices attach_dmb(), detach_dmb() and move_data() need to access the dmb tables, so move those to dibs_dev_ops in this patch as well. Remove remaining definitions of smc_loopback as they are no longer required, now that everything is in dibs_loopback. Note that struct ism_client and struct ism_dev are still required in smc until a follow-on patch moves event handling to dibs. (Loopback does not use events). Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Link: https://patch.msgid.link/20250918110500.1731261-14-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:22 +02:00
Alexandra Winter	719c3b67bb	dibs: Move query_remote_gid() to dibs_dev_ops Provide the dibs_dev_ops->query_remote_gid() in ism and dibs_loopback dibs_devices. And call it in smc dibs_client. Reviewed-by: Julian Ruess <julianr@linux.ibm.com> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Link: https://patch.msgid.link/20250918110500.1731261-13-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:22 +02:00
Alexandra Winter	92a0f7bb08	dibs: Move vlan support to dibs_dev_ops It can be debated how much benefit definition of vlan ids for dibs devices brings, as the dmbs are accessible only by a single peer anyhow. But ism provides vlan support and smcd exploits it, so move it to dibs layer as an optional feature. smcd_loopback simply ignores all vlan settings, do the same in dibs_loopback. SMC-D and ISM have a method to use the invalid VLAN ID 1FFF (ISM_RESERVED_VLANID), to indicate that both communication peers support routable SMC-Dv2. Tolerate it in dibs, but move it to SMC only. Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Link: https://patch.msgid.link/20250918110500.1731261-12-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:22 +02:00
Alexandra Winter	05e68d8ded	dibs: Local gid for dibs devices Define a uuid_t GID attribute to identify a dibs device. SMC uses 64 Bit and 128 Bit Global Identifiers (GIDs) per device, that need to be sent via the SMC protocol. Because the smc code uses integers, network endianness and host endianness need to be considered. Avoid this in the dibs layer by using uuid_t byte arrays. Future patches could change SMC to use uuid_t. For now conversion helper functions are introduced. ISM devices provide 64 Bit GIDs. Map them to dibs uuid_t GIDs like this: _________________________________________ \| 64 Bit ISM-vPCI GID \| 00000000_00000000 \| ----------------------------------------- If interpreted as UUID [1], this would be interpreted as the UIID variant, that is reserved for NCS backward compatibility. So it will not collide with UUIDs that were generated according to the standard. smc_loopback already uses version 4 UUIDs as 128 Bit GIDs, move that to dibs loopback. A temporary change to smc_lo_query_rgid() is required, that will be moved to dibs_loopback with a follow-on patch. Provide gid of a dibs device as sysfs read-only attribute. Link: https://datatracker.ietf.org/doc/html/rfc4122 [1] Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Julian Ruess <julianr@linux.ibm.com> Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Link: https://patch.msgid.link/20250918110500.1731261-11-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:22 +02:00
Julian Ruess	8047373498	dibs: Create class dibs Create '/sys/class/dibs' to represent multiple kinds of dibs devices in sysfs. Show s390/ism devices as well as dibs_loopback devices. Show attribute fabric_id using dibs_ops.get_fabric_id(). This can help users understand which dibs devices are connected to the same fabric in different systems and which dibs devices are loopback devices (fabric_id 0xffff) Instead of using the same name as the pci device, give the ism devices their own readable names based on uid or fid from the HW definition. smc_loopback was never visible in sysfs. dibs_loopback is now represented as a virtual device. For the SMC feature "software defined pnet-id" either the ib device name or the PCI-ID (actually the parent device name) can be used for SMC-R entries. Mimic this behaviour for SMC-D, and check the parent device name as well. So device name or PCI-ID can be used for ism and device name can be used for dibs-loopback. Note that this: IB_DEVICE_NAME_MAX - 1 == smc_pnet_policy.[SMC_PNETID_IBNAME].len is the length of smcd_name. Future SW-pnetid cleanup patches to could use a meaningful define, but that would touch too much unrelated code here. Examples: --------- ism before: > ls /sys/bus/pci/devices/0000:00:00.0/0000:00:00.0 uevent ism now: > ls /sys/bus/pci/devices/0000:00:00.0/dibs/ism30 device -> ../../../0000:00:00.0/ fabric_id subsystem -> ../../../../../class/dibs/ uevent dibs loopback: > ls /sys/devices/virtual/dibs/lo/ fabric_id subsystem -> ../../../../class/dibs/ uevent dibs class: > ls -l /sys/class/dibs/ ism30 -> ../../devices/pci0000:00/0000:00:00.0/dibs/ism30/ lo -> ../../devices/virtual/dibs/lo/ For comparison: > ls -l /sys/class/net/ enc8410 -> ../../devices/qeth/0.0.8410/net/enc8410/ ens1693 -> ../../devices/pci0001:00/0001:00:00.0/net/ens1693/ lo -> ../../devices/virtual/net/lo/ Signed-off-by: Julian Ruess <julianr@linux.ibm.com> Co-developed-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Link: https://patch.msgid.link/20250918110500.1731261-10-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:22 +02:00
Julian Ruess	845c334a01	dibs: Move struct device to dibs_dev Move struct device from ism_dev and smc_lo_dev to dibs_dev, and define a corresponding release function. Free ism_dev in ism_remove() and smc_lo_dev in smc_lo_dev_remove(). Replace smcd->ops->get_dev(smcd) by using dibs->dev directly. An alternative design would be to embed dibs_dev as a field in ism_dev and do the same for other dibs device driver specific structs. However that would have the disadvantage that each dibs device driver needs to allocate dibs_dev and each dibs device driver needs a different device release function. The advantage would be that ism_dev and other device driver specific structs would be covered by device reference counts. Signed-off-by: Julian Ruess <julianr@linux.ibm.com> Co-developed-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Link: https://patch.msgid.link/20250918110500.1731261-9-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:22 +02:00
Alexandra Winter	69baaac936	dibs: Define dibs_client_ops and dibs_dev_ops Move the device add() and remove() functions from ism_client to dibs_client_ops and call add_dev()/del_dev() for ism devices and dibs_loopback devices. dibs_client_ops->add_dev() = smcd_register_dev() for the smc_dibs_client. This is the first step to handle ism and loopback devices alike (as dibs devices) in the smc dibs client. Define dibs_dev->ops and move smcd_ops->get_chid to dibs_dev_ops->get_fabric_id() for ism and loopback devices. See below for why this needs to be in the same patch as dibs_client_ops->add_dev(). The following changes contain intermediate steps, that will be obsoleted by follow-on patches, once more functionality has been moved to dibs: Use different smcd_ops and max_dmbs for ism and loopback. Follow-on patches will change SMC-D to directly use dibs_ops instead of smcd_ops. In smcd_register_dev() it is now necessary to identify a dibs_loopback device before smcd_dev and smcd_ops->get_chid() are available. So provide dibs_dev_ops->get_fabric_id() in this patch and evaluate it in smc_ism_is_loopback(). Call smc_loopback_init() in smcd_register_dev() and call smc_loopback_exit() in smcd_unregister_dev() to handle the functionality that is still in smc_loopback. Follow-on patches will move all smc_loopback code to dibs_loopback. In smcd_[un]register_dev() use only ism device name, this will be replaced by dibs device name by a follow-on patch. End of changes with intermediate parts. Allocate an smcd event workqueue for all dibs devices, although dibs_loopback does not generate events. Use kernel memory instead of devres memory for smcd_dev and smcd->conn. Since commit `a72178cfe8` ("net/smc: Fix dependency of SMC on ISM") an ism device and its driver can have a longer lifetime than the smc module, so smc should not rely on devres to free its resources [1]. It is now the responsibility of the smc client to free smcd and smcd->conn for all dibs devices, ism devices as well as loopback. Call client->ops->del_dev() for all existing dibs devices in dibs_unregister_client(), so all device related structures can be freed in the client. When dibs_unregister_client() is called in the context of smc_exit() or smc_core_reboot_event(), these functions have already called smc_lgrs_shutdown() which calls smc_smcd_terminate_all(smcd) and sets going_away. This is done a second time in smcd_unregister_dev(). This is analogous to how smcr is handled in these functions, by calling first smc_lgrs_shutdown() and then smc_ib_unregister_client() > smc_ib_remove_dev(), so leave it that way. It may be worth investigating, whether smc_lgrs_shutdown() is still required or useful. Remove CONFIG_SMC_LO. CONFIG_DIBS_LO now controls whether a dibs loopback device exists or not. Link: https://www.kernel.org/doc/Documentation/driver-model/devres.txt [1] Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Link: https://patch.msgid.link/20250918110500.1731261-8-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:22 +02:00
Alexandra Winter	d324a2ca3f	dibs: Register smc as dibs_client Formally register smc as dibs client. Functionality will be moved by follow-on patches from ism_client to dibs_client until eventually ism_client can be removed. As DIBS is only a shim layer without any dependencies, we can depend SMC on DIBS without adding indirect dependencies. A follow-on patch will remove dependency of SMC on ISM. Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Julian Ruess <julianr@linux.ibm.com> Link: https://patch.msgid.link/20250918110500.1731261-5-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:21 +02:00
Alexandra Winter	35758b0032	dibs: Create drivers/dibs Create the file structure for a 'DIBS - Direct Internal Buffer Sharing' shim layer that will provide generic functionality and declarations for dibs device drivers and dibs clients. Following patches will add functionality. Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Link: https://patch.msgid.link/20250918110500.1731261-4-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:21 +02:00
Alexandra Winter	a4997e17d1	net/smc: Decouple sf and attached send_buf in smc_loopback Before this patch there was the following assumption in smc_loopback.c>smc_lo_move_data(): sf (signalling flag) == 0 : data is already in an attached target dmb sf == 1 : data is not yet in the target dmb This is true for the 2 callers in smc client smcd_cdc_msg_send() : sf=1 smcd_tx_rdma_writes() : sf=0 but should not be a general assumption. Add a bool to struct smc_buf_desc to indicate whether an SMC-D sndbuf_desc is an attached buffer. Don't call move_data() for attached send_buffers, because it is not necessary. Move the data in smc_lo_move_data() if len != 0 and signal when requested. Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Link: https://patch.msgid.link/20250918110500.1731261-3-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:21 +02:00
Alexandra Winter	884eee8e43	net/smc: Remove error handling of unregister_dmb() smcd_buf_free() calls smc_ism_unregister_dmb(lgr->smcd, buf_desc) and then unconditionally frees buf_desc. Remove the cleaning up of fields of buf_desc in smc_ism_unregister_dmb(), because it is not helpful. This removes the only usage of ISM_ERROR from the smc module. So move it to drivers/s390/net/ism.h. Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Link: https://patch.msgid.link/20250918110500.1731261-2-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:21 +02:00
Jakub Sitnicki	d57f4b8749	tcp: Update bind bucket state on port release Today, once an inet_bind_bucket enters a state where fastreuse >= 0 or fastreuseport >= 0 after a socket is explicitly bound to a port, it remains in that state until all sockets are removed and the bucket is destroyed. In this state, the bucket is skipped during ephemeral port selection in connect(). For applications using a reduced ephemeral port range (IP_LOCAL_PORT_RANGE socket option), this can cause faster port exhaustion since blocked buckets are excluded from reuse. The reason the bucket state isn't updated on port release is unclear. Possibly a performance trade-off to avoid scanning bucket owners, or just an oversight. Fix it by recalculating the bucket state when a socket releases a port. To limit overhead, each inet_bind2_bucket stores its own (fastreuse, fastreuseport) state. On port release, only the relevant port-addr bucket is scanned, and the overall state is derived from these. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250917-update-bind-bucket-state-on-unhash-v5-1-57168b661b47@cloudflare.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 10:12:15 +02:00
Vincent Mailhol	c67732d067	can: annotate mtu accesses with READ_ONCE() As hinted in commit `501a90c945` ("inet: protect against too small mtu values."), net_device->mtu is vulnerable to race conditions if it is written and read without holding the RTNL. At the moment, all the writes are done while the interface is down, either in the devices' probe() function or in can_changelink(). So there are no such issues yet. But upcoming changes will allow to modify the MTU while the CAN XL devices are up. In preparation to the introduction of CAN XL, annotate all the net_device->mtu accesses which are not yet guarded by the RTNL with a READ_ONCE(). Note that all the write accesses are already either guarded by the RTNL or are already annotated and thus need no changes. Signed-off-by: Vincent Mailhol <mailhol@kernel.org> Link: https://patch.msgid.link/20250923-can-fix-mtu-v3-1-581bde113f52@kernel.org Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2025-09-23 10:04:58 +02:00
Ryder Lee	17f34ab55a	wifi: cfg80211: fix width unit in cfg80211_radio_chandef_valid() The original code used nl80211_chan_width_to_mhz(), which returns the width in MHz. However, the expected unit is KHz. Fixes: `510dba80ed` ("wifi: cfg80211: add helper for checking if a chandef is valid on a radio") Signed-off-by: Ryder Lee <ryder.lee@mediatek.com> Link: https://patch.msgid.link/df54294e6c4ed0f3ceff6e818b710478ddfc62c0.1758579480.git.Ryder%20Lee%20ryder.lee@mediatek.com/ Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2025-09-23 09:50:02 +02:00

... 3 4 5 6 7 ...

82231 Commits