linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-18 03:07:06 -04:00

Author	SHA1	Message	Date
Konstantin Khorenko	29b1ee8788	net: add noinline __init __no_profile to skb_extensions_init() for GCOV compatibility With -fprofile-update=atomic in global CFLAGS_GCOV, GCC still cannot constant-fold the skb_ext_total_length() loop when it is inlined into a profiled caller. The existing __no_profile on skb_ext_total_length() itself is insufficient because after __always_inline expansion the code resides in the caller's body, which still carries GCOV instrumentation. Mark skb_extensions_init() with __no_profile so the BUILD_BUG_ON checks can be evaluated at compile time. Also mark it noinline to prevent the compiler from inlining it into skb_init() (which lacks __no_profile), which would re-expose the function body to GCOV instrumentation. Add __init since skb_extensions_init() is only called from __init skb_init(). Previously it was implicitly inlined into the .init.text section; with noinline it would otherwise remain in permanent .text, wasting memory after boot. Build-tested with both CONFIG_GCOV_PROFILE_ALL=y and CONFIG_KCOV_INSTRUMENT_ALL=y. Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com> Link: https://patch.msgid.link/20260410162150.3105738-3-khorenko@virtuozzo.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 15:29:02 -07:00
Konstantin Khorenko	c0b4382c86	net: fix skb_ext_total_length() BUILD_BUG_ON with CONFIG_GCOV_PROFILE_ALL When CONFIG_GCOV_PROFILE_ALL=y is enabled, the kernel fails to build: In file included from <command-line>: In function 'skb_extensions_init', inlined from 'skb_init' at net/core/skbuff.c:5214:2: ././include/linux/compiler_types.h:706:45: error: call to '__compiletime_assert_1490' declared with attribute error: BUILD_BUG_ON failed: skb_ext_total_length() > 255 CONFIG_GCOV_PROFILE_ALL adds -fprofile-arcs -ftest-coverage -fno-tree-loop-im to CFLAGS globally. GCC inserts branch profiling counters into the skb_ext_total_length() loop and, combined with -fno-tree-loop-im (which disables loop invariant motion), cannot constant-fold the result. BUILD_BUG_ON requires a compile-time constant and fails. The issue manifests in kernels with 5+ SKB extension types enabled (e.g., after addition of SKB_EXT_CAN, SKB_EXT_PSP). With 4 extensions GCC can still unroll and fold the loop despite GCOV instrumentation; with 5+ it gives up. Mark skb_ext_total_length() with __no_profile to prevent GCOV from inserting counters into this function. Without counters the loop is "clean" and GCC can constant-fold it even with -fno-tree-loop-im active. This allows BUILD_BUG_ON to work correctly while keeping GCOV profiling for the rest of the kernel. This also removes the CONFIG_KCOV_INSTRUMENT_ALL preprocessor guard introduced by `d6e5794b06`. That guard was added as a precaution because KCOV instrumentation was also suspected of inhibiting constant folding. However, KCOV uses -fsanitize-coverage=trace-pc, which inserts lightweight trace callbacks that do not interfere with GCC's constant folding or loop optimization passes. Only GCOV's -fprofile-arcs combined with -fno-tree-loop-im actually prevents the compiler from evaluating the loop at compile time. The guard is therefore unnecessary and can be safely removed. Fixes: `96ea3a1e2d` ("can: add CAN skb extension infrastructure") Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com> Reviewed-by: Thomas Weissschuh <linux@weissschuh.net> Link: https://patch.msgid.link/20260410162150.3105738-2-khorenko@virtuozzo.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 15:29:02 -07:00
Jakub Kicinski	9336854a59	Merge branch 'net-reduce-sk_filter-and-friends-bloat' Eric Dumazet says: ==================== net: reduce sk_filter() (and friends) bloat Some functions return an error by value, and a drop_reason by an output parameter. This extra parameter can force stack canaries. A drop_reason is enough and more efficient. This series reduces bloat by 678 bytes on x86_64: $ scripts/bloat-o-meter -t vmlinux.old vmlinux.final add/remove: 0/0 grow/shrink: 3/18 up/down: 79/-757 (-678) Function old new delta vsock_queue_rcv_skb 50 79 +29 ipmr_cache_report 1290 1315 +25 ip6mr_cache_report 1322 1347 +25 tcp_v6_rcv 3169 3167 -2 packet_rcv_spkt 329 327 -2 unix_dgram_sendmsg 1731 1726 -5 netlink_unicast 957 945 -12 netlink_dump 1372 1359 -13 sk_filter_trim_cap 889 858 -31 netlink_broadcast_filtered 1633 1595 -38 tcp_v4_rcv 3152 3111 -41 raw_rcv_skb 122 80 -42 ping_queue_rcv_skb 109 61 -48 ping_rcv 215 162 -53 rawv6_rcv_skb 278 224 -54 __sk_receive_skb 690 632 -58 raw_rcv 591 527 -64 udpv6_queue_rcv_one_skb 935 869 -66 udp_queue_rcv_one_skb 919 853 -66 tun_net_xmit 1146 1074 -72 sock_queue_rcv_skb_reason 166 76 -90 Total: Before=29722890, After=29722212, chg -0.00% Future conversions from sock_queue_rcv_skb() to sock_queue_rcv_skb_reason() can be done later. ==================== Link: https://patch.msgid.link/20260409145625.2306224-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 14:30:28 -07:00
Eric Dumazet	fb37aea2a0	net: change sk_filter_trim_cap() to return a drop_reason by value Current return value can be replaced with the drop_reason, reducing kernel bloat: $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/2 grow/shrink: 1/11 up/down: 32/-603 (-571) Function old new delta tcp_v6_rcv 3135 3167 +32 unix_dgram_sendmsg 1731 1726 -5 netlink_unicast 957 945 -12 netlink_dump 1372 1359 -13 sk_filter_trim_cap 882 858 -24 tcp_v4_rcv 3143 3111 -32 __pfx_tcp_filter 32 - -32 netlink_broadcast_filtered 1633 1595 -38 sock_queue_rcv_skb_reason 126 76 -50 tun_net_xmit 1127 1074 -53 __sk_receive_skb 690 632 -58 udpv6_queue_rcv_one_skb 935 869 -66 udp_queue_rcv_one_skb 919 853 -66 tcp_filter 154 - -154 Total: Before=29722783, After=29722212, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260409145625.2306224-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 14:30:25 -07:00
Eric Dumazet	97449a5f1a	tcp: change tcp_filter() to return the reason by value sk_filter_trim_cap() will soon return the reason by value, do the same for tcp_filter(). Note: tcp_filter() is no longer inlined. Following patch will inline it again. $ scripts/bloat-o-meter -t vmlinux.4 vmlinux.5 add/remove: 2/0 grow/shrink: 0/2 up/down: 186/-43 (143) Function old new delta tcp_filter - 154 +154 __pfx_tcp_filter - 32 +32 tcp_v4_rcv 3152 3143 -9 tcp_v6_rcv 3169 3135 -34 Total: Before=29722640, After=29722783, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260409145625.2306224-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 14:30:25 -07:00
Eric Dumazet	c78bcbd519	net: change sk_filter_reason() to return the reason by value sk_filter_trim_cap will soon return the reason by value, do the same for sk_filter_reason(). $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-21 (-21) Function old new delta sock_queue_rcv_skb_reason 128 126 -2 tun_net_xmit 1146 1127 -19 Total: Before=29722661, After=29722640, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260409145625.2306224-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 14:30:25 -07:00
Eric Dumazet	734ea7e324	net: always set reason in sk_filter_trim_cap() sk_filter_trim_cap() will soon return the drop reason by value. Make sure *reason is cleared when no error is returned, to ease this conversion. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-7 (-7) Function old new delta sk_filter_trim_cap 889 882 -7 Total: Before=29722668, After=29722661, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260409145625.2306224-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 14:30:25 -07:00
Eric Dumazet	900f27fb79	net: change sock_queue_rcv_skb_reason() to return a drop_reason Change sock_queue_rcv_skb_reason() to return the drop_reason directly instead of using a reference. This is part of an effort to remove stack canaries and reduce bloat. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 3/7 up/down: 79/-301 (-222) Function old new delta vsock_queue_rcv_skb 50 79 +29 ipmr_cache_report 1290 1315 +25 ip6mr_cache_report 1322 1347 +25 packet_rcv_spkt 329 327 -2 sock_queue_rcv_skb_reason 166 128 -38 raw_rcv_skb 122 80 -42 ping_queue_rcv_skb 109 61 -48 ping_rcv 215 162 -53 rawv6_rcv_skb 278 224 -54 raw_rcv 591 527 -64 Total: Before=29722890, After=29722668, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260409145625.2306224-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 14:30:25 -07:00
Gal Pressman	8632175ccb	gre: Count GRE packet drops GRE is silently dropping packets without updating statistics. In case of drop, increment rx_dropped counter to provide visibility into packet loss. For the case where no GRE protocol handler is registered, use rx_nohandler. Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Nimrod Oren <noren@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20260409090945.1542440-1-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 12:33:33 -07:00
Hangbin Liu	b2fb1a3363	ethtool: strset: check nla_len overflow The netlink attribute length field nla_len is a __u16, which can only represent values up to 65535 bytes. NICs with a large number of statistics strings (e.g. mlx5_core with thousands of ETH_SS_STATS entries) can produce a ETHTOOL_A_STRINGSET_STRINGS nest that exceeds this limit. When nla_nest_end() writes the actual nest size back to nla_len, the value is silently truncated. This results in a corrupted netlink message being sent to userspace: the parser reads a wrong (truncated) attribute length and misaligns all subsequent attribute boundaries, causing decode errors. Fix this by using the new helper nla_nest_end_safe and error out if the size exceeds U16_MAX. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20260408-b4-ynl_ethtool-v2-5-7623a5e8f70b@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 11:23:50 -07:00
Joe Damato	82db77f6fb	net: tso: Introduce tso_dma_map and helpers Add struct tso_dma_map to tso.h for tracking DMA addresses of mapped GSO payload data and tso_dma_map_completion_state. The tso_dma_map combines DMA mapping storage with iterator state, allowing drivers to walk pre-mapped DMA regions linearly. Includes fields for the DMA IOVA path (iova_state, iova_offset, total_len) and a fallback per-region path (linear_dma, frags[], frag_idx, offset). The tso_dma_map_completion_state makes the IOVA completion state opaque for drivers. Drivers are expected to allocate this and use the added helpers to update the completion state. Adds skb_frag_phys() to skbuff.h, returning the physical address of a paged fragment's data, which is used by the tso_dma_map helpers introduced in this commit described below. The added TSO DMA map helpers are: tso_dma_map_init(): DMA-maps the linear payload region and all frags upfront. Prefers the DMA IOVA API for a single contiguous mapping with one IOTLB sync; falls back to per-region dma_map_phys() otherwise. Returns 0 on success, cleans up partial mappings on failure. tso_dma_map_cleanup(): Handles both IOVA and fallback teardown paths. tso_dma_map_count(): counts how many descriptors the next N bytes of payload will need. Returns 1 if IOVA is used since the mapping is contiguous. tso_dma_map_next(): yields the next (dma_addr, chunk_len) pair. On the IOVA path, each segment is a single contiguous chunk. On the fallback path, indicates when a chunk starts a new DMA mapping so the driver can set dma_unmap_len on that descriptor for completion-time unmapping. tso_dma_map_completion_save(): updates the completion state. Drivers will call this at xmit time. tso_dma_map_complete(): tears down the mapping at completion time and returns true if the IOVA path was used. If it was not used, this is a no-op and returns false. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20260408230607.2019402-2-joe@dama.to Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 10:54:31 -07:00
Luigi Leonardi	006679268a	vsock/virtio: remove unnecessary call to `virtio_transport_get_ops` `virtio_transport_send_pkt_info` gets all the transport information from the parameter `t_ops`. There is no need to call `virtio_transport_get_ops()`. Remove it. Acked-by: Arseniy Krasnov <avkrasnov@salutedevices.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260408-remove_parameter-v2-1-e00f31cf7a17@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 09:57:01 -07:00
Jiayuan Chen	5758be283f	net: skb: clean up dead code after skb_kfree_head() simplification Since commit `0f42e3f4fe` ("net: skb: fix cross-cache free of KFENCE-allocated skb head"), skb_kfree_head() always calls kfree() and no longer uses end_offset to distinguish between skb_small_head_cache and generic kmalloc caches. Clean up the leftovers: - Remove the unused end_offset parameter from skb_kfree_head() and update all callers. - Remove the SKB_SMALL_HEAD_HEADROOM guard in __skb_unclone_keeptruesize() which was protecting the old skb_kfree_head() logic. - Update the SKB_SMALL_HEAD_CACHE_SIZE comment to reflect that the non-power-of-2 sizing is no longer used for free-path disambiguation. No functional change. Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260410034736.297900-1-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 09:43:49 -07:00
Jakub Kicinski	03a1569c2b	Merge tag 'nf-next-26-04-10' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter: updates for net-next 1-3) IPVS updates from Julian Anastasov to enhance visibility into IPVS internal state by exposing hash size, load factor etc and allows userspace to tune the load factor used for resizing hash tables. 4) reject empty/not nul terminated device names from xt_physdev. This isn't a bug fix; existing code doesn't require a c-string. But clean this up anyway because conceptually the interface name definitely should be a c-string. 5) Switch nfnetlink to skb_mac_header helpers that didn't exist back when this code was written. This gives us additional debug checks but is not intended to change functionality. 6) Let the xt ttl/hoplimit match reject unknown operator modes. This is a cleanup, the evaluation function simply returns false when the mode is out of range. From Marino Dzalto. 7) xt_socket match should enable defrag after all other checks. This bug is harmless, historically defrag could not be disabled either except by rmmod. 8) remove UDP-Lite conntrack support, from Fernando Fernandez Mancera. 9) Avoid a couple -Wflex-array-member-not-at-end warnings in the old xtables 32bit compat code, from Gustavo A. R. Silva. 10) nftables fwd expression should drop packets when their ttl/hl has expired. This is a bug fix deferred, its not deemed important enough for -rc8. 11) Add additional checks before assuming the mac header is an ethernet header, from Zhengchuan Liang. * tag 'nf-next-26-04-10' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: require Ethernet MAC header before using eth_hdr() netfilter: nft_fwd_netdev: check ttl/hl before forwarding netfilter: x_tables: Avoid a couple -Wflex-array-member-not-at-end warnings netfilter: conntrack: remove UDP-Lite conntrack support netfilter: xt_socket: enable defrag after all other checks netfilter: xt_HL: add pr_fmt and checkentry validation netfilter: nfnetlink: prefer skb_mac_header helpers netfilter: x_physdev: reject empty or not-nul terminated device names ipvs: add conn_lfactor and svc_lfactor sysctl vars ipvs: add ip_vs_status info ipvs: show the current conn_tab size to users ==================== Link: https://patch.msgid.link/20260410112352.23599-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 09:39:21 -07:00
Jakub Kicinski	118cbd428e	Merge tag 'wireless-next-2026-04-10' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Johannes Berg says: ==================== Final updates, notably: - crypto: move Michael MIC code into wireless (only) - mac80211: - multi-link 4-addr support - NAN data support (but no drivers yet) - ath10k: DT quirk to make it work on some devices - ath12k: IPQ5424 support - rtw89: USB improvements for performance * tag 'wireless-next-2026-04-10' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (124 commits) wifi: cfg80211: Explicitly include <linux/export.h> in michael-mic.c wifi: ath10k: Add device-tree quirk to skip host cap QMI requests dt-bindings: wireless: ath10k: Add quirk to skip host cap QMI requests crypto: Remove michael_mic from crypto_shash API wifi: ipw2x00: Use michael_mic() from cfg80211 wifi: ath12k: Use michael_mic() from cfg80211 wifi: ath11k: Use michael_mic() from cfg80211 wifi: mac80211, cfg80211: Export michael_mic() and move it to cfg80211 wifi: ipw2x00: Rename michael_mic() to libipw_michael_mic() wifi: libertas_tf: refactor endpoint lookup wifi: libertas: refactor endpoint lookup wifi: at76c50x: refactor endpoint lookup wifi: ath12k: Enable IPQ5424 WiFi device support wifi: ath12k: Add CE remap hardware parameters for IPQ5424 wifi: ath12k: add ath12k_hw_regs for IPQ5424 wifi: ath12k: add ath12k_hw_version_map entry for IPQ5424 wifi: ath12k: Add ath12k_hw_params for IPQ5424 dt-bindings: net: wireless: add ath12k wifi device IPQ5424 wifi: ath10k: fix station lookup failure during disconnect wifi: ath12k: Create symlink for each radio in a wiphy ... ==================== Link: https://patch.msgid.link/20260410064703.735099-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 09:17:42 -07:00
Eric Dumazet	29703d7813	tcp: add indirect call wrapper in tcp_conn_request() Small improvement in SYN processing, to directly call tcp_v6_init_seq_and_ts_off() or tcp_v4_init_seq_and_ts_off(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260410174950.745670-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 09:17:03 -07:00
Daniel Borkmann	59818773ba	net: Rename ifq_idx to rxq_idx in netif_mp_* helpers Rename the leftover ifq_idx parameter naming to rxq_idx to be consistent with the rest of the file and the header declaration. Back then this was taken out of the queue leasing series given the cleanup is independent. No functional change. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/netdev/20260131160237.07789674@kernel.org Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260410130602.552600-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 09:12:07 -07:00
Jakub Kicinski	0aa72fc37e	net: fix reference tracker mismanagement in netdev_put_lock() dev_put() releases a reference which didn't have a tracker. References without a tracker are accounted in the tracking code as "no_tracker". We can't free the tracker and then call dev_put(). The references themselves will be fine but the tracking code will think it's a double-release: refcount_t: decrement hit 0; leaking memory. IOW commit under fixes confused dev_put() (release never tracked reference) with __dev_put() (just release the reference, skipping the reference tracking infra). Since __netdev_put_lock() uses dev_put() we can't feed a previously tracked netdev ref into it. Let's flip things around. netdev_put(dev, NULL) is the same as dev_put(dev) so make netdev_put_lock() the real function and have __netdev_put_lock() feed it a NULL tracker for all the cases that were untracked. Fixes: `d04686d9bc` ("net: Implement netdev_nl_queue_create_doit") Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://patch.msgid.link/20260410153600.1984522-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 09:08:43 -07:00
Eric Dumazet	f5148298b0	tcp: return a drop_reason from tcp_add_backlog() Part of a stack canary removal from tcp_v{4,6}_rcv(). Return a drop_reason instead of a boolean, so that we no longer have to pass the address of a local variable. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-37 (-37) Function old new delta tcp_v6_rcv 3133 3129 -4 tcp_v4_rcv 3206 3202 -4 tcp_add_backlog 1281 1252 -29 Total: Before=25567186, After=25567149, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260409101147.1642967-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 09:07:53 -07:00
Chris J Arges	b258cba1e0	net: Add net_cookie to Dead loop messages Network devices can have the same name within different network namespaces. To help distinguish these devices, add the net_cookie value which can be used to identify the netns. Signed-off-by: Chris J Arges <carges@cloudflare.com> Link: https://patch.msgid.link/20260408191056.1036330-1-carges@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 09:05:02 -07:00
Luiz Angelo Daros de Luca	82f37bd9a4	net: dsa: tag_rtl8_4: set KEEP flag KEEP=1 is needed because we should respect the format of the packet as the kernel sends it to us. Unless tx forward offloading is used, the kernel is giving us the packet exactly as it should leave the specified port on the wire. Until now this was not needed because the ports were always functioning in a standalone mode in a VLAN-unaware way, so the switch would not tag or untag frames anyway. But arguably it should have been KEEP=1 all along. Co-developed-by: Alvin Šipraga <alsi@bang-olufsen.dk> Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk> Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Reviewed-by: Linus Walleij <linusw@kernel.org> Link: https://patch.msgid.link/20260408-realtek_fixes-v1-2-915ff1404d56@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 09:03:55 -07:00
Alvin Šipraga	297e1f411e	net: dsa: tag_rtl8_4: update format description Document the updated tag layout fields (EFID, VSEL/VIDX) and clarify which bits are set/cleared when emitting tags. Co-developed-by: Alvin Šipraga <alsi@bang-olufsen.dk> Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk> Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Reviewed-by: Linus Walleij <linusw@kernel.org> Link: https://patch.msgid.link/20260408-realtek_fixes-v1-1-915ff1404d56@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 09:03:54 -07:00
Andy Roulin	54fc83a172	net: bridge: add stp_mode attribute for STP mode selection The bridge-stp usermode helper is currently restricted to the initial network namespace, preventing userspace STP daemons (e.g. mstpd) from operating on bridges in other network namespaces. Since commit `ff62198553` ("bridge: Only call /sbin/bridge-stp for the initial network namespace"), bridges in non-init namespaces silently fall back to kernel STP with no way to use userspace STP. Add a new bridge attribute IFLA_BR_STP_MODE that allows explicit per-bridge control over STP mode selection: BR_STP_MODE_AUTO (default) - Existing behavior: invoke the /sbin/bridge-stp helper in init_net only; fall back to kernel STP if it fails or in non-init namespaces. BR_STP_MODE_USER - Directly enable userspace STP (BR_USER_STP) without invoking the helper. Works in any network namespace. Userspace is responsible for ensuring an STP daemon manages the bridge. BR_STP_MODE_KERNEL - Directly enable kernel STP (BR_KERNEL_STP) without invoking the helper. The mode can only be changed while STP is disabled, or set to the same value (-EBUSY otherwise). IFLA_BR_STP_MODE is processed before IFLA_BR_STP_STATE in br_changelink(), so both can be set atomically in a single netlink message. The mode can also be changed in the same message that disables STP. The stp_mode struct field is u8 since all possible values fit, while NLA_U32 is used for the netlink attribute since it occupies the same space in the netlink message as NLA_U8. A new stp_helper_active boolean tracks whether the /sbin/bridge-stp helper was invoked during br_stp_start(), so that br_stp_stop() only calls the helper for stop when it was called for start. This avoids calling the helper asymmetrically when stp_mode changes between start and stop. Suggested-by: Ido Schimmel <idosch@nvidia.com> Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Andy Roulin <aroulin@nvidia.com> Link: https://patch.msgid.link/20260405205224.3163000-2-aroulin@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-10 15:52:24 -07:00
Zhengchuan Liang	62443dc211	netfilter: require Ethernet MAC header before using eth_hdr() `ip6t_eui64`, `xt_mac`, the `bitmap:ip,mac`, `hash:ip,mac`, and `hash:mac` ipset types, and `nf_log_syslog` access `eth_hdr(skb)` after either assuming that the skb is associated with an Ethernet device or checking only that the `ETH_HLEN` bytes at `skb_mac_header(skb)` lie between `skb->head` and `skb->data`. Make these paths first verify that the skb is associated with an Ethernet device, that the MAC header was set, and that it spans at least a full Ethernet header before accessing `eth_hdr(skb)`. Suggested-by: Florian Westphal <fw@strlen.de> Tested-by: Ren Wei <enjou1224z@gmail.com> Signed-off-by: Zhengchuan Liang <zcliangcn@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-04-10 12:16:27 +02:00
Florian Westphal	1dfd95bdf4	netfilter: nft_fwd_netdev: check ttl/hl before forwarding Drop packets if their ttl/hl is too small for forwarding. Fixes: `d32de98ea7` ("netfilter: nft_fwd_netdev: allow to forward packets via neighbour layer") Signed-off-by: Florian Westphal <fw@strlen.de>	2026-04-10 12:16:27 +02:00
Gustavo A. R. Silva	f30e5a7291	netfilter: x_tables: Avoid a couple -Wflex-array-member-not-at-end warnings -Wflex-array-member-not-at-end was introduced in GCC-14, and we are getting ready to enable it, globally. Use the TRAILING_OVERLAP() helper to fix the following warnings: 1 net/netfilter/x_tables.c:816:39: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] 1 net/netfilter/x_tables.c:811:39: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] This helper creates a union between a flexible-array member (FAM) and a set of members that would otherwise follow it. This overlays the trailing members onto the FAM while preserving the original memory layout. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-04-10 12:16:27 +02:00
Fernando Fernandez Mancera	84dee05d9d	netfilter: conntrack: remove UDP-Lite conntrack support UDP-Lite (RFC 3828) socket support was recently retired from the core networking stack. As a follow-up of that, drop the connection tracker and NAT support for UDP-Lite in Netfilter. This patch removes CONFIG_NF_CT_PROTO_UDPLITE and scrubs UDP-Lite awareness from the conntrack core, NAT core, nft_ct, and ctnetlink. Please note that stateless packet inspection, matching, ipsets or logging support for IPPROTO_UDPLITE is preserved. As conntrack no longer extracts UDP-Lite ports or tracks its L4 state, when performing NAT the UDP-Lite checksum cannot be updated anymore. That is an expected and acceptable consequence of removing UDP-Lite conntrack module. Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-04-10 12:16:26 +02:00
Florian Westphal	542be3fa5a	netfilter: xt_socket: enable defrag after all other checks Originally this did not matter because defrag was enabled once per netns and only disabled again on netns dismantle. When this got changed I should have adjusted checkentry to not leave defrag enabled on error. Fixes: `de8c12110a` ("netfilter: disable defrag once its no longer needed") Signed-off-by: Florian Westphal <fw@strlen.de>	2026-04-10 12:16:26 +02:00
Marino Dzalto	24bd5c2679	netfilter: xt_HL: add pr_fmt and checkentry validation Add pr_fmt to prefix log messages with the module name for easier debugging in dmesg. Add checkentry functions for IPv4 (ttl_mt_check) and IPv6 (hl_mt6_check) to validate the match mode at rule registration time, rejecting invalid modes with -EINVAL. The evaluation function returns false in case the mode is unknown, so this is a cleanup, not a bug fix. Signed-off-by: Marino Dzalto <marino.dzalto@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-04-10 12:16:26 +02:00
Florian Westphal	74feb7d373	netfilter: nfnetlink: prefer skb_mac_header helpers This adds implicit DEBUG_WARN_ON_ONCE for debug configurations. No other changes intended. Signed-off-by: Florian Westphal <fw@strlen.de>	2026-04-10 12:16:26 +02:00
Florian Westphal	8df772afc9	netfilter: x_physdev: reject empty or not-nul terminated device names Reject names that lack a \0 character and reject the empty string as well. iptables allows this but it fails to re-parse iptables-save output that contain such rules. Signed-off-by: Florian Westphal <fw@strlen.de>	2026-04-10 12:16:26 +02:00
Julian Anastasov	8d7de5477e	ipvs: add conn_lfactor and svc_lfactor sysctl vars Allow the default load factor for the connection and service tables to be configured. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-04-10 12:16:26 +02:00
Julian Anastasov	9a9ccef907	ipvs: add ip_vs_status info Add /proc/net/ip_vs_status to show current state of IPVS. The motivation for this new /proc interface is to provide the output for the users to help them decide when to tune the load factor for hash tables, which is possible with the new sysctl knobs coming in followup patch. The output also includes information for the kthreads used for stats. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-04-10 12:16:25 +02:00
Julian Anastasov	22e620fe84	ipvs: show the current conn_tab size to users As conn_tab is per-net, better to show the current hash table size to users instead of the ip_vs_conn_tab_size (max). Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-04-10 12:16:25 +02:00
Yue Haibing	3c6132ccc5	ipv6: sit: remove redundant ret = 0 assignment The variable ret is assigned a value at all places where it is used; There is no need to assign a value when it is initially defined. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Link: https://patch.msgid.link/20260408032051.3096449-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 20:37:40 -07:00
Paolo Abeni	8e6405f821	ipv6: move IFA_F_PERMANENT percpu allocation in process scope Observed at boot time: CPU: 43 UID: 0 PID: 3595 Comm: (t-daemon) Not tainted 6.12.0 #1 Call Trace: <TASK> dump_stack_lvl+0x4e/0x70 pcpu_alloc_noprof.cold+0x1f/0x4b fib_nh_common_init+0x4c/0x110 fib6_nh_init+0x387/0x740 ip6_route_info_create+0x46d/0x640 addrconf_f6i_alloc+0x13b/0x180 addrconf_permanent_addr+0xd0/0x220 addrconf_notify+0x93/0x540 notifier_call_chain+0x5a/0xd0 __dev_notify_flags+0x5c/0xf0 dev_change_flags+0x54/0x70 do_setlink+0x36c/0xce0 rtnl_setlink+0x11f/0x1d0 rtnetlink_rcv_msg+0x142/0x3f0 netlink_rcv_skb+0x50/0x100 netlink_unicast+0x242/0x390 netlink_sendmsg+0x21b/0x470 __sys_sendto+0x1dc/0x1f0 __x64_sys_sendto+0x24/0x30 do_syscall_64+0x7d/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f5c3852f127 Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 80 3d 85 ef 0c 00 00 41 89 ca 74 10 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 71 c3 55 48 83 ec 30 44 89 4c 24 2c 4c 89 44 RSP: 002b:00007ffe86caf4c8 EFLAGS: 00000202 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 0000556c5cd93210 RCX: 00007f5c3852f127 RDX: 0000000000000020 RSI: 0000556c5cd938b0 RDI: 0000000000000003 RBP: 00007ffe86caf5a0 R08: 00007ffe86caf4e0 R09: 0000000000000080 R10: 0000000000000000 R11: 0000000000000202 R12: 0000556c5cd932d0 R13: 00000000021d05d1 R14: 00000000021d05d1 R15: 0000000000000001 IFA_F_PERMANENT addresses require the allocation of a bunch of percpu pointers, currently in atomic scope. Similar to commit `51454ea42c` ("ipv6: fix locking issues with loops over idev->addr_list"), move fixup_permanent_addr() outside the &idev->lock scope, and do the allocations with GFP_KERNEL. With such change fixup_permanent_addr() is invoked with the BH enabled, and the ifp lock acquired there needs the BH variant. Note that we don't need to acquire a reference to the permanent addresses before releasing the mentioned write lock, because addrconf_permanent_addr() runs under RTNL and ifa removal always happens under RTNL, too. Also the PERMANENT flag is constant in the relevant scope, as it can be cleared only by inet6_addr_modify() under the RTNL lock. Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/46a7a030727e236af2dc7752994cd4f04f4a91d2.1775658924.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 19:32:45 -07:00
David Carlier	9addea5d44	net: use get_random_u{16,32,64}() where appropriate Use the typed random integer helpers instead of get_random_bytes() when filling a single integer variable. The helpers return the value directly, require no pointer or size argument, and better express intent. Skipped sites writing into __be16 (netdevsim) and __le64 (ceph) fields where a direct assignment would trigger sparse endianness warnings. Signed-off-by: David Carlier <devnexen@gmail.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260407150758.5889-1-devnexen@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 19:27:43 -07:00
Jakub Kicinski	581d28606c	net: remove the netif_get_rx_queue_lease_locked() helpers The netif_get_rx_queue_lease_locked() API hides the locking and the descend onto the leased queue. Making the code harder to follow (at least to me). Remove the API and open code the descend a bit. Most of the code now looks like: if (!leased) return __helper(x); hw_rxq = .. netdev_lock(hw_rxq->dev); ret = __helper(x); netdev_unlock(hw_rxq->dev); return ret; Of course if we have more code paths that need the wrapping we may need to revisit. For now, IMHO, having to know what netif_get_rx_queue_lease_locked() does is not worth the 20LoC it saves. Link: https://patch.msgid.link/20260408151251.72bd2482@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 18:26:28 -07:00
Jakub Kicinski	1508922588	Merge branch 'netkit-support-for-io_uring-zero-copy-and-af_xdp' Daniel Borkmann says: ==================== netkit: Support for io_uring zero-copy and AF_XDP Containers use virtual netdevs to route traffic from a physical netdev in the host namespace. They do not have access to the physical netdev in the host and thus can't use memory providers or AF_XDP that require reconfiguring/restarting queues in the physical netdev. This patchset adds the concept of queue leasing to virtual netdevs that allow containers to use memory providers and AF_XDP at native speed. Leased queues are bound to a real queue in a physical netdev and act as a proxy. Memory providers and AF_XDP operations take an ifindex and queue id, so containers would pass in an ifindex for a virtual netdev and a queue id of a leased queue, which then gets proxied to the underlying real queue. We have implemented support for this concept in netkit and tested the latter against Nvidia ConnectX-6 (mlx5) as well as Broadcom BCM957504 (bnxt_en) 100G NICs. For more details see the individual patches. ==================== Link: https://patch.msgid.link/20260402231031.447597-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 18:24:35 -07:00
Daniel Borkmann	2544447057	netkit: Add netkit notifier to check for unregistering devices Add a netdevice notifier in netkit to watch for NETDEV_UNREGISTER events. If the target device is indeed NETREG_UNREGISTERING and previously leased a queue to a netkit device, then collect the related netkit devices and batch-unregister_netdevice_many() them. If this were not done, then the netkit device would hold a reference on the physical device preventing it from going away. However, in case of both io_uring zero-copy as well as AF_XDP this situation is handled gracefully and the allocated resources are torn down. In the case where mentioned infra is used through netkit, the applications have a reference on netkit, and netkit in turn holds a reference on the physical device. In order to have netkit release the reference on the physical device, we need such watcher to then unregister the netkit ones. This is generally quite similar to the dependency handling in case of tunnels (e.g. vxlan bound to a underlying netdev) where the tunnel device gets removed along with the physical device. # ip a [...] 4: enp10s0f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether e8:eb:d3:a3:43:f6 brd ff:ff:ff:ff:ff:ff inet 10.0.0.2/24 scope global enp10s0f0np0 valid_lft forever preferred_lft forever [...] 8: nk@NONE: <BROADCAST,MULTICAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff [...] # rmmod mlx5_ib # rmmod mlx5_core [...] [ 309.261822] mlx5_core 0000:0a:00.0 mlx5_0: Port: 1 Link DOWN [ 344.235236] mlx5_core 0000:0a:00.1: E-Switch: Unload vfs: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) [ 344.246948] mlx5_core 0000:0a:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) [ 344.463754] mlx5_core 0000:0a:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) [ 344.770155] mlx5_core 0000:0a:00.1: E-Switch: cleanup [...] # ip a [...] [ both enp10s0f0np0 and nk gone ] [...] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-13-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 18:21:47 -07:00
Daniel Borkmann	910f636db9	xsk: Proxy pool management for leased queues Similarly to the netif_mp_{open,close}_rxq handling for leased queues, proxy the xsk_{reg,clear}_pool_at_qid via netif_get_rx_queue_lease_locked such that in case a virtual netdev picked a leased rxq, the request gets through to the real rxq in the physical netdev. The proxying is only relevant for queue_id < dev->real_num_rx_queues since right now it's only supported for rxqs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-10-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 18:21:47 -07:00
Daniel Borkmann	9368397fb9	xsk: Extend xsk_rcv_check validation xsk_rcv_check tests for inbound packets to see whether they match the bound AF_XDP socket. Refactor the test into a small helper xsk_dev_queue_valid and move the validation against xs->dev and xs->queue_id there. The fast-path case stays in place and allows for quick return in xsk_dev_queue_valid. If it fails, the validation is extended to check whether the AF_XDP socket is bound against a leased queue, and if so, the test is redone. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-9-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 18:21:46 -07:00
David Wei	222b5566a0	net: Proxy netdev_queue_get_dma_dev for leased queues Extend netdev_queue_get_dma_dev to return the physical device of the real rxq for DMA in case the queue was leased. This allows memory providers like io_uring zero-copy or devmem to bind to the physically leased rxq via virtual devices such as netkit. Signed-off-by: David Wei <dw@davidwei.uk> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-8-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 18:21:46 -07:00
David Wei	5602ad61eb	net: Proxy netif_mp_{open,close}_rxq for leased queues When a process in a container wants to setup a memory provider, it will use the virtual netdev and a leased rxq, and call netif_mp_{open,close}_rxq to try and restart the queue. At this point, proxy the queue restart on the real rxq in the physical netdev. For memory providers (io_uring zero-copy rx and devmem), it causes the real rxq in the physical netdev to be filled from a memory provider that has DMA mapped memory from a process within a container. Signed-off-by: David Wei <dw@davidwei.uk> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-7-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 18:21:46 -07:00
Daniel Borkmann	1e91c98bc9	net: Slightly simplify net_mp_{open,close}_rxq net_mp_open_rxq is currently not used in the tree as all callers are using __net_mp_open_rxq directly, and net_mp_close_rxq is only used once while all other locations use __net_mp_close_rxq. Consolidate into a single API, netif_mp_{open,close}_rxq, using the netif_ prefix to indicate that the caller is responsible for locking. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-6-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 18:21:46 -07:00
Daniel Borkmann	22fdf28f7c	net, ethtool: Disallow leased real rxqs to be resized Similar to AF_XDP, do not allow queues in a physical netdev to be resized by ethtool -L when they are leased. Cover channel resize paths (both netlink and ioctl) to reject resizing when the queues would be affected. Given we need to have different checks for RX vs TX, detangle the code into a two-loop version rather than the range of new_combined + min(new_rx, new_tx) to old_combined + max(old_rx, old_tx). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-5-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 18:21:46 -07:00
Daniel Borkmann	21d58b35e5	net: Add lease info to queue-get response Populate nested lease info to the queue-get response that returns the ifindex, queue id with type and optionally netns id if the device resides in a different netns. Example with ynl client when using AF_XDP via queue leasing: # ip a [...] 4: enp10s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp/id:24 qdisc mq state UP group default qlen 1000 link/ether e8:eb:d3:a3:43:f6 brd ff:ff:ff:ff:ff:ff inet 10.0.0.2/24 scope global enp10s0f0np0 valid_lft forever preferred_lft forever inet6 fe80::eaeb:d3ff:fea3:43f6/64 scope link proto kernel_ll valid_lft forever preferred_lft forever [...] # ethtool -i enp10s0f0np0 driver: mlx5_core [...] # ynl --family netdev --output-json --do queue-get \ --json '{"ifindex": 4, "id": 15, "type": "rx"}' {'id': 15, 'ifindex': 4, 'lease': {'ifindex': 8, 'netns-id': 0, 'queue': {'id': 1, 'type': 'rx'}}, 'napi-id': 8227, 'type': 'rx', 'xsk': {}} # ip netns list foo (id: 0) # ip netns exec foo ip a [...] 8: nk@NONE: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff inet6 fe80::200:ff:fe00:0/64 scope link proto kernel_ll valid_lft forever preferred_lft forever [...] # ip netns exec foo ethtool -i nk driver: netkit [...] # ip netns exec foo ls /sys/class/net/nk/queues/ rx-0 rx-1 tx-0 # ip netns exec foo ynl --family netdev --output-json --do queue-get \ --json '{"ifindex": 8, "id": 1, "type": "rx"}' {"id": 1, "type": "rx", "ifindex": 8, "xsk": {}} Note that the caller of netdev_nl_queue_fill_one() holds the netdevice lock. For the queue-get we do not lock both devices. When queues get {un,}leased, both devices are locked, thus if __netif_get_rx_queue_lease() returns a lease pointer, it points to a valid device. The netns-id is fetched via peernet2id_alloc() similarly as done in OVS. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-4-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 18:21:46 -07:00
Daniel Borkmann	d04686d9bc	net: Implement netdev_nl_queue_create_doit Implement netdev_nl_queue_create_doit which creates a new rx queue in a virtual netdev and then leases it to a rx queue in a physical netdev. Example with ynl client: # ynl --family netdev --output-json --do queue-create \ --json '{"ifindex": 8, "type": "rx", "lease": {"ifindex": 4, "queue": {"type": "rx", "id": 15}}}' {'id': 1} Note that the netdevice locking order is always from the virtual to the physical device. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-3-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 18:21:45 -07:00
Daniel Borkmann	7789c6bb76	net: Add queue-create operation Add a ynl netdev family operation called queue-create that creates a new queue on a netdevice: name: queue-create attribute-set: queue flags: [admin-perm] do: request: attributes: - ifindex - type - lease reply: &queue-create-op attributes: - id This is a generic operation such that it can be extended for various use cases in future. Right now it is mandatory to specify ifindex, the queue type which is enforced to rx and a lease. The newly created queue id is returned to the caller. A queue from a virtual device can have a lease which refers to another queue from a physical device. This is useful for memory providers and AF_XDP operations which take an ifindex and queue id to allow applications to bind against virtual devices in containers. The lease couples both queues together and allows to proxy the operations from a virtual device in a container to the physical device. In future, the nested lease attribute can be lifted and made optional for other use-cases such as dynamic queue creation for physical netdevs. The lack of lease and the specification of the physical device as an ifindex will imply that we need a real queue to be allocated. Similarly, the queue type enforcement to rx can then be lifted as well to support tx. An early implementation had only driver-specific integration [0], but in order for other virtual devices to reuse, it makes sense to have this as a generic API in core net. For leasing queues, the virtual netdev must have real_num_rx_queues less than num_rx_queues at the time of calling queue-create. The queue-type must be rx as only rx queues are supported for leasing for now. We also enforce that the queue-create ifindex must point to a virtual device, and that the nested lease attribute's ifindex must point to a physical device. The nested lease attribute set contains a netns-id attribute which is optional and can specify a netns-id relative to the caller's netns. It requires cap_net_admin and if the netns-id attribute is not specified, the lease ifindex will be retrieved from the current netns. Also, it is modeled as an s32 type similarly as done elsewhere in the stack. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://bpfconf.ebpf.io/bpfconf2025/bpfconf2025_material/lsfmmbpf_2025_netkit_borkmann.pdf [0] Link: https://patch.msgid.link/20260402231031.447597-2-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 18:21:45 -07:00
Jakub Kicinski	b6e39e4846	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-7.0-rc8). Conflicts: net/ipv6/seg6_iptunnel.c `c3812651b5` ("seg6: separate dst_cache for input and output paths in seg6 lwtunnel") `78723a62b9` ("seg6: add per-route tunnel source address") https://lore.kernel.org/adZhwtOYfo-0ImSa@sirena.org.uk net/ipv4/icmp.c `fde29fd934` ("ipv4: icmp: fix null-ptr-deref in icmp_build_probe()") `d98adfbdd5` ("ipv4: drop ipv6_stub usage and use direct function calls") https://lore.kernel.org/adO3dccqnr6j-BL9@sirena.org.uk Adjacent changes: drivers/net/ethernet/stmicro/stmmac/chain_mode.c `51f4e090b9` ("net: stmmac: fix integer underflow in chain mode") `6b4286e055` ("net: stmmac: rename STMMAC_GET_ENTRY() -> STMMAC_NEXT_ENTRY()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 13:20:59 -07:00

1 2 3 4 5 ...

83842 Commits