linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-16 09:02:21 -04:00

Author	SHA1	Message	Date
Jakub Kicinski	28d0060632	Merge tag 'nf-26-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following batch contains Netfilter fixes for net: 1) Allow initial x_tables table replacement without emitting an audit log message. Delay the register message until after hooks are wired up to avoid unnecessary unregister logs during error unwinding. 2) Fix a NULL dereference by allocating hook ops before adding the table to the per-netns list. Use `synchronize_rcu()` during error unwinding to ensure the table stops processing packets before teardown. Defer audit log register message until all operations succeed. 3) Refactor xtables to use a single `xt_unregister_table_pre_exit` function. Eliminate code duplication by centralizing table unregistration logic within the xtables core. ebtables cannot be changed due to incompatibility. 4) Unregister xtables templates before module removal. This prevents a race condition where userspace instantiates a new table after the pernet unreg removed the current table. 5) Add `xtables_unregister_table_exit` to fully unregister netfilter tables during module removal. Unlink the table from dying lists, then free hook operations. 6) Implement a two-stage removal scheme for ebtables following the x_tables pattern. Assign table->ops while holding the ebt mutex to prevent exposing partially-filled structures. 7) Fix ebtables module initialization race. Register the template last in table initialization functions. Prevent table instantiation before pernet operations are available. 8) Fix a race condition in x_tables module initialization. Ensure pernet ops are fully set up before exposing the table to userspace. 9) Fix a race condition in ebtables module initialization, similar to previous patch. 10) Restore propagation of helper to expected connection, this is a fix-for-recent-fix. 11) Validate that the expectation tuple and mask netlink attributes are present when adding expectation via nfqueue, this fixes a possible null-ptr-deref. 12) Fix possible rare memleak in the SIP helper in case helper has been detached from conntrack entry, from Li Xiasong. 13) Fix refcount leak in nft_ct when creating custom expectation, also from Li Xiason. Patches 1-9 from Florian Westphal. 10) Restore propagation of helper to expected connection, this is a fix-for-recent-fix. 11) Check that tuple and mask netlink attributes are set when creating an expectation via nfqueue. * tag 'nf-26-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nft_ct: fix missing expect put in obj eval netfilter: nf_conntrack_sip: get helper before allocating expectation netfilter: ctnetlink: check tuple and mask in expectations created via nfqueue netfilter: nf_conntrack_expect: restore helper propagation via expectation netfilter: bridge: eb_tables: close module init race netfilter: x_tables: close dangling table module init race netfilter: ebtables: close dangling table module init race netfilter: ebtables: move to two-stage removal scheme netfilter: x_tables: add and use xtables_unregister_table_exit netfilter: x_tables: unregister the templates first netfilter: x_tables: add and use xt_unregister_table_pre_exit netfilter: x_tables: allocate hook ops while under mutex netfilter: x_tables: allow initial table replace without emitting audit log message ==================== Link: https://patch.msgid.link/20260507234509.603182-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-08 18:28:27 -07:00
Maoyi Xie	e68eadffb7	ipv6: flowlabel: enforce per-netns limit for unprivileged callers fl_size, fl_ht and ip6_fl_lock in net/ipv6/ip6_flowlabel.c are file scope and shared across netns. mem_check() reads fl_size to decide whether to deny non-CAP_NET_ADMIN callers. capable() runs against init_user_ns, so an unprivileged user in any non-init userns can push fl_size past FL_MAX_SIZE - FL_MAX_SIZE / 4 and starve every other unprivileged userns on the host. Add struct netns_ipv6::flowlabel_count, bumped and decremented next to fl_size in fl_intern, ip6_fl_gc and ip6_fl_purge. The new field fills the existing 4-byte hole after ipmr_seq, so struct netns_ipv6 stays the same size on 64-bit builds. Bump FL_MAX_SIZE from 4096 to 8192. It has been 4096 since the file was added. Machines and connection counts have grown. mem_check() folds an extra per-netns ceiling into the existing non-CAP_NET_ADMIN conditional. The ceiling is half of the total budget that unprivileged callers have ever been able to use, i.e. (FL_MAX_SIZE - FL_MAX_SIZE / 4) / 2 = 3072 entries. With FL_MAX_SIZE doubled, this preserves the original per-user reach of 3K (what an unprivileged caller could already obtain before this change), while forcing an attacker to spread allocations across at least two netns to exhaust the global non-CAP_NET_ADMIN budget. CAP_NET_ADMIN against init_user_ns still bypasses both caps. The previous patch took ip6_fl_lock across mem_check and fl_intern, so the new flowlabel_count read in mem_check and the new flowlabel_count++ in fl_intern run under the same critical section. flowlabel_count is therefore plain int, like fl_size. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Suggested-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg> Link: https://patch.msgid.link/20260506082416.2259567-3-maoyixie.tju@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-08 14:59:14 -07:00
Maoyi Xie	7ce5556f25	ipv6: flowlabel: take ip6_fl_lock across mem_check and fl_intern mem_check() in net/ipv6/ip6_flowlabel.c reads fl_size without holding ip6_fl_lock. fl_intern() takes the lock immediately afterwards. The two checks therefore race against concurrent fl_intern, ip6_fl_gc and ip6_fl_purge writers, which makes the mem_check budget check approximate. Move spin_lock_bh(&ip6_fl_lock) and the matching unlock from fl_intern() into its only caller ipv6_flowlabel_get(). The mem_check() call now runs under the same critical section as the fl_intern() insert, so the budget check is exact. With all writers and the read of fl_size under ip6_fl_lock, convert fl_size from atomic_t to plain int. The four sites that update or read fl_size are fl_intern (insert path), ip6_fl_gc (garbage collector, the !sched check and the per-entry decrement), ip6_fl_purge (per-netns purge), and mem_check (budget check), and all four now run under ip6_fl_lock. This is a prerequisite for adding a per-netns budget alongside fl_size. The follow-up patch adds netns_ipv6::flowlabel_count and folds it into mem_check(). Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Suggested-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg> Link: https://patch.msgid.link/20260506082416.2259567-2-maoyixie.tju@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-08 14:59:13 -07:00
Florian Westphal	16bc4b6686	netfilter: x_tables: close dangling table module init race Similar to the previous ebtables patch: template add exposes the table to userspace, we must do this last to rnsure the pernet ops are set up (contain the destructors). Fixes: `fdacd57c79` ("netfilter: x_tables: never register tables by default") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-08 01:30:17 +02:00
Florian Westphal	b4597d5fd7	netfilter: x_tables: add and use xtables_unregister_table_exit Previous change added xtables_unregister_table_pre_exit to detach the table from the packetpath and to unlink it from the active table list. In case of rmmod, userspace that is doing set/getsockopt for this table will not be able to re-instantiate the table: 1. The larval table has been removed already 2. existing instantiated table is no longer on the xt pernet table list. This adds the second stage helper: unlink the table from the dying list, free the hook ops (if any) and do the audit notification. It replaces xt_unregister_table(). Fixes: `fdacd57c79` ("netfilter: x_tables: never register tables by default") Reported-by: Tristan Madani <tristan@talencesecurity.com> Reviewed-by: Tristan Madani <tristan@talencesecurity.com> Closes: https://lore.kernel.org/netfilter-devel/20260429175613.1459342-1-tristmd@gmail.com/ Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-08 01:30:16 +02:00
Florian Westphal	d338693d77	netfilter: x_tables: unregister the templates first When the module is going away we need to zap the template first. Else there is a small race window where userspace could instantiate a new table after the pernet exit function has removed the current table. Fixes: `fdacd57c79` ("netfilter: x_tables: never register tables by default") Reported-by: Tristan Madani <tristan@talencesecurity.com> Reviewed-by: Tristan Madani <tristan@talencesecurity.com> Closes: https://lore.kernel.org/netfilter-devel/20260429175613.1459342-1-tristmd@gmail.com/ Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-08 01:30:16 +02:00
Florian Westphal	527d693147	netfilter: x_tables: add and use xt_unregister_table_pre_exit Remove the copypasted variants of _pre_exit and add one single function in the xtables core. ebtables is not compatible with x_tables and therefore unchanged. This is a preparation patch to reduce noise in the followup bug fixes. Reviewed-by: Tristan Madani <tristan@talencesecurity.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-08 01:30:16 +02:00
Florian Westphal	b62eb8dcf2	netfilter: x_tables: allocate hook ops while under mutex arp/ip(6)t_register_table() add the table to the per-netns list via xt_register_table() before allocating the per-netns hook ops copy via kmemdup_array(). This leaves a window where the table is visible in the list with ops=NULL. If the pernet exit happens runs concurrently the pre_exit callback finds the table via xt_find_table() and passes the NULL ops pointer to nf_unregister_net_hooks(), causing a NULL dereference: general protection fault in nf_unregister_net_hooks+0xbc/0x150 RIP: nf_unregister_net_hooks (net/netfilter/core.c:613) Call Trace: ipt_unregister_table_pre_exit iptable_mangle_net_pre_exit ops_pre_exit_list cleanup_net Fix by moving the ops allocation into the xtables core so the table is never in the list without valid ops. Also ensure the table is no longer processing packets before its torn down on error unwind. nf_register_net_hooks might have published at least one hook; call synchronize_rcu() if there was an error. audit log register message gets deferred until all operations have passed, this avoids need to emit another ureg message in case of error unwinding. Based on earlier patch by Tristan Madani. Fixes: `f9006acc8d` ("netfilter: arp_tables: pass table pointer via nf_hook_ops") Fixes: `ee177a5441` ("netfilter: ip6_tables: pass table pointer via nf_hook_ops") Fixes: `ae68933422` ("netfilter: ip_tables: pass table pointer via nf_hook_ops") Link: https://lore.kernel.org/netfilter-devel/20260429175613.1459342-1-tristmd@gmail.com/ Signed-off-by: Tristan Madani <tristan@talencesecurity.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-08 01:30:16 +02:00
Kuniyuki Iwashima	ecddc523cf	tcp: Fix dst leak in tcp_v6_connect(). If a socket is bound to a wildcard address, tcp_v[46]_connect() updates it with a non-wildcard address based on the route lookup. After bhash2 was introduced in the cited commit, we must call inet_bhash2_update_saddr() to update the bhash2 entry as well. If inet_bhash2_update_saddr() fails, we must release the refcount for dst by ip_route_connect() or ip6_dst_lookup_flow(). While tcp_v4_connect() calls ip_rt_put() in the error path, tcp_v6_connect() does not call dst_release(). Let's call dst_release() when inet_bhash2_update_saddr() fails in tcp_v6_connect(). Fixes: `28044fc1d4` ("net: Add a bhash2 table hashed by port and address") Reported-by: Damiano Melotti <melotti@google.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260506070443.1699879-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-07 08:39:15 -07:00
Eric Dumazet	c8f7244c8c	tcp: tcp_child_process() related UAF tcp_child_process( .. child ...) currently calls sock_put(child). Unfortunately @child (named @nsk in callers) can be used after this point to send a RST packet. To fix this UAF, I remove the sock_put() from tcp_child_process() and let the callers handle this after it is safe. Remove @rsk variable in tcp_v4_do_rcv() and change tcp_v6_do_rcv() so that both functions look the same. Fixes: `cfb6eeb4c8` ("[TCP]: MD5 Signature Option (RFC2385) support.") Reported-by: Damiano Melotti <melotti@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260505153927.3435532-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-06 18:11:33 -07:00
Eric Dumazet	7aaa8f5e45	ipv6: fix potential UAF caused by ip6_forward_proxy_check() ip6_forward_proxy_check() calls pskb_may_pull() which might re-allocate skb->head. Reload ipv6_hdr() after the pskb_may_pull() call to avoid using the freed memory. Fixes: `e21e0b5f19` ("[IPV6] NDISC: Handle NDP messages to proxied addresses.") Reported-by: Damiano Melotti <melotti@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260505130056.2927197-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-06 17:29:23 -07:00
Jakub Kicinski	dc61989e37	Merge tag 'ipsec-2026-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2026-05-05 1. Fix an IPv6 encapsulation error path that leaked route references when UDPv6 ESP decapsulation resolved to an error route. From Yilin Zhu. 2. Fix AH with ESN on async crypto paths by accounting for the extra high-order sequence number when reconstructing the temporary authentication layout in the completion callbacks. From Michael Bomarito. 3. Fix XFRM output so it does not overwrite already-correct inner header pointers when a tunnel layer such as VXLAN has already saved them. The fix comes with new selftests. From Cosmin Ratiu. 4. Add the missing native payload size entry for XFRM_MSG_MAPPING in the compat translation path. From Ruijie Li. 5. Harden __xfrm_state_delete() against repeated or inconsistent unhashing of state list nodes by keying the removal on actual list membership and using delete-and-init helpers. From Michal Kosiorek. 6. Prevent ESP from decrypting shared splice-backed skb fragments in place by marking UDP splice frags as shared and forcing copy-on-write in ESP input when needed. From Kuan-Ting Chen. * tag 'ipsec-2026-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm: esp: avoid in-place decrypt on shared skb frags xfrm: defensively unhash xfrm_state lists in __xfrm_state_delete xfrm: provide message size for XFRM_MSG_MAPPING xfrm: Don't clobber inner headers when already set tools/selftests: Add a VXLAN+IPsec traffic test tools/selftests: Use a sensible timeout value for iperf3 client xfrm: ah: account for ESN high bits in async callbacks ipv6: xfrm6: release dst on error in xfrm6_rcv_encap() ==================== Link: https://patch.msgid.link/20260505132326.1362733-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-06 16:49:42 -07:00
Kuniyuki Iwashima	5ad509c1fd	ipv6: Fix null-ptr-deref in fib6_mtu(). syzbot reported null-ptr-deref in fib6_mtu(). [0] When res->f6i->fib6_pmtu is 0 in fib6_mtu(), it fetches MTU from __in6_dev_get(nh->fib_nh_dev)->cnf.mtu6. However, __in6_dev_get() could return NULL when the device is being unregistered. Let's return 0 MTU if __in6_dev_get() returns NULL in fib6_mtu(). [0]: Oops: general protection fault, probably for non-canonical address 0xdffffc00000000bc: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x00000000000005e0-0x00000000000005e7] CPU: 0 UID: 0 PID: 7890 Comm: syz.2.502 Tainted: G L syzkaller #0 PREEMPT(full) Tainted: [L]=SOFTLOCKUP Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:fib6_mtu net/ipv6/route.c:1648 [inline] RIP: 0010:rt6_insert_exception+0x9eb/0x10a0 net/ipv6/route.c:1753 Code: 3b 14 cf f7 45 85 f6 0f 85 1d 02 00 00 e8 7d 19 cf f7 48 8d bb e0 05 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 89 RSP: 0000:ffffc9000610f120 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffc9000c001000 RDX: 00000000000000bc RSI: ffffffff8a38bc83 RDI: 00000000000005e0 RBP: ffff888052f06000 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffff888042d16c00 R13: ffff888042d16cc8 R14: 0000000000000001 R15: 0000000000000500 FS: 0000000000000000(0000) GS:ffff88809717d000(0063) knlGS:00000000f540db40 CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 00000000f73c6d50 CR3: 000000006eff0000 CR4: 0000000000352ef0 Call Trace: <TASK> __ip6_rt_update_pmtu+0x555/0xd60 net/ipv6/route.c:2982 ip6_update_pmtu+0x34f/0x3b0 net/ipv6/route.c:3014 icmpv6_err+0x2a2/0x3f0 net/ipv6/icmp.c:82 icmpv6_notify+0x35e/0x820 net/ipv6/icmp.c:1087 icmpv6_rcv+0x10bf/0x1ae0 net/ipv6/icmp.c:1228 ip6_protocol_deliver_rcu+0xf97/0x1500 net/ipv6/ip6_input.c:478 ip6_input_finish+0x1e4/0x4a0 net/ipv6/ip6_input.c:529 NF_HOOK include/linux/netfilter.h:318 [inline] NF_HOOK include/linux/netfilter.h:312 [inline] ip6_input+0x105/0x2f0 net/ipv6/ip6_input.c:540 ip6_mc_input+0x513/0xf50 net/ipv6/ip6_input.c:630 dst_input include/net/dst.h:480 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:119 [inline] NF_HOOK include/linux/netfilter.h:318 [inline] NF_HOOK include/linux/netfilter.h:312 [inline] ipv6_rcv+0x34c/0x3d0 net/ipv6/ip6_input.c:351 __netif_receive_skb_one_core+0x12d/0x1e0 net/core/dev.c:6202 __netif_receive_skb+0x1f/0x120 net/core/dev.c:6315 netif_receive_skb_internal net/core/dev.c:6401 [inline] netif_receive_skb+0x13b/0x7f0 net/core/dev.c:6460 tun_rx_batched.isra.0+0x3f6/0x750 drivers/net/tun.c:1511 tun_get_user+0x1e31/0x3c20 drivers/net/tun.c:1955 tun_chr_write_iter+0xdc/0x200 drivers/net/tun.c:2001 new_sync_write fs/read_write.c:595 [inline] vfs_write+0x6ac/0x1070 fs/read_write.c:688 ksys_write+0x12a/0x250 fs/read_write.c:740 do_syscall_32_irqs_on arch/x86/entry/syscall_32.c:83 [inline] do_int80_emulation+0x141/0x700 arch/x86/entry/syscall_32.c:172 asm_int80_emulation+0x1a/0x20 arch/x86/include/asm/idtentry.h:621 RIP: 0023:0xf715616b Code: 57 56 53 8b 44 24 14 f6 00 08 75 23 8b 44 24 18 8b 5c 24 1c 8b 4c 24 20 8b 54 24 24 8b 74 24 28 8b 7c 24 2c 8b 6c 24 30 cd 80 <5b> 5e 5f 5d c3 5b 5e 5f 5d e9 f7 a1 ff ff 66 90 66 90 66 90 90 53 RSP: 002b:00000000f540d44c EFLAGS: 00000246 ORIG_RAX: 0000000000000004 RAX: ffffffffffffffda RBX: 00000000000000c8 RCX: 0000000080000640 RDX: 000000000000007a RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000292 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> Fixes: `dcd1f57295` ("net/ipv6: Remove fib6_idev") Reported-by: syzbot+01f005f9c6387ca6f6dd@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/69f83f22.170a0220.13cc2.0004.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260504064316.3820775-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-05 17:32:57 -07:00
Alyssa Ross	901a7d9e2f	ipv6: default IPV6_SIT to m This basically defaulted to m until recently, since IPV6 defaulted to m. Since IPV6 was changed to a boolean with a default of y, IPV6_SIT started defaulting to built-in as well. This results in a surprise sit0 device by default for defconfig (and defconfig-derived config) users at boot. For me, this broke an (admittedly non-robust) script. Preserve the behaviour of most configs by avoiding building this module, that's probably overall seldom used compared to IPv6 as a whole, into the kernel. Fixes: `309b905dee` ("ipv6: convert CONFIG_IPV6 to built-in only and clean up Kconfigs") Signed-off-by: Alyssa Ross <hi@alyssa.is> Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de> Link: https://patch.msgid.link/20260503192515.290900-2-hi@alyssa.is Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-05 17:31:51 -07:00
Kuan-Ting Chen	f4c50a4034	xfrm: esp: avoid in-place decrypt on shared skb frags MSG_SPLICE_PAGES can attach pages from a pipe directly to an skb. TCP marks such skbs with SKBFL_SHARED_FRAG after skb_splice_from_iter(), so later paths that may modify packet data can first make a private copy. The IPv4/IPv6 datagram append paths did not set this flag when splicing pages into UDP skbs. That leaves an ESP-in-UDP packet made from shared pipe pages looking like an ordinary uncloned nonlinear skb. ESP input then takes the no-COW fast path for uncloned skbs without a frag_list and decrypts in place over data that is not owned privately by the skb. Mark IPv4/IPv6 datagram splice frags with SKBFL_SHARED_FRAG, matching TCP. Also make ESP input fall back to skb_cow_data() when the flag is present, so ESP does not decrypt externally backed frags in place. Private nonlinear skb frags still use the existing fast path. This intentionally does not change ESP output. In esp_output_head(), the path that appends the ESP trailer to existing skb tailroom without calling skb_cow_data() is not reachable for nonlinear skbs: skb_tailroom() returns zero when skb->data_len is nonzero, while ESP tailen is positive. Thus ESP output will either use the separate destination-frag path or fall back to skb_cow_data(). Fixes: `cac2661c53` ("esp4: Avoid skb_cow_data whenever possible") Fixes: `03e2a30f6a` ("esp6: Avoid skb_cow_data whenever possible") Fixes: `7da0dde684` ("ip, udp: Support MSG_SPLICE_PAGES") Fixes: `6d8192bd69` ("ip6, udp6: Support MSG_SPLICE_PAGES") Reported-by: Hyunwoo Kim <imv4bel@gmail.com> Reported-by: Kuan-Ting Chen <h3xrabbit@gmail.com> Tested-by: Hyunwoo Kim <imv4bel@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Kuan-Ting Chen <h3xrabbit@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2026-05-05 06:38:30 +02:00
Maoyi Xie	1d324c2f43	ip6_gre: Use cached t->net in ip6erspan_changelink(). After commit `5e72ce3e39` ("net: ipv6: Use link netns in newlink() of rtnl_link_ops"), ip6erspan_newlink() correctly resolves the per-netns ip6gre hash via link_net. ip6erspan_changelink() was not converted in that series and still uses dev_net(dev), which diverges from the device's creation netns after IFLA_NET_NS_FD migration. This re-inserts the tunnel into the wrong per-netns hash. The original netns keeps a stale entry. When that netns is later destroyed, ip6gre_exit_rtnl_net() walks the stale entry, producing a slab-use-after-free reported by KASAN, followed by a kernel BUG at net/core/dev.c (LIST_POISON1) in unregister_netdevice_many_notify(). Reachable from an unprivileged user namespace (unshare --user --map-root-user --net). ip6gre_changelink() earlier in the same file already uses the cached t->net; only ip6erspan_changelink() has the wrong shape. Fixes: `2d665034f2` ("net: ip6_gre: Fix ip6erspan hlen calculation") Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260430103318.3206018-1-maoyi.xie@ntu.edu.sg Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-02 10:36:21 -07:00
Sagarika Sharma	4bc852006b	ipv6: update route serial number on NETDEV_CHANGE When using IPv6 ECMP routes, if a netdev listed as a nexthop experiences a carrier change event (e.g., a bond device generating a NETDEV_CHANGE event after its slaves go linkdown), established connections utilizing that nexthop fail to fail over to other available nexthops. Instead, these connections stall or drop. This happens because the IPv6 FIB code does not invalidate the socket's cached destination when a NETDEV_CHANGE event occurs. While fib6_ifdown() correctly marks the nexthop with RTNH_F_LINKDOWN, it leaves the route's serial number unchanged. As a result, sockets with a previously cached dst do not realize the route is no longer viable and continue to try using the non-functional nexthop. This behavior contrasts with IPv4, which actively flushes cached destinations on a NETDEV_CHANGE event (see fib_netdev_event() in net/ipv4/fib_frontend.c). Fix this by updating the route serial number in fib6_ifdown() when setting RTNH_F_LINKDOWN. This invalidates stale cached destinations, forcing sockets to perform a new route lookup and fail over to a functioning nexthop. Fixes: `51ebd31815` ("ipv6: add support of equal cost multipath (ECMP)") Signed-off-by: Sagarika Sharma <sharmasagarika@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260430200909.527827-2-sharmasagarika@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-01 17:58:44 -07:00
Jakub Kicinski	2ab02ac411	Merge tag 'nf-26-05-01' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following batch contains Netfilter fixes for net: 1) Replace skb_try_make_writable() by skb_ensure_writable() in nft_fwd_netdev and the flowtable to deal with uncloned packets having their network header in paged fragments. 2) Drop packet if output device does not exist and ensure sufficient headroom in nft_fwd_netdev before transmitting the skb. 3) Use the existing dup recursion counter in nft_fwd_netdev for the neigh_xmit variant, from Weiming Shi. 4) Add .check_hooks interface to x_tables to detach the control plane hook check based on the match/target configuration. Then, update nft_compat to use .check_hooks from .validate path, this fixes a lack of hook validation for several match/targets. 5) Fix incorrect .usersize in xt_CT, from Florian Westphal. 6) Fix a memleak with netdev tables in dormant state, from Florian Westphal. 7) Several patches to check if the packet is a fragment, then skip layer 4 inspection, for x_tables and nf_tables; as well as common nf_socket infrastructure. The xt_hashlimit match drops fragments to stay consistent with the existing approach when failing to parse the layer 4 protocol header. 8) Ensure sufficient headroom in the flowtable before transmitting the skb. 9) Fix the flowtable inline vlan approach for double-tagged vlan: Reverse the iteration over .encap[] since it represents the encapsulation as seen from the ingress path. Postpone pushing layer 2 header so output device is available to calculate needed headroom. Finally, add and use nf_flow_vlan_push() to fix it. 10) Fix flowtable inline pppoe with GSO packets. Moreover, use FLOW_OFFLOAD_XMIT_DIRECT to fill up destination hardware address since neighbour cache does not exist in pppoe. 11) Use skb_pull_rcsum() to decapsulate vlan and pppoe headers, for double-tagged vlan in particular this should provide some benefits in certain scenarios. More notes regarding 9-11): - sashiko is also signalling to use it for IPIP headers, but that needs more adjustments such setting skb->protocol after removing the IPIP header, will follow up in a separated patch. - I plan to submit selftests to cover double-tagged-vlan. As for pppoe, it should be possible but that would mandate a few userspace dependencies. This has been semi-automatically tested by me and reporters describing broken double-vlan-tagged and pppoe currently in the flowtable. * tag 'nf-26-05-01' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: flowtable: use skb_pull_rcsum() to pop vlan/pppoe header netfilter: flowtable: fix inline pppoe encapsulation in xmit path netfilter: flowtable: fix inline vlan encapsulation in xmit path netfilter: flowtable: ensure sufficient headroom in xmit path netfilter: xtables: fix L4 header parsing for non-first fragments netfilter: nf_tables: skip L4 header parsing for non-first fragments netfilter: nf_socket: skip socket lookup for non-first fragments netfilter: nf_tables: fix netdev hook allocation memleak with dormant tables netfilter: xt_CT: fix usersize for v1 and v2 revision netfilter: nft_compat: run xt_check_hooks_{match,target}() from .validate netfilter: x_tables: add .check_hooks to matches and targets netfilter: nft_fwd_netdev: use recursion counter in neigh egress path netfilter: nft_fwd_netdev: add device and headroom validate with neigh forwarding netfilter: replace skb_try_make_writable() by skb_ensure_writable() ==================== Link: https://patch.msgid.link/20260501122237.296262-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-01 16:45:42 -07:00
Daniel Borkmann	3744b0964d	ipv6: Implement limits on extension header parsing ipv6_{skip_exthdr,find_hdr}() and ip6_{tnl_parse_tlv_enc_lim, protocol_deliver_rcu}() iterate over IPv6 extension headers until they find a non-extension-header protocol or run out of packet data. The loops have no iteration counter, relying solely on the packet length to bound them. For a crafted packet with 8-byte extension headers filling a 64KB jumbogram, this means a worst case of up to ~8k iterations with a skb_header_pointer call each. ipv6_skip_exthdr(), for example, is used where it parses the inner quoted packet inside an incoming ICMPv6 error: - icmpv6_rcv - checksum validation - case ICMPV6_DEST_UNREACH - icmpv6_notify - pskb_may_pull() <- pull inner IPv6 header - ipv6_skip_exthdr() <- iterates here - pskb_may_pull() - ipprot->err_handler() <- sk lookup The per-iteration cost of ipv6_skip_exthdr itself is generally light, but skb_header_pointer becomes more costly on reassembled packets: the first ~1232 bytes of the inner packet are in the skb's linear area, but the remaining ~63KB are in the frag_list where skb_copy_bits is needed to read data. Initially, the idea was to add a configurable limit via a new sysctl knob with default 8, in line with knobs from commit `47d3d7ac65` ("ipv6: Implement limits on Hop-by-Hop and Destination options"), but two reasons eventually argued against it: - It adds to UAPI that needs to be maintained forever, and upcoming work is restricting extension header ordering anyway, leaving little reason for another sysctl knob - exthdrs_core.c is always built-in even when CONFIG_IPV6=n, where struct net has no .ipv6 member, so the read site would need an ifdef'd fallback to a constant anyway Therefore, just use a constant (IP6_MAX_EXT_HDRS_CNT). All four extension header walking functions are now bound by this limit. Note that the check in ip6_protocol_deliver_rcu() happens right before the goto resubmit, such that we don't have to have a test for ipv6_ext_hdr() in the fast-path. There's an ongoing IETF draft-iurman-6man-eh-occurrences to enforce IPv6 extension headers ordering and occurrence. The latter also discusses security implications. As per RFC8200 section 4.1, the occurrence rules for extension headers provide a practical upper bound which is 8. In order to be conservative, let's define IP6_MAX_EXT_HDRS_CNT as 12 to leave enough room for quirky setups. In the unlikely event that this is still not enough, then we might need to reconsider a sysctl. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Justin Iurman <justin.iurman@gmail.com> Link: https://patch.msgid.link/20260429154648.809751-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-30 17:21:45 -07:00
Fernando Fernandez Mancera	0bf00859d7	netfilter: nf_socket: skip socket lookup for non-first fragments Both nft_socket and xt_socket relies on L4 headers to perform socket lookup in the slow path. For fragmented packets, while the IP protocol remains constant across all fragments, only the first fragment contains the actual L4 header. As the expression/match could be attached to a chain with a priority lower than -400, it could bypass defragmentation. Add a check for fragmentation in the lookup functions directly so the problem is handled for both nft_socket and xt_socket at the same time. In addition, future users of the functions would not need to care about this. Fixes: `902d6a4c2a` ("netfilter: nf_defrag: Skip defrag if NOTRACK is set") Fixes: `554ced0a6e` ("netfilter: nf_tables: add support for native socket matching") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-04-30 17:59:01 +02:00
Andrea Mayer	f9c52a6ba9	net: ipv6: fix NOREF dst use in seg6 and rpl lwtunnels seg6_input_core() and rpl_input() call ip6_route_input() which sets a NOREF dst on the skb, then pass it to dst_cache_set_ip6() invoking dst_hold() unconditionally. On PREEMPT_RT, ksoftirqd is preemptible and a higher-priority task can release the underlying pcpu_rt between the lookup and the caching through a concurrent FIB lookup on a shared nexthop. Simplified race sequence: ksoftirqd/X higher-prio task (same CPU X) ----------- -------------------------------- seg6_input_core(,skb)/rpl_input(skb) dst_cache_get() -> miss ip6_route_input(skb) -> ip6_pol_route(,skb,flags) [RT6_LOOKUP_F_DST_NOREF in flags] -> FIB lookup resolves fib6_nh [nhid=N route] -> rt6_make_pcpu_route() [creates pcpu_rt, refcount=1] pcpu_rt->sernum = fib6_sernum [fib6_sernum=W] -> cmpxchg(fib6_nh.rt6i_pcpu, NULL, pcpu_rt) [slot was empty, store succeeds] -> skb_dst_set_noref(skb, dst) [dst is pcpu_rt, refcount still 1] rt_genid_bump_ipv6() -> bumps fib6_sernum [fib6_sernum from W to Z] ip6_route_output() -> ip6_pol_route() -> FIB lookup resolves fib6_nh [nhid=N] -> rt6_get_pcpu_route() pcpu_rt->sernum != fib6_sernum [W <> Z, stale] -> prev = xchg(rt6i_pcpu, NULL) -> dst_release(prev) [prev is pcpu_rt, refcount 1->0, dead] dst = skb_dst(skb) [dst is the dead pcpu_rt] dst_cache_set_ip6(dst) -> dst_hold() on dead dst -> WARN / use-after-free For the race to occur, ksoftirqd must be preemptible (PREEMPT_RT without PREEMPT_RT_NEEDS_BH_LOCK) and a concurrent task must be able to release the pcpu_rt. Shared nexthop objects provide such a path, as two routes pointing to the same nhid share the same fib6_nh and its rt6i_pcpu entry. Fix seg6_input_core() and rpl_input() by calling skb_dst_force() after ip6_route_input() to force the NOREF dst into a refcounted one before caching. The output path is not affected as ip6_route_output() already returns a refcounted dst. Fixes: `af4a2209b1` ("ipv6: sr: use dst_cache in seg6_input") Fixes: `a7a29f9c36` ("net: ipv6: add rpl sr tunnel") Cc: stable@vger.kernel.org Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Justin Iurman <justin.iurman@gmail.com> Link: https://patch.msgid.link/20260421094735.20997-1-andrea.mayer@uniroma2.it Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-28 11:16:14 +02:00
Greg Kroah-Hartman	9e6bf146b5	ipv6: rpl: reserve mac_len headroom when recompressed SRH grows ipv6_rpl_srh_rcv() decompresses an RFC 6554 Source Routing Header, swaps the next segment into ipv6_hdr->daddr, recompresses, then pulls the old header and pushes the new one plus the IPv6 header back. The recompressed header can be larger than the received one when the swap reduces the common-prefix length the segments share with daddr (CmprI=0, CmprE>0, seg[0][0] != daddr[0] gives the maximum +8 bytes). pskb_expand_head() was gated on segments_left == 0, so on earlier segments the push consumed unchecked headroom. Once skb_push() leaves fewer than skb->mac_len bytes in front of data, skb_mac_header_rebuild()'s call to: skb_set_mac_header(skb, -skb->mac_len); will store (data - head) - mac_len into the u16 mac_header field, which wraps to ~65530, and the following memmove() writes mac_len bytes ~64KiB past skb->head. A single AF_INET6/SOCK_RAW/IPV6_HDRINCL packet over lo with a two segment type-3 SRH (CmprI=0, CmprE=15) reaches headroom 8 after one pass; KASAN reports a 14-byte OOB write in ipv6_rthdr_rcv. Fix this by expanding the head whenever the remaining room is less than the push size plus mac_len, and request that much extra so the rebuilt MAC header fits afterwards. Fixes: `8610c7c6e3` ("net: ipv6: add support for rpl sr exthdr") Cc: stable <stable@kernel.org> Reported-by: Anthropic Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://patch.msgid.link/2026042133-gout-unvented-1bd9@gregkh Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-27 17:47:26 -07:00
Daniel Borkmann	076b8cad77	ipv6: Cap TLV scan in ip6_tnl_parse_tlv_enc_lim Commit `47d3d7ac65` ("ipv6: Implement limits on Hop-by-Hop and Destination options") added net.ipv6.max_{hbh,dst}_opts_{cnt,len} and applied them in ip6_parse_tlv(), the generic TLV walker invoked from ipv6_destopt_rcv() and ipv6_parse_hopopts(). ip6_tnl_parse_tlv_enc_lim() does not go through ip6_parse_tlv(); it has its own hand-rolled TLV scanner inside its NEXTHDR_DEST branch which looks for IPV6_TLV_TNL_ENCAP_LIMIT. That inner loop is bounded only by optlen, which can be up to 2048 bytes. Stuffing the Destination Options header with 2046 Pad1 (type=0) entries advances the scanner a single byte at a time, yielding ~2000 TLV iterations per extension header. Reusing max_dst_opts_cnt to bound the TLV iterations, matching the semantics from `47d3d7ac65`, would require duplicating ip6_parse_tlv() to also validate Pad1/PadN payload. It would also mandate enforcing max_dst_opts_len, since otherwise an attacker shifts the axis to few options with a giant PadN and recovers the original DoS. Allowing up to 8 options before the tunnel encapsulation limit TLV is liberal enough; in practice encap limit is the first TLV. Thus, go with a hard-coded limit IP6_TUNNEL_MAX_DEST_TLVS (8). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Justin Iurman <justin.iurman@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-23 11:52:07 -07:00
Paolo Abeni	5a5db99c34	Merge tag 'nf-26-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following batch contains Netfilter/IPVS fixes for net: 1) nft_osf actually only supports IPv4, restrict it. 2) Address possible division by zero in nfnetlink_osf, from Xiang Mei. 3) Remove unsafe use of sprintf to fix possible buffer overflow in the SIP NAT helper, from Florian Westphal. 4) Restrict xt_mac, xt_owner and xt_physdev to inet families only; xt_realm is only for ipv4, otherwise null-pointer-deref is possible. 5) Use kfree_rcu() in nat core to release hooks, this can be an issue once nfnetlink_hook gets support to dump NAT hook information, not currently a real issue but better fix it now. From Florian Westphal. 6) Fix MTU checks in IPVS, from Yingnan Zhang. 7) Fix possible out-of-bounds when matching TCP options in nfnetlink_osf, from Fernando Fernandez Mancera. 8) Fix potential nul-ptr-deref in ttl check in nfnetlink_osf, remove useless loop to fix this, also from Fernando. This is a smaller batch, there are more patches pending in the queue to arm another pull request as soon as this is considered good enough. AI might complain again about one more issue regarding osf and big-endian arches in osf but this batch is targetting crash fixes for osf at this stage. netfilter pull request 26-04-20 * tag 'nf-26-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nfnetlink_osf: fix potential NULL dereference in ttl check netfilter: nfnetlink_osf: fix out-of-bounds read on option matching ipvs: fix MTU check for GSO packets in tunnel mode netfilter: nat: use kfree_rcu to release ops netfilter: xtables: restrict several matches to inet family netfilter: conntrack: remove sprintf usage netfilter: nfnetlink_osf: fix divide-by-zero in OSF_WSS_MODULO netfilter: nft_osf: restrict it to ipv4 ==================== Link: https://patch.msgid.link/20260420220215.111510-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-23 11:20:38 +02:00
Andrea Mayer	ade67d5f58	seg6: fix seg6 lwtunnel output redirect for L2 reduced encap mode When SEG6_IPTUN_MODE_L2ENCAP_RED (L2ENCAP_RED) was introduced, the condition in seg6_build_state() that excludes L2 encap modes from setting LWTUNNEL_STATE_OUTPUT_REDIRECT was not updated to account for the new mode. As a consequence, L2ENCAP_RED routes incorrectly trigger seg6_output() on the output path, where the packet is silently dropped because skb_mac_header_was_set() fails on L3 packets. Extend the check to also exclude L2ENCAP_RED, consistent with L2ENCAP. Fixes: `13f0296be8` ("seg6: add support for SRv6 H.L2Encaps.Red behavior") Cc: stable@vger.kernel.org Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: Justin Iurman <justin.iurman@gmail.com> Link: https://patch.msgid.link/20260418162838.31979-1-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-22 20:32:38 -07:00
Pablo Neira Ayuso	6eda0d771f	netfilter: nat: use kfree_rcu to release ops Florian Westphal says: "Historically this is not an issue, even for normal base hooks: the data path doesn't use the original nf_hook_ops that are used to register the callbacks. However, in v5.14 I added the ability to dump the active netfilter hooks from userspace. This code will peek back into the nf_hook_ops that are available at the tail of the pointer-array blob used by the datapath. The nat hooks are special, because they are called indirectly from the central nat dispatcher hook. They are currently invisible to the nfnl hook dump subsystem though. But once that changes the nat ops structures have to be deferred too." Update nf_nat_register_fn() to deal with partial exposition of the hooks from error path which can be also an issue for nfnetlink_hook. Fixes: `e2cf17d377` ("netfilter: add new hook nfnl subsystem") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-04-20 23:45:41 +02:00
Michael Bommarito	ec54093e6a	xfrm: ah: account for ESN high bits in async callbacks AH allocates its temporary auth/ICV layout differently when ESN is enabled: the async ahash setup appends a 4-byte seqhi slot before the ICV or auth_data area, but the async completion callbacks still reconstruct the temporary layout as if seqhi were absent. With an async AH implementation selected, that makes AH copy or compare the wrong bytes on both the IPv4 and IPv6 paths. In UML repro on IPv4 AH with ESN and forced async hmac(sha1), ping fails with 100% packet loss, and the callback logs show the pre-fix drift: ah4 output_done: esn=1 err=0 icv_off=20 expected_off=24 ah4 input_done: esn=1 auth_off=20 expected_auth_off=24 icv_off=32 expected_icv_off=36 Reconstruct the callback-side layout the same way the setup path built it by skipping the ESN seqhi slot before locating the saved auth_data or ICV. Per RFC 4302, the ESN high-order 32 bits participate in the AH ICV computation, so the async callbacks must account for the seqhi slot. Post-fix, the same IPv4 AH+ESN+forced-async-hmac(sha1) UML repro shows the corrected offset (ah4 output_done: esn=1 err=0 icv_off=24 expected_off=24) and ping succeeds; net/ipv4/ah4.o and net/ipv6/ah6.o build clean at W=1. IPv6 AH+ESN was not exercised at runtime, and the change has not been tested against a real async hardware AH engine. Fixes: `d4d573d033` ("{IPv4,xfrm} Add ESN support for AH egress part") Fixes: `d8b2a8600b` ("{IPv4,xfrm} Add ESN support for AH ingress part") Fixes: `26dd70c3fa` ("{IPv6,xfrm} Add ESN support for AH egress part") Fixes: `8d6da6f325` ("{IPv6,xfrm} Add ESN support for AH ingress part") Cc: stable@vger.kernel.org Assisted-by: Codex:gpt-5-4 Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2026-04-20 09:28:34 +02:00
Eric Dumazet	f996edd761	ipv6: fix possible UAF in icmpv6_rcv() Caching saddr and daddr before pskb_pull() is problematic since skb->head can change. Remove these temporary variables: - We only access &ipv6_hdr(skb)->saddr and &ipv6_hdr(skb)->daddr when net_dbg_ratelimited() is called in the slow path. - Avoid potential future misuse after pskb_pull() call. Fixes: `4b3418fba0` ("ipv6: icmp: include addresses in debug messages") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Joe Damato <joe@dama.to> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260416103505.2380753-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-18 12:09:52 -07:00
Yilin Zhu	bc0fcb9823	ipv6: xfrm6: release dst on error in xfrm6_rcv_encap() xfrm6_rcv_encap() performs an IPv6 route lookup when the skb does not already have a dst attached. ip6_route_input_lookup() returns a referenced dst entry even when the lookup resolves to an error route. If dst->error is set, xfrm6_rcv_encap() drops the skb without attaching the dst to the skb and without releasing the reference returned by the lookup. Repeated packets hitting this path therefore leak dst entries. Release the dst before jumping to the drop path. Fixes: `0146dca70b` ("xfrm: add support for UDPv6 encapsulation of ESP") Cc: stable@kernel.org Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Co-developed-by: Yuan Tan <yuantan098@gmail.com> Signed-off-by: Yuan Tan <yuantan098@gmail.com> Suggested-by: Xin Liu <bird@lzu.edu.cn> Tested-by: Ruide Cao <caoruide123@gmail.com> Signed-off-by: Yilin Zhu <zylzyl2333@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2026-04-17 10:40:29 +02:00
Linus Torvalds	91a4855d6c	Merge tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Support HW queue leasing, allowing containers to be granted access to HW queues for zero-copy operations and AF_XDP - Number of code moves to help the compiler with inlining. Avoid output arguments for returning drop reason where possible - Rework drop handling within qdiscs to include more metadata about the reason and dropping qdisc in the tracepoints - Remove the rtnl_lock use from IP Multicast Routing - Pack size information into the Rx Flow Steering table pointer itself. This allows making the table itself a flat array of u32s, thus making the table allocation size a power of two - Report TCP delayed ack timer information via socket diag - Add ip_local_port_step_width sysctl to allow distributing the randomly selected ports more evenly throughout the allowed space - Add support for per-route tunsrc in IPv6 segment routing - Start work of switching sockopt handling to iov_iter - Improve dynamic recvbuf sizing in MPTCP, limit burstiness and avoid buffer size drifting up - Support MSG_EOR in MPTCP - Add stp_mode attribute to the bridge driver for STP mode selection. This addresses concerns about call_usermodehelper() usage - Remove UDP-Lite support (as announced in 2023) - Remove support for building IPv6 as a module. Remove the now unnecessary function calling indirection Cross-tree stuff: - Move Michael MIC code from generic crypto into wireless, it's considered insecure but some WiFi networks still need it Netfilter: - Switch nft_fib_ipv6 module to no longer need temporary dst_entry object allocations by using fib6_lookup() + RCU. Florian W reports this gets us ~13% higher packet rate - Convert IPVS's global __ip_vs_mutex to per-net service_mutex and switch the service tables to be per-net. Convert some code that walks the service lists to use RCU instead of the service_mutex - Add more opinionated input validation to lower security exposure - Make IPVS hash tables to be per-netns and resizable Wireless: - Finished assoc frame encryption/EPPKE/802.1X-over-auth - Radar detection improvements - Add 6 GHz incumbent signal detection APIs - Multi-link support for FILS, probe response templates and client probing - New APIs and mac80211 support for NAN (Neighbor Aware Networking, aka Wi-Fi Aware) so less work must be in firmware Driver API: - Add numerical ID for devlink instances (to avoid having to create fake bus/device pairs just to have an ID). Support shared devlink instances which span multiple PFs - Add standard counters for reporting pause storm events (implement in mlx5 and fbnic) - Add configuration API for completion writeback buffering (implement in mana) - Support driver-initiated change of RSS context sizes - Support DPLL monitoring input frequency (implement in zl3073x) - Support per-port resources in devlink (implement in mlx5) Misc: - Expand the YAML spec for Netfilter Drivers - Software: - macvlan: support multicast rx for bridge ports with shared source MAC address - team: decouple receive and transmit enablement for IEEE 802.3ad LACP "independent control" - Ethernet high-speed NICs: - nVidia/Mellanox: - support high order pages in zero-copy mode (for payload coalescing) - support multiple packets in a page (for systems with 64kB pages) - Broadcom 25-400GE (bnxt): - implement XDP RSS hash metadata extraction - add software fallback for UDP GSO, lowering the IOMMU cost - Broadcom 800GE (bnge): - add link status and configuration handling - add various HW and SW statistics - Marvell/Cavium: - NPC HW block support for cn20k - Huawei (hinic3): - add mailbox / control queue - add rx VLAN offload - add driver info and link management - Ethernet NICs: - Marvell/Aquantia: - support reading SFP module info on some AQC100 cards - Realtek PCI (r8169): - add support for RTL8125cp - Realtek USB (r8152): - support for the RTL8157 5Gbit chip - add 2500baseT EEE status/configuration support - Ethernet NICs embedded and off-the-shelf IP: - Synopsys (stmmac): - cleanup and reorganize SerDes handling and PCS support - cleanup descriptor handling and per-platform data - cleanup and consolidate MDIO defines and handling - shrink driver memory use for internal structures - improve Tx IRQ coalescing - improve TCP segmentation handling - add support for Spacemit K3 - Cadence (macb): - support PHYs that have inband autoneg disabled with GEM - support IEEE 802.3az EEE - rework usrio capabilities and handling - AMD (xgbe): - improve power management for S0i3 - improve TX resilience for link-down handling - Virtual: - Google cloud vNIC: - support larger ring sizes in DQO-QPL mode - improve HW-GRO handling - support UDP GSO for DQO format - PCIe NTB: - support queue count configuration - Ethernet PHYs: - automatically disable PHY autonomous EEE if MAC is in charge - Broadcom: - add BCM84891/BCM84892 support - Micrel: - support for LAN9645X internal PHY - Realtek: - add RTL8224 pair order support - support PHY LEDs on RTL8211F-VD - support spread spectrum clocking (SSC) - Maxlinear: - add PHY-level statistics via ethtool - Ethernet switches: - Maxlinear (mxl862xx): - support for bridge offloading - support for VLANs - support driver statistics - Bluetooth: - large number of fixes and new device IDs - Mediatek: - support MT6639 (MT7927) - support MT7902 SDIO - WiFi: - Intel (iwlwifi): - UNII-9 and continuing UHR work - MediaTek (mt76): - mt7996/mt7925 MLO fixes/improvements - mt7996 NPU support (HW eth/wifi traffic offload) - Qualcomm (ath12k): - monitor mode support on IPQ5332 - basic hwmon temperature reporting - support IPQ5424 - Realtek: - add USB RX aggregation to improve performance - add USB TX flow control by tracking in-flight URBs - Cellular: - IPA v5.2 support" * tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1561 commits) net: pse-pd: fix kernel-doc function name for pse_control_find_by_id() wireguard: device: use exit_rtnl callback instead of manual rtnl_lock in pre_exit wireguard: allowedips: remove redundant space tools: ynl: add sample for wireguard wireguard: allowedips: Use kfree_rcu() instead of call_rcu() MAINTAINERS: Add netkit selftest files selftests/net: Add additional test coverage in nk_qlease selftests/net: Split netdevsim tests from HW tests in nk_qlease tools/ynl: Make YnlFamily closeable as a context manager net: airoha: Add missing PPE configurations in airoha_ppe_hw_init() net: airoha: Fix VIP configuration for AN7583 SoC net: caif: clear client service pointer on teardown net: strparser: fix skb_head leak in strp_abort_strp() net: usb: cdc-phonet: fix skb frags[] overflow in rx_complete() selftests/bpf: add test for xdp_master_redirect with bond not up net, bpf: fix null-ptr-deref in xdp_master_redirect() for down master net: airoha: Remove PCE_MC_EN_MASK bit in REG_FE_PCE_CFG configuration sctp: disable BH before calling udp_tunnel_xmit_skb() sctp: fix missing encap_port propagation for GSO fragments net: airoha: Rely on net_device pointer in ETS callbacks ...	2026-04-14 18:36:10 -07:00
Linus Torvalds	f5ad410100	Merge tag 'bpf-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Welcome new BPF maintainers: Kumar Kartikeya Dwivedi, Eduard Zingerman while Martin KaFai Lau reduced his load to Reviwer. - Lots of fixes everywhere from many first time contributors. Thank you All. - Diff stat is dominated by mechanical split of verifier.c into multiple components: - backtrack.c: backtracking logic and jump history - states.c: state equivalence - cfg.c: control flow graph, postorder, strongly connected components - liveness.c: register and stack liveness - fixups.c: post-verification passes: instruction patching, dead code removal, bpf_loop inlining, finalize fastcall 8k line were moved. verifier.c still stands at 20k lines. Further refactoring is planned for the next release. - Replace dynamic stack liveness with static stack liveness based on data flow analysis. This improved the verification time by 2x for some programs and equally reduced memory consumption. New logic is in liveness.c and supported by constant folding in const_fold.c (Eduard Zingerman, Alexei Starovoitov) - Introduce BTF layout to ease addition of new BTF kinds (Alan Maguire) - Use kmalloc_nolock() universally in BPF local storage (Amery Hung) - Fix several bugs in linked registers delta tracking (Daniel Borkmann) - Improve verifier support of arena pointers (Emil Tsalapatis) - Improve verifier tracking of register bounds in min/max and tnum domains (Harishankar Vishwanathan, Paul Chaignon, Hao Sun) - Further extend support for implicit arguments in the verifier (Ihor Solodrai) - Add support for nop,nop5 instruction combo for USDT probes in libbpf (Jiri Olsa) - Support merging multiple module BTFs (Josef Bacik) - Extend applicability of bpf_kptr_xchg (Kaitao Cheng) - Retire rcu_trace_implies_rcu_gp() (Kumar Kartikeya Dwivedi) - Support variable offset context access for 'syscall' programs (Kumar Kartikeya Dwivedi) - Migrate bpf_task_work and dynptr to kmalloc_nolock() (Mykyta Yatsenko) - Fix UAF in in open-coded task_vma iterator (Puranjay Mohan) * tag 'bpf-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (241 commits) selftests/bpf: cover short IPv4/IPv6 inputs with adjust_room bpf: reject short IPv4/IPv6 inputs in bpf_prog_test_run_skb selftests/bpf: Use memfd_create instead of shm_open in cgroup_iter_memcg selftests/bpf: Add test for cgroup storage OOB read bpf: Fix OOB in pcpu_init_value selftests/bpf: Fix reg_bounds to match new tnum-based refinement selftests/bpf: Add tests for non-arena/arena operations bpf: Allow instructions with arena source and non-arena dest registers bpftool: add missing fsession to the usage and docs of bpftool docs/bpf: add missing fsession attach type to docs bpf: add missing fsession to the verifier log bpf: Move BTF checking logic into check_btf.c bpf: Move backtracking logic to backtrack.c bpf: Move state equivalence logic to states.c bpf: Move check_cfg() into cfg.c bpf: Move compute_insn_live_regs() into liveness.c bpf: Move fixup/post-processing logic from verifier.c into fixups.c bpf: Simplify do_check_insn() bpf: Move checks for reserved fields out of the main pass bpf: Delete unused variable ...	2026-04-14 18:04:04 -07:00
Gabriel Krisman Bertazi	b80a95ccf1	udp: Force compute_score to always inline Back in 2024 I reported a 7-12% regression on an iperf3 UDP loopback thoughput test that we traced to the extra overhead of calling compute_score on two places, introduced by commit `f0ea27e7bf` ("udp: re-score reuseport groups when connected sockets are present"). At the time, I pointed out the overhead was caused by the multiple calls, associated with cpu-specific mitigations, and merged commit `50aee97d15` ("udp: Avoid call to compute_score on multiple sites") to jump back explicitly, to force the rescore call in a single place. Recently though, we got another regression report against a newer distro version, which a team colleague traced back to the same root-cause. Turns out that once we updated to gcc-13, the compiler got smart enough to unroll the loop, undoing my previous mitigation. Let's bite the bullet and __always_inline compute_score on both ipv4 and ipv6 to prevent gcc from de-optimizing it again in the future. These functions are only called in two places each, udpX_lib_lookup1 and udpX_lib_lookup2, so the extra size shouldn't be a problem and it is hot enough to be very visible in profilings. In fact, with gcc13, forcing the inline will prevent gcc from unrolling the fix from commit `50aee97d15`, so we don't end up increasing udpX_lib_lookup2 at all. I haven't recollected the results myself, as I don't have access to the machine at the moment. But the same colleague reported 4.67% inprovement with this patch in the loopback benchmark, solving the regression report within noise margins. Eric Dumazet reported no size change to vmlinux when built with clang. I report the same also with gcc-13: scripts/bloat-o-meter vmlinux vmlinux-inline add/remove: 0/2 grow/shrink: 4/0 up/down: 616/-416 (200) Function old new delta udp6_lib_lookup2 762 949 +187 __udp6_lib_lookup 810 975 +165 udp4_lib_lookup2 757 906 +149 __udp4_lib_lookup 871 986 +115 __pfx_compute_score 32 - -32 compute_score 384 - -384 Total: Before=35011784, After=35011984, chg +0.00% Fixes: `50aee97d15` ("udp: Avoid call to compute_score on multiple sites") Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://patch.msgid.link/20260410155936.654915-1-krisman@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-13 15:44:42 -07:00
Linus Torvalds	b7d74ea0fd	Merge tag 'vfs-7.1-rc1.kino' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs i_ino updates from Christian Brauner: "For historical reasons, the inode->i_ino field is an unsigned long, which means that it's 32 bits on 32 bit architectures. This has caused a number of filesystems to implement hacks to hash a 64-bit identifier into a 32-bit field, and deprives us of a universal identifier field for an inode. This changes the inode->i_ino field from an unsigned long to a u64. This shouldn't make any material difference on 64-bit hosts, but 32-bit hosts will see struct inode grow by at least 4 bytes. This could have effects on slabcache sizes and field alignment. The bulk of the changes are to format strings and tracepoints, since the kernel itself doesn't care that much about the i_ino field. The first patch changes some vfs function arguments, so check that one out carefully. With this change, we may be able to shrink some inode structures. For instance, struct nfs_inode has a fileid field that holds the 64-bit inode number. With this set of changes, that field could be eliminated. I'd rather leave that sort of cleanups for later just to keep this simple" * tag 'vfs-7.1-rc1.kino' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: nilfs2: fix 64-bit division operations in nilfs_bmap_find_target_in_group() EVM: add comment describing why ino field is still unsigned long vfs: remove externs from fs.h on functions modified by i_ino widening treewide: fix missed i_ino format specifier conversions ext4: fix signed format specifier in ext4_load_inode trace event treewide: change inode->i_ino from unsigned long to u64 nilfs2: widen trace event i_ino fields to u64 f2fs: widen trace event i_ino fields to u64 ext4: widen trace event i_ino fields to u64 zonefs: widen trace event i_ino fields to u64 hugetlbfs: widen trace event i_ino fields to u64 ext2: widen trace event i_ino fields to u64 cachefiles: widen trace event i_ino fields to u64 vfs: widen trace event i_ino fields to u64 net: change sock.sk_ino and sock_i_ino() to u64 audit: widen ino fields to u64 vfs: widen inode hash/lookup functions to u64	2026-04-13 12:19:01 -07:00
Jakub Kicinski	b025461303	tcp: update window_clamp when SO_RCVBUF is set Commit under Fixes moved recomputing the window clamp to tcp_measure_rcv_mss() (when scaling_ratio changes). I suspect it missed the fact that we don't recompute the clamp when rcvbuf is set. Until scaling_ratio changes we are stuck with the old window clamp which may be based on the small initial buffer. scaling_ratio may never change. Inspired by Eric's recent commit `d1361840f8` ("tcp: fix SO_RCVLOWAT and RCVBUF autotuning") plumb the user action thru to TCP and have it update the clamp. A smaller fix would be to just have tcp_rcvbuf_grow() adjust the clamp even if SOCK_RCVBUF_LOCK is set. But IIUC this is what we were trying to get away from in the first place. Fixes: `a2cbb16039` ("tcp: Update window clamping condition") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumaze@google.com> Link: https://patch.msgid.link/20260408001438.129165-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-13 15:32:35 +02:00
Jakub Kicinski	9336854a59	Merge branch 'net-reduce-sk_filter-and-friends-bloat' Eric Dumazet says: ==================== net: reduce sk_filter() (and friends) bloat Some functions return an error by value, and a drop_reason by an output parameter. This extra parameter can force stack canaries. A drop_reason is enough and more efficient. This series reduces bloat by 678 bytes on x86_64: $ scripts/bloat-o-meter -t vmlinux.old vmlinux.final add/remove: 0/0 grow/shrink: 3/18 up/down: 79/-757 (-678) Function old new delta vsock_queue_rcv_skb 50 79 +29 ipmr_cache_report 1290 1315 +25 ip6mr_cache_report 1322 1347 +25 tcp_v6_rcv 3169 3167 -2 packet_rcv_spkt 329 327 -2 unix_dgram_sendmsg 1731 1726 -5 netlink_unicast 957 945 -12 netlink_dump 1372 1359 -13 sk_filter_trim_cap 889 858 -31 netlink_broadcast_filtered 1633 1595 -38 tcp_v4_rcv 3152 3111 -41 raw_rcv_skb 122 80 -42 ping_queue_rcv_skb 109 61 -48 ping_rcv 215 162 -53 rawv6_rcv_skb 278 224 -54 __sk_receive_skb 690 632 -58 raw_rcv 591 527 -64 udpv6_queue_rcv_one_skb 935 869 -66 udp_queue_rcv_one_skb 919 853 -66 tun_net_xmit 1146 1074 -72 sock_queue_rcv_skb_reason 166 76 -90 Total: Before=29722890, After=29722212, chg -0.00% Future conversions from sock_queue_rcv_skb() to sock_queue_rcv_skb_reason() can be done later. ==================== Link: https://patch.msgid.link/20260409145625.2306224-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 14:30:28 -07:00
Eric Dumazet	fb37aea2a0	net: change sk_filter_trim_cap() to return a drop_reason by value Current return value can be replaced with the drop_reason, reducing kernel bloat: $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/2 grow/shrink: 1/11 up/down: 32/-603 (-571) Function old new delta tcp_v6_rcv 3135 3167 +32 unix_dgram_sendmsg 1731 1726 -5 netlink_unicast 957 945 -12 netlink_dump 1372 1359 -13 sk_filter_trim_cap 882 858 -24 tcp_v4_rcv 3143 3111 -32 __pfx_tcp_filter 32 - -32 netlink_broadcast_filtered 1633 1595 -38 sock_queue_rcv_skb_reason 126 76 -50 tun_net_xmit 1127 1074 -53 __sk_receive_skb 690 632 -58 udpv6_queue_rcv_one_skb 935 869 -66 udp_queue_rcv_one_skb 919 853 -66 tcp_filter 154 - -154 Total: Before=29722783, After=29722212, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260409145625.2306224-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 14:30:25 -07:00
Eric Dumazet	97449a5f1a	tcp: change tcp_filter() to return the reason by value sk_filter_trim_cap() will soon return the reason by value, do the same for tcp_filter(). Note: tcp_filter() is no longer inlined. Following patch will inline it again. $ scripts/bloat-o-meter -t vmlinux.4 vmlinux.5 add/remove: 2/0 grow/shrink: 0/2 up/down: 186/-43 (143) Function old new delta tcp_filter - 154 +154 __pfx_tcp_filter - 32 +32 tcp_v4_rcv 3152 3143 -9 tcp_v6_rcv 3169 3135 -34 Total: Before=29722640, After=29722783, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260409145625.2306224-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 14:30:25 -07:00
Eric Dumazet	900f27fb79	net: change sock_queue_rcv_skb_reason() to return a drop_reason Change sock_queue_rcv_skb_reason() to return the drop_reason directly instead of using a reference. This is part of an effort to remove stack canaries and reduce bloat. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 3/7 up/down: 79/-301 (-222) Function old new delta vsock_queue_rcv_skb 50 79 +29 ipmr_cache_report 1290 1315 +25 ip6mr_cache_report 1322 1347 +25 packet_rcv_spkt 329 327 -2 sock_queue_rcv_skb_reason 166 128 -38 raw_rcv_skb 122 80 -42 ping_queue_rcv_skb 109 61 -48 ping_rcv 215 162 -53 rawv6_rcv_skb 278 224 -54 raw_rcv 591 527 -64 Total: Before=29722890, After=29722668, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260409145625.2306224-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 14:30:25 -07:00
Gal Pressman	8632175ccb	gre: Count GRE packet drops GRE is silently dropping packets without updating statistics. In case of drop, increment rx_dropped counter to provide visibility into packet loss. For the case where no GRE protocol handler is registered, use rx_nohandler. Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Nimrod Oren <noren@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20260409090945.1542440-1-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 12:33:33 -07:00
Jakub Kicinski	03a1569c2b	Merge tag 'nf-next-26-04-10' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter: updates for net-next 1-3) IPVS updates from Julian Anastasov to enhance visibility into IPVS internal state by exposing hash size, load factor etc and allows userspace to tune the load factor used for resizing hash tables. 4) reject empty/not nul terminated device names from xt_physdev. This isn't a bug fix; existing code doesn't require a c-string. But clean this up anyway because conceptually the interface name definitely should be a c-string. 5) Switch nfnetlink to skb_mac_header helpers that didn't exist back when this code was written. This gives us additional debug checks but is not intended to change functionality. 6) Let the xt ttl/hoplimit match reject unknown operator modes. This is a cleanup, the evaluation function simply returns false when the mode is out of range. From Marino Dzalto. 7) xt_socket match should enable defrag after all other checks. This bug is harmless, historically defrag could not be disabled either except by rmmod. 8) remove UDP-Lite conntrack support, from Fernando Fernandez Mancera. 9) Avoid a couple -Wflex-array-member-not-at-end warnings in the old xtables 32bit compat code, from Gustavo A. R. Silva. 10) nftables fwd expression should drop packets when their ttl/hl has expired. This is a bug fix deferred, its not deemed important enough for -rc8. 11) Add additional checks before assuming the mac header is an ethernet header, from Zhengchuan Liang. * tag 'nf-next-26-04-10' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: require Ethernet MAC header before using eth_hdr() netfilter: nft_fwd_netdev: check ttl/hl before forwarding netfilter: x_tables: Avoid a couple -Wflex-array-member-not-at-end warnings netfilter: conntrack: remove UDP-Lite conntrack support netfilter: xt_socket: enable defrag after all other checks netfilter: xt_HL: add pr_fmt and checkentry validation netfilter: nfnetlink: prefer skb_mac_header helpers netfilter: x_physdev: reject empty or not-nul terminated device names ipvs: add conn_lfactor and svc_lfactor sysctl vars ipvs: add ip_vs_status info ipvs: show the current conn_tab size to users ==================== Link: https://patch.msgid.link/20260410112352.23599-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 09:39:21 -07:00
Eric Dumazet	29703d7813	tcp: add indirect call wrapper in tcp_conn_request() Small improvement in SYN processing, to directly call tcp_v6_init_seq_and_ts_off() or tcp_v4_init_seq_and_ts_off(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260410174950.745670-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 09:17:03 -07:00
Eric Dumazet	f5148298b0	tcp: return a drop_reason from tcp_add_backlog() Part of a stack canary removal from tcp_v{4,6}_rcv(). Return a drop_reason instead of a boolean, so that we no longer have to pass the address of a local variable. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-37 (-37) Function old new delta tcp_v6_rcv 3133 3129 -4 tcp_v4_rcv 3206 3202 -4 tcp_add_backlog 1281 1252 -29 Total: Before=25567186, After=25567149, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260409101147.1642967-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 09:07:53 -07:00
Zhengchuan Liang	62443dc211	netfilter: require Ethernet MAC header before using eth_hdr() `ip6t_eui64`, `xt_mac`, the `bitmap:ip,mac`, `hash:ip,mac`, and `hash:mac` ipset types, and `nf_log_syslog` access `eth_hdr(skb)` after either assuming that the skb is associated with an Ethernet device or checking only that the `ETH_HLEN` bytes at `skb_mac_header(skb)` lie between `skb->head` and `skb->data`. Make these paths first verify that the skb is associated with an Ethernet device, that the MAC header was set, and that it spans at least a full Ethernet header before accessing `eth_hdr(skb)`. Suggested-by: Florian Westphal <fw@strlen.de> Tested-by: Ren Wei <enjou1224z@gmail.com> Signed-off-by: Zhengchuan Liang <zcliangcn@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-04-10 12:16:27 +02:00
Yue Haibing	3c6132ccc5	ipv6: sit: remove redundant ret = 0 assignment The variable ret is assigned a value at all places where it is used; There is no need to assign a value when it is initially defined. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Link: https://patch.msgid.link/20260408032051.3096449-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 20:37:40 -07:00
Paolo Abeni	8e6405f821	ipv6: move IFA_F_PERMANENT percpu allocation in process scope Observed at boot time: CPU: 43 UID: 0 PID: 3595 Comm: (t-daemon) Not tainted 6.12.0 #1 Call Trace: <TASK> dump_stack_lvl+0x4e/0x70 pcpu_alloc_noprof.cold+0x1f/0x4b fib_nh_common_init+0x4c/0x110 fib6_nh_init+0x387/0x740 ip6_route_info_create+0x46d/0x640 addrconf_f6i_alloc+0x13b/0x180 addrconf_permanent_addr+0xd0/0x220 addrconf_notify+0x93/0x540 notifier_call_chain+0x5a/0xd0 __dev_notify_flags+0x5c/0xf0 dev_change_flags+0x54/0x70 do_setlink+0x36c/0xce0 rtnl_setlink+0x11f/0x1d0 rtnetlink_rcv_msg+0x142/0x3f0 netlink_rcv_skb+0x50/0x100 netlink_unicast+0x242/0x390 netlink_sendmsg+0x21b/0x470 __sys_sendto+0x1dc/0x1f0 __x64_sys_sendto+0x24/0x30 do_syscall_64+0x7d/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f5c3852f127 Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 80 3d 85 ef 0c 00 00 41 89 ca 74 10 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 71 c3 55 48 83 ec 30 44 89 4c 24 2c 4c 89 44 RSP: 002b:00007ffe86caf4c8 EFLAGS: 00000202 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 0000556c5cd93210 RCX: 00007f5c3852f127 RDX: 0000000000000020 RSI: 0000556c5cd938b0 RDI: 0000000000000003 RBP: 00007ffe86caf5a0 R08: 00007ffe86caf4e0 R09: 0000000000000080 R10: 0000000000000000 R11: 0000000000000202 R12: 0000556c5cd932d0 R13: 00000000021d05d1 R14: 00000000021d05d1 R15: 0000000000000001 IFA_F_PERMANENT addresses require the allocation of a bunch of percpu pointers, currently in atomic scope. Similar to commit `51454ea42c` ("ipv6: fix locking issues with loops over idev->addr_list"), move fixup_permanent_addr() outside the &idev->lock scope, and do the allocations with GFP_KERNEL. With such change fixup_permanent_addr() is invoked with the BH enabled, and the ifp lock acquired there needs the BH variant. Note that we don't need to acquire a reference to the permanent addresses before releasing the mentioned write lock, because addrconf_permanent_addr() runs under RTNL and ifa removal always happens under RTNL, too. Also the PERMANENT flag is constant in the relevant scope, as it can be cleared only by inet6_addr_modify() under the RTNL lock. Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/46a7a030727e236af2dc7752994cd4f04f4a91d2.1775658924.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 19:32:45 -07:00
Jakub Kicinski	b6e39e4846	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-7.0-rc8). Conflicts: net/ipv6/seg6_iptunnel.c `c3812651b5` ("seg6: separate dst_cache for input and output paths in seg6 lwtunnel") `78723a62b9` ("seg6: add per-route tunnel source address") https://lore.kernel.org/adZhwtOYfo-0ImSa@sirena.org.uk net/ipv4/icmp.c `fde29fd934` ("ipv4: icmp: fix null-ptr-deref in icmp_build_probe()") `d98adfbdd5` ("ipv4: drop ipv6_stub usage and use direct function calls") https://lore.kernel.org/adO3dccqnr6j-BL9@sirena.org.uk Adjacent changes: drivers/net/ethernet/stmicro/stmmac/chain_mode.c `51f4e090b9` ("net: stmmac: fix integer underflow in chain mode") `6b4286e055` ("net: stmmac: rename STMMAC_GET_ENTRY() -> STMMAC_NEXT_ENTRY()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 13:20:59 -07:00
Justin Iurman	b30b1675aa	net: ioam6: fix OOB and missing lock When trace->type.bit6 is set: if (trace->type.bit6) { ... queue = skb_get_tx_queue(dev, skb); qdisc = rcu_dereference(queue->qdisc); This code can lead to an out-of-bounds access of the dev->_tx[] array when is_input is true. In such a case, the packet is on the RX path and skb->queue_mapping contains the RX queue index of the ingress device. If the ingress device has more RX queues than the egress device (dev) has TX queues, skb_get_queue_mapping(skb) will exceed dev->num_tx_queues. Add a check to avoid this situation since skb_get_tx_queue() does not clamp the index. This issue has also revealed that per queue visibility cannot be accurate and will be replaced later as a new feature. While at it, add missing lock around qdisc_qstats_qlen_backlog(). The function __ioam6_fill_trace_data() is called from both softirq and process contexts, hence the use of spin_lock_bh() here. Fixes: `b63c5478e9` ("ipv6: ioam: Support for Queue depth data field") Reported-by: Jakub Kicinski <kuba@kernel.org> Closes: https://lore.kernel.org/netdev/20260403214418.2233266-2-kuba@kernel.org/ Signed-off-by: Justin Iurman <justin.iurman@gmail.com> Link: https://patch.msgid.link/20260404134137.24553-1-justin.iurman@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-08 19:08:56 -07:00
Jakub Kicinski	84ac9a922d	Merge tag 'ipsec-2026-04-08' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2026-04-08 1) Clear trailing padding in build_polexpire() to prevent leaking unititialized memory. From Yasuaki Torimaru. 2) Fix aevent size calculation when XFRMA_IF_ID is used. From Keenan Dong. 3) Wait for RCU readers during policy netns exit before freeing the policy hash tables. 4) Fix dome too eaerly dropped references on the netdev when uding transport mode. From Qi Tang. 5) Fix refcount leak in xfrm_migrate_policy_find(). From Kotlyarov Mihail. 6) Fix two fix info leaks in build_report() and in build_mapping(). From Greg Kroah-Hartman. 7) Zero aligned sockaddr tail in PF_KEY exports. From Zhengchuan Liang. * tag 'ipsec-2026-04-08' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: net: af_key: zero aligned sockaddr tail in PF_KEY exports xfrm_user: fix info leak in build_report() xfrm_user: fix info leak in build_mapping() xfrm: fix refcount leak in xfrm_migrate_policy_find xfrm: hold dev ref until after transport_finish NF_HOOK xfrm: Wait for RCU readers during policy netns exit xfrm: account XFRMA_IF_ID in aevent size calculation xfrm: clear trailing padding in build_polexpire() ==================== Link: https://patch.msgid.link/20260408095925.253681-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-08 18:54:32 -07:00
Zhengchuan Liang	fdce0b3590	netfilter: ip6t_eui64: reject invalid MAC header for all packets `eui64_mt6()` derives a modified EUI-64 from the Ethernet source address and compares it with the low 64 bits of the IPv6 source address. The existing guard only rejects an invalid MAC header when `par->fragoff != 0`. For packets with `par->fragoff == 0`, `eui64_mt6()` can still reach `eth_hdr(skb)` even when the MAC header is not valid. Fix this by removing the `par->fragoff != 0` condition so that packets with an invalid MAC header are rejected before accessing `eth_hdr(skb)`. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Co-developed-by: Yuan Tan <yuantan098@gmail.com> Signed-off-by: Yuan Tan <yuantan098@gmail.com> Suggested-by: Xin Liu <bird@lzu.edu.cn> Tested-by: Ren Wei <enjou1224z@gmail.com> Signed-off-by: Zhengchuan Liang <zcliangcn@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-04-08 13:33:56 +02:00
Andrea Mayer	c3812651b5	seg6: separate dst_cache for input and output paths in seg6 lwtunnel The seg6 lwtunnel uses a single dst_cache per encap route, shared between seg6_input_core() and seg6_output_core(). These two paths can perform the post-encap SID lookup in different routing contexts (e.g., ip rules matching on the ingress interface, or VRF table separation). Whichever path runs first populates the cache, and the other reuses it blindly, bypassing its own lookup. Fix this by splitting the cache into cache_input and cache_output, so each path maintains its own cached dst independently. Fixes: `6c8702c60b` ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels") Cc: stable@vger.kernel.org Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Justin Iurman <justin.iurman@gmail.com> Link: https://patch.msgid.link/20260404004405.4057-2-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-07 20:20:56 -07:00

1 2 3 4 5 ...

9005 Commits