As explained in commit 85d05e2817 ("ipv6: change inet6_sk_rebuild_header()
to use inet->cork.fl.u.ip6"):
TCP v6 spends a good amount of time rebuilding a fresh fl6 at each
transmit in inet6_csk_xmit()/inet6_csk_route_socket().
TCP v4 caches the information in inet->cork.fl.u.ip4 instead.
After this patch, passive TCP ipv6 flows have correctly initialized
inet->cork.fl.u.ip6 structure.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260206173426.1638518-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of using private @fl6 and @final variables
use respectively inet->cork.fl.u.ip6 and np->final.
As explained in commit 85d05e2817 ("ipv6: change inet6_sk_rebuild_header()
to use inet->cork.fl.u.ip6"):
TCP v6 spends a good amount of time rebuilding a fresh fl6 at each
transmit in inet6_csk_xmit()/inet6_csk_route_socket().
TCP v4 caches the information in inet->cork.fl.u.ip4 instead.
After this patch, active TCP ipv6 flows have correctly initialized
inet->cork.fl.u.ip6 structure.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260206173426.1638518-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
BIG TCP IPv6 inserts a hop-by-hop extension header to indicate the real
IPv6 payload length when it doesn't fit into the 16-bit field in the
IPv6 header itself. While it helps tools parse the packet, it also
requires every driver that supports TSO and BIG TCP to remove this
8-byte extension header. It might not sound that bad until we try to
apply it to tunneled traffic. Currently, the drivers don't attempt to
strip HBH if skb->encapsulation = 1. Moreover, trying to do so would
require dissecting different tunnel protocols and making corresponding
adjustments on case-by-case basis, which would slow down the fastpath
(potentially also requiring adjusting checksums in outer headers).
At the same time, BIG TCP IPv4 doesn't insert any extra headers and just
calculates the payload length from skb->len, significantly simplifying
implementing BIG TCP for tunnels.
Stop inserting HBH when building BIG TCP GSO SKBs.
Signed-off-by: Alice Mikityanska <alice@isovalent.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260205133925.526371-3-alice.kernel@fastmail.im
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The next commits will transition away from using the hop-by-hop
extension header to encode packet length for BIG TCP. Add wrappers
around ip6->payload_len that return the actual value if it's non-zero,
and calculate it from skb->len if payload_len is set to zero (and a
symmetrical setter).
The new helpers are used wherever the surrounding code supports the
hop-by-hop jumbo header for BIG TCP IPv6, or the corresponding IPv4 code
uses skb_ip_totlen (e.g., in include/net/netfilter/nf_tables_ipv6.h).
No behavioral change in this commit.
Signed-off-by: Alice Mikityanska <alice@isovalent.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260205133925.526371-2-alice.kernel@fastmail.im
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yizhou Zhao reported that simply having one RAW socket on protocol
IPPROTO_RAW (255) was dangerous.
socket(AF_INET, SOCK_RAW, 255);
A malicious incoming ICMP packet can set the protocol field to 255
and match this socket, leading to FNHE cache changes.
inner = IP(src="192.168.2.1", dst="8.8.8.8", proto=255)/Raw("TEST")
pkt = IP(src="192.168.1.1", dst="192.168.2.1")/ICMP(type=3, code=4, nexthopmtu=576)/inner
"man 7 raw" states:
A protocol of IPPROTO_RAW implies enabled IP_HDRINCL and is able
to send any IP protocol that is specified in the passed header.
Receiving of all IP protocols via IPPROTO_RAW is not possible
using raw sockets.
Make sure we drop these malicious packets.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
Link: https://lore.kernel.org/netdev/20251109134600.292125-1-zhaoyz24@mails.tsinghua.edu.cn/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260203192509.682208-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
TCP v6 spends a good amount of time rebuilding a fresh fl6 at each
transmit in inet6_csk_xmit()/inet6_csk_route_socket().
TCP v4 caches the information in inet->cork.fl.u.ip4 instead.
This patch is a first step converting IPv6 to the same strategy:
Before this patch inet6_sk_rebuild_header() only validated/rebuilt
a dst. Automatic variable @fl6 content was lost.
After this patch inet6_sk_rebuild_header() also initializes
inet->cork.fl.u.ip6, which can be reused in the future.
This makes inet6_sk_rebuild_header() very similar to
inet_sk_rebuild_header().
Also remove the EXPORT_SYMBOL_GPL(), inet6_sk_rebuild_header()
is not called from any module.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260204163035.4123817-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
syzbot reported a kernel BUG in fib6_add_rt2node() when adding an IPv6
route. [0]
Commit f72514b3c5 ("ipv6: clear RA flags when adding a static
route") introduced logic to clear RTF_ADDRCONF from existing routes
when a static route with the same nexthop is added. However, this
causes a problem when the existing route has a gateway.
When RTF_ADDRCONF is cleared from a route that has a gateway, that
route becomes eligible for ECMP, i.e. rt6_qualify_for_ecmp() returns
true. The issue is that this route was never added to the
fib6_siblings list.
This leads to a mismatch between the following counts:
- The sibling count computed by iterating fib6_next chain, which
includes the newly ECMP-eligible route
- The actual siblings in fib6_siblings list, which does not include
that route
When a subsequent ECMP route is added, fib6_add_rt2node() hits
BUG_ON(sibling->fib6_nsiblings != rt->fib6_nsiblings) because the
counts don't match.
Fix this by only clearing RTF_ADDRCONF when the existing route does
not have a gateway. Routes without a gateway cannot qualify for ECMP
anyway (rt6_qualify_for_ecmp() requires fib_nh_gw_family), so clearing
RTF_ADDRCONF on them is safe and matches the original intent of the
commit.
[0]:
kernel BUG at net/ipv6/ip6_fib.c:1217!
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 0 UID: 0 PID: 6010 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025
RIP: 0010:fib6_add_rt2node+0x3433/0x3470 net/ipv6/ip6_fib.c:1217
[...]
Call Trace:
<TASK>
fib6_add+0x8da/0x18a0 net/ipv6/ip6_fib.c:1532
__ip6_ins_rt net/ipv6/route.c:1351 [inline]
ip6_route_add+0xde/0x1b0 net/ipv6/route.c:3946
ipv6_route_ioctl+0x35c/0x480 net/ipv6/route.c:4571
inet6_ioctl+0x219/0x280 net/ipv6/af_inet6.c:577
sock_do_ioctl+0xdc/0x300 net/socket.c:1245
sock_ioctl+0x576/0x790 net/socket.c:1366
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:597 [inline]
__se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fixes: f72514b3c5 ("ipv6: clear RA flags when adding a static route")
Reported-by: syzbot+cb809def1baaac68ab92@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=cb809def1baaac68ab92
Tested-by: syzbot+cb809def1baaac68ab92@syzkaller.appspotmail.com
Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20260204095837.1285552-1-syoshida@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
1) daddr is unlikely a multicast in ip6_finish_output2().
2) ip6_finish_output_gso_slowpath_drop() should not be called often.
3) ip6_fragment() should not be called often.
4) opt is unlikely to be set.
5) ip6_xmit() and ip6_forward() mostly sends not too big packets.
6) Most __ip6_make_skb() calls are for UDP packets,
not ICMPV6 ones.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260130210303.3888261-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
With CONFIG_STACKPROTECTOR_STRONG=y, it is better to avoid passing
a pointer to an automatic variable.
Change these exported functions to return 'u8 proto'
instead of void.
- ipv6_push_nfrag_opts()
- ipv6_push_frag_opts()
For instance, replace
ipv6_push_frag_opts(skb, opt, &proto);
with:
proto = ipv6_push_frag_opts(skb, opt, proto);
Note that even after this change, ip6_xmit() has to use a stack canary
because of @first_hop variable.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260130210303.3888261-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
By default, when a kmem_cache is created with SLAB_TYPESAFE_BY_RCU,
slub has to use extra storage for the freelist pointer after each
object, because slub assumes that any bit in the object
can be used by RCU readers.
Because proto_register() is also using SLAB_HWCACHE_ALIGN,
this forces slub to use one extra cache line per object.
We can instead put the slub freelist anywhere in the object,
granted the concurrent RCU readers are not supposed to
use the pointer value.
Add a new (struct sock)sk_freeptr field, in an union
with sk_rcu: No RCU readers would need to look at sk_rcu,
which is only used at free phase.
Tested:
grep . /sys/kernel/slab/TCP/{object_size,slab_size,objs_per_slab}
grep . /sys/kernel/slab/TCPv6/{object_size,slab_size,objs_per_slab}
Before:
/sys/kernel/slab/TCP/object_size:2368
/sys/kernel/slab/TCP/slab_size:2432
/sys/kernel/slab/TCP/objs_per_slab:13
/sys/kernel/slab/TCPv6/object_size:2496
/sys/kernel/slab/TCPv6/slab_size:2560
/sys/kernel/slab/TCPv6/objs_per_slab:12
After this patch, we can pack one more TCPv6 object per slab,
and object_size == slab_size.
/sys/kernel/slab/TCP/object_size:2368
/sys/kernel/slab/TCP/slab_size:2368
/sys/kernel/slab/TCP/objs_per_slab:13
/sys/kernel/slab/TCPv6/object_size:2496
/sys/kernel/slab/TCPv6/slab_size:2496
/sys/kernel/slab/TCPv6/objs_per_slab:13
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260129153458.4163797-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
fl6_update_dst() is called for every TCP (and others) transmit,
and is a nop for common cases.
Split it in two parts :
1) fl6_update_dst() inline helper, small and fast.
2) __fl6_update_dst() for the exception, out of line.
Small size increase to get better TX performance.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 2/2 grow/shrink: 8/0 up/down: 296/-125 (171)
Function old new delta
__fl6_update_dst - 104 +104
rawv6_sendmsg 2244 2284 +40
udpv6_sendmsg 3013 3043 +30
tcp_v6_connect 1514 1534 +20
cookie_v6_check 1501 1519 +18
ip6_datagram_dst_update 673 690 +17
inet6_sk_rebuild_header 499 516 +17
inet6_csk_route_socket 507 524 +17
inet6_csk_route_req 343 360 +17
__pfx___fl6_update_dst - 16 +16
__pfx_fl6_update_dst 16 - -16
fl6_update_dst 109 - -109
Total: Before=22570304, After=22570475, chg +0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260128185548.3738781-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch enhances GSO segment handling by properly checking
the SKB_GSO_DODGY flag for frag_list GSO packets, addressing
low throughput issues observed when a station accesses IPv4
servers via hotspots with an IPv6-only upstream interface.
Specifically, it fixes a bug in GSO segmentation when forwarding
GRO packets containing a frag_list. The function skb_segment_list
cannot correctly process GRO skbs that have been converted by XLAT,
since XLAT only translates the header of the head skb. Consequently,
skbs in the frag_list may remain untranslated, resulting in protocol
inconsistencies and reduced throughput.
To address this, the patch explicitly sets the SKB_GSO_DODGY flag
for GSO packets in XLAT's IPv4/IPv6 protocol translation helpers
(bpf_skb_proto_4_to_6 and bpf_skb_proto_6_to_4). This marks GSO
packets as potentially modified after protocol translation. As a
result, GSO segmentation will avoid using skb_segment_list and
instead falls back to skb_segment for packets with the SKB_GSO_DODGY
flag. This ensures that only safe and fully translated frag_list
packets are processed by skb_segment_list, resolving protocol
inconsistencies and improving throughput when forwarding GRO packets
converted by XLAT.
Signed-off-by: Jibin Zhang <jibin.zhang@mediatek.com>
Fixes: 9fd1ff5d2a ("udp: Support UDP fraglist GRO/GSO.")
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260126152114.1211-1-jibin.zhang@mediatek.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Introduce sw acceleration for rx path of IP6IP6 tunnels relying on the
netfilter flowtable infrastructure. Subsequent patches will add sw
acceleration for IP6IP6 tunnels tx path.
IP6IP6 rx sw acceleration can be tested running the following scenario
where the traffic is forwarded between two NICs (eth0 and eth1) and an
IP6IP6 tunnel is used to access a remote site (using eth1 as the underlay
device):
ETH0 -- TUN0 <==> ETH1 -- [IP network] -- TUN1 (2001:db8:3::2)
$ip addr show
6: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 00:00:22:33:11:55 brd ff:ff:ff:ff:ff:ff
inet6 2001:db8:1::2/64 scope global nodad
valid_lft forever preferred_lft forever
7: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 00:11:22:33:11:55 brd ff:ff:ff:ff:ff:ff
inet6 2001:db8:2::1/64 scope global nodad
valid_lft forever preferred_lft forever
8: tun0@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default qlen 1000
link/tunnel6 2001:db8:2::1 peer 2001:db8:2::2 permaddr ce9c:2940:7dcc::
inet6 2002:db8:1::1/64 scope global nodad
valid_lft forever preferred_lft forever
$ip -6 route show
2001:db8:1::/64 dev eth0 proto kernel metric 256 pref medium
2001:db8:2::/64 dev eth1 proto kernel metric 256 pref medium
2002:db8:1::/64 dev tun0 proto kernel metric 256 pref medium
default via 2002:db8:1::2 dev tun0 metric 1024 pref medium
$nft list ruleset
table inet filter {
flowtable ft {
hook ingress priority filter
devices = { eth0, eth1 }
}
chain forward {
type filter hook forward priority filter; policy accept;
meta l4proto { tcp, udp } flow add @ft
}
}
Reproducing the scenario described above using veths I got the following
results:
- TCP stream received from the IPIP tunnel:
- net-next: (baseline) ~ 81Gbps
- net-next + IP6IP6 flowtbale support: ~112Gbps
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
There's now a proper SHA-1 API that follows the usual conventions for
hash function APIs: sha1_init(), sha1_update(), sha1_final(), sha1().
The only remaining user of the older low-level SHA-1 API,
sha1_init_raw() and sha1_transform(), is ipv6_generate_stable_address().
I'd like to remove this older API, which is too low-level.
Unfortunately, ipv6_generate_stable_address() does in fact skip the
SHA-1 finalization for some reason. So the values it computes are not
standard SHA-1 values, and it sort of does want the low-level API.
Still, it's still possible to use the higher-level functions sha1_init()
and sha1_update() to get the same result, provided that the resulting
state is used directly, skipping sha1_final().
So, let's do that instead. This will allow removing the low-level API.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Acked-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260123051656.396371-2-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When replying to a ICMPv6 echo request that comes from localhost address
the right output ifindex is 1 (lo) and not rt6i_idev dev index. Use the
skb device ifindex instead. This fixes pinging to a local address from
localhost source address.
$ ping6 -I ::1 2001:1:1::2 -c 3
PING 2001:1:1::2 (2001:1:1::2) from ::1 : 56 data bytes
64 bytes from 2001:1:1::2: icmp_seq=1 ttl=64 time=0.037 ms
64 bytes from 2001:1:1::2: icmp_seq=2 ttl=64 time=0.069 ms
64 bytes from 2001:1:1::2: icmp_seq=3 ttl=64 time=0.122 ms
2001:1:1::2 ping statistics
3 packets transmitted, 3 received, 0% packet loss, time 2032ms
rtt min/avg/max/mdev = 0.037/0.076/0.122/0.035 ms
Fixes: 1b70d792cf ("ipv6: Use rt6i_idev index for echo replies to a local address")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260121194409.6749-1-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
FDO/LTO are unable to inline tcp6_gro_receive() from ipv6_gro_receive()
Make sure tcp6_check_fraglist_gro() is only called only when needed,
so that compiler can leave it out-of-line.
$ scripts/bloat-o-meter -t vmlinux.1 vmlinux.2
add/remove: 2/0 grow/shrink: 3/1 up/down: 1123/-253 (870)
Function old new delta
ipv6_gro_receive 1069 1846 +777
tcp6_check_fraglist_gro - 272 +272
ipv6_offload_init 218 274 +56
__pfx_tcp6_check_fraglist_gro - 16 +16
ipv6_gro_complete 433 435 +2
tcp6_gro_receive 959 706 -253
Total: Before=22592662, After=22593532, chg +0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260120164903.1912995-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
syzbot found that ndisc_router_discovery() could read and write
in6_dev->ra_mtu without holding a lock [1]
This looks fine, IFLA_INET6_RA_MTU is best effort.
Add READ_ONCE()/WRITE_ONCE() to document the race.
Note that we might also reject illegal MTU values
(mtu < IPV6_MIN_MTU || mtu > skb->dev->mtu) in a future patch.
[1]
BUG: KCSAN: data-race in ndisc_router_discovery / ndisc_router_discovery
read to 0xffff888119809c20 of 4 bytes by task 25817 on cpu 1:
ndisc_router_discovery+0x151d/0x1c90 net/ipv6/ndisc.c:1558
ndisc_rcv+0x2ad/0x3d0 net/ipv6/ndisc.c:1841
icmpv6_rcv+0xe5a/0x12f0 net/ipv6/icmp.c:989
ip6_protocol_deliver_rcu+0xb2a/0x10d0 net/ipv6/ip6_input.c:438
ip6_input_finish+0xf0/0x1d0 net/ipv6/ip6_input.c:489
NF_HOOK include/linux/netfilter.h:318 [inline]
ip6_input+0x5e/0x140 net/ipv6/ip6_input.c:500
ip6_mc_input+0x27c/0x470 net/ipv6/ip6_input.c:590
dst_input include/net/dst.h:474 [inline]
ip6_rcv_finish+0x336/0x340 net/ipv6/ip6_input.c:79
...
write to 0xffff888119809c20 of 4 bytes by task 25816 on cpu 0:
ndisc_router_discovery+0x155a/0x1c90 net/ipv6/ndisc.c:1559
ndisc_rcv+0x2ad/0x3d0 net/ipv6/ndisc.c:1841
icmpv6_rcv+0xe5a/0x12f0 net/ipv6/icmp.c:989
ip6_protocol_deliver_rcu+0xb2a/0x10d0 net/ipv6/ip6_input.c:438
ip6_input_finish+0xf0/0x1d0 net/ipv6/ip6_input.c:489
NF_HOOK include/linux/netfilter.h:318 [inline]
ip6_input+0x5e/0x140 net/ipv6/ip6_input.c:500
ip6_mc_input+0x27c/0x470 net/ipv6/ip6_input.c:590
dst_input include/net/dst.h:474 [inline]
ip6_rcv_finish+0x336/0x340 net/ipv6/ip6_input.c:79
...
value changed: 0x00000000 -> 0xe5400659
Fixes: 49b99da2c9 ("ipv6: add IFLA_INET6_RA_MTU to expose mtu value")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Rocco Yue <rocco.yue@mediatek.com>
Link: https://patch.msgid.link/20260118152941.2563857-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Steffen Klassert says:
====================
pull request (net): ipsec 2026-01-14
1) Fix inner mode lookup in tunnel mode GSO segmentation.
The protocol was taken from the wrong field.
2) Set ipv4 no_pmtu_disc flag only on output SAs. The
insertation of input SAs can fail if no_pmtu_disc
is set.
Please pull or let me know if there are problems.
ipsec-2026-01-14
* tag 'ipsec-2026-01-14' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: set ipv4 no_pmtu_disc flag only on output sa when direction is set
xfrm: Fix inner mode lookup in tunnel mode GSO segmentation
====================
Link: https://patch.msgid.link/20260114121817.1106134-1-steffen.klassert@secunet.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
IPv4 allows for a nexthop device mismatch when the "onlink" keyword is
specified:
# ip link add name dummy1 up type dummy
# ip address add 192.0.2.1/24 dev dummy1
# ip link add name dummy2 up type dummy
# ip route add 198.51.100.0/24 nexthop via 192.0.2.2 dev dummy2
Error: Nexthop has invalid gateway.
# ip route add 198.51.100.0/24 nexthop via 192.0.2.2 dev dummy2 onlink
# echo $?
0
This seems to be consistent with the description of "onlink" in the
ip-route man page: "Pretend that the nexthop is directly attached to
this link, even if it does not match any interface prefix".
On the other hand, IPv6 rejects a nexthop device mismatch, even when
"onlink" is specified:
# ip link add name dummy1 up type dummy
# ip address add 2001:db8:1::1/64 dev dummy1
# ip link add name dummy2 up type dummy
# ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2
RTNETLINK answers: No route to host
# ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2 onlink
Error: Nexthop has invalid gateway or device mismatch.
This is intentional according to commit fc1e64e109 ("net/ipv6: Add
support for onlink flag") which added IPv6 "onlink" support and states
that "any unicast gateway is allowed as long as the gateway is not a
local address and if it resolves it must match the given device".
The condition was later relaxed in commit 4ed591c8ab ("net/ipv6: Allow
onlink routes to have a device mismatch if it is the default route") to
allow for a nexthop device mismatch if the gateway address is resolved
via the default route:
# ip link add name dummy1 up type dummy
# ip route add ::/0 dev dummy1
# ip link add name dummy2 up type dummy
# ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2
RTNETLINK answers: No route to host
# ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2 onlink
# echo $?
0
While the decision to forbid a nexthop device mismatch in IPv6 seems to
be intentional, it is unclear why it was made. Especially when it
differs from IPv4 and seems to go against the intended behavior of
"onlink".
Therefore, relax the condition further and allow for a nexthop device
mismatch when "onlink" is specified:
# ip link add name dummy1 up type dummy
# ip address add 2001:db8:1::1/64 dev dummy1
# ip link add name dummy2 up type dummy
# ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2 onlink
# echo $?
0
The motivating use case is the fact that FRR would like to be able to
configure overlay routes of the following form:
# ip route add <host-Z> vrf <VRF> encap ip id <ID> src <VTEP-A> dst <VTEP-Z> via <VTEP-Z> dev vxlan0 onlink
Where vxlan0 is in the default VRF in which "VTEP-Z" is reachable via
one of the underlay routes (e.g., via swpX). Without this patch, the
above only works with IPv4, but not with IPv6.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260111120813.159799-5-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
IPv6 addresses with the same scope are returned in reverse insertion
order, unlike IPv4. For example, when adding a -> b -> c, the list is
reported as c -> b -> a, while IPv4 preserves the original order.
This behavior causes:
a. When using `ip -6 a save` and `ip -6 a restore`, addresses are restored
in the opposite order from which they were saved. See example below
showing addresses added as 1::1, 1::2, 1::3 but displayed and saved
in reverse order.
# ip -6 a a 1::1 dev x
# ip -6 a a 1::2 dev x
# ip -6 a a 1::3 dev x
# ip -6 a s dev x
2: x: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
inet6 1::3/128 scope global tentative
valid_lft forever preferred_lft forever
inet6 1::2/128 scope global tentative
valid_lft forever preferred_lft forever
inet6 1::1/128 scope global tentative
valid_lft forever preferred_lft forever
# ip -6 a save > dump
# ip -6 a d 1::1 dev x
# ip -6 a d 1::2 dev x
# ip -6 a d 1::3 dev x
# ip a d ::1 dev lo
# ip a restore < dump
# ip -6 a s dev x
2: x: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
inet6 1::1/128 scope global tentative
valid_lft forever preferred_lft forever
inet6 1::2/128 scope global tentative
valid_lft forever preferred_lft forever
inet6 1::3/128 scope global tentative
valid_lft forever preferred_lft forever
# ip a showdump < dump
if1:
inet6 ::1/128 scope host proto kernel_lo
valid_lft forever preferred_lft forever
if2:
inet6 1::3/128 scope global tentative
valid_lft forever preferred_lft forever
if2:
inet6 1::2/128 scope global tentative
valid_lft forever preferred_lft forever
if2:
inet6 1::1/128 scope global tentative
valid_lft forever preferred_lft forever
b. Addresses in pasta to appear in reversed order compared to host
addresses.
The ipv6 addresses were added in reverse order by commit e55ffac601
("[IPV6]: order addresses by scope"), then it was changed by commit
502a2ffd73 ("ipv6: convert idev_list to list macros"), and restored by
commit b54c9b98bb ("ipv6: Preserve pervious behavior in
ipv6_link_dev_addr()."). However, this reverse ordering within the same
scope causes inconsistency with IPv4 and the issues described above.
This patch aligns IPv6 address ordering with IPv4 for consistency
by changing the comparison from >= to > when inserting addresses
into the address list. Also updates the ioam6 selftest to reflect
the new address ordering behavior. Combine these two changes into
one patch for bisectability.
Link: https://bugs.passt.top/show_bug.cgi?id=175
Suggested-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Yumei Huang <yuhuang@redhat.com>
Acked-by: Justin Iurman <justin.iurman@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260104032357.38555-1-yuhuang@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
On PREEMPT_RT kernels, after rt6_get_pcpu_route() returns NULL, the
current task can be preempted. Another task running on the same CPU
may then execute rt6_make_pcpu_route() and successfully install a
pcpu_rt entry. When the first task resumes execution, its cmpxchg()
in rt6_make_pcpu_route() will fail because rt6i_pcpu is no longer
NULL, triggering the BUG_ON(prev). It's easy to reproduce it by adding
mdelay() after rt6_get_pcpu_route().
Using preempt_disable/enable is not appropriate here because
ip6_rt_pcpu_alloc() may sleep.
Fix this by handling the cmpxchg() failure gracefully on PREEMPT_RT:
free our allocation and return the existing pcpu_rt installed by
another task. The BUG_ON is replaced by WARN_ON_ONCE for non-PREEMPT_RT
kernels where such races should not occur.
Link: https://syzkaller.appspot.com/bug?extid=9b35e9bc0951140d13e6
Fixes: d2d6422f8b ("x86: Allow to enable PREEMPT_RT.")
Reported-by: syzbot+9b35e9bc0951140d13e6@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6918cd88.050a0220.1c914e.0045.GAE@google.com/T/
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://patch.msgid.link/20251223051413.124687-1-jiayuan.chen@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
There exists a kernel oops caused by a BUG_ON(nhead < 0) at
net/core/skbuff.c:2232 in pskb_expand_head().
This bug is triggered as part of the calipso_skbuff_setattr()
routine when skb_cow() is passed headroom > INT_MAX
(i.e. (int)(skb_headroom(skb) + len_delta) < 0).
The root cause of the bug is due to an implicit integer cast in
__skb_cow(). The check (headroom > skb_headroom(skb)) is meant to ensure
that delta = headroom - skb_headroom(skb) is never negative, otherwise
we will trigger a BUG_ON in pskb_expand_head(). However, if
headroom > INT_MAX and delta <= -NET_SKB_PAD, the check passes, delta
becomes negative, and pskb_expand_head() is passed a negative value for
nhead.
Fix the trigger condition in calipso_skbuff_setattr(). Avoid passing
"negative" headroom sizes to skb_cow() within calipso_skbuff_setattr()
by only using skb_cow() to grow headroom.
PoC:
Using `netlabelctl` tool:
netlabelctl map del default
netlabelctl calipso add pass doi:7
netlabelctl map add default address:0::1/128 protocol:calipso,7
Then run the following PoC:
int fd = socket(AF_INET6, SOCK_DGRAM, IPPROTO_UDP);
// setup msghdr
int cmsg_size = 2;
int cmsg_len = 0x60;
struct msghdr msg;
struct sockaddr_in6 dest_addr;
struct cmsghdr * cmsg = (struct cmsghdr *) calloc(1,
sizeof(struct cmsghdr) + cmsg_len);
msg.msg_name = &dest_addr;
msg.msg_namelen = sizeof(dest_addr);
msg.msg_iov = NULL;
msg.msg_iovlen = 0;
msg.msg_control = cmsg;
msg.msg_controllen = cmsg_len;
msg.msg_flags = 0;
// setup sockaddr
dest_addr.sin6_family = AF_INET6;
dest_addr.sin6_port = htons(31337);
dest_addr.sin6_flowinfo = htonl(31337);
dest_addr.sin6_addr = in6addr_loopback;
dest_addr.sin6_scope_id = 31337;
// setup cmsghdr
cmsg->cmsg_len = cmsg_len;
cmsg->cmsg_level = IPPROTO_IPV6;
cmsg->cmsg_type = IPV6_HOPOPTS;
char * hop_hdr = (char *)cmsg + sizeof(struct cmsghdr);
hop_hdr[1] = 0x9; //set hop size - (0x9 + 1) * 8 = 80
sendmsg(fd, &msg, 0);
Fixes: 2917f57b6b ("calipso: Allow the lsm to label the skbuff directly.")
Suggested-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Will Rosenberg <whrosenb@asu.edu>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://patch.msgid.link/20251219173637.797418-1-whrosenb@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>