Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following batch contains Netfilter fixes for net:
1) Allow initial x_tables table replacement without emitting an audit
log message. Delay the register message until after hooks are wired up
to avoid unnecessary unregister logs during error unwinding.
2) Fix a NULL dereference by allocating hook ops before adding the
table to the per-netns list. Use `synchronize_rcu()` during error
unwinding to ensure the table stops processing packets before
teardown. Defer audit log register message until all operations
succeed.
3) Refactor xtables to use a single `xt_unregister_table_pre_exit`
function. Eliminate code duplication by centralizing table
unregistration logic within the xtables core. ebtables cannot be
changed due to incompatibility.
4) Unregister xtables templates before module removal. This prevents
a race condition where userspace instantiates a new table after the
pernet unreg removed the current table.
5) Add `xtables_unregister_table_exit` to fully unregister netfilter
tables during module removal. Unlink the table from dying lists,
then free hook operations.
6) Implement a two-stage removal scheme for ebtables following the
x_tables pattern. Assign table->ops while holding the ebt mutex to
prevent exposing partially-filled structures.
7) Fix ebtables module initialization race. Register the template last
in table initialization functions. Prevent table instantiation before
pernet operations are available.
8) Fix a race condition in x_tables module initialization. Ensure
pernet ops are fully set up before exposing the table to userspace.
9) Fix a race condition in ebtables module initialization, similar to
previous patch.
10) Restore propagation of helper to expected connection, this is a
fix-for-recent-fix.
11) Validate that the expectation tuple and mask netlink attributes are
present when adding expectation via nfqueue, this fixes a possible
null-ptr-deref.
12) Fix possible rare memleak in the SIP helper in case helper has been
detached from conntrack entry, from Li Xiasong.
13) Fix refcount leak in nft_ct when creating custom expectation, also
from Li Xiason.
Patches 1-9 from Florian Westphal.
10) Restore propagation of helper to expected connection, this is a
fix-for-recent-fix.
11) Check that tuple and mask netlink attributes are set when creating an
expectation via nfqueue.
* tag 'nf-26-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nft_ct: fix missing expect put in obj eval
netfilter: nf_conntrack_sip: get helper before allocating expectation
netfilter: ctnetlink: check tuple and mask in expectations created via nfqueue
netfilter: nf_conntrack_expect: restore helper propagation via expectation
netfilter: bridge: eb_tables: close module init race
netfilter: x_tables: close dangling table module init race
netfilter: ebtables: close dangling table module init race
netfilter: ebtables: move to two-stage removal scheme
netfilter: x_tables: add and use xtables_unregister_table_exit
netfilter: x_tables: unregister the templates first
netfilter: x_tables: add and use xt_unregister_table_pre_exit
netfilter: x_tables: allocate hook ops while under mutex
netfilter: x_tables: allow initial table replace without emitting audit log message
====================
Link: https://patch.msgid.link/20260507234509.603182-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
fl_size, fl_ht and ip6_fl_lock in net/ipv6/ip6_flowlabel.c are
file scope and shared across netns. mem_check() reads fl_size to
decide whether to deny non-CAP_NET_ADMIN callers. capable() runs
against init_user_ns, so an unprivileged user in any non-init
userns can push fl_size past FL_MAX_SIZE - FL_MAX_SIZE / 4 and
starve every other unprivileged userns on the host.
Add struct netns_ipv6::flowlabel_count, bumped and decremented
next to fl_size in fl_intern, ip6_fl_gc and ip6_fl_purge. The new
field fills the existing 4-byte hole after ipmr_seq, so struct
netns_ipv6 stays the same size on 64-bit builds.
Bump FL_MAX_SIZE from 4096 to 8192. It has been 4096 since the
file was added. Machines and connection counts have grown.
mem_check() folds an extra per-netns ceiling into the existing
non-CAP_NET_ADMIN conditional. The ceiling is half of the total
budget that unprivileged callers have ever been able to use, i.e.
(FL_MAX_SIZE - FL_MAX_SIZE / 4) / 2 = 3072 entries. With
FL_MAX_SIZE doubled, this preserves the original per-user reach
of 3K (what an unprivileged caller could already obtain before
this change), while forcing an attacker to spread allocations
across at least two netns to exhaust the global non-CAP_NET_ADMIN
budget.
CAP_NET_ADMIN against init_user_ns still bypasses both caps.
The previous patch took ip6_fl_lock across mem_check and
fl_intern, so the new flowlabel_count read in mem_check and the
new flowlabel_count++ in fl_intern run under the same critical
section. flowlabel_count is therefore plain int, like fl_size.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Suggested-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Cc: stable@vger.kernel.org # v5.15+
Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg>
Link: https://patch.msgid.link/20260506082416.2259567-3-maoyixie.tju@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
mem_check() in net/ipv6/ip6_flowlabel.c reads fl_size without
holding ip6_fl_lock. fl_intern() takes the lock immediately
afterwards. The two checks therefore race against concurrent
fl_intern, ip6_fl_gc and ip6_fl_purge writers, which makes the
mem_check budget check approximate.
Move spin_lock_bh(&ip6_fl_lock) and the matching unlock from
fl_intern() into its only caller ipv6_flowlabel_get(). The
mem_check() call now runs under the same critical section as the
fl_intern() insert, so the budget check is exact.
With all writers and the read of fl_size under ip6_fl_lock,
convert fl_size from atomic_t to plain int. The four sites that
update or read fl_size are fl_intern (insert path), ip6_fl_gc
(garbage collector, the !sched check and the per-entry decrement),
ip6_fl_purge (per-netns purge), and mem_check (budget check), and
all four now run under ip6_fl_lock.
This is a prerequisite for adding a per-netns budget alongside
fl_size. The follow-up patch adds netns_ipv6::flowlabel_count and
folds it into mem_check().
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Suggested-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg>
Link: https://patch.msgid.link/20260506082416.2259567-2-maoyixie.tju@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Similar to the previous ebtables patch:
template add exposes the table to userspace, we must do this last to
rnsure the pernet ops are set up (contain the destructors).
Fixes: fdacd57c79 ("netfilter: x_tables: never register tables by default")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Previous change added xtables_unregister_table_pre_exit to detach the
table from the packetpath and to unlink it from the active table list.
In case of rmmod, userspace that is doing set/getsockopt for this table
will not be able to re-instantiate the table:
1. The larval table has been removed already
2. existing instantiated table is no longer on the xt pernet table list.
This adds the second stage helper:
unlink the table from the dying list, free the hook ops (if any) and do
the audit notification. It replaces xt_unregister_table().
Fixes: fdacd57c79 ("netfilter: x_tables: never register tables by default")
Reported-by: Tristan Madani <tristan@talencesecurity.com>
Reviewed-by: Tristan Madani <tristan@talencesecurity.com>
Closes: https://lore.kernel.org/netfilter-devel/20260429175613.1459342-1-tristmd@gmail.com/
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Remove the copypasted variants of _pre_exit and add one single
function in the xtables core. ebtables is not compatible with
x_tables and therefore unchanged.
This is a preparation patch to reduce noise in the followup
bug fixes.
Reviewed-by: Tristan Madani <tristan@talencesecurity.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
arp/ip(6)t_register_table() add the table to the per-netns list via
xt_register_table() before allocating the per-netns hook ops copy
via kmemdup_array(). This leaves a window where the table is
visible in the list with ops=NULL.
If the pernet exit happens runs concurrently the pre_exit callback finds
the table via xt_find_table() and passes the NULL ops pointer to
nf_unregister_net_hooks(), causing a NULL dereference:
general protection fault in nf_unregister_net_hooks+0xbc/0x150
RIP: nf_unregister_net_hooks (net/netfilter/core.c:613)
Call Trace:
ipt_unregister_table_pre_exit
iptable_mangle_net_pre_exit
ops_pre_exit_list
cleanup_net
Fix by moving the ops allocation into the xtables core so the table is
never in the list without valid ops. Also ensure the table is no longer
processing packets before its torn down on error unwind.
nf_register_net_hooks might have published at least one hook; call
synchronize_rcu() if there was an error.
audit log register message gets deferred until all operations have
passed, this avoids need to emit another ureg message in case of
error unwinding.
Based on earlier patch by Tristan Madani.
Fixes: f9006acc8d ("netfilter: arp_tables: pass table pointer via nf_hook_ops")
Fixes: ee177a5441 ("netfilter: ip6_tables: pass table pointer via nf_hook_ops")
Fixes: ae68933422 ("netfilter: ip_tables: pass table pointer via nf_hook_ops")
Link: https://lore.kernel.org/netfilter-devel/20260429175613.1459342-1-tristmd@gmail.com/
Signed-off-by: Tristan Madani <tristan@talencesecurity.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If a socket is bound to a wildcard address, tcp_v[46]_connect()
updates it with a non-wildcard address based on the route lookup.
After bhash2 was introduced in the cited commit, we must call
inet_bhash2_update_saddr() to update the bhash2 entry as well.
If inet_bhash2_update_saddr() fails, we must release the refcount
for dst by ip_route_connect() or ip6_dst_lookup_flow().
While tcp_v4_connect() calls ip_rt_put() in the error path,
tcp_v6_connect() does not call dst_release().
Let's call dst_release() when inet_bhash2_update_saddr() fails
in tcp_v6_connect().
Fixes: 28044fc1d4 ("net: Add a bhash2 table hashed by port and address")
Reported-by: Damiano Melotti <melotti@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260506070443.1699879-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_child_process( .. child ...) currently calls sock_put(child).
Unfortunately @child (named @nsk in callers) can be used after
this point to send a RST packet.
To fix this UAF, I remove the sock_put() from tcp_child_process()
and let the callers handle this after it is safe.
Remove @rsk variable in tcp_v4_do_rcv() and change tcp_v6_do_rcv()
so that both functions look the same.
Fixes: cfb6eeb4c8 ("[TCP]: MD5 Signature Option (RFC2385) support.")
Reported-by: Damiano Melotti <melotti@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260505153927.3435532-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Steffen Klassert says:
====================
pull request (net): ipsec 2026-05-05
1. Fix an IPv6 encapsulation error path that leaked route references
when UDPv6 ESP decapsulation resolved to an error route.
From Yilin Zhu.
2. Fix AH with ESN on async crypto paths by accounting for the extra
high-order sequence number when reconstructing the temporary
authentication layout in the completion callbacks.
From Michael Bomarito.
3. Fix XFRM output so it does not overwrite already-correct inner header
pointers when a tunnel layer such as VXLAN has already saved them.
The fix comes with new selftests. From Cosmin Ratiu.
4. Add the missing native payload size entry for XFRM_MSG_MAPPING in the
compat translation path. From Ruijie Li.
5. Harden __xfrm_state_delete() against repeated or inconsistent unhashing
of state list nodes by keying the removal on actual list membership and
using delete-and-init helpers. From Michal Kosiorek.
6. Prevent ESP from decrypting shared splice-backed skb fragments in place
by marking UDP splice frags as shared and forcing copy-on-write in ESP
input when needed. From Kuan-Ting Chen.
* tag 'ipsec-2026-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: esp: avoid in-place decrypt on shared skb frags
xfrm: defensively unhash xfrm_state lists in __xfrm_state_delete
xfrm: provide message size for XFRM_MSG_MAPPING
xfrm: Don't clobber inner headers when already set
tools/selftests: Add a VXLAN+IPsec traffic test
tools/selftests: Use a sensible timeout value for iperf3 client
xfrm: ah: account for ESN high bits in async callbacks
ipv6: xfrm6: release dst on error in xfrm6_rcv_encap()
====================
Link: https://patch.msgid.link/20260505132326.1362733-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This basically defaulted to m until recently, since IPV6 defaulted to
m. Since IPV6 was changed to a boolean with a default of y, IPV6_SIT
started defaulting to built-in as well. This results in a surprise
sit0 device by default for defconfig (and defconfig-derived config)
users at boot. For me, this broke an (admittedly non-robust) script.
Preserve the behaviour of most configs by avoiding building this
module, that's probably overall seldom used compared to IPv6 as a
whole, into the kernel.
Fixes: 309b905dee ("ipv6: convert CONFIG_IPV6 to built-in only and clean up Kconfigs")
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20260503192515.290900-2-hi@alyssa.is
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
MSG_SPLICE_PAGES can attach pages from a pipe directly to an skb. TCP
marks such skbs with SKBFL_SHARED_FRAG after skb_splice_from_iter(),
so later paths that may modify packet data can first make a private
copy. The IPv4/IPv6 datagram append paths did not set this flag when
splicing pages into UDP skbs.
That leaves an ESP-in-UDP packet made from shared pipe pages looking
like an ordinary uncloned nonlinear skb. ESP input then takes the no-COW
fast path for uncloned skbs without a frag_list and decrypts in place
over data that is not owned privately by the skb.
Mark IPv4/IPv6 datagram splice frags with SKBFL_SHARED_FRAG, matching
TCP. Also make ESP input fall back to skb_cow_data() when the flag is
present, so ESP does not decrypt externally backed frags in place.
Private nonlinear skb frags still use the existing fast path.
This intentionally does not change ESP output. In esp_output_head(),
the path that appends the ESP trailer to existing skb tailroom without
calling skb_cow_data() is not reachable for nonlinear skbs:
skb_tailroom() returns zero when skb->data_len is nonzero, while ESP
tailen is positive. Thus ESP output will either use the separate
destination-frag path or fall back to skb_cow_data().
Fixes: cac2661c53 ("esp4: Avoid skb_cow_data whenever possible")
Fixes: 03e2a30f6a ("esp6: Avoid skb_cow_data whenever possible")
Fixes: 7da0dde684 ("ip, udp: Support MSG_SPLICE_PAGES")
Fixes: 6d8192bd69 ("ip6, udp6: Support MSG_SPLICE_PAGES")
Reported-by: Hyunwoo Kim <imv4bel@gmail.com>
Reported-by: Kuan-Ting Chen <h3xrabbit@gmail.com>
Tested-by: Hyunwoo Kim <imv4bel@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Kuan-Ting Chen <h3xrabbit@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
After commit 5e72ce3e39 ("net: ipv6: Use link netns in newlink() of
rtnl_link_ops"), ip6erspan_newlink() correctly resolves the per-netns
ip6gre hash via link_net. ip6erspan_changelink() was not converted in
that series and still uses dev_net(dev), which diverges from the
device's creation netns after IFLA_NET_NS_FD migration.
This re-inserts the tunnel into the wrong per-netns hash. The
original netns keeps a stale entry. When that netns is later
destroyed, ip6gre_exit_rtnl_net() walks the stale entry, producing a
slab-use-after-free reported by KASAN, followed by a kernel BUG at
net/core/dev.c (LIST_POISON1) in unregister_netdevice_many_notify().
Reachable from an unprivileged user namespace (unshare --user
--map-root-user --net).
ip6gre_changelink() earlier in the same file already uses the cached
t->net; only ip6erspan_changelink() has the wrong shape.
Fixes: 2d665034f2 ("net: ip6_gre: Fix ip6erspan hlen calculation")
Cc: stable@vger.kernel.org # v5.15+
Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260430103318.3206018-1-maoyi.xie@ntu.edu.sg
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When using IPv6 ECMP routes, if a netdev listed as a nexthop experiences
a carrier change event (e.g., a bond device generating a NETDEV_CHANGE
event after its slaves go linkdown), established connections utilizing
that nexthop fail to fail over to other available nexthops. Instead,
these connections stall or drop.
This happens because the IPv6 FIB code does not invalidate the socket's
cached destination when a NETDEV_CHANGE event occurs. While
fib6_ifdown() correctly marks the nexthop with RTNH_F_LINKDOWN, it
leaves the route's serial number unchanged. As a result, sockets with a
previously cached dst do not realize the route is no longer viable and
continue to try using the non-functional nexthop.
This behavior contrasts with IPv4, which actively flushes cached
destinations on a NETDEV_CHANGE event (see fib_netdev_event() in
net/ipv4/fib_frontend.c).
Fix this by updating the route serial number in fib6_ifdown() when
setting RTNH_F_LINKDOWN. This invalidates stale cached destinations,
forcing sockets to perform a new route lookup and fail over to a
functioning nexthop.
Fixes: 51ebd31815 ("ipv6: add support of equal cost multipath (ECMP)")
Signed-off-by: Sagarika Sharma <sharmasagarika@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260430200909.527827-2-sharmasagarika@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following batch contains Netfilter fixes for net:
1) Replace skb_try_make_writable() by skb_ensure_writable() in
nft_fwd_netdev and the flowtable to deal with uncloned packets
having their network header in paged fragments.
2) Drop packet if output device does not exist and ensure sufficient
headroom in nft_fwd_netdev before transmitting the skb.
3) Use the existing dup recursion counter in nft_fwd_netdev for the
neigh_xmit variant, from Weiming Shi.
4) Add .check_hooks interface to x_tables to detach the control plane
hook check based on the match/target configuration. Then, update
nft_compat to use .check_hooks from .validate path, this fixes a
lack of hook validation for several match/targets.
5) Fix incorrect .usersize in xt_CT, from Florian Westphal.
6) Fix a memleak with netdev tables in dormant state,
from Florian Westphal.
7) Several patches to check if the packet is a fragment, then skip
layer 4 inspection, for x_tables and nf_tables; as well as common
nf_socket infrastructure. The xt_hashlimit match drops fragments
to stay consistent with the existing approach when failing to parse
the layer 4 protocol header.
8) Ensure sufficient headroom in the flowtable before transmitting
the skb.
9) Fix the flowtable inline vlan approach for double-tagged vlan:
Reverse the iteration over .encap[] since it represents the
encapsulation as seen from the ingress path. Postpone pushing
layer 2 header so output device is available to calculate needed
headroom. Finally, add and use nf_flow_vlan_push() to fix it.
10) Fix flowtable inline pppoe with GSO packets. Moreover, use
FLOW_OFFLOAD_XMIT_DIRECT to fill up destination hardware
address since neighbour cache does not exist in pppoe.
11) Use skb_pull_rcsum() to decapsulate vlan and pppoe headers, for
double-tagged vlan in particular this should provide some benefits
in certain scenarios.
More notes regarding 9-11):
- sashiko is also signalling to use it for IPIP headers, but that needs
more adjustments such setting skb->protocol after removing the IPIP
header, will follow up in a separated patch.
- I plan to submit selftests to cover double-tagged-vlan. As for pppoe,
it should be possible but that would mandate a few userspace dependencies.
This has been semi-automatically tested by me and reporters describing
broken double-vlan-tagged and pppoe currently in the flowtable.
* tag 'nf-26-05-01' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: flowtable: use skb_pull_rcsum() to pop vlan/pppoe header
netfilter: flowtable: fix inline pppoe encapsulation in xmit path
netfilter: flowtable: fix inline vlan encapsulation in xmit path
netfilter: flowtable: ensure sufficient headroom in xmit path
netfilter: xtables: fix L4 header parsing for non-first fragments
netfilter: nf_tables: skip L4 header parsing for non-first fragments
netfilter: nf_socket: skip socket lookup for non-first fragments
netfilter: nf_tables: fix netdev hook allocation memleak with dormant tables
netfilter: xt_CT: fix usersize for v1 and v2 revision
netfilter: nft_compat: run xt_check_hooks_{match,target}() from .validate
netfilter: x_tables: add .check_hooks to matches and targets
netfilter: nft_fwd_netdev: use recursion counter in neigh egress path
netfilter: nft_fwd_netdev: add device and headroom validate with neigh forwarding
netfilter: replace skb_try_make_writable() by skb_ensure_writable()
====================
Link: https://patch.msgid.link/20260501122237.296262-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ipv6_{skip_exthdr,find_hdr}() and ip6_{tnl_parse_tlv_enc_lim,
protocol_deliver_rcu}() iterate over IPv6 extension headers until they
find a non-extension-header protocol or run out of packet data. The
loops have no iteration counter, relying solely on the packet length
to bound them. For a crafted packet with 8-byte extension headers
filling a 64KB jumbogram, this means a worst case of up to ~8k
iterations with a skb_header_pointer call each. ipv6_skip_exthdr(),
for example, is used where it parses the inner quoted packet inside
an incoming ICMPv6 error:
- icmpv6_rcv
- checksum validation
- case ICMPV6_DEST_UNREACH
- icmpv6_notify
- pskb_may_pull() <- pull inner IPv6 header
- ipv6_skip_exthdr() <- iterates here
- pskb_may_pull()
- ipprot->err_handler() <- sk lookup
The per-iteration cost of ipv6_skip_exthdr itself is generally
light, but skb_header_pointer becomes more costly on reassembled
packets: the first ~1232 bytes of the inner packet are in the skb's
linear area, but the remaining ~63KB are in the frag_list where
skb_copy_bits is needed to read data.
Initially, the idea was to add a configurable limit via a new
sysctl knob with default 8, in line with knobs from commit
47d3d7ac65 ("ipv6: Implement limits on Hop-by-Hop and Destination
options"), but two reasons eventually argued against it:
- It adds to UAPI that needs to be maintained forever, and
upcoming work is restricting extension header ordering anyway,
leaving little reason for another sysctl knob
- exthdrs_core.c is always built-in even when CONFIG_IPV6=n,
where struct net has no .ipv6 member, so the read site would
need an ifdef'd fallback to a constant anyway
Therefore, just use a constant (IP6_MAX_EXT_HDRS_CNT). All four
extension header walking functions are now bound by this limit.
Note that the check in ip6_protocol_deliver_rcu() happens right
before the goto resubmit, such that we don't have to have a test
for ipv6_ext_hdr() in the fast-path.
There's an ongoing IETF draft-iurman-6man-eh-occurrences to enforce
IPv6 extension headers ordering and occurrence. The latter also
discusses security implications. As per RFC8200 section 4.1, the
occurrence rules for extension headers provide a practical upper
bound which is 8. In order to be conservative, let's define
IP6_MAX_EXT_HDRS_CNT as 12 to leave enough room for quirky setups.
In the unlikely event that this is still not enough, then we might
need to reconsider a sysctl.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Justin Iurman <justin.iurman@gmail.com>
Link: https://patch.msgid.link/20260429154648.809751-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Both nft_socket and xt_socket relies on L4 headers to perform socket
lookup in the slow path. For fragmented packets, while the IP protocol
remains constant across all fragments, only the first fragment contains
the actual L4 header.
As the expression/match could be attached to a chain with a priority
lower than -400, it could bypass defragmentation.
Add a check for fragmentation in the lookup functions directly so the
problem is handled for both nft_socket and xt_socket at the same time.
In addition, future users of the functions would not need to care about
this.
Fixes: 902d6a4c2a ("netfilter: nf_defrag: Skip defrag if NOTRACK is set")
Fixes: 554ced0a6e ("netfilter: nf_tables: add support for native socket matching")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
seg6_input_core() and rpl_input() call ip6_route_input() which sets a
NOREF dst on the skb, then pass it to dst_cache_set_ip6() invoking
dst_hold() unconditionally.
On PREEMPT_RT, ksoftirqd is preemptible and a higher-priority task can
release the underlying pcpu_rt between the lookup and the caching
through a concurrent FIB lookup on a shared nexthop.
Simplified race sequence:
ksoftirqd/X higher-prio task (same CPU X)
----------- --------------------------------
seg6_input_core(,skb)/rpl_input(skb)
dst_cache_get()
-> miss
ip6_route_input(skb)
-> ip6_pol_route(,skb,flags)
[RT6_LOOKUP_F_DST_NOREF in flags]
-> FIB lookup resolves fib6_nh
[nhid=N route]
-> rt6_make_pcpu_route()
[creates pcpu_rt, refcount=1]
pcpu_rt->sernum = fib6_sernum
[fib6_sernum=W]
-> cmpxchg(fib6_nh.rt6i_pcpu,
NULL, pcpu_rt)
[slot was empty, store succeeds]
-> skb_dst_set_noref(skb, dst)
[dst is pcpu_rt, refcount still 1]
rt_genid_bump_ipv6()
-> bumps fib6_sernum
[fib6_sernum from W to Z]
ip6_route_output()
-> ip6_pol_route()
-> FIB lookup resolves fib6_nh
[nhid=N]
-> rt6_get_pcpu_route()
pcpu_rt->sernum != fib6_sernum
[W <> Z, stale]
-> prev = xchg(rt6i_pcpu, NULL)
-> dst_release(prev)
[prev is pcpu_rt,
refcount 1->0, dead]
dst = skb_dst(skb)
[dst is the dead pcpu_rt]
dst_cache_set_ip6(dst)
-> dst_hold() on dead dst
-> WARN / use-after-free
For the race to occur, ksoftirqd must be preemptible (PREEMPT_RT without
PREEMPT_RT_NEEDS_BH_LOCK) and a concurrent task must be able to release
the pcpu_rt. Shared nexthop objects provide such a path, as two routes
pointing to the same nhid share the same fib6_nh and its rt6i_pcpu
entry.
Fix seg6_input_core() and rpl_input() by calling skb_dst_force() after
ip6_route_input() to force the NOREF dst into a refcounted one before
caching.
The output path is not affected as ip6_route_output() already returns a
refcounted dst.
Fixes: af4a2209b1 ("ipv6: sr: use dst_cache in seg6_input")
Fixes: a7a29f9c36 ("net: ipv6: add rpl sr tunnel")
Cc: stable@vger.kernel.org
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Justin Iurman <justin.iurman@gmail.com>
Link: https://patch.msgid.link/20260421094735.20997-1-andrea.mayer@uniroma2.it
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
ipv6_rpl_srh_rcv() decompresses an RFC 6554 Source Routing Header, swaps
the next segment into ipv6_hdr->daddr, recompresses, then pulls the old
header and pushes the new one plus the IPv6 header back. The
recompressed header can be larger than the received one when the swap
reduces the common-prefix length the segments share with daddr (CmprI=0,
CmprE>0, seg[0][0] != daddr[0] gives the maximum +8 bytes).
pskb_expand_head() was gated on segments_left == 0, so on earlier
segments the push consumed unchecked headroom. Once skb_push() leaves
fewer than skb->mac_len bytes in front of data,
skb_mac_header_rebuild()'s call to:
skb_set_mac_header(skb, -skb->mac_len);
will store (data - head) - mac_len into the u16 mac_header field, which
wraps to ~65530, and the following memmove() writes mac_len bytes ~64KiB
past skb->head.
A single AF_INET6/SOCK_RAW/IPV6_HDRINCL packet over lo with a two
segment type-3 SRH (CmprI=0, CmprE=15) reaches headroom 8 after one
pass; KASAN reports a 14-byte OOB write in ipv6_rthdr_rcv.
Fix this by expanding the head whenever the remaining room is less than
the push size plus mac_len, and request that much extra so the rebuilt
MAC header fits afterwards.
Fixes: 8610c7c6e3 ("net: ipv6: add support for rpl sr exthdr")
Cc: stable <stable@kernel.org>
Reported-by: Anthropic
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/2026042133-gout-unvented-1bd9@gregkh
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 47d3d7ac65 ("ipv6: Implement limits on Hop-by-Hop and
Destination options") added net.ipv6.max_{hbh,dst}_opts_{cnt,len}
and applied them in ip6_parse_tlv(), the generic TLV walker
invoked from ipv6_destopt_rcv() and ipv6_parse_hopopts().
ip6_tnl_parse_tlv_enc_lim() does not go through ip6_parse_tlv();
it has its own hand-rolled TLV scanner inside its NEXTHDR_DEST
branch which looks for IPV6_TLV_TNL_ENCAP_LIMIT. That inner
loop is bounded only by optlen, which can be up to 2048 bytes.
Stuffing the Destination Options header with 2046 Pad1 (type=0)
entries advances the scanner a single byte at a time, yielding
~2000 TLV iterations per extension header.
Reusing max_dst_opts_cnt to bound the TLV iterations, matching
the semantics from 47d3d7ac65, would require duplicating
ip6_parse_tlv() to also validate Pad1/PadN payload. It would
also mandate enforcing max_dst_opts_len, since otherwise an
attacker shifts the axis to few options with a giant PadN and
recovers the original DoS. Allowing up to 8 options before the
tunnel encapsulation limit TLV is liberal enough; in practice
encap limit is the first TLV. Thus, go with a hard-coded limit
IP6_TUNNEL_MAX_DEST_TLVS (8).
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Justin Iurman <justin.iurman@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following batch contains Netfilter/IPVS fixes for net:
1) nft_osf actually only supports IPv4, restrict it.
2) Address possible division by zero in nfnetlink_osf, from Xiang Mei.
3) Remove unsafe use of sprintf to fix possible buffer overflow
in the SIP NAT helper, from Florian Westphal.
4) Restrict xt_mac, xt_owner and xt_physdev to inet families only;
xt_realm is only for ipv4, otherwise null-pointer-deref is possible.
5) Use kfree_rcu() in nat core to release hooks, this can be an issue
once nfnetlink_hook gets support to dump NAT hook information, not
currently a real issue but better fix it now. From Florian Westphal.
6) Fix MTU checks in IPVS, from Yingnan Zhang.
7) Fix possible out-of-bounds when matching TCP options in
nfnetlink_osf, from Fernando Fernandez Mancera.
8) Fix potential nul-ptr-deref in ttl check in nfnetlink_osf,
remove useless loop to fix this, also from Fernando.
This is a smaller batch, there are more patches pending in the queue
to arm another pull request as soon as this is considered good enough.
AI might complain again about one more issue regarding osf and
big-endian arches in osf but this batch is targetting crash fixes for
osf at this stage.
netfilter pull request 26-04-20
* tag 'nf-26-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nfnetlink_osf: fix potential NULL dereference in ttl check
netfilter: nfnetlink_osf: fix out-of-bounds read on option matching
ipvs: fix MTU check for GSO packets in tunnel mode
netfilter: nat: use kfree_rcu to release ops
netfilter: xtables: restrict several matches to inet family
netfilter: conntrack: remove sprintf usage
netfilter: nfnetlink_osf: fix divide-by-zero in OSF_WSS_MODULO
netfilter: nft_osf: restrict it to ipv4
====================
Link: https://patch.msgid.link/20260420220215.111510-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When SEG6_IPTUN_MODE_L2ENCAP_RED (L2ENCAP_RED) was introduced, the
condition in seg6_build_state() that excludes L2 encap modes from
setting LWTUNNEL_STATE_OUTPUT_REDIRECT was not updated to account for
the new mode.
As a consequence, L2ENCAP_RED routes incorrectly trigger seg6_output()
on the output path, where the packet is silently dropped because
skb_mac_header_was_set() fails on L3 packets.
Extend the check to also exclude L2ENCAP_RED, consistent with L2ENCAP.
Fixes: 13f0296be8 ("seg6: add support for SRv6 H.L2Encaps.Red behavior")
Cc: stable@vger.kernel.org
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: Justin Iurman <justin.iurman@gmail.com>
Link: https://patch.msgid.link/20260418162838.31979-1-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Florian Westphal says:
"Historically this is not an issue, even for normal base hooks: the data
path doesn't use the original nf_hook_ops that are used to register the
callbacks.
However, in v5.14 I added the ability to dump the active netfilter
hooks from userspace.
This code will peek back into the nf_hook_ops that are available
at the tail of the pointer-array blob used by the datapath.
The nat hooks are special, because they are called indirectly from
the central nat dispatcher hook. They are currently invisible to
the nfnl hook dump subsystem though.
But once that changes the nat ops structures have to be deferred too."
Update nf_nat_register_fn() to deal with partial exposition of the hooks
from error path which can be also an issue for nfnetlink_hook.
Fixes: e2cf17d377 ("netfilter: add new hook nfnl subsystem")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
AH allocates its temporary auth/ICV layout differently when ESN is enabled:
the async ahash setup appends a 4-byte seqhi slot before the ICV or
auth_data area, but the async completion callbacks still reconstruct the
temporary layout as if seqhi were absent.
With an async AH implementation selected, that makes AH copy or compare
the wrong bytes on both the IPv4 and IPv6 paths. In UML repro on IPv4 AH
with ESN and forced async hmac(sha1), ping fails with 100% packet loss,
and the callback logs show the pre-fix drift:
ah4 output_done: esn=1 err=0 icv_off=20 expected_off=24
ah4 input_done: esn=1 auth_off=20 expected_auth_off=24 icv_off=32 expected_icv_off=36
Reconstruct the callback-side layout the same way the setup path built it
by skipping the ESN seqhi slot before locating the saved auth_data or ICV.
Per RFC 4302, the ESN high-order 32 bits participate in the AH ICV
computation, so the async callbacks must account for the seqhi slot.
Post-fix, the same IPv4 AH+ESN+forced-async-hmac(sha1) UML repro shows
the corrected offset (ah4 output_done: esn=1 err=0 icv_off=24
expected_off=24) and ping succeeds; net/ipv4/ah4.o and net/ipv6/ah6.o
build clean at W=1. IPv6 AH+ESN was not exercised at runtime, and the
change has not been tested against a real async hardware AH engine.
Fixes: d4d573d033 ("{IPv4,xfrm} Add ESN support for AH egress part")
Fixes: d8b2a8600b ("{IPv4,xfrm} Add ESN support for AH ingress part")
Fixes: 26dd70c3fa ("{IPv6,xfrm} Add ESN support for AH egress part")
Fixes: 8d6da6f325 ("{IPv6,xfrm} Add ESN support for AH ingress part")
Cc: stable@vger.kernel.org
Assisted-by: Codex:gpt-5-4
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Caching saddr and daddr before pskb_pull() is problematic
since skb->head can change.
Remove these temporary variables:
- We only access &ipv6_hdr(skb)->saddr and &ipv6_hdr(skb)->daddr
when net_dbg_ratelimited() is called in the slow path.
- Avoid potential future misuse after pskb_pull() call.
Fixes: 4b3418fba0 ("ipv6: icmp: include addresses in debug messages")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Joe Damato <joe@dama.to>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260416103505.2380753-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
xfrm6_rcv_encap() performs an IPv6 route lookup when the skb does not
already have a dst attached. ip6_route_input_lookup() returns a
referenced dst entry even when the lookup resolves to an error route.
If dst->error is set, xfrm6_rcv_encap() drops the skb without attaching
the dst to the skb and without releasing the reference returned by the
lookup. Repeated packets hitting this path therefore leak dst entries.
Release the dst before jumping to the drop path.
Fixes: 0146dca70b ("xfrm: add support for UDPv6 encapsulation of ESP")
Cc: stable@kernel.org
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Co-developed-by: Yuan Tan <yuantan098@gmail.com>
Signed-off-by: Yuan Tan <yuantan098@gmail.com>
Suggested-by: Xin Liu <bird@lzu.edu.cn>
Tested-by: Ruide Cao <caoruide123@gmail.com>
Signed-off-by: Yilin Zhu <zylzyl2333@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Support HW queue leasing, allowing containers to be granted access
to HW queues for zero-copy operations and AF_XDP
- Number of code moves to help the compiler with inlining. Avoid
output arguments for returning drop reason where possible
- Rework drop handling within qdiscs to include more metadata about
the reason and dropping qdisc in the tracepoints
- Remove the rtnl_lock use from IP Multicast Routing
- Pack size information into the Rx Flow Steering table pointer
itself. This allows making the table itself a flat array of u32s,
thus making the table allocation size a power of two
- Report TCP delayed ack timer information via socket diag
- Add ip_local_port_step_width sysctl to allow distributing the
randomly selected ports more evenly throughout the allowed space
- Add support for per-route tunsrc in IPv6 segment routing
- Start work of switching sockopt handling to iov_iter
- Improve dynamic recvbuf sizing in MPTCP, limit burstiness and avoid
buffer size drifting up
- Support MSG_EOR in MPTCP
- Add stp_mode attribute to the bridge driver for STP mode selection.
This addresses concerns about call_usermodehelper() usage
- Remove UDP-Lite support (as announced in 2023)
- Remove support for building IPv6 as a module. Remove the now
unnecessary function calling indirection
Cross-tree stuff:
- Move Michael MIC code from generic crypto into wireless, it's
considered insecure but some WiFi networks still need it
Netfilter:
- Switch nft_fib_ipv6 module to no longer need temporary dst_entry
object allocations by using fib6_lookup() + RCU.
Florian W reports this gets us ~13% higher packet rate
- Convert IPVS's global __ip_vs_mutex to per-net service_mutex and
switch the service tables to be per-net. Convert some code that
walks the service lists to use RCU instead of the service_mutex
- Add more opinionated input validation to lower security exposure
- Make IPVS hash tables to be per-netns and resizable
Wireless:
- Finished assoc frame encryption/EPPKE/802.1X-over-auth
- Radar detection improvements
- Add 6 GHz incumbent signal detection APIs
- Multi-link support for FILS, probe response templates and client
probing
- New APIs and mac80211 support for NAN (Neighbor Aware Networking,
aka Wi-Fi Aware) so less work must be in firmware
Driver API:
- Add numerical ID for devlink instances (to avoid having to create
fake bus/device pairs just to have an ID). Support shared devlink
instances which span multiple PFs
- Add standard counters for reporting pause storm events (implement
in mlx5 and fbnic)
- Add configuration API for completion writeback buffering (implement
in mana)
- Support driver-initiated change of RSS context sizes
- Support DPLL monitoring input frequency (implement in zl3073x)
- Support per-port resources in devlink (implement in mlx5)
Misc:
- Expand the YAML spec for Netfilter
Drivers
- Software:
- macvlan: support multicast rx for bridge ports with shared
source MAC address
- team: decouple receive and transmit enablement for IEEE 802.3ad
LACP "independent control"
- Ethernet high-speed NICs:
- nVidia/Mellanox:
- support high order pages in zero-copy mode (for payload
coalescing)
- support multiple packets in a page (for systems with 64kB
pages)
- Broadcom 25-400GE (bnxt):
- implement XDP RSS hash metadata extraction
- add software fallback for UDP GSO, lowering the IOMMU cost
- Broadcom 800GE (bnge):
- add link status and configuration handling
- add various HW and SW statistics
- Marvell/Cavium:
- NPC HW block support for cn20k
- Huawei (hinic3):
- add mailbox / control queue
- add rx VLAN offload
- add driver info and link management
- Ethernet NICs:
- Marvell/Aquantia:
- support reading SFP module info on some AQC100 cards
- Realtek PCI (r8169):
- add support for RTL8125cp
- Realtek USB (r8152):
- support for the RTL8157 5Gbit chip
- add 2500baseT EEE status/configuration support
- Ethernet NICs embedded and off-the-shelf IP:
- Synopsys (stmmac):
- cleanup and reorganize SerDes handling and PCS support
- cleanup descriptor handling and per-platform data
- cleanup and consolidate MDIO defines and handling
- shrink driver memory use for internal structures
- improve Tx IRQ coalescing
- improve TCP segmentation handling
- add support for Spacemit K3
- Cadence (macb):
- support PHYs that have inband autoneg disabled with GEM
- support IEEE 802.3az EEE
- rework usrio capabilities and handling
- AMD (xgbe):
- improve power management for S0i3
- improve TX resilience for link-down handling
- Virtual:
- Google cloud vNIC:
- support larger ring sizes in DQO-QPL mode
- improve HW-GRO handling
- support UDP GSO for DQO format
- PCIe NTB:
- support queue count configuration
- Ethernet PHYs:
- automatically disable PHY autonomous EEE if MAC is in charge
- Broadcom:
- add BCM84891/BCM84892 support
- Micrel:
- support for LAN9645X internal PHY
- Realtek:
- add RTL8224 pair order support
- support PHY LEDs on RTL8211F-VD
- support spread spectrum clocking (SSC)
- Maxlinear:
- add PHY-level statistics via ethtool
- Ethernet switches:
- Maxlinear (mxl862xx):
- support for bridge offloading
- support for VLANs
- support driver statistics
- Bluetooth:
- large number of fixes and new device IDs
- Mediatek:
- support MT6639 (MT7927)
- support MT7902 SDIO
- WiFi:
- Intel (iwlwifi):
- UNII-9 and continuing UHR work
- MediaTek (mt76):
- mt7996/mt7925 MLO fixes/improvements
- mt7996 NPU support (HW eth/wifi traffic offload)
- Qualcomm (ath12k):
- monitor mode support on IPQ5332
- basic hwmon temperature reporting
- support IPQ5424
- Realtek:
- add USB RX aggregation to improve performance
- add USB TX flow control by tracking in-flight URBs
- Cellular:
- IPA v5.2 support"
* tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1561 commits)
net: pse-pd: fix kernel-doc function name for pse_control_find_by_id()
wireguard: device: use exit_rtnl callback instead of manual rtnl_lock in pre_exit
wireguard: allowedips: remove redundant space
tools: ynl: add sample for wireguard
wireguard: allowedips: Use kfree_rcu() instead of call_rcu()
MAINTAINERS: Add netkit selftest files
selftests/net: Add additional test coverage in nk_qlease
selftests/net: Split netdevsim tests from HW tests in nk_qlease
tools/ynl: Make YnlFamily closeable as a context manager
net: airoha: Add missing PPE configurations in airoha_ppe_hw_init()
net: airoha: Fix VIP configuration for AN7583 SoC
net: caif: clear client service pointer on teardown
net: strparser: fix skb_head leak in strp_abort_strp()
net: usb: cdc-phonet: fix skb frags[] overflow in rx_complete()
selftests/bpf: add test for xdp_master_redirect with bond not up
net, bpf: fix null-ptr-deref in xdp_master_redirect() for down master
net: airoha: Remove PCE_MC_EN_MASK bit in REG_FE_PCE_CFG configuration
sctp: disable BH before calling udp_tunnel_xmit_skb()
sctp: fix missing encap_port propagation for GSO fragments
net: airoha: Rely on net_device pointer in ETS callbacks
...
Pull bpf updates from Alexei Starovoitov:
- Welcome new BPF maintainers: Kumar Kartikeya Dwivedi, Eduard
Zingerman while Martin KaFai Lau reduced his load to Reviwer.
- Lots of fixes everywhere from many first time contributors. Thank you
All.
- Diff stat is dominated by mechanical split of verifier.c into
multiple components:
- backtrack.c: backtracking logic and jump history
- states.c: state equivalence
- cfg.c: control flow graph, postorder, strongly connected
components
- liveness.c: register and stack liveness
- fixups.c: post-verification passes: instruction patching, dead
code removal, bpf_loop inlining, finalize fastcall
8k line were moved. verifier.c still stands at 20k lines.
Further refactoring is planned for the next release.
- Replace dynamic stack liveness with static stack liveness based on
data flow analysis.
This improved the verification time by 2x for some programs and
equally reduced memory consumption. New logic is in liveness.c and
supported by constant folding in const_fold.c (Eduard Zingerman,
Alexei Starovoitov)
- Introduce BTF layout to ease addition of new BTF kinds (Alan Maguire)
- Use kmalloc_nolock() universally in BPF local storage (Amery Hung)
- Fix several bugs in linked registers delta tracking (Daniel Borkmann)
- Improve verifier support of arena pointers (Emil Tsalapatis)
- Improve verifier tracking of register bounds in min/max and tnum
domains (Harishankar Vishwanathan, Paul Chaignon, Hao Sun)
- Further extend support for implicit arguments in the verifier (Ihor
Solodrai)
- Add support for nop,nop5 instruction combo for USDT probes in libbpf
(Jiri Olsa)
- Support merging multiple module BTFs (Josef Bacik)
- Extend applicability of bpf_kptr_xchg (Kaitao Cheng)
- Retire rcu_trace_implies_rcu_gp() (Kumar Kartikeya Dwivedi)
- Support variable offset context access for 'syscall' programs (Kumar
Kartikeya Dwivedi)
- Migrate bpf_task_work and dynptr to kmalloc_nolock() (Mykyta
Yatsenko)
- Fix UAF in in open-coded task_vma iterator (Puranjay Mohan)
* tag 'bpf-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (241 commits)
selftests/bpf: cover short IPv4/IPv6 inputs with adjust_room
bpf: reject short IPv4/IPv6 inputs in bpf_prog_test_run_skb
selftests/bpf: Use memfd_create instead of shm_open in cgroup_iter_memcg
selftests/bpf: Add test for cgroup storage OOB read
bpf: Fix OOB in pcpu_init_value
selftests/bpf: Fix reg_bounds to match new tnum-based refinement
selftests/bpf: Add tests for non-arena/arena operations
bpf: Allow instructions with arena source and non-arena dest registers
bpftool: add missing fsession to the usage and docs of bpftool
docs/bpf: add missing fsession attach type to docs
bpf: add missing fsession to the verifier log
bpf: Move BTF checking logic into check_btf.c
bpf: Move backtracking logic to backtrack.c
bpf: Move state equivalence logic to states.c
bpf: Move check_cfg() into cfg.c
bpf: Move compute_insn_live_regs() into liveness.c
bpf: Move fixup/post-processing logic from verifier.c into fixups.c
bpf: Simplify do_check_insn()
bpf: Move checks for reserved fields out of the main pass
bpf: Delete unused variable
...
Back in 2024 I reported a 7-12% regression on an iperf3 UDP loopback
thoughput test that we traced to the extra overhead of calling
compute_score on two places, introduced by commit f0ea27e7bf ("udp:
re-score reuseport groups when connected sockets are present"). At the
time, I pointed out the overhead was caused by the multiple calls,
associated with cpu-specific mitigations, and merged commit
50aee97d15 ("udp: Avoid call to compute_score on multiple sites") to
jump back explicitly, to force the rescore call in a single place.
Recently though, we got another regression report against a newer distro
version, which a team colleague traced back to the same root-cause.
Turns out that once we updated to gcc-13, the compiler got smart enough
to unroll the loop, undoing my previous mitigation. Let's bite the
bullet and __always_inline compute_score on both ipv4 and ipv6 to
prevent gcc from de-optimizing it again in the future. These functions
are only called in two places each, udpX_lib_lookup1 and
udpX_lib_lookup2, so the extra size shouldn't be a problem and it is hot
enough to be very visible in profilings. In fact, with gcc13, forcing
the inline will prevent gcc from unrolling the fix from commit
50aee97d15, so we don't end up increasing udpX_lib_lookup2 at all.
I haven't recollected the results myself, as I don't have access to the
machine at the moment. But the same colleague reported 4.67%
inprovement with this patch in the loopback benchmark, solving the
regression report within noise margins.
Eric Dumazet reported no size change to vmlinux when built with clang.
I report the same also with gcc-13:
scripts/bloat-o-meter vmlinux vmlinux-inline
add/remove: 0/2 grow/shrink: 4/0 up/down: 616/-416 (200)
Function old new delta
udp6_lib_lookup2 762 949 +187
__udp6_lib_lookup 810 975 +165
udp4_lib_lookup2 757 906 +149
__udp4_lib_lookup 871 986 +115
__pfx_compute_score 32 - -32
compute_score 384 - -384
Total: Before=35011784, After=35011984, chg +0.00%
Fixes: 50aee97d15 ("udp: Avoid call to compute_score on multiple sites")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://patch.msgid.link/20260410155936.654915-1-krisman@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull vfs i_ino updates from Christian Brauner:
"For historical reasons, the inode->i_ino field is an unsigned long,
which means that it's 32 bits on 32 bit architectures. This has caused
a number of filesystems to implement hacks to hash a 64-bit identifier
into a 32-bit field, and deprives us of a universal identifier field
for an inode.
This changes the inode->i_ino field from an unsigned long to a u64.
This shouldn't make any material difference on 64-bit hosts, but
32-bit hosts will see struct inode grow by at least 4 bytes. This
could have effects on slabcache sizes and field alignment.
The bulk of the changes are to format strings and tracepoints, since
the kernel itself doesn't care that much about the i_ino field. The
first patch changes some vfs function arguments, so check that one out
carefully.
With this change, we may be able to shrink some inode structures. For
instance, struct nfs_inode has a fileid field that holds the 64-bit
inode number. With this set of changes, that field could be
eliminated. I'd rather leave that sort of cleanups for later just to
keep this simple"
* tag 'vfs-7.1-rc1.kino' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
nilfs2: fix 64-bit division operations in nilfs_bmap_find_target_in_group()
EVM: add comment describing why ino field is still unsigned long
vfs: remove externs from fs.h on functions modified by i_ino widening
treewide: fix missed i_ino format specifier conversions
ext4: fix signed format specifier in ext4_load_inode trace event
treewide: change inode->i_ino from unsigned long to u64
nilfs2: widen trace event i_ino fields to u64
f2fs: widen trace event i_ino fields to u64
ext4: widen trace event i_ino fields to u64
zonefs: widen trace event i_ino fields to u64
hugetlbfs: widen trace event i_ino fields to u64
ext2: widen trace event i_ino fields to u64
cachefiles: widen trace event i_ino fields to u64
vfs: widen trace event i_ino fields to u64
net: change sock.sk_ino and sock_i_ino() to u64
audit: widen ino fields to u64
vfs: widen inode hash/lookup functions to u64
Commit under Fixes moved recomputing the window clamp to
tcp_measure_rcv_mss() (when scaling_ratio changes).
I suspect it missed the fact that we don't recompute the clamp
when rcvbuf is set. Until scaling_ratio changes we are
stuck with the old window clamp which may be based on
the small initial buffer. scaling_ratio may never change.
Inspired by Eric's recent commit d1361840f8 ("tcp: fix
SO_RCVLOWAT and RCVBUF autotuning") plumb the user action
thru to TCP and have it update the clamp.
A smaller fix would be to just have tcp_rcvbuf_grow()
adjust the clamp even if SOCK_RCVBUF_LOCK is set.
But IIUC this is what we were trying to get away from
in the first place.
Fixes: a2cbb16039 ("tcp: Update window clamping condition")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumaze@google.com>
Link: https://patch.msgid.link/20260408001438.129165-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
sk_filter_trim_cap() will soon return the reason by value,
do the same for tcp_filter().
Note:
tcp_filter() is no longer inlined. Following patch will inline it again.
$ scripts/bloat-o-meter -t vmlinux.4 vmlinux.5
add/remove: 2/0 grow/shrink: 0/2 up/down: 186/-43 (143)
Function old new delta
tcp_filter - 154 +154
__pfx_tcp_filter - 32 +32
tcp_v4_rcv 3152 3143 -9
tcp_v6_rcv 3169 3135 -34
Total: Before=29722640, After=29722783, chg +0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260409145625.2306224-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Florian Westphal says:
====================
netfilter: updates for net-next
1-3) IPVS updates from Julian Anastasov to enhance visibility into
IPVS internal state by exposing hash size, load factor etc and
allows userspace to tune the load factor used for resizing hash
tables.
4) reject empty/not nul terminated device names from xt_physdev.
This isn't a bug fix; existing code doesn't require a c-string.
But clean this up anyway because conceptually the interface name
definitely should be a c-string.
5) Switch nfnetlink to skb_mac_header helpers that didn't exist back
when this code was written. This gives us additional debug checks
but is not intended to change functionality.
6) Let the xt ttl/hoplimit match reject unknown operator modes.
This is a cleanup, the evaluation function simply returns false when
the mode is out of range. From Marino Dzalto.
7) xt_socket match should enable defrag after all other checks. This
bug is harmless, historically defrag could not be disabled either
except by rmmod.
8) remove UDP-Lite conntrack support, from Fernando Fernandez Mancera.
9) Avoid a couple -Wflex-array-member-not-at-end warnings in the old
xtables 32bit compat code, from Gustavo A. R. Silva.
10) nftables fwd expression should drop packets when their ttl/hl has
expired. This is a bug fix deferred, its not deemed important
enough for -rc8.
11) Add additional checks before assuming the mac header is an ethernet
header, from Zhengchuan Liang.
* tag 'nf-next-26-04-10' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: require Ethernet MAC header before using eth_hdr()
netfilter: nft_fwd_netdev: check ttl/hl before forwarding
netfilter: x_tables: Avoid a couple -Wflex-array-member-not-at-end warnings
netfilter: conntrack: remove UDP-Lite conntrack support
netfilter: xt_socket: enable defrag after all other checks
netfilter: xt_HL: add pr_fmt and checkentry validation
netfilter: nfnetlink: prefer skb_mac_header helpers
netfilter: x_physdev: reject empty or not-nul terminated device names
ipvs: add conn_lfactor and svc_lfactor sysctl vars
ipvs: add ip_vs_status info
ipvs: show the current conn_tab size to users
====================
Link: https://patch.msgid.link/20260410112352.23599-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Part of a stack canary removal from tcp_v{4,6}_rcv().
Return a drop_reason instead of a boolean, so that we no longer
have to pass the address of a local variable.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-37 (-37)
Function old new delta
tcp_v6_rcv 3133 3129 -4
tcp_v4_rcv 3206 3202 -4
tcp_add_backlog 1281 1252 -29
Total: Before=25567186, After=25567149, chg -0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260409101147.1642967-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
`ip6t_eui64`, `xt_mac`, the `bitmap:ip,mac`, `hash:ip,mac`, and
`hash:mac` ipset types, and `nf_log_syslog` access `eth_hdr(skb)`
after either assuming that the skb is associated with an Ethernet
device or checking only that the `ETH_HLEN` bytes at
`skb_mac_header(skb)` lie between `skb->head` and `skb->data`.
Make these paths first verify that the skb is associated with an
Ethernet device, that the MAC header was set, and that it spans at
least a full Ethernet header before accessing `eth_hdr(skb)`.
Suggested-by: Florian Westphal <fw@strlen.de>
Tested-by: Ren Wei <enjou1224z@gmail.com>
Signed-off-by: Zhengchuan Liang <zcliangcn@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Signed-off-by: Florian Westphal <fw@strlen.de>
Observed at boot time:
CPU: 43 UID: 0 PID: 3595 Comm: (t-daemon) Not tainted 6.12.0 #1
Call Trace:
<TASK>
dump_stack_lvl+0x4e/0x70
pcpu_alloc_noprof.cold+0x1f/0x4b
fib_nh_common_init+0x4c/0x110
fib6_nh_init+0x387/0x740
ip6_route_info_create+0x46d/0x640
addrconf_f6i_alloc+0x13b/0x180
addrconf_permanent_addr+0xd0/0x220
addrconf_notify+0x93/0x540
notifier_call_chain+0x5a/0xd0
__dev_notify_flags+0x5c/0xf0
dev_change_flags+0x54/0x70
do_setlink+0x36c/0xce0
rtnl_setlink+0x11f/0x1d0
rtnetlink_rcv_msg+0x142/0x3f0
netlink_rcv_skb+0x50/0x100
netlink_unicast+0x242/0x390
netlink_sendmsg+0x21b/0x470
__sys_sendto+0x1dc/0x1f0
__x64_sys_sendto+0x24/0x30
do_syscall_64+0x7d/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f5c3852f127
Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 80 3d 85 ef 0c 00 00 41 89 ca 74 10 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 71 c3 55 48 83 ec 30 44 89 4c 24 2c 4c 89 44
RSP: 002b:00007ffe86caf4c8 EFLAGS: 00000202 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000556c5cd93210 RCX: 00007f5c3852f127
RDX: 0000000000000020 RSI: 0000556c5cd938b0 RDI: 0000000000000003
RBP: 00007ffe86caf5a0 R08: 00007ffe86caf4e0 R09: 0000000000000080
R10: 0000000000000000 R11: 0000000000000202 R12: 0000556c5cd932d0
R13: 00000000021d05d1 R14: 00000000021d05d1 R15: 0000000000000001
IFA_F_PERMANENT addresses require the allocation of a bunch of percpu
pointers, currently in atomic scope.
Similar to commit 51454ea42c ("ipv6: fix locking issues with loops
over idev->addr_list"), move fixup_permanent_addr() outside the
&idev->lock scope, and do the allocations with GFP_KERNEL. With such
change fixup_permanent_addr() is invoked with the BH enabled, and the
ifp lock acquired there needs the BH variant.
Note that we don't need to acquire a reference to the permanent
addresses before releasing the mentioned write lock, because
addrconf_permanent_addr() runs under RTNL and ifa removal always happens
under RTNL, too.
Also the PERMANENT flag is constant in the relevant scope, as it can be
cleared only by inet6_addr_modify() under the RTNL lock.
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/46a7a030727e236af2dc7752994cd4f04f4a91d2.1775658924.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When trace->type.bit6 is set:
if (trace->type.bit6) {
...
queue = skb_get_tx_queue(dev, skb);
qdisc = rcu_dereference(queue->qdisc);
This code can lead to an out-of-bounds access of the dev->_tx[] array
when is_input is true. In such a case, the packet is on the RX path and
skb->queue_mapping contains the RX queue index of the ingress device. If
the ingress device has more RX queues than the egress device (dev) has
TX queues, skb_get_queue_mapping(skb) will exceed dev->num_tx_queues.
Add a check to avoid this situation since skb_get_tx_queue() does not
clamp the index. This issue has also revealed that per queue visibility
cannot be accurate and will be replaced later as a new feature.
While at it, add missing lock around qdisc_qstats_qlen_backlog(). The
function __ioam6_fill_trace_data() is called from both softirq and
process contexts, hence the use of spin_lock_bh() here.
Fixes: b63c5478e9 ("ipv6: ioam: Support for Queue depth data field")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/netdev/20260403214418.2233266-2-kuba@kernel.org/
Signed-off-by: Justin Iurman <justin.iurman@gmail.com>
Link: https://patch.msgid.link/20260404134137.24553-1-justin.iurman@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Steffen Klassert says:
====================
pull request (net): ipsec 2026-04-08
1) Clear trailing padding in build_polexpire() to prevent
leaking unititialized memory. From Yasuaki Torimaru.
2) Fix aevent size calculation when XFRMA_IF_ID is used.
From Keenan Dong.
3) Wait for RCU readers during policy netns exit before
freeing the policy hash tables.
4) Fix dome too eaerly dropped references on the netdev
when uding transport mode. From Qi Tang.
5) Fix refcount leak in xfrm_migrate_policy_find().
From Kotlyarov Mihail.
6) Fix two fix info leaks in build_report() and
in build_mapping(). From Greg Kroah-Hartman.
7) Zero aligned sockaddr tail in PF_KEY exports.
From Zhengchuan Liang.
* tag 'ipsec-2026-04-08' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
net: af_key: zero aligned sockaddr tail in PF_KEY exports
xfrm_user: fix info leak in build_report()
xfrm_user: fix info leak in build_mapping()
xfrm: fix refcount leak in xfrm_migrate_policy_find
xfrm: hold dev ref until after transport_finish NF_HOOK
xfrm: Wait for RCU readers during policy netns exit
xfrm: account XFRMA_IF_ID in aevent size calculation
xfrm: clear trailing padding in build_polexpire()
====================
Link: https://patch.msgid.link/20260408095925.253681-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
`eui64_mt6()` derives a modified EUI-64 from the Ethernet source address
and compares it with the low 64 bits of the IPv6 source address.
The existing guard only rejects an invalid MAC header when
`par->fragoff != 0`. For packets with `par->fragoff == 0`, `eui64_mt6()`
can still reach `eth_hdr(skb)` even when the MAC header is not valid.
Fix this by removing the `par->fragoff != 0` condition so that packets
with an invalid MAC header are rejected before accessing `eth_hdr(skb)`.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Co-developed-by: Yuan Tan <yuantan098@gmail.com>
Signed-off-by: Yuan Tan <yuantan098@gmail.com>
Suggested-by: Xin Liu <bird@lzu.edu.cn>
Tested-by: Ren Wei <enjou1224z@gmail.com>
Signed-off-by: Zhengchuan Liang <zcliangcn@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Signed-off-by: Florian Westphal <fw@strlen.de>
The seg6 lwtunnel uses a single dst_cache per encap route, shared
between seg6_input_core() and seg6_output_core(). These two paths
can perform the post-encap SID lookup in different routing contexts
(e.g., ip rules matching on the ingress interface, or VRF table
separation). Whichever path runs first populates the cache, and the
other reuses it blindly, bypassing its own lookup.
Fix this by splitting the cache into cache_input and cache_output,
so each path maintains its own cached dst independently.
Fixes: 6c8702c60b ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels")
Cc: stable@vger.kernel.org
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Justin Iurman <justin.iurman@gmail.com>
Link: https://patch.msgid.link/20260404004405.4057-2-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>