linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-06-03 18:32:28 -04:00

Author	SHA1	Message	Date
Jakub Kicinski	504eaefa44	ethtool: module: check fw_flash_in_progress under rtnl_lock ethnl_set_module_validate() inspects module_fw_flash_in_progress but validate is meant for _input_ validation, not state validation. rtnl_lock is not held, yet. Move the check into ethnl_set_module(). Fixes: `32b4c8b53e` ("ethtool: Add ability to flash transceiver modules' firmware") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Link: https://patch.msgid.link/20260522231312.1710836-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-26 08:19:32 -07:00
Jakub Kicinski	7a84b965ff	ethtool: module: avoid racy updates to dev->ethtool bitfield When reviewing other changes Gemini points out that we currently update module_fw_flash_in_progress without holding any locks. Since module_fw_flash_in_progress is part of a bitfield this is not great, updates to other fields may be lost. We could use a bool and sprinkle some READ_ONCE/WRITE_ONCE here but seems like the issue is rather than the work is an unusual writer. The other writers already hold the right locks. So just very briefly take these locks when the work completes. Note that nothing ever cancels the FW update work, so there's no concern with deadlocks vs cancel. Fixes: `32b4c8b53e` ("ethtool: Add ability to flash transceiver modules' firmware") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Link: https://patch.msgid.link/20260522231312.1710836-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-26 08:19:32 -07:00
Jakub Kicinski	fb7f511d62	ethtool: module: avoid leaking a netdev ref on module flash errors module_flash_fw_schedule() is missing undo for setting the "in_progress" flag and taking the netdev reference. Delay taking these, the device can't disappear while we are holding rtnl_lock. Fixes: `32b4c8b53e` ("ethtool: Add ability to flash transceiver modules' firmware") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Link: https://patch.msgid.link/20260522231312.1710836-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-26 08:19:32 -07:00
Jakub Kicinski	84371fb584	ethtool: module: call ethnl_ops_complete() on module flash errors When validate() fails we are skipping over ethnl_ops_complete() even tho we already called ethnl_ops_begin(). Fixes: `32b4c8b53e` ("ethtool: Add ability to flash transceiver modules' firmware") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Link: https://patch.msgid.link/20260522231312.1710836-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-26 08:19:32 -07:00
Jakub Kicinski	32a9ecde62	ethtool: rss: avoid device context leak on reply-build failure We wait with filling the reply for new RSS context creation until after the driver ->create_rxfh_context call. The driver needs to fill some of the defaults in the context. The failure of rss_fill_reply() is somewhat theoretical, but doesn't take much effort to handle it properly. Call ->remove_rxfh_context(). If the driver's remove callback fails (some implementations like sfc can return real command errors from firmware RPCs) - skip the xa_erase and kfree, leaving the context in the xarray. This matches how ethnl_rss_delete_doit() behaves. Fixes: `a166ab7816` ("ethtool: rss: support creating contexts via Netlink") Link: https://patch.msgid.link/20260522230647.1705600-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-26 08:17:57 -07:00
Jakub Kicinski	78ccf1a70c	ethtool: rss: fix hkey leak when indir_size is 0 rss_get_data_alloc() allocates a single buffer that backs both the indirection table and the hash key, but only assigned data->indir_table when indir_size was nonzero. The expectation was that no driver implements RSS without supporting indirection table but apparently enic does just that (it's the only such in-tree driver). enic has get_rxfh_key_size but no get_rxfh_indir_size. data->indir_table stays as NULL, hkey gets set but rss_get_data_free() kfree(data->indir_table) is a nop and the allocation leaks. Always store the allocation base in data->indir_table so the free path is unambiguous. No caller treats indir_table as a sentinel; everything keys off indir_size. Fixes: `7112a04664` ("ethtool: add netlink based get rss support") Link: https://patch.msgid.link/20260522230647.1705600-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-26 08:17:57 -07:00
Jakub Kicinski	266297692f	ethtool: rss: fix indir_table and hkey leak on get_rxfh failure rss_prepare_get() allocates the indirection table and hash key buffer via rss_get_data_alloc(), then calls ops->get_rxfh() to populate them. If get_rxfh() fails, the function returns an error without freeing the allocation. Fixes: `4f038a6a02` ("net: ethtool: Don't call .cleanup_data when prepare_data fails") Link: https://patch.msgid.link/20260522230647.1705600-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-26 08:17:56 -07:00
Jakub Kicinski	8d60141a32	ethtool: rss: fix falsely ignoring indir table updates rss_set_prep_indir() compares the new indirection table against the current one to determine whether any update is needed. The memcmp call passes data->indir_size as the length argument, but indir_size is the number of u32 entries, not the byte count. Fixes: `c0ae03588b` ("ethtool: rss: initial RSS_SET (indirection table handling)") Link: https://patch.msgid.link/20260522230647.1705600-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-26 08:17:56 -07:00
Jakub Kicinski	3e6c6e9782	ethtool: rss: add missing errno on RSS context delete Remember to set ret before jumping out if someone tries to delete a context on a device which doesn't support contexts. Fixes: `fbe09277fa` ("ethtool: rss: support removing contexts via Netlink") Link: https://patch.msgid.link/20260522230647.1705600-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-26 08:17:56 -07:00
Jakub Kicinski	c75b6f6eaa	ethtool: rss: avoid modifying the RSS context response Gemini says that we're modifying the RSS_CREATE response skb. I think it's right, the comment says that unicast() should unshare the skb but I'm not entirely sure what I meant there. netlink_trim() does a copy but only if skb is not well sized (it's at least 2x larger than necessary for the payload). Fixes: `a166ab7816` ("ethtool: rss: support creating contexts via Netlink") Link: https://patch.msgid.link/20260522230647.1705600-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-26 08:17:56 -07:00
Björn Töpel	2e357f002c	net: Avoid checksumming unreadable skb tail on trim pskb_trim_rcsum_slow() keeps CHECKSUM_COMPLETE valid by subtracting the checksum of the bytes removed from the skb tail. That assumes the removed bytes can be read. io_uring zcrx skbs may contain unreadable net_iov frags. With fbnic header/data split, small TCP/IPv4 packets can carry Ethernet padding in such a frag. ip_rcv_core() trims the skb to iph->tot_len before TCP sees it, and the CHECKSUM_COMPLETE adjustment then calls skb_checksum() on the padding. This is exposed by IPv4 because small TCP/IPv4 frames can be shorter than the Ethernet minimum payload. TCP/IPv6 frames are large enough in the normal zcrx path, so they do not hit the same padding trim. Keep the existing checksum adjustment for readable skbs. If the remaining packet is fully linear, drop CHECKSUM_COMPLETE and let the stack validate the packet after trimming. If unreadable payload would remain, fail the trim; the checksum cannot be adjusted without reading the trimmed tail. Also clear skb->unreadable when trimming removes all frags. Fixes: `65249feb6b` ("net: add support for skbs with unreadable frags") Signed-off-by: Björn Töpel <bjorn@kernel.org> Reviewed-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20260522120643.242974-1-bjorn@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-26 15:21:27 +02:00
Maoyi Xie	8b484efd5c	ip6: vti: Use ip6_tnl.net in vti6_siocdevprivate(). After patch 1/2 in this series, vti6_update() unlinks and relinks the tunnel through t->net. vti6_siocdevprivate() still uses dev_net(dev) for the collision lookup. For a tunnel moved through IFLA_NET_NS_FD, dev_net(dev) is the new netns, not t->net. SIOCCHGTUNNEL on a migrated tunnel then runs: net = dev_net(dev) /* migrated netns / t = vti6_locate(net, &p1, false) / misses target in t->net / ... t = netdev_priv(dev) vti6_update(t, &p1, false) / mutates t->net's hash */ A caller in the migrated netns picks params that match a tunnel in the creation netns. The lookup in dev_net(dev) finds nothing. vti6_update() prepends the migrated tunnel at the head of the creation netns hash bucket for those params. Later lookups in the creation netns resolve to the migrated device. xfrm receive delivers the matched packets through a device the caller controls. Reachable from an unprivileged user namespace (unshare --user --map-root-user --net). Cross tenant scope on container hosts. Switch the SIOCCHGTUNNEL path on a non fallback device to use t->net for the lookup. The lookup now matches the netns vti6_update() operates on. Also add ns_capable(self->net->user_ns, CAP_NET_ADMIN) before the lookup. The check at the top of the case is against dev_net(dev)->user_ns, which after migration is the attacker's netns. A caller there can pick params absent from self->net, the lookup returns NULL, t becomes self, and vti6_update() inserts the device into the creation netns hash. The new check requires CAP_NET_ADMIN in the creation netns user_ns too. SIOCADDTUNNEL and SIOCCHGTUNNEL on the fallback device keep dev_net(dev), which equals init_net there. Fixes: `61220ab349` ("vti6: Enable namespace changing") Suggested-by: Jakub Kicinski <kuba@kernel.org> Suggested-by: Xiao Liang <shaw.leon@gmail.com> Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com> Link: https://patch.msgid.link/20260521130555.3421684-3-maoyixie.tju@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-26 11:16:12 +02:00
Kuniyuki Iwashima	11b326fb0a	ip6: vti: Use ip6_tnl.net in vti6_changelink(). ip netns add ns1 ip netns add ns2 ip -n ns1 link add vti6_test type vti6 remote ::1 local ::2 key 7 ip -n ns1 link set vti6_test netns ns2 ip -n ns2 link set vti6_test type vti6 remote ::3 local ::4 key 9 ip netns del ns2 ip netns del ns1 [ 132.495484] ------------[ cut here ]------------ [ 132.497609] kernel BUG at net/core/dev.c:12376! Commit `61220ab349` ("vti6: Enable namespace changing") dropped NETIF_F_NETNS_LOCAL from vti6 devices. A vti6 tunnel can then move through IFLA_NET_NS_FD. After the move dev_net(dev) points at the new netns while t->net stays at the creation netns. vti6_changelink() and vti6_update() still use dev_net(dev) and dev_net(t->dev). They unlink from one per netns hash and relink into another. The creation netns is left with a stale entry. cleanup_net() of that netns later walks freed memory. Reachable from an unprivileged user namespace (unshare --user --map-root-user --net). Cross tenant scope on container hosts. Fixes: `61220ab349` ("vti6: Enable namespace changing") Reported-by: Maoyi Xie <maoyi.xie@ntu.edu.sg> Reviewed-by: Eric Dumazet <edumazet@google.com> Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260521130555.3421684-2-maoyixie.tju@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-26 11:16:12 +02:00
Luka Gejak	f229426072	net: hsr: fix potential OOB access in supervision frame handling Ensure the entire TLV header is linearized before access by adding sizeof(struct hsr_sup_tlv) to the pskb_may_pull() calls. Without this, a truncated frame could cause an out-of-bounds access. Fixes: `eafaa88b3e` ("net: hsr: Add support for redbox supervision frames") Signed-off-by: Luka Gejak <luka.gejak@linux.dev> Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de> Link: https://patch.msgid.link/20260523130330.61880-1-luka.gejak@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-25 14:18:15 -07:00
Justin Iurman	d47548a366	ipv6: exthdrs: refresh nh pointer after ipv6_hop_jumbo() ipv6_hop_jumbo() calls pskb_trim_rcsum(), which can change skb pointers. Let's recompute nh pointer to make sure any change won't mess things up. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Signed-off-by: Justin Iurman <justin.iurman@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260522112013.12342-1-justin.iurman@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-25 11:08:00 -07:00
Zhengchuan Liang	f7b52afe35	ipv6: exthdrs: refresh nh after handling HAO option ip6_parse_tlv() caches skb_network_header(skb) in nh while walking IPv6 TLVs. ipv6_dest_hao() may call pskb_expand_head() for a cloned skb, which can move the skb head and invalidate the cached network header pointer. Refresh nh after ipv6_dest_hao() returns so any trailing padding or TLVs are parsed from the current skb head. This matches the existing pattern used in ip6_parse_tlv() after helpers that can modify skb header storage. Fixes: `a831f5bbc8` ("[IPV6] MIP6: Add inbound interface of home address option.") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Co-developed-by: Luxing Yin <tr0jan@lzu.edu.cn> Signed-off-by: Luxing Yin <tr0jan@lzu.edu.cn> Signed-off-by: Zhengchuan Liang <zcliangcn@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Reviewed-by: Justin Iurman <justin.iurman@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/7aba1debc2196189172499e5769802b026f8caf8.1779247873.git.zcliangcn@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-25 11:07:40 -07:00
Jakub Kicinski	f6f1bfc198	Merge tag 'nf-26-05-22' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Florian Westphal says: ==================== netfilter: updates for net Patches 7+8 fix a regression from 7.1-rc1. Everything else is from 2.6.x to 5.3 releases. There are additional known issues with these patches (drive-by-findings in related code). There are many old bugs all over netfilter and our ability to review feature patches has come to a complete halt due to lack of time. There are further security bugs that we cannot address due to lack of time, maintainers and reviewers. Other remarks: The xtables 32bit compat interface is already off in many vendor kernels, the plan is to remove it soon. 1) Prevent RST packets with invalid sequence numbers from forcing TCP connections into the CLOSE state without a direction check. From Hamza Mahfooz. 2) Re-derive the TCP header pointer after skb_ensure_writable in synproxy_tstamp_adjust. Prevent use-after-free and invalid checksum updates caused by stale pointers during buffer expansion. From Chris Mason. 3) Fix a race condition causing keymap list corruption in conntracks gre/pptp helper. 4) Use raw_smp_processor_id() in xt_cpu to prevent splats under PREEMPT_RCU. 5) Disable netfilter payload mangling in user namespaces (nft_payload.c and nf_queue). TCP option mangling via nft_exthdr.c remains enabled. There will be followups here to restrict resp. revalidate headers. 6) Fix an out-of-bounds read in ebtables's compat_mtw_from_user function. 7) Use list_for_each_entry_rcu() to traverse fib6_siblings in nft_fib6_info_nh_uses_dev(). Ensure safe list walking under RCU. 8) Fix an out-of-bounds read in nft_fib_ipv6 caused by incorrect list traversal. 9) Add nft_fib_nexthop selftest to netfilter. Cover nexthop enumeration for single, group, and multipath route shapes. All three nft_fib6 fixes from Jiayuan Chen. 10) Fix destination corruption in shift operations when source and destination registers overlap. Reject partial register overlap for all operations from control plane. From Fernando Fernandez Mancera. * tag 'nf-26-05-22' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: fix dst corruption in same register operation selftests: netfilter: add nft_fib_nexthop test netfilter: nft_fib_ipv6: handle routes via external nexthop netfilter: nft_fib_ipv6: walk fib6_siblings under RCU netfilter: ebtables: fix OOB read in compat_mtw_from_user netfilter: disable payload mangling in userns netfilter: xt_cpu: prefer raw_smp_processor_id netfilter: nf_conntrack_gre: fix gre keymap list corruption netfilter: synproxy: refresh tcphdr after skb_ensure_writable netfilter: conntrack: tcp: do not force CLOSE on invalid-seq RST without direction check ==================== Link: https://patch.msgid.link/20260522104257.2008-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-25 10:37:28 -07:00
Eric Dumazet	87a1e0fe77	ipv4: free net->ipv4.sysctl_local_reserved_ports after unregister_net_sysctl_table() ipv4_sysctl_exit_net() is currently freeing net->ipv4.sysctl_local_reserved_ports too soon. Only after unregister_net_sysctl_table() we can be sure no threads can possibly use the sysctls, including /proc/sys/net/ipv4/ip_local_reserved_ports. Fixes: `122ff243f5` ("ipv4: make ip_local_reserved_ports per netns") Reported-by: Ji'an Zhou <eilaimemedsnaimel@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://patch.msgid.link/20260521122147.3584624-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-22 19:05:31 -07:00
Stefano Garzarella	4157501b9a	vsock/virtio: fix skb overhead overflow on 32-bit builds On 32-bit architectures, both skb_queue_len() and SKB_TRUESIZE(0) evaluate to 32-bit values. The multiplication can overflow before being assigned to the u64 skb_overhead variable, making the skb overhead check ineffective. Cast skb_queue_len() to u64 so the multiplication is always performed in 64-bit arithmetic. This issue was reported by Sashiko while reviewing another patch. Fixes: `059b7dbd20` ("vsock/virtio: fix potential unbounded skb queue") Closes: https://sashiko.dev/#/patchset/20260518090656.134588-1-sgarzare%40redhat.com Cc: stable@vger.kernel.org Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://patch.msgid.link/20260521124732.125771-1-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-22 19:05:10 -07:00
Breno Leitao	3589d20a66	net/iucv: fix locking in .getsockopt Mirror iucv_sock_setsockopt() and wrap the whole switch in lock_sock()/release_sock(). The pre-existing SO_MSGLIMIT-only lock becomes redundant and is removed. Any AF_IUCV HIPER user can potentially crash the kernel by racing recvmsg() with getsockopt(SO_MSGSIZE): the SO_MSGSIZE arm dereferences iucv->hs_dev->mtu after iucv_sock_close() (called from the racing recvmsg()) has set hs_dev to NULL, producing a NULL pointer dereference oops. Suggested-by: Stanislav Fomichev <sdf.kernel@gmail.com> Fixes: `51363b8751` ("af_iucv: allow retrieval of maximum message size") Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Alexandra Winter <wintera@linux.ibm.com> Tested-by: Alexandra Winter <wintera@linux.ibm.com> Link: https://patch.msgid.link/20260521-af_iucv_fix2-v1-1-f16b1c510aa9@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-22 17:40:32 -07:00
Alexandra Winter	9e4389b003	net/smc: Do not re-initialize smc hashtables INIT_HLIST_HEAD(&smc_v_hashinfo.ht) are called after smc_nl_init(), proto_register() and sock_register(). This can lead to smc_v_hashinfo.ht being reset even though hash entries already exist and are being used, possibly resulting in a corrupted list. Remove unnecessary and dangerous re-initialisation of smc_v*_hashinfo.ht in smc_init(); it is implicitly initialised to zero anyhow. Add HLIST_HEAD_INIT to the definitions for clarity. Fixes: `f16a7dd5cf` ("smc: netlink interface for SMC sockets") Suggested-by: Halil Pasic <pasic@linux.ibm.com> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Acked-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Link: https://patch.msgid.link/20260521145639.10317-1-wintera@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-22 17:38:02 -07:00
Ilya Maximets	88b126b39f	net: netlink: don't set nsid on local notifications In most cases, notifications on sockets with NETLINK_LISTEN_ALL_NSID do not contain NSID in their ancillary data in case the event is local to the listener. However, when a self-referential NSID is allocated for a namespace, every local notification starts sending this ID to the user space. This is problematic, because the listener cannot tell if those notifications are local or not anymore without making extra requests to figure out if the provided NSID is local or not. The listener can also not figure out the local NSID beforehand as it can be allocated at any point in time by other processes, changing the structure of the future notifications for everyone. The value is practically not useful, since it's the namespace's own ID that the application has to obtain from other sources in order to figure out if it's the same or not. So, for the application it's just an extra busy work with no benefits. Moreover, applications that do not know about this quirk may be mishandling notifications with NSID set as notifications from remote namespaces. This is the case for ovs-vswitchd and the iproute2's 'ip monitor' that stops printing 'current' and starts printing the nsid number mid-session. Lack of clear documentation for this behavior is also not helping. A search though open-source projects doesn't reveal any projects that use NETNSA_NSID_NOT_ASSIGNED and rely on metadata to contain self-referential NSIDs (expected, since the value is not useful). Quite the opposite, as already mentioned, there are few applications that rely on NSID to not be present in local events. Since the value is not useful and actively harmful in some cases, let's not report it for local events, making the notifications more consistent. Also adding some blank lines for readability. Fixes: `59324cf35a` ("netlink: allow to listen "all" netns") Reported-by: Matteo Perin <matteo.perin@canonical.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://patch.msgid.link/20260520172317.175168-3-i.maximets@ovn.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-22 17:11:09 -07:00
Ilya Maximets	70f8592ee9	net: netlink: fix sending unassigned nsid after assigned one If the current skb is not shared, it is re-used directly for all the sockets subscribed to the notification. If we have remote all-nsid socket receiving a message first, then the 'nsid_is_set' will be set to 'true'. If the nsid is NOT_ASSIGNED for the next socket in the list, the 'nsid_is_set' will remain 'true' and the negative value is be delivered to the user space. All subsequent nsid values will be delivered as well, since there is no code path that sets the flag back to 'false'. Fix that by always dropping the flag to 'false' first. Fixes: `7212462fa6` ("netlink: don't send unknown nsid") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://patch.msgid.link/20260520172317.175168-2-i.maximets@ovn.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-22 17:11:09 -07:00
Ziyu Zhang	aae9d8a552	vsock: keep poll shutdown state consistent vsock_poll() reads vsk->peer_shutdown before taking the socket lock to set EPOLLHUP and EPOLLRDHUP, then reads it again after taking the lock to report EOF readability. A shutdown packet can update peer_shutdown while poll is waiting for the lock, so one poll invocation can report EOF readability without the corresponding HUP/RDHUP bits. For connectible sockets, take one peer_shutdown snapshot after lock_sock() and use it for all peer-shutdown-derived poll bits. For datagram sockets, which do not take lock_sock() in poll(), take one lockless READ_ONCE() snapshot and pair it with WRITE_ONCE() on the writer side. This keeps the peer-shutdown-derived bits internally consistent for each poll pass. Fixes: `d021c34405` ("VSOCK: Introduce VM Sockets") Signed-off-by: Ziyu Zhang <ziyuzhang201@gmail.com> Link: https://patch.msgid.link/20260519165636.62542-1-ziyuzhang201@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-22 11:27:57 -07:00
Fernando Fernandez Mancera	18014147d3	netfilter: nf_tables: fix dst corruption in same register operation For lshift and rshift, the shift operations are performed in a loop over 32-bit words. The loop calculates the shifted value and write it to dst, and then immediately reads from src to calculate the carry for the next iteration. Because src and dst could point to the same memory location, the carry is incorrectly calculated using the newly modified dst value instead of the original src value. Adding a temporary local variable to cache the original value before writing to dst and using it for the carry calculation solves the problem. In addition, partial overlap is rejected from control plane for all kind of operations including byteorder. This was tested with the following bytecode: table test_table ip flags 0 use 1 handle 1 ip test_table test_chain use 3 type filter hook input prio 0 policy accept packets 0 bytes 0 flags 1 ip test_table test_chain 2 [ immediate reg 1 0x44332211 0x88776655 ] [ bitwise reg 1 = ( reg 1 << 0x08000000 ) ] [ cmp eq reg 1 0x66443322 0x00887766 ] [ counter pkts 0 bytes 0 ] ip test_table test_chain 4 3 [ immediate reg 1 0x44332211 0x88776655 ] [ bitwise reg 1 = ( reg 1 << 0x08000000 ) ] [ cmp eq reg 1 0x55443322 0x00887766 ] [ counter pkts 21794 bytes 1917798 ] Fixes: `567d746b55` ("netfilter: bitwise: add support for shifts.") Acked-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-05-22 12:28:46 +02:00
Jiayuan Chen	f81b0c2d28	netfilter: nft_fib_ipv6: handle routes via external nexthop fib6_info has a union: union { struct list_head fib6_siblings; struct list_head nh_list; }; Old-style multipath (ip -6 route add ... nexthop ... nexthop ...) uses fib6_siblings. External nexthop (ip -6 route add ... nhid N) uses nh_list, linked into &nh->f6i_list. nft_fib6_info_nh_uses_dev() blindly walks &rt->fib6_siblings, causing an OOB read past the struct nexthop slab when rt->nh is set: ================================================================== BUG: KASAN: slab-out-of-bounds in nft_fib6_eval+0x1362/0x16c0 Read of size 8 at addr ffff888103a099d0 by task ping/386 CPU: 2 UID: 0 PID: 386 Comm: ping Not tainted 7.1.0-rc3+ #251 PREEMPT Call Trace: <IRQ> dump_stack_lvl+0x76/0xa0 print_report+0xd1/0x5f0 kasan_report+0xe7/0x130 __asan_report_load8_noabort+0x14/0x30 nft_fib6_eval+0x1362/0x16c0 nft_do_chain+0x279/0x18c0 nft_do_chain_ipv6+0x1a8/0x230 nf_hook_slow+0xad/0x200 ipv6_rcv+0x152/0x380 __netif_receive_skb_one_core+0x118/0x1c0 ================================================================== Branch by route shape: when rt->nh is set, walk via nexthop_for_each_fib6_nh() (also covers nh groups, which the original code missed); otherwise walk fib6_siblings, guarded by READ_ONCE() of rt->fib6_nsiblings as required by commit `31d7d67ba1` ("ipv6: annotate data-races around rt->fib6_nsiblings"). Fixes: `1c32b24c23` ("netfilter: nft_fib_ipv6: switch to fib6_lookup") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-05-22 12:28:46 +02:00
Jiayuan Chen	1d001b0a61	netfilter: nft_fib_ipv6: walk fib6_siblings under RCU nft_fib6_info_nh_uses_dev() runs from nft_fib6_eval() in softirq under rcu_read_lock(). fib6_siblings is modified by writers that hold tb6_lock but do not wait for RCU readers, so the sibling walk should use list_for_each_entry_rcu(): it adds READ_ONCE() on the ->next pointer and lets CONFIG_PROVE_RCU_LIST validate the locking. No functional change for non-debug builds. Fixes: `1c32b24c23` ("netfilter: nft_fib_ipv6: switch to fib6_lookup") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-05-22 12:28:46 +02:00
Florian Westphal	f438d1786d	netfilter: ebtables: fix OOB read in compat_mtw_from_user Luxiao Xu says: The function compat_mtw_from_user() converts ebtables extensions from 32-bit user structures to kernel native structures. However, it lacks proper validation of the user-supplied match_size/target_size. When certain extensions are processed, the kernel-side translation logic may perform memory accesses based on the extension's expected size. If the user provides a size smaller than what the extension requires, it results in an out-of-bounds read as reported by KASAN. This fix introduces a check to ensure match_size is at least as large as the extension's required compatsize. This covers matches, watchers, and targets, while maintaining compatibility with standard targets. AFAIU this is relevant for matches that need to go though match->compat_from_user() call. Those that use plain memcpy with the user-provided size are ok because the caller checks that size vs the start of the next rule entry offset (which itself is checked vs. total size copied from userspace). The ->compat_from_user() callbacks assume they can read compatsize bytes, so they need this extra check. Based on an earlier patch from Luxiao Xu. Fixes: `81e675c227` ("netfilter: ebtables: add CONFIG_COMPAT support") Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Luxiao Xu <rakukuip@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-05-22 12:28:46 +02:00
Florian Westphal	968cc2c963	netfilter: disable payload mangling in userns Several parts of network stack rely on iph->ihl validation done by network stack before PRE_ROUTING. Disable this feature for user namespaces for now. tcp option handling is likely safe even for LOCAL_IN, so this this leaves tcp option mangling via nft_exthdr.c as-is. I don't think these are the only means to alter packets, but these appear to be relatively prominent. This could be relaxed later. Example: - allow userns for ingress hook. - allow userns if base is transport header. Also, we should revalidate or restrict generally: - Don't allow linklayer writes to spill into network header - restrict ipv4 and ipv6 to 'known safe' writes, e.g. saddr/daddr/check/tos Reported-by: Qi Tang <tpluszz77@gmail.com> Reported-by: Tong Liu <lyutoon@gmail.com> Tested-by: Qi Tang <tpluszz77@gmail.com> Link: https://lore.kernel.org/netfilter-devel/20260515100411.3141-1-fw@strlen.de/ Signed-off-by: Florian Westphal <fw@strlen.de>	2026-05-22 12:28:46 +02:00
Florian Westphal	c376f07e16	netfilter: xt_cpu: prefer raw_smp_processor_id With PREEMPT_RCU we get splat: BUG: using smp_processor_id() in preemptible [..] caller is cpu_mt+0x53/0xd0 net/netfilter/xt_cpu.c:37 CPU: 1 .. Comm: syz.3.1377 #0 PREEMPT(full) Call Trace: <TASK> dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 check_preemption_disabled+0xd3/0xe0 lib/smp_processor_id.c:47 cpu_mt+0x53/0xd0 net/netfilter/xt_cpu.c:37 [..] Just use raw version instead. This is similar to `14d14a5d29` ("netfilter: nft_meta: use raw_smp_processor_id()"). Fixes: `0ca743a559` ("netfilter: nf_tables: add compatibility layer for x_tables") Reported-by: syzbot+690d3e3ffa7335ac10eb@syzkaller.appspotmail.com Signed-off-by: Florian Westphal <fw@strlen.de>	2026-05-22 12:28:46 +02:00
Florian Westphal	47980b6dbf	netfilter: nf_conntrack_gre: fix gre keymap list corruption Quoting reporter: A race between GRE keymap insertion and destruction can corrupt the kernel list or use a freed object. `nf_ct_gre_keymap_add()` publishes a new keymap pointer before the embedded `list_head` is linked, while `nf_ct_gre_keymap_destroy()` can concurrently delete and free that same object. An unprivileged user can reach this through the PPTP conntrack helper by racing PPTP control messages or helper teardown, leading to KASAN-detectable list corruption/UAF in kernel context. ## Root Cause Analysis `exp_gre()` installs GRE expectations for a PPTP control flow and then adds two GRE keymap entries [..] The add path publishes `ct_pptp_info->keymap[dir]` before linking the embedded list node [..] Concurrent teardown deletes that partially initialized object. Make add/destroy symmetric: install both, destroy both while under lock. Furthermore, we should refuse to publish a new mapping in case ct is going away, else we may leak the allocation. The "retrans" detection is strange: existing mapping is checked for key equality with the new mapping, then for "is on the list" via list walk. But I can't see how an existing keymap entry can be NOT on list. Change this to only check if we're asked to map same tuple again -- if so, skip re-install, else signal failure. Last, add a bug trap for the keymap list; it has to be empty when namespace is going away. Reported-by: Leo Lin <leo@depthfirst.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-05-22 12:28:46 +02:00
Chris Mason	92170e6afe	netfilter: synproxy: refresh tcphdr after skb_ensure_writable synproxy_tstamp_adjust() rewrites the TCP timestamp option in place and then patches the TCP checksum via inet_proto_csum_replace4() on the caller-supplied tcphdr pointer. Both ipv4_synproxy_hook() and ipv6_synproxy_hook() obtain that pointer with skb_header_pointer() before calling in, so it may either alias skb->head directly or point at the caller's on-stack _tcph buffer. Between obtaining the pointer and using it, the function calls skb_ensure_writable(skb, optend), which on a cloned or non-linear skb invokes pskb_expand_head() and frees the old skb->head. After that point the cached th is stale: caller (ipv[46]_synproxy_hook) th = skb_header_pointer(skb, ..., &_tcph) synproxy_tstamp_adjust(skb, protoff, th, ...) skb_ensure_writable(skb, optend) pskb_expand_head() /* kfree(old skb->head) / ... inet_proto_csum_replace4(&th->check, ...) / writes into freed head, or into the caller's stack copy leaving the on-wire checksum stale */ The option bytes are written through skb->data and are fine; only the checksum update goes through th and so lands in the wrong place. The result is either a write into freed slab memory or a packet leaving with a checksum that does not match its payload. Fix by re-deriving th from skb->data + protoff immediately after skb_ensure_writable() succeeds, so the subsequent checksum update targets the linear, writable header. Fixes: `48b1de4c11` ("netfilter: add SYNPROXY core/target") Assisted-by: kres (claude-opus-4-7) Signed-off-by: Chris Mason <clm@meta.com> Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-05-22 12:28:40 +02:00
Hamza Mahfooz	bed6e04be8	netfilter: conntrack: tcp: do not force CLOSE on invalid-seq RST without direction check An unintended behavior in the TCP conntrack state machine allows a connection to be forced into the CLOSE state using an RST packet with an invalid sequence number. Specifically, after a SYN packet is observed, an RST with an invalid SEQ can transition the conntrack entry to TCP_CONNTRACK_CLOSE, regardless of whether the RST corresponds to the expected reply direction. The relevant code path assumes the RST is a response to an outgoing SYN, but does not validate packet direction or ensure that a matching SYN was actually sent in the opposite direction. As a result, a crafted packet sequence consisting of a SYN followed by an invalid-sequence RST can prematurely terminate an active NAT entry. This makes connection teardown easier than intended. So, tighten the state transition logic to ensure that RST-triggered CLOSE transitions only occur when the RST is a valid response to a previously observed SYN in the correct direction. Cc: stable@vger.kernel.org Fixes: `9fb9cbb108` ("[NETFILTER]: Add nf_conntrack subsystem.") Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-05-22 12:27:55 +02:00
Linus Torvalds	68993ced0f	Merge tag 'net-7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from Bluetooth, wireless and netfilter. Craziness continues with no end in sight. Even discounting the driver revert this is a pretty huge PR for standards of the previous era. I'd speculate - we haven't seen the worst of it, yet. Good news, I guess, is that so far we haven't seen many (any?) cases of "AI reported a bug, we fixed it and a real user regressed". Current release - fix to a fix: - Bluetooth: btmtk: accept too short WMT FUNC_CTRL events - vsock/virtio: relax the recently added memory limit a little Current release - regressions: - IB/IPoIB: make sure IB drivers always use async set_rx_mode since some (mlx5) are now required to use it due to locking changes Previous releases - regressions: - udp: fix UDP length on last GSO_PARTIAL segment - af_unix: fix UAF read of tail->len in unix_stream_data_wait() - tcp: fix stale per-CPU tcp_tw_isn leak enabling ISN prediction - mlx5e: fix unlocked writing to ICOSQ, breaking AF_XDP Previous releases - always broken: - tap: fix stack info leak in tap_ioctl() SIOCGIFHWADDR - ipv4: raw: reject IP_HDRINCL packets with ihl < 5 - Bluetooth: a lot of locking and concurrency fixes (as always) - batman-adv (mesh wireless networking): a lot of random fixes for issues reported by security researchers and Sashiko - netfilter: same thing, a lot of small security-ish fixes all over the place, nothing really stands out Misc: - bring back the old 3c509 driver, Maciej wants to maintain it" * tag 'net-7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (187 commits) net: enetc: avoid VF->PF mailbox timeout during SR-IOV teardown net: enetc: fix init and teardown order to prevent use of unsafe resources net: enetc: fix unbounded loop and interrupt handling in VF-to-PF messaging net: enetc: fix DMA write to freed memory in enetc_msg_free_mbx() net: enetc: fix race condition in VF MAC address configuration net: enetc: fix TOCTOU race and validate VF MAC address net: enetc: add ratelimiting to VF mailbox error messages net: enetc: fix missing error code when pf->vf_state allocation fails net: enetc: fix incorrect mailbox message status returned to VFs net: bridge: prevent too big nested attributes in br_fill_linkxstats() l2tp: use list_del_rcu in l2tp_session_unhash net: bcmgenet: keep RBUF EEE/PM disabled ethernet: 3c509: Fix most coding style issues ethernet: 3c509: Update documentation to match MAINTAINERS ethernet: 3c509: Add GPL 2.0 SPDX license identifier ethernet: 3c509: Fix AUI transceiver type selection Revert "drivers: net: 3com: 3c509: Remove this driver" tools: ynl: support listening on all nsids net: gro: don't merge zcopy skbs pds_core: ensure null-termination for firmware version strings ...	2026-05-21 14:39:12 -07:00
Jakub Kicinski	0e3c08f1b7	Merge tag 'wireless-2026-05-21' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Quite a few more updates: - cfg80211/mac80211: - various security(-ish) fixes - fix A-MSDU subframe handling - fix multi-link element parsing - ath10: avoid sending commands to dead device - ath11k: - fix WMI buffer leaks on error conditions - fix UAF in RX MSDU coalesce path - allow peer ID 0 on RX path (legal for mobile devices) - reinitialize shared SRNG pointers on restart - ath12k: - fix 20 MHz-only parsing of EHT-MCS map - iwlwifi: - fix TSO segmentation explosion - don't TX to dead device - fix warning in WoWLAN - fix TX rates on old devices - disconnect on beacon loss only if also no other traffic - fill NULL-ptr deref - fix STEP_URM hardware access * tag 'wireless-2026-05-21' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: (24 commits) wifi: cfg80211: wext: validate chandef in monitor mode wifi: mac80211: consume only present negotiated TTLM maps wifi: wilc1000: fix dma_buffer leak on bus acquire failure wifi: mac80211: capture fast-RX rate before mesh reuses skb->cb wifi: mac80211: fix multi-link element inheritance wifi: mac80211: fix MLE defragmentation wifi: mac80211: don't override max_amsdu_subframes wifi: mac80211: bounds-check link_id in ieee80211_ml_epcs wifi: ath12k: fix EHT TX MCS limitation due to wrong 20 MHz-only parsing wifi: ath11k: clear shared SRNG pointer state on restart wifi: ath11k: fix use after free in ath11k_dp_rx_msdu_coalesce() wifi: ath11k: fix peer resolution on rx path when peer_id=0 wifi: iwlwifi: mld: disconnect only after 6 beacons without Rx wifi: iwlwifi: mld: don't WARN on WoWLAN suspend w/o BSS vif wifi: iwlwifi: use correct function to read STEP_URM register wifi: iwlwifi: mvm: fix driver-set TX rates on old devices wifi: iwlwifi: mld: don't dereference a pointer before NULL checking it wifi: iwlwifi: mld: stop TX during firmware restart wifi: iwlwifi: mld: fix TSO segmentation explosion when AMSDU is disabled wifi: ath10k: skip WMI and beacon transmission when device is wedged ... ==================== Link: https://patch.msgid.link/20260521152903.374070-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-21 11:03:58 -07:00
Eric Dumazet	bdd39576bf	net: bridge: prevent too big nested attributes in br_fill_linkxstats() After commit ff205bf8c554 ("netlink: add one debug check in nla_nest_end()") syzbot found that br_fill_linkxstats() can send corrupted netlink packets. Make sure the nested attribute size is bounded. Fixes: `a60c090361` ("bridge: netlink: export per-vlan stats") Reported-by: syzbot+a35f9259d08f907c06e6@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6a0b0da3.050a0220.175f0c.0000.GAE@google.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260520114207.1394241-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-21 08:47:36 -07:00
Michael Bommarito	979c017803	l2tp: use list_del_rcu in l2tp_session_unhash An unprivileged local user can pin a host CPU indefinitely in l2tp_session_get_by_ifname() by issuing L2TP_CMD_SESSION_GET on L2TP_ATTR_IFNAME concurrently with L2TP_CMD_SESSION_CREATE and L2TP_CMD_SESSION_DELETE on the same tunnel. All three commands take GENL_UNS_ADMIN_PERM, so CAP_NET_ADMIN in the netns user namespace suffices; on any host that has l2tp_core loaded the trigger is reachable from a standard `unshare -Urn` sandbox. l2tp_session_unhash() removes a session from tunnel->session_list with list_del_init(), but that list is walked by l2tp_session_get_by_ifname() with list_for_each_entry_rcu() under rcu_read_lock_bh(). list_del_init() leaves the deleted entry's next/prev self-pointing; a reader that has loaded the entry and then advances pos->list.next reads &session->list, container_of()s back to the same session, and list_for_each_entry_rcu() never reaches the list head. The CPU stays in strcmp() inside the walker, with BH and preemption disabled, so RCU grace periods on the host stall behind it and the wedged thread cannot be killed (SIGKILL is delivered on syscall return). Use list_del_rcu() to match the existing list_add_rcu() in l2tp_session_register(); the deleted session remains visible to in-flight walkers with consistent next/prev pointers until kfree_rcu() in l2tp_session_free() releases it. tunnel->session_list has exactly one list_del_init() call site; the list_del_init (&session->clist) at l2tp_core.c:533 operates on the per-collision list, which is not walked under RCU. list_empty(&session->list) is not used anywhere in net/l2tp/ after the unhash point, so dropping the post-delete self-init is safe; the fix has no userspace-visible behavior change. Fixes: `89b768ec2d` ("l2tp: use rcu list add/del when updating lists") Cc: stable@vger.kernel.org # 6.11+ Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Link: https://patch.msgid.link/20260518183447.64078-1-michael.bommarito@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-21 08:47:20 -07:00
Sabrina Dubroca	4db79a322d	net: gro: don't merge zcopy skbs skb_gro_receive() can currently copy frags between the source and GRO skb, without checking the zerocopy status, and in particular the SKBFL_MANAGED_FRAG_REFS flag. When SKBFL_MANAGED_FRAG_REFS is set, the skb doesn't hold a reference on the pages in shinfo->frags. Appending those frags to another skb's frags without fixing up the page refcount can lead to UAF. When either the last skb in the GRO chain (the one we would append frags to) or the source skb is zerocopy, don't merge the skbs. Fixes: `753f1ca4e1` ("net: introduce managed frags infrastructure") Reported-by: Huzaifa Sidhpurwala <huzaifas@redhat.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/c3b7f906bbfcbdfd7b4fa9d6c18a438870df85be.1779307748.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-21 08:21:33 -07:00
Justin Iurman	e46e6bc97f	ipv6: ioam: refresh hdr pointer before ioam6_event() Reported by Sashiko: In ipv6_hop_ioam(), the hdr pointer is initialized to point into the skb's linear data buffer. Later, the code calls skb_ensure_writable(), which might reallocate the buffer: if (skb_ensure_writable(skb, optoff + 2 + hdr->opt_len)) goto drop; /* Trace pointer may have changed / trace = (struct ioam6_trace_hdr )(skb_network_header(skb) + optoff + sizeof(hdr)); ioam6_fill_trace_data(skb, ns, trace, true); ioam6_event(IOAM6_EVENT_TRACE, dev_net(skb->dev), GFP_ATOMIC, (void )trace, hdr->opt_len - 2); If the skb is cloned or lacks sufficient linear headroom, skb_ensure_writable() will invoke pskb_expand_head(), which reallocates the skb's data buffer and frees the old one, invalidating pointers to it. While the code recalculates the trace pointer immediately after the call to skb_ensure_writable(), it fails to recalculate the hdr pointer. This patch fixes the above by recalculating the hdr pointer before passing hdr->opt_len to ioam6_event(), so that we avoid any UaF. Fixes: `f655c78d62` ("net: exthdrs: ioam6: send trace event") Cc: stable@vger.kernel.org Signed-off-by: Justin Iurman <justin.iurman@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260520124242.32320-1-justin.iurman@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-21 08:19:25 -07:00
Zhang Cen	c367b90821	netpoll: normalize skb->dev to the netpoll device __netpoll_send_skb() always transmits through np->dev and queues busy packets on np->dev->npinfo->txq, but it leaves skb->dev unchanged. Stacked callers such as DSA and macvlan can reach netpoll with skb->dev still naming the upper device while np->dev is the lower device that owns the netpoll state. If the skb has to be deferred, queue_process() later dequeues it from the lower device's txq but retries it through skb->dev. That can re-enter the upper ndo_start_xmit path on an already transformed skb, and if the upper device disappears before the lower txq drains the workqueue can dereference a stale skb->dev pointer. The buggy scenario involves two paths, with each column showing the order within that path: path A label: netpoll enqueue path path B label: upper-device teardown 1. Stacked xmit calls netpoll 1. Teardown unregisters the upper with lower np->dev and upper net_device while lower npinfo skb->dev. stays alive. 2. __netpoll_send_skb() uses 2. netdev_release() runs for the np->dev->npinfo as the txq upper net_device. owner. 3. Busy transmit queues the skb 3. The lower txq still owns the on that lower txq with upper deferred skb. skb->dev. 4. queue_process() drains the 4. queue_process() dereferences lower txq and reads skb->dev. that stale upper skb->dev. Normalize skb->dev to np->dev after loading np->dev from the netpoll instance, before either the direct transmit path or the fallback enqueue. This keeps the queued skb in the same device and txq domain as the netpoll state that owns it. KASAN report as below: KASAN slab-use-after-free in queue_process+0x7c/0x480 Workqueue: events queue_process The buggy address belongs to the object at ffff88810906c000 which belongs to the cache kmalloc-4k of size 4096 The buggy address is located 168 bytes inside of freed 4096-byte region [ffff88810906c000, ffff88810906d000) Read of size 8 Call trace: dump_stack_lvl+0x73/0xb0 (?:?) print_report+0xd1/0x620 (?:?) srso_alias_return_thunk+0x5/0xfbef5 (?:?) __virt_addr_valid+0x215/0x420 (?:?) kasan_complete_mode_report_info+0x64/0x200 (?:?) kasan_report+0xf7/0x130 (?:?) queue_process+0x7c/0x480 (net/core/netpoll.c:88) kasan_check_range+0x10c/0x1c0 (?:?) __kasan_check_read+0x15/0x20 (?:?) process_one_work+0x8b7/0x1af0 (kernel/workqueue.c:3200) assign_work+0x170/0x3f0 (?:?) worker_thread+0x574/0xf10 (?:?) _raw_spin_unlock_irqrestore+0x4b/0x60 (?:?) trace_hardirqs_on+0x2a/0x180 (?:?) kthread+0x2fc/0x3f0 (?:?) ret_from_fork+0x58b/0x830 (?:?) __switch_to+0x58e/0xe90 (?:?) __switch_to_asm+0x39/0x70 (?:?) ret_from_fork_asm+0x1a/0x30 (?:?) Freed by task stack: kasan_save_stack+0x3d/0x60 (?:?) kasan_save_track+0x18/0x40 (?:?) kasan_save_free_info+0x3f/0x60 (?:?) __kasan_slab_free+0x48/0x70 (?:?) kfree+0x20e/0x4e0 (?:?) kvfree+0x31/0x40 (?:?) netdev_release+0x71/0x90 (net/core/net-sysfs.c:2227) device_release+0xd2/0x250 (?:?) kobject_put+0x181/0x4c0 (lib/kobject.c:730) netdev_run_todo+0x700/0x1000 (net/core/dev.c:11666) rtnl_dellink+0x396/0xc00 (net/core/rtnetlink.c:3558) rtnetlink_rcv_msg+0x740/0xc20 (net/core/rtnetlink.c:6897) netlink_rcv_skb+0x147/0x3a0 (?:?) rtnetlink_rcv+0x19/0x20 (net/core/rtnetlink.c:7021) netlink_unicast+0x4d1/0x830 (net/netlink/af_netlink.c:1327) netlink_sendmsg+0x840/0xe10 (net/netlink/af_netlink.c:1812) ____sys_sendmsg+0x8a7/0xb50 (?:?) ___sys_sendmsg+0x104/0x190 (?:?) __sys_sendmsg+0x135/0x1d0 (?:?) __x64_sys_sendmsg+0x7b/0xc0 (?:?) x64_sys_call+0x205c/0x2130 (?:?) do_syscall_64+0x115/0x6a0 (arch/x86/entry/syscall_64.c:87) entry_SYSCALL_64_after_hwframe+0x77/0x7f (?:?) Fixes: `5de4a473bd` ("netpoll queue cleanup") Signed-off-by: Zhang Cen <rollkingzzc@gmail.com> Link: https://patch.msgid.link/20260519104647.3517990-1-rollkingzzc@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-21 08:10:18 -07:00
Yuho Choi	1341db3224	ipv6: route: Unregister netdevice notifier on BPF init failure ip6_route_init() registers ip6_route_dev_notifier before registering the IPv6 route BPF iterator target. If bpf_iter_register() fails after the notifier has been registered, the error path currently jumps to out_register_late_subsys and unwinds the RTNL handlers and pernet route state without removing the notifier from the netdevice notifier chain. This leaves ip6_route_dev_notify() callable after the IPv6 route state it uses has been torn down. Add a separate unwind label for the BPF iterator failure path and unregister the netdevice notifier before continuing with the existing cleanup. Fixes: `138d0be35b` ("net: bpf: Add netlink and ipv6_route bpf_iter targets") Signed-off-by: Yuho Choi <dbgh9129@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260520030329.1061183-1-dbgh9129@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-21 07:43:15 -07:00
Zijing Yin	dbc81608e3	phonet/pep: disable BH around forwarded sk_receive_skb() The networking receive path is usually run from softirq context, but protocols that take the socket lock may have packets stored in the backlog and processed later from process context. In that case release_sock() -> __release_sock() drops the slock with spin_unlock_bh() and then calls sk->sk_backlog_rcv() with bottom halves enabled. Typical sk_backlog_rcv handlers process the socket whose backlog is being drained, so the BH state at entry is irrelevant for the slocks they touch. pep_do_rcv() is different: when the inbound skb targets an existing PEP pipe, it forwards the skb to a different child socket via sk_receive_skb(). That helper takes the child slock with bh_lock_sock_nested(), which is just spin_lock_nested() and assumes BH is already off. The same child slock therefore ends up acquired with BH on (process path) and with BH off (softirq path): process context softirq context --------------- --------------- release_sock(listener) __netif_receive_skb() __release_sock() phonet_rcv() spin_unlock_bh() __sk_receive_skb(listener) [BH now ENABLED] [BH already disabled] sk_backlog_rcv: sk_backlog_rcv: pep_do_rcv() pep_do_rcv() sk_receive_skb(child) sk_receive_skb(child) bh_lock_sock_nested(child) bh_lock_sock_nested(child) => SOFTIRQ-ON-W => IN-SOFTIRQ-W Lockdep flags this as inconsistent lock state, and it can become a real self-deadlock if a softirq on the same CPU tries to receive to the same child socket while its slock is held in the BH-enabled path: WARNING: inconsistent lock state inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. (slock-AF_PHONET/1){+.?.}-{3:3}, at: __sk_receive_skb+0x1cf/0x900 __sk_receive_skb net/core/sock.c:563 sk_receive_skb include/net/sock.h:2022 [inline] pep_do_rcv net/phonet/pep.c:675 sk_backlog_rcv include/net/sock.h:1190 __release_sock net/core/sock.c:3216 release_sock net/core/sock.c:3815 pep_sock_accept net/phonet/pep.c:879 Wrap the forwarded sk_receive_skb() in local_bh_disable() / local_bh_enable() so the child slock is always acquired with BH off. local_bh_disable() nests safely on the softirq path. Discovered via in-house syzkaller fuzzing; the same root cause also on the linux-6.1.y syzbot dashboard as extid 44f0626dd6284f02663c. Reproduced under KASAN + LOCKDEP + PROVE_LOCKING, reproducer: https://pastebin.com/A3t8xzCR Fixes: `9641458d3e` ("Phonet: Pipe End Point for Phonet Pipes protocol") Link: https://syzkaller.appspot.com/bug?extid=44f0626dd6284f02663c Cc: stable@vger.kernel.org Signed-off-by: Zijing Yin <yzjaurora@gmail.com> Acked-by: Rémi Denis-Courmont <remi@remlab.net> Reported-by: syzbot+9f4a135646b66c509935@syzkaller.appspotmail.com Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260519172635.86304-1-yzjaurora@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-21 07:38:21 -07:00
Paolo Abeni	42734af663	Merge tag 'batadv-net-pullrequest-20260520' of https://git.open-mesh.org/batadv Simon Wunderlich says: ==================== Here are batman-adv bugfixes, all by by Sven Eckelmann. - fix batadv_skb_is_frag() kernel-doc - BATMAN V: stop OGMv2 on disabled interface - BATMAN IV: abort OGM send on tvlv append failure - BATMAN IV: reject oversized TVLV packets - tp_meter: fix race condition in send error reporting - tp_meter: avoid role confusion in tp_list - mcast: fix use-after-free in orig_node RCU release - BATMAN IV: recover OGM scheduling after forward packet error - bla: fix report_work leak on backbone_gw purge - bla: avoid double decrement of bla.num_requests - bla: avoid NULL-ptr deref for claim via dropped interface * tag 'batadv-net-pullrequest-20260520' of https://git.open-mesh.org/batadv: batman-adv: bla: avoid NULL-ptr deref for claim via dropped interface batman-adv: bla: avoid double decrement of bla.num_requests batman-adv: bla: fix report_work leak on backbone_gw purge batman-adv: iv: recover OGM scheduling after forward packet error batman-adv: mcast: fix use-after-free in orig_node RCU release batman-adv: tp_meter: avoid role confusion in tp_list batman-adv: tp_meter: fix race condition in send error reporting batman-adv: tvlv: reject oversized TVLV packets batman-adv: tvlv: abort OGM send on tvlv append failure batman-adv: v: stop OGMv2 on disabled interface batman-adv: fix batadv_skb_is_frag() kernel-doc ==================== Link: https://patch.msgid.link/20260520115422.53552-1-sw@simonwunderlich.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-21 15:59:11 +02:00
Stefano Garzarella	c6087c5aaa	vsock/virtio: fix skb overhead accounting to preserve full buf_alloc After commit `059b7dbd20` ("vsock/virtio: fix potential unbounded skb queue"), virtio_transport_inc_rx_pkt() subtracts per-skb overhead from buf_alloc when checking whether a new packet fits. This reduces the effective receive buffer below what the user configured via SO_VM_SOCKETS_BUFFER_SIZE, causing legitimate data packets to be silently dropped and applications that rely on the full buffer size to deadlock. Also, the reduced space is not communicated to the remote peer, so its credit calculation accounts more credit than the receiver will actually accept, causing data loss (there is no retransmission). With this approach we currently have failures in tools/testing/vsock/vsock_test.c. Test 18 sometimes fails, while test 22 always fails in this way: 18 - SOCK_STREAM MSG_ZEROCOPY...hash mismatch 22 - SOCK_STREAM virtio credit update + SO_RCVLOWAT...send failed: Resource temporarily unavailable Fix by allowing at most `buf_alloc * 2` as the total budget for payload plus skb overhead in virtio_transport_inc_rx_pkt(), similar to how SO_RCVBUF is doubled to reserve space for sk_buff metadata. This preserves the full buf_alloc for payload under normal operation, while still bounding the skb queue growth. With this patch, all tests in tools/testing/vsock/vsock_test.c are now passing again. Fixes: `059b7dbd20` ("vsock/virtio: fix potential unbounded skb queue") Cc: stable@vger.kernel.org Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260518090656.134588-3-sgarzare@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-21 13:14:01 +02:00
Stefano Garzarella	a4f0b00178	vsock/virtio: reset connection on receiving queue overflow When there is no more space to queue an incoming packet, the packet is silently dropped. This causes data loss without any notification to either peer, since there is no retransmission. Under normal circumstances, this should never happen. However, it could happen if the other peer doesn't respect the credit, or if the skb overhead, which we recently began to take into account with commit `059b7dbd20` ("vsock/virtio: fix potential unbounded skb queue"), is too high. Fix this by resetting the connection and setting the local socket error to ENOBUFS when virtio_transport_recv_enqueue() can no longer queue a packet, so both peers are explicitly notified of the failure rather than silently losing data. Fixes: `ae6fcfbf5f` ("vsock/virtio: discard packets if credit is not respected") Cc: stable@vger.kernel.org Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260518090656.134588-2-sgarzare@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-21 13:14:01 +02:00
Hyunwoo Kim	48f6a5356a	net: skbuff: propagate shared-frag marker through frag-transfer helpers Two frag-transfer helpers (__pskb_copy_fclone() and skb_shift()) fail to propagate the SKBFL_SHARED_FRAG bit in skb_shinfo()->flags when moving frags from source to destination. __pskb_copy_fclone() defers the rest of the shinfo metadata to skb_copy_header() after copying frag descriptors, but that helper only carries over gso_{size,segs, type} and never touches skb_shinfo()->flags; skb_shift() moves frag descriptors directly and leaves flags untouched. As a result, the destination skb keeps a reference to the same externally-owned or page-cache-backed pages while reporting skb_has_shared_frag() as false. The mismatch is harmful in any in-place writer that uses skb_has_shared_frag() to decide whether shared pages must be detoured through skb_cow_data(). ESP input is one such writer (esp4.c, esp6.c), and a single nft 'dup to <local>' rule -- or any other nf_dup_ipv4() / xt_TEE caller -- is enough to land a pskb_copy()'d skb in esp_input() with the marker stripped, letting an unprivileged user write into the page cache of a root-owned read-only file via authencesn-ESN stray writes. Set SKBFL_SHARED_FRAG on the destination whenever frag descriptors were actually moved from the source. skb_copy() and skb_copy_expand() share skb_copy_header() too but linearize all paged data into freshly allocated head storage and emerge with nr_frags == 0, so skb_has_shared_frag() returns false on its own; they need no change. The same omission exists in skb_gro_receive() and skb_gro_receive_list(). The former moves the incoming skb's frag descriptors into the accumulator's last sub-skb via two paths (a direct frag-move loop and the head_frag + memcpy path); the latter chains the incoming skb whole onto p's frag_list. Downstream skb_segment() reads only skb_shinfo(p)->flags, and skb_segment_list() reuses each sub-skb's shinfo as the nskb -- both p and lp must carry the marker. The same omission also exists in tcp_clone_payload(), which builds an MTU probe skb by moving frag descriptors from skbs on sk_write_queue into a freshly allocated nskb. The helper falls into the same family and warrants the same fix for consistency; no TCP TX-side in-place writer is currently known to reach a user page through this gap, but a future consumer depending on the marker would regress silently. The same omission exists in skb_segment(): the per-iteration flag merge takes only head_skb's flag, and the inner switch that rebinds frag_skb to list_skb on head_skb-frags exhaustion does not fold the new frag_skb's flag into nskb. Fold frag_skb's flag at both sites so segments drawing frags from frag_list members carry the marker. Fixes: `cef401de7b` ("net: fix possible wrong checksum generation") Fixes: `f4c50a4034` ("xfrm: esp: avoid in-place decrypt on shared skb frags") Suggested-by: Sabrina Dubroca <sd@queasysnail.net> Suggested-by: Sultan Alsawaf <sultan@kerneltoast.com> Suggested-by: Ben Hutchings <ben@decadent.org.uk> Suggested-by: Lin Ma <malin89@huawei.com> Suggested-by: Jingguo Tan <tanjingguo@huawei.com> Suggested-by: Aaron Esau <aaron1esau@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Tested-by: Rajat Gupta <rajat.gupta@oss.qualcomm.com> Link: https://patch.msgid.link/ageeJfJHwgzmKXbh@v4bel Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-21 11:31:05 +02:00
Eric Dumazet	1bbf0ced1d	tcp: fix stale per-CPU tcp_tw_isn leak enabling ISN prediction Blamed commit moved the TIME_WAIT-derived ISN from the skb control block to a per-CPU variable, assuming the value would always be consumed by tcp_conn_request() for the same packet that wrote it. That assumption is violated by multiple drop paths between the producer (__this_cpu_write(tcp_tw_isn, isn) in tcp_v{4,6}_rcv()) and the consumer (tcp_conn_request()): - min_ttl / min_hopcount check - xfrm policy check - tcp_inbound_hash() MD5/AO mismatch - tcp_filter() eBPF/SO_ATTACH_FILTER drop - th->syn && th->fin discard in tcp_rcv_state_process() TCP_LISTEN - psp_sk_rx_policy_check() in tcp_v{4,6}_do_rcv() - tcp_checksum_complete() in tcp_v{4,6}_do_rcv() - tcp_v{4,6}_cookie_check() returning NULL When a packet is dropped on any of these paths, tcp_tw_isn is left set. The next SYN processed on the same CPU then consumes the non zero value in tcp_conn_request(), receiving a potentially predictable ISN. This patch moves back tcp_tw_isn to skb->cb[], getting rid of the per-cpu variable. Note that tcp_v{4,6}_fill_cb() do not set it. Very litle impact on overall code size/complexity: $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 2/1 up/down: 8/-15 (-7) Function old new delta tcp_v6_rcv 3038 3042 +4 tcp_v4_rcv 3035 3039 +4 tcp_conn_request 2938 2923 -15 Total: Before=24436060, After=24436053, chg -0.00% Fixes: `41eecbd712` ("tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260519084611.2485277-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-20 19:14:06 -07:00
Minh Nguyen	99e22ddf4e	vsock/vmci: fix UAF when peer resets connection during handshake vmci_transport_recv_connecting_server() returned err = 0 for a peer RST in its default switch arm: err = pkt->type == VMCI_TRANSPORT_PACKET_TYPE_RST ? 0 : -EINVAL; That made vmci_transport_recv_listen() skip vsock_remove_pending(), leaving the pending socket on the listener's pending_links with sk_state = TCP_CLOSE while destroy: still dropped the explicit reference taken before schedule_delayed_work(). One second later vsock_pending_work() observed is_pending=true and performed full cleanup: vsock_remove_pending() then the two trailing sock_put(sk) calls -- the first reached refcount 0 and __sk_freed the socket, and the second wrote into the freed object: BUG: KASAN: slab-use-after-free in refcount_warn_saturate Write of size 4 at addr ffff88800b1cac80 by task kworker Workqueue: events vsock_pending_work Treat peer RST like any other unexpected packet type (err = -EINVAL). All destroy: arms now return err < 0, so vmci_transport_recv_listen() removes pending from pending_links synchronously and vsock_pending_work() takes the is_pending=false / !rejected branch, dropping only its own work reference. This also closes the multi-packet race Sashiko reported on v2: pending is removed from the list before any subsequent packet can find it. The pre-existing sk_acceptq_removed() gap on the err < 0 path of vmci_transport_recv_listen() that Sashiko also noted is not introduced or changed by this patch. Tested on lts-6.12.79 with KASAN: 52/100 unpatched -> 0/100 patched. Fixes: `d021c34405` ("VSOCK: Introduce VM Sockets") Cc: stable@vger.kernel.org Signed-off-by: Minh Nguyen <minhnguyen.080505@gmail.com> Acked-by: Bryan Tan <bryan-bt.tan@broadcom.com> Link: https://patch.msgid.link/20260519102310.237181-1-minhnguyen.080505@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-20 19:11:18 -07:00
Eric Dumazet	e4bdef4d32	ipv4: use WARN_ON_ONCE() in ip_rt_bug() It turns out ip_rt_bug() can be called more than expected. syzbot will still panic (because of panic_on_warn=1), but non debug kernels will no longer die while repeating stack traces on the console. Fixes: `c378a9c019` ("ipv4: Give backtrace in ip_rt_bug().") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://patch.msgid.link/20260519193248.4018872-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-20 19:00:36 -07:00
Eric Dumazet	7eb72c1e39	ipv4: icmp: reject broadcast/multicast routes syzbot was able to trigger ip_rt_bug() in a loop, using an IPv4 packet with a crafted IPOPT_SSRR option: options: ipv4_options { options: array[ipv4_option] { union ipv4_option { ssrr: ipv4_option_route[IPOPT_SSRR] { type: const = 0x89 (1 bytes) length: len = 0x7 (1 bytes) pointer: int8 = 0xa2 (1 bytes) data: array[ipv4_addr] { union ipv4_addr { broadcast: const = 0xffffffff (4 bytes) } } } } Change __icmp_send() to not send ICMP to broadcast/multicast destinations. Fixes: `c378a9c019` ("ipv4: Give backtrace in ip_rt_bug().") Reported-by: syzbot+c13a57c2639c2c0d03a6@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6a0cc169.170a0220.1f6c2d.0004.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260519200836.4141061-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-20 19:00:02 -07:00

1 2 3 4 5 ...

84400 Commits