linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-16 05:31:37 -04:00

Author	SHA1	Message	Date
Jamal Hadi Salim	320fb29ea2	net/sched: sch_cbs: Call qdisc_reset for child qdisc During a reset, CBS is not calling reset on its child qdisc, which might cause qlen/backlog accounting issues. For example, if we have CBS with a QFQ parent and a netem child with delay, we can create a scenario where the parent's qlen underflows. QFQ, specifically, uses qlen to check whether it should deference a pointer, so this scenario may cause a null-ptr deref in QFQ: [ 43.875639][ T319] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000009: 0000 [#1] SMP KASAN NOPTI [ 43.876124][ T319] KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f] [ 43.876417][ T319] CPU: 10 UID: 0 PID: 319 Comm: ping Not tainted 7.0.0-13039-ge728258debd5 #773 PREEMPT(full) [ 43.876751][ T319] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 43.876949][ T319] RIP: 0010:qfq_dequeue+0x35c/0x1650 [ 43.877123][ T319] Code: 00 fc ff df 80 3c 02 00 0f 85 17 0e 00 00 4c 8d 73 48 48 89 9d b8 02 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 76 0c 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8b [ 43.877648][ T319] RSP: 0018:ffff8881017ef4f0 EFLAGS: 00010216 [ 43.877845][ T319] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: dffffc0000000000 [ 43.878073][ T319] RDX: 0000000000000009 RSI: 0000000c40000000 RDI: ffff88810eef02b0 [ 43.878306][ T319] RBP: ffff88810eef0000 R08: ffff88810eef0280 R09: 1ffff1102120fd63 [ 43.878523][ T319] R10: 1ffff1102120fd66 R11: 1ffff1102120fd67 R12: 0000000c40000000 [ 43.878742][ T319] R13: ffff88810eef02b8 R14: 0000000000000048 R15: 0000000020000000 [ 43.878959][ T319] FS: 00007f9c51c47c40(0000) GS:ffff88817a0be000(0000) knlGS:0000000000000000 [ 43.879214][ T319] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 43.879403][ T319] CR2: 000055e69a2230a8 CR3: 000000010c07a000 CR4: 0000000000750ef0 [ 43.879621][ T319] PKRU: 55555554 [ 43.879735][ T319] Call Trace: [ 43.879844][ T319] <TASK> [ 43.879924][ T319] __qdisc_run+0x169/0x1900 [ 43.880075][ T319] ? dev_qdisc_enqueue+0x8b/0x210 [ 43.880222][ T319] __dev_queue_xmit+0x2346/0x37a0 [ 43.880376][ T319] ? register_lock_class+0x3f/0x800 [ 43.880531][ T319] ? srso_alias_return_thunk+0x5/0xfbef5 [ 43.880684][ T319] ? __pfx___dev_queue_xmit+0x10/0x10 [ 43.880834][ T319] ? srso_alias_return_thunk+0x5/0xfbef5 [ 43.880977][ T319] ? __lock_acquire+0x819/0x1df0 [ 43.881124][ T319] ? srso_alias_return_thunk+0x5/0xfbef5 [ 43.881275][ T319] ? srso_alias_return_thunk+0x5/0xfbef5 [ 43.881418][ T319] ? __asan_memcpy+0x3c/0x60 [ 43.881563][ T319] ? srso_alias_return_thunk+0x5/0xfbef5 [ 43.881708][ T319] ? eth_header+0x165/0x1a0 [ 43.881853][ T319] ? lockdep_hardirqs_on_prepare+0xdb/0x1a0 [ 43.882031][ T319] ? srso_alias_return_thunk+0x5/0xfbef5 [ 43.882174][ T319] ? neigh_resolve_output+0x3cc/0x7e0 [ 43.882325][ T319] ? srso_alias_return_thunk+0x5/0xfbef5 [ 43.882471][ T319] ip_finish_output2+0x6b6/0x1e10 Fix this by calling qdisc_reset for CBS' child qdisc. Sashiko caught an issue which could result in a null ptr deref if qdisc_create_dflt() is invoked on an unitialised cbs qdisc which is exposed by this patch. We add an early return if the qdisc is null to address this. This is a similar approach used by two other fixes[1][2]. The proper fix for this specific issue elucidated by sashiko is to remove the call to qdisc_reset when qdisc_create_dflt fails. Since the dflt qdisc isn't attached anywhere yet at that point, calling the reset callback doesn't make much sense (and as stated has been a source of two other bugs). We plan on submitting this fix in a later patch. [1] https://lore.kernel.org/netdev/20221018063201.306474-2-shaozhengchao@huawei.com/ [2] https://lore.kernel.org/netdev/20221018063201.306474-4-shaozhengchao@huawei.com/ Fixes: `585d763af0` ("net/sched: Introduce Credit Based Shaper (CBS) qdisc") Reported-by: Junyoung Jang <graypanda.inzag@gmail.com> Tested-by: Junyoung Jang <graypanda.inzag@gmail.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-13 17:53:39 -07:00
Chenguang Zhao	3d042592eb	ethtool: fix ethnl_bitmap32_not_zero() bit interval semantics ethnl_bitmap32_not_zero() should return true if some bit in [start, end) is set: - Fix inverted memchr_inv() sense: return true when the scan finds a non-zero byte, not when the middle words are all zero. - Return false for an empty interval (end <= start). - When end is 32-bit aligned, indices in [start, end) do not include any bits from map[end_word]; return false after earlier checks found no non-zero data. Fixes: `10b518d4e6` ("ethtool: netlink bitset handling") Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-12 18:45:13 -07:00
Xiang Mei	7bf563badd	net/smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint The smc_msg_event tracepoint class, shared by smc_tx_sendmsg and smc_rx_recvmsg, unconditionally dereferences smc->conn.lnk: __string(name, smc->conn.lnk->ibname) conn->lnk is only set for SMC-R; for SMC-D it is NULL. Other code on these paths already handles this (e.g. !conn->lnk in SMC_STAT_RMB_TX_SIZE_SMALL()). With the tracepoint enabled, the first sendmsg()/recvmsg() on an SMC-D socket crashes: Oops: general protection fault, probably for non-canonical address KASAN: null-ptr-deref in range [...] RIP: 0010:strlen+0x1e/0xa0 Call Trace: trace_event_raw_event_smc_msg_event (net/smc/smc_tracepoint.h:44) smc_rx_recvmsg (net/smc/smc_rx.c:515) smc_recvmsg (net/smc/af_smc.c:2859) __sys_recvfrom (net/socket.c:2315) __x64_sys_recvfrom (net/socket.c:2326) do_syscall_64 The faulting address 0x3e0 is offsetof(struct smc_link, ibname), confirming the NULL ->lnk deref. Enabling the tracepoint requires root, but the trigger itself is unprivileged: socket(AF_SMC, ...) has no capability check, and SMC-D negotiation needs no admin step on s390 or on x86 with the loopback ISM device loaded. Log an empty device name for SMC-D instead of dereferencing NULL. Fixes: `aff3083f10` ("net/smc: Introduce tracepoints for tx and rx msg") Reported-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Xiang Mei <xmei5@asu.edu> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-12 18:44:36 -07:00
Nicolò Coccia	a3fdd924d8	net/smc: fix sleep-inside-lock in __smc_setsockopt() causing local DoS A logic flaw in __smc_setsockopt() allows a local unprivileged user to cause a Denial of Service (DoS) by holding the socket lock indefinitely. The function __smc_setsockopt() calls copy_from_sockptr() while holding lock_sock(sk). By passing a userfaultfd-monitored memory page (or FUSE-backed memory on systems where unprivileged userfaultfd is disabled) as the optval, an attacker can halt execution during the copy operation, keeping the lock held. Combined with asynchronous tear-down operations like shutdown(), this exhausts the kernel wq (kworkers) and triggers the hung task watchdog. [ 240.123456] INFO: task kworker/u8:2 blocked for more than 120 seconds. [ 240.123489] Call Trace: [ 240.123501] smc_shutdown+... [ 240.123512] lock_sock_nested+... This patch moves the user-space copy outside the lock_sock() critical section to prevent the issue. Fixes: `a6a6fe27ba` ("net/smc: Dynamic control handshake limitation by socket options") Signed-off-by: Nicolò Coccia <n.coccia96@gmail.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Tested-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-12 18:43:40 -07:00
Wei Yang	f9e2342046	net: atm: fix skb leak in sigd_send() default branch The default branch in sigd_send() calls sock_put() and returns -EINVAL without freeing the skb, while all other exit paths do so. Add the missing dev_kfree_skb() before sock_put() to fix the leak. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Wei Yang <albinwyang@tencent.com> Link: https://patch.msgid.link/20260509122358.1102997-1-albin_yang@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-12 18:07:02 -07:00
David Carlier	e3adf69f8e	net: ethtool: phy: avoid NULL deref when PHY driver is unbound phydev->drv can become NULL while the phy_device is still attached to its net_device, namely after the PHY driver is unbound via sysfs: echo <mdio_id> > /sys/bus/mdio_bus/drivers/<phy_drv>/unbind phy_remove() clears phydev->drv but doesn't call phy_detach(), so the phy_device stays in the link topology xarray and ethnl_req_get_phydev() still hands it back. ETHTOOL_MSG_PHY_GET then oopses on: rep_data->drvname = kstrdup(phydev->drv->name, GFP_KERNEL); drvname is already treated as optional by phy_reply_size(), phy_fill_reply() and phy_cleanup_data(), so just skip the allocation when there is no driver bound. Fixes: `9dd2ad5e92` ("net: ethtool: phy: Convert the PHY_GET command to generic phy dump") Cc: stable@vger.kernel.org # 6.13.x Signed-off-by: David Carlier <devnexen@gmail.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260509215046.107157-1-devnexen@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-12 17:45:12 -07:00
Jakub Kicinski	ce372e869f	net: shaper: reject QUEUE scope handle with missing id net_shaper_parse_handle() does not enforce that the user provides the handle ID. For NODE the ID defaults to UNSPEC for all other cases it defaults to 0. For NETDEV 0 is the only option. For QUEUE defaulting to 0 makes less intuitive sense. Specifically because the behavior should (IMHO) be the same for all cases where there may be more than one ID (QUEUE and NODE). We should either document this as intentional or reject. I picked the latter with no strong conviction. Fixes: `4b623f9f0f` ("net-shapers: implement NL get operation") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-11-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-12 16:15:00 +02:00
Jakub Kicinski	b62b29e6de	net: shaper: enforce singleton NETDEV scope with id 0 The NETDEV scope represents a singleton root shaper in the per-device hierarchy. All code assumes NETDEV shapers have id 0: net_shaper_default_parent() hardcodes parent->id = 0 when returning the NETDEV parent for QUEUE/NODE children, and the UAPI documentation describes NETDEV scope as "the main shaper" (singular, not plural). Make sure we reject non-0 IDs. Fixes: `4b623f9f0f` ("net-shapers: implement NL get operation") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-10-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-12 16:15:00 +02:00
Jakub Kicinski	8d5806c600	net: shaper: reject handle IDs exceeding internal bit-width net_shaper_parse_handle() reads the user-supplied handle ID via nla_get_u32(), accepting the full u32 range. However, the xarray key is built by net_shaper_handle_to_index() using FIELD_PREP(NET_SHAPER_ID_MASK, handle->id), where NET_SHAPER_ID_MASK is GENMASK(25, 0) - only 26 bits wide. FIELD_PREP silently masks off the upper bits at runtime. A user-supplied NODE id like 0x04000123 becomes id 0x123. Additionally, a user-supplied id equal to NET_SHAPER_ID_UNSPEC (0x03FFFFFF, which is NET_SHAPER_ID_MASK itself) would collide with the sentinel used internally by the group operation to signal "allocate a new NODE id". Reject user-supplied IDs >= NET_SHAPER_ID_MASK (i.e., >= 0x03FFFFFF) in the policy. Fixes: `4b623f9f0f` ("net-shapers: implement NL get operation") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-9-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-12 16:15:00 +02:00
Jakub Kicinski	0f9a857e34	net: shaper: fix undersized reply skb allocation in GROUP command net_shaper_group_send_reply() writes both the NET_SHAPER_A_IFINDEX attribute (via net_shaper_fill_binding()) and the nested NET_SHAPER_A_HANDLE attribute (via net_shaper_fill_handle()), but the reply skb at the call site in net_shaper_nl_group_doit() is allocated using net_shaper_handle_size(), which only accounts for the nested handle. The allocation is therefore short by nla_total_size(sizeof(u32)) (8 bytes) for the IFINDEX attribute. In practice the slab allocator rounds up the small allocation so the bug is latent, but the size accounting is wrong and could bite if the reply grew further. Introduce net_shaper_group_reply_size() that accounts for the full reply payload and use it both at the genlmsg_new() call site and in the defensive WARN_ONCE message. Fixes: `5d5d4700e7` ("net-shapers: implement NL group operation") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-7-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-12 16:14:59 +02:00
Jakub Kicinski	8054f85b83	net: shaper: set ret to -ENOMEM when genlmsg_new() fails in group_doit genlmsg_new() alloc failure path in net_shaper_nl_group_doit() forgets to set ret before jumping to error handling. Fixes: `5d5d4700e7` ("net-shapers: implement NL group operation") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-6-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-12 16:14:59 +02:00
Jakub Kicinski	a9a2fa1da6	net: shaper: reject duplicate leaves in GROUP request net_shaper_nl_group_doit() does not deduplicate NET_SHAPER_A_LEAVES entries. When userspace supplies the same leaf handle twice, the same old-parent pointer lands twice in old_nodes[]. The cleanup loop double frees the parent. Of course the same parent may still be in old_nodes[] twice if we are moving multiple of its leaves. Note that this patch also implicitly fixes the fact that the i >= leaves_count path forgets to set ret. Fixes: `5d5d4700e7` ("net-shapers: implement NL group operation") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-4-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-12 16:14:59 +02:00
Jakub Kicinski	235fb53761	net: shaper: fix trivial ordering issue in net_shaper_commit() We should update the entry before we mark it as valid. Fixes: `93954b40f6` ("net-shapers: implement NL set and delete operations") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-3-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-12 16:14:59 +02:00
Jakub Kicinski	7cee43fcb0	net: shaper: flip the polarity of the valid flag The usual way of inserting entries which are not yet fully ready into XArray is to have a VALID flag. The shaper code has a NOT_VALID flag. Since XArray code does not let us create entries with marks already set - the creation of entries is currently not atomic. Flip the polarity of the VALID flag. This closes the tiny race in net_shaper_pre_insert() of entries being created without the NOT_VALID flag. Fixes: `93954b40f6` ("net-shapers: implement NL set and delete operations") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-2-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-12 16:14:59 +02:00
Stefano Garzarella	3a3e3d90cb	vsock/virtio: fix empty payload in tap skb for non-linear buffers For non-linear skbs, virtio_transport_build_skb() goes through virtio_transport_copy_nonlinear_skb() to copy the original payload in the new skb to be delivered to the vsockmon tap device. This manually initializes an iov_iter but does not set iov_iter.count. Since the iov_iter is zero-initialized, the copy length is zero and no payload is actually copied to the monitor interface, leaving data un-initialized. Fix this by removing the linear vs non-linear split and using skb_copy_datagram_iter() with iov_iter_kvec() for all cases, as vhost-vsock already does. This handles both linear and non-linear skbs, properly initializes the iov_iter, and removes the now unused virtio_transport_copy_nonlinear_skb(). While touching this code, let's also check the return value of skb_copy_datagram_iter(), even though it's unlikely to fail. Fixes: `4b0bf10eb0` ("vsock/virtio: non-linear skb handling for tap") Reported-by: Yiqi Sun <sunyiqixm@gmail.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Arseniy Krasnov <avkrasnov@rulkc.org> Link: https://patch.msgid.link/20260508164411.261440-3-sgarzare@redhat.com Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-12 12:52:15 +02:00
Stefano Garzarella	5f344d809e	vsock/virtio: fix length and offset in tap skb for split packets virtio_transport_build_skb() builds a new skb to be delivered to the vsockmon tap device. To build the new skb, it uses the original skb data length as payload length, but as the comment notes, the original packet stored in the skb may have been split in multiple packets, so we need to use the length in the header, which is correctly updated before the packet is delivered to the tap, and the offset for the data. This was also similar to what we did before commit `71dc9ec9ac` ("virtio/vsock: replace virtio_vsock_pkt with sk_buff") where we probably missed something during the skb conversion. Also update the comment above, which was left stale by the skb conversion and still mentioned a buffer pointer that no longer exists. Fixes: `71dc9ec9ac` ("virtio/vsock: replace virtio_vsock_pkt with sk_buff") Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Arseniy Krasnov <avkrasnov@rulkc.org> Link: https://patch.msgid.link/20260508164411.261440-2-sgarzare@redhat.com Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-12 12:52:15 +02:00
Quan Sun	911f54771c	net: hsr: fix NULL pointer dereference in hsr_get_node_data() In the HSR (High-availability Seamless Redundancy) protocol, node information is maintained in the node_db. When a supervision frame is received, node->addr_B_port is updated to track the receiving port type (e.g., HSR_PT_SLAVE_B). If the underlying physical interface associated with this slave port is removed (e.g., via `ip link del`), hsr_del_port() frees the hsr_port object. However, the stale node->addr_B_port reference is kept in the node_db until the node ages out. Subsequently, if userspace queries the node status via the Netlink command HSR_C_GET_NODE_STATUS, the kernel calls hsr_get_node_data(). This function unconditionally dereferences the pointer returned by hsr_port_get_hsr(): if (node->addr_B_port != HSR_PT_NONE) { port = hsr_port_get_hsr(hsr, node->addr_B_port); *addr_b_ifindex = port->dev->ifindex; // <-- NULL deref } If the slave port has been deleted, hsr_port_get_hsr() returns NULL, resulting in a kernel panic. Oops: general protection fault, probably for non-canonical address KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017] RIP: 0010:hsr_get_node_data+0x7b6/0x9e0 Call Trace: <TASK> hsr_get_node_status+0x445/0xa40 Fix this by adding a proper NULL pointer check. If the port lookup fails due to a stale port type, gracefully treat it as if no valid port exists and assign -1 to the interface index. Steps to reproduce: 1. Create an HSR interface with two slave devices. 2. Receive a supervision frame to populate node_db with addr_B_port assigned to SLAVE_B. 3. Delete the underlying slave device B. 4. Send an HSR_C_GET_NODE_STATUS Netlink message. Fixes: `c5a7591172` ("net/hsr: Use list_head (and rcu) instead of array for slave devices.") Signed-off-by: Quan Sun <2022090917019@std.uestc.edu.cn> Link: https://patch.msgid.link/20260508124636.1462346-1-2022090917019@std.uestc.edu.cn Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-12 12:28:34 +02:00
Davide Caratti	f8e64e956a	net/sched: dualpi2: initialize timer earlier in dualpi2_init() 'pi2_timer' needs to be initialized in all error paths of dualpi2_init(): otherwise, a failure in qdisc_create_dflt() causes the following crash in dualpi2_destroy(): # tc qdisc add dev crash0 handle 1: root dualpi2 BUG: kernel NULL pointer dereference, address: 0000000000000010 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP PTI CPU: 4 UID: 0 PID: 471 Comm: tc Tainted: G E 7.1.0-rc1-virtme #2 PREEMPT(full) Tainted: [E]=UNSIGNED_MODULE Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 RIP: 0010:hrtimer_active+0x39/0x60 Code: f9 eb 23 0f b6 41 38 3c 01 0f 87 87 64 c0 ff 83 e0 01 75 33 48 39 4a 50 74 28 44 3b 42 10 75 06 48 3b 51 30 74 21 48 8b 51 30 <44> 8b 42 10 41 f6 c0 01 74 cf f3 90 44 8b 42 10 41 f6 c0 01 74 c3 RSP: 0018:ffffd0db80b93620 EFLAGS: 00010282 RAX: ffffffffc0400320 RBX: ffff8cf24a4c86b8 RCX: ffff8cf24a4c86b8 RDX: 0000000000000000 RSI: ffff8cf2429c2ab0 RDI: ffff8cf24a4c86b8 RBP: 00000000fffffff4 R08: 0000000000000003 R09: 0000000000000000 R10: 0000000000000001 R11: ffff8cf24a39c500 R12: ffff8cf24822c000 R13: ffffd0db80b936c0 R14: ffffffffc02cf360 R15: 00000000ffffffff FS: 00007fbc01706580(0000) GS:ffff8cf2dc759000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000010 CR3: 0000000008e02003 CR4: 0000000000172ef0 Call Trace: <TASK> hrtimer_cancel+0x15/0x40 dualpi2_destroy+0x20/0x40 [sch_dualpi2] qdisc_create+0x230/0x570 tc_modify_qdisc+0x716/0xc10 rtnetlink_rcv_msg+0x188/0x780 netlink_rcv_skb+0xcd/0x150 netlink_unicast+0x1ba/0x290 netlink_sendmsg+0x242/0x4d0 ____sys_sendmsg+0x39e/0x3e0 ___sys_sendmsg+0xe1/0x130 __sys_sendmsg+0xad/0x110 do_syscall_64+0x14f/0xf80 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fbc0188b08e Code: 4d 89 d8 e8 94 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 03 ff ff ff 0f 1f 00 f3 0f 1e fa RSP: 002b:00007fff593260e0 EFLAGS: 00000202 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbc0188b08e RDX: 0000000000000000 RSI: 00007fff59326190 RDI: 0000000000000003 RBP: 00007fff593260f0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 000055f06124f260 R13: 0000000069fca043 R14: 000055f061255640 R15: 000055f06124d3f8 </TASK> Modules linked in: sch_dualpi2(E) CR2: 0000000000000010 [1] https://lore.kernel.org/netdev/2e78e01c504c633ebdff18d041833cf2e079a3a4.1607020450.git.dcaratti@redhat.com/ [2] https://lore.kernel.org/netdev/20200725201707.16909-1-xiyou.wangcong@gmail.com/ v2: - rebased on top of latest net.git Fixes: `320d031ad6` ("sched: Struct definition and parsing of dualpi2 qdisc") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/1faca91179702b31da5d87653e1e036543e32722.1778259798.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-11 18:03:16 -07:00
Kuniyuki Iwashima	03cb001ef8	tcp: Fix out-of-bounds access for twsk in tcp_ao_established_key(). lockdep_sock_is_held() was added in tcp_ao_established_key() by the cited commit. It can be called from tcp_v[46]_timewait_ack() with twsk. Since it does not have sk->sk_lock, the lockdep annotation results in out-of-bound access. $ pahole -C tcp_timewait_sock vmlinux \| grep size /* size: 288, cachelines: 5, members: 8 / $ pahole -C sock vmlinux \| grep sk_lock socket_lock_t sk_lock; / 440 192 */ Let's not use lockdep_sock_is_held() for TCP_TIME_WAIT. Fixes: `6b2d11e2d8` ("net/tcp: Add missing lockdep annotations for TCP-AO hlist traversals") Reported-by: Damiano Melotti <melotti@google.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260508120853.4098365-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-11 17:50:15 -07:00
Allison Henderson	e174929793	net/rds: reset op_nents when zerocopy page pin fails When iov_iter_get_pages2() fails in rds_message_zcopy_from_user(), the pinned pages are released with put_page(), and rm->data.op_mmp_znotifier is cleared. But we fail to properly clear rm->data.op_nents. Later when rds_message_purge() is called from rds_sendmsg() the cleanup loop iterates over the incorrectly non zero number of op_nents and frees them again. Fix this by properly resetting op_nents when it should be in rds_message_zcopy_from_user(). Fixes: `0cebaccef3` ("rds: zerocopy Tx support.") Signed-off-by: Allison Henderson <achender@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260505234336.2132721-1-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-11 17:20:02 -07:00
Jakub Kicinski	69754da484	Merge tag 'batadv-net-pullrequest-20260508' of https://git.open-mesh.org/batadv Simon Wunderlich says: ==================== Here are some batman-adv bugfixes: - fix integer overflow on buff_pos, by Lyes Bourennani - fix invalid tp_meter access during teardown, by Jiexun Wang (2 patches) - stop caching unowned originator pointers in BAT IV, by Jiexun Wang - tp_meter: fix tp_num leak on kmalloc failure, by Sven Eckelmann - fix BLA refcounting issues, by Sven Eckelmann (3 patches) * tag 'batadv-net-pullrequest-20260508' of https://git.open-mesh.org/batadv: batman-adv: bla: put backbone reference on failed claim hash insert batman-adv: bla: only purge non-released claims batman-adv: bla: prevent use-after-free when deleting claims batman-adv: tp_meter: fix tp_num leak on kmalloc failure batman-adv: stop caching unowned originator pointers in BAT IV batman-adv: stop tp_meter sessions during mesh teardown batman-adv: reject new tp_meter sessions during teardown batman-adv: fix integer overflow on buff_pos ==================== Link: https://patch.msgid.link/20260508154314.12817-1-sw@simonwunderlich.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-10 10:07:38 -07:00
Jakub Kicinski	28d0060632	Merge tag 'nf-26-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following batch contains Netfilter fixes for net: 1) Allow initial x_tables table replacement without emitting an audit log message. Delay the register message until after hooks are wired up to avoid unnecessary unregister logs during error unwinding. 2) Fix a NULL dereference by allocating hook ops before adding the table to the per-netns list. Use `synchronize_rcu()` during error unwinding to ensure the table stops processing packets before teardown. Defer audit log register message until all operations succeed. 3) Refactor xtables to use a single `xt_unregister_table_pre_exit` function. Eliminate code duplication by centralizing table unregistration logic within the xtables core. ebtables cannot be changed due to incompatibility. 4) Unregister xtables templates before module removal. This prevents a race condition where userspace instantiates a new table after the pernet unreg removed the current table. 5) Add `xtables_unregister_table_exit` to fully unregister netfilter tables during module removal. Unlink the table from dying lists, then free hook operations. 6) Implement a two-stage removal scheme for ebtables following the x_tables pattern. Assign table->ops while holding the ebt mutex to prevent exposing partially-filled structures. 7) Fix ebtables module initialization race. Register the template last in table initialization functions. Prevent table instantiation before pernet operations are available. 8) Fix a race condition in x_tables module initialization. Ensure pernet ops are fully set up before exposing the table to userspace. 9) Fix a race condition in ebtables module initialization, similar to previous patch. 10) Restore propagation of helper to expected connection, this is a fix-for-recent-fix. 11) Validate that the expectation tuple and mask netlink attributes are present when adding expectation via nfqueue, this fixes a possible null-ptr-deref. 12) Fix possible rare memleak in the SIP helper in case helper has been detached from conntrack entry, from Li Xiasong. 13) Fix refcount leak in nft_ct when creating custom expectation, also from Li Xiason. Patches 1-9 from Florian Westphal. 10) Restore propagation of helper to expected connection, this is a fix-for-recent-fix. 11) Check that tuple and mask netlink attributes are set when creating an expectation via nfqueue. * tag 'nf-26-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nft_ct: fix missing expect put in obj eval netfilter: nf_conntrack_sip: get helper before allocating expectation netfilter: ctnetlink: check tuple and mask in expectations created via nfqueue netfilter: nf_conntrack_expect: restore helper propagation via expectation netfilter: bridge: eb_tables: close module init race netfilter: x_tables: close dangling table module init race netfilter: ebtables: close dangling table module init race netfilter: ebtables: move to two-stage removal scheme netfilter: x_tables: add and use xtables_unregister_table_exit netfilter: x_tables: unregister the templates first netfilter: x_tables: add and use xt_unregister_table_pre_exit netfilter: x_tables: allocate hook ops while under mutex netfilter: x_tables: allow initial table replace without emitting audit log message ==================== Link: https://patch.msgid.link/20260507234509.603182-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-08 18:28:27 -07:00
Ben Morris	abb5f36771	sctp: revalidate list cursor after sctp_sendmsg_to_asoc() in SCTP_SENDALL The SCTP_SENDALL path in sctp_sendmsg() iterates ep->asocs with list_for_each_entry_safe(), which caches the next entry in @tmp before the loop body runs. The body calls sctp_sendmsg_to_asoc(), which may drop the socket lock inside sctp_wait_for_sndbuf(). While the lock is dropped, another thread can SCTP_SOCKOPT_PEELOFF the association cached in @tmp, migrating it to a new endpoint via sctp_sock_migrate() (list_del_init() + list_add_tail() to newep->asocs), and optionally close the new socket which frees the association via kfree_rcu(). The cached @tmp can also be freed by a network ABORT for that association, processed in softirq while the lock is dropped. sctp_wait_for_sndbuf() revalidates @asoc (the current entry) on re-lock via the "sk != asoc->base.sk" and "asoc->base.dead" checks, but nothing revalidates @tmp. After a successful return, the iterator advances to the stale @tmp, yielding either a use-after-free (if the peeled socket was closed) or a list-walk onto the new endpoint's list head (type confusion of &newep->asocs as a struct sctp_association *). Both are reachable from CapEff=0; the type-confusion path gives controlled indirect call via the outqueue.sched->init_sid pointer. Fix by re-deriving @tmp from @asoc after sctp_sendmsg_to_asoc() returns. @asoc is known to still be on ep->asocs at that point: the only callers that list_del an association from ep->asocs are sctp_association_free() (which sets asoc->base.dead) and sctp_assoc_migrate() (which changes asoc->base.sk), and sctp_wait_for_sndbuf() checks both under the lock before any successful return; a tripped check propagates as err < 0 and the loop bails before the re-derive. The SCTP_ABORT path in sctp_sendmsg_check_sflags() returns 0 and the loop hits 'continue' before sctp_sendmsg_to_asoc() is ever called, so the @tmp cached by list_for_each_entry_safe() still covers the lock-held free that `ba59fb0273` ("sctp: walk the list of asoc safely") was added for. Fixes: `4910280503` ("sctp: add support for snd flag SCTP_SENDALL process in sendmsg") Cc: stable@vger.kernel.org Signed-off-by: Ben Morris <bmorris@anthropic.com> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20260508001455.3137-1-joycathacker@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-08 18:21:09 -07:00
Mohsin Bashir	a77d5a069d	net: shaper: Reject reparenting of existing nodes When an existing node-scope shaper is moved to a different parent via the group operation, the framework fails to update the leaves count on both the old and new parent shapers. Only newly created nodes (handle.id == NET_SHAPER_ID_UNSPEC) trigger the parent leaves increment at line 1039. This causes the parent's leaves counter to diverge from the actual number of children in the xarray. When the node is later deleted, pre_del_node() allocates an array sized by the stale leaves count, but the xarray iteration finds more children than expected, hitting the WARN_ON_ONCE guard and returning -EINVAL. Rather than adding reparenting support with complex leaves count bookkeeping, reject group calls that attempt to change an existing node's parent. Updates to an existing node's rate or leaves under the same parent remain permitted. We expect that for any modification of the topology user should always create new groups and let the kernel garbage collect the leaf-less nodes. Fixes: `5d5d4700e7` ("net-shapers: implement NL group operation") Signed-off-by: Mohsin Bashir <hmohsin@meta.com> Link: https://patch.msgid.link/20260506233745.111895-1-mohsin.bashr@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-08 15:45:47 -07:00
Alice Ryhl	efda25ee84	genetlink: free the skb on 'group >= family->n_mcgrps' These methods generally consume ownership of the provided skb, so even if an error path is encountered, the skb is freed. This is because the very first thing they do after some initial setup is to unconditionally consume the skb via consume_skb(skb). Any subsequent errors lead to the core netlink layer freeing the skb. However, there is one check that occurs before ownership is passed, which is the check for the group index. So if this error condition is encountered, then the skb is leaked. This error condition is generally considered a violation of the netlink API, so it's not expected to occur under normal circumstances. For the same reason, no callers check for this error condition, and no callers need to be adjusted. However, we should still follow the same ownership semantics of the rest of the function. Thus, free the skb in this codepath. Suggested-by: Andrew Lunn <andrew@lunn.ch> Suggested-by: Matthew Maurer <mmaurer@google.com> Fixes: `2a94fe48f3` ("genetlink: make multicast groups const, prevent abuse") Link: https://lore.kernel.org/r/845b36ba-7b3a-41f2-acb2-b284f253e2ca@lunn.ch Signed-off-by: Alice Ryhl <aliceryhl@google.com> Link: https://patch.msgid.link/20260506-genlmsg-return-v2-1-a63ee2a055d6@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-08 15:43:29 -07:00
Quan Sun	4908f1395f	net: ethtool: fix NULL pointer dereference in phy_reply_size In phy_prepare_data(), several strings such as 'name', 'drvname', 'upstream_sfp_name', and 'downstream_sfp_name' are allocated using kstrdup(). However, these allocations were not checked for failure. If kstrdup() fails for 'name', it returns NULL while the function continues. This leads to a kernel NULL pointer dereference and panic later in phy_reply_size() when it unconditionally calls strlen() on the NULL pointer. While other strings like 'upstream_sfp_name' might be checked before access in certain code paths, failing to handle these allocations consistently can lead to incomplete data reporting or hidden bugs. Fix this by adding proper NULL checks for all kstrdup() calls in phy_prepare_data() and implement a centralized error handling path using goto labels to ensure all previously allocated resources are freed on failure. Fixes: `9dd2ad5e92` ("net: ethtool: phy: Convert the PHY_GET command to generic phy dump") Signed-off-by: Quan Sun <2022090917019@std.uestc.edu.cn> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260507131738.1173835-1-2022090917019@std.uestc.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-08 15:30:10 -07:00
Dragos Tatulea	58e2330bd4	net: napi: Avoid gro timer misfiring at end of busypoll When in irq deferral mode (defer-hard-irqs > 0), a short enough gro-flush timeout can trigger before NAPI_STATE_SCHED is cleared if the last poll in busy_poll_stop() takes too long. This can have the effect of leaving the queue stuck with interrupts disabled and no timer armed which results in a tx timeout if there is no subsequent busypoll cycle. To prevent this, defer the gro-flush timer arm after the last poll. Fixes: `7fd3253a7d` ("net: Introduce preferred busy-polling") Co-developed-by: Martin Karsten <mkarsten@uwaterloo.ca> Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20260506090808.820559-2-dtatulea@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-08 15:02:32 -07:00
Maoyi Xie	e68eadffb7	ipv6: flowlabel: enforce per-netns limit for unprivileged callers fl_size, fl_ht and ip6_fl_lock in net/ipv6/ip6_flowlabel.c are file scope and shared across netns. mem_check() reads fl_size to decide whether to deny non-CAP_NET_ADMIN callers. capable() runs against init_user_ns, so an unprivileged user in any non-init userns can push fl_size past FL_MAX_SIZE - FL_MAX_SIZE / 4 and starve every other unprivileged userns on the host. Add struct netns_ipv6::flowlabel_count, bumped and decremented next to fl_size in fl_intern, ip6_fl_gc and ip6_fl_purge. The new field fills the existing 4-byte hole after ipmr_seq, so struct netns_ipv6 stays the same size on 64-bit builds. Bump FL_MAX_SIZE from 4096 to 8192. It has been 4096 since the file was added. Machines and connection counts have grown. mem_check() folds an extra per-netns ceiling into the existing non-CAP_NET_ADMIN conditional. The ceiling is half of the total budget that unprivileged callers have ever been able to use, i.e. (FL_MAX_SIZE - FL_MAX_SIZE / 4) / 2 = 3072 entries. With FL_MAX_SIZE doubled, this preserves the original per-user reach of 3K (what an unprivileged caller could already obtain before this change), while forcing an attacker to spread allocations across at least two netns to exhaust the global non-CAP_NET_ADMIN budget. CAP_NET_ADMIN against init_user_ns still bypasses both caps. The previous patch took ip6_fl_lock across mem_check and fl_intern, so the new flowlabel_count read in mem_check and the new flowlabel_count++ in fl_intern run under the same critical section. flowlabel_count is therefore plain int, like fl_size. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Suggested-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg> Link: https://patch.msgid.link/20260506082416.2259567-3-maoyixie.tju@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-08 14:59:14 -07:00
Maoyi Xie	7ce5556f25	ipv6: flowlabel: take ip6_fl_lock across mem_check and fl_intern mem_check() in net/ipv6/ip6_flowlabel.c reads fl_size without holding ip6_fl_lock. fl_intern() takes the lock immediately afterwards. The two checks therefore race against concurrent fl_intern, ip6_fl_gc and ip6_fl_purge writers, which makes the mem_check budget check approximate. Move spin_lock_bh(&ip6_fl_lock) and the matching unlock from fl_intern() into its only caller ipv6_flowlabel_get(). The mem_check() call now runs under the same critical section as the fl_intern() insert, so the budget check is exact. With all writers and the read of fl_size under ip6_fl_lock, convert fl_size from atomic_t to plain int. The four sites that update or read fl_size are fl_intern (insert path), ip6_fl_gc (garbage collector, the !sched check and the per-entry decrement), ip6_fl_purge (per-netns purge), and mem_check (budget check), and all four now run under ip6_fl_lock. This is a prerequisite for adding a per-netns budget alongside fl_size. The follow-up patch adds netns_ipv6::flowlabel_count and folds it into mem_check(). Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Suggested-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg> Link: https://patch.msgid.link/20260506082416.2259567-2-maoyixie.tju@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-08 14:59:13 -07:00
Kuniyuki Iwashima	7eca3292ca	tcp: Fix imbalanced icsk_accept_queue count. When TCP socket migration happens in reqsk_timer_handler(), @sk_listener will be updated with the new listener. When we call __inet_csk_reqsk_queue_drop(), the listener must be the one stored in req->rsk_listener. The cited commit accidentally replaced oreq->rsk_listener with sk_listener, leading to imbalanced icsk_accept_queue count. Let's pass the correct listener to __inet_csk_reqsk_queue_drop(). Fixes: `e8c526f2bd` ("tcp/dccp: Don't use timer_pending() in reqsk_queue_unlink().") Reported-by: Damiano Melotti <melotti@google.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260506035954.1563147-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-08 14:54:51 -07:00
Kuniyuki Iwashima	97c8a3c1f7	tcp: Fix potential UAF in reqsk_timer_handler(). When TCP socket migration fails at inet_ehash_insert() in reqsk_timer_handler(), we jump to the no_ownership: label and free the new reqsk immediately with __reqsk_free(). Thus, we must stop the new reqsk's timer before jumping to the label, but the timer might be missed since the cited commit, resulting in UAF. As we are in the original reqsk's timer context, we can safely call timer_delete_sync() for the new reqsk. Let's pass false to __inet_csk_reqsk_queue_drop() to stop the new reqsk's timer. Fixes: `83fccfc394` ("inet: fix potential deadlock in reqsk_queue_unlink()") Reported-by: Damiano Melotti <melotti@google.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260506035954.1563147-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-08 14:54:50 -07:00
Sven Eckelmann	ba9d20ee90	batman-adv: bla: put backbone reference on failed claim hash insert When batadv_bla_add_claim() fails to insert a new claim into the hash, it leaked a reference to the backbone_gw for which the claim was intended. Call batadv_backbone_gw_put() on the error path to release the reference and avoid leaking the backbone_gw object. Cc: stable@kernel.org Fixes: `3db0decf11` ("batman-adv: Fix non-atomic bla_claim::backbone_gw access") Signed-off-by: Sven Eckelmann <sven@narfation.org>	2026-05-08 14:29:02 +02:00
Sven Eckelmann	cf6b604011	batman-adv: bla: only purge non-released claims When batadv_bla_purge_claims() goes through the list of claims, it is only traversing the hash list with an rcu_read_lock(). Due to a potential parallel batadv_claim_put(), it can happen that it encounters a claim which was actually in the process of being released+freed by batadv_claim_release(). In this case, backbone_gw is set to NULL before the delayed RCU kfree is started. Calling batadv_bla_claim_get_backbone_gw() is then no longer allowed because it would cause a NULL-ptr derefence. To avoid this, only claims with a valid reference counter must be purged. All others are already taken care of. Cc: stable@kernel.org Fixes: `23721387c4` ("batman-adv: add basic bridge loop avoidance code") Signed-off-by: Sven Eckelmann <sven@narfation.org>	2026-05-08 14:28:56 +02:00
Sven Eckelmann	4ae1709a31	batman-adv: bla: prevent use-after-free when deleting claims When batadv_bla_del_backbone_claims() removes all claims for a backbone, it does this by dropping the link entry in the hash list. This list entry itself was one of the references which need to be dropped at the same time via batadv_claim_put(). But the batadv_claim_put() must not be done before the last access to the claim object in this function. Otherwise the claim might be freed already by the batadv_claim_release() function before the list entry was dropped. Cc: stable@kernel.org Fixes: `23721387c4` ("batman-adv: add basic bridge loop avoidance code") Signed-off-by: Sven Eckelmann <sven@narfation.org>	2026-05-08 14:28:51 +02:00
Sven Eckelmann	ce425dd05d	batman-adv: tp_meter: fix tp_num leak on kmalloc failure When batadv_tp_start() or batadv_tp_init_recv() fail to allocate a new tp_vars object, the previously incremented bat_priv->tp_num counter is never decremented. This causes tp_num to drift upward on each allocation failure. Since only BATADV_TP_MAX_NUM sessions can be started and the count is never reduced for these failed allocations, it causes to an exhaustion of throughput meter sessions. In worst case, no new throughput meter session can be started until the mesh interface is removed. The error handling must decrement tp_num releasing the lock and aborting the creation of an throughput meter session Cc: stable@kernel.org Fixes: `33a3bb4a33` ("batman-adv: throughput meter implementation") Signed-off-by: Sven Eckelmann <sven@narfation.org>	2026-05-08 14:28:44 +02:00
Jiexun Wang	f03e858353	batman-adv: stop caching unowned originator pointers in BAT IV BAT IV keeps the last-hop neighbor address in each neigh_node, but some paths also cache an originator pointer derived from a temporary lookup. That pointer is not owned by the neigh_node and may no longer refer to a live originator entry after purge handling runs. Stop storing the auxiliary originator pointer in the BAT IV neighbor state. When BAT IV needs the neighbor originator data, resolve it from the stored neighbor address and drop the reference again after use. Fixes: `c6c8fea297` ("net: Add batman-adv meshing protocol") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Jiexun Wang <wangjiexun2025@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> [sven: avoid bonding logic for outgoing OGM] Signed-off-by: Sven Eckelmann <sven@narfation.org>	2026-05-08 14:28:40 +02:00
Li Xiasong	19f94b6fee	netfilter: nft_ct: fix missing expect put in obj eval nft_ct_expect_obj_eval() allocates an expectation and may call nf_ct_expect_related(), but never drops its local reference. Add nf_ct_expect_put(exp) before return to balance allocation. Fixes: `857b46027d` ("netfilter: nft_ct: add ct expectations support") Cc: stable@vger.kernel.org Signed-off-by: Li Xiasong <lixiasong1@huawei.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-08 01:30:17 +02:00
Li Xiasong	eb6317739b	netfilter: nf_conntrack_sip: get helper before allocating expectation process_register_request() allocates an expectation and then checks whether a conntrack helper is available. If helper lookup fails, the function returns early and the allocated expectation is left behind. Reorder the code to fetch and validate helper before calling nf_ct_expect_alloc(). This keeps the logic simpler and removes the leak path while preserving existing behavior. Fixes: `e14575fa75` ("netfilter: nf_conntrack: use rcu accessors where needed") Cc: stable@vger.kernel.org Signed-off-by: Li Xiasong <lixiasong1@huawei.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-08 01:30:17 +02:00
Pablo Neira Ayuso	d8ef54c83a	netfilter: ctnetlink: check tuple and mask in expectations created via nfqueue Ensure the expectation tuple and mask attributes are present in netlink message, otherwise null-ptr-deref is possible. Fixes: `bd07793705` ("netfilter: nfnetlink_queue: allow to attach expectations to conntracks") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-08 01:30:17 +02:00
Pablo Neira Ayuso	dcb0f9aefd	netfilter: nf_conntrack_expect: restore helper propagation via expectation A recent series to fix expectations broke helper propagation via expectation, this mechanism is used by the sip and h323 helper. This also propagates the conntrack helper to expected connections. I changed semantics of exp->helper which now tells us the actual helper that created the expectation. Add an explicit assign_helper field to expectations for this purpose and update helpers to use it. Restore this feature for userspace conntrack helper via ctnetlink nfqueue integration so it is again possible to attach a helper to an expectation, where it makes sense. This is not restored via ctnetlink expectation creation as there is no client for such feature. Use the expectation layer 4 protocol number for the helper lookup for consistency. Make sure the expectation using this helper propagation mechanism also go away when the helper is unregistered. Fixes: `9c42bc9db9` ("netfilter: nf_conntrack_expect: honor expectation helper field") Fixes: `917b61fa20` ("netfilter: ctnetlink: ignore explicit helper on new expectations") Reported-by: Ilya Maximets <i.maximets@ovn.org> Tested-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-08 01:30:17 +02:00
Florian Westphal	27414ff1b2	netfilter: bridge: eb_tables: close module init race sashiko reports for unrelated patch: Does the core ebtables initialization in ebtables.c suffer from a similar race? Once nf_register_sockopt() completes, the sockopts are exposed globally. sockopt has to be registered last, just like in ip/ip6/arptables. Fixes: `5b53951cfc` ("netfilter: ebtables: use net_generic infra") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-08 01:30:17 +02:00
Florian Westphal	16bc4b6686	netfilter: x_tables: close dangling table module init race Similar to the previous ebtables patch: template add exposes the table to userspace, we must do this last to rnsure the pernet ops are set up (contain the destructors). Fixes: `fdacd57c79` ("netfilter: x_tables: never register tables by default") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-08 01:30:17 +02:00
Florian Westphal	92c603fa07	netfilter: ebtables: close dangling table module init race sashiko reported for a related patch: In modules like iptable_raw.c, [..], if register_pernet_subsys() fails, the rollback might call kfree(rawtable_ops) before [..] During this window, could a concurrent userspace process find the globally visible template, trigger table_init(), [..] The table init functions must always register the template last. Otherwise, set/getsockopt can instantiate a table in a namespace while the required pernet ops (contain the destructor) isn't available. This change is also required in x_tables, handled in followup change. Fixes: `87663c39f8` ("netfilter: ebtables: do not hook tables by default") Reviewed-by: Tristan Madani <tristan@talencesecurity.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-08 01:30:17 +02:00
Florian Westphal	b7f0544d86	netfilter: ebtables: move to two-stage removal scheme Like previous patches for x_tables, follow same pattern in ebtables. We can't reuse xt helpers: ebt_table struct layout is incompatible. table->ops assignment is now done while still holding the ebt mutex to make sure we never expose partially-filled table struct. Fixes: `87663c39f8` ("netfilter: ebtables: do not hook tables by default") Reviewed-by: Tristan Madani <tristan@talencesecurity.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-08 01:30:17 +02:00
Florian Westphal	b4597d5fd7	netfilter: x_tables: add and use xtables_unregister_table_exit Previous change added xtables_unregister_table_pre_exit to detach the table from the packetpath and to unlink it from the active table list. In case of rmmod, userspace that is doing set/getsockopt for this table will not be able to re-instantiate the table: 1. The larval table has been removed already 2. existing instantiated table is no longer on the xt pernet table list. This adds the second stage helper: unlink the table from the dying list, free the hook ops (if any) and do the audit notification. It replaces xt_unregister_table(). Fixes: `fdacd57c79` ("netfilter: x_tables: never register tables by default") Reported-by: Tristan Madani <tristan@talencesecurity.com> Reviewed-by: Tristan Madani <tristan@talencesecurity.com> Closes: https://lore.kernel.org/netfilter-devel/20260429175613.1459342-1-tristmd@gmail.com/ Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-08 01:30:16 +02:00
Florian Westphal	d338693d77	netfilter: x_tables: unregister the templates first When the module is going away we need to zap the template first. Else there is a small race window where userspace could instantiate a new table after the pernet exit function has removed the current table. Fixes: `fdacd57c79` ("netfilter: x_tables: never register tables by default") Reported-by: Tristan Madani <tristan@talencesecurity.com> Reviewed-by: Tristan Madani <tristan@talencesecurity.com> Closes: https://lore.kernel.org/netfilter-devel/20260429175613.1459342-1-tristmd@gmail.com/ Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-08 01:30:16 +02:00
Florian Westphal	527d693147	netfilter: x_tables: add and use xt_unregister_table_pre_exit Remove the copypasted variants of _pre_exit and add one single function in the xtables core. ebtables is not compatible with x_tables and therefore unchanged. This is a preparation patch to reduce noise in the followup bug fixes. Reviewed-by: Tristan Madani <tristan@talencesecurity.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-08 01:30:16 +02:00
Florian Westphal	b62eb8dcf2	netfilter: x_tables: allocate hook ops while under mutex arp/ip(6)t_register_table() add the table to the per-netns list via xt_register_table() before allocating the per-netns hook ops copy via kmemdup_array(). This leaves a window where the table is visible in the list with ops=NULL. If the pernet exit happens runs concurrently the pre_exit callback finds the table via xt_find_table() and passes the NULL ops pointer to nf_unregister_net_hooks(), causing a NULL dereference: general protection fault in nf_unregister_net_hooks+0xbc/0x150 RIP: nf_unregister_net_hooks (net/netfilter/core.c:613) Call Trace: ipt_unregister_table_pre_exit iptable_mangle_net_pre_exit ops_pre_exit_list cleanup_net Fix by moving the ops allocation into the xtables core so the table is never in the list without valid ops. Also ensure the table is no longer processing packets before its torn down on error unwind. nf_register_net_hooks might have published at least one hook; call synchronize_rcu() if there was an error. audit log register message gets deferred until all operations have passed, this avoids need to emit another ureg message in case of error unwinding. Based on earlier patch by Tristan Madani. Fixes: `f9006acc8d` ("netfilter: arp_tables: pass table pointer via nf_hook_ops") Fixes: `ee177a5441` ("netfilter: ip6_tables: pass table pointer via nf_hook_ops") Fixes: `ae68933422` ("netfilter: ip_tables: pass table pointer via nf_hook_ops") Link: https://lore.kernel.org/netfilter-devel/20260429175613.1459342-1-tristmd@gmail.com/ Signed-off-by: Tristan Madani <tristan@talencesecurity.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-08 01:30:16 +02:00
Florian Westphal	8e72510db9	netfilter: x_tables: allow initial table replace without emitting audit log message At the moment we emit the audit log a bit too early, which makes it necessary to also emit an unregister log in case we have to unwind errors after possible hook register failure. Followup patch will be slightly simpler if we can delay the register message until after the hooks have been wired up. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-08 01:30:16 +02:00
Kuniyuki Iwashima	ecddc523cf	tcp: Fix dst leak in tcp_v6_connect(). If a socket is bound to a wildcard address, tcp_v[46]_connect() updates it with a non-wildcard address based on the route lookup. After bhash2 was introduced in the cited commit, we must call inet_bhash2_update_saddr() to update the bhash2 entry as well. If inet_bhash2_update_saddr() fails, we must release the refcount for dst by ip_route_connect() or ip6_dst_lookup_flow(). While tcp_v4_connect() calls ip_rt_put() in the error path, tcp_v6_connect() does not call dst_release(). Let's call dst_release() when inet_bhash2_update_saddr() fails in tcp_v6_connect(). Fixes: `28044fc1d4` ("net: Add a bhash2 table hashed by port and address") Reported-by: Damiano Melotti <melotti@google.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260506070443.1699879-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-07 08:39:15 -07:00

1 2 3 4 5 ...

84241 Commits