linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-06-03 12:43:01 -04:00

Author	SHA1	Message	Date
Gal Pressman	78effd896e	udp: Fix UDP length on last GSO_PARTIAL segment Following the cited commit, __udp_gso_segment() writes single MSS length in the UDP header. The cited patch doesn't account for the fact that the last segment could be a GSO skb by itself. This could happen when the size of the packet is a multiple of MSS, hence the first segment is also the last one (there is no need for a remainder skb). When the post-loop segment is a GSO skb, assign the single MSS length in the UDP header. Fixes: `b10b446ce7` ("udp: gso: Use single MSS length in UDP header for GSO_PARTIAL") Reported-by: Matthew Schwartz <matthew.schwartz@linux.dev> Closes: https://lore.kernel.org/all/6c3fb15e-711d-4b8d-b152-e03d9b05293f@linux.dev/ Tested-by: Matthew Schwartz <matthew.schwartz@linux.dev> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20260518062250.3019914-3-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-20 15:03:47 -07:00
Alice Mikityanska	5f17ae0f59	udp: gso: Fix handling checksum in __udp_gso_segment The cited commit started using msslen for uh->len, but still uses newlen to adjust uh->check. Although the checksum is ignored in most cases due to the hardware offload, __udp_gso_segment attempts to maintain the correct one. Fix uh->check and adjust it by the right value. Additionally, after the fix, newlen becomes assigned and unused before the loop. The code can be simplified a bit if mss adjustment is dropped, so that newlen becomes equal to msslen before the loop, and msslen can be also dropped, saving a few lines of code. This brings us back to one variable, drops an unneeded arithmetic for mss, and fixes the UDP checksum. Fixes: `b10b446ce7` ("udp: gso: Use single MSS length in UDP header for GSO_PARTIAL") Signed-off-by: Alice Mikityanska <alice@isovalent.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20260518062250.3019914-2-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-20 15:02:47 -07:00
Nikhil P. Rao	dc416e32ba	pds_core: fix debugfs_lookup dentry leak and error handling debugfs_lookup() returns a dentry with an elevated reference count that must be released with dput(). The current code discards the returned dentry without calling dput(), causing a reference leak on every firmware reset recovery. Additionally, when CONFIG_DEBUG_FS is disabled, debugfs_lookup() returns ERR_PTR(-ENODEV), not NULL. The current check passes for error pointers and would call dput() on an invalid pointer, causing a crash. Fixes: `bc90fbe0c3` ("pds_core: Rework teardown/setup flow to be more common") Signed-off-by: Nikhil P. Rao <nikhil.rao@amd.com> Link: https://patch.msgid.link/20260515212907.998028-3-nikhil.rao@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-19 19:18:33 -07:00
Nikhil P. Rao	0e46b6635b	pds_core: fix error handling in pdsc_devcmd_wait Fix two cases where pdsc_devcmd_wait() returns stale success from the completion register instead of an error: 1. FW crash: If firmware stops running, the wait loop breaks early with running=false. The condition "if ((!done \|\| timeout) && running)" is false, so error handling is bypassed and stale status is returned. Check !running first and return -ENXIO. 2. Timeout: If a command times out, err is set to -ETIMEDOUT but then overwritten by pdsc_err_to_errno(status) which reads stale status. Return -ETIMEDOUT immediately after cleaning up. Both errors now propagate to pdsc_devcmd_locked() which queues health_work for recovery. Fixes: `45d76f4929` ("pds_core: set up device and adminq") Signed-off-by: Nikhil P. Rao <nikhil.rao@amd.com> Link: https://patch.msgid.link/20260515212907.998028-1-nikhil.rao@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-19 19:12:42 -07:00
Christian Marangi	0cb5a74faa	net: airoha: Fix NPU RX DMA descriptor bits In an internal review from Airoha, it was notice that the RX DMA descriptor bits and mask are wrong. These values probably refer to an old NPU firmware never published. The previous value works correctly but it was reported that in some specific condition in mixed scenario with both Ethernet and WiFi offload it's possible that RX DMA descriptor signal wrong value with the problem to the RX ring or packets getting dropped. To handle these specific scenario, apply the new suggested bits mask from Airoha. Correct functionality of both AN7581 NPU and MT7996 variant were verified and confirmed working. Fixes: `a7fc8c641c` ("net: airoha: Fix npu rx DMA definitions") Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20260518134530.3683-1-ansuelsmth@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-19 19:05:20 -07:00
Jann Horn	be309f8eae	af_unix: Fix UAF read of tail->len in unix_stream_data_wait() unix_stream_data_wait() does skb_peek_tail(&sk->sk_receive_queue) without holding any lock that prevents SKBs on that queue from being dequeued and freed. This has been the case since commit `79f632c71b` ("unix/stream: fix peeking with an offset larger than data in queue"). The first consequence of this is that the pointer comparison `tail != last` can be false even if `last` semantically refers to an already-freed SKB while `tail` is a new SKB allocated at the same address; which can cause unix_stream_data_wait() to wrongly keep blocking after new data has arrived, but only in a weird scenario where a peeking recv() and a normal recv() on the same socket are racing, which is probably not a real problem. But since commit `2b514574f7` ("net: af_unix: implement splice for stream af_unix sockets"), `tail` is actually dereferenced, which can cause UAF in the following race scenario (where test_setup() runs single-threaded, and afterwards, test_thread1() and test_thread2() run concurrently in two threads: ``` static int socks[2]; void test_setup(void) { socketpair(AF_UNIX, SOCK_STREAM, 0, socks); send(socks[1], "A", 1, 0); int peekoff = 1; setsockopt(socks[0], SOL_SOCKET, SO_PEEK_OFF, &peekoff, sizeof(peekoff)); } void test_thread1(void) { char dummy; recv(socks[0], &dummy, 1, MSG_PEEK); } void test_thread2(void) { char dummy; recv(socks[0], &dummy, 1, 0); shutdown(socks[1], SHUT_WR); } ``` when racing like this: ``` thread1 thread2 unix_stream_read_generic mutex_lock(&u->iolock) skb_peek(&sk->sk_receive_queue) skb_peek_next(skb, &sk->sk_receive_queue) mutex_unlock(&u->iolock) unix_stream_read_generic unix_state_lock(sk) skb_peek(&sk->sk_receive_queue) unix_state_unlock(sk) unix_stream_data_wait unix_state_lock(sk) tail = skb_peek_tail(&sk->sk_receive_queue) spin_lock(&sk->sk_receive_queue.lock) __skb_unlink(skb, &sk->sk_receive_queue) spin_unlock(&sk->sk_receive_queue.lock) consume_skb(skb) [frees the SKB] `tail != last`: false `tail`: true `tail->len != last_len` *UAF* ``` Fix the UAF by removing the read of tail->len; checking tail->len would only make sense if SKBs in the receive queue of a UNIX socket could grow, which can no longer happen. Kuniyuki explained: > When commit `869e7c6248` ("net: af_unix: implement stream sendpage > support") added sendpage() support, data could be appended to the last > skb in the receiver's queue. > > That's why we needed to check if the length of the last skb was changed > while waiting for new data in unix_stream_data_wait(). > > However, commit `a0dbf5f818` ("af_unix: Support MSG_SPLICE_PAGES") and > commit `57d44a354a` ("unix: Convert unix_stream_sendpage() to use > MSG_SPLICE_PAGES") refactored sendmsg(), and now data is always added > to a new skb. That means this fix is not suitable for kernels before 6.5. Fixes: `2b514574f7` ("net: af_unix: implement splice for stream af_unix sockets") Cc: stable@vger.kernel.org # 6.5.x Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260518-b4-unix-recv-wait-hotfix-v2-1-83e29ce8ad31@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-19 18:53:56 -07:00
Justin Iurman	d4ea0dfd75	ipv6: ioam: add NULL check for idev in ipv6_hop_ioam() Reported by Sashiko: The function ipv6_hop_ioam() accesses __in6_dev_get(skb->dev)->cnf.ioam6_enabled without validating the returned idev pointer. Because addrconf_ifdown() can concurrently clear dev->ip6_ptr via RCU, __in6_dev_get() can return NULL during interface teardown, which could cause a NULL pointer dereference when processing an IOAM Hop-by-Hop option. Let's add a check and use SKB_DROP_REASON_IPV6DISABLED accordingly. Fixes: `9ee11f0fff` ("ipv6: ioam: Data plane support for Pre-allocated Trace") Cc: stable@vger.kernel.org Signed-off-by: Justin Iurman <justin.iurman@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260517183059.29140-1-justin.iurman@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-19 18:46:44 -07:00
Jakub Kicinski	d03e6124c7	Merge branch 'net-phy-honor-eee_disabled_modes-when-advertising-eee' Nicolai Buchwitz says: ==================== net: phy: honor eee_disabled_modes when advertising EEE While debugging why ethtool --show-eee reports "not supported" on a Raspberry Pi CM4 with eee-broken-1000t / eee-broken-100tx set on the PHY node, I noticed two phylib helpers copy phydev->supported_eee into phydev->advertising_eee without applying phydev->eee_disabled_modes: phy_support_eee() and phy_advertise_eee_all(). That undoes the filtering phy_probe() set up after of_set_phy_eee_broken(), so the PHY ends up advertising EEE for modes that were marked broken in DT (or by the driver via eee_disabled_modes). The visible effect on MAC drivers that call phy_support_eee() after probe (bcmgenet, fec, lan743x, lan78xx, r8169) is that ethtool on the local interface reports "not supported" (because supported is masked by eee_disabled_modes and ends up empty), while the link partner happily sees EEE negotiated and active. Patch 1 fixes phy_support_eee(). Patch 2 fixes phy_advertise_eee_all(), which is also reached from genphy_c45_ethtool_set_eee() when user space passes an empty advertisement. I went through the other users of supported_eee as suggested by Andrew and they look fine: - phy_probe() already masks via eee_disabled_modes after of_set_phy_eee_broken(). - genphy_c45_ethtool_get_eee() masks supported_eee with eee_disabled_modes when reporting to user space. - genphy_c45_ethtool_set_eee() masks user-supplied adv against eee_disabled_modes, and the empty-adv path is now covered by patch 2. - genphy_c45_read_eee_abilities(), read_eee_cap1/cap2 populate supported_eee from PHY registers (source of truth). - genphy_c45_read_eee_adv(), read_eee_lpa() and write_eee_adv() use supported_eee only to gate which MMD registers to access, not to construct an advertisement. ==================== Link: https://patch.msgid.link/20260518-devel-phy-support-eee-fix-v2-0-05b52626fa68@tipi-net.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-19 18:45:27 -07:00
Nicolai Buchwitz	8baa7506d7	net: phy: honor eee_disabled_modes in phy_advertise_eee_all() phy_advertise_eee_all() copies supported_eee into advertising_eee unconditionally, overwriting any filtering applied during phy_probe() based on DT eee-broken-* properties or driver-populated eee_disabled_modes. genphy_c45_ethtool_set_eee() calls this helper when user space passes an empty advertisement, undoing the filtering. Apply the same eee_disabled_modes mask in phy_advertise_eee_all() so the filtering survives the copy, matching the pattern in phy_probe() and phy_support_eee(). Fixes: `b64691274f` ("net: phy: add helper phy_advertise_eee_all") Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20260518-devel-phy-support-eee-fix-v2-2-05b52626fa68@tipi-net.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-19 18:45:26 -07:00
Nicolai Buchwitz	3655063e08	net: phy: honor eee_disabled_modes in phy_support_eee() phy_support_eee() copies supported_eee into advertising_eee unconditionally, overwriting any filtering applied during phy_probe() based on DT eee-broken-* properties or driver-populated eee_disabled_modes. MAC drivers that call phy_support_eee() after probe (e.g. bcmgenet, fec, lan743x, lan78xx, r8169) then cause the PHY to advertise EEE for modes the user marked as broken. The symptom is that ethtool --show-eee on the local interface reports "not supported" (supported & ~eee_disabled_modes is empty) while the link partner sees EEE negotiated and active. phy_probe() already filters advertising_eee via eee_disabled_modes after calling of_set_phy_eee_broken(). Apply the same mask in phy_support_eee() so the filtering survives the copy. Fixes: `49168d1980` ("net: phy: Add phy_support_eee() indicating MAC support EEE") Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20260518-devel-phy-support-eee-fix-v2-1-05b52626fa68@tipi-net.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-19 18:45:26 -07:00
Nerijus Bendžiūnas	960e77ce14	net: phy: skip EEE advertisement write when autoneg is disabled genphy_c45_an_config_eee_aneg() writes the EEE advertisement to the auto-negotiation device's MMD register space (MDIO_MMD_AN, register MDIO_AN_EEE_ADV). These registers are read by the link partner only during auto-negotiation, so writing them while autoneg is disabled cannot influence the link. On some PHYs (e.g. Broadcom BCM54213PE) the write nevertheless reaches the chip and disturbs the receive datapath. Concretely, running ethtool -s eth0 speed 100 duplex full autoneg off ethtool --set-eee eth0 eee off leaves eth0 with TX working and RX completely silent on a Raspberry Pi 4 / CM4 board (bcmgenet + BCM54213PE in rgmii-rxid). Switching back to autoneg recovers the link. Prior to commit `f26a29a038` ("net: phy: ensure that genphy_c45_an_config_eee_aneg() sees new value of phydev->eee_cfg.eee_enabled"), the disable path was effectively a no-op because the helper read the stale eee_cfg.eee_enabled, so the underlying PHY behavior never surfaced. Bisected on rpi-6.12.y between commits 83943264 (good) and effcbc88 (bad) to `f26a29a038`. Fixes: `f26a29a038` ("net: phy: ensure that genphy_c45_an_config_eee_aneg() sees new value of phydev->eee_cfg.eee_enabled") Cc: stable@vger.kernel.org Signed-off-by: Nerijus Bendžiūnas <nerijus.bendziunas@gmail.com> Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de> Tested-by: Nicolai Buchwitz <nb@tipi-net.de> Link: https://patch.msgid.link/20260516150251.879680-1-nerijus.bendziunas@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-19 18:45:16 -07:00
Jakub Kicinski	90fc1a3937	Merge branch 'bridge-mcast-fix-a-possible-use-after-free-when-removing-a-bridge-port' Ido Schimmel says: ==================== bridge: mcast: Fix a possible use-after-free when removing a bridge port Patch #1 fixes a possible use-after-free when removing a bridge port. Patch #2 adds a test case that triggers the problem. In net-next we can: 1. Add DEBUG_NET_WARN_ON_ONCE() when a port multicast context is de-initialized while enabled. 2. When de-initializing a port multicast context, synchronously shutdown all the timers that were initialized when the context was initialized. ==================== Link: https://patch.msgid.link/20260517121122.188333-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-19 18:15:24 -07:00
Ido Schimmel	ae743a8ca8	selftests: bridge_vlan_mcast: Test toggling of multicast snooping Test toggling of multicast snooping when per-VLAN multicast snooping is enabled. The test always passes, but without "bridge: mcast: Fix possible use-after-free when removing a bridge port" it results in a splat. Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260517121122.188333-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-19 18:15:22 -07:00
Ido Schimmel	4df78ff026	bridge: mcast: Fix a possible use-after-free when removing a bridge port When per-VLAN multicast snooping is enabled, the bridge iterates over all the bridge ports, disables the per-port multicast context on each port and enables the per-{port, VLAN} multicast contexts instead. The reverse happens when per-VLAN multicast snooping is disabled. When global multicast snooping is enabled, the bridge iterates over all the bridge ports and enables the per-port multicast context on each port. The reverse happens when multicast snooping is disabled. The above scheme can result in a situation where both types of contexts (per-port and per-{port, VLAN}) are enabled on a single bridge port: # ip link add name br1 up type bridge mcast_snooping 1 mcast_querier 1 vlan_filtering 1 # ip link add name dummy1 up master br1 type dummy # ip link set dev br1 type bridge mcast_vlan_snooping 1 # ip link set dev br1 type bridge mcast_snooping 0 # ip link set dev br1 type bridge mcast_snooping 1 This is not intended and it is a problem since the commit cited below. Prior to this commit, when removing a bridge port, br_multicast_disable_port() would disable the per-port multicast context and the per-{port, VLAN} multicast contexts would get disabled when flushing VLANs. After this commit, br_multicast_disable_port() only disables the per-port multicast context if per-VLAN multicast snooping is disabled. If both types of contexts were enabled on the port when it was removed, the per-port multicast context would remain enabled when freeing the bridge port, leading to a use-after-free [1]. Fix by preventing the bridge from enabling / disabling the per-port multicast contexts when toggling global multicast snooping if per-VLAN multicast snooping is enabled. [1] ODEBUG: free active (active state 0) object: ffff88810f8bda78 object type: timer_list hint: br_ip6_multicast_port_query_expired (net/bridge/br_multicast.c:1927) WARNING: lib/debugobjects.c:629 at debug_print_object+0x1b1/0x3e0, CPU#5: swapper/5/0 [...] Call Trace: <IRQ> __debug_check_no_obj_freed (lib/debugobjects.c:1116) kfree (mm/slub.c:2620 mm/slub.c:6250 mm/slub.c:6565) kobject_cleanup (lib/kobject.c:689) rcu_do_batch (kernel/rcu/tree.c:2617) rcu_core (kernel/rcu/tree.c:2869) handle_softirqs (kernel/softirq.c:622) __irq_exit_rcu (kernel/softirq.c:656 kernel/softirq.c:496 kernel/softirq.c:735) irq_exit_rcu (kernel/softirq.c:752) sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1061 (discriminator 47) arch/x86/kernel/apic/apic.c:1061 (discriminator 47)) </IRQ> Fixes: `4b30ae9adb` ("net: bridge: mcast: re-implement br_multicast_{enable, disable}_port functions") Reported-by: syzbot+ae231e0552fa77b26ea1@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/87qznowlfs.ffs@tglx/ Reported-by: Thomas Gleixner <tglx@kernel.org> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260517121122.188333-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-19 18:15:22 -07:00
Dawei Feng	9b244c242b	octeontx2-pf: avoid double free of pool->stack on AQ init failure otx2_pool_aq_init() frees pool->stack when mailbox sync or retry allocation fails, but leaves the pointer unchanged. Later, otx2_sq_aura_pool_init() unwinds the partial setup through otx2_aura_pool_free(), which frees pool->stack again. The CN20K-specific cn20k_pool_aq_init() implementation has the same bug in its corresponding error path. Set pool->stack to NULL immediately after the local free so the shared cleanup path does not free the same stack again while cleaning up partially initialized pool state. The bug was first flagged by an experimental analysis tool we are developing for kernel memory-management bugs while analyzing v6.13-rc1. The tool is still under development and is not yet publicly available. Manual inspection confirms that the bug is still present in v7.1-rc3. Runtime validation was not performed because reproducing this path requires OcteonTX2/CN20K hardware. Fixes: `caa2da34fd` ("octeontx2-pf: Initialize and config queues") Fixes: `d322fbd172` ("octeontx2-pf: Initialize cn20k specific aura and pool contexts") Cc: stable@vger.kernel.org Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Dawei Feng <dawei.feng@seu.edu.cn> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260515151826.1005397-1-dawei.feng@seu.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-19 17:49:49 -07:00
Jonas Jelonek	33d35975cb	net: pse-pd: fix sign on -ENOENT check in of_load_pse_pis() of_count_phandle_with_args() returns the count on success and a negative errno on failure, including -ENOENT when the "pairsets" property is absent. The existing comparison in of_load_pse_pis() checks against ENOENT (positive 2) instead of -ENOENT, so the branch is taken for any error return: legitimate DTs that omit "pairsets" trigger a spurious "wrong number of pairsets" error and probe fails with -EINVAL. Compare against -ENOENT so a missing "pairsets" property is correctly treated as "this PI has no pairsets, continue". Fixes: `9be9567a7c` ("net: pse-pd: Add support for PSE PIs") Cc: stable@vger.kernel.org Signed-off-by: Jonas Jelonek <jelonek.jonas@gmail.com> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20260515143103.1721888-1-jelonek.jonas@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-19 17:44:27 -07:00
Paolo Abeni	edc502717b	Merge branch 'mptcp-misc-fixes-for-v7-1-rc4' Matthieu Baerts says: ==================== mptcp: misc fixes for v7.1-rc4 Here are various unrelated fixes: - Patch 1: avoid dropping partial packets. A previous version has been sent a few week ago. A fix for 5.10. - Patches 2-3: stop ADD_ADDR timer when an ADD_ADDR can never been sent due to insufficient option space. A fix for v5.10. - Patch 4: reset rcv_wnd_sent on disconnect, just in case the next connection falls back to TCP. A fix for 5.17. - Patch 5: update window_clamp when SO_RCVBUF is set during the connection. A fix similar to a recent one on TCP side, for v6.6. - Patch 6: avoid wrong time being displayed in the selftests when using uutils 0.8.0 which contains a regression with 'date +%3N'. It doesn't fix an issue in the kernel selftests, but having the fix is helpful for those using uutils 0.8.0. Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> ==================== Link: https://patch.msgid.link/20260515-net-mptcp-misc-fixes-7-1-rc4-v2-0-701e96419f2f@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-19 15:36:39 +02:00
Matthieu Baerts (NGI0)	01ff78e4b3	selftests: mptcp: drop nanoseconds width specifier Using the format specifier +%s%3N with GNU date is honoured, and only prints 3 digits of the nanoseconds portion of the seconds since epoch, which corresponds to the milliseconds. The uutils implementation of date currently does not honour this, and always prints all 9 digits. This is a known issue [1], but can be worked around by adapting this test to use nanoseconds instead of microseconds, and then divide it by 1e6. This fix is similar to what has been done on systemd side [2], and it is needed to run the selftests on Ubuntu 26.04, containing uutils 0.8.0. Note that the Fixes tag is there even if this patch doesn't fix an issue in the kernel selftests, but it is useful for those using uutils 0.8.0. Fixes: `048d19d444` ("mptcp: add basic kselftest for mptcp") Cc: stable@vger.kernel.org Link: https://github.com/uutils/coreutils/issues/11658 [1] Link: https://github.com/systemd/systemd/pull/41627 [2] Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260515-net-mptcp-misc-fixes-7-1-rc4-v2-6-701e96419f2f@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-19 15:36:35 +02:00
Gang Yan	3a543ae0e2	mptcp: update window_clamp on subflows when SO_RCVBUF is set Add __mptcp_subflow_set_rcvbuf() helper to write the subflow sk_rcvbuf, but also to call the recently added tcp_set_rcvbuf() helper to update window_clamp. This is needed because the window clap is updated when scaling_ratio changes, in tcp_measure_rcv_mss(). Until scaling_ratio changes, the subflow is stuck with the old window clamp which may be based on a small initial buffer. Use this new helper in both mptcp_sol_socket_sync_intval() (setsockopt path) and sync_socket_options() (new subflow creation path). Note that this patch depends on commit `b025461303` ("tcp: update window_clamp when SO_RCVBUF is set"): it fixes the issue on TCP side, but the same fix is needed on MPTCP side as well. Fixes: `a2cbb16039` ("tcp: Update window clamping condition") Cc: stable@vger.kernel.org Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/619 Signed-off-by: Gang Yan <yangang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260515-net-mptcp-misc-fixes-7-1-rc4-v2-5-701e96419f2f@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-19 15:36:35 +02:00
Paolo Abeni	0981f90e1a	mptcp: reset rcv wnd on disconnect If the MPTCP socket fallback to TCP before the MP handshake completion, the IASN remain 0, and the rcv_wnd_sent field is not explicitly initialized, just incremented over time with the data transfer. At disconnect time such value is not cleared. If the next connection falls back to TCP before the MP handshake completion, the data transfer will keep incrementing the receive window end sequence starting from the last value used in the previous connection: the announced window will be unrelated from the actual receiver buffer size and likely too big. Address the issue zeroing the field at disconnect time. Fixes: `b29fcfb54c` ("mptcp: full disconnect implementation") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260515-net-mptcp-misc-fixes-7-1-rc4-v2-4-701e96419f2f@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-19 15:36:35 +02:00
Li Xiasong	fc5ef43318	selftests: mptcp: join: cover ADD_ADDR tx drop and list progress Extend add_addr_ports_tests with IPv6 signaling cases that exercise ADD_ADDR tx-space shortage when tcp_timestamps are enabled. Add one case to verify PM still progresses to later signal endpoints after the first one is dropped. This covers both failure accounting and the non-blocking behavior of the announce list after a tx-space drop on pure ACK. Signed-off-by: Li Xiasong <lixiasong1@huawei.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260515-net-mptcp-misc-fixes-7-1-rc4-v2-3-701e96419f2f@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-19 15:36:35 +02:00
Li Xiasong	51e398a3b8	mptcp: pm: fix ADD_ADDR timer infinite retry on option space insufficient When TCP option space is insufficient (e.g., when sending ADD_ADDR with an IPv6 address and port while tcp_timestamps is enabled), the original code jumped to out_unlock without clearing the addr_signal flag. This caused mptcp_pm_add_timer to keep rescheduling indefinitely, not sending ADD_ADDR, preventing subsequent addresses in the endpoint list from being announced. Handle this case by clearing the ADD_ADDR signal and skipping the matching ADD_ADDR retransmission entry. The skip path cancels the matching timer (with id check) and advances PM state progression, preserving forward progress to subsequent PM work. This cancellation is inherently best-effort. A concurrent add_timer callback may already be running and may acquire pm.lock before the cancel path updates entry state. In that case, one final ADD_ADDR transmit attempt can still be executed. Once the cancel path sets entry->retrans_times to ADD_ADDR_RETRANS_MAX, the callback-side retrans_times check suppresses further ADD_ADDR retransmissions. Note that when an ADD_ADDR is being prepared, a pure-ACK is queued. On the output side, it means that it is fine to skip non-pure-ACK packets, when drop_other_suboptions is set: a pure-ACK will be processed soon after. Fixes: `00cfd77b90` ("mptcp: retransmit ADD_ADDR when timeout") Cc: stable@vger.kernel.org Signed-off-by: Li Xiasong <lixiasong1@huawei.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260515-net-mptcp-misc-fixes-7-1-rc4-v2-2-701e96419f2f@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-19 15:36:35 +02:00
Shardul Bankar	50c2d91c5d	mptcp: do not drop partial packets When a packet arrives with map_seq < ack_seq < end_seq, the beginning of the packet has already been acknowledged but the end contains new data. Currently the entire packet is dropped as "old data," forcing the sender to retransmit. Instead, skip the already-acked bytes by adjusting the skb offset and enqueue only the new portion. Update bytes_received and ack_seq to reflect the new data consumed. A previous attempt at this fix has been sent by Paolo Abeni [1], but had issues [2]: it also added a zero-window check and changed rcv_wnd_sent initialization, which caused test regressions. This version addresses only the partial packet handling without modifying receive window accounting. Fixes: `ab174ad8ef` ("mptcp: move ooo skbs into msk out of order queue.") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/c9b426a4e163aa3c4fe8b80c79f1a610f47ae7d8.1763075056.git.pabeni@redhat.com [1] Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/600 [2] Signed-off-by: Shardul Bankar <shardul.b@mpiricsoftware.com> [pabeni@redhat.com: update map] Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260515-net-mptcp-misc-fixes-7-1-rc4-v2-1-701e96419f2f@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-19 15:36:35 +02:00
Paolo Abeni	956aa1ec97	Merge tag 'ovpn-net-20260514' of https://github.com/OpenVPN/ovpn-net-next Antonio Quartulli says: ==================== Included fixes: * fix TCP selftest failures by reducing number of attempted pings * fix RCU ptr deref outside of RCU read section * fix UAF in case of TCP peer failed to be added to hashtable * fix race condition between iface teardown and new peer being added * ensure dstats are updated with BH disabled to avoid concurrency * tag 'ovpn-net-20260514' of https://github.com/OpenVPN/ovpn-net-next: ovpn: disable BHs when updating device stats ovpn: fix race between deleting interface and adding new peer ovpn: respect peer refcount in CMD_NEW_PEER error path ovpn: tcp - use cached peer pointer in ovpn_tcp_close() selftests: ovpn: reduce remaining ping flood counts ==================== Link: https://patch.msgid.link/20260514231544.795993-1-antonio@openvpn.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-19 13:51:08 +02:00
Erni Sri Satya Vennela	35f0f0a253	net: mana: Fix TOCTOU double-fetch of hwc_msg_id from DMA buffer In mana_hwc_rx_event_handler(), resp->response.hwc_msg_id is read from DMA-coherent memory and bounds-checked, then mana_hwc_handle_resp() re-reads the same field from the same DMA buffer for test_bit() and pointer arithmetic. DMA-coherent memory is mapped uncacheable on x86 and is shared, unencrypted, in Confidential VMs (SEV-SNP/TDX), so each load goes directly to host-visible memory. A H/W can modify the value between the check and the use, bypassing the bounds validation. Fix this by reading hwc_msg_id exactly once using READ_ONCE() into a stack-local variable in mana_hwc_rx_event_handler(), and passing the validated value as a parameter to mana_hwc_handle_resp(). Fixes: `ca9c54d2d6` ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)") Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com> Link: https://patch.msgid.link/20260514194156.466823-1-ernis@linux.microsoft.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-19 13:00:28 +02:00
Paolo Abeni	2d85ae5d0f	Merge branch 'net-dsa-mt7530-assorted-fixes' Daniel Golle says: ==================== net: dsa: mt7530: assorted fixes A batch of small, independent fixes for the MediaTek MT7530 family DSA driver, addressing long-standing correctness issues that surface on hardware with bridge VLAN filtering enabled, on link-local frame reception, and during bridge join/leave transitions. ==================== Link: https://patch.msgid.link/cover.1778766629.git.daniel@makrotopia.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-19 12:37:34 +02:00
Edward Parker	4cb3cd670b	net: dsa: mt7530: untag VLAN-aware bridge PVID With bridge VLAN filtering enabled on a port configured as untagged member of the bridge PVID, ingress untagged frames do not reach the corresponding bridge VLAN upper interface (br-lan.<vid>). ARP and similar traffic is visible on the physical port but not delivered to the VLAN sub-interface. The MT7530/MT7531 forwards frames to the CPU port with the user port's PVID tag applied even when the frame ingressed untagged on the wire, because the CPU port is set to MT7530_VLAN_EG_CONSISTENT and is a tagged member of the VLAN entry created for the bridge VLAN. The DSA core then sees a hwaccel-tagged frame whose VID matches the port's PVID, which the bridge does not treat as the untagged-on-the-wire frame that the user expects. Set ds->untag_vlan_aware_bridge_pvid in the mt7530 and mt7531 setup paths so the DSA core strips that hwaccel tag in software when the parsed VID matches the bridge port's PVID, restoring the on-the-wire frame as the bridge expects to see it. Link: https://github.com/openwrt/openwrt/issues/18576 Fixes: `83163f7dca` ("net: dsa: mediatek: add VLAN support for MT7530") Signed-off-by: Edward Parker <edward@topnotchit.com> [daniel@makrotopia.org: improve commit message] Signed-off-by: Daniel Golle <daniel@makrotopia.org> Link: https://patch.msgid.link/85d25ea1b26d3c907f815649f2e0bde6560282a3.1778766629.git.daniel@makrotopia.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-19 12:37:34 +02:00
Daniel Golle	2c4c76cacc	net: dsa: mt7530: fix CPU port VLAN not being reset to unaware After a VLAN-aware bridge is destroyed, creating any VLAN-unaware bridge loses all connectivity. The VID 0 VLAN table entry used by VLAN-unaware ports in FALLBACK mode gets corrupted during VLAN-aware operation: mt7530_hw_vlan_add() overwrites its EG_CON flag with VTAG_EN and bridge teardown removes ports from its PORT_MEM. The cleanup code that should restore it never runs because the current port's dp->vlan_filtering flag is still true when checked (DSA updates it only after the driver callback returns). Even when restored, the deferred VLAN deletion events from the switchdev workqueue can corrupt VID 0 again after the restoration. Skip the current port in the all_user_ports_removed check, call mt7530_setup_vlan0() to restore the VID 0 entry, and protect VID 0 from being modified by bridge VLAN operations in port_vlan_add and port_vlan_del since it is managed exclusively by mt7530_setup_vlan0(). Remove the CPU port PCR and PVC register writes which were clobbering PORT_VLAN mode and VLAN_ATTR with wrong values. Fixes: `83163f7dca` ("net: dsa: mediatek: add VLAN support for MT7530") Signed-off-by: Daniel Golle <daniel@makrotopia.org> Link: https://patch.msgid.link/da8bdaf08b2427a9057e6cb33e26d41f8a8d5000.1778766629.git.daniel@makrotopia.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-19 12:37:29 +02:00
Daniel Golle	3ac85bcfd4	net: dsa: mt7530: preserve VLAN tags on trapped link-local frames The BPC, RGAC1 and RGAC2 registers control the handling of link-local frames with reserved MAC DAs (01:80:C2:00:00:0x). These frames are correctly trapped to the CPU port, but the egress VLAN tag attribute was set to MT7530_VLAN_EG_UNTAGGED which causes the switch to strip any VLAN tags from trapped frames before they reach the CPU. This causes VLAN-tagged link-local frames (STP BPDUs, LLDP, PTP Peer Delay Requests) to arrive at the CPU without their VLAN tag, so they are delivered to the base network interface instead of the VLAN sub-interface. The DSA local_termination selftest confirms this: all link-local protocol tests on VLAN upper interfaces fail. Set the EG_TAG attribute to MT7530_VLAN_EG_DISABLED (system default) so that the switch does not modify VLAN tags in trapped frames. This way VLAN-tagged frames retain their original tag and are delivered to the correct VLAN sub-interface, matching the behavior of non-trapped frames which pass through without VLAN tag modification. Fixes: `69ddba9d17` ("net: dsa: mt7530: fix handling of all link-local frames") Signed-off-by: Daniel Golle <daniel@makrotopia.org> Acked-by: Chester A. Unal <chester.a.unal@arinc9.com> Link: https://patch.msgid.link/891e0cd34db2a5fe20ceb73283a81fb5f71427ca.1778766629.git.daniel@makrotopia.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-19 12:37:29 +02:00
Daniel Golle	e824e40d0e	net: dsa: mt7530: fix FDB entries not aging out with short timeout The DSA forwarding selftests bridge_vlan_aware.sh and bridge_vlan_unaware.sh configure the bridge with ageing_time set to LOW_AGEING_TIME (1000 centiseconds, i.e. 10 seconds) and then run learning_test() in lib.sh, which expects a learned FDB entry to be removed after ageing_time + 10 seconds. On MT7530/MT7531 the entry persisted past the deadline and the "Found FDB record when should not" assertion failed. With msecs=10000, the algorithm in mt7530_set_ageing_time() finds AGE_CNT=0 and AGE_UNIT=9 as the first exact match (starting the search from tmp_age_count=0). The per-entry aging counter is initialized to AGE_CNT when a MAC address is learned, so with AGE_CNT=0 new entries start with a counter value of 0, which the hardware treats as "already aged" and never removes, effectively disabling aging. Fix this by starting the search from tmp_age_count=1 to ensure entries always have a non-zero initial aging counter. For a 10-second ageing time this yields AGE_CNT=1 and AGE_UNIT=4 instead: the timer ticks every 5 seconds and entries are removed after 2 ticks. Starting the search at AGE_CNT=1 raises the minimum representable ageing time from 1 to 2 seconds. Without bounds, a stale ageing_time of 1 second would now make the loop fall through without setting age_count and age_unit, leaving them uninitialized when written to the MT7530_AAC hardware register. Set ds->ageing_time_min and ds->ageing_time_max so the DSA core validates the range before the callback is invoked, and drop the now-redundant range check from mt7530_set_ageing_time(). Fixes: `ea6d5c924e` ("net: dsa: mt7530: support setting ageing time") Signed-off-by: Daniel Golle <daniel@makrotopia.org> Link: https://patch.msgid.link/7788ded12dc07b1bce329ec35fa70f4b45f3f9b7.1778766629.git.daniel@makrotopia.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-19 12:37:29 +02:00
Jakub Kicinski	2beba18b01	Merge branch 'intel-wired-lan-driver-updates-2026-05-15-ice-ixgbevf-igc-e1000e' Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2026-05-15 (ice, ixgbevf, igc, e1000e) For ice: Jake fixes a mismatch in locking around wait queue usage. Jose Ignacio Tornos Martinez adjusts allowed lower bound for VF data buffer size to accommodate low MTU sizes. Marcin adjusts for -EEXIST to not trigger error path when the promisc filter already exists as part of adding VLAN Ids. Grzegorz fixes a few issues related to PTP. He adds locking to ice_start_phy_timer_eth56g() to protect proper register programming. Fixes the PTP lock used in 2xNAC configuration to always be the primary and restores PTP configuration on ethtool channel changes. For ixgbevf: Michael Bommarito sets freed skb pointer to NULL to prevent use-after-free. For igc: Kohei Enju resolves a couple of issues reported by Sashiko; setting buffer type for an SMD skb and freeing skb on error of igc_fpe_init_tx_descriptor(). ==================== Link: https://patch.msgid.link/20260515182419.1597859-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 19:01:37 -07:00
Kohei Enju	e935c37b8a	igc: fix potential skb leak in igc_fpe_xmit_smd_frame() When igc_fpe_init_tx_descriptor() fails, no one takes care of an allocated skb, leaking it. [1] Use dev_kfree_skb_any() on failure. Tested on an I226 adapter with the following command, while injecting faults in igc_fpe_init_tx_descriptor() to trigger the error path. # ethtool --set-mm $DEV verify-enabled on tx-enabled on pmac-enabled on [1] unreferenced object 0xffff888113c6cdc0 (size 224): ... backtrace (crc be3d3fda): kmem_cache_alloc_node_noprof+0x3b1/0x410 __alloc_skb+0xde/0x830 igc_fpe_xmit_smd_frame.isra.0+0xad/0x1b0 igc_fpe_send_mpacket+0x37/0x90 ethtool_mmsv_verify_timer+0x15e/0x300 Cc: stable@vger.kernel.org Fixes: `5422570c00` ("igc: add support for frame preemption verification") Signed-off-by: Kohei Enju <kohei@enjuk.jp> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com> Tested-by: Avigail Dahan <avigailx.dahan@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20260515182419.1597859-10-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 19:00:44 -07:00
Kohei Enju	5acc641e59	igc: set tx buffer type for SMD frames Sashiko pointed out that igc_fpe_init_smd_frame() initializes igc_tx_buffer fields for an SMD skb, but does not set the buffer type: https://sashiko.dev/#/patchset/20260415025226.114115-1-kohei%40enjuk.jp Since igc_tx_buffer entries are reused, a stale XDP or XSK type can remain and make TX completion use the wrong cleanup path. Set the buffer type to IGC_TX_BUFFER_TYPE_SKB. Fixes: `5422570c00` ("igc: add support for frame preemption verification") Signed-off-by: Kohei Enju <kohei@enjuk.jp> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Avigail Dahan <avigailx.dahan@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20260515182419.1597859-9-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 19:00:44 -07:00
Michael Bommarito	5d49b568c1	ixgbevf: fix use-after-free in VEPA multicast source pruning ixgbevf_clean_rx_irq() prunes frames whose source MAC matches the VF's own address (VEPA multicast workaround) by freeing the skb and continuing to the next descriptor: dev_kfree_skb_irq(skb); continue; The skb pointer is declared outside the while loop and persists across iterations. Because the continue skips the "skb = NULL" reset at the bottom of the loop, the next iteration enters the "else if (skb)" path and calls ixgbevf_add_rx_frag() on the freed skb, dereferencing skb_shinfo(skb)->nr_frags - a use-after-free in NAPI softirq context. The sibling driver iavf already handles this correctly by nulling the pointer before continuing. Apply the same pattern here. I do not have ixgbevf hardware; the bug was found by static analysis (scan_drop_continue_loops.py + semgrep drop_continue_in_loop, multi-tool corroboration with the highest score in the scan). The UAF was confirmed under KASAN by loading a test module that reproduces the exact code pattern (alloc skb, kfree_skb, then read skb_shinfo(skb)->nr_frags): BUG: KASAN: slab-use-after-free in ixgbevf_uaf_test_init+0x100/0x1000 Read of size 8 at addr 000000006163ae78 by task insmod/30 freed 208-byte region [000000006163adc0, 000000006163ae90) QEMU emulates igb (82576) but not ixgbe (82599), and the igbvf VF driver does not include the VEPA source pruning path, so a full end-to-end reproduction with emulated hardware was not possible. Fixes: `bad17234ba` ("ixgbevf: Change receive model to use double buffered page based receives") Cc: stable@vger.kernel.org Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20260515182419.1597859-8-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 19:00:44 -07:00
Grzegorz Nitka	975b564d19	ice: restore PTP Rx timestamp config after ethtool set-channels When ethtool -L changes queue counts, ice_vsi_recfg_qs() closes and rebuilds the VSI, reallocating Rx rings. The newly allocated rings have ptp_rx cleared, so RX hardware timestamps are no longer attached to skb until hwtstamp configuration is applied again. Restore timestamp mode after ice_vsi_open() in the queue reconfiguration path, matching reset/rebuild behavior and ensuring newly rebuilt Rx rings have PTP RX timestamping re-enabled. Testing hints: - run ptp4l application in client synchronization mode: ptp4l -i ethX -m -s - run PTP traffic - change queue number on ethX netdev interface: ethtool -L ethX combined new_queue_size - observe ptp4l output - expected result: no "received DELAY_REQ without timestamp" messages Fixes: `77a781155a` ("ice: enable receive hardware timestamping") Cc: stable@vger.kernel.org Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20260515182419.1597859-7-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 19:00:44 -07:00
Grzegorz Nitka	7b28523546	ice: ptp: use primary NAC semaphore on E825 For E825 2xNAC configurations, PTP semaphore operations must hit the primary NAC register block so both sides coordinate on the same lock. Commit `e2193f9f9e` ("ice: enable timesync operation on 2xNAC E825 devices") updated other primary-only PTP register accesses to use the primary NAC on non-primary functions, but left ice_ptp_lock() and ice_ptp_unlock() operating on the local NAC. As a result, secondary NAC PTP paths can take a different semaphore than the primary side. Select the primary hardware in ice_ptp_lock() and ice_ptp_unlock() when the current function is not primary, keeping semaphore operations symmetric and consistent with the rest of the 2xNAC PTP register access path. Fixes: `e2193f9f9e` ("ice: enable timesync operation on 2xNAC E825 devices") Reviewed-by: Arkadiusz Kubalewski <Arkadiusz.kubalewski@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20260515182419.1597859-6-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 19:00:44 -07:00
Grzegorz Nitka	781ff8f2d5	ice: ptp: serialize E825 PHY timer start with PTP lock ice_start_phy_timer_eth56g() programs TIMETUS registers and issues INIT_INCVAL without holding the global PTP semaphore. This allows concurrent PTP command paths to interleave with PHY timer start, which can make the sequence fail and leave timer initialization inconsistent. Take the PTP lock around TIMETUS registers programming and INIT_INCVAL command execution, and make sure the lock is released on all error paths. Keep the subsequent sync step outside of this critical section, since ice_sync_phy_timer_eth56g() takes the same semaphore internally. Fixes: `7cab44f1c3` ("ice: Introduce ETH56G PHY model for E825C products") Reviewed-by: Arkadiusz Kubalewski <Arkadiusz.kubalewski@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20260515182419.1597859-5-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 19:00:43 -07:00
Marcin Szycik	ebc8de716c	ice: fix setting promisc mode while adding VID filter There are at least two paths through which VSI promiscuous mode can be independently configured via ice_fltr_set_vsi_promisc(): - ice_vlan_rx_add_vid() (netdev op) - ice_service_task() -> ... -> ice_set_promisc() Both paths may try to program promiscuous mode concurrently. One such scenario is: 1. Add ice netdev to bond 2. Add the bond netdev to bridge 3. ice netdev enters allmulticast mode (IFF_ALLMULTI) 4. Service task programs promisc mode filter 5. Bridge -> bond calls ice_vlan_rx_add_vid() Crucially, ice_vlan_rx_add_vid() fails if ice_fltr_set_vsi_promisc() returns any error, including -EEXIST. This causes VLAN filtering setup to fail on the bond interface. ice_set_promisc() already handles -EEXIST correctly. Fix by adding the same -EEXIST check to ice_vlan_rx_add_vid(): if the promisc filter is already programmed, continue without returning error. Fixes: `1273f89578` ("ice: Fix broken IFF_ALLMULTI handling") Cc: stable@vger.kernel.org Signed-off-by: Marcin Szycik <marcin.szycik@intel.com> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20260515182419.1597859-4-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 19:00:43 -07:00
Jose Ignacio Tornos Martinez	3ba4dd024d	ice: fix VF queue configuration with low MTU values The ice driver's VF queue configuration validation rejects databuffer_size values below 1024 bytes, which prevents VFs from using MTU values below 871 bytes. The iavf driver calculates databuffer_size based on the MTU using: databuffer_size = ALIGN(MTU + LIBETH_RX_LL_LEN, 128) where LIBETH_RX_LL_LEN = 26 (ETH_HLEN + 2*VLAN_HLEN + ETH_FCS_LEN). For MTU values below 871: MTU 870: 870 + 26 = 896, aligned to 128 = 896 (< 1024, rejected) MTU 871: 871 + 26 = 897, aligned to 128 = 1024 (>= 1024, accepted) The 1024-byte minimum seems unnecessarily restrictive, because the hardware supports databuffer_size as low as 128 bytes (the alignment boundary), which should allow MTU values down to the standard minimum of 68 bytes. I haven't found the reason why the limit was configured in the commit `9c7dd7566d` ("ice: add validation in OP_CONFIG_VSI_QUEUES VF message"), so with no more information and since it is working, change the minimum databuffer_size validation from 1024 to 128 bytes to allow standard low MTU values while still preventing invalid configurations. Fixes: `9c7dd7566d` ("ice: add validation in OP_CONFIG_VSI_QUEUES VF message") cc: stable@vger.kernel.org Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20260515182419.1597859-3-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 19:00:43 -07:00
Jacob Keller	89bbff099b	ice: fix locking around wait_event_interruptible_locked_irq Commit `50327223a8` ("ice: add lock to protect low latency interface") introduced a wait queue used to protect the low latency timer interface. The queue is used with the wait_event_interruptible_locked_irq macro, which unlocks the wait queue lock while sleeping. The irq variant uses spin_lock_irq and spin_unlock_irq to manage this. The wait queue lock was previously locked using spin_lock_irqsave. This difference in lock variants could lead to issues, since wait_event would unlock the wait queue and restore interrupts while sleeping. The ice_read_phy_tstamp_ll_e810() function is ultimately called through ice_read_phy_tstamp, which is called from ice_ptp_process_tx_tstamp or ice_ptp_clear_unexpected_tx_ready. The former is called through the miscellaneous IRQ thread function, while the latter is called from the service task work queue thread. Neither of these functions has interrupts disabled, so use spin_lock_irq instead of spin_lock_irqsave. Fixes: `50327223a8` ("ice: add lock to protect low latency interface") Cc: stable@vger.kernel.org Reported-by: Jakub Kicinski <kuba@kernel.org> Closes: https://lore.kernel.org/netdev/20250109181823.77f44c69@kernel.org/ Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20260515182419.1597859-2-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 19:00:43 -07:00
Jakub Kicinski	317bbe5301	Merge tag 'nf-26-05-16' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for net: 1) Fix small race windows in nf_ct_helper_log() when accessing helper, from Florian Westphal. 2) Fix potential infinite loop and race conditions in IPVS caused by frequent user-triggered service table changes, from Julia Anastasov. 3) Fix a race condition when dumping ipsets for restore, from Jozsef Kadlecsik. 4) Fix inner transport offset in IPv6 in nft_inner when extension headers come before the layer 4 transport header, from Yizhou Zhao. 5) Fix incorrect iteration over IPv4 ranges in several hash set types, from Nan Li. 6) Fix incorrect order when restoring BH in nft_inner_restore_tun_ctx(), from Florian Westphal. 7) Validate option array from ip6t_hbh checkpath() to fix an off-by-one access, from Zhengchuan Liang. 8) Fix race condition between ipset list -terse and concurrent updates, from Jozsef Kadlecisk. 9) Fix race condition when inserting elements into a hash bucket, also from Jozsef. 10) Annotate access to first free slot in hashtable, from Jozsef Kadlecsik. 11) Ensure sufficient headroom in br_netfilter neigh transmission, from Lorenzo Bianconi. 12) Hold reference on skb->dev in nfqueue exit path, bridge local input is speciall since skb->dev != state->indev, allowing for net_device to go away while packet is sitting in nfqueue. From Haoze Xie. * tag 'nf-26-05-16' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_queue: hold bridge skb->dev while queued netfilter: br_netfilter: Reallocate headroom if necessary in neigh_hh_bridge() netfilter: ipset: annotate "pos" for concurrent readers/writers netfilter: ipset: Fix data race between add and dump in all hash types netfilter: ipset: Fix data race between add and list header in all hash types netfilter: ip6t_hbh: reject oversized option lists netfilter: nft_inner: release local_lock before re-enabling softirqs netfilter: ipset: stop hash:* range iteration at end netfilter: nft_inner: Fix IPv6 inner_thoff desync netfilter: ipset: fix a potential dump-destroy race ipvs: avoid possible loop in ip_vs_dst_event on resizing netfilter: nf_conntrack_helper: fix possible null deref during error log ==================== Link: https://patch.msgid.link/20260516115627.967773-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 16:59:30 -07:00
Jakub Kicinski	aeff7e8e50	Merge tag 'batadv-net-pullrequest-20260515' of https://git.open-mesh.org/batadv Simon Wunderlich says: ==================== Here are various batman-adv bugfixes: - fix tp_meter counter underflow during shutdown, by Luxiao Xu - fix tp_meter tp_vars reference leak in receiver shutdown, by Sven Eckelmann - fix various translation table integer handling issues, by Sven Eckelmann (3 patches) - fix various translation table counter issues, by Sven Eckelmann (3 patches) - fix fragment reassembly length accounting, by Ruide Cao - clear current gateway during teardown, by Ruijie Li - handle forward allocation error in DAT, by Sven Eckelmann - tp_meter: avoid use of uninitialized sender variables in tp_meter, by Sven Eckelmann - disallow unicast fragment in fragment, by Sven Eckelmann - directly shut down tp_meter timer on cleanup, by Sven Eckelmann * tag 'batadv-net-pullrequest-20260515' of https://git.open-mesh.org/batadv: batman-adv: tp_meter: directly shut down timer on cleanup batman-adv: frag: disallow unicast fragment in fragment batman-adv: tp_meter: avoid use of uninit sender vars batman-adv: dat: handle forward allocation error batman-adv: clear current gateway during teardown batman-adv: fix fragment reassembly length accounting batman-adv: tt: prevent TVLV entry number overflow batman-adv: tt: avoid empty VLAN responses batman-adv: tt: fix TOCTOU race for reported vlans batman-adv: tt: fix negative last_changeset_len batman-adv: tt: fix negative tt_buff_len batman-adv: tt: reject oversized local TVLV buffers batman-adv: tp_meter: fix tp_vars reference leak in receiver shutdown batman-adv: fix tp_meter counter underflow during shutdown ==================== Link: https://patch.msgid.link/20260515095540.325586-1-sw@simonwunderlich.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 16:50:04 -07:00
Jakub Kicinski	23f3b04f15	Merge tag 'for-net-2026-05-14' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - af_bluetooth: serialize accept_q access - L2CAP: ecred_reconfigure: send packed pdu, not stack pointer - btmtk: accept too short WMT FUNC_CTRL events - hci_qca: Convert timeout from jiffies to ms * tag 'for-net-2026-05-14' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: hci_qca: Convert timeout from jiffies to ms Bluetooth: L2CAP: ecred_reconfigure: send packed pdu, not stack pointer Bluetooth: btmtk: accept too short WMT FUNC_CTRL events Bluetooth: serialize accept_q access ==================== Link: https://patch.msgid.link/20260514172340.1515042-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 16:40:05 -07:00
Ilya Maximets	bae3ee802c	openvswitch: vport: fix race between linking and the device notifier Sashiko reports that it is technically possible that we got the device reference, but by the time we're linking it to the OVS datapath, it may be already in the process of being deleted. In this case if the notifier wins the race for RTNL, it will see that the device is not yet in the OVS datapath (ovs_netdev_get_vport() will fail in the dp_device_event()) and will do nothing. Then the ovs_netdev_link() will take the RTNL and link the unregistering device to OVS datapath. Eventually, netdev_wait_allrefs_any() will re-broadcast the event and the device will be properly detached, but it will take at least a second before that happens, so it's not something we should rely on. Let's avoid linking the non-registered device in the first place. Note: As per documentation, RTNL doesn't protect the reg_state, but it actually does for all the state transitions we care about here, so it should not be necessary to use READ_ONCE or taking the instance lock. We can still do that, but we have a few more places even in this file where the reg_state is accessed without those while under RTNL, and many more places like this across the kernel code, so it might make more sense to change all of them in a more centralized fashion in the future, if necessary. Fixes: `ccb1352e76` ("net: Add Open vSwitch kernel components.") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: Aaron Conole <aconole@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Link: https://patch.msgid.link/20260514184702.2461435-1-i.maximets@ovn.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 16:38:45 -07:00
Weiming Shi	d00c953a8f	net: qualcomm: rmnet: fix endpoint use-after-free in rmnet_dellink() rmnet_dellink() removes the endpoint from the hash table with hlist_del_init_rcu() and then immediately frees it with kfree(). However, RCU readers on the receive path (rmnet_rx_handler -> __rmnet_map_ingress_handler) may still hold a reference to the endpoint and dereference ep->egress_dev after the memory has been freed. The endpoint is a kmalloc-32 object, and the stale read at offset 8 corresponds to the egress_dev pointer. BUG: unable to handle page fault for address: ffffffffde942eef Oops: 0002 [#1] SMP NOPTI CPU: 1 UID: 0 PID: 137 Comm: poc_write Not tainted 7.0.0+ #4 PREEMPTLAZY RIP: 0010:rmnet_vnd_rx_fixup (rmnet_vnd.c:27) Call Trace: <TASK> __rmnet_map_ingress_handler (rmnet_handlers.c:48 rmnet_handlers.c:101) rmnet_rx_handler (rmnet_handlers.c:129 rmnet_handlers.c:235) __netif_receive_skb_core.constprop.0 (net/core/dev.c:6096) __netif_receive_skb_one_core (net/core/dev.c:6208) netif_receive_skb (net/core/dev.c:6467) tun_get_user (drivers/net/tun.c:1955) tun_chr_write_iter (drivers/net/tun.c:2003) vfs_write (fs/read_write.c:688) ksys_write (fs/read_write.c:740) </TASK> Add an rcu_head field to struct rmnet_endpoint and replace kfree() with kfree_rcu() so the endpoint memory remains valid through the RCU grace period. Also remove the rmnet_vnd_dellink() call and inline only the nr_rmnet_devs decrement, since rmnet_vnd_dellink() would set ep->egress_dev to NULL during the grace period, creating a data race with lockless readers. Fixes: `ceed73a2cf` ("drivers: net: ethernet: qualcomm: rmnet: Initial implementation") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Link: https://patch.msgid.link/20260514122511.3083479-2-bestswngs@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 16:35:52 -07:00
Weiming Shi	9e7f36ab5b	net: appletalk: fix NULL pointer dereference in aarp_send_ddp() aarp_send_ddp() calls atalk_find_dev_addr(dev) in the LocalTalk fast path without checking for NULL. When the device has no AppleTalk interface configured (dev->atalk_ptr == NULL), this leads to a NULL pointer dereference at the at->s_net access. KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] RIP: 0010:aarp_send_ddp (net/appletalk/aarp.c:552 (discriminator 2)) Call Trace: <TASK> atalk_sendmsg (net/appletalk/ddp.c:1715) __sys_sendto (net/socket.c:2265 (discriminator 1)) __x64_sys_sendto (net/socket.c:2272) do_syscall_64 (arch/x86/entry/syscall_64.c:94) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121) Add a NULL check consistent with the other callers of atalk_find_dev_addr(). Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Link: https://patch.msgid.link/20260514123806.3085961-3-bestswngs@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 16:33:34 -07:00
Dragos Tatulea	c326f9c689	net/mlx5e: xsk: Fix unlocked writing to ICOSQ During napi poll, when the affinity changes and there's still XSK work to be done, we trigger an ICOSQ interrupt on the new CPU. However, this triggering on the ICOSQ is done unprotected. There are 2 such races: A) mlx5e_trigger_irq() is called while mlx5e_xsk_alloc_rx_mpwqe() is running from a different CPU due to affinity change. This can happen because IRQ triggering is done after napi_complete_done(). At this point the NAPI can be scheduled on a different CPU. Like this: CPU A (old affinity, NAPI tail) CPU B (new affinity, fresh NAPI) ------------------------------- -------------------------------- napi_complete_done() clears SCHED mlx5e_cq_arm(...) napi_schedule_prep() sets SCHED mlx5e_napi_poll() mlx5e_xsk_alloc_rx_mpwqe() mlx5e_icosq_sync_lock() // noop memcpy 640 B UMR body advance sq->pc by 10 mlx5e_trigger_irq(&c->icosq) wqe_info[pi] = {NOP, 1} mlx5e_post_nop() advances sq->pc B) mlx5e_trigger_irq() is called on the ICOSQ when mlx5e_trigger_napi_icosq() is running. The obvious fix would be to lock the ICOSQ. But ICOSQ has an optimized locking scheme that doesn't work for this scenario. Kick the async ICOSQ instead which is always locked. This issue was noticed in the wild with the following splat: netdevice: ge-0-0-1: Bad OP in ICOSQ CQE: 0xd WARNING: drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:826 [...] [...] Call Trace: <IRQ> mlx5e_napi_poll+0x11d/0x7f0 [mlx5_core] __napi_poll+0x30/0x200 ? skb_defer_free_flush+0x9c/0xc0 net_rx_action+0x2fe/0x3f0 handle_softirqs+0xd8/0x340 __irq_exit_rcu+0xbc/0xe0 common_interrupt+0x85/0xa0 </IRQ> <TASK> asm_common_interrupt+0x26/0x40 [...] ---[ end trace 0000000000000000 ]--- mlx5_core 0000:08:00.0 ge-0-0-1: Error cqe on cqn 0x548, ci 0x2022, qn 0x8f4, opcode 0xd, syndrome 0x2, vendor syndrome 0x68 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000030: 00 00 00 00 01 00 68 02 01 00 08 f4 de 14 59 d2 WQE DUMP: WQ size 16384 WQ cur size 0, WQE index 0x1e14, len: 64 00000000: 00 00 00 01 d9 ed 80 02 00 00 00 01 d9 ed 90 02 00000010: 00 00 00 01 d9 ed a0 02 00 00 00 01 d9 ed b0 02 00000020: 00 00 00 01 d9 ed c0 02 00 00 00 01 d9 ed d0 02 00000030: 00 00 00 01 d9 ed e0 02 00 00 00 01 d9 ed f0 02 mlx5_core 0000:08:00.0 ge-0-0-1: Error cqe on cqn 0x548, ci 0x2023, qn 0x8f4, opcode 0xd, syndrome 0x5, vendor syndrome 0xf9 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000030: 00 00 00 00 01 00 f9 05 01 00 08 f4 de 15 cf d2 Fixes: `db05815b36` ("net/mlx5e: Add XSK zero-copy support") Reported-by: Paul Saab <ps@mu.org> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260513064613.334602-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 16:10:27 -07:00
Haoze Xie	e196115ec3	netfilter: nf_queue: hold bridge skb->dev while queued br_pass_frame_up() rewrites skb->dev from the ingress port to the bridge master before queueing bridge LOCAL_IN packets. NFQUEUE only holds references on state.in/out and bridge physdevs, so a queued bridge packet can retain a freed bridge master in skb->dev until reinjection. When the verdict is reinjected later, br_netif_receive_skb() re-enters the receive path with skb->dev still pointing at the freed bridge master, triggering a use-after-free. Store skb->dev in the queue entry, hold a reference on it for the queue lifetime, and use the saved device when dropping queued packets during NETDEV_DOWN handling. Fixes: `ac28634456` ("netfilter: bridge: add nf_afinfo to enable queuing to userspace") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Haoze Xie <royenheart@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-16 13:23:01 +02:00
Lorenzo Bianconi	b2870fc216	netfilter: br_netfilter: Reallocate headroom if necessary in neigh_hh_bridge() neigh_hh_bridge() assumes the skb always has sufficient headroom to copy the aligned L2 header. This assumption can trigger the crash reported below using the following netfilter setup: $modprobe br_netfilter $sysctl -w net.bridge.bridge-nf-call-iptables=1 $root@OpenWrt:~# nft list ruleset table ip nat { chain prerouting { type nat hook prerouting priority dstnat; policy accept; ip daddr 192.168.83.123 dnat to 192.168.83.120 } } - iperf3 client (192.168.83.119) --> bridge (192.168.83.118) --> iperf3 server (192.168.83.120) the iperf3 client is sending packet for 192.168.83.123 to the bridge device. [ 1579.036575] Unable to handle kernel write to read-only memory at virtual address ffffff8004d76ffe [ 1579.045482] Mem abort info: [ 1579.048273] ESR = 0x000000009600004f [ 1579.052024] EC = 0x25: DABT (current EL), IL = 32 bits [ 1579.057363] SET = 0, FnV = 0 [ 1579.060417] EA = 0, S1PTW = 0 [ 1579.063550] FSC = 0x0f: level 3 permission fault [ 1579.068345] Data abort info: [ 1579.071224] ISV = 0, ISS = 0x0000004f, ISS2 = 0x00000000 [ 1579.076720] CM = 0, WnR = 1, TnD = 0, TagAccess = 0 [ 1579.081770] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 1579.087092] swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000080dc4000 [ 1579.093794] [ffffff8004d76ffe] pgd=180000009ffff003, p4d=180000009ffff003, pud=180000009ffff003, pmd=180000009ffe3003, pte=0060000084d76787 [ 1579.106343] Internal error: Oops: 000000009600004f [#1] SMP [ 1579.193824] CPU: 0 UID: 0 PID: 235 Comm: napi/qdma_eth-3 Tainted: G O 6.12.57 #0 [ 1579.202614] Tainted: [O]=OOT_MODULE [ 1579.206102] Hardware name: Airoha AN7581 Evaluation Board (DT) [ 1579.211929] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 1579.218889] pc : br_nf_pre_routing_finish_bridge+0x1ac/0xcc8 [br_netfilter] [ 1579.225859] lr : br_nf_pre_routing_finish_bridge+0x18c/0xcc8 [br_netfilter] [ 1579.232822] sp : ffffffc0817cba20 [ 1579.236128] x29: ffffffc0817cba20 x28: 0000000000000000 x27: ffffff8002b89000 [ 1579.243273] x26: ffffff8004d7700e x25: 0000000000000008 x24: 0000000000000000 [ 1579.250416] x23: ffffffc08179d4c0 x22: 0000000000000000 x21: ffffffc08179d4c0 [ 1579.257561] x20: ffffff8004d9b800 x19: ffffff8015010000 x18: 0000000000000014 [ 1579.264704] x17: ffffffbf9e930000 x16: ffffffc0817c8000 x15: 0000000000000070 [ 1579.271848] x14: 0000000000000080 x13: 0000000000000001 x12: 0000000000000000 [ 1579.278993] x11: ffffffc0798caae0 x10: ffffff8014db6fd8 x9 : 0000000000000000 [ 1579.286136] x8 : 0000000000000003 x7 : ffffffc08171f628 x6 : 000000001a3b83d3 [ 1579.293281] x5 : 0000000000000000 x4 : 1beb76f22fee0000 x3 : ffffff8004d7700e [ 1579.300425] x2 : 0000000000000000 x1 : ffffff8004d9b8bc x0 : ffffff80026ed000 [ 1579.307570] Call trace: [ 1579.310018] br_nf_pre_routing_finish_bridge+0x1ac/0xcc8 [br_netfilter] [ 1579.316632] br_nf_hook_thresh+0xd4/0x14bc [br_netfilter] [ 1579.322032] br_nf_hook_thresh+0x250/0x14bc [br_netfilter] [ 1579.327517] br_nf_hook_thresh+0x76c/0x14bc [br_netfilter] [ 1579.333003] br_handle_frame+0x180/0x480 [ 1579.336935] __netif_receive_skb_core.constprop.0+0x540/0xf40 [ 1579.342682] __netif_receive_skb_one_core+0x28/0x50 [ 1579.347561] process_backlog+0x98/0x1e0 [ 1579.351398] __napi_poll+0x34/0x1c4 [ 1579.354887] net_rx_action+0x178/0x330 [ 1579.358638] handle_softirqs+0x108/0x2d4 [ 1579.362560] __do_softirq+0x10/0x18 [ 1579.366051] ____do_softirq+0xc/0x20 [ 1579.369627] call_on_irq_stack+0x30/0x4c [ 1579.373550] do_softirq_own_stack+0x18/0x20 [ 1579.377734] do_softirq+0x4c/0x60 [ 1579.381050] __local_bh_enable_ip+0x88/0x98 [ 1579.385234] napi_threaded_poll_loop+0x188/0x21c [ 1579.389853] napi_threaded_poll+0x70/0x80 [ 1579.393863] kthread+0xd8/0xdc [ 1579.396918] ret_from_fork+0x10/0x20 [ 1579.400499] Code: 88dffc22 3707ffc2 f9406663 f9406684 (f81f0064) [ 1579.406589] ---[ end trace 0000000000000000 ]--- [ 1579.411209] Kernel panic - not syncing: Oops: Fatal exception in interrupt [ 1579.418083] SMP: stopping secondary CPUs [ 1579.422012] Kernel Offset: disabled Fix the issue reallocating the skb headroom if necessary in neigh_hh_bridge routine. Fixes: `e179e6322a` ("netfilter: bridge-netfilter: Fix MAC header handling with IP DNAT") Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-16 13:22:50 +02:00
Jozsef Kadlecsik	7f7445840b	netfilter: ipset: annotate "pos" for concurrent readers/writers The "pos" structure member of struct hbucket stores the first free slot in the hash bucket of a hash type of set and there are concurrent readers/writers. Annotate accesses properly. Fixes: `18f84d41d3` ("netfilter: ipset: Introduce RCU locking in hash:* types") Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-16 13:21:42 +02:00

1 2 3 4 5 ...

1445342 Commits