linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-06-03 20:52:29 -04:00

Author	SHA1	Message	Date
Kohei Enju	5acc641e59	igc: set tx buffer type for SMD frames Sashiko pointed out that igc_fpe_init_smd_frame() initializes igc_tx_buffer fields for an SMD skb, but does not set the buffer type: https://sashiko.dev/#/patchset/20260415025226.114115-1-kohei%40enjuk.jp Since igc_tx_buffer entries are reused, a stale XDP or XSK type can remain and make TX completion use the wrong cleanup path. Set the buffer type to IGC_TX_BUFFER_TYPE_SKB. Fixes: `5422570c00` ("igc: add support for frame preemption verification") Signed-off-by: Kohei Enju <kohei@enjuk.jp> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Avigail Dahan <avigailx.dahan@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20260515182419.1597859-9-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 19:00:44 -07:00
Michael Bommarito	5d49b568c1	ixgbevf: fix use-after-free in VEPA multicast source pruning ixgbevf_clean_rx_irq() prunes frames whose source MAC matches the VF's own address (VEPA multicast workaround) by freeing the skb and continuing to the next descriptor: dev_kfree_skb_irq(skb); continue; The skb pointer is declared outside the while loop and persists across iterations. Because the continue skips the "skb = NULL" reset at the bottom of the loop, the next iteration enters the "else if (skb)" path and calls ixgbevf_add_rx_frag() on the freed skb, dereferencing skb_shinfo(skb)->nr_frags - a use-after-free in NAPI softirq context. The sibling driver iavf already handles this correctly by nulling the pointer before continuing. Apply the same pattern here. I do not have ixgbevf hardware; the bug was found by static analysis (scan_drop_continue_loops.py + semgrep drop_continue_in_loop, multi-tool corroboration with the highest score in the scan). The UAF was confirmed under KASAN by loading a test module that reproduces the exact code pattern (alloc skb, kfree_skb, then read skb_shinfo(skb)->nr_frags): BUG: KASAN: slab-use-after-free in ixgbevf_uaf_test_init+0x100/0x1000 Read of size 8 at addr 000000006163ae78 by task insmod/30 freed 208-byte region [000000006163adc0, 000000006163ae90) QEMU emulates igb (82576) but not ixgbe (82599), and the igbvf VF driver does not include the VEPA source pruning path, so a full end-to-end reproduction with emulated hardware was not possible. Fixes: `bad17234ba` ("ixgbevf: Change receive model to use double buffered page based receives") Cc: stable@vger.kernel.org Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20260515182419.1597859-8-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 19:00:44 -07:00
Grzegorz Nitka	975b564d19	ice: restore PTP Rx timestamp config after ethtool set-channels When ethtool -L changes queue counts, ice_vsi_recfg_qs() closes and rebuilds the VSI, reallocating Rx rings. The newly allocated rings have ptp_rx cleared, so RX hardware timestamps are no longer attached to skb until hwtstamp configuration is applied again. Restore timestamp mode after ice_vsi_open() in the queue reconfiguration path, matching reset/rebuild behavior and ensuring newly rebuilt Rx rings have PTP RX timestamping re-enabled. Testing hints: - run ptp4l application in client synchronization mode: ptp4l -i ethX -m -s - run PTP traffic - change queue number on ethX netdev interface: ethtool -L ethX combined new_queue_size - observe ptp4l output - expected result: no "received DELAY_REQ without timestamp" messages Fixes: `77a781155a` ("ice: enable receive hardware timestamping") Cc: stable@vger.kernel.org Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20260515182419.1597859-7-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 19:00:44 -07:00
Grzegorz Nitka	7b28523546	ice: ptp: use primary NAC semaphore on E825 For E825 2xNAC configurations, PTP semaphore operations must hit the primary NAC register block so both sides coordinate on the same lock. Commit `e2193f9f9e` ("ice: enable timesync operation on 2xNAC E825 devices") updated other primary-only PTP register accesses to use the primary NAC on non-primary functions, but left ice_ptp_lock() and ice_ptp_unlock() operating on the local NAC. As a result, secondary NAC PTP paths can take a different semaphore than the primary side. Select the primary hardware in ice_ptp_lock() and ice_ptp_unlock() when the current function is not primary, keeping semaphore operations symmetric and consistent with the rest of the 2xNAC PTP register access path. Fixes: `e2193f9f9e` ("ice: enable timesync operation on 2xNAC E825 devices") Reviewed-by: Arkadiusz Kubalewski <Arkadiusz.kubalewski@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20260515182419.1597859-6-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 19:00:44 -07:00
Grzegorz Nitka	781ff8f2d5	ice: ptp: serialize E825 PHY timer start with PTP lock ice_start_phy_timer_eth56g() programs TIMETUS registers and issues INIT_INCVAL without holding the global PTP semaphore. This allows concurrent PTP command paths to interleave with PHY timer start, which can make the sequence fail and leave timer initialization inconsistent. Take the PTP lock around TIMETUS registers programming and INIT_INCVAL command execution, and make sure the lock is released on all error paths. Keep the subsequent sync step outside of this critical section, since ice_sync_phy_timer_eth56g() takes the same semaphore internally. Fixes: `7cab44f1c3` ("ice: Introduce ETH56G PHY model for E825C products") Reviewed-by: Arkadiusz Kubalewski <Arkadiusz.kubalewski@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20260515182419.1597859-5-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 19:00:43 -07:00
Marcin Szycik	ebc8de716c	ice: fix setting promisc mode while adding VID filter There are at least two paths through which VSI promiscuous mode can be independently configured via ice_fltr_set_vsi_promisc(): - ice_vlan_rx_add_vid() (netdev op) - ice_service_task() -> ... -> ice_set_promisc() Both paths may try to program promiscuous mode concurrently. One such scenario is: 1. Add ice netdev to bond 2. Add the bond netdev to bridge 3. ice netdev enters allmulticast mode (IFF_ALLMULTI) 4. Service task programs promisc mode filter 5. Bridge -> bond calls ice_vlan_rx_add_vid() Crucially, ice_vlan_rx_add_vid() fails if ice_fltr_set_vsi_promisc() returns any error, including -EEXIST. This causes VLAN filtering setup to fail on the bond interface. ice_set_promisc() already handles -EEXIST correctly. Fix by adding the same -EEXIST check to ice_vlan_rx_add_vid(): if the promisc filter is already programmed, continue without returning error. Fixes: `1273f89578` ("ice: Fix broken IFF_ALLMULTI handling") Cc: stable@vger.kernel.org Signed-off-by: Marcin Szycik <marcin.szycik@intel.com> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20260515182419.1597859-4-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 19:00:43 -07:00
Jose Ignacio Tornos Martinez	3ba4dd024d	ice: fix VF queue configuration with low MTU values The ice driver's VF queue configuration validation rejects databuffer_size values below 1024 bytes, which prevents VFs from using MTU values below 871 bytes. The iavf driver calculates databuffer_size based on the MTU using: databuffer_size = ALIGN(MTU + LIBETH_RX_LL_LEN, 128) where LIBETH_RX_LL_LEN = 26 (ETH_HLEN + 2*VLAN_HLEN + ETH_FCS_LEN). For MTU values below 871: MTU 870: 870 + 26 = 896, aligned to 128 = 896 (< 1024, rejected) MTU 871: 871 + 26 = 897, aligned to 128 = 1024 (>= 1024, accepted) The 1024-byte minimum seems unnecessarily restrictive, because the hardware supports databuffer_size as low as 128 bytes (the alignment boundary), which should allow MTU values down to the standard minimum of 68 bytes. I haven't found the reason why the limit was configured in the commit `9c7dd7566d` ("ice: add validation in OP_CONFIG_VSI_QUEUES VF message"), so with no more information and since it is working, change the minimum databuffer_size validation from 1024 to 128 bytes to allow standard low MTU values while still preventing invalid configurations. Fixes: `9c7dd7566d` ("ice: add validation in OP_CONFIG_VSI_QUEUES VF message") cc: stable@vger.kernel.org Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20260515182419.1597859-3-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 19:00:43 -07:00
Jacob Keller	89bbff099b	ice: fix locking around wait_event_interruptible_locked_irq Commit `50327223a8` ("ice: add lock to protect low latency interface") introduced a wait queue used to protect the low latency timer interface. The queue is used with the wait_event_interruptible_locked_irq macro, which unlocks the wait queue lock while sleeping. The irq variant uses spin_lock_irq and spin_unlock_irq to manage this. The wait queue lock was previously locked using spin_lock_irqsave. This difference in lock variants could lead to issues, since wait_event would unlock the wait queue and restore interrupts while sleeping. The ice_read_phy_tstamp_ll_e810() function is ultimately called through ice_read_phy_tstamp, which is called from ice_ptp_process_tx_tstamp or ice_ptp_clear_unexpected_tx_ready. The former is called through the miscellaneous IRQ thread function, while the latter is called from the service task work queue thread. Neither of these functions has interrupts disabled, so use spin_lock_irq instead of spin_lock_irqsave. Fixes: `50327223a8` ("ice: add lock to protect low latency interface") Cc: stable@vger.kernel.org Reported-by: Jakub Kicinski <kuba@kernel.org> Closes: https://lore.kernel.org/netdev/20250109181823.77f44c69@kernel.org/ Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20260515182419.1597859-2-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 19:00:43 -07:00
Jakub Kicinski	317bbe5301	Merge tag 'nf-26-05-16' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for net: 1) Fix small race windows in nf_ct_helper_log() when accessing helper, from Florian Westphal. 2) Fix potential infinite loop and race conditions in IPVS caused by frequent user-triggered service table changes, from Julia Anastasov. 3) Fix a race condition when dumping ipsets for restore, from Jozsef Kadlecsik. 4) Fix inner transport offset in IPv6 in nft_inner when extension headers come before the layer 4 transport header, from Yizhou Zhao. 5) Fix incorrect iteration over IPv4 ranges in several hash set types, from Nan Li. 6) Fix incorrect order when restoring BH in nft_inner_restore_tun_ctx(), from Florian Westphal. 7) Validate option array from ip6t_hbh checkpath() to fix an off-by-one access, from Zhengchuan Liang. 8) Fix race condition between ipset list -terse and concurrent updates, from Jozsef Kadlecisk. 9) Fix race condition when inserting elements into a hash bucket, also from Jozsef. 10) Annotate access to first free slot in hashtable, from Jozsef Kadlecsik. 11) Ensure sufficient headroom in br_netfilter neigh transmission, from Lorenzo Bianconi. 12) Hold reference on skb->dev in nfqueue exit path, bridge local input is speciall since skb->dev != state->indev, allowing for net_device to go away while packet is sitting in nfqueue. From Haoze Xie. * tag 'nf-26-05-16' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_queue: hold bridge skb->dev while queued netfilter: br_netfilter: Reallocate headroom if necessary in neigh_hh_bridge() netfilter: ipset: annotate "pos" for concurrent readers/writers netfilter: ipset: Fix data race between add and dump in all hash types netfilter: ipset: Fix data race between add and list header in all hash types netfilter: ip6t_hbh: reject oversized option lists netfilter: nft_inner: release local_lock before re-enabling softirqs netfilter: ipset: stop hash:* range iteration at end netfilter: nft_inner: Fix IPv6 inner_thoff desync netfilter: ipset: fix a potential dump-destroy race ipvs: avoid possible loop in ip_vs_dst_event on resizing netfilter: nf_conntrack_helper: fix possible null deref during error log ==================== Link: https://patch.msgid.link/20260516115627.967773-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 16:59:30 -07:00
Jakub Kicinski	aeff7e8e50	Merge tag 'batadv-net-pullrequest-20260515' of https://git.open-mesh.org/batadv Simon Wunderlich says: ==================== Here are various batman-adv bugfixes: - fix tp_meter counter underflow during shutdown, by Luxiao Xu - fix tp_meter tp_vars reference leak in receiver shutdown, by Sven Eckelmann - fix various translation table integer handling issues, by Sven Eckelmann (3 patches) - fix various translation table counter issues, by Sven Eckelmann (3 patches) - fix fragment reassembly length accounting, by Ruide Cao - clear current gateway during teardown, by Ruijie Li - handle forward allocation error in DAT, by Sven Eckelmann - tp_meter: avoid use of uninitialized sender variables in tp_meter, by Sven Eckelmann - disallow unicast fragment in fragment, by Sven Eckelmann - directly shut down tp_meter timer on cleanup, by Sven Eckelmann * tag 'batadv-net-pullrequest-20260515' of https://git.open-mesh.org/batadv: batman-adv: tp_meter: directly shut down timer on cleanup batman-adv: frag: disallow unicast fragment in fragment batman-adv: tp_meter: avoid use of uninit sender vars batman-adv: dat: handle forward allocation error batman-adv: clear current gateway during teardown batman-adv: fix fragment reassembly length accounting batman-adv: tt: prevent TVLV entry number overflow batman-adv: tt: avoid empty VLAN responses batman-adv: tt: fix TOCTOU race for reported vlans batman-adv: tt: fix negative last_changeset_len batman-adv: tt: fix negative tt_buff_len batman-adv: tt: reject oversized local TVLV buffers batman-adv: tp_meter: fix tp_vars reference leak in receiver shutdown batman-adv: fix tp_meter counter underflow during shutdown ==================== Link: https://patch.msgid.link/20260515095540.325586-1-sw@simonwunderlich.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 16:50:04 -07:00
Jakub Kicinski	23f3b04f15	Merge tag 'for-net-2026-05-14' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - af_bluetooth: serialize accept_q access - L2CAP: ecred_reconfigure: send packed pdu, not stack pointer - btmtk: accept too short WMT FUNC_CTRL events - hci_qca: Convert timeout from jiffies to ms * tag 'for-net-2026-05-14' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: hci_qca: Convert timeout from jiffies to ms Bluetooth: L2CAP: ecred_reconfigure: send packed pdu, not stack pointer Bluetooth: btmtk: accept too short WMT FUNC_CTRL events Bluetooth: serialize accept_q access ==================== Link: https://patch.msgid.link/20260514172340.1515042-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 16:40:05 -07:00
Ilya Maximets	bae3ee802c	openvswitch: vport: fix race between linking and the device notifier Sashiko reports that it is technically possible that we got the device reference, but by the time we're linking it to the OVS datapath, it may be already in the process of being deleted. In this case if the notifier wins the race for RTNL, it will see that the device is not yet in the OVS datapath (ovs_netdev_get_vport() will fail in the dp_device_event()) and will do nothing. Then the ovs_netdev_link() will take the RTNL and link the unregistering device to OVS datapath. Eventually, netdev_wait_allrefs_any() will re-broadcast the event and the device will be properly detached, but it will take at least a second before that happens, so it's not something we should rely on. Let's avoid linking the non-registered device in the first place. Note: As per documentation, RTNL doesn't protect the reg_state, but it actually does for all the state transitions we care about here, so it should not be necessary to use READ_ONCE or taking the instance lock. We can still do that, but we have a few more places even in this file where the reg_state is accessed without those while under RTNL, and many more places like this across the kernel code, so it might make more sense to change all of them in a more centralized fashion in the future, if necessary. Fixes: `ccb1352e76` ("net: Add Open vSwitch kernel components.") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: Aaron Conole <aconole@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Link: https://patch.msgid.link/20260514184702.2461435-1-i.maximets@ovn.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 16:38:45 -07:00
Weiming Shi	d00c953a8f	net: qualcomm: rmnet: fix endpoint use-after-free in rmnet_dellink() rmnet_dellink() removes the endpoint from the hash table with hlist_del_init_rcu() and then immediately frees it with kfree(). However, RCU readers on the receive path (rmnet_rx_handler -> __rmnet_map_ingress_handler) may still hold a reference to the endpoint and dereference ep->egress_dev after the memory has been freed. The endpoint is a kmalloc-32 object, and the stale read at offset 8 corresponds to the egress_dev pointer. BUG: unable to handle page fault for address: ffffffffde942eef Oops: 0002 [#1] SMP NOPTI CPU: 1 UID: 0 PID: 137 Comm: poc_write Not tainted 7.0.0+ #4 PREEMPTLAZY RIP: 0010:rmnet_vnd_rx_fixup (rmnet_vnd.c:27) Call Trace: <TASK> __rmnet_map_ingress_handler (rmnet_handlers.c:48 rmnet_handlers.c:101) rmnet_rx_handler (rmnet_handlers.c:129 rmnet_handlers.c:235) __netif_receive_skb_core.constprop.0 (net/core/dev.c:6096) __netif_receive_skb_one_core (net/core/dev.c:6208) netif_receive_skb (net/core/dev.c:6467) tun_get_user (drivers/net/tun.c:1955) tun_chr_write_iter (drivers/net/tun.c:2003) vfs_write (fs/read_write.c:688) ksys_write (fs/read_write.c:740) </TASK> Add an rcu_head field to struct rmnet_endpoint and replace kfree() with kfree_rcu() so the endpoint memory remains valid through the RCU grace period. Also remove the rmnet_vnd_dellink() call and inline only the nr_rmnet_devs decrement, since rmnet_vnd_dellink() would set ep->egress_dev to NULL during the grace period, creating a data race with lockless readers. Fixes: `ceed73a2cf` ("drivers: net: ethernet: qualcomm: rmnet: Initial implementation") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Link: https://patch.msgid.link/20260514122511.3083479-2-bestswngs@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 16:35:52 -07:00
Weiming Shi	9e7f36ab5b	net: appletalk: fix NULL pointer dereference in aarp_send_ddp() aarp_send_ddp() calls atalk_find_dev_addr(dev) in the LocalTalk fast path without checking for NULL. When the device has no AppleTalk interface configured (dev->atalk_ptr == NULL), this leads to a NULL pointer dereference at the at->s_net access. KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] RIP: 0010:aarp_send_ddp (net/appletalk/aarp.c:552 (discriminator 2)) Call Trace: <TASK> atalk_sendmsg (net/appletalk/ddp.c:1715) __sys_sendto (net/socket.c:2265 (discriminator 1)) __x64_sys_sendto (net/socket.c:2272) do_syscall_64 (arch/x86/entry/syscall_64.c:94) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121) Add a NULL check consistent with the other callers of atalk_find_dev_addr(). Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Link: https://patch.msgid.link/20260514123806.3085961-3-bestswngs@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 16:33:34 -07:00
Dragos Tatulea	c326f9c689	net/mlx5e: xsk: Fix unlocked writing to ICOSQ During napi poll, when the affinity changes and there's still XSK work to be done, we trigger an ICOSQ interrupt on the new CPU. However, this triggering on the ICOSQ is done unprotected. There are 2 such races: A) mlx5e_trigger_irq() is called while mlx5e_xsk_alloc_rx_mpwqe() is running from a different CPU due to affinity change. This can happen because IRQ triggering is done after napi_complete_done(). At this point the NAPI can be scheduled on a different CPU. Like this: CPU A (old affinity, NAPI tail) CPU B (new affinity, fresh NAPI) ------------------------------- -------------------------------- napi_complete_done() clears SCHED mlx5e_cq_arm(...) napi_schedule_prep() sets SCHED mlx5e_napi_poll() mlx5e_xsk_alloc_rx_mpwqe() mlx5e_icosq_sync_lock() // noop memcpy 640 B UMR body advance sq->pc by 10 mlx5e_trigger_irq(&c->icosq) wqe_info[pi] = {NOP, 1} mlx5e_post_nop() advances sq->pc B) mlx5e_trigger_irq() is called on the ICOSQ when mlx5e_trigger_napi_icosq() is running. The obvious fix would be to lock the ICOSQ. But ICOSQ has an optimized locking scheme that doesn't work for this scenario. Kick the async ICOSQ instead which is always locked. This issue was noticed in the wild with the following splat: netdevice: ge-0-0-1: Bad OP in ICOSQ CQE: 0xd WARNING: drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:826 [...] [...] Call Trace: <IRQ> mlx5e_napi_poll+0x11d/0x7f0 [mlx5_core] __napi_poll+0x30/0x200 ? skb_defer_free_flush+0x9c/0xc0 net_rx_action+0x2fe/0x3f0 handle_softirqs+0xd8/0x340 __irq_exit_rcu+0xbc/0xe0 common_interrupt+0x85/0xa0 </IRQ> <TASK> asm_common_interrupt+0x26/0x40 [...] ---[ end trace 0000000000000000 ]--- mlx5_core 0000:08:00.0 ge-0-0-1: Error cqe on cqn 0x548, ci 0x2022, qn 0x8f4, opcode 0xd, syndrome 0x2, vendor syndrome 0x68 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000030: 00 00 00 00 01 00 68 02 01 00 08 f4 de 14 59 d2 WQE DUMP: WQ size 16384 WQ cur size 0, WQE index 0x1e14, len: 64 00000000: 00 00 00 01 d9 ed 80 02 00 00 00 01 d9 ed 90 02 00000010: 00 00 00 01 d9 ed a0 02 00 00 00 01 d9 ed b0 02 00000020: 00 00 00 01 d9 ed c0 02 00 00 00 01 d9 ed d0 02 00000030: 00 00 00 01 d9 ed e0 02 00 00 00 01 d9 ed f0 02 mlx5_core 0000:08:00.0 ge-0-0-1: Error cqe on cqn 0x548, ci 0x2023, qn 0x8f4, opcode 0xd, syndrome 0x5, vendor syndrome 0xf9 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000030: 00 00 00 00 01 00 f9 05 01 00 08 f4 de 15 cf d2 Fixes: `db05815b36` ("net/mlx5e: Add XSK zero-copy support") Reported-by: Paul Saab <ps@mu.org> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260513064613.334602-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-18 16:10:27 -07:00
Haoze Xie	e196115ec3	netfilter: nf_queue: hold bridge skb->dev while queued br_pass_frame_up() rewrites skb->dev from the ingress port to the bridge master before queueing bridge LOCAL_IN packets. NFQUEUE only holds references on state.in/out and bridge physdevs, so a queued bridge packet can retain a freed bridge master in skb->dev until reinjection. When the verdict is reinjected later, br_netif_receive_skb() re-enters the receive path with skb->dev still pointing at the freed bridge master, triggering a use-after-free. Store skb->dev in the queue entry, hold a reference on it for the queue lifetime, and use the saved device when dropping queued packets during NETDEV_DOWN handling. Fixes: `ac28634456` ("netfilter: bridge: add nf_afinfo to enable queuing to userspace") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Haoze Xie <royenheart@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-16 13:23:01 +02:00
Lorenzo Bianconi	b2870fc216	netfilter: br_netfilter: Reallocate headroom if necessary in neigh_hh_bridge() neigh_hh_bridge() assumes the skb always has sufficient headroom to copy the aligned L2 header. This assumption can trigger the crash reported below using the following netfilter setup: $modprobe br_netfilter $sysctl -w net.bridge.bridge-nf-call-iptables=1 $root@OpenWrt:~# nft list ruleset table ip nat { chain prerouting { type nat hook prerouting priority dstnat; policy accept; ip daddr 192.168.83.123 dnat to 192.168.83.120 } } - iperf3 client (192.168.83.119) --> bridge (192.168.83.118) --> iperf3 server (192.168.83.120) the iperf3 client is sending packet for 192.168.83.123 to the bridge device. [ 1579.036575] Unable to handle kernel write to read-only memory at virtual address ffffff8004d76ffe [ 1579.045482] Mem abort info: [ 1579.048273] ESR = 0x000000009600004f [ 1579.052024] EC = 0x25: DABT (current EL), IL = 32 bits [ 1579.057363] SET = 0, FnV = 0 [ 1579.060417] EA = 0, S1PTW = 0 [ 1579.063550] FSC = 0x0f: level 3 permission fault [ 1579.068345] Data abort info: [ 1579.071224] ISV = 0, ISS = 0x0000004f, ISS2 = 0x00000000 [ 1579.076720] CM = 0, WnR = 1, TnD = 0, TagAccess = 0 [ 1579.081770] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 1579.087092] swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000080dc4000 [ 1579.093794] [ffffff8004d76ffe] pgd=180000009ffff003, p4d=180000009ffff003, pud=180000009ffff003, pmd=180000009ffe3003, pte=0060000084d76787 [ 1579.106343] Internal error: Oops: 000000009600004f [#1] SMP [ 1579.193824] CPU: 0 UID: 0 PID: 235 Comm: napi/qdma_eth-3 Tainted: G O 6.12.57 #0 [ 1579.202614] Tainted: [O]=OOT_MODULE [ 1579.206102] Hardware name: Airoha AN7581 Evaluation Board (DT) [ 1579.211929] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 1579.218889] pc : br_nf_pre_routing_finish_bridge+0x1ac/0xcc8 [br_netfilter] [ 1579.225859] lr : br_nf_pre_routing_finish_bridge+0x18c/0xcc8 [br_netfilter] [ 1579.232822] sp : ffffffc0817cba20 [ 1579.236128] x29: ffffffc0817cba20 x28: 0000000000000000 x27: ffffff8002b89000 [ 1579.243273] x26: ffffff8004d7700e x25: 0000000000000008 x24: 0000000000000000 [ 1579.250416] x23: ffffffc08179d4c0 x22: 0000000000000000 x21: ffffffc08179d4c0 [ 1579.257561] x20: ffffff8004d9b800 x19: ffffff8015010000 x18: 0000000000000014 [ 1579.264704] x17: ffffffbf9e930000 x16: ffffffc0817c8000 x15: 0000000000000070 [ 1579.271848] x14: 0000000000000080 x13: 0000000000000001 x12: 0000000000000000 [ 1579.278993] x11: ffffffc0798caae0 x10: ffffff8014db6fd8 x9 : 0000000000000000 [ 1579.286136] x8 : 0000000000000003 x7 : ffffffc08171f628 x6 : 000000001a3b83d3 [ 1579.293281] x5 : 0000000000000000 x4 : 1beb76f22fee0000 x3 : ffffff8004d7700e [ 1579.300425] x2 : 0000000000000000 x1 : ffffff8004d9b8bc x0 : ffffff80026ed000 [ 1579.307570] Call trace: [ 1579.310018] br_nf_pre_routing_finish_bridge+0x1ac/0xcc8 [br_netfilter] [ 1579.316632] br_nf_hook_thresh+0xd4/0x14bc [br_netfilter] [ 1579.322032] br_nf_hook_thresh+0x250/0x14bc [br_netfilter] [ 1579.327517] br_nf_hook_thresh+0x76c/0x14bc [br_netfilter] [ 1579.333003] br_handle_frame+0x180/0x480 [ 1579.336935] __netif_receive_skb_core.constprop.0+0x540/0xf40 [ 1579.342682] __netif_receive_skb_one_core+0x28/0x50 [ 1579.347561] process_backlog+0x98/0x1e0 [ 1579.351398] __napi_poll+0x34/0x1c4 [ 1579.354887] net_rx_action+0x178/0x330 [ 1579.358638] handle_softirqs+0x108/0x2d4 [ 1579.362560] __do_softirq+0x10/0x18 [ 1579.366051] ____do_softirq+0xc/0x20 [ 1579.369627] call_on_irq_stack+0x30/0x4c [ 1579.373550] do_softirq_own_stack+0x18/0x20 [ 1579.377734] do_softirq+0x4c/0x60 [ 1579.381050] __local_bh_enable_ip+0x88/0x98 [ 1579.385234] napi_threaded_poll_loop+0x188/0x21c [ 1579.389853] napi_threaded_poll+0x70/0x80 [ 1579.393863] kthread+0xd8/0xdc [ 1579.396918] ret_from_fork+0x10/0x20 [ 1579.400499] Code: 88dffc22 3707ffc2 f9406663 f9406684 (f81f0064) [ 1579.406589] ---[ end trace 0000000000000000 ]--- [ 1579.411209] Kernel panic - not syncing: Oops: Fatal exception in interrupt [ 1579.418083] SMP: stopping secondary CPUs [ 1579.422012] Kernel Offset: disabled Fix the issue reallocating the skb headroom if necessary in neigh_hh_bridge routine. Fixes: `e179e6322a` ("netfilter: bridge-netfilter: Fix MAC header handling with IP DNAT") Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-16 13:22:50 +02:00
Jozsef Kadlecsik	7f7445840b	netfilter: ipset: annotate "pos" for concurrent readers/writers The "pos" structure member of struct hbucket stores the first free slot in the hash bucket of a hash type of set and there are concurrent readers/writers. Annotate accesses properly. Fixes: `18f84d41d3` ("netfilter: ipset: Introduce RCU locking in hash:* types") Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-16 13:21:42 +02:00
Jozsef Kadlecsik	2358f7427c	netfilter: ipset: Fix data race between add and dump in all hash types When adding a new entry to the next position in the existing hash bucket, the position index was incremented too early and parallel dump could read it before the entry was populated with the value. Move the setting of the position index after populating the entry. v2: Position counting fixed, noticed by Florian Westphal. Fixes: `18f84d41d3` ("netfilter: ipset: Introduce RCU locking in hash:* types") Reported-by: syzbot+786c889f046e8b003ca6@syzkaller.appspotmail.com Reported-by: syzbot+1da17e4b41d795df059e@syzkaller.appspotmail.com Reported-by: syzbot+421c5f3ff8e9493084d9@syzkaller.appspotmail.com Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-16 13:21:42 +02:00
Jozsef Kadlecsik	c0c42a0fb2	netfilter: ipset: Fix data race between add and list header in all hash types The "ipset list -terse" command is actually a dump operation which may run parallel with "ipset add" commands, which can trigger an internal resizing of the hash type of sets just being dumped. However, dumping just the header part of the set was not protected against underlying resizing. Fix it by protecting the header dumping part as well. Fixes: `c4c997839c` ("netfilter: ipset: Fix parallel resizing and listing of the same set") Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-16 13:21:41 +02:00
Zhengchuan Liang	4322dcde6b	netfilter: ip6t_hbh: reject oversized option lists struct ip6t_opts stores at most IP6T_OPTS_OPTSNR option descriptors, but hbh_mt6_check() does not reject larger optsnr values supplied from userspace. Validate optsnr in the rule setup path so only match data that fits the fixed-size opts array can be installed. This follows the existing xtables pattern of rejecting invalid user-provided counts in checkentry() and keeps the packet matching path unchanged. `struct ip6t_opts` has a fixed `opts[IP6T_OPTS_OPTSNR]` array, where `IP6T_OPTS_OPTSNR` is 16, then off-by-one array access is possible: [ 137.924693][ T8692] UBSAN: array-index-out-of-bounds in ../net/ipv6/netfilter/ip6t_hbh.c:110:29 [ 137.926167][ T8692] index 16 is out of range for type '__u16 [16]' Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Zhengchuan Liang <zcliangcn@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-16 13:21:41 +02:00
Florian Westphal	a6cb3ff979	netfilter: nft_inner: release local_lock before re-enabling softirqs Quoting sashiko: In the error path, local_bh_enable() is called before local_unlock_nested_bh(). Fixes: `ba36fada9a` ("netfilter: nft_inner: Use nested-BH locking for nft_pcpu_tun_ctx") Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-16 13:21:41 +02:00
Nan Li	0d3a282ab5	netfilter: ipset: stop hash:* range iteration at end The following hash set variants: hash:ip,mark hash:ip,port hash:ip,port,ip hash:ip,port,net iterate IPv4 ranges with a 32-bit iterator. The iterator must stop once the last address in the requested range has been processed. Advancing it once more can move the traversal state past the end of the request, so a later retry may continue from an unintended position. Handle the iterator increment explicitly at the end of the loop and stop once the upper bound has been processed. This keeps the existing retry behaviour intact for valid ranges while preventing traversal from continuing past the original boundary. Fixes: `48596a8ddc` ("netfilter: ipset: Fix adding an IPv4 range containing more than 2^31 addresses") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Nan Li <tonanli66@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-16 13:21:15 +02:00
Yizhou Zhao	b6a91f68eb	netfilter: nft_inner: Fix IPv6 inner_thoff desync In nft_inner_parse_l2l3(), when processing inner IPv6 packets, ipv6_find_hdr() correctly computes the transport header offset traversing all extension headers, but the result is immediately overwritten with nhoff + sizeof(_ip6h) (40 bytes), which only accounts for the IPv6 base header. This creates a desync between inner_thoff (wrong — points to extension header start) and l4proto (correct — e.g., IPPROTO_TCP), enabling transport header forgery and potential firewall bypass. This issue affects stable versions from Linux 6.2. For comparison, the normal (non-inner) IPv6 path correctly preserves ipv6_find_hdr()'s result. Removing the incorrect overwrite ensures that ipv6_find_hdr()'s calculated transport header offset is preserved, thereby fixing the desynchronization. Fixes: `3a07327d10` ("netfilter: nft_inner: support for inner tunnel header matching") Cc: stable@vger.kernel.org Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Assisted-by: GLM:5.1 Z.ai Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-16 12:19:56 +02:00
Jozsef Kadlecsik	53d7fd878c	netfilter: ipset: fix a potential dump-destroy race When dumping sets in order to create the proper order for restore, the list type of sets dumped last. Therefore internally we run the dumping loop twice: first with all non-list type of sets and skipping the list type ones and then secondly for the list type of sets. Sashiko noticed that there's a potential race between dump and destroy if in the first loop the last set was a list type of set: its pointer remains unreferenced and a concurrent destroy can free it. Fix the issue by resetting the variable holding the pointer. Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-16 12:19:56 +02:00
Julian Anastasov	5522d65d81	ipvs: avoid possible loop in ip_vs_dst_event on resizing Sashiko points out that unprivileged user can frequently call ip_vs_flush() or ip_vs_del_service() to trigger svc_table_changes updates that can lead to infinite loop in ip_vs_dst_event(). This can also happen if the user triggers frequent table resizing without deleting all services. We should also consider the possible effects if the user triggers many NETDEV_DOWN events. One way to solve it is to hold svc_resize_sem in ip_vs_dst_event() but this can block the dev notifier during the whole resizing process. Instead, use new rw_semaphore svc_replace_sem to protect just the svc_table replacement which is a short code section. Then hold svc_replace_sem in ip_vs_dst_event() to serialize with replacing the svc_table. As result, loop is avoided as there is no need to repeat the table walking from the start. By this way changes in svc_table_changes can happen only when all services are removed and all dev references dropped which allows us to abort the table walking. As IP_VS_WORK_SVC_NORESIZE is the flag used to stop the svc_resize_work under service_mutex, we should check only this flag often but not while under service_mutex. To remove the mutex_trylock() for service_mutex in the second phase where the resizer installs the new table after rehashing, we will avoid holding the service_mutex there. As result, the code in configuration context which is under service_mutex should access ipvs->svc_table under RCU because it can be replaced at anytime and released after a RCU grace period. As for ip_vs_zero_all(), it needs different solution as a table walker which can escape single RCU read-side critical section: to hold the svc_replace_sem to prevent table to be replaced. In ip_vs_status_show() prefer to hold svc_replace_sem to avoid many loops, just detect if the svc_table is removed. Prefer the newly attached table for the u_thresh/l_thresh checks to know when to grow/shrink while adding or deleting services because the new table size is based on the latest parameters. Link: https://sashiko.dev/#/patchset/20260505001648.360569-1-pablo%40netfilter.org Fixes: `840aac3d90` ("ipvs: use resizable hash table for services") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-16 12:19:56 +02:00
Florian Westphal	1afc25ae75	netfilter: nf_conntrack_helper: fix possible null deref during error log Reported by sashiko: there is a small race window. If a helper module is unloaded or a userspace-defined helper is removed, nf_conntrack_helper_unregister() sets ->helper to NULL. Handle this safely. This needs a second patch to close related race during nf_conntrack_helper_unregister(). Fixes: `b20ab9cc63` ("netfilter: nf_ct_helper: better logging for dropped packets") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-16 12:19:56 +02:00
Michael Bommarito	aaec7096f9	net: hsr: defer node table free until after RCU readers HSR node-list and node-status generic-netlink operations run under rcu_read_lock(). They walk hsr->node_db through hsr_get_next_node() and hsr_get_node_data(), but RTM_DELLINK teardown removes the same node table with plain list_del() and frees each node immediately. That lets a generic-netlink reader hold a struct hsr_node pointer across hsr_dellink(). In a KASAN build, widening the reader window after hsr_get_next_node() obtains the node reproduces a slab-use-after-free when the reader copies node->macaddress_A; the freeing stack is hsr_del_nodes() from hsr_dellink(). Use list_del_rcu() and defer the free through the existing hsr_free_node_rcu() callback. This matches the lifetime rule used by the HSR prune paths, which already delete nodes with list_del_rcu() and call_rcu(). Fixes: `b9a1e62740` ("hsr: implement dellink to clean up resources") Cc: stable@vger.kernel.org # v5.3+ Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Link: https://patch.msgid.link/20260513233838.3064715-2-michael.bommarito@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-15 18:25:26 -07:00
Stefano Garzarella	ae38d91791	vsock/virtio: fix zerocopy completion for multi-skb sends When a large message is fragmented into multiple skbs, the zerocopy uarg is only allocated and attached to the last skb in the loop. Non-final skbs carry pinned user pages with no completion tracking, so the kernel has no way to notify userspace when those pages are safe to reuse. If the loop breaks early the uarg is never allocated at all, leaking pinned pages with no completion notification. Fix this by following the approach used by TCP: allocate the zerocopy uarg (if not provided by the caller) before the send loop and attach it to every skb via skb_zcopy_set(), which takes a reference per skb. Each skb's completion properly decrements the refcount, and the notification only fires after the last skb is freed. On failure, if no data was sent, the uarg is cleanly aborted via net_zcopy_put_abort(). This issue was initially discovered by sashiko while reviewing commit `1cb36e2522` ("vsock/virtio: fix MSG_ZEROCOPY pinned-pages accounting") but was pre-existing. Fixes: `581512a6dc` ("vsock/virtio: MSG_ZEROCOPY flag support") Closes: https://sashiko.dev/#/patchset/20260420132051.217589-1-sgarzare%40redhat.com Reported-by: Maher Azzouzi <maherazz04@gmail.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Arseniy Krasnov <avkrasnov@salutedevices.com> Link: https://patch.msgid.link/20260514092948.268720-1-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-15 17:38:15 -07:00
Sam Daly	c0bf0a4f3f	octeontx2-af: CGX: add bounds check to cgx_speed_mbps index cgx_speed_mbps has 13 elements but RESP_LINKSTAT_SPEED can yield values 0-15. If it returns a value >= 13, this causes an out-of-bounds array access. Add a bounds check and default to speed 0 if the index is out of range. Fixes: `61071a871e` ("octeontx2-af: Forward CGX link notifications to PFs") Cc: Sunil Goutham <sgoutham@marvell.com> Cc: Linu Cherian <lcherian@marvell.com> Cc: Geetha sowjanya <gakula@marvell.com> Cc: hariprasad <hkelam@marvell.com> Cc: Subbaraya Sundeep <sbhatta@marvell.com> Cc: Andrew Lunn <andrew+netdev@lunn.ch> Cc: stable <stable@kernel.org> Signed-off-by: Sam Daly <sam@samdaly.ie> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://patch.msgid.link/2026051352-refined-demise-e88d@gregkh Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-15 17:22:43 -07:00
Dragos Tatulea	cfd08f0972	IB/IPoIB: ndo_set_rx_mode_async conversion The commit in the fixes tag added a warning for devices that are netdev ops locked that they should be converted to .ndo_set_rx_mode_async. IPoIB for mlx5 is such a driver which was missed during the conversion because the flow is more complex: - mlx5 part of IPoIB device was converted to ops-lock in commit [1]. - ipoib_intf_init() then overrides netdev_ops with ipoib_netdev_ops_{pf,vf}, which still wired ndo_set_rx_mode to the legacy sync path -- tripping the new warning on every probe. So now we have the following splat: netdevice: ib0 (uninitialized): ops-locked drivers should use ndo_set_rx_mode_async WARNING: net/core/dev.c:11366 at register_netdevice+0x83c/0x21d0 ... register_netdev+0x1f/0x40 ipoib_add_one+0x35c/0x880 [ib_ipoib] This patch implements .ndo_set_rx_mode_async but it simply schedules the multicast restart task like before. This is done to maintain the assumption that this task and others [2] must run on the same order workqueue to avoid racing with themselves. The race between ipoib_mcast_join_task() and ipoib_mcast_restart_task() would be the most obvious example. [1] `8f7b00307b`, "net/mlx5e: Convert mlx5 netdevs to instance locking") [2] ipoib_mcast_join_task, ipoib_mcast_restart_task, ipoib_mcast_carrier_on_task, ipoib_reap_ah, ipoib_reap_neigh Fixes: `3cbd229388` ("net: warn ops-locked drivers still using ndo_set_rx_mode") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Acked-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://patch.msgid.link/20260513124519.3357165-1-dtatulea@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-15 17:16:33 -07:00
Michael Bommarito	915fab6982	ipv4: raw: reject IP_HDRINCL packets with ihl < 5 raw_send_hdrinc() validates that the caller-supplied IPv4 header fits within the message length: iphlen = iph->ihl * 4; err = -EINVAL; if (iphlen > length) goto error_free; if (iphlen >= sizeof(iph)) { / fix up saddr, tot_len, id, csum, transport_header / } It does not, however, reject ihl < 5. For such a packet the "if (iphlen >= sizeof(iph))" branch is skipped, leaving the crafted iphdr untouched, but the packet is still handed to __ip_local_out() and onward. Downstream consumers that read iph->ihl assume a sane value: net/ipv4/ah4.c:ah_output() in particular subtracts sizeof(struct iphdr) from top_iph->ihl * 4 and passes the (signed-int-negative, then cast to size_t) result to memcpy(), producing an OOB access of length close to SIZE_MAX and a host kernel panic. An IPv4 header with ihl < 5 is malformed by definition (RFC 791: "Internet Header Length is the length of the internet header in 32 bit words ... Note that the minimum value for a correct header is 5."). The kernel should not be willing to inject such a packet into its own output path. Reject "iphlen < sizeof(iph)" alongside the existing "iphlen > length" check. This matches the principle that locally constructed packets that re-enter the IP stack must pass the same basic sanity tests that a foreign packet would be subjected to. Once this lands, the "if (iphlen >= sizeof(iph))" wrapper around the fixup branch becomes redundant; left in place to keep the patch minimal and backport-friendly. A follow-up can unwrap it. Note that commit `86f4c90a1c` ("ipv4, ipv6: ensure raw socket message is big enough to hold an IP header") ensures the message buffer is large enough to hold an iphdr, but does not constrain the self-reported iph->ihl. Reachability: the malformed packet source is any caller with CAP_NET_RAW, including an unprivileged process in a user+net namespace on a kernel with CONFIG_USER_NS=y. The reproduced AH crash also requires a matching xfrm AH policy on the outgoing route; a container granted CAP_NET_ADMIN can install that state and policy in its netns. Loopback bypasses xfrm_output, so the trigger uses a real netdev. Reproduced on UML + KASAN: kernel-mode fault at addr 0x0 with memcpy_orig at the crash site. Same shape reproduces inside a rootless Docker container with --cap-add NET_ADMIN on a stock distro kernel. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Suggested-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Link: https://patch.msgid.link/77ec2b5e8111961c2c39883c92e8aa2709039c17.1778614451.git.michael.bommarito@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-15 15:55:02 -07:00
Sven Eckelmann	d5487249a8	batman-adv: tp_meter: directly shut down timer on cleanup batadv_tp_sender_cleanup() was calling timer_delete_sync() followed by timer_delete() to guard against the timer handler re-arming itself between the two calls. This double-deletion hack relied on the sending status being set to 0 to suppress re-arming. Replace both calls with a single timer_shutdown_sync(). This function both waits for any running timer callback to complete (like timer_delete_sync()) and permanently disarms the timer so it cannot be re-armed afterwards, making re-arming prevention unconditional and self-documenting. The re-arming property is also required because otherwise: 1. context 0 (batadv_tp_recv_ack()) checks in batadv_tp_reset_sender_timer() if sending is still 1 -> it is 2. context 1 changes in batadv_tp_sender_shutdown() sending to 0 and in this process forces the kthread to stop timer in batadv_tp_sender_cleanup() 3. context 0 continues in batadv_tp_reset_sender_timer() and rearms the timer -> but the reference for it is already gone Cc: stable@kernel.org Fixes: `33a3bb4a33` ("batman-adv: throughput meter implementation") Signed-off-by: Sven Eckelmann <sven@narfation.org>	2026-05-15 10:41:55 +02:00
Sven Eckelmann	bc62216dc8	batman-adv: frag: disallow unicast fragment in fragment batadv_frag_skb_buffer() is called by batadv_batman_skb_recv() when a BATADV_UNICAST_FRAG packet is received. Once all fragments are collected and the packet is reassembled, batadv_recv_frag_packet() calls batadv_batman_skb_recv() again to process the defragmented payload. A malicious sender can craft a BATADV_UNICAST_FRAG packet whose reassembled payload is itself a BATADV_UNICAST_FRAG packet (matryoshka-style nesting). Each nesting level recurses through batadv_batman_skb_recv() without bound, growing the kernel stack until it is exhausted. Since refragmentation or fragments in fragments are not actually allowed, discard all packets which are still BATADV_UNICAST_FRAG packets after the defragmentation process. Cc: stable@kernel.org Fixes: `610bfc6bc9` ("batman-adv: Receive fragmented packets and merge") Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Reviewed-by: Yuan Tan <yuantan098@gmail.com> Signed-off-by: Sven Eckelmann <sven@narfation.org>	2026-05-15 10:41:49 +02:00
Michael Bommarito	5db89c9956	net: ifb: report ethtool stats over num_tx_queues ifb_dev_init() allocates dp->tx_private to dev->num_tx_queues entries via kzalloc_objs(*txp, dev->num_tx_queues). Both IFB per-queue RX and TX stats live in those entries: ifb_xmit() updates txp->rx_stats using the skb queue mapping, ifb_ri_tasklet() updates txp->tx_stats, and ifb_stats64() aggregates both over dev->num_tx_queues. The ethtool stats callbacks instead size and walk the per-queue stats with dev->real_num_rx_queues and dev->real_num_tx_queues. With an asymmetric device where the RX queue count exceeds the TX queue count, for example: ip link add name ifb10 numtxqueues 1 numrxqueues 8 type ifb ethtool -S ifb10 ifb_get_ethtool_stats() indexes past the tx_private allocation and copies adjacent slab data through ETHTOOL_GSTATS. Use dev->num_tx_queues consistently for the stats strings, the stats count, and the stats data walks. This reports one RX stats group and one TX stats group for each backing ifb_q_private entry, which is the queue set IFB can actually populate. Reproduced under UML+KASAN at v7.1-rc2: BUG: KASAN: slab-out-of-bounds in ifb_fill_stats_data+0x3c/0xae Read of size 8 at addr 0000000062dbd228 by task ethtool/36 ifb_fill_stats_data+0x3c/0xae ifb_get_ethtool_stats+0xc0/0x129 __dev_ethtool+0x1ca5/0x363c dev_ethtool+0x123/0x1b3 dev_ioctl+0x56c/0x744 sock_do_ioctl+0x15f/0x1b2 sock_ioctl+0x4d5/0x50a sys_ioctl+0xd8b/0xde9 With the patch applied, the same UML+KASAN repro is silent and ethtool -S ifb10 reports only the stats backed by the single allocated tx_private entry. Fixes: `a21ee5b2fc` ("net: ifb: support ethtools stats") Cc: stable@vger.kernel.org Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Link: https://patch.msgid.link/20260514013739.3549624-1-michael.bommarito@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-14 18:40:04 -07:00
Or Har-Toov	c6df9a65cb	net/mlx5: Skip disabled vports when setting max TX speed When setting vports max TX speed during LAG activation or bond state changes, the code iterates over all eswitch vports. However, some vports may not be enabled yet. Skip vports that are not enabled to avoid sending FW commands for uninitialized vports. Save the LAG aggregated speed in the vport struct so it can be applied when the vport is enabled later. Fixes: `50f1d188c5` ("net/mlx5: Propagate LAG effective max_tx_speed to vports") Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260513063640.334132-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-14 18:35:36 -07:00
Jeroen Massar	8d0a5af8b1	net/mlx5: Do not restore destination-less TC rules After IPsec policy/state TX rules are added, any TC flow rule, which forwards packets to uplink, is modified to forward to IPsec TX tables. As these tables are destroyed dynamically, whenever there is no reference to them, the destinations of this kind of rules must be restored to uplink, unless there is no destination for that rule. The flow rules FLOW_ACTION_ACCEPT, DROP, TRAP, GOTO and SAMPLE do not have a destination port, and thus out_count = 0. At cleanup time of the rules in mlx5_esw_ipsec_modify_flow_dests we call mlx5_eswitch_restore_ipsec_rule but as the above types do not have a destination we get an underflow of out_count, as the port is passed, which is esw_attr->out_count - 1. This change avoids calling mlx5_eswitch_restore_ipsec_rule when there are no output destinations and thus avoids the underflow. Fixes: `d1569537a8` ("net/mlx5e: Modify and restore TC rules for IPSec TX rules") Signed-off-by: Jeroen Massar <jmassar@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260513063302.333761-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-14 18:34:20 -07:00
Gal Pressman	c9d08c8c4c	net/mlx5e: Don't leak RSS context in case of error If mlx5e_rx_res_rss_set_rxfh() fails during mlx5e_create_rxfh_context(), the RSS context is not cleaned up. This leaves a stale entry in 'res->rss[rss_idx]' that occupies a context slot. Destroy the RSS context before returning the error. Fixes: `6c2509d446` ("net/mlx5e: Add error flow for ethtool -X command") Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Nimrod Oren <noren@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260513062737.333259-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-14 18:33:48 -07:00
Chuck Lever	f508262ae9	tls: Preserve sk_err across recvmsg() when data has been copied The sk_err check in tls_rx_rec_wait() consumes the error via sock_error(), which clears sk_err atomically. When the caller (tls_sw_recvmsg, tls_sw_splice_read, or tls_sw_read_sock) already has bytes copied to userspace, it returns those bytes and discards the error from this call. sk_err is now zero on the socket, so the next read syscall observes only RCV_SHUTDOWN and reports a clean EOF instead of the actual error (typically -ECONNRESET). The race is reachable when tls_read_flush_backlog()'s periodic sk_flush_backlog() triggers tcp_reset() in the middle of a multi-record read. Pass a has_copied flag to tls_rx_rec_wait(). When has_copied is false, consume sk_err via sock_error() as before. When has_copied is true, report the error from READ_ONCE() but leave sk_err set: the caller returns the byte count and discards the err from this call, and the next read syscall surfaces the preserved sk_err. This mirrors the tcp_recvmsg() preserve-and-surface pattern. The decrypt-abort path is unaffected: tls_err_abort() raises sk_err to EBADMSG after tls_rx_rec_wait() returns, and nothing on the caller's return path consumes it, so the EBADMSG surfaces on the next read. tls_sw_splice_read() passes has_copied=false: it processes one record per call, so no bytes have been copied within the function when tls_rx_rec_wait() runs. A reset that arrives between iterations of splice_direct_to_actor() (the sendfile() path) is still consumed by sock_error() in the later call, and the outer loop returns the prior iterations' byte count and drops the error. tcp_splice_read() exhibits the same pattern at the iteration boundary; addressing it belongs at the splice_direct_to_actor() layer and is out of scope here. Fixes: `c46b01839f` ("tls: rx: periodically flush socket backlog") Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://patch.msgid.link/20260513125825.205189-1-cel@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-14 18:19:44 -07:00
Dawei Feng	e8fb3de2a8	octeontx2-pf: fix double free in rvu_rep_rsrc_init() rvu_rep_rsrc_init() allocates queue memory before calling otx2_init_hw_resources(). When hardware resource setup fails, otx2_init_hw_resources() already unwinds the partially initialized SQ, CQ, and aura state before returning an error. The representor error path then calls otx2_free_hw_resources() again and can free the same resources a second time. Fix this by splitting the cleanup labels so that a failure from otx2_init_hw_resources() only releases queue memory. Keep the otx2_free_hw_resources() call for failures that happen after hardware resource initialization completed successfully. The bug was first flagged by an experimental analysis tool we are developing for kernel memory-management bugs while analyzing v6.13-rc1. The tool is still under development and is not yet publicly available. Manual inspection confirms that the bug is still present in v7.1-rc3. Runtime validation was not performed because reproducing this path requires OcteonTX2 representor hardware. Fixes: `3937b7308d` ("octeontx2-pf: Create representor netdev") Cc: stable@vger.kernel.org # v6.13+ Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Dawei Feng <dawei.feng@seu.edu.cn> Reviewed-by: Geetha sowjanya <gakula@marvell.com> Link: https://patch.msgid.link/20260513151320.213260-1-dawei.feng@seu.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-14 18:18:09 -07:00
William Bowling	f84eca5817	net: skbuff: preserve shared-frag marker during coalescing skb_try_coalesce() can attach paged frags from @from to @to. If @from has SKBFL_SHARED_FRAG set, the resulting @to skb can contain the same externally-owned or page-cache-backed frags, but the shared-frag marker is currently lost. That breaks the invariant relied on by later in-place writers. In particular, ESP input checks skb_has_shared_frag() before deciding whether an uncloned nonlinear skb can skip skb_cow_data(). If TCP receive coalescing has moved shared frags into an unmarked skb, ESP can see skb_has_shared_frag() as false and decrypt in place over page-cache backed frags. Propagate SKBFL_SHARED_FRAG when skb_try_coalesce() transfers paged frags. The tailroom copy path does not need the marker because it copies bytes into @to's linear data rather than transferring frag descriptors. Fixes: `cef401de7b` ("net: fix possible wrong checksum generation") Fixes: `f4c50a4034` ("xfrm: esp: avoid in-place decrypt on shared skb frags") Signed-off-by: William Bowling <vakzz@zellic.io> Reviewed-by: Eric Dumazet <edumazet@google.com> Tested-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://patch.msgid.link/20260513041635.1289541-1-vakzz@zellic.io Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-14 17:22:00 -07:00
Matt Fleming	7d260c5d2d	net/mlx5e: Fix use-after-free in mlx5e_tx_reporter_timeout_recover mlx5e_tx_reporter_timeout_recover() accesses sq->netdev after mlx5e_safe_reopen_channels() has torn down and freed the channel (and its embedded SQs). Replace the three sq->netdev references with priv->netdev which is safe because priv outlives channel teardown. The netdev_err() call already used priv->netdev for this reason; make the trylock/unlock and health_channel_eq_recover calls consistent. This fixes the following KASAN splat: BUG: KASAN: use-after-free in mlx5e_tx_reporter_timeout_recover+0x1dd/0x360 [mlx5_core] Read of size 8 at addr ffff889860ed0b28 by task kworker/u113:2/5277 Call Trace: mlx5e_tx_reporter_timeout_recover+0x1dd/0x360 [mlx5_core] devlink_health_reporter_recover+0xa2/0x150 devlink_health_report+0x254/0x7c0 mlx5e_reporter_tx_timeout+0x297/0x380 [mlx5_core] mlx5e_tx_timeout_work+0x109/0x170 [mlx5_core] process_one_work+0x677/0xf20 worker_thread+0x51f/0xd90 kthread+0x3a5/0x810 ret_from_fork+0x208/0x400 ret_from_fork_asm+0x1a/0x30 Fixes: `83ac0304a2` ("net/mlx5e: Fix deadlocks between devlink and netdev instance locks") Cc: stable@vger.kernel.org Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Matt Fleming <mfleming@cloudflare.com> Link: https://patch.msgid.link/20260513112226.140512-1-matt@readmodwrite.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-14 17:17:11 -07:00
Maoyi Xie	d2bfdbb69c	rds_tcp: close NULL deref window in rds_tcp_set_callbacks rds_tcp_set_callbacks() links a new rds_tcp_connection onto rds_tcp_tc_list under rds_tcp_tc_list_lock. It releases the lock, then assigns tc->t_sock = sock outside the lock. rds_tcp_tc_info() and rds6_tcp_tc_info() walk rds_tcp_tc_list under the same lock. Both dereference tc->t_sock->sk without a NULL check. A reader can acquire rds_tcp_tc_list_lock between the writer's spin_unlock and the t_sock store. It then sees a list entry whose t_sock is NULL. The dereference of tc->t_sock->sk is a NULL access. Move tc->t_sock = sock inside rds_tcp_tc_list_lock, before list_add_tail. A reader holding the lock then observes the linkage and the t_sock store together. The restore path is safe. rds_tcp_restore_callbacks() does list_del_init inside the lock. The matching tc->t_sock = NULL after unlink is harmless to readers holding the lock. Fixes: `70041088e3` ("RDS: Add TCP transport to RDS") Suggested-by: Simon Horman <horms@kernel.org> Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg> Reviewed-by: Allison Henderson <achender@kernel.org> Link: https://patch.msgid.link/20260512142807.1855619-1-maoyi.xie@ntu.edu.sg Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-14 17:06:59 -07:00
Sven Eckelmann	6c65cf23d4	batman-adv: tp_meter: avoid use of uninit sender vars batadv_tp_recv_ack() and batadv_tp_stop() are only valid for tp_vars in the BATADV_TP_SENDER role. When called with a BATADV_TP_RECEIVER role, it proceeds to read sender-only members that were never initialized, leading to undefined behavior. This can be triggered when a node that is currently acting as a receiver in an ongoing tp_meter session receives a malicious ACK packet. Guard against this by checking tp_vars->role immediately after the lookup and bailing out if it is not BATADV_TP_SENDER, before any of those members are accessed. Cc: stable@kernel.org Fixes: `33a3bb4a33` ("batman-adv: throughput meter implementation") Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Reviewed-by: Yuan Tan <yuantan098@gmail.com> Signed-off-by: Sven Eckelmann <sven@narfation.org>	2026-05-14 20:01:31 +02:00
Sven Eckelmann	2d8826a2d3	batman-adv: dat: handle forward allocation error batadv_dat_forward_data() calls pskb_copy_for_clone() to duplicate an skb for each DHT candidate, but does not check the return value before passing it to batadv_send_skb_prepare_unicast_4addr(). That function dereferences the skb unconditionally, so a failed allocation triggers a NULL pointer dereference. Skip forwarding to the current DHT candidate on allocation failure. Cc: stable@kernel.org Fixes: `785ea11441` ("batman-adv: Distributed ARP Table - create DHT helper functions") Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Reviewed-by: Yuan Tan <yuantan098@gmail.com> Signed-off-by: Sven Eckelmann <sven@narfation.org>	2026-05-14 20:01:31 +02:00
Ruijie Li	a340a51ed8	batman-adv: clear current gateway during teardown batadv_gw_node_free() removes the gateway list entries during mesh teardown, but it does not clear the currently selected gateway. This leaves stale gateway state behind across cleanup and can break a later mesh recreation. Clear bat_priv->gw.curr_gw before walking the gateway list so the selected gateway reference is dropped as part of teardown. Fixes: `2265c14108` ("batman-adv: gateway election code refactoring") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Ruijie Li <ruijieli51@gmail.com> Signed-off-by: Zhanpeng Li <lzhanpeng2025@lzu.edu.cn> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Sven Eckelmann <sven@narfation.org>	2026-05-14 18:48:40 +02:00
Ruide Cao	9cd3f16c32	batman-adv: fix fragment reassembly length accounting batman-adv keeps a running payload length for queued fragments and uses it to validate a fragment chain before reassembly. That accounting currently allows the accumulated fragment length to be truncated during updates. As a result, malformed fragment chains can bypass the intended validation and drive reassembly with inconsistent length state, leading to a local denial of service. Fix the accounting by storing the accumulated length in a length-typed field and rejecting update overflows before the existing validation logic runs. The fix was verified against the original reproducer and against valid fragment reassembly paths. Fixes: `610bfc6bc9` ("batman-adv: Receive fragmented packets and merge") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Ruide Cao <caoruide123@gmail.com> Tested-by: Ren Wei <enjou1224z@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Sven Eckelmann <sven@narfation.org>	2026-05-14 18:33:29 +02:00
Linus Torvalds	66182ca873	Merge tag 'net-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfilter. Previous releases - regressions: - ethtool: fix NULL pointer dereference in phy_reply_size - netfilter: - allocate hook ops while under mutex - close dangling table module init race - restore nf_conntrack helper propagation via expectation - tcp: - fix potential UAF in reqsk_timer_handler(). - fix out-of-bounds access for twsk in tcp_ao_established_key(). - vsock: fix empty payload in tap skb for non-linear buffers - hsr: fix NULL pointer dereference in hsr_get_node_data() - eth: - cortina: fix RX drop accounting - ice: fix locking in ice_dcb_rebuild() Previous releases - always broken: - napi: avoid gro timer misfiring at end of busypoll - sched: - dualpi2: initialize timer earlier in dualpi2_init() - sch_cbs: Call qdisc_reset for child qdisc - shaper: - fix ordering issue in net_shaper_commit() - reject handle IDs exceeding internal bit-width - ipv6: flowlabel: enforce per-netns limit for unprivileged callers - tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring - smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint - sctp: revalidate list cursor after sctp_sendmsg_to_asoc() in SCTP_SENDALL - batman-adv: - reject new tp_meter sessions during teardown - purge non-released claims - eth: - i40e: cleanup PTP registration on probe failure - idpf: fix double free and use-after-free in aux device error paths - ena: fix potential use-after-free in get_timestamp" * tag 'net-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits) net: phy: DP83TC811: add reading of abilities net: tls: prevent chain-after-chain in plain text SG net: tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring net/smc: reject CHID-0 ACCEPT that matches an empty ism_dev slot macsec: use rcu_work to defer TX SA crypto cleanup out of softirq macsec: use rcu_work to defer RX SA crypto cleanup out of softirq macsec: introduce dedicated workqueue for SA crypto cleanup net: net_failover: Fix the deadlock in slave register MAINTAINERS: update atlantic driver maintainer selftests/tc-testing: Add QFQ/CBS qlen underflow test net/sched: sch_cbs: Call qdisc_reset for child qdisc FDDI: defza: Sanitise the reset safety timer net: ethernet: ravb: Do not check URAM suspension when WoL is active ethtool: fix ethnl_bitmap32_not_zero() bit interval semantics net/smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint net/smc: fix sleep-inside-lock in __smc_setsockopt() causing local DoS net: atm: fix skb leak in sigd_send() default branch net: ethtool: phy: avoid NULL deref when PHY driver is unbound net: atlantic: preserve PCI wake-from-D3 on shutdown when WOL enabled net: shaper: reject QUEUE scope handle with missing id ...	2026-05-14 08:57:43 -07:00
Linus Torvalds	eb5441518f	Merge tag 'audit-pr-20260513' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit fixes from Paul Moore: - Correctly log the inheritable capabilities - Honor AUDIT_LOCKED in the AUDIT_TRIM and AUDIT_MAKE_EQUIV commands * tag 'audit-pr-20260513' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: enforce AUDIT_LOCKED for AUDIT_TRIM and AUDIT_MAKE_EQUIV audit: fix incorrect inheritable capability in CAPSET records	2026-05-14 08:53:24 -07:00
Linus Torvalds	31e62c2ebb	ptrace: slightly saner 'get_dumpable()' logic The 'dumpability' of a task is fundamentally about the memory image of the task - the concept comes from whether it can core dump or not - and makes no sense when you don't have an associated mm. And almost all users do in fact use it only for the case where the task has a mm pointer. But we have one odd special case: ptrace_may_access() uses 'dumpable' to check various other things entirely independently of the MM (typically explicitly using flags like PTRACE_MODE_READ_FSCREDS). Including for threads that no longer have a VM (and maybe never did, like most kernel threads). It's not what this flag was designed for, but it is what it is. The ptrace code does check that the uid/gid matches, so you do have to be uid-0 to see kernel thread details, but this means that the traditional "drop capabilities" model doesn't make any difference for this all. Make it all make a bit more sense by saying that if you don't have a MM pointer, we'll use a cached "last dumpability" flag if the thread ever had a MM (it will be zero for kernel threads since it is never set), and require a proper CAP_SYS_PTRACE capability to override. Reported-by: Qualys Security Advisory <qsa@qualys.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-05-14 08:32:11 -07:00

1 2 3 4 5 ...

1445305 Commits