linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-16 00:51:51 -04:00

Author	SHA1	Message	Date
Petr Oros	1a41b58fd4	ice: fix missing dpll notifications for SW pins The SMA/U.FL pin redesign (commit `2dd5d03c77` ("ice: redesign dpll sma/u.fl pins control")) introduced software-controlled pins that wrap backing CGU input/output pins, but never updated the notification and data paths to propagate pin events to these SW wrappers. The periodic work sends dpll_pin_change_ntf() only for direct CGU input pins. SW pins that wrap these inputs never receive change or phase offset notifications, so userspace consumers such as synce4l monitoring SMA pins via dpll netlink never learn about state transitions or phase offset updates. Similarly, ice_dpll_phase_offset_get() reads the SW pin's own phase_offset field which is never updated; the PPS monitor writes to the backing CGU input's field instead. Fix by introducing ice_dpll_pin_ntf(), a wrapper around dpll_pin_change_ntf() that also notifies any registered SMA/U.FL pin whose backing CGU input matches. Replace all direct dpll_pin_change_ntf() calls in the periodic notification paths with this wrapper. Fix ice_dpll_phase_offset_get() to return the backing CGU input's phase_offset for input-direction SW pins. Fixes: `2dd5d03c77` ("ice: redesign dpll sma/u.fl pins control") Signed-off-by: Petr Oros <poros@redhat.com> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-10-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-30 11:37:39 +02:00
Ivan Vecera	620055cb10	dpll: export __dpll_pin_change_ntf() for use under dpll_lock Export __dpll_pin_change_ntf() so that drivers can send pin change notifications from within pin callbacks, which are already called under dpll_lock. Using dpll_pin_change_ntf() in that context would deadlock. Add lockdep_assert_held() to catch misuse without the lock held. Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: Petr Oros <poros@redhat.com> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-9-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-30 11:37:39 +02:00
Petr Oros	6f9d8393c9	ice: fix SMA and U.FL pin state changes affecting paired pin SMA and U.FL pins share physical signal paths in pairs (SMA1/U.FL1 and SMA2/U.FL2) controlled by the PCA9575 GPIO expander. Each pair can only have one active pin at a time: SMA1 output and U.FL1 output share the same CGU output, SMA2 input and U.FL2 input share the same CGU input. The PCA9575 register bits determine which connector in each pair owns the signal path. The driver does not account for this pairing in two places: ice_dpll_ufl_pin_state_set() modifies PCA9575 bits and disables the backing CGU pin without checking whether the U.FL pin is currently active. Disconnecting an already inactive U.FL pin flips bits that the paired SMA pin relies on, breaking its connection. ice_dpll_sma_direction_set() does not propagate direction changes to the paired U.FL pin. For SMA2/U.FL2 the ICE_SMA2_UFL2_RX_DIS bit is never managed, so U.FL2 stays disconnected after SMA2 switches to output. For both pairs the backing CGU pin of the U.FL side is never enabled when a direction change activates it, so userspace sees the pin as disconnected even though the routing is correct. Fix by guarding the U.FL disconnect path against inactive pins and by updating the paired U.FL pin fully on SMA direction changes: manage ICE_SMA2_UFL2_RX_DIS for the SMA2/U.FL2 pair and enable the backing CGU pin whenever the peer becomes active. Fixes: `2dd5d03c77` ("ice: redesign dpll sma/u.fl pins control") Signed-off-by: Petr Oros <poros@redhat.com> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-8-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-30 11:37:39 +02:00
Petr Oros	56a643aed0	ice: fix missing SMA pin initialization in DPLL subsystem The DPLL SMA/U.FL pin redesign introduced ice_dpll_sw_pin_frequency_get() which gates frequency reporting on the pin's active flag. This flag is determined by ice_dpll_sw_pins_update() from the PCA9575 GPIO expander state. Before the redesign, SMA pins were exposed as direct HW input/output pins and ice_dpll_frequency_get() returned the CGU frequency unconditionally — the PCA9575 state was never consulted. The PCA9575 powers on with all outputs high, setting ICE_SMA1_DIR_EN, ICE_SMA1_TX_EN, ICE_SMA2_DIR_EN and ICE_SMA2_TX_EN. Nothing in the driver writes the register during initialization, so ice_dpll_sw_pins_update() sees all pins as inactive and ice_dpll_sw_pin_frequency_get() permanently returns 0 Hz for every SW pin. Fix this by writing a default SMA configuration in ice_dpll_init_info_sw_pins(): clear all SMA bits, then set SMA1 and SMA2 as active inputs (DIR_EN=0) with U.FL1 output and U.FL2 input disabled. Each SMA/U.FL pair shares a physical signal path so only one pin per pair can be active at a time. U.FL pins still report frequency 0 after this fix: U.FL1 (output-only) is disabled by ICE_SMA1_TX_EN which keeps the TX output buffer off, and U.FL2 (input-only) is disabled by ICE_SMA2_UFL2_RX_DIS. They can be activated by changing the corresponding SMA pin direction via dpll netlink. Fixes: `2dd5d03c77` ("ice: redesign dpll sma/u.fl pins control") Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-7-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-30 11:37:38 +02:00
Petr Oros	70ad216411	ice: fix infinite recursion in ice_cfg_tx_topo via ice_init_dev_hw On certain E810 configurations where firmware supports Tx scheduler topology switching (tx_sched_topo_comp_mode_en), ice_cfg_tx_topo() may need to apply a new 5-layer or 9-layer topology from the DDP package. If the AQ command to set the topology fails (e.g. due to invalid DDP data or firmware limitations), the global configuration lock must still be cleared via a CORER reset. Commit `86aae43f21` ("ice: don't leave device non-functional if Tx scheduler config fails") correctly fixed this by refactoring ice_cfg_tx_topo() to always trigger CORER after acquiring the global lock and re-initialize hardware via ice_init_hw() afterwards. However, commit `8a37f9e2ff` ("ice: move ice_deinit_dev() to the end of deinit paths") later moved ice_init_dev_hw() into ice_init_hw(), breaking the reinit path introduced by `86aae43f21`. This creates an infinite recursive call chain: ice_init_hw() ice_init_dev_hw() ice_cfg_tx_topo() # topology change needed ice_deinit_hw() ice_init_hw() # reinit after CORER ice_init_dev_hw() # recurse ice_cfg_tx_topo() ... # stack overflow Fix by moving ice_init_dev_hw() back out of ice_init_hw() and calling it explicitly from ice_probe() and ice_devlink_reinit_up(). The third caller, ice_cfg_tx_topo(), intentionally does not need ice_init_dev_hw() during its reinit, it only needs the core HW reinitialization. This breaks the recursion cleanly without adding flags or guards. The deinit ordering changes from commit `8a37f9e2ff` ("ice: move ice_deinit_dev() to the end of deinit paths") which fixed slow rmmod are preserved, only the init-side placement of ice_init_dev_hw() is reverted. Fixes: `8a37f9e2ff` ("ice: move ice_deinit_dev() to the end of deinit paths") Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-6-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-30 11:37:38 +02:00
Petr Oros	54ef024879	ice: fix NULL pointer dereference in ice_reset_all_vfs() ice_reset_all_vfs() ignores the return value of ice_vf_rebuild_vsi(). When the VSI rebuild fails (e.g. during NVM firmware update via nvmupdate64e), ice_vsi_rebuild() tears down the VSI on its error path, leaving txq_map and rxq_map as NULL. The subsequent unconditional call to ice_vf_post_vsi_rebuild() leads to a NULL pointer dereference in ice_ena_vf_q_mappings() when it accesses vsi->txq_map[0]. The single-VF reset path in ice_reset_vf() already handles this correctly by checking the return value of ice_vf_reconfig_vsi() and skipping ice_vf_post_vsi_rebuild() on failure. Apply the same pattern to ice_reset_all_vfs(): check the return value of ice_vf_rebuild_vsi() and skip ice_vf_post_vsi_rebuild() and ice_eswitch_attach_vf() on failure. The VF is left safely disabled (ICE_VF_STATE_INIT not set, VFGEN_RSTAT not set to VFACTIVE) and can be recovered via a VFLR triggered by a PCI reset of the VF (sysfs reset or driver rebind). Note that this patch does not prevent the VF VSI rebuild from failing during NVM update — the underlying cause is firmware being in a transitional state while the EMP reset is processed, which can cause Admin Queue commands (ice_add_vsi, ice_cfg_vsi_lan) to fail. This patch only prevents the subsequent NULL pointer dereference that crashes the kernel when the rebuild does fail. crash> bt PID: 50795 TASK: ff34c9ee708dc680 CPU: 1 COMMAND: "kworker/u512:5" #0 [ff72159bcfe5bb50] machine_kexec at ffffffffaa8850ee #1 [ff72159bcfe5bba8] __crash_kexec at ffffffffaaa15fba #2 [ff72159bcfe5bc68] crash_kexec at ffffffffaaa16540 #3 [ff72159bcfe5bc70] oops_end at ffffffffaa837eda #4 [ff72159bcfe5bc90] page_fault_oops at ffffffffaa893997 #5 [ff72159bcfe5bce8] exc_page_fault at ffffffffab528595 #6 [ff72159bcfe5bd10] asm_exc_page_fault at ffffffffab600bb2 [exception RIP: ice_ena_vf_q_mappings+0x79] RIP: ffffffffc0a85b29 RSP: ff72159bcfe5bdc8 RFLAGS: 00010206 RAX: 00000000000f0000 RBX: ff34c9efc9c00000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000010 RDI: ff34c9efc9c00000 RBP: ff34c9efc27d4828 R8: 0000000000000093 R9: 0000000000000040 R10: ff34c9efc27d4828 R11: 0000000000000040 R12: 0000000000100000 R13: 0000000000000010 R14: R15: ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ff72159bcfe5bdf8] ice_sriov_post_vsi_rebuild at ffffffffc0a85e2e [ice] #8 [ff72159bcfe5be08] ice_reset_all_vfs at ffffffffc0a920b4 [ice] #9 [ff72159bcfe5be48] ice_service_task at ffffffffc0a31519 [ice] #10 [ff72159bcfe5be88] process_one_work at ffffffffaa93dca4 #11 [ff72159bcfe5bec8] worker_thread at ffffffffaa93e9de #12 [ff72159bcfe5bf18] kthread at ffffffffaa946663 #13 [ff72159bcfe5bf50] ret_from_fork at ffffffffaa8086b9 The panic occurs attempting to dereference the NULL pointer in RDX at ice_sriov.c:294, which loads vsi->txq_map (offset 0x4b8 in ice_vsi). The faulting VSI is an allocated slab object but not fully initialized after a failed ice_vsi_rebuild(): crash> struct ice_vsi 0xff34c9efc27d4828 netdev = 0x0, rx_rings = 0x0, tx_rings = 0x0, q_vectors = 0x0, txq_map = 0x0, rxq_map = 0x0, alloc_txq = 0x10, num_txq = 0x10, alloc_rxq = 0x10, num_rxq = 0x10, The nvmupdate64e process was performing NVM firmware update: crash> bt 0xff34c9edd1a30000 PID: 49858 TASK: ff34c9edd1a30000 CPU: 1 COMMAND: "nvmupdate64e" #0 [ff72159bcd617618] __schedule at ffffffffab5333f8 #4 [ff72159bcd617750] ice_sq_send_cmd at ffffffffc0a35347 [ice] #5 [ff72159bcd6177a8] ice_sq_send_cmd_retry at ffffffffc0a35b47 [ice] #6 [ff72159bcd617810] ice_aq_send_cmd at ffffffffc0a38018 [ice] #7 [ff72159bcd617848] ice_aq_read_nvm at ffffffffc0a40254 [ice] #8 [ff72159bcd6178b8] ice_read_flat_nvm at ffffffffc0a4034c [ice] #9 [ff72159bcd617918] ice_devlink_nvm_snapshot at ffffffffc0a6ffa5 [ice] dmesg: ice 0000:13:00.0: firmware recommends not updating fw.mgmt, as it may result in a downgrade. continuing anyways ice 0000:13:00.1: ice_init_nvm failed -5 ice 0000:13:00.1: Rebuild failed, unload and reload driver Fixes: `12bb018c53` ("ice: Refactor VF reset") Signed-off-by: Petr Oros <poros@redhat.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-5-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-30 11:37:38 +02:00
Petr Oros	34d33313b5	iavf: add VIRTCHNL_OP_ADD_VLAN to success completion handler The V1 ADD_VLAN opcode had no success handler; filters sent via V1 stayed in ADDING state permanently. Add a fallthrough case so V1 filters also transition ADDING -> ACTIVE on PF confirmation. Critically, add an `if (v_retval) break` guard: the error switch in iavf_virtchnl_completion() does NOT return after handling errors, it falls through to the success switch. Without this guard, a PF-rejected ADD would incorrectly mark ADDING filters as ACTIVE, creating a driver/HW mismatch where the driver believes the filter is installed but the PF never accepted it. For V2, this is harmless: iavf_vlan_add_reject() in the error block already kfree'd all ADDING filters, so the success handler finds nothing to transition. Fixes: `968996c070` ("iavf: Fix VLAN_V2 addition/rejection") Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-4-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-30 11:37:38 +02:00
Petr Oros	bbcbe4ed70	iavf: wait for PF confirmation before removing VLAN filters The VLAN filter DELETE path was asymmetric with the ADD path: ADD waits for PF confirmation (ADD -> ADDING -> ACTIVE), but DELETE immediately frees the filter struct after sending the DEL message without waiting for the PF response. This is problematic because: - If the PF rejects the DEL, the filter remains in HW but the driver has already freed the tracking structure, losing sync. - Race conditions between DEL pending and other operations (add, reset) cannot be properly resolved if the filter struct is already gone. Add IAVF_VLAN_REMOVING state to make the DELETE path symmetric: REMOVE -> REMOVING (send DEL) -> PF confirms -> kfree -> PF rejects -> ACTIVE In iavf_del_vlans(), transition filters from REMOVE to REMOVING instead of immediately freeing them. The new DEL completion handler in iavf_virtchnl_completion() frees filters on success or reverts them to ACTIVE on error. Update iavf_add_vlan() to handle the REMOVING state: if a DEL is pending and the user re-adds the same VLAN, queue it for ADD so it gets re-programmed after the PF processes the DEL. The !VLAN_FILTERING_ALLOWED early-exit path still frees filters directly since no PF message is sent in that case. Also update iavf_del_vlan() to skip filters already in REMOVING state: DEL has been sent to PF and the completion handler will free the filter when PF confirms. Without this guard, the sequence DEL(pending) -> user-del -> second DEL could cause the PF to return an error for the second DEL (filter already gone), causing the completion handler to incorrectly revert a deleted filter back to ACTIVE. Fixes: `968996c070` ("iavf: Fix VLAN_V2 addition/rejection") Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-3-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-30 11:37:38 +02:00
Petr Oros	f2ce65b9b9	iavf: stop removing VLAN filters from PF on interface down When a VF goes down, the driver currently sends DEL_VLAN to the PF for every VLAN filter (ACTIVE -> DISABLE -> send DEL -> INACTIVE), then re-adds them all on UP (INACTIVE -> ADD -> send ADD -> ADDING -> ACTIVE). This round-trip is unnecessary because: 1. The PF disables the VF's queues via VIRTCHNL_OP_DISABLE_QUEUES, which already prevents all RX/TX traffic regardless of VLAN filter state. 2. The VLAN filters remaining in PF HW while the VF is down is harmless - packets matching those filters have nowhere to go with queues disabled. 3. The DEL+ADD cycle during down/up creates race windows where the VLAN filter list is incomplete. With spoofcheck enabled, the PF enables TX VLAN filtering on the first non-zero VLAN add, blocking traffic for any VLANs not yet re-added. Remove the entire DISABLE/INACTIVE state machinery: - Remove IAVF_VLAN_DISABLE and IAVF_VLAN_INACTIVE enum values - Remove iavf_restore_filters() and its call from iavf_open() - Remove VLAN filter handling from iavf_clear_mac_vlan_filters(), rename it to iavf_clear_mac_filters() - Remove DEL_VLAN_FILTER scheduling from iavf_down() - Remove all DISABLE/INACTIVE handling from iavf_del_vlans() VLAN filters now stay ACTIVE across down/up cycles. Only explicit user removal (ndo_vlan_rx_kill_vid) or PF/VF reset triggers VLAN filter deletion/re-addition. Fixes: `ed1f5b58ea` ("i40evf: remove VLAN filters on close") Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-2-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-30 11:37:38 +02:00
Petr Oros	70d62b669f	iavf: rename IAVF_VLAN_IS_NEW to IAVF_VLAN_ADDING Rename the IAVF_VLAN_IS_NEW state to IAVF_VLAN_ADDING to better describe what the state represents: an ADD request has been sent to the PF and is waiting for a response. This is a pure rename with no behavioral change, preparing for a cleanup of the VLAN filter state machine. Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-1-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-30 11:37:38 +02:00
Hasan Basbunar	5ef343614d	page_pool: fix memory-provider leak in page_pool_create_percpu() error path When page_pool_create_percpu() fails on page_pool_list(), it falls through to its err_uninit: label, which calls page_pool_uninit(). At that point page_pool_init() has already taken two references when the user requested PP_FLAG_ALLOW_UNREADABLE_NETMEM: pool->mp_ops->init(pool) static_branch_inc(&page_pool_mem_providers); Neither is undone by page_pool_uninit(); both are only undone by __page_pool_destroy() (success-side teardown). The error path therefore leaks the per-provider reference taken by mp_ops->init (io_zcrx_ifq->refs in the io_uring zcrx provider, the dmabuf binding refcount in the devmem provider) plus one increment of the page_pool_mem_providers static branch on every failure of xa_alloc_cyclic() inside page_pool_list(). The leaked io_zcrx_ifq->refs in turn pins everything io_zcrx_ifq_free() would release on cleanup: ifq->user (uid), ifq->mm_account (mmdrop), ifq->dev (device refcount), ifq->netdev_tracker (netdev refcount), and the rbuf region. The leaked static branch increment forces all subsequent page_pool_alloc_netmems() and page_pool_return_page() callers to take the slow mp_ops branch for the lifetime of the kernel. Reachable via the io_uring zcrx path: io_uring_register(IORING_REGISTER_ZCRX_IFQ) /* CAP_NET_ADMIN */ -> __io_uring_register -> io_register_zcrx -> zcrx_register_netdev -> netif_mp_open_rxq -> driver ndo_queue_mem_alloc -> page_pool_create_percpu -> page_pool_init succeeds (mp_ops->init runs, branch++) -> page_pool_list fails (xa_alloc_cyclic -ENOMEM) -> goto err_uninit <-- leak The same shape applies to the devmem dmabuf provider via mp_dmabuf_devmem_init()/mp_dmabuf_devmem_destroy(). Restore the cleanup symmetry by moving the mp_ops->destroy() and static_branch_dec() calls out of __page_pool_destroy() and into page_pool_uninit(), so page_pool_uninit() is again the strict inverse of page_pool_init(). page_pool_uninit() has only two callers (the err_uninit: path and __page_pool_destroy()), so this preserves the single-call invariant on the success path while fixing the err path. The error path of page_pool_init() itself still skips the mp_ops cleanup correctly: mp_ops->init is the last action that takes a reference before page_pool_init() returns 0, so when it returns an error neither the refcount nor the static branch has been touched. Triggering the bug requires xa_alloc_cyclic() to fail with -ENOMEM, which under normal GFP_KERNEL retry behaviour is rare. It is deterministic under CONFIG_FAULT_INJECTION with fail_page_alloc / xa fault injection, or under sustained memory pressure. The leak is silent: there is no warning, and the released kernel build continues running with a permanently-incremented static branch. Fixes: `0f92140468` ("memory-provider: dmabuf devmem memory provider") Signed-off-by: Hasan Basbunar <basbunarhasan@gmail.com> Link: https://patch.msgid.link/20260428170739.34881-1-basbunarhasan@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-29 19:11:49 -07:00
Eric Dumazet	c4f050ce06	bonding: 3ad: implement proper RCU rules for port->aggregator syzbot found a data-race in bond_3ad_get_active_agg_info / bond_3ad_state_machine_handler [1] which hints at lack of proper RCU implementation. Add __rcu qualifier to port->aggregator, and add proper RCU API. [1] BUG: KCSAN: data-race in bond_3ad_get_active_agg_info / bond_3ad_state_machine_handler write to 0xffff88813cf5c4b0 of 8 bytes by task 36 on cpu 0: ad_port_selection_logic drivers/net/bonding/bond_3ad.c:1659 [inline] bond_3ad_state_machine_handler+0x9d5/0x2d60 drivers/net/bonding/bond_3ad.c:2569 process_one_work kernel/workqueue.c:3302 [inline] process_scheduled_works+0x4f0/0x9c0 kernel/workqueue.c:3385 worker_thread+0x58a/0x780 kernel/workqueue.c:3466 kthread+0x22a/0x280 kernel/kthread.c:436 ret_from_fork+0x146/0x330 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 read to 0xffff88813cf5c4b0 of 8 bytes by task 22063 on cpu 1: __bond_3ad_get_active_agg_info drivers/net/bonding/bond_3ad.c:2858 [inline] bond_3ad_get_active_agg_info+0x8c/0x230 drivers/net/bonding/bond_3ad.c:2881 bond_fill_info+0xe0f/0x10f0 drivers/net/bonding/bond_netlink.c:853 rtnl_link_info_fill net/core/rtnetlink.c:906 [inline] rtnl_link_fill+0x1d7/0x4e0 net/core/rtnetlink.c:927 rtnl_fill_ifinfo+0xf8e/0x1380 net/core/rtnetlink.c:2168 rtmsg_ifinfo_build_skb+0x11c/0x1b0 net/core/rtnetlink.c:4453 rtmsg_ifinfo_event net/core/rtnetlink.c:4486 [inline] rtmsg_ifinfo+0x6d/0x110 net/core/rtnetlink.c:4495 __dev_notify_flags+0x76/0x390 net/core/dev.c:9790 netif_change_flags+0xac/0xd0 net/core/dev.c:9823 do_setlink+0x905/0x2950 net/core/rtnetlink.c:3180 rtnl_group_changelink net/core/rtnetlink.c:3813 [inline] __rtnl_newlink net/core/rtnetlink.c:3981 [inline] rtnl_newlink+0xf55/0x1400 net/core/rtnetlink.c:4109 rtnetlink_rcv_msg+0x64b/0x720 net/core/rtnetlink.c:6995 netlink_rcv_skb+0x123/0x220 net/netlink/af_netlink.c:2550 rtnetlink_rcv+0x1c/0x30 net/core/rtnetlink.c:7022 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline] netlink_unicast+0x5a8/0x680 net/netlink/af_netlink.c:1344 netlink_sendmsg+0x5c8/0x6f0 net/netlink/af_netlink.c:1894 sock_sendmsg_nosec net/socket.c:787 [inline] __sock_sendmsg net/socket.c:802 [inline] ____sys_sendmsg+0x563/0x5b0 net/socket.c:2698 ___sys_sendmsg+0x195/0x1e0 net/socket.c:2752 __sys_sendmsg net/socket.c:2784 [inline] __do_sys_sendmsg net/socket.c:2789 [inline] __se_sys_sendmsg net/socket.c:2787 [inline] __x64_sys_sendmsg+0xd4/0x160 net/socket.c:2787 x64_sys_call+0x194c/0x3020 arch/x86/include/generated/asm/syscalls_64.h:47 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x12c/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f value changed: 0x0000000000000000 -> 0xffff88813cf5c400 Reported by Kernel Concurrency Sanitizer on: CPU: 1 UID: 0 PID: 22063 Comm: syz.0.31122 Tainted: G W syzkaller #0 PREEMPT(full) Tainted: [W]=WARN Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026 Fixes: `47e91f5600` ("bonding: use RCU protection for 3ad xmit path") Reported-by: syzbot+9bb2ff2a4ab9e17307e1@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/69f0a82f.050a0220.3aadc4.0000.GAE@google.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jay Vosburgh <jv@jvosburgh.net> Cc: Andrew Lunn <andrew+netdev@lunn.ch> Link: https://patch.msgid.link/20260428123207.3809211-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-29 18:32:02 -07:00
Lorenzo Bianconi	4ca01292ea	net: airoha: Do not return err in ndo_stop() callback Always complete the airoha_dev_stop() routine regardless of the airoha_set_vip_for_gdm_port() return value, since errors from ndo_stop() are ignored by the networking stack and the interface is always considered down after the call. Fixes: `23020f0493` ("net: airoha: Introduce ethernet support for EN7581 SoC") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20260428-airoha-ndo-stop-not-err-v1-1-674506d29a91@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-29 18:29:54 -07:00
Hamza Mahfooz	b31681206e	hv_sock: fix ARM64 support VMBUS ring buffers must be page aligned. Therefore, the current value of 24K presents a challenge on ARM64 kernels (with 64K pages). So, use VMBUS_RING_SIZE() to ensure they are always aligned and large enough to hold all of the relevant data. Cc: stable@vger.kernel.org Fixes: `77ffe33363` ("hv_sock: use HV_HYP_PAGE_SIZE for Hyper-V communication") Tested-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260428125339.13963-1-hamzamahfooz@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-29 17:30:45 -07:00
Jakub Kicinski	e73cafaf4a	MAINTAINERS: update the IPv4/IPv6 entry and add Ido Schimmel The IPv4/IPv6 and routing code is not very well separated from the TCP/UDP code. Scope it down properly by providing a more accurate file list, instead of net/ipv4/ and net/ipv6/ Now that the entry is more accurately representing layer 3 and routing merge in the nexthop entry into it. Add Ido Schimmel as a co-maintainer, Ido's git history speaks for itself. Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260428203924.1229169-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-29 16:52:33 -07:00
Jakub Kicinski	72e9647e2b	selftests: drv-net: clarify linters and frameworks in README Minor clarifications in the README: - call out what linters we expect to be clean - make it clear that by "frameworks" we mean code under lib/ not just factoring code out in the same file Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-29 16:48:21 -07:00
David Heidelberg	c3388f8c1c	MAINTAINERS: Add myself as NFC subsystem maintainer Add myself and update the mailing list. Signed-off-by: David Heidelberg <david@ixit.cz> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-29 16:46:51 -07:00
Jakub Kicinski	735a309b4b	net: add net_iov_init() and use it to initialize ->page_type Commit `db359fccf2` ("mm: introduce a new page type for page pool in page type") added a page_type field to struct net_iov at the same offset as struct page::page_type, so that page_pool_set_pp_info() can call __SetPageNetpp() uniformly on both pages and net_iovs. The page-type API requires the field to hold the UINT_MAX "no type" sentinel before a type can be set; for real struct page that invariant is established by the page allocator on free. struct net_iov is not allocated through the page allocator, so the field is left as zero (io_uring zcrx, which uses __GFP_ZERO) or as slab garbage (devmem, which uses kvmalloc_objs() without zeroing). When the page pool then calls page_pool_set_pp_info() on a freshly-bound niov, __SetPageNetpp()'s VM_BUG_ON_PAGE(page->page_type != UINT_MAX) fires and the kernel BUGs. Triggered in selftests by io_uring zcrx setup through the fbnic queue restart path: kernel BUG at ./include/linux/page-flags.h:1062! RIP: 0010:page_pool_set_pp_info (./include/linux/page-flags.h:1062 net/core/page_pool.c:716) Call Trace: <TASK> net_mp_niov_set_page_pool (net/core/page_pool.c:1360) io_pp_zc_alloc_netmems (io_uring/zcrx.c:1089 io_uring/zcrx.c:1110) fbnic_fill_bdq (./include/net/page_pool/helpers.h:160 drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:906) __fbnic_nv_restart (drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:2470 drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:2874) fbnic_queue_start (drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:2903) netdev_rx_queue_reconfig (net/core/netdev_rx_queue.c:137) __netif_mp_open_rxq (net/core/netdev_rx_queue.c:234) io_register_zcrx (io_uring/zcrx.c:818 io_uring/zcrx.c:903) __io_uring_register (io_uring/register.c:931) __do_sys_io_uring_register (io_uring/register.c:1029) do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94) </TASK> The same path is reachable through devmem dmabuf binding via netdev_nl_bind_rx_doit() -> net_devmem_bind_dmabuf_to_queue(). Add a net_iov_init() helper that stamps ->owner, ->type and the ->page_type sentinel, and use it from both the devmem and io_uring zcrx niov init loops. Fixes: `db359fccf2` ("mm: introduce a new page type for page pool in page type") Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Acked-by: Byungchul Park <byungchul@sk.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Acked-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/20260428025320.853452-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-29 16:40:08 -07:00
Jakub Kicinski	0c7a5ba011	Merge branch 'mptcp-misc-fixes-for-v7-1-rc2' Matthieu Baerts says: ==================== mptcp: misc fixes for v7.1-rc2 Here are various unrelated fixes: - Patches 1-2: set timestamp flags on 'ssk', not 'sk' (typo); Plus do that with sleepable lock_sock/release_sock. A fix for v5.14. - Patch 3: respect SO_LINGER(1, 0) by sending MP_FASTCLOSE at close time as expected. A fix for v6.1. - Patch 4: reset fullmesh counter after a flush. A fix for v6.19. ==================== Link: https://patch.msgid.link/20260427-net-mptcp-misc-fixes-7-1-rc2-v1-0-7432b7f279fa@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:36:29 -07:00
Matthieu Baerts (NGI0)	1774d3cf3c	mptcp: pm: kernel: reset fullmesh counter after flush This variable counts how many MPTCP endpoints have a 'fullmesh' flag set. After having flushed all MPTCP endpoints, it is then needed to reset this counter. Without this reset, this counter exposed to the userspace is wrong, but also non-fullmesh endpoints added after the flush will not be taken into account to create subflows in reaction to ADD_ADDRs. Fixes: `f88191c7f3` ("mptcp: pm: in-kernel: record fullmesh endp nb") Cc: stable@vger.kernel.org Reported-by: Sashiko <sashiko-bot@kernel.org> Closes: https://sashiko.dev/#/patchset/20260422-mptcp-inc-limits-v6-0-903181771530%40kernel.org?part=15 Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260427-net-mptcp-misc-fixes-7-1-rc2-v1-4-7432b7f279fa@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:36:27 -07:00
Matthieu Baerts (NGI0)	f14d6e9c36	mptcp: fastclose msk when linger time is 0 The SO_LINGER socket option has been supported for a while with MPTCP sockets [1], but it didn't cause the equivalent of a TCP reset as expected when enabled and its time was set to 0. This was causing some behavioural differences with TCP where some connections were not promptly stopped as expected. To fix that, an extra condition is checked at close() time before sending an MP_FASTCLOSE, the MPTCP equivalent of a TCP reset. Note that backporting up to [1] will be difficult as more changes are needed to be able to send MP_FASTCLOSE. It seems better to stop at [2], which was supposed to already imitate TCP. Validated with MPTCP packetdrill tests [3]. Fixes: `268b123874` ("mptcp: setsockopt: support SO_LINGER") [1] Fixes: `d21f834855` ("mptcp: use fastclose on more edge scenarios") [2] Cc: stable@vger.kernel.org Reported-by: Lance Tuller <lance@lance0.com> Closes: https://github.com/lance0/xfr/pull/67 Link: https://github.com/multipath-tcp/packetdrill/pull/196 [3] Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260427-net-mptcp-misc-fixes-7-1-rc2-v1-3-7432b7f279fa@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:36:27 -07:00
Gang Yan	b5c52908d5	mptcp: fix scheduling with atomic in timestamp sockopt Using lock_sock_fast() (atomic context) around sock_set_timestamp() and sock_set_timestamping() is unsafe, as both helpers can sleep. Replace lock_sock_fast() with sleepable lock_sock()/release_sock() to avoid scheduling while atomic panic. Fixes: `9061f24bf8` ("mptcp: sockopt: propagate timestamp request to subflows") Cc: stable@vger.kernel.org Reported-by: Sashiko <sashiko-bot@kernel.org> Closes: https://sashiko.dev/#/patchset/20260420093343.16443-1-gang.yan@linux.dev Signed-off-by: Gang Yan <yangang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260427-net-mptcp-misc-fixes-7-1-rc2-v1-2-7432b7f279fa@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:36:27 -07:00
Gang Yan	5f95c21fc2	mptcp: sockopt: set timestamp flags on subflow socket, not msk Both mptcp_setsockopt_sol_socket_tstamp() and mptcp_setsockopt_sol_socket_timestamping() iterate over subflows, acquire the subflow socket lock, but then erroneously pass the MPTCP msk socket to sock_set_timestamp() / sock_set_timestamping() instead of the subflow ssk. As a result, the timestamp flags are set on the wrong socket and have no effect on the actual subflows. Pass ssk instead of sk to both helpers. Fixes: `9061f24bf8` ("mptcp: sockopt: propagate timestamp request to subflows") Cc: stable@vger.kernel.org Signed-off-by: Gang Yan <yangang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260427-net-mptcp-misc-fixes-7-1-rc2-v1-1-7432b7f279fa@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:36:27 -07:00
Jakub Kicinski	21a7bb5cde	Merge branch 'netconsole-configfs-store-callback-fixes' Breno Leitao says: ==================== netconsole: configfs store callback fixes There are still some changes I want to make, such as, having the dynamic lock when reading from configfs (_show() callbacks), wich will solve other issues, but I will keep it for later. ==================== Link: https://patch.msgid.link/20260427-netconsole_ai_fixes-v2-0-59965f29d9cc@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:28:15 -07:00
Breno Leitao	869cd6490f	netconsole: restore userdatum value on update_userdata() failure userdatum_value_store() updates udm->value first and only then calls update_userdata() to rebuild the on-the-wire payload. If update_userdata() fails (e.g. -ENOMEM from kmalloc), the function returns the error to userspace, but udm->value already holds the new string while the live nt->userdata buffer still reflects the old one. The next successful write to any sibling userdatum on the same target will call update_userdata() again, which walks every entry and packs the now-stale udm->value into the payload. The failed write is thus silently activated later, with no indication to userspace that the value it tried to set was rejected. Snapshot the previous value before overwriting udm->value and restore it if update_userdata() fails so the visible state and the active payload stay consistent. Fixes: `eb83801af2` ("netconsole: Dynamic allocation of userdata buffer") Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20260427-netconsole_ai_fixes-v2-4-59965f29d9cc@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:28:12 -07:00
Breno Leitao	92ceb7bff6	netconsole: propagate device name truncation in dev_name_store() dev_name_store() calls strscpy(nt->np.dev_name, buf, IFNAMSIZ) without checking the return value. If userspace writes an interface name longer than IFNAMSIZ - 1, strscpy() silently truncates and returns -E2BIG, but the function ignores it and reports a fully successful write back to userspace. If a real interface happens to match the truncated name, netconsole will bind to the wrong device on the next enable, sending kernel logs and panic output to an unintended network segment with no indication to userspace that anything was rewritten. Reject writes whose length cannot fit in nt->np.dev_name up front: if (count >= IFNAMSIZ) return -ENAMETOOLONG; This is not a big deal of a problem, but, it is still the correct approach. Fixes: `0bcc181618` ("[NET] netconsole: Support dynamic reconfiguration using configfs") Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20260427-netconsole_ai_fixes-v2-3-59965f29d9cc@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:28:11 -07:00
Breno Leitao	e6dd94252b	netconsole: avoid clobbering userdatum value on truncated write userdatum_value_store() bounds count by MAX_EXTRADATA_VALUE_LEN (200) and then copies straight into udm->value, which is itself 200 bytes: if (count > MAX_EXTRADATA_VALUE_LEN) return -EMSGSIZE; ... ret = strscpy(udm->value, buf, sizeof(udm->value)); if (ret < 0) goto out_unlock; If userspace writes exactly MAX_EXTRADATA_VALUE_LEN bytes with no NUL within them, strscpy() copies 199 bytes plus a NUL into udm->value and returns -E2BIG. The function jumps to out_unlock and reports the error to userspace, but udm->value has already been overwritten with the truncated string and update_userdata() is skipped, so the corruption is not yet visible on the wire. The next successful write to any userdatum entry under the same target calls update_userdata(), which packs udm->value into the active netconsole payload. From that point on, every netconsole message carries the silently truncated value, and userspace has no indication that a previous, error-returning write left state behind. Tighten the entry check from "count > MAX_EXTRADATA_VALUE_LEN" to "count >= MAX_EXTRADATA_VALUE_LEN". With count strictly less than sizeof(udm->value), strscpy() can no longer return -E2BIG here, so the corrupting truncation path is removed entirely. Fixes: `8a6d5fec6c` ("net: netconsole: add a userdata config_group member to netconsole_target") Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20260427-netconsole_ai_fixes-v2-2-59965f29d9cc@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:28:11 -07:00
Breno Leitao	d62c6f2df5	netconsole: return count instead of strnlen(buf, count) from store callbacks Several configfs store callbacks in netconsole end with: ret = strnlen(buf, count); This under-reports the number of bytes consumed when the input contains an embedded NUL within count, telling the VFS that fewer bytes were written than userspace actually handed in. A conformant partial-write loop would then retry the trailing bytes against a callback that has already accepted them. Every other configfs driver in the tree returns count directly from its store callbacks once parsing has succeeded, including drivers/nvme/target/configfs.c, drivers/gpio/gpio-sim.c, drivers/most/configfs.c, drivers/block/null_blk/main.c, drivers/pci/endpoint/pci-ep-cfs.c, and the rest of the configfs users. netconsole was the outlier (along with drivers/infiniband/core/cma_configfs.c, which has the same latent issue). Align netconsole with the rest of the configfs ecosystem: return count once the parser/validator has accepted the input. The numeric and boolean parsers (kstrtobool, kstrtou16, mac_pton, netpoll_parse_ip_addr) have already validated the meaningful prefix; any trailing bytes are padding and should simply be reported as consumed. Fixes: `0bcc181618` ("[NET] netconsole: Support dynamic reconfiguration using configfs") Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20260427-netconsole_ai_fixes-v2-1-59965f29d9cc@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:28:11 -07:00
Jakub Kicinski	403e7e34cc	Merge branch 'net-sched-sch_cake-annotate-data-races-in-cake_dump_stats-series' Eric Dumazet says: ==================== net/sched: sch_cake: annotate data-races in cake_dump_stats() (series) cake_dump_stats() runs without qdisc spinlock being held. This mini series adds missing READ_ONCE()/WRITE_ONCE() annotations. Original patch was too big, splitting it eases code review. ==================== Link: https://patch.msgid.link/20260427083606.459355-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:27:54 -07:00
Eric Dumazet	a6c95b833d	net/sched: sch_cake: annotate data-races in cake_dump_stats() (V) cake_dump_stats() runs without qdisc spinlock being held. In this final patch, I add READ_ONCE()/WRITE_ONCE() annotations for cparams.target and cparams.interval. Fixes: `046f6fd5da` ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: "Toke Høiland-Jørgensen" <toke@toke.dk> Link: https://patch.msgid.link/20260427083606.459355-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:27:51 -07:00
Eric Dumazet	8fab48d877	net/sched: sch_cake: annotate data-races in cake_dump_stats() (IV) cake_dump_stats() runs without qdisc spinlock being held. In this fourth patch, I add READ_ONCE()/WRITE_ONCE() annotations for the following fields: - avg_peak_bandwidth - buffer_limit - buffer_max_used - avg_netoff - max_netlen - max_adjlen - min_netlen - min_adjlen - active_queues - tin_rate_bps - bytes - tin_backlog Other annotations are added in following patch, to ease code review. Fixes: `046f6fd5da` ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Link: https://patch.msgid.link/20260427083606.459355-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:27:51 -07:00
Eric Dumazet	276a98a434	net/sched: sch_cake: annotate data-races in cake_dump_stats() (III) cake_dump_stats() runs without qdisc spinlock being held. In this third patch, I add READ_ONCE()/WRITE_ONCE() annotations for the following fields: - packets - tin_dropped - tin_ecn_mark - ack_drops - peak_delay - avge_delay - base_delay Other annotations are added in following patches, to ease code review. Fixes: `046f6fd5da` ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: "Toke Høiland-Jørgensen" <toke@toke.dk> Link: https://patch.msgid.link/20260427083606.459355-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:27:50 -07:00
Eric Dumazet	91a96427b9	net/sched: sch_cake: annotate data-races in cake_dump_stats() (II) cake_dump_stats() runs without qdisc spinlock being held. In this second patch, I add READ_ONCE()/WRITE_ONCE() annotations for the following fields: - bulk_flow_count - unresponsive_flow_count - max_skblen - flow_quantum Other annotations are added in following patches, to ease code review. Fixes: `046f6fd5da` ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: "Toke Høiland-Jørgensen" <toke@toke.dk> Link: https://patch.msgid.link/20260427083606.459355-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:27:50 -07:00
Eric Dumazet	44967ac378	net/sched: sch_cake: annotate data-races in cake_dump_stats() (I) cake_dump_stats() runs without qdisc spinlock being held. In this first patch, I add READ_ONCE()/WRITE_ONCE() annotations for the following fields: - way_hits - way_misses - way_collisions - sparse_flow_count - decaying_flow_count Other annotations are added in following patches, to ease code review. Fixes: `046f6fd5da` ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: "Toke Høiland-Jørgensen" <toke@toke.dk> Link: https://patch.msgid.link/20260427083606.459355-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:27:50 -07:00
Weiming Shi	aa6c6d9ee0	bareudp: fix NULL pointer dereference in bareudp_fill_metadata_dst() bareudp_fill_metadata_dst() passes bareudp->sock to udp_tunnel6_dst_lookup() in the IPv6 path without a NULL check. The socket is only created in bareudp_open() and NULLed in bareudp_stop(), so calling this function while the device is down triggers a NULL dereference via sock->sk. BUG: kernel NULL pointer dereference, address: 0000000000000018 RIP: 0010:udp_tunnel6_dst_lookup (net/ipv6/ip6_udp_tunnel.c:160) Call Trace: <TASK> bareudp_fill_metadata_dst (drivers/net/bareudp.c:532) do_execute_actions (net/openvswitch/actions.c:901) ovs_execute_actions (net/openvswitch/actions.c:1589) ovs_packet_cmd_execute (net/openvswitch/datapath.c:700) genl_family_rcv_msg_doit (net/netlink/genetlink.c:1114) genl_rcv_msg (net/netlink/genetlink.c:1209) netlink_rcv_skb (net/netlink/af_netlink.c:2550) </TASK> Add a NULL check returning -ESHUTDOWN, consistent with the xmit paths in the same driver. Fixes: `571912c69f` ("net: UDP tunnel encapsulation module for tunnelling different protocols like MPLS, IP, NSH etc.") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260426165350.1663137-2-bestswngs@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:16:53 -07:00
Jakub Kicinski	dfb7e3b9a9	Merge branch 'sctp-fix-a-vtag-verification-failure-caused-by-stale-inits' Xin Long says: ==================== sctp: fix a vtag verification failure caused by stale INITs Similar to Scenario B in commit `8e56b063c8` ( netfilter: handle the connecting collision properly in nf_conntrack_proto_sctp"): Scenario B: INIT_ACK is delayed until the peer completes its own handshake 192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408] 192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885] 192.168.1.2 > 192.168.1.1: sctp (1) [INIT ACK] [init tag: 3922216408] 192.168.1.1 > 192.168.1.2: sctp (1) [COOKIE ECHO] 192.168.1.2 > 192.168.1.1: sctp (1) [COOKIE ACK] 192.168.1.1 > 192.168.1.2: sctp (1) [INIT ACK] [init tag: 3914796021] * There is another case: Scenario F: INIT is delayed until the peer completes its own handshake 192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408] (OVS upcall) 192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885] 192.168.1.2 > 192.168.1.1: sctp (1) [INIT ACK] [init tag: 3922216408] 192.168.1.1 > 192.168.1.2: sctp (1) [COOKIE ECHO] 192.168.1.2 > 192.168.1.1: sctp (1) [COOKIE ACK] 192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408] (delayed) 192.168.1.1 > 192.168.1.2: sctp (1) [INIT ACK] [init tag: 3914796021] * In this case, the delayed INIT (e.g. due to OVS upcall) is recorded by conntrack, which prevents vtag verification from dropping the unexpected INIT-ACK in nf_conntrack_sctp_packet(): vtag = ct->proto.sctp.vtag[!dir]; if (!ct->proto.sctp.init[!dir] && vtag && vtag != ih->init_tag) goto out_unlock; This happens because ct->proto.sctp.init[!dir] is set by the delayed INIT, even though it is stale. Fix this in two parts: - In netfilter: Do not record INITs whose init_tag matches the peer vtag, as they carry no new handshake state in the 1st patch. - In SCTP: Prevent endpoints from responding to such INITs with INIT-ACK, ensuring correctness even when middleboxes lack the netfilter fix in the 2nd patch. A follow-up selftest for this scenario will be posted in a separate patch by Yi Chen. ==================== Link: https://patch.msgid.link/cover.1777214801.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 17:52:23 -07:00
Xin Long	8a92cb475c	sctp: discard stale INIT after handshake completion After an association reaches ESTABLISHED, the peer’s init_tag is already known from the handshake. Any subsequent INIT with the same init_tag is not a valid restart, but a delayed or duplicate INIT. Drop such INIT chunks in sctp_sf_do_unexpected_init() instead of processing them as new association attempts. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Link: https://patch.msgid.link/5788c76c1ee122a3ed00189e88dcf9df1fba226c.1777214801.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 17:52:19 -07:00
Xin Long	576a5d2bad	netfilter: skip recording stale or retransmitted INIT An INIT whose init_tag matches the peer's vtag does not provide new state information. It indicates either: - a stale INIT (after INIT-ACK has already been seen on the same side), or - a retransmitted INIT (after INIT has already been recorded on the same side). In both cases, the INIT must not update ct->proto.sctp.init[] state, since it does not advance the handshake tracking and may otherwise corrupt INIT/INIT-ACK validation logic. Allow INIT processing only when the conntrack entry is newly created (SCTP_CONNTRACK_NONE), or when the init_tag differs from the stored peer vtag. Note it skips the check for the ct with old_state SCTP_CONNTRACK_NONE in nf_conntrack_sctp_packet(), as it is just created in sctp_new() where it set ct->proto.sctp.vtag[IP_CT_DIR_REPLY] = ih->init_tag. Fixes: `9fb9cbb108` ("[NETFILTER]: Add nf_conntrack subsystem.") Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/ee56c3e416452b2a40589a2a85245ac2ad5e9f4b.1777214801.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 17:52:19 -07:00
Jakub Kicinski	b718342a7f	net: psp: require admin permission for dev-set and key-rotate The dev-set and key-rotate netlink operations modify shared device state (PSP version configuration and cryptographic key material, respectively) but do not require CAP_NET_ADMIN. The only access control is psp_dev_check_access() which merely verifies netns membership. Fixes: `00c94ca2b9` ("psp: base PSP device support") Reviewed-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260427195856.401223-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 17:44:20 -07:00
Jakub Kicinski	b89769f936	net: psp: check for device unregister when creating assoc psp_assoc_device_get_locked() obtains a psp_dev reference via psp_dev_get_for_sock() (which uses psp_dev_tryget() under RCU); it then acquires psd->lock and drops the reference. Before the lock is taken, psp_dev_unregister() can run to completion: take psd->lock, clear out state, unlock, drop the registration reference. The expectation is that the lock prevents device unregistration, but much like with netdevs special care has to be taken when "upgrading" a reference to a locked device. Add the missing check if device is still alive. psp_dev_is_registered() exists already but had no callers, which makes me wonder if I either forgot to add this or lost the check during refactoring... Reported-by: Yiming Qian <yimingqian591@gmail.com> Fixes: `6b46ca260e` ("net: psp: add socket security association code") Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260427190606.366101-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 17:43:32 -07:00
Jakub Kicinski	67d7ae3340	Merge tag 'nf-26-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) IEEE1394 ARP payload contains no target hardware address in the ARP packet. Apparently, arp_tables was never updated to deal with IEEE1394 ARP properly. To deal with this, return no match in case the target hardware address selector is used, either for inverse or normal match. Moreover, arpt_mangle disallows mangling of the target hardware and IP address because, it is not worth to adjust the offset calculation to fix this, we suspect no users of arp_tables for this family. 2) Use list_del_rcu() to delete device hooks in nf_tables, this hook list is RCU protected, concurrent netlink dump readers can be walking on this list, fix it by adding a helper function and use it for consistency. From Florian Westphal. 3) Add list_splice_rcu(), this is useful for joining the local list of new device hooks to the RCU protected hook list in chain and flowtable. Reviewed by Paul E. McKenney. 4) Use list_splice_rcu() to publish the new device hooks in chain and flowtable to fix concurrent netlink dump traversal. 5) Add a new hook transaction object to track device hook deletions. The current approach moves device hooks to be deleted around during the preparation phase, this breaks concurrent RCU reader via netlink dump. This new hook transaction is combined with NFT_HOOK_REMOVE flag to annotate hooks for removal in the preparation phase. 6) xt_policy inbound policy check in strict mode can lead to out-of-bound access of the secpath array due to incorrect. The iteration over the secpath needs to be reversed in the inbound to check for the human readable policy, expecting inner in first position and outer in second position, the secpath from inbound actually stores outer in first position then in second position. From Jiexun Wang. 7) Fix possible zero shift in nft_bitwise triggering UBSAN splat, reject zero shift from control plane, from Kai Ma. 8) Replace simple_strtoul() in the conntrack SIP helper since it relies on nul-terminated strings. From Florian Westphal. * tag 'nf-26-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_conntrack_sip: don't use simple_strtoul netfilter: reject zero shift in nft_bitwise netfilter: xt_policy: fix strict mode inbound policy matching netfilter: nf_tables: add hook transactions for device deletions netfilter: nf_tables: join hook list via splice_list_rcu() in commit phase rculist: add list_splice_rcu() for private lists netfilter: nf_tables: use list_del_rcu for netlink hooks netfilter: arp_tables: fix IEEE1394 ARP payload parsing ==================== Link: https://patch.msgid.link/20260428095840.51961-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 17:41:06 -07:00
Heiko Schocher	46f74a3f7d	net: phy: dp83869: fix setting CLK_O_SEL field. Table 7-121 in datasheet says we have to set register 0xc6 to value 0x10 before CLK_O_SEL can be modified. No more infos about this field found in datasheet. With this fix, setting of CLK_O_SEL field in IO_MUX_CFG register worked through dts property "ti,clk-output-sel" on a DP83869HMRGZR. Signed-off-by: Heiko Schocher <hs@nabladev.com> Reviewed-by: Simon Horman <horms@kernel.org> Fixes: `01db923e83` ("net: phy: dp83869: Add TI dp83869 phy") Link: https://patch.msgid.link/20260425031339.3318-1-hs@nabladev.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-28 15:51:55 +02:00
William A. Kennington III	4ca07b9239	net: mctp i2c: check length before marking flow active Currently, mctp_i2c_get_tx_flow_state() is called before the packet length sanity check. This function marks a new flow as active in the MCTP core. If the sanity check fails, mctp_i2c_xmit() returns early without calling mctp_i2c_lock_nest(). This results in a mismatched locking state: the flow is active, but the I2C bus lock was never acquired for it. When the flow is later released, mctp_i2c_release_flow() will see the active state and queue an unlock marker. The TX thread will then decrement midev->i2c_lock_count from 0, causing it to underflow to -1. This underflow permanently breaks the driver's locking logic, allowing future transmissions to occur without holding the I2C bus lock, leading to bus collisions and potential hardware hangs. Move the mctp_i2c_get_tx_flow_state() call to after the length sanity check to ensure we only transition the flow state if we are actually going to proceed with the transmission and locking. Fixes: `f5b8abf9fc` ("mctp i2c: MCTP I2C binding driver") Signed-off-by: William A. Kennington III <william@wkennington.com> Acked-by: Jeremy Kerr <jk@codeconstruct.com.au> Link: https://patch.msgid.link/20260423074741.201460-1-william@wkennington.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-28 13:11:53 +02:00
Sam Edwards	0bb05e6adf	net: stmmac: Prevent NULL deref when RX memory exhausted The CPU receives frames from the MAC through conventional DMA: the CPU allocates buffers for the MAC, then the MAC fills them and returns ownership to the CPU. For each hardware RX queue, the CPU and MAC coordinate through a shared ring array of DMA descriptors: one descriptor per DMA buffer. Each descriptor includes the buffer's physical address and a status flag ("OWN") indicating which side owns the buffer: OWN=0 for CPU, OWN=1 for MAC. The CPU is only allowed to set the flag and the MAC is only allowed to clear it, and both must move through the ring in sequence: thus the ring is used for both "submissions" and "completions." In the stmmac driver, stmmac_rx() bookmarks its position in the ring with the `cur_rx` index. The main receive loop in that function checks for rx_descs[cur_rx].own=0, gives the corresponding buffer to the network stack (NULLing the pointer), and increments `cur_rx` modulo the ring size. After the loop exits, stmmac_rx_refill(), which bookmarks its position with `dirty_rx`, allocates fresh buffers and rearms the descriptors (setting OWN=1). If it fails any allocation, it simply stops early (leaving OWN=0) and will retry where it left off when next called. This means descriptors have a three-stage lifecycle (terms my own): - `empty` (OWN=1, buffer valid) - `full` (OWN=0, buffer valid and populated) - `dirty` (OWN=0, buffer NULL) But because stmmac_rx() only checks OWN, it confuses `full`/`dirty`. In the past (see 'Fixes:'), there was a bug where the loop could cycle `cur_rx` all the way back to the first descriptor it dirtied, resulting in a NULL dereference when mistaken for `full`. The aforementioned commit resolved that specific failure by capping the loop's iteration limit at `dma_rx_size - 1`, but this is only a partial fix: if the previous stmmac_rx_refill() didn't complete, then there are leftover `dirty` descriptors that the loop might encounter without needing to cycle fully around. The current code therefore panics (see 'Closes:') when stmmac_rx_refill() is memory-starved long enough for `cur_rx` to catch up to `dirty_rx`. Fix this by explicitly checking, before advancing `cur_rx`, if the next entry is dirty; exit the loop if so. This prevents processing of the final, used descriptor until stmmac_rx_refill() succeeds, but fully prevents the `cur_rx == dirty_rx` ambiguity as the previous bugfix intended: so remove the clamp as well. Since stmmac_rx_zc() is a copy-paste-and-tweak of stmmac_rx() and the code structure is identical, any fix to stmmac_rx() will also need a corresponding fix for stmmac_rx_zc(). Therefore, apply the same check there. In stmmac_rx() (not stmmac_rx_zc()), a related bug remains: after the MAC sets OWN=0 on the final descriptor, it will be unable to send any further DMA-complete IRQs until it's given more `empty` descriptors. Currently, the driver simply hopes that the next stmmac_rx_refill() succeeds, risking an indefinite stall of the receive process if not. But this is not a regression, so it can be addressed in a future change. Fixes: `b6cb454185` ("net: stmmac: avoid rx queue overrun") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221010 Cc: stable@vger.kernel.org Suggested-by: Russell King <linux@armlinux.org.uk> Signed-off-by: Sam Edwards <CFSworks@gmail.com> Link: https://patch.msgid.link/20260422044503.5349-1-CFSworks@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-28 12:26:20 +02:00
Andrea Mayer	f9c52a6ba9	net: ipv6: fix NOREF dst use in seg6 and rpl lwtunnels seg6_input_core() and rpl_input() call ip6_route_input() which sets a NOREF dst on the skb, then pass it to dst_cache_set_ip6() invoking dst_hold() unconditionally. On PREEMPT_RT, ksoftirqd is preemptible and a higher-priority task can release the underlying pcpu_rt between the lookup and the caching through a concurrent FIB lookup on a shared nexthop. Simplified race sequence: ksoftirqd/X higher-prio task (same CPU X) ----------- -------------------------------- seg6_input_core(,skb)/rpl_input(skb) dst_cache_get() -> miss ip6_route_input(skb) -> ip6_pol_route(,skb,flags) [RT6_LOOKUP_F_DST_NOREF in flags] -> FIB lookup resolves fib6_nh [nhid=N route] -> rt6_make_pcpu_route() [creates pcpu_rt, refcount=1] pcpu_rt->sernum = fib6_sernum [fib6_sernum=W] -> cmpxchg(fib6_nh.rt6i_pcpu, NULL, pcpu_rt) [slot was empty, store succeeds] -> skb_dst_set_noref(skb, dst) [dst is pcpu_rt, refcount still 1] rt_genid_bump_ipv6() -> bumps fib6_sernum [fib6_sernum from W to Z] ip6_route_output() -> ip6_pol_route() -> FIB lookup resolves fib6_nh [nhid=N] -> rt6_get_pcpu_route() pcpu_rt->sernum != fib6_sernum [W <> Z, stale] -> prev = xchg(rt6i_pcpu, NULL) -> dst_release(prev) [prev is pcpu_rt, refcount 1->0, dead] dst = skb_dst(skb) [dst is the dead pcpu_rt] dst_cache_set_ip6(dst) -> dst_hold() on dead dst -> WARN / use-after-free For the race to occur, ksoftirqd must be preemptible (PREEMPT_RT without PREEMPT_RT_NEEDS_BH_LOCK) and a concurrent task must be able to release the pcpu_rt. Shared nexthop objects provide such a path, as two routes pointing to the same nhid share the same fib6_nh and its rt6i_pcpu entry. Fix seg6_input_core() and rpl_input() by calling skb_dst_force() after ip6_route_input() to force the NOREF dst into a refcounted one before caching. The output path is not affected as ip6_route_output() already returns a refcounted dst. Fixes: `af4a2209b1` ("ipv6: sr: use dst_cache in seg6_input") Fixes: `a7a29f9c36` ("net: ipv6: add rpl sr tunnel") Cc: stable@vger.kernel.org Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Justin Iurman <justin.iurman@gmail.com> Link: https://patch.msgid.link/20260421094735.20997-1-andrea.mayer@uniroma2.it Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-28 11:16:14 +02:00
Breno Leitao	3bc179bc71	netpoll: fix IPv6 local-address corruption netpoll_setup() decides whether to auto-populate the local source address by testing np->local_ip.ip, which only inspects the first 4 bytes of the union inet_addr storage. For an IPv6 netpoll whose caller-supplied local address has a zero high-32 bits (::1, ::<suffix>, IPv4-mapped ::ffff:a.b.c.d, etc.), this misdetects the address as unset (which they are not, but the first 4 bytes are empty), calls netpoll_take_ipv6() and overwrites it with whatever matching link-local/global address the device happens to expose first. Introduce a helper netpoll_local_ip_unset() that picks the correct family-aware test (ipv6_addr_any() for IPv6, !.ip for IPv4) and use it from netpoll_setup(). Reproducer is something like: echo "::2" > local_ip echo 1 > enabled cat local_ip # before this fix: 2001:db8::1 (caller-supplied ::2 was clobbered) # after this fix: ::2 Fixes: `b7394d2429` ("netpoll: prepare for ipv6") Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20260424-netpoll_fix-v1-1-3a55348c625f@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-27 19:16:18 -07:00
Altan Hacigumus	2b9f6f7065	tcp: make probe0 timer handle expired user timeout tcp_clamp_probe0_to_user_timeout() computes remaining time in jiffies using subtraction with an unsigned lvalue. If elapsed probing time exceeds the configured TCP_USER_TIMEOUT, the underflow yields a large value. This ends up re-arming the probe timer for a full backoff interval instead of expiring immediately, delaying connection teardown beyond the configured timeout. Fix this by preventing underflow so user-set timeout expiration is handled correctly without extending the probe timer. Fixes: `344db93ae3` ("tcp: make TCP_USER_TIMEOUT accurate for zero window probes") Link: https://lore.kernel.org/r/20260414013634.43997-1-ahacigu.linux@gmail.com Signed-off-by: Altan Hacigumus <ahacigu.linux@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260424014639.54110-1-ahacigu.linux@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-27 19:16:07 -07:00
Mingming Cao	cc427d24ac	ibmveth: Disable GSO for packets with small MSS Some physical adapters on Power systems do not support segmentation offload when the MSS is less than 224 bytes. Attempting to send such packets causes the adapter to freeze, stopping all traffic until manually reset. Implement ndo_features_check to disable GSO for packets with small MSS values. The network stack will perform software segmentation instead. The 224-byte minimum matches ibmvnic commit <f10b09ef687f> ("ibmvnic: Enforce stronger sanity checks on GSO packets") which uses the same physical adapters in SEA configurations. The issue occurs specifically when the hardware attempts to perform segmentation (gso_segs > 1) with a small MSS. Single-segment GSO packets (gso_segs == 1) do not trigger the problematic LSO code path and are transmitted normally without segmentation. Add an ndo_features_check callback to disable GSO when MSS < 224 bytes. Also call vlan_features_check() to ensure proper handling of VLAN packets, particularly QinQ (802.1ad) configurations where the hardware parser may not support certain offload features. Validated using iptables to force small MSS values. Without the fix, the adapter freezes. With the fix, packets are segmented in software and transmission succeeds. Comprehensive regression testing completedd (MSS tests, performance, stability). Fixes: `8641dd8579` ("ibmveth: Add support for TSO") Cc: stable@vger.kernel.org Reviewed-by: Brian King <bjking1@linux.ibm.com> Tested-by: Shaik Abdulla <shaik.abdulla1@ibm.com> Tested-by: Naveed Ahmed <naveedaus@in.ibm.com> Signed-off-by: Mingming Cao <mmc@linux.ibm.com> Link: https://patch.msgid.link/20260424162917.65725-1-mmc@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-27 19:07:57 -07:00
Florian Westphal	4438113be6	neigh: let neigh_xmit take skb ownership neigh_xmit always releases the skb, except when no neighbour table is found. But even the first added user of neigh_xmit (mpls) relied on neigh_xmit to release the skb (or queue it for tx). sashiko reported: If neigh_xmit() is called with an uninitialized neighbor table (for example, NEIGH_ND_TABLE when IPv6 is disabled), it returns -EAFNOSUPPORT and bypasses its internal out_kfree_skb error path. Because the return value of neigh_xmit() is ignored here, does this leak the SKB? Assume full ownership and remove the last code path that doesn't xmit or free skb. Fixes: `4fd3d7d9e8` ("neigh: Add helper function neigh_xmit") Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260424145843.74055-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-27 19:02:11 -07:00
Kuniyuki Iwashima	b3b6babf47	ipmr: Free mr_table after RCU grace period. With CONFIG_IP_MROUTE_MULTIPLE_TABLES=n, ipmr_fib_lookup() does not check if net->ipv4.mrt is NULL. Since default_device_exit_batch() is called after ->exit_rtnl(), a device could receive IGMP packets and access net->ipv4.mrt during/after ipmr_rules_exit_rtnl(). If ipmr_rules_exit_rtnl() had already cleared it and freed the memory, the access would trigger null-ptr-deref or use-after-free. Let's fix it by using RCU helper and free mrt after RCU grace period. In addition, check_net(net) is added to mroute_clean_tables() and ipmr_cache_unresolved() to synchronise via mfc_unres_lock. This prevents ipmr_cache_unresolved() from putting skb into c->_c.mfc_un.unres.unresolved after mroute_clean_tables() purges it. For the same reason, timer_shutdown_sync() is moved after mroute_clean_tables(). Since rhltable_destroy() holds mutex internally, rcu_work is used, and it is placed as the first member because rcu_head must be placed within <4K offset. mr_table is alraedy 3864 bytes without rcu_work. Note that IP6MR is not yet converted to ->exit_rtnl(), so this change is not needed for now but will be. Fixes: `b22b018674` ("ipmr: Convert ipmr_net_exit_batch() to ->exit_rtnl().") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260423053456.4097409-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-27 18:46:17 -07:00

1 2 3 4 5 ...

1444097 Commits