The SMA/U.FL pin redesign (commit 2dd5d03c77 ("ice: redesign dpll
sma/u.fl pins control")) introduced software-controlled pins that wrap
backing CGU input/output pins, but never updated the notification and
data paths to propagate pin events to these SW wrappers.
The periodic work sends dpll_pin_change_ntf() only for direct CGU input
pins. SW pins that wrap these inputs never receive change or phase
offset notifications, so userspace consumers such as synce4l monitoring
SMA pins via dpll netlink never learn about state transitions or phase
offset updates. Similarly, ice_dpll_phase_offset_get() reads the SW
pin's own phase_offset field which is never updated; the PPS monitor
writes to the backing CGU input's field instead.
Fix by introducing ice_dpll_pin_ntf(), a wrapper around
dpll_pin_change_ntf() that also notifies any registered SMA/U.FL pin
whose backing CGU input matches. Replace all direct
dpll_pin_change_ntf() calls in the periodic notification paths with
this wrapper. Fix ice_dpll_phase_offset_get() to return the backing
CGU input's phase_offset for input-direction SW pins.
Fixes: 2dd5d03c77 ("ice: redesign dpll sma/u.fl pins control")
Signed-off-by: Petr Oros <poros@redhat.com>
Tested-by: Alexander Nowlin <alexander.nowlin@intel.com>
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-10-cdcb48303fd8@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
SMA and U.FL pins share physical signal paths in pairs (SMA1/U.FL1 and
SMA2/U.FL2) controlled by the PCA9575 GPIO expander. Each pair can
only have one active pin at a time: SMA1 output and U.FL1 output share
the same CGU output, SMA2 input and U.FL2 input share the same CGU
input. The PCA9575 register bits determine which connector in each
pair owns the signal path.
The driver does not account for this pairing in two places:
ice_dpll_ufl_pin_state_set() modifies PCA9575 bits and disables the
backing CGU pin without checking whether the U.FL pin is currently
active. Disconnecting an already inactive U.FL pin flips bits that
the paired SMA pin relies on, breaking its connection.
ice_dpll_sma_direction_set() does not propagate direction changes to
the paired U.FL pin. For SMA2/U.FL2 the ICE_SMA2_UFL2_RX_DIS bit is
never managed, so U.FL2 stays disconnected after SMA2 switches to
output. For both pairs the backing CGU pin of the U.FL side is never
enabled when a direction change activates it, so userspace sees the
pin as disconnected even though the routing is correct.
Fix by guarding the U.FL disconnect path against inactive pins and by
updating the paired U.FL pin fully on SMA direction changes: manage
ICE_SMA2_UFL2_RX_DIS for the SMA2/U.FL2 pair and enable the backing
CGU pin whenever the peer becomes active.
Fixes: 2dd5d03c77 ("ice: redesign dpll sma/u.fl pins control")
Signed-off-by: Petr Oros <poros@redhat.com>
Tested-by: Alexander Nowlin <alexander.nowlin@intel.com>
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-8-cdcb48303fd8@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The DPLL SMA/U.FL pin redesign introduced ice_dpll_sw_pin_frequency_get()
which gates frequency reporting on the pin's active flag. This flag is
determined by ice_dpll_sw_pins_update() from the PCA9575 GPIO expander
state. Before the redesign, SMA pins were exposed as direct HW
input/output pins and ice_dpll_frequency_get() returned the CGU
frequency unconditionally — the PCA9575 state was never consulted.
The PCA9575 powers on with all outputs high, setting ICE_SMA1_DIR_EN,
ICE_SMA1_TX_EN, ICE_SMA2_DIR_EN and ICE_SMA2_TX_EN. Nothing in the
driver writes the register during initialization, so
ice_dpll_sw_pins_update() sees all pins as inactive and
ice_dpll_sw_pin_frequency_get() permanently returns 0 Hz for every
SW pin.
Fix this by writing a default SMA configuration in
ice_dpll_init_info_sw_pins(): clear all SMA bits, then set SMA1 and
SMA2 as active inputs (DIR_EN=0) with U.FL1 output and U.FL2 input
disabled. Each SMA/U.FL pair shares a physical signal path so only
one pin per pair can be active at a time. U.FL pins still report
frequency 0 after this fix: U.FL1 (output-only) is disabled by
ICE_SMA1_TX_EN which keeps the TX output buffer off, and U.FL2
(input-only) is disabled by ICE_SMA2_UFL2_RX_DIS. They can be
activated by changing the corresponding SMA pin direction via dpll
netlink.
Fixes: 2dd5d03c77 ("ice: redesign dpll sma/u.fl pins control")
Signed-off-by: Petr Oros <poros@redhat.com>
Reviewed-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Tested-by: Alexander Nowlin <alexander.nowlin@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-7-cdcb48303fd8@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
On certain E810 configurations where firmware supports Tx scheduler
topology switching (tx_sched_topo_comp_mode_en), ice_cfg_tx_topo()
may need to apply a new 5-layer or 9-layer topology from the DDP
package. If the AQ command to set the topology fails (e.g. due to
invalid DDP data or firmware limitations), the global configuration
lock must still be cleared via a CORER reset.
Commit 86aae43f21 ("ice: don't leave device non-functional if Tx
scheduler config fails") correctly fixed this by refactoring
ice_cfg_tx_topo() to always trigger CORER after acquiring the global
lock and re-initialize hardware via ice_init_hw() afterwards.
However, commit 8a37f9e2ff ("ice: move ice_deinit_dev() to the end
of deinit paths") later moved ice_init_dev_hw() into ice_init_hw(),
breaking the reinit path introduced by 86aae43f21. This creates an
infinite recursive call chain:
ice_init_hw()
ice_init_dev_hw()
ice_cfg_tx_topo() # topology change needed
ice_deinit_hw()
ice_init_hw() # reinit after CORER
ice_init_dev_hw() # recurse
ice_cfg_tx_topo()
... # stack overflow
Fix by moving ice_init_dev_hw() back out of ice_init_hw() and calling
it explicitly from ice_probe() and ice_devlink_reinit_up(). The third
caller, ice_cfg_tx_topo(), intentionally does not need ice_init_dev_hw()
during its reinit, it only needs the core HW reinitialization. This
breaks the recursion cleanly without adding flags or guards.
The deinit ordering changes from commit 8a37f9e2ff ("ice: move
ice_deinit_dev() to the end of deinit paths") which fixed slow rmmod
are preserved, only the init-side placement of ice_init_dev_hw() is
reverted.
Fixes: 8a37f9e2ff ("ice: move ice_deinit_dev() to the end of deinit paths")
Signed-off-by: Petr Oros <poros@redhat.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Alexander Nowlin <alexander.nowlin@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-6-cdcb48303fd8@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
ice_reset_all_vfs() ignores the return value of ice_vf_rebuild_vsi().
When the VSI rebuild fails (e.g. during NVM firmware update via
nvmupdate64e), ice_vsi_rebuild() tears down the VSI on its error path,
leaving txq_map and rxq_map as NULL. The subsequent unconditional call
to ice_vf_post_vsi_rebuild() leads to a NULL pointer dereference in
ice_ena_vf_q_mappings() when it accesses vsi->txq_map[0].
The single-VF reset path in ice_reset_vf() already handles this
correctly by checking the return value of ice_vf_reconfig_vsi() and
skipping ice_vf_post_vsi_rebuild() on failure.
Apply the same pattern to ice_reset_all_vfs(): check the return value
of ice_vf_rebuild_vsi() and skip ice_vf_post_vsi_rebuild() and
ice_eswitch_attach_vf() on failure. The VF is left safely disabled
(ICE_VF_STATE_INIT not set, VFGEN_RSTAT not set to VFACTIVE) and can
be recovered via a VFLR triggered by a PCI reset of the VF
(sysfs reset or driver rebind).
Note that this patch does not prevent the VF VSI rebuild from failing
during NVM update — the underlying cause is firmware being in a
transitional state while the EMP reset is processed, which can cause
Admin Queue commands (ice_add_vsi, ice_cfg_vsi_lan) to fail. This
patch only prevents the subsequent NULL pointer dereference that
crashes the kernel when the rebuild does fail.
crash> bt
PID: 50795 TASK: ff34c9ee708dc680 CPU: 1 COMMAND: "kworker/u512:5"
#0 [ff72159bcfe5bb50] machine_kexec at ffffffffaa8850ee
#1 [ff72159bcfe5bba8] __crash_kexec at ffffffffaaa15fba
#2 [ff72159bcfe5bc68] crash_kexec at ffffffffaaa16540
#3 [ff72159bcfe5bc70] oops_end at ffffffffaa837eda
#4 [ff72159bcfe5bc90] page_fault_oops at ffffffffaa893997
#5 [ff72159bcfe5bce8] exc_page_fault at ffffffffab528595
#6 [ff72159bcfe5bd10] asm_exc_page_fault at ffffffffab600bb2
[exception RIP: ice_ena_vf_q_mappings+0x79]
RIP: ffffffffc0a85b29 RSP: ff72159bcfe5bdc8 RFLAGS: 00010206
RAX: 00000000000f0000 RBX: ff34c9efc9c00000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000010 RDI: ff34c9efc9c00000
RBP: ff34c9efc27d4828 R8: 0000000000000093 R9: 0000000000000040
R10: ff34c9efc27d4828 R11: 0000000000000040 R12: 0000000000100000
R13: 0000000000000010 R14: R15:
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ff72159bcfe5bdf8] ice_sriov_post_vsi_rebuild at ffffffffc0a85e2e [ice]
#8 [ff72159bcfe5be08] ice_reset_all_vfs at ffffffffc0a920b4 [ice]
#9 [ff72159bcfe5be48] ice_service_task at ffffffffc0a31519 [ice]
#10 [ff72159bcfe5be88] process_one_work at ffffffffaa93dca4
#11 [ff72159bcfe5bec8] worker_thread at ffffffffaa93e9de
#12 [ff72159bcfe5bf18] kthread at ffffffffaa946663
#13 [ff72159bcfe5bf50] ret_from_fork at ffffffffaa8086b9
The panic occurs attempting to dereference the NULL pointer in RDX at
ice_sriov.c:294, which loads vsi->txq_map (offset 0x4b8 in ice_vsi).
The faulting VSI is an allocated slab object but not fully initialized
after a failed ice_vsi_rebuild():
crash> struct ice_vsi 0xff34c9efc27d4828
netdev = 0x0,
rx_rings = 0x0,
tx_rings = 0x0,
q_vectors = 0x0,
txq_map = 0x0,
rxq_map = 0x0,
alloc_txq = 0x10,
num_txq = 0x10,
alloc_rxq = 0x10,
num_rxq = 0x10,
The nvmupdate64e process was performing NVM firmware update:
crash> bt 0xff34c9edd1a30000
PID: 49858 TASK: ff34c9edd1a30000 CPU: 1 COMMAND: "nvmupdate64e"
#0 [ff72159bcd617618] __schedule at ffffffffab5333f8
#4 [ff72159bcd617750] ice_sq_send_cmd at ffffffffc0a35347 [ice]
#5 [ff72159bcd6177a8] ice_sq_send_cmd_retry at ffffffffc0a35b47 [ice]
#6 [ff72159bcd617810] ice_aq_send_cmd at ffffffffc0a38018 [ice]
#7 [ff72159bcd617848] ice_aq_read_nvm at ffffffffc0a40254 [ice]
#8 [ff72159bcd6178b8] ice_read_flat_nvm at ffffffffc0a4034c [ice]
#9 [ff72159bcd617918] ice_devlink_nvm_snapshot at ffffffffc0a6ffa5 [ice]
dmesg:
ice 0000:13:00.0: firmware recommends not updating fw.mgmt, as it
may result in a downgrade. continuing anyways
ice 0000:13:00.1: ice_init_nvm failed -5
ice 0000:13:00.1: Rebuild failed, unload and reload driver
Fixes: 12bb018c53 ("ice: Refactor VF reset")
Signed-off-by: Petr Oros <poros@redhat.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-5-cdcb48303fd8@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The V1 ADD_VLAN opcode had no success handler; filters sent via V1
stayed in ADDING state permanently. Add a fallthrough case so V1
filters also transition ADDING -> ACTIVE on PF confirmation.
Critically, add an `if (v_retval) break` guard: the error switch in
iavf_virtchnl_completion() does NOT return after handling errors,
it falls through to the success switch. Without this guard, a
PF-rejected ADD would incorrectly mark ADDING filters as ACTIVE,
creating a driver/HW mismatch where the driver believes the filter
is installed but the PF never accepted it.
For V2, this is harmless: iavf_vlan_add_reject() in the error
block already kfree'd all ADDING filters, so the success handler
finds nothing to transition.
Fixes: 968996c070 ("iavf: Fix VLAN_V2 addition/rejection")
Signed-off-by: Petr Oros <poros@redhat.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-4-cdcb48303fd8@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The VLAN filter DELETE path was asymmetric with the ADD path: ADD
waits for PF confirmation (ADD -> ADDING -> ACTIVE), but DELETE
immediately frees the filter struct after sending the DEL message
without waiting for the PF response.
This is problematic because:
- If the PF rejects the DEL, the filter remains in HW but the driver
has already freed the tracking structure, losing sync.
- Race conditions between DEL pending and other operations
(add, reset) cannot be properly resolved if the filter struct
is already gone.
Add IAVF_VLAN_REMOVING state to make the DELETE path symmetric:
REMOVE -> REMOVING (send DEL) -> PF confirms -> kfree
-> PF rejects -> ACTIVE
In iavf_del_vlans(), transition filters from REMOVE to REMOVING
instead of immediately freeing them. The new DEL completion handler
in iavf_virtchnl_completion() frees filters on success or reverts
them to ACTIVE on error.
Update iavf_add_vlan() to handle the REMOVING state: if a DEL is
pending and the user re-adds the same VLAN, queue it for ADD so
it gets re-programmed after the PF processes the DEL.
The !VLAN_FILTERING_ALLOWED early-exit path still frees filters
directly since no PF message is sent in that case.
Also update iavf_del_vlan() to skip filters already in REMOVING
state: DEL has been sent to PF and the completion handler will
free the filter when PF confirms. Without this guard, the sequence
DEL(pending) -> user-del -> second DEL could cause the PF to return
an error for the second DEL (filter already gone), causing the
completion handler to incorrectly revert a deleted filter back to
ACTIVE.
Fixes: 968996c070 ("iavf: Fix VLAN_V2 addition/rejection")
Signed-off-by: Petr Oros <poros@redhat.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-3-cdcb48303fd8@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When a VF goes down, the driver currently sends DEL_VLAN to the PF for
every VLAN filter (ACTIVE -> DISABLE -> send DEL -> INACTIVE), then
re-adds them all on UP (INACTIVE -> ADD -> send ADD -> ADDING ->
ACTIVE). This round-trip is unnecessary because:
1. The PF disables the VF's queues via VIRTCHNL_OP_DISABLE_QUEUES,
which already prevents all RX/TX traffic regardless of VLAN filter
state.
2. The VLAN filters remaining in PF HW while the VF is down is
harmless - packets matching those filters have nowhere to go with
queues disabled.
3. The DEL+ADD cycle during down/up creates race windows where the
VLAN filter list is incomplete. With spoofcheck enabled, the PF
enables TX VLAN filtering on the first non-zero VLAN add, blocking
traffic for any VLANs not yet re-added.
Remove the entire DISABLE/INACTIVE state machinery:
- Remove IAVF_VLAN_DISABLE and IAVF_VLAN_INACTIVE enum values
- Remove iavf_restore_filters() and its call from iavf_open()
- Remove VLAN filter handling from iavf_clear_mac_vlan_filters(),
rename it to iavf_clear_mac_filters()
- Remove DEL_VLAN_FILTER scheduling from iavf_down()
- Remove all DISABLE/INACTIVE handling from iavf_del_vlans()
VLAN filters now stay ACTIVE across down/up cycles. Only explicit
user removal (ndo_vlan_rx_kill_vid) or PF/VF reset triggers VLAN
filter deletion/re-addition.
Fixes: ed1f5b58ea ("i40evf: remove VLAN filters on close")
Signed-off-by: Petr Oros <poros@redhat.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-2-cdcb48303fd8@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When page_pool_create_percpu() fails on page_pool_list(), it falls
through to its err_uninit: label, which calls page_pool_uninit().
At that point page_pool_init() has already taken two references
when the user requested PP_FLAG_ALLOW_UNREADABLE_NETMEM:
pool->mp_ops->init(pool)
static_branch_inc(&page_pool_mem_providers);
Neither is undone by page_pool_uninit(); both are only undone by
__page_pool_destroy() (success-side teardown). The error path
therefore leaks the per-provider reference taken by mp_ops->init
(io_zcrx_ifq->refs in the io_uring zcrx provider, the dmabuf
binding refcount in the devmem provider) plus one increment of
the page_pool_mem_providers static branch on every failure of
xa_alloc_cyclic() inside page_pool_list().
The leaked io_zcrx_ifq->refs in turn pins everything
io_zcrx_ifq_free() would release on cleanup: ifq->user (uid),
ifq->mm_account (mmdrop), ifq->dev (device refcount),
ifq->netdev_tracker (netdev refcount), and the rbuf region.
The leaked static branch increment forces all subsequent
page_pool_alloc_netmems() and page_pool_return_page() callers to
take the slow mp_ops branch for the lifetime of the kernel.
Reachable via the io_uring zcrx path:
io_uring_register(IORING_REGISTER_ZCRX_IFQ) /* CAP_NET_ADMIN */
-> __io_uring_register
-> io_register_zcrx
-> zcrx_register_netdev
-> netif_mp_open_rxq
-> driver ndo_queue_mem_alloc
-> page_pool_create_percpu
-> page_pool_init succeeds (mp_ops->init runs, branch++)
-> page_pool_list fails (xa_alloc_cyclic -ENOMEM)
-> goto err_uninit <-- leak
The same shape applies to the devmem dmabuf provider via
mp_dmabuf_devmem_init()/mp_dmabuf_devmem_destroy().
Restore the cleanup symmetry by moving the mp_ops->destroy() and
static_branch_dec() calls out of __page_pool_destroy() and into
page_pool_uninit(), so page_pool_uninit() is again the strict
inverse of page_pool_init(). page_pool_uninit() has only two
callers (the err_uninit: path and __page_pool_destroy()), so this
preserves the single-call invariant on the success path while
fixing the err path. The error path of page_pool_init() itself
still skips the mp_ops cleanup correctly: mp_ops->init is the
last action that takes a reference before page_pool_init() returns
0, so when it returns an error neither the refcount nor the static
branch has been touched.
Triggering the bug requires xa_alloc_cyclic() to fail with -ENOMEM,
which under normal GFP_KERNEL retry behaviour is rare. It is
deterministic under CONFIG_FAULT_INJECTION with fail_page_alloc /
xa fault injection, or under sustained memory pressure. The leak
is silent: there is no warning, and the released kernel build
continues running with a permanently-incremented static branch.
Fixes: 0f92140468 ("memory-provider: dmabuf devmem memory provider")
Signed-off-by: Hasan Basbunar <basbunarhasan@gmail.com>
Link: https://patch.msgid.link/20260428170739.34881-1-basbunarhasan@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The IPv4/IPv6 and routing code is not very well separated from
the TCP/UDP code. Scope it down properly by providing a more
accurate file list, instead of net/ipv4/ and net/ipv6/
Now that the entry is more accurately representing layer 3
and routing merge in the nexthop entry into it.
Add Ido Schimmel as a co-maintainer, Ido's git history speaks
for itself.
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260428203924.1229169-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Minor clarifications in the README:
- call out what linters we expect to be clean
- make it clear that by "frameworks" we mean code under lib/
not just factoring code out in the same file
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit db359fccf2 ("mm: introduce a new page type for page pool in
page type") added a page_type field to struct net_iov at the same
offset as struct page::page_type, so that page_pool_set_pp_info() can
call __SetPageNetpp() uniformly on both pages and net_iovs.
The page-type API requires the field to hold the UINT_MAX "no type"
sentinel before a type can be set; for real struct page that invariant
is established by the page allocator on free. struct net_iov is not
allocated through the page allocator, so the field is left as zero
(io_uring zcrx, which uses __GFP_ZERO) or as slab garbage (devmem,
which uses kvmalloc_objs() without zeroing). When the page pool then
calls page_pool_set_pp_info() on a freshly-bound niov,
__SetPageNetpp()'s VM_BUG_ON_PAGE(page->page_type != UINT_MAX) fires
and the kernel BUGs. Triggered in selftests by io_uring zcrx setup
through the fbnic queue restart path:
kernel BUG at ./include/linux/page-flags.h:1062!
RIP: 0010:page_pool_set_pp_info (./include/linux/page-flags.h:1062
net/core/page_pool.c:716)
Call Trace:
<TASK>
net_mp_niov_set_page_pool (net/core/page_pool.c:1360)
io_pp_zc_alloc_netmems (io_uring/zcrx.c:1089 io_uring/zcrx.c:1110)
fbnic_fill_bdq (./include/net/page_pool/helpers.h:160
drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:906)
__fbnic_nv_restart (drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:2470
drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:2874)
fbnic_queue_start (drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:2903)
netdev_rx_queue_reconfig (net/core/netdev_rx_queue.c:137)
__netif_mp_open_rxq (net/core/netdev_rx_queue.c:234)
io_register_zcrx (io_uring/zcrx.c:818 io_uring/zcrx.c:903)
__io_uring_register (io_uring/register.c:931)
__do_sys_io_uring_register (io_uring/register.c:1029)
do_syscall_64 (arch/x86/entry/syscall_64.c:63
arch/x86/entry/syscall_64.c:94)
</TASK>
The same path is reachable through devmem dmabuf binding via
netdev_nl_bind_rx_doit() -> net_devmem_bind_dmabuf_to_queue().
Add a net_iov_init() helper that stamps ->owner, ->type and the
->page_type sentinel, and use it from both the devmem and io_uring
zcrx niov init loops.
Fixes: db359fccf2 ("mm: introduce a new page type for page pool in page type")
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Acked-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/20260428025320.853452-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Matthieu Baerts says:
====================
mptcp: misc fixes for v7.1-rc2
Here are various unrelated fixes:
- Patches 1-2: set timestamp flags on 'ssk', not 'sk' (typo); Plus do
that with sleepable lock_sock/release_sock. A fix for v5.14.
- Patch 3: respect SO_LINGER(1, 0) by sending MP_FASTCLOSE at close time
as expected. A fix for v6.1.
- Patch 4: reset fullmesh counter after a flush. A fix for v6.19.
====================
Link: https://patch.msgid.link/20260427-net-mptcp-misc-fixes-7-1-rc2-v1-0-7432b7f279fa@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The SO_LINGER socket option has been supported for a while with MPTCP
sockets [1], but it didn't cause the equivalent of a TCP reset as
expected when enabled and its time was set to 0. This was causing some
behavioural differences with TCP where some connections were not
promptly stopped as expected.
To fix that, an extra condition is checked at close() time before
sending an MP_FASTCLOSE, the MPTCP equivalent of a TCP reset.
Note that backporting up to [1] will be difficult as more changes are
needed to be able to send MP_FASTCLOSE. It seems better to stop at [2],
which was supposed to already imitate TCP.
Validated with MPTCP packetdrill tests [3].
Fixes: 268b123874 ("mptcp: setsockopt: support SO_LINGER") [1]
Fixes: d21f834855 ("mptcp: use fastclose on more edge scenarios") [2]
Cc: stable@vger.kernel.org
Reported-by: Lance Tuller <lance@lance0.com>
Closes: https://github.com/lance0/xfr/pull/67
Link: https://github.com/multipath-tcp/packetdrill/pull/196 [3]
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260427-net-mptcp-misc-fixes-7-1-rc2-v1-3-7432b7f279fa@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Both mptcp_setsockopt_sol_socket_tstamp() and
mptcp_setsockopt_sol_socket_timestamping() iterate over subflows,
acquire the subflow socket lock, but then erroneously pass the MPTCP
msk socket to sock_set_timestamp() / sock_set_timestamping() instead
of the subflow ssk. As a result, the timestamp flags are set on the
wrong socket and have no effect on the actual subflows.
Pass ssk instead of sk to both helpers.
Fixes: 9061f24bf8 ("mptcp: sockopt: propagate timestamp request to subflows")
Cc: stable@vger.kernel.org
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260427-net-mptcp-misc-fixes-7-1-rc2-v1-1-7432b7f279fa@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Breno Leitao says:
====================
netconsole: configfs store callback fixes
There are still some changes I want to make, such as, having the dynamic
lock when reading from configfs (_show() callbacks), wich will solve
other issues, but I will keep it for later.
====================
Link: https://patch.msgid.link/20260427-netconsole_ai_fixes-v2-0-59965f29d9cc@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
userdatum_value_store() updates udm->value first and only then calls
update_userdata() to rebuild the on-the-wire payload. If
update_userdata() fails (e.g. -ENOMEM from kmalloc), the function
returns the error to userspace, but udm->value already holds the new
string while the live nt->userdata buffer still reflects the old one.
The next successful write to any sibling userdatum on the same target
will call update_userdata() again, which walks every entry and packs
the now-stale udm->value into the payload. The failed write is thus
silently activated later, with no indication to userspace that the
value it tried to set was rejected.
Snapshot the previous value before overwriting udm->value and restore
it if update_userdata() fails so the visible state and the active
payload stay consistent.
Fixes: eb83801af2 ("netconsole: Dynamic allocation of userdata buffer")
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260427-netconsole_ai_fixes-v2-4-59965f29d9cc@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dev_name_store() calls strscpy(nt->np.dev_name, buf, IFNAMSIZ) without
checking the return value. If userspace writes an interface name longer
than IFNAMSIZ - 1, strscpy() silently truncates and returns -E2BIG, but
the function ignores it and reports a fully successful write back to
userspace.
If a real interface happens to match the truncated name, netconsole will
bind to the wrong device on the next enable, sending kernel logs and
panic output to an unintended network segment with no indication to
userspace that anything was rewritten.
Reject writes whose length cannot fit in nt->np.dev_name up front:
if (count >= IFNAMSIZ)
return -ENAMETOOLONG;
This is not a big deal of a problem, but, it is still the correct
approach.
Fixes: 0bcc181618 ("[NET] netconsole: Support dynamic reconfiguration using configfs")
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260427-netconsole_ai_fixes-v2-3-59965f29d9cc@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
userdatum_value_store() bounds count by MAX_EXTRADATA_VALUE_LEN (200)
and then copies straight into udm->value, which is itself 200 bytes:
if (count > MAX_EXTRADATA_VALUE_LEN)
return -EMSGSIZE;
...
ret = strscpy(udm->value, buf, sizeof(udm->value));
if (ret < 0)
goto out_unlock;
If userspace writes exactly MAX_EXTRADATA_VALUE_LEN bytes with no NUL
within them, strscpy() copies 199 bytes plus a NUL into udm->value and
returns -E2BIG. The function jumps to out_unlock and reports the error
to userspace, but udm->value has already been overwritten with the
truncated string and update_userdata() is skipped, so the corruption
is not yet visible on the wire.
The next successful write to any userdatum entry under the same target
calls update_userdata(), which packs udm->value into the active
netconsole payload. From that point on, every netconsole message
carries the silently truncated value, and userspace has no indication
that a previous, error-returning write left state behind.
Tighten the entry check from "count > MAX_EXTRADATA_VALUE_LEN" to
"count >= MAX_EXTRADATA_VALUE_LEN". With count strictly less than
sizeof(udm->value), strscpy() can no longer return -E2BIG here, so
the corrupting truncation path is removed entirely.
Fixes: 8a6d5fec6c ("net: netconsole: add a userdata config_group member to netconsole_target")
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260427-netconsole_ai_fixes-v2-2-59965f29d9cc@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Several configfs store callbacks in netconsole end with:
ret = strnlen(buf, count);
This under-reports the number of bytes consumed when the input
contains an embedded NUL within count, telling the VFS that fewer
bytes were written than userspace actually handed in. A conformant
partial-write loop would then retry the trailing bytes against a
callback that has already accepted them.
Every other configfs driver in the tree returns count directly from
its store callbacks once parsing has succeeded, including
drivers/nvme/target/configfs.c, drivers/gpio/gpio-sim.c,
drivers/most/configfs.c, drivers/block/null_blk/main.c,
drivers/pci/endpoint/pci-ep-cfs.c, and the rest of the configfs
users. netconsole was the outlier (along with
drivers/infiniband/core/cma_configfs.c, which has the same latent
issue).
Align netconsole with the rest of the configfs ecosystem: return
count once the parser/validator has accepted the input. The numeric
and boolean parsers (kstrtobool, kstrtou16, mac_pton,
netpoll_parse_ip_addr) have already validated the meaningful prefix;
any trailing bytes are padding and should simply be reported as
consumed.
Fixes: 0bcc181618 ("[NET] netconsole: Support dynamic reconfiguration using configfs")
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260427-netconsole_ai_fixes-v2-1-59965f29d9cc@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet says:
====================
net/sched: sch_cake: annotate data-races in cake_dump_stats() (series)
cake_dump_stats() runs without qdisc spinlock being held.
This mini series adds missing READ_ONCE()/WRITE_ONCE() annotations.
Original patch was too big, splitting it eases code review.
====================
Link: https://patch.msgid.link/20260427083606.459355-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
cake_dump_stats() runs without qdisc spinlock being held.
In this fourth patch, I add READ_ONCE()/WRITE_ONCE() annotations
for the following fields:
- avg_peak_bandwidth
- buffer_limit
- buffer_max_used
- avg_netoff
- max_netlen
- max_adjlen
- min_netlen
- min_adjlen
- active_queues
- tin_rate_bps
- bytes
- tin_backlog
Other annotations are added in following patch, to ease code review.
Fixes: 046f6fd5da ("sched: Add Common Applications Kept Enhanced (cake) qdisc")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://patch.msgid.link/20260427083606.459355-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
cake_dump_stats() runs without qdisc spinlock being held.
In this third patch, I add READ_ONCE()/WRITE_ONCE() annotations
for the following fields:
- packets
- tin_dropped
- tin_ecn_mark
- ack_drops
- peak_delay
- avge_delay
- base_delay
Other annotations are added in following patches, to ease code review.
Fixes: 046f6fd5da ("sched: Add Common Applications Kept Enhanced (cake) qdisc")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: "Toke Høiland-Jørgensen" <toke@toke.dk>
Link: https://patch.msgid.link/20260427083606.459355-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
cake_dump_stats() runs without qdisc spinlock being held.
In this second patch, I add READ_ONCE()/WRITE_ONCE() annotations
for the following fields:
- bulk_flow_count
- unresponsive_flow_count
- max_skblen
- flow_quantum
Other annotations are added in following patches, to ease code review.
Fixes: 046f6fd5da ("sched: Add Common Applications Kept Enhanced (cake) qdisc")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: "Toke Høiland-Jørgensen" <toke@toke.dk>
Link: https://patch.msgid.link/20260427083606.459355-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
cake_dump_stats() runs without qdisc spinlock being held.
In this first patch, I add READ_ONCE()/WRITE_ONCE() annotations
for the following fields:
- way_hits
- way_misses
- way_collisions
- sparse_flow_count
- decaying_flow_count
Other annotations are added in following patches, to ease code review.
Fixes: 046f6fd5da ("sched: Add Common Applications Kept Enhanced (cake) qdisc")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: "Toke Høiland-Jørgensen" <toke@toke.dk>
Link: https://patch.msgid.link/20260427083606.459355-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
bareudp_fill_metadata_dst() passes bareudp->sock to
udp_tunnel6_dst_lookup() in the IPv6 path without a NULL check.
The socket is only created in bareudp_open() and NULLed in
bareudp_stop(), so calling this function while the device is down
triggers a NULL dereference via sock->sk.
BUG: kernel NULL pointer dereference, address: 0000000000000018
RIP: 0010:udp_tunnel6_dst_lookup (net/ipv6/ip6_udp_tunnel.c:160)
Call Trace:
<TASK>
bareudp_fill_metadata_dst (drivers/net/bareudp.c:532)
do_execute_actions (net/openvswitch/actions.c:901)
ovs_execute_actions (net/openvswitch/actions.c:1589)
ovs_packet_cmd_execute (net/openvswitch/datapath.c:700)
genl_family_rcv_msg_doit (net/netlink/genetlink.c:1114)
genl_rcv_msg (net/netlink/genetlink.c:1209)
netlink_rcv_skb (net/netlink/af_netlink.c:2550)
</TASK>
Add a NULL check returning -ESHUTDOWN, consistent with the xmit paths
in the same driver.
Fixes: 571912c69f ("net: UDP tunnel encapsulation module for tunnelling different protocols like MPLS, IP, NSH etc.")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260426165350.1663137-2-bestswngs@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long says:
====================
sctp: fix a vtag verification failure caused by stale INITs
Similar to Scenario B in commit 8e56b063c8 ( netfilter: handle the
connecting collision properly in nf_conntrack_proto_sctp"):
Scenario B: INIT_ACK is delayed until the peer completes its own handshake
192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408]
192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885]
192.168.1.2 > 192.168.1.1: sctp (1) [INIT ACK] [init tag: 3922216408]
192.168.1.1 > 192.168.1.2: sctp (1) [COOKIE ECHO]
192.168.1.2 > 192.168.1.1: sctp (1) [COOKIE ACK]
192.168.1.1 > 192.168.1.2: sctp (1) [INIT ACK] [init tag: 3914796021] *
There is another case:
Scenario F: INIT is delayed until the peer completes its own handshake
192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408]
(OVS upcall)
192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885]
192.168.1.2 > 192.168.1.1: sctp (1) [INIT ACK] [init tag: 3922216408]
192.168.1.1 > 192.168.1.2: sctp (1) [COOKIE ECHO]
192.168.1.2 > 192.168.1.1: sctp (1) [COOKIE ACK]
192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408]
(delayed)
192.168.1.1 > 192.168.1.2: sctp (1) [INIT ACK] [init tag: 3914796021] *
In this case, the delayed INIT (e.g. due to OVS upcall) is recorded by
conntrack, which prevents vtag verification from dropping the unexpected
INIT-ACK in nf_conntrack_sctp_packet():
vtag = ct->proto.sctp.vtag[!dir];
if (!ct->proto.sctp.init[!dir] && vtag && vtag != ih->init_tag)
goto out_unlock;
This happens because ct->proto.sctp.init[!dir] is set by the delayed INIT,
even though it is stale.
Fix this in two parts:
- In netfilter: Do not record INITs whose init_tag matches the peer vtag,
as they carry no new handshake state in the 1st patch.
- In SCTP: Prevent endpoints from responding to such INITs with INIT-ACK,
ensuring correctness even when middleboxes lack the netfilter fix in
the 2nd patch.
A follow-up selftest for this scenario will be posted in a separate patch
by Yi Chen.
====================
Link: https://patch.msgid.link/cover.1777214801.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
An INIT whose init_tag matches the peer's vtag does not provide new state
information. It indicates either:
- a stale INIT (after INIT-ACK has already been seen on the same side), or
- a retransmitted INIT (after INIT has already been recorded on the same
side).
In both cases, the INIT must not update ct->proto.sctp.init[] state, since
it does not advance the handshake tracking and may otherwise corrupt
INIT/INIT-ACK validation logic.
Allow INIT processing only when the conntrack entry is newly created
(SCTP_CONNTRACK_NONE), or when the init_tag differs from the stored peer
vtag.
Note it skips the check for the ct with old_state SCTP_CONNTRACK_NONE in
nf_conntrack_sctp_packet(), as it is just created in sctp_new() where it
set ct->proto.sctp.vtag[IP_CT_DIR_REPLY] = ih->init_tag.
Fixes: 9fb9cbb108 ("[NETFILTER]: Add nf_conntrack subsystem.")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/ee56c3e416452b2a40589a2a85245ac2ad5e9f4b.1777214801.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The dev-set and key-rotate netlink operations modify shared device
state (PSP version configuration and cryptographic key material,
respectively) but do not require CAP_NET_ADMIN. The only access
control is psp_dev_check_access() which merely verifies netns
membership.
Fixes: 00c94ca2b9 ("psp: base PSP device support")
Reviewed-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260427195856.401223-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
psp_assoc_device_get_locked() obtains a psp_dev reference via
psp_dev_get_for_sock() (which uses psp_dev_tryget() under RCU);
it then acquires psd->lock and drops the reference. Before
the lock is taken, psp_dev_unregister() can run to completion:
take psd->lock, clear out state, unlock, drop the registration
reference.
The expectation is that the lock prevents device unregistration,
but much like with netdevs special care has to be taken when
"upgrading" a reference to a locked device. Add the missing
check if device is still alive. psp_dev_is_registered() exists
already but had no callers, which makes me wonder if I either
forgot to add this or lost the check during refactoring...
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Fixes: 6b46ca260e ("net: psp: add socket security association code")
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260427190606.366101-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) IEEE1394 ARP payload contains no target hardware address in the
ARP packet. Apparently, arp_tables was never updated to deal with
IEEE1394 ARP properly. To deal with this, return no match in case
the target hardware address selector is used, either for inverse or
normal match. Moreover, arpt_mangle disallows mangling of the target
hardware and IP address because, it is not worth to adjust the
offset calculation to fix this, we suspect no users of arp_tables
for this family.
2) Use list_del_rcu() to delete device hooks in nf_tables, this hook
list is RCU protected, concurrent netlink dump readers can be
walking on this list, fix it by adding a helper function and use it
for consistency. From Florian Westphal.
3) Add list_splice_rcu(), this is useful for joining the local list of
new device hooks to the RCU protected hook list in chain and
flowtable. Reviewed by Paul E. McKenney.
4) Use list_splice_rcu() to publish the new device hooks in chain and
flowtable to fix concurrent netlink dump traversal.
5) Add a new hook transaction object to track device hook deletions.
The current approach moves device hooks to be deleted around during
the preparation phase, this breaks concurrent RCU reader via netlink
dump. This new hook transaction is combined with NFT_HOOK_REMOVE
flag to annotate hooks for removal in the preparation phase.
6) xt_policy inbound policy check in strict mode can lead to
out-of-bound access of the secpath array due to incorrect.
The iteration over the secpath needs to be reversed in the inbound
to check for the human readable policy, expecting inner in first
position and outer in second position, the secpath from inbound
actually stores outer in first position then in second position.
From Jiexun Wang.
7) Fix possible zero shift in nft_bitwise triggering UBSAN splat,
reject zero shift from control plane, from Kai Ma.
8) Replace simple_strtoul() in the conntrack SIP helper since it relies
on nul-terminated strings. From Florian Westphal.
* tag 'nf-26-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_conntrack_sip: don't use simple_strtoul
netfilter: reject zero shift in nft_bitwise
netfilter: xt_policy: fix strict mode inbound policy matching
netfilter: nf_tables: add hook transactions for device deletions
netfilter: nf_tables: join hook list via splice_list_rcu() in commit phase
rculist: add list_splice_rcu() for private lists
netfilter: nf_tables: use list_del_rcu for netlink hooks
netfilter: arp_tables: fix IEEE1394 ARP payload parsing
====================
Link: https://patch.msgid.link/20260428095840.51961-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Table 7-121 in datasheet says we have to set register 0xc6
to value 0x10 before CLK_O_SEL can be modified. No more infos
about this field found in datasheet. With this fix, setting
of CLK_O_SEL field in IO_MUX_CFG register worked through dts
property "ti,clk-output-sel" on a DP83869HMRGZR.
Signed-off-by: Heiko Schocher <hs@nabladev.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Fixes: 01db923e83 ("net: phy: dp83869: Add TI dp83869 phy")
Link: https://patch.msgid.link/20260425031339.3318-1-hs@nabladev.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently, mctp_i2c_get_tx_flow_state() is called before the packet length
sanity check. This function marks a new flow as active in the MCTP core.
If the sanity check fails, mctp_i2c_xmit() returns early without calling
mctp_i2c_lock_nest(). This results in a mismatched locking state: the
flow is active, but the I2C bus lock was never acquired for it.
When the flow is later released, mctp_i2c_release_flow() will see the
active state and queue an unlock marker. The TX thread will then
decrement midev->i2c_lock_count from 0, causing it to underflow to -1.
This underflow permanently breaks the driver's locking logic, allowing
future transmissions to occur without holding the I2C bus lock, leading
to bus collisions and potential hardware hangs.
Move the mctp_i2c_get_tx_flow_state() call to after the length sanity
check to ensure we only transition the flow state if we are actually
going to proceed with the transmission and locking.
Fixes: f5b8abf9fc ("mctp i2c: MCTP I2C binding driver")
Signed-off-by: William A. Kennington III <william@wkennington.com>
Acked-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20260423074741.201460-1-william@wkennington.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The CPU receives frames from the MAC through conventional DMA: the CPU
allocates buffers for the MAC, then the MAC fills them and returns
ownership to the CPU. For each hardware RX queue, the CPU and MAC
coordinate through a shared ring array of DMA descriptors: one
descriptor per DMA buffer. Each descriptor includes the buffer's
physical address and a status flag ("OWN") indicating which side owns
the buffer: OWN=0 for CPU, OWN=1 for MAC. The CPU is only allowed to set
the flag and the MAC is only allowed to clear it, and both must move
through the ring in sequence: thus the ring is used for both
"submissions" and "completions."
In the stmmac driver, stmmac_rx() bookmarks its position in the ring
with the `cur_rx` index. The main receive loop in that function checks
for rx_descs[cur_rx].own=0, gives the corresponding buffer to the
network stack (NULLing the pointer), and increments `cur_rx` modulo the
ring size. After the loop exits, stmmac_rx_refill(), which bookmarks its
position with `dirty_rx`, allocates fresh buffers and rearms the
descriptors (setting OWN=1). If it fails any allocation, it simply stops
early (leaving OWN=0) and will retry where it left off when next called.
This means descriptors have a three-stage lifecycle (terms my own):
- `empty` (OWN=1, buffer valid)
- `full` (OWN=0, buffer valid and populated)
- `dirty` (OWN=0, buffer NULL)
But because stmmac_rx() only checks OWN, it confuses `full`/`dirty`. In
the past (see 'Fixes:'), there was a bug where the loop could cycle
`cur_rx` all the way back to the first descriptor it dirtied, resulting
in a NULL dereference when mistaken for `full`. The aforementioned
commit resolved that *specific* failure by capping the loop's iteration
limit at `dma_rx_size - 1`, but this is only a partial fix: if the
previous stmmac_rx_refill() didn't complete, then there are leftover
`dirty` descriptors that the loop might encounter without needing to
cycle fully around. The current code therefore panics (see 'Closes:')
when stmmac_rx_refill() is memory-starved long enough for `cur_rx` to
catch up to `dirty_rx`.
Fix this by explicitly checking, before advancing `cur_rx`, if the next
entry is dirty; exit the loop if so. This prevents processing of the
final, used descriptor until stmmac_rx_refill() succeeds, but
fully prevents the `cur_rx == dirty_rx` ambiguity as the previous bugfix
intended: so remove the clamp as well. Since stmmac_rx_zc() is a
copy-paste-and-tweak of stmmac_rx() and the code structure is identical,
any fix to stmmac_rx() will also need a corresponding fix for
stmmac_rx_zc(). Therefore, apply the same check there.
In stmmac_rx() (not stmmac_rx_zc()), a related bug remains: after the
MAC sets OWN=0 on the final descriptor, it will be unable to send any
further DMA-complete IRQs until it's given more `empty` descriptors.
Currently, the driver simply *hopes* that the next stmmac_rx_refill()
succeeds, risking an indefinite stall of the receive process if not. But
this is not a regression, so it can be addressed in a future change.
Fixes: b6cb454185 ("net: stmmac: avoid rx queue overrun")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221010
Cc: stable@vger.kernel.org
Suggested-by: Russell King <linux@armlinux.org.uk>
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
Link: https://patch.msgid.link/20260422044503.5349-1-CFSworks@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
seg6_input_core() and rpl_input() call ip6_route_input() which sets a
NOREF dst on the skb, then pass it to dst_cache_set_ip6() invoking
dst_hold() unconditionally.
On PREEMPT_RT, ksoftirqd is preemptible and a higher-priority task can
release the underlying pcpu_rt between the lookup and the caching
through a concurrent FIB lookup on a shared nexthop.
Simplified race sequence:
ksoftirqd/X higher-prio task (same CPU X)
----------- --------------------------------
seg6_input_core(,skb)/rpl_input(skb)
dst_cache_get()
-> miss
ip6_route_input(skb)
-> ip6_pol_route(,skb,flags)
[RT6_LOOKUP_F_DST_NOREF in flags]
-> FIB lookup resolves fib6_nh
[nhid=N route]
-> rt6_make_pcpu_route()
[creates pcpu_rt, refcount=1]
pcpu_rt->sernum = fib6_sernum
[fib6_sernum=W]
-> cmpxchg(fib6_nh.rt6i_pcpu,
NULL, pcpu_rt)
[slot was empty, store succeeds]
-> skb_dst_set_noref(skb, dst)
[dst is pcpu_rt, refcount still 1]
rt_genid_bump_ipv6()
-> bumps fib6_sernum
[fib6_sernum from W to Z]
ip6_route_output()
-> ip6_pol_route()
-> FIB lookup resolves fib6_nh
[nhid=N]
-> rt6_get_pcpu_route()
pcpu_rt->sernum != fib6_sernum
[W <> Z, stale]
-> prev = xchg(rt6i_pcpu, NULL)
-> dst_release(prev)
[prev is pcpu_rt,
refcount 1->0, dead]
dst = skb_dst(skb)
[dst is the dead pcpu_rt]
dst_cache_set_ip6(dst)
-> dst_hold() on dead dst
-> WARN / use-after-free
For the race to occur, ksoftirqd must be preemptible (PREEMPT_RT without
PREEMPT_RT_NEEDS_BH_LOCK) and a concurrent task must be able to release
the pcpu_rt. Shared nexthop objects provide such a path, as two routes
pointing to the same nhid share the same fib6_nh and its rt6i_pcpu
entry.
Fix seg6_input_core() and rpl_input() by calling skb_dst_force() after
ip6_route_input() to force the NOREF dst into a refcounted one before
caching.
The output path is not affected as ip6_route_output() already returns a
refcounted dst.
Fixes: af4a2209b1 ("ipv6: sr: use dst_cache in seg6_input")
Fixes: a7a29f9c36 ("net: ipv6: add rpl sr tunnel")
Cc: stable@vger.kernel.org
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Justin Iurman <justin.iurman@gmail.com>
Link: https://patch.msgid.link/20260421094735.20997-1-andrea.mayer@uniroma2.it
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
netpoll_setup() decides whether to auto-populate the local source
address by testing np->local_ip.ip, which only inspects the first 4
bytes of the union inet_addr storage.
For an IPv6 netpoll whose caller-supplied local address has a zero
high-32 bits (::1, ::<suffix>, IPv4-mapped ::ffff:a.b.c.d, etc.), this
misdetects the address as unset (which they are not, but the first
4 bytes are empty), calls netpoll_take_ipv6() and overwrites it with
whatever matching link-local/global address the device happens to expose
first.
Introduce a helper netpoll_local_ip_unset() that picks the correct
family-aware test (ipv6_addr_any() for IPv6, !.ip for IPv4) and use it
from netpoll_setup().
Reproducer is something like:
echo "::2" > local_ip
echo 1 > enabled
cat local_ip
# before this fix: 2001:db8::1 (caller-supplied ::2 was clobbered)
# after this fix: ::2
Fixes: b7394d2429 ("netpoll: prepare for ipv6")
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260424-netpoll_fix-v1-1-3a55348c625f@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_clamp_probe0_to_user_timeout() computes remaining time in jiffies
using subtraction with an unsigned lvalue. If elapsed probing time
exceeds the configured TCP_USER_TIMEOUT, the underflow yields a large
value.
This ends up re-arming the probe timer for a full backoff interval
instead of expiring immediately, delaying connection teardown beyond
the configured timeout.
Fix this by preventing underflow so user-set timeout expiration is
handled correctly without extending the probe timer.
Fixes: 344db93ae3 ("tcp: make TCP_USER_TIMEOUT accurate for zero window probes")
Link: https://lore.kernel.org/r/20260414013634.43997-1-ahacigu.linux@gmail.com
Signed-off-by: Altan Hacigumus <ahacigu.linux@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260424014639.54110-1-ahacigu.linux@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some physical adapters on Power systems do not support segmentation
offload when the MSS is less than 224 bytes. Attempting to send such
packets causes the adapter to freeze, stopping all traffic until
manually reset.
Implement ndo_features_check to disable GSO for packets with small MSS
values. The network stack will perform software segmentation instead.
The 224-byte minimum matches ibmvnic
commit <f10b09ef687f> ("ibmvnic: Enforce stronger sanity checks
on GSO packets")
which uses the same physical adapters in SEA configurations.
The issue occurs specifically when the hardware attempts to perform
segmentation (gso_segs > 1) with a small MSS. Single-segment GSO packets
(gso_segs == 1) do not trigger the problematic LSO code path and are
transmitted normally without segmentation.
Add an ndo_features_check callback to disable GSO when MSS < 224 bytes.
Also call vlan_features_check() to ensure proper handling of VLAN packets,
particularly QinQ (802.1ad) configurations where the hardware parser may
not support certain offload features.
Validated using iptables to force small MSS values. Without the fix,
the adapter freezes. With the fix, packets are segmented in software
and transmission succeeds. Comprehensive regression testing completedd
(MSS tests, performance, stability).
Fixes: 8641dd8579 ("ibmveth: Add support for TSO")
Cc: stable@vger.kernel.org
Reviewed-by: Brian King <bjking1@linux.ibm.com>
Tested-by: Shaik Abdulla <shaik.abdulla1@ibm.com>
Tested-by: Naveed Ahmed <naveedaus@in.ibm.com>
Signed-off-by: Mingming Cao <mmc@linux.ibm.com>
Link: https://patch.msgid.link/20260424162917.65725-1-mmc@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
neigh_xmit always releases the skb, except when no neighbour table is
found. But even the first added user of neigh_xmit (mpls) relied on
neigh_xmit to release the skb (or queue it for tx).
sashiko reported:
If neigh_xmit() is called with an uninitialized neighbor table (for
example, NEIGH_ND_TABLE when IPv6 is disabled), it returns -EAFNOSUPPORT
and bypasses its internal out_kfree_skb error path. Because the return
value of neigh_xmit() is ignored here, does this leak the SKB?
Assume full ownership and remove the last code path that doesn't
xmit or free skb.
Fixes: 4fd3d7d9e8 ("neigh: Add helper function neigh_xmit")
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260424145843.74055-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
With CONFIG_IP_MROUTE_MULTIPLE_TABLES=n, ipmr_fib_lookup()
does not check if net->ipv4.mrt is NULL.
Since default_device_exit_batch() is called after ->exit_rtnl(),
a device could receive IGMP packets and access net->ipv4.mrt
during/after ipmr_rules_exit_rtnl().
If ipmr_rules_exit_rtnl() had already cleared it and freed the
memory, the access would trigger null-ptr-deref or use-after-free.
Let's fix it by using RCU helper and free mrt after RCU grace
period.
In addition, check_net(net) is added to mroute_clean_tables()
and ipmr_cache_unresolved() to synchronise via mfc_unres_lock.
This prevents ipmr_cache_unresolved() from putting skb into
c->_c.mfc_un.unres.unresolved after mroute_clean_tables()
purges it.
For the same reason, timer_shutdown_sync() is moved after
mroute_clean_tables().
Since rhltable_destroy() holds mutex internally, rcu_work is
used, and it is placed as the first member because rcu_head
must be placed within <4K offset. mr_table is alraedy 3864
bytes without rcu_work.
Note that IP6MR is not yet converted to ->exit_rtnl(), so this
change is not needed for now but will be.
Fixes: b22b018674 ("ipmr: Convert ipmr_net_exit_batch() to ->exit_rtnl().")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260423053456.4097409-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>