Commit Graph

1444339 Commits

Author SHA1 Message Date
Ilya Maximets
aa69918bd4 openvswitch: vport: fix self-deadlock on release of tunnel ports
vports are used concurrently and protected by RCU, so netdev_put()
must happen after the RCU grace period.  So, either in an RCU call or
after the synchronize_net().  The rtnl_delete_link() must happen under
RTNL and so can't be executed in RCU context.  Calling synchronize_net()
while holding RTNL is not a good idea for performance and system
stability under load in general, so calling netdev_put() in RCU call
is the right solution here.

However,
when the device is deleted, rtnl_unlock() will call netdev_run_todo()
and block until all the references are gone.  In the current code this
means that we never reach the call_rcu() and the vport is never freed
and the reference is never released, causing a self-deadlock on device
removal.

Fix that by moving the rcu_call() before the rtnl_unlock(), so the
scheduled RCU callback will be executed when synchronize_net() is
called from the rtnl_unlock()->netdev_run_todo() while the RTNL itself
is already released.

Fixes: 6931d21f87 ("openvswitch: defer tunnel netdev_put to RCU release")
Cc: stable@vger.kernel.org
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20260430233848.440994-2-i.maximets@ovn.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-05 15:19:37 +02:00
Ilya Maximets
83861c48ba openvswitch: vport: fix race between tunnel creation and linking
When a tunnel vport is created it first creates the tunnel device, e.g.,
with geneve_dev_create_fb(), then it calls ovs_netdev_link() to take a
reference and link it to the device that represents openvswitch datapath.

The creation of the device is happening under RTNL, but then RTNL is
released and re-acquired to find the device by name.  It is technically
possible for the tunnel device to be re-named or deleted within that
window while RTNL is not held, and some other device created in its
place.  This will cause a non-tunnel device to be referenced in the
vport and tunnel-specific functions used on it, e.g. vxlan_get_options()
that directly casts the private netdev data into a struct vxlan_dev
causing an invalid memory access:

 BUG: KASAN: slab-use-after-free in vxlan_get_options+0x323/0x3a0
  vxlan_get_options+0x323/0x3a0
  ovs_vport_cmd_new+0x6e3/0xd30

Fix that by taking a reference to the just created device before
releasing RTNL.  This ensures that the device in the vport is always
the one that was just created.  The search by name is only needed
for a standard vport-netdev that links pre-existing devices, so that
functionality and device type checks are moved to netdev_create().

It is also awkward that ovs_netdev_link() takes ownership of the vport
and destroys it on failure.  It doesn't know the type of the port it is
dealing with, so we need to pass down the indicator that it's a tunnel,
so the link can be properly deleted on failure.

It's possible to refactor the logic to make the ovs_netdev_link() do
only the linking part and let the callers perform a proper destruction,
but it will be much more code for each legacy tunnel port type, so it
is not worth it for the bug fix.

Fixes: 614732eaa1 ("openvswitch: Use regular VXLAN net_device device")
Reported-by: Yuan Tan <tanyuan98@outlook.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Reported-by: Yang Yang <n05ec@lzu.edu.cn>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Link: https://patch.msgid.link/20260430213349.407991-1-i.maximets@ovn.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-05 15:14:33 +02:00
Paolo Abeni
6bdcbd79ad Merge branch 'net-mana-fix-mana_destroy_rxq-cleanup-for-partial-rxq-init'
Dipayaan Roy says:

====================
net: mana: Fix mana_destroy_rxq() cleanup for partial RXQ init

When mana_create_rxq() fails partway through initialization (e.g. the
hardware rejects the WQ object creation), the error path calls
mana_destroy_rxq() to tear down a partially-initialized RXQ.
This exposed multiple issues in mana_destroy_rxq() path, as it assumed
the RXQ was always fully initialized, leading to multiple issues:

1. xdp_rxq_info_unreg() was called on an unregistered xdp_rxq,
   triggering a WARN_ON ("Driver BUG") in net/core/xdp.c.

2. mana_destroy_wq_obj() was called with INVALID_MANA_HANDLE,
   sending a bogus destroy command to the hardware.

3. mana_deinit_cq() was called twice — once inside mana_destroy_rxq()
   and again in mana_create_rxq()'s error path — causing a
   use-after-free since mana_destroy_rxq() frees the rxq first.

This was observed during ethtool ring parameter changes when the
hardware returned an error creating the RXQ. This series makes
mana_destroy_rxq() safe to call at any stage of RXQ initialization
by guarding each teardown step, and removes the redundant cleanup
in mana_create_rxq().
====================

Link: https://patch.msgid.link/20260430035935.1859220-1-dipayanroy@linux.microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-05 12:16:26 +02:00
Dipayaan Roy
3985c9a56d net: mana: remove double CQ cleanup in mana_create_rxq error path
In mana_create_rxq(), the error cleanup path calls mana_destroy_rxq()
followed by mana_deinit_cq(). This is incorrect for two reasons:

1. mana_destroy_rxq() already calls mana_deinit_cq() internally,
   so the CQ's GDMA queue is destroyed twice.

2. mana_destroy_rxq() frees the rxq via kfree(rxq) before returning.
   The subsequent mana_deinit_cq(apc, cq) then operates on freed memory
   since cq points to &rxq->rx_cq, which is embedded in the
   already-freed rxq structure — a use-after-free.

Remove the redundant mana_deinit_cq() call from the error path since
mana_destroy_rxq() already handles CQ cleanup. mana_deinit_cq() is
itself safe for an uninitialized CQ as it checks for a NULL gdma_cq
before proceeding.

Fixes: ca9c54d2d6 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Reviewed-by: Aditya Garg <gargaditya@linux.microsoft.com>
Link: https://patch.msgid.link/20260430035935.1859220-4-dipayanroy@linux.microsoft.com
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-05 12:16:23 +02:00
Dipayaan Roy
2a1c691182 net: mana: Skip WQ object destruction for uninitialized RXQ
In mana_destroy_rxq(), mana_destroy_wq_obj() is called unconditionally
even when the WQ object was never created (rxobj is still
INVALID_MANA_HANDLE). When mana_create_rxq() fails before
mana_create_wq_obj() succeeds, the error path calls mana_destroy_rxq()
which sends a bogus destroy command to the hardware:

    mana 7870:00:00.0: HWC: Failed hw_channel req: 0x1d
    mana 7870:00:00.0: Failed to send mana message: -71, 0x1d
    mana 7870:00:00.0 eth7: Failed to destroy WQ object: -71

Guard mana_destroy_wq_obj() with an INVALID_MANA_HANDLE check so that
mana_destroy_rxq() is safe to call at any stage of RXQ initialization.

Fixes: ca9c54d2d6 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Link: https://patch.msgid.link/20260430035935.1859220-3-dipayanroy@linux.microsoft.com
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-05 12:16:23 +02:00
Dipayaan Roy
e9e334f806 net: mana: check xdp_rxq registration before unreg in mana_destroy_rxq()
When mana_create_rxq() fails at mana_create_wq_obj() or any step before
xdp_rxq_info_reg() is called, the error path jumps to `out:` which calls
mana_destroy_rxq(). mana_destroy_rxq() unconditionally calls
xdp_rxq_info_unreg() on xilinx xdp_rxq that was never registered,
triggering a WARN_ON in net/core/xdp.c:

mana 7870:00:00.0: HWC: Failed hw_channel req: 0xc000009a
mana 7870:00:00.0 eth7: Failed to create RXQ: err = -71
Driver BUG
WARNING: CPU: 442 PID: 491615 at ../net/core/xdp.c:150 xdp_rxq_info_unreg+0x44/0x70
Modules linked in: tcp_bbr xsk_diag udp_diag raw_diag unix_diag af_packet_diag netlink_diag nf_tables nfnetlink tcp_diag inet_diag binfmt_misc rpcsec_gss_krb5 nfsv3 nfs_acl auth_rpcgss nfsv4 dns_resolver nfs lockd ext4 grace crc16 iscsi_tcp mbcache fscache libiscsi_tcp jbd2 netfs rpcrdma af_packet sunrpc rdma_ucm ib_iser rdma_cm iw_cm iscsi_ibft ib_cm iscsi_boot_sysfs libiscsi rfkill scsi_transport_iscsi mana_ib ib_uverbs ib_core mana hyperv_drm(X) drm_shmem_helper intel_rapl_msr drm_kms_helper intel_rapl_common syscopyarea nls_iso8859_1 sysfillrect intel_uncore_frequency_common nls_cp437 vfat fat nfit sysimgblt libnvdimm hv_netvsc(X) hv_utils(X) fb_sys_fops hv_balloon(X) joydev fuse drm dm_mod configfs ip_tables x_tables xfs libcrc32c sd_mod nvme nvme_core nvme_common t10_pi crc64_rocksoft_generic crc64_rocksoft crc64 hid_generic serio_raw pci_hyperv(X) hv_storvsc(X) scsi_transport_fc hyperv_keyboard(X) hid_hyperv(X) pci_hyperv_intf(X) crc32_pclmul
 crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd hv_vmbus(X) softdog sg scsi_mod efivarfs
Supported: Yes, External
CPU: 442 PID: 491615 Comm: ethtool Kdump: loaded Tainted: G               X    5.14.21-150500.55.136-default #1 SLE15-SP5 a627be1b53abbfd64ad16b2685e4308c52847f42
Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 07/25/2025
RIP: 0010:xdp_rxq_info_unreg+0x44/0x70
Code: e8 91 fe ff ff c7 43 0c 02 00 00 00 48 c7 03 00 00 00 00 5b c3 cc cc cc cc e9 58 3a 1c 00 48 c7 c7 f6 5f 19 97 e8 5c a4 7e ff <0f> 0b 83 7b 0c 01 74 ca 48 c7 c7 d9 5f 19 97 e8 48 a4 7e ff 0f 0b
RSP: 0018:ff3df6c8f7207818 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ff30d89f94808a80 RCX: 0000000000000027
RDX: 0000000000000000 RSI: 0000000000000002 RDI: ff30d94bdcca2908
RBP: 0000000000080000 R08: ffffffff98ed11a0 R09: ff3df6c8f72077a0
R10: dead000000000100 R11: 000000000000000a R12: 0000000000000000
R13: 0000000000002000 R14: 0000000000040000 R15: ff30d89f94800000
FS:  00007fe6d8432b80(0000) GS:ff30d94bdcc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe6d81a89b1 CR3: 00000b3b6d578001 CR4: 0000000000371ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
Call Trace:
 <TASK>
 mana_destroy_rxq+0x5b/0x2f0 [mana 267acf7006bcb696095bba4d810643d1db3b9e94]
 mana_create_rxq.isra.55+0x3db/0x720 [mana 267acf7006bcb696095bba4d810643d1db3b9e94]
 ? simple_lookup+0x36/0x50
 ? current_time+0x42/0x80
 ? __d_free_external+0x30/0x30
 mana_alloc_queues+0x32a/0x470 [mana 267acf7006bcb696095bba4d810643d1db3b9e94]
 ? _raw_spin_unlock+0xa/0x30
 ? d_instantiate.part.29+0x2e/0x40
 ? _raw_spin_unlock+0xa/0x30
 ? debugfs_create_dir+0xe4/0x140
 mana_attach+0x5c/0xf0 [mana 267acf7006bcb696095bba4d810643d1db3b9e94]
 mana_set_ringparam+0xd5/0x1a0 [mana 267acf7006bcb696095bba4d810643d1db3b9e94]
 ethnl_set_rings+0x292/0x320
 genl_family_rcv_msg_doit.isra.15+0x11b/0x150
 genl_rcv_msg+0xe3/0x1e0
 ? rings_prepare_data+0x80/0x80
 ? genl_family_rcv_msg_doit.isra.15+0x150/0x150
 netlink_rcv_skb+0x50/0x100
 genl_rcv+0x24/0x40
 netlink_unicast+0x1b6/0x280
 netlink_sendmsg+0x365/0x4d0
 sock_sendmsg+0x5f/0x70
 __sys_sendto+0x112/0x140
 __x64_sys_sendto+0x24/0x30
 do_syscall_64+0x5b/0x80
 ? handle_mm_fault+0xd7/0x290
 ? do_user_addr_fault+0x2d8/0x740
 ? exc_page_fault+0x67/0x150
 entry_SYSCALL_64_after_hwframe+0x6b/0xd5
RIP: 0033:0x7fe6d8122f06
Code: 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 72 f3 c3 41 57 41 56 4d 89 c7 41 55 41 54 41
RSP: 002b:00007fff2b66b068 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 000055771123d2a0 RCX: 00007fe6d8122f06
RDX: 0000000000000034 RSI: 000055771123d3b0 RDI: 0000000000000003
RBP: 00007fff2b66b100 R08: 00007fe6d8203360 R09: 000000000000000c
R10: 0000000000000000 R11: 0000000000000246 R12: 000055771123d350
R13: 000055771123d340 R14: 0000000000000000 R15: 00007fff2b66b2b0
 </TASK>

Guard the xdp_rxq_info_unreg() call with xdp_rxq_info_is_reg() so that
mana_destroy_rxq() is safe to call regardless of how far initialization
progressed.

Fixes: ed5356b53f ("net: mana: Add XDP support")
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Link: https://patch.msgid.link/20260430035935.1859220-2-dipayanroy@linux.microsoft.com
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-05 12:16:23 +02:00
Daniel Golle
07d9958739 net: dsa: mt7530: fix .get_stats64 sleeping in atomic context
The .get_stats64 callback runs in atomic context, but on
MDIO-connected switches every register read acquires the MDIO bus
mutex, which can sleep:
[   12.645973] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:609
[   12.654442] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 759, name: grep
[   12.663377] preempt_count: 0, expected: 0
[   12.667410] RCU nest depth: 1, expected: 0
[   12.671511] INFO: lockdep is turned off.
[   12.675441] CPU: 0 UID: 0 PID: 759 Comm: grep Tainted: G S      W           7.0.0+ #0 PREEMPT
[   12.675453] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
[   12.675456] Hardware name: Bananapi BPI-R64 (DT)
[   12.675459] Call trace:
[   12.675462]  show_stack+0x14/0x1c (C)
[   12.675477]  dump_stack_lvl+0x68/0x8c
[   12.675487]  dump_stack+0x14/0x1c
[   12.675495]  __might_resched+0x14c/0x220
[   12.675504]  __might_sleep+0x44/0x80
[   12.675511]  __mutex_lock+0x50/0xb10
[   12.675523]  mutex_lock_nested+0x20/0x30
[   12.675532]  mt7530_get_stats64+0x40/0x2ac
[   12.675542]  dsa_user_get_stats64+0x2c/0x40
[   12.675553]  dev_get_stats+0x44/0x1e0
[   12.675564]  dev_seq_printf_stats+0x24/0xe0
[   12.675575]  dev_seq_show+0x14/0x3c
[   12.675583]  seq_read_iter+0x37c/0x480
[   12.675595]  seq_read+0xd0/0xec
[   12.675605]  proc_reg_read+0x94/0xe4
[   12.675615]  vfs_read+0x98/0x29c
[   12.675625]  ksys_read+0x54/0xdc
[   12.675633]  __arm64_sys_read+0x18/0x20
[   12.675642]  invoke_syscall.constprop.0+0x54/0xec
[   12.675653]  do_el0_svc+0x3c/0xb4
[   12.675662]  el0_svc+0x38/0x200
[   12.675670]  el0t_64_sync_handler+0x98/0xdc
[   12.675679]  el0t_64_sync+0x158/0x15c

For MDIO-connected switches, poll MIB counters asynchronously using a
delayed workqueue every second and let .get_stats64 return the cached
values under a spinlock. A mod_delayed_work() call on each read
triggers an immediate refresh so counters stay responsive when queried
more frequently.

MMIO-connected switches (MT7988, EN7581, AN7583) are not affected
because their regmap does not sleep, so they continue to read MIB
counters directly in .get_stats64.

Fixes: 88c810f35e ("net: dsa: mt7530: implement .get_stats64")
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Acked-by: Chester A. Unal <chester.a.unal@arinc9.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/6940b913da2c29156f0feff74b678d3c526ee84c.1777719253.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-04 19:28:54 -07:00
Kuniyuki Iwashima
a6039776c7 ipmr: Add __rcu to netns_ipv4.mrt.
kernel test robot reported this Sparse warning:

  $ make C=1 net/ipv4/ipmr.o
  net/ipv4/ipmr.c:312:24: error: incompatible types in comparison expression (different address spaces):
  net/ipv4/ipmr.c:312:24:    struct mr_table [noderef] __rcu *
  net/ipv4/ipmr.c:312:24:    struct mr_table *

Let's add __rcu annotation to netns_ipv4.mrt.

Fixes: b3b6babf47 ("ipmr: Free mr_table after RCU grace period.")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202605030032.glNApko7-lkp@intel.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260502180755.359554-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-04 19:26:13 -07:00
David Carlier
30cb24f97d psp: strip variable-length PSP header in psp_dev_rcv()
psp_dev_rcv() unconditionally removes a fixed PSP_ENCAP_HLEN, even
when psph->hdrlen indicates that the PSP header carries optional
fields. A frame whose PSP header advertises a non-zero VC or any
extension would therefore be silently mis-decapsulated: option bytes
would spill into the inner packet head and downstream parsing would
fail on a corrupted skb.

Compute the full PSP header length from psph->hdrlen, pull the
optional bytes into the linear region, and strip the whole header
when decapsulating. Optional fields (VC, ...) are still ignored,
just discarded with the rest of the header instead of leaking.
crypt_offset and the VIRT flag are intentionally not validated here
- callers know their device's PSP implementation and can decide.

Both in-tree callers gate on hardware-validated PSP, so this is a
correctness fix rather than a reachable corruption path under
current configurations.

Fixes: 0eddb8023c ("psp: provide decapsulation and receive helper for drivers")
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Daniel Zahka <daniel.zahka@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: David Carlier <devnexen@gmail.com>
Link: https://patch.msgid.link/20260502141945.14484-1-devnexen@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-04 19:25:14 -07:00
Eric Dumazet
ac0841d7d2 net: prevent possible UAF in rtnl_prop_list_size()
I was mistaken by synchronize_rcu() [1] call in netdev_name_node_alt_destroy(),
giving a false sense of RCU safety at delete times.

We have to use list_del_rcu() to not confuse potential readers
in rtnl_prop_list_size().

[1] This synchronize_rcu() call was later removed in commit 723de3ebef
("net: free altname using an RCU callback").

Fixes: 9f30831390 ("net: add rcu safety to rtnl_prop_list_size()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260502124102.499204-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-04 19:24:27 -07:00
Jakub Kicinski
9d7ebff0c3 Merge branch 'mptcp-misc-fixes-for-v7-1-rc3'
Matthieu Baerts says:

====================
mptcp: misc fixes for v7.1-rc3

Here are various unrelated fixes:

- Patch 1: increment the right MIB counter. A fix for v5.7.

- Patch 2: set the right MPTCP reset reason. A fix for v5.9.

- Patch 3: fix rx timestamp corruption when on MPTCP passive fastopen. A
  fix for v6.2.

- Patch 4: increase sockopt seq after having set TCP_MAXSEG to propagate
  it to newer subflows later. A fix for 6.17.
====================

Link: https://patch.msgid.link/20260501-net-mptcp-misc-fixes-7-1-rc3-v1-0-b70118df778e@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-04 19:20:52 -07:00
Matthieu Baerts (NGI0)
70ece9d702 mptcp: sockopt: increase seq in mptcp_setsockopt_all_sf
mptcp_setsockopt_all_sf() was missing a call to sockopt_seq_inc(). This
is required not to cause missing synchronization for newer subflows
created later on.

This helper is called each time a socket option is set on subflows, and
future ones will need to inherit this option after their creation.

Fixes: 51c5fd09e1 ("mptcp: add TCP_MAXSEG sockopt support")
Cc: stable@vger.kernel.org
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260501-net-mptcp-misc-fixes-7-1-rc3-v1-4-b70118df778e@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-04 19:20:50 -07:00
Paolo Abeni
6254a16d6f mptcp: fix rx timestamp corruption on fastopen
The skb cb offset containing the timestamp presence flag is cleared
before loading such information. Cache such value before MPTCP CB
initialization.

Fixes: 36b122baf6 ("mptcp: add subflow_v(4,6)_send_synack()")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260501-net-mptcp-misc-fixes-7-1-rc3-v1-3-b70118df778e@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-04 19:20:50 -07:00
Shardul Bankar
a6da02d4c0 mptcp: use MPTCP_RST_EMPTCP for ACK HMAC validation failure
When HMAC validation fails on a received ACK + MP_JOIN in
subflow_syn_recv_sock(), the subflow is reset with reason
MPTCP_RST_EPROHIBIT ("Administratively prohibited"). This is
incorrect: HMAC validation failure is an MPTCP protocol-level
error, not an administrative policy denial.

The mirror site on the client, in subflow_finish_connect(), already
uses MPTCP_RST_EMPTCP ("MPTCP-specific error") for the same kind of
HMAC failure on the SYN/ACK + MP_JOIN. Use the same reason on the
server side for symmetry and accuracy.

Suggested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Fixes: 443041deb5 ("mptcp: fix NULL pointer in can_accept_new_subflow")
Cc: stable@vger.kernel.org
Signed-off-by: Shardul Bankar <shardul.b@mpiricsoftware.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260501-net-mptcp-misc-fixes-7-1-rc3-v1-2-b70118df778e@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-04 19:20:50 -07:00
Shardul Bankar
c4a99a9219 mptcp: use MPJoinSynAckHMacFailure for SynAck HMAC failure
In subflow_finish_connect(), HMAC validation of the server's HMAC
in SYN/ACK + MP_JOIN increments MPTCP_MIB_JOINACKMAC ("HMAC was
wrong on ACK + MP_JOIN") on failure. The function processes the
SYN/ACK, not the ACK; the matching MPTCP_MIB_JOINSYNACKMAC counter
("HMAC was wrong on SYN/ACK + MP_JOIN") exists but is not
incremented anywhere in the tree.

The mirror site on the server, subflow_syn_recv_sock(), already
uses JOINACKMAC correctly for ACK HMAC failure. Use JOINSYNACKMAC
at the SYN/ACK validation site so each counter reflects the packet
whose HMAC actually failed.

Suggested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Fixes: fc518953bc ("mptcp: add and use MIB counter infrastructure")
Cc: stable@vger.kernel.org
Signed-off-by: Shardul Bankar <shardul.b@mpiricsoftware.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260501-net-mptcp-misc-fixes-7-1-rc3-v1-1-b70118df778e@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-04 19:20:50 -07:00
Eric Dumazet
059b7dbd20 vsock/virtio: fix potential unbounded skb queue
virtio_transport_inc_rx_pkt() checks vvs->rx_bytes + len > vvs->buf_alloc.

virtio_transport_recv_enqueue() skips coalescing for packets
with VIRTIO_VSOCK_SEQ_EOM.

If fed with packets with len == 0 and VIRTIO_VSOCK_SEQ_EOM,
a very large number of packets can be queued
because vvs->rx_bytes stays at 0.

Fix this by estimating the skb metadata size:

	(Number of skbs in the queue) * SKB_TRUESIZE(0)

Fixes: 0777061657 ("virtio/vsock: don't use skbuff state to account credit")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: "Eugenio Pérez" <eperezma@redhat.com>
Cc: virtualization@lists.linux.dev
Link: https://patch.msgid.link/20260430122653.554058-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-04 19:12:37 -07:00
Markus Baier
36bdc0e815 net: usb: asix: ax88772: re-add usbnet_link_change() in phylink callbacks
Commit e0bffe3e68 ("net: asix: ax88772: migrate to phylink") replaced
the asix_adjust_link() PHY callback with phylink's mac_link_up() and
mac_link_down() handlers, but did not carry over the usbnet_link_change()
notification that commit 805206e66f ("net: asix: fix "can't send until
first packet is send" issue") had added.

As a result, the original symptom returns: when the link comes up,
usbnet is never notified, so the RX URB submission stays dormant until
some other event (e.g. a transmitted packet triggering the status
endpoint interrupt) wakes it up.

This is reproducible with the Apple A1277 USB Ethernet Adapter
(05ac:1402, AX88772A based) on a Banana Pro using a static IPv4
configuration. After bringing the interface up, no incoming packets are
received until the first outgoing frame triggers usbnet's RX path.

Restore the link change notification, gated on a carrier transition so
the call remains idempotent if the status endpoint also reports the
change later.

Fixes: e0bffe3e68 ("net: asix: ax88772: migrate to phylink")
Signed-off-by: Markus Baier <Markus.Baier@soslab.tu-darmstadt.de>
Tested-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20260501163941.107668-1-Markus.Baier@soslab.tu-darmstadt.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-04 19:04:15 -07:00
Breno Leitao
76b93a8107 netpoll: pass buffer size to egress_dev() to avoid MAC truncation
egress_dev() formats np->dev_mac via snprintf() but receives buf as
a bare char *, so it cannot derive the buffer size from the pointer. The
size argument was hardcoded to MAC_ADDR_STR_LEN (3 * ETH_ALEN - 1 = 17),
which is silly wrong in two ways:

 1) misleading kernel log output on the MAC-selected target path
    (np->dev_name[0] == '\0'); for example "aa:bb:cc:dd:ee:ff doesn't
    exist, aborting" was logged as "aa:bb:cc:dd:ee:f doesn't exist,
    aborting".

 2) the second argument of snprintf is the size of the buffer, not the
    size of what you want to write.

Add a bufsz parameter to egress_dev() and pass sizeof(buf) from each
caller, matching the standard snprintf() idiom and removing the
hardcoded size from the helper.

Every caller already declares "char buf[MAC_ADDR_STR_LEN + 1]" so the
formatted MAC continues to fit.

Tested by booting with
  netconsole=6665@/aa:bb:cc:dd:ee:ff,6666@10.0.0.1/00:11:22:33:44:55
on a kernel without a matching device. Pre-fix dmesg shows
"aa:bb:cc:dd:ee:f doesn't exist, aborting"; post-fix shows the full
"aa:bb:cc:dd:ee:ff doesn't exist, aborting".

Fixes: f8a10bed32 ("netconsole: allow selection of egress interface via MAC address")
Cc: stable@vger.kernel.org
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260501-netpoll_snprintf_fix-v1-1-84b0566e6597@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-04 18:37:25 -07:00
Kuniyuki Iwashima
d82ba05263 af_unix: Set gc_in_progress to true in unix_gc().
Igor Ushakov reported that unix_gc() could run with gc_in_progress
being false if the work is scheduled while running:

  Thread 1         Thread 2                     Thread 3
  --------         --------                     --------
                   unix_schedule_gc()           unix_schedule_gc()
                   `- if (!gc_in_progress)      `- if (!gc_in_progress)
                      |- gc_in_progress = true     |
                      `- queue_work()              |
  unix_gc() <----------------/                     |
  |                                                |- gc_in_progress = true
  ...                                              `- queue_work()
  |                                                       |
  `- gc_in_progress = false                               |
                                                          |
  unix_gc() <---------------------------------------------'
  |
  ... /* gc_in_progress == false */
  |
  `- gc_in_progress = false

unix_peek_fpl() relies on gc_in_progress not to confuse GC
by MSG_PEEK.

Let's set gc_in_progress to true in unix_gc().

Fixes: 8b90a9f819 ("af_unix: Run GC on only one CPU.")
Reported-by: Igor Ushakov <sysroot314@gmail.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260501073945.1884564-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-04 18:34:45 -07:00
Jakub Kicinski
bd3a4795d5 selftests: tls: add test for data loss on small pipe
Add selftest for data loss on short splice.

Link: https://patch.msgid.link/20260429222944.2139041-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-02 18:27:14 -07:00
Jakub Kicinski
7e7be31bfd net: tls: fix silent data drop under pipe back-pressure
tls_sw_splice_read() uses len when advancing rxm->offset / rxm->full_len
after skb_splice_bits(), rather than copied (the actual number of bytes
successfully spliced into the pipe). When the destination pipe cannot
accept all the requested bytes, splice_to_pipe() returns fewer bytes
than len, and 'len - copied' of data is effectively skipped over.

Fixes: e062fe99cc ("tls: splice_read: fix accessing pre-processed records")
Link: https://patch.msgid.link/20260429222944.2139041-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-02 18:27:14 -07:00
Jakub Kicinski
b42f68cf04 Merge branch 'net-sched-sch_cake-annotate-data-races-in-cake_dump_class_stats-series'
Eric Dumazet says:

====================
net/sched: sch_cake: annotate data-races in cake_dump_class_stats (series)

cake_dump_class_stats() runs without qdisc spinlock being held.

In this series (of two), I add READ_ONCE()/WRITE_ONCE() annotations for:

- flow->head
- flow->dropped
- b->backlogs[]
- flow->deficit
- flow->cvars.dropping
- flow->cvars.count
- flow->cvars.p_drop
- flow->cvars.blue_timer
- flow->cvars.drop_next
====================

Link: https://patch.msgid.link/20260430061610.3503483-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-02 16:59:12 -07:00
Eric Dumazet
67dc6c56b8 net/sched: sch_cake: annotate data-races in cake_dump_class_stats (II)
cake_dump_class_stats() runs without qdisc spinlock being held.

In this second patch, I add READ_ONCE()/WRITE_ONCE() annotations for:

- flow->deficit
- flow->cvars.dropping
- flow->cvars.count
- flow->cvars.p_drop
- flow->cvars.blue_timer
- flow->cvars.drop_next

Fixes: 046f6fd5da ("sched: Add Common Applications Kept Enhanced (cake) qdisc")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260430061610.3503483-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-02 16:59:09 -07:00
Eric Dumazet
046111a1a3 net/sched: sch_cake: annotate data-races in cake_dump_class_stats (I)
cake_dump_class_stats() runs without qdisc spinlock being held.

In this first patch, I add READ_ONCE()/WRITE_ONCE() annotations for:

- flow->head
- flow->dropped
- b->backlogs[]

Fixes: 046f6fd5da ("sched: Add Common Applications Kept Enhanced (cake) qdisc")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260430061610.3503483-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-02 16:59:09 -07:00
Maoyi Xie
1d324c2f43 ip6_gre: Use cached t->net in ip6erspan_changelink().
After commit 5e72ce3e39 ("net: ipv6: Use link netns in newlink() of
rtnl_link_ops"), ip6erspan_newlink() correctly resolves the per-netns
ip6gre hash via link_net. ip6erspan_changelink() was not converted in
that series and still uses dev_net(dev), which diverges from the
device's creation netns after IFLA_NET_NS_FD migration.

This re-inserts the tunnel into the wrong per-netns hash. The
original netns keeps a stale entry. When that netns is later
destroyed, ip6gre_exit_rtnl_net() walks the stale entry, producing a
slab-use-after-free reported by KASAN, followed by a kernel BUG at
net/core/dev.c (LIST_POISON1) in unregister_netdevice_many_notify().

Reachable from an unprivileged user namespace (unshare --user
--map-root-user --net).

ip6gre_changelink() earlier in the same file already uses the cached
t->net; only ip6erspan_changelink() has the wrong shape.

Fixes: 2d665034f2 ("net: ip6_gre: Fix ip6erspan hlen calculation")
Cc: stable@vger.kernel.org # v5.15+
Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260430103318.3206018-1-maoyi.xie@ntu.edu.sg
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-02 10:36:21 -07:00
Jakub Kicinski
386829cd89 Merge branch 'replace-direct-dequeue-call-with-qdisc_dequeue_peeked'
Jamal Hadi Salim says:

====================
Replace direct dequeue call with qdisc_dequeue_peeked

When sfb and red qdiscs have children (eg qfq qdisc) whose peek() callback is
qdisc_peek_dequeued(), we could get a kernel panic. When the parent of such
qdiscs (eg illustrated in patch #3 as tbf) wants to retrieve an skb from
its child (red/sfb in this case), it will do the following:
 1a. do a peek() - and when sensing there's an skb the child can offer, then
     - the child in this case(red/sfb) calls its child's (qfq) peek.
        qfq does the right thing and will return the gso_skb queue packet.
        Note: if there wasnt a gso_skb entry then qfq will store it there.
 1b. invoke a dequeue() on the child (red/sfb). And herein lies the problem.
     - red/sfb will call the child's dequeue() which will essentially just
       try to grab something of qfq's queue.

The right thing to do in #1b is to grab the skb off gso_skb queue.
This patchset fixes that issue by changing #1b to use qdisc_dequeue_peeked()
method instead.

Patch 1 fixes the issue for red qdisc. Patch 2 fixes it for sfb.
Patch 3 adds testcases for the two setups.
====================

Link: https://patch.msgid.link/20260430152957.194015-1-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-02 10:21:00 -07:00
Victor Nogueira
3a3a30c14d selftests/tc-testing: Add tests that force red and sfb to dequeue from child's gso_skb
Create 4 test cases:
- Force red to dequeue from its child's gso_skb with qfq leaf
- Force sfb to dequeue from its child's gso_skb with qfq leaf
- Force red to dequeue from its child's gso_skb with dualpi2 leaf
- Force sfb to dequeue from its child's gso_skb with dualpi2 leaf

All of them have tbf followed by red (or sfb) followed by qfq (or
dualpi2). Since tbf calls its child's peek followed by
qdisc_dequeue_peeked, it will force red/sfb to call their child's peek.
In this case, since the child (qfq/dualpi2) has qdisc_peek_dequeued as
its peek callback, the packet will be stored in its gso_skb queue. During
the subsequent call to qdisc_dequeue_peeked, red/sfb will have to dequeue
from the child's gso_skb to retrieve the packet.
Not doing so will cause a NULL ptr deref which was happening before a
recent fix.

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260430152957.194015-4-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-02 10:20:56 -07:00
Victor Nogueria
1b9bc71153 net/sched: sch_sfb: Replace direct dequeue call with peek and qdisc_dequeue_peeked
When sfb has children (eg qfq qdisc) whose peek() callback is
qdisc_peek_dequeued(), we could get a kernel panic. When the parent of such
qdiscs (eg illustrated in patch #3 as tbf) wants to retrieve an skb from
its child (sfb in this case), it will do the following:
 1a. do a peek() - and when sensing there's an skb the child can offer, then
     - the child in this case(sfb) calls its child's (qfq) peek.
        qfq does the right thing and will return the gso_skb queue packet.
        Note: if there wasnt a gso_skb entry then qfq will store it there.
 1b. invoke a dequeue() on the child (sfb). And herein lies the problem.
     - sfb will call the child's dequeue() which will essentially just
       try to grab something of qfq's queue.

[  127.594489][  T453] KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f]
[  127.594741][  T453] CPU: 2 UID: 0 PID: 453 Comm: ping Not tainted 7.1.0-rc1-00035-gac961974495b-dirty #793 PREEMPT(full)
[  127.595059][  T453] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  127.595254][  T453] RIP: 0010:qfq_dequeue+0x35c/0x1650 [sch_qfq]
[  127.595461][  T453] Code: 00 fc ff df 80 3c 02 00 0f 85 17 0e 00 00 4c 8d 73 48 48 89 9d b8 02 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 76 0c 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8b
[  127.596081][  T453] RSP: 0018:ffff88810e5af440 EFLAGS: 00010216
[  127.596337][  T453] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: dffffc0000000000
[  127.596623][  T453] RDX: 0000000000000009 RSI: 0000001880000000 RDI: ffff888104fd82b0
[  127.596917][  T453] RBP: ffff888104fd8000 R08: ffff888104fd8280 R09: 1ffff110211893a3
[  127.597165][  T453] R10: 1ffff110211893a6 R11: 1ffff110211893a7 R12: 0000001880000000
[  127.597404][  T453] R13: ffff888104fd82b8 R14: 0000000000000048 R15: 0000000040000000
[  127.597644][  T453] FS:  00007fc380cbfc40(0000) GS:ffff88816f2a8000(0000) knlGS:0000000000000000
[  127.597956][  T453] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  127.598160][  T453] CR2: 00005610aa9890a8 CR3: 000000010369e000 CR4: 0000000000750ef0
[  127.598390][  T453] PKRU: 55555554
[  127.598509][  T453] Call Trace:
[  127.598629][  T453]  <TASK>
[  127.598718][  T453]  ? mark_held_locks+0x40/0x70
[  127.598890][  T453]  ? srso_alias_return_thunk+0x5/0xfbef5
[  127.599053][  T453]  sfb_dequeue+0x88/0x4d0
[  127.599174][  T453]  ? ktime_get+0x137/0x230
[  127.599328][  T453]  ? srso_alias_return_thunk+0x5/0xfbef5
[  127.599480][  T453]  ? qdisc_peek_dequeued+0x7b/0x350 [sch_qfq]
[  127.599670][  T453]  ? srso_alias_return_thunk+0x5/0xfbef5
[  127.599831][  T453]  tbf_dequeue+0x6b1/0x1098 [sch_tbf]
[  127.599988][  T453]  __qdisc_run+0x169/0x1900

The right thing to do in #1b is to grab the skb off gso_skb queue.
This patchset fixes that issue by changing #1b to use qdisc_dequeue_peeked()
method instead.

Fixes: e13e02a3c6 ("net_sched: SFB flow scheduler")
Signed-off-by: Victor Nogueria <victor@mojatatu.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260430152957.194015-3-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-02 10:20:56 -07:00
Jamal Hadi Salim
458d561527 net/sched: sch_red: Replace direct dequeue call with peek and qdisc_dequeue_peeked
When red qdisc has children (eg qfq qdisc) whose peek() callback is
qdisc_peek_dequeued(), we could get a kernel panic. When the parent of such
qdiscs (eg illustrated in patch #3 as tbf) wants to retrieve an skb from
its child (red in this case), it will do the following:
 1a. do a peek() - and when sensing there's an skb the child can offer, then
     - the child in this case(red) calls its child's (qfq) peek.
        qfq does the right thing and will return the gso_skb queue packet.
        Note: if there wasnt a gso_skb entry then qfq will store it there.
 1b. invoke a dequeue() on the child (red). And herein lies the problem.
     - red will call the child's dequeue() which will essentially just
       try to grab something of qfq's queue.

[   78.667668][  T363] KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f]
[   78.667927][  T363] CPU: 1 UID: 0 PID: 363 Comm: ping Not tainted 7.1.0-rc1-00033-g46f74a3f7d57-dirty #790 PREEMPT(full)
[   78.668263][  T363] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   78.668486][  T363] RIP: 0010:qfq_dequeue+0x446/0xc90 [sch_qfq]
[   78.668718][  T363] Code: 54 c0 e8 dd 90 00 f1 48 c7 c7 e0 03 54 c0 48 89 de e8 ce 90 00 f1 48 8d 7b 48 b8 ff ff 37 00 48 89 fa 48 c1 e0 2a 48 c1 ea 03 <80> 3c 02 00 74 05 e8 ef a1 e1 f1 48 8b 7b 48 48 8d 54 24 58 48 8d
[   78.669312][  T363] RSP: 0018:ffff88810de573e0 EFLAGS: 00010216
[   78.669533][  T363] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
[   78.669790][  T363] RDX: 0000000000000009 RSI: 0000000000000004 RDI: 0000000000000048
[   78.670044][  T363] RBP: ffff888110dc4000 R08: ffffffffb1b0885a R09: fffffbfff6ba9078
[   78.670297][  T363] R10: 0000000000000003 R11: ffff888110e31c80 R12: 0000001880000000
[   78.670560][  T363] R13: ffff888110dc4150 R14: ffff888110dc42b8 R15: 0000000000000200
[   78.670814][  T363] FS:  00007f66a8f09c40(0000) GS:ffff888163428000(0000) knlGS:0000000000000000
[   78.671110][  T363] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   78.671324][  T363] CR2: 000055db4c6a30a8 CR3: 000000010da67000 CR4: 0000000000750ef0
[   78.671585][  T363] PKRU: 55555554
[   78.671713][  T363] Call Trace:
[   78.671843][  T363]  <TASK>
[   78.671936][  T363]  ? __pfx_qfq_dequeue+0x10/0x10 [sch_qfq]
[   78.672148][  T363]  ? __pfx__printk+0x10/0x10
[   78.672322][  T363]  ? srso_alias_return_thunk+0x5/0xfbef5
[   78.672496][  T363]  ? lockdep_hardirqs_on_prepare+0xa8/0x1a0
[   78.672706][  T363]  ? srso_alias_return_thunk+0x5/0xfbef5
[   78.672875][  T363]  ? trace_hardirqs_on+0x19/0x1a0
[   78.673047][  T363]  red_dequeue+0x65/0x270 [sch_red]
[   78.673217][  T363]  ? srso_alias_return_thunk+0x5/0xfbef5
[   78.673385][  T363]  tbf_dequeue.cold+0xb0/0x70c [sch_tbf]
[   78.673566][  T363]  __qdisc_run+0x169/0x1900

The right thing to do in #1b is to grab the skb off gso_skb queue.
This patchset fixes that issue by changing #1b to use qdisc_dequeue_peeked()
method instead.

Fixes: 77be155cba ("pkt_sched: Add peek emulation for non-work-conserving qdiscs.")
Reported-by: Manas <ghandatmanas@gmail.com>
Reported-by: Rakshit Awasthi <rakshitawasthi17@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260430152957.194015-2-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-02 10:20:55 -07:00
Gregory Fuchedgi
383d0fb894 amd-xgbe: fix PTP addend overflow causing frozen clock
XGBE_PTP_ACT_CLK_FREQ and XGBE_V2_PTP_ACT_CLK_FREQ were 10x too
large (500MHz/1GHz instead of 50MHz/100MHz), causing the computed
addend to overflow the 32-bit tstamp_addend. In the general case
this would result in the clock advancing at the wrong rate. For v2
(PCI), ptpclk_rate is hardcoded to 125MHz, so the addend formula
(ACT_CLK_FREQ << 32) / ptpclk_rate yields exactly 8 * 2^32, and
when stored to the 32-bit tstamp_addend the value is zero. With
addend = 0 the hardware accumulator never overflows and the PTP
clock is fully stopped. For v1 (platform), ptpclk_rate is read from
ACPI/DT so the exact overflow behavior depends on the
firmware-reported frequency.

Define the constants as NSEC_PER_SEC / SSINC so the relationship is
explicit and cannot drift out of sync.

Fixes: fbd47be098 ("amd-xgbe: add hardware PTP timestamping support")
Tested-by: Gregory Fuchedgi <gfuchedgi@gmail.com>
Signed-off-by: Gregory Fuchedgi <gfuchedgi@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260429-fix-xgbe-ptp-addend-v1-1-fca5b0ca5e62@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-02 10:16:27 -07:00
Holger Brunck
851bba8068 net: wan: fsl_ucc_hdlc: fix ucc_hdlc_remove
If the driver is used in a non tdm mode priv->utdm is a NULL pointer.
Therefore we need to check this pointer first before checking si_regs.

Fixes: c19b6d246a ("drivers/net: support hdlc function for QE-UCC")
Signed-off-by: Holger Brunck <holger.brunck@hitachienergy.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-02 10:14:06 -07:00
Holger Brunck
1a57efe250 net: wan: fsl_ucc_hdlc: fix uhdlc_memclean
Unmapping of uf_regs is done from ucc_fast_free and doesn't need to be
done explicitly. If already unmapped ucc_fast_free will crash.

Fixes: c19b6d246a ("drivers/net: support hdlc function for QE-UCC")
Signed-off-by: Holger Brunck <holger.brunck@hitachienergy.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-02 10:14:00 -07:00
Maciej W. Rozycki
ddca6da148 MAINTAINERS: Add self for the DEC LANCE network driver
Like with the rest of DECstation and TURBOchannel hardware I have been
handling the DEC LANCE network driver for some 25 years now anyway.

Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/alpine.DEB.2.21.2604271113520.28583@angie.orcam.me.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-02 10:06:32 -07:00
Jakub Kicinski
ebb639024e Merge branch 'ipv6-fix-ecmp-route-failover-on-carrier-loss'
Sagarika Sharma says:

====================
ipv6: fix ECMP route failover on carrier loss

This patchset resolves an issue where established IPv6 connections are
unable to transition to alternative ECMP nexthops upon carrier loss.

Unlike IPv4, the IPv6 routing subsystem does not actively invalidate
cached destinations during a NETDEV_CHANGE event. Sockets persist
with dead routes, leading to stalled traffic or connection drops.

This series introduces a fix to trigger route invalidation by
updating the route serial number on link carrier loss and provides
a corresponding selftest to validate the failover behavior for IPv4
and IPv6.
====================

Link: https://patch.msgid.link/20260430200909.527827-1-sharmasagarika@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-01 17:58:46 -07:00
Kuniyuki Iwashima
d1ae37dc68 selftest: net: Add test for TCP flow failover with ECMP routes.
Without the previous commit, TCP failed to switch to alternative
IPv6 routes immediately upon carrier loss.

It would persist with the dead route until reaching the threshold
net.ipv4.tcp_retries1, leading to unnecessary delays in failover.

Let's add a selftest for this scenario to ensure TCP fails over
immediately upon a carrier loss event.

Before:
  TEST: TCP IPv4 failover                                             [ OK ]
  TEST: TCP IPv6 failover                                             [FAIL]

After:
  TEST: TCP IPv4 failover                                             [ OK ]
  TEST: TCP IPv6 failover                                             [ OK ]

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Sagarika Sharma <sharmasagarika@google.com>
Link: https://patch.msgid.link/20260430200909.527827-3-sharmasagarika@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-01 17:58:44 -07:00
Sagarika Sharma
4bc852006b ipv6: update route serial number on NETDEV_CHANGE
When using IPv6 ECMP routes, if a netdev listed as a nexthop experiences
a carrier change event (e.g., a bond device generating a NETDEV_CHANGE
event after its slaves go linkdown), established connections utilizing
that nexthop fail to fail over to other available nexthops. Instead,
these connections stall or drop.

This happens because the IPv6 FIB code does not invalidate the socket's
cached destination when a NETDEV_CHANGE event occurs. While
fib6_ifdown() correctly marks the nexthop with RTNH_F_LINKDOWN, it
leaves the route's serial number unchanged. As a result, sockets with a
previously cached dst do not realize the route is no longer viable and
continue to try using the non-functional nexthop.

This behavior contrasts with IPv4, which actively flushes cached
destinations on a NETDEV_CHANGE event (see fib_netdev_event() in
net/ipv4/fib_frontend.c).

Fix this by updating the route serial number in fib6_ifdown() when
setting RTNH_F_LINKDOWN. This invalidates stale cached destinations,
forcing sockets to perform a new route lookup and fail over to a
functioning nexthop.

Fixes: 51ebd31815 ("ipv6: add support of equal cost multipath (ECMP)")
Signed-off-by: Sagarika Sharma <sharmasagarika@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260430200909.527827-2-sharmasagarika@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-01 17:58:44 -07:00
Eric Dumazet
6d4106e8df net/sched: sch_pie: annotate more data-races in pie_dump_stats()
My prior patch missed few READ_ONCE()/WRITE_ONCE() annotations.

Fixes: 5154561d9b ("net/sched: sch_pie: annotate data-races in pie_dump_stats()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260430080056.35104-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-01 17:54:57 -07:00
Alex Cheema
a5148bc2fa net: usb: cdc_ncm: add Apple Mac USB-C direct networking quirk
Apple Silicon Macs expose two CDC NCM "private" data interfaces over
USB-C with VID:PID 0x05ac:0x1905 and product string "Mac". This is the
same protocol Apple already ships on iPhone (0x05ac:0x12a8) and iPad
(0x05ac:0x12ab) for RemoteXPC since iOS 17 -- both data interfaces lack
an interrupt status endpoint, so they rely on the FLAG_LINK_INTR-
conditional bind path introduced in commit 3ec8d7572a ("CDC-NCM: add
support for Apple's private interface").

The id_table currently has entries for iPhone and iPad but not for the
Mac. Without a match, cdc_ncm falls through to the generic CDC NCM
class-match entry, which uses the FLAG_LINK_INTR-having cdc_ncm_info
struct, so bind_common() fails on the missing status endpoint and no
netdev appears.

Add id_table entries for both interface numbers (0 and 2) of the Mac,
bound to the existing apple_private_interface_info driver_info.

Verified empirically on a Mac Studio M3 Ultra running macOS 26.5: when
a Mac is connected via USB-C, ioreg shows VID 0x05ac, PID 0x1905,
product string "Mac", with two NCM data interfaces at numbers 0 and 2.
The same PID is presented by all current Apple Silicon Mac models
(MacBook Pro/Air, Mac mini, Mac Studio across the M-series), mirroring
Apple's single-PID-per-family pattern from iPhone/iPad.

After this patch, plugging a Mac into a Linux host running the patched
kernel produces two enx... interfaces (one per data interface),
"ip -br link" lists them as UP, and standard userspace networking
(DHCP, NetworkManager shared mode, etc.) works without any modprobe
overrides or out-of-tree modules.

Signed-off-by: Alex Cheema <alex@exolabs.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260429175739.34426-1-alex@exolabs.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-01 17:17:27 -07:00
Eric Dumazet
c6bebaa744 ipv4: igmp: annotate data-races in igmp_heard_query()
Multiple cpus can run igmp_heard_query() concurrently.

Add missing READ_ONCE()/WRITE_ONCE() over following in_dev fields.

- mr_qrv
- mr_qi
- mr_qri
- mr_v1_seen
- mr_v2_seen

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzbot+ae9a171f239b14485310@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/69f38675.050a0220.3cbe47.0002.GAE@google.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260430164836.872079-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-01 17:11:42 -07:00
Kai Zen
4b9e327991 net: rtnetlink: zero ifla_vf_broadcast to avoid stack infoleak in rtnl_fill_vfinfo
rtnl_fill_vfinfo() declares struct ifla_vf_broadcast on the stack
without initialisation:

	struct ifla_vf_broadcast vf_broadcast;

The struct contains a single fixed 32-byte field:

	/* include/uapi/linux/if_link.h */
	struct ifla_vf_broadcast {
		__u8 broadcast[32];
	};

The function then copies dev->broadcast into it using dev->addr_len
as the length:

	memcpy(vf_broadcast.broadcast, dev->broadcast, dev->addr_len);

On Ethernet devices (the overwhelming majority of SR-IOV NICs)
dev->addr_len is 6, so only the first 6 bytes of broadcast[] are
written. The remaining 26 bytes retain whatever was previously on
the kernel stack. The full struct is then handed to userspace via:

	nla_put(skb, IFLA_VF_BROADCAST,
		sizeof(vf_broadcast), &vf_broadcast)

leaking up to 26 bytes of uninitialised kernel stack per VF per
RTM_GETLINK request, repeatable.

The other vf_* structs in the same function are explicitly zeroed
for exactly this reason - see the memset() calls for ivi,
vf_vlan_info, node_guid and port_guid a few lines above.
vf_broadcast was simply missed when it was added.

Reachability: any unprivileged local process can open AF_NETLINK /
NETLINK_ROUTE without capabilities and send RTM_GETLINK with an
IFLA_EXT_MASK attribute carrying RTEXT_FILTER_VF. The kernel walks
each VF and emits IFLA_VF_BROADCAST, leaking 26 bytes of stack per
VF per request. Stack residue at this call site can include return
addresses and transient sensitive data; KASAN with stack
instrumentation, or KMSAN, will flag the nla_put() when reproduced.

Zero the on-stack struct before the partial memcpy, matching the
existing pattern used for the other vf_* structs in the same
function.

Fixes: 75345f888f ("ipoib: show VF broadcast address")
Cc: stable@vger.kernel.org
Signed-off-by: Kai Zen <kai.aizen.dev@gmail.com>
Link: https://patch.msgid.link/3c506e8f936e52b57620269b55c348af05d413a2.1777557228.git.kai.aizen.dev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-01 17:04:28 -07:00
Eric Dumazet
4f34002e2e ipmr: prevent info-leak in pmr_cache_report()
Yiming Qian reported:

<quote>
 ipmr_cache_report()` allocates a report skb with `alloc_skb(128,
 GFP_ATOMIC)` and appends a `struct igmphdr` using `skb_put()`. In the
 non-`IGMPMSG_WHOLEPKT` path it initializes only:

 - `igmp->type`
 - `igmp->code`

 but does not initialize:

 - `igmp->csum`
 - `igmp->group`

 Later, `igmpmsg_netlink_event()` copies the bytes after `sizeof(struct
 igmpmsg)` into the `IPMRA_CREPORT_PKT` netlink attribute and emits
 `RTM_NEWCACHEREPORT` on `RTNLGRP_IPV4_MROUTE_R`.

 As a result, 6 bytes of stale heap data from the skb head are
 disclosed to userspace.
</quote>

Let's use skb_put_zero() instead of skb_put() to fix this bug.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260430070611.4004529-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-01 17:01:40 -07:00
Aleksander Jan Bajkowski
f93836b236 net: usb: r8152: add TRENDnet TUC-ET2G v2.0
The TRENDnet TUC-ET2G V2.0 is an RTL8156B based 2.5G Ethernet controller.

Add the vendor and product ID values to the driver. This makes Ethernet
work with the adapter.

Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Birger Koblitz <mail@birger-koblitz.de>
Link: https://patch.msgid.link/20260430213435.21821-1-olek2@wp.pl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-01 16:50:49 -07:00
Jakub Kicinski
2ab02ac411 Merge tag 'nf-26-05-01' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following batch contains Netfilter fixes for net:

1) Replace skb_try_make_writable() by skb_ensure_writable() in
   nft_fwd_netdev and the flowtable to deal with uncloned packets
   having their network header in paged fragments.

2) Drop packet if output device does not exist and ensure sufficient
   headroom in nft_fwd_netdev before transmitting the skb.

3) Use the existing dup recursion counter in nft_fwd_netdev for the
   neigh_xmit variant, from Weiming Shi.

4) Add .check_hooks interface to x_tables to detach the control plane
   hook check based on the match/target configuration. Then, update
   nft_compat to use .check_hooks from .validate path, this fixes a
   lack of hook validation for several match/targets.

5) Fix incorrect .usersize in xt_CT, from Florian Westphal.

6) Fix a memleak with netdev tables in dormant state,
   from Florian Westphal.

7) Several patches to check if the packet is a fragment, then skip
   layer 4 inspection, for x_tables and nf_tables; as well as common
   nf_socket infrastructure. The xt_hashlimit match drops fragments
   to stay consistent with the existing approach when failing to parse
   the layer 4 protocol header.

8) Ensure sufficient headroom in the flowtable before transmitting
   the skb.

9) Fix the flowtable inline vlan approach for double-tagged vlan:
   Reverse the iteration over .encap[] since it represents the
   encapsulation as seen from the ingress path. Postpone pushing
   layer 2 header so output device is available to calculate needed
   headroom. Finally, add and use nf_flow_vlan_push() to fix it.

10) Fix flowtable inline pppoe with GSO packets. Moreover, use
    FLOW_OFFLOAD_XMIT_DIRECT to fill up destination hardware
    address since neighbour cache does not exist in pppoe.

11) Use skb_pull_rcsum() to decapsulate vlan and pppoe headers, for
    double-tagged vlan in particular this should provide some benefits
    in certain scenarios.

More notes regarding 9-11):

- sashiko is also signalling to use it for IPIP headers, but that needs
  more adjustments such setting skb->protocol after removing the IPIP
  header, will follow up in a separated patch.
- I plan to submit selftests to cover double-tagged-vlan. As for pppoe,
  it should be possible but that would mandate a few userspace dependencies.
  This has been semi-automatically  tested by me and reporters describing
  broken double-vlan-tagged and pppoe currently in the flowtable.

* tag 'nf-26-05-01' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: flowtable: use skb_pull_rcsum() to pop vlan/pppoe header
  netfilter: flowtable: fix inline pppoe encapsulation in xmit path
  netfilter: flowtable: fix inline vlan encapsulation in xmit path
  netfilter: flowtable: ensure sufficient headroom in xmit path
  netfilter: xtables: fix L4 header parsing for non-first fragments
  netfilter: nf_tables: skip L4 header parsing for non-first fragments
  netfilter: nf_socket: skip socket lookup for non-first fragments
  netfilter: nf_tables: fix netdev hook allocation memleak with dormant tables
  netfilter: xt_CT: fix usersize for v1 and v2 revision
  netfilter: nft_compat: run xt_check_hooks_{match,target}() from .validate
  netfilter: x_tables: add .check_hooks to matches and targets
  netfilter: nft_fwd_netdev: use recursion counter in neigh egress path
  netfilter: nft_fwd_netdev: add device and headroom validate with neigh forwarding
  netfilter: replace skb_try_make_writable() by skb_ensure_writable()
====================

Link: https://patch.msgid.link/20260501122237.296262-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-01 16:45:42 -07:00
Pablo Neira Ayuso
baa3c65435 netfilter: flowtable: use skb_pull_rcsum() to pop vlan/pppoe header
This adjusts the checksum, if required, after pulling the layer 2
header, either the pppoe header or the inner vlan header in the
double-tagged vlan packets.

Fixes: 4cd91f7c29 ("netfilter: flowtable: add vlan support")
Fixes: 72efd585f7 ("netfilter: flowtable: add pppoe support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-05-01 12:39:23 +02:00
Jakub Kicinski
85da3965df Merge branch 'octeontx2-af-npc-cn20k-mcam-fixes'
Ratheesh Kannoth says:

====================
octeontx2-af: npc: cn20k: MCAM fixes

This series tightens Marvell OcteonTX2 AF NPC support for CN20K silicon
around MCAM key typing, optional debugfs setup, defrag allocation rollback,
defrag entry relocation bookkeeping, logical MCAM clear and programming,
default-rule index handling with explicit teardown, and NIXLF reserved-slot
lookup when default rules are missing.

Patches 1 through 3 focus on AF error handling: propagate
npc_mcam_idx_2_key_type() failures through cn20k MCAM enable, config, copy,
and read paths; treat cn20k NPC debugfs nodes as optional so probe does not
fail when debugfs is unavailable; and fix defrag MCAM allocation rollback
so allocation errno is not overwritten during subbank index resolution.

Patch 4 fixes npc_defrag_move_vdx_to_free(): when an MCAM line is moved to
a new physical index, move entry2target_pffunc[] association to the new
slot, clear the old slot, and retarget the matching mcam_rules entry so
software state matches hardware after defrag.

Patches 5 through 7 refine cn20k MCAM programming: clear entries using the
logical MCAM index and resolved key width, fix bank/CFG sequencing in
npc_cn20k_config_mcam_entry(), and read action metadata from the correct
bank in npc_cn20k_read_mcam_entry().

Patches 8 through 10 complete default-rule lifecycle handling: initialize
default-rule index outputs eagerly, tear down reserved default MCAM rules
explicitly (coordinated with npc_mcam_free_all_entries()), and reject
USHRT_MAX sentinel indices from npc_get_nixlf_mcam_index() on cn20k.
====================

Link: https://patch.msgid.link/20260429022722.1110289-1-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-30 18:50:19 -07:00
Ratheesh Kannoth
bc968f61bf octeontx2-af: npc: cn20k: Reject missing default-rule MCAM indices
When cn20k default L2 rules are not installed,
npc_cn20k_dft_rules_idx_get() leaves broadcast, multicast, promiscuous, and
unicast slots at USHRT_MAX. npc_get_nixlf_mcam_index() previously returned
that sentinel as a valid MCAM index, so callers could program hardware with
an invalid index.

Return -EINVAL from the cn20k branches of npc_get_nixlf_mcam_index() when
the requested slot is still USHRT_MAX. Harden cn20k NPC MCAM entry helpers
to reject out-of-range indices before touching hardware.

Drop the early bounds check in npc_enable_mcam_entry() for cn20k so invalid
indices are validated inside npc_cn20k_enable_mcam_entry() instead of being
silently ignored.

In rvu_npc_update_flowkey_alg_idx(), treat negative MCAM indices like
out-of-range values, and only update RSS actions for promiscuous and
all-multi paths when the resolved index is non-negative.

Cc: Suman Ghosh <sumang@marvell.com>
Fixes: 6d1e70282f ("octeontx2-af: npc: cn20k: Use common APIs")
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260429022722.1110289-11-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-30 18:50:17 -07:00
Ratheesh Kannoth
013717353c octeontx2-af: npc: cn20k: Tear down default MCAM rules explicitly on free
npc_cn20k_dft_rules_free() used the NPC MCAM mbox "free all" path, which
does not match how cn20k tracks default-rule MCAM slots indexes.

Resolve the default-rule indices, then for each valid slot clear the bitmap
entry, drop the PF/VF map, disable the MCAM line, clear the target
function, and npc_cn20k_idx_free(). Remove any matching software mcam_rules
nodes. On hard failure from idx_free, WARN and stop so the box stays up for
analysis.

In npc_mcam_free_all_entries(), prefetch the same default-rule indices and,
on cn20k, skip bitmap clear and idx_free when the scanned entry is one of
those reserved defaults (they are released by npc_cn20k_dft_rules_free).

Fixes: 09d3b7a140 ("octeontx2-af: npc: cn20k: Allocate default MCAM indexes")
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260429022722.1110289-10-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-30 18:50:17 -07:00
Ratheesh Kannoth
afb474bd4f octeontx2-af: npc: cn20k: Initialize default-rule index outputs up front
npc_cn20k_dft_rules_idx_get() wrote USHRT_MAX into individual outputs only
on some error paths (lbk promisc lookup, VF ucast lookup, and the PF rule
walk), which could leave other caller slots stale across retries.

Set every non-NULL bcast/mcast/promisc/ucast pointer to USHRT_MAX once at
entry, then drop the duplicate assignments on failure. Successful lookups
still overwrite the relevant slot before returning.

Fixes: 09d3b7a140 ("octeontx2-af: npc: cn20k: Allocate default MCAM indexes")
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260429022722.1110289-9-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-30 18:50:17 -07:00
Ratheesh Kannoth
f6803eb070 octeontx2-af: npc: cn20k: Fix MCAM actions read
npc_cn20k_read_mcam_entry() always reloaded action and vtag_action from
bank 0 after programming the CAM words. Use the bank returned by
npc_get_bank() for the ACTION reads as well, and read those registers once
up front so both X2 and X4 paths share the same metadata.

Return directly from the X2 keyword path now that the action fields are
already populated.

Cc: Suman Ghosh <sumang@marvell.com>
Fixes: 6d1e70282f ("octeontx2-af: npc: cn20k: Use common APIs")
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260429022722.1110289-8-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-30 18:50:16 -07:00
Ratheesh Kannoth
2b6d6bb728 octeontx2-af: npc: cn20k: Fix bank value
For X4 keys its loop reused the bank parameter as the loop counter, so bank
no longer reflected the caller's bank after the loop and the control flow
was hard to follow.

Program NPC_AF_CN20K_MCAMEX_BANKX_CFG_EXT directly in
npc_cn20k_config_mcam_entry(): one CFG write for X2 using the computed
bank, and one CFG write per bank inside the X4 action loop. Enable the
entry at the end with npc_cn20k_enable_mcam_entry(..., true) instead of
embedding the enable bit in bank_cfg via the removed helper.

Cc: Suman Ghosh <sumang@marvell.com>
Fixes: 4e527f1e5c ("octeontx2-af: npc: cn20k: Add new mailboxes for CN20K silicon")
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260429022722.1110289-7-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-30 18:50:16 -07:00