The vport netlink reply helpers allocate a fixed-size skb with
nlmsg_new(NLMSG_DEFAULT_SIZE, ...) but serialize the full upcall PID
array via ovs_vport_get_upcall_portids(). Since
ovs_vport_set_upcall_portids() accepts any non-zero multiple of
sizeof(u32) with no upper bound, a CAP_NET_ADMIN user can install a PID
array large enough to overflow the reply buffer, causing nla_put() to
fail with -EMSGSIZE and hitting BUG_ON(err < 0). On systems with
unprivileged user namespaces enabled (e.g., Ubuntu default), this is
reachable via unshare -Urn since OVS vport mutation operations use
GENL_UNS_ADMIN_PERM.
kernel BUG at net/openvswitch/datapath.c:2414!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
CPU: 1 UID: 0 PID: 65 Comm: poc Not tainted 7.0.0-rc7-00195-geb216e422044 #1
RIP: 0010:ovs_vport_cmd_set+0x34c/0x400
Call Trace:
<TASK>
genl_family_rcv_msg_doit (net/netlink/genetlink.c:1116)
genl_rcv_msg (net/netlink/genetlink.c:1194)
netlink_rcv_skb (net/netlink/af_netlink.c:2550)
genl_rcv (net/netlink/genetlink.c:1219)
netlink_unicast (net/netlink/af_netlink.c:1344)
netlink_sendmsg (net/netlink/af_netlink.c:1894)
__sys_sendto (net/socket.c:2206)
__x64_sys_sendto (net/socket.c:2209)
do_syscall_64 (arch/x86/entry/syscall_64.c:63)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
</TASK>
Kernel panic - not syncing: Fatal exception
Reject attempts to set more PIDs than nr_cpu_ids in
ovs_vport_set_upcall_portids(), and pre-compute the worst-case reply
size in ovs_vport_cmd_msg_size() based on that bound, similar to the
existing ovs_dp_cmd_msg_size(). nr_cpu_ids matches the cap already
used by the per-CPU dispatch configuration on the datapath side
(ovs_dp_cmd_fill_info() serialises at most nr_cpu_ids PIDs), so the
two sides stay consistent.
Fixes: 5cd667b0a4 ("openvswitch: Allow each vport to have an array of 'port_id's.")
Reported-by: Xiang Mei <xmei5@asu.edu>
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Reviewed-by: Ilya Maximets <i.maximets@ovn.org>
Link: https://patch.msgid.link/20260416024653.153456-2-bestswngs@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
RFC 2516 Section 7 states that Protocol Field Compression (PFC) is NOT
RECOMMENDED for PPPoE. In practice, pppd does not support negotiating
PFC for PPPoE sessions, and the current PPPoE driver assumes an
uncompressed (2-byte) protocol field. However, the generic PPP layer
function ppp_input() is not aware of the negotiation result, and still
accepts PFC frames.
If a peer with a broken implementation or an attacker sends a frame with
a compressed (1-byte) protocol field, the subsequent PPP payload is
shifted by one byte. This causes the network header to be 4-byte
misaligned, which may trigger unaligned access exceptions on some
architectures.
To reduce the attack surface, drop PPPoE PFC frames. Introduce
ppp_skb_is_compressed_proto() helper function to be used in both
ppp_generic.c and pppoe.c to avoid open-coding.
Fixes: 7fb1b8ca8f ("ppp: Move PFC decompression to PPP generic layer")
Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260415022456.141758-2-qingfang.deng@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
RFC 2516 Section 7 states that Protocol Field Compression (PFC) is NOT
RECOMMENDED for PPPoE. In practice, pppd does not support negotiating
PFC for PPPoE sessions, and the flow dissector driver has assumed an
uncompressed frame until the blamed commit.
During the review process of that commit [1], support for PFC is
suggested. However, having a compressed (1-byte) protocol field means
the subsequent PPP payload is shifted by one byte, causing 4-byte
misalignment for the network header and an unaligned access exception
on some architectures.
The exception can be reproduced by sending a PPPoE PFC frame to an
ethernet interface of a MIPS board, with RPS enabled, even if no PPPoE
session is active on that interface:
$ 0 : 00000000 80c40000 00000000 85144817
$ 4 : 00000008 00000100 80a75758 81dc9bb8
$ 8 : 00000010 8087ae2c 0000003d 00000000
$12 : 000000e0 00000039 00000000 00000000
$16 : 85043240 80a75758 81dc9bb8 00006488
$20 : 0000002f 00000007 85144810 80a70000
$24 : 81d1bda0 00000000
$28 : 81dc8000 81dc9aa8 00000000 805ead08
Hi : 00009d51
Lo : 2163358a
epc : 805e91f0 __skb_flow_dissect+0x1b0/0x1b50
ra : 805ead08 __skb_get_hash_net+0x74/0x12c
Status: 11000403 KERNEL EXL IE
Cause : 40800010 (ExcCode 04)
BadVA : 85144817
PrId : 0001992f (MIPS 1004Kc)
Call Trace:
[<805e91f0>] __skb_flow_dissect+0x1b0/0x1b50
[<805ead08>] __skb_get_hash_net+0x74/0x12c
[<805ef330>] get_rps_cpu+0x1b8/0x3fc
[<805fca70>] netif_receive_skb_list_internal+0x324/0x364
[<805fd120>] napi_complete_done+0x68/0x2a4
[<8058de5c>] mtk_napi_rx+0x228/0xfec
[<805fd398>] __napi_poll+0x3c/0x1c4
[<805fd754>] napi_threaded_poll_loop+0x234/0x29c
[<805fd848>] napi_threaded_poll+0x8c/0xb0
[<80053544>] kthread+0x104/0x12c
[<80002bd8>] ret_from_kernel_thread+0x14/0x1c
Code: 02d51821 1060045b 00000000 <8c640000> 3084000f 2c820005 144001a2 00042080 8e220000
To reduce the attack surface and maintain performance, do not process
PPPoE PFC frames.
[1] https://lore.kernel.org/r/20220630231016.GA392@debian.home
Fixes: 46126db9c8 ("flow_dissector: Add PPPoE dissectors")
Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev>
Link: https://patch.msgid.link/20260415022456.141758-1-qingfang.deng@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sctp_getsockopt_peer_auth_chunks() checks that the caller's optval
buffer is large enough for the peer AUTH chunk list with
if (len < num_chunks)
return -EINVAL;
but then writes num_chunks bytes to p->gauth_chunks, which lives
at offset offsetof(struct sctp_authchunks, gauth_chunks) == 8
inside optval. The check is missing the sizeof(struct
sctp_authchunks) = 8-byte header. When the caller supplies
len == num_chunks (for any num_chunks > 0) the test passes but
copy_to_user() writes sizeof(struct sctp_authchunks) = 8 bytes
past the declared buffer.
The sibling function sctp_getsockopt_local_auth_chunks() at the
next line already has the correct check:
if (len < sizeof(struct sctp_authchunks) + num_chunks)
return -EINVAL;
Align the peer variant with its sibling.
Reproducer confirms on v7.0-13-generic: an unprivileged userspace
caller that opens a loopback SCTP association with AUTH enabled,
queries num_chunks with a short optval, then issues the real
getsockopt with len == num_chunks and sentinel bytes painted past
the buffer observes those sentinel bytes overwritten with the
peer's AUTH chunk type. The bytes written are under the peer's
control but land in the caller's own userspace; this is not a
kernel memory corruption, but it is a kernel-side contract
violation that can silently corrupt adjacent userspace data.
Fixes: 65b07e5d0d ("[SCTP]: API updates to suport SCTP-AUTH extensions.")
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20260416031903.1447072-1-michael.bommarito@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If the driver executes ks8851_irq() AND a TX packet has been sent, then
the driver enables TX queue via netif_wake_queue() which schedules TX
softirq to queue packets for this device.
If CONFIG_PREEMPT_RT=y is set AND a packet has also been received by
the MAC, then ks8851_rx_pkts() calls netdev_alloc_skb_ip_align() to
allocate SKBs for the received packets. If netdev_alloc_skb_ip_align()
is called with BH enabled, then local_bh_enable() at the end of
netdev_alloc_skb_ip_align() will trigger the pending softirq processing,
which may ultimately call the .xmit callback ks8851_start_xmit_par().
The ks8851_start_xmit_par() will try to lock struct ks8851_net_par
.lock spinlock, which is already locked by ks8851_irq() from which
ks8851_start_xmit_par() was called. This leads to a deadlock, which
is reported by the kernel, including a trace listed below.
If CONFIG_PREEMPT_RT is not set, then since commit 0913ec336a
("net: ks8851: Fix deadlock with the SPI chip variant") the deadlock
can also be triggered without received packet in the RX FIFO. The
pending softirqs will be processed on return from
spin_unlock_bh(&ks->statelock) in ks8851_irq(), which triggers the
deadlock as well.
Fix the problem by disabling BH around critical sections, including the
IRQ handler, thus preventing the net_tx_action() softirq from triggering
during these critical sections. The net_tx_action() softirq is triggered
once BH are re-enabled and at the end of the IRQ handler, once all the
other IRQ handler actions have been completed.
__schedule from schedule_rtlock+0x1c/0x34
schedule_rtlock from rtlock_slowlock_locked+0x548/0x904
rtlock_slowlock_locked from rt_spin_lock+0x60/0x9c
rt_spin_lock from ks8851_start_xmit_par+0x74/0x1a8
ks8851_start_xmit_par from netdev_start_xmit+0x20/0x44
netdev_start_xmit from dev_hard_start_xmit+0xd0/0x188
dev_hard_start_xmit from sch_direct_xmit+0xb8/0x25c
sch_direct_xmit from __qdisc_run+0x1f8/0x4ec
__qdisc_run from qdisc_run+0x1c/0x28
qdisc_run from net_tx_action+0x1f0/0x268
net_tx_action from handle_softirqs+0x1a4/0x270
handle_softirqs from __local_bh_enable_ip+0xcc/0xe0
__local_bh_enable_ip from __alloc_skb+0xd8/0x128
__alloc_skb from __netdev_alloc_skb+0x3c/0x19c
__netdev_alloc_skb from ks8851_irq+0x388/0x4d4
ks8851_irq from irq_thread_fn+0x24/0x64
irq_thread_fn from irq_thread+0x178/0x28c
irq_thread from kthread+0x12c/0x138
kthread from ret_from_fork+0x14/0x28
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Fixes: e0863634bf ("net: ks8851: Queue RX packets in IRQ handler instead of disabling BHs")
Cc: stable@vger.kernel.org
Signed-off-by: Marek Vasut <marex@nabladev.com>
Link: https://patch.msgid.link/20260415231020.455298-1-marex@nabladev.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
SOCKMAP can hide inflight fd from AF_UNIX GC.
When a socket in SOCKMAP receives skb with inflight fd,
sk_psock_verdict_data_ready() looks up the mapped socket and
enqueue skb to its psock->ingress_skb.
Since neither the old nor the new GC can inspect the psock
queue, the hidden skb leaks the inflight sockets. Note that
this cannot be detected via kmemleak because inflight sockets
are linked to a global list.
In addition, SOCKMAP redirect breaks the Tarjan-based GC's
assumption that unix_edge.successor is always alive, which
is no longer true once skb is redirected, resulting in
use-after-free below. [0]
Moreover, SOCKMAP does not call scm_stat_del() properly,
so unix_show_fdinfo() could report an incorrect fd count.
sk_msg_recvmsg() does not support any SCM attributes in the
first place.
Let's drop all SCM attributes before passing skb to the
SOCKMAP layer.
[0]:
BUG: KASAN: slab-use-after-free in unix_del_edges (net/unix/garbage.c:118 net/unix/garbage.c:181 net/unix/garbage.c:251)
Read of size 8 at addr ffff888125362670 by task kworker/56:1/496
CPU: 56 UID: 0 PID: 496 Comm: kworker/56:1 Not tainted 7.0.0-rc7-00263-gb9d8b856689d #3 PREEMPT(lazy)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
Workqueue: events sk_psock_backlog
Call Trace:
<TASK>
dump_stack_lvl (lib/dump_stack.c:122)
print_report (mm/kasan/report.c:379)
kasan_report (mm/kasan/report.c:597)
unix_del_edges (net/unix/garbage.c:118 net/unix/garbage.c:181 net/unix/garbage.c:251)
unix_destroy_fpl (net/unix/garbage.c:317)
unix_destruct_scm (./include/net/scm.h:80 ./include/net/scm.h:86 net/unix/af_unix.c:1976)
sk_psock_backlog (./include/linux/skbuff.h:?)
process_scheduled_works (kernel/workqueue.c:?)
worker_thread (kernel/workqueue.c:?)
kthread (kernel/kthread.c:438)
ret_from_fork (arch/x86/kernel/process.c:164)
ret_from_fork_asm (arch/x86/entry/entry_64.S:258)
</TASK>
Allocated by task 955:
kasan_save_track (mm/kasan/common.c:58 mm/kasan/common.c:78)
__kasan_slab_alloc (mm/kasan/common.c:369)
kmem_cache_alloc_noprof (mm/slub.c:4539)
sk_prot_alloc (net/core/sock.c:2240)
sk_alloc (net/core/sock.c:2301)
unix_create1 (net/unix/af_unix.c:1099)
unix_create (net/unix/af_unix.c:1169)
__sock_create (net/socket.c:1606)
__sys_socketpair (net/socket.c:1811)
__x64_sys_socketpair (net/socket.c:1863 net/socket.c:1860 net/socket.c:1860)
do_syscall_64 (arch/x86/entry/syscall_64.c:?)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
Freed by task 496:
kasan_save_track (mm/kasan/common.c:58 mm/kasan/common.c:78)
kasan_save_free_info (mm/kasan/generic.c:587)
__kasan_slab_free (mm/kasan/common.c:287)
kmem_cache_free (mm/slub.c:6165)
__sk_destruct (net/core/sock.c:2282 net/core/sock.c:2384)
sk_psock_destroy (./include/net/sock.h:?)
process_scheduled_works (kernel/workqueue.c:?)
worker_thread (kernel/workqueue.c:?)
kthread (kernel/kthread.c:438)
ret_from_fork (arch/x86/kernel/process.c:164)
ret_from_fork_asm (arch/x86/entry/entry_64.S:258)
Fixes: c63829182c ("af_unix: Implement ->psock_update_sk_prot()")
Fixes: 77462de14a ("af_unix: Add read_sock for stream socket types")
Reported-by: Xingyu Jin <xingyuj@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260415184830.3988432-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
get_interfaces() will update both the plat->phy_interfaces and
mdio_bus_data->default_an_inband based on reading a SERDES register. As
get_interfaces() will be called after default_an_inband had already been
read, dwmac-intel regressed as a result with incorrect default_an_inband
value in phylink_config.
Therefore, we moved the priv->plat->get_interfaces() to be executed first
before assigning priv->plat->default_an_inband to config->default_an_inband
to ensure default_an_inband is in correct value.
Fixes: d3836052fe ("net: stmmac: intel: convert speed_mode_2500() to get_interfaces()")
Signed-off-by: KhaiWenTan <khai.wen.tan@linux.intel.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20260416102609.7953-1-khai.wen.tan@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Caching saddr and daddr before pskb_pull() is problematic
since skb->head can change.
Remove these temporary variables:
- We only access &ipv6_hdr(skb)->saddr and &ipv6_hdr(skb)->daddr
when net_dbg_ratelimited() is called in the slow path.
- Avoid potential future misuse after pskb_pull() call.
Fixes: 4b3418fba0 ("ipv6: icmp: include addresses in debug messages")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Joe Damato <joe@dama.to>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260416103505.2380753-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jacob Keller says:
====================
Intel Wired LAN Driver Updates 2026-04-14 (ice, i40e, iavf, e1000e)
Grzegorz updates the logic for adjusting the PTP hardware clock on E830,
fixing a bug that prevented adjustments below S32_MAX/MIN nanoseconds.
Grzegorz and Zoli update the PCS latency settings for E825 devices at 10GbE
and 25GbE, improving the accuracy of timestamps based on data from
production hardware.
Michal Schmidt fixes a double-free that could happen if a particular error
path is taken in ice_xmit_frame_ring().
Guangshuo fixes a double-free that could happen during error paths in the
ice_sf_eth_activate() function.
Paul Greenwalt fixes the PHY link configuration when the link-down-on-close
driver parameter is enabled and new media is inserted.
Paul Greenwalt fixes the ICE_AQ_LINK_SPEED_M macro for 200G, enabling 200G
link speed advertisement.
Keita Morisaki fixes a race condition in the ice Tx timestamp ring cleanup,
preventing a possible NULL pointer dereference.
Kohei Enju fixes a potential NULL pointer dereference in ice_set_ring_param().
Kohei Enju fixes i40e to stop advertising IFF_SUPP_NOFCS, when the driver
does not actually support the feature.
Petr fixes the VLAN L2TAG2 mask when the iAVF VF and a PF negotiate use of
the legacy Rx descriptor format.
Matt fixes the unrolling logic for PTP when the e1000e probe fails after
the PTP clock has been registered.
**A note to stable backports**
The patches [7/12] ("ice: fix race condition in TX timestamp ring
cleanup") and [8/12] ("ice: fix potential NULL pointer deref in error
path of ice_set_ringparam()") must be backported together. Otherwise the
fix in patch 8 will not work properly.
====================
Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-0-686c33c9828d@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The IAVF_RXD_LEGACY_L2TAG2_M mask was incorrectly defined as
GENMASK_ULL(63, 32), extracting 32 bits from qw2 instead of the
16-bit VLAN tag. In the legacy Rx descriptor layout, the 2nd L2TAG2
(VLAN tag) occupies bits 63:48 of qw2, not 63:32.
The oversized mask causes FIELD_GET to return a 32-bit value where the
actual VLAN tag sits in bits 31:16. When this value is passed to
iavf_receive_skb() as a u16 parameter, it gets truncated to the lower
16 bits (which contain the 1st L2TAG2, typically zero). As a result,
__vlan_hwaccel_put_tag() is never called and software VLAN interfaces
on VFs receive no traffic.
This affects VFs behind ice PF (VIRTCHNL VLAN v2) when the PF
advertises VLAN stripping into L2TAG2_2 and legacy descriptors are
used.
The flex descriptor path already uses the correct mask
(IAVF_RXD_FLEX_L2TAG2_2_M = GENMASK_ULL(63, 48)).
Reproducer:
1. Create 2 VFs on ice PF (echo 2 > sriov_numvfs)
2. Disable spoofchk on both VFs
3. Move each VF into a separate network namespace
4. On each VF: create VLAN interface (e.g. vlan 198), assign IP,
bring up
5. Set rx-vlan-offload OFF on both VFs
6. Ping between VLAN interfaces -> expect PASS
(VLAN tag stays in packet data, kernel matches in-band)
7. Set rx-vlan-offload ON on both VFs
8. Ping between VLAN interfaces -> expect FAIL if bug present
(HW strips VLAN tag into descriptor L2TAG2 field, wrong mask
extracts bits 47:32 instead of 63:48, truncated to u16 -> zero,
__vlan_hwaccel_put_tag() never called, packet delivered to parent
interface, not VLAN interface)
The reproducer requires legacy Rx descriptors. On modern ice + iavf
with full PTP support, flex descriptors are always negotiated and the
buggy legacy path is never reached. Flex descriptors require all of:
- CONFIG_PTP_1588_CLOCK enabled
- VIRTCHNL_VF_OFFLOAD_RX_FLEX_DESC granted by PF
- PTP capabilities negotiated (VIRTCHNL_VF_CAP_PTP)
- VIRTCHNL_1588_PTP_CAP_RX_TSTAMP supported
- VIRTCHNL_RXDID_2_FLEX_SQ_NIC present in DDP profile
If any condition is not met, iavf_select_rx_desc_format() falls back
to legacy descriptors (RXDID=1) and the wrong L2TAG2 mask is hit.
Fixes: 2dc8e7c36d ("iavf: refactor iavf_clean_rx_irq to support legacy and flex descriptors")
Signed-off-by: Petr Oros <poros@redhat.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-10-686c33c9828d@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
i40e advertises IFF_SUPP_NOFCS, allowing users to use the SO_NOFCS
socket option. However, this option is silently ignored, as the driver
does not check skb->no_fcs, and always enables FCS insertion offload.
Fix this by removing the advertisement of IFF_SUPP_NOFCS.
This behavior can be reproduced with a simple AF_PACKET socket:
import socket
s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW)
s.setsockopt(socket.SOL_SOCKET, 43, 1) # SO_NOFCS
s.bind(("eth0", 0))
s.send(b'\xff' * 64)
Previously, send() succeeds but the driver ignores SO_NOFCS.
With this change, send() fails with -EPROTONOSUPPORT, as expected.
Fixes: 41c445ff0f ("i40e: main driver core")
Signed-off-by: Kohei Enju <kohei@enjuk.jp>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-9-686c33c9828d@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ice_set_ringparam nullifies tstamp_ring of temporary tx_rings, without
clearing ICE_TX_RING_FLAGS_TXTIME bit.
When ICE_TX_RING_FLAGS_TXTIME is set and the subsequent
ice_setup_tx_ring() call fails, a NULL pointer dereference could happen
in the unwinding sequence:
ice_clean_tx_ring()
-> ice_is_txtime_cfg() == true (ICE_TX_RING_FLAGS_TXTIME is set)
-> ice_free_tx_tstamp_ring()
-> ice_free_tstamp_ring()
-> tstamp_ring->desc (NULL deref)
Clear ICE_TX_RING_FLAGS_TXTIME bit to avoid the potential issue.
Note that this potential issue is found by manual code review.
Compile test only since unfortunately I don't have E830 devices.
Fixes: ccde82e909 ("ice: add E830 Earliest TxTime First Offload support")
Signed-off-by: Kohei Enju <kohei@enjuk.jp>
Reviewed-by: Paul Greenwalt <paul.greenwalt@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-8-686c33c9828d@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix a race condition between ice_free_tx_tstamp_ring() and ice_tx_map()
that can cause a NULL pointer dereference.
ice_free_tx_tstamp_ring currently clears the ICE_TX_FLAGS_TXTIME flag
after NULLing the tstamp_ring. This could allow a concurrent ice_tx_map
call on another CPU to dereference the tstamp_ring, which could lead to
a NULL pointer dereference.
CPU A:ice_free_tx_tstamp_ring() | CPU B:ice_tx_map()
--------------------------------|---------------------------------
tx_ring->tstamp_ring = NULL |
| ice_is_txtime_cfg() -> true
| tstamp_ring = tx_ring->tstamp_ring
| tstamp_ring->count // NULL deref!
flags &= ~ICE_TX_FLAGS_TXTIME |
Fix by:
1. Reordering ice_free_tx_tstamp_ring() to clear the flag before
NULLing the pointer, with smp_wmb() to ensure proper ordering.
2. Adding smp_rmb() in ice_tx_map() after the flag check to order the
flag read before the pointer read, using READ_ONCE() for the
pointer, and adding a NULL check as a safety net.
3. Converting tx_ring->flags from u8 to DECLARE_BITMAP() and using
atomic bitops (set_bit(), clear_bit(), test_bit()) for all flag
operations throughout the driver:
- ICE_TX_RING_FLAGS_XDP
- ICE_TX_RING_FLAGS_VLAN_L2TAG1
- ICE_TX_RING_FLAGS_VLAN_L2TAG2
- ICE_TX_RING_FLAGS_TXTIME
Fixes: ccde82e909 ("ice: add E830 Earliest TxTime First Offload support")
Signed-off-by: Keita Morisaki <kmta1236@gmail.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-7-686c33c9828d@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When setting PHY configuration during driver initialization, 200G link
speed is not being advertised even when the PHY is capable. This is
because the get PHY capabilities link speed response is being masked by
ICE_AQ_LINK_SPEED_M, which does not include the 200G link speed bit.
ICE_AQ_LINK_SPEED_200GB is defined as BIT(11), but the mask 0x7FF only
covers bits 0-10. Fix ICE_AQ_LINK_SPEED_M to use GENMASK(11, 0) so
that it covers all defined link speed bits including 200G.
Fixes: 24407a01e5 ("ice: Add 200G speed/phy type use")
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-6-686c33c9828d@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 1a3571b593 ("ice: restore PHY settings on media insertion")
introduced separate flows for setting PHY configuration on media
present: ice_configure_phy() when link-down-on-close is disabled, and
ice_force_phys_link_state() when enabled. The latter incorrectly uses
the previous configuration even after module change, causing link
issues such as wrong speed or no link.
Unify PHY configuration into a single ice_phy_cfg() function with a
link_en parameter, ensuring PHY capabilities are always fetched fresh
from hardware.
Fixes: 1a3571b593 ("ice: restore PHY settings on media insertion")
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-5-686c33c9828d@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If ice_tso() or ice_tx_csum() fail, the error path in
ice_xmit_frame_ring() frees the skb, but the 'first' tx_buf still points
to it and is marked as valid (ICE_TX_BUF_SKB).
'next_to_use' remains unchanged, so the potential problem will
likely fix itself when the next packet is transmitted and the tx_buf
gets overwritten. But if there is no next packet and the interface is
brought down instead, ice_clean_tx_ring() -> ice_unmap_and_free_tx_buf()
will find the tx_buf and free the skb for the second time.
The fix is to reset the tx_buf type to ICE_TX_BUF_EMPTY in the error
path, so that ice_unmap_and_free_tx_buf().
Move the initialization of 'first' up, to ensure it's already valid in
case we hit the linearization error path.
The bug was spotted by AI while I had it looking for something else.
It also proposed an initial version of the patch.
I reproduced the bug and tested the fix by adding code to inject
failures, on a build with KASAN.
I looked for similar bugs in related Intel drivers and did not find any.
Fixes: d76a60ba7a ("ice: Add support for VLANs and offloads")
Assisted-by: Claude:claude-4.6-opus-high Cursor
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-4-686c33c9828d@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Update MAC Rx/Tx offset registers settings (PHY_MAC_[RX|TX]_OFFSET
registers) with the data obtained with the latest research. It applies
to PCS latency settings for the following speeds/modes:
* 10Gb NO-FEC
- TX latency changed from 71.25 ns to 73 ns
- RX latency changed from -25.6 ns to -28 ns
* 25Gb NO-FEC
- TX latency changed from 28.17 ns to 33 ns
- RX latency changed from -12.45 ns to -12 ns
* 25Gb RS-FEC
- TX latency changed from 64.5 ns to 69 ns
- RX latency changed from -3.6 ns to -3 ns
The original data came from simulation and pre-production hardware.
The new data measures the actual delays and as such is more accurate.
Fixes: 7cab44f1c3 ("ice: Introduce ETH56G PHY model for E825C products")
Co-developed-by: Zoltan Fodor <zoltan.fodor@intel.com>
Signed-off-by: Zoltan Fodor <zoltan.fodor@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-2-686c33c9828d@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix incorrect 'adjust the timer' programming sequence for E830 devices
series. Only shadow registers GLTSYN_SHADJ were programmed in the
current implementation. According to the specification [1], write to
command GLTSYN_CMD register is also required with CMD field set to
"Adjust the Time" value, for the timer adjustment to take the effect.
The flow was broken for the adjustment less than S32_MAX/MIN range
(around +/- 2 seconds). For bigger adjustment, non-atomic programming
flow is used, involving set timer programming. Non-atomic flow is
implemented correctly.
Testing hints:
Run command:
phc_ctl /dev/ptpX get adj 2 get
Expected result:
Returned timestamps differ at least by 2 seconds
[1] Intel® Ethernet Controller E830 Datasheet rev 1.3, chapter 9.7.5.4
https://cdrdv2.intel.com/v1/dl/getContent/787353?explicitVersion=true
Fixes: f003075227 ("ice: Implement PTP support for E830 devices")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-1-686c33c9828d@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Antonio Quartulli says:
====================
This batch includes only fixes to the selftest harness:
* switch to TAP test orchestration
* parse slurped notifications as returned by jq -s
* add ovpn_ prefix to helpers and global variables to avoid clashes
* fail test in case of netlink notification mismatch
* add missing kernel config dependencies
* add delay when launching multiple ynl/cli.py listeners
* tag 'ovpn-net-20260417' of https://github.com/OpenVPN/ovpn-net-next:
selftests: ovpn: serialize YNL listener startup
selftests: ovpn: align command flow with TAP
selftests: ovpn: add prefix to helpers and shared variables
selftests: ovpn: flatten slurped notification JSON before filtering
selftests: ovpn: fail notification check on mismatch
selftests: ovpn: add nftables config dependencies for test-mark
====================
Link: https://patch.msgid.link/20260417090305.2775723-1-antonio@openvpn.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet says:
====================
tcp: take care of tcp_get_timestamping_opt_stats() races
tcp_get_timestamping_opt_stats() does not own the socket lock,
this is intentional.
It calls tcp_get_info_chrono_stats() while other threads could
change chrono fields in tcp_chrono_set(). It also reads many
tcp socket fields that can be modified by other cpus/threads.
I do not think we need coherent TCP socket state snapshot
in tcp_get_timestamping_opt_stats().
Add READ_ONCE()/WRITE_ONCE() or data_race() annotations.
Note that icsk_ca_state is a bitfield, thus not covered
in this series.
====================
Link: https://patch.msgid.link/20260416200319.3608680-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_get_timestamping_opt_stats() does not own the socket lock,
this is intentional.
It calls tcp_get_info_chrono_stats() while other threads could
change chrono fields in tcp_chrono_set().
I do not think we need coherent TCP socket state snapshot
in tcp_get_timestamping_opt_stats(), I chose to only
add annotations to keep KCSAN happy.
Fixes: 1c885808e4 ("tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Starting one background YNL notification listener per peer back-to-back
can intermittently stall the test setup before the listeners even reach
the Python main function.
This was reproducible in a reduced test.sh setup-only loop: a single
listener stayed stable across repeated runs, while starting listeners
for all peers could hang early in the listener launch phase. Adding a
short delay between listener launches makes the listeners start cleanly
and eliminates the reproduced hangs in repeated normal and slow-runner
tests.
Serialize listener startup with a small sleep between setup_listener
calls.
Fixes: 77de28cd7c ("selftests: ovpn: add notification parsing and matching")
Signed-off-by: Ralf Lici <ralf@mandelbit.com>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Current tests do not properly adhere to the TAP infrastructure
therefore they do not properly report failures leading to hangs of
the CI machinery.
Restructure ovpn selftests into using the TAP infrastructure: split each
test in stages, execute stage bodies with fail-fast semantics, and emit
KTAP pass/fail for each stage.
Centralize behavior control in common.sh and makes the scripts use
dedicated wrappers for required-success, expected-failure, and non-fatal
commands. Also add the OVPN_VERBOSE mode that exposes captured command
output for debugging.
This way tests won't hang anymore in case of failure when executed
within the CI machinery.
This change also makes default OVPN_CLI and YNL resolution
independent from the caller CWD by anchoring both to COMMON_DIR, so
behavior is stable across direct execution and run_tests-style
execution.
Fixes: 959bc330a4 ("testing/selftests: add test tool and scripts for ovpn module")
Signed-off-by: Ralf Lici <ralf@mandelbit.com>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Current naming for shared variables, helpers and netnamespaces is
a bit unfortunate as it doesn't come with a clean prefix.
This showed to be problematic in case of name clashes with external
scripts or in case of abrupt test termination (hanging netns' weren't
easily reconducible to ovpn).
Rename common helper entry points and all shared globals in the ovpn
selftests to ovpn_ or OVPN_ names so test scripts and wrappers use a
single explicit prefix. Also rename the temporary network namespaces
created by the tests from peerN to ovpn_peerN. This makes leaked
namespaces easier to identify.
This is a mechanical refactor only, behavior is unchanged.
Fixes: 959bc330a4 ("testing/selftests: add test tool and scripts for ovpn module")
Signed-off-by: Ralf Lici <ralf@mandelbit.com>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Notification comparison uses jq -s, which slurps all inputs into an
array. Some inputs can be arrays themselves, and applying the .msg.peer
filter directly on those entries triggers jq type errors.
Expand any array-valued JSON items returned by jq -s before selecting
.msg.peer, so the filter handles both normal notification objects and []
entries without type errors.
Fixes: 77de28cd7c ("selftests: ovpn: add notification parsing and matching")
Signed-off-by: Ralf Lici <ralf@mandelbit.com>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
compare_ntfs doesn't fail when expected and received notification
streams diverge.
Fix this bug by tracking the diff exit status explicitly and return it
to the caller so notification mismatches propagate as test failures.
Fixes: 77de28cd7c ("selftests: ovpn: add notification parsing and matching")
Signed-off-by: Ralf Lici <ralf@mandelbit.com>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
test-mark.sh installs nftables rules in an inet/filter output chain and
verifies packet drops via nft counters. In vmksft this can fail when the
nftables core is not enabled by the ovpn selftest config.
Add the missing kernel options required by this test:
- CONFIG_NETFILTER
- CONFIG_NF_TABLES
- CONFIG_NF_TABLES_INET
Fixes: 7b80d8a335 ("selftests: ovpn: add test for the FW mark feature")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/all/20260319124114.42f91f72@kernel.org/
Signed-off-by: Ralf Lici <ralf@mandelbit.com>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Luigi Leonardi says:
====================
vsock/virtio: fix MSG_PEEK calculation on bytes to copy
`virtio_transport_stream_do_peek`, when calculating the number of bytes to
copy, didn't consider the `offset`, caused by partial reads that happened
before.
This might cause out-of-bounds read that lead to an EFAULT.
More details in the commits.
Commit 1 introduces the fix
Commit 2 introduces some preliminary work for adding a test and fixes a
problem in existing tests.
Commit 3 introduces a test that checks for this bug to avoid future
regressions.
For disclosure: this bug was found initially by claude opus 4.6, I then analyzed
it and worked on the fix and the test.
====================
Link: https://patch.msgid.link/20260415-fix_peek-v4-0-8207e872759e@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a test that verifies MSG_PEEK works correctly after a partial
recv().
This is to test a bug that was present in the
`virtio_transport_stream_do_peek()` when computing the number of bytes to
copy: After a partial read, the peek function didn't take into
consideration the number of bytes that were already read. So peeking the
whole buffer would cause an out-of-bounds read, that resulted in a -EFAULT.
This test does exactly this: do a partial recv on a buffer, then try to
peek the whole buffer content. The test re-uses
`test_stream_msg_peek_client()` to also cover this scenario.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Luigi Leonardi <leonardi@redhat.com>
Link: https://patch.msgid.link/20260415-fix_peek-v4-3-8207e872759e@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
`recv_buf` does not handle the MSG_PEEK flag correctly: it keeps calling
`recv` until all requested bytes are available or an error occurs.
The problem is how it calculates the number of bytes read: MSG_PEEK
doesn't consume any bytes and will re-read the same bytes from the buffer
head, so summing the return value every time is wrong.
Moreover, MSG_PEEK doesn't consume the bytes in the buffer, so if more
bytes are requested than are available, the loop will never terminate,
because `recv` will never return EOF. For this reason, we need to compare
the number of bytes read with the number of bytes expected.
Add a check: if the MSG_PEEK flag is present, update the byte counter and
break out of the loop only after at least the expected number of bytes
have been received; otherwise, retry after a short delay to avoid
consuming too many CPU cycles.
This allows us to simplify the `test_stream_credit_update_test` by
reusing `recv_buf`, like some other tests already do.
Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Luigi Leonardi <leonardi@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20260415-fix_peek-v4-2-8207e872759e@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
`virtio_transport_stream_do_peek()` does not account for the skb offset
when computing the number of bytes to copy.
This means that, after a partial recv() that advances the offset, a peek
requesting more bytes than are available in the sk_buff causes
`skb_copy_datagram_iter()` to go past the valid payload, resulting in
a -EFAULT.
The dequeue path already handles this correctly.
Apply the same logic to the peek path.
Fixes: 0df7cd3c13 ("vsock/virtio/vhost: read data from non-linear skb")
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Signed-off-by: Luigi Leonardi <leonardi@redhat.com>
Link: https://patch.msgid.link/20260415-fix_peek-v4-1-8207e872759e@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wei Fang says:
====================
net: enetc: fix command BD ring issues
Currently, the implementation of command BD ring has two issues, one is
that the driver may obtain wrong consumer index of the ring, because the
driver does not mask out the SBE bit of the CIR value, so a wrong index
will be obtained when a SBE error ouccrs. The other one is that the DMA
buffer may be used after free. If netc_xmit_ntmp_cmd() times out and
returns an error, the pending command is not explicitly aborted, while
ntmp_free_data_mem() unconditionally frees the DMA buffer. If the buffer
has already been reallocated elsewhere, this may lead to silent memory
corruption. Because the hardware eventually processes the pending command
and perform a DMA write of the response to the physical address of the
freed buffer. So this patch set is to fix these two issues.
====================
Link: https://patch.msgid.link/20260415060833.2303846-1-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>