linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-18 01:58:29 -04:00

Author	SHA1	Message	Date
Zilin Guan	fe868b499d	ice: Fix memory leak in ice_set_ringparam() In ice_set_ringparam, tx_rings and xdp_rings are allocated before rx_rings. If the allocation of rx_rings fails, the code jumps to the done label leaking both tx_rings and xdp_rings. Furthermore, if the setup of an individual Rx ring fails during the loop, the code jumps to the free_tx label which releases tx_rings but leaks xdp_rings. Fix this by introducing a free_xdp label and updating the error paths to ensure both xdp_rings and tx_rings are properly freed if rx_rings allocation or setup fails. Compile tested only. Issue found using a prototype static analysis tool and code review. Fixes: `fcea6f3da5` ("ice: Add stats and ethtool support") Fixes: `efc2214b60` ("ice: Add support for XDP") Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2026-03-03 13:06:04 -08:00
Jakub Staniszewski	fb4903b335	ice: fix retry for AQ command 0x06EE Executing ethtool -m can fail reporting a netlink I/O error while firmware link management holds the i2c bus used to communicate with the module. According to Intel(R) Ethernet Controller E810 Datasheet Rev 2.8 [1] Section 3.3.10.4 Read/Write SFF EEPROM (0x06EE) request should to be retried upon receiving EBUSY from firmware. Commit `e9c9692c8a` ("ice: Reimplement module reads used by ethtool") implemented it only for part of ice_get_module_eeprom(), leaving all other calls to ice_aq_sff_eeprom() vulnerable to returning early on getting EBUSY without retrying. Remove the retry loop from ice_get_module_eeprom() and add Admin Queue (AQ) command with opcode 0x06EE to the list of commands that should be retried on receiving EBUSY from firmware. Cc: stable@vger.kernel.org Fixes: `e9c9692c8a` ("ice: Reimplement module reads used by ethtool") Signed-off-by: Jakub Staniszewski <jakub.staniszewski@linux.intel.com> Co-developed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Signed-off-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://www.intel.com/content/www/us/en/content-details/613875/intel-ethernet-controller-e810-datasheet.html [1] Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2026-03-03 13:06:04 -08:00
Jakub Staniszewski	326256c0a7	ice: reintroduce retry mechanism for indirect AQ Add retry mechanism for indirect Admin Queue (AQ) commands. To do so we need to keep the command buffer. This technically reverts commit `43a630e37e` ("ice: remove unused buffer copy code in ice_sq_send_cmd_retry()"), but combines it with a fix in the logic by using a kmemdup() call, making it more robust and less likely to break in the future due to programmer error. Cc: Michal Schmidt <mschmidt@redhat.com> Cc: stable@vger.kernel.org Fixes: `3056df93f7` ("ice: Re-send some AQ commands, as result of EBUSY AQ error") Signed-off-by: Jakub Staniszewski <jakub.staniszewski@linux.intel.com> Co-developed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Signed-off-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2026-03-03 13:06:04 -08:00
Larysa Zaremba	eef33aa449	ice: fix adding AQ LLDP filter for VF The referenced commit came from a misunderstanding of the FW LLDP filter AQ (Admin Queue) command due to the error in the internal documentation. Contrary to the assumptions in the original commit, VFs can be added and deleted from this filter without any problems. Introduced dev_info message proved to be useful, so reverting the whole commit does not make sense. Without this fix, trusted VFs do not receive LLDP traffic, if there is an AQ LLDP filter on PF. When trusted VF attempts to add an LLDP multicast MAC address, the following message can be seen in dmesg on host: ice 0000:33:00.0: Failed to add Rx LLDP rule on VSI 20 error: -95 Revert checking VSI type when adding LLDP filter through AQ. Fixes: `4d5a1c4e6d` ("ice: do not add LLDP-specific filter if not necessary") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2026-03-03 13:06:04 -08:00
YiFei Zhu	1a86a1f7d8	net: Fix rcu_tasks stall in threaded busypoll I was debugging a NIC driver when I noticed that when I enable threaded busypoll, bpftrace hangs when starting up. dmesg showed: rcu_tasks_wait_gp: rcu_tasks grace period number 85 (since boot) is 10658 jiffies old. rcu_tasks_wait_gp: rcu_tasks grace period number 85 (since boot) is 40793 jiffies old. rcu_tasks_wait_gp: rcu_tasks grace period number 85 (since boot) is 131273 jiffies old. rcu_tasks_wait_gp: rcu_tasks grace period number 85 (since boot) is 402058 jiffies old. INFO: rcu_tasks detected stalls on tasks: 00000000769f52cd: .N nvcsw: 2/2 holdout: 1 idle_cpu: -1/64 task:napi/eth2-8265 state:R running task stack:0 pid:48300 tgid:48300 ppid:2 task_flags:0x208040 flags:0x00004000 Call Trace: <TASK> ? napi_threaded_poll_loop+0x27c/0x2c0 ? __pfx_napi_threaded_poll+0x10/0x10 ? napi_threaded_poll+0x26/0x80 ? kthread+0xfa/0x240 ? __pfx_kthread+0x10/0x10 ? ret_from_fork+0x31/0x50 ? __pfx_kthread+0x10/0x10 ? ret_from_fork_asm+0x1a/0x30 </TASK> The cause is that in threaded busypoll, the main loop is in napi_threaded_poll rather than napi_threaded_poll_loop, where the latter rarely iterates more than once within its loop. For rcu_softirq_qs_periodic inside napi_threaded_poll_loop to report its qs state, the last_qs must be 100ms behind, and this can't happen because napi_threaded_poll_loop rarely iterates in threaded busypoll, and each time napi_threaded_poll_loop is called last_qs is reset to latest jiffies. This patch changes so that in threaded busypoll, last_qs is saved in the outer napi_threaded_poll, and whether busy_poll_last_qs is NULL indicates whether napi_threaded_poll_loop is called for busypoll. This way last_qs would not reset to latest jiffies on each invocation of napi_threaded_poll_loop. Fixes: `c18d4b190a` ("net: Extend NAPI threaded polling to allow kthread based busy polling") Cc: stable@vger.kernel.org Signed-off-by: YiFei Zhu <zhuyifei@google.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Link: https://patch.msgid.link/20260227221937.1060857-1-zhuyifei@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 13:44:28 +01:00
Allison Henderson	6a877ececd	net/rds: Fix circular locking dependency in rds_tcp_tune syzbot reported a circular locking dependency in rds_tcp_tune() where sk_net_refcnt_upgrade() is called while holding the socket lock: ====================================================== WARNING: possible circular locking dependency detected ====================================================== kworker/u10:8/15040 is trying to acquire lock: ffffffff8e9aaf80 (fs_reclaim){+.+.}-{0:0}, at: __kmalloc_cache_noprof+0x4b/0x6f0 but task is already holding lock: ffff88805a3c1ce0 (k-sk_lock-AF_INET6){+.+.}-{0:0}, at: rds_tcp_tune+0xd7/0x930 The issue occurs because sk_net_refcnt_upgrade() performs memory allocation (via get_net_track() -> ref_tracker_alloc()) while the socket lock is held, creating a circular dependency with fs_reclaim. Fix this by moving sk_net_refcnt_upgrade() outside the socket lock critical section. This is safe because the fields modified by the sk_net_refcnt_upgrade() call (sk_net_refcnt, ns_tracker) are not accessed by any concurrent code path at this point. v2: - Corrected fixes tag - check patch line wrap nits - ai commentary nits Reported-by: syzbot+2e2cf5331207053b8106@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=2e2cf5331207053b8106 Fixes: `3a58f13a88` ("net: rds: acquire refcount on TCP sockets") Signed-off-by: Allison Henderson <achender@kernel.org> Link: https://patch.msgid.link/20260227202336.167757-1-achender@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 12:57:06 +01:00
Eric Dumazet	710f5c7658	indirect_call_wrapper: do not reevaluate function pointer We have an increasing number of READ_ONCE(xxx->function) combined with INDIRECT_CALL_[1234]() helpers. Unfortunately this forces INDIRECT_CALL_[1234]() to read xxx->function many times, which is not what we wanted. Fix these macros so that xxx->function value is not reloaded. $ scripts/bloat-o-meter -t vmlinux.0 vmlinux add/remove: 0/0 grow/shrink: 1/65 up/down: 122/-1084 (-962) Function old new delta ip_push_pending_frames 59 181 +122 ip6_finish_output 687 681 -6 __udp_enqueue_schedule_skb 1078 1072 -6 ioam6_output 2319 2312 -7 xfrm4_rcv_encap_finish2 64 56 -8 xfrm4_output 297 289 -8 vrf_ip_local_out 278 270 -8 vrf_ip6_local_out 278 270 -8 seg6_input_finish 64 56 -8 rpl_output 700 692 -8 ipmr_forward_finish 124 116 -8 ip_forward_finish 143 135 -8 ip6mr_forward2_finish 100 92 -8 ip6_forward_finish 73 65 -8 input_action_end_bpf 1091 1083 -8 dst_input 52 44 -8 __xfrm6_output 801 793 -8 __xfrm4_output 83 75 -8 bpf_input 500 491 -9 __tcp_check_space 530 521 -9 input_action_end_dt6 291 280 -11 vti6_tnl_xmit 1634 1622 -12 bpf_xmit 1203 1191 -12 rpl_input 497 483 -14 rawv6_send_hdrinc 1355 1341 -14 ndisc_send_skb 1030 1016 -14 ipv6_srh_rcv 1377 1363 -14 ip_send_unicast_reply 1253 1239 -14 ip_rcv_finish 226 212 -14 ip6_rcv_finish 300 286 -14 input_action_end_x_core 205 191 -14 input_action_end_x 355 341 -14 input_action_end_t 205 191 -14 input_action_end_dx6_finish 127 113 -14 input_action_end_dx4_finish 373 359 -14 input_action_end_dt4 426 412 -14 input_action_end_core 186 172 -14 input_action_end_b6_encap 292 278 -14 input_action_end_b6 198 184 -14 igmp6_send 1332 1318 -14 ip_sublist_rcv 864 848 -16 ip6_sublist_rcv 1091 1075 -16 ipv6_rpl_srh_rcv 1937 1920 -17 xfrm_policy_queue_process 1246 1228 -18 seg6_output_core 903 885 -18 mld_sendpack 856 836 -20 NF_HOOK 756 736 -20 vti_tunnel_xmit 1447 1426 -21 input_action_end_dx6 664 642 -22 input_action_end 1502 1480 -22 sock_sendmsg_nosec 134 111 -23 ip6mr_forward2 388 364 -24 sock_recvmsg_nosec 134 109 -25 seg6_input_core 836 810 -26 ip_send_skb 172 146 -26 ip_local_out 140 114 -26 ip6_local_out 140 114 -26 __sock_sendmsg 162 136 -26 __ip_queue_xmit 1196 1170 -26 __ip_finish_output 405 379 -26 ipmr_queue_fwd_xmit 373 346 -27 sock_recvmsg 173 145 -28 ip6_xmit 1635 1607 -28 xfrm_output_resume 1418 1389 -29 ip_build_and_send_pkt 625 591 -34 dst_output 504 432 -72 Total: Before=25217686, After=25216724, chg -0.00% Fixes: `283c16a2df` ("indirect call wrappers: helpers to speed-up indirect calls of builtin") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260227172603.1700433-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 12:41:29 +01:00
Paolo Abeni	699f3b2e51	Merge branch 'avoid-compiler-and-iq-oq-reordering' Vimlesh Kumar says: ==================== avoid compiler and IQ/OQ reordering Utilize READ_ONCE and WRITE_ONCE APIs to prevent compiler optimization and reordering. Ensure IO queue OUT/IN_CNT registers are flushed. Relocate IQ/OQ IN/OUT_CNTS updates to occur before NAPI completion, and replace napi_complete with napi_complete_done. ==================== Link: https://patch.msgid.link/20260227091402.1773833-1-vimleshk@marvell.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 11:34:22 +01:00
Vimlesh Kumar	6c73126ecd	octeon_ep_vf: avoid compiler and IQ/OQ reordering Utilize READ_ONCE and WRITE_ONCE APIs for IO queue Tx/Rx variable access to prevent compiler optimization and reordering. Additionally, ensure IO queue OUT/IN_CNT registers are flushed by performing a read-back after writing. The compiler could reorder reads/writes to pkts_pending, last_pkt_count, etc., causing stale values to be used when calculating packets to process or register updates to send to hardware. The Octeon hardware requires a read-back after writing to OUT_CNT/IN_CNT registers to ensure the write has been flushed through any posted write buffers before the interrupt resend bit is set. Without this, we have observed cases where the hardware didn't properly update its internal state. wmb/rmb only provides ordering guarantees but doesn't prevent the compiler from performing optimizations like caching in registers, load tearing etc. Fixes: `1cd3b40797` ("octeon_ep_vf: add Tx/Rx processing and interrupt support") Signed-off-by: Sathesh Edara <sedara@marvell.com> Signed-off-by: Shinas Rasheed <srasheed@marvell.com> Signed-off-by: Vimlesh Kumar <vimleshk@marvell.com> Link: https://patch.msgid.link/20260227091402.1773833-5-vimleshk@marvell.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 11:34:20 +01:00
Vimlesh Kumar	2ae7d20fb2	octeon_ep_vf: Relocate counter updates before NAPI Relocate IQ/OQ IN/OUT_CNTS updates to occur before NAPI completion. Moving the IQ/OQ counter updates before napi_complete_done ensures 1. Counter registers are updated before re-enabling interrupts. 2. Prevents a race where new packets arrive but counters aren't properly synchronized. Fixes: `1cd3b40797` ("octeon_ep_vf: add Tx/Rx processing and interrupt support") Signed-off-by: Sathesh Edara <sedara@marvell.com> Signed-off-by: Shinas Rasheed <srasheed@marvell.com> Signed-off-by: Vimlesh Kumar <vimleshk@marvell.com> Link: https://patch.msgid.link/20260227091402.1773833-4-vimleshk@marvell.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 11:34:20 +01:00
Vimlesh Kumar	43b3160cb6	octeon_ep: avoid compiler and IQ/OQ reordering Utilize READ_ONCE and WRITE_ONCE APIs for IO queue Tx/Rx variable access to prevent compiler optimization and reordering. Additionally, ensure IO queue OUT/IN_CNT registers are flushed by performing a read-back after writing. The compiler could reorder reads/writes to pkts_pending, last_pkt_count, etc., causing stale values to be used when calculating packets to process or register updates to send to hardware. The Octeon hardware requires a read-back after writing to OUT_CNT/IN_CNT registers to ensure the write has been flushed through any posted write buffers before the interrupt resend bit is set. Without this, we have observed cases where the hardware didn't properly update its internal state. wmb/rmb only provides ordering guarantees but doesn't prevent the compiler from performing optimizations like caching in registers, load tearing etc. Fixes: `37d79d0596` ("octeon_ep: add Tx/Rx processing and interrupt support") Signed-off-by: Sathesh Edara <sedara@marvell.com> Signed-off-by: Shinas Rasheed <srasheed@marvell.com> Signed-off-by: Vimlesh Kumar <vimleshk@marvell.com> Link: https://patch.msgid.link/20260227091402.1773833-3-vimleshk@marvell.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 11:34:20 +01:00
Vimlesh Kumar	18c04a808c	octeon_ep: Relocate counter updates before NAPI Relocate IQ/OQ IN/OUT_CNTS updates to occur before NAPI completion, and replace napi_complete with napi_complete_done. Moving the IQ/OQ counter updates before napi_complete_done ensures 1. Counter registers are updated before re-enabling interrupts. 2. Prevents a race where new packets arrive but counters aren't properly synchronized. napi_complete_done (vs napi_complete) allows for better interrupt coalescing. Fixes: `37d79d0596` ("octeon_ep: add Tx/Rx processing and interrupt support") Signed-off-by: Sathesh Edara <sedara@marvell.com> Signed-off-by: Shinas Rasheed <srasheed@marvell.com> Signed-off-by: Vimlesh Kumar <vimleshk@marvell.com> Link: https://patch.msgid.link/20260227091402.1773833-2-vimleshk@marvell.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 11:34:20 +01:00
Paolo Abeni	210fd8f408	Merge branch 'bonding-fix-missing-xdp-compat-check-on-xmit_hash_policy-change' Jiayuan Chen says: ==================== bonding: fix missing XDP compat check on xmit_hash_policy change syzkaller reported a bug https://syzkaller.appspot.com/bug?extid=5a287bcdc08104bc3132 When a bond device is in 802.3ad or balance-xor mode, XDP is supported only when xmit_hash_policy != vlan+srcmac. This constraint is enforced in bond_option_mode_set() via bond_xdp_check(), which prevents switching to an XDP-incompatible mode while a program is loaded. However, the symmetric path -- changing xmit_hash_policy while XDP is loaded -- had no such guard in bond_option_xmit_hash_policy_set(). This means the following sequence silently creates an inconsistent state: 1. Create a bond in 802.3ad mode with xmit_hash_policy=layer2+3. 2. Attach a native XDP program to the bond. 3. Change xmit_hash_policy to vlan+srcmac (no error, not checked). Now bond->xdp_prog is set but bond_xdp_check() returns false for the same device. When the bond is later torn down (e.g. netns deletion), dev_xdp_uninstall() calls bond_xdp_set(dev, NULL) to remove the program, which hits the bond_xdp_check() guard and returns -EOPNOTSUPP, triggering a kernel WARNING: bond1 (unregistering): Error: No native XDP support for the current bonding mode ------------[ cut here ]------------ dev_xdp_install(dev, mode, bpf_op, NULL, 0, NULL) WARNING: net/core/dev.c:10361 at dev_xdp_uninstall net/core/dev.c:10361 [inline], CPU#0: kworker/u8:22/11031 Modules linked in: CPU: 0 UID: 0 PID: 11031 Comm: kworker/u8:22 Not tainted syzkaller #0 PREEMPT(full) Workqueue: netns cleanup_net RIP: 0010:dev_xdp_uninstall net/core/dev.c:10361 [inline] RIP: 0010:unregister_netdevice_many_notify+0x1efd/0x2370 net/core/dev.c:12393 RSP: 0018:ffffc90003b2f7c0 EFLAGS: 00010293 RAX: ffffffff8971e99c RBX: ffff888052f84c40 RCX: ffff88807896bc80 RDX: 0000000000000000 RSI: 00000000ffffffa1 RDI: 0000000000000000 RBP: ffffc90003b2f930 R08: ffffc90003b2f207 R09: 1ffff92000765e40 R10: dffffc0000000000 R11: fffff52000765e41 R12: 00000000ffffffa1 R13: ffff888052f84c38 R14: 1ffff1100a5f0988 R15: ffffc9000df67000 FS: 0000000000000000(0000) GS:ffff8881254ae000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f60871d5d58 CR3: 000000006c41c000 CR4: 00000000003526f0 Call Trace: <TASK> ops_exit_rtnl_list net/core/net_namespace.c:187 [inline] ops_undo_list+0x3d3/0x940 net/core/net_namespace.c:248 cleanup_net+0x56b/0x800 net/core/net_namespace.c:704 process_one_work kernel/workqueue.c:3275 [inline] process_scheduled_works+0xaec/0x17a0 kernel/workqueue.c:3358 worker_thread+0xa50/0xfc0 kernel/workqueue.c:3439 kthread+0x388/0x470 kernel/kthread.c:467 ret_from_fork+0x51e/0xb90 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> Beyond the WARNING itself, when dev_xdp_install() fails during dev_xdp_uninstall(), bond_xdp_set() returns early without calling bpf_prog_put() on the old program. dev_xdp_uninstall() then releases only the reference held by dev->xdp_state[], while the reference held by bond->xdp_prog is never dropped, leaking the struct bpf_prog. The fix refactors the core logic of bond_xdp_check() into a new helper __bond_xdp_check_mode(mode, xmit_policy) that takes both parameters explicitly, avoiding the need to read them from the bond struct. bond_xdp_check() becomes a thin wrapper around it. bond_option_xmit_hash_policy_set() then uses __bond_xdp_check_mode() directly, passing the candidate xmit_policy before it is committed, mirroring exactly what bond_option_mode_set() already does for mode changes. Patch 1 adds the kernel fix. Patch 2 adds a selftest that reproduces the WARNING by attaching native XDP to a bond in 802.3ad mode, then attempting to change xmit_hash_policy to vlan+srcmac -- verifying the change is rejected with the fix applied. ==================== Link: https://patch.msgid.link/20260226080306.98766-1-jiayuan.chen@linux.dev Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 10:47:44 +01:00
Jiayuan Chen	181cafbd8a	selftests/bpf: add test for xdp_bonding xmit_hash_policy compat Add a selftest to verify that changing xmit_hash_policy to vlan+srcmac is rejected when a native XDP program is loaded on a bond in 802.3ad mode. Without the fix in bond_option_xmit_hash_policy_set(), the change succeeds silently, creating an inconsistent state that triggers a kernel WARNING in dev_xdp_uninstall() when the bond is torn down. The test attaches native XDP to a bond0 (802.3ad, layer2+3), then attempts to switch xmit_hash_policy to vlan+srcmac and asserts the operation fails. It also verifies the change succeeds after XDP is detached, confirming the rejection is specific to the XDP-loaded state. Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Link: https://patch.msgid.link/20260226080306.98766-3-jiayuan.chen@linux.dev Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 10:47:38 +01:00
Jiayuan Chen	479d589b40	bpf/bonding: reject vlan+srcmac xmit_hash_policy change when XDP is loaded bond_option_mode_set() already rejects mode changes that would make a loaded XDP program incompatible via bond_xdp_check(). However, bond_option_xmit_hash_policy_set() has no such guard. For 802.3ad and balance-xor modes, bond_xdp_check() returns false when xmit_hash_policy is vlan+srcmac, because the 802.1q payload is usually absent due to hardware offload. This means a user can: 1. Attach a native XDP program to a bond in 802.3ad/balance-xor mode with a compatible xmit_hash_policy (e.g. layer2+3). 2. Change xmit_hash_policy to vlan+srcmac while XDP remains loaded. This leaves bond->xdp_prog set but bond_xdp_check() now returning false for the same device. When the bond is later destroyed, dev_xdp_uninstall() calls bond_xdp_set(dev, NULL, NULL) to remove the program, which hits the bond_xdp_check() guard and returns -EOPNOTSUPP, triggering: WARN_ON(dev_xdp_install(dev, mode, bpf_op, NULL, 0, NULL)) Fix this by rejecting xmit_hash_policy changes to vlan+srcmac when an XDP program is loaded on a bond in 802.3ad or balance-xor mode. commit `39a0876d59` ("net, bonding: Disallow vlan+srcmac with XDP") introduced bond_xdp_check() which returns false for 802.3ad/balance-xor modes when xmit_hash_policy is vlan+srcmac. The check was wired into bond_xdp_set() to reject XDP attachment with an incompatible policy, but the symmetric path -- preventing xmit_hash_policy from being changed to an incompatible value after XDP is already loaded -- was left unguarded in bond_option_xmit_hash_policy_set(). Note: commit `094ee6017e` ("bonding: check xdp prog when set bond mode") later added a similar guard to bond_option_mode_set(), but bond_option_xmit_hash_policy_set() remained unprotected. Reported-by: syzbot+5a287bcdc08104bc3132@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6995aff6.050a0220.2eeac1.014e.GAE@google.com/T/ Fixes: `39a0876d59` ("net, bonding: Disallow vlan+srcmac with XDP") Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Link: https://patch.msgid.link/20260226080306.98766-2-jiayuan.chen@linux.dev Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 10:47:37 +01:00
Arthur Kiyanovski	1939d9816d	MAINTAINERS: ena: update AMAZON ETHERNET maintainers Remove Shay Agroskin and Saeed Bishara. Promote David Arinzon to maintainer. Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Link: https://patch.msgid.link/20260301191652.5916-1-akiyano@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 19:03:34 -08:00
Simon Baatz	3f10543c5b	selftests/net: packetdrill: restore tcp_rcv_big_endseq.pkt Commit `1cc93c48b5` ("selftests/net: packetdrill: remove tests for tcp_rcv_*big") removed the test for the reverted commit `1d2fbaad7c` ("tcp: stronger sk_rcvbuf checks") but also the one for commit `9ca48d616e` ("tcp: do not accept packets beyond window"). Restore the test with the necessary adaptation: expect a delayed ACK instead of an immediate one, since tcp_can_ingest() does not fail anymore for the last data packet. Signed-off-by: Simon Baatz <gmbnomis@gmail.com> Link: https://patch.msgid.link/20260301-tcp_rcv_big_endseq-v1-1-86ab7415ab58@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:47:46 -08:00
Mieczyslaw Nalewaj	7cbe98f7be	net: dsa: realtek: rtl8365mb: fix rtl8365mb_phy_ocp_write return value Function rtl8365mb_phy_ocp_write() always returns 0, even when an error occurs during register access. This patch fixes the return value to propagate the actual error code from regmap operations. Link: https://lore.kernel.org/netdev/a2dfde3c-d46f-434b-9d16-1e251e449068@yahoo.com/ Fixes: `2796728460` ("net: dsa: realtek: rtl8365mb: serialize indirect PHY register access") Signed-off-by: Mieczyslaw Nalewaj <namiltd@yahoo.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Reviewed-by: Linus Walleij <linusw@kernel.org> Link: https://patch.msgid.link/20260301-realtek_namiltd_fix1-v1-1-43a6bb707f9c@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:32:40 -08:00
Raju Rangoju	9439a661c2	amd-xgbe: fix MAC_TCR_SS register width for 2.5G and 10M speeds Extend the MAC_TCR_SS (Speed Select) register field width from 2 bits to 3 bits to properly support all speed settings. The MAC_TCR register's SS field encoding requires 3 bits to represent all supported speeds: - 0x00: 10Gbps (XGMII) - 0x02: 2.5Gbps (GMII) / 100Mbps - 0x03: 1Gbps / 10Mbps - 0x06: 2.5Gbps (XGMII) - P100a only With only 2 bits, values 0x04-0x07 cannot be represented, which breaks 2.5G XGMII mode on newer platforms and causes incorrect speed select values to be programmed. Fixes: `07445f3c7c` ("amd-xgbe: Add support for 10 Mbps speed") Co-developed-by: Guruvendra Punugupati <Guruvendra.Punugupati@amd.com> Signed-off-by: Guruvendra Punugupati <Guruvendra.Punugupati@amd.com> Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com> Link: https://patch.msgid.link/20260226170753.250312-1-Raju.Rangoju@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 14:22:34 -08:00
MD Danish Anwar	147792c395	net: ti: icssg-prueth: Fix ping failure after offload mode setup when link speed is not 1G When both eth interfaces with links up are added to a bridge or hsr interface, ping fails if the link speed is not 1Gbps (e.g., 100Mbps). The issue is seen because when switching to offload (bridge/hsr) mode, prueth_emac_restart() restarts the firmware and clears DRAM with memset_io(), setting all memory to 0. This includes PORT_LINK_SPEED_OFFSET which firmware reads for link speed. The value 0 corresponds to FW_LINK_SPEED_1G (0x00), so for 1Gbps links the default value is correct and ping works. For 100Mbps links, the firmware needs FW_LINK_SPEED_100M (0x01) but gets 0 instead, causing ping to fail. The function emac_adjust_link() is called to reconfigure, but it detects no state change (emac->link is still 1, speed/duplex match PHY) so new_state remains false and icssg_config_set_speed() is never called to correct the firmware speed value. The fix resets emac->link to 0 before calling emac_adjust_link() in prueth_emac_common_start(). This forces new_state=true, ensuring icssg_config_set_speed() is called to write the correct speed value to firmware memory. Fixes: `06feac1540` ("net: ti: icssg-prueth: Fix emac link speed handling") Signed-off-by: MD Danish Anwar <danishanwar@ti.com> Link: https://patch.msgid.link/20260226102356.2141871-1-danishanwar@ti.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 13:41:35 -08:00
Jiayuan Chen	101bacb303	atm: lec: fix null-ptr-deref in lec_arp_clear_vccs syzkaller reported a null-ptr-deref in lec_arp_clear_vccs(). This issue can be easily reproduced using the syzkaller reproducer. In the ATM LANE (LAN Emulation) module, the same atm_vcc can be shared by multiple lec_arp_table entries (e.g., via entry->vcc or entry->recv_vcc). When the underlying VCC is closed, lec_vcc_close() iterates over all ARP entries and calls lec_arp_clear_vccs() for each matched entry. For example, when lec_vcc_close() iterates through the hlists in priv->lec_arp_empty_ones or other ARP tables: 1. In the first iteration, for the first matched ARP entry sharing the VCC, lec_arp_clear_vccs() frees the associated vpriv (which is vcc->user_back) and sets vcc->user_back to NULL. 2. In the second iteration, for the next matched ARP entry sharing the same VCC, lec_arp_clear_vccs() is called again. It obtains a NULL vpriv from vcc->user_back (via LEC_VCC_PRIV(vcc)) and then attempts to dereference it via `vcc->pop = vpriv->old_pop`, leading to a null-ptr-deref crash. Fix this by adding a null check for vpriv before dereferencing it. If vpriv is already NULL, it means the VCC has been cleared by a previous call, so we can safely skip the cleanup and just clear the entry's vcc/recv_vcc pointers. The entire cleanup block (including vcc_release_async()) is placed inside the vpriv guard because a NULL vpriv indicates the VCC has already been fully released by a prior iteration — repeating the teardown would redundantly set flags and trigger callbacks on an already-closing socket. The Fixes tag points to the initial commit because the entry->vcc path has been vulnerable since the original code. The entry->recv_vcc path was later added by commit `8d9f73c0ad` ("atm: fix a memory leak of vcc->user_back") with the same pattern, and both paths are fixed here. Reported-by: syzbot+72e3ea390c305de0e259@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68c95a83.050a0220.3c6139.0e5c.GAE@google.com/T/ Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Suggested-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Link: https://patch.msgid.link/20260225123250.189289-1-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 09:33:26 -08:00
Guenter Roeck	74badb9c20	dpaa2-switch: Fix interrupt storm after receiving bad if_id in IRQ handler Commit `31a7a0bbeb` ("dpaa2-switch: add bounds check for if_id in IRQ handler") introduces a range check for if_id to avoid an out-of-bounds access. If an out-of-bounds if_id is detected, the interrupt status is not cleared. This may result in an interrupt storm. Clear the interrupt status after detecting an out-of-bounds if_id to avoid the problem. Found by an experimental AI code review agent at Google. Fixes: `31a7a0bbeb` ("dpaa2-switch: add bounds check for if_id in IRQ handler") Cc: Junrui Luo <moonafterrain@outlook.com> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com> Link: https://patch.msgid.link/20260227055812.1777915-1-linux@roeck-us.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 09:01:41 -08:00
Jakub Kicinski	0eb5965b29	Merge branch 'xsk-fixes-for-af_xdp-fragment-handling' Nikhil P. Rao says: ==================== xsk: Fixes for AF_XDP fragment handling This series fixes two issues in AF_XDP zero-copy fragment handling: Patch 1 fixes a buffer leak caused by incorrect list node handling after commit `b692bf9a75`. The list_node field is now reused for both the xskb pool list and the buffer free list. Using list_del() instead of list_del_init() causes list_empty() checks in xp_free() to fail, preventing buffers from being added to the free list. Patch 2 fixes partial packet delivery to userspace. In the zero-copy path, if the Rx queue fills up while enqueuing fragments, the remaining fragments are dropped, causing the application to receive incomplete packets. The fix ensures the Rx queue has sufficient space for all fragments before starting to enqueue them. [1] https://lore.kernel.org/oe-kbuild-all/202602051720.YfZO23pZ-lkp@intel.com/ [2] https://lore.kernel.org/oe-kbuild-all/202602172046.vf9DtpdF-lkp@intel.com/ ==================== Link: https://patch.msgid.link/20260225000456.107806-1-nikhil.rao@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 08:55:15 -08:00
Nikhil P. Rao	f7387d6579	xsk: Fix zero-copy AF_XDP fragment drop AF_XDP should ensure that only a complete packet is sent to application. In the zero-copy case, if the Rx queue gets full as fragments are being enqueued, the remaining fragments are dropped. For the multi-buffer case, add a check to ensure that the Rx queue has enough space for all fragments of a packet before starting to enqueue them. Fixes: `24ea50127e` ("xsk: support mbuf on ZC RX") Signed-off-by: Nikhil P. Rao <nikhil.rao@amd.com> Link: https://patch.msgid.link/20260225000456.107806-3-nikhil.rao@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 08:55:11 -08:00
Nikhil P. Rao	60abb0ac11	xsk: Fix fragment node deletion to prevent buffer leak After commit `b692bf9a75` ("xsk: Get rid of xdp_buff_xsk::xskb_list_node"), the list_node field is reused for both the xskb pool list and the buffer free list, this causes a buffer leak as described below. xp_free() checks if a buffer is already on the free list using list_empty(&xskb->list_node). When list_del() is used to remove a node from the xskb pool list, it doesn't reinitialize the node pointers. This means list_empty() will return false even after the node has been removed, causing xp_free() to incorrectly skip adding the buffer to the free list. Fix this by using list_del_init() instead of list_del() in all fragment handling paths, this ensures the list node is reinitialized after removal, allowing the list_empty() to work correctly. Fixes: `b692bf9a75` ("xsk: Get rid of xdp_buff_xsk::xskb_list_node") Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Nikhil P. Rao <nikhil.rao@amd.com> Link: https://patch.msgid.link/20260225000456.107806-2-nikhil.rao@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 08:55:11 -08:00
Jakub Kicinski	6df0022b6c	Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2026-02-19 (idpf, ice, i40e, ixgbevf, e1000e) For idpf: Li Li moves the check for software marker to occur after incrementing next to clean to avoid re-encountering the same packet. He also adds a couple of checks to prevent NULL pointer dereferences and NULLs rss_key, after free, in error path so that later checks are properly evaluated. Brian Vazquez adjusts IRQ naming to have correlation with netdev naming. Sreedevi removes validation of action type as part of ntuple rule deletion. For ice: Aaron Ma breaks RDMA initialization into two steps and adjusts calls so that VSIs are entirely configured before plugging. Michal Schmidt fixes initialization of loopback VSI to have proper resources allocated to allow for loopback testing to occur. For i40e: Thomas Gleixner fixes a leak of preempt count by replacing get_cpu() with smp_processor_id(). For ixgbevf: Jedrzej adds a check for mailbox version before attempting to call an associated link state call that is supported in that mailbox version. For e1000e: Vitaly clears power gating feature for Panther Lake systems to avoid packet issues. * '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: e1000e: clear DPG_EN after reset to avoid autonomous power-gating e1000e: introduce new board type for Panther Lake PCH ixgbevf: fix link setup issue i40e: Fix preempt count leak in napi poll tracepoint ice: fix crash in ethtool offline loopback test ice: recap the VSI and QoS info after rebuild idpf: Fix flow rule delete failure due to invalid validation idpf: change IRQ naming to match netdev and ethtool queue numbering idpf: nullify pointers after they are freed idpf: skip deallocating txq group's txqs if it is NULL idpf: skip deallocating bufq_sets from rx_qgrp if it is NULL idpf: increment completion queue next_to_clean in sw marker wait routine ==================== Link: https://patch.msgid.link/20260225211546.1949260-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 08:43:56 -08:00
Jakub Kicinski	1cc93c48b5	selftests/net: packetdrill: remove tests for tcp_rcv_*big Since commit `1d2fbaad7c` ("tcp: stronger sk_rcvbuf checks") has been reverted we need to remove the corresponding tests. Link: https://lore.kernel.org/20260227003359.2391017-1-kuba@kernel.org Link: https://patch.msgid.link/20260227033446.2596457-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 07:55:52 -08:00
Jakub Kicinski	026dfef287	tcp: give up on stronger sk_rcvbuf checks (for now) We hit another corner case which leads to TcpExtTCPRcvQDrop Connections which send RPCs in the 20-80kB range over loopback experience spurious drops. The exact conditions for most of the drops I investigated are that: - socket exchanged >1MB of data so its not completely fresh - rcvbuf is around 128kB (default, hasn't grown) - there is ~60kB of data in rcvq - skb > 64kB arrives The sum of skb->len (!) of both of the skbs (the one already in rcvq and the arriving one) is larger than rwnd. My suspicion is that this happens because __tcp_select_window() rounds the rwnd up to (1 << wscale) if less than half of the rwnd has been consumed. Eric suggests that given the number of Fixes we already have pointing to `1d2fbaad7c` it's probably time to give up on it, until a bigger revamp of rmem management. Also while we could risk tweaking the rwnd math, there are other drops on workloads I investigated, after the commit in question, not explained by this phenomenon. Suggested-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/20260225122355.585fd57b@kernel.org Fixes: `1d2fbaad7c` ("tcp: stronger sk_rcvbuf checks") Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260227003359.2391017-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 07:55:39 -08:00
Kuniyuki Iwashima	6996a2d2d0	udp: Unhash auto-bound connected sk from 4-tuple hash table when disconnected. Let's say we bind() an UDP socket to the wildcard address with a non-zero port, connect() it to an address, and disconnect it from the address. bind() sets SOCK_BINDPORT_LOCK on sk->sk_userlocks (but not SOCK_BINDADDR_LOCK), and connect() calls udp_lib_hash4() to put the socket into the 4-tuple hash table. Then, __udp_disconnect() calls sk->sk_prot->rehash(sk). It computes a new hash based on the wildcard address and moves the socket to a new slot in the 4-tuple hash table, leaving a garbage in the chain that no packet hits. Let's remove such a socket from 4-tuple hash table when disconnected. Note that udp_sk(sk)->udp_portaddr_hash needs to be udpated after udp_hash4_dec(hslot2) in udp_unhash4(). Fixes: `78c91ae2c6` ("ipv4/udp: Add 4-tuple hash for connected socket") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260227035547.3321327-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 07:46:24 -08:00
Long Li	dabffd0854	net: mana: Ring doorbell at 4 CQ wraparounds MANA hardware requires at least one doorbell ring every 8 wraparounds of the CQ. The driver rings the doorbell as a form of flow control to inform hardware that CQEs have been consumed. The NAPI poll functions mana_poll_tx_cq() and mana_poll_rx_cq() can poll up to CQE_POLLING_BUFFER (512) completions per call. If the CQ has fewer than 512 entries, a single poll call can process more than 4 wraparounds without ringing the doorbell. The doorbell threshold check also uses ">" instead of ">=", delaying the ring by one extra CQE beyond 4 wraparounds. Combined, these issues can cause the driver to exceed the 8-wraparound hardware limit, leading to missed completions and stalled queues. Fix this by capping the number of CQEs polled per call to 4 wraparounds of the CQ in both TX and RX paths. Also change the doorbell threshold from ">" to ">=" so the doorbell is rung as soon as 4 wraparounds are reached. Cc: stable@vger.kernel.org Fixes: `58a63729c9` ("net: mana: Fix doorbell out of order violation and avoid unnecessary doorbell rings") Signed-off-by: Long Li <longli@microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20260226192833.1050807-1-longli@microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 19:29:38 -08:00
Valentin Spreckels	15fba71533	net: usb: r8152: add TRENDnet TUC-ET2G The TRENDnet TUC-ET2G is a RTL8156 based usb ethernet adapter. Add its vendor and product IDs. Signed-off-by: Valentin Spreckels <valentin@spreckels.dev> Link: https://patch.msgid.link/20260226195409.7891-2-valentin@spreckels.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 19:27:31 -08:00
Victor Nogueira	b14e82abf7	selftests/tc-testing: Create tests to exercise act_ct binding restrictions Add 4 test cases to exercise new act_ct binding restrictions: - Try to attach act_ct to an ets qdisc - Attach act_ct to an ingress qdisc - Attach act_ct to a clsact/egress qdisc - Attach act_ct to a shared block Signed-off-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260225134349.1287037-2-victor@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 19:06:21 -08:00
Victor Nogueira	11cb63b0d1	net/sched: Only allow act_ct to bind to clsact/ingress qdiscs and shared blocks As Paolo said earlier [1]: "Since the blamed commit below, classify can return TC_ACT_CONSUMED while the current skb being held by the defragmentation engine. As reported by GangMin Kim, if such packet is that may cause a UaF when the defrag engine later on tries to tuch again such packet." act_ct was never meant to be used in the egress path, however some users are attaching it to egress today [2]. Attempting to reach a middle ground, we noticed that, while most qdiscs are not handling TC_ACT_CONSUMED, clsact/ingress qdiscs are. With that in mind, we address the issue by only allowing act_ct to bind to clsact/ingress qdiscs and shared blocks. That way it's still possible to attach act_ct to egress (albeit only with clsact). [1] https://lore.kernel.org/netdev/674b8cbfc385c6f37fb29a1de08d8fe5c2b0fbee.1771321118.git.pabeni@redhat.com/ [2] https://lore.kernel.org/netdev/cc6bfb4a-4a2b-42d8-b9ce-7ef6644fb22b@ovn.org/ Reported-by: GangMin Kim <km.kim1503@gmail.com> Fixes: `3f14b377d0` ("net/sched: act_ct: fix skb leak and crash on ooo frags") CC: stable@vger.kernel.org Signed-off-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260225134349.1287037-1-victor@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 19:06:21 -08:00
Florian Westphal	ba14798653	selftests: netfilter: nft_queue.sh: avoid flakes on debug kernels Jakub reports test flakes on debug kernels: FAIL: test_udp_gro_ct: Expected software segmentation to occur, had 23 and 17 This test assumes that the kernels nfnetlink_queue module sees N GSO packets, segments them into M skbs and queues them to userspace for reinjection. Hence, if M >= N, no segmentation occurred. However, its possible that this happens: - nfnetlink_queue gets GSO packet - segments that into n skbs - userspace buffer is full, kernel drops the segmented skbs -> "toqueue" counter incremented by 1, "fromqueue" is unchanged. If this happens often enough in a single run, M >= N check triggers incorrectly. To solve this, allow the nf_queue.c test program to set the FAIL_OPEN flag so that the segmented skbs bypass the queueing step in the kernel if the receive buffer is full. Also, reduce number of sending socat instances, decrease their priority and increase nice value for the nf_queue program itself to reduce the probability of overruns happening in the first place. Fixes: `59ecffa399` ("selftests: netfilter: nft_queue.sh: add udp fraglist gro test case") Reported-by: Jakub Kicinski <kuba@kernel.org> Closes: https://lore.kernel.org/netdev/20260218184114.0b405b72@kernel.org/ Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260226161920.1205-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 18:36:59 -08:00
Jakub Kicinski	71347b9d8c	Merge branch 'net-sched-sch_cake-fixes-for-cake_mq' Jonas Köppeler says: ==================== net/sched: sch_cake: fixes for cake_mq This patch contains two fixes for cake_mq: - do not sync when bandwidth is unlimited - adjust the rates for all tins during sync ==================== Link: https://patch.msgid.link/20260226-cake-mq-skip-sync-bandwidth-unlimited-v1-0-01830bb4db87@tu-berlin.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 18:35:42 -08:00
Jonas Köppeler	15c2715a52	net/sched: sch_cake: fixup cake_mq rate adjustment for diffserv config cake_mq's rate adjustment during the sync periods did not adjust the rates for every tin in a diffserv config. This lead to inconsistencies of rates between the tins. Fix this by setting the rates for all tins during synchronization. Fixes: `1bddd758ba` ("net/sched: sch_cake: share shaper state across sub-instances of cake_mq") Signed-off-by: Jonas Köppeler <j.koeppeler@tu-berlin.de> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Link: https://patch.msgid.link/20260226-cake-mq-skip-sync-bandwidth-unlimited-v1-2-01830bb4db87@tu-berlin.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 18:35:40 -08:00
Jonas Köppeler	0b3cd139be	net/sched: sch_cake: avoid sync overhead when unlimited Skip inter-instance sync when no rate limit is configured, as it serves no purpose and only adds overhead. Fixes: `1bddd758ba` ("net/sched: sch_cake: share shaper state across sub-instances of cake_mq") Signed-off-by: Jonas Köppeler <j.koeppeler@tu-berlin.de> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Link: https://patch.msgid.link/20260226-cake-mq-skip-sync-bandwidth-unlimited-v1-1-01830bb4db87@tu-berlin.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 18:35:40 -08:00
Eric Dumazet	29252397bc	inet: annotate data-races around isk->inet_num UDP/TCP lookups are using RCU, thus isk->inet_num accesses should use READ_ONCE() and WRITE_ONCE() where needed. Fixes: `3ab5aee7fe` ("net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260225203545.1512417-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 17:16:59 -08:00
Paul Moses	62413a9c3c	net/sched: act_gate: snapshot parameters with RCU on replace The gate action can be replaced while the hrtimer callback or dump path is walking the schedule list. Convert the parameters to an RCU-protected snapshot and swap updates under tcf_lock, freeing the previous snapshot via call_rcu(). When REPLACE omits the entry list, preserve the existing schedule so the effective state is unchanged. Fixes: `a51c328df3` ("net: qos: introduce a gate control flow action") Cc: stable@vger.kernel.org Signed-off-by: Paul Moses <p@1g4.org> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20260223150512.2251594-2-p@1g4.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 16:10:36 -08:00
Chintan Vankar	be11a53722	net: ethernet: ti: am65-cpsw-nuss/cpsw-ale: Fix multicast entry handling in ALE table In the current implementation, flushing multicast entries in MAC mode incorrectly deletes entries for all ports instead of only the target port, disrupting multicast traffic on other ports. The cause is adding multicast entries by setting only host port bit, and not setting the MAC port bits. Fix this by setting the MAC port's bit in the port mask while adding the multicast entry. Also fix the flush logic to preserve the host port bit during removal of MAC port and free ALE entries when mask contains only host port. Fixes: `5c50a856d5` ("drivers: net: ethernet: cpsw: add multicast address to ALE table") Signed-off-by: Chintan Vankar <c-vankar@ti.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260224181359.2055322-1-c-vankar@ti.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 19:43:54 -08:00
Jakub Kicinski	7e5b450c49	Merge branch 'bridge-check-relevant-options-in-vlan-range-grouping' Danielle Ratson says: ==================== bridge: Check relevant options in VLAN range grouping The br_vlan_opts_eq_range() function determines if consecutive VLANs can be grouped together in a range for compact netlink notifications. It currently checks state, tunnel info, and multicast router configuration, but misses two categories of per-VLAN options that affect the output: 1. User-visible priv_flags (neigh_suppress, mcast_enabled) 2. Port multicast context options (mcast_max_groups, mcast_n_groups) When VLANs have different settings for these options, they are incorrectly grouped into ranges, causing netlink notifications to report only one VLAN's settings for the entire range. Fix by checking priv_flags equality, but only for flags that affect netlink output (BR_VLFLAG_NEIGH_SUPPRESS_ENABLED and BR_VLFLAG_MCAST_ENABLED), and comparing multicast context options (mcast_max_groups, mcast_n_groups). Add a test with four test cases for each option, to ensure that VLANs with different values are not grouped into ranges and VLANs with matching values are properly grouped together. ==================== Link: https://patch.msgid.link/20260225143956.3995415-1-danieller@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 19:24:33 -08:00
Danielle Ratson	13540021be	selftests: net: Add bridge VLAN range grouping tests Add a new test file bridge_vlan_dump.sh with four test cases that verify VLANs with different per-VLAN options are not incorrectly grouped into ranges in the dump output. The tests verify the kernel's br_vlan_opts_eq_range() function correctly prevents VLAN range grouping when neigh_suppress, mcast_max_groups, mcast_n_groups, or mcast_enabled options differ. Each test verifies that VLANs with different option values appear as individual entries rather than ranges, and that VLANs with matching values are properly grouped together. Example output: $ ./bridge_vlan_dump.sh TEST: VLAN range grouping with neigh_suppress [ OK ] TEST: VLAN range grouping with mcast_max_groups [ OK ] TEST: VLAN range grouping with mcast_n_groups [ OK ] TEST: VLAN range grouping with mcast_enabled [ OK ] Signed-off-by: Danielle Ratson <danieller@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20260225143956.3995415-3-danieller@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 19:24:29 -08:00
Danielle Ratson	93c9475c04	bridge: Check relevant per-VLAN options in VLAN range grouping The br_vlan_opts_eq_range() function determines if consecutive VLANs can be grouped together in a range for compact netlink notifications. It currently checks state, tunnel info, and multicast router configuration, but misses two categories of per-VLAN options that affect the output: 1. User-visible priv_flags (neigh_suppress, mcast_enabled) 2. Port multicast context (mcast_max_groups, mcast_n_groups) When VLANs have different settings for these options, they are incorrectly grouped into ranges, causing netlink notifications to report only one VLAN's settings for the entire range. Fix by checking priv_flags equality, but only for flags that affect netlink output (BR_VLFLAG_NEIGH_SUPPRESS_ENABLED and BR_VLFLAG_MCAST_ENABLED), and comparing multicast context (mcast_max_groups and mcast_n_groups). Example showing the bugs before the fix: $ bridge vlan set vid 10 dev dummy1 neigh_suppress on $ bridge vlan set vid 11 dev dummy1 neigh_suppress off $ bridge -d vlan show dev dummy1 port vlan-id dummy1 10-11 ... neigh_suppress on $ bridge vlan set vid 10 dev dummy1 mcast_max_groups 100 $ bridge vlan set vid 11 dev dummy1 mcast_max_groups 200 $ bridge -d vlan show dev dummy1 port vlan-id dummy1 10-11 ... mcast_max_groups 100 After the fix, VLANs 10 and 11 are shown as separate entries with their correct individual settings. Fixes: `a1aee20d5d` ("net: bridge: Add netlink knobs for number / maximum MDB entries") Fixes: `83f6d60079` ("bridge: vlan: Allow setting VLAN neighbor suppression state") Signed-off-by: Danielle Ratson <danieller@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260225143956.3995415-2-danieller@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 19:24:29 -08:00
Eric Dumazet	2ef2b20cf4	net: annotate data-races around sk->sk_{data_ready,write_space} skmsg (and probably other layers) are changing these pointers while other cpus might read them concurrently. Add corresponding READ_ONCE()/WRITE_ONCE() annotations for UDP, TCP and AF_UNIX. Fixes: `604326b41a` ("bpf, sockmap: convert to generic sk_msg interface") Reported-by: syzbot+87f770387a9e5dc6b79b@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/699ee9fc.050a0220.1cd54b.0009.GAE@google.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jakub Sitnicki <jakub@cloudflare.com> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260225131547.1085509-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 19:23:03 -08:00
Jakub Kicinski	754a3d081a	Merge tag 'batadv-net-pullrequest-20260225' of https://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== Here is a batman-adv bugfix: - Avoid double-rtnl_lock ELP metric worker, by Sven Eckelmann * tag 'batadv-net-pullrequest-20260225' of https://git.open-mesh.org/linux-merge: batman-adv: Avoid double-rtnl_lock ELP metric worker ==================== Link: https://patch.msgid.link/20260225084614.229077-1-sw@simonwunderlich.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 19:15:09 -08:00
Davide Caratti	e35626f610	net/sched: ets: fix divide by zero in the offload path Offloading ETS requires computing each class' WRR weight: this is done by averaging over the sums of quanta as 'q_sum' and 'q_psum'. Using unsigned int, the same integer size as the individual DRR quanta, can overflow and even cause division by zero, like it happened in the following splat: Oops: divide error: 0000 [#1] SMP PTI CPU: 13 UID: 0 PID: 487 Comm: tc Tainted: G E 6.19.0-virtme #45 PREEMPT(full) Tainted: [E]=UNSIGNED_MODULE Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 RIP: 0010:ets_offload_change+0x11f/0x290 [sch_ets] Code: e4 45 31 ff eb 03 41 89 c7 41 89 cb 89 ce 83 f9 0f 0f 87 b7 00 00 00 45 8b 08 31 c0 45 01 cc 45 85 c9 74 09 41 6b c4 64 31 d2 <41> f7 f2 89 c2 44 29 fa 45 89 df 41 83 fb 0f 0f 87 c7 00 00 00 44 RSP: 0018:ffffd0a180d77588 EFLAGS: 00010246 RAX: 00000000ffffff38 RBX: ffff8d3d482ca000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffd0a180d77660 RBP: ffffd0a180d77690 R08: ffff8d3d482ca2d8 R09: 00000000fffffffe R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffffe R13: ffff8d3d472f2000 R14: 0000000000000003 R15: 0000000000000000 FS: 00007f440b6c2740(0000) GS:ffff8d3dc9803000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000003cdd2000 CR3: 0000000007b58002 CR4: 0000000000172ef0 Call Trace: <TASK> ets_qdisc_change+0x870/0xf40 [sch_ets] qdisc_create+0x12b/0x540 tc_modify_qdisc+0x6d7/0xbd0 rtnetlink_rcv_msg+0x168/0x6b0 netlink_rcv_skb+0x5c/0x110 netlink_unicast+0x1d6/0x2b0 netlink_sendmsg+0x22e/0x470 ____sys_sendmsg+0x38a/0x3c0 ___sys_sendmsg+0x99/0xe0 __sys_sendmsg+0x8a/0xf0 do_syscall_64+0x111/0xf80 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f440b81c77e Code: 4d 89 d8 e8 d4 bc 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 13 ff ff ff 0f 1f 00 f3 0f 1e fa RSP: 002b:00007fff951e4c10 EFLAGS: 00000202 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000000000481820 RCX: 00007f440b81c77e RDX: 0000000000000000 RSI: 00007fff951e4cd0 RDI: 0000000000000003 RBP: 00007fff951e4c20 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff951f4fa8 R13: 00000000699ddede R14: 00007f440bb01000 R15: 0000000000486980 </TASK> Modules linked in: sch_ets(E) netdevsim(E) ---[ end trace 0000000000000000 ]--- RIP: 0010:ets_offload_change+0x11f/0x290 [sch_ets] Code: e4 45 31 ff eb 03 41 89 c7 41 89 cb 89 ce 83 f9 0f 0f 87 b7 00 00 00 45 8b 08 31 c0 45 01 cc 45 85 c9 74 09 41 6b c4 64 31 d2 <41> f7 f2 89 c2 44 29 fa 45 89 df 41 83 fb 0f 0f 87 c7 00 00 00 44 RSP: 0018:ffffd0a180d77588 EFLAGS: 00010246 RAX: 00000000ffffff38 RBX: ffff8d3d482ca000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffd0a180d77660 RBP: ffffd0a180d77690 R08: ffff8d3d482ca2d8 R09: 00000000fffffffe R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffffe R13: ffff8d3d472f2000 R14: 0000000000000003 R15: 0000000000000000 FS: 00007f440b6c2740(0000) GS:ffff8d3dc9803000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000003cdd2000 CR3: 0000000007b58002 CR4: 0000000000172ef0 Kernel panic - not syncing: Fatal exception Kernel Offset: 0x30000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) ---[ end Kernel panic - not syncing: Fatal exception ]--- Fix this using 64-bit integers for 'q_sum' and 'q_psum'. Cc: stable@vger.kernel.org Fixes: `d35eb52bd2` ("net: sch_ets: Make the ETS qdisc offloadable") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/28504887df314588c7255e9911769c36f751edee.1771964872.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 18:28:47 -08:00
Linus Torvalds	b9c8fc2cae	Merge tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from IPsec, Bluetooth and netfilter Current release - regressions: - wifi: fix dev_alloc_name() return value check - rds: fix recursive lock in rds_tcp_conn_slots_available Current release - new code bugs: - vsock: lock down child_ns_mode as write-once Previous releases - regressions: - core: - do not pass flow_id to set_rps_cpu() - consume xmit errors of GSO frames - netconsole: avoid OOB reads, msg is not nul-terminated - netfilter: h323: fix OOB read in decode_choice() - tcp: re-enable acceptance of FIN packets when RWIN is 0 - udplite: fix null-ptr-deref in __udp_enqueue_schedule_skb(). - wifi: brcmfmac: fix potential kernel oops when probe fails - phy: register phy led_triggers during probe to avoid AB-BA deadlock - eth: - bnxt_en: fix deleting of Ntuple filters - wan: farsync: fix use-after-free bugs caused by unfinished tasklets - xscale: check for PTP support properly Previous releases - always broken: - tcp: fix potential race in tcp_v6_syn_recv_sock() - kcm: fix zero-frag skb in frag_list on partial sendmsg error - xfrm: - fix race condition in espintcp_close() - always flush state and policy upon NETDEV_UNREGISTER event - bluetooth: - purge error queues in socket destructors - fix response to L2CAP_ECRED_CONN_REQ - eth: - mlx5: - fix circular locking dependency in dump - fix "scheduling while atomic" in IPsec MAC address query - gve: fix incorrect buffer cleanup for QPL - team: avoid NETDEV_CHANGEMTU event when unregistering slave - usb: validate USB endpoints" * tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits) netfilter: nf_conntrack_h323: fix OOB read in decode_choice() dpaa2-switch: validate num_ifs to prevent out-of-bounds write net: consume xmit errors of GSO frames vsock: document write-once behavior of the child_ns_mode sysctl vsock: lock down child_ns_mode as write-once selftests/vsock: change tests to respect write-once child ns mode net/mlx5e: Fix "scheduling while atomic" in IPsec MAC address query net/mlx5: Fix missing devlink lock in SRIOV enable error path net/mlx5: E-switch, Clear legacy flag when moving to switchdev net/mlx5: LAG, disable MPESW in lag_disable_change() net/mlx5: DR, Fix circular locking dependency in dump selftests: team: Add a reference count leak test team: avoid NETDEV_CHANGEMTU event when unregistering slave net: mana: Fix double destroy_workqueue on service rescan PCI path MAINTAINERS: Update maintainer entry for QUALCOMM ETHQOS ETHERNET DRIVER dpll: zl3073x: Remove redundant cleanup in devm_dpll_init() selftests/net: packetdrill: Verify acceptance of FIN packets when RWIN is 0 tcp: re-enable acceptance of FIN packets when RWIN is 0 vsock: Use container_of() to get net namespace in sysctl handlers net: usb: kaweth: validate USB endpoints ...	2026-02-26 08:00:13 -08:00
Vahagn Vardanian	baed0d9ba9	netfilter: nf_conntrack_h323: fix OOB read in decode_choice() In decode_choice(), the boundary check before get_len() uses the variable `len`, which is still 0 from its initialization at the top of the function: unsigned int type, ext, len = 0; ... if (ext \|\| (son->attr & OPEN)) { BYTE_ALIGN(bs); if (nf_h323_error_boundary(bs, len, 0)) /* len is 0 here / return H323_ERROR_BOUND; len = get_len(bs); / OOB read / When the bitstream is exactly consumed (bs->cur == bs->end), the check nf_h323_error_boundary(bs, 0, 0) evaluates to (bs->cur + 0 > bs->end), which is false. The subsequent get_len() call then dereferences bs->cur++, reading 1 byte past the end of the buffer. If that byte has bit 7 set, get_len() reads a second byte as well. This can be triggered remotely by sending a crafted Q.931 SETUP message with a User-User Information Element containing exactly 2 bytes of PER-encoded data ({0x08, 0x00}) to port 1720 through a firewall with the nf_conntrack_h323 helper active. The decoder fully consumes the PER buffer before reaching this code path, resulting in a 1-2 byte heap-buffer-overflow read confirmed by AddressSanitizer. Fix this by checking for 2 bytes (the maximum that get_len() may read) instead of the uninitialized `len`. This matches the pattern used at every other get_len() call site in the same file, where the caller checks for 2 bytes of available data before calling get_len(). Fixes: `ec8a8f3c31` ("netfilter: nf_ct_h323: Extend nf_h323_error_boundary to work on bits as well") Signed-off-by: Vahagn Vardanian <vahagn@redrays.io> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260225130619.1248-2-fw@strlen.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 12:50:42 +01:00
Junrui Luo	8a5752c6dc	dpaa2-switch: validate num_ifs to prevent out-of-bounds write The driver obtains sw_attr.num_ifs from firmware via dpsw_get_attributes() but never validates it against DPSW_MAX_IF (64). This value controls iteration in dpaa2_switch_fdb_get_flood_cfg(), which writes port indices into the fixed-size cfg->if_id[DPSW_MAX_IF] array. When firmware reports num_ifs >= 64, the loop can write past the array bounds. Add a bound check for num_ifs in dpaa2_switch_init(). dpaa2_switch_fdb_get_flood_cfg() appends the control interface (port num_ifs) after all matched ports. When num_ifs == DPSW_MAX_IF and all ports match the flood filter, the loop fills all 64 slots and the control interface write overflows by one entry. The check uses >= because num_ifs == DPSW_MAX_IF is also functionally broken. build_if_id_bitmap() silently drops any ID >= 64: if (id[i] < DPSW_MAX_IF) bmap[id[i] / 64] \|= ... Fixes: `539dda3c5d` ("staging: dpaa2-switch: properly setup switching domains") Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com> Link: https://patch.msgid.link/SYBPR01MB78812B47B7F0470B617C408AAF74A@SYBPR01MB7881.ausprd01.prod.outlook.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 12:37:21 +01:00
Jakub Kicinski	7aa767d0d3	net: consume xmit errors of GSO frames udpgro_frglist.sh and udpgro_bench.sh are the flakiest tests currently in NIPA. They fail in the same exact way, TCP GRO test stalls occasionally and the test gets killed after 10min. These tests use veth to simulate GRO. They attach a trivial ("return XDP_PASS;") XDP program to the veth to force TSO off and NAPI on. Digging into the failure mode we can see that the connection is completely stuck after a burst of drops. The sender's snd_nxt is at sequence number N [1], but the receiver claims to have received (rcv_nxt) up to N + 3 * MSS [2]. Last piece of the puzzle is that senders rtx queue is not empty (let's say the block in the rtx queue is at sequence number N - 4 * MSS [3]). In this state, sender sends a retransmission from the rtx queue with a single segment, and sequence numbers N-4MSS:N-3MSS [3]. Receiver sees it and responds with an ACK all the way up to N + 3 * MSS [2]. But sender will reject this ack as TCP_ACK_UNSENT_DATA because it has no recollection of ever sending data that far out [1]. And we are stuck. The root cause is the mess of the xmit return codes. veth returns an error when it can't xmit a frame. We end up with a loss event like this: ------------------------------------------------- \| GSO super frame 1 \| GSO super frame 2 \| \|-----------------------------------------------\| \| seg \| seg \| seg \| seg \| seg \| seg \| seg \| seg \| \| 1 \| 2 \| 3 \| 4 \| 5 \| 6 \| 7 \| 8 \| ------------------------------------------------- x ok ok <ok>\| ok ok ok <x> \\ snd_nxt "x" means packet lost by veth, and "ok" means it went thru. Since veth has TSO disabled in this test it sees individual segments. Segment 1 is on the retransmit queue and will be resent. So why did the sender not advance snd_nxt even tho it clearly did send up to seg 8? tcp_write_xmit() interprets the return code from the core to mean that data has not been sent at all. Since TCP deals with GSO super frames, not individual segment the crux of the problem is that loss of a single segment can be interpreted as loss of all. TCP only sees the last return code for the last segment of the GSO frame (in <> brackets in the diagram above). Of course for the problem to occur we need a setup or a device without a Qdisc. Otherwise Qdisc layer disconnects the protocol layer from the device errors completely. We have multiple ways to fix this. 1) make veth not return an error when it lost a packet. While this is what I think we did in the past, the issue keeps reappearing and it's annoying to debug. The game of whack a mole is not great. 2) fix the damn return codes We only talk about NETDEV_TX_OK and NETDEV_TX_BUSY in the documentation, so maybe we should make the return code from ndo_start_xmit() a boolean. I like that the most, but perhaps some ancient, not-really-networking protocol would suffer. 3) make TCP ignore the errors It is not entirely clear to me what benefit TCP gets from interpreting the result of ip_queue_xmit()? Specifically once the connection is established and we're pushing data - packet loss is just packet loss? 4) this fix Ignore the rc in the Qdisc-less+GSO case, since it's unreliable. We already always return OK in the TCQ_F_CAN_BYPASS case. In the Qdisc-less case let's be a bit more conservative and only mask the GSO errors. This path is taken by non-IP-"networks" like CAN, MCTP etc, so we could regress some ancient thing. This is the simplest, but also maybe the hackiest fix? Similar fix has been proposed by Eric in the past but never committed because original reporter was working with an OOT driver and wasn't providing feedback (see Link). Link: https://lore.kernel.org/CANn89iJcLepEin7EtBETrZ36bjoD9LrR=k4cfwWh046GB+4f9A@mail.gmail.com Fixes: `1f59533f9c` ("qdisc: validate frames going through the direct_xmit path") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260223235100.108939-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 11:35:00 +01:00

1 2 3 4 5 ...

1427062 Commits