linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-14 11:11:22 -04:00

Author	SHA1	Message	Date
Alexey Nepomnyashih	cca7b1cfd7	net: liquidio: fix overflow in octeon_init_instr_queue() The expression `(conf->instr_type == 64) << iq_no` can overflow because `iq_no` may be as high as 64 (`CN23XX_MAX_RINGS_PER_PF`). Casting the operand to `u64` ensures correct 64-bit arithmetic. Fixes: `f21fb3ed36` ("Add support of Cavium Liquidio ethernet adapters") Signed-off-by: Alexey Nepomnyashih <sdl@nppct.ru> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-18 07:47:17 -07:00
Eric Dumazet	87ebb628a5	net: clear sk->sk_ino in sk_set_socket(sk, NULL) Andrei Vagin reported that blamed commit broke CRIU. Indeed, while we want to keep sk_uid unchanged when a socket is cloned, we want to clear sk->sk_ino. Otherwise, sock_diag might report multiple sockets sharing the same inode number. Move the clearing part from sock_orphan() to sk_set_socket(sk, NULL), called both from sock_orphan() and sk_clone_lock(). Fixes: `5d6b58c932` ("net: lockless sock_i_ino()") Closes: https://lore.kernel.org/netdev/aMhX-VnXkYDpKd9V@google.com/ Closes: https://github.com/checkpoint-restore/criu/issues/2744 Reported-by: Andrei Vagin <avagin@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Andrei Vagin <avagin@google.com> Link: https://patch.msgid.link/20250917135337.1736101-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-18 07:47:17 -07:00
Tariq Toukan	3fbfe251cc	Revert "net/mlx5e: Update and set Xon/Xoff upon port speed set" This reverts commit `d24341740f`. It causes errors when trying to configure QoS, as well as loss of L2 connectivity (on multi-host devices). Reported-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/20250910170011.70528106@kernel.org Fixes: `d24341740f` ("net/mlx5e: Update and set Xon/Xoff upon port speed set") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-18 07:47:17 -07:00
Jakub Kicinski	4c05c7ed88	selftests: tls: test skb copy under mem pressure and OOB Add a test which triggers mem pressure via OOB writes. Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250917002814.1743558-2-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 12:43:54 +02:00
Jakub Kicinski	0aeb54ac4c	tls: make sure to abort the stream if headers are bogus Normally we wait for the socket to buffer up the whole record before we service it. If the socket has a tiny buffer, however, we read out the data sooner, to prevent connection stalls. Make sure that we abort the connection when we find out late that the record is actually invalid. Retrying the parsing is fine in itself but since we copy some more data each time before we parse we can overflow the allocated skb space. Constructing a scenario in which we're under pressure without enough data in the socket to parse the length upfront is quite hard. syzbot figured out a way to do this by serving us the header in small OOB sends, and then filling in the recvbuf with a large normal send. Make sure that tls_rx_msg_size() aborts strp, if we reach an invalid record there's really no way to recover. Reported-by: Lee Jones <lee@kernel.org> Fixes: `84c61fe1a7` ("tls: rx: do not use the standard strparser") Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250917002814.1743558-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 12:43:54 +02:00
Jakub Kicinski	0984710897	Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2025-09-16 (ice, i40e, ixgbe, igc) For ice: Jake resolves leaking pages with multi-buffer frames when a 0-sized descriptor is encountered. For i40e: Maciej removes a redundant, and incorrect, memory barrier. For ixgbe: Jedrzej adjusts lifespan of ACI lock to ensure uses are while it is valid. For igc: Kohei Enju does not fail probe on LED setup failure which resolves a kernel panic in the cleanup path, if we were to fail. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: igc: don't fail igc_probe() on LED setup error ixgbe: destroy aci.lock later within ixgbe_remove path ixgbe: initialize aci.lock before it's used i40e: remove redundant memory barrier when cleaning Tx descs ice: fix Rx page leak on multi-buffer frames ==================== Link: https://patch.msgid.link/20250916212801.2818440-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 16:14:25 -07:00
Jakub Kicinski	934da21f99	Merge tag 'wireless-2025-09-17' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Just two fixes: - fix crash in rfkill due to uninitialized type_name - fix aggregation in iwlwifi 7000/8000 devices * tag 'wireless-2025-09-17' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: net: rfkill: gpio: Fix crash due to dereferencering uninitialized pointer wifi: iwlwifi: pcie: fix byte count table for some devices ==================== Link: https://patch.msgid.link/20250917105159.161583-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 16:12:46 -07:00
Jakub Kicinski	d7995c2b91	Merge branch 'tcp-clear-tcp_sk-sk-fastopen_rsk-in-tcp_disconnect' Kuniyuki Iwashima says: ==================== tcp: Clear tcp_sk(sk)->fastopen_rsk in tcp_disconnect(). syzbot reported a warning in tcp_retransmit_timer() for TCP Fast Open socket. Patch 1 fixes the issue and Patch 2 adds a test for the scenario. ==================== Link: https://patch.msgid.link/20250915175800.118793-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 16:01:54 -07:00
Kuniyuki Iwashima	1fd0362262	selftest: packetdrill: Add tcp_fastopen_server_reset-after-disconnect.pkt. The test reproduces the scenario explained in the previous patch. Without the patch, the test triggers the warning and cannot see the last retransmitted packet. # ./ksft_runner.sh tcp_fastopen_server_reset-after-disconnect.pkt TAP version 13 1..2 [ 29.229250] ------------[ cut here ]------------ [ 29.231414] WARNING: CPU: 26 PID: 0 at net/ipv4/tcp_timer.c:542 tcp_retransmit_timer+0x32/0x9f0 ... tcp_fastopen_server_reset-after-disconnect.pkt:26: error handling packet: Timed out waiting for packet not ok 1 ipv4 tcp_fastopen_server_reset-after-disconnect.pkt:26: error handling packet: Timed out waiting for packet not ok 2 ipv6 # Totals: pass:0 fail:2 xfail:0 xpass:0 skip:0 error:0 Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250915175800.118793-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 16:01:52 -07:00
Kuniyuki Iwashima	45c8a6cc2b	tcp: Clear tcp_sk(sk)->fastopen_rsk in tcp_disconnect(). syzbot reported the splat below where a socket had tcp_sk(sk)->fastopen_rsk in the TCP_ESTABLISHED state. [0] syzbot reused the server-side TCP Fast Open socket as a new client before the TFO socket completes 3WHS: 1. accept() 2. connect(AF_UNSPEC) 3. connect() to another destination As of accept(), sk->sk_state is TCP_SYN_RECV, and tcp_disconnect() changes it to TCP_CLOSE and makes connect() possible, which restarts timers. Since tcp_disconnect() forgot to clear tcp_sk(sk)->fastopen_rsk, the retransmit timer triggered the warning and the intended packet was not retransmitted. Let's call reqsk_fastopen_remove() in tcp_disconnect(). [0]: WARNING: CPU: 2 PID: 0 at net/ipv4/tcp_timer.c:542 tcp_retransmit_timer (net/ipv4/tcp_timer.c:542 (discriminator 7)) Modules linked in: CPU: 2 UID: 0 PID: 0 Comm: swapper/2 Not tainted 6.17.0-rc5-g201825fb4278 #62 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:tcp_retransmit_timer (net/ipv4/tcp_timer.c:542 (discriminator 7)) Code: 41 55 41 54 55 53 48 8b af b8 08 00 00 48 89 fb 48 85 ed 0f 84 55 01 00 00 0f b6 47 12 3c 03 74 0c 0f b6 47 12 3c 04 74 04 90 <0f> 0b 90 48 8b 85 c0 00 00 00 48 89 ef 48 8b 40 30 e8 6a 4f 06 3e RSP: 0018:ffffc900002f8d40 EFLAGS: 00010293 RAX: 0000000000000002 RBX: ffff888106911400 RCX: 0000000000000017 RDX: 0000000002517619 RSI: ffffffff83764080 RDI: ffff888106911400 RBP: ffff888106d5c000 R08: 0000000000000001 R09: ffffc900002f8de8 R10: 00000000000000c2 R11: ffffc900002f8ff8 R12: ffff888106911540 R13: ffff888106911480 R14: ffff888106911840 R15: ffffc900002f8de0 FS: 0000000000000000(0000) GS:ffff88907b768000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8044d69d90 CR3: 0000000002c30003 CR4: 0000000000370ef0 Call Trace: <IRQ> tcp_write_timer (net/ipv4/tcp_timer.c:738) call_timer_fn (kernel/time/timer.c:1747) __run_timers (kernel/time/timer.c:1799 kernel/time/timer.c:2372) timer_expire_remote (kernel/time/timer.c:2385 kernel/time/timer.c:2376 kernel/time/timer.c:2135) tmigr_handle_remote_up (kernel/time/timer_migration.c:944 kernel/time/timer_migration.c:1035) __walk_groups.isra.0 (kernel/time/timer_migration.c:533 (discriminator 1)) tmigr_handle_remote (kernel/time/timer_migration.c:1096) handle_softirqs (./arch/x86/include/asm/jump_label.h:36 ./include/trace/events/irq.h:142 kernel/softirq.c:580) irq_exit_rcu (kernel/softirq.c:614 kernel/softirq.c:453 kernel/softirq.c:680 kernel/softirq.c:696) sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1050 (discriminator 35) arch/x86/kernel/apic/apic.c:1050 (discriminator 35)) </IRQ> Fixes: `8336886f78` ("tcp: TCP Fast Open Server - support TFO listeners") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250915175800.118793-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 16:01:52 -07:00
Sathesh B Edara	a72175c985	octeon_ep: fix VF MAC address lifecycle handling Currently, VF MAC address info is not updated when the MAC address is configured from VF, and it is not cleared when the VF is removed. This leads to stale or missing MAC information in the PF, which may cause incorrect state tracking or inconsistencies when VFs are hot-plugged or reassigned. Fix this by: - storing the VF MAC address in the PF when it is set from VF - clearing the stored VF MAC address when the VF is removed This ensures that the PF always has correct VF MAC state. Fixes: `cde29af9e6` ("octeon_ep: add PF-VF mailbox communication") Signed-off-by: Sathesh B Edara <sedara@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250916133207.21737-1-sedara@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 15:58:32 -07:00
Hangbin Liu	dc5f94b1ec	selftests: bonding: add vlan over bond testing Add a vlan over bond testing to make sure arp/ns target works. Also change all the configs to mudules. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20250916080127.430626-2-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 15:13:51 -07:00
Hangbin Liu	a8ba87f04c	bonding: don't set oif to bond dev when getting NS target destination Unlike IPv4, IPv6 routing strictly requires the source address to be valid on the outgoing interface. If the NS target is set to a remote VLAN interface, and the source address is also configured on a VLAN over a bond interface, setting the oif to the bond device will fail to retrieve the correct destination route. Fix this by not setting the oif to the bond device when retrieving the NS target destination. This allows the correct destination device (the VLAN interface) to be determined, so that bond_verify_device_path can return the proper VLAN tags for sending NS messages. Reported-by: David Wilder <wilder@us.ibm.com> Closes: https://lore.kernel.org/netdev/aGOKggdfjv0cApTO@fedora/ Suggested-by: Jay Vosburgh <jv@jvosburgh.net> Tested-by: David Wilder <wilder@us.ibm.com> Acked-by: Jay Vosburgh <jv@jvosburgh.net> Fixes: `4e24be018e` ("bonding: add new parameter ns_targets") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20250916080127.430626-1-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 15:13:51 -07:00
Hans de Goede	b6f56a44e4	net: rfkill: gpio: Fix crash due to dereferencering uninitialized pointer Since commit `7d5e9737ef` ("net: rfkill: gpio: get the name and type from device property") rfkill_find_type() gets called with the possibly uninitialized "const char *type_name;" local variable. On x86 systems when rfkill-gpio binds to a "BCM4752" or "LNV4752" acpi_device, the rfkill->type is set based on the ACPI acpi_device_id: rfkill->type = (unsigned)id->driver_data; and there is no "type" property so device_property_read_string() will fail and leave type_name uninitialized, leading to a potential crash. rfkill_find_type() does accept a NULL pointer, fix the potential crash by initializing type_name to NULL. Note likely sofar this has not been caught because: 1. Not many x86 machines actually have a "BCM4752"/"LNV4752" acpi_device 2. The stack happened to contain NULL where type_name is stored Fixes: `7d5e9737ef` ("net: rfkill: gpio: get the name and type from device property") Cc: stable@vger.kernel.org Cc: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: Hans de Goede <hansg@kernel.org> Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Link: https://patch.msgid.link/20250913113515.21698-1-hansg@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2025-09-17 12:37:05 +02:00
Johannes Berg	e882985b09	Merge tag 'iwlwifi-fixes-2025-09-15' of https://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-next Miri Korenblit says: ==================== iwlwifi fix ==================== The fix is for byte count tables in 7000/8000 family devices. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2025-09-17 12:36:08 +02:00
Jakub Kicinski	8c47485399	Merge branch 'mlx5e-misc-fixes-2025-09-15' Tariq Toukan says: ==================== mlx5e misc fixes 2025-09-15 This patchset provides misc bug fixes from the team to the mlx5 Eth driver. ==================== Link: https://patch.msgid.link/1757939074-617281-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-16 17:19:15 -07:00
Lama Kayal	7601a0a462	net/mlx5e: Add a miss level for ipsec crypto offload The cited commit adds a miss table for switchdev mode. But it uses the same level as policy table. Will hit the following error when running command: # ip xfrm state add src 192.168.1.22 dst 192.168.1.21 proto \ esp spi 1001 reqid 10001 aead 'rfc4106(gcm(aes))' \ 0x3a189a7f9374955d3817886c8587f1da3df387ff 128 \ mode tunnel offload dev enp8s0f0 dir in Error: mlx5_core: Device failed to offload this state. The dmesg error is: mlx5_core 0000:03:00.0: ipsec_miss_create:578:(pid 311797): fail to create IPsec miss_rule err=-22 Fix it by adding a new miss level to avoid the error. Fixes: `7d9e292ecd` ("net/mlx5e: Move IPSec policy check after decryption") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Chris Mi <cmi@nvidia.com> Signed-off-by: Lama Kayal <lkayal@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1757939074-617281-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-16 17:19:14 -07:00
Jianbo Liu	6b4be64fd9	net/mlx5e: Harden uplink netdev access against device unbind The function mlx5_uplink_netdev_get() gets the uplink netdevice pointer from mdev->mlx5e_res.uplink_netdev. However, the netdevice can be removed and its pointer cleared when unbound from the mlx5_core.eth driver. This results in a NULL pointer, causing a kernel panic. BUG: unable to handle page fault for address: 0000000000001300 at RIP: 0010:mlx5e_vport_rep_load+0x22a/0x270 [mlx5_core] Call Trace: <TASK> mlx5_esw_offloads_rep_load+0x68/0xe0 [mlx5_core] esw_offloads_enable+0x593/0x910 [mlx5_core] mlx5_eswitch_enable_locked+0x341/0x420 [mlx5_core] mlx5_devlink_eswitch_mode_set+0x17e/0x3a0 [mlx5_core] devlink_nl_eswitch_set_doit+0x60/0xd0 genl_family_rcv_msg_doit+0xe0/0x130 genl_rcv_msg+0x183/0x290 netlink_rcv_skb+0x4b/0xf0 genl_rcv+0x24/0x40 netlink_unicast+0x255/0x380 netlink_sendmsg+0x1f3/0x420 __sock_sendmsg+0x38/0x60 __sys_sendto+0x119/0x180 do_syscall_64+0x53/0x1d0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Ensure the pointer is valid before use by checking it for NULL. If it is valid, immediately call netdev_hold() to take a reference, and preventing the netdevice from being freed while it is in use. Fixes: `7a9fb35e8c` ("net/mlx5e: Do not reload ethernet ports when changing eswitch mode") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1757939074-617281-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-16 17:19:11 -07:00
Jakub Kicinski	94ff1ed303	MAINTAINERS: make the DPLL entry cover drivers DPLL maintainers should probably be CCed on driver patches, too. Remove the *, which makes the pattern only match files directly under drivers/dpll but not its sub-directories. Acked-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Acked-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Link: https://patch.msgid.link/20250915234255.1306612-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-16 16:55:59 -07:00
Remy D. Farley	109f8b5154	doc/netlink: Fix typos in operation attributes I'm trying to generate Rust bindings for netlink using the yaml spec. It looks like there's a typo in conntrack spec: attribute set conntrack-attrs defines attributes "counters-{orig,reply}" (plural), while get operation references "counter-{orig,reply}" (singular). The latter should be fixed, as it denotes multiple counters (packet and byte). The corresonding C define is CTA_COUNTERS_ORIG. Also, dump request references "nfgen-family" attribute, which neither exists in conntrack-attrs attrset nor ctattr_type enum. There's member of nfgenmsg struct with the same name, which is where family value is actually taken from. > static int ctnetlink_dump_exp_ct(struct net net, struct sock ctnl, > struct sk_buff skb, > const struct nlmsghdr nlh, > const struct nlattr * const cda[], > struct netlink_ext_ack extack) > { > int err; > struct nfgenmsg nfmsg = nlmsg_data(nlh); > u_int8_t u3 = nfmsg->nfgen_family; ^^^^^^^^^^^^ Signed-off-by: Remy D. Farley <one-d-wide@protonmail.com> Fixes: `23fc9311a5` ("netlink: specs: add conntrack dump and stats dump support") Link: https://patch.msgid.link/20250913140515.1132886-1-one-d-wide@protonmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-16 16:30:06 -07:00
Kohei Enju	528eb4e19e	igc: don't fail igc_probe() on LED setup error When igc_led_setup() fails, igc_probe() fails and triggers kernel panic in free_netdev() since unregister_netdev() is not called. [1] This behavior can be tested using fault-injection framework, especially the failslab feature. [2] Since LED support is not mandatory, treat LED setup failures as non-fatal and continue probe with a warning message, consequently avoiding the kernel panic. [1] kernel BUG at net/core/dev.c:12047! Oops: invalid opcode: 0000 [#1] SMP NOPTI CPU: 0 UID: 0 PID: 937 Comm: repro-igc-led-e Not tainted 6.17.0-rc4-enjuk-tnguy-00865-gc4940196ab02 #64 PREEMPT(voluntary) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:free_netdev+0x278/0x2b0 [...] Call Trace: <TASK> igc_probe+0x370/0x910 local_pci_probe+0x3a/0x80 pci_device_probe+0xd1/0x200 [...] [2] #!/bin/bash -ex FAILSLAB_PATH=/sys/kernel/debug/failslab/ DEVICE=0000:00:05.0 START_ADDR=$(grep " igc_led_setup" /proc/kallsyms \ \| awk '{printf("0x%s", $1)}') END_ADDR=$(printf "0x%x" $((START_ADDR + 0x100))) echo $START_ADDR > $FAILSLAB_PATH/require-start echo $END_ADDR > $FAILSLAB_PATH/require-end echo 1 > $FAILSLAB_PATH/times echo 100 > $FAILSLAB_PATH/probability echo N > $FAILSLAB_PATH/ignore-gfp-wait echo $DEVICE > /sys/bus/pci/drivers/igc/bind Fixes: `ea578703b0` ("igc: Add support for LEDs on i225/i226") Signed-off-by: Kohei Enju <enjuk@amazon.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-09-16 14:01:53 -07:00
Jedrzej Jagielski	316ba68175	ixgbe: destroy aci.lock later within ixgbe_remove path There's another issue with aci.lock and previous patch uncovers it. aci.lock is being destroyed during removing ixgbe while some of the ixgbe closing routines are still ongoing. These routines use Admin Command Interface which require taking aci.lock which has been already destroyed what leads to call trace. [ +0.000004] DEBUG_LOCKS_WARN_ON(lock->magic != lock) [ +0.000007] WARNING: CPU: 12 PID: 10277 at kernel/locking/mutex.c:155 mutex_lock+0x5f/0x70 [ +0.000002] Call Trace: [ +0.000003] <TASK> [ +0.000006] ixgbe_aci_send_cmd+0xc8/0x220 [ixgbe] [ +0.000049] ? try_to_wake_up+0x29d/0x5d0 [ +0.000009] ixgbe_disable_rx_e610+0xc4/0x110 [ixgbe] [ +0.000032] ixgbe_disable_rx+0x3d/0x200 [ixgbe] [ +0.000027] ixgbe_down+0x102/0x3b0 [ixgbe] [ +0.000031] ixgbe_close_suspend+0x28/0x90 [ixgbe] [ +0.000028] ixgbe_close+0xfb/0x100 [ixgbe] [ +0.000025] __dev_close_many+0xae/0x220 [ +0.000005] dev_close_many+0xc2/0x1a0 [ +0.000004] ? kernfs_should_drain_open_files+0x2a/0x40 [ +0.000005] unregister_netdevice_many_notify+0x204/0xb00 [ +0.000006] ? __kernfs_remove.part.0+0x109/0x210 [ +0.000006] ? kobj_kset_leave+0x4b/0x70 [ +0.000008] unregister_netdevice_queue+0xf6/0x130 [ +0.000006] unregister_netdev+0x1c/0x40 [ +0.000005] ixgbe_remove+0x216/0x290 [ixgbe] [ +0.000021] pci_device_remove+0x42/0xb0 [ +0.000007] device_release_driver_internal+0x19c/0x200 [ +0.000008] driver_detach+0x48/0x90 [ +0.000003] bus_remove_driver+0x6d/0xf0 [ +0.000006] pci_unregister_driver+0x2e/0xb0 [ +0.000005] ixgbe_exit_module+0x1c/0xc80 [ixgbe] Same as for the previous commit, the issue has been highlighted by the commit `337369f8ce` ("locking/mutex: Add MUTEX_WARN_ON() into fast path"). Move destroying aci.lock to the end of ixgbe_remove(), as this simply fixes the issue. Fixes: `4600cdf9f5` ("ixgbe: Enable link management in E610 device") Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-09-16 14:01:53 -07:00
Jedrzej Jagielski	b85936e95a	ixgbe: initialize aci.lock before it's used Currently aci.lock is initialized too late. A bunch of ACI callbacks using the lock are called prior it's initialized. Commit `337369f8ce` ("locking/mutex: Add MUTEX_WARN_ON() into fast path") highlights that issue what results in call trace. [ 4.092899] DEBUG_LOCKS_WARN_ON(lock->magic != lock) [ 4.092910] WARNING: CPU: 0 PID: 578 at kernel/locking/mutex.c:154 mutex_lock+0x6d/0x80 [ 4.098757] Call Trace: [ 4.098847] <TASK> [ 4.098922] ixgbe_aci_send_cmd+0x8c/0x1e0 [ixgbe] [ 4.099108] ? hrtimer_try_to_cancel+0x18/0x110 [ 4.099277] ixgbe_aci_get_fw_ver+0x52/0xa0 [ixgbe] [ 4.099460] ixgbe_check_fw_error+0x1fc/0x2f0 [ixgbe] [ 4.099650] ? usleep_range_state+0x69/0xd0 [ 4.099811] ? usleep_range_state+0x8c/0xd0 [ 4.099964] ixgbe_probe+0x3b0/0x12d0 [ixgbe] [ 4.100132] local_pci_probe+0x43/0xa0 [ 4.100267] work_for_cpu_fn+0x13/0x20 [ 4.101647] </TASK> Move aci.lock mutex initialization to ixgbe_sw_init() before any ACI command is sent. Along with that move also related SWFW semaphore in order to reduce size of ixgbe_probe() and that way all locks are initialized in ixgbe_sw_init(). Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Fixes: `4600cdf9f5` ("ixgbe: Enable link management in E610 device") Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-09-16 14:01:53 -07:00
Maciej Fijalkowski	e37084a260	i40e: remove redundant memory barrier when cleaning Tx descs i40e has a feature which writes to memory location last descriptor successfully sent. Memory barrier in i40e_clean_tx_irq() was used to avoid forward-reading descriptor fields in case DD bit was not set. Having mentioned feature in place implies that such situation will not happen as we know in advance how many descriptors HW has dealt with. Besides, this barrier placement was wrong. Idea is to have this protection after reading DD bit from HW descriptor, not before. Digging through git history showed me that indeed barrier was before DD bit check, anyways the commit introducing i40e_get_head() should have wiped it out altogether. Also, there was one commit doing s/read_barrier_depends/smp_rmb when get head feature was already in place, but it was only theoretical based on ixgbe experiences, which is different in these terms as that driver has to read DD bit from HW descriptor. Fixes: `1943d8ba95` ("i40e/i40evf: enable hardware feature head write back") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-09-16 14:01:53 -07:00
Jacob Keller	84bf1ac85a	ice: fix Rx page leak on multi-buffer frames The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each buffer in the current frame. This function was introduced as part of handling multi-buffer XDP support in the ice driver. It works by iterating over the buffers from first_desc up to 1 plus the total number of fragments in the frame, cached from before the XDP program was executed. If the hardware posts a descriptor with a size of 0, the logic used in ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added as fragments in ice_add_xdp_frag. Since the buffer isn't counted as a fragment, we do not iterate over it in ice_put_rx_mbuf(), and thus we don't call ice_put_rx_buf(). Because we don't call ice_put_rx_buf(), we don't attempt to re-use the page or free it. This leaves a stale page in the ring, as we don't increment next_to_alloc. The ice_reuse_rx_page() assumes that the next_to_alloc has been incremented properly, and that it always points to a buffer with a NULL page. Since this function doesn't check, it will happily recycle a page over the top of the next_to_alloc buffer, losing track of the old page. Note that this leak only occurs for multi-buffer frames. The ice_put_rx_mbuf() function always handles at least one buffer, so a single-buffer frame will always get handled correctly. It is not clear precisely why the hardware hands us descriptors with a size of 0 sometimes, but it happens somewhat regularly with "jumbo frames" used by 9K MTU. To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on all buffers between first_desc and next_to_clean. Borrow the logic of a similar function in i40e used for this same purpose. Use the same logic also in ice_get_pgcnts(). Instead of iterating over just the number of fragments, use a loop which iterates until the current index reaches to the next_to_clean element just past the current frame. Unlike i40e, the ice_put_rx_mbuf() function does call ice_put_rx_buf() on the last buffer of the frame indicating the end of packet. For non-linear (multi-buffer) frames, we need to take care when adjusting the pagecnt_bias. An XDP program might release fragments from the tail of the frame, in which case that fragment page is already released. Only update the pagecnt_bias for the first descriptor and fragments still remaining post-XDP program. Take care to only access the shared info for fragmented buffers, as this avoids a significant cache miss. The xdp_xmit value only needs to be updated if an XDP program is run, and only once per packet. Drop the xdp_xmit pointer argument from ice_put_rx_mbuf(). Instead, set xdp_xmit in the ice_clean_rx_irq() function directly. This avoids needing to pass the argument and avoids an extra bit-wise OR for each buffer in the frame. Move the increment of the ntc local variable to ensure its updated before all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic requires the index of the element just after the current frame. Now that we use an index pointer in the ring to identify the packet, we no longer need to track or cache the number of fragments in the rx_ring. Cc: Christoph Petrausch <christoph.petrausch@deepl.com> Cc: Jesper Dangaard Brouer <hawk@kernel.org> Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Closes: https://lore.kernel.org/netdev/CAK8fFZ4hY6GUJNENz3wY9jaYLZXGfpr7dnZxzGMYoE44caRbgw@mail.gmail.com/ Fixes: `743bbd93cf` ("ice: put Rx buffers after being done with current frame") Tested-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Tested-by: Priya Singh <priyax.singh@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-09-16 14:01:46 -07:00
Yeounsu Moon	93ab4881a4	net: natsemi: fix `rx_dropped` double accounting on `netif_rx()` failure `netif_rx()` already increments `rx_dropped` core stat when it fails. The driver was also updating `ndev->stats.rx_dropped` in the same path. Since both are reported together via `ip -s -s` command, this resulted in drops being counted twice in user-visible stats. Keep the driver update on `if (unlikely(!skb))`, but skip it after `netif_rx()` errors. Fixes: `caf586e5f2` ("net: add a core netdev->rx_dropped counter") Signed-off-by: Yeounsu Moon <yyyynoom@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250913060135.35282-3-yyyynoom@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 19:06:25 -07:00
Jakub Kicinski	97499e2818	Merge branch 'mptcp-pm-nl-announce-deny-join-id0-flag' Matthieu Baerts says: ==================== mptcp: pm: nl: announce deny-join-id0 flag During the connection establishment, a peer can tell the other one that it cannot establish new subflows to the initial IP address and port by setting the 'C' flag [1]. Doing so makes sense when the sender is behind a strict NAT, operating behind a legacy Layer 4 load balancer, or using anycast IP address for example. When this 'C' flag is set, the path-managers must then not try to establish new subflows to the other peer's initial IP address and port. The in-kernel PM has access to this info, but the userspace PM didn't, not letting the userspace daemon able to respect the RFC8684. Here are a few fixes related to this 'C' flag (aka 'deny-join-id0'): - Patch 1: add remote_deny_join_id0 info on passive connections. A fix for v5.14. - Patch 2: let the userspace PM daemon know about the deny_join_id0 attribute, so when set, it can avoid creating new subflows to the initial IP address and port. A fix for v5.19. - Patch 3: a validation for the previous commit. - Patch 4: record the deny_join_id0 info when TFO is used. A fix for v6.2. - Patch 5: not related to deny-join-id0, but it fixes errors messages in the sockopt selftests, not to create confusions. A fix for v6.5. ==================== Link: https://patch.msgid.link/20250912-net-mptcp-pm-uspace-deny_join_id0-v1-0-40171884ade8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 18:12:08 -07:00
Geliang Tang	b86418bead	selftests: mptcp: sockopt: fix error messages This patch fixes several issues in the error reporting of the MPTCP sockopt selftest: 1. Fix diff not printed: The error messages for counter mismatches had the actual difference ('diff') as argument, but it was missing in the format string. Displaying it makes the debugging easier. 2. Fix variable usage: The error check for 'mptcpi_bytes_acked' incorrectly used 'ret2' (sent bytes) for both the expected value and the difference calculation. It now correctly uses 'ret' (received bytes), which is the expected value for bytes_acked. 3. Fix off-by-one in diff: The calculation for the 'mptcpi_rcv_delta' diff was 's.mptcpi_rcv_delta - ret', which is off-by-one. It has been corrected to 's.mptcpi_rcv_delta - (ret + 1)' to match the expected value in the condition above it. Fixes: `5dcff89e14` ("selftests: mptcp: explicitly tests aggregate counters") Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250912-net-mptcp-pm-uspace-deny_join_id0-v1-5-40171884ade8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 18:12:05 -07:00
Matthieu Baerts (NGI0)	92da495cb6	mptcp: tfo: record 'deny join id0' info When TFO is used, the check to see if the 'C' flag (deny join id0) was set was bypassed. This flag can be set when TFO is used, so the check should also be done when TFO is used. Note that the set_fully_established label is also used when a 4th ACK is received. In this case, deny_join_id0 will not be set. Fixes: `dfc8d06030` ("mptcp: implement delayed seq generation for passive fastopen") Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250912-net-mptcp-pm-uspace-deny_join_id0-v1-4-40171884ade8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 18:12:05 -07:00
Matthieu Baerts (NGI0)	24733e193a	selftests: mptcp: userspace pm: validate deny-join-id0 flag The previous commit adds the MPTCP_PM_EV_FLAG_DENY_JOIN_ID0 flag. Make sure it is correctly announced by the other peer when it has been received. pm_nl_ctl will now display 'deny_join_id0:1' when monitoring the events, and when this flag was set by the other peer. The 'Fixes' tag here below is the same as the one from the previous commit: this patch here is not fixing anything wrong in the selftests, but it validates the previous fix for an issue introduced by this commit ID. Fixes: `702c2f646d` ("mptcp: netlink: allow userspace-driven subflow establishment") Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250912-net-mptcp-pm-uspace-deny_join_id0-v1-3-40171884ade8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 18:12:05 -07:00
Matthieu Baerts (NGI0)	2293c57484	mptcp: pm: nl: announce deny-join-id0 flag During the connection establishment, a peer can tell the other one that it cannot establish new subflows to the initial IP address and port by setting the 'C' flag [1]. Doing so makes sense when the sender is behind a strict NAT, operating behind a legacy Layer 4 load balancer, or using anycast IP address for example. When this 'C' flag is set, the path-managers must then not try to establish new subflows to the other peer's initial IP address and port. The in-kernel PM has access to this info, but the userspace PM didn't. The RFC8684 [1] is strict about that: (...) therefore the receiver MUST NOT try to open any additional subflows toward this address and port. So it is important to tell the userspace about that as it is responsible for the respect of this flag. When a new connection is created and established, the Netlink events now contain the existing but not currently used 'flags' attribute. When MPTCP_PM_EV_FLAG_DENY_JOIN_ID0 is set, it means no other subflows to the initial IP address and port -- info that are also part of the event -- can be established. Link: https://datatracker.ietf.org/doc/html/rfc8684#section-3.1-20.6 [1] Fixes: `702c2f646d` ("mptcp: netlink: allow userspace-driven subflow establishment") Reported-by: Marek Majkowski <marek@cloudflare.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/532 Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250912-net-mptcp-pm-uspace-deny_join_id0-v1-2-40171884ade8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 18:12:05 -07:00
Matthieu Baerts (NGI0)	96939cec99	mptcp: set remote_deny_join_id0 on SYN recv When a SYN containing the 'C' flag (deny join id0) was received, this piece of information was not propagated to the path-manager. Even if this flag is mainly set on the server side, a client can also tell the server it cannot try to establish new subflows to the client's initial IP address and port. The server's PM should then record such info when received, and before sending events about the new connection. Fixes: `df377be387` ("mptcp: add deny_join_id0 in mptcp_options_received") Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250912-net-mptcp-pm-uspace-deny_join_id0-v1-1-40171884ade8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 18:12:05 -07:00
Jakub Kicinski	33a09c64c2	Merge branch 'selftests-mptcp-avoid-spurious-errors-on-tcp-disconnect' Matthieu Baerts says: ==================== selftests: mptcp: avoid spurious errors on TCP disconnect This series should fix the recent instabilities seen by MPTCP and NIPA CIs where the 'mptcp_connect.sh' tests fail regularly when running the 'disconnect' subtests with "plain" TCP sockets, e.g. # INFO: disconnect # 63 ns1 MPTCP -> ns1 (10.0.1.1:20001 ) MPTCP (duration 996ms) [ OK ] # 64 ns1 MPTCP -> ns1 (10.0.1.1:20002 ) TCP (duration 851ms) [ OK ] # 65 ns1 TCP -> ns1 (10.0.1.1:20003 ) MPTCP Unexpected revents: POLLERR/POLLNVAL(19) # (duration 896ms) [FAIL] file received by server does not match (in, out): # -rw-r--r-- 1 root root 11112852 Aug 19 09:16 /tmp/tmp.hlJe5DoMoq.disconnect # Trailing bytes are: # /{ga 6@=#.8:-rw------- 1 root root 10085368 Aug 19 09:16 /tmp/tmp.blClunilxx # Trailing bytes are: # /{ga 6@=#.8:66 ns1 MPTCP -> ns1 (dead:beef:1::1:20004) MPTCP (duration 987ms) [ OK ] # 67 ns1 MPTCP -> ns1 (dead:beef:1::1:20005) TCP (duration 911ms) [ OK ] # 68 ns1 TCP -> ns1 (dead:beef:1::1:20006) MPTCP (duration 980ms) [ OK ] # [FAIL] Tests of the full disconnection have failed These issues started to be visible after some behavioural changes in TCP, where too quick re-connections after a shutdown() can now be more easily rejected. Patch 3 modifies the selftests to wait, but this resolution revealed an issue in MPTCP which is fixed by patch 1 (a fix for v5.9 kernel). Patches 2 and 4 improve some errors reported by the selftests, and patch 5 helps with the debugging of such issues. ==================== Link: https://patch.msgid.link/20250912-net-mptcp-fix-sft-connect-v1-0-d40e77cbbf02@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 18:10:40 -07:00
Matthieu Baerts (NGI0)	cf74e0aa0e	selftests: mptcp: connect: print pcap prefix To be able to find which capture files have been produced after several runs. This prefix was not printed anywhere before. While at it, always use the same prefix by taking info from ns1, instead of "$connector_ns", which is sometimes ns1, sometimes ns2 in the subtests. Reviewed-by: Mat Martineau <martineau@kernel.org> Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250912-net-mptcp-fix-sft-connect-v1-5-d40e77cbbf02@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 18:10:37 -07:00
Matthieu Baerts (NGI0)	a17c5aa3a3	selftests: mptcp: print trailing bytes with od This is better than printing random bytes in the terminal. Note that Jakub suggested 'hexdump', but Mat found out this tool is not often installed by default. 'od' can do a similar job, and it is in the POSIX specs and available in coreutils, so it should be on more systems. While at it, display a few more bytes, just to fill in the two lines. And no need to display the 3rd only line showing the next number of bytes: 0000040. Suggested-by: Jakub Kicinski <kuba@kernel.org> Suggested-by: Mat Martineau <martineau@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250912-net-mptcp-fix-sft-connect-v1-4-d40e77cbbf02@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 18:10:37 -07:00
Matthieu Baerts (NGI0)	8708c5d8b3	selftests: mptcp: avoid spurious errors on TCP disconnect The disconnect test-case, with 'plain' TCP sockets generates spurious errors, e.g. 07 ns1 TCP -> ns1 (dead:beef:1::1:10006) MPTCP read: Connection reset by peer read: Connection reset by peer (duration 155ms) [FAIL] client exit code 3, server 3 netns ns1-FloSdv (listener) socket stat for 10006: TcpActiveOpens 2 0.0 TcpPassiveOpens 2 0.0 TcpEstabResets 2 0.0 TcpInSegs 274 0.0 TcpOutSegs 276 0.0 TcpOutRsts 3 0.0 TcpExtPruneCalled 2 0.0 TcpExtRcvPruned 1 0.0 TcpExtTCPPureAcks 104 0.0 TcpExtTCPRcvCollapsed 2 0.0 TcpExtTCPBacklogCoalesce 42 0.0 TcpExtTCPRcvCoalesce 43 0.0 TcpExtTCPChallengeACK 1 0.0 TcpExtTCPFromZeroWindowAdv 42 0.0 TcpExtTCPToZeroWindowAdv 41 0.0 TcpExtTCPWantZeroWindowAdv 13 0.0 TcpExtTCPOrigDataSent 164 0.0 TcpExtTCPDelivered 165 0.0 TcpExtTCPRcvQDrop 1 0.0 In the failing scenarios (TCP -> MPTCP), the involved sockets are actually plain TCP ones, as fallbacks for passive sockets at 2WHS time cause the MPTCP listeners to actually create 'plain' TCP sockets. Similar to commit `218cc16632` ("selftests: mptcp: avoid spurious errors on disconnect"), the root cause is in the user-space bits: the test program tries to disconnect as soon as all the pending data has been spooled, generating an RST. If such option reaches the peer before the connection has reached the closed status, the TCP socket will report an error to the user-space, as per protocol specification, causing the above failure. Note that it looks like this issue got more visible since the "tcp: receiver changes" series from commit `06baf9bfa6` ("Merge branch 'tcp-receiver-changes'"). Address the issue by explicitly waiting for the TCP sockets (-t) to reach a closed status before performing the disconnect. More precisely, the test program now waits for plain TCP sockets or TCP subflows in addition to the MPTCP sockets that were already monitored. While at it, use 'ss' with '-n' to avoid resolving service names, which is not needed here. Fixes: `218cc16632` ("selftests: mptcp: avoid spurious errors on disconnect") Cc: stable@vger.kernel.org Suggested-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250912-net-mptcp-fix-sft-connect-v1-3-d40e77cbbf02@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 18:10:37 -07:00
Matthieu Baerts (NGI0)	14e22b43df	selftests: mptcp: connect: catch IO errors on listen side IO errors were correctly printed to stderr, and propagated up to the main loop for the server side, but the returned value was ignored. As a consequence, the program for the listener side was no longer exiting with an error code in case of IO issues. Because of that, some issues might not have been seen. But very likely, most issues either had an effect on the client side, or the file transfer was not the expected one, e.g. the connection got reset before the end. Still, it is better to fix this. The main consequence of this issue is the error that was reported by the selftests: the received and sent files were different, and the MIB counters were not printed. Also, when such errors happened during the 'disconnect' tests, the program tried to continue until the timeout. Now when an IO error is detected, the program exits directly with an error. Fixes: `05be5e273c` ("selftests: mptcp: add disconnect tests") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250912-net-mptcp-fix-sft-connect-v1-2-d40e77cbbf02@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 18:10:36 -07:00
Matthieu Baerts (NGI0)	f755be0b1f	mptcp: propagate shutdown to subflows when possible When the MPTCP DATA FIN have been ACKed, there is no more MPTCP related metadata to exchange, and all subflows can be safely shutdown. Before this patch, the subflows were actually terminated at 'close()' time. That's certainly fine most of the time, but not when the userspace 'shutdown()' a connection, without close()ing it. When doing so, the subflows were staying in LAST_ACK state on one side -- and consequently in FIN_WAIT2 on the other side -- until the 'close()' of the MPTCP socket. Now, when the DATA FIN have been ACKed, all subflows are shutdown. A consequence of this is that the TCP 'FIN' flag can be set earlier now, but the end result is the same. This affects the packetdrill tests looking at the end of the MPTCP connections, but for a good reason. Note that tcp_shutdown() will check the subflow state, so no need to do that again before calling it. Fixes: `3721b9b646` ("mptcp: Track received DATA_FIN sequence number and add related helpers") Cc: stable@vger.kernel.org Fixes: `16a9a9da17` ("mptcp: Add helper to process acks of DATA_FIN") Reviewed-by: Mat Martineau <martineau@kernel.org> Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250912-net-mptcp-fix-sft-connect-v1-1-d40e77cbbf02@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 18:10:36 -07:00
Hangbin Liu	71379e1c95	selftests: bonding: add fail_over_mac testing Add a test to check each value of bond fail_over_mac option. Also fix a minor garp_test print issue. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20250910024336.400253-2-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 17:48:42 -07:00
Hangbin Liu	35ae4e8629	bonding: set random address only when slaves already exist After commit `5c3bf6cba7` ("bonding: assign random address if device address is same as bond"), bonding will erroneously randomize the MAC address of the first interface added to the bond if fail_over_mac = follow. Correct this by additionally testing for the bond being empty before randomizing the MAC. Fixes: `5c3bf6cba7` ("bonding: assign random address if device address is same as bond") Reported-by: Qiuling Ren <qren@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20250910024336.400253-1-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 17:48:42 -07:00
Håkon Bugge	4351ca3fcb	rds: ib: Increment i_fastreg_wrs before bailing out We need to increment i_fastreg_wrs before we bail out from rds_ib_post_reg_frmr(). We have a fixed budget of how many FRWR operations that can be outstanding using the dedicated QP used for memory registrations and de-registrations. This budget is enforced by the atomic_t i_fastreg_wrs. If we bail out early in rds_ib_post_reg_frmr(), we will "leak" the possibility of posting an FRWR operation, and if that accumulates, no FRWR operation can be carried out. Fixes: `1659185fb4` ("RDS: IB: Support Fastreg MR (FRMR) memory registration mode") Fixes: `3a2886cca7` ("net/rds: Keep track of and wait for FRWR segments in use upon shutdown") Cc: stable@vger.kernel.org Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20250911133336.451212-1-haakon.bugge@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 16:47:53 -07:00
Johannes Berg	a38108a23a	wifi: iwlwifi: pcie: fix byte count table for some devices In my previous fix for this condition, I erroneously listed 9000 instead of 7000 family, when 7000/8000 were already using iwlmvm. Thus the condition ended up wrong, causing the issue I had fixed for older devices to suddenly appear on 7000/8000 family devices. Correct the condition accordingly. Reported-by: David Wang <00107082@163.com> Closes: https://lore.kernel.org/r/20250909165811.10729-1-00107082@163.com/ Fixes: `586e3cb33b` ("wifi: iwlwifi: fix byte count table for old devices") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20250915102743.777aaafbcc6c.I84404edfdfbf400501f6fb06def5b86c501da198@changeid Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>	2025-09-15 11:20:47 +03:00
Jakub Kicinski	2e5fb2ff31	Merge branch 'net-dst_metadata-fix-df-flag-extraction-on-tunnel-rx' Ilya Maximets says: ==================== net: dst_metadata: fix DF flag extraction on tunnel rx Two patches here, first fixes the issue where tunnel core doesn't actually extract DF bit from the outer IP header, even though both OVS and TC flower allow matching on it. More details in the commit message. The second is a selftest for openvswitch that reproduces the issue, but also just adds some basic coverage for the tunnel metadata extraction and related openvswitch uAPI. ==================== Link: https://patch.msgid.link/20250909165440.229890-1-i.maximets@ovn.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-14 14:28:15 -07:00
Ilya Maximets	6cafb93c1f	selftests: openvswitch: add a simple test for tunnel metadata This test ensures that upon receiving decapsulated packets from a tunnel interface in openvswitch, the tunnel metadata fields are properly populated. This partially covers interoperability of the kernel tunnel ports and openvswitch tunnels (LWT) and parsing and formatting of the tunnel metadata fields of the openvswitch netlink uAPI. Doing so, this test also ensures that fields and flags are properly extracted during decapsulation by the tunnel core code, serving as a regression test for the previously fixed issue with the DF bit not being extracted from the outer IP header. The ovs-dpctl.py script already supports all that is necessary for the tunnel ports for this test, so we only need to adjust the ovs_add_if() function to pass the '-t' port type argument in order to be able to create tunnel ports in the openvswitch datapath. Reviewed-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Link: https://patch.msgid.link/20250909165440.229890-3-i.maximets@ovn.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-14 14:28:13 -07:00
Ilya Maximets	a9888628cb	net: dst_metadata: fix IP_DF bit not extracted from tunnel headers Both OVS and TC flower allow extracting and matching on the DF bit of the outer IP header via OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT in the OVS_KEY_ATTR_TUNNEL and TCA_FLOWER_KEY_FLAGS_TUNNEL_DONT_FRAGMENT in the TCA_FLOWER_KEY_ENC_FLAGS respectively. Flow dissector extracts this information as FLOW_DIS_F_TUNNEL_DONT_FRAGMENT from the tunnel info key. However, the IP_TUNNEL_DONT_FRAGMENT_BIT in the tunnel key is never actually set, because the tunneling code doesn't actually extract it from the IP header. OAM and CRIT_OPT are extracted by the tunnel implementation code, same code also sets the KEY flag, if present. UDP tunnel core takes care of setting the CSUM flag if the checksum is present in the UDP header, but the DONT_FRAGMENT is not handled at any layer. Fix that by checking the bit and setting the corresponding flag while populating the tunnel info in the IP layer where it belongs. Not using __assign_bit as we don't really need to clear the bit in a just initialized field. It also doesn't seem like using __assign_bit will make the code look better. Clearly, users didn't rely on this functionality for anything very important until now. The reason why this doesn't break OVS logic is that it only matches on what kernel previously parsed out and if kernel consistently reports this bit as zero, OVS will only match on it to be zero, which sort of works. But it is still a bug that the uAPI reports and allows matching on the field that is not actually checked in the packet. And this is causing misleading -df reporting in OVS datapath flows, while the tunnel traffic actually has the bit set in most cases. This may also cause issues if a hardware properly implements support for tunnel flag matching as it will disagree with the implementation in a software path of TC flower. Fixes: `7d5437c709` ("openvswitch: Add tunneling interface.") Fixes: `1d17568e74` ("net/sched: cls_flower: add support for matching tunnel control flags") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250909165440.229890-2-i.maximets@ovn.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-14 14:28:12 -07:00
Jamie Bainbridge	56c0a2a9dd	qed: Don't collect too many protection override GRC elements In the protection override dump path, the firmware can return far too many GRC elements, resulting in attempting to write past the end of the previously-kmalloc'ed dump buffer. This will result in a kernel panic with reason: BUG: unable to handle kernel paging request at ADDRESS where "ADDRESS" is just past the end of the protection override dump buffer. The start address of the buffer is: p_hwfn->cdev->dbg_features[DBG_FEATURE_PROTECTION_OVERRIDE].dump_buf and the size of the buffer is buf_size in the same data structure. The panic can be arrived at from either the qede Ethernet driver path: [exception RIP: qed_grc_dump_addr_range+0x108] qed_protection_override_dump at ffffffffc02662ed [qed] qed_dbg_protection_override_dump at ffffffffc0267792 [qed] qed_dbg_feature at ffffffffc026aa8f [qed] qed_dbg_all_data at ffffffffc026b211 [qed] qed_fw_fatal_reporter_dump at ffffffffc027298a [qed] devlink_health_do_dump at ffffffff82497f61 devlink_health_report at ffffffff8249cf29 qed_report_fatal_error at ffffffffc0272baf [qed] qede_sp_task at ffffffffc045ed32 [qede] process_one_work at ffffffff81d19783 or the qedf storage driver path: [exception RIP: qed_grc_dump_addr_range+0x108] qed_protection_override_dump at ffffffffc068b2ed [qed] qed_dbg_protection_override_dump at ffffffffc068c792 [qed] qed_dbg_feature at ffffffffc068fa8f [qed] qed_dbg_all_data at ffffffffc0690211 [qed] qed_fw_fatal_reporter_dump at ffffffffc069798a [qed] devlink_health_do_dump at ffffffff8aa95e51 devlink_health_report at ffffffff8aa9ae19 qed_report_fatal_error at ffffffffc0697baf [qed] qed_hw_err_notify at ffffffffc06d32d7 [qed] qed_spq_post at ffffffffc06b1011 [qed] qed_fcoe_destroy_conn at ffffffffc06b2e91 [qed] qedf_cleanup_fcport at ffffffffc05e7597 [qedf] qedf_rport_event_handler at ffffffffc05e7bf7 [qedf] fc_rport_work at ffffffffc02da715 [libfc] process_one_work at ffffffff8a319663 Resolve this by clamping the firmware's return value to the maximum number of legal elements the firmware should return. Fixes: `d52c89f120` ("qed*: Utilize FW 8.37.2.0") Signed-off-by: Jamie Bainbridge <jamie.bainbridge@gmail.com> Link: https://patch.msgid.link/f8e1182934aa274c18d0682a12dbaf347595469c.1757485536.git.jamie.bainbridge@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-14 14:25:03 -07:00
Kamal Heib	af82e857df	octeon_ep: Validate the VF ID Add a helper to validate the VF ID and use it in the VF ndo ops to prevent accessing out-of-range entries. Without this check, users can run commands such as: # ip link show dev enp135s0 2: enp135s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether 00:00:00:01:01:00 brd ff:ff:ff:ff:ff:ff vf 0 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state enable, trust off vf 1 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state enable, trust off # ip link set dev enp135s0 vf 4 mac 00:00:00:00:00:14 # echo $? 0 even though VF 4 does not exist, which results in silent success instead of returning an error. Fixes: `8a241ef9b9` ("octeon_ep: add ndo ops for VFs in PF driver") Signed-off-by: Kamal Heib <kheib@redhat.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250911223610.1803144-1-kheib@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-14 13:12:03 -07:00
David Howells	2429a19764	rxrpc: Fix untrusted unsigned subtract Fix the following Smatch static checker warning: net/rxrpc/rxgk_app.c:65 rxgk_yfs_decode_ticket() warn: untrusted unsigned subtract. 'ticket_len - 10 * 4' by prechecking the length of what we're trying to extract in two places in the token and decoding for a response packet. Also use sizeof() on the struct we're extracting rather specifying the size numerically to be consistent with the other related statements. Fixes: `9d1d2b5934` ("rxrpc: rxgk: Implement the yfs-rxgk security class (GSSAPI)") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lists.infradead.org/pipermail/linux-afs/2025-September/010135.html Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/2039268.1757631977@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-14 13:05:22 -07:00
David Howells	64863f4ca4	rxrpc: Fix unhandled errors in rxgk_verify_packet_integrity() rxgk_verify_packet_integrity() may get more errors than just -EPROTO from rxgk_verify_mic_skb(). Pretty much anything other than -ENOMEM constitutes an unrecoverable error. In the case of -ENOMEM, we can just drop the packet and wait for a retransmission. Similar happens with rxgk_decrypt_skb() and its callers. Fix rxgk_decrypt_skb() or rxgk_verify_mic_skb() to return a greater variety of abort codes and fix their callers to abort the connection on any error apart from -ENOMEM. Also preclear the variables used to hold the abort code returned from rxgk_decrypt_skb() or rxgk_verify_mic_skb() to eliminate uninitialised variable warnings. Fixes: `9d1d2b5934` ("rxrpc: rxgk: Implement the yfs-rxgk security class (GSSAPI)") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lists.infradead.org/pipermail/linux-afs/2025-April/009739.html Closes: https://lists.infradead.org/pipermail/linux-afs/2025-April/009740.html Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/2038804.1757631496@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-14 13:05:22 -07:00
Ivan Vecera	70d99623d5	dpll: fix clock quality level reporting The DPLL_CLOCK_QUALITY_LEVEL_ITU_OPT1_EPRC is not reported via netlink due to bug in dpll_msg_add_clock_quality_level(). The usage of DPLL_CLOCK_QUALITY_LEVEL_MAX for both DECLARE_BITMAP() and for_each_set_bit() is not correct because these macros requires bitmap size and not the highest valid bit in the bitmap. Use correct bitmap size to fix this issue. Fixes: `a1afb959ad` ("dpll: add clock quality level attribute and op") Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Link: https://patch.msgid.link/20250912093331.862333-1-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-14 13:03:40 -07:00

1 2 3 4 5 ...

1383202 Commits