linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-16 10:11:38 -04:00

Author	SHA1	Message	Date
Cosmin Ratiu	50690733db	net/mlx5e: psp: Expose only a fully initialized priv->psp Currently, during PSP init, priv->psp is initialized to an incompletely built psp struct. Additionally, on fs init failure priv->psp is reset to NULL. Change this so that only a fully initialized priv->psp is set, which makes the code easier to reason about in failure scenarios. Fixes: `af2196f494` ("net/mlx5e: Implement PSP operations .assoc_add and .assoc_del") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260504181100.269334-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-05 19:09:04 -07:00
Cosmin Ratiu	ae9582cd0b	net/mlx5e: psp: Fix invalid access on PSP dev registration fail priv->psp->psp is initialized with the PSP device as returned by psp_dev_create(). This could also return an error, in which case a future psp_dev_unregister() will result in unpleasantness. Avoid that by using a local variable and only saving the PSP device when registration succeeds. In case psp_dev_create() fails, priv->psp and steering structs are left in place, but they will be inert. The unchecked access of priv->psp in mlx5e_psp_offload_handle_rx_skb() won't happen because without a PSP device, there can be no SAs added and therefore no packets will be successfully decrypted and be handed off to the SW handler. Fixes: `89ee2d92f6` ("net/mlx5e: Support PSP offload functionality") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260504181100.269334-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-05 19:09:04 -07:00
Pavitra Jha	0e7c074cfc	net: wwan: t7xx: validate port_count against message length in t7xx_port_enum_msg_handler t7xx_port_enum_msg_handler() uses the modem-supplied port_count field as a loop bound over port_msg->data[] without checking that the message buffer contains sufficient data. A modem sending port_count=65535 in a 12-byte buffer triggers a slab-out-of-bounds read of up to 262140 bytes. Add a sizeof(*port_msg) check before accessing the port message header fields to guard against undersized messages. Add a struct_size() check after extracting port_count and before the loop. In t7xx_parse_host_rt_data(), guard the rt_feature header read with a remaining-buffer check before accessing data_len, validate feat_data_len against the actual remaining buffer to prevent OOB reads and signed integer overflow on offset. Pass msg_len from both call sites: skb->len at the DPMAIF path after skb_pull(), and the validated feat_data_len at the handshake path. Fixes: `da45d2566a` ("net: wwan: t7xx: Add control port") Cc: stable@vger.kernel.org Signed-off-by: Pavitra Jha <jhapavitra98@gmail.com> Link: https://patch.msgid.link/20260501110713.145563-1-jhapavitra98@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-05 19:05:11 -07:00
Eric Dumazet	f83e07b292	net/sched: sch_fq_codel: annotate data-races from fq_codel_dump_class_stats() fq_codel_dump_class_stats() acquires qdisc spinlock only when requested to follow flow->head chain. As we did in sch_cake recently, add the missing READ_ONCE()/WRITE_ONCE() annotations. Fixes: `edb09eb17e` ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260504163842.1162001-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-05 18:01:28 -07:00
Jakub Kicinski	40aa9fcea0	Merge tag 'nf-26-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== IPVS fixes for net The following batch contains IPVS fixes for net to address issues from the latest net-next pull request. Julian Anastasov made the following summary: 1-3) Fixes for the recently added resizable hash tables 4) dest from trash can be leaked if ip_vs_start_estimator() fails 5) fixed races and locking for the estimation kthreads 6) fix for wrong roundup_pow_of_two() usage in the resizable hash tables 7-8) v2 of the changes from Waiman Long to properly guard against the housekeeping_cpumask() updates: https://lore.kernel.org/netfilter-devel/20260331165015.2777765-1-longman@redhat.com/ I added missing Fixes tag. The original description: Since commit `041ee6f372` ("kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management"), the HK_TYPE_KTHREAD housekeeping cpumask may no longer be correct in showing the actual CPU affinity of kthreads that have no predefined CPU affinity. As the ipvs networking code is still using HK_TYPE_KTHREAD, we need to make HK_TYPE_KTHREAD reflect the reality. This patch series makes HK_TYPE_KTHREAD an alias of HK_TYPE_DOMAIN and uses RCU to protect access to the HK_TYPE_KTHREAD housekeeping cpumask. Julian plans to post a nf-next patch to limit the connections by using "conn_max" sysctl. With Simon Horman, they agreed that this is an old problem that we do not have a limit of connections and it is not a stopper for this patchset. * tag 'nf-26-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: sched/isolation: Make HK_TYPE_KTHREAD an alias of HK_TYPE_DOMAIN ipvs: Guard access of HK_TYPE_KTHREAD cpumask with RCU ipvs: fix shift-out-of-bounds in ip_vs_rht_desired_size ipvs: fix races around est_mutex and est_cpulist ipvs: do not leak dest after get from dest trash ipvs: fix the spin_lock usage for RT build ipvs: fix races around the conn_lfactor and svc_lfactor sysctl vars ipvs: fixes for the new ip_vs_status info ==================== Link: https://patch.msgid.link/20260505001648.360569-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-05 17:55:25 -07:00
Jakub Kicinski	561a22d979	Merge branch 'bnxt_en-bug-fixes' Pavan Chebbi says: ==================== bnxt_en: Bug fixes This patchset adds the following fixes for bnxt: Patch #1 fixes DPC AER handling to make it more reliable Patch #2 fixes incorrect capping bp->max_tpa based on what the FW supports Patch #3 fixes ignoring of VNIC configuration result when RDMA driver is loading Patch #4 fixes logic to make phase adjustment on the PPS OUT signal ==================== Link: https://patch.msgid.link/20260504083611.1383776-1-pavan.chebbi@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-05 17:36:17 -07:00
Pavan Chebbi	bd279e104e	bnxt_en: Use absolute target ns from ptp_clock_request There is no need to calculate the target PHC cycles required to make phase adjustment on the PPS OUT signal. This is because the application supplies absolute n_sec value in the future and is already the actual desired target value. Remove the unnecessary code. Fixes: `9e518f2580` ("bnxt_en: 1PPS functions to configure TSIO pins") Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Cc: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Tested-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20260504083611.1383776-5-pavan.chebbi@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-05 17:36:15 -07:00
Kalesh AP	16517bc98a	bnxt_en: Check return value of bnxt_hwrm_vnic_cfg When the bnxt RDMA driver is loaded, it calls bnxt_register_dev(). As part of this, driver sends HWRM_VNIC_CFG firmware command to configure the VNIC to operate in dual VNIC mode. Currently the driver ignores the result of this firmware command. The RDMA driver must know the result since it affects its functioning. Check return value of call to bnxt_hwrm_vnic_cfg() in bnxt_register_dev() and return failure on error. Fixes: `a588e4580a` ("bnxt_en: Add interface to support RDMA driver.") Reviewed-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Link: https://patch.msgid.link/20260504083611.1383776-4-pavan.chebbi@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-05 17:36:14 -07:00
Michael Chan	54c28fab2f	bnxt_en: Set bp->max_tpa according to what the FW supports Fix the logic to set bp->max_tpa no higher than what the FW supports. On P5 chips, some older FW sets max_tpa very low so we override it to prevent performance regressions with the older FW. Fixes: `79632e9ba3` ("bnxt_en: Expand bnxt_tpa_info struct to support 57500 chips.") Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Reviewed-by: Colin Winegarden <colin.winegarden@broadcom.com> Reviewed-by: Rukhsana Ansari <rukhsana.ansari@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Link: https://patch.msgid.link/20260504083611.1383776-3-pavan.chebbi@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-05 17:36:14 -07:00
Michael Chan	07f4443335	bnxt_en: Delay for 5 seconds after AER DPC for all chips The FW on all chips is requiring a 5-second delay after Downstream Port Containment (DPC) AER. The previously added 900 msec delay was not long enough in all cases because the chip's CRS (Configuration Request Retry Status) mechanism is not always reliable. Fixes: `d5ab32e9b0` ("bnxt_en: Add delay to handle Downstream Port Containment (DPC) AER") Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Link: https://patch.msgid.link/20260504083611.1383776-2-pavan.chebbi@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-05 17:36:14 -07:00
Kuniyuki Iwashima	5ad509c1fd	ipv6: Fix null-ptr-deref in fib6_mtu(). syzbot reported null-ptr-deref in fib6_mtu(). [0] When res->f6i->fib6_pmtu is 0 in fib6_mtu(), it fetches MTU from __in6_dev_get(nh->fib_nh_dev)->cnf.mtu6. However, __in6_dev_get() could return NULL when the device is being unregistered. Let's return 0 MTU if __in6_dev_get() returns NULL in fib6_mtu(). [0]: Oops: general protection fault, probably for non-canonical address 0xdffffc00000000bc: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x00000000000005e0-0x00000000000005e7] CPU: 0 UID: 0 PID: 7890 Comm: syz.2.502 Tainted: G L syzkaller #0 PREEMPT(full) Tainted: [L]=SOFTLOCKUP Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:fib6_mtu net/ipv6/route.c:1648 [inline] RIP: 0010:rt6_insert_exception+0x9eb/0x10a0 net/ipv6/route.c:1753 Code: 3b 14 cf f7 45 85 f6 0f 85 1d 02 00 00 e8 7d 19 cf f7 48 8d bb e0 05 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 89 RSP: 0000:ffffc9000610f120 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffc9000c001000 RDX: 00000000000000bc RSI: ffffffff8a38bc83 RDI: 00000000000005e0 RBP: ffff888052f06000 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffff888042d16c00 R13: ffff888042d16cc8 R14: 0000000000000001 R15: 0000000000000500 FS: 0000000000000000(0000) GS:ffff88809717d000(0063) knlGS:00000000f540db40 CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 00000000f73c6d50 CR3: 000000006eff0000 CR4: 0000000000352ef0 Call Trace: <TASK> __ip6_rt_update_pmtu+0x555/0xd60 net/ipv6/route.c:2982 ip6_update_pmtu+0x34f/0x3b0 net/ipv6/route.c:3014 icmpv6_err+0x2a2/0x3f0 net/ipv6/icmp.c:82 icmpv6_notify+0x35e/0x820 net/ipv6/icmp.c:1087 icmpv6_rcv+0x10bf/0x1ae0 net/ipv6/icmp.c:1228 ip6_protocol_deliver_rcu+0xf97/0x1500 net/ipv6/ip6_input.c:478 ip6_input_finish+0x1e4/0x4a0 net/ipv6/ip6_input.c:529 NF_HOOK include/linux/netfilter.h:318 [inline] NF_HOOK include/linux/netfilter.h:312 [inline] ip6_input+0x105/0x2f0 net/ipv6/ip6_input.c:540 ip6_mc_input+0x513/0xf50 net/ipv6/ip6_input.c:630 dst_input include/net/dst.h:480 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:119 [inline] NF_HOOK include/linux/netfilter.h:318 [inline] NF_HOOK include/linux/netfilter.h:312 [inline] ipv6_rcv+0x34c/0x3d0 net/ipv6/ip6_input.c:351 __netif_receive_skb_one_core+0x12d/0x1e0 net/core/dev.c:6202 __netif_receive_skb+0x1f/0x120 net/core/dev.c:6315 netif_receive_skb_internal net/core/dev.c:6401 [inline] netif_receive_skb+0x13b/0x7f0 net/core/dev.c:6460 tun_rx_batched.isra.0+0x3f6/0x750 drivers/net/tun.c:1511 tun_get_user+0x1e31/0x3c20 drivers/net/tun.c:1955 tun_chr_write_iter+0xdc/0x200 drivers/net/tun.c:2001 new_sync_write fs/read_write.c:595 [inline] vfs_write+0x6ac/0x1070 fs/read_write.c:688 ksys_write+0x12a/0x250 fs/read_write.c:740 do_syscall_32_irqs_on arch/x86/entry/syscall_32.c:83 [inline] do_int80_emulation+0x141/0x700 arch/x86/entry/syscall_32.c:172 asm_int80_emulation+0x1a/0x20 arch/x86/include/asm/idtentry.h:621 RIP: 0023:0xf715616b Code: 57 56 53 8b 44 24 14 f6 00 08 75 23 8b 44 24 18 8b 5c 24 1c 8b 4c 24 20 8b 54 24 24 8b 74 24 28 8b 7c 24 2c 8b 6c 24 30 cd 80 <5b> 5e 5f 5d c3 5b 5e 5f 5d e9 f7 a1 ff ff 66 90 66 90 66 90 90 53 RSP: 002b:00000000f540d44c EFLAGS: 00000246 ORIG_RAX: 0000000000000004 RAX: ffffffffffffffda RBX: 00000000000000c8 RCX: 0000000080000640 RDX: 000000000000007a RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000292 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> Fixes: `dcd1f57295` ("net/ipv6: Remove fib6_idev") Reported-by: syzbot+01f005f9c6387ca6f6dd@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/69f83f22.170a0220.13cc2.0004.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260504064316.3820775-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-05 17:32:57 -07:00
Alyssa Ross	901a7d9e2f	ipv6: default IPV6_SIT to m This basically defaulted to m until recently, since IPV6 defaulted to m. Since IPV6 was changed to a boolean with a default of y, IPV6_SIT started defaulting to built-in as well. This results in a surprise sit0 device by default for defconfig (and defconfig-derived config) users at boot. For me, this broke an (admittedly non-robust) script. Preserve the behaviour of most configs by avoiding building this module, that's probably overall seldom used compared to IPv6 as a whole, into the kernel. Fixes: `309b905dee` ("ipv6: convert CONFIG_IPV6 to built-in only and clean up Kconfigs") Signed-off-by: Alyssa Ross <hi@alyssa.is> Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de> Link: https://patch.msgid.link/20260503192515.290900-2-hi@alyssa.is Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-05 17:31:51 -07:00
Dipayaan Roy	95084f1883	net: mana: Fix crash from unvalidated SHM offset read from BAR0 during FLR During Function Level Reset recovery, the MANA driver reads hardware BAR0 registers that may temporarily contain garbage values. The SHM (Shared Memory) offset read from GDMA_REG_SHM_OFFSET is used to compute gc->shm_base, which is later dereferenced via readl() in mana_smc_poll_register(). If the hardware returns an unaligned or out-of-range value, the driver must not blindly use it, as this would propagate the hardware error into a kernel crash. The following crash was observed on an arm64 Hyper-V guest running kernel 6.17.0-3013-azure during VF reset recovery triggered by HWC timeout. [13291.785274] Unable to handle kernel paging request at virtual address ffff8000a200001b [13291.785311] Mem abort info: [13291.785332] ESR = 0x0000000096000021 [13291.785343] EC = 0x25: DABT (current EL), IL = 32 bits [13291.785355] SET = 0, FnV = 0 [13291.785363] EA = 0, S1PTW = 0 [13291.785372] FSC = 0x21: alignment fault [13291.785382] Data abort info: [13291.785391] ISV = 0, ISS = 0x00000021, ISS2 = 0x00000000 [13291.785404] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [13291.785412] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [13291.785421] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000014df3a1000 [13291.785432] [ffff8000a200001b] pgd=1000000100438403, p4d=1000000100438403, pud=1000000100439403, pmd=0068000fc2000711 [13291.785703] Internal error: Oops: 0000000096000021 [#1] SMP [13291.830975] Modules linked in: tls qrtr mana_ib ib_uverbs ib_core xt_owner xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables cfg80211 8021q garp mrp stp llc binfmt_misc joydev serio_raw nls_iso8859_1 hid_generic aes_ce_blk aes_ce_cipher polyval_ce ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher hid_hyperv sm4 sm3_ce sha3_ce hv_netvsc hid vmgenid hyperv_keyboard hyperv_drm sch_fq_codel nvme_fabrics efi_pstore dm_multipath nfnetlink vsock_loopback vmw_vsock_virtio_transport_common hv_sock vmw_vsock_vmci_transport vmw_vmci vsock dmi_sysfs ip_tables x_tables autofs4 [13291.862630] CPU: 122 UID: 0 PID: 61796 Comm: kworker/122:2 Tainted: G W 6.17.0-3013-azure #13-Ubuntu VOLUNTARY [13291.869902] Tainted: [W]=WARN [13291.871901] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 01/08/2026 [13291.878086] Workqueue: events mana_serv_func [13291.880718] pstate: 62400005 (nZCv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=--) [13291.884835] pc : mana_smc_poll_register+0x48/0xb0 [13291.887902] lr : mana_smc_setup_hwc+0x70/0x1c0 [13291.890493] sp : ffff8000ab79bbb0 [13291.892364] x29: ffff8000ab79bbb0 x28: ffff00410c8b5900 x27: ffff00410d630680 [13291.896252] x26: ffff004171f9fd80 x25: 000000016ed55000 x24: 000000017f37e000 [13291.899990] x23: 0000000000000000 x22: 000000016ed55000 x21: 0000000000000000 [13291.904497] x20: ffff8000a200001b x19: 0000000000004e20 x18: ffff8000a6183050 [13291.908308] x17: 0000000000000000 x16: 0000000000000000 x15: 000000000000000a [13291.912542] x14: 0000000000000004 x13: 0000000000000000 x12: 0000000000000000 [13291.916298] x11: 0000000000000000 x10: 0000000000000001 x9 : ffffc45006af1bd8 [13291.920945] x8 : ffff000151129000 x7 : 0000000000000000 x6 : 0000000000000000 [13291.925293] x5 : 000000015f214000 x4 : 000000017217a000 x3 : 000000016ed50000 [13291.930436] x2 : 000000016ed55000 x1 : 0000000000000000 x0 : ffff8000a1ffffff [13291.934342] Call trace: [13291.935736] mana_smc_poll_register+0x48/0xb0 (P) [13291.938611] mana_smc_setup_hwc+0x70/0x1c0 [13291.941113] mana_hwc_create_channel+0x1a0/0x3a0 [13291.944283] mana_gd_setup+0x16c/0x398 [13291.946584] mana_gd_resume+0x24/0x70 [13291.948917] mana_do_service+0x13c/0x1d0 [13291.951583] mana_serv_func+0x34/0x68 [13291.953732] process_one_work+0x168/0x3d0 [13291.956745] worker_thread+0x2ac/0x480 [13291.959104] kthread+0xf8/0x110 [13291.961026] ret_from_fork+0x10/0x20 [13291.963560] Code: d2807d00 9417c551 71000673 54000220 (b9400281) [13291.967299] ---[ end trace 0000000000000000 ]--- Disassembly of mana_smc_poll_register() around the crash site: Disassembly of section .text: 00000000000047c8 <mana_smc_poll_register>: 47c8: d503201f nop 47cc: d503201f nop 47d0: d503233f paciasp 47d4: f800865e str x30, [x18], #8 47d8: a9bd7bfd stp x29, x30, [sp, #-48]! 47dc: 910003fd mov x29, sp 47e0: a90153f3 stp x19, x20, [sp, #16] 47e4: 91007014 add x20, x0, #0x1c 47e8: 5289c413 mov w19, #0x4e20 47ec: f90013f5 str x21, [sp, #32] 47f0: 12001c35 and w21, w1, #0xff 47f4: 14000008 b 4814 <mana_smc_poll_register+0x4c> 47f8: 36f801e1 tbz w1, #31, 4834 <mana_smc_poll_register+0x6c> 47fc: 52800042 mov w2, #0x2 4800: d280fa01 mov x1, #0x7d0 4804: d2807d00 mov x0, #0x3e8 4808: 94000000 bl 0 <usleep_range_state> 480c: 71000673 subs w19, w19, #0x1 4810: 54000200 b.eq 4850 <mana_smc_poll_register+0x88> 4814: b9400281 ldr w1, [x20] <-- ** CRASHED HERE *** 4818: d50331bf dmb oshld 481c: 2a0103e2 mov w2, w1 ... From the crash signature x20 = ffff8000a200001b, this address ends in 0x1b which is not 4-byte aligned, so the 'ldr w1, [x20]' instruction (readl) triggers the arm64 alignment fault (FSC = 0x21). The root cause is in mana_gd_init_vf_regs(), which computes: gc->shm_base = gc->bar0_va + mana_gd_r64(gc, GDMA_REG_SHM_OFFSET); The offset is used without any validation. The same problem exists in mana_gd_init_pf_regs() for sriov_base_off and sriov_shm_off. Fix this by validating all offsets before use: - VF: check shm_off is within BAR0, properly aligned to 4 bytes (readl requirement), and leaves room for the full 256-bit (32-byte) SMC aperture. - PF: check sriov_base_off is within BAR0, aligned to 8 bytes (readq requirement), and leaves room to safely read the sriov_shm_off register at sriov_base_off + GDMA_PF_REG_SHM_OFF. Then check sriov_shm_off leaves room for the full SMC aperture. All arithmetic uses subtraction rather than addition to avoid integer overflow on garbage values. Define SMC_APERTURE_SIZE (32 bytes, derived from the 256-bit aperture width) Return -EPROTO on invalid values. The existing recovery path in mana_serv_reset() already handles -EPROTO by falling through to PCI device rescan, giving the hardware another chance to present valid register values after reset. Fixes: `9bf66036d6` ("net: mana: Handle hardware recovery events when probing the device") Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com> Link: https://patch.msgid.link/afQUMClyjmBVfD+u@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-05 15:43:08 +02:00
Nan Li	44b550d88b	net/rds: handle zerocopy send cleanup before the message is queued A zerocopy send can fail after user pages have been pinned but before the message is attached to the sending socket. The purge path currently infers zerocopy state from rm->m_rs, so an unqueued message can be cleaned up as if it owned normal payload pages. However, zerocopy ownership is really determined by the presence of op_mmp_znotifier, regardless of whether the message has reached the socket queue. Capture op_mmp_znotifier up front in rds_message_purge() and use it as the cleanup discriminator. If the message is already associated with a socket, keep the existing completion path. Otherwise, drop the pinned page accounting directly and release the notifier before putting the payload pages. This keeps early send failure cleanup consistent with the zerocopy lifetime rules without changing the normal queued completion path. Fixes: `0cebaccef3` ("rds: zerocopy Tx support.") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Co-developed-by: Xiao Liu <lx24@stu.ynu.edu.cn> Signed-off-by: Xiao Liu <lx24@stu.ynu.edu.cn> Signed-off-by: Nan Li <tonanli66@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Reviewed-by: Allison Henderson <achender@kernel.org> Link: https://patch.msgid.link/d2ea98a6313d5467bac00f7c9fef8c7acddb9258.1777550074.git.tonanli66@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-05 15:32:40 +02:00
Paolo Abeni	0c21517ac8	Merge branch 'openvswitch-fix-self-deadlock-on-release-of-tunnel-vports' Ilya Maximets says: ==================== openvswitch: fix self-deadlock on release of tunnel vports Two patches - the fix for the actual bug and the selftest that reproduces it. I missed the self-deadlock in the original patch that introduced the issue, because testing required code modification in the ovs-vswitchd to force it to use legacy tunnel ports. I thought I made the change correctly, but apparently something went wrong and the tests were run with the standard LWT infra instead. The selftest added in this patch set will at least prevent this kind of mistakes in the future. I mentioned, however, that these tunnel vports are legacy and not actually used by ovs-vswitchd. RTM_NEWLINK + COLLECT_METADATA is used in conjunction with the standard OVS_VPORT_TYPE_NETDEV instead since 2017. The code to use the legacy tunnels still exists in ovs-vswitchd however, but only as a fallback for older kernels and we're planning to remove it in the next release. I'll be sending an RFC to remove support for these legacy tunnel types from the kernel, as they serve no real purpose today and only increase the uAPI surface for CVEs, but we need to fix the known bugs for stable versions. v1: https://lore.kernel.org/netdev/20260429151756.4157670-1-i.maximets@ovn.org/ ==================== Link: https://patch.msgid.link/20260430233848.440994-1-i.maximets@ovn.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-05 15:19:40 +02:00
Ilya Maximets	05416ada37	selftests: openvswitch: add tests for tunnel vport refcounting There were a few issues found with the tunnel vport types around the vport destruction code. Add some basic tests, so at least we know that they can be properly added and removed without obvious issues. The test creates OVS datapath, adds a non-LWT tunnel port, makes sure they are created, and then removes the datapath and waits for all the ports to be gone. The dpctl script had a few bugs in the none-lwt tunnel creation code, so fixing them as well to make the testing possible: - The type of the --lwt option changed in order to properly disable it. - Removed byte order conversion for the port numbers, as the value supposed to be in the host order. - Added missing 'gre' choice for the tunnel type. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Link: https://patch.msgid.link/20260430233848.440994-3-i.maximets@ovn.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-05 15:19:37 +02:00
Ilya Maximets	aa69918bd4	openvswitch: vport: fix self-deadlock on release of tunnel ports vports are used concurrently and protected by RCU, so netdev_put() must happen after the RCU grace period. So, either in an RCU call or after the synchronize_net(). The rtnl_delete_link() must happen under RTNL and so can't be executed in RCU context. Calling synchronize_net() while holding RTNL is not a good idea for performance and system stability under load in general, so calling netdev_put() in RCU call is the right solution here. However, when the device is deleted, rtnl_unlock() will call netdev_run_todo() and block until all the references are gone. In the current code this means that we never reach the call_rcu() and the vport is never freed and the reference is never released, causing a self-deadlock on device removal. Fix that by moving the rcu_call() before the rtnl_unlock(), so the scheduled RCU callback will be executed when synchronize_net() is called from the rtnl_unlock()->netdev_run_todo() while the RTNL itself is already released. Fixes: `6931d21f87` ("openvswitch: defer tunnel netdev_put to RCU release") Cc: stable@vger.kernel.org Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Aaron Conole <aconole@redhat.com> Link: https://patch.msgid.link/20260430233848.440994-2-i.maximets@ovn.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-05 15:19:37 +02:00
Ilya Maximets	83861c48ba	openvswitch: vport: fix race between tunnel creation and linking When a tunnel vport is created it first creates the tunnel device, e.g., with geneve_dev_create_fb(), then it calls ovs_netdev_link() to take a reference and link it to the device that represents openvswitch datapath. The creation of the device is happening under RTNL, but then RTNL is released and re-acquired to find the device by name. It is technically possible for the tunnel device to be re-named or deleted within that window while RTNL is not held, and some other device created in its place. This will cause a non-tunnel device to be referenced in the vport and tunnel-specific functions used on it, e.g. vxlan_get_options() that directly casts the private netdev data into a struct vxlan_dev causing an invalid memory access: BUG: KASAN: slab-use-after-free in vxlan_get_options+0x323/0x3a0 vxlan_get_options+0x323/0x3a0 ovs_vport_cmd_new+0x6e3/0xd30 Fix that by taking a reference to the just created device before releasing RTNL. This ensures that the device in the vport is always the one that was just created. The search by name is only needed for a standard vport-netdev that links pre-existing devices, so that functionality and device type checks are moved to netdev_create(). It is also awkward that ovs_netdev_link() takes ownership of the vport and destroys it on failure. It doesn't know the type of the port it is dealing with, so we need to pass down the indicator that it's a tunnel, so the link can be properly deleted on failure. It's possible to refactor the logic to make the ovs_netdev_link() do only the linking part and let the callers perform a proper destruction, but it will be much more code for each legacy tunnel port type, so it is not worth it for the bug fix. Fixes: `614732eaa1` ("openvswitch: Use regular VXLAN net_device device") Reported-by: Yuan Tan <tanyuan98@outlook.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Reported-by: Yang Yang <n05ec@lzu.edu.cn> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Eelco Chaudron <echaudro@redhat.com> Link: https://patch.msgid.link/20260430213349.407991-1-i.maximets@ovn.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-05 15:14:33 +02:00
Paolo Abeni	6bdcbd79ad	Merge branch 'net-mana-fix-mana_destroy_rxq-cleanup-for-partial-rxq-init' Dipayaan Roy says: ==================== net: mana: Fix mana_destroy_rxq() cleanup for partial RXQ init When mana_create_rxq() fails partway through initialization (e.g. the hardware rejects the WQ object creation), the error path calls mana_destroy_rxq() to tear down a partially-initialized RXQ. This exposed multiple issues in mana_destroy_rxq() path, as it assumed the RXQ was always fully initialized, leading to multiple issues: 1. xdp_rxq_info_unreg() was called on an unregistered xdp_rxq, triggering a WARN_ON ("Driver BUG") in net/core/xdp.c. 2. mana_destroy_wq_obj() was called with INVALID_MANA_HANDLE, sending a bogus destroy command to the hardware. 3. mana_deinit_cq() was called twice — once inside mana_destroy_rxq() and again in mana_create_rxq()'s error path — causing a use-after-free since mana_destroy_rxq() frees the rxq first. This was observed during ethtool ring parameter changes when the hardware returned an error creating the RXQ. This series makes mana_destroy_rxq() safe to call at any stage of RXQ initialization by guarding each teardown step, and removes the redundant cleanup in mana_create_rxq(). ==================== Link: https://patch.msgid.link/20260430035935.1859220-1-dipayanroy@linux.microsoft.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-05 12:16:26 +02:00
Dipayaan Roy	3985c9a56d	net: mana: remove double CQ cleanup in mana_create_rxq error path In mana_create_rxq(), the error cleanup path calls mana_destroy_rxq() followed by mana_deinit_cq(). This is incorrect for two reasons: 1. mana_destroy_rxq() already calls mana_deinit_cq() internally, so the CQ's GDMA queue is destroyed twice. 2. mana_destroy_rxq() frees the rxq via kfree(rxq) before returning. The subsequent mana_deinit_cq(apc, cq) then operates on freed memory since cq points to &rxq->rx_cq, which is embedded in the already-freed rxq structure — a use-after-free. Remove the redundant mana_deinit_cq() call from the error path since mana_destroy_rxq() already handles CQ cleanup. mana_deinit_cq() is itself safe for an uninitialized CQ as it checks for a NULL gdma_cq before proceeding. Fixes: `ca9c54d2d6` ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)") Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com> Reviewed-by: Aditya Garg <gargaditya@linux.microsoft.com> Link: https://patch.msgid.link/20260430035935.1859220-4-dipayanroy@linux.microsoft.com Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-05 12:16:23 +02:00
Dipayaan Roy	2a1c691182	net: mana: Skip WQ object destruction for uninitialized RXQ In mana_destroy_rxq(), mana_destroy_wq_obj() is called unconditionally even when the WQ object was never created (rxobj is still INVALID_MANA_HANDLE). When mana_create_rxq() fails before mana_create_wq_obj() succeeds, the error path calls mana_destroy_rxq() which sends a bogus destroy command to the hardware: mana 7870:00:00.0: HWC: Failed hw_channel req: 0x1d mana 7870:00:00.0: Failed to send mana message: -71, 0x1d mana 7870:00:00.0 eth7: Failed to destroy WQ object: -71 Guard mana_destroy_wq_obj() with an INVALID_MANA_HANDLE check so that mana_destroy_rxq() is safe to call at any stage of RXQ initialization. Fixes: `ca9c54d2d6` ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)") Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com> Link: https://patch.msgid.link/20260430035935.1859220-3-dipayanroy@linux.microsoft.com Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-05 12:16:23 +02:00
Dipayaan Roy	e9e334f806	net: mana: check xdp_rxq registration before unreg in mana_destroy_rxq() When mana_create_rxq() fails at mana_create_wq_obj() or any step before xdp_rxq_info_reg() is called, the error path jumps to `out:` which calls mana_destroy_rxq(). mana_destroy_rxq() unconditionally calls xdp_rxq_info_unreg() on xilinx xdp_rxq that was never registered, triggering a WARN_ON in net/core/xdp.c: mana 7870:00:00.0: HWC: Failed hw_channel req: 0xc000009a mana 7870:00:00.0 eth7: Failed to create RXQ: err = -71 Driver BUG WARNING: CPU: 442 PID: 491615 at ../net/core/xdp.c:150 xdp_rxq_info_unreg+0x44/0x70 Modules linked in: tcp_bbr xsk_diag udp_diag raw_diag unix_diag af_packet_diag netlink_diag nf_tables nfnetlink tcp_diag inet_diag binfmt_misc rpcsec_gss_krb5 nfsv3 nfs_acl auth_rpcgss nfsv4 dns_resolver nfs lockd ext4 grace crc16 iscsi_tcp mbcache fscache libiscsi_tcp jbd2 netfs rpcrdma af_packet sunrpc rdma_ucm ib_iser rdma_cm iw_cm iscsi_ibft ib_cm iscsi_boot_sysfs libiscsi rfkill scsi_transport_iscsi mana_ib ib_uverbs ib_core mana hyperv_drm(X) drm_shmem_helper intel_rapl_msr drm_kms_helper intel_rapl_common syscopyarea nls_iso8859_1 sysfillrect intel_uncore_frequency_common nls_cp437 vfat fat nfit sysimgblt libnvdimm hv_netvsc(X) hv_utils(X) fb_sys_fops hv_balloon(X) joydev fuse drm dm_mod configfs ip_tables x_tables xfs libcrc32c sd_mod nvme nvme_core nvme_common t10_pi crc64_rocksoft_generic crc64_rocksoft crc64 hid_generic serio_raw pci_hyperv(X) hv_storvsc(X) scsi_transport_fc hyperv_keyboard(X) hid_hyperv(X) pci_hyperv_intf(X) crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd hv_vmbus(X) softdog sg scsi_mod efivarfs Supported: Yes, External CPU: 442 PID: 491615 Comm: ethtool Kdump: loaded Tainted: G X 5.14.21-150500.55.136-default #1 SLE15-SP5 a627be1b53abbfd64ad16b2685e4308c52847f42 Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 07/25/2025 RIP: 0010:xdp_rxq_info_unreg+0x44/0x70 Code: e8 91 fe ff ff c7 43 0c 02 00 00 00 48 c7 03 00 00 00 00 5b c3 cc cc cc cc e9 58 3a 1c 00 48 c7 c7 f6 5f 19 97 e8 5c a4 7e ff <0f> 0b 83 7b 0c 01 74 ca 48 c7 c7 d9 5f 19 97 e8 48 a4 7e ff 0f 0b RSP: 0018:ff3df6c8f7207818 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ff30d89f94808a80 RCX: 0000000000000027 RDX: 0000000000000000 RSI: 0000000000000002 RDI: ff30d94bdcca2908 RBP: 0000000000080000 R08: ffffffff98ed11a0 R09: ff3df6c8f72077a0 R10: dead000000000100 R11: 000000000000000a R12: 0000000000000000 R13: 0000000000002000 R14: 0000000000040000 R15: ff30d89f94800000 FS: 00007fe6d8432b80(0000) GS:ff30d94bdcc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe6d81a89b1 CR3: 00000b3b6d578001 CR4: 0000000000371ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 Call Trace: <TASK> mana_destroy_rxq+0x5b/0x2f0 [mana 267acf7006bcb696095bba4d810643d1db3b9e94] mana_create_rxq.isra.55+0x3db/0x720 [mana 267acf7006bcb696095bba4d810643d1db3b9e94] ? simple_lookup+0x36/0x50 ? current_time+0x42/0x80 ? __d_free_external+0x30/0x30 mana_alloc_queues+0x32a/0x470 [mana 267acf7006bcb696095bba4d810643d1db3b9e94] ? _raw_spin_unlock+0xa/0x30 ? d_instantiate.part.29+0x2e/0x40 ? _raw_spin_unlock+0xa/0x30 ? debugfs_create_dir+0xe4/0x140 mana_attach+0x5c/0xf0 [mana 267acf7006bcb696095bba4d810643d1db3b9e94] mana_set_ringparam+0xd5/0x1a0 [mana 267acf7006bcb696095bba4d810643d1db3b9e94] ethnl_set_rings+0x292/0x320 genl_family_rcv_msg_doit.isra.15+0x11b/0x150 genl_rcv_msg+0xe3/0x1e0 ? rings_prepare_data+0x80/0x80 ? genl_family_rcv_msg_doit.isra.15+0x150/0x150 netlink_rcv_skb+0x50/0x100 genl_rcv+0x24/0x40 netlink_unicast+0x1b6/0x280 netlink_sendmsg+0x365/0x4d0 sock_sendmsg+0x5f/0x70 __sys_sendto+0x112/0x140 __x64_sys_sendto+0x24/0x30 do_syscall_64+0x5b/0x80 ? handle_mm_fault+0xd7/0x290 ? do_user_addr_fault+0x2d8/0x740 ? exc_page_fault+0x67/0x150 entry_SYSCALL_64_after_hwframe+0x6b/0xd5 RIP: 0033:0x7fe6d8122f06 Code: 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 72 f3 c3 41 57 41 56 4d 89 c7 41 55 41 54 41 RSP: 002b:00007fff2b66b068 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 000055771123d2a0 RCX: 00007fe6d8122f06 RDX: 0000000000000034 RSI: 000055771123d3b0 RDI: 0000000000000003 RBP: 00007fff2b66b100 R08: 00007fe6d8203360 R09: 000000000000000c R10: 0000000000000000 R11: 0000000000000246 R12: 000055771123d350 R13: 000055771123d340 R14: 0000000000000000 R15: 00007fff2b66b2b0 </TASK> Guard the xdp_rxq_info_unreg() call with xdp_rxq_info_is_reg() so that mana_destroy_rxq() is safe to call regardless of how far initialization progressed. Fixes: `ed5356b53f` ("net: mana: Add XDP support") Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com> Link: https://patch.msgid.link/20260430035935.1859220-2-dipayanroy@linux.microsoft.com Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-05 12:16:23 +02:00
Daniel Golle	07d9958739	net: dsa: mt7530: fix .get_stats64 sleeping in atomic context The .get_stats64 callback runs in atomic context, but on MDIO-connected switches every register read acquires the MDIO bus mutex, which can sleep: [ 12.645973] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:609 [ 12.654442] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 759, name: grep [ 12.663377] preempt_count: 0, expected: 0 [ 12.667410] RCU nest depth: 1, expected: 0 [ 12.671511] INFO: lockdep is turned off. [ 12.675441] CPU: 0 UID: 0 PID: 759 Comm: grep Tainted: G S W 7.0.0+ #0 PREEMPT [ 12.675453] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN [ 12.675456] Hardware name: Bananapi BPI-R64 (DT) [ 12.675459] Call trace: [ 12.675462] show_stack+0x14/0x1c (C) [ 12.675477] dump_stack_lvl+0x68/0x8c [ 12.675487] dump_stack+0x14/0x1c [ 12.675495] __might_resched+0x14c/0x220 [ 12.675504] __might_sleep+0x44/0x80 [ 12.675511] __mutex_lock+0x50/0xb10 [ 12.675523] mutex_lock_nested+0x20/0x30 [ 12.675532] mt7530_get_stats64+0x40/0x2ac [ 12.675542] dsa_user_get_stats64+0x2c/0x40 [ 12.675553] dev_get_stats+0x44/0x1e0 [ 12.675564] dev_seq_printf_stats+0x24/0xe0 [ 12.675575] dev_seq_show+0x14/0x3c [ 12.675583] seq_read_iter+0x37c/0x480 [ 12.675595] seq_read+0xd0/0xec [ 12.675605] proc_reg_read+0x94/0xe4 [ 12.675615] vfs_read+0x98/0x29c [ 12.675625] ksys_read+0x54/0xdc [ 12.675633] __arm64_sys_read+0x18/0x20 [ 12.675642] invoke_syscall.constprop.0+0x54/0xec [ 12.675653] do_el0_svc+0x3c/0xb4 [ 12.675662] el0_svc+0x38/0x200 [ 12.675670] el0t_64_sync_handler+0x98/0xdc [ 12.675679] el0t_64_sync+0x158/0x15c For MDIO-connected switches, poll MIB counters asynchronously using a delayed workqueue every second and let .get_stats64 return the cached values under a spinlock. A mod_delayed_work() call on each read triggers an immediate refresh so counters stay responsive when queried more frequently. MMIO-connected switches (MT7988, EN7581, AN7583) are not affected because their regmap does not sleep, so they continue to read MIB counters directly in .get_stats64. Fixes: `88c810f35e` ("net: dsa: mt7530: implement .get_stats64") Signed-off-by: Daniel Golle <daniel@makrotopia.org> Acked-by: Chester A. Unal <chester.a.unal@arinc9.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/6940b913da2c29156f0feff74b678d3c526ee84c.1777719253.git.daniel@makrotopia.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-04 19:28:54 -07:00
Kuniyuki Iwashima	a6039776c7	ipmr: Add __rcu to netns_ipv4.mrt. kernel test robot reported this Sparse warning: $ make C=1 net/ipv4/ipmr.o net/ipv4/ipmr.c:312:24: error: incompatible types in comparison expression (different address spaces): net/ipv4/ipmr.c:312:24: struct mr_table [noderef] __rcu * net/ipv4/ipmr.c:312:24: struct mr_table * Let's add __rcu annotation to netns_ipv4.mrt. Fixes: `b3b6babf47` ("ipmr: Free mr_table after RCU grace period.") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202605030032.glNApko7-lkp@intel.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260502180755.359554-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-04 19:26:13 -07:00
David Carlier	30cb24f97d	psp: strip variable-length PSP header in psp_dev_rcv() psp_dev_rcv() unconditionally removes a fixed PSP_ENCAP_HLEN, even when psph->hdrlen indicates that the PSP header carries optional fields. A frame whose PSP header advertises a non-zero VC or any extension would therefore be silently mis-decapsulated: option bytes would spill into the inner packet head and downstream parsing would fail on a corrupted skb. Compute the full PSP header length from psph->hdrlen, pull the optional bytes into the linear region, and strip the whole header when decapsulating. Optional fields (VC, ...) are still ignored, just discarded with the rest of the header instead of leaking. crypt_offset and the VIRT flag are intentionally not validated here - callers know their device's PSP implementation and can decide. Both in-tree callers gate on hardware-validated PSP, so this is a correctness fix rather than a reachable corruption path under current configurations. Fixes: `0eddb8023c` ("psp: provide decapsulation and receive helper for drivers") Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Daniel Zahka <daniel.zahka@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: David Carlier <devnexen@gmail.com> Link: https://patch.msgid.link/20260502141945.14484-1-devnexen@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-04 19:25:14 -07:00
Eric Dumazet	ac0841d7d2	net: prevent possible UAF in rtnl_prop_list_size() I was mistaken by synchronize_rcu() [1] call in netdev_name_node_alt_destroy(), giving a false sense of RCU safety at delete times. We have to use list_del_rcu() to not confuse potential readers in rtnl_prop_list_size(). [1] This synchronize_rcu() call was later removed in commit `723de3ebef` ("net: free altname using an RCU callback"). Fixes: `9f30831390` ("net: add rcu safety to rtnl_prop_list_size()") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260502124102.499204-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-04 19:24:27 -07:00
Jakub Kicinski	9d7ebff0c3	Merge branch 'mptcp-misc-fixes-for-v7-1-rc3' Matthieu Baerts says: ==================== mptcp: misc fixes for v7.1-rc3 Here are various unrelated fixes: - Patch 1: increment the right MIB counter. A fix for v5.7. - Patch 2: set the right MPTCP reset reason. A fix for v5.9. - Patch 3: fix rx timestamp corruption when on MPTCP passive fastopen. A fix for v6.2. - Patch 4: increase sockopt seq after having set TCP_MAXSEG to propagate it to newer subflows later. A fix for 6.17. ==================== Link: https://patch.msgid.link/20260501-net-mptcp-misc-fixes-7-1-rc3-v1-0-b70118df778e@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-04 19:20:52 -07:00
Matthieu Baerts (NGI0)	70ece9d702	mptcp: sockopt: increase seq in mptcp_setsockopt_all_sf mptcp_setsockopt_all_sf() was missing a call to sockopt_seq_inc(). This is required not to cause missing synchronization for newer subflows created later on. This helper is called each time a socket option is set on subflows, and future ones will need to inherit this option after their creation. Fixes: `51c5fd09e1` ("mptcp: add TCP_MAXSEG sockopt support") Cc: stable@vger.kernel.org Suggested-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260501-net-mptcp-misc-fixes-7-1-rc3-v1-4-b70118df778e@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-04 19:20:50 -07:00
Paolo Abeni	6254a16d6f	mptcp: fix rx timestamp corruption on fastopen The skb cb offset containing the timestamp presence flag is cleared before loading such information. Cache such value before MPTCP CB initialization. Fixes: `36b122baf6` ("mptcp: add subflow_v(4,6)_send_synack()") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260501-net-mptcp-misc-fixes-7-1-rc3-v1-3-b70118df778e@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-04 19:20:50 -07:00
Shardul Bankar	a6da02d4c0	mptcp: use MPTCP_RST_EMPTCP for ACK HMAC validation failure When HMAC validation fails on a received ACK + MP_JOIN in subflow_syn_recv_sock(), the subflow is reset with reason MPTCP_RST_EPROHIBIT ("Administratively prohibited"). This is incorrect: HMAC validation failure is an MPTCP protocol-level error, not an administrative policy denial. The mirror site on the client, in subflow_finish_connect(), already uses MPTCP_RST_EMPTCP ("MPTCP-specific error") for the same kind of HMAC failure on the SYN/ACK + MP_JOIN. Use the same reason on the server side for symmetry and accuracy. Suggested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Fixes: `443041deb5` ("mptcp: fix NULL pointer in can_accept_new_subflow") Cc: stable@vger.kernel.org Signed-off-by: Shardul Bankar <shardul.b@mpiricsoftware.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260501-net-mptcp-misc-fixes-7-1-rc3-v1-2-b70118df778e@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-04 19:20:50 -07:00
Shardul Bankar	c4a99a9219	mptcp: use MPJoinSynAckHMacFailure for SynAck HMAC failure In subflow_finish_connect(), HMAC validation of the server's HMAC in SYN/ACK + MP_JOIN increments MPTCP_MIB_JOINACKMAC ("HMAC was wrong on ACK + MP_JOIN") on failure. The function processes the SYN/ACK, not the ACK; the matching MPTCP_MIB_JOINSYNACKMAC counter ("HMAC was wrong on SYN/ACK + MP_JOIN") exists but is not incremented anywhere in the tree. The mirror site on the server, subflow_syn_recv_sock(), already uses JOINACKMAC correctly for ACK HMAC failure. Use JOINSYNACKMAC at the SYN/ACK validation site so each counter reflects the packet whose HMAC actually failed. Suggested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Fixes: `fc518953bc` ("mptcp: add and use MIB counter infrastructure") Cc: stable@vger.kernel.org Signed-off-by: Shardul Bankar <shardul.b@mpiricsoftware.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260501-net-mptcp-misc-fixes-7-1-rc3-v1-1-b70118df778e@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-04 19:20:50 -07:00
Eric Dumazet	059b7dbd20	vsock/virtio: fix potential unbounded skb queue virtio_transport_inc_rx_pkt() checks vvs->rx_bytes + len > vvs->buf_alloc. virtio_transport_recv_enqueue() skips coalescing for packets with VIRTIO_VSOCK_SEQ_EOM. If fed with packets with len == 0 and VIRTIO_VSOCK_SEQ_EOM, a very large number of packets can be queued because vvs->rx_bytes stays at 0. Fix this by estimating the skb metadata size: (Number of skbs in the queue) * SKB_TRUESIZE(0) Fixes: `0777061657` ("virtio/vsock: don't use skbuff state to account credit") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Stefano Garzarella <sgarzare@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: "Eugenio Pérez" <eperezma@redhat.com> Cc: virtualization@lists.linux.dev Link: https://patch.msgid.link/20260430122653.554058-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-04 19:12:37 -07:00
Markus Baier	36bdc0e815	net: usb: asix: ax88772: re-add usbnet_link_change() in phylink callbacks Commit `e0bffe3e68` ("net: asix: ax88772: migrate to phylink") replaced the asix_adjust_link() PHY callback with phylink's mac_link_up() and mac_link_down() handlers, but did not carry over the usbnet_link_change() notification that commit `805206e66f` ("net: asix: fix "can't send until first packet is send" issue") had added. As a result, the original symptom returns: when the link comes up, usbnet is never notified, so the RX URB submission stays dormant until some other event (e.g. a transmitted packet triggering the status endpoint interrupt) wakes it up. This is reproducible with the Apple A1277 USB Ethernet Adapter (05ac:1402, AX88772A based) on a Banana Pro using a static IPv4 configuration. After bringing the interface up, no incoming packets are received until the first outgoing frame triggers usbnet's RX path. Restore the link change notification, gated on a carrier transition so the call remains idempotent if the status endpoint also reports the change later. Fixes: `e0bffe3e68` ("net: asix: ax88772: migrate to phylink") Signed-off-by: Markus Baier <Markus.Baier@soslab.tu-darmstadt.de> Tested-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20260501163941.107668-1-Markus.Baier@soslab.tu-darmstadt.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-04 19:04:15 -07:00
Breno Leitao	76b93a8107	netpoll: pass buffer size to egress_dev() to avoid MAC truncation egress_dev() formats np->dev_mac via snprintf() but receives buf as a bare char , so it cannot derive the buffer size from the pointer. The size argument was hardcoded to MAC_ADDR_STR_LEN (3 ETH_ALEN - 1 = 17), which is silly wrong in two ways: 1) misleading kernel log output on the MAC-selected target path (np->dev_name[0] == '\0'); for example "aa:bb:cc:dd:ee:ff doesn't exist, aborting" was logged as "aa:bb:cc:dd:ee:f doesn't exist, aborting". 2) the second argument of snprintf is the size of the buffer, not the size of what you want to write. Add a bufsz parameter to egress_dev() and pass sizeof(buf) from each caller, matching the standard snprintf() idiom and removing the hardcoded size from the helper. Every caller already declares "char buf[MAC_ADDR_STR_LEN + 1]" so the formatted MAC continues to fit. Tested by booting with netconsole=6665@/aa:bb:cc:dd:ee:ff,6666@10.0.0.1/00:11:22:33:44:55 on a kernel without a matching device. Pre-fix dmesg shows "aa:bb:cc:dd:ee:f doesn't exist, aborting"; post-fix shows the full "aa:bb:cc:dd:ee:ff doesn't exist, aborting". Fixes: `f8a10bed32` ("netconsole: allow selection of egress interface via MAC address") Cc: stable@vger.kernel.org Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20260501-netpoll_snprintf_fix-v1-1-84b0566e6597@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-04 18:37:25 -07:00
Kuniyuki Iwashima	d82ba05263	af_unix: Set gc_in_progress to true in unix_gc(). Igor Ushakov reported that unix_gc() could run with gc_in_progress being false if the work is scheduled while running: Thread 1 Thread 2 Thread 3 -------- -------- -------- unix_schedule_gc() unix_schedule_gc() `- if (!gc_in_progress) `- if (!gc_in_progress) \|- gc_in_progress = true \| `- queue_work() \| unix_gc() <----------------/ \| \| \|- gc_in_progress = true ... `- queue_work() \| \| `- gc_in_progress = false \| \| unix_gc() <---------------------------------------------' \| ... /* gc_in_progress == false */ \| `- gc_in_progress = false unix_peek_fpl() relies on gc_in_progress not to confuse GC by MSG_PEEK. Let's set gc_in_progress to true in unix_gc(). Fixes: `8b90a9f819` ("af_unix: Run GC on only one CPU.") Reported-by: Igor Ushakov <sysroot314@gmail.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260501073945.1884564-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-04 18:34:45 -07:00
Waiman Long	8f78b749f3	sched/isolation: Make HK_TYPE_KTHREAD an alias of HK_TYPE_DOMAIN Since commit `041ee6f372` ("kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management"), kthreads default to use the HK_TYPE_DOMAIN cpumask. IOW, it is no longer affected by the setting of the nohz_full boot kernel parameter. That means HK_TYPE_KTHREAD should now be an alias of HK_TYPE_DOMAIN instead of HK_TYPE_KERNEL_NOISE to correctly reflect the current kthread behavior. Make the change as HK_TYPE_KTHREAD is still being used in some networking code. Fixes: `041ee6f372` ("kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-05 01:52:55 +02:00
Waiman Long	aa60652069	ipvs: Guard access of HK_TYPE_KTHREAD cpumask with RCU The ip_vs_ctl.c file and the associated ip_vs.h file are the only places in the kernel where HK_TYPE_KTHREAD cpumask is being retrieved and used. Now that HK_TYPE_KTHREAD/HK_TYPE_DOMAIN cpumask can be changed at run time. We need to use RCU to guard access to this cpumask to avoid a potential UAF problem as the returned cpumask may be freed before it is being used. We can replace HK_TYPE_KTHREAD by HK_TYPE_DOMAIN as they are aliases of each other, but keeping the HK_TYPE_KTHREAD name can highlight the fact that it is the kthread initiated by ipvs that is being controlled. Fixes: `03ff735101` ("cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-05 01:52:55 +02:00
Julian Anastasov	4ee52b7021	ipvs: fix shift-out-of-bounds in ip_vs_rht_desired_size Calling roundup_pow_of_two() with 0 has undefined result: UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13 shift exponent 64 is too large for 64-bit type 'unsigned long' CPU: 1 UID: 0 PID: 77 Comm: kworker/u8:4 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026 Workqueue: events_unbound conn_resize_work_handler Call Trace: <TASK> dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 ubsan_epilogue+0xa/0x30 lib/ubsan.c:233 __ubsan_handle_shift_out_of_bounds+0x385/0x410 lib/ubsan.c:494 __roundup_pow_of_two include/linux/log2.h:57 [inline] ip_vs_rht_desired_size+0x2cf/0x410 net/netfilter/ipvs/ip_vs_core.c:240 ip_vs_conn_desired_size net/netfilter/ipvs/ip_vs_conn.c:765 [inline] conn_resize_work_handler+0x1b6/0x14c0 net/netfilter/ipvs/ip_vs_conn.c:822 process_one_work kernel/workqueue.c:3302 [inline] process_scheduled_works+0xb5d/0x1860 kernel/workqueue.c:3385 worker_thread+0xa53/0xfc0 kernel/workqueue.c:3466 kthread+0x388/0x470 kernel/kthread.c:436 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> Reported-by: syzbot+217f1db9c791e27fe54a@syzkaller.appspotmail.com Fixes: `b655388111` ("ipvs: add resizable hash tables") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-05 01:52:55 +02:00
Julian Anastasov	2fd1092389	ipvs: fix races around est_mutex and est_cpulist Sashiko reports for races and possible crash around the usage of est_cpulist_valid and sysctl_est_cpulist. The problem is that we do not lock est_mutex in some places which can lead to wrong write ordering and as result problems when calling cpumask_weight() and cpumask_empty(). Fix them by moving the est_max_threads read/write under locked est_mutex. Do the same for one ip_vs_est_reload_start() call to protect the cpumask_empty() usage of sysctl_est_cpulist. To remove the chance of deadlock while stopping the estimation kthreads, keep the data structure for kthread 0 even after last estimator is removed and do not hold mutexes while stopping this task. Now we will use a new flag 'needed' to know when kthread 0 should run. The kthreads above 0 do not use mutexes, so stop them under est_mutex because their kthread data still can be destroyed if they do not serve estimators. Now all kthreads will be started by the est_reload_work to properly serialize the stop/start for kthread 0. Reduce the use of service_mutex in ip_vs_est_calc_phase() because under est_mutex we can safely walk est_kt_arr to stop the kthreads above slot 0. As ip_vs_stop_estimator() for tot_stats should be called under service_mutex, do it early in the netns exit path in ip_vs_flush() to avoid locking the mutex again later. It still should be called in ip_vs_control_net_cleanup_sysctl() when we are called during netns init error. Use -2 for ktid as indicator if estimator was already stopped. Finally, fix use-after-free for kd->est_row in ip_vs_est_calc_phase(). est->ktrow should simply switch to a delay value while estimator is linked to est_temp_list. Link: https://sashiko.dev/#/patchset/20260331165015.2777765-1-longman%40redhat.com Link: https://sashiko.dev/#/patchset/20260420171308.87192-1-ja%40ssi.bg Link: https://sashiko.dev/#/patchset/20260422125123.40658-1-ja%40ssi.bg Link: https://sashiko.dev/#/patchset/20260424175858.54752-1-ja%40ssi.bg Link: https://sashiko.dev/#/patchset/20260425103918.7447-1-ja%40ssi.bg Fixes: `f0be83d542` ("ipvs: add est_cpulist and est_nice sysctl vars") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-05 01:52:55 +02:00
Julian Anastasov	fbe1e01e81	ipvs: do not leak dest after get from dest trash Sashiko warns about leaked dest if ip_vs_start_estimator() fails in ip_vs_add_dest(). Add ip_vs_trash_put_dest() to put back the dest into dest trash. Link: https://sashiko.dev/#/patchset/20260428175725.72050-1-ja%40ssi.bg Fixes: `705dd34440` ("ipvs: use kthreads for stats estimation") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-05 01:52:55 +02:00
Julian Anastasov	d493d9de1c	ipvs: fix the spin_lock usage for RT build syzbot reports for sleeping function called from invalid context [1]. The recently added code for resizable hash tables uses hlist_bl bit locks in combination with spin_lock for the connection fields (cp->lock). Fix the following problems: * avoid using spin_lock(&cp->lock) under locked bit lock because it sleeps on PREEMPT_RT * as the recent changes call ip_vs_conn_hash() only for newly allocated connection, the spin_lock can be removed there because the connection is still not linked to table and does not need cp->lock protection. * the lock can be removed also from ip_vs_conn_unlink() where we are the last connection user. * the last place that is fixed is ip_vs_conn_fill_cport() where now the cp->lock is locked before the other locks to ensure other packets do not modify the cp->flags in non-atomic way. Here we make sure cport and flags are changed only once if two or more packets race to fill the cport. Also, we fill cport early, so that if we race with resizing there will be valid cport key for the hashing. Add a warning if too many hash table changes occur for our RCU read-side critical section which is error condition but minor because the connection still can expire gracefully. Still, restore the cport to 0 to allow retransmitted packet to properly fill the cport. Problems reported by Sashiko. [1]: BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 16, name: ktimers/0 preempt_count: 2, expected: 0 RCU nest depth: 3, expected: 3 8 locks held by ktimers/0/16: #0: ffffffff8de5f260 (local_bh){.+.+}-{1:3}, at: __local_bh_disable_ip+0x3c/0x420 kernel/softirq.c:163 #1: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: __local_bh_disable_ip+0x3c/0x420 kernel/softirq.c:163 #2: ffff8880b8826360 (&base->expiry_lock){+...}-{3:3}, at: spin_lock include/linux/spinlock_rt.h:45 [inline] #2: ffff8880b8826360 (&base->expiry_lock){+...}-{3:3}, at: timer_base_lock_expiry kernel/time/timer.c:1502 [inline] #2: ffff8880b8826360 (&base->expiry_lock){+...}-{3:3}, at: __run_timer_base+0x120/0x9f0 kernel/time/timer.c:2384 #3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire include/linux/rcupdate.h:300 [inline] #3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: __rt_spin_lock kernel/locking/spinlock_rt.c:50 [inline] #3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rt_spin_lock+0x1e0/0x400 kernel/locking/spinlock_rt.c:57 #4: ffffc90000157a80 ((&cp->timer)){+...}-{0:0}, at: call_timer_fn+0xd4/0x5e0 kernel/time/timer.c:1745 #5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire include/linux/rcupdate.h:300 [inline] #5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: ip_vs_conn_unlink net/netfilter/ipvs/ip_vs_conn.c:315 [inline] #5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: ip_vs_conn_expire+0x257/0x2390 net/netfilter/ipvs/ip_vs_conn.c:1260 #6: ffffffff8de5f260 (local_bh){.+.+}-{1:3}, at: __local_bh_disable_ip+0x3c/0x420 kernel/softirq.c:163 #7: ffff888068d4c3f0 (&cp->lock#2){+...}-{3:3}, at: spin_lock include/linux/spinlock_rt.h:45 [inline] #7: ffff888068d4c3f0 (&cp->lock#2){+...}-{3:3}, at: ip_vs_conn_unlink net/netfilter/ipvs/ip_vs_conn.c:324 [inline] #7: ffff888068d4c3f0 (&cp->lock#2){+...}-{3:3}, at: ip_vs_conn_expire+0xd4a/0x2390 net/netfilter/ipvs/ip_vs_conn.c:1260 Preemption disabled at: [<ffffffff898a6358>] bit_spin_lock include/linux/bit_spinlock.h:38 [inline] [<ffffffff898a6358>] hlist_bl_lock+0x18/0x110 include/linux/list_bl.h:149 CPU: 0 UID: 0 PID: 16 Comm: ktimers/0 Tainted: G W L syzkaller #0 PREEMPT_{RT,(full)} Tainted: [W]=WARN, [L]=SOFTLOCKUP Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/18/2026 Call Trace: <TASK> dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 __might_resched+0x329/0x480 kernel/sched/core.c:9162 __rt_spin_lock kernel/locking/spinlock_rt.c:48 [inline] rt_spin_lock+0xc2/0x400 kernel/locking/spinlock_rt.c:57 spin_lock include/linux/spinlock_rt.h:45 [inline] ip_vs_conn_unlink net/netfilter/ipvs/ip_vs_conn.c:324 [inline] ip_vs_conn_expire+0xd4a/0x2390 net/netfilter/ipvs/ip_vs_conn.c:1260 call_timer_fn+0x192/0x5e0 kernel/time/timer.c:1748 expire_timers kernel/time/timer.c:1799 [inline] __run_timers kernel/time/timer.c:2374 [inline] __run_timer_base+0x6a3/0x9f0 kernel/time/timer.c:2386 run_timer_base kernel/time/timer.c:2395 [inline] run_timer_softirq+0xb7/0x170 kernel/time/timer.c:2405 handle_softirqs+0x1de/0x6d0 kernel/softirq.c:622 __do_softirq kernel/softirq.c:656 [inline] run_ktimerd+0x69/0x100 kernel/softirq.c:1151 smpboot_thread_fn+0x541/0xa50 kernel/smpboot.c:160 kthread+0x388/0x470 kernel/kthread.c:436 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> Reported-by: syzbot+504e778ddaecd36fdd17@syzkaller.appspotmail.com Link: https://sashiko.dev/#/patchset/20260415200216.79699-1-ja%40ssi.bg Link: https://sashiko.dev/#/patchset/20260420165539.85174-4-ja%40ssi.bg Link: https://sashiko.dev/#/patchset/20260422135823.50489-4-ja%40ssi.bg Fixes: `2fa7cc9c70` ("ipvs: switch to per-net connection table") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-05 01:52:55 +02:00
Julian Anastasov	f2da9a96ab	ipvs: fix races around the conn_lfactor and svc_lfactor sysctl vars Sashiko warns that the new sysctls vars can be changed after the hash tables are destroyed and their respective resizing works canceled, leading to mod_delayed_work() being called for canceled works. Solve this in different ways. conn_tab can be present even without services and is destroyed only on netns exit, so use disable_delayed_work_sync() to disable the work instead of adding more synchronization mechanisms. As for the svc_table, it is destroyed when the services are deleted, so we must be sure that netns exit is not called yet (the check for 'enable') and the work is not canceled by checking all under same mutex lock. Also, use WRITE_ONCE when updating the sysctl vars as we already read them with READ_ONCE. Link: https://sashiko.dev/#/patchset/20260410112352.23599-1-fw%40strlen.de Fixes: `8d7de5477e` ("ipvs: add conn_lfactor and svc_lfactor sysctl vars") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-05 01:52:55 +02:00
Julian Anastasov	afbd961305	ipvs: fixes for the new ip_vs_status info Sashiko reports some problems for the recently added /proc/net/ip_vs_status: * ip_vs_status_show() as a table reader may run long after the conn_tab and svc_table table are released. While ip_vs_conn_flush() properly changes the conn_tab_changes counter when conn_tab is removed, ip_vs_del_service() and ip_vs_flush() were missing such change for the svc_table_changes counter. As result, readers like ip_vs_dst_event() and ip_vs_status_show() may continue to use a freed table after a cond_resched_rcu() call. * While counting the buckets in ip_vs_status_show() make sure we traverse only the needed number of entries in the chain. This also prevents possible overflow of the 'count' variable. * Add check for 'loops' to prevent infinite loops while restarting the traversal on table change. * While IP_VS_CONN_TAB_MAX_BITS is 20 on 32-bit platforms and there is no risk to overflow when multiplying the number of conn_tab buckets to 100, prefer the div_u64() helper to make the following dividing safer. * Use 0440 permissions for ip_vs_status to restrict the info only to root due to the exported information for hash distribution. Link: https://sashiko.dev/#/patchset/20260410112352.23599-1-fw%40strlen.de Fixes: `9a9ccef907` ("ipvs: add ip_vs_status info") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-05 01:52:55 +02:00
Jakub Kicinski	bd3a4795d5	selftests: tls: add test for data loss on small pipe Add selftest for data loss on short splice. Link: https://patch.msgid.link/20260429222944.2139041-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-02 18:27:14 -07:00
Jakub Kicinski	7e7be31bfd	net: tls: fix silent data drop under pipe back-pressure tls_sw_splice_read() uses len when advancing rxm->offset / rxm->full_len after skb_splice_bits(), rather than copied (the actual number of bytes successfully spliced into the pipe). When the destination pipe cannot accept all the requested bytes, splice_to_pipe() returns fewer bytes than len, and 'len - copied' of data is effectively skipped over. Fixes: `e062fe99cc` ("tls: splice_read: fix accessing pre-processed records") Link: https://patch.msgid.link/20260429222944.2139041-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-02 18:27:14 -07:00
Jakub Kicinski	b42f68cf04	Merge branch 'net-sched-sch_cake-annotate-data-races-in-cake_dump_class_stats-series' Eric Dumazet says: ==================== net/sched: sch_cake: annotate data-races in cake_dump_class_stats (series) cake_dump_class_stats() runs without qdisc spinlock being held. In this series (of two), I add READ_ONCE()/WRITE_ONCE() annotations for: - flow->head - flow->dropped - b->backlogs[] - flow->deficit - flow->cvars.dropping - flow->cvars.count - flow->cvars.p_drop - flow->cvars.blue_timer - flow->cvars.drop_next ==================== Link: https://patch.msgid.link/20260430061610.3503483-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-02 16:59:12 -07:00
Eric Dumazet	67dc6c56b8	net/sched: sch_cake: annotate data-races in cake_dump_class_stats (II) cake_dump_class_stats() runs without qdisc spinlock being held. In this second patch, I add READ_ONCE()/WRITE_ONCE() annotations for: - flow->deficit - flow->cvars.dropping - flow->cvars.count - flow->cvars.p_drop - flow->cvars.blue_timer - flow->cvars.drop_next Fixes: `046f6fd5da` ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260430061610.3503483-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-02 16:59:09 -07:00
Eric Dumazet	046111a1a3	net/sched: sch_cake: annotate data-races in cake_dump_class_stats (I) cake_dump_class_stats() runs without qdisc spinlock being held. In this first patch, I add READ_ONCE()/WRITE_ONCE() annotations for: - flow->head - flow->dropped - b->backlogs[] Fixes: `046f6fd5da` ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260430061610.3503483-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-02 16:59:09 -07:00
Maoyi Xie	1d324c2f43	ip6_gre: Use cached t->net in ip6erspan_changelink(). After commit `5e72ce3e39` ("net: ipv6: Use link netns in newlink() of rtnl_link_ops"), ip6erspan_newlink() correctly resolves the per-netns ip6gre hash via link_net. ip6erspan_changelink() was not converted in that series and still uses dev_net(dev), which diverges from the device's creation netns after IFLA_NET_NS_FD migration. This re-inserts the tunnel into the wrong per-netns hash. The original netns keeps a stale entry. When that netns is later destroyed, ip6gre_exit_rtnl_net() walks the stale entry, producing a slab-use-after-free reported by KASAN, followed by a kernel BUG at net/core/dev.c (LIST_POISON1) in unregister_netdevice_many_notify(). Reachable from an unprivileged user namespace (unshare --user --map-root-user --net). ip6gre_changelink() earlier in the same file already uses the cached t->net; only ip6erspan_changelink() has the wrong shape. Fixes: `2d665034f2` ("net: ip6_gre: Fix ip6erspan hlen calculation") Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260430103318.3206018-1-maoyi.xie@ntu.edu.sg Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-02 10:36:21 -07:00
Jakub Kicinski	386829cd89	Merge branch 'replace-direct-dequeue-call-with-qdisc_dequeue_peeked' Jamal Hadi Salim says: ==================== Replace direct dequeue call with qdisc_dequeue_peeked When sfb and red qdiscs have children (eg qfq qdisc) whose peek() callback is qdisc_peek_dequeued(), we could get a kernel panic. When the parent of such qdiscs (eg illustrated in patch #3 as tbf) wants to retrieve an skb from its child (red/sfb in this case), it will do the following: 1a. do a peek() - and when sensing there's an skb the child can offer, then - the child in this case(red/sfb) calls its child's (qfq) peek. qfq does the right thing and will return the gso_skb queue packet. Note: if there wasnt a gso_skb entry then qfq will store it there. 1b. invoke a dequeue() on the child (red/sfb). And herein lies the problem. - red/sfb will call the child's dequeue() which will essentially just try to grab something of qfq's queue. The right thing to do in #1b is to grab the skb off gso_skb queue. This patchset fixes that issue by changing #1b to use qdisc_dequeue_peeked() method instead. Patch 1 fixes the issue for red qdisc. Patch 2 fixes it for sfb. Patch 3 adds testcases for the two setups. ==================== Link: https://patch.msgid.link/20260430152957.194015-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-02 10:21:00 -07:00

1 2 3 4 5 ...

1444363 Commits