linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-11 09:43:54 -04:00

Author	SHA1	Message	Date
William Liu	52bf272636	net/sched: Fix backlog accounting in qdisc_dequeue_internal This issue applies for the following qdiscs: hhf, fq, fq_codel, and fq_pie, and occurs in their change handlers when adjusting to the new limit. The problem is the following in the values passed to the subsequent qdisc_tree_reduce_backlog call given a tbf parent: When the tbf parent runs out of tokens, skbs of these qdiscs will be placed in gso_skb. Their peek handlers are qdisc_peek_dequeued, which accounts for both qlen and backlog. However, in the case of qdisc_dequeue_internal, ONLY qlen is accounted for when pulling from gso_skb. This means that these qdiscs are missing a qdisc_qstats_backlog_dec when dropping packets to satisfy the new limit in their change handlers. One can observe this issue with the following (with tc patched to support a limit of 0): export TARGET=fq tc qdisc del dev lo root tc qdisc add dev lo root handle 1: tbf rate 8bit burst 100b latency 1ms tc qdisc replace dev lo handle 3: parent 1:1 $TARGET limit 1000 echo ''; echo 'add child'; tc -s -d qdisc show dev lo ping -I lo -f -c2 -s32 -W0.001 127.0.0.1 2>&1 >/dev/null echo ''; echo 'after ping'; tc -s -d qdisc show dev lo tc qdisc change dev lo handle 3: parent 1:1 $TARGET limit 0 echo ''; echo 'after limit drop'; tc -s -d qdisc show dev lo tc qdisc replace dev lo handle 2: parent 1:1 sfq echo ''; echo 'post graft'; tc -s -d qdisc show dev lo The second to last show command shows 0 packets but a positive number (74) of backlog bytes. The problem becomes clearer in the last show command, where qdisc_purge_queue triggers qdisc_tree_reduce_backlog with the positive backlog and causes an underflow in the tbf parent's backlog (4096 Mb instead of 0). To fix this issue, the codepath for all clients of qdisc_dequeue_internal has been simplified: codel, pie, hhf, fq, fq_pie, and fq_codel. qdisc_dequeue_internal handles the backlog adjustments for all cases that do not directly use the dequeue handler. The old fq_codel_change limit adjustment loop accumulated the arguments to the subsequent qdisc_tree_reduce_backlog call through the cstats field. However, this is confusing and error prone as fq_codel_dequeue could also potentially mutate this field (which qdisc_dequeue_internal calls in the non gso_skb case), so we have unified the code here with other qdiscs. Fixes: `2d3cbfd6d5` ("net_sched: Flush gso_skb list too during ->change()") Fixes: `4b549a2ef4` ("fq_codel: Fair Queue Codel AQM") Fixes: `10239edf86` ("net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdisc") Signed-off-by: William Liu <will@willsroot.io> Reviewed-by: Savino Dicanosa <savy@syst3mfailure.io> Link: https://patch.msgid.link/20250812235725.45243-1-will@willsroot.io Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-14 17:52:29 -07:00
Wang Liang	d1547bf460	net: bridge: fix soft lockup in br_multicast_query_expired() When set multicast_query_interval to a large value, the local variable 'time' in br_multicast_send_query() may overflow. If the time is smaller than jiffies, the timer will expire immediately, and then call mod_timer() again, which creates a loop and may trigger the following soft lockup issue. watchdog: BUG: soft lockup - CPU#1 stuck for 221s! [rb_consumer:66] CPU: 1 UID: 0 PID: 66 Comm: rb_consumer Not tainted 6.16.0+ #259 PREEMPT(none) Call Trace: <IRQ> __netdev_alloc_skb+0x2e/0x3a0 br_ip6_multicast_alloc_query+0x212/0x1b70 __br_multicast_send_query+0x376/0xac0 br_multicast_send_query+0x299/0x510 br_multicast_query_expired.constprop.0+0x16d/0x1b0 call_timer_fn+0x3b/0x2a0 __run_timers+0x619/0x950 run_timer_softirq+0x11c/0x220 handle_softirqs+0x18e/0x560 __irq_exit_rcu+0x158/0x1a0 sysvec_apic_timer_interrupt+0x76/0x90 </IRQ> This issue can be reproduced with: ip link add br0 type bridge echo 1 > /sys/class/net/br0/bridge/multicast_querier echo 0xffffffffffffffff > /sys/class/net/br0/bridge/multicast_query_interval ip link set dev br0 up The multicast_startup_query_interval can also cause this issue. Similar to the commit `99b4061095` ("net: bridge: mcast: add and enforce query interval minimum"), add check for the query interval maximum to fix this issue. Link: https://lore.kernel.org/netdev/20250806094941.1285944-1-wangliang74@huawei.com/ Link: https://lore.kernel.org/netdev/20250812091818.542238-1-wangliang74@huawei.com/ Fixes: `d902eee43f` ("bridge: Add multicast count/interval sysfs entries") Suggested-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Wang Liang <wangliang74@huawei.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250813021054.1643649-1-wangliang74@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-14 17:49:33 -07:00
Suraj Gupta	fd980bf6e9	net: xilinx: axienet: Fix RX skb ring management in DMAengine mode Submit multiple descriptors in axienet_rx_cb() to fill Rx skb ring. This ensures the ring "catches up" on previously missed allocations. Increment Rx skb ring head pointer after BD is successfully allocated. Previously, head pointer was incremented before verifying if descriptor is successfully allocated and has valid entries, which could lead to ring state inconsistency if descriptor setup failed. These changes improve reliability by maintaining adequate descriptor availability and ensuring proper ring buffer state management. Fixes: `6a91b846af` ("net: axienet: Introduce dmaengine support") Signed-off-by: Suraj Gupta <suraj.gupta2@amd.com> Link: https://patch.msgid.link/20250813135559.1555652-1-suraj.gupta2@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-14 17:38:44 -07:00
Alexandra Winter	1548549e17	MAINTAINERS: update s390/net Remove Thorsten Winkler as maintainer and add Aswin Karuvally as reviewer. Thank you Thorsten for your support, welcome Aswin! Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Acked-by: Thorsten Winkler <twinkler@linux.ibm.com> Acked-by: Aswin Karuvally <aswin@linux.ibm.com> Link: https://patch.msgid.link/20250813111633.241111-1-wintera@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-14 17:36:15 -07:00
Linus Torvalds	63467137ec	Merge tag 'net-6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from Netfilter and IPsec. Current release - regressions: - netfilter: nft_set_pipapo: - don't return bogus extension pointer - fix null deref for empty set Current release - new code bugs: - core: prevent deadlocks when enabling NAPIs with mixed kthread config - eth: netdevsim: Fix wild pointer access in nsim_queue_free(). Previous releases - regressions: - page_pool: allow enabling recycling late, fix false positive warning - sched: ets: use old 'nbands' while purging unused classes - xfrm: - restore GSO for SW crypto - bring back device check in validate_xmit_xfrm - tls: handle data disappearing from under the TLS ULP - ptp: prevent possible ABBA deadlock in ptp_clock_freerun() - eth: - bnxt: fill data page pool with frags if PAGE_SIZE > BNXT_RX_PAGE_SIZE - hv_netvsc: fix panic during namespace deletion with VF Previous releases - always broken: - netfilter: fix refcount leak on table dump - vsock: do not allow binding to VMADDR_PORT_ANY - sctp: linearize cloned gso packets in sctp_rcv - eth: - hibmcge: fix the division by zero issue - microchip: fix KSZ8863 reset problem" * tag 'net-6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (54 commits) net: usb: asix_devices: add phy_mask for ax88772 mdio bus net: kcm: Fix race condition in kcm_unattach() selftests: net/forwarding: test purge of active DWRR classes net/sched: ets: use old 'nbands' while purging unused classes bnxt: fill data page pool with frags if PAGE_SIZE > BNXT_RX_PAGE_SIZE netdevsim: Fix wild pointer access in nsim_queue_free(). net: mctp: Fix bad kfree_skb in bind lookup test netfilter: nf_tables: reject duplicate device on updates ipvs: Fix estimator kthreads preferred affinity netfilter: nft_set_pipapo: fix null deref for empty set selftests: tls: test TCP stealing data from under the TLS socket tls: handle data disappearing from under the TLS ULP ptp: prevent possible ABBA deadlock in ptp_clock_freerun() ixgbe: prevent from unwanted interface name changes devlink: let driver opt out of automatic phys_port_name generation net: prevent deadlocks when enabling NAPIs with mixed kthread config net: update NAPI threaded config even for disabled NAPIs selftests: drv-net: don't assume device has only 2 queues docs: Fix name for net.ipv4.udp_child_hash_entries riscv: dts: thead: Add APB clocks for TH1520 GMACs ...	2025-08-14 07:14:30 -07:00
Xu Yang	4faff70959	net: usb: asix_devices: add phy_mask for ax88772 mdio bus Without setting phy_mask for ax88772 mdio bus, current driver may create at most 32 mdio phy devices with phy address range from 0x00 ~ 0x1f. DLink DUB-E100 H/W Ver B1 is such a device. However, only one main phy device will bind to net phy driver. This is creating issue during system suspend/resume since phy_polling_mode() in phy_state_machine() will directly deference member of phydev->drv for non-main phy devices. Then NULL pointer dereference issue will occur. Due to only external phy or internal phy is necessary, add phy_mask for ax88772 mdio bus to workarnoud the issue. Closes: https://lore.kernel.org/netdev/20250806082931.3289134-1-xu.yang_2@nxp.com Fixes: `e532a096be` ("net: usb: asix: ax88772: add phylib support") Cc: stable@vger.kernel.org Signed-off-by: Xu Yang <xu.yang_2@nxp.com> Tested-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20250811092931.860333-1-xu.yang_2@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-14 10:09:28 +02:00
Linus Torvalds	0cc53520e6	Merge tag 'probes-fixes-v6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fix from Masami Hiramatsu: - MAINTAINERS: Remove bouncing kprobes maintainer * tag 'probes-fixes-v6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: MAINTAINERS: Remove bouncing kprobes maintainer	2025-08-13 20:23:32 -07:00
Dave Hansen	e2f9ae9161	MAINTAINERS: Remove bouncing kprobes maintainer The kprobes MAINTAINERS entry includes anil.s.keshavamurthy@intel.com. That address is bouncing. Remove it. This still leaves three other listed maintainers. Link: https://lore.kernel.org/all/20250808180124.7DDE2ECD@davehans-spike.ostc.intel.com/ Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Naveen N Rao <naveen@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: linux-trace-kernel@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2025-08-14 11:38:58 +09:00
Sven Stegemann	52565a9352	net: kcm: Fix race condition in kcm_unattach() syzbot found a race condition when kcm_unattach(psock) and kcm_release(kcm) are executed at the same time. kcm_unattach() is missing a check of the flag kcm->tx_stopped before calling queue_work(). If the kcm has a reserved psock, kcm_unattach() might get executed between cancel_work_sync() and unreserve_psock() in kcm_release(), requeuing kcm->tx_work right before kcm gets freed in kcm_done(). Remove kcm->tx_stopped and replace it by the less error-prone disable_work_sync(). Fixes: `ab7ac4eb98` ("kcm: Kernel Connection Multiplexor module") Reported-by: syzbot+e62c9db591c30e174662@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=e62c9db591c30e174662 Reported-by: syzbot+d199b52665b6c3069b94@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d199b52665b6c3069b94 Reported-by: syzbot+be6b1fdfeae512726b4e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=be6b1fdfeae512726b4e Signed-off-by: Sven Stegemann <sven@stegemann.de> Link: https://patch.msgid.link/20250812191810.27777-1-sven@stegemann.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-13 18:18:33 -07:00
Jakub Kicinski	1c756093cd	Merge branch 'ets-use-old-nbands-while-purging-unused-classes' Davide Caratti says: ==================== ets: use old 'nbands' while purging unused classes - patch 1/2 fixes a NULL dereference in the control path of sch_ets qdisc - patch 2/2 extends kselftests to verify effectiveness of the above fix ==================== Link: https://patch.msgid.link/cover.1755016081.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-13 18:11:56 -07:00
Davide Caratti	774a2ae661	selftests: net/forwarding: test purge of active DWRR classes Extend sch_ets.sh to add a reproducer for problematic list deletions when active DWRR class are purged by ets_qdisc_change() [1] [2]. [1] https://lore.kernel.org/netdev/e08c7f4a6882f260011909a868311c6e9b54f3e4.1639153474.git.dcaratti@redhat.com/ [2] https://lore.kernel.org/netdev/f3b9bacc73145f265c19ab80785933da5b7cbdec.1754581577.git.dcaratti@redhat.com/ Suggested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/489497cb781af7389011ca1591fb702a7391f5e7.1755016081.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-13 18:11:48 -07:00
Davide Caratti	87c6efc5ce	net/sched: ets: use old 'nbands' while purging unused classes Shuang reported sch_ets test-case [1] crashing in ets_class_qlen_notify() after recent changes from Lion [2]. The problem is: in ets_qdisc_change() we purge unused DWRR queues; the value of 'q->nbands' is the new one, and the cleanup should be done with the old one. The problem is here since my first attempts to fix ets_qdisc_change(), but it surfaced again after the recent qdisc len accounting fixes. Fix it purging idle DWRR queues before assigning a new value of 'q->nbands', so that all purge operations find a consistent configuration: - old 'q->nbands' because it's needed by ets_class_find() - old 'q->nstrict' because it's needed by ets_class_is_strict() BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 62 UID: 0 PID: 39457 Comm: tc Kdump: loaded Not tainted 6.12.0-116.el10.x86_64 #1 PREEMPT(voluntary) Hardware name: Dell Inc. PowerEdge R640/06DKY5, BIOS 2.12.2 07/09/2021 RIP: 0010:__list_del_entry_valid_or_report+0x4/0x80 Code: ff 4c 39 c7 0f 84 39 19 8e ff b8 01 00 00 00 c3 cc cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa <48> 8b 17 48 8b 4f 08 48 85 d2 0f 84 56 19 8e ff 48 85 c9 0f 84 ab RSP: 0018:ffffba186009f400 EFLAGS: 00010202 RAX: 00000000000000d6 RBX: 0000000000000000 RCX: 0000000000000004 RDX: ffff9f0fa29b69c0 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffffffc12c2400 R08: 0000000000000008 R09: 0000000000000004 R10: ffffffffffffffff R11: 0000000000000004 R12: 0000000000000000 R13: ffff9f0f8cfe0000 R14: 0000000000100005 R15: 0000000000000000 FS: 00007f2154f37480(0000) GS:ffff9f269c1c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 00000001530be001 CR4: 00000000007726f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ets_class_qlen_notify+0x65/0x90 [sch_ets] qdisc_tree_reduce_backlog+0x74/0x110 ets_qdisc_change+0x630/0xa40 [sch_ets] __tc_modify_qdisc.constprop.0+0x216/0x7f0 tc_modify_qdisc+0x7c/0x120 rtnetlink_rcv_msg+0x145/0x3f0 netlink_rcv_skb+0x53/0x100 netlink_unicast+0x245/0x390 netlink_sendmsg+0x21b/0x470 ____sys_sendmsg+0x39d/0x3d0 ___sys_sendmsg+0x9a/0xe0 __sys_sendmsg+0x7a/0xd0 do_syscall_64+0x7d/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f2155114084 Code: 89 02 b8 ff ff ff ff eb bb 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 80 3d 25 f0 0c 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 48 83 ec 28 89 54 24 1c 48 89 RSP: 002b:00007fff1fd7a988 EFLAGS: 00000202 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000560ec063e5e0 RCX: 00007f2155114084 RDX: 0000000000000000 RSI: 00007fff1fd7a9f0 RDI: 0000000000000003 RBP: 00007fff1fd7aa60 R08: 0000000000000010 R09: 000000000000003f R10: 0000560ee9b3a010 R11: 0000000000000202 R12: 00007fff1fd7aae0 R13: 000000006891ccde R14: 0000560ec063e5e0 R15: 00007fff1fd7aad0 </TASK> [1] https://lore.kernel.org/netdev/e08c7f4a6882f260011909a868311c6e9b54f3e4.1639153474.git.dcaratti@redhat.com/ [2] https://lore.kernel.org/netdev/d912cbd7-193b-4269-9857-525bee8bbb6a@gmail.com/ Cc: stable@vger.kernel.org Fixes: `103406b38c` ("net/sched: Always pass notifications when child class becomes empty") Fixes: `c062f2a0b0` ("net/sched: sch_ets: don't remove idle classes from the round-robin list") Fixes: `dcc68b4d80` ("net: sch_ets: Add a new Qdisc") Reported-by: Li Shuang <shuali@redhat.com> Closes: https://issues.redhat.com/browse/RHEL-108026 Reviewed-by: Petr Machata <petrm@nvidia.com> Co-developed-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Link: https://patch.msgid.link/7928ff6d17db47a2ae7cc205c44777b1f1950545.1755016081.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-13 18:11:48 -07:00
Jakub Kicinski	212122775e	Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== ixgbe: bypass devlink phys_port_name generation Jedrzej adds option to skip phys_port_name generation and opts ixgbe into it as some configurations rely on pre-devlink naming which could end up broken as a result. * '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: ixgbe: prevent from unwanted interface name changes devlink: let driver opt out of automatic phys_port_name generation ==================== Link: https://patch.msgid.link/20250812205226.1984369-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-13 17:31:46 -07:00
David Wei	39f8fcda20	bnxt: fill data page pool with frags if PAGE_SIZE > BNXT_RX_PAGE_SIZE The data page pool always fills the HW rx ring with pages. On arm64 with 64K pages, this will waste _at least_ 32K of memory per entry in the rx ring. Fix by fragmenting the pages if PAGE_SIZE > BNXT_RX_PAGE_SIZE. This makes the data page pool the same as the header pool. Tested with iperf3 with a small (64 entries) rx ring to encourage buffer circulation. Fixes: `cd1fafe7da` ("eth: bnxt: add support rx side device memory TCP") Reviewed-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250812182907.1540755-1-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-13 17:27:23 -07:00
Kuniyuki Iwashima	b2cafefaf0	netdevsim: Fix wild pointer access in nsim_queue_free(). syzbot reported the splat below. [0] When nsim_queue_uninit() is called from nsim_init_netdevsim(), register_netdevice() has not been called, thus dev->dstats has not been allocated. Let's not call dev_dstats_rx_dropped_add() in such a case. [0] BUG: unable to handle page fault for address: ffff88809782c020 PF: supervisor write access in kernel mode PF: error_code(0x0002) - not-present page PGD 1b401067 P4D 1b401067 PUD 0 Oops: Oops: 0002 [#1] SMP KASAN NOPTI CPU: 3 UID: 0 PID: 8476 Comm: syz.1.251 Not tainted 6.16.0-syzkaller-06699-ge8d780dcd957 #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:local_add arch/x86/include/asm/local.h:33 [inline] RIP: 0010:u64_stats_add include/linux/u64_stats_sync.h:89 [inline] RIP: 0010:dev_dstats_rx_dropped_add include/linux/netdevice.h:3027 [inline] RIP: 0010:nsim_queue_free+0xba/0x120 drivers/net/netdevsim/netdev.c:714 Code: 07 77 6c 4a 8d 3c ed 20 7e f1 8d 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 75 46 4a 03 1c ed 20 7e f1 8d <4c> 01 63 20 be 00 02 00 00 48 8d 3d 00 00 00 00 e8 61 2f 58 fa 48 RSP: 0018:ffffc900044af150 EFLAGS: 00010286 RAX: dffffc0000000000 RBX: ffff88809782c000 RCX: 00000000000079c3 RDX: 1ffffffff1be2fc7 RSI: ffffffff8c15f380 RDI: ffffffff8df17e38 RBP: ffff88805f59d000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000003 R14: ffff88806ceb3d00 R15: ffffed100dfd308e FS: 0000000000000000(0000) GS:ffff88809782c000(0063) knlGS:00000000f505db40 CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 CR2: ffff88809782c020 CR3: 000000006fc6a000 CR4: 0000000000352ef0 Call Trace: <TASK> nsim_queue_uninit drivers/net/netdevsim/netdev.c:993 [inline] nsim_init_netdevsim drivers/net/netdevsim/netdev.c:1049 [inline] nsim_create+0xd0a/0x1260 drivers/net/netdevsim/netdev.c:1101 __nsim_dev_port_add+0x435/0x7d0 drivers/net/netdevsim/dev.c:1438 nsim_dev_port_add_all drivers/net/netdevsim/dev.c:1494 [inline] nsim_dev_reload_create drivers/net/netdevsim/dev.c:1546 [inline] nsim_dev_reload_up+0x5b8/0x860 drivers/net/netdevsim/dev.c:1003 devlink_reload+0x322/0x7c0 net/devlink/dev.c:474 devlink_nl_reload_doit+0xe31/0x1410 net/devlink/dev.c:584 genl_family_rcv_msg_doit+0x206/0x2f0 net/netlink/genetlink.c:1115 genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline] genl_rcv_msg+0x55c/0x800 net/netlink/genetlink.c:1210 netlink_rcv_skb+0x155/0x420 net/netlink/af_netlink.c:2552 genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219 netlink_unicast_kernel net/netlink/af_netlink.c:1320 [inline] netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1346 netlink_sendmsg+0x8d1/0xdd0 net/netlink/af_netlink.c:1896 sock_sendmsg_nosec net/socket.c:714 [inline] __sock_sendmsg net/socket.c:729 [inline] ____sys_sendmsg+0xa95/0xc70 net/socket.c:2614 ___sys_sendmsg+0x134/0x1d0 net/socket.c:2668 __sys_sendmsg+0x16d/0x220 net/socket.c:2700 do_syscall_32_irqs_on arch/x86/entry/syscall_32.c:83 [inline] __do_fast_syscall_32+0x7c/0x3a0 arch/x86/entry/syscall_32.c:306 do_fast_syscall_32+0x32/0x80 arch/x86/entry/syscall_32.c:331 entry_SYSENTER_compat_after_hwframe+0x84/0x8e RIP: 0023:0xf708e579 Code: b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00 RSP: 002b:00000000f505d55c EFLAGS: 00000296 ORIG_RAX: 0000000000000172 RAX: ffffffffffffffda RBX: 0000000000000007 RCX: 0000000080000080 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000296 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> Modules linked in: CR2: ffff88809782c020 Fixes: `2a68a22304` ("netdevsim: account dropped packet length in stats on queue free") Reported-by: syzbot+8aa80c6232008f7b957d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/688bb9ca.a00a0220.26d0e1.0050.GAE@google.com/ Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250812162130.4129322-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-13 17:26:39 -07:00
Matt Johnston	a58893aa17	net: mctp: Fix bad kfree_skb in bind lookup test The kunit test's skb_pkt is consumed by mctp_dst_input() so shouldn't be freed separately. Fixes: `e6d8e7dbc5` ("net: mctp: Add bind lookup test") Reported-by: Alexandre Ghiti <alex@ghiti.fr> Closes: https://lore.kernel.org/all/734b02a3-1941-49df-a0da-ec14310d41e4@ghiti.fr/ Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Tested-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://patch.msgid.link/20250812-fix-mctp-bind-test-v1-1-5e2128664eb3@codeconstruct.com.au Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-13 17:07:34 -07:00
Jakub Kicinski	3bfc778297	Merge tag 'nf-25-08-13' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Florian Westphal says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) I managed to add a null dereference crash in nft_set_pipapo in the current development cycle, was not caught by CI because the avx2 implementation is fine, but selftest splats when run on non-avx2 host. 2) Fix the ipvs estimater kthread affinity, was incorrect since 6.14. From Frederic Weisbecker. 3) nf_tables should not allow to add a device to a flowtable or netdev chain more than once -- reject this. From Pablo Neira Ayuso. This has been broken for long time, blamed commit dates from v5.8. * tag 'nf-25-08-13' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: reject duplicate device on updates ipvs: Fix estimator kthreads preferred affinity netfilter: nft_set_pipapo: fix null deref for empty set ==================== Link: https://patch.msgid.link/20250813113800.20775-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-13 14:51:51 -07:00
Linus Torvalds	dfc0f63730	Merge tag 'erofs-for-6.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: - Align FSDAX enablement among multiple devices - Fix EROFS_FS_ZIP_ACCEL build dependency again to prevent forcing CRYPTO{,_DEFLATE}=y even if EROFS=m - Fix atomic context detection to properly launch kworkers on demand - Fix block count statistics for 48-bit addressing support * tag 'erofs-for-6.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: fix block count report when 48-bit layout is on erofs: fix atomic context detection when !CONFIG_DEBUG_LOCK_ALLOC erofs: Do not select tristate symbols from bool symbols erofs: Fallback to normal access if DAX is not supported on extra device	2025-08-13 11:29:27 -07:00
Linus Torvalds	3a4a0367c9	Merge tag 'rcu.fixes.6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux Pull RCU fix from Neeraj Upadhyay: "Fix a regression introduced by commit `b41642c877` ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work") which results in boot hang as reported by kernel test bot at [1]. This issue happens because RCU re-initializes the deferred QS IRQ work everytime it is queued. With commit `b41642c877`, the IRQ work re-initialization can happen while it is already queued. This results in IRQ work being requeued to itself. When IRQ work finally fires, as it is requeued to itself, it is repeatedly executed and results in hang. Fix this with initializing the IRQ work only once before the CPU boots" Link: https://lore.kernel.org/rcu/202508071303.c1134cce-lkp@intel.com/ [1] * tag 'rcu.fixes.6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: rcu: Fix racy re-initialization of irq_work causing hangs	2025-08-13 10:23:28 -07:00
Linus Torvalds	91325f31af	Merge tag 'mm-hotfixes-stable-2025-08-12-20-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "12 hotfixes. 5 are cc:stable and the remainder address post-6.16 issues or aren't considered necessary for -stable kernels. 10 of these fixes are for MM" * tag 'mm-hotfixes-stable-2025-08-12-20-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: proc: proc_maps_open allow proc_mem_open to return NULL mm/mremap: avoid expensive folio lookup on mremap folio pte batch userfaultfd: fix a crash in UFFDIO_MOVE when PMD is a migration entry mm: pass page directly instead of using folio_page selftests/proc: fix string literal warning in proc-maps-race.c fs/proc/task_mmu: hold PTL in pagemap_hugetlb_range and gather_hugetlb_stats mm/smaps: fix race between smaps_hugetlb_range and migration mm: fix the race between collapse and PT_RECLAIM under per-vma lock mm/kmemleak: avoid soft lockup in __kmemleak_do_cleanup() MAINTAINERS: add Masami as a reviewer of hung task detector mm/kmemleak: avoid deadlock by moving pr_warn() outside kmemleak_lock kasan/test: fix protection against compiler elision	2025-08-13 08:28:33 -07:00
Pablo Neira Ayuso	cf5fb87fcd	netfilter: nf_tables: reject duplicate device on updates A chain/flowtable update with duplicated devices in the same batch is possible. Unfortunately, netdev event path only removes the first device that is found, leaving unregistered the hook of the duplicated device. Check if a duplicated device exists in the transaction batch, bail out with EEXIST in such case. WARNING is hit when unregistering the hook: [49042.221275] WARNING: CPU: 4 PID: 8425 at net/netfilter/core.c:340 nf_hook_entry_head+0xaa/0x150 [49042.221375] CPU: 4 UID: 0 PID: 8425 Comm: nft Tainted: G S 6.16.0+ #170 PREEMPT(full) [...] [49042.221382] RIP: 0010:nf_hook_entry_head+0xaa/0x150 Fixes: `78d9f48f7f` ("netfilter: nf_tables: add devices to existing flowtable") Fixes: `b9703ed44f` ("netfilter: nf_tables: support for adding new devices to an existing netdev chain") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2025-08-13 08:34:55 +02:00
Frederic Weisbecker	c0a23bbc98	ipvs: Fix estimator kthreads preferred affinity The estimator kthreads' affinity are defined by sysctl overwritten preferences and applied through a plain call to the scheduler's affinity API. However since the introduction of managed kthreads preferred affinity, such a practice shortcuts the kthreads core code which eventually overwrites the target to the default unbound affinity. Fix this with using the appropriate kthread's API. Fixes: `d1a8919758` ("kthread: Default affine kthread to its preferred NUMA node") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>	2025-08-13 08:34:33 +02:00
Florian Westphal	30c1d25b98	netfilter: nft_set_pipapo: fix null deref for empty set Blamed commit broke the check for a null scratch map: - if (unlikely(!m \|\| !raw_cpu_ptr(m->scratch))) + if (unlikely(!raw_cpu_ptr(m->scratch))) This should have been "if (!raw_ ...)". Use the pattern of the avx2 version which is more readable. This can only be reproduced if avx2 support isn't available. Fixes: `d8d871a35c` ("netfilter: nft_set_pipapo: merge pipapo_get/lookup") Signed-off-by: Florian Westphal <fw@strlen.de>	2025-08-13 08:34:32 +02:00
Jakub Kicinski	d7e82594a4	selftests: tls: test TCP stealing data from under the TLS socket Check a race where data disappears from the TCP socket after TLS signaled that its ready to receive. ok 6 global.data_steal # RUN tls_basic.base_base ... # OK tls_basic.base_base Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250807232907.600366-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-12 18:59:06 -07:00
Jakub Kicinski	6db015fc4b	tls: handle data disappearing from under the TLS ULP TLS expects that it owns the receive queue of the TCP socket. This cannot be guaranteed in case the reader of the TCP socket entered before the TLS ULP was installed, or uses some non-standard read API (eg. zerocopy ones). Replace the WARN_ON() and a buggy early exit (which leaves anchor pointing to a freed skb) with real error handling. Wipe the parsing state and tell the reader to retry. We already reload the anchor every time we (re)acquire the socket lock, so the only condition we need to avoid is an out of bounds read (not having enough bytes in the socket for previously parsed record len). If some data was read from under TLS but there's enough in the queue we'll reload and decrypt what is most likely not a valid TLS record. Leading to some undefined behavior from TLS perspective (corrupting a stream? missing an alert? missing an attack?) but no kernel crash should take place. Reported-by: William Liu <will@willsroot.io> Reported-by: Savino Dicanosa <savy@syst3mfailure.io> Link: https://lore.kernel.org/tFjq_kf7sWIG3A7CrCg_egb8CVsT_gsmHAK0_wxDPJXfIzxFAMxqmLwp3MlU5EHiet0AwwJldaaFdgyHpeIUCS-3m3llsmRzp9xIOBR4lAI=@syst3mfailure.io Fixes: `84c61fe1a7` ("tls: rx: do not use the standard strparser") Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250807232907.600366-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-12 18:59:05 -07:00
Jeongjun Park	2efe41234d	ptp: prevent possible ABBA deadlock in ptp_clock_freerun() syzbot reported the following ABBA deadlock: CPU0 CPU1 ---- ---- n_vclocks_store() lock(&ptp->n_vclocks_mux) [1] (physical clock) pc_clock_adjtime() lock(&clk->rwsem) [2] (physical clock) ... ptp_clock_freerun() ptp_vclock_in_use() lock(&ptp->n_vclocks_mux) [3] (physical clock) ptp_clock_unregister() posix_clock_unregister() lock(&clk->rwsem) [4] (virtual clock) Since ptp virtual clock is registered only under ptp physical clock, both ptp_clock and posix_clock must be physical clocks for ptp_vclock_in_use() to lock &ptp->n_vclocks_mux and check ptp->n_vclocks. However, when unregistering vclocks in n_vclocks_store(), the locking ptp->n_vclocks_mux is a physical clock lock, but clk->rwsem of ptp_clock_unregister() called through device_for_each_child_reverse() is a virtual clock lock. Therefore, clk->rwsem used in CPU0 and clk->rwsem used in CPU1 are different locks, but in lockdep, a false positive occurs because the possibility of deadlock is determined through lock-class. To solve this, lock subclass annotation must be added to the posix_clock rwsem of the vclock. Reported-by: syzbot+7cfb66a237c4a5fb22ad@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=7cfb66a237c4a5fb22ad Fixes: `73f37068d5` ("ptp: support ptp physical/virtual clocks conversion") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20250728062649.469882-1-aha310510@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-12 14:17:35 -07:00
Jedrzej Jagielski	e67a0bc3ed	ixgbe: prevent from unwanted interface name changes Users of the ixgbe driver report that after adding devlink support by the commit `a0285236ab` ("ixgbe: add initial devlink support") their configs got broken due to unwanted changes of interface names. It's caused by automatic phys_port_name generation during devlink port initialization flow. To prevent from that set no_phys_port_name flag for ixgbe devlink ports. Reported-by: David Howells <dhowells@redhat.com> Closes: https://lore.kernel.org/netdev/3452224.1745518016@warthog.procyon.org.uk/ Reported-by: David Kaplan <David.Kaplan@amd.com> Closes: https://lore.kernel.org/netdev/LV3PR12MB92658474624CCF60220157199470A@LV3PR12MB9265.namprd12.prod.outlook.com/ Fixes: `a0285236ab` ("ixgbe: add initial devlink support") Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-08-12 13:24:09 -07:00
Jedrzej Jagielski	c5ec7f49b4	devlink: let driver opt out of automatic phys_port_name generation Currently when adding devlink port, phys_port_name is automatically generated within devlink port initialization flow. As a result adding devlink port support to driver may result in forced changes of interface names, which breaks already existing network configs. This is an expected behavior but in some scenarios it would not be preferable to provide such limitation for legacy driver not being able to keep 'pre-devlink' interface name. Add flag no_phys_port_name to devlink_port_attrs struct which indicates if devlink should not alter name of interface. Suggested-by: Jiri Pirko <jiri@resnulli.us> Link: https://lore.kernel.org/all/nbwrfnjhvrcduqzjl4a2jafnvvud6qsbxlvxaxilnryglf4j7r@btuqrimnfuly/ Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-08-12 13:23:39 -07:00
Linus Torvalds	8742b2d893	Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull habanalabs fix from Al Viro: "Yet another use-after-free fix due to dma_buf_fd() misuse" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: habanalabs: fix UAF in export_dmabuf()	2025-08-12 12:10:33 -07:00
Linus Torvalds	0e39a73182	Merge tag 'for-6.17-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix bug in qgroups reporting incorrect usage for higher level qgroups - in zoned mode, do not select metadata group as finish target - convert xarray lock to RCU when trying to release extent buffer to avoid a deadlock - do not allow relocation on partially dropped subvolumes, which is normally not possible but has been reported on old filesystems - in tree-log, report errors on missing block group when unaccounting log tree extent buffers - with large folios, fix range length when processing ordered extents * tag 'for-6.17-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix iteration bug in __qgroup_excl_accounting() btrfs: zoned: do not select metadata BG as finish target btrfs: do not allow relocation of partially dropped subvolumes btrfs: error on missing block group when unaccounting log tree extent buffers btrfs: fix wrong length parameter for btrfs_cleanup_ordered_extents() btrfs: make btrfs_cleanup_ordered_extents() support large folios btrfs: fix subpage deadlock in try_release_subpage_extent_buffer()	2025-08-12 08:52:05 -07:00
Linus Torvalds	20e0d85764	Merge tag 'snp_cache_coherency' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip - Add a mitigation for a cache coherency vulnerability when running an SNP guest which makes sure all cache lines belonging to a 4K page are evicted after latter has been converted to a guest-private page [ SNP: Secure Nested Paging - not to be confused with Single Nucleotide Polymorphism, which is the more common use of that TLA. I am on a mission to write out the more obscure TLAs in order to keep track of them. Because while math tells us that there are only about 17k different combinations of three-letter acronyms using English letters (26^3), I am convinced that somehow Intel, AMD and ARM have together figured out new mathematics, and have at least a million different TLAs that they use. - Linus ] * tag 'snp_cache_coherency' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sev: Evict cache lines during SNP memory validation	2025-08-12 08:19:23 -07:00
Paolo Abeni	c04fdca8a9	Merge tag 'ipsec-2025-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2025-08-11 1) Fix flushing of all states in xfrm_state_fini. From Sabrina Dubroca. 2) Fix some IPsec software offload features. These got lost with some recent HW offload changes. From Sabrina Dubroca. Please pull or let me know if there are problems. * tag 'ipsec-2025-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: udp: also consider secpath when evaluating ipsec use for checksumming xfrm: bring back device check in validate_xmit_xfrm xfrm: restore GSO for SW crypto xfrm: flush all states in xfrm_state_fini ==================== Link: https://patch.msgid.link/20250811092008.731573-1-steffen.klassert@secunet.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-12 15:01:09 +02:00
Paolo Abeni	74078816f8	Merge branch 'net-prevent-deadlocks-and-mis-configuration-with-per-napi-threaded-config' Jakub Kicinski says: ==================== net: prevent deadlocks and mis-configuration with per-NAPI threaded config Running the test added with a recent fix on a driver with persistent NAPI config leads to a deadlock. The deadlock is fixed by patch 3, patch 2 is I think a more fundamental problem with the way we implemented the config. I hope the fix makes sense, my own thinking is definitely colored by my preference (IOW how the per-queue config RFC was implemented). v1: https://lore.kernel.org/20250808014952.724762-1-kuba@kernel.org ==================== Link: https://patch.msgid.link/20250809001205.1147153-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-12 14:43:08 +02:00
Jakub Kicinski	b3fc08ab9a	net: prevent deadlocks when enabling NAPIs with mixed kthread config The following order of calls currently deadlocks if: - device has threaded=1; and - NAPI has persistent config with threaded=0. netif_napi_add_weight_config() dev->threaded == 1 napi_kthread_create() napi_enable() napi_restore_config() napi_set_threaded(0) napi_stop_kthread() while (NAPIF_STATE_SCHED) msleep(20) We deadlock because disabled NAPI has STATE_SCHED set. Creating a thread in netif_napi_add() just to destroy it in napi_disable() is fairly ugly in the first place. Let's read both the device config and the NAPI config in netif_napi_add(). Fixes: `e6d7626881` ("net: Update threaded state in napi config in netif_set_threaded") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20250809001205.1147153-4-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-12 14:43:05 +02:00
Jakub Kicinski	ccba9f6baa	net: update NAPI threaded config even for disabled NAPIs We have to make sure that all future NAPIs will have the right threaded state when the state is configured on the device level. We chose not to have an "unset" state for threaded, and not to wipe the NAPI config clean when channels are explicitly disabled. This means the persistent config structs "exist" even when their NAPIs are not instantiated. Differently put - the NAPI persistent state lives in the net_device (ncfg == struct napi_config): ,--- [napi 0] - [napi 1] [dev] \| \| `--- [ncfg 0] - [ncfg 1] so say we a device with 2 queues but only 1 enabled: ,--- [napi 0] [dev] \| `--- [ncfg 0] - [ncfg 1] now we set the device to threaded=1: ,---------- [napi 0 (thr:1)] [dev(thr:1)] \| `---------- [ncfg 0 (thr:1)] - [ncfg 1 (thr:?)] Since [ncfg 1] was not attached to a NAPI during configuration we skipped it. If we create a NAPI for it later it will have the old setting (presumably disabled). One could argue if this is right or not "in principle", but it's definitely not how things worked before per-NAPI config.. Fixes: `2677010e77` ("Add support to set NAPI threaded for individual NAPI") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20250809001205.1147153-3-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-12 14:43:05 +02:00
Jakub Kicinski	bda053d644	selftests: drv-net: don't assume device has only 2 queues The test is implicitly assuming the device only has 2 queues. A real device will likely have more. The exact problem is that because NAPIs get added to the list from the head, the netlink dump reports them in reverse order. So the naive napis[0] will actually likely give us the _last_ NAPI, not the first one. Re-enable all the NAPIs instead of hard-coding 2 in the test. This way the NAPIs we operated on will always reappear, doesn't matter where they were in the registration order. Fixes: `e6d7626881` ("net: Update threaded state in napi config in netif_set_threaded") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20250809001205.1147153-2-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-12 14:43:05 +02:00
Jordan Rife	e93f7af148	docs: Fix name for net.ipv4.udp_child_hash_entries udp_child_ehash_entries -> udp_child_hash_entries Fixes: `9804985bf2` ("udp: Introduce optional per-netns hash table.") Signed-off-by: Jordan Rife <jordan@jrife.io> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250808185800.1189042-1-jordan@jrife.io Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-12 13:03:13 +02:00
Paolo Abeni	b3e8c3dfce	Merge branch 'fix-broken-link-with-th1520-gmac-when-linkspeed-changes' Yao Zi says: ==================== Fix broken link with TH1520 GMAC when linkspeed changes It's noted that on TH1520 SoC, the GMAC's link becomes broken after the link speed is changed (for example, running ethtool -s eth0 speed 100 on the peer when negotiated to 1Gbps), but the GMAC could function normally if the speed is brought back to the initial. Just like many other SoCs utilizing STMMAC IP, we need to adjust the TX clock supplying TH1520's GMAC through some SoC-specific glue registers when linkspeed changes. But it's found that after the full kernel startup, reading from them results in garbage and writing to them makes no effect, which is the cause of broken link. Further testing shows perisys-apb4-hclk must be ungated for normal access to Th1520 GMAC APB glue registers, which is neither described in dt-binding nor acquired by the driver. This series expands the dt-binding of TH1520's GMAC to allow an extra "APB glue registers interface clock", instructs the driver to acquire and enable the clock, and finally supplies CLK_PERISYS_APB4_HCLK for TH1520's GMACs in SoC devicetree. v2: https://lore.kernel.org/netdev/20250801091240.46114-1-ziyao@disroot.org/ v1: https://lore.kernel.org/all/20250729093734.40132-1-ziyao@disroot.org/ ==================== Link: https://patch.msgid.link/20250808093655.48074-2-ziyao@disroot.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-12 12:52:25 +02:00
Yao Zi	a7f75e2883	riscv: dts: thead: Add APB clocks for TH1520 GMACs Describe perisys-apb4-hclk as the APB clock for TH1520 SoC, which is essential for accessing GMAC glue registers. Fixes: `7e756671a6` ("riscv: dts: thead: Add TH1520 ethernet nodes") Signed-off-by: Yao Zi <ziyao@disroot.org> Reviewed-by: Drew Fustini <fustini@kernel.org> Tested-by: Drew Fustini <fustini@kernel.org> Link: https://patch.msgid.link/20250808093655.48074-5-ziyao@disroot.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-12 12:52:21 +02:00
Yao Zi	4cc339ce48	net: stmmac: thead: Get and enable APB clock on initialization It's necessary to adjust the MAC TX clock when the linkspeed changes, but it's noted such adjustment always fails on TH1520 SoC, and reading back from APB glue registers that control clock generation results in garbage, causing broken link. With some testing, it's found a clock must be ungated for access to APB glue registers. Without any consumer, the clock is automatically disabled during late kernel startup. Let's get and enable it if it's described in devicetree. For backward compatibility with older devicetrees, probing won't fail if the APB clock isn't found. In this case, we emit a warning since the link will break if the speed changes. Fixes: `33a1a01e3a` ("net: stmmac: Add glue layer for T-HEAD TH1520 SoC") Signed-off-by: Yao Zi <ziyao@disroot.org> Tested-by: Drew Fustini <fustini@kernel.org> Reviewed-by: Drew Fustini <fustini@kernel.org> Link: https://patch.msgid.link/20250808093655.48074-4-ziyao@disroot.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-12 12:52:21 +02:00
Yao Zi	c8a9a619c0	dt-bindings: net: thead,th1520-gmac: Describe APB interface clock Besides ones for GMAC core and peripheral registers, the TH1520 GMAC requires one more clock for configuring APB glue registers. Describe it in the binding. Fixes: `f920ce04c3` ("dt-bindings: net: Add T-HEAD dwmac support") Signed-off-by: Yao Zi <ziyao@disroot.org> Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Drew Fustini <fustini@kernel.org> Link: https://patch.msgid.link/20250808093655.48074-3-ziyao@disroot.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-12 12:52:21 +02:00
Buday Csaba	8ea25274eb	net: mdiobus: release reset_gpio in mdiobus_unregister_device() reset_gpio is claimed in mdiobus_register_device(), but it is not released in mdiobus_unregister_device(). It is instead only released when the whole MDIO bus is unregistered. When a device uses the reset_gpio property, it becomes impossible to unregister it and register it again, because the GPIO remains claimed. This patch resolves that issue. Fixes: `bafbdd527d` ("phylib: Add device reset GPIO support") # see notes Reviewed-by: Andrew Lunn <andrew@lunn.ch> Cc: Csókás Bence <csokas.bence@prolan.hu> [ csokas.bence: Resolve rebase conflict and clarify msg ] Signed-off-by: Buday Csaba <buday.csaba@prolan.hu> Link: https://patch.msgid.link/20250807135449.254254-2-csokas.bence@prolan.hu Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-12 12:32:58 +02:00
Clark Wang	8ee90742cf	net: phy: nxp-c45-tja11xx: fix the PHY ID mismatch issue when using C45 TJA1103/04/20/21 support both C22 and C45 accessing methods. The TJA11xx driver has implemented the match_phy_device() API. However, it does not handle the C45 ID. If C45 was used to access TJA11xx, match_phy_device() would always return false due to phydev->phy_id only used by C22 being empty, resulting in the generic phy driver being used for TJA11xx PHYs. Therefore, check phydev->c45_ids.device_ids[MDIO_MMD_PMAPMD] when using C45. Fixes: `1b76b2497a` ("net: phy: nxp-c45-tja11xx: simplify .match_phy_device OP") Signed-off-by: Clark Wang <xiaoning.wang@nxp.com> Link: https://patch.msgid.link/20250807040832.2455306-1-xiaoning.wang@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-12 12:01:07 +02:00
Jialin Wang	c0e1b774f6	proc: proc_maps_open allow proc_mem_open to return NULL The commit `65c6604725` ("proc: fix the issue of proc_mem_open returning NULL") caused proc_maps_open() to return -ESRCH when proc_mem_open() returns NULL. This breaks legitimate /proc/<pid>/maps access for kernel threads since kernel threads have NULL mm_struct. The regression causes perf to fail and exit when profiling a kernel thread: # perf record -v -g -p $(pgrep kswapd0) ... couldn't open /proc/65/task/65/maps This patch partially reverts the commit to fix it. Link: https://lkml.kernel.org/r/20250807165455.73656-1-wjl.linux@gmail.com Fixes: `65c6604725` ("proc: fix the issue of proc_mem_open returning NULL") Signed-off-by: Jialin Wang <wjl.linux@gmail.com> Cc: Penglei Jiang <superman.xpt@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-11 23:01:00 -07:00
Lorenzo Stoakes	0b5be138ce	mm/mremap: avoid expensive folio lookup on mremap folio pte batch It was discovered in the attached report that commit `f822a9a81a` ("mm: optimize mremap() by PTE batching") introduced a significant performance regression on a number of metrics on x86-64, most notably stress-ng.bigheap.realloc_calls_per_sec - indicating a 37.3% regression in number of mremap() calls per second. I was able to reproduce this locally on an intel x86-64 raptor lake system, noting an average of 143,857 realloc calls/sec (with a stddev of 4,531 or 3.1%) prior to this patch being applied, and 81,503 afterwards (stddev of 2,131 or 2.6%) - a 43.3% regression. During testing I was able to determine that there was no meaningful difference in efforts to optimise the folio_pte_batch() operation, nor checking folio_test_large(). This is within expectation, as a regression this large is likely to indicate we are accessing memory that is not yet in a cache line (and perhaps may even cause a main memory fetch). The expectation by those discussing this from the start was that vm_normal_folio() (invoked by mremap_folio_pte_batch()) would likely be the culprit due to having to retrieve memory from the vmemmap (which mremap() page table moves does not otherwise do, meaning this is inevitably cold memory). I was able to definitively determine that this theory is indeed correct and the cause of the issue. The solution is to restore part of an approach previously discarded on review, that is to invoke pte_batch_hint() which explicitly determines, through reference to the PTE alone (thus no vmemmap lookup), what the PTE batch size may be. On platforms other than arm64 this is currently hardcoded to return 1, so this naturally resolves the issue for x86-64, and for arm64 introduces little to no overhead as the pte cache line will be hot. With this patch applied, we move from 81,503 realloc calls/sec to 138,701 (stddev of 496.1 or 0.4%), which is a -3.6% regression, however accounting for the variance in the original result, this is broadly restoring performance to its prior state. Link: https://lkml.kernel.org/r/20250807185819.199865-1-lorenzo.stoakes@oracle.com Fixes: `f822a9a81a` ("mm: optimize mremap() by PTE batching") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202508071609.4e743d7c-lkp@intel.com Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-11 23:00:59 -07:00
Suren Baghdasaryan	aba6faec01	userfaultfd: fix a crash in UFFDIO_MOVE when PMD is a migration entry When UFFDIO_MOVE encounters a migration PMD entry, it proceeds with obtaining a folio and accessing it even though the entry is swp_entry_t. Add the missing check and let split_huge_pmd() handle migration entries. While at it also remove unnecessary folio check. [surenb@google.com: remove extra folio check, per David] Link: https://lkml.kernel.org/r/20250807200418.1963585-1-surenb@google.com Link: https://lkml.kernel.org/r/20250806220022.926763-1-surenb@google.com Fixes: `adef440691` ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: syzbot+b446dbe27035ef6bd6c2@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68794b5c.a70a0220.693ce.0050.GAE@google.com/ Reviewed-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-11 23:00:59 -07:00
Dev Jain	cf1b80dc31	mm: pass page directly instead of using folio_page In commit_anon_folio_batch(), we iterate over all pages pointed to by the PTE batch. Therefore we need to know the first page of the batch; currently we derive that via folio_page(folio, 0), but, that takes us to the first (head) page of the folio instead - our PTE batch may lie in the middle of the folio, leading to incorrectness. Bite the bullet and throw away the micro-optimization of reusing the folio in favour of code simplicity. Derive the page and the folio in change_pte_range, and pass the page too to commit_anon_folio_batch to fix the aforementioned issue. Link: https://lkml.kernel.org/r/20250806145611.3962-1-dev.jain@arm.com Fixes: `cac1db8c3a` ("mm: optimize mprotect() by PTE batching") Reported-by: syzbot+57bcc752f0df8bb1365c@syzkaller.appspotmail.com Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Debugged-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Zhenhua Huang <quic_zhenhuah@quicinc.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-11 23:00:59 -07:00
Sukrut Heroorkar	ab5ac789ef	selftests/proc: fix string literal warning in proc-maps-race.c This change resolves non literal string format warning invoked for proc-maps-race.c while compiling. proc-maps-race.c:205:17: warning: format not a string literal and no format arguments [-Wformat-security] 205 \| printf(text); \| ^~~~~~ proc-maps-race.c:209:17: warning: format not a string literal and no format arguments [-Wformat-security] 209 \| printf(text); \| ^~~~~~ proc-maps-race.c: In function `print_last_lines': proc-maps-race.c:224:9: warning: format not a string literal and no format arguments [-Wformat-security] 224 \| printf(start); \| ^~~~~~ Add string format specifier %s for the printf calls in both print_first_lines() and print_last_lines() thus resolving the warnings. The test executes fine after this change thus causing no effect to the functional behavior of the test. Link: https://lkml.kernel.org/r/20250804225633.841777-1-hsukrut3@gmail.com Fixes: `aadc099c48` ("selftests/proc: add verbose mode for /proc/pid/maps tearing tests") Signed-off-by: Sukrut Heroorkar <hsukrut3@gmail.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Cc: David Hunter <david.hunter.linux@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-11 23:00:59 -07:00
Russell King (Oracle)	89886abd07	net: stmmac: dwc-qos: fix clk prepare/enable leak on probe failure dwc_eth_dwmac_probe() gets bulk clocks, and then prepares and enables them. Unfortunately, if dwc_eth_dwmac_config_dt() or stmmac_dvr_probe() fail, we leave the clocks prepared and enabled. Fix this by using devm_clk_bulk_get_all_enabled() to combine the steps and provide devm based release of the prepare and enable state. This also fixes a similar leakin dwc_eth_dwmac_remove() which wasn't correctly retrieving the struct plat_stmmacenet_data. This becomes unnecessary. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Simon Horman <horms@kernel.org> Fixes: `a045e40645` ("net: stmmac: refactor clock management in EQoS driver") Link: https://patch.msgid.link/E1ukM1X-0086qu-Td@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-11 20:11:48 -07:00
Russell King (Oracle)	de1e963ad0	net: stmmac: rk: put the PHY clock on remove The PHY clock (bsp_priv->clk_phy) is obtained using of_clk_get(), which doesn't take part in the devm release. Therefore, when a device is unbound, this clock needs to be explicitly put. Fix this. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Simon Horman <horms@kernel.org> Fixes: `fecd4d7eef` ("net: stmmac: dwmac-rk: Add integrated PHY support") Link: https://patch.msgid.link/E1ukM1S-0086qo-PC@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-11 20:11:23 -07:00

1 2 3 4 5 ...

1381799 Commits