linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-16 17:12:50 -04:00

Author	SHA1	Message	Date
Suman Ghosh	5868682b68	octeontx2-af: npc: cn20k: KPM profile changes KPU (Kangaroo Processing Unit) profiles are primarily used to set the required packet pointers that will be used in later stages for key generation. In the new CN20K silicon variant, a new KPM profile is introduced alongside the existing KPU profiles. In CN20K, a total of 16 KPUs are grouped into 8 KPM profiles. As per the current hardware design, each KPM configuration contains a combination of 2 KPUs: KPM0 = KPU0 + KPU8 KPM1 = KPU1 + KPU9 ... KPM7 = KPU7 + KPU15 This configuration enables more efficient use of KPU resources. This patch adds support for the new KPM profile configuration. Signed-off-by: Suman Ghosh <sumang@marvell.com> Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Link: https://patch.msgid.link/20260224080009.4147301-3-rkannoth@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 10:29:25 -08:00
Ratheesh Kannoth	1396771b0b	octeontx2-af: npc: cn20k: Index management In CN20K silicon, the MCAM is divided vertically into two banks. Each bank has a depth of 8192. The MCAM is divided horizontally into 32 subbanks, with each subbank having a depth of 256. Each subbank can accommodate either x2 keys or x4 keys. x2 keys are 256 bits in size, and x4 keys are 512 bits in size. Bank1 Bank0 \|-----------------------------\| \| \| \| subbank 31 { depth 256 } \| \| \| \|-----------------------------\| \| \| \| subbank 30 \| \| \| ------------------------------ ............................... \|-----------------------------\| \| \| \| subbank 0 \| \| \| ------------------------------\| This patch implements the following allocation schemes in NPC. The allocation API accepts reference (ref), limit, contig, priority, and count values. For example, specifying ref=100, limit=200, contig=1, priority=LOW, and count=20 will allocate 20 contiguous MCAM entries between entries 100 and 200. 1. Contiguous allocation with ref, limit, and priority. 2. Non-contiguous allocation with ref, limit, and priority. 3. Non-contiguous allocation without ref. 4. Contiguous allocation without ref. Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Link: https://patch.msgid.link/20260224080009.4147301-2-rkannoth@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 10:29:25 -08:00
Thorsten Blum	ded4a02e7d	ipv6: sit: Replace deprecated strcpy with strscpy strcpy() has been deprecated [1] because it performs no bounds checking on the destination buffer, which can lead to buffer overflows. Replace it with the safer strscpy(). Use the two-argument version of strscpy() to copy 'parms->name' in ipip6_tunnel_locate(). Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy [1] Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://patch.msgid.link/20260227004541.798966-3-thorsten.blum@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 10:06:29 -08:00
Jakub Kicinski	eed562b2a6	Merge branch 'gve-support-larger-ring-sizes-in-dqo-qpl-mode' Max Yuan says: ==================== gve: Support larger ring sizes in DQO-QPL mode This patch series updates the gve driver to improve Queue Page List (QPL) management and enable support for larger ring sizes when using the DQO-QPL queue format. Previously, the driver used hardcoded multipliers to determine the number of pages to register for QPLs (e.g., 2x ring size for RX). This rigid approach made it difficult to support larger ring sizes without potentially exceeding the "max_registered_pages" limit reported by the device. The first patch introduces a unified and flexible logic for calculating QPL page requirements. It balances TX and RX page allocations based on the configured ring sizes and scales the total count down proportionally if it would otherwise exceed the device's global registration limit. The second patch leverages this new flexibility to stop ignoring the maximum ring size supported by the device in DQO-QPL mode. Users can now configure ring sizes up to the device-reported maximum, as the driver will automatically adjust the QPL size to stay within allowed memory bounds. ==================== Link: https://patch.msgid.link/20260225182342.1049816-1-joshwash@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 08:58:41 -08:00
Matt Olson	a2f1918401	gve: Enable reading max ring size from the device in DQO-QPL mode The gVNIC device indicates a device option (MODIFY_RING) to the driver, which presents a range of ring sizes from which the user is allowed to select. But in DQO-QPL queue format, the driver ignores the "max" of this range and instead allows the user to configure the ring size in the range [min, default]. This was done because increasing the ring size could result in the number of registered pages being higher than the max allowed by the device. In order to support large ring sizes, stop ignoring the "max" of the range presented in the MODIFY_RING option. Signed-off-by: Matt Olson <maolson@google.com> Signed-off-by: Max Yuan <maxyuan@google.com> Reviewed-by: Jordan Rhee <jordanrhee@google.com> Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com> Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com> Signed-off-by: Joshua Washington <joshwash@google.com> Link: https://patch.msgid.link/20260225182342.1049816-3-joshwash@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 08:58:29 -08:00
Matt Olson	07993df560	gve: Update QPL page registration logic For DQO, change QPL page registration logic to be more flexible to honor the "max_registered_pages" parameter from the gVNIC device. Previously the number of RX pages per QPL was hardcoded to twice the ring size, and the number of TX pages per QPL was dictated by the device in the DQO-QPL device option. Now [in DQO-QPL mode], the driver will ignore the "tx_pages_per_qpl" parameter indicated in the DQO-QPL device option and instead allocate up to (tx_queue_length / 2) pages per TX QPL and up to (rx_queue_length * 2) pages per RX QPL while keeping the total number of pages under the "max_registered_pages". Merge DQO and GQI QPL page calculation logic into a unified gve_update_num_qpl_pages function. Add rx_pages_per_qpl to the priv struct for consumption by both DQO and GQI. Signed-off-by: Matt Olson <maolson@google.com> Signed-off-by: Max Yuan <maxyuan@google.com> Reviewed-by: Jordan Rhee <jordanrhee@google.com> Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com> Signed-off-by: Joshua Washington <joshwash@google.com> Link: https://patch.msgid.link/20260225182342.1049816-2-joshwash@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 08:58:29 -08:00
Thorsten Blum	a9a13c7379	keys, dns: Use kmalloc_flex to improve dns_resolver_preparse Use kmalloc_flex() when allocating a new 'struct user_key_payload' in dns_resolver_preparse() to replace the open-coded size arithmetic. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://patch.msgid.link/20260226214930.785423-3-thorsten.blum@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 08:48:21 -08:00
Jiayuan Chen	58e443b773	net: fix sock compilation error under CONFIG_PREEMPT_RT When CONFIG_PREEMPT_RT is enabled, __SPIN_LOCK_UNLOCKED() expands to a brace-enclosed initializer rather than a compound literal, which cannot be used in assignment expressions. This causes a build failure: net/core/sock.c:3787:29: error: expected expression before '{' token 3787 \| tmp.slock = __SPIN_LOCK_UNLOCKED(tmp.slock); Use declaration-with-initializer instead of assignment, consistent with how __SPIN_LOCK_UNLOCKED() is used elsewhere in the kernel (e.g. DEFINE_SPINLOCK). Fixes: `5151ec54f5` ("net: use try_cmpxchg() in lock_sock_nested()") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228111319.79506-1-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 07:42:39 -08:00
Jakub Kicinski	1e08faf996	Merge branch 'net-ethernet-litex-minor-improvment-for-the-codebase' Inochi Amaoto says: ==================== net: ethernet: litex: minor improvment for the codebase Improve the litex code for using the device managed function to register netdev and replace all the "pdev->dev" with dev pointer instead. ==================== Link: https://patch.msgid.link/20260227003351.752934-1-inochiama@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 19:25:20 -08:00
Inochi Amaoto	621e3634df	net: ethernet: litex: use device pointer to simplify code. As there is already a device pointer in the probe function, replace all "&pdev->dev" pattern with this predefined device pointer. Signed-off-by: Inochi Amaoto <inochiama@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20260227003351.752934-3-inochiama@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 19:25:16 -08:00
Inochi Amaoto	97c55c1298	net: ethernet: litex: use devm_register_netdev() to register netdev Use devm_register_netdev to avoid unnecessary remove() callback in platform_driver structure. Signed-off-by: Inochi Amaoto <inochiama@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20260227003351.752934-2-inochiama@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 19:25:15 -08:00
Leon Kral	57cc8ab3e9	net/handshake: Fixed grammar mistake The word "a" was used instead of "an" which is grammatically incorrect. Fixed by changing from "a" to "an". This improves readability of the documentation. Signed-off-by: Leon Kral <leon.j.kral@protonmail.com> Reviewed-by: Alistair Francis <alistair.francis@wdc.com> Link: https://patch.msgid.link/20260227001151.41610-1-leon.j.kral@protonmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 19:24:08 -08:00
Randy Dunlap	2164242c50	NFC: fix header file kernel-doc warnings Repair some of the comments: - use the correct enum names - don't use "/**" for a non-kernel-doc comment to fix these warnings: Warning: include/uapi/linux/nfc.h:127 Excess enum value '@NFC_EVENT_DEVICE_DEACTIVATED' description in 'nfc_commands' Warning: include/uapi/linux/nfc.h:204 Excess enum value '@NFC_ATTR_APDU' description in 'nfc_attrs' Warning: include/uapi/linux/nfc.h:302 expecting prototype for Pseudo(). Prototype was for NFC_RAW_HEADER_SIZE() instead Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Link: https://patch.msgid.link/20260226221004.1037909-1-rdunlap@infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 19:21:56 -08:00
Eric Dumazet	6466441a5e	net: inline skb_add_rx_frag_netmem() This critical helper (via skb_add_rx_frag()) is mostly used from drivers rx fast path. It is time to inline it, this actually saves space in vmlinux: size vmlinux.old vmlinux text data bss dec hex filename 37350766 23092977 4846992 65290735 3e441ef vmlinux.old 37350600 23092977 4846992 65290569 3e44149 vmlinux Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260226041213.1892561-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 19:20:34 -08:00
Fernando Fernandez Mancera	9ff2d2a983	ipv6: discard fragment queue earlier if there is malformed datagram Currently the kernel IPv6 implementation is not dicarding the fragment queue upon receiving a IPv6 fragment that is not 8 bytes aligned. It relies on queue expiration to free the queue. While RFC 8200 section 4.5 does not explicitly mention that the rest of fragments must be discarded, it does not make sense to keep them. The parameter problem message is sent regardless that. In addition, if the sender is able to re-compose the datagram so it is 8 bytes aligned it would qualify as a new whole datagram not fitting into the same fragment queue. The same situation happens if segment end is exceeding the IPv6 maximum packet length. The sooner we can free resources the better during reassembly, the better. Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Link: https://patch.msgid.link/20260225133758.4553-1-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 19:08:54 -08:00
Birger Koblitz	e8e83b6796	r8152: Add 2500baseT EEE status/configuration support The r8152 driver supports the RTL8156, which is a 2.5Gbit Ethernet controller for USB 3.0, for which support is added for configuring and displaying the EEE advertisement status for 2.5GBit connections. The patch also corrects the determination of whether EEE is active to include the 2.5GBit connection status and make the determination dependent not on the desired speed configuration (tp->speed), but on the actual speed used by the controller. For consistency, this is corrected also for the RTL8152/3. This was tested on an Edimax EU-4307 V1.0 USB-Ethernet adapter with RTL8156, and a SECOMP Value 12.99.1115 USB-C 3.1 Ethernet converter with RTL8153. Signed-off-by: Birger Koblitz <mail@birger-koblitz.de> Link: https://patch.msgid.link/20260224-b4-eee2g5-v2-1-cf5c83df036e@birger-koblitz.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 18:42:22 -08:00
Aaron Tomlin	c31770c493	vmxnet3: Suppress page allocation warning for massive Rx Data ring The vmxnet3 driver supports an Rx Data ring (rx-mini) to optimise the processing of small packets. The size of this ring's DMA-coherent memory allocation is determined by the product of the primary Rx ring size and the data ring descriptor size: sz = rq->rx_ring[0].size * rq->data_ring.desc_size; When a user configures the maximum supported parameters via ethtool (rx_ring[0].size = 4096, data_ring.desc_size = 2048), the required contiguous memory allocation reaches 8 MB (8,388,608 bytes). In environments lacking Contiguous Memory Allocator (CMA), dma_alloc_coherent() falls back to the standard zone buddy allocator. An 8 MB allocation translates to a page order of 11, which strictly exceeds the default MAX_PAGE_ORDER (10) on most architectures. Consequently, __alloc_pages_noprof() catches the oversize request and triggers a loud kernel warning stack trace: WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp) This warning is unnecessary and alarming to system administrators because the vmxnet3 driver already handles this allocation failure gracefully. If dma_alloc_coherent() returns NULL, the driver safely disables the Rx Data ring (adapter->rxdataring_enabled = false) and falls back to standard, streaming DMA packet processing. To resolve this, append the __GFP_NOWARN flag to the dma_alloc_coherent() gfp_mask. This instructs the page allocator to silently fail the allocation if it exceeds order limits or memory is too fragmented, preventing the spurious warning stack trace. Furthermore, enhance the subsequent netdev_err() fallback message to include the requested allocation size. This provides critical debugging context to the administrator (e.g., revealing that an 8 MB allocation was attempted and failed) without making hardcoded assumptions about the state of the system's configurations. Reviewed-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: Aaron Tomlin <atomlin@atomlin.com> Link: https://patch.msgid.link/20260226163121.4045808-1-atomlin@atomlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 18:36:45 -08:00
Joris Vaisvila	9a04d3b2f0	net: ethernet: mtk_eth_soc: avoid writing to ESW registers on MT7628 The MT7628 has a fixed-link PHY and does not expose MAC control registers. Writes to these registers only corrupt the ESW VLAN configuration. This patch explicitly registers no-op phylink_mac_ops for MT7628, as after removing the invalid register accesses, the existing phylink_mac_ops effectively become no-ops. This code was introduced by commit `296c912075` ("net: ethernet: mediatek: Add MT7628/88 SoC support") Signed-off-by: Joris Vaisvila <joey@tinyisr.com> Reviewed-by: Daniel Golle <daniel@makrotpia.org> Reviewed-by: Stefan Roese <stefan.roese@mailbox.org> Link: https://patch.msgid.link/20260226154547.68553-1-joey@tinyisr.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 18:36:21 -08:00
Sabrina Dubroca	da89f2e312	tls: don't select STREAM_PARSER ktls was converted to its own stream parser in commit `84c61fe1a7` ("tls: rx: do not use the standard strparser"), but the Kconfig dependency was left. The only part of the original strparser that's shared with ktls are a few structs (strp_msg, sk_skb_cb) and the strp_msg helper, those don't require building the net/strparser code. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/cb41e513a30eeaac0b419284cc87433f049b2ee0.1771871995.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 18:36:13 -08:00
Eric Dumazet	5151ec54f5	net: use try_cmpxchg() in lock_sock_nested() Add a fast path in lock_sock_nested(), to avoid acquiring the socket spinlock only to set @owned to one: spin_lock_bh(&sk->sk_lock.slock); if (unlikely(sock_owned_by_user_nocheck(sk))) __lock_sock(sk); sk->sk_lock.owned = 1; spin_unlock_bh(&sk->sk_lock.slock); On x86_64 compiler generates something quite efficient: 00000000000077c0 <lock_sock_nested>: 77c0: f3 0f 1e fa endbr64 77c4: e8 00 00 00 00 call __fentry__ 77c9: b9 01 00 00 00 mov $0x1,%ecx 77ce: 31 c0 xor %eax,%eax 77d0: f0 48 0f b1 8f 48 01 00 00 lock cmpxchg %rcx,0x148(%rdi) 77d9: 75 06 jne slow_path 77db: 2e e9 00 00 00 00 cs jmp __x86_return_thunk-0x4 slow_path: ... Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20260226021215.1764237-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 17:25:45 -08:00
Kexin Sun	b99ccb37ed	net/hsr: update outdated comments The function hsr_rcv() was renamed hsr_handle_frame() and moved to net/hsr/hsr_slave.c by commit `81ba6afd6e` ("net/hsr: Switch from dev_add_pack() to netdev_rx_handler_register()"). Update all remaining references in the comments accordingly. Signed-off-by: Kexin Sun <kexinsun@smail.nju.edu.cn> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260225145159.2953-1-kexinsun@smail.nju.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 17:24:58 -08:00
Jens Emil Schulz Østergaard	11c0663a59	net: phy: micrel: Add support for lan9645x internal phy LAN9645X is a family of switch chips with 5 internal copper phys. The internal PHY is based on parts of LAN8832. This is a low-power, single port triple-speed (10BASE-T/100BASE-TX/1000BASE-T) ethernet physical layer transceiver (PHY) that supports transmission and reception of data on standard CAT-5, as well as CAT-5e and CAT-6 Unshielded Twisted Pair (UTP) cables. Add support for the internal PHY of the lan9645x chip family. Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com> Reviewed-by: Daniel Machon <daniel.machon@microchip.com> Signed-off-by: Jens Emil Schulz Østergaard <jensemil.schulzostergaard@microchip.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20260226-phy_micrel_add_support_for_lan9645x_internal_phy-v3-1-1fe82379962b@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 17:23:37 -08:00
Byungchul Park	fd6dad4e1a	netmem: remove the pp fields from net_iov Now that the pp fields in net_iov have no users, remove them from net_iov and clean up. Signed-off-by: Byungchul Park <byungchul@sk.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20260224061424.11219-1-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 19:45:24 -08:00
Zhengping Zhang	aebf15e8eb	net: airoha: fix typo in function name Corrected the typo in the function name from `airhoa_is_lan_gdm_port` to `airoha_is_lan_gdm_port`. This change ensures consistency in the API naming convention. Signed-off-by: Zhengping Zhang <aquapinn@qq.com> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/tencent_E4FD5D6BC0131E617D848896F5F9FCED6E0A@qq.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 19:21:40 -08:00
Tiernan Hubble	b70190d767	net: atlantic: fix reading SFP module info on some AQC100 cards Commit `853a2944aa` ("net: atlantic: support reading SFP module info") added support for reading SFP module info on AQC100-based cards. However, it only supports reading directly from the controller's hardware registers, and this does not seem to be supported on certain cards, including my TRENDnet TEG-10GECSFP V3. "ethtool -m" times out when reading certain registers, even when I increase the read poll timeout values. The DPDK "atlantic" driver reads module info via firmware calls instead of directly reading the hardware registers, provided that the NIC's firmware version supports it. This change adapts the DPDK firmware call code to the kernel driver. It preserves the old hardware-based module read code as a fallback when the firmware does not support it, to avoid breaking cards that are currently working. Tested on 2 different TRENDnet TEG-10GECSFP V3 cards, both with firmware version 3.1.121 (current at the time of this patch). Both cards correctly reported module info for a passive DAC cable and 2 different 10G optical transceivers. Signed-off-by: Tiernan Hubble <thubble@thubble.ca> Link: https://patch.msgid.link/20260225002026.1754045-1-thubble@thubble.ca Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 19:20:53 -08:00
Jakub Kicinski	ed02c6b8b5	Merge branch 'support-phys-that-have-inband-autoneg-disabled-with-gem' Charles Perry says: ==================== Support PHYs that have inband autoneg disabled with GEM I'm testing SGMII with a VSC8574 PHY [1] and microchip HPSC SoC [2]. The link can work with or without autoneg, as long as the MAC and the PHY are configured the same way. This doesn't work with the current MAC driver because the MAC inband autoneg is always enabled (in the ->mac_config() phylink_mac_ops). More precisely, the PHY driver (mscc_main.c) has phylink's ->config_inband() implemented while the MAC ->pcs_config() ops has an empty body. This is based on code written by Sean Anderson [3]. Let me know if I should add a From: or Co-developed-by: tag. Logs with inband autoneg (managed = "in-band-status"): root@p64h:~# ifconfig eth1 up 10.180.59.33 macb 40004184000.ethernet eth1: PHY 4000c21e000.mdio-mdio:02 doesn't supply possible interfaces macb 40004184000.ethernet eth1: PHY [4000c21e000.mdio-mdio:02] driver [Microsemi GE VSC8574 SyncE] (irq=POLL) macb 40004184000.ethernet eth1: phy: sgmii setting supported 00000000,00000000,00000000,000042ff advertising 00000000,00000000,00000000,000042ff macb 40004184000.ethernet eth1: configuring for inband/sgmii link mode macb 40004184000.ethernet eth1: major config, requested inband/sgmii macb 40004184000.ethernet eth1: interface sgmii inband modes: pcs=03 phy=03 macb 40004184000.ethernet eth1: major config, active inband/inband,an-enabled/sgmii macb 40004184000.ethernet eth1: phylink_mac_config: mode=inband/sgmii/none adv=00000000,00000000,00000000,000042ff pause=00 macb_pcs_config: PCSANADV=0x1 PCSCNTRL=0x1040 macb_pcs_get_state: PCSSTS=0x109 PCSANLPBASE=0x1 macb_pcs_get_state: PCSSTS=0x12d PCSANLPBASE=0x1801 macb 40004184000.ethernet eth1: phy link down sgmii/Unknown/Unknown/none/off/nolpi macb_pcs_get_state: PCSSTS=0x12d PCSANLPBASE=0x1801 macb_pcs_get_state: PCSSTS=0x12d PCSANLPBASE=0x1801 macb 40004184000.ethernet eth1: phy link up sgmii/1Gbps/Full/none/tx/nolpi macb_pcs_get_state: PCSSTS=0x129 PCSANLPBASE=0x9801 macb_pcs_get_state: PCSSTS=0x12d PCSANLPBASE=0x9801 macb 40004184000.ethernet eth1: Link is Up - 1Gbps/Full - flow control tx Logs without inband autoneg: root@p64h:~# ifconfig eth1 up 10.180.59.33 macb 40004184000.ethernet eth1: PHY 4000c21e000.mdio-mdio:02 doesn't supply possible interfaces macb 40004184000.ethernet eth1: PHY [4000c21e000.mdio-mdio:02] driver [Microsemi GE VSC8574 SyncE] (irq=POLL) macb 40004184000.ethernet eth1: phy: sgmii setting supported 00000000,00000000,00000000,000042ff advertising 00000000,00000000,00000000,000042ff macb 40004184000.ethernet eth1: configuring for phy/sgmii link mode macb 40004184000.ethernet eth1: major config, requested phy/sgmii macb 40004184000.ethernet eth1: interface sgmii inband modes: pcs=03 phy=03 macb 40004184000.ethernet eth1: major config, active phy/outband/sgmii macb 40004184000.ethernet eth1: phylink_mac_config: mode=phy/sgmii/none adv=00000000,00000000,00000000,00000000 pause=00 macb_pcs_config: PCSANADV=0x1 PCSCNTRL=0x40 macb 40004184000.ethernet eth1: phy link down sgmii/Unknown/Unknown/none/off/nolpi macb 40004184000.ethernet eth1: phy link up sgmii/1Gbps/Full/none/tx/nolpi macb 40004184000.ethernet eth1: Link is Up - 1Gbps/Full - flow control tx The above logs are generated with an additional printk() in macb_psc_config() and macb_pcs_get_state() and "#define DEBUG" in phylink.c. [1]: https://www.microchip.com/en-us/product/vsc8574 [2]: https://www.microchip.com/en-us/products/microprocessors/64-bit-mpus/pic64-hpsc [3]: https://lore.kernel.org/all/20250610233547.3588356-1-sean.anderson@linux.dev/ ==================== Link: https://patch.msgid.link/20260224202854.112813-1-charles.perry@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 19:19:28 -08:00
Charles Perry	d3549e2b48	net: macb: add the .pcs_inband_caps() callback for SGMII In SGMII mode, GEM can work with or without inband autonegotiation. Signed-off-by: Charles Perry <charles.perry@microchip.com> Link: https://patch.msgid.link/20260224202854.112813-4-charles.perry@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 19:19:24 -08:00
Charles Perry	7f44b2acc5	net: macb: add support for reporting SGMII inband link status This makes it possible to use in-band autonegotiation with SGMII. If using a device tree, this can be done by adding the managed = "in-band-status" property to the gem node. Signed-off-by: Charles Perry <charles.perry@microchip.com> Link: https://patch.msgid.link/20260224202854.112813-3-charles.perry@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 19:19:24 -08:00
Charles Perry	1338cfef1f	net: macb: fix SGMII with inband aneg disabled Make it possible to connect a PHY which does not use inband autoneg to a gem MAC using phylink's information. The previous implementation relied on whether or not the link was a fixed-link to disable SGMII autoneg. This commit extend this to all link which are not configured for inband autonegotiation. Signed-off-by: Charles Perry <charles.perry@microchip.com> Link: https://patch.msgid.link/20260224202854.112813-2-charles.perry@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 19:19:24 -08:00
Eric Dumazet	363c5108e4	inet: remove three EXPORT_SYMBOL() inet_rcv_saddr_equal() and inet_csk_listen_stop() are not used from any modules. inet_csk_accept() can use EXPORT_IPV6_MOD() Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260225134023.1176738-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 18:27:21 -08:00
Yohei Kojima	5cf47393d9	docs: ethtool: clarify the bit-by-bit bitset format description Clarify the bit-by-bit bitset format's behavior around mandatory attributes and bit identification. More specifically, the following changes are made: * Rephrase a misleading sentence which implies name and index are mutually exclusive * Describe that ETHTOOL_A_BITSET_BITS nest is mandatory * Describe that a request fails if inconsistent identifiers are given Signed-off-by: Yohei Kojima <yk@y-koj.net> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/ef90a56965ca66e57aa177929ce3e10c5ca815fa.1772031974.git.yk@y-koj.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 17:55:42 -08:00
Bo Sun	f22b4e6fbb	octeontx2-af: CGX: replace kfree() with rvu_free_bitmap() mac_to_index_bmap is allocated with rvu_alloc_bitmap(), so free it with rvu_free_bitmap() instead of open-coding kfree(.bmap) to keep the alloc/free API pairing consistent. Signed-off-by: Bo Sun <bo@mboxify.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Jijie Shao <shaojijie@huawei.com> Link: https://patch.msgid.link/20260225082348.2519131-1-bo@mboxify.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 17:45:00 -08:00
Gabriel Goller	d68d21ea6b	docs: net: document neigh gc_interval sysctl Add entry for the neigh/default/gc_interval sysctl. This sysctl is unused since kernel v2.6.8. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Gabriel Goller <g.goller@proxmox.com> Link: https://patch.msgid.link/20260225095822.44050-1-g.goller@proxmox.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 17:37:04 -08:00
Russell King (Oracle)	5c894879f1	net: stmmac: ptp: limit n_per_out ptp_clock_ops.n_per_out sets the number of PPS outputs, which the PTP subsystem uses to validate userspace input, such as the index number used in a PTP_CLK_REQ_PEROUT request. stmmac_enable() uses this to index the priv->pps array, which is an array of size STMMAC_PPS_MAX. ptp_clock_ops.n_per_out is initialised using priv->dma_cap.pps_out_num, which is a three bit field read from hardware. Documentation that I've checked suggests that values >= 5 are reserved, but that doesn't mean such values won't appear, and if they do, we can overrun the priv->pps array in stmmac_enable(). stmmac_ptp_register() has protection against this in its loop, but it doesn't act to limit ptp_clock_ops.n_per_out. Fix this by introducing a local variable, pps_out_num which is limited to STMMAC_PPS_MAX, and use that when initialising the array and setting priv->ptp_clock_ops.n_per_out. Print a warning when we limit the number of outputs. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1vvBhn-0000000ArCg-4C4u@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 17:35:48 -08:00
Jakub Kicinski	0314e382cf	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-7.0-rc2). Conflicts: tools/testing/selftests/drivers/net/hw/rss_ctx.py `19c3a2a81d` ("selftests: drv-net: rss: Generate unique ports for RSS context tests") `ce5a0f4612` ("selftests: drv-net: rss_ctx: test RSS contexts persist after ifdown/up") include/net/inet_connection_sock.h `858d2a4f67` ("tcp: fix potential race in tcp_v6_syn_recv_sock()") `fcd3d039fa` ("tcp: make tcp_v{4,6}_send_check() static") https://lore.kernel.org/aZ8PSFLzBrEU3I89@sirena.org.uk drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c drivers/net/ethernet/mellanox/mlx5/core/en/xsk/pool.c `69050f8d6d` ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types") `bf4afc53b7` ("Convert 'alloc_obj' family to use the new default GFP_KERNEL argument") `8a96b9144f` ("net/mlx5e: Alloc xsk channel param out of mlx5e_open_xsk()") Adjacent changes: net/netfilter/ipvs/ip_vs_ctl.c `c59bd9e62e` ("ipvs: use more counters to avoid service lookups") `bf4afc53b7` ("Convert 'alloc_obj' family to use the new default GFP_KERNEL argument") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 10:23:00 -08:00
Linus Torvalds	b9c8fc2cae	Merge tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from IPsec, Bluetooth and netfilter Current release - regressions: - wifi: fix dev_alloc_name() return value check - rds: fix recursive lock in rds_tcp_conn_slots_available Current release - new code bugs: - vsock: lock down child_ns_mode as write-once Previous releases - regressions: - core: - do not pass flow_id to set_rps_cpu() - consume xmit errors of GSO frames - netconsole: avoid OOB reads, msg is not nul-terminated - netfilter: h323: fix OOB read in decode_choice() - tcp: re-enable acceptance of FIN packets when RWIN is 0 - udplite: fix null-ptr-deref in __udp_enqueue_schedule_skb(). - wifi: brcmfmac: fix potential kernel oops when probe fails - phy: register phy led_triggers during probe to avoid AB-BA deadlock - eth: - bnxt_en: fix deleting of Ntuple filters - wan: farsync: fix use-after-free bugs caused by unfinished tasklets - xscale: check for PTP support properly Previous releases - always broken: - tcp: fix potential race in tcp_v6_syn_recv_sock() - kcm: fix zero-frag skb in frag_list on partial sendmsg error - xfrm: - fix race condition in espintcp_close() - always flush state and policy upon NETDEV_UNREGISTER event - bluetooth: - purge error queues in socket destructors - fix response to L2CAP_ECRED_CONN_REQ - eth: - mlx5: - fix circular locking dependency in dump - fix "scheduling while atomic" in IPsec MAC address query - gve: fix incorrect buffer cleanup for QPL - team: avoid NETDEV_CHANGEMTU event when unregistering slave - usb: validate USB endpoints" * tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits) netfilter: nf_conntrack_h323: fix OOB read in decode_choice() dpaa2-switch: validate num_ifs to prevent out-of-bounds write net: consume xmit errors of GSO frames vsock: document write-once behavior of the child_ns_mode sysctl vsock: lock down child_ns_mode as write-once selftests/vsock: change tests to respect write-once child ns mode net/mlx5e: Fix "scheduling while atomic" in IPsec MAC address query net/mlx5: Fix missing devlink lock in SRIOV enable error path net/mlx5: E-switch, Clear legacy flag when moving to switchdev net/mlx5: LAG, disable MPESW in lag_disable_change() net/mlx5: DR, Fix circular locking dependency in dump selftests: team: Add a reference count leak test team: avoid NETDEV_CHANGEMTU event when unregistering slave net: mana: Fix double destroy_workqueue on service rescan PCI path MAINTAINERS: Update maintainer entry for QUALCOMM ETHQOS ETHERNET DRIVER dpll: zl3073x: Remove redundant cleanup in devm_dpll_init() selftests/net: packetdrill: Verify acceptance of FIN packets when RWIN is 0 tcp: re-enable acceptance of FIN packets when RWIN is 0 vsock: Use container_of() to get net namespace in sysctl handlers net: usb: kaweth: validate USB endpoints ...	2026-02-26 08:00:13 -08:00
Vahagn Vardanian	baed0d9ba9	netfilter: nf_conntrack_h323: fix OOB read in decode_choice() In decode_choice(), the boundary check before get_len() uses the variable `len`, which is still 0 from its initialization at the top of the function: unsigned int type, ext, len = 0; ... if (ext \|\| (son->attr & OPEN)) { BYTE_ALIGN(bs); if (nf_h323_error_boundary(bs, len, 0)) /* len is 0 here / return H323_ERROR_BOUND; len = get_len(bs); / OOB read / When the bitstream is exactly consumed (bs->cur == bs->end), the check nf_h323_error_boundary(bs, 0, 0) evaluates to (bs->cur + 0 > bs->end), which is false. The subsequent get_len() call then dereferences bs->cur++, reading 1 byte past the end of the buffer. If that byte has bit 7 set, get_len() reads a second byte as well. This can be triggered remotely by sending a crafted Q.931 SETUP message with a User-User Information Element containing exactly 2 bytes of PER-encoded data ({0x08, 0x00}) to port 1720 through a firewall with the nf_conntrack_h323 helper active. The decoder fully consumes the PER buffer before reaching this code path, resulting in a 1-2 byte heap-buffer-overflow read confirmed by AddressSanitizer. Fix this by checking for 2 bytes (the maximum that get_len() may read) instead of the uninitialized `len`. This matches the pattern used at every other get_len() call site in the same file, where the caller checks for 2 bytes of available data before calling get_len(). Fixes: `ec8a8f3c31` ("netfilter: nf_ct_h323: Extend nf_h323_error_boundary to work on bits as well") Signed-off-by: Vahagn Vardanian <vahagn@redrays.io> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260225130619.1248-2-fw@strlen.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 12:50:42 +01:00
Junrui Luo	8a5752c6dc	dpaa2-switch: validate num_ifs to prevent out-of-bounds write The driver obtains sw_attr.num_ifs from firmware via dpsw_get_attributes() but never validates it against DPSW_MAX_IF (64). This value controls iteration in dpaa2_switch_fdb_get_flood_cfg(), which writes port indices into the fixed-size cfg->if_id[DPSW_MAX_IF] array. When firmware reports num_ifs >= 64, the loop can write past the array bounds. Add a bound check for num_ifs in dpaa2_switch_init(). dpaa2_switch_fdb_get_flood_cfg() appends the control interface (port num_ifs) after all matched ports. When num_ifs == DPSW_MAX_IF and all ports match the flood filter, the loop fills all 64 slots and the control interface write overflows by one entry. The check uses >= because num_ifs == DPSW_MAX_IF is also functionally broken. build_if_id_bitmap() silently drops any ID >= 64: if (id[i] < DPSW_MAX_IF) bmap[id[i] / 64] \|= ... Fixes: `539dda3c5d` ("staging: dpaa2-switch: properly setup switching domains") Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com> Link: https://patch.msgid.link/SYBPR01MB78812B47B7F0470B617C408AAF74A@SYBPR01MB7881.ausprd01.prod.outlook.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 12:37:21 +01:00
Hangbin Liu	4916f2e2f3	bonding: print churn state via netlink Currently, the churn state is printed only in sysfs. Add netlink support so users could get the state via netlink. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20260224020215.6012-1-liuhangbin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 11:45:35 +01:00
Qingfang Deng	15c9ed1d82	pppoe: remove kernel-mode relay support The kernel-mode PPPoE relay feature and its two associated ioctls (PPPOEIOCSFWD and PPPOEIOCDFWD) are not used by any existing userspace PPPoE implementations. The most commonly-used package, RP-PPPoE [1], handles the relaying entirely in userspace. This legacy code has remained in the driver since its introduction in kernel 2.3.99-pre7 for over two decades, but has served no practical purpose. Remove the unused relay code. [1] https://dianne.skoll.ca/projects/rp-pppoe/ Signed-off-by: Qingfang Deng <dqfext@gmail.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20260224015053.42472-1-dqfext@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 11:41:00 +01:00
Jakub Kicinski	7aa767d0d3	net: consume xmit errors of GSO frames udpgro_frglist.sh and udpgro_bench.sh are the flakiest tests currently in NIPA. They fail in the same exact way, TCP GRO test stalls occasionally and the test gets killed after 10min. These tests use veth to simulate GRO. They attach a trivial ("return XDP_PASS;") XDP program to the veth to force TSO off and NAPI on. Digging into the failure mode we can see that the connection is completely stuck after a burst of drops. The sender's snd_nxt is at sequence number N [1], but the receiver claims to have received (rcv_nxt) up to N + 3 * MSS [2]. Last piece of the puzzle is that senders rtx queue is not empty (let's say the block in the rtx queue is at sequence number N - 4 * MSS [3]). In this state, sender sends a retransmission from the rtx queue with a single segment, and sequence numbers N-4MSS:N-3MSS [3]. Receiver sees it and responds with an ACK all the way up to N + 3 * MSS [2]. But sender will reject this ack as TCP_ACK_UNSENT_DATA because it has no recollection of ever sending data that far out [1]. And we are stuck. The root cause is the mess of the xmit return codes. veth returns an error when it can't xmit a frame. We end up with a loss event like this: ------------------------------------------------- \| GSO super frame 1 \| GSO super frame 2 \| \|-----------------------------------------------\| \| seg \| seg \| seg \| seg \| seg \| seg \| seg \| seg \| \| 1 \| 2 \| 3 \| 4 \| 5 \| 6 \| 7 \| 8 \| ------------------------------------------------- x ok ok <ok>\| ok ok ok <x> \\ snd_nxt "x" means packet lost by veth, and "ok" means it went thru. Since veth has TSO disabled in this test it sees individual segments. Segment 1 is on the retransmit queue and will be resent. So why did the sender not advance snd_nxt even tho it clearly did send up to seg 8? tcp_write_xmit() interprets the return code from the core to mean that data has not been sent at all. Since TCP deals with GSO super frames, not individual segment the crux of the problem is that loss of a single segment can be interpreted as loss of all. TCP only sees the last return code for the last segment of the GSO frame (in <> brackets in the diagram above). Of course for the problem to occur we need a setup or a device without a Qdisc. Otherwise Qdisc layer disconnects the protocol layer from the device errors completely. We have multiple ways to fix this. 1) make veth not return an error when it lost a packet. While this is what I think we did in the past, the issue keeps reappearing and it's annoying to debug. The game of whack a mole is not great. 2) fix the damn return codes We only talk about NETDEV_TX_OK and NETDEV_TX_BUSY in the documentation, so maybe we should make the return code from ndo_start_xmit() a boolean. I like that the most, but perhaps some ancient, not-really-networking protocol would suffer. 3) make TCP ignore the errors It is not entirely clear to me what benefit TCP gets from interpreting the result of ip_queue_xmit()? Specifically once the connection is established and we're pushing data - packet loss is just packet loss? 4) this fix Ignore the rc in the Qdisc-less+GSO case, since it's unreliable. We already always return OK in the TCQ_F_CAN_BYPASS case. In the Qdisc-less case let's be a bit more conservative and only mask the GSO errors. This path is taken by non-IP-"networks" like CAN, MCTP etc, so we could regress some ancient thing. This is the simplest, but also maybe the hackiest fix? Similar fix has been proposed by Eric in the past but never committed because original reporter was working with an OOT driver and wasn't providing feedback (see Link). Link: https://lore.kernel.org/CANn89iJcLepEin7EtBETrZ36bjoD9LrR=k4cfwWh046GB+4f9A@mail.gmail.com Fixes: `1f59533f9c` ("qdisc: validate frames going through the direct_xmit path") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260223235100.108939-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 11:35:00 +01:00
Paolo Abeni	f0a2f2aadb	Merge branch 'vsock-add-write-once-semantics-to-child_ns_mode' Bobby Eshleman says: ==================== vsock: add write-once semantics to child_ns_mode Two administrator processes may race when setting child_ns_mode: one sets it to "local" and creates a namespace, but another changes it to "global" in between. The first process ends up with a namespace in the wrong mode. Make child_ns_mode write-once so that a namespace manager can set it once, check the value, and be guaranteed it won't change before creating its namespaces. Writing a different value after the first write returns -EBUSY. One patch for the implementation, one for docs, and one for tests. v2: https://lore.kernel.org/r/20260218-vsock-ns-write-once-v2-0-19e4c50d509a@meta.com v1: https://lore.kernel.org/r/20260217-vsock-ns-write-once-v1-1-a1fb30f289a9@meta.com ==================== Link: https://patch.msgid.link/20260223-vsock-ns-write-once-v3-0-c0cde6959923@meta.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 11:10:06 +01:00
Bobby Eshleman	b6302e057f	vsock: document write-once behavior of the child_ns_mode sysctl Update the vsock child_ns_mode documentation to include the new write-once semantics of setting child_ns_mode. The semantics are implemented in a preceding patch in this series. Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260223-vsock-ns-write-once-v3-3-c0cde6959923@meta.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 11:10:03 +01:00
Bobby Eshleman	102eab95f0	vsock: lock down child_ns_mode as write-once Two administrator processes may race when setting child_ns_mode as one process sets child_ns_mode to "local" and then creates a namespace, but another process changes child_ns_mode to "global" between the write and the namespace creation. The first process ends up with a namespace in "global" mode instead of "local". While this can be detected after the fact by reading ns_mode and retrying, it is fragile and error-prone. Make child_ns_mode write-once so that a namespace manager can set it once and be sure it won't change. Writing a different value after the first write returns -EBUSY. This applies to all namespaces, including init_net, where an init process can write "local" to lock all future namespaces into local mode. Fixes: `eafb64f40c` ("vsock: add netns to vsock core") Suggested-by: Daan De Meyer <daan.j.demeyer@gmail.com> Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Co-developed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260223-vsock-ns-write-once-v3-2-c0cde6959923@meta.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 11:10:03 +01:00
Bobby Eshleman	a382a34276	selftests/vsock: change tests to respect write-once child ns mode The child_ns_mode sysctl parameter becomes write-once in a future patch in this series, which breaks existing tests. This patch updates the tests to respect this new policy. No additional tests are added. Add "global-parent" and "local-parent" namespaces as intermediaries to spawn namespaces in the given modes. This avoids the need to change "child_ns_mode" in the init_ns. nsenter must be used because ip netns unshares the mount namespace so nested "ip netns add" breaks exec calls from the init ns. Adds nsenter to the deps check. Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260223-vsock-ns-write-once-v3-1-c0cde6959923@meta.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 11:10:03 +01:00
Paolo Abeni	90fcb0f3bc	Merge branch 'net-mlx5e-shampo-allow-high-order-pages-in-zerocopy-mode' Tariq Toukan says: ==================== net/mlx5e: SHAMPO, Allow high order pages in zerocopy mode This series adds support for high order pages when io_uring/devmem zero copy is used. See detailed description by Dragos below. The first patches are moving code around to allow using queue specific parameters that are not just for XSK. They are a bit large as they touch a lot of functions. The middle part of the series is updating various formulas to remove remaining hardcoded use of PAGE_SIZE/PAGE_SHIFT. The last part adds support for high order pages by implementing the queue configuration functions and allowing larger rx_page_size configurations when in zero-copy mode. Results show an increase in BW and a decrease in CPU usage. The benchmark was done with the zcrx samples from liburing [0]. rx_buf_len=4K, oncpu [1]: packets=3358832 (MB=820027), rps=55794 (MB/s=13621) Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: 9 1.56 0.00 18.09 13.42 0.00 66.80 0.00 0.00 0.00 0.12 rx_buf_len=128K, oncpu [2]: packets=3781376 (MB=923187), rps=62813 (MB/s=15335) Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: 9 0.33 0.00 7.61 18.86 0.00 73.08 0.00 0.00 0.00 0.12 rx_buf_len=4K, offcpu [3]: packets=3460368 (MB=844816), rps=57481 (MB/s=14033) Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: 9 0.00 0.00 0.26 0.00 0.00 92.63 0.00 0.00 0.00 7.11 Average: 11 3.04 0.00 68.09 28.87 0.00 0.00 0.00 0.00 0.00 0.00 rx_buf_len=128K, offcpu [4]: packets=4119840 (MB=1005820), rps=68435 (MB/s=16707) Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: 9 0.00 0.00 0.87 0.00 0.00 63.77 0.00 0.00 0.00 35.36 Average: 11 1.96 0.00 43.68 54.37 0.00 0.00 0.00 0.00 0.00 0.00 [0] https://github.com/isilence/liburing/tree/zcrx/rx-buf-len [1] commands: $> taskset -c 9 ./zcrx 6 -i eth2 -q 9 -A 1 -B 4096 -S 33554432 $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000 [2] commands: $> taskset -c 9 ./zcrx 6 -i eth2 -q 9 -A 1 -B 131072 -S 33554432 $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000 [3] commands: $> taskset -c 11 ./zcrx 6 -i eth2 -q 9 -A 1 -B 4096 -S 33554432 $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000 [4] commands: $> taskset -c 11 ./zcrx 6 -i eth2 -q 9 -A 1 -B 131072 -S 33554432 $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000 ==================== Link: https://patch.msgid.link/20260223204155.1783580-1-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 10:54:41 +01:00
Dragos Tatulea	df5135fced	net/mlx5e: SHAMPO, Allow high order pages in zerocopy mode Allow high order pages only when SHAMPO mode is enabled (hw-gro) and the queue is used for zerocopy (has memory provider ops set). The limit is 128K and it was chosen for the following reasons: - 256K size requires a special case during MTT calculation to split the page in two. That's because two MTTs are needed to form an octword. - Higher sizes require increasing WQE size and/or reducing the number of WQEs. - Having the RQ lined with too few large pages can lead to refill issues. Results show an increase in BW and a decrease in CPU usage. The benchmark was done with the zcrx samples from liburing [0]. rx_buf_len=4K, oncpu [1]: packets=3358832 (MB=820027), rps=55794 (MB/s=13621) Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: 9 1.56 0.00 18.09 13.42 0.00 66.80 0.00 0.00 0.00 0.12 rx_buf_len=128K, oncpu [2]: packets=3781376 (MB=923187), rps=62813 (MB/s=15335) Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: 9 0.33 0.00 7.61 18.86 0.00 73.08 0.00 0.00 0.00 0.12 rx_buf_len=4K, offcpu [3]: packets=3460368 (MB=844816), rps=57481 (MB/s=14033) Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: 9 0.00 0.00 0.26 0.00 0.00 92.63 0.00 0.00 0.00 7.11 Average: 11 3.04 0.00 68.09 28.87 0.00 0.00 0.00 0.00 0.00 0.00 rx_buf_len=128K, offcpu [4]: packets=4119840 (MB=1005820), rps=68435 (MB/s=16707) Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: 9 0.00 0.00 0.87 0.00 0.00 63.77 0.00 0.00 0.00 35.36 Average: 11 1.96 0.00 43.68 54.37 0.00 0.00 0.00 0.00 0.00 0.00 [0] https://github.com/isilence/liburing/tree/zcrx/rx-buf-len [1] commands: $> taskset -c 9 ./zcrx 6 -i eth2 -q 9 -A 1 -B 4096 -S 33554432 $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000 [2] commands: $> taskset -c 9 ./zcrx 6 -i eth2 -q 9 -A 1 -B 131072 -S 33554432 $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000 [3] commands: $> taskset -c 11 ./zcrx 6 -i eth2 -q 9 -A 1 -B 4096 -S 33554432 $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000 [4] commands: $> taskset -c 11 ./zcrx 6 -i eth2 -q 9 -A 1 -B 131072 -S 33554432 $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000 Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260223204155.1783580-16-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 10:54:24 +01:00
Dragos Tatulea	5b6e0ddb36	net/mlx5e: Add param helper to calculate max page size This function will be necessary to determine the upper limit of rx-page-size. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260223204155.1783580-15-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 10:54:24 +01:00
Dragos Tatulea	585cfa99d3	net/mlx5e: Pass netdev queue config to param calculations If set, take rx_page_size into consideration when calculating the page shift in Multi Packet WQE mode. The queue config is saved in the mlx5e_rq_opt_param struct which is added to the mlx5e_channel_param struct. Now the configuration can be read from the struct instead of adding it as an argument to all call sites. For consistency, the queue config is assigned in mlx5e_build_channel_param(). The queue configuration is read only from queue management ops as that's the only place where it is currently useful. Furthermore, netdev_queue_config() expects netdev->queue_mgmt_ops to be set which is not always the case (representor netdevs). Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260223204155.1783580-14-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 10:54:24 +01:00
Dragos Tatulea	0fa8c93357	net/mlx5e: Add queue config ops for page size For now allow only PAGE_SIZE. A subsequent patch will add support for high order pages in zero-copy mode. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260223204155.1783580-13-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 10:54:23 +01:00

1 2 3 4 5 ...

1427114 Commits