Repair some of the comments:
- use the correct enum names
- don't use "/**" for a non-kernel-doc comment
to fix these warnings:
Warning: include/uapi/linux/nfc.h:127 Excess enum value
'@NFC_EVENT_DEVICE_DEACTIVATED' description in 'nfc_commands'
Warning: include/uapi/linux/nfc.h:204 Excess enum value
'@NFC_ATTR_APDU' description in 'nfc_attrs'
Warning: include/uapi/linux/nfc.h:302 expecting prototype for Pseudo().
Prototype was for NFC_RAW_HEADER_SIZE() instead
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20260226221004.1037909-1-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This critical helper (via skb_add_rx_frag()) is mostly used
from drivers rx fast path.
It is time to inline it, this actually saves space in vmlinux:
size vmlinux.old vmlinux
text data bss dec hex filename
37350766 23092977 4846992 65290735 3e441ef vmlinux.old
37350600 23092977 4846992 65290569 3e44149 vmlinux
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260226041213.1892561-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently the kernel IPv6 implementation is not dicarding the fragment
queue upon receiving a IPv6 fragment that is not 8 bytes aligned. It
relies on queue expiration to free the queue.
While RFC 8200 section 4.5 does not explicitly mention that the rest of
fragments must be discarded, it does not make sense to keep them. The
parameter problem message is sent regardless that. In addition, if the
sender is able to re-compose the datagram so it is 8 bytes aligned it
would qualify as a new whole datagram not fitting into the same fragment
queue.
The same situation happens if segment end is exceeding the IPv6 maximum
packet length. The sooner we can free resources the better during
reassembly, the better.
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20260225133758.4553-1-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The r8152 driver supports the RTL8156, which is a 2.5Gbit Ethernet controller for
USB 3.0, for which support is added for configuring and displaying the EEE
advertisement status for 2.5GBit connections.
The patch also corrects the determination of whether EEE is active to include
the 2.5GBit connection status and make the determination dependent not on the
desired speed configuration (tp->speed), but on the actual speed used by
the controller. For consistency, this is corrected also for the RTL8152/3.
This was tested on an Edimax EU-4307 V1.0 USB-Ethernet adapter with RTL8156,
and a SECOMP Value 12.99.1115 USB-C 3.1 Ethernet converter with RTL8153.
Signed-off-by: Birger Koblitz <mail@birger-koblitz.de>
Link: https://patch.msgid.link/20260224-b4-eee2g5-v2-1-cf5c83df036e@birger-koblitz.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The vmxnet3 driver supports an Rx Data ring (rx-mini) to optimise the
processing of small packets. The size of this ring's DMA-coherent memory
allocation is determined by the product of the primary Rx ring size and
the data ring descriptor size:
sz = rq->rx_ring[0].size * rq->data_ring.desc_size;
When a user configures the maximum supported parameters via ethtool
(rx_ring[0].size = 4096, data_ring.desc_size = 2048), the required
contiguous memory allocation reaches 8 MB (8,388,608 bytes).
In environments lacking Contiguous Memory Allocator (CMA),
dma_alloc_coherent() falls back to the standard zone buddy allocator. An
8 MB allocation translates to a page order of 11, which strictly exceeds
the default MAX_PAGE_ORDER (10) on most architectures.
Consequently, __alloc_pages_noprof() catches the oversize request and
triggers a loud kernel warning stack trace:
WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)
This warning is unnecessary and alarming to system administrators because
the vmxnet3 driver already handles this allocation failure gracefully.
If dma_alloc_coherent() returns NULL, the driver safely disables the
Rx Data ring (adapter->rxdataring_enabled = false) and falls back to
standard, streaming DMA packet processing.
To resolve this, append the __GFP_NOWARN flag to the dma_alloc_coherent()
gfp_mask. This instructs the page allocator to silently fail the
allocation if it exceeds order limits or memory is too fragmented,
preventing the spurious warning stack trace.
Furthermore, enhance the subsequent netdev_err() fallback message to
include the requested allocation size. This provides critical debugging
context to the administrator (e.g., revealing that an 8 MB allocation
was attempted and failed) without making hardcoded assumptions about
the state of the system's configurations.
Reviewed-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Link: https://patch.msgid.link/20260226163121.4045808-1-atomlin@atomlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The MT7628 has a fixed-link PHY and does not expose MAC control
registers. Writes to these registers only corrupt the ESW VLAN
configuration.
This patch explicitly registers no-op phylink_mac_ops for MT7628, as
after removing the invalid register accesses, the existing
phylink_mac_ops effectively become no-ops.
This code was introduced by commit 296c912075
("net: ethernet: mediatek: Add MT7628/88 SoC support")
Signed-off-by: Joris Vaisvila <joey@tinyisr.com>
Reviewed-by: Daniel Golle <daniel@makrotpia.org>
Reviewed-by: Stefan Roese <stefan.roese@mailbox.org>
Link: https://patch.msgid.link/20260226154547.68553-1-joey@tinyisr.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 853a2944aa ("net: atlantic: support reading SFP module info")
added support for reading SFP module info on AQC100-based cards. However,
it only supports reading directly from the controller's hardware
registers, and this does not seem to be supported on certain cards,
including my TRENDnet TEG-10GECSFP V3. "ethtool -m" times out when reading
certain registers, even when I increase the read poll timeout values.
The DPDK "atlantic" driver reads module info via firmware calls instead of
directly reading the hardware registers, provided that the NIC's firmware
version supports it.
This change adapts the DPDK firmware call code to the kernel driver. It
preserves the old hardware-based module read code as a fallback when the
firmware does not support it, to avoid breaking cards that are currently
working.
Tested on 2 different TRENDnet TEG-10GECSFP V3 cards, both with firmware
version 3.1.121 (current at the time of this patch). Both cards correctly
reported module info for a passive DAC cable and 2 different 10G optical
transceivers.
Signed-off-by: Tiernan Hubble <thubble@thubble.ca>
Link: https://patch.msgid.link/20260225002026.1754045-1-thubble@thubble.ca
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Charles Perry says:
====================
Support PHYs that have inband autoneg disabled with GEM
I'm testing SGMII with a VSC8574 PHY [1] and microchip HPSC SoC [2].
The link can work with or without autoneg, as long as the MAC and the PHY
are configured the same way. This doesn't work with the current MAC driver
because the MAC inband autoneg is always enabled (in the ->mac_config()
phylink_mac_ops). More precisely, the PHY driver (mscc_main.c) has
phylink's ->config_inband() implemented while the MAC ->pcs_config() ops
has an empty body.
This is based on code written by Sean Anderson [3]. Let me know if I
should add a From: or Co-developed-by: tag.
Logs with inband autoneg (managed = "in-band-status"):
root@p64h:~# ifconfig eth1 up 10.180.59.33
macb 40004184000.ethernet eth1: PHY 4000c21e000.mdio-mdio:02 doesn't supply possible interfaces
macb 40004184000.ethernet eth1: PHY [4000c21e000.mdio-mdio:02] driver [Microsemi GE VSC8574 SyncE] (irq=POLL)
macb 40004184000.ethernet eth1: phy: sgmii setting supported 00000000,00000000,00000000,000042ff advertising 00000000,00000000,00000000,000042ff
macb 40004184000.ethernet eth1: configuring for inband/sgmii link mode
macb 40004184000.ethernet eth1: major config, requested inband/sgmii
macb 40004184000.ethernet eth1: interface sgmii inband modes: pcs=03 phy=03
macb 40004184000.ethernet eth1: major config, active inband/inband,an-enabled/sgmii
macb 40004184000.ethernet eth1: phylink_mac_config: mode=inband/sgmii/none adv=00000000,00000000,00000000,000042ff pause=00
macb_pcs_config: PCSANADV=0x1 PCSCNTRL=0x1040
macb_pcs_get_state: PCSSTS=0x109 PCSANLPBASE=0x1
macb_pcs_get_state: PCSSTS=0x12d PCSANLPBASE=0x1801
macb 40004184000.ethernet eth1: phy link down sgmii/Unknown/Unknown/none/off/nolpi
macb_pcs_get_state: PCSSTS=0x12d PCSANLPBASE=0x1801
macb_pcs_get_state: PCSSTS=0x12d PCSANLPBASE=0x1801
macb 40004184000.ethernet eth1: phy link up sgmii/1Gbps/Full/none/tx/nolpi
macb_pcs_get_state: PCSSTS=0x129 PCSANLPBASE=0x9801
macb_pcs_get_state: PCSSTS=0x12d PCSANLPBASE=0x9801
macb 40004184000.ethernet eth1: Link is Up - 1Gbps/Full - flow control tx
Logs without inband autoneg:
root@p64h:~# ifconfig eth1 up 10.180.59.33
macb 40004184000.ethernet eth1: PHY 4000c21e000.mdio-mdio:02 doesn't supply possible interfaces
macb 40004184000.ethernet eth1: PHY [4000c21e000.mdio-mdio:02] driver [Microsemi GE VSC8574 SyncE] (irq=POLL)
macb 40004184000.ethernet eth1: phy: sgmii setting supported 00000000,00000000,00000000,000042ff advertising 00000000,00000000,00000000,000042ff
macb 40004184000.ethernet eth1: configuring for phy/sgmii link mode
macb 40004184000.ethernet eth1: major config, requested phy/sgmii
macb 40004184000.ethernet eth1: interface sgmii inband modes: pcs=03 phy=03
macb 40004184000.ethernet eth1: major config, active phy/outband/sgmii
macb 40004184000.ethernet eth1: phylink_mac_config: mode=phy/sgmii/none adv=00000000,00000000,00000000,00000000 pause=00
macb_pcs_config: PCSANADV=0x1 PCSCNTRL=0x40
macb 40004184000.ethernet eth1: phy link down sgmii/Unknown/Unknown/none/off/nolpi
macb 40004184000.ethernet eth1: phy link up sgmii/1Gbps/Full/none/tx/nolpi
macb 40004184000.ethernet eth1: Link is Up - 1Gbps/Full - flow control tx
The above logs are generated with an additional printk() in macb_psc_config()
and macb_pcs_get_state() and "#define DEBUG" in phylink.c.
[1]: https://www.microchip.com/en-us/product/vsc8574
[2]: https://www.microchip.com/en-us/products/microprocessors/64-bit-mpus/pic64-hpsc
[3]: https://lore.kernel.org/all/20250610233547.3588356-1-sean.anderson@linux.dev/
====================
Link: https://patch.msgid.link/20260224202854.112813-1-charles.perry@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Make it possible to connect a PHY which does not use inband
autoneg to a gem MAC using phylink's information.
The previous implementation relied on whether or not the link
was a fixed-link to disable SGMII autoneg. This commit extend
this to all link which are not configured for inband
autonegotiation.
Signed-off-by: Charles Perry <charles.perry@microchip.com>
Link: https://patch.msgid.link/20260224202854.112813-2-charles.perry@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Clarify the bit-by-bit bitset format's behavior around mandatory
attributes and bit identification. More specifically, the following
changes are made:
* Rephrase a misleading sentence which implies name and index are
mutually exclusive
* Describe that ETHTOOL_A_BITSET_BITS nest is mandatory
* Describe that a request fails if inconsistent identifiers are given
Signed-off-by: Yohei Kojima <yk@y-koj.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/ef90a56965ca66e57aa177929ce3e10c5ca815fa.1772031974.git.yk@y-koj.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ptp_clock_ops.n_per_out sets the number of PPS outputs, which the PTP
subsystem uses to validate userspace input, such as the index number
used in a PTP_CLK_REQ_PEROUT request.
stmmac_enable() uses this to index the priv->pps array, which is an
array of size STMMAC_PPS_MAX. ptp_clock_ops.n_per_out is initialised
using priv->dma_cap.pps_out_num, which is a three bit field read from
hardware.
Documentation that I've checked suggests that values >= 5 are reserved,
but that doesn't mean such values won't appear, and if they do, we
can overrun the priv->pps array in stmmac_enable().
stmmac_ptp_register() has protection against this in its loop, but it
doesn't act to limit ptp_clock_ops.n_per_out.
Fix this by introducing a local variable, pps_out_num which is limited
to STMMAC_PPS_MAX, and use that when initialising the array and setting
priv->ptp_clock_ops.n_per_out. Print a warning when we limit the number
of outputs.
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vvBhn-0000000ArCg-4C4u@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR (net-7.0-rc2).
Conflicts:
tools/testing/selftests/drivers/net/hw/rss_ctx.py
19c3a2a81d ("selftests: drv-net: rss: Generate unique ports for RSS context tests")
ce5a0f4612 ("selftests: drv-net: rss_ctx: test RSS contexts persist after ifdown/up")
include/net/inet_connection_sock.h
858d2a4f67 ("tcp: fix potential race in tcp_v6_syn_recv_sock()")
fcd3d039fa ("tcp: make tcp_v{4,6}_send_check() static")
https://lore.kernel.org/aZ8PSFLzBrEU3I89@sirena.org.uk
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/pool.c
69050f8d6d ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
bf4afc53b7 ("Convert 'alloc_obj' family to use the new default GFP_KERNEL argument")
8a96b9144f ("net/mlx5e: Alloc xsk channel param out of mlx5e_open_xsk()")
Adjacent changes:
net/netfilter/ipvs/ip_vs_ctl.c
c59bd9e62e ("ipvs: use more counters to avoid service lookups")
bf4afc53b7 ("Convert 'alloc_obj' family to use the new default GFP_KERNEL argument")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull networking fixes from Paolo Abeni:
"Including fixes from IPsec, Bluetooth and netfilter
Current release - regressions:
- wifi: fix dev_alloc_name() return value check
- rds: fix recursive lock in rds_tcp_conn_slots_available
Current release - new code bugs:
- vsock: lock down child_ns_mode as write-once
Previous releases - regressions:
- core:
- do not pass flow_id to set_rps_cpu()
- consume xmit errors of GSO frames
- netconsole: avoid OOB reads, msg is not nul-terminated
- netfilter: h323: fix OOB read in decode_choice()
- tcp: re-enable acceptance of FIN packets when RWIN is 0
- udplite: fix null-ptr-deref in __udp_enqueue_schedule_skb().
- wifi: brcmfmac: fix potential kernel oops when probe fails
- phy: register phy led_triggers during probe to avoid AB-BA deadlock
- eth:
- bnxt_en: fix deleting of Ntuple filters
- wan: farsync: fix use-after-free bugs caused by unfinished tasklets
- xscale: check for PTP support properly
Previous releases - always broken:
- tcp: fix potential race in tcp_v6_syn_recv_sock()
- kcm: fix zero-frag skb in frag_list on partial sendmsg error
- xfrm:
- fix race condition in espintcp_close()
- always flush state and policy upon NETDEV_UNREGISTER event
- bluetooth:
- purge error queues in socket destructors
- fix response to L2CAP_ECRED_CONN_REQ
- eth:
- mlx5:
- fix circular locking dependency in dump
- fix "scheduling while atomic" in IPsec MAC address query
- gve: fix incorrect buffer cleanup for QPL
- team: avoid NETDEV_CHANGEMTU event when unregistering slave
- usb: validate USB endpoints"
* tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits)
netfilter: nf_conntrack_h323: fix OOB read in decode_choice()
dpaa2-switch: validate num_ifs to prevent out-of-bounds write
net: consume xmit errors of GSO frames
vsock: document write-once behavior of the child_ns_mode sysctl
vsock: lock down child_ns_mode as write-once
selftests/vsock: change tests to respect write-once child ns mode
net/mlx5e: Fix "scheduling while atomic" in IPsec MAC address query
net/mlx5: Fix missing devlink lock in SRIOV enable error path
net/mlx5: E-switch, Clear legacy flag when moving to switchdev
net/mlx5: LAG, disable MPESW in lag_disable_change()
net/mlx5: DR, Fix circular locking dependency in dump
selftests: team: Add a reference count leak test
team: avoid NETDEV_CHANGEMTU event when unregistering slave
net: mana: Fix double destroy_workqueue on service rescan PCI path
MAINTAINERS: Update maintainer entry for QUALCOMM ETHQOS ETHERNET DRIVER
dpll: zl3073x: Remove redundant cleanup in devm_dpll_init()
selftests/net: packetdrill: Verify acceptance of FIN packets when RWIN is 0
tcp: re-enable acceptance of FIN packets when RWIN is 0
vsock: Use container_of() to get net namespace in sysctl handlers
net: usb: kaweth: validate USB endpoints
...
In decode_choice(), the boundary check before get_len() uses the
variable `len`, which is still 0 from its initialization at the top of
the function:
unsigned int type, ext, len = 0;
...
if (ext || (son->attr & OPEN)) {
BYTE_ALIGN(bs);
if (nf_h323_error_boundary(bs, len, 0)) /* len is 0 here */
return H323_ERROR_BOUND;
len = get_len(bs); /* OOB read */
When the bitstream is exactly consumed (bs->cur == bs->end), the check
nf_h323_error_boundary(bs, 0, 0) evaluates to (bs->cur + 0 > bs->end),
which is false. The subsequent get_len() call then dereferences
*bs->cur++, reading 1 byte past the end of the buffer. If that byte
has bit 7 set, get_len() reads a second byte as well.
This can be triggered remotely by sending a crafted Q.931 SETUP message
with a User-User Information Element containing exactly 2 bytes of
PER-encoded data ({0x08, 0x00}) to port 1720 through a firewall with
the nf_conntrack_h323 helper active. The decoder fully consumes the
PER buffer before reaching this code path, resulting in a 1-2 byte
heap-buffer-overflow read confirmed by AddressSanitizer.
Fix this by checking for 2 bytes (the maximum that get_len() may read)
instead of the uninitialized `len`. This matches the pattern used at
every other get_len() call site in the same file, where the caller
checks for 2 bytes of available data before calling get_len().
Fixes: ec8a8f3c31 ("netfilter: nf_ct_h323: Extend nf_h323_error_boundary to work on bits as well")
Signed-off-by: Vahagn Vardanian <vahagn@redrays.io>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260225130619.1248-2-fw@strlen.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The driver obtains sw_attr.num_ifs from firmware via dpsw_get_attributes()
but never validates it against DPSW_MAX_IF (64). This value controls
iteration in dpaa2_switch_fdb_get_flood_cfg(), which writes port indices
into the fixed-size cfg->if_id[DPSW_MAX_IF] array. When firmware reports
num_ifs >= 64, the loop can write past the array bounds.
Add a bound check for num_ifs in dpaa2_switch_init().
dpaa2_switch_fdb_get_flood_cfg() appends the control interface (port
num_ifs) after all matched ports. When num_ifs == DPSW_MAX_IF and all
ports match the flood filter, the loop fills all 64 slots and the control
interface write overflows by one entry.
The check uses >= because num_ifs == DPSW_MAX_IF is also functionally
broken.
build_if_id_bitmap() silently drops any ID >= 64:
if (id[i] < DPSW_MAX_IF)
bmap[id[i] / 64] |= ...
Fixes: 539dda3c5d ("staging: dpaa2-switch: properly setup switching domains")
Signed-off-by: Junrui Luo <moonafterrain@outlook.com>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Link: https://patch.msgid.link/SYBPR01MB78812B47B7F0470B617C408AAF74A@SYBPR01MB7881.ausprd01.prod.outlook.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The kernel-mode PPPoE relay feature and its two associated ioctls
(PPPOEIOCSFWD and PPPOEIOCDFWD) are not used by any existing userspace
PPPoE implementations. The most commonly-used package, RP-PPPoE [1],
handles the relaying entirely in userspace.
This legacy code has remained in the driver since its introduction in
kernel 2.3.99-pre7 for over two decades, but has served no practical
purpose.
Remove the unused relay code.
[1] https://dianne.skoll.ca/projects/rp-pppoe/
Signed-off-by: Qingfang Deng <dqfext@gmail.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20260224015053.42472-1-dqfext@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
udpgro_frglist.sh and udpgro_bench.sh are the flakiest tests
currently in NIPA. They fail in the same exact way, TCP GRO
test stalls occasionally and the test gets killed after 10min.
These tests use veth to simulate GRO. They attach a trivial
("return XDP_PASS;") XDP program to the veth to force TSO off
and NAPI on.
Digging into the failure mode we can see that the connection
is completely stuck after a burst of drops. The sender's snd_nxt
is at sequence number N [1], but the receiver claims to have
received (rcv_nxt) up to N + 3 * MSS [2]. Last piece of the puzzle
is that senders rtx queue is not empty (let's say the block in
the rtx queue is at sequence number N - 4 * MSS [3]).
In this state, sender sends a retransmission from the rtx queue
with a single segment, and sequence numbers N-4*MSS:N-3*MSS [3].
Receiver sees it and responds with an ACK all the way up to
N + 3 * MSS [2]. But sender will reject this ack as TCP_ACK_UNSENT_DATA
because it has no recollection of ever sending data that far out [1].
And we are stuck.
The root cause is the mess of the xmit return codes. veth returns
an error when it can't xmit a frame. We end up with a loss event
like this:
-------------------------------------------------
| GSO super frame 1 | GSO super frame 2 |
|-----------------------------------------------|
| seg | seg | seg | seg | seg | seg | seg | seg |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
-------------------------------------------------
x ok ok <ok>| ok ok ok <x>
\\
snd_nxt
"x" means packet lost by veth, and "ok" means it went thru.
Since veth has TSO disabled in this test it sees individual segments.
Segment 1 is on the retransmit queue and will be resent.
So why did the sender not advance snd_nxt even tho it clearly did
send up to seg 8? tcp_write_xmit() interprets the return code
from the core to mean that data has not been sent at all. Since
TCP deals with GSO super frames, not individual segment the crux
of the problem is that loss of a single segment can be interpreted
as loss of all. TCP only sees the last return code for the last
segment of the GSO frame (in <> brackets in the diagram above).
Of course for the problem to occur we need a setup or a device
without a Qdisc. Otherwise Qdisc layer disconnects the protocol
layer from the device errors completely.
We have multiple ways to fix this.
1) make veth not return an error when it lost a packet.
While this is what I think we did in the past, the issue keeps
reappearing and it's annoying to debug. The game of whack
a mole is not great.
2) fix the damn return codes
We only talk about NETDEV_TX_OK and NETDEV_TX_BUSY in the
documentation, so maybe we should make the return code from
ndo_start_xmit() a boolean. I like that the most, but perhaps
some ancient, not-really-networking protocol would suffer.
3) make TCP ignore the errors
It is not entirely clear to me what benefit TCP gets from
interpreting the result of ip_queue_xmit()? Specifically once
the connection is established and we're pushing data - packet
loss is just packet loss?
4) this fix
Ignore the rc in the Qdisc-less+GSO case, since it's unreliable.
We already always return OK in the TCQ_F_CAN_BYPASS case.
In the Qdisc-less case let's be a bit more conservative and only
mask the GSO errors. This path is taken by non-IP-"networks"
like CAN, MCTP etc, so we could regress some ancient thing.
This is the simplest, but also maybe the hackiest fix?
Similar fix has been proposed by Eric in the past but never committed
because original reporter was working with an OOT driver and wasn't
providing feedback (see Link).
Link: https://lore.kernel.org/CANn89iJcLepEin7EtBETrZ36bjoD9LrR=k4cfwWh046GB+4f9A@mail.gmail.com
Fixes: 1f59533f9c ("qdisc: validate frames going through the direct_xmit path")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260223235100.108939-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Two administrator processes may race when setting child_ns_mode as one
process sets child_ns_mode to "local" and then creates a namespace, but
another process changes child_ns_mode to "global" between the write and
the namespace creation. The first process ends up with a namespace in
"global" mode instead of "local". While this can be detected after the
fact by reading ns_mode and retrying, it is fragile and error-prone.
Make child_ns_mode write-once so that a namespace manager can set it
once and be sure it won't change. Writing a different value after the
first write returns -EBUSY. This applies to all namespaces, including
init_net, where an init process can write "local" to lock all future
namespaces into local mode.
Fixes: eafb64f40c ("vsock: add netns to vsock core")
Suggested-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Co-developed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20260223-vsock-ns-write-once-v3-2-c0cde6959923@meta.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The child_ns_mode sysctl parameter becomes write-once in a future patch
in this series, which breaks existing tests. This patch updates the
tests to respect this new policy. No additional tests are added.
Add "global-parent" and "local-parent" namespaces as intermediaries to
spawn namespaces in the given modes. This avoids the need to change
"child_ns_mode" in the init_ns. nsenter must be used because ip netns
unshares the mount namespace so nested "ip netns add" breaks exec calls
from the init ns. Adds nsenter to the deps check.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260223-vsock-ns-write-once-v3-1-c0cde6959923@meta.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
If set, take rx_page_size into consideration when calculating
the page shift in Multi Packet WQE mode.
The queue config is saved in the mlx5e_rq_opt_param struct which is
added to the mlx5e_channel_param struct. Now the configuration can be
read from the struct instead of adding it as an argument to all call
sites. For consistency, the queue config is assigned in
mlx5e_build_channel_param().
The queue configuration is read only from queue management ops
as that's the only place where it is currently useful. Furthermore,
netdev_queue_config() expects netdev->queue_mgmt_ops to be
set which is not always the case (representor netdevs).
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-14-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The formula uses the system page size but does not account
for high order pages.
One way to fix this would be to adapt the formula to take
into account the pool order. This would require calculating it
for every allocation or adding an additional rq struct member to
hold the bias max.
However, the above is not really needed as the driver doesn't
check the bias value. It has other means to calculate the expected
number of fragments based on context.
This patch simply sets the value to the max possible value. A sanity
check is added during queue init phase to avoid having really big pages
from using more fragments than the type can fit.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-12-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
An upcoming patch will allow setting the page order for RX
pages to be greater than 0. Make sure that the drop page will
also be allocated with the right size when that happens.
Take extra care when calculating the drop page size to
account for page_shift < PAGE_SHIFT which can happen for xsk.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-11-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The channel parameters from struct mlx5_qmgmt_data are
built in mlx5e_queue_mem_alloc() but are not used.
mlx5e_open_channel() builds the channel parameters internally and those
parameters will be the ones that are used when opening the queue.
This patch drops the unused parameters.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-8-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The xsk parameter configuration (struct mlx5e_xsk_param) is passed
around many places during parameter calculation. It is used to contain
channel specific information (as opposed to the global info from
struct mlx5e_params).
Upcoming changes will need to push similar channel specific rq
configuration. Instead of adding one more parameter to all these
functions, create a new container structure that has optional rq
specific parameters. The xsk parameter will be the first of such kind.
The new container struct is itself optional. That means that before
checking its members, it has to be checked itself for validity.
This patch has no functional changes.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-7-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently the allocation and filling of the xsk channel
parameters was done in mlx5e_open_xsk().
Move this responsibility out of mlx5e_open_xsk() and have
the function take an already filled mlx5e_channel_param.
mlx5e_open_channel() already allocates channel parameters.
The only precaution that is needed is to call
mlx5e_build_xsk_channel_param() before mlx5e_open_xsk().
mlx5e_xsk_enable_locked() now allocates and fills the xsk parameters.
For simplicity, link the xsk parameters in struct mlx5e_channel_params
so that channel params can be passed around.
This patch has no functional changes.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-6-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
mlx5e_build_xsk_cparam() is meant to be the alternative
to mlx5e_build_channel_param(). It calculates only the parameters
that it requires using the previously configured mlx5e_xsk_param.
Move this function to params.c to be alongside
mlx5e_build_channel_param() and give it a similar name.
Expose the function as it will be needed by upcoming changes.
This patch has no functional changes.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-5-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Calculating parameters for striding rq is large enough
to deserve its own function. As the names are also very long
it is very easy to hit on the 80 char limitation every time
a change is made. This is an additional sign that it should
be extracted into its own function.
This patch has no functional change.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-3-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>