Introduce support for traffic class (TC) scheduling on vports by
allowing the vport to own multiple TC scheduling nodes. This patch
enables more granular control of QoS by defining three distinct QoS
states for vports, each providing unique scheduling behavior:
1. Regular QoS: The `sched_node` represents the vport directly,
handling QoS as a single scheduling entity.
2. TC QoS on the vport: The `sched_node` acts as a TC arbiter, enabling
TC scheduling directly on the vport.
3. TC QoS on the parent node: The `sched_node` functions as a rate
limiter, with TC arbitration enabled at the parent level, associating
multiple scheduling nodes with each vport.
Key changes include:
- Added support for new scheduling elements, vport traffic class and
rate limiter.
- New helper functions for creating, destroying, and restoring vport TC
scheduling nodes, handling transitions between regular QoS and TC
arbitration states.
- Updated `esw_qos_vport_enable()` and `esw_qos_vport_disable()` to
support both regular QoS and TC arbitration states, ensuring consistent
transitions between scheduling modes.
- Introduced a `sched_nodes` array under `vport->qos` to store multiple
TC scheduling nodes per vport, enabling finer control over per-TC QoS.
- Enhanced `esw_qos_vport_update_parent()` to handle transitions between
the three QoS states based on the current and new parent node types.
This patch lays the groundwork for future support for configuring tc-bw
on vports. Although the infrastructure is in place, full support for
tc-bw is not yet implemented; attempts to set tc-bw on vports will
return `-EOPNOTSUPP`.
No functional changes are introduced at this stage.
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250629142138.361537-7-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce support for enabling and disabling Traffic Class (TC)
arbitration for existing devlink rate nodes. This patch adds support
for a new scheduling node type, `SCHED_NODE_TYPE_TC_ARBITER_TSAR`.
Key changes include:
- New helper functions for transitioning existing rate nodes to TC
arbiter nodes and vice versa. These functions handle the allocation
of TC arbiter nodes, copying of child nodes, and restoring vport QoS
settings when TC arbitration is disabled.
- Implementation of `mlx5_esw_devlink_rate_node_tc_bw_set()` to manage
tc-bw configuration on nodes.
- Introduced stubs for `esw_qos_tc_arbiter_scheduling_setup()` and
`esw_qos_tc_arbiter_scheduling_teardown()`, which will be extended in
future patches to provide full support for tc-bw on devlink rate
objects.
- Validation functions for tc-bw settings, allowing graceful handling
of unsupported traffic class bandwidth configurations.
- Updated `__esw_qos_alloc_node()` to insert the new node into the
parent’s children list only if the parent is not NULL. For the root
TSAR, the new node is inserted directly after the allocation call.
- Don't allow `tc-bw` configuration for nodes containing non-leaf
children.
This patch lays the groundwork for future support for configuring tc-bw
on devlink rate nodes. Although the infrastructure is in place, full
support for tc-bw is not yet implemented; attempts to set tc-bw on
nodes will return `-EOPNOTSUPP`.
No functional changes are introduced at this stage.
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250629142138.361537-6-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce support for specifying relative bandwidth shares between
traffic classes (TC) in the devlink-rate API. This new option allows
users to allocate bandwidth across multiple traffic classes in a
single command.
This feature provides a more granular control over traffic management,
especially for scenarios requiring Enhanced Transmission Selection.
Users can now define a relative bandwidth share for each traffic class.
For example, assigning share values of 20 to TC0 (TCP/UDP) and 80 to TC5
(RoCE) will result in TC0 receiving 20% and TC5 receiving 80% of the
total bandwidth. The actual percentage each class receives depends on
the ratio of its share value to the sum of all shares.
Example:
DEV=pci/0000:08:00.0
$ devlink port function rate add $DEV/vfs_group tx_share 10Gbit \
tx_max 50Gbit tc-bw 0:20 1:0 2:0 3:0 4:0 5:80 6:0 7:0
$ devlink port function rate set $DEV/vfs_group \
tc-bw 0:20 1:0 2:0 3:0 4:0 5:20 6:60 7:0
Example usage with ynl:
./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \
--do rate-set --json '{
"bus-name": "pci",
"dev-name": "0000:08:00.0",
"port-index": 1,
"rate-tc-bws": [
{"rate-tc-index": 0, "rate-tc-bw": 50},
{"rate-tc-index": 1, "rate-tc-bw": 50},
{"rate-tc-index": 2, "rate-tc-bw": 0},
{"rate-tc-index": 3, "rate-tc-bw": 0},
{"rate-tc-index": 4, "rate-tc-bw": 0},
{"rate-tc-index": 5, "rate-tc-bw": 0},
{"rate-tc-index": 6, "rate-tc-bw": 0},
{"rate-tc-index": 7, "rate-tc-bw": 0}
]
}'
./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \
--do rate-get --json '{
"bus-name": "pci",
"dev-name": "0000:08:00.0",
"port-index": 1
}'
output for rate-get:
{'bus-name': 'pci',
'dev-name': '0000:08:00.0',
'port-index': 1,
'rate-tc-bws': [{'rate-tc-bw': 50, 'rate-tc-index': 0},
{'rate-tc-bw': 50, 'rate-tc-index': 1},
{'rate-tc-bw': 0, 'rate-tc-index': 2},
{'rate-tc-bw': 0, 'rate-tc-index': 3},
{'rate-tc-bw': 0, 'rate-tc-index': 4},
{'rate-tc-bw': 0, 'rate-tc-index': 5},
{'rate-tc-bw': 0, 'rate-tc-index': 6},
{'rate-tc-bw': 0, 'rate-tc-index': 7}],
'rate-tx-max': 0,
'rate-tx-priority': 0,
'rate-tx-share': 0,
'rate-tx-weight': 0,
'rate-type': 'leaf'}
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250629142138.361537-3-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We used to do twice copy_from_iter() to copy virtio-net and packet
separately. This introduce overheads for userspace access hardening as
well as SMAP (for x86 it's stac/clac). So this patch tries to use one
copy_from_iter() to copy them once and move the virtio-net header
afterwards to reduce overheads.
Testpmd + vhost_net shows 10% improvement from 5.45Mpps to 6.0Mpps.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20250701010352.74515-2-jasowang@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Willem de Bruijn says:
====================
preserve MSG_ZEROCOPY with forwarding
Avoid false positive copying of zerocopy skb frags when entering the
ingress path if the skb is not queued locally but forwarded.
Patch 1 for more details and feature.
Patch 2 converts the existing selftest to a pass/fail test and adds
coverage for this new feature.
====================
Link: https://patch.msgid.link/20250630194312.1571410-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Zerocopy skbs are converted to regular copy skbs when data is queued
to a local socket. This happens in the existing test with a sender and
receiver communicating over a veth device.
Zerocopy skbs are sent without copying if egressing a device. Verify
that this behavior is maintained even in the common container setup
where data is forwarded over a veth to the physical device.
Update msg_zerocopy.sh to
1. Have a dummy network device to simulate a physical device.
2. Have forwarding enabled between veth and dummy.
3. Add a tx-only test that sends out dummy via the forwarding path.
4. Verify the exitcode of the sender, which signals zerocopy success.
As dummy drops all packets, this cannot be a TCP connection. Test
the new case with unconnected UDP only.
Update msg_zerocopy.c to
- Accept an argument whether send with zerocopy is expected.
- Return an exitcode whether behavior matched that expectation.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630194312.1571410-3-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
MSG_ZEROCOPY data must be copied before data is queued to local
sockets, to avoid indefinite timeout until memory release.
This test is performed by skb_orphan_frags_rx, which is called when
looping an egress skb to packet sockets, error queue or ingress path.
To preserve zerocopy for skbs that are looped to ingress but are then
forwarded to an egress device rather than delivered locally, defer
this last check until an skb enters the local IP receive path.
This is analogous to existing behavior of skb_clear_delivery_time.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630194312.1571410-2-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a new test to ensure that when the transport changes a null pointer
dereference does not occur. The bug was reported upstream [1] and fixed
with commit 2cb7c756f6 ("vsock/virtio: discard packets if the
transport changes").
KASAN: null-ptr-deref in range [0x0000000000000060-0x0000000000000067]
CPU: 2 UID: 0 PID: 463 Comm: kworker/2:3 Not tainted
Workqueue: vsock-loopback vsock_loopback_work
RIP: 0010:vsock_stream_has_data+0x44/0x70
Call Trace:
virtio_transport_do_close+0x68/0x1a0
virtio_transport_recv_pkt+0x1045/0x2ae4
vsock_loopback_work+0x27d/0x3f0
process_one_work+0x846/0x1420
worker_thread+0x5b3/0xf80
kthread+0x35a/0x700
ret_from_fork+0x2d/0x70
ret_from_fork_asm+0x1a/0x30
Note that this test may not fail in a kernel without the fix, but it may
hang on the client side if it triggers a kernel oops.
This works by creating a socket, trying to connect to a server, and then
executing a second connect operation on the same socket but to a
different CID (0). This triggers a transport change. If the connect
operation is interrupted by a signal, this could cause a null-ptr-deref.
Since this bug is non-deterministic, we need to try several times. It
is reasonable to assume that the bug will show up within the timeout
period.
If there is a G2H transport loaded in the system, the bug is not
triggered and this test will always pass. This is because
`vsock_assign_transport`, when using CID 0, like in this case, sets
vsk->transport to `transport_g2h` that is not NULL if a G2H transport is
available.
[1]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/
Suggested-by: Hyunwoo Kim <v4bel@theori.io>
Suggested-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Luigi Leonardi <leonardi@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250630-test_vsock-v5-2-2492e141e80b@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
AMD XGBE hardware supports giant Ethernet frames up to 16K bytes.
Add support for configuring and enabling giant packet handling
in the driver.
- Define new register fields and macros for giant packet support.
- Update the jumbo frame configuration logic to enable giant
packet mode when MTU exceeds the jumbo threshold.
Acked-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com>
Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250701121929.319690-1-Raju.Rangoju@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dst->dev is read locklessly in many contexts,
and written in dst_dev_put().
Fixing all the races is going to need many changes.
We probably will have to add full RCU protection.
Add three helpers to ease this painful process.
static inline struct net_device *dst_dev(const struct dst_entry *dst)
{
return READ_ONCE(dst->dev);
}
static inline struct net_device *skb_dst_dev(const struct sk_buff *skb)
{
return dst_dev(skb_dst(skb));
}
static inline struct net *skb_dst_dev_net(const struct sk_buff *skb)
{
return dev_net(skb_dst_dev(skb));
}
static inline struct net *skb_dst_dev_net_rcu(const struct sk_buff *skb)
{
return dev_net_rcu(skb_dst_dev(skb));
}
Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dst_dev_put() can overwrite dst->output while other
cpus might read this field (for instance from dst_output())
Add READ_ONCE()/WRITE_ONCE() annotations to suppress
potential issues.
We will likely need RCU protection in the future.
Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet says:
====================
net: introduce net_aligned_data
____cacheline_aligned_in_smp on small fields like
tcp_memory_allocated and udp_memory_allocated is not good enough.
It makes sure to put these fields at the start of a cache line,
but does not prevent the linker from using the cache line for other
fields, with potential performance impact.
nm -v vmlinux|egrep -5 "tcp_memory_allocated|udp_memory_allocated"
...
ffffffff83e35cc0 B tcp_memory_allocated
ffffffff83e35cc8 b __key.0
ffffffff83e35cc8 b __tcp_tx_delay_enabled.1
ffffffff83e35ce0 b tcp_orphan_timer
...
ffffffff849dddc0 B udp_memory_allocated
ffffffff849dddc8 B udp_encap_needed_key
ffffffff849dddd8 B udpv6_encap_needed_key
ffffffff849dddf0 b inetsw_lock
One solution is to move these sensitive fields to a structure,
so that the compiler is forced to add empty holes between each member.
nm -v vmlinux|egrep -2 "tcp_memory_allocated|udp_memory_allocated|net_aligned_data"
ffffffff885af970 b mem_id_init
ffffffff885af980 b __key.0
ffffffff885af9c0 B net_aligned_data
ffffffff885afa80 B page_pool_mem_providers
ffffffff885afa90 b __key.2
v1: https://lore.kernel.org/netdev/20250627200551.348096-1-edumazet@google.com/
====================
Link: https://patch.msgid.link/20250630093540.3052835-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
____cacheline_aligned_in_smp attribute only makes sure to align
a field to a cache line. It does not prevent the linker to use
the remaining of the cache line for other variables, causing
potential false sharing.
Move tcp_memory_allocated into a dedicated cache line.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is not only a cosmetic change because the section mismatch checks
(implemented in scripts/mod/modpost.c) also depend on the object's name
and for drivers the checks are stricter than for ops.
However aq_pci_driver also passes the stricter checks just fine, so no
further changes needed.
The cheating^Wmisleading name was introduced in commit 97bde5c4f9
("net: ethernet: aquantia: Support for NIC-specific code")
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250630164406.57589-2-u.kleine-koenig@baylibre.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If the user reinitializes the network interface, the PHY will reinitialize,
and the CKO settings will revert to their initial configuration(be enabled).
To prevent CKO from being re-enabled,
en8811h_clk_restore_context and en8811h_resume were added
to ensure the CKO settings remain correct.
Signed-off-by: Lucien.Jheng <lucienzx159@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250630154147.80388-1-lucienzx159@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
'struct devlink_region_ops' and 'struct hellcreek_fdb_entry' are not
modified in this driver.
Constifying these structures moves some data to a read-only section, so
increases overall security, especially when the structure holds some
function pointers.
On a x86_64, with allmodconfig:
Before:
======
text data bss dec hex filename
55320 19216 320 74856 12468 drivers/net/dsa/hirschmann/hellcreek.o
After:
=====
text data bss dec hex filename
55960 18576 320 74856 12468 drivers/net/dsa/hirschmann/hellcreek.o
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Link: https://patch.msgid.link/2f7e8dc30db18bade94999ac7ce79f333342e979.1751231174.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andrea Mayer says:
====================
seg6: fix typos in comments within the SRv6 subsystem
In this patchset, we correct some typos found both in the SRv6 Endpoints
implementation (i.e., seg6local) and in some SRv6 selftests, using
codespell.
====================
Link: https://patch.msgid.link/20250629171226.4988-1-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
'struct devlink_region_ops' and 'struct mv88e6xxx_region' are not modified
in this driver.
Constifying these structures moves some data to a read-only section, so
increases overall security, especially when the structure holds some
function pointers.
On a x86_64, with allmodconfig, as an example:
Before:
======
text data bss dec hex filename
18076 6496 64 24636 603c drivers/net/dsa/mv88e6xxx/devlink.o
After:
=====
text data bss dec hex filename
18652 5920 64 24636 603c drivers/net/dsa/mv88e6xxx/devlink.o
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/46040062161dda211580002f950a6d60433243dc.1751200453.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add support for the Signal Quality Indicator (SQI) feature on KSZ9477
family switches, providing a relative measure of receive signal quality.
The hardware exposes separate SQI readings per channel. For 1000BASE-T,
all four channels are read. For 100BASE-TX, only one channel is reported,
but which receive pair is active depends on Auto MDI-X negotiation, which
is not exposed by the hardware. Therefore, it is not possible to reliably
map the measured channel to a specific wire pair.
This resolves an earlier discussion about how to handle multi-channel
SQI. Originally, the plan was to expose all channels individually.
However, since pair mapping is sometimes unavailable, this
implementation treats SQI as a per-link metric instead. This fallback
avoids ambiguity and ensures consistent behavior. The existing get_sqi()
UAPI was designed for single-pair Ethernet (SPE), where per-pair and
per-link are effectively equivalent. Restricting its use to per-link
metrics does not introduce regressions for existing users.
The raw 7-bit SQI value (0–127, lower is better) is converted to the
standard 0–7 (high is better) scale. Empirical testing showed that the
link becomes unstable around a raw value of 8.
The SQI raw value remains zero if no data is received, even if noise is
present. This confirms that the measurement reflects the "quality" during
active data reception rather than the passive line state. User space
must ensure that traffic is present on the link to obtain valid SQI
readings.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250627112539.895255-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>