Commit Graph

1368885 Commits

Author SHA1 Message Date
Carolina Jubran
97733d1e00 net/mlx5: Add traffic class scheduling support for vport QoS
Introduce support for traffic class (TC) scheduling on vports by
allowing the vport to own multiple TC scheduling nodes. This patch
enables more granular control of QoS by defining three distinct QoS
states for vports, each providing unique scheduling behavior:

1. Regular QoS: The `sched_node` represents the vport directly,
   handling QoS as a single scheduling entity.
2. TC QoS on the vport: The `sched_node` acts as a TC arbiter, enabling
   TC scheduling directly on the vport.
3. TC QoS on the parent node: The `sched_node` functions as a rate
   limiter, with TC arbitration enabled at the parent level, associating
   multiple scheduling nodes with each vport.

Key changes include:

- Added support for new scheduling elements, vport traffic class and
  rate limiter.

- New helper functions for creating, destroying, and restoring vport TC
  scheduling nodes, handling transitions between regular QoS and TC
  arbitration states.

- Updated `esw_qos_vport_enable()` and `esw_qos_vport_disable()` to
  support both regular QoS and TC arbitration states, ensuring consistent
  transitions between scheduling modes.

- Introduced a `sched_nodes` array under `vport->qos` to store multiple
  TC scheduling nodes per vport, enabling finer control over per-TC QoS.

- Enhanced `esw_qos_vport_update_parent()` to handle transitions between
  the three QoS states based on the current and new parent node types.

This patch lays the groundwork for future support for configuring tc-bw
on vports. Although the infrastructure is in place, full support for
tc-bw is not yet implemented; attempts to set tc-bw on vports will
return `-EOPNOTSUPP`.

No functional changes are introduced at this stage.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250629142138.361537-7-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 15:39:05 -07:00
Carolina Jubran
96619c485f net/mlx5: Add support for setting tc-bw on nodes
Introduce support for enabling and disabling Traffic Class (TC)
arbitration for existing devlink rate nodes. This patch adds support
for a new scheduling node type, `SCHED_NODE_TYPE_TC_ARBITER_TSAR`.

Key changes include:

- New helper functions for transitioning existing rate nodes to TC
  arbiter nodes and vice versa. These functions handle the allocation
  of TC arbiter nodes, copying of child nodes, and restoring vport QoS
  settings when TC arbitration is disabled.

- Implementation of `mlx5_esw_devlink_rate_node_tc_bw_set()` to manage
  tc-bw configuration on nodes.

- Introduced stubs for `esw_qos_tc_arbiter_scheduling_setup()` and
  `esw_qos_tc_arbiter_scheduling_teardown()`, which will be extended in
  future patches to provide full support for tc-bw on devlink rate
  objects.

- Validation functions for tc-bw settings, allowing graceful handling
  of unsupported traffic class bandwidth configurations.

- Updated `__esw_qos_alloc_node()` to insert the new node into the
  parent’s children list only if the parent is not NULL. For the root
  TSAR, the new node is inserted directly after the allocation call.

- Don't allow `tc-bw` configuration for nodes containing non-leaf
  children.

This patch lays the groundwork for future support for configuring tc-bw
on devlink rate nodes. Although the infrastructure is in place, full
support for tc-bw is not yet implemented; attempts to set tc-bw on
nodes will return `-EOPNOTSUPP`.

No functional changes are introduced at this stage.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250629142138.361537-6-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 15:39:05 -07:00
Carolina Jubran
7109282124 net/mlx5: Add no-op implementation for setting tc-bw on rate objects
Introduce `mlx5_esw_devlink_rate_node_tc_bw_set()` and
`mlx5_esw_devlink_rate_leaf_tc_bw_set()` with no-op logic.

Future patches will add support for setting traffic class bandwidth
on rate objects.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250629142138.361537-5-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 15:39:05 -07:00
Carolina Jubran
236156d80d selftest: netdevsim: Add devlink rate tc-bw test
Test verifies that netdevsim correctly implements devlink ops callbacks
that set tc-bw on leaf or node rate object.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250629142138.361537-4-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 15:39:05 -07:00
Carolina Jubran
566e8f108f devlink: Extend devlink rate API with traffic classes bandwidth management
Introduce support for specifying relative bandwidth shares between
traffic classes (TC) in the devlink-rate API. This new option allows
users to allocate bandwidth across multiple traffic classes in a
single command.

This feature provides a more granular control over traffic management,
especially for scenarios requiring Enhanced Transmission Selection.

Users can now define a relative bandwidth share for each traffic class.
For example, assigning share values of 20 to TC0 (TCP/UDP) and 80 to TC5
(RoCE) will result in TC0 receiving 20% and TC5 receiving 80% of the
total bandwidth. The actual percentage each class receives depends on
the ratio of its share value to the sum of all shares.

Example:
DEV=pci/0000:08:00.0

$ devlink port function rate add $DEV/vfs_group tx_share 10Gbit \
  tx_max 50Gbit tc-bw 0:20 1:0 2:0 3:0 4:0 5:80 6:0 7:0

$ devlink port function rate set $DEV/vfs_group \
  tc-bw 0:20 1:0 2:0 3:0 4:0 5:20 6:60 7:0

Example usage with ynl:

./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \
  --do rate-set --json '{
  "bus-name": "pci",
  "dev-name": "0000:08:00.0",
  "port-index": 1,
  "rate-tc-bws": [
    {"rate-tc-index": 0, "rate-tc-bw": 50},
    {"rate-tc-index": 1, "rate-tc-bw": 50},
    {"rate-tc-index": 2, "rate-tc-bw": 0},
    {"rate-tc-index": 3, "rate-tc-bw": 0},
    {"rate-tc-index": 4, "rate-tc-bw": 0},
    {"rate-tc-index": 5, "rate-tc-bw": 0},
    {"rate-tc-index": 6, "rate-tc-bw": 0},
    {"rate-tc-index": 7, "rate-tc-bw": 0}
  ]
}'

./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \
  --do rate-get --json '{
  "bus-name": "pci",
  "dev-name": "0000:08:00.0",
  "port-index": 1
}'

output for rate-get:
{'bus-name': 'pci',
 'dev-name': '0000:08:00.0',
 'port-index': 1,
 'rate-tc-bws': [{'rate-tc-bw': 50, 'rate-tc-index': 0},
                 {'rate-tc-bw': 50, 'rate-tc-index': 1},
                 {'rate-tc-bw': 0, 'rate-tc-index': 2},
                 {'rate-tc-bw': 0, 'rate-tc-index': 3},
                 {'rate-tc-bw': 0, 'rate-tc-index': 4},
                 {'rate-tc-bw': 0, 'rate-tc-index': 5},
                 {'rate-tc-bw': 0, 'rate-tc-index': 6},
                 {'rate-tc-bw': 0, 'rate-tc-index': 7}],
 'rate-tx-max': 0,
 'rate-tx-priority': 0,
 'rate-tx-share': 0,
 'rate-tx-weight': 0,
 'rate-type': 'leaf'}

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250629142138.361537-3-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 15:39:05 -07:00
Carolina Jubran
42401c4238 netlink: introduce type-checking attribute iteration for nlmsg
Add the nlmsg_for_each_attr_type() macro to simplify iteration over
attributes of a specific type in a Netlink message.

Convert existing users in vxlan and nfsd to use the new macro.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250629142138.361537-2-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 15:39:04 -07:00
Jason Wang
97b2409f28 vhost-net: reduce one userspace copy when building XDP buff
We used to do twice copy_from_iter() to copy virtio-net and packet
separately. This introduce overheads for userspace access hardening as
well as SMAP (for x86 it's stac/clac). So this patch tries to use one
copy_from_iter() to copy them once and move the virtio-net header
afterwards to reduce overheads.

Testpmd + vhost_net shows 10% improvement from 5.45Mpps to 6.0Mpps.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20250701010352.74515-2-jasowang@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 15:29:46 -07:00
Jason Wang
4d313f2bd2 tun: remove unnecessary tun_xdp_hdr structure
With f95f0f95cfb7("net, xdp: Introduce xdp_init_buff utility routine"),
buffer length could be stored as frame size so there's no need to have
a dedicated tun_xdp_hdr structure. We can simply store virtio net
header instead.

Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20250701010352.74515-1-jasowang@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 15:29:46 -07:00
Jakub Kicinski
285c895fba Merge branch 'preserve-msg_zerocopy-with-forwarding'
Willem de Bruijn says:

====================
preserve MSG_ZEROCOPY with forwarding

Avoid false positive copying of zerocopy skb frags when entering the
ingress path if the skb is not queued locally but forwarded.

Patch 1 for more details and feature.

Patch 2 converts the existing selftest to a pass/fail test and adds
        coverage for this new feature.
====================

Link: https://patch.msgid.link/20250630194312.1571410-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 15:07:18 -07:00
Willem de Bruijn
81d572a551 selftest: net: extend msg_zerocopy test with forwarding
Zerocopy skbs are converted to regular copy skbs when data is queued
to a local socket. This happens in the existing test with a sender and
receiver communicating over a veth device.

Zerocopy skbs are sent without copying if egressing a device. Verify
that this behavior is maintained even in the common container setup
where data is forwarded over a veth to the physical device.

Update msg_zerocopy.sh to

1. Have a dummy network device to simulate a physical device.
2. Have forwarding enabled between veth and dummy.
3. Add a tx-only test that sends out dummy via the forwarding path.
4. Verify the exitcode of the sender, which signals zerocopy success.

As dummy drops all packets, this cannot be a TCP connection. Test
the new case with unconnected UDP only.

Update msg_zerocopy.c to
- Accept an argument whether send with zerocopy is expected.
- Return an exitcode whether behavior matched that expectation.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630194312.1571410-3-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 15:07:16 -07:00
Willem de Bruijn
d2527ad3a9 net: preserve MSG_ZEROCOPY with forwarding
MSG_ZEROCOPY data must be copied before data is queued to local
sockets, to avoid indefinite timeout until memory release.

This test is performed by skb_orphan_frags_rx, which is called when
looping an egress skb to packet sockets, error queue or ingress path.

To preserve zerocopy for skbs that are looped to ingress but are then
forwarded to an egress device rather than delivered locally, defer
this last check until an skb enters the local IP receive path.

This is analogous to existing behavior of skb_clear_delivery_time.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630194312.1571410-2-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 15:07:16 -07:00
Jakub Kicinski
04b1d18c5b Merge branch 'vsock-test-check-for-null-ptr-deref-when-transport-changes'
Luigi Leonardi says:

====================
vsock/test: check for null-ptr-deref when transport changes

This series introduces a new test that checks for a null pointer
dereference that may happen when there is a transport change[1]. This
bug was fixed in [2].

Note that this test *cannot* fail, it hangs if it triggers a kernel
oops. The intended use-case is to run it and then check if there is any
oops in the dmesg.

This test is based on Hyunwoo Kim's[3] and Michal's python
reproducers[4].

[1]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/
[2]https://lore.kernel.org/netdev/20250110083511.30419-1-sgarzare@redhat.com/
[3]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/#t
[4]https://lore.kernel.org/netdev/2b3062e3-bdaa-4c94-a3c0-2930595b9670@rbox.co/

v4: https://lore.kernel.org/20250624-test_vsock-v4-1-087c9c8e25a2@redhat.com
v3: https://lore.kernel.org/20250611-test_vsock-v3-1-8414a2d4df62@redhat.com
v2: https://lore.kernel.org/20250314-test_vsock-v2-1-3c0a1d878a6d@redhat.com
v1: https://lore.kernel.org/20250306-test_vsock-v1-0-0320b5accf92@redhat.com
====================

Link: https://patch.msgid.link/20250630-test_vsock-v5-0-2492e141e80b@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 15:05:25 -07:00
Luigi Leonardi
3a764d9338 vsock/test: Add test for null ptr deref when transport changes
Add a new test to ensure that when the transport changes a null pointer
dereference does not occur. The bug was reported upstream [1] and fixed
with commit 2cb7c756f6 ("vsock/virtio: discard packets if the
transport changes").

KASAN: null-ptr-deref in range [0x0000000000000060-0x0000000000000067]
CPU: 2 UID: 0 PID: 463 Comm: kworker/2:3 Not tainted
Workqueue: vsock-loopback vsock_loopback_work
RIP: 0010:vsock_stream_has_data+0x44/0x70
Call Trace:
 virtio_transport_do_close+0x68/0x1a0
 virtio_transport_recv_pkt+0x1045/0x2ae4
 vsock_loopback_work+0x27d/0x3f0
 process_one_work+0x846/0x1420
 worker_thread+0x5b3/0xf80
 kthread+0x35a/0x700
 ret_from_fork+0x2d/0x70
 ret_from_fork_asm+0x1a/0x30

Note that this test may not fail in a kernel without the fix, but it may
hang on the client side if it triggers a kernel oops.

This works by creating a socket, trying to connect to a server, and then
executing a second connect operation on the same socket but to a
different CID (0). This triggers a transport change. If the connect
operation is interrupted by a signal, this could cause a null-ptr-deref.

Since this bug is non-deterministic, we need to try several times. It
is reasonable to assume that the bug will show up within the timeout
period.

If there is a G2H transport loaded in the system, the bug is not
triggered and this test will always pass. This is because
`vsock_assign_transport`, when using CID 0, like in this case, sets
vsk->transport to `transport_g2h` that is not NULL if a G2H transport is
available.

[1]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/

Suggested-by: Hyunwoo Kim <v4bel@theori.io>
Suggested-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Luigi Leonardi <leonardi@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250630-test_vsock-v5-2-2492e141e80b@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 15:05:23 -07:00
Luigi Leonardi
e84b20b25d vsock/test: Add macros to identify transports
Add three new macros: TRANSPORTS_G2H, TRANSPORTS_H2G and
TRANSPORTS_LOCAL.
They can be used to identify the type of the transport(s) loaded when
using the `get_transports()` function.

Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Luigi Leonardi <leonardi@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250630-test_vsock-v5-1-2492e141e80b@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 15:05:22 -07:00
Matthew Gerlach
6d359cf464 dt-bindings: net: Convert socfpga-dwmac bindings to yaml
Convert the bindings for socfpga-dwmac to yaml. Since the original
text contained descriptions for two separate nodes, two separate
yaml files were created.

Signed-off-by: Mun Yew Tham <mun.yew.tham@altera.com>
Signed-off-by: Matthew Gerlach <matthew.gerlach@altera.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20250630213748.71919-1-matthew.gerlach@altera.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 15:04:58 -07:00
Raju Rangoju
9e2a7ad4ae amd-xgbe: add support for giant packet size
AMD XGBE hardware supports giant Ethernet frames up to 16K bytes.
Add support for configuring and enabling giant packet handling
in the driver.

- Define new register fields and macros for giant packet support.
- Update the jumbo frame configuration logic to enable giant
  packet mode when MTU exceeds the jumbo threshold.

Acked-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com>
Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250701121929.319690-1-Raju.Rangoju@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:50:07 -07:00
Eric Dumazet
7d2dabaa17 net: ifb: support BIG TCP packets
Set the driver limit to GSO_MAX_SIZE (512 KB).

This allows the admin/user to set a GSO limit up to this value, to
avoid segmenting too large GRO packets in the netem -> ifb path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250701084540.459261-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:46:19 -07:00
Jakub Kicinski
7945fe4858 Merge branch 'net-add-data-race-annotations-around-dst-fields'
Eric Dumazet says:

====================
net: add data-race annotations around dst fields

Add annotations around various dst fields, which can change under us.

Add four helpers to prepare better dst->dev protection,
and start using them. More to come later.

v1: https://lore.kernel.org/20250627112526.3615031-1-edumazet@google.com
====================

Link: https://patch.msgid.link/20250630121934.3399505-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:32 -07:00
Eric Dumazet
46a94e44b9 ipv6: ip6_mc_input() and ip6_mr_input() cleanups
Both functions are always called under RCU.

We remove the extra rcu_read_lock()/rcu_read_unlock().

We use skb_dst_dev_net_rcu() instead of skb_dst_dev_net().

We use dev_net_rcu() instead of dev_net().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-11-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:30 -07:00
Eric Dumazet
93d1cff35a ipv6: adopt skb_dst_dev() and skb_dst_dev_net[_rcu]() helpers
Use the new helpers as a step to deal with potential dst->dev races.

v2: fix typo in ipv6_rthdr_rcv() (kernel test robot <lkp@intel.com>)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-10-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:30 -07:00
Eric Dumazet
1caf272972 ipv6: adopt dst_dev() helper
Use the new helper as a step to deal with potential dst->dev races.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:30 -07:00
Eric Dumazet
a74fc62eec ipv4: adopt dst_dev, skb_dst_dev and skb_dst_dev_net[_rcu]
Use the new helpers as a first step to deal with
potential dst->dev races.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:30 -07:00
Eric Dumazet
88fe14253e net: dst: add four helpers to annotate data-races around dst->dev
dst->dev is read locklessly in many contexts,
and written in dst_dev_put().

Fixing all the races is going to need many changes.

We probably will have to add full RCU protection.

Add three helpers to ease this painful process.

static inline struct net_device *dst_dev(const struct dst_entry *dst)
{
       return READ_ONCE(dst->dev);
}

static inline struct net_device *skb_dst_dev(const struct sk_buff *skb)
{
       return dst_dev(skb_dst(skb));
}

static inline struct net *skb_dst_dev_net(const struct sk_buff *skb)
{
       return dev_net(skb_dst_dev(skb));
}

static inline struct net *skb_dst_dev_net_rcu(const struct sk_buff *skb)
{
       return dev_net_rcu(skb_dst_dev(skb));
}

Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:30 -07:00
Eric Dumazet
2dce8c52a9 net: dst: annotate data-races around dst->output
dst_dev_put() can overwrite dst->output while other
cpus might read this field (for instance from dst_output())

Add READ_ONCE()/WRITE_ONCE() annotations to suppress
potential issues.

We will likely need RCU protection in the future.

Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:30 -07:00
Eric Dumazet
f1c5fd3489 net: dst: annotate data-races around dst->input
dst_dev_put() can overwrite dst->input while other
cpus might read this field (for instance from dst_input())

Add READ_ONCE()/WRITE_ONCE() annotations to suppress
potential issues.

We will likely need full RCU protection later.

Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:29 -07:00
Eric Dumazet
8f2b2282d0 net: dst: annotate data-races around dst->lastuse
(dst_entry)->lastuse is read and written locklessly,
add corresponding annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:29 -07:00
Eric Dumazet
36229b2cac net: dst: annotate data-races around dst->expires
(dst_entry)->expires is read and written locklessly,
add corresponding annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:29 -07:00
Eric Dumazet
8a402bbe54 net: dst: annotate data-races around dst->obsolete
(dst_entry)->obsolete is read locklessly, add corresponding
annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:29 -07:00
Jakub Kicinski
8077b9a911 Merge branch 'net-introduce-net_aligned_data'
Eric Dumazet says:

====================
net: introduce net_aligned_data

____cacheline_aligned_in_smp on small fields like
tcp_memory_allocated and udp_memory_allocated is not good enough.

It makes sure to put these fields at the start of a cache line,
but does not prevent the linker from using the cache line for other
fields, with potential performance impact.

nm -v vmlinux|egrep -5 "tcp_memory_allocated|udp_memory_allocated"

...
ffffffff83e35cc0 B tcp_memory_allocated
ffffffff83e35cc8 b __key.0
ffffffff83e35cc8 b __tcp_tx_delay_enabled.1
ffffffff83e35ce0 b tcp_orphan_timer
...
ffffffff849dddc0 B udp_memory_allocated
ffffffff849dddc8 B udp_encap_needed_key
ffffffff849dddd8 B udpv6_encap_needed_key
ffffffff849dddf0 b inetsw_lock

One solution is to move these sensitive fields to a structure,
so that the compiler is forced to add empty holes between each member.

nm -v vmlinux|egrep -2 "tcp_memory_allocated|udp_memory_allocated|net_aligned_data"

ffffffff885af970 b mem_id_init
ffffffff885af980 b __key.0
ffffffff885af9c0 B net_aligned_data
ffffffff885afa80 B page_pool_mem_providers
ffffffff885afa90 b __key.2

v1: https://lore.kernel.org/netdev/20250627200551.348096-1-edumazet@google.com/
====================

Link: https://patch.msgid.link/20250630093540.3052835-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:22:04 -07:00
Eric Dumazet
e3d4825124 udp: move udp_memory_allocated into net_aligned_data
____cacheline_aligned_in_smp attribute only makes sure to align
a field to a cache line. It does not prevent the linker to use
the remaining of the cache line for other variables, causing
potential false sharing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:22:02 -07:00
Eric Dumazet
8308133741 tcp: move tcp_memory_allocated into net_aligned_data
____cacheline_aligned_in_smp attribute only makes sure to align
a field to a cache line. It does not prevent the linker to use
the remaining of the cache line for other variables, causing
potential false sharing.

Move tcp_memory_allocated into a dedicated cache line.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:22:02 -07:00
Eric Dumazet
998642e999 net: move net_cookie into net_aligned_data
Using per-cpu data for net->net_cookie generation is overkill,
because even busy hosts do not create hundreds of netns per second.

Make sure to put net_cookie in a private cache line to avoid
potential false sharing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:22:02 -07:00
Eric Dumazet
3715b5df09 net: add struct net_aligned_data
This structure will hold networking data that must
consume a full cache line to avoid accidental false sharing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:22:02 -07:00
zhangjianrong
a8065af334 net: thunderbolt: Enable end-to-end flow control also in transmit
According to USB4 specification, if E2E flow control is disabled for
the Transmit Descriptor Ring, the Host Interface Adapter Layer shall
not require any credits to be available before transmitting a Tunneled
Packet from this Transmit Descriptor Ring, so e2e flow control should
be enabled in both directions.

Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Link: https://lore.kernel.org/20250624153805.GC2824380@black.fi.intel.com
Signed-off-by: zhangjianrong <zhangjianrong5@huawei.com>
Link: https://patch.msgid.link/20250628093813.647005-1-zhangjianrong5@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:17:43 -07:00
zhangjianrong
8ec31cb17c net: thunderbolt: Fix the parameter passing of tb_xdomain_enable_paths()/tb_xdomain_disable_paths()
According to the description of tb_xdomain_enable_paths(), the third
parameter represents the transmit ring and the fifth parameter represents
the receive ring. tb_xdomain_disable_paths() is the same case.

[Jakub] Mika says: it works now because both rings ->hop is the same

Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Link: https://lore.kernel.org/20250625051149.GD2824380@black.fi.intel.com
Signed-off-by: zhangjianrong <zhangjianrong5@huawei.com>
Link: https://patch.msgid.link/20250628094920.656658-1-zhangjianrong5@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:15:09 -07:00
Uwe Kleine-König
e96ee511c9 net: tulip: Rename PCI driver struct to end in _driver
This is not only a cosmetic change because the section mismatch checks
also depend on the name and for drivers the checks are stricter than for
ops.

However xircom_driver also passes the stricter checks just fine, so no
further changes needed.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Link: https://patch.msgid.link/20250627102220.1937649-2-u.kleine-koenig@baylibre.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01 19:36:27 -07:00
Uwe Kleine-König
b9ac2ae000 net: atlantic: Rename PCI driver struct to end in _driver
This is not only a cosmetic change because the section mismatch checks
(implemented in scripts/mod/modpost.c) also depend on the object's name
and for drivers the checks are stricter than for ops.

However aq_pci_driver also passes the stricter checks just fine, so no
further changes needed.

The cheating^Wmisleading name was introduced in commit 97bde5c4f9
("net: ethernet: aquantia: Support for NIC-specific code")

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250630164406.57589-2-u.kleine-koenig@baylibre.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01 19:35:58 -07:00
Lucien.Jheng
6b9c9def95 net: phy: air_en8811h: Introduce resume/suspend and clk_restore_context to ensure correct CKO settings after network interface reinitialization.
If the user reinitializes the network interface, the PHY will reinitialize,
and the CKO settings will revert to their initial configuration(be enabled).
To prevent CKO from being re-enabled,
en8811h_clk_restore_context and en8811h_resume were added
to ensure the CKO settings remain correct.

Signed-off-by: Lucien.Jheng <lucienzx159@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250630154147.80388-1-lucienzx159@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01 19:35:43 -07:00
Christophe JAILLET
10c38949e0 net: dsa: hellcreek: Constify struct devlink_region_ops and struct hellcreek_fdb_entry
'struct devlink_region_ops' and 'struct hellcreek_fdb_entry' are not
modified in this driver.

Constifying these structures moves some data to a read-only section, so
increases overall security, especially when the structure holds some
function pointers.

On a x86_64, with allmodconfig:
Before:
======
   text	   data	    bss	    dec	    hex	filename
  55320	  19216	    320	  74856	  12468	drivers/net/dsa/hirschmann/hellcreek.o

After:
=====
   text	   data	    bss	    dec	    hex	filename
  55960	  18576	    320	  74856	  12468	drivers/net/dsa/hirschmann/hellcreek.o

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Link: https://patch.msgid.link/2f7e8dc30db18bade94999ac7ce79f333342e979.1751231174.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01 19:33:31 -07:00
Jakub Kicinski
215891acb4 Merge branch 'seg6-fix-typos-in-comments-within-the-srv6-subsystem'
Andrea Mayer says:

====================
seg6: fix typos in comments within the SRv6 subsystem

In this patchset, we correct some typos found both in the SRv6 Endpoints
implementation (i.e., seg6local) and in some SRv6 selftests, using
codespell.
====================

Link: https://patch.msgid.link/20250629171226.4988-1-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01 19:32:47 -07:00
Andrea Mayer
3bedaff19b selftests: seg6: fix instaces typo in comments
Fix a typo:
  instaces -> instances

The typo has been identified using codespell, and the tool does not
report any additional issues in the selftests considered.

Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250629171226.4988-3-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01 19:32:45 -07:00
Andrea Mayer
db3e2ceab3 seg6: fix lenghts typo in a comment
Fix a typo:
  lenghts -> length

The typo has been identified using codespell, and the tool currently
does not report any additional issues in comments.

Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250629171226.4988-2-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01 19:32:45 -07:00
Christophe JAILLET
a63b5a0bb7 net: dsa: mv88e6xxx: Use kcalloc()
Use kcalloc() instead of hand writing it. This is less verbose.

Also move the initialization of 'count' to save some LoC.

On a x86_64, with allmodconfig, as an example:
Before:
======
   text	   data	    bss	    dec	    hex	filename
  18652	   5920	     64	  24636	   603c	drivers/net/dsa/mv88e6xxx/devlink.o

After:
=====
   text	   data	    bss	    dec	    hex	filename
  18498	   5920	     64	  24482	   5fa2	drivers/net/dsa/mv88e6xxx/devlink.o

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/2f4fca4ff84950da71e007c9169f18a0272476f3.1751200453.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01 19:31:38 -07:00
Christophe JAILLET
ff2d4cfdaf net: dsa: mv88e6xxx: Constify struct devlink_region_ops and struct mv88e6xxx_region
'struct devlink_region_ops' and 'struct mv88e6xxx_region' are not modified
in this driver.

Constifying these structures moves some data to a read-only section, so
increases overall security, especially when the structure holds some
function pointers.

On a x86_64, with allmodconfig, as an example:
Before:
======
   text	   data	    bss	    dec	    hex	filename
  18076	   6496	     64	  24636	   603c	drivers/net/dsa/mv88e6xxx/devlink.o

After:
=====
   text	   data	    bss	    dec	    hex	filename
  18652	   5920	     64	  24636	   603c	drivers/net/dsa/mv88e6xxx/devlink.o

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/46040062161dda211580002f950a6d60433243dc.1751200453.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01 19:31:38 -07:00
Eric Work
fad9cf2165 net: atlantic: add set_power to fw_ops for atl2 to fix wol
Aquantia AQC113(C) using ATL2FW doesn't properly prepare the NIC for
enabling wake-on-lan. The FW operation `set_power` was only implemented
for `hw_atl` and not `hw_atl2`. Implement the `set_power` functionality
for `hw_atl2`.

Tested with both AQC113 and AQC113C devices. Confirmed you can shutdown
the system and wake from S5 using magic packets. NIC was previously
powered off when entering S5. If the NIC was configured for WOL by the
Windows driver, loading the atlantic driver would disable WOL.

Partially cherry-picks changes from commit,
https://github.com/Aquantia/AQtion/commit/37bd5cc

Attributing original authors from Marvell for the referenced commit.

Closes: https://github.com/Aquantia/AQtion/issues/70
Co-developed-by: Igor Russkikh <irusskikh@marvell.com>
Co-developed-by: Mark Starovoitov <mstarovoitov@marvell.com>
Co-developed-by: Dmitry Bogdanov <dbogdanov@marvell.com>
Co-developed-by: Pavel Belous <pbelous@marvell.com>
Co-developed-by: Nikita Danilov <ndanilov@marvell.com>
Signed-off-by: Eric Work <work.eric@gmail.com>
Reviewed-by: Igor Russkikh <irusskikh@marvell.com>
Link: https://patch.msgid.link/20250629051535.5172-1-work.eric@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01 19:30:58 -07:00
Haiyang Zhang
fbe346ce9d net: mana: Handle Reset Request from MANA NIC
Upon receiving the Reset Request, pause the connection and clean up
queues, wait for the specified period, then resume the NIC.
In the cleanup phase, the HWC is no longer responding, so set hwc_timeout
to zero to skip waiting on the response.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1751055983-29760-1-git-send-email-haiyangz@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01 19:17:53 -07:00
Oleksij Rempel
f461c7a885 phy: micrel: add Signal Quality Indicator (SQI) support for KSZ9477 switch PHYs
Add support for the Signal Quality Indicator (SQI) feature on KSZ9477
family switches, providing a relative measure of receive signal quality.

The hardware exposes separate SQI readings per channel. For 1000BASE-T,
all four channels are read. For 100BASE-TX, only one channel is reported,
but which receive pair is active depends on Auto MDI-X negotiation, which
is not exposed by the hardware. Therefore, it is not possible to reliably
map the measured channel to a specific wire pair.

This resolves an earlier discussion about how to handle multi-channel
SQI. Originally, the plan was to expose all channels individually.
However, since pair mapping is sometimes unavailable, this
implementation treats SQI as a per-link metric instead. This fallback
avoids ambiguity and ensures consistent behavior. The existing get_sqi()
UAPI was designed for single-pair Ethernet (SPE), where per-pair and
per-link are effectively equivalent. Restricting its use to per-link
metrics does not introduce regressions for existing users.

The raw 7-bit SQI value (0–127, lower is better) is converted to the
standard 0–7 (high is better) scale. Empirical testing showed that the
link becomes unstable around a raw value of 8.

The SQI raw value remains zero if no data is received, even if noise is
present. This confirms that the measurement reflects the "quality" during
active data reception rather than the passive line state. User space
must ensure that traffic is present on the link to obtain valid SQI
readings.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250627112539.895255-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01 19:13:18 -07:00
Mina Almasry
be75d319d1 selftests: pp-bench: remove page_pool_put_page wrapper
Minor cleanup: remove the pointless looking _ wrapper around
page_pool_put_page, and just do the call directly.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://patch.msgid.link/20250627200501.1712389-2-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01 17:42:11 -07:00
Mina Almasry
8d3e0982f7 selftests: pp-bench: remove unneeded linux/version.h
linux/version.h was used by the out-of-tree version, but not needed in
the upstream one anymore.

While I'm at it, sort the includes.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202506271434.Gk0epC9H-lkp@intel.com/
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://patch.msgid.link/20250627200501.1712389-1-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01 17:42:11 -07:00
Nicolas Dichtel
0341e34727 ip6_tunnel: enable to change proto of fb tunnels
This is possible via the ioctl API:
> ip -6 tunnel change ip6tnl0 mode any

Let's align the netlink API:
> ip link set ip6tnl0 type ip6tnl mode any

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://patch.msgid.link/20250630145602.1027220-1-nicolas.dichtel@6wind.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01 17:40:57 -07:00