Commit Graph

1367723 Commits

Author SHA1 Message Date
David Arinzon
51d58804a5 net: ena: PHC silent reset
Each PHC device kernel registration receives a unique kernel index,
which is associated with a new PHC device file located at
"/dev/ptp<index>".
This device file serves as an interface for obtaining PHC timestamps.
Examples of tools that use "/dev/ptp" include testptp [1]
and chrony [2].

A reset flow may occur in the ENA driver while PHC is active.
During a reset, the driver will unregister and then re-register the
PHC device with the kernel.
Under race conditions, particularly during heavy PHC loads,
the driver’s reset flow might complete faster than the kernel’s PHC
unregister/register process.
This can result in the PHC index being different from what it was prior
to the reset, as the PHC index is selected using kernel ID
allocation [3].

While driver rmmod/insmod are done by the user, a reset may occur
at anytime, without the user awareness, consequently, the driver
might receive a new PHC index after the reset, potentially compromising
the user experience.

To prevent this issue, the PHC flow will detect the reset during PHC
destruction and will skip the PHC unregister/register calls to preserve
the kernel PHC index.
During the reset flow, any attempt to get a PHC timestamp will fail as
expected, but the kernel PHC index will remain unchanged.

[1]: https://github.com/torvalds/linux/blob/v6.1/tools/testing/selftests/ptp/testptp.c
[2]: https://github.com/mlichvar/chrony
[3]: https://www.kernel.org/doc/html/latest/core-api/idr.html

Signed-off-by: Amit Bernstein <amitbern@amazon.com>
Signed-off-by: David Arinzon <darinzon@amazon.com>
Link: https://patch.msgid.link/20250617110545.5659-3-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 18:57:28 -07:00
David Arinzon
e0ea34158e net: ena: Add PHC support in the ENA driver
The ENA driver will be extended to support the new PHC feature using
ptp_clock interface [1]. this will provide timestamp reference for user
space to allow measuring time offset between the PHC and the system
clock in order to achieve nanosecond accuracy.

[1] - https://www.kernel.org/doc/html/latest/driver-api/ptp.html

Signed-off-by: Amit Bernstein <amitbern@amazon.com>
Signed-off-by: David Arinzon <darinzon@amazon.com>
Link: https://patch.msgid.link/20250617110545.5659-2-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 18:57:28 -07:00
Jakub Kicinski
253833da4e Merge branch 'udp_tunnel-remove-rtnl_lock-dependency'
Stanislav Fomichev says:

====================
udp_tunnel: remove rtnl_lock dependency

Recently bnxt had to grow back a bunch of rtnl dependencies because
of udp_tunnel's infra. Add separate (global) mutext to protect
udp_tunnel state.
====================

Link: https://patch.msgid.link/20250616162117.287806-1-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 18:53:53 -07:00
Stanislav Fomichev
850d9248d2 Revert "bnxt_en: bring back rtnl_lock() in the bnxt_open() path"
This reverts commit 325eb217e4.

udp_tunnel infra doesn't need RTNL, should be safe to get back
to only netdev instance lock.

Cc: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Link: https://patch.msgid.link/20250616162117.287806-7-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 18:53:51 -07:00
Stanislav Fomichev
e054c8ba3b netdevsim: remove udp_ports_sleep
Now that there is only one path in udp_tunnel, there is no need to
have udp_ports_sleep knob. Remove it and adjust the test.

Cc: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Link: https://patch.msgid.link/20250616162117.287806-6-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 18:53:51 -07:00
Stanislav Fomichev
3a321b6b1f net: remove redundant ASSERT_RTNL() in queue setup functions
The existing netdev_ops_assert_locked() already asserts that either
the RTNL lock or the per-device lock is held, making the explicit
ASSERT_RTNL() redundant.

Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Link: https://patch.msgid.link/20250616162117.287806-5-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 18:53:51 -07:00
Stanislav Fomichev
1ead750109 udp_tunnel: remove rtnl_lock dependency
Drivers that are using ops lock and don't depend on RTNL lock
still need to manage it because udp_tunnel's RTNL dependency.
Introduce new udp_tunnel_nic_lock and use it instead of
rtnl_lock. Drop non-UDP_TUNNEL_NIC_INFO_MAY_SLEEP mode from
udp_tunnel infra (udp_tunnel_nic_device_sync_work needs to
grab udp_tunnel_nic_lock mutex and might sleep).

Cover more places in v4:

- netlink
  - udp_tunnel_notify_add_rx_port (ndo_open)
    - triggers udp_tunnel_nic_device_sync_work
  - udp_tunnel_notify_del_rx_port (ndo_stop)
    - triggers udp_tunnel_nic_device_sync_work
  - udp_tunnel_get_rx_info (__netdev_update_features)
    - triggers NETDEV_UDP_TUNNEL_PUSH_INFO
  - udp_tunnel_drop_rx_info (__netdev_update_features)
    - triggers NETDEV_UDP_TUNNEL_DROP_INFO
  - udp_tunnel_nic_reset_ntf (ndo_open)

- notifiers
  - udp_tunnel_nic_netdevice_event, depending on the event:
    - triggers NETDEV_UDP_TUNNEL_PUSH_INFO
    - triggers NETDEV_UDP_TUNNEL_DROP_INFO

- ethnl_tunnel_info_reply_size
- udp_tunnel_nic_set_port_priv (two intel drivers)

Cc: Michael Chan <michael.chan@broadcom.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Link: https://patch.msgid.link/20250616162117.287806-4-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 18:53:51 -07:00
Stanislav Fomichev
df5425b3bd vxlan: drop sock_lock
We won't be able to sleep soon in vxlan_offload_rx_ports and won't be
able to grab sock_lock. Instead of having separate spinlock to
manage sockets, rely on rtnl lock. This is similar to how geneve
manages its sockets.

Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250616162117.287806-3-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 18:53:51 -07:00
Stanislav Fomichev
3e14960f3b geneve: rely on rtnl lock in geneve_offload_rx_ports
udp_tunnel_push_rx_port will grab mutex in the next patch so
we can't use rcu. geneve_offload_rx_ports is called
from geneve_netdevice_event for NETDEV_UDP_TUNNEL_PUSH_INFO and
NETDEV_UDP_TUNNEL_DROP_INFO which both have ASSERT_RTNL.
Entries are added to and removed from the sock_list under rtnl
lock as well (when adding or removing a tunneling device).

Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250616162117.287806-2-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 18:53:51 -07:00
Yue Haibing
a33556940b tcp: Remove inet_hashinfo2_free_mod()
DCCP was removed, inet_hashinfo2_free_mod() is unused now.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250617130613.498659-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 18:29:58 -07:00
Heiner Kallweit
d8155c1df5 dpaa_eth: don't use fixed_phy_change_carrier
This effectively reverts 6e8b0ff1ba ("dpaa_eth: Add change_carrier()
for Fixed PHYs"). Usage of fixed_phy_change_carrier() requires that
fixed_phy_register() has been called before, directly or indirectly.
And that's not the case in this driver.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/7eb189b3-d5fd-4be6-8517-a66671a4e4e3@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 18:28:54 -07:00
Mark Zhang
e0e3265acf net/mlx4e: Don't redefine IB_MTU_XXX enum
Rely on existing IB_MTU_XXX definitions which exist in ib_verbs.h.

Reviewed-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/382c91ee506e7f1f3c1801957df6b28963484b7d.1750147222.git.leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 14:17:56 -07:00
Simon Horman
a9874d961e nfc: Remove checks for nla_data returning NULL
The implementation of nla_data is as follows:

static inline void *nla_data(const struct nlattr *nla)
{
	return (char *) nla + NLA_HDRLEN;
}

Excluding the case where nla is exactly -NLA_HDRLEN, it will not return
NULL. And it seems misleading to assume that it can, other than in this
corner case. So drop checks for this condition.

Flagged by Smatch.

Compile tested only.

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250617-nfc-null-data-v1-1-c7525ead2e95@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 14:17:32 -07:00
Jakub Kicinski
4f451b977e Merge branch 'eth-migrate-more-drivers-to-new-rxfh-callbacks'
Jakub Kicinski says:

====================
eth: migrate more drivers to new RXFH callbacks

Migrate a batch of drivers to the recently added dedicated
.get_rxfh_fields and .set_rxfh_fields ethtool callbacks.
====================

Link: https://patch.msgid.link/20250617014848.436741-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 13:19:08 -07:00
Jakub Kicinski
c2cd2f6125 eth: sxgbe: migrate to new RXFH callbacks
Migrate to new callbacks added by commit 9bb00786fc ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").

RXFH is all this driver supports in RXNFC so old callbacks are
completely removed.

Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20250617014848.436741-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 13:19:01 -07:00
Jakub Kicinski
20ffe3bbc2 eth: dpaa2: migrate to new RXFH callbacks
Migrate to new callbacks added by commit 9bb00786fc ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").

Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20250617014848.436741-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 13:19:00 -07:00
Jakub Kicinski
17da66f140 eth: dpaa: migrate to new RXFH callbacks
Migrate to new callbacks added by commit 9bb00786fc ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").

RXFH is all this driver supports in RXNFC so old callbacks are
completely removed.

Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20250617014848.436741-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 13:19:00 -07:00
Jakub Kicinski
b6f7e4fafe eth: mvpp2: migrate to new RXFH callbacks
Migrate to new callbacks added by commit 9bb00786fc ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").

Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20250617014848.436741-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 13:19:00 -07:00
Jakub Kicinski
b82d92dd71 eth: niu: migrate to new RXFH callbacks
Migrate to new callbacks added by commit 9bb00786fc ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").

Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20250617014848.436741-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 13:19:00 -07:00
Jakub Kicinski
2fca0d1277 Merge branch 'eth-migrate-some-drivers-to-new-rxfh-callbacks'
Jakub Kicinski says:

====================
eth: migrate some drivers to new RXFH callbacks

Migrate a batch of drivers to the recently added dedicated
.get_rxfh_fields and .set_rxfh_fields ethtool callbacks.
====================

Link: https://patch.msgid.link/20250617014555.434790-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 13:17:52 -07:00
Jakub Kicinski
f99ff3c2a3 eth: otx2: migrate to new RXFH callbacks
Migrate to new callbacks added by commit 9bb00786fc ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").

Link: https://patch.msgid.link/20250617014555.434790-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 13:17:50 -07:00
Jakub Kicinski
e8b8738439 eth: thunder: migrate to new RXFH callbacks
Migrate to new callbacks added by commit 9bb00786fc ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").

The driver has no other RXNFC functionality so the SET callback can
be now removed.

Link: https://patch.msgid.link/20250617014555.434790-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 13:17:47 -07:00
Jakub Kicinski
e7860a6e18 eth: ena: migrate to new RXFH callbacks
Migrate to new callbacks added by commit 9bb00786fc ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").

The driver has no other RXNFC functionality so the SET callback can
be now removed.

Reviewed-by: David Arinzon <darinzon@amazon.com>
Link: https://patch.msgid.link/20250617014555.434790-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 13:17:43 -07:00
Jakub Kicinski
82113468a0 eth: bnxt: migrate to new RXFH callbacks
Migrate to new callbacks added by commit 9bb00786fc ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").

Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250617014555.434790-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 13:17:38 -07:00
Jakub Kicinski
f1a6fcc454 eth: bnx2x: migrate to new RXFH callbacks
Migrate to new callbacks added by commit 9bb00786fc ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").

The driver has no other RXNFC functionality so the SET callback can
be now removed.

Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://patch.msgid.link/20250617014555.434790-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 13:17:32 -07:00
David S. Miller
fc4842cd0f Merge branch 'netconsole-msgid' into main
Gustavo Luiz Duarte says:

====================
netconsole: Add support for msgid in sysdata

This patch series introduces a new feature to netconsole which allows
appending a message ID to the userdata dictionary.

If the msgid feature is enabled, the message ID is built from a per-target 32
bit counter that is incremented and appended to every message sent to the target.

Example::
  echo 1 > "/sys/kernel/config/netconsole/cmdline0/userdata/msgid_enabled"
  echo "This is message #1" > /dev/kmsg
  echo "This is message #2" > /dev/kmsg
  13,434,54928466,-;This is message #1
   msgid=1
  13,435,54934019,-;This is message #2
   msgid=2

This feature can be used by the target to detect if messages were dropped or
reordered before reaching the target. This allows system administrators to
assess the reliability of their netconsole pipeline and detect loss of messages
due to network contention or temporary unavailability.

---
Changes in v3:
- Add kdoc documentation for msgcounter.
- Link to v2: https://lore.kernel.org/r/20250612-netconsole-msgid-v2-0-d4c1abc84bac@gmail.com

Changes in v2:
- Use wrapping_assign_add() to avoid warnings in UBSAN and friends.
- Improve documentation to clarify wrapping and distinguish msgid from sequnum.
- Rebase and fix conflict in prepare_extradata().
- Link to v1: https://lore.kernel.org/r/20250611-netconsole-msgid-v1-0-1784a51feb1e@gmail.com
====================

Suggested-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-06-18 10:46:31 +01:00
Gustavo Luiz Duarte
8c587aa3fa docs: netconsole: document msgid feature
Add documentation explaining the msgid feature in netconsole.

This feature appends unique id to the userdata dictionary. The message
ID is populated from a per-target 32 bit counter which is incremented
for each message sent to the target. This allows a target to detect if
messages are dropped before reaching the target.

Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-06-18 10:46:10 +01:00
Gustavo Luiz Duarte
68707c079e selftests: netconsole: Add tests for 'msgid' feature in sysdata
Extend the self-tests to cover the 'msgid' feature in sysdata.

Verify that msgid is appended to the message when the feature is enabled
and that it is not appended when the feature is disabled.

Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com>
Reviewed-by: Breno Leitao <leitao@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-06-18 10:46:10 +01:00
Gustavo Luiz Duarte
c5efaabd45 netconsole: append msgid to sysdata
Add msgcounter to the netconsole_target struct to generate message IDs.
If the msgid_enabled attribute is true, increment msgcounter and append
msgid=<msgcounter> to sysdata buffer before sending the message.

Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com>
Reviewed-by: Breno Leitao <leitao@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-06-18 10:46:10 +01:00
Gustavo Luiz Duarte
53def0c4c8 netconsole: implement configfs for msgid_enabled
Implement the _show and _store functions for the msgid_enabled configfs
attribute under userdata.
Set the sysdata_fields bit accordingly.

Reviewed-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-06-18 10:46:10 +01:00
Gustavo Luiz Duarte
15b3c930a2 netconsole: introduce 'msgid' as a new sysdata field
This adds a new sysdata field to enable assigning a per-target unique id
to each message sent to that target. This id can later be appended as
part of sysdata, allowing targets to detect dropped netconsole messages.
Update count_extradata_entries() to take the new field into account.

Reviewed-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-06-18 10:46:10 +01:00
Simon Horman
ec315832f6 dpll: remove documentation of rclk_dev_name
Remove documentation of rclk_dev_name member of dpll_device which
doesn't exist.

Flagged by ./scripts/kernel-doc -none

Introduced by commit 9431063ad3 ("dpll: core: Add DPLL framework base
functions")

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250616-dpll-member-v1-1-8c9e6b8e1fd4@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:53:37 -07:00
Jakub Kicinski
189bd9c873 Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:

====================
libeth: add libeth_xdp helper lib

Alexander Lobakin says:

Time to add XDP helpers infra to libeth to greatly simplify adding
XDP to idpf and iavf, as well as improve and extend XDP in ice and
i40e. Any vendor is free to reuse helpers. If this happens, I'm fine
with moving the folder of out intel/.

The helpers greatly simplify building xdp_buff, running a prog,
handling the verdict, implement XDP_TX, .ndo_xdp_xmit, XDP buffer
completion. Same applies to XSk (with XSk xmit instead of
.ndo_xdp_xmit, plus stuff like XSk wakeup).
They are entirely generic with no HW definitions or assumptions.
HW-specific stuff like parsing Rx desc / filling Tx desc is passed
from the driver as inline callbacks.

For now, key assumptions that optimize performance / avoid code
bloat, but might not fit every driver in driver/net/:
 * netmem holding the buffers are always order-0;
 * driver has separate XDP Tx queues, doesn't use stack queues for
   that. For best efficiency, you may want to have nr_cpu_ids XDP
   queues, but less (queue sharing) is also supported;
 * XDP Tx queues are interrupt-less and use "lazy" cleaning only
   when there are less than 1/4 free Tx descriptors of the queue
   size;
 * main target platforms are 64-bit, although 32-bit is also fully
   supported, but the code might be not as optimized for them.

Library code already supports multi-buffer for all kinds of Tx and
both header split and no split for Rx and Tx. Frags can come from
devmem/io_uring etc., direct `struct page *` is used only for header
buffers for which it's always true.
Drivers are free to pass their own Rx hints and XSK xmit hints ops.

XDP_TX and ndo_xdp_xmit use onstack bulk for the frames to be sent
and send them by batches of 16 buffers. This eats ~280 bytes on the
stack, but gives good boosts and allow to greatly optimize the main
sending function leaving it without any error/exception paths.

XSk xmit fills Tx descriptors in the loop unrolled by 8. This was
proven to improve perf on ice and i40e. XDP_TX and ndo_xdp_xmit
doesn't use unrolling as I wasn't able to get any improvements in
those scenenarios from this, while +1 Kb for their sending functions
for nothing doesn't sound reasonable.

XSk wakeup, instead of traditionally used "SW interrupts" provided
by NICs, uses IPI to schedule NAPI on the CPU corresponding to the
given queue pair. It gives better control over CPU distribution and
in general performs way better than "SW interrupts", plus allows us
to not pass any HW-specific callbacks there.

The code is built the way that all callbacks passed from drivers
get inlined; in general, most of hotpath gets inlined. Everything
slow/exception lands to .c files in the libeth folder, doesn't
create copies in the drivers themselves and doesn't overloat
hotpath.
Sure, inlining means that hotpath will be compiled into every driver
that uses the lib, but the core code is written in one place, so no
copying of bugs happens. Fixed once -- works everywhere.

The last commit might look like sorta hack, but it gives really good
boosts and decreases object code size, plus there are checks that
all those wider accesses are fully safe, so I don't feel anything
bad about it.

An example of using libeth_xdp can be found either on my GitHub or
on the mailing lists here ("XDP for idpf"). Macros for building
driver XDP functions lead to that some implementations (XDP_TX,
ndo_xdp_xmit etc.) consist of really only a few lines.

* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  libeth: xdp, xsk: access adjacent u32s as u64 where applicable
  libeth: xsk: add XSkFQ refill and XSk wakeup helpers
  libeth: xsk: add XSk Rx processing support
  libeth: xsk: add XSk xmit functions
  libeth: xsk: add XSk XDP_TX sending helpers
  libeth: xdp: add RSS hash hint and XDP features setup helpers
  libeth: xdp: add templates for building driver-side callbacks
  libeth: xdp: add XDP prog run and verdict result handling
  libeth: xdp: add helpers for preparing/processing &libeth_xdp_buff
  libeth: xdp: add XDPSQ cleanup timers
  libeth: xdp: add XDPSQ locking helpers
  libeth: xdp: add XDPSQE completion helpers
  libeth: xdp: add .ndo_xdp_xmit() helpers
  libeth: xdp: add XDP_TX buffers sending
  libeth: support native XDP and register memory model
  libeth: convert to netmem
  libeth, libie: clean symbol exports up a little
====================

Link: https://patch.msgid.link/20250616201639.710420-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:50:57 -07:00
Jakub Kicinski
8152c4028c Merge branch 'net-mlx5e-add-support-for-devmem-and-io_uring-tcp-zero-copy'
Mark Bloch says:

====================
net/mlx5e: Add support for devmem and io_uring TCP zero-copy

This series adds support for zerocopy rx TCP with devmem and io_uring
for ConnectX7 NICs and above. For performance reasons and simplicity
HW-GRO will also be turned on when header-data split mode is on.

Performance
===========

Test setup:

* CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (single NUMA)
* NIC: ConnectX7
* Benchmarking tool: kperf [0]
* Single TCP flow
* Test duration: 60s

With application thread and interrupts pinned to the *same* core:

|------+-----------+----------|
| MTU  | epoll     | io_uring |
|------+-----------+----------|
| 1500 | 61.6 Gbps | 114 Gbps |
| 4096 | 69.3 Gbps | 151 Gbps |
| 9000 | 67.8 Gbps | 187 Gbps |
|------+-----------+----------|

The CPU usage for io_uring is 95%.

Reproduction steps for io_uring:

server --no-daemon -a 2001:db8::1 --no-memcmp --iou --iou_sendzc \
	--iou_zcrx --iou_dev_name eth2 --iou_zcrx_queue_id 2

server --no-daemon -a 2001:db8::2 --no-memcmp --iou --iou_sendzc

client --src 2001:db8::2 --dst 2001:db8::1 \
	--msg-zerocopy -t 60 --cpu-min=2 --cpu-max=2

Patch overview:
================

First, a netmem API for skb_can_coalesce is added to the core to be able
to do skb fragment coalescing on netmems.

The next patches introduce some cleanups in the internal SHAMPO code and
improvements to hw gro capability checks in FW.

A separate page_pool is introduced for headers, to be used only when
the rxq has a memory provider.

Then the driver is converted to use the netmem API and to allow support
for unreadable netmem page pool.

The queue management ops are implemented.

Finally, the tcp-data-split ring parameter is exposed.
References
==========
[0] kperf: git://git.kernel.dk/kperf.git
v1: https://lore.kernel.org/20250116215530.158886-1-saeed@kernel.org
v2: https://lore.kernel.org/1747950086-1246773-1-git-send-email-tariqt@nvidia.com
v3: https://lore.kernel.org/20250609145833.990793-1-mbloch@nvidia.com
v4: https://lore.kernel.org/20250610150950.1094376-1-mbloch@nvidia.com
v5: https://lore.kernel.org/20250612154648.1161201-1-mbloch@nvidia.com
====================

Link: https://patch.msgid.link/20250616141441.1243044-1-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:34:15 -07:00
Dragos Tatulea
5a842c288c net/mlx5e: Add TX support for netmems
Declare netmem TX support in netdev.

As required, use the netmem aware dma unmapping APIs
for unmapping netmems in tx completion path.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-13-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:34:13 -07:00
Saeed Mahameed
46bcce5dfd net/mlx5e: Support ethtool tcp-data-split settings
In mlx5, tcp header-data split requires HW GRO to be on.

Enabling it fails when HW GRO is off.
mlx5e_fix_features now keeps HW GRO on when tcp data split is enabled.
Finally, when tcp data split is disabled, features are updated to maybe
remove the forced HW GRO.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-12-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:34:13 -07:00
Saeed Mahameed
b2588ea40e net/mlx5e: Implement queue mgmt ops and single channel swap
The bulk of the work is done in mlx5e_queue_mem_alloc, where we allocate
and create the new channel resources, similar to
mlx5e_safe_switch_params, but here we do it for a single channel using
existing params, sort of a clone channel.
To swap the old channel with the new one, we deactivate and close the
old channel then replace it with the new one, since the swap procedure
doesn't fail in mlx5, we do it all in one place (mlx5e_queue_start).

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Acked-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250616141441.1243044-11-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:34:13 -07:00
Saeed Mahameed
db3010bb5a net/mlx5e: Add support for UNREADABLE netmem page pools
On netdev_rx_queue_restart, a special type of page pool maybe expected.

In this patch declare support for UNREADABLE netmem iov pages in the
pool params only when header data split shampo RQ mode is enabled, also
set the queue index in the page pool params struct.

Shampo mode requirement: Without header split rx needs to peek at the data,
we can't do UNREADABLE_NETMEM.

The patch also enables the use of a separate page pool for headers when
a memory provider is installed for the queue, otherwise the same common
page pool continues to be used.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-10-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:34:12 -07:00
Saeed Mahameed
d1668f1199 net/mlx5e: Convert over to netmem
mlx5e_page_frag holds the physical page itself, to naturally support
zc page pools, remove physical page reference from mlx5 and replace it
with netmem_ref, to avoid internal handling in mlx5 for net_iov backed
pages.

SHAMPO can issue packets that are not split into header and data. These
packets will be dropped if the data part resides in a net_iov as the
driver can't read into this area.

No performance degradation observed.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-9-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:34:12 -07:00
Saeed Mahameed
e225d9bd93 net/mlx5e: SHAMPO: Separate pool for headers
Allow allocating a separate page pool for headers when SHAMPO is on.
This will be useful for adding support to zc page pool, which has to be
different from the headers page pool.
For now, the pools are the same.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-8-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:34:12 -07:00
Saeed Mahameed
d2760abded net/mlx5e: SHAMPO: Improve hw gro capability checking
Add missing HW capabilities, declare the feature in
netdev->vlan_features, similar to other features in mlx5e_build_nic_netdev.
No functional change here as all by default disabled features are
explicitly disabled at the bottom of the function.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-7-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:34:12 -07:00
Saeed Mahameed
16142defd3 net/mlx5e: SHAMPO: Remove redundant params
Two SHAMPO params are static and always the same, remove them from the
global mlx5e_params struct.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-6-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:34:12 -07:00
Saeed Mahameed
af4312c4c9 net/mlx5e: SHAMPO: Reorganize mlx5_rq_shampo_alloc
Drop redundant SHAMPO structure alloc/free functions.

Gather together function calls pertaining to header split info, pass
header per WQE (hd_per_wqe) as parameter to those function to avoid use
before initialization future mistakes.

Allocate HW GRO related info outside of the header related info scope.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-5-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:34:12 -07:00
Dragos Tatulea
a202f24b08 page_pool: Add page_pool_dev_alloc_netmems helper
This is the netmem counterpart of page_pool_dev_alloc_pages() which
uses the default GFP flags for RX.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-4-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:34:12 -07:00
Dragos Tatulea
1cbb49f85b net: Add skb_can_coalesce for netmem
Allow drivers that have moved over to netmem to do fragment coalescing.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-3-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:34:11 -07:00
Dragos Tatulea
c9e1225352 net: Allow const args for of page_to_netmem()
This allows calling page_to_netmem() with a const page * argument.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-2-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:34:11 -07:00
Tejun Heo
fd0406e5ca net: tcp: tsq: Convert from tasklet to BH workqueue
The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws. To
replace tasklets, BH workqueue support was recently added. A BH workqueue
behaves similarly to regular workqueues except that the queued work items
are executed in the BH context.

This patch converts TCP Small Queues implementation from tasklet to BH
workqueue.

Semantically, this is an equivalent conversion and there shouldn't be any
user-visible behavior changes. While workqueue's queueing and execution
paths are a bit heavier than tasklet's, unless the work item is being queued
every packet, the difference hopefully shouldn't matter.

My experience with the networking stack is very limited and this patch
definitely needs attention from someone who actually understands networking.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/aFBeJ38AS1ZF3Dq5@slm.duckdns.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:29:21 -07:00
Jakub Kicinski
e15962ae74 Merge branch 'ipmr-ip6mr-allow-mc-routing-locally-generated-mc-packets'
Petr Machata says:

====================
ipmr, ip6mr: Allow MC-routing locally-generated MC packets

Multicast routing is today handled in the input path. Locally generated MC
packets don't hit the IPMR code. Thus if a VXLAN remote address is
multicast, the driver needs to set an OIF during route lookup. In practice
that means that MC routing configuration needs to be kept in sync with the
VXLAN FDB and MDB. Ideally, the VXLAN packets would be routed by the MC
routing code instead.

To that end, this patchset adds support to route locally generated
multicast packets.

However, an installation that uses a VXLAN underlay netdevice for which it
also has matching MC routes, would get a different routing with this patch.
Previously, the MC packets would be delivered directly to the underlay
port, whereas now they would be MC-routed. In order to avoid this change in
behavior, introduce an IPCB/IP6CB flag. Unless the flag is set, the new
MC-routing code is skipped.

All this is keyed to a new VXLAN attribute, IFLA_VXLAN_MC_ROUTE. Only when
it is set does any of the above engage.

In addition to that, and as is the case today with MC forwarding,
IPV4_DEVCONF_MC_FORWARDING must be enabled for the netdevice that acts as a
source of MC traffic (i.e. the VXLAN PHYS_DEV), so an MC daemon must be
attached to the netdevice.

When a VXLAN netdevice with a MC remote is brought up, the physical
netdevice joins the indicated MC group. This is important for local
delivery of MC packets, so it is still necessary to configure a physical
netdevice -- the parameter cannot go away. The netdevice would however
typically not be a front panel port, but a dummy. An MC daemon would then
sit on top of that netdevice as well as any front panel ports that it needs
to service, and have routes set up between the two.

A way to configure the VXLAN netdevice to take advantage of the new MC
routing would be:

 # ip link add name d up type dummy
 # ip link add name vx10 up type vxlan id 1000 dstport 4789 \
	local 192.0.2.1 group 225.0.0.1 ttl 16 dev d mrcoute
 # ip link set dev vx10 master br # plus vlans etc.

With the following MC routes:

 (192.0.2.1, 225.0.0.1) iif=d oil=swp1,swp2 # TX route
 (*, 225.0.0.1) iif=swp1 oil=d,swp2         # RX route
 (*, 225.0.0.1) iif=swp2 oil=d,swp1         # RX route

The RX path has not changed, with the exception of an extra MC hop. Packets
are delivered to the front panel port and MC-forwarded to the VXLAN
physical port, here "d". Since the port has joined the multicast group, the
packets are locally delivered, and end up being processed by the VXLAN
netdevice.

This patchset is based on earlier patches from Nikolay Aleksandrov and
Roopa Prabhu, though it underwent significant changes. Roopa broadly
presented the topic on LPC 2019 [0].

Patchset progression:

- Patches #1 to #4 add ip_mr_output()
- Patches #5 to #10 add ip6_mr_output()
- Patch #11 adds the VXLAN bits to enable MR engagement
- Patches #12 to #14 prepare selftest libraries
- Patch #15 includes a new test suite

[0] https://www.youtube.com/watch?v=xlReECfi-uo
====================

Link: https://patch.msgid.link/cover.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:18:49 -07:00
Petr Machata
e3180379e2 selftests: forwarding: Add a test for verifying VXLAN MC underlay
Add tests for MC-routing underlay VXLAN traffic.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/eecd2c0fefc754182e74be8e8e65751bf5749c21.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:18:46 -07:00
Petr Machata
237f84a6d2 selftests: forwarding: adf_mcd_start(): Allow configuring custom interfaces
Tests may wish to add other interfaces to listen on. Notably locally
generated traffic uses dummy interfaces. The multicast daemon needs to know
about these so that it allows forming rules that involve these interfaces,
and so that net.ipv4.conf.X.mc_forwarding is set for the interfaces.

To that end, allow passing in a list of interfaces to configure in addition
to all the physical ones.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/2e8d83297985933be4850f2b9f296b3c27110388.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:18:46 -07:00