Commit Graph

1324917 Commits

Author SHA1 Message Date
Wei Fang
69797ff888 net: enetc: add LSO support for i.MX95 ENETC PF
ENETC rev 4.1 supports large send offload (LSO), segmenting large TCP
and UDP transmit units into multiple Ethernet frames. To support LSO,
software needs to fill some auxiliary information in Tx BD, such as LSO
header length, frame length, LSO maximum segment size, etc.

At 1Gbps link rate, TCP segmentation was tested using iperf3, and the
CPU performance before and after applying the patch was compared through
the top command. It can be seen that LSO saves a significant amount of
CPU cycles compared to software TSO.

Before applying the patch:
%Cpu(s):  0.1 us,  4.1 sy,  0.0 ni, 85.7 id,  0.0 wa,  0.5 hi,  9.7 si

After applying the patch:
%Cpu(s):  0.1 us,  2.3 sy,  0.0 ni, 94.5 id,  0.0 wa,  0.4 hi,  2.6 si

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Link: https://patch.msgid.link/20241219054755.1615626-4-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-23 09:54:33 -08:00
Wei Fang
93c5d5a0dd net: enetc: update max chained Tx BD number for i.MX95 ENETC
The max chained Tx BDs of latest ENETC (i.MX95 ENETC, rev 4.1) has been
increased to 63, but since the range of MAX_SKB_FRAGS is 17~45, so for
i.MX95 ENETC and later revision, it is better to set ENETC4_MAX_SKB_FRAGS
to MAX_SKB_FRAGS.

In addition, add max_frags in struct enetc_drvdata to indicate the max
chained BDs supported by device. Because the max number of chained BDs
supported by LS1028A and i.MX95 ENETC is different.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241219054755.1615626-3-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-23 09:54:33 -08:00
Wei Fang
d9a093d2d1 net: enetc: add Tx checksum offload for i.MX95 ENETC
In addition to supporting Rx checksum offload, i.MX95 ENETC also supports
Tx checksum offload. The transmit checksum offload is implemented through
the Tx BD. To support Tx checksum offload, software needs to fill some
auxiliary information in Tx BD, such as IP version, IP header offset and
size, whether L4 is UDP or TCP, etc.

Same as Rx checksum offload, Tx checksum offload capability isn't defined
in register, so tx_csum bit is added to struct enetc_drvdata to indicate
whether the device supports Tx checksum offload.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Link: https://patch.msgid.link/20241219054755.1615626-2-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-23 09:54:33 -08:00
Stefano Brivio
a502ea6fa9 udp: Deal with race between UDP socket address change and rehash
If a UDP socket changes its local address while it's receiving
datagrams, as a result of connect(), there is a period during which
a lookup operation might fail to find it, after the address is changed
but before the secondary hash (port and address) and the four-tuple
hash (local and remote ports and addresses) are updated.

Secondary hash chains were introduced by commit 30fff9231f ("udp:
bind() optimisation") and, as a result, a rehash operation became
needed to make a bound socket reachable again after a connect().

This operation was introduced by commit 719f835853 ("udp: add
rehash on connect()") which isn't however a complete fix: the
socket will be found once the rehashing completes, but not while
it's pending.

This is noticeable with a socat(1) server in UDP4-LISTEN mode, and a
client sending datagrams to it. After the server receives the first
datagram (cf. _xioopen_ipdgram_listen()), it issues a connect() to
the address of the sender, in order to set up a directed flow.

Now, if the client, running on a different CPU thread, happens to
send a (subsequent) datagram while the server's socket changes its
address, but is not rehashed yet, this will result in a failed
lookup and a port unreachable error delivered to the client, as
apparent from the following reproducer:

  LEN=$(($(cat /proc/sys/net/core/wmem_default) / 4))
  dd if=/dev/urandom bs=1 count=${LEN} of=tmp.in

  while :; do
  	taskset -c 1 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
  	sleep 0.1 || sleep 1
  	taskset -c 2 socat OPEN:tmp.in UDP4:localhost:1337,shut-null
  	wait
  done

where the client will eventually get ECONNREFUSED on a write()
(typically the second or third one of a given iteration):

  2024/11/13 21:28:23 socat[46901] E write(6, 0x556db2e3c000, 8192): Connection refused

This issue was first observed as a seldom failure in Podman's tests
checking UDP functionality while using pasta(1) to connect the
container's network namespace, which leads us to a reproducer with
the lookup error resulting in an ICMP packet on a tap device:

  LOCAL_ADDR="$(ip -j -4 addr show|jq -rM '.[] | .addr_info[0] | select(.scope == "global").local')"

  while :; do
  	./pasta --config-net -p pasta.pcap -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
  	sleep 0.2 || sleep 1
  	socat OPEN:tmp.in UDP4:${LOCAL_ADDR}:1337,shut-null
  	wait
  	cmp tmp.in tmp.out
  done

Once this fails:

  tmp.in tmp.out differ: char 8193, line 29

we can finally have a look at what's going on:

  $ tshark -r pasta.pcap
      1   0.000000           :: ? ff02::16     ICMPv6 110 Multicast Listener Report Message v2
      2   0.168690 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      3   0.168767 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      4   0.168806 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      5   0.168827 c6:47:05:8d:dc:04 ? Broadcast    ARP 42 Who has 88.198.0.161? Tell 88.198.0.164
      6   0.168851 9a:55:9a:55:9a:55 ? c6:47:05:8d:dc:04 ARP 42 88.198.0.161 is at 9a:55:9a:55:9a:55
      7   0.168875 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      8   0.168896 88.198.0.164 ? 88.198.0.161 ICMP 590 Destination unreachable (Port unreachable)
      9   0.168926 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
     10   0.168959 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
     11   0.168989 88.198.0.161 ? 88.198.0.164 UDP 4138 60260 ? 1337 Len=4096
     12   0.169010 88.198.0.161 ? 88.198.0.164 UDP 42 60260 ? 1337 Len=0

On the third datagram received, the network namespace of the container
initiates an ARP lookup to deliver the ICMP message.

In another variant of this reproducer, starting the client with:

  strace -f pasta --config-net -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc 2>strace.log &

and connecting to the socat server using a loopback address:

  socat OPEN:tmp.in UDP4:localhost:1337,shut-null

we can more clearly observe a sendmmsg() call failing after the
first datagram is delivered:

  [pid 278012] connect(173, 0x7fff96c95fc0, 16) = 0
  [...]
  [pid 278012] recvmmsg(173, 0x7fff96c96020, 1024, MSG_DONTWAIT, NULL) = -1 EAGAIN (Resource temporarily unavailable)
  [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = 1
  [...]
  [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = -1 ECONNREFUSED (Connection refused)

and, somewhat confusingly, after a connect() on the same socket
succeeded.

Until commit 4cdeeee925 ("net: udp: prefer listeners bound to an
address"), the race between receive address change and lookup didn't
actually cause visible issues, because, once the lookup based on the
secondary hash chain failed, we would still attempt a lookup based on
the primary hash (destination port only), and find the socket with the
outdated secondary hash.

That change, however, dropped port-only lookups altogether, as side
effect, making the race visible.

To fix this, while avoiding the need to make address changes and
rehash atomic against lookups, reintroduce primary hash lookups as
fallback, if lookups based on four-tuple and secondary hashes fail.

To this end, introduce a simplified lookup implementation, which
doesn't take care of SO_REUSEPORT groups: if we have one, there are
multiple sockets that would match the four-tuple or secondary hash,
meaning that we can't run into this race at all.

v2:
  - instead of synchronising lookup operations against address change
    plus rehash, reintroduce a simplified version of the original
    primary hash lookup as fallback

v1:
  - fix build with CONFIG_IPV6=n: add ifdef around sk_v6_rcv_saddr
    usage (Kuniyuki Iwashima)
  - directly use sk_rcv_saddr for IPv4 receive addresses instead of
    fetching inet_rcv_saddr (Kuniyuki Iwashima)
  - move inet_update_saddr() to inet_hashtables.h and use that
    to set IPv4/IPv6 addresses as suitable (Kuniyuki Iwashima)
  - rebase onto net-next, update commit message accordingly

Reported-by: Ed Santiago <santiago@redhat.com>
Link: https://github.com/containers/podman/issues/24147
Analysed-by: David Gibson <david@gibson.dropbear.id.au>
Fixes: 30fff9231f ("udp: bind() optimisation")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-12-23 11:39:55 +00:00
Jakub Kicinski
ae418e95dd Merge branch 'ipv4-consolidate-route-lookups-from-ipv4-sockets'
Guillaume Nault says:

====================
ipv4: Consolidate route lookups from IPv4 sockets.

Create inet_sk_init_flowi4() so that the different IPv4 code paths that
need to do a route lookup based on an IPv4 socket don't need to
reimplement that logic.
====================

Link: https://patch.msgid.link/cover.1734357769.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20 13:50:15 -08:00
Guillaume Nault
c63e9f3b89 l2tp: Use inet_sk_init_flowi4() in l2tp_ip_sendmsg().
Use inet_sk_init_flowi4() to automatically initialise the flowi4
structure in l2tp_ip_sendmsg() instead of passing parameters manually
to ip_route_output_ports().

Override ->daddr with the value passed in the msghdr structure if
provided.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: James Chapman <jchapman@katalix.com>
Link: https://patch.msgid.link/2ff22a3560c5050228928456662b80b9c84a8fe4.1734357769.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20 13:50:09 -08:00
Guillaume Nault
148721f8e0 ipv4: Use inet_sk_init_flowi4() in __ip_queue_xmit().
Use inet_sk_init_flowi4() to automatically initialise the flowi4
structure in __ip_queue_xmit() instead of passing parameters manually
to ip_route_output_ports().

Override ->flowi4_tos with the value passed as parameter since that's
required by SCTP.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/37e64ffbd9adac187b14aa9097b095f5c86e85be.1734357769.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20 13:50:09 -08:00
Guillaume Nault
42e5ffc385 ipv4: Use inet_sk_init_flowi4() in inet_csk_rebuild_route().
Use inet_sk_init_flowi4() to automatically initialise the flowi4
structure in inet_csk_rebuild_route() instead of passing parameters
manually to ip_route_output_ports().

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/b270931636effa1095508e0f0a3e8c3a0e6d357f.1734357769.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20 13:50:09 -08:00
Guillaume Nault
5be1323b50 ipv4: Use inet_sk_init_flowi4() in ip4_datagram_release_cb().
Use inet_sk_init_flowi4() to automatically initialise the flowi4
structure in ip4_datagram_release_cb() instead of passing parameters
manually to ip_route_output_ports().

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/9c326b8d9e919478f7952b21473d31da07eba2dd.1734357769.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20 13:50:09 -08:00
Guillaume Nault
1dbdce30f0 ipv4: Define inet_sk_init_flowi4() and use it in inet_sk_rebuild_header().
IPv4 code commonly has to initialise a flowi4 structure from an IPv4
socket. This requires looking at potential IPv4 options to set the
proper destination address, call flowi4_init_output() with the correct
set of parameters and run the sk_classify_flow security hook.

Instead of reimplementing these operations in different parts of the
stack, let's define inet_sk_init_flowi4() which does all these
operations.

The first user is inet_sk_rebuild_header(), where inet_sk_init_flowi4()
replaces ip_route_output_ports(). Unlike ip_route_output_ports(), which
sets the flowi4 structure and performs the route lookup in one go,
inet_sk_init_flowi4() only initialises the flow. The route lookup is
then done by ip_route_output_flow(). Decoupling flow initialisation
from route lookup makes this new interface applicable more broadly as
it will allow some users to overwrite specific struct flowi4 members
before the route lookup.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/fd416275262b1f518d5abfcef740ce4f4a1a6522.1734357769.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20 13:50:09 -08:00
Tristram Ha
6ed3472173 net: dsa: microchip: Do not execute PTP driver code for unsupported switches
The PTP driver code only works for certain KSZ switches like KSZ9477,
KSZ9567, LAN937X and their varieties.  This code is enabled by kernel
configuration CONFIG_NET_DSA_MICROCHIP_KSZ_PTP.  As the DSA driver is
common to work with all KSZ switches this PTP code is not appropriate
for other unsupported switches.  The ptp_capable indication is added to
the chip data structure to signal whether to execute those code.

Signed-off-by: Tristram Ha <tristram.ha@microchip.com>
Link: https://patch.msgid.link/20241218020240.70601-1-Tristram.Ha@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20 13:28:42 -08:00
Thomas Weißschuh
3272040790 qlcnic: use const 'struct bin_attribute' callbacks
The sysfs core now provides callback variants that explicitly take a
const pointer. Use them so the non-const variants can be removed.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://patch.msgid.link/20241219-sysfs-const-bin_attr-net-v2-1-93bdaece3c90@weissschuh.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20 13:24:50 -08:00
Jakub Kicinski
c1bc6d2178 Merge branch 'bridge-handle-changes-in-vlan_flag_bridge_binding'
Petr Machata says:

====================
bridge: Handle changes in VLAN_FLAG_BRIDGE_BINDING

When bridge binding is enabled on a VLAN netdevice, its link state should
track bridge ports that are members of the corresponding VLAN. This works
for a newly-added netdevices. However toggling the option does not have the
effect of enabling or disabling the behavior as appropriate.

In this patchset, have bridge react to bridge_binding toggles on VLAN
uppers.

There has been another attempt at supporting this behavior in 2022 by
Sevinj Aghayeva [0]. A discussion ensued that informed how this new
patchset is constructed, namely that the logic is in the bridge as opposed
to the 8021q driver, and the bridge reacts to NETDEV_CHANGE events on the
8021q upper.

Patches #1 and #2 contain the implementation, patches #3 and #4 a
selftest.

[0] https://lore.kernel.org/netdev/cover.1660100506.git.sevinj.aghayeva@gmail.com/
====================

Link: https://patch.msgid.link/cover.1734540770.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20 13:16:47 -08:00
Petr Machata
dca12e9ab7 selftests: net: Add a VLAN bridge binding selftest
Add a test that exercises bridge binding.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/baf7244fd1fe223a6d93e027584fa9f99dee982c.1734540770.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20 13:14:17 -08:00
Petr Machata
976d248bd3 selftests: net: lib: Add a couple autodefer helpers
Alongside the helper ip_link_set_up(), one to set the link down will be
useful as well. Add a helper to determine the link state as well,
ip_link_is_up(), and use it to short-circuit any changes if the state is
already the desired one.

Furthermore, add a helper bridge_vlan_add().

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/856d9e01725fdba21b7f6716358f645b19131af2.1734540770.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20 13:14:17 -08:00
Petr Machata
3abd45122c net: bridge: Handle changes in VLAN_FLAG_BRIDGE_BINDING
When bridge binding is enabled on a VLAN netdevice, its link state should
track bridge ports that are members of the corresponding VLAN. This works
for newly-added netdevices. However toggling the option does not have the
effect of enabling or disabling the behavior as appropriate.

In this patch, react to bridge_binding toggles on VLAN uppers.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/90a8ca8aea4d81378b29d75d9e562433e0d5c7ff.1734540770.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20 13:14:17 -08:00
Petr Machata
f284424dc1 net: bridge: Extract a helper to handle bridge_binding toggles
Currently, the BROPT_VLAN_BRIDGE_BINDING bridge option is only toggled when
VLAN devices are added on top of a bridge or removed from it. Extract the
toggling of the option to a function so that it could be invoked by a
subsequent patch when the state of an upper VLAN device changes.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/a7455f6fe1dfa7b13126ed8a7fb33d3b611eecb8.1734540770.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20 13:14:17 -08:00
Eric Dumazet
05dd04b218 inetpeer: avoid false sharing in inet_peer_xrlim_allow()
Under DOS, inet_peer_xrlim_allow() might be called millions
of times per second from different cpus.

Make sure to write over peer->rate_tokens and peer->rate_last
only when really needed.

Note the inherent races of this function are still there,
we do not care of precise ICMP rate limiting.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20241219150330.3159027-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20 13:04:40 -08:00
Jakub Kicinski
710aebdad6 Merge branch 'hisilicon-hns-deadcoding'
Dr. David Alan Gilbert says:

====================
hisilicon hns deadcoding

From: "Dr. David Alan Gilbert" <linux@treblig.org>

A small set of deadcoding for functions that are not
called, and a couple of function pointers that they
called.

Build tested only; I don't have the hardware.
====================

Link: https://patch.msgid.link/20241218163341.40297-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20 12:56:19 -08:00
Dr. David Alan Gilbert
8973ce1893 net: hisilicon: hns: Remove unused enums
The enums dsaf_roce_port_mode, dsaf_roce_port_num and dsaf_roce_qos_sl
are unused after the removal of the reset code.

Remove them.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Jijie Shao<shaojijie@huawei.com>
Link: https://patch.msgid.link/20241218163341.40297-5-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20 12:56:17 -08:00
Dr. David Alan Gilbert
0198b459f5 net: hisilicon: hns: Remove reset helpers
With hns_dsaf_roce_reset() removed in a previous patch, the two
helper member pointers, 'hns_dsaf_roce_srst',  and 'hns_dsaf_srst_chns'
are now unread.

Remove them, and the helper functions that they were initialised
to, that is hns_dsaf_srst_chns(), hns_dsaf_srst_chns_acpi(),
hns_dsaf_roce_srst() and hns_dsaf_roce_srst_acpi().

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Jijie Shao<shaojijie@huawei.com>
Link: https://patch.msgid.link/20241218163341.40297-4-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20 12:56:17 -08:00
Dr. David Alan Gilbert
0265e9edf2 net: hisilicon: hns: Remove unused hns_rcb_start
hns_rcb_start() has been unused since 2016's
commit 454784d85d ("net: hns: delete redundancy ring enable operations")

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Jijie Shao<shaojijie@huawei.com>
Link: https://patch.msgid.link/20241218163341.40297-3-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20 12:56:17 -08:00
Dr. David Alan Gilbert
a574fe14ed net: hisilicon: hns: Remove unused hns_dsaf_roce_reset
hns_dsaf_roce_reset() has been unused since 2021's
commit 38d2208824 ("RDMA/hns: Remove support for HIP06")

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Jijie Shao<shaojijie@huawei.com>
Link: https://patch.msgid.link/20241218163341.40297-2-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20 12:56:16 -08:00
Jakub Kicinski
b73e56f162 Merge branch 'xdp-a-fistful-of-generic-changes-pt-iii'
Alexander Lobakin says:

====================
xdp: a fistful of generic changes pt. III

XDP for idpf is currently 5.(6) chapters:
* convert Rx to libeth;
* convert Tx and stats to libeth;
* generic XDP and XSk code changes;
* generic XDP and XSk code additions pt. 1;
* generic XDP and XSk code additions pt. 2 (you are here);
* actual XDP for idpf via new libeth_xdp;
* XSk for idpf (via ^).

Part III.3 does the following:
* adds generic functions to build skbs from xdp_buffs (regular and
  XSk) and attach frags to xdp_buffs (regular and XSk);
* adds helper to optimize XSk xmit in drivers.

Everything is prereq for libeth_xdp, but will be useful standalone
as well: less code in drivers, faster XSk XDP_PASS, smaller object
code.
====================

Link: https://patch.msgid.link/20241218174435.1445282-1-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 19:51:18 -08:00
Alexander Lobakin
560d958c6c xsk: add generic XSk &xdp_buff -> skb conversion
Same as with converting &xdp_buff to skb on Rx, the code which allocates
a new skb and copies the XSk frame there is identical across the
drivers, so make it generic. This includes copying all the frags if they
are present in the original buff.
System percpu page_pools greatly improve XDP_PASS performance on XSk:
instead of page_alloc() + page_free(), the net core recycles the same
pages, so the only overhead left is memcpy()s. When the Page Pool is
not compiled in, the whole function is a return-NULL (but it always
gets selected when eBPF is enabled).
Note that the passed buff gets freed if the conversion is done w/o any
error, assuming you don't need this buffer after you convert it to an
skb.

Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241218174435.1445282-6-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 19:51:14 -08:00
Alexander Lobakin
51205f841a xsk: make xsk_buff_add_frag() really add the frag via __xdp_buff_add_frag()
Currently, xsk_buff_add_frag() only adds the frag to pool's linked list,
not doing anything with the &xdp_buff. The drivers do that manually and
the logic is the same.
Make it really add an skb frag, just like xdp_buff_add_frag() does that,
and freeing frags on error if needed. This allows to remove repeating
code from i40e and ice and not add the same code again and again.

Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241218174435.1445282-5-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 19:51:14 -08:00
Alexander Lobakin
539c1fba1a xdp: add generic xdp_build_skb_from_buff()
The code which builds an skb from an &xdp_buff keeps multiplying itself
around the drivers with almost no changes. Let's try to stop that by
adding a generic function.
Unlike __xdp_build_skb_from_frame(), always allocate an skbuff head
using napi_build_skb() and make use of the available xdp_rxq pointer to
assign the Rx queue index. In case of PP-backed buffer, mark the skb to
be recycled, as every PP user's been switched to recycle skbs.

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241218174435.1445282-4-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 19:51:14 -08:00
Alexander Lobakin
68ddc8ae17 xdp: add generic xdp_buff_add_frag()
The code piece which would attach a frag to &xdp_buff is almost
identical across the drivers supporting XDP multi-buffer on Rx.
Make it a generic elegant "oneliner".
Also, I see lots of drivers calculating frags_truesize as
`xdp->frame_sz * nr_frags`. I can't say this is fully correct, since
frags might be backed by chunks of different sizes, especially with
stuff like the header split. Even page_pool_alloc() can give you two
different truesizes on two subsequent requests to allocate the same
buffer size. Add a field to &skb_shared_info (unionized as there's no
free slot currently on x86_64) to track the "true" truesize. It can
be used later when updating the skb.

Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241218174435.1445282-3-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 19:51:13 -08:00
Alexander Lobakin
a19d0236f4 page_pool: add page_pool_dev_alloc_netmem()
Similarly to other _dev shorthands, add one for page_pool_alloc_netmem()
to allocate a netmem using the default Rx GFP flags (ATOMIC | NOWARN) to
make the page -> netmem transition of drivers easier.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241218174435.1445282-2-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 19:51:13 -08:00
Guillaume Nault
29b540795b gre: Drop ip_route_output_gre().
We already have enough variants of ip_route_output*() functions. We
don't need a GRE specific one in the generic route.h header file.

Furthermore, ip_route_output_gre() is only used once, in ipgre_open(),
where it can be easily replaced by a simple call to
ip_route_output_key().

While there, and for clarity, explicitly set .flowi4_scope to
RT_SCOPE_UNIVERSE instead of relying on the implicit zero
initialisation.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/ab7cba47b8558cd4bfe2dc843c38b622a95ee48e.1734527729.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 19:24:47 -08:00
Florent Revest
5155cbcdbf af_unix: Add a prompt to CONFIG_AF_UNIX_OOB
This makes it possible to disable the MSG_OOB support in .config.

Signed-off-by: Florent Revest <revest@chromium.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241218143334.1507465-1-revest@chromium.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 19:16:15 -08:00
Jamal Hadi Salim
6724bc65e5 selftests: net: remove redundant ncdevmem print
Remove extrenous fprintf

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20241218140018.15607-1-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 19:15:01 -08:00
Andy Moreton
455e135c30 sfc: remove efx_writed_page_locked
From: Andy Moreton <andy.moreton@amd.com>

efx_writed_page_locked is a workaround for Siena hardware that is not
needed on later adapters, and has no callers. Remove it.

Signed-off-by: Andy Moreton <andy.moreton@amd.com>
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://patch.msgid.link/20241218135930.2350358-1-edward.cree@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 19:14:39 -08:00
Furong Xu
5c98e89d96 net: stmmac: Drop useless code related to ethtool rx-copybreak
After commit 2af6106ae9 ("net: stmmac: Introducing support for Page
Pool"), the driver always copies frames to get a better performance,
zero-copy for RX frames is no more, then these code turned to be
useless and users of ethtool may get confused about the unhandled
rx-copybreak parameter.

This patch mostly reverts
commit 22ad383815 ("stmmac: do not perform zero-copy for rx frames")

Signed-off-by: Furong Xu <0x1207@gmail.com>
Link: https://patch.msgid.link/20241218083407.390509-1-0x1207@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 19:14:01 -08:00
Mina Almasry
f6038d913b net: Document netmem driver support
Document expectations from drivers looking to add support for device
memory tcp or other netmem based features.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://patch.msgid.link/20241217201206.2360389-1-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 19:12:36 -08:00
Lorenzo Bianconi
0c7469ee71 net: airoha: Fix error path in airoha_probe()
Do not run napi_disable() if airoha_hw_init() fails since Tx/Rx napi
has not been started yet. In order to fix the issue, introduce
airoha_qdma_stop_napi routine and remove napi_disable in
airoha_hw_cleanup().

Fixes: 23020f0493 ("net: airoha: Introduce ethernet support for EN7581 SoC")
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20241216-airoha_probe-error-path-fix-v2-1-6b10e04e9a5c@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 19:08:28 -08:00
Jakub Kicinski
5fde7edade Merge branch 'net-add-and-use-phy_disable_eee'
Heiner Kallweit says:

====================
net: add and use phy_disable_eee

If a MAC driver doesn't support EEE, then the PHY shouldn't advertise it.
Add phy_disable_eee() for this purpose, and use it in cpsw driver.
====================

Link: https://patch.msgid.link/5139374e-7151-4d0d-8ba9-9ec3d9b52f67@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 19:07:55 -08:00
Heiner Kallweit
c9f5a5dabb net: ethernet: ti: cpsw: disable PHY EEE advertisement
It seems the cpsw MAC doesn't support EEE. See e.g. the commit message of
ce2899428e ("ARM: dts: am335x-baltos: disable EEE for Atheros 8035 PHY").
There are cases where this causes issues if the PHY's on both sides have
negotiated EEE. As a workaround EEE modes of the PHY are marked broken
in DT, effectively disabling EEE advertisement.
Improve this by using new function phy_disable_eee() in the MAC driver.
This properly disables EEE advertisement, and allows to remove the
eee-broken-xxx properties from DT. As EEE is disabled anyway, we can
remove also the set_eee ethtool op.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/d08a798e-8565-422c-b2ed-121794db077f@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 19:07:48 -08:00
Heiner Kallweit
b55498ff14 net: phy: add phy_disable_eee
If a MAC driver doesn't support EEE, then the PHY shouldn't advertise it.
Add phy_disable_eee() for this purpose.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/fd51738c-dcd6-4d61-b8c5-faa6ac0f1026@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 19:07:48 -08:00
Jakub Kicinski
e8f3323805 Merge tag 'wireless-next-2024-12-19' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Kalle Valo says:

====================
wireless-next patches for v6.14

Multi-Link Operation implementation continues, both in stack and in
drivers. Otherwise it has been relatively quiet.

Major changes:

cfg80211/mac80211
 - define wiphy guard
 - get TX power per link
 - EHT 320 MHz channel support for mesh

ath11k
 - QCA6698AQ support

ath9k
 - RX inactivity detection

rtl8xxxu
 - add more USB device IDs

rtw88
 - add more USB device IDs
 - enable USB RX aggregation and USB 3 to improve performance

rtw89
 - PowerSave flow for Multi-Link Operation

* tag 'wireless-next-2024-12-19' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (121 commits)
  wifi: wlcore: sysfs: constify 'struct bin_attribute'
  wifi: brcmfmac: clarify unmodifiable headroom log message
  wifi: brcmfmac: add missing header include for brcmf_dbg
  wifi: brcmsmac: add gain range check to wlc_phy_iqcal_gainparams_nphy()
  wifi: qtnfmac: fix spelling error in core.h
  wifi: rtw89: phy: add dummy C2H event handler for report of TAS power
  wifi: rtw89: 8851b: rfk: remove unnecessary assignment of return value of _dpk_dgain_read()
  wifi: rtw89: 8852c: rfk: refine target channel calculation in _rx_dck_channel_calc()
  wifi: rtlwifi: pci: wait for firmware loading before releasing memory
  wifi: rtlwifi: fix memory leaks and invalid access at probe error path
  wifi: rtlwifi: destroy workqueue at rtl_deinit_core
  wifi: rtlwifi: remove unused check_buddy_priv
  wifi: rtw89: 8922a: update format of RFK pre-notify H2C command v2
  wifi: rtw89: regd: update regulatory map to R68-R51
  wifi: rtw89: 8852c: disable ER SU when 4x HE-LTF and 0.8 GI capability differ
  wifi: rtw89: disable firmware training HE GI and LTF
  wifi: rtw89: ps: update data for firmware and settings for hardware before/after PS
  wifi: rtw89: ps: refactor channel info to firmware before entering PS
  wifi: rtw89: ps: refactor PS flow to support MLO
  wifi: mwifiex: decrease timeout waiting for host sleep from 10s to 5s
  ...
====================

Link: https://patch.msgid.link/20241219185709.774EDC4CECE@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 18:54:07 -08:00
Jakub Kicinski
75e2c86c7b net: netlink: catch attempts to send empty messages
syzbot can figure out a way to redirect a netlink message to a tap.
Sending empty skbs to devices is not valid and we end up hitting
a skb_assert_len() in __dev_queue_xmit().

Make catching these mistakes easier, assert the skb size directly
in netlink core.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241218024400.824355-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 18:06:28 -08:00
Tristram Ha
3fc87cb94f net: dsa: microchip: Add suspend/resume support to KSZ DSA driver
The KSZ DSA driver starts a timer to read MIB counters periodically to
avoid count overrun.  During system suspend this will give an error for
not able to write to register as the SPI system returns an error when
it is in suspend state.  This implementation stops the timer when the
system goes into suspend and restarts it when resumed.

Signed-off-by: Tristram Ha <tristram.ha@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241218020311.70628-1-Tristram.Ha@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 18:05:57 -08:00
Jakub Kicinski
d5872aa262 Merge branch 'bnxt_en-driver-update'
Michael Chan says:

====================
bnxt_en: Driver update

The first patch configures context memory for RoCE resources based
on FW limits.  The next 4 patches restrict certain ethtool
operations when they are not supported.  The last patch adds Pavan
Chebbi as co-maintainer of the driver.

v1: https://lore.kernel.org/20241215205943.2341612-1-michael.chan@broadcom.com
====================

Link: https://patch.msgid.link/20241217182620.2454075-1-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 17:30:04 -08:00
Michael Chan
73df38b097 MAINTAINERS: bnxt_en: Add Pavan Chebbi as co-maintainer
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20241217182620.2454075-7-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 17:30:01 -08:00
Michael Chan
bf2afe0f14 bnxt_en: Skip reading PXP registers during ethtool -d if unsupported
Newer firmware does not allow reading the PXP registers during
ethtool -d, so skip the firmware call in that case.  Userspace
(bnxt.c) always expects the register block to be populated so
zeroes will be returned instead.

Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20241217182620.2454075-6-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 17:30:00 -08:00
Michael Chan
b45a850585 bnxt_en: Skip MAC loopback selftest if it is unsupported by FW
Call the new HWRM_PORT_MAC_QCAPS to check if mac loopback is
supported.  Skip the MAC loopback ethtool self test if it is
not supported.

Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20241217182620.2454075-5-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 17:30:00 -08:00
Michael Chan
36d1e70a90 bnxt_en: Skip PHY loopback ethtool selftest if unsupported by FW
Skip PHY loopback selftest if firmware advertises that it is unsupported
in the HWRM_PORT_PHY_QCAPS call.  Only show PHY loopback test result to
be 0 if the test has run and passes.  Do the same for external loopback
to be consistent.

Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20241217182620.2454075-4-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 17:30:00 -08:00
Michael Chan
fac5472fc8 bnxt_en: Do not allow ethtool -m on an untrusted VF
Block all ethtool module operations on an untrusted VF.  The firmware
won't allow it and will return error.

Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20241217182620.2454075-3-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 17:30:00 -08:00
Hongguang Gao
b1b66ae094 bnxt_en: Use FW defined resource limits for RoCE
If FW supports setting resource limits for RoCE, then just use the
FW limits instead of using some fixed values in the driver.  These
limits will be used to allocate context memory for QP, SRQ, AH, and
MR resources for RoCE.

Reviewed-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20241217182620.2454075-2-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 17:30:00 -08:00
Jakub Kicinski
07e5c4eb94 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.13-rc4).

No conflicts.

Adjacent changes:

drivers/net/ethernet/renesas/rswitch.h
  32fd46f5b6 ("net: renesas: rswitch: remove speed from gwca structure")
  922b4b955a ("net: renesas: rswitch: rework ts tags management")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 11:35:07 -08:00