The default IPv6 multipath hash policy takes the flow label into account
when calculating a multipath hash and previous patches added a flow
label selector to IPv6 FIB rules.
Allow user space to specify a flow label in route get requests by adding
a new netlink attribute and using its value to populate the "flowlabel"
field in the IPv6 flow info structure prior to a route lookup.
Deny the attribute in RTM_{NEW,DEL}ROUTE requests by checking for it in
rtm_to_fib6_config() and returning an error if present.
A subsequent patch will use this capability to test the new flow label
selector in IPv6 FIB rules.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Now that both IPv4 and IPv6 correctly handle the new flow label
attributes, enable user space to configure FIB rules that make use of
the flow label by changing the policy to stop rejecting them and
accepting 32 bit values in big-endian byte order.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Implement support for the new flow label selector which allows IPv6 FIB
rules to match on the flow label with a mask. Ensure that both flow
label attributes are specified (or none) and that the mask is valid.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
IPv4 FIB rules cannot match on flow label so reject requests that try to
add such rules. Do that in the IPv4 configure callback as the netlink
policy resides in the core and used by both IPv4 and IPv6.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add new FIB rule attributes which will allow user space to match on the
IPv6 flow label with a mask. Temporarily set the type of the attributes
to 'NLA_REJECT' while support is being added in the IPv6 code.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tony Nguyen says:
====================
ice: add support for devlink health events
Przemek Kitszel says:
Reports for two kinds of events are implemented, Malicious Driver
Detection (MDD) and Tx hang.
Patches 1, 2, 3: core improvements (checkpatch.pl, devlink extension)
Patch 4: rename current ice devlink/ files
Patches 5, 6, 7: ice devlink health infra + reporters
Mateusz did good job caring for this series, and hardening the code.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
ice: Add MDD logging via devlink health
ice: add Tx hang devlink health reporter
ice: rename devlink_port.[ch] to port.[ch]
devlink: add devlink_fmsg_dump_skb() function
devlink: add devlink_fmsg_put() macro
checkpatch: don't complain on _Generic() use
====================
Link: https://patch.msgid.link/20241217210835.3702003-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We noticed a high number of rx_discards_phy events on certain servers while
running `ethtool -S`. However, this critical counter is not currently
included in the standard /proc/net/dev statistics file, making it difficult
to monitor effectively—especially given the diversity of vendors across a
large fleet of servers.
Let's report it via the standard rx_dropped metric.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Gal Pressman <gal@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241210022706.6665-1-laoar.shao@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Soham Chakradeo says:
====================
selftests/net: packetdrill: import multiple tests
Import tests for the following features (folder names in brackets):
ECN (ecn) : RFC 3168
Close (close) : RFC 9293
TCP_INFO (tcp_info) : RFC 9293
Fast recovery (fast_recovery) : RFC 5681
Timestamping (timestamping) : RFC 1323
Nagle (nagle) : RFC 896
Selective Acknowledgments (sack) : RFC 2018
Recent Timestamp (ts_recent) : RFC 1323
Send file (sendfile)
Syscall bad arg (syscall_bad_arg)
Validate (validate)
Blocking (blocking)
Splice (splice)
End of record (eor)
Limited transmit (limited_transmit)
Procedure to import and test the packetdrill tests into upstream linux
is explained in the first patch of this series
These tests have many authors. We only import them here from
github.com/google/packetdrill. Thanks to the following authors fo their
contributions over the years to these tests: Neal Cardwell, Shuo Chen,
Yuchung Cheng, Jerry Chu, Eric Dumazet, Luke Hsiao, Priyaranjan Jha,
Chonggang Li, Tanner Love, John Sperbeck, Wei Wang and Maciej
Żenczykowski. For more info see the original github commits, such as
https://github.com/google/packetdrill/commit/8229c94928ac.
====================
Link: https://patch.msgid.link/20241217185203.297935-1-sohamch.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Same as initial tests, import verbatim from
github.com/google/packetdrill, aside from:
- update `source ./defaults.sh` path to adjust for flat dir
- add SPDX headers
- remove author statements if any
- drop blank lines at EOF
Same test process as previous tests. Both with and without debug mode.
Recording the steps once:
make mrproper
vng --build \
--config tools/testing/selftests/net/packetdrill/config \
--config kernel/configs/debug.config
vng -v --run . --user root --cpus 4 -- \
make -C tools/testing/selftests TARGETS=net/packetdrill run_tests
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Soham Chakradeo <sohamch@google.com>
Link: https://patch.msgid.link/20241217185203.297935-2-sohamch.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
wait_for_complete_timeout() expects a timeout in jiffies. With the
driver, some call sites converted QCA8K_ETHERNET_TIMEOUT to jiffies,
others did not. Make the code consistent by changes the #define to
include a call to msecs_to_jiffies, and remove all other calls to
msecs_to_jiffies.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: from Christian would be very welcome.
Signed-off-by: David S. Miller <davem@davemloft.net>
Jijie Shao says:
====================
Support some features for the HIBMCGE driver
In this patch series, The HIBMCGE driver implements some functions
such as dump register, unicast MAC address filtering, debugfs and reset.
====================
Link: https://patch.msgid.link/20241216040532.1566229-1-shaojijie@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sometimes, if the port doesn't work, we can try to fix it by resetting it.
This patch supports reset triggered by ethtool or FLR of PCIe, For example:
ethtool --reset eth0 dedicated
echo 1 > /sys/bus/pci/devices/0000\:83\:00.1/reset
We hope that the reset can be performed only when the port is down,
and the port cannot be up during the reset.
Therefore, the entire reset process is protected by the rtnl lock.
After the reset is complete, the hardware registers are restored
to their default values. Therefore, some rebuild operations are
required to rewrite the user configuration to the registers.
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241216040532.1566229-7-shaojijie@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The dump register is an effective way to analyze problems.
To ensure code flexibility, each register contains the type,
offset, and value information. The ethtool does the pretty print
based on these information.
The driver can dynamically add or delete registers that need to be dumped
in the future because information such as type and offset is contained.
ethtool always can do pretty print.
With the ethtool of a specific version,
the following effects are achieved:
[root@localhost sjj]# ./ethtool -d enp131s0f1
[SPEC] VALID [0x0000]: 0x00000001
[SPEC] EVENT_REQ [0x0004]: 0x00000000
[SPEC] MAC_ID [0x0008]: 0x00000002
[SPEC] PHY_ADDR [0x000c]: 0x00000002
[SPEC] MAC_ADDR_L [0x0010]: 0x00000808
[SPEC] MAC_ADDR_H [0x0014]: 0x08080802
[SPEC] UC_MAX_NUM [0x0018]: 0x00000004
[SPEC] MAX_MTU [0x0028]: 0x00000fc2
[SPEC] MIN_MTU [0x002c]: 0x00000100
[SPEC] TX_FIFO_NUM [0x0030]: 0x00000040
[SPEC] RX_FIFO_NUM [0x0034]: 0x0000007f
[SPEC] VLAN_LAYERS [0x0038]: 0x00000002
[MDIO] COMMAND_REG [0x0000]: 0x0000185f
[MDIO] ADDR_REG [0x0004]: 0x00000000
[MDIO] WDATA_REG [0x0008]: 0x0000a000
[MDIO] RDATA_REG [0x000c]: 0x00000000
[MDIO] STA_REG [0x0010]: 0x00000000
[GMAC] DUPLEX_TYPE [0x0008]: 0x00000001
[GMAC] FD_FC_TYPE [0x000c]: 0x00008808
[GMAC] FC_TX_TIMER [0x001c]: 0x000000ff
[GMAC] FD_FC_ADDR_LOW [0x0020]: 0xc2000001
[GMAC] FD_FC_ADDR_HIGH [0x0024]: 0x00000180
[GMAC] MAX_FRM_SIZE [0x003c]: 0x000005f6
[GMAC] PORT_MODE [0x0040]: 0x00000002
[GMAC] PORT_EN [0x0044]: 0x00000006
...
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241216040532.1566229-5-shaojijie@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Oleksij Rempel says:
====================
lan78xx: Preparations for PHYlink
This patch set is a third part of the preparatory work for migrating
the lan78xx USB Ethernet driver to the PHYlink framework. During
extensive testing, I observed that resetting the USB adapter can lead to
various read/write errors. While the errors themselves are acceptable,
they generate excessive log messages, resulting in significant log spam.
This set improves error handling to reduce logging noise by addressing
errors directly and returning early when necessary.
====================
Link: https://patch.msgid.link/20241216120941.1690908-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Enhance error handling in Wake-on-LAN (WoL) operations:
- Log a warning in `lan78xx_get_wol` if `lan78xx_read_reg` fails.
- Check and handle errors from `device_set_wakeup_enable` and
`phy_ethtool_set_wol` in `lan78xx_set_wol`.
- Ensure proper cleanup with a unified error handling path.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241216120941.1690908-7-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove PHY register handling from `lan78xx_get_regs` and
`lan78xx_get_regs_len`. Since the controller can have different PHYs
attached, the first 32 registers are not universally relevant or the
most interesting. Simplify the implementation to focus on MAC and device
registers.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20241216120941.1690908-6-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Update lan78xx_stop_hw to return -ETIMEDOUT instead of -ETIME when
a timeout occurs. While -ETIME indicates a general timer expiration,
-ETIMEDOUT is more commonly used for signaling operation timeouts and
provides better consistency with standard error handling in the driver.
The -ETIME checks in tx_complete() and rx_complete() are unrelated to
this error handling change. In these functions, the error values are derived
from urb->status, which reflects USB transfer errors. The error value from
lan78xx_stop_hw will be exposed in the following cases:
- usb_driver::suspend
- net_device_ops::ndo_stop (potentially, though currently the return value
is not used).
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Link: https://patch.msgid.link/20241216120941.1690908-3-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Update `lan78xx_get_regs` to handle errors during register and PHY
reads. Log warnings for failed reads and exit the function early if an
error occurs. Drop all previously logged registers to signal
inconsistent readings to the user space. This ensures that invalid data
is not returned to users.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20241216120941.1690908-2-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Benefit from the recent conversion of the driver to NAPI and enable GRO
support through the use of napi_gro_receive(). Pass the NAPI pointer
from the bus driver (mlxsw_pci) to the switch driver (mlxsw_spectrum)
through the skb control block where various packet metadata is already
encoded.
The main motivation is to improve forwarding performance through the use
of GRO fraglist [1]. In my testing, when the forwarding data path is
simple (routing between two ports) there is not much difference in
forwarding performance between GRO disabled and GRO enabled with
fraglist.
The improvement becomes more noticeable as the data path becomes more
complex since it is traversed less times with GRO enabled. For example,
with 10 ingress and 10 egress flower filters with different priorities
on the two ports between which routing is performed, there is an
improvement of about 140% in forwarded bandwidth.
[1] https://lore.kernel.org/netdev/20200125102645.4782-1-steffen.klassert@secunet.com/
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/21258fe55f608ccf1ee2783a5a4534220af28903.1734354812.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet says:
====================
inetpeer: reduce false sharing and atomic operations
After commit 8c2bd38b95 ("icmp: change the order of rate limits"),
there is a risk that a host receiving packets from an unique
source targeting closed ports is using a common inet_peer structure
from many cpus.
All these cpus have to acquire/release a refcount and update
the inet_peer timestamp (p->dtime)
Switch to pure RCU to avoid changing the refcount, and update
p->dtime only once per jiffy.
Tested:
DUT : 128 cores, 32 hw rx queues.
receiving 8,400,000 UDP packets per second, targeting closed ports.
Before the series:
- napi poll can not keep up, NIC drops 1,200,000 packets
per second.
- We use 20 % of cpu cycles
After this series:
- All packets are received (no more hw drops)
- We use 12 % of cpu cycles.
v1: https://lore.kernel.org/20241213130212.1783302-1-edumazet@google.com
====================
Link: https://patch.msgid.link/20241215175629.1248773-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All inet_getpeer() callers except ip4_frag_init() don't need
to acquire a permanent refcount on the inetpeer.
They can switch to full RCU protection.
Move the refcount_inc_not_zero() into ip4_frag_init(),
so that all the other callers no longer have to
perform a pair of expensive atomic operations on
a possibly contended cache line.
inet_putpeer() no longer needs to be exported.
After this patch, my DUT can receive 8,400,000 UDP packets
per second targeting closed ports, using 50% less cpu cycles
than before.
Also change two calls to l3mdev_master_ifindex() by
l3mdev_master_ifindex_rcu() (Ido ideas)
Fixes: 8c2bd38b95 ("icmp: change the order of rate limits")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241215175629.1248773-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Thomas Weißschuh says:
====================
net: constify 'struct bin_attribute'
The sysfs core now allows instances of 'struct bin_attribute' to be
moved into read-only memory. Make use of that to protect them against
accidental or malicious modifications.
====================
Link: https://patch.msgid.link/20241216-sysfs-const-bin_attr-net-v1-0-ec460b91f274@weissschuh.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>