dst->dev is read locklessly in many contexts,
and written in dst_dev_put().
Fixing all the races is going to need many changes.
We probably will have to add full RCU protection.
Add three helpers to ease this painful process.
static inline struct net_device *dst_dev(const struct dst_entry *dst)
{
return READ_ONCE(dst->dev);
}
static inline struct net_device *skb_dst_dev(const struct sk_buff *skb)
{
return dst_dev(skb_dst(skb));
}
static inline struct net *skb_dst_dev_net(const struct sk_buff *skb)
{
return dev_net(skb_dst_dev(skb));
}
static inline struct net *skb_dst_dev_net_rcu(const struct sk_buff *skb)
{
return dev_net_rcu(skb_dst_dev(skb));
}
Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dst_dev_put() can overwrite dst->output while other
cpus might read this field (for instance from dst_output())
Add READ_ONCE()/WRITE_ONCE() annotations to suppress
potential issues.
We will likely need RCU protection in the future.
Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet says:
====================
net: introduce net_aligned_data
____cacheline_aligned_in_smp on small fields like
tcp_memory_allocated and udp_memory_allocated is not good enough.
It makes sure to put these fields at the start of a cache line,
but does not prevent the linker from using the cache line for other
fields, with potential performance impact.
nm -v vmlinux|egrep -5 "tcp_memory_allocated|udp_memory_allocated"
...
ffffffff83e35cc0 B tcp_memory_allocated
ffffffff83e35cc8 b __key.0
ffffffff83e35cc8 b __tcp_tx_delay_enabled.1
ffffffff83e35ce0 b tcp_orphan_timer
...
ffffffff849dddc0 B udp_memory_allocated
ffffffff849dddc8 B udp_encap_needed_key
ffffffff849dddd8 B udpv6_encap_needed_key
ffffffff849dddf0 b inetsw_lock
One solution is to move these sensitive fields to a structure,
so that the compiler is forced to add empty holes between each member.
nm -v vmlinux|egrep -2 "tcp_memory_allocated|udp_memory_allocated|net_aligned_data"
ffffffff885af970 b mem_id_init
ffffffff885af980 b __key.0
ffffffff885af9c0 B net_aligned_data
ffffffff885afa80 B page_pool_mem_providers
ffffffff885afa90 b __key.2
v1: https://lore.kernel.org/netdev/20250627200551.348096-1-edumazet@google.com/
====================
Link: https://patch.msgid.link/20250630093540.3052835-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
____cacheline_aligned_in_smp attribute only makes sure to align
a field to a cache line. It does not prevent the linker to use
the remaining of the cache line for other variables, causing
potential false sharing.
Move tcp_memory_allocated into a dedicated cache line.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is not only a cosmetic change because the section mismatch checks
(implemented in scripts/mod/modpost.c) also depend on the object's name
and for drivers the checks are stricter than for ops.
However aq_pci_driver also passes the stricter checks just fine, so no
further changes needed.
The cheating^Wmisleading name was introduced in commit 97bde5c4f9
("net: ethernet: aquantia: Support for NIC-specific code")
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250630164406.57589-2-u.kleine-koenig@baylibre.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If the user reinitializes the network interface, the PHY will reinitialize,
and the CKO settings will revert to their initial configuration(be enabled).
To prevent CKO from being re-enabled,
en8811h_clk_restore_context and en8811h_resume were added
to ensure the CKO settings remain correct.
Signed-off-by: Lucien.Jheng <lucienzx159@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250630154147.80388-1-lucienzx159@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
'struct devlink_region_ops' and 'struct hellcreek_fdb_entry' are not
modified in this driver.
Constifying these structures moves some data to a read-only section, so
increases overall security, especially when the structure holds some
function pointers.
On a x86_64, with allmodconfig:
Before:
======
text data bss dec hex filename
55320 19216 320 74856 12468 drivers/net/dsa/hirschmann/hellcreek.o
After:
=====
text data bss dec hex filename
55960 18576 320 74856 12468 drivers/net/dsa/hirschmann/hellcreek.o
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Link: https://patch.msgid.link/2f7e8dc30db18bade94999ac7ce79f333342e979.1751231174.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andrea Mayer says:
====================
seg6: fix typos in comments within the SRv6 subsystem
In this patchset, we correct some typos found both in the SRv6 Endpoints
implementation (i.e., seg6local) and in some SRv6 selftests, using
codespell.
====================
Link: https://patch.msgid.link/20250629171226.4988-1-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
'struct devlink_region_ops' and 'struct mv88e6xxx_region' are not modified
in this driver.
Constifying these structures moves some data to a read-only section, so
increases overall security, especially when the structure holds some
function pointers.
On a x86_64, with allmodconfig, as an example:
Before:
======
text data bss dec hex filename
18076 6496 64 24636 603c drivers/net/dsa/mv88e6xxx/devlink.o
After:
=====
text data bss dec hex filename
18652 5920 64 24636 603c drivers/net/dsa/mv88e6xxx/devlink.o
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/46040062161dda211580002f950a6d60433243dc.1751200453.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add support for the Signal Quality Indicator (SQI) feature on KSZ9477
family switches, providing a relative measure of receive signal quality.
The hardware exposes separate SQI readings per channel. For 1000BASE-T,
all four channels are read. For 100BASE-TX, only one channel is reported,
but which receive pair is active depends on Auto MDI-X negotiation, which
is not exposed by the hardware. Therefore, it is not possible to reliably
map the measured channel to a specific wire pair.
This resolves an earlier discussion about how to handle multi-channel
SQI. Originally, the plan was to expose all channels individually.
However, since pair mapping is sometimes unavailable, this
implementation treats SQI as a per-link metric instead. This fallback
avoids ambiguity and ensures consistent behavior. The existing get_sqi()
UAPI was designed for single-pair Ethernet (SPE), where per-pair and
per-link are effectively equivalent. Restricting its use to per-link
metrics does not introduce regressions for existing users.
The raw 7-bit SQI value (0–127, lower is better) is converted to the
standard 0–7 (high is better) scale. Empirical testing showed that the
link becomes unstable around a raw value of 8.
The SQI raw value remains zero if no data is received, even if noise is
present. This confirms that the measurement reflects the "quality" during
active data reception rather than the passive line state. User space
must ensure that traffic is present on the link to obtain valid SQI
readings.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250627112539.895255-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido spotted that I made a mistake in commit under Fixes,
ethnl_default_parse() may acquire a dev reference even when it returns
an error. This may have been driven by the code structure in dumps
(which unconditionally release dev before handling errors), but it's
too much of a trap. Functions should undo what they did before returning
an error, rather than expecting caller to clean up.
Rather than fixing ethnl_default_set_doit() directly make
ethnl_default_parse() clean up errors.
Reported-by: Ido Schimmel <idosch@idosch.org>
Link: https://lore.kernel.org/aGEPszpq9eojNF4Y@shredder
Fixes: 963781bdfe ("net: ethtool: call .parse_request for SET handlers")
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250630154053.1074664-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If no PHY device is found (e.g., for LAN7801 in fixed-link mode),
lan78xx_phy_init() may proceed to dereference a NULL phydev pointer,
leading to a crash.
Update the logic to perform MAC configuration first, then check for the presence
of a PHY. For the fixed-link case, set up the fixed link and return early,
bypassing any code that assumes a valid phydev pointer.
It is safe to move lan78xx_mac_prepare_for_phy() earlier because this function
only uses information from dev->interface, which is configured by
lan78xx_get_phy() beforehand. The function does not access phydev or any data
set up by later steps.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Fixes: e110bc8258 ("net: usb: lan78xx: Convert to PHYLINK for improved PHY and MAC management")
Link: https://patch.msgid.link/20250626103731.3986545-1-o.rempel@pengutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Some counters in enetc_port_counters are 32-bit registers, and some are
64-bit registers. But in the current driver, they are all read through
enetc_port_rd(), which can only read a 32-bit value. Therefore, separate
64-bit counters (enetc_pm_counters) from enetc_port_counters and use
enetc_port_rd64() to read the 64-bit statistics.
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250627021108.3359642-3-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The statistics of the ring are all unsigned int type, so the statistics
will overflow quickly under heavy traffic. In addition, the statistics
of struct net_device_stats are obtained from struct enetc_ring_stats,
but the statistics of net_device_stats are unsigned long type. So it is
better to keep the statistics types consistent in these two structures.
Considering these two factors, and the fact that both LS1028A and i.MX95
are arm64 architecture, the statistics of enetc_ring_stats are changed
to unsigned long type. Note that unsigned int and unsigned long are the
same thing on some systems, and on such systems there is no overflow
advantage of one over the other.
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250627021108.3359642-2-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In the current implementation, IP coalescing is always enabled and
cannot be disabled.
As setting maximum frames to 0 or 1, or setting delay to zero implies
immediate delivery of single packets/IRQs, disable coalescing in
hardware in these cases.
This also guarantees that coalescing is never enabled with ICFT or ICTT
set to zero, a configuration that could lead to unpredictable behaviour
according to i.MX8MP reference manual.
Signed-off-by: Jonas Rebmann <jre@pengutronix.de>
Reviewed-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20250626-fec_deactivate_coalescing-v2-1-0b217f2e80da@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel says:
====================
Add support for externally validated neighbor entries
Patch #1 adds a new neighbor flag ("extern_valid") that prevents the
kernel from invalidating or removing a neighbor entry, while allowing
the kernel to notify user space when the entry becomes reachable. See
motivation and implementation details in the commit message.
Patch #2 adds a selftest.
v1: https://lore.kernel.org/20250611141551.462569-1-idosch@nvidia.com
====================
Link: https://patch.msgid.link/20250626073111.244534-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add test cases for externally validated neighbor entries, testing both
IPv4 and IPv6. Name the file "test_neigh.sh" so that it could be
possibly extended in the future with more neighbor test cases.
Example output:
# ./test_neigh.sh
TEST: IPv4 "extern_valid" flag: Add entry [ OK ]
TEST: IPv4 "extern_valid" flag: Add with an invalid state [ OK ]
TEST: IPv4 "extern_valid" flag: Add with "use" flag [ OK ]
TEST: IPv4 "extern_valid" flag: Replace entry [ OK ]
TEST: IPv4 "extern_valid" flag: Replace entry with "managed" flag [ OK ]
TEST: IPv4 "extern_valid" flag: Replace with an invalid state [ OK ]
TEST: IPv4 "extern_valid" flag: Interface down [ OK ]
TEST: IPv4 "extern_valid" flag: Carrier down [ OK ]
TEST: IPv4 "extern_valid" flag: Transition to "reachable" state [ OK ]
TEST: IPv4 "extern_valid" flag: Transition back to "stale" state [ OK ]
TEST: IPv4 "extern_valid" flag: Forced garbage collection [ OK ]
TEST: IPv4 "extern_valid" flag: Periodic garbage collection [ OK ]
TEST: IPv6 "extern_valid" flag: Add entry [ OK ]
TEST: IPv6 "extern_valid" flag: Add with an invalid state [ OK ]
TEST: IPv6 "extern_valid" flag: Add with "use" flag [ OK ]
TEST: IPv6 "extern_valid" flag: Replace entry [ OK ]
TEST: IPv6 "extern_valid" flag: Replace entry with "managed" flag [ OK ]
TEST: IPv6 "extern_valid" flag: Replace with an invalid state [ OK ]
TEST: IPv6 "extern_valid" flag: Interface down [ OK ]
TEST: IPv6 "extern_valid" flag: Carrier down [ OK ]
TEST: IPv6 "extern_valid" flag: Transition to "reachable" state [ OK ]
TEST: IPv6 "extern_valid" flag: Transition back to "stale" state [ OK ]
TEST: IPv6 "extern_valid" flag: Forced garbage collection [ OK ]
TEST: IPv6 "extern_valid" flag: Periodic garbage collection [ OK ]
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250626073111.244534-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tl;dr
=====
Add a new neighbor flag ("extern_valid") that can be used to indicate to
the kernel that a neighbor entry was learned and determined to be valid
externally. The kernel will not try to remove or invalidate such an
entry, leaving these decisions to the user space control plane. This is
needed for EVPN multi-homing where a neighbor entry for a multi-homed
host needs to be synced across all the VTEPs among which the host is
multi-homed.
Background
==========
In a typical EVPN multi-homing setup each host is multi-homed using a
set of links called ES (Ethernet Segment, i.e., LAG) to multiple leaf
switches (VTEPs). VTEPs that are connected to the same ES are called ES
peers.
When a neighbor entry is learned on a VTEP, it is distributed to both ES
peers and remote VTEPs using EVPN MAC/IP advertisement routes. ES peers
use the neighbor entry when routing traffic towards the multi-homed host
and remote VTEPs use it for ARP/NS suppression.
Motivation
==========
If the ES link between a host and the VTEP on which the neighbor entry
was locally learned goes down, the EVPN MAC/IP advertisement route will
be withdrawn and the neighbor entries will be removed from both ES peers
and remote VTEPs. Routing towards the multi-homed host and ARP/NS
suppression can fail until another ES peer locally learns the neighbor
entry and distributes it via an EVPN MAC/IP advertisement route.
"draft-rbickhart-evpn-ip-mac-proxy-adv-03" [1] suggests avoiding these
intermittent failures by having the ES peers install the neighbor
entries as before, but also injecting EVPN MAC/IP advertisement routes
with a proxy indication. When the previously mentioned ES link goes down
and the original EVPN MAC/IP advertisement route is withdrawn, the ES
peers will not withdraw their neighbor entries, but instead start aging
timers for the proxy indication.
If an ES peer locally learns the neighbor entry (i.e., it becomes
"reachable"), it will restart its aging timer for the entry and emit an
EVPN MAC/IP advertisement route without a proxy indication. An ES peer
will stop its aging timer for the proxy indication if it observes the
removal of the proxy indication from at least one of the ES peers
advertising the entry.
In the event that the aging timer for the proxy indication expired, an
ES peer will withdraw its EVPN MAC/IP advertisement route. If the timer
expired on all ES peers and they all withdrew their proxy
advertisements, the neighbor entry will be completely removed from the
EVPN fabric.
Implementation
==============
In the above scheme, when the control plane (e.g., FRR) advertises a
neighbor entry with a proxy indication, it expects the corresponding
entry in the data plane (i.e., the kernel) to remain valid and not be
removed due to garbage collection or loss of carrier. The control plane
also expects the kernel to notify it if the entry was learned locally
(i.e., became "reachable") so that it will remove the proxy indication
from the EVPN MAC/IP advertisement route. That is why these entries
cannot be programmed with dummy states such as "permanent" or "noarp".
Instead, add a new neighbor flag ("extern_valid") which indicates that
the entry was learned and determined to be valid externally and should
not be removed or invalidated by the kernel. The kernel can probe the
entry and notify user space when it becomes "reachable" (it is initially
installed as "stale"). However, if the kernel does not receive a
confirmation, have it return the entry to the "stale" state instead of
the "failed" state.
In other words, an entry marked with the "extern_valid" flag behaves
like any other dynamically learned entry other than the fact that the
kernel cannot remove or invalidate it.
One can argue that the "extern_valid" flag should not prevent garbage
collection and that instead a neighbor entry should be programmed with
both the "extern_valid" and "extern_learn" flags. There are two reasons
for not doing that:
1. Unclear why a control plane would like to program an entry that the
kernel cannot invalidate but can completely remove.
2. The "extern_learn" flag is used by FRR for neighbor entries learned
on remote VTEPs (for ARP/NS suppression) whereas here we are
concerned with local entries. This distinction is currently irrelevant
for the kernel, but might be relevant in the future.
Given that the flag only makes sense when the neighbor has a valid
state, reject attempts to add a neighbor with an invalid state and with
this flag set. For example:
# ip neigh add 192.0.2.1 nud none dev br0.10 extern_valid
Error: Cannot create externally validated neighbor with an invalid state.
# ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid
# ip neigh replace 192.0.2.1 nud failed dev br0.10 extern_valid
Error: Cannot mark neighbor as externally validated with an invalid state.
The above means that a neighbor cannot be created with the
"extern_valid" flag and flags such as "use" or "managed" as they result
in a neighbor being created with an invalid state ("none") and
immediately getting probed:
# ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
Error: Cannot create externally validated neighbor with an invalid state.
However, these flags can be used together with "extern_valid" after the
neighbor was created with a valid state:
# ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid
# ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
One consequence of preventing the kernel from invalidating a neighbor
entry is that by default it will only try to determine reachability
using unicast probes. This can be changed using the "mcast_resolicit"
sysctl:
# sysctl net.ipv4.neigh.br0/10.mcast_resolicit
0
# tcpdump -nn -e -i br0.10 -Q out arp &
# ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
# sysctl -wq net.ipv4.neigh.br0/10.mcast_resolicit=3
# ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
iproute2 patches can be found here [2].
[1] https://datatracker.ietf.org/doc/html/draft-rbickhart-evpn-ip-mac-proxy-adv-03
[2] https://github.com/idosch/iproute2/tree/submit/extern_valid_v1
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/20250626073111.244534-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>