Currently, rtnl_setlink() and rtnl_dellink() cannot be fully converted
to per-netns RTNL due to a lack of handling peer/lower/upper devices in
different netns.
For example, when we change a device in rtnl_setlink() and need to
propagate that to its upper devices, we want to avoid acquiring all netns
locks, for which we do not know the upper limit.
The same situation happens when we remove a device.
rtnl_dellink() could be transformed to remove a single device in the
requested netns and delegate other devices to per-netns work, and
rtnl_setlink() might be ?
Until we come up with a better idea, let's use a new flag
RTNL_FLAG_DOIT_PERNET_WIP for rtnl_dellink() and rtnl_setlink().
This will unblock converting RTNL users where such devices are not related.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241108004823.29419-11-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In ops->newlink(), veth, vxcan, and netkit call rtnl_link_get_net() with
a net pointer, which is the first argument of ->newlink().
rtnl_link_get_net() could return another netns based on IFLA_NET_NS_PID
and IFLA_NET_NS_FD in the peer device's attributes.
We want to get it and fill rtnl_nets->nets[] in advance in rtnl_newlink()
for per-netns RTNL.
All of the three get the peer netns in the same way:
1. Call rtnl_nla_parse_ifinfomsg()
2. Call ops->validate() (vxcan doesn't have)
3. Call rtnl_link_get_net_tb()
Let's add a new field peer_type to struct rtnl_link_ops and prefetch
netns in the peer ifla to add it to rtnl_nets in rtnl_newlink().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20241108004823.29419-6-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rtnl_newlink() needs to hold 3 per-netns RTNL: 2 for a new device
and 1 for its peer.
We will add rtnl_nets_lock() later, which performs the nested locking
based on struct rtnl_nets, which has an array of struct net pointers.
rtnl_nets_add() adds a net pointer to the array and sorts it so that
rtnl_nets_lock() can simply acquire per-netns RTNL from array[0] to [2].
Before calling rtnl_nets_add(), get_net() must be called for the net,
and rtnl_nets_destroy() will call put_net() for each.
Let's apply the helpers to rtnl_newlink().
When CONFIG_DEBUG_NET_SMALL_RTNL is disabled, we do not call
rtnl_net_lock() thus do not care about the array order, so
rtnl_net_cmp_locks() returns -1 so that the loop in rtnl_nets_add()
can be optimised to NOP.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20241108004823.29419-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
link_ops is protected by link_ops_mutex and no longer needs RTNL,
so we have no reason to have __rtnl_link_register() separately.
Let's remove it and call rtnl_link_register() from ifb.ko and
dummy.ko.
Note that both modules' init() work on init_net only, so we need
not export pernet_ops_rwsem and can use rtnl_net_lock() there.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241108004823.29419-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rtnl_link_unregister() holds RTNL and calls synchronize_srcu(),
but rtnl_newlink() will acquire SRCU frist and then RTNL.
Then, we need to unlink ops and call synchronize_srcu() outside
of RTNL to avoid the deadlock.
rtnl_link_unregister() rtnl_newlink()
---- ----
lock(rtnl_mutex);
lock(&ops->srcu);
lock(rtnl_mutex);
sync(&ops->srcu);
Let's move as such and add a mutex to protect link_ops.
Now, link_ops is protected by its dedicated mutex and
rtnl_link_register() no longer needs to hold RTNL.
While at it, we move the initialisation of ops->dellink and
ops->srcu out of the mutex scope.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241108004823.29419-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rtnl_link_unregister() holds RTNL and calls __rtnl_link_unregister(),
where we call synchronize_srcu() to wait inflight RTM_NEWLINK requests
for per-netns RTNL.
We put synchronize_srcu() in __rtnl_link_unregister() due to ifb.ko
and dummy.ko.
However, rtnl_newlink() will acquire SRCU before RTNL later in this
series. Then, lockdep will detect the deadlock:
rtnl_link_unregister() rtnl_newlink()
---- ----
lock(rtnl_mutex);
lock(&ops->srcu);
lock(rtnl_mutex);
sync(&ops->srcu);
To avoid the problem, we must call synchronize_srcu() before RTNL in
rtnl_link_unregister().
As a preparation, let's remove __rtnl_link_unregister().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241108004823.29419-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stanislav Fomichev says:
====================
selftests: ncdevmem: Add ncdevmem to ksft
The goal of the series is to simplify and make it possible to use
ncdevmem in an automated way from the ksft python wrapper.
ncdevmem is slowly mutated into a state where it uses stdout
to print the payload and the python wrapper is added to
make sure the arrived payload matches the expected one.
====================
Link: https://patch.msgid.link/20241107181211.3934153-1-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Only RX side for now and small message to test the setup.
In the future, we can extend it to TX side and to testing
both sides with a couple of megs of data.
make \
-C tools/testing/selftests \
TARGETS="drivers/hw/net" \
install INSTALL_PATH=~/tmp/ksft
scp ~/tmp/ksft ${HOST}:
scp ~/tmp/ksft ${PEER}:
cfg+="NETIF=${DEV}\n"
cfg+="LOCAL_V6=${HOST_IP}\n"
cfg+="REMOTE_V6=${PEER_IP}\n"
cfg+="REMOTE_TYPE=ssh\n"
cfg+="REMOTE_ARGS=root@${PEER}\n"
echo -e "$cfg" | ssh root@${HOST} "cat > ksft/drivers/net/net.config"
ssh root@${HOST} "cd ksft && ./run_kselftest.sh -t drivers/net:devmem.py"
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20241107181211.3934153-13-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In the next patch the hard-coded queue numbers are gonna be removed.
So introduce some initial support for ethtool YNL and use
it to enable header split.
Also, tcp-data-split requires latest ethtool which is unlikely
to be present in the distros right now.
(ideally, we should not shell out to ethtool at all).
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241107181211.3934153-9-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ley Foon Tan says:
====================
net: stmmac: dwmac4: Fixes issues in dwmac4
This patch series fixes issues in the dwmac4 driver. These three patches
don't cause any user-visible issues, so they are targeted for net-next.
Patch #1:
Corrects the masking logic in the MTL Operation Mode RTC mask and shift
macros. The current code lacks the use of the ~ operator, which is
necessary to clear the bits properly.
Patch #2:
Addresses inaccuracies in the MTL_OP_MODE_*_MASK macros. The RTC fields
are located in bits [1:0], and this patch ensures the mask and shift
macros use the appropriate values to reflect this.
Patch #3:
Moves the handling of the Receive Watchdog Timeout (RWT) out of the
Abnormal Interrupt Summary (AIS) condition. According to the databook,
the RWT interrupt is not included in the AIS.
v1: https://lore.kernel.org/20241023112005.GN402847@kernel.org
v2: https://lore.kernel.org/20241101082336.1552084-3-leyfoon.tan@starfivetech.com
====================
Link: https://patch.msgid.link/20241107063637.2122726-1-leyfoon.tan@starfivetech.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The Receive Watchdog Timeout (RWT, bit[9]) is not part of Abnormal
Interrupt Summary (AIS). Move the RWT handling out of the AIS
condition statement.
From databook, the AIS is the logical OR of the following interrupt bits:
- Bit 1: Transmit Process Stopped
- Bit 7: Receive Buffer Unavailable
- Bit 8: Receive Process Stopped
- Bit 10: Early Transmit Interrupt
- Bit 12: Fatal Bus Error
- Bit 13: Context Descriptor Error
Signed-off-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241107063637.2122726-4-leyfoon.tan@starfivetech.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds support for VLAN ctag based filtering at slave devices.
The slave ethernet device may be capable of filtering ethernet packets
based on VLAN ID. This requires that when the VLAN interface is created
over an HSR/PRP interface, it passes the VID information to the
associated slave ethernet devices so that it updates the hardware
filters to filter ethernet frames based on VID. This patch adds the
required functions to propagate the vid information to the slave
devices.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: MD Danish Anwar <danishanwar@ti.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20241106091710.3308519-3-danishanwar@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Oleksij Rempel says:
====================
Side MDIO Support for LAN937x Switches
This patch set introduces support for an internal MDIO bus in LAN937x
switches, enabling the use of a side MDIO channel for PHY management
while keeping SPI as the main interface for switch configuration.
other changelogs are added to separate patches.
====================
Link: https://patch.msgid.link/20241106075942.1636998-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce ksz_parse_dt_phy_config() to validate and parse PHY
configuration from the device tree for KSZ switches. This function
ensures proper setup of internal PHYs by checking `phy-handle`
properties, verifying expected PHY IDs, and handling parent node
mismatches. Sets the PHY mask on the MII bus if validation is
successful. Returns -EINVAL on configuration errors.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20241106075942.1636998-7-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Implement side MDIO channel support for LAN937x switches, providing an
alternative to SPI for PHY management alongside existing SPI-based
switch configuration. This is needed to reduce SPI load, as SPI can be
relatively expensive for small packets compared to MDIO support.
Also, implemented static mappings for PHY addresses for various LAN937x
models to support different internal PHY configurations. Since the PHY
address mappings are not equal to the port indexes, this patch also
provides PHY address calculation based on hardware strapping
configuration.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241106075942.1636998-6-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add support for accessing PHYs via a side MDIO interface in LAN937x
switches. The existing code already supports accessing PHYs via main
management interfaces, which can be SPI, I2C, or MDIO, depending on the
chip variant. This patch enables using a side MDIO bus, where SPI is
used for the main switch configuration and MDIO for managing the
integrated PHYs. On LAN937x, this is optional, allowing them to operate
in both configurations: SPI only, or SPI + MDIO. Typically, the SPI
interface is used for switch configuration, while MDIO handles PHY
management.
Additionally, update interrupt controller code to support non-linear
port to PHY address mapping, enabling correct interrupt handling for
configurations where PHY addresses do not directly correspond to port
indexes. This change ensures that the interrupt mechanism properly
aligns with the new, flexible PHY address mappings introduced by side
MDIO support.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241106075942.1636998-4-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
irq_set_affinity_hint() is deprecated, Use irq_update_affinity_hint()
instead. This removes the side-effect of actually applying the affinity.
The driver does not really need to worry about spreading its IRQs across
CPUs. The core code already takes care of that. when the driver applies the
affinities by itself, it breaks the users' expectations:
1. The user configures irqbalance with IRQBALANCE_BANNED_CPULIST in
order to prevent IRQs from being moved to certain CPUs that run a
real-time workload.
2. atlantic device reopening will resets the affinity
in aq_ndev_open().
3. atlantic has no idea about irqbalance's config, so it may move an IRQ to
a banned CPU. The real-time workload suffers unacceptable latency.
Signed-off-by: Mohammad Heib <mheib@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241107120739.415743-1-mheib@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
irq_set_affinity_hint() is deprecated, Use irq_update_affinity_hint()
instead. This removes the side-effect of actually applying the affinity.
The driver does not really need to worry about spreading its IRQs across
CPUs. The core code already takes care of that. when the driver applies the
affinities by itself, it breaks the users' expectations:
1. The user configures irqbalance with IRQBALANCE_BANNED_CPULIST in
order to prevent IRQs from being moved to certain CPUs that run a
real-time workload.
2. nfp device reopening will resets the affinity
in nfp_net_netdev_open().
3. nfp has no idea about irqbalance's config, so it may move an IRQ to
a banned CPU. The real-time workload suffers unacceptable latency.
Signed-off-by: Mohammad Heib <mheib@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Louis Peens <louis.peens@corigine.com>
Link: https://patch.msgid.link/20241107115002.413358-1-mheib@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
irq_set_affinity_hint() is deprecated, Use irq_update_affinity_hint()
instead. This removes the side-effect of actually applying the affinity.
The driver does not really need to worry about spreading its IRQs across
CPUs. The core code already takes care of that. when the driver applies the
affinities by itself, it breaks the users' expectations:
1. The user configures irqbalance with IRQBALANCE_BANNED_CPULIST in
order to prevent IRQs from being moved to certain CPUs that run a
real-time workload.
2. bnxt_en device reopening will resets the affinity
in bnxt_open().
3. bnxt_en has no idea about irqbalance's config, so it may move an IRQ to
a banned CPU. The real-time workload suffers unacceptable latency.
Signed-off-by: Mohammad Heib <mheib@redhat.com>
Reviewed-by: Andy Gospodarek <gospo@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Link: https://patch.msgid.link/20241106180811.385175-1-mheib@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>