When net_shaper_ops are enabled for MANA, netdev_ops_lock
becomes active.
MANA VF setup/teardown by netvsc follows this call chain:
netvsc_vf_setup()
dev_change_flags()
...
__dev_open() OR __dev_close()
dev_change_flags() holds the netdev mutex via netdev_lock_ops.
Meanwhile, mana_create_txq() and mana_create_rxq() in mana_open()
path call NAPI APIs (netif_napi_add_tx(), netif_napi_add_weight(),
napi_enable()), which also try to acquire the same lock, risking
deadlock.
Similarly in the teardown path (mana_close()), netif_napi_disable()
and netif_napi_del(), contend for the same lock.
Switch to the _locked variants of these APIs to avoid deadlocks
when the netdev_ops_lock is held.
Fixes: d4c22ec680 ("net: hold netdev instance lock during ndo_open/ndo_stop")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Saurabh Singh Sengar <ssengar@linux.microsoft.com>
Reviewed-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/1750144656-2021-2-git-send-email-ernis@linux.microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Since commit 3e6a0243ff ("gre: Fix again IPv6 link-local address
generation."), addrconf_gre_config() has stopped handling IP6GRE
devices specially and just calls the regular addrconf_addr_gen()
function to create their link-local IPv6 addresses.
We can thus avoid using addrconf_gre_config() for IP6GRE devices and
use the normal IPv6 initialisation path instead (that is, jump directly
to addrconf_dev_config() in addrconf_init_auto_addrs()).
See commit 3e6a0243ff ("gre: Fix again IPv6 link-local address
generation.") for a deeper explanation on how and why GRE devices
started handling their IPv6 link-local address generation specially,
why it was a problem, and why this is not even necessary in most cases
(especially for GRE over IPv6).
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/a9144be9c7ec3cf09f25becae5e8fdf141fde9f6.1750075076.git.gnault@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Kory Maincent says:
====================
Add support for PSE budget evaluation strategy
This series brings support for budget evaluation strategy in the PSE
subsystem. PSE controllers can set priorities to decide which ports should
be turned off in case of special events like over-current.
This patch series adds support for two budget evaluation strategy.
1. Static Method:
This method involves distributing power based on PD classification.
It’s straightforward and stable, the PSE core keeping track of the
budget and subtracting the power requested by each PD’s class.
Advantages: Every PD gets its promised power at any time, which
guarantees reliability.
Disadvantages: PD classification steps are large, meaning devices
request much more power than they actually need. As a result, the power
supply may only operate at, say, 50% capacity, which is inefficient and
wastes money.
2. Dynamic Method:
To address the inefficiencies of the static method, vendors like
Microchip have introduced dynamic power budgeting, as seen in the
PD692x0 firmware. This method monitors the current consumption per port
and subtracts it from the available power budget. When the budget is
exceeded, lower-priority ports are shut down.
Advantages: This method optimizes resource utilization, saving costs.
Disadvantages: Low-priority devices may experience instability.
The UAPI allows adding support for software port priority mode managed from
userspace later if needed.
Patches 1-2: Add support for interrupt event report in PSE core, ethtool
and ethtool specs.
Patch 3: Adds support for interrupt and event report in TPS23881 driver.
Patches 4,5: Add support for PSE power domain in PSE core and ethtool.
Patches 6-8: Add support for budget evaluation strategy in PSE core,
ethtool and ethtool specs.
Patches 9-11: Add support for port priority and power supplies in PD692x0
drivers.
Patches 12,13: Add support for port priority in TPS23881 drivers.
====================
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-0-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add an interrupt property to the device tree bindings for the TI TPS23881
PSE controller. The interrupt is primarily used to detect classification
and disconnection events, which are essential for managing the PSE
controller in compliance with the PoE standard.
Interrupt support is essential for the proper functioning of the TPS23881
controller. Without it, after a power-on (PWON), the controller will
no longer perform detection and classification. This could lead to
potential hazards, such as connecting a non-PoE device after a PoE device,
which might result in magic smoke.
Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-13-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch enhances PSE callbacks by introducing support for the static
port priority feature. It extends interrupt management to handle and report
detection, classification, and disconnection events. Additionally, it
introduces the pi_get_pw_req() callback, which provides information about
the power requested by the Powered Devices.
Interrupt support is essential for the proper functioning of the TPS23881
controller. Without it, after a power-on (PWON), the controller will
no longer perform detection and classification. This could lead to
potential hazards, such as connecting a non-PoE device after a PoE device,
which might result in magic smoke.
Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-12-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch introduces the ability to configure the PSE PI budget evaluation
strategies. Budget evaluation strategies is utilized by PSE controllers to
determine which ports to turn off first in scenarios such as power budget
exceedance.
The pis_prio_max value is used to define the maximum priority level
supported by the controller. Both the current priority and the maximum
priority are exposed to the user through the pse_ethtool_get_status call.
This patch add support for two mode of budget evaluation strategies.
1. Static Method:
This method involves distributing power based on PD classification.
It’s straightforward and stable, the PSE core keeping track of the
budget and subtracting the power requested by each PD’s class.
Advantages: Every PD gets its promised power at any time, which
guarantees reliability.
Disadvantages: PD classification steps are large, meaning devices
request much more power than they actually need. As a result, the power
supply may only operate at, say, 50% capacity, which is inefficient and
wastes money.
Priority max value is matching the number of PSE PIs within the PSE.
2. Dynamic Method:
To address the inefficiencies of the static method, vendors like
Microchip have introduced dynamic power budgeting, as seen in the
PD692x0 firmware. This method monitors the current consumption per port
and subtracts it from the available power budget. When the budget is
exceeded, lower-priority ports are shut down.
Advantages: This method optimizes resource utilization, saving costs.
Disadvantages: Low-priority devices may experience instability.
Priority max value is set by the PSE controller driver.
For now, budget evaluation methods are not configurable and cannot be
mixed. They are hardcoded in the PSE driver itself, as no current PSE
controller supports both methods.
Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-7-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce PSE power domain support as groundwork for upcoming port
priority features. Multiple PSE PIs can now be grouped under a single
PSE power domain, enabling future enhancements like defining available
power budgets, port priority modes, and disconnection policies. This
setup will allow the system to assess whether activating a port would
exceed the available power budget, preventing over-budget states
proactively.
Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-4-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In preparation for reporting PSE events via ethtool notifications,
introduce an attached_phydev field in the pse_control structure.
This field stores the phy_device associated with the PSE PI,
ensuring that notifications are sent to the correct network
interface.
The attached_phydev pointer is directly tied to the PHY lifecycle. It
is set when the PHY is registered and cleared when the PHY is removed.
There is no need to use a refcount, as doing so could interfere with
the PHY removal process.
Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-1-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Arinzon says:
====================
PHC support in ENA driver
This patchset adds the support for PHC (PTP Hardware Clock)
in the ENA driver. The documentation part of the patchset
includes additional information, including statistics,
utilization and invocation examples through the testptp
utility.
====================
Link: https://patch.msgid.link/20250617110545.5659-1-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Adding the base directory of debugfs to the driver.
In order for the folder to be unique per driver instantiation,
the chosen name is the device name.
This commit contains the initialization and the
base folder.
The creation of the base folder may fail, but is considered
non-fatal.
Signed-off-by: David Arinzon <darinzon@amazon.com>
Link: https://patch.msgid.link/20250617110545.5659-8-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add the capability to set parameters through the devlink framework.
The parameter used for controlling PHC (enable/disable) details
are as follows:
- Name: enable_phc
- Type: Boolean (true - enable/false - disable)
- Mode: DEVLINK_PARAM_CMODE_DRIVERINIT
- Effect: Changes take place during driver initialization,
any changes require a devlink reload to take effect.
Signed-off-by: David Arinzon <darinzon@amazon.com>
Link: https://patch.msgid.link/20250617110545.5659-7-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Adding basic devlink capability support of reloading the driver.
This capability is required to support driver init type
devlink params (DEVLINK_PARAM_CMODE_DRIVERINIT). Such params
require reloading of the driver (destroy/restore sequence).
The reloading is done by the devlink framework using the
hooks provided by the driver.
Signed-off-by: David Arinzon <darinzon@amazon.com>
Link: https://patch.msgid.link/20250617110545.5659-4-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Each PHC device kernel registration receives a unique kernel index,
which is associated with a new PHC device file located at
"/dev/ptp<index>".
This device file serves as an interface for obtaining PHC timestamps.
Examples of tools that use "/dev/ptp" include testptp [1]
and chrony [2].
A reset flow may occur in the ENA driver while PHC is active.
During a reset, the driver will unregister and then re-register the
PHC device with the kernel.
Under race conditions, particularly during heavy PHC loads,
the driver’s reset flow might complete faster than the kernel’s PHC
unregister/register process.
This can result in the PHC index being different from what it was prior
to the reset, as the PHC index is selected using kernel ID
allocation [3].
While driver rmmod/insmod are done by the user, a reset may occur
at anytime, without the user awareness, consequently, the driver
might receive a new PHC index after the reset, potentially compromising
the user experience.
To prevent this issue, the PHC flow will detect the reset during PHC
destruction and will skip the PHC unregister/register calls to preserve
the kernel PHC index.
During the reset flow, any attempt to get a PHC timestamp will fail as
expected, but the kernel PHC index will remain unchanged.
[1]: https://github.com/torvalds/linux/blob/v6.1/tools/testing/selftests/ptp/testptp.c
[2]: https://github.com/mlichvar/chrony
[3]: https://www.kernel.org/doc/html/latest/core-api/idr.html
Signed-off-by: Amit Bernstein <amitbern@amazon.com>
Signed-off-by: David Arinzon <darinzon@amazon.com>
Link: https://patch.msgid.link/20250617110545.5659-3-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stanislav Fomichev says:
====================
udp_tunnel: remove rtnl_lock dependency
Recently bnxt had to grow back a bunch of rtnl dependencies because
of udp_tunnel's infra. Add separate (global) mutext to protect
udp_tunnel state.
====================
Link: https://patch.msgid.link/20250616162117.287806-1-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Drivers that are using ops lock and don't depend on RTNL lock
still need to manage it because udp_tunnel's RTNL dependency.
Introduce new udp_tunnel_nic_lock and use it instead of
rtnl_lock. Drop non-UDP_TUNNEL_NIC_INFO_MAY_SLEEP mode from
udp_tunnel infra (udp_tunnel_nic_device_sync_work needs to
grab udp_tunnel_nic_lock mutex and might sleep).
Cover more places in v4:
- netlink
- udp_tunnel_notify_add_rx_port (ndo_open)
- triggers udp_tunnel_nic_device_sync_work
- udp_tunnel_notify_del_rx_port (ndo_stop)
- triggers udp_tunnel_nic_device_sync_work
- udp_tunnel_get_rx_info (__netdev_update_features)
- triggers NETDEV_UDP_TUNNEL_PUSH_INFO
- udp_tunnel_drop_rx_info (__netdev_update_features)
- triggers NETDEV_UDP_TUNNEL_DROP_INFO
- udp_tunnel_nic_reset_ntf (ndo_open)
- notifiers
- udp_tunnel_nic_netdevice_event, depending on the event:
- triggers NETDEV_UDP_TUNNEL_PUSH_INFO
- triggers NETDEV_UDP_TUNNEL_DROP_INFO
- ethnl_tunnel_info_reply_size
- udp_tunnel_nic_set_port_priv (two intel drivers)
Cc: Michael Chan <michael.chan@broadcom.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Link: https://patch.msgid.link/20250616162117.287806-4-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
udp_tunnel_push_rx_port will grab mutex in the next patch so
we can't use rcu. geneve_offload_rx_ports is called
from geneve_netdevice_event for NETDEV_UDP_TUNNEL_PUSH_INFO and
NETDEV_UDP_TUNNEL_DROP_INFO which both have ASSERT_RTNL.
Entries are added to and removed from the sock_list under rtnl
lock as well (when adding or removing a tunneling device).
Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250616162117.287806-2-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The implementation of nla_data is as follows:
static inline void *nla_data(const struct nlattr *nla)
{
return (char *) nla + NLA_HDRLEN;
}
Excluding the case where nla is exactly -NLA_HDRLEN, it will not return
NULL. And it seems misleading to assume that it can, other than in this
corner case. So drop checks for this condition.
Flagged by Smatch.
Compile tested only.
Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250617-nfc-null-data-v1-1-c7525ead2e95@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski says:
====================
eth: migrate more drivers to new RXFH callbacks
Migrate a batch of drivers to the recently added dedicated
.get_rxfh_fields and .set_rxfh_fields ethtool callbacks.
====================
Link: https://patch.msgid.link/20250617014848.436741-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski says:
====================
eth: migrate some drivers to new RXFH callbacks
Migrate a batch of drivers to the recently added dedicated
.get_rxfh_fields and .set_rxfh_fields ethtool callbacks.
====================
Link: https://patch.msgid.link/20250617014555.434790-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>