After the recent removal of gpio support member no_carrier isn't
needed any longer.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The two callers of __fixed_phy_add() both pass PHY_POLL, so we can
remove the irq argument to simplify the function.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The pmtu test takes nearly an hour when run on a debug kernel
(10min on a normal kernel, so the debug slow down is quite significant).
NIPA tries to ensure all results are delivered by a certain deadline
so this prevents it from retrying the test in case of a flake.
Looks like one of the slowest operations in the test is calling out
to ./openvswitch/ovs-dpctl.py to remove potential leftover OvS interfaces.
Check whether the interfaces exist in the first place in sysfs,
since it can be done directly in bash it is very fast.
This should save us around 20-30% of the test runtime.
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250906214535.3204785-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tariq Toukan says:
====================
mlx5-next updates 2025-09-09
The following pull-request contains a common mlx5 update.
* tag 'mlx5-rs-fec-ifc' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Add RS FEC histogram infrastructure
====================
Link: https://patch.msgid.link/1757413460-539097-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add selftest for the IPv6 fragmentation regression which affected
several stable kernels.
Commit a18dfa9925 ("ipv6: save dontfrag in cork") was backported to
stable without some prerequisite commits. This caused a regression when
sending IPv6 UDP packets by preventing fragmentation and instead
returning -1 (EMSGSIZE).
Add selftest to check for this issue by attempting to send a packet
larger than the interface MTU. The packet will be fragmented on a
working kernel, with sendmsg(2) correctly returning the expected number
of bytes sent. When the regression is present, sendmsg returns -1 and
sets errno to EMSGSIZE.
Link: https://lore.kernel.org/stable/aElivdUXqd1OqgMY@karahi.gladserv.com
Signed-off-by: Brett A C Sheffield <bacs@librecast.net>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250903154925.13481-1-bacs@librecast.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Unlike VLAN devices, HSR changes the lower device’s rx_handler, which
prevents the lower device from being attached to another master.
Switch to using netdev_master_upper_dev_link() when setting up the lower
device.
This could improves user experience, since ip link will now display the
HSR device as the master for its ports.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20250902065558.360927-1-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Hangbin Liu says:
====================
bonding: support aggregator selection based on port priority
This patchset introduces a new per-port bonding option: `ad_actor_port_prio`.
It allows users to configure the actor's port priority, which can then be used
by the bonding driver for aggregator selection based on port priority.
This provides finer control over LACP aggregator choice, especially in setups
with multiple eligible aggregators over 2 switches.
====================
Link: https://patch.msgid.link/20250902064501.360822-1-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add comprehensive selftest to verify:
- Per-port actor priority setting via ad_actor_port_prio
- Aggregator selection behavior with port_priority ad_select policy
Also move cmd_jq helper from forwarding/lib.sh to net/lib.sh for
broader reusability across network selftests.
Here is the result output
# ./bond_lacp_prio.sh
TEST: bond 802.3ad (ad_actor_port_prio setting) [ OK ]
TEST: bond 802.3ad (ad_actor_port_prio select) [ OK ]
TEST: bond 802.3ad (ad_actor_port_prio switch) [ OK ]
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20250902064501.360822-4-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add a new ad_select policy 'port_priority' that uses the per-port
actor priority values (set via ad_actor_port_prio) to determine
aggregator selection.
This allows administrators to influence which ports are preferred
for aggregation by assigning different priority values, providing
more flexible load balancing control in LACP configurations.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20250902064501.360822-3-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Introduce a new netlink attribute 'actor_port_prio' to allow setting
the LACP actor port priority on a per-slave basis. This extends the
existing bonding infrastructure to support more granular control over
LACP negotiations.
The priority value is embedded in LACPDU packets and will be used by
subsequent patches to influence aggregator selection policies.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20250902064501.360822-2-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tariq Toukan says:
====================
Support exposing raw cycle counters in PTP and mlx5
This series by Carolina adds support in ptp and usage in mlx5 for
exposing the raw free-running cycle counter of PTP hardware clocks.
This is V2. Find previous one here:
https://lore.kernel.org/all/1752556533-39218-1-git-send-email-tariqt@nvidia.com/
Find detailed description by Carolina below [1].
[1]
This patch series introduces support for exposing the raw free-running
cycle counter of PTP hardware clocks. When the device is in free-running
mode, it emits timestamps as raw cycle values instead of nanoseconds.
These values may be passed directly to user space through:
- fwctl: exposes internal device event records that include raw
cycle-based timestamps.
- DPDK: retrieves CQEs that contain raw cycle counters, which are passed
to user space unmodified.
To address this, the series introduces two new ioctl commands that allow
userspace to query the device's raw cycle counter together with host
time:
- PTP_SYS_OFFSET_PRECISE_CYCLES
- PTP_SYS_OFFSET_EXTENDED_CYCLES
These commands work like their existing counterparts but return the
device timestamp in cycle units instead of real-time nanoseconds. This
allows user space to collect (cycle, time) pairs and build a mapping
between the device’s free-running clock and host time.
This can also be useful in the XDP fast path: if a driver inserts the
raw cycle value into metadata instead of a real-time timestamp, it can
avoid the overhead of converting cycles to time in the kernel. Then
userspace can resolve the cycle-to-time mapping using this ioctl when
needed.
The ioctl enables user space to correlate those with host time, without
requiring the PHC to be synchronized, so long as the drift remains
stable during collection.
Adds the new PTP ioctls and integrates support in ptp_ioctl():
- ptp: Add ioctl commands to expose raw cycle counter values
Support for exposing raw cycles in mlx5:
- net/mlx5: Extract MTCTR register read logic into helper function
- net/mlx5: Support getcyclesx and getcrosscycles
====================
Link: https://patch.msgid.link/1755008228-88881-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Introduce two new ioctl commands, PTP_SYS_OFFSET_PRECISE_CYCLES and
PTP_SYS_OFFSET_EXTENDED_CYCLES, to allow user space to access the
raw free-running cycle counter from PTP devices.
These ioctls are variants of the existing PRECISE and EXTENDED
offset queries, but instead of returning device time in realtime,
they return the raw cycle counter value.
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Link: https://patch.msgid.link/1755008228-88881-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In the old days, RDS used FMR (Fast Memory Registration) to register
IB MRs to be used by RDMA. A newer and better verbs based
registration/de-registration method called FRWR (Fast Registration
Work Request) was added to RDS by commit 1659185fb4 ("RDS: IB:
Support Fastreg MR (FRMR) memory registration mode") in 2016.
Detection and enablement of FRWR was done in commit 2cb2912d65
("RDS: IB: add Fastreg MR (FRMR) detection support"). But said commit
added an extern bool prefer_frmr, which was not used by said commit -
nor used by later commits. Hence, remove it.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20250905101958.4028647-1-haakon.bugge@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King says:
====================
net: stmmac: mdio cleanups
Clean up the stmmac MDIO code:
- provide an address register formatter to avoid repeated code
- provide a common function to wait for the busy bit to clear
- pre-compute the CR field (mdio clock divider)
- move address formatter into read/write functions
- combine the read/write functions into a common accessor function
- move runtime PM handling into common accessor function
- rename register constants to better reflect manufacturer names
- move stmmac_clk_csr_set() into stmmac_mdio
- make stmmac_clk_csr_set() return the CR field value and remove
priv->clk_csr
- clean up if() range tests in stmmac_clk_csr_set()
- use STMMAC_CSR_xxx definitions in initialisers
For Qualcomm QCS9100 Ride R3 board with the AQR115C PHY:
Tested-by: Mohd Ayaan Anwar <quic_mohdayaa@quicinc.com>
====================
Link: https://patch.msgid.link/aLmBwsMdW__XBv7g@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All the readl_poll_timeout()s follow the same pattern - test a register
for a bit being clear every 100us, and timeout after 10ms returning
-EBUSY. Wrap this up into a function to avoid duplicating this.
This slightly changes the return value for stmmac_mdio_write() if the
second readl_poll_timeout() fails - rather than returning -ETIMEDOUT
we return -EBUSY matching the stmmac_mdio_read() behaviour.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Mohd Ayaan Anwar <quic_mohdayaa@quicinc.com>
Link: https://patch.msgid.link/E1uu8nx-00000001voa-0tJ0@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet says:
====================
ipv6: snmp: avoid performance issue with RATELIMITHOST
Addition of ICMP6_MIB_RATELIMITHOST in commit d0941130c9
("icmp: Add counters for rate limits") introduced a performance drop
in case of DOS (like receiving UDP packets to closed ports).
Per netns ICMP6_MIB_RATELIMITHOST tracking uses per-cpu storage and
is enough, we do not need per-device and slow tracking for this metric.
In v2 of this series, I completed the removal of SNMP_MIB_SENTINEL
in all the kernel for consistency.
v1: https://lore.kernel.org/20250904092432.113c4940@kernel.org
====================
Link: https://patch.msgid.link/20250905165813.1470708-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 9bb88c6596 ("selftests: net: test extacks in netlink dumps")
moved netlink-dumps from TEST_GEN_PROGS to YNL_GEN_FILES.
But _FILES are not for tests, rather for utilities / helpers.
Create YNL_GEN_PROGS and include netlink-dumps there.
This makes netlink-dumps part of executed tests, again.
Link: https://patch.msgid.link/20250906211351.3192412-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Recent changes to make netlink socket memory accounting must
have broken the implicit assumption of the netlink-dump test
that we can fit exactly 64 dumps into the socket. Handle the
failure mode properly, and increase the dump count to 80
to make sure we still run into the error condition if
the default buffer size increases in the future.
Link: https://patch.msgid.link/20250906211351.3192412-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean says:
====================
10G-QXGMII for AQR412C, Felix DSA and Lynx PCS driver
Introduce the first user of the "10g-qxgmii" phy-mode, since its
introduction from commit 5dfabcdd76 ("dt-bindings: net:
ethernet-controller: add 10g-qxgmii mode").
The arch/arm64/boot/dts/freescale/fsl-ls1028a-qds-13bb.dtso already
exists upstream, but has phy-mode = "usxgmii", which comes from the fact
that the AQR412(C) PHY does not distinguish between the two modes.
Yet, the distinction is crucial for the upcoming SerDes driver for the
LS1028A platform.
The series is comprised of:
- preliminary patches to the Lynx PCS and Felix DSA driver which accept
the phy-mode and treat it like "usxgmii"
- an ad-hoc whitelisting mechanism in the Aquantia PHY driver based on
firmware version, which was agreed upon with Marvell, and which serves
as "detection"
- in-band auto-negotiation capability reporting and configuration. This
makes sure this feature is enabled in the PHY, because the Lynx PCS
only works with USXGMII/10G-QXGMII in-band autoneg enabled.
Notably, it lacks a device tree update, which will come later, but
should not be strictly necessary. The expectation is for the Aquantia
PHY driver to pick up "10g-qxgmii" with existing device trees as well,
which it does, except for the slightly confusing "configuring for
inband/usxgmii link mode" initial message. This changes to "configuring
for inband/10g-qxgmii link mode" once phylink gets a chance to pick up
the phydev->interface in its pl->link_config.interface.
$ ip link set swp3 up
mscc_felix 0000:00:00.5 swp3: configuring for inband/usxgmii link mode
mscc_felix 0000:00:00.5 swp3: phylink_mac_config: mode=inband/usxgmii/none adv=0000000,00000000,00008000,0002606c pause=04
mscc_felix 0000:00:00.5 swp3: phylink_phy_change: phy interface 10g-qxgmii link 0
mscc_felix 0000:00:00.5 swp3: phylink_phy_change: phy interface 10g-qxgmii link 1
mscc_felix 0000:00:00.5 swp3: phylink_mac_config: mode=inband/10g-qxgmii/none adv=0000000,00000000,00008000,0002606c pause=00
mscc_felix 0000:00:00.5 swp3: Link is Up - 2.5Gbps/Full - flow control off
$ ip link set swp3 down
mscc_felix 0000:00:00.5 swp3: phylink_phy_change: phy interface 10g-qxgmii link 0
mscc_felix 0000:00:00.5 swp3: Link is Down
$ ip link set swp3 up
mscc_felix 0000:00:00.5 swp3: configuring for inband/10g-qxgmii link mode
mscc_felix 0000:00:00.5 swp3: phylink_mac_config: mode=inband/10g-qxgmii/none adv=0000000,00000000,00008000,0002606c pause=04
mscc_felix 0000:00:00.5 swp3: phylink_phy_change: phy interface 10g-qxgmii link 0
mscc_felix 0000:00:00.5 swp3: phylink_phy_change: phy interface 10g-qxgmii link 1
mscc_felix 0000:00:00.5 swp3: Link is Up - 2.5Gbps/Full - flow control off
====================
Link: https://patch.msgid.link/20250903130730.2836022-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The quad port PHYs (AQR4*) have 4 system interfaces, and some of them,
like AQR412C, can be used with a special firmware provisioning which
multiplexes all ports over a single host-side SerDes lane. The protocol
used over this lane is Cisco 10G-QXGMII feature, or "MUSX", as Aquantia
seems to call it.
One such example is the AQR412C PHY from the NXP SPF-30841 10G-QXGMII
add-in card, which uses this firmware file:
https://github.com/nxp-qoriq/qoriq-firmware-aquantia/blob/master/AQR-G3_v4.3.C-AQR_NXP_SPF-30841_MUSX_ID40019_VER1198.cld
There seems to be no disagreement, including from Marvell FAE, that
10G-QXGMII is reported to the host over MDIO as USXGMII and
indistinguishable from it. This includes the registers from the
provisioning based on which the firmware configures a single system
interface (lane C in the case of SPF-30841) to multiplex all ports -
they are also only accessible from the firmware, or over I2C (?!).
However, the Linux MAC and especially SerDes drivers may need to know if
it is using 1 port per lane (USXGMII) or 4 ports per lane (10G-QXGMII).
In the downstream Layerscape SDK we have previously implemented a
simpler scheme where for certain PHY interface modes, we trust the
device tree and never let the PHY driver overwrite phydev->interface:
862694a496
but for upstream, a nicer detection method is implemented, where
although we can not distinguish USXGMII from 10G-QXGMII per se, we
create a whitelist of firmware fingerprints for which USXGMII is
translated into 10G-QXGMII. At the time of writing, it is expected that
this should only happen for the NXP SPF-30841 card, although extending
for more is trivial - just uncomment the phydev_dbg() in
aqr_build_fingerprint().
An advantage of this method is that it doesn't strictly require updates
to arch/arm64/boot/dts/freescale/fsl-ls1028a-qds-13bb.dtso, since the
PHY driver will transition from "usxgmii" to "10g-qxgmii".
All aqr_translate_interface() callers have also previously called
aqr107_probe(), so dereferencing phydev->priv is safe.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20250903130730.2836022-7-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some PHY features cannot be queried through MDIO registers and require
alternative driver detection methods.
One such feature is 10G-QXGMII (4 ports of up to 2.5G multiplexed over
a single SerDes lane), or "MUSX" as it is called by Aquantia/Marvell.
The firmware has provisioning to modify some registers which seem
inaccessible for read or write over MDIO, which configure an internal
mux for MUSX. To the host, over MDIO, the system interface appears
indistinguishable from single-port-per-lane USXGMII.
Marvell FAE Ziang You recommended a detection method for this feature
based on a tuple which should hopefully identify the firmware build
uniquely. Most of the tuple items are already printed by
aqr107_chip_info(), and an extra set is the misc ID (reg 1.c41d) and the
misc version (reg 1.c41e). These are auto-generated by the Marvell
firmware tool for formal builds, and should be unique (not my claim).
In addition, at least for the builds provided to NXP and redistributed
here:
https://github.com/nxp-qoriq/qoriq-firmware-aquantia/tree/master
these registers are part of the name, for example in
AQR-G3_v4.3.C-AQR_NXP_SPF-30841_MUSX_ID40019_VER1198.cld, reg 1.c41d
will contain 40019 and reg 1.c41e will contain 1198.
Note that according to commit 43429a0353 ("net: phy: aquantia: report
PHY details like firmware version"), the "chip may be functional even
w/o firmware image." In that case, we can't construct a fingerprint and
it will remain zero. That shouldn't imact the use case though.
Dereferencing phydev->priv should be ok in all cases: all
aqr_gen1_config_init() callers have also previously called
aqr107_probe().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20250903130730.2836022-6-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The Global System Configuration registers for each media side link speed
have bit 3 which controls auto-negotiation for the system interface.
Since bits 2:0 of the same register indicate the SerDes protocol for the
same system interface, it makes sense to filter these registers for the
SerDes protocol matching phydev->interface, and to read/write the
auto-negotiation bit.
However, experimentally, USXGMII in-band auto-negotiation is unaffected
by this bit, and instead reacts to bit 3 of register 4.C441 (PHY XS
Transmit Reserved Vendor Provisioning 2).
Both the Global System Configuration as well as the aforementioned
register 4.C441 are documented as PD (Provisioning Defaults), i.e. each
PHY firmware may provision its own values.
I was initially planning to only read these values and not support
changing them (instead just the MAC PCS reconfigures itself, if it can).
But there is one problem: Linux expects that the in-band capability is
configured the same for all speeds where a given SerDes protocol is used.
I was going to add logic that detects mismatched vendor provisioning
(in-band autoneg enabled for speed X, disabled for speed Y) and warn
about it and return 0 (unknown capabilities).
Funnily enough, there is already a known instance where speed 2500 has
"autoneg 1" and the lower speeds have "autoneg 0":
https://lore.kernel.org/netdev/aJH8n0zheqB8tWzb@FUE-ALEWI-WINX/
I don't think it's worth fighting the battle with inconsistent firmware
images built by Aquantia/Marvell, and reporting that to the user, when
we have the ability to modify these fields to values that make sense to
us. We see the same situation with all the aqr*_get_features() functions
which fix up nonsensical supported link modes.
Furthermore, altering the in-band auto-negotiation setting can be
considered a minor change, compared to changing the SerDes protocol in
its entirety, for which we are still not prepared.
Testing was done on:
- AQR107 (Gen2) in USXGMII mode, as found on the NXP LX2160A-RDB.
- AQR112 (Gen3) in USXGMII mode, as found on the NXP SCH-30842 riser
card, plugged into LS1028A-QDS.
- AQR412C (Gen3) in 10G-QXGMII mode, as found on the NXP SCH-30841 riser
card, plugged into the LS1028A-QDS.
- AQR115 (Gen4) in SGMII mode, as found on the NXP LS1046A-RDB rev E.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20250903130730.2836022-5-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sometimes people with unknown firmware provisioning post on the mailing
lists asking for support. The information collected by
aqr_gen2_read_global_syscfg() is sufficiently important to warrant a
phydev_dbg() that can easily be turned into a verbose print by the
system owner in case some debugging is needed.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20250903130730.2836022-4-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>