MPFS (Multi PF Switch) is enabled by default in Multi-Host environments,
the driver keeps a list of desired unicast mac addresses of all vports
(vfs/Sfs) and applied to HW via L2_table FW command.
Add API to dynamically apply the list of MACs to HW when needed for next
patches, to utilize this new API in devlink eswitch active/in-active uAPI.
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Adithya Jayachandran <ajayachandra@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20251108070404.1551708-3-saeed@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Adds DEVLINK_ESWITCH_MODE_SWITCHDEV_INACTIVE attribute to UAPI and
documentation.
Before having traffic flow through an eswitch, a user may want to have the
ability to block traffic towards the FDB until FDB is fully programmed and
the user is ready to send traffic to it. For example: when two eswitches
are present for vports in a multi-PF setup, one eswitch may take over the
traffic from the other when the user chooses.
Before this take over, a user may want to first program the inactive
eswitch and then once ready redirect traffic to this new eswitch.
switchdev modes transition semantics:
legacy->switchdev_inactive: Create switchdev mode normally, traffic not
allowed to flow yet.
switchdev_inactive->switchdev: Enable traffic to flow.
switchdev->switchdev_inactive: Block traffic on the FDB, FDB and
representros state and content is preserved.
When eswitch is configured to this mode, traffic is ignored/dropped on
this eswitch FDB, while current configuration is kept, e.g FDB rules and
netdev representros are kept available, FDB programming is allowed.
Example:
# start inactive switchdev
devlink dev eswitch set pci/0000:08:00.1 mode switchdev_inactive
# setup TC rules, representors etc ..
# activate
devlink dev eswitch set pci/0000:08:00.1 mode switchdev
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20251108070404.1551708-2-saeed@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski says:
====================
tools: ynl: turn the page-pool sample into a real tool
The page-pool YNL sample is quite useful. It's helps calculate
recycling rate and memory consumption. Since we still haven't
figured out a way to integrate with iproute2 (not for the lack
of thinking how to solve it) - create a ynltool command in ynl.
Add page-pool and qstats support.
Most commands can use the Python YNL CLI directly but low level
stats often need aggregation or some math on top to be useful.
Specifically in this patch set:
- page pool stats are aggregated and recycling rate computed
- per-queue stats are used to compute traffic balance across queues
v1: https://lore.kernel.org/20251104232348.1954349-1-kuba@kernel.org
====================
Link: https://patch.msgid.link/20251107162227.980672-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Based on past discussions it seems like integration of YNL into
iproute2 is unlikely. YNL itself is not great as a C library,
since it has no backward compat (we routinely change types).
Most of the operations can be performed with the generic Python
CLI directly. There is, however, a handful of operations where
summarization of kernel output is very useful (mostly related
to stats: page-pool, qstat).
Create a command (inspired by bpftool, I think it stood the test
of time reasonably well) to be able to plug the subcommands into.
Link: https://lore.kernel.org/1754895902-8790-1-git-send-email-ernis@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20251107162227.980672-2-kuba@kernel.org
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Some Marvell AP firmware used with mwl8k misbehaves when beacon frames
do not contain a WLAN_EID_DS_PARAMS element with the current channel.
It was reported on OpenWrt Github issues [0].
When hostapd/mac80211 omits DSSS Parameter Set from the beacon (which is
valid on some bands), the firmware stops transmitting sane frames and RX
status starts reporting bogus channel information. This makes AP mode
unusable.
Newer Marvell drivers (mwlwifi [1]) hard-code DSSS Parameter Set into
AP beacons for all chips, which suggests this is a firmware requirement
rather than a mwl8k-specific quirk.
Mirror that behaviour in mwl8k: when setting the beacon, check if
WLAN_EID_DS_PARAMS is present, and if not, extend the beacon and inject
a DSSS Parameter Set element, using the current channel from
hw->conf.chandef.chan.
Tested on Linksys EA4500 (88W8366).
[0] https://github.com/openwrt/openwrt/issues/19088
[1] db97edf20f/hif/fwcmd.c (L675)
Fixes: b64fe619e3 ("mwl8k: basic AP interface support")
Tested-by: Antony Kolitsos <zeusomighty@hotmail.com>
Signed-off-by: Pawel Dembicki <paweldembicki@gmail.com>
Link: https://patch.msgid.link/20251111100733.2825970-3-paweldembicki@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
For monitoring, userspace will try to configure the VIF sdata, while the
driver may see the monitor_sdata that is created when only monitor
interfaces are up. This causes the odd situation that it may not be
possible to store the MU-MIMO configuration on monitor_sdata.
Fix this by storing that information on the VIF sdata and updating the
monitor_sdata when available and the interface is up. Also, adjust the
code that adds monitor_sdata so that it will configure MU-MIMO based on
the newly added interface or one of the existing ones.
This should give a mostly consistent behaviour when configuring MU-MIMO
on sniffer interfaces. Should the user configure MU-MIMO on multiple
sniffer interfaces, then mac80211 will simply select one of the
configurations. This behaviour should be good enough and avoids breaking
user expectations in the common scenarios.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20251110141514.677915f8f6bb.If4e04a57052f9ca763562a67248b06fd80d0c2c1@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Jeff Johnson says:
==================
ath.git update for v6.18-rc6
Fix an ath11k transmit status reporting issue. This issue has always
been present, but not reported until recently.
Bringing this through the current release since there is now a
userspace entity that wants to leverage this.
==================
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Breno Leitao says:
====================
net: netpoll: fix memory leak and add comprehensive selftests
Fix a memory leak in netpoll and introduce netconsole selftests that
expose the issue when running with kmemleak detection enabled.
This patchset includes a selftest for netpoll with multiple concurrent
users (netconsole + bonding), which simulates the scenario from test[1]
that originally demonstrated the issue allegedly fixed by commit
efa95b01da ("netpoll: fix use after free") - a commit that is now
being reverted.
Sending this to "net" branch because this is a fix, and the selftest
might help with the backports validation.
Link: https://lore.kernel.org/lkml/96b940137a50e5c387687bb4f57de8b0435a653f.1404857349.git.decot@googlers.com/ [1]
====================
Link: https://patch.msgid.link/20251107-netconsole_torture-v10-0-749227b55f63@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Create a netconsole test that puts a lot of pressure on the netconsole
list manipulation. Do it by creating dynamic targets and deleting
targets while messages are being sent. Also put interface down while the
messages are being sent, as creating parallel targets.
The code launches three background jobs on distinct schedules:
* Toggle netcons target every 30 iterations
* create and delete random_target every 50 iterations
* toggle iface every 70 iterations
This creates multiple concurrency sources that interact with netconsole
states. This is good practice to simulate stress, and exercise netpoll
and netconsole locks.
This test already found an issue as reported in [1]
Link: https://lore.kernel.org/all/20250901-netpoll_memleak-v1-1-34a181977dfc@debian.org/ [1]
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Andre Carvalho <asantostc@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251107-netconsole_torture-v10-3-749227b55f63@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Extract the netconsole target creation from create_dynamic_target(), by
moving it from create_dynamic_target() into a new helper function. This
enables other tests to use the creation of netconsole targets with
arbitrary parameters and no sleep.
The new helper will be utilized by forthcoming torture-type selftests
that require dynamic target management.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251107-netconsole_torture-v10-2-749227b55f63@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
commit efa95b01da ("netpoll: fix use after free") incorrectly
ignored the refcount and prematurely set dev->npinfo to NULL during
netpoll cleanup, leading to improper behavior and memory leaks.
Scenario causing lack of proper cleanup:
1) A netpoll is associated with a NIC (e.g., eth0) and netdev->npinfo is
allocated, and refcnt = 1
- Keep in mind that npinfo is shared among all netpoll instances. In
this case, there is just one.
2) Another netpoll is also associated with the same NIC and
npinfo->refcnt += 1.
- Now dev->npinfo->refcnt = 2;
- There is just one npinfo associated to the netdev.
3) When the first netpolls goes to clean up:
- The first cleanup succeeds and clears np->dev->npinfo, ignoring
refcnt.
- It basically calls `RCU_INIT_POINTER(np->dev->npinfo, NULL);`
- Set dev->npinfo = NULL, without proper cleanup
- No ->ndo_netpoll_cleanup() is either called
4) Now the second target tries to clean up
- The second cleanup fails because np->dev->npinfo is already NULL.
* In this case, ops->ndo_netpoll_cleanup() was never called, and
the skb pool is not cleaned as well (for the second netpoll
instance)
- This leaks npinfo and skbpool skbs, which is clearly reported by
kmemleak.
Revert commit efa95b01da ("netpoll: fix use after free") and adds
clarifying comments emphasizing that npinfo cleanup should only happen
once the refcount reaches zero, ensuring stable and correct netpoll
behavior.
Cc: <stable@vger.kernel.org> # 3.17.x
Cc: Jay Vosburgh <jv@jvosburgh.net>
Fixes: efa95b01da ("netpoll: fix use after free")
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251107-netconsole_torture-v10-1-749227b55f63@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently if a user enqueues a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")
This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251107134452.198378-1-marco.crivellari@suse.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Aksh Garg says:
====================
Fix IET verification implementation for CPSW driver
The CPSW module supports Intersperse Express Traffic (IET) and allows
the MAC layer to verify whether the peer supports IET through its MAC
merge sublayer, by sending a verification packet and waiting for its
response until the timeout. As defined in IEEE 802.3 Clause 99, the
verification process involves up to 3 verification attempts to
establish support.
This patch series fixes issues in the implementation of this IET
verification process.
====================
Link: https://patch.msgid.link/20251106092305.1437347-1-a-garg7@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The am65_cpsw_iet_verify_wait() function attempts verification 20 times,
toggling the AM65_CPSW_PN_IET_MAC_LINKFAIL bit in each iteration. When
the LINKFAIL bit transitions from 1 to 0, the MAC merge layer initiates
the verification process and waits for the timeout configured in
MAC_VERIFY_CNT before automatically retransmitting. The MAC_VERIFY_CNT
register is configured according to the user-defined verify/response
timeout in am65_cpsw_iet_set_verify_timeout_count(). As per IEEE 802.3
Clause 99, the hardware performs this automatic retry up to 3 times.
Current implementation toggles LINKFAIL after the user-configured
verify/response timeout in each iteration, forcing the hardware to
restart verification instead of respecting the MAC_VERIFY_CNT timeout.
This bypasses the hardware's automatic retry mechanism.
Fix this by moving the LINKFAIL bit toggle outside the retry loop and
reducing the retry count from 20 to 3. The software now only monitors
the status register while the hardware autonomously handles the 3
verification attempts at proper MAC_VERIFY_CNT intervals.
Fixes: 49a2eb9068 ("net: ethernet: ti: am65-cpsw-qos: Add Frame Preemption MAC Merge support")
Signed-off-by: Aksh Garg <a-garg7@ti.com>
Link: https://patch.msgid.link/20251106092305.1437347-3-a-garg7@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The CPSW module uses the MAC_VERIFY_CNT bit field in the
CPSW_PN_IET_VERIFY_REG_k register to set the verify/response timeout
count. This register specifies the number of clock cycles to wait before
resending a verify packet if the verification fails.
The verify/response timeout count, as being set by the function
am65_cpsw_iet_set_verify_timeout_count() is hardcoded for 125MHz
clock frequency, which varies based on PHY mode and link speed.
The respective clock frequencies are as follows:
- RGMII mode:
* 1000 Mbps: 125 MHz
* 100 Mbps: 25 MHz
* 10 Mbps: 2.5 MHz
- QSGMII/SGMII mode: 125 MHz (all speeds)
Fix this by adding logic to calculate the correct timeout counts
based on the actual PHY interface mode and link speed.
Fixes: 49a2eb9068 ("net: ethernet: ti: am65-cpsw-qos: Add Frame Preemption MAC Merge support")
Signed-off-by: Aksh Garg <a-garg7@ti.com>
Link: https://patch.msgid.link/20251106092305.1437347-2-a-garg7@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In tls_handshake_accept(), a netlink message is allocated using
genlmsg_new(). In the error handling path, genlmsg_cancel() is called
to cancel the message construction, but the message itself is not freed.
This leads to a memory leak.
Fix this by calling nlmsg_free() in the error path after genlmsg_cancel()
to release the allocated memory.
Fixes: 2fd5532044 ("net/handshake: Add a kernel API for requesting a TLSv1.3 handshake")
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://patch.msgid.link/20251106144511.3859535-1-zilin@seu.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The current CLC proposal message construction uses a mix of
`ini->smc_type_v1/v2` and `pclc_base->hdr.typev1/v2` to decide whether
to include optional extensions (IPv6 prefix extension for v1, and v2
extension). This leads to a critical inconsistency: when
`smc_clc_prfx_set()` fails - for example, in IPv6-only environments with
only link-local addresses, or when the local IP address and the outgoing
interface’s network address are not in the same subnet.
As a result, the proposal message is assembled using the stale
`ini->smc_type_v1` value—causing the IPv6 prefix extension to be
included even though the header indicates v1 is not supported.
The peer then receives a malformed CLC proposal where the header type
does not match the payload, and immediately resets the connection.
The fix ensures consistency between the CLC header flags and the actual
payload by synchronizing `ini->smc_type_v1` with `pclc_base->hdr.typev1`
when prefix setup fails.
Fixes: 8c3dca341a ("net/smc: build and send V2 CLC proposal")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20251107024029.88753-1-alibuda@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ankit Garg says:
====================
gve: Improve RX buffer length management
This patch series improves the management of the RX buffer length for
the DQO queue format in the gve driver. The goal is to make RX buffer
length config more explicit, easy to change, and performant by default.
We accomplish that in four patches:
1. Currently, the buffer length is implicitly coupled with the header
split setting, which is an unintuitive and restrictive design. The
first patch decouples the RX buffer length from the header split
configuration.
2. The second patch is a preparatory step for third. It converts the XDP
config verification method to use extack for better error reporting.
3. The third patch exposes the `rx_buf_len` parameter to userspace via
ethtool, allowing user to directly view or modify the RX buffer length
if supported by the device.
4. The final patch improves the out-of-the-box RX single stream throughput
by >10% by changing the driver's default behavior to select the
maximum supported RX buffer length advertised by the device during
initialization.
====================
Link: https://patch.msgid.link/20251106192746.243525-1-joshwash@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Change the driver's default behavior to prefer the largest available RX
buffer length supported by the device for DQO format, rather than always
using the hardcoded 2K default.
Previously, the driver would initialize with
`GVE_DEFAULT_RX_BUFFER_SIZE` (2K), even if the device advertised support
for a larger length (e.g., 4K).
Performance observations:
- With LRO disabled, we observed >10% improvement in RX single stream
throughput when MTU >=2048.
- With LRO enabled, we observed >10% improvement in RX single stream
throughput when MTU >=1460.
- No regressions were observed.
Signed-off-by: Ankit Garg <nktgrg@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Reviewed-by: Jordan Rhee <jordanrhee@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Joshua Washington <joshwash@google.com>
Link: https://patch.msgid.link/20251106192746.243525-5-joshwash@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add support for getting and setting the RX buffer length via the
ethtool ring parameters (`ethtool -g`/`-G`). The driver restricts the
allowed buffer length to 2048 (SZ_2K) by default and allows 4096 (SZ_4K)
based on device options.
As XDP is only supported when the `rx_buf_len` is 2048, the driver now
enforces this in two places:
1. In `gve_xdp_set`, rejecting XDP programs if the current buffer
length is not 2048.
2. In `gve_set_rx_buf_len_config`, rejecting buffer length changes if XDP
is loaded and the new length is not 2048.
Signed-off-by: Ankit Garg <nktgrg@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Reviewed-by: Jordan Rhee <jordanrhee@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Joshua Washington <joshwash@google.com>
Link: https://patch.msgid.link/20251106192746.243525-4-joshwash@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Previously, enabling header split via `gve_set_hsplit_config` also
implicitly changed the RX buffer length to 4K (if supported by the
device). This coupled two settings that should be orthogonal; this patch
removes that side effect.
After this change, `gve_set_hsplit_config` only toggles the header
split configuration. The RX buffer length is no longer affected and
must be configured independently.
Signed-off-by: Ankit Garg <nktgrg@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Reviewed-by: Jordan Rhee <jordanrhee@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Joshua Washington <joshwash@google.com>
Link: https://patch.msgid.link/20251106192746.243525-2-joshwash@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King says:
====================
net: stmmac: ingenic: convert to set_phy_intf_sel()
Convert ingenic to use the new ->set_phy_intf_sel() method that was
recently introduced in net-next.
This is the largest of the conversions, as there is scope for cleanups
along with the conversion.
====================
Link: https://patch.msgid.link/aQ2tgEu-dudzlZlg@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
x1000, x1600 and x1830 only accept RMII mode. PHY_INTF_SEL_RMII is only
selected with PHY_INTERFACE_MODE_RMII, and PHY_INTF_SEL_RMII has been
validated by the SoC's .valid_phy_intf_sel bitmask. Thus, checking the
interface mode in these functions becomes unnecessary. Remove these.
jz4775 is similar, except for a greater set of PHY_INTF_SEL_x valies.
Also remove the switch statement here.
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vHHqI-0000000DjrV-3ygL@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use stmmac_get_phy_intf_sel() to decode the PHY interface mode to the
phy_intf_sel value, validate the result against the SoC specific
supported phy_intf_sel values, and pass into the SoC specific
set_mode() methods, replacing the local phy_intf_sel variable. This
provides the value for the MACPHYC_PHY_INFT_MASK field.
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vHHq8-0000000DjrJ-2NRK@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>