Commit Graph

1427104 Commits

Author SHA1 Message Date
Kavita Kavita
bed80a08ff wifi: mac80211: Advertise IEEE 802.1X authentication support
Advertise support for IEEE 802.1X authentication protocol directly from
mac80211, without depending on driver indication of (Re)Association
frame encryption capability.

As specified in "IEEE P802.11bi/D4.0, clauses 12.16.5 and 12.16.8.2",
IEEE 802.1X authentication can operate with or without (Re)Association
frame encryption support. Therefore, mac80211 can safely advertise
802.1X support independently of driver capabilities.

Signed-off-by: Kavita Kavita <kavita.kavita@oss.qualcomm.com>
Link: https://patch.msgid.link/20260226185553.1516290-6-kavita.kavita@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-02 09:53:23 +01:00
Kavita Kavita
9347878b15 wifi: mac80211: Add support for IEEE 802.1X authentication protocol in non-AP STA mode
Add support for the IEEE 802.1X authentication protocol in non-AP STA
mode, as specified in "IEEE P802.11bi/D4.0, 12.16.5".

IEEE 802.1X authentication involves multiple Authentication frame
exchanges, with the non-AP STA and AP alternating transaction
sequence numbers. The number of Authentication frame exchanges
depends on the EAP method in use. For IEEE 802.1X authentication,
process only Authentication frames with the expected transaction
sequence number.

For IEEE 802.1X Authentication, Table 9-71 specifies that the
Encapsulation Length field as specified in Clause 9.4.1.82 shall be
present in all IEEE 802.1X Authentication frames. Drop the frame in
the mac80211 if the Encapsulation Length field is missing.

After receiving the final Authentication frame with status code
WLAN_STATUS_8021X_AUTH_SUCCESS from the AP, mac80211 marks the state
as authenticated, as it indicates the EAP handshake has completed
successfully over the Authentication frames as specified in
Clause 12.16.5.

In the PMKSA caching case, only two Authentication frames are
exchanged if the AP identifies a valid PMKSA, then as specified
in Clause 12.16.8.3, the AP shall set the Status Code to
WLAN_STATUS_SUCCESS in the final Authentication frame and must not
include an encapsulated EAPOL PDU. This frame will be the final
Authentication frame from the AP when PMKSA caching is enabled,
and mac80211 marks the state as authenticated.

In case of authentication success or failure, forward the
Authentication frame to userspace(e.g. wpa_supplicant), and let
userspace validate the Authentication frame from the AP as per the
specification.

Signed-off-by: Kavita Kavita <kavita.kavita@oss.qualcomm.com>
Link: https://patch.msgid.link/20260226185553.1516290-5-kavita.kavita@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-02 09:53:23 +01:00
Kavita Kavita
bd77375097 wifi: cfg80211: add support for IEEE 802.1X Authentication Protocol
Add an extended feature flag NL80211_EXT_FEATURE_IEEE8021X_AUTH to
allow a driver to indicate support for the IEEE 802.1X authentication
protocol in non-AP STA mode, as defined in
"IEEE P802.11bi/D4.0, 12.16.5".

In case of SME in userspace, the Authentication frame body is prepared
in userspace while the driver finalizes the Authentication frame once
it receives the required fields and elements. The driver indicates
support for IEEE 802.1X authentication using the extended feature flag
so that userspace can initiate IEEE 802.1X authentication.

When the feature flag is set, process IEEE 802.1X Authentication frames
from userspace in non-AP STA mode. If the flag is not set, reject
IEEE 802.1X Authentication frames.

Define a new authentication type NL80211_AUTHTYPE_IEEE8021X for
IEEE 802.1X authentication.

Signed-off-by: Kavita Kavita <kavita.kavita@oss.qualcomm.com>
Link: https://patch.msgid.link/20260226185553.1516290-4-kavita.kavita@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-02 09:53:23 +01:00
Kavita Kavita
0e88342dbd wifi: mac80211: Advertise EPPKE support based on driver capabilities
Advertise support for Enhanced Privacy Protection Key Exchange (EPPKE)
authentication protocol in mac80211 when the driver supports
(Re)Association frame encryption. Since EPPKE mandates (Re)Association
frame encryption.

Signed-off-by: Kavita Kavita <kavita.kavita@oss.qualcomm.com>
Link: https://patch.msgid.link/20260226185553.1516290-3-kavita.kavita@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-02 09:53:22 +01:00
Kavita Kavita
ae61f43df1 wifi: mac80211_hwsim: Advertise support for (Re)Association frame encryption
Advertise support for (Re)Association frame encryption in mac80211_hwsim
for testing scenarios.

Signed-off-by: Kavita Kavita <kavita.kavita@oss.qualcomm.com>
Link: https://patch.msgid.link/20260226185553.1516290-2-kavita.kavita@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-02 09:53:22 +01:00
Sai Pratyusha Magam
a536be9231 wifi: mac80211: Fix AAD/Nonce computation for management frames with MLO
Per IEEE Std 802.11be-2024, 12.5.2.3.3, if the MPDU is an
individually addressed Data frame between an AP MLD and a
non-AP MLD associated with the AP MLD, then A1/A2/A3
will be MLD MAC addresses. Otherwise, Al/A2/A3 will be
over-the-air link MAC addresses.

Currently, during AAD and Nonce computation for software based
encryption/decryption cases, mac80211 directly uses the addresses it
receives in the skb frame header. However, after the first
authentication, management frame addresses for non-AP MLD stations
are translated to MLD addresses from over the air link addresses in
software. This means that the skb header could contain translated MLD
addresses, which when used as is, can lead to incorrect AAD/Nonce
computation.

In the following manner, ensure that the right set of addresses are used:

In the receive path, stash the pre-translated link addresses in
ieee80211_rx_data and use them for the AAD/Nonce computations
when required.

In the transmit path, offload the encryption for a CCMP/GCMP key
to the hwsim driver that can then ensure that encryption and hence
the AAD/Nonce computations are performed on the frame containing the
right set of addresses, i.e, MLD addresses if unicast data frame and
link addresses otherwise.

To do so, register the set key handler in hwsim driver so mac80211 is
aware that it is the driver that would take care of encrypting the
frame. Offload encryption for a CCMP/GCMP key, while keeping the
encryption for WEP/TKIP and MMIE generation for a AES_CMAC or a
AES_GMAC key still at the SW crypto in mac layer

Co-developed-by: Rohan Dutta <quic_drohan@quicinc.com>
Signed-off-by: Rohan Dutta <quic_drohan@quicinc.com>
Signed-off-by: Sai Pratyusha Magam <sai.magam@oss.qualcomm.com>
Link: https://patch.msgid.link/20260226042959.3766157-1-sai.magam@oss.qualcomm.com
[only store and apply link_addrs for unicast non-data
 rather storing always and applying for !unicast_data]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-02 09:53:19 +01:00
Rosen Penev
5249fcc0ef wifi: rt2x00: use generic nvmem_cell_get
The library doesn't necessarily depend on OF. This codepath is used by
both soc (OF only) and pci (no such requirement). After this, the only
of specific function is of_get_mac_address, which is needed for nvmem.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Acked-by: Stanislaw Gruszka <stf_xl@wp.pl>
Link: https://patch.msgid.link/20260223214004.19960-1-rosenp@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-02 09:31:15 +01:00
Sriram R
e098c26b35 wifi: mac80211: fetch unsolicited probe response template by link ID
Currently, the unsolicited probe response template is always fetched from
the default link of a virtual interface in both Multi-Link Operation (MLO)
and non-MLO cases. However, in the MLO case there is a need to fetch the
unsolicited probe response template from a specific link instead of the
default link.

Hence, add support for fetching the unsolicited probe response template
based on the link ID from the corresponding link data.

Signed-off-by: Sriram R <quic_srirrama@quicinc.com>
Co-developed-by: Raj Kumar Bhagat <raj.bhagat@oss.qualcomm.com>
Signed-off-by: Raj Kumar Bhagat <raj.bhagat@oss.qualcomm.com>
Link: https://patch.msgid.link/20260220-fils-prob-by-link-v1-2-a2746a853f75@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-02 09:29:15 +01:00
Sriram R
0495b64132 wifi: mac80211: fetch FILS discovery template by link ID
Currently, the FILS discovery template is always fetched from the default
link of a virtual interface in both Multi-Link Operation (MLO) and
non-MLO cases. However, in the MLO case there is a need to fetch the FILS
discovery template from a specific link instead of the default link.

Hence, add support for fetching the FILS discovery template based on the
link ID from the corresponding link data.

Signed-off-by: Sriram R <quic_srirrama@quicinc.com>
Co-developed-by: Raj Kumar Bhagat <raj.bhagat@oss.qualcomm.com>
Signed-off-by: Raj Kumar Bhagat <raj.bhagat@oss.qualcomm.com>
Link: https://patch.msgid.link/20260220-fils-prob-by-link-v1-1-a2746a853f75@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-02 09:29:15 +01:00
Miri Korenblit
a34951ef56 wifi: nl80211: don't allow DFS channels for NAN
NAN cannot use DFS channels.
Mark DFS channels as unusable if the chandef is to be used for NAN.

Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260108102921.c2a5a0a14b9f.Idca29fb8a235df980e63b733a298fd1f2bdf2f48@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260219094725.3846371-3-miriam.rachel.korenblit@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-02 09:23:11 +01:00
Miri Korenblit
9e2f7f4a2c wifi: cfg80211: refactor wiphy_suspend
The sequence of operations that needs to be done in wiphy_suspend is
identical for the case where there is no wowlan configured, and for the
case that it is but the driver refused to do wowlan (by returning 1 from
rdev_suspend).

The current code duplicates this set of operations for each one of the
cases.

In particular, next patch will change the locking of cfg80211_leave_all to
not hold the wiphy lock, which will be easier to do if it is not called
twice.

Change the code to handle first the case that wowlan is configured, and
then handle both cases (driver refused to do wowlan and no wowlan
configured) in one place.

Note that this changes the behaviour to set suspended=true also when
we were not registered yet, but that makes sense anyway, as wiphy works
can be queued also before registration.

Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260108102921.00336669ac32.Id76f272662e1315cd93a628808cc2d1625036b00@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260219094725.3846371-2-miriam.rachel.korenblit@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-02 09:23:07 +01:00
Miri Korenblit
033fe322f5 wifi: nl80211/cfg80211: support stations of non-netdev interfaces
Currently, a station can only be added to a netdev interface,
mainly because there was no need for a station of a non-netdev
interface.

But for NAN, we will have stations that belong to the NL80211_IFTYPE_NAN
interface.

Prepare for adding/changing/deleting a station that belongs to a non-netdev
interface. This doesn't actually allow such stations - this will be done
in a different patch.

Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260219114327.65c9cc96f814.Ic02066b88bb8ad6b21e15cbea8d720280008c83b@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-02 09:23:03 +01:00
Miri Korenblit
137b61fdfc wifi: cfg80211: remove unneeded call to cfg80211_leave
In cfg80211_destroy_ifaces, we first close all netdev wdevs, which will
trigger a NETDEV_GOING_DOWN event that will call cfg80211_leave,
and for non-netdev wdevs, we call cfg80211_remove_virtual_intf which
calles cfg80211_unregister_wdev, which handles the "leaving" for those
interfaces (i.e. stop_nan and stop_p2p_device)

Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260219114327.c43709c9d3af.I3179a28f237ea3b8ec368af720fbf77455a7763f@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-02 09:22:58 +01:00
Miri Korenblit
49a1e65c6d wifi: nl80211: refactor nl80211_parse_chandef
In order to be able to use this function also for nested attributes,
change this function to receive a pointer to extack and to the
attributes array, instead of receiving the info and extracting them out
of it.
While at it, use NL_SET_ERR_MSG_ATTR with the frequency of the chandef.

Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260219114327.2b994566a63b.I6c2b6f4c7e2e09f4c47285ca4ac8a37b20700e19@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-02 09:22:52 +01:00
Gustavo A. R. Silva
abf37167e7 wifi: iwlegacy: Avoid multiple -Wflex-array-member-not-at-end warnings
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.

Move the conflicting declarations (which in a couple of cases happen
to be in a union, so the entire unions are moved) to the end of the
corresponding structures, struct il_frame, and struct il3945_frame.

Notice that `struct il_tx_beacon_cmd`, `struct il4965_tx_resp`, and
`struct il3945_tx_beacon_cmd` are flexible structures, this is
structures that contain a flexible-array member.

The case for struct il4965_beacon_notif is different. Since this
structure is defined by hardware, we create the new `struct
il4965_tx_resp_hdr` type. We then use this newly created type to
replace the object type causing trouble in struct il4965_beacon_notif,
namely `struct il4965_tx_resp`.

Also, once -fms-extensions is enabled, we can use transparent struct
members in struct il4965_tx_resp.

Notice that the newly created type does not contain the flex-array
member `agg_status`, which is the object causing the -Wfamnae warnings.
This object is currently in a union along with `__le32 status`, so
anything using struct il4965_beacon_notif needs to have its own view
of `status`. To preserve the memory layout, we therefore add member
`__le32 beacon_tx_status` to struct il4965_beacon_notif.

After these changes, the size of struct il4965_beacon_notif along
with its member's offsets remain the same, hence the memory layout
doesn't change:

Before changes:
struct il4965_beacon_notif {
	struct il4965_tx_resp      beacon_notify_hdr;    /*     0    24 */
	__le32                     low_tsf;              /*    24     4 */
	__le32                     high_tsf;             /*    28     4 */
	__le32                     ibss_mgr_status;      /*    32     4 */

	/* size: 36, cachelines: 1, members: 4 */
	/* last cacheline: 36 bytes */
};

After changes:
struct il4965_beacon_notif {
	struct il4965_tx_resp_hdr  beacon_notify_hdr;    /*     0    20 */
	__le32                     beacon_tx_status;     /*    20     4 */
	__le32                     low_tsf;              /*    24     4 */
	__le32                     high_tsf;             /*    28     4 */
	__le32                     ibss_mgr_status;      /*    32     4 */

	/* size: 36, cachelines: 1, members: 5 */
	/* last cacheline: 36 bytes */
};

Lastly, adjust the rest of the code, accordingly.

With these changes fix the following warnings:

11 drivers/net/wireless/intel/iwlegacy/common.h:526:11: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
11 drivers/net/wireless/intel/iwlegacy/commands.h:2667:31: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
4 drivers/net/wireless/intel/iwlegacy/3945.h:131:11: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Acked-by: Stanislaw Gruszka <stf_xl@wp.pl>
Link: https://patch.msgid.link/aZLienEatf9KC6Rx@kspp
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-02 09:22:30 +01:00
Hari Chandrakanthan
6a584e336c wifi: cfg80211: add support to handle incumbent signal detected event from mac80211/driver
When any incumbent signal is detected by an AP/mesh interface operating
in 6 GHz band, FCC mandates the AP/mesh to vacate the channels affected
by it [1].

Add a new API cfg80211_incumbent_signal_notify() that can be used
by mac80211 or drivers to notify the higher layers about the signal
interference event with the interference bitmap in which each bit
denotes the affected 20 MHz in the operating channel.

Add support for the new nl80211 event and nl80211 attribute as well to
notify userspace on the details about the interference event. Userspace is
expected to process it and take further action - vacate the channel, or
reduce the bandwidth.

[1] - https://apps.fcc.gov/kdb/GetAttachment.html?id=nXQiRC%2B4mfiA54Zha%2BrW4Q%3D%3D&desc=987594%20D02%20U-NII%206%20GHz%20EMC%20Measurement%20v03&tracking_number=277034

Signed-off-by: Hari Chandrakanthan <quic_haric@quicinc.com>
Signed-off-by: Amith A <amith.a@oss.qualcomm.com>
Link: https://patch.msgid.link/20260216032027.2310956-2-amith.a@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-02 09:14:54 +01:00
Suraj P Kizhakkethil
f3f52e6f20 wifi: mac80211: Set link ID for NULL packets sent to probe stations
Currently, for AP MLD, the link ID is not provided when a NULL
packet is triggered to probe a station. For non-MLO stations connected
to an AP MLD, use the station's default link to send the NULL packets
and set addr2 and addr3 to the link address. For MLO stations, set the
link ID to unspecified to let the driver select the appropriate link.

Co-developed-by: Sriram R <quic_srirrama@quicinc.com>
Signed-off-by: Sriram R <quic_srirrama@quicinc.com>
Co-developed-by: Rameshkumar Sundaram <rameshkumar.sundaram@oss.qualcomm.com>
Signed-off-by: Rameshkumar Sundaram <rameshkumar.sundaram@oss.qualcomm.com>
Signed-off-by: Suraj P Kizhakkethil <suraj.kizhakkethil@oss.qualcomm.com>
Link: https://patch.msgid.link/20260213100126.1414398-3-suraj.kizhakkethil@oss.qualcomm.com
[init link_id in each branch instead of default to zero]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-02 09:13:39 +01:00
Suraj P Kizhakkethil
73e7df69ed wifi: mac80211: set band information only for non-MLD when probing stations using NULL frame
Currently, when sending a NULL frame to probe a station, the band
information is derived from the chanctx_conf in the mac80211 vif's
bss_conf. However, for AP MLD, chanctx_conf is not assigned to the
vif's bss_conf; instead it is assigned on a per-link basis. As a result,
for AP MLD, sending a NULL packet to probe will trigger a warning.

WARNING: net/mac80211/cfg.c:4635 at ieee80211_probe_client+0x1a8/0x1d8 [mac80211], CPU#2: hostapd/244
Call trace:
 ieee80211_probe_client+0x1a8/0x1d8 [mac80211] (P)
 nl80211_probe_client+0xac/0x170 [cfg80211]
 genl_family_rcv_msg_doit+0xc8/0x134
 genl_rcv_msg+0x200/0x280
 netlink_rcv_skb+0x38/0xf0
 genl_rcv+0x34/0x48
 netlink_unicast+0x314/0x3a0
 netlink_sendmsg+0x150/0x390
 ____sys_sendmsg+0x1f4/0x21c
 ___sys_sendmsg+0x98/0xc0
 __sys_sendmsg+0x74/0xcc
 __arm64_sys_sendmsg+0x20/0x34
 invoke_syscall.constprop.0+0x4c/0xd0
 do_el0_svc+0x3c/0xd0
 el0_svc+0x28/0xc0
 el0t_64_sync_handler+0x98/0xdc
 el0t_64_sync+0x154/0x158
---[ end trace 0000000000000000 ]---

For NULL packets sent to probe stations, set the band information only
for non-MLD, since MLD transmissions does not rely on band.

Signed-off-by: Suraj P Kizhakkethil <suraj.kizhakkethil@oss.qualcomm.com>
Link: https://patch.msgid.link/20260213100126.1414398-2-suraj.kizhakkethil@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-02 09:11:14 +01:00
Daniel Hodges
ae5e95d415 wifi: mwifiex: fix use-after-free in mwifiex_adapter_cleanup()
The mwifiex_adapter_cleanup() function uses timer_delete()
(non-synchronous) for the wakeup_timer before the adapter structure is
freed. This is incorrect because timer_delete() does not wait for any
running timer callback to complete.

If the wakeup_timer callback (wakeup_timer_fn) is executing when
mwifiex_adapter_cleanup() is called, the callback will continue to
access adapter fields (adapter->hw_status, adapter->if_ops.card_reset,
etc.) which may be freed by mwifiex_free_adapter() called later in the
mwifiex_remove_card() path.

Use timer_delete_sync() instead to ensure any running timer callback has
completed before returning.

Fixes: 4636187da6 ("mwifiex: add wakeup timer based recovery mechanism")
Cc: stable@vger.kernel.org
Signed-off-by: Daniel Hodges <git@danielhodges.dev>
Link: https://patch.msgid.link/20260206194401.2346-1-git@danielhodges.dev
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-02 09:11:00 +01:00
Janusz Dziedzic
668b233b7a wifi: mac80211_hwsim: background CAC support
Report background CAC support and add allow
to cancel background CAC and simulate radar.

echo cancel > /sys/kernel/debug/ieee80211/phy2/hwsim/dfs_background_cac
echo radar > /sys/kernel/debug/ieee80211/phy2/hwsim/dfs_background_cac

Signed-off-by: Janusz Dziedzic <janusz.dziedzic@gmail.com>
Link: https://patch.msgid.link/20260206171830.553879-5-janusz.dziedzic@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-02 09:10:28 +01:00
Janusz Dziedzic
68b908b3c8 wifi: cfg80211: events, report background radar
In case we report radar event add also information
this is connected with background one, so user mode
application like hostapd, could check it and behave
correctly.

Signed-off-by: Janusz Dziedzic <janusz.dziedzic@gmail.com>
Link: https://patch.msgid.link/20260206171830.553879-4-janusz.dziedzic@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-02 09:10:28 +01:00
Janusz Dziedzic
d69cb039ab wifi: cfg80211: set and report chandef CAC ongoing
Allow to track and check CAC state from user mode by
simple check phy channels eg. using iw phy1 channels
command.
This is done for regular CAC and background CAC.
It is important for background CAC while we can start
it from any app (eg. iw or hostapd).

Signed-off-by: Janusz Dziedzic <janusz.dziedzic@gmail.com>
Link: https://patch.msgid.link/20260206171830.553879-3-janusz.dziedzic@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-02 09:10:28 +01:00
Janusz Dziedzic
92fecd2744 wifi: cfg80211: fix background CAC
Fix:
- Send CAC_ABORT event when background CAC is canceled
- Cancel CAC done workqueue when radar is detected
- Release background wdev ownership when CAC is aborted or passed
- Clean lower layer background radar state when CAC is aborted or passed
- Prevent sending abort event when radar event is sent

Signed-off-by: Janusz Dziedzic <janusz.dziedzic@gmail.com>
Link: https://patch.msgid.link/20260206171830.553879-2-janusz.dziedzic@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-02 09:10:28 +01:00
Zilin Guan
990a73dec3 wifi: mwifiex: Fix memory leak in mwifiex_11n_aggregate_pkt()
In mwifiex_11n_aggregate_pkt(), skb_aggr is allocated via
mwifiex_alloc_dma_align_buf(). If mwifiex_is_ralist_valid() returns false,
the function currently returns -1 immediately without freeing the
previously allocated skb_aggr, causing a memory leak.

Since skb_aggr has not yet been queued via skb_queue_tail(), no other
references to this memory exist. Therefore, it has to be freed locally
before returning the error.

Fix this by calling mwifiex_write_data_complete() to free skb_aggr before
returning the error status.

Compile tested only. Issue found using a prototype static analysis tool
and code review.

Fixes: 5e6e3a92b9 ("wireless: mwifiex: initial commit for Marvell mwifiex driver")
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Reviewed-by: Jeff Chen <jeff.chen_1@nxp.com>
Link: https://patch.msgid.link/20260119092625.1349934-1-zilin@seu.edu.cn
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-02 08:17:22 +01:00
Jakub Kicinski
0314e382cf Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-7.0-rc2).

Conflicts:

tools/testing/selftests/drivers/net/hw/rss_ctx.py
  19c3a2a81d ("selftests: drv-net: rss: Generate unique ports for RSS context tests")
  ce5a0f4612 ("selftests: drv-net: rss_ctx: test RSS contexts persist after ifdown/up")

include/net/inet_connection_sock.h
  858d2a4f67 ("tcp: fix potential race in tcp_v6_syn_recv_sock()")
  fcd3d039fa ("tcp: make tcp_v{4,6}_send_check() static")
https://lore.kernel.org/aZ8PSFLzBrEU3I89@sirena.org.uk

drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/pool.c
  69050f8d6d ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
  bf4afc53b7 ("Convert 'alloc_obj' family to use the new default GFP_KERNEL argument")
  8a96b9144f ("net/mlx5e: Alloc xsk channel param out of mlx5e_open_xsk()")

Adjacent changes:

net/netfilter/ipvs/ip_vs_ctl.c
  c59bd9e62e ("ipvs: use more counters to avoid service lookups")
  bf4afc53b7 ("Convert 'alloc_obj' family to use the new default GFP_KERNEL argument")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-26 10:23:00 -08:00
Linus Torvalds
b9c8fc2cae Merge tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
 "Including fixes from IPsec, Bluetooth and netfilter

  Current release - regressions:

   - wifi: fix dev_alloc_name() return value check

   - rds: fix recursive lock in rds_tcp_conn_slots_available

  Current release - new code bugs:

   - vsock: lock down child_ns_mode as write-once

  Previous releases - regressions:

   - core:
      - do not pass flow_id to set_rps_cpu()
      - consume xmit errors of GSO frames

   - netconsole: avoid OOB reads, msg is not nul-terminated

   - netfilter: h323: fix OOB read in decode_choice()

   - tcp: re-enable acceptance of FIN packets when RWIN is 0

   - udplite: fix null-ptr-deref in __udp_enqueue_schedule_skb().

   - wifi: brcmfmac: fix potential kernel oops when probe fails

   - phy: register phy led_triggers during probe to avoid AB-BA deadlock

   - eth:
      - bnxt_en: fix deleting of Ntuple filters
      - wan: farsync: fix use-after-free bugs caused by unfinished tasklets
      - xscale: check for PTP support properly

  Previous releases - always broken:

   - tcp: fix potential race in tcp_v6_syn_recv_sock()

   - kcm: fix zero-frag skb in frag_list on partial sendmsg error

   - xfrm:
      - fix race condition in espintcp_close()
      - always flush state and policy upon NETDEV_UNREGISTER event

   - bluetooth:
      - purge error queues in socket destructors
      - fix response to L2CAP_ECRED_CONN_REQ

   - eth:
      - mlx5:
         - fix circular locking dependency in dump
         - fix "scheduling while atomic" in IPsec MAC address query
      - gve: fix incorrect buffer cleanup for QPL
      - team: avoid NETDEV_CHANGEMTU event when unregistering slave
      - usb: validate USB endpoints"

* tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits)
  netfilter: nf_conntrack_h323: fix OOB read in decode_choice()
  dpaa2-switch: validate num_ifs to prevent out-of-bounds write
  net: consume xmit errors of GSO frames
  vsock: document write-once behavior of the child_ns_mode sysctl
  vsock: lock down child_ns_mode as write-once
  selftests/vsock: change tests to respect write-once child ns mode
  net/mlx5e: Fix "scheduling while atomic" in IPsec MAC address query
  net/mlx5: Fix missing devlink lock in SRIOV enable error path
  net/mlx5: E-switch, Clear legacy flag when moving to switchdev
  net/mlx5: LAG, disable MPESW in lag_disable_change()
  net/mlx5: DR, Fix circular locking dependency in dump
  selftests: team: Add a reference count leak test
  team: avoid NETDEV_CHANGEMTU event when unregistering slave
  net: mana: Fix double destroy_workqueue on service rescan PCI path
  MAINTAINERS: Update maintainer entry for QUALCOMM ETHQOS ETHERNET DRIVER
  dpll: zl3073x: Remove redundant cleanup in devm_dpll_init()
  selftests/net: packetdrill: Verify acceptance of FIN packets when RWIN is 0
  tcp: re-enable acceptance of FIN packets when RWIN is 0
  vsock: Use container_of() to get net namespace in sysctl handlers
  net: usb: kaweth: validate USB endpoints
  ...
2026-02-26 08:00:13 -08:00
Vahagn Vardanian
baed0d9ba9 netfilter: nf_conntrack_h323: fix OOB read in decode_choice()
In decode_choice(), the boundary check before get_len() uses the
variable `len`, which is still 0 from its initialization at the top of
the function:

    unsigned int type, ext, len = 0;
    ...
    if (ext || (son->attr & OPEN)) {
        BYTE_ALIGN(bs);
        if (nf_h323_error_boundary(bs, len, 0))  /* len is 0 here */
            return H323_ERROR_BOUND;
        len = get_len(bs);                        /* OOB read */

When the bitstream is exactly consumed (bs->cur == bs->end), the check
nf_h323_error_boundary(bs, 0, 0) evaluates to (bs->cur + 0 > bs->end),
which is false.  The subsequent get_len() call then dereferences
*bs->cur++, reading 1 byte past the end of the buffer.  If that byte
has bit 7 set, get_len() reads a second byte as well.

This can be triggered remotely by sending a crafted Q.931 SETUP message
with a User-User Information Element containing exactly 2 bytes of
PER-encoded data ({0x08, 0x00}) to port 1720 through a firewall with
the nf_conntrack_h323 helper active.  The decoder fully consumes the
PER buffer before reaching this code path, resulting in a 1-2 byte
heap-buffer-overflow read confirmed by AddressSanitizer.

Fix this by checking for 2 bytes (the maximum that get_len() may read)
instead of the uninitialized `len`.  This matches the pattern used at
every other get_len() call site in the same file, where the caller
checks for 2 bytes of available data before calling get_len().

Fixes: ec8a8f3c31 ("netfilter: nf_ct_h323: Extend nf_h323_error_boundary to work on bits as well")
Signed-off-by: Vahagn Vardanian <vahagn@redrays.io>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260225130619.1248-2-fw@strlen.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 12:50:42 +01:00
Junrui Luo
8a5752c6dc dpaa2-switch: validate num_ifs to prevent out-of-bounds write
The driver obtains sw_attr.num_ifs from firmware via dpsw_get_attributes()
but never validates it against DPSW_MAX_IF (64). This value controls
iteration in dpaa2_switch_fdb_get_flood_cfg(), which writes port indices
into the fixed-size cfg->if_id[DPSW_MAX_IF] array. When firmware reports
num_ifs >= 64, the loop can write past the array bounds.

Add a bound check for num_ifs in dpaa2_switch_init().

dpaa2_switch_fdb_get_flood_cfg() appends the control interface (port
num_ifs) after all matched ports. When num_ifs == DPSW_MAX_IF and all
ports match the flood filter, the loop fills all 64 slots and the control
interface write overflows by one entry.

The check uses >= because num_ifs == DPSW_MAX_IF is also functionally
broken.

build_if_id_bitmap() silently drops any ID >= 64:
      if (id[i] < DPSW_MAX_IF)
          bmap[id[i] / 64] |= ...

Fixes: 539dda3c5d ("staging: dpaa2-switch: properly setup switching domains")
Signed-off-by: Junrui Luo <moonafterrain@outlook.com>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Link: https://patch.msgid.link/SYBPR01MB78812B47B7F0470B617C408AAF74A@SYBPR01MB7881.ausprd01.prod.outlook.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 12:37:21 +01:00
Hangbin Liu
4916f2e2f3 bonding: print churn state via netlink
Currently, the churn state is printed only in sysfs. Add netlink support
so users could get the state via netlink.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20260224020215.6012-1-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 11:45:35 +01:00
Qingfang Deng
15c9ed1d82 pppoe: remove kernel-mode relay support
The kernel-mode PPPoE relay feature and its two associated ioctls
(PPPOEIOCSFWD and PPPOEIOCDFWD) are not used by any existing userspace
PPPoE implementations. The most commonly-used package, RP-PPPoE [1],
handles the relaying entirely in userspace.

This legacy code has remained in the driver since its introduction in
kernel 2.3.99-pre7 for over two decades, but has served no practical
purpose.

Remove the unused relay code.

[1] https://dianne.skoll.ca/projects/rp-pppoe/

Signed-off-by: Qingfang Deng <dqfext@gmail.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20260224015053.42472-1-dqfext@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 11:41:00 +01:00
Jakub Kicinski
7aa767d0d3 net: consume xmit errors of GSO frames
udpgro_frglist.sh and udpgro_bench.sh are the flakiest tests
currently in NIPA. They fail in the same exact way, TCP GRO
test stalls occasionally and the test gets killed after 10min.

These tests use veth to simulate GRO. They attach a trivial
("return XDP_PASS;") XDP program to the veth to force TSO off
and NAPI on.

Digging into the failure mode we can see that the connection
is completely stuck after a burst of drops. The sender's snd_nxt
is at sequence number N [1], but the receiver claims to have
received (rcv_nxt) up to N + 3 * MSS [2]. Last piece of the puzzle
is that senders rtx queue is not empty (let's say the block in
the rtx queue is at sequence number N - 4 * MSS [3]).

In this state, sender sends a retransmission from the rtx queue
with a single segment, and sequence numbers N-4*MSS:N-3*MSS [3].
Receiver sees it and responds with an ACK all the way up to
N + 3 * MSS [2]. But sender will reject this ack as TCP_ACK_UNSENT_DATA
because it has no recollection of ever sending data that far out [1].
And we are stuck.

The root cause is the mess of the xmit return codes. veth returns
an error when it can't xmit a frame. We end up with a loss event
like this:

  -------------------------------------------------
  |   GSO super frame 1   |   GSO super frame 2   |
  |-----------------------------------------------|
  | seg | seg | seg | seg | seg | seg | seg | seg |
  |  1  |  2  |  3  |  4  |  5  |  6  |  7  |  8  |
  -------------------------------------------------
     x    ok    ok    <ok>|  ok    ok    ok   <x>
                          \\
			   snd_nxt

"x" means packet lost by veth, and "ok" means it went thru.
Since veth has TSO disabled in this test it sees individual segments.
Segment 1 is on the retransmit queue and will be resent.

So why did the sender not advance snd_nxt even tho it clearly did
send up to seg 8? tcp_write_xmit() interprets the return code
from the core to mean that data has not been sent at all. Since
TCP deals with GSO super frames, not individual segment the crux
of the problem is that loss of a single segment can be interpreted
as loss of all. TCP only sees the last return code for the last
segment of the GSO frame (in <> brackets in the diagram above).

Of course for the problem to occur we need a setup or a device
without a Qdisc. Otherwise Qdisc layer disconnects the protocol
layer from the device errors completely.

We have multiple ways to fix this.

 1) make veth not return an error when it lost a packet.
    While this is what I think we did in the past, the issue keeps
    reappearing and it's annoying to debug. The game of whack
    a mole is not great.

 2) fix the damn return codes
    We only talk about NETDEV_TX_OK and NETDEV_TX_BUSY in the
    documentation, so maybe we should make the return code from
    ndo_start_xmit() a boolean. I like that the most, but perhaps
    some ancient, not-really-networking protocol would suffer.

 3) make TCP ignore the errors
    It is not entirely clear to me what benefit TCP gets from
    interpreting the result of ip_queue_xmit()? Specifically once
    the connection is established and we're pushing data - packet
    loss is just packet loss?

 4) this fix
    Ignore the rc in the Qdisc-less+GSO case, since it's unreliable.
    We already always return OK in the TCQ_F_CAN_BYPASS case.
    In the Qdisc-less case let's be a bit more conservative and only
    mask the GSO errors. This path is taken by non-IP-"networks"
    like CAN, MCTP etc, so we could regress some ancient thing.
    This is the simplest, but also maybe the hackiest fix?

Similar fix has been proposed by Eric in the past but never committed
because original reporter was working with an OOT driver and wasn't
providing feedback (see Link).

Link: https://lore.kernel.org/CANn89iJcLepEin7EtBETrZ36bjoD9LrR=k4cfwWh046GB+4f9A@mail.gmail.com
Fixes: 1f59533f9c ("qdisc: validate frames going through the direct_xmit path")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260223235100.108939-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 11:35:00 +01:00
Paolo Abeni
f0a2f2aadb Merge branch 'vsock-add-write-once-semantics-to-child_ns_mode'
Bobby Eshleman says:

====================
vsock: add write-once semantics to child_ns_mode

Two administrator processes may race when setting child_ns_mode: one
sets it to "local" and creates a namespace, but another changes it to
"global" in between. The first process ends up with a namespace in the
wrong mode. Make child_ns_mode write-once so that a namespace manager
can set it once, check the value, and be guaranteed it won't change
before creating its namespaces. Writing a different value after the
first write returns -EBUSY.

One patch for the implementation, one for docs, and one for tests.

v2: https://lore.kernel.org/r/20260218-vsock-ns-write-once-v2-0-19e4c50d509a@meta.com
v1: https://lore.kernel.org/r/20260217-vsock-ns-write-once-v1-1-a1fb30f289a9@meta.com
====================

Link: https://patch.msgid.link/20260223-vsock-ns-write-once-v3-0-c0cde6959923@meta.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 11:10:06 +01:00
Bobby Eshleman
b6302e057f vsock: document write-once behavior of the child_ns_mode sysctl
Update the vsock child_ns_mode documentation to include the new
write-once semantics of setting child_ns_mode. The semantics are
implemented in a preceding patch in this series.

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20260223-vsock-ns-write-once-v3-3-c0cde6959923@meta.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 11:10:03 +01:00
Bobby Eshleman
102eab95f0 vsock: lock down child_ns_mode as write-once
Two administrator processes may race when setting child_ns_mode as one
process sets child_ns_mode to "local" and then creates a namespace, but
another process changes child_ns_mode to "global" between the write and
the namespace creation. The first process ends up with a namespace in
"global" mode instead of "local". While this can be detected after the
fact by reading ns_mode and retrying, it is fragile and error-prone.

Make child_ns_mode write-once so that a namespace manager can set it
once and be sure it won't change. Writing a different value after the
first write returns -EBUSY. This applies to all namespaces, including
init_net, where an init process can write "local" to lock all future
namespaces into local mode.

Fixes: eafb64f40c ("vsock: add netns to vsock core")
Suggested-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Co-developed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20260223-vsock-ns-write-once-v3-2-c0cde6959923@meta.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 11:10:03 +01:00
Bobby Eshleman
a382a34276 selftests/vsock: change tests to respect write-once child ns mode
The child_ns_mode sysctl parameter becomes write-once in a future patch
in this series, which breaks existing tests. This patch updates the
tests to respect this new policy. No additional tests are added.

Add "global-parent" and "local-parent" namespaces as intermediaries to
spawn namespaces in the given modes. This avoids the need to change
"child_ns_mode" in the init_ns. nsenter must be used because ip netns
unshares the mount namespace so nested "ip netns add" breaks exec calls
from the init ns. Adds nsenter to the deps check.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260223-vsock-ns-write-once-v3-1-c0cde6959923@meta.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 11:10:03 +01:00
Paolo Abeni
90fcb0f3bc Merge branch 'net-mlx5e-shampo-allow-high-order-pages-in-zerocopy-mode'
Tariq Toukan says:

====================
net/mlx5e: SHAMPO, Allow high order pages in zerocopy mode

This series adds support for high order pages when io_uring/devmem
zero copy is used.

See detailed description by Dragos below.

The first patches are moving code around to allow using queue specific
parameters that are not just for XSK. They are a bit large as they touch
a lot of functions.

The middle part of the series is updating various formulas to remove
remaining hardcoded use of PAGE_SIZE/PAGE_SHIFT.

The last part adds support for high order pages by implementing the
queue configuration functions and allowing larger rx_page_size
configurations when in zero-copy mode.

Results show an increase in BW and a decrease in CPU usage.
The benchmark was done with the zcrx samples from liburing [0].

rx_buf_len=4K, oncpu [1]:
packets=3358832 (MB=820027), rps=55794 (MB/s=13621)
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:       9    1.56    0.00   18.09   13.42    0.00   66.80    0.00    0.00    0.00    0.12

rx_buf_len=128K, oncpu [2]:
packets=3781376 (MB=923187), rps=62813 (MB/s=15335)
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:       9    0.33    0.00    7.61   18.86    0.00   73.08    0.00    0.00    0.00    0.12

rx_buf_len=4K, offcpu [3]:
packets=3460368 (MB=844816), rps=57481 (MB/s=14033)
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:       9    0.00    0.00    0.26    0.00    0.00   92.63    0.00    0.00    0.00    7.11
Average:      11    3.04    0.00   68.09   28.87    0.00    0.00    0.00    0.00    0.00    0.00

rx_buf_len=128K, offcpu [4]:
packets=4119840 (MB=1005820), rps=68435 (MB/s=16707)
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:       9    0.00    0.00    0.87    0.00    0.00   63.77    0.00    0.00    0.00   35.36
Average:      11    1.96    0.00   43.68   54.37    0.00    0.00    0.00    0.00    0.00    0.00

[0] https://github.com/isilence/liburing/tree/zcrx/rx-buf-len

[1] commands:
  $> taskset -c 9 ./zcrx 6 -i eth2 -q 9 -A 1 -B 4096 -S 33554432
  $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000

[2] commands:
  $> taskset -c 9 ./zcrx 6 -i eth2 -q 9 -A 1 -B 131072 -S 33554432
  $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000

[3] commands:
  $> taskset -c 11 ./zcrx 6 -i eth2 -q 9 -A 1 -B 4096 -S 33554432
  $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000

[4] commands:
  $> taskset -c 11 ./zcrx 6 -i eth2 -q 9 -A 1 -B 131072 -S 33554432
  $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000
====================

Link: https://patch.msgid.link/20260223204155.1783580-1-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 10:54:41 +01:00
Dragos Tatulea
df5135fced net/mlx5e: SHAMPO, Allow high order pages in zerocopy mode
Allow high order pages only when SHAMPO mode is enabled (hw-gro) and the
queue is used for zerocopy (has memory provider ops set). The limit is
128K and it was chosen for the following reasons:
- 256K size requires a special case during MTT calculation to split the
  page in two. That's because two MTTs are needed to form an octword.
- Higher sizes require increasing WQE size and/or reducing the number
  of WQEs.
- Having the RQ lined with too few large pages can lead to refill
  issues.

Results show an increase in BW and a decrease in CPU usage.
The benchmark was done with the zcrx samples from liburing [0].

rx_buf_len=4K, oncpu [1]:
packets=3358832 (MB=820027), rps=55794 (MB/s=13621)
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:       9    1.56    0.00   18.09   13.42    0.00   66.80    0.00    0.00    0.00    0.12

rx_buf_len=128K, oncpu [2]:
packets=3781376 (MB=923187), rps=62813 (MB/s=15335)
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:       9    0.33    0.00    7.61   18.86    0.00   73.08    0.00    0.00    0.00    0.12

rx_buf_len=4K, offcpu [3]:
packets=3460368 (MB=844816), rps=57481 (MB/s=14033)
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:       9    0.00    0.00    0.26    0.00    0.00   92.63    0.00    0.00    0.00    7.11
Average:      11    3.04    0.00   68.09   28.87    0.00    0.00    0.00    0.00    0.00    0.00

rx_buf_len=128K, offcpu [4]:
packets=4119840 (MB=1005820), rps=68435 (MB/s=16707)
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:       9    0.00    0.00    0.87    0.00    0.00   63.77    0.00    0.00    0.00   35.36
Average:      11    1.96    0.00   43.68   54.37    0.00    0.00    0.00    0.00    0.00    0.00

[0] https://github.com/isilence/liburing/tree/zcrx/rx-buf-len

[1] commands:
  $> taskset -c 9 ./zcrx 6 -i eth2 -q 9 -A 1 -B 4096 -S 33554432
  $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000

[2] commands:
  $> taskset -c 9 ./zcrx 6 -i eth2 -q 9 -A 1 -B 131072 -S 33554432
  $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000

[3] commands:
  $> taskset -c 11 ./zcrx 6 -i eth2 -q 9 -A 1 -B 4096 -S 33554432
  $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000

[4] commands:
  $> taskset -c 11 ./zcrx 6 -i eth2 -q 9 -A 1 -B 131072 -S 33554432
  $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-16-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 10:54:24 +01:00
Dragos Tatulea
5b6e0ddb36 net/mlx5e: Add param helper to calculate max page size
This function will be necessary to determine the upper limit of
rx-page-size.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-15-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 10:54:24 +01:00
Dragos Tatulea
585cfa99d3 net/mlx5e: Pass netdev queue config to param calculations
If set, take rx_page_size into consideration when calculating
the page shift in Multi Packet WQE mode.

The queue config is saved in the mlx5e_rq_opt_param struct which is
added to the mlx5e_channel_param struct. Now the configuration can be
read from the struct instead of adding it as an argument to all call
sites. For consistency, the queue config is assigned in
mlx5e_build_channel_param().

The queue configuration is read only from queue management ops
as that's the only place where it is currently useful. Furthermore,
netdev_queue_config() expects netdev->queue_mgmt_ops to be
set which is not always the case (representor netdevs).

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-14-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 10:54:24 +01:00
Dragos Tatulea
0fa8c93357 net/mlx5e: Add queue config ops for page size
For now allow only PAGE_SIZE. A subsequent patch will add support for
high order pages in zero-copy mode.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-13-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 10:54:23 +01:00
Dragos Tatulea
8611660778 net/mlx5e: RX, Make page frag bias more robust
The formula uses the system page size but does not account
for high order pages.

One way to fix this would be to adapt the formula to take
into account the pool order. This would require calculating it
for every allocation or adding an additional rq struct member to
hold the bias max.

However, the above is not really needed as the driver doesn't
check the bias value. It has other means to calculate the expected
number of fragments based on context.

This patch simply sets the value to the max possible value. A sanity
check is added during queue init phase to avoid having really big pages
from using more fragments than the type can fit.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-12-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 10:54:23 +01:00
Dragos Tatulea
0285cc3dac net/mlx5e: Alloc rq drop page based on calculated page_shift
An upcoming patch will allow setting the page order for RX
pages to be greater than 0. Make sure that the drop page will
also be allocated with the right size when that happens.

Take extra care when calculating the drop page size to
account for page_shift < PAGE_SHIFT which can happen for xsk.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-11-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 10:54:23 +01:00
Dragos Tatulea
3a145cf492 net/mlx5e: Set page_pool order based on calculated page_shift
Instead of unconditionally setting the page_pool to 0, calculate it from
page_shift for MPWQE case.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-10-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 10:54:23 +01:00
Dragos Tatulea
dff1c3164a net/mlx5e: SHAMPO, Always calculate page size
Adapt the rx path in SHAMPO mode to calculate page size based on
configured page_shift when dealing with payload data.

This is necessary as an upcoming patch will add support for using
different page sizes.

This change has no functional changes.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-9-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 10:54:23 +01:00
Dragos Tatulea
3707a73854 net/mlx5e: Drop unused channel parameters
The channel parameters from struct mlx5_qmgmt_data are
built in mlx5e_queue_mem_alloc() but are not used.

mlx5e_open_channel() builds the channel parameters internally and those
parameters will be the ones that are used when opening the queue.

This patch drops the unused parameters.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-8-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 10:54:23 +01:00
Dragos Tatulea
099efb294e net/mlx5e: Move xsk param into new option container struct
The xsk parameter configuration (struct mlx5e_xsk_param) is passed
around many places during parameter calculation. It is used to contain
channel specific information (as opposed to the global info from
struct mlx5e_params).

Upcoming changes will need to push similar channel specific rq
configuration. Instead of adding one more parameter to all these
functions, create a new container structure that has optional rq
specific parameters. The xsk parameter will be the first of such kind.

The new container struct is itself optional. That means that before
checking its members, it has to be checked itself for validity.

This patch has no functional changes.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-7-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 10:54:23 +01:00
Dragos Tatulea
8a96b9144f net/mlx5e: Alloc xsk channel param out of mlx5e_open_xsk()
Currently the allocation and filling of the xsk channel
parameters was done in mlx5e_open_xsk().

Move this responsibility out of mlx5e_open_xsk() and have
the function take an already filled mlx5e_channel_param.

mlx5e_open_channel() already allocates channel parameters.
The only precaution that is needed is to call
mlx5e_build_xsk_channel_param() before mlx5e_open_xsk().

mlx5e_xsk_enable_locked() now allocates and fills the xsk parameters.

For simplicity, link the xsk parameters in struct mlx5e_channel_params
so that channel params can be passed around.

This patch has no functional changes.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-6-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 10:54:23 +01:00
Dragos Tatulea
ba4f39c256 net/mlx5e: Expose and rename xsk channel parameter function
mlx5e_build_xsk_cparam() is meant to be the alternative
to mlx5e_build_channel_param(). It calculates only the parameters
that it requires using the previously configured mlx5e_xsk_param.

Move this function to params.c to be alongside
mlx5e_build_channel_param() and give it a similar name.

Expose the function as it will be needed by upcoming changes.

This patch has no functional changes.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-5-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 10:54:23 +01:00
Dragos Tatulea
a2ff2f5f80 net/mlx5e: Extract max_xsk_wqebbs into its own function
Calculating max_xsk_wqebbs seems large enough to deserve its own
function. It will make upcoming changes easier.

This patch has no functional changes.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-4-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 10:54:23 +01:00
Dragos Tatulea
d3a99b71a2 net/mlx5e: Extract striding rq param calculation in function
Calculating parameters for striding rq is large enough
to deserve its own function. As the names are also very long
it is very easy to hit on the 80 char limitation every time
a change is made. This is an additional sign that it should
be extracted into its own function.

This patch has no functional change.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-3-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 10:54:23 +01:00