Add struct tso_dma_map to tso.h for tracking DMA addresses of mapped
GSO payload data and tso_dma_map_completion_state.
The tso_dma_map combines DMA mapping storage with iterator state, allowing
drivers to walk pre-mapped DMA regions linearly. Includes fields for
the DMA IOVA path (iova_state, iova_offset, total_len) and a fallback
per-region path (linear_dma, frags[], frag_idx, offset).
The tso_dma_map_completion_state makes the IOVA completion state opaque
for drivers. Drivers are expected to allocate this and use the added
helpers to update the completion state.
Adds skb_frag_phys() to skbuff.h, returning the physical address
of a paged fragment's data, which is used by the tso_dma_map helpers
introduced in this commit described below.
The added TSO DMA map helpers are:
tso_dma_map_init(): DMA-maps the linear payload region and all frags
upfront. Prefers the DMA IOVA API for a single contiguous mapping with
one IOTLB sync; falls back to per-region dma_map_phys() otherwise.
Returns 0 on success, cleans up partial mappings on failure.
tso_dma_map_cleanup(): Handles both IOVA and fallback teardown paths.
tso_dma_map_count(): counts how many descriptors the next N bytes of
payload will need. Returns 1 if IOVA is used since the mapping is
contiguous.
tso_dma_map_next(): yields the next (dma_addr, chunk_len) pair.
On the IOVA path, each segment is a single contiguous chunk. On the
fallback path, indicates when a chunk starts a new DMA mapping so the
driver can set dma_unmap_len on that descriptor for completion-time
unmapping.
tso_dma_map_completion_save(): updates the completion state. Drivers
will call this at xmit time.
tso_dma_map_complete(): tears down the mapping at completion time and
returns true if the IOVA path was used. If it was not used, this is a
no-op and returns false.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260408230607.2019402-2-joe@dama.to
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Florian Westphal says:
====================
netfilter: updates for net-next
1-3) IPVS updates from Julian Anastasov to enhance visibility into
IPVS internal state by exposing hash size, load factor etc and
allows userspace to tune the load factor used for resizing hash
tables.
4) reject empty/not nul terminated device names from xt_physdev.
This isn't a bug fix; existing code doesn't require a c-string.
But clean this up anyway because conceptually the interface name
definitely should be a c-string.
5) Switch nfnetlink to skb_mac_header helpers that didn't exist back
when this code was written. This gives us additional debug checks
but is not intended to change functionality.
6) Let the xt ttl/hoplimit match reject unknown operator modes.
This is a cleanup, the evaluation function simply returns false when
the mode is out of range. From Marino Dzalto.
7) xt_socket match should enable defrag after all other checks. This
bug is harmless, historically defrag could not be disabled either
except by rmmod.
8) remove UDP-Lite conntrack support, from Fernando Fernandez Mancera.
9) Avoid a couple -Wflex-array-member-not-at-end warnings in the old
xtables 32bit compat code, from Gustavo A. R. Silva.
10) nftables fwd expression should drop packets when their ttl/hl has
expired. This is a bug fix deferred, its not deemed important
enough for -rc8.
11) Add additional checks before assuming the mac header is an ethernet
header, from Zhengchuan Liang.
* tag 'nf-next-26-04-10' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: require Ethernet MAC header before using eth_hdr()
netfilter: nft_fwd_netdev: check ttl/hl before forwarding
netfilter: x_tables: Avoid a couple -Wflex-array-member-not-at-end warnings
netfilter: conntrack: remove UDP-Lite conntrack support
netfilter: xt_socket: enable defrag after all other checks
netfilter: xt_HL: add pr_fmt and checkentry validation
netfilter: nfnetlink: prefer skb_mac_header helpers
netfilter: x_physdev: reject empty or not-nul terminated device names
ipvs: add conn_lfactor and svc_lfactor sysctl vars
ipvs: add ip_vs_status info
ipvs: show the current conn_tab size to users
====================
Link: https://patch.msgid.link/20260410112352.23599-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Johannes Berg says:
====================
Final updates, notably:
- crypto: move Michael MIC code into wireless (only)
- mac80211:
- multi-link 4-addr support
- NAN data support (but no drivers yet)
- ath10k: DT quirk to make it work on some devices
- ath12k: IPQ5424 support
- rtw89: USB improvements for performance
* tag 'wireless-next-2026-04-10' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (124 commits)
wifi: cfg80211: Explicitly include <linux/export.h> in michael-mic.c
wifi: ath10k: Add device-tree quirk to skip host cap QMI requests
dt-bindings: wireless: ath10k: Add quirk to skip host cap QMI requests
crypto: Remove michael_mic from crypto_shash API
wifi: ipw2x00: Use michael_mic() from cfg80211
wifi: ath12k: Use michael_mic() from cfg80211
wifi: ath11k: Use michael_mic() from cfg80211
wifi: mac80211, cfg80211: Export michael_mic() and move it to cfg80211
wifi: ipw2x00: Rename michael_mic() to libipw_michael_mic()
wifi: libertas_tf: refactor endpoint lookup
wifi: libertas: refactor endpoint lookup
wifi: at76c50x: refactor endpoint lookup
wifi: ath12k: Enable IPQ5424 WiFi device support
wifi: ath12k: Add CE remap hardware parameters for IPQ5424
wifi: ath12k: add ath12k_hw_regs for IPQ5424
wifi: ath12k: add ath12k_hw_version_map entry for IPQ5424
wifi: ath12k: Add ath12k_hw_params for IPQ5424
dt-bindings: net: wireless: add ath12k wifi device IPQ5424
wifi: ath10k: fix station lookup failure during disconnect
wifi: ath12k: Create symlink for each radio in a wiphy
...
====================
Link: https://patch.msgid.link/20260410064703.735099-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Part of a stack canary removal from tcp_v{4,6}_rcv().
Return a drop_reason instead of a boolean, so that we no longer
have to pass the address of a local variable.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-37 (-37)
Function old new delta
tcp_v6_rcv 3133 3129 -4
tcp_v4_rcv 3206 3202 -4
tcp_add_backlog 1281 1252 -29
Total: Before=25567186, After=25567149, chg -0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260409101147.1642967-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
UDP-Lite (RFC 3828) socket support was recently retired from the core
networking stack. As a follow-up of that, drop the connection tracker
and NAT support for UDP-Lite in Netfilter.
This patch removes CONFIG_NF_CT_PROTO_UDPLITE and scrubs UDP-Lite
awareness from the conntrack core, NAT core, nft_ct, and ctnetlink.
Please note that stateless packet inspection, matching, ipsets or
logging support for IPPROTO_UDPLITE is preserved.
As conntrack no longer extracts UDP-Lite ports or tracks its L4 state,
when performing NAT the UDP-Lite checksum cannot be updated anymore.
That is an expected and acceptable consequence of removing UDP-Lite
conntrack module.
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
The netif_get_rx_queue_lease_locked() API hides the locking
and the descend onto the leased queue. Making the code
harder to follow (at least to me). Remove the API and open
code the descend a bit. Most of the code now looks like:
if (!leased)
return __helper(x);
hw_rxq = ..
netdev_lock(hw_rxq->dev);
ret = __helper(x);
netdev_unlock(hw_rxq->dev);
return ret;
Of course if we have more code paths that need the wrapping
we may need to revisit. For now, IMHO, having to know what
netif_get_rx_queue_lease_locked() does is not worth the 20LoC
it saves.
Link: https://patch.msgid.link/20260408151251.72bd2482@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
netkit: Support for io_uring zero-copy and AF_XDP
Containers use virtual netdevs to route traffic from a physical netdev
in the host namespace. They do not have access to the physical netdev
in the host and thus can't use memory providers or AF_XDP that require
reconfiguring/restarting queues in the physical netdev.
This patchset adds the concept of queue leasing to virtual netdevs that
allow containers to use memory providers and AF_XDP at native speed.
Leased queues are bound to a real queue in a physical netdev and act
as a proxy.
Memory providers and AF_XDP operations take an ifindex and queue id,
so containers would pass in an ifindex for a virtual netdev and a queue
id of a leased queue, which then gets proxied to the underlying real
queue.
We have implemented support for this concept in netkit and tested the
latter against Nvidia ConnectX-6 (mlx5) as well as Broadcom BCM957504
(bnxt_en) 100G NICs. For more details see the individual patches.
====================
Link: https://patch.msgid.link/20260402231031.447597-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net_mp_open_rxq is currently not used in the tree as all callers are
using __net_mp_open_rxq directly, and net_mp_close_rxq is only used
once while all other locations use __net_mp_close_rxq.
Consolidate into a single API, netif_mp_{open,close}_rxq, using the
netif_ prefix to indicate that the caller is responsible for locking.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260402231031.447597-6-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Populate nested lease info to the queue-get response that returns the
ifindex, queue id with type and optionally netns id if the device
resides in a different netns.
Example with ynl client when using AF_XDP via queue leasing:
# ip a
[...]
4: enp10s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp/id:24 qdisc mq state UP group default qlen 1000
link/ether e8:eb:d3:a3:43:f6 brd ff:ff:ff:ff:ff:ff
inet 10.0.0.2/24 scope global enp10s0f0np0
valid_lft forever preferred_lft forever
inet6 fe80::eaeb:d3ff:fea3:43f6/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
[...]
# ethtool -i enp10s0f0np0
driver: mlx5_core
[...]
# ynl --family netdev --output-json --do queue-get \
--json '{"ifindex": 4, "id": 15, "type": "rx"}'
{'id': 15,
'ifindex': 4,
'lease': {'ifindex': 8, 'netns-id': 0, 'queue': {'id': 1, 'type': 'rx'}},
'napi-id': 8227,
'type': 'rx',
'xsk': {}}
# ip netns list
foo (id: 0)
# ip netns exec foo ip a
[...]
8: nk@NONE: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
inet6 fe80::200:ff:fe00:0/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
[...]
# ip netns exec foo ethtool -i nk
driver: netkit
[...]
# ip netns exec foo ls /sys/class/net/nk/queues/
rx-0 rx-1 tx-0
# ip netns exec foo ynl --family netdev --output-json --do queue-get \
--json '{"ifindex": 8, "id": 1, "type": "rx"}'
{"id": 1, "type": "rx", "ifindex": 8, "xsk": {}}
Note that the caller of netdev_nl_queue_fill_one() holds the netdevice
lock. For the queue-get we do not lock both devices. When queues get
{un,}leased, both devices are locked, thus if __netif_get_rx_queue_lease()
returns a lease pointer, it points to a valid device. The netns-id is
fetched via peernet2id_alloc() similarly as done in OVS.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260402231031.447597-4-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Implement netdev_nl_queue_create_doit which creates a new rx queue in a
virtual netdev and then leases it to a rx queue in a physical netdev.
Example with ynl client:
# ynl --family netdev --output-json --do queue-create \
--json '{"ifindex": 8, "type": "rx", "lease": {"ifindex": 4, "queue": {"type": "rx", "id": 15}}}'
{'id': 1}
Note that the netdevice locking order is always from the virtual to
the physical device.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260402231031.447597-3-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ipvlan and macvlan use queues to process broadcast/multicast packets
from a work queue.
Under attack these queues can drop packets.
Add MACVLAN_BROADCAST_BACKLOG drop_reason for macvlan broadcast queue.
Add IPVLAN_MULTICAST_BACKLOG drop_reason for ipvlan multicast queue.
Use different reasons as some deployments use both ipvlan and macvlan.
Also change ipvlan_rcv_frame() to use SKB_DROP_REASON_DEV_READY
when the device is not UP.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260407150710.1640747-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
codel_dump_stats() only runs with RTNL held,
reading fields that can be changed in qdisc fast path.
Add READ_ONCE()/WRITE_ONCE() annotations.
Alternative would be to acquire the qdisc spinlock, but our long-term
goal is to make qdisc dump operations lockless as much as we can.
tc_codel_xstats fields don't need to be latched atomically,
otherwise this bug would have been caught earlier.
No change in kernel size:
$ scripts/bloat-o-meter -t vmlinux.0 vmlinux
add/remove: 0/0 grow/shrink: 1/1 up/down: 3/-1 (2)
Function old new delta
codel_qdisc_dequeue 2462 2465 +3
codel_dump_stats 250 249 -1
Total: Before=29739919, After=29739921, chg +0.00%
Fixes: 76e3cc126b ("codel: Controlled Delay AQM")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260407143053.1570620-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit 2884bf72fb ("net: bonding: fix use-after-free in
bond_xmit_broadcast()"), bond_is_last_slave() was only used in
bond_xmit_broadcast(). After the recent fix replaced that usage with
a simple index comparison, bond_is_last_slave() has no remaining
callers. bond_is_first_slave() likewise has no callers.
Remove both unused macros.
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Link: https://patch.msgid.link/20260404220412.444753-1-xmei5@asu.edu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sharing a global hash table among all queues is tempting, but
it can cause crash:
BUG: KASAN: slab-use-after-free in nfqnl_recv_verdict+0x11ac/0x15e0 [nfnetlink_queue]
[..]
nfqnl_recv_verdict+0x11ac/0x15e0 [nfnetlink_queue]
nfnetlink_rcv_msg+0x46a/0x930
kmem_cache_alloc_node_noprof+0x11e/0x450
struct nf_queue_entry is freed via kfree, but parallel cpu can still
encounter such an nf_queue_entry when walking the list.
Alternative fix is to free the nf_queue_entry via kfree_rcu() instead,
but as we have to alloc/free for each skb this will cause more mem
pressure.
Cc: Scott Mitchell <scott.k.mitch1@gmail.com>
Fixes: e19079adcd ("netfilter: nfnetlink_queue: optimize verdict lookup with hash table")
Signed-off-by: Florian Westphal <fw@strlen.de>
nft_ct_timeout_obj_destroy() frees the timeout object with kfree()
immediately after nf_ct_untimeout(), without waiting for an RCU grace
period. Concurrent packet processing on other CPUs may still hold
RCU-protected references to the timeout object obtained via
rcu_dereference() in nf_ct_timeout_data().
Add an rcu_head to struct nf_ct_timeout and use kfree_rcu() to defer
freeing until after an RCU grace period, matching the approach already
used in nfnetlink_cttimeout.c.
KASAN report:
BUG: KASAN: slab-use-after-free in nf_conntrack_tcp_packet+0x1381/0x29d0
Read of size 4 at addr ffff8881035fe19c by task exploit/80
Call Trace:
nf_conntrack_tcp_packet+0x1381/0x29d0
nf_conntrack_in+0x612/0x8b0
nf_hook_slow+0x70/0x100
__ip_local_out+0x1b2/0x210
tcp_sendmsg_locked+0x722/0x1580
__sys_sendto+0x2d8/0x320
Allocated by task 75:
nft_ct_timeout_obj_init+0xf6/0x290
nft_obj_init+0x107/0x1b0
nf_tables_newobj+0x680/0x9c0
nfnetlink_rcv_batch+0xc29/0xe00
Freed by task 26:
nft_obj_destroy+0x3f/0xa0
nf_tables_trans_destroy_work+0x51c/0x5c0
process_one_work+0x2c4/0x5a0
Fixes: 7e0b2b57f0 ("netfilter: nft_ct: add ct timeout support")
Cc: stable@vger.kernel.org
Signed-off-by: Tuan Do <tuan@calif.io>
Signed-off-by: Florian Westphal <fw@strlen.de>
Add a new helper function to retrieve the next action entry in flow
rule, check if the maximum number of actions is reached, bail out in
such case.
Replace existing opencoded iteration on the action array by this
helper function.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Currently:
add rule netdev x y ip saddr 1.1.1.1
does not work with neither double-tagged vlan nor pppoe packets. This is
because the network and transport header offset are not pointing to the
IP and transport protocol headers in the stack.
This patch expands NFT_META_PROTOCOL and NFT_META_L4PROTO to parse
double-tagged vlan and pppoe packets so matching network and transport
header fields becomes possible with the existing userspace generated
bytecode. Note that this parser only supports double-tagged vlan which
is composed of vlan offload + vlan header in the skb payload area for
simplicity.
NFT_META_PROTOCOL is used by bridge and netdev family as an implicit
dependency in the bytecode to match on network header fields.
Similarly, there is also NFT_META_L4PROTO, which is also used as an
implicit dependency when matching on the transport protocol header
fields.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Most ndo_start_xmit() methods expects headers of gso packets
to be already in skb->head.
net/core/tso.c users are particularly at risk, because tso_build_hdr()
does a memcpy(hdr, skb->data, hdr_len);
qdisc_pkt_len_segs_init() already does a dissection of gso packets.
Use pskb_may_pull() instead of skb_header_pointer() to make
sure drivers do not have to reimplement this.
Some malicious packets could be fed, detect them so that we can
drop them sooner with a new SKB_DROP_REASON_SKB_BAD_GSO drop_reason.
Fixes: e876f208af ("net: Add a software TSO helper API")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260403221540.3297753-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Peer schedules specify which channels the peer is available on and when.
Add support for configuring peer NAN schedules:
- build and store the schedule and maps
- for each channel, make sure that it fits into the capabilities, and
take the minimum between it and the local compatible nan channel.
- configure the driver
Note that the removal of a peer schedule should be done by the driver
upon NMI station removal.
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260326121156.185ff2283fa6.I0345eb665be8ccf4a77eb1aca9a421eb8d2432e2@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add support for both NMI and NDI stations.
The NDI station will be linked to the NMI station of the NAN peer for
which the NDI station is added.
A peer can choose to reuse its NMI address as the NDI address.
Since different keys might be in use for NAN management and for data
frames, we will have 2 different stations, even if they'll have the same
address.
Even though there are no links in NAN, sta->deflink will still be used
to store the one set of capabilities and SMPS mode.
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260326121156.9fdd37b8e755.I7a7bd6e8e751cab49c329419485839afd209cfc6@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
A NAN local schedule consist of a list of NAN channels, and an array
that maps time slots to the channel it is scheduled to (or NULL to indicate
unscheduled).
A NAN channel is the configuration of a channel which is used for NAN
operations. It is a new type of chanctx user (before, the only user is a
link). A NAN channel may not have a chanctx assigned if it is ULWed out.
A NAN channel may or may not be scheduled (for example, user space
may want to prepare the resources before the actual schedule is
configured).
Add management of the NAN local schedule.
Since we introduce a new chanctx user, also adjust the different
for_each_chanctx_user_* macros to visit also the NAN channels and take
those into account.
Co-developed-by: Avraham Stern <avraham.stern@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260326121156.03350fd40630.Id158f815cfc9b5ab1ebdb8ee608bda426e4d7474@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently xp_assign_dev_shared() is missing XDP_USE_SG being propagated
to flags so set it in order to preserve mtu check that is supposed to be
done only when no multi-buffer setup is in picture.
Also, this flag has the same value as XDP_UMEM_TX_SW_CSUM so we could
get unexpected SG setups for software Tx checksums. Since csum flag is
UAPI, modify value of XDP_UMEM_SG_FLAG.
Fixes: d609f3d228 ("xsk: add multi-buffer support for sockets sharing umem")
Reviewed-by: Björn Töpel <bjorn@kernel.org>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20260402154958.562179-4-maciej.fijalkowski@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Multi-buffer XDP stores information about frags in skb_shared_info that
sits at the tailroom of a packet. The storage space is reserved via
xdp_data_hard_end():
((xdp)->data_hard_start + (xdp)->frame_sz - \
SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
and then we refer to it via macro below:
static inline struct skb_shared_info *
xdp_get_shared_info_from_buff(const struct xdp_buff *xdp)
{
return (struct skb_shared_info *)xdp_data_hard_end(xdp);
}
Currently we do not respect this tailroom space in multi-buffer AF_XDP
ZC scenario. To address this, introduce xsk_pool_get_tailroom() and use
it within xsk_pool_get_rx_frame_size() which is used in ZC drivers to
configure length of HW Rx buffer.
Typically drivers on Rx Hw buffers side work on 128 byte alignment so
let us align the value returned by xsk_pool_get_rx_frame_size() in order
to avoid addressing this on driver's side. This addresses the fact that
idpf uses mentioned function *before* pool->dev being set so we were at
risk that after subtracting tailroom we would not provide 128-byte
aligned value to HW.
Since xsk_pool_get_rx_frame_size() is actively used in xsk_rcv_check()
and __xsk_rcv(), add a variant of this routine that will not include 128
byte alignment and therefore old behavior is preserved.
Reviewed-by: Björn Töpel <bjorn@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Fixes: 24ea50127e ("xsk: support mbuf on ZC RX")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20260402154958.562179-3-maciej.fijalkowski@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Drivers that offload bridges need to iterate over the ports that are
members of a given bridge, for example to rebuild per-port forwarding
bitmaps when membership changes. Currently drivers typically open-code
this by combining dsa_switch_for_each_user_port() with a
dsa_port_offloads_bridge_dev() check, or cache bridge membership
within the driver.
Add dsa_switch_for_each_bridge_member() macro to express this pattern
directly, and use it for the existing dsa_bridge_ports() inline
helper.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/e7136aaa26773f39e805a00fe4ecf13cd2b83fc0.1775049897.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rather than querying the output device for its address in
mctp_local_output, set up the source address when we're populating the
dst structure. If no address is assigned, use MCTP_ADDR_NULL.
This will allow us more flexibility when routing for NULL-source-eid
cases. For now though, we still reject a NULL source address in the
output path.
We need to update the tests a little, so that addresses are assigned
before we do the dst lookups.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20260331-dev-mctp-null-eids-v1-1-b4d047372eaf@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
As IPv6 is built-in only, the ipv6_stub infrastructure is no longer
necessary.
Convert remaining ipv6_stub users to make direct function calls. The
fallback functions introduced previously will prevent linkage errors
when CONFIG_IPV6 is disabled.
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Tested-by: Ricardo B. Marlière <rbm@suse.com>
Link: https://patch.msgid.link/20260325120928.15848-9-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In preparation for dropping ipv6_stub and converting its users to direct
function calls, introduce static inline dummy functions and fallback
macros in the IPv6 networking headers. In addition, introduce checks on
fib6_nh_init(), ip6_dst_lookup_flow() and ip6_fragment() to avoid a
crash due to ipv6.disable=1 set during booting. The other functions are
safe as they cannot be called with ipv6.disable=1 set.
These fallbacks ensure that when CONFIG_IPV6 is completely disabled,
there are no compiling or linking errors due to code paths not guarded
by preprocessor macro IS_ENABLED(CONFIG_IPV6).
In addition, export ndisc_send_na(), ip6_route_input() and
ip6_fragment().
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Tested-by: Ricardo B. Marlière <rbm@suse.com>
Link: https://patch.msgid.link/20260325120928.15848-6-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As IPv6 is built-in only, it does not make sense to continue using
IS_BUILTIN(CONFIG_IPV6). Therefore, replace it with IS_ENABLED() when
necessary and drop it if it isn't valid anymore.
Notice that there is still one instance related to ICMPv6, as it
requires more changes it will be handle separately.
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Tested-by: Ricardo B. Marlière <rbm@suse.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260325120928.15848-4-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The RCU-protected codepaths (mpls_forward, mpls_dump_routes) can have
an inconsistent view of platform_labels vs platform_label in case of a
concurrent resize (resize_platform_label_table, under
platform_mutex). This can lead to OOB accesses.
This patch adds a seqcount, so that we get a consistent snapshot.
Note that mpls_label_ok is also susceptible to this, so the check
against RTA_DST in rtm_to_route_config, done outside platform_mutex,
is not sufficient. This value gets passed to mpls_label_ok once more
in both mpls_route_add and mpls_route_del, so there is no issue, but
that additional check must not be removed.
Reported-by: Yuan Tan <tanyuan98@outlook.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Fixes: 7720c01f3f ("mpls: Add a sysctl to control the size of the mpls label table")
Fixes: dde1b38e87 ("mpls: Convert mpls_dump_routes() to RCU.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/cd8fca15e3eb7e212b094064cd83652e20fd9d31.1774284088.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Johannes Berg says:
====================
A fairly big set of changes all over, notably with:
- cfg80211: new APIs for NAN (Neighbor Aware Networking,
aka Wi-Fi Aware) so less work must be in firmware
- mt76:
- mt7996/mt7925 MLO fixes/improvements
- mt7996 NPU support (HW eth/wifi traffic offload)
- iwlwifi: UNII-9 and continuing UHR work
* tag 'wireless-next-2026-03-26' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (230 commits)
wifi: mac80211: ignore reserved bits in reconfiguration status
wifi: cfg80211: allow protected action frame TX for NAN
wifi: ieee80211: Add some missing NAN definitions
wifi: nl80211: Add a notification to notify NAN channel evacuation
wifi: nl80211: add NL80211_CMD_NAN_ULW_UPDATE notification
wifi: nl80211: allow reporting spurious NAN Data frames
wifi: cfg80211: allow ToDS=0/FromDS=0 data frames on NAN data interfaces
wifi: nl80211: define an API for configuring the NAN peer's schedule
wifi: nl80211: add support for NAN stations
wifi: cfg80211: separately store HT, VHT and HE capabilities for NAN
wifi: cfg80211: add support for NAN data interface
wifi: cfg80211: make sure NAN chandefs are valid
wifi: cfg80211: Add an API to configure local NAN schedule
wifi: mac80211: cleanup error path of ieee80211_do_open
wifi: mac80211: extract channel logic from link logic
wifi: iwlwifi: mld: set RX_FLAG_RADIOTAP_TLV_AT_END generically
wifi: iwlwifi: reduce the number of prints upon firmware crash
wifi: iwlwifi: fix the description of SESSION_PROTECTION_CMD
wifi: iwlwifi: mld: introduce iwl_mld_vif_fw_id_valid
wifi: iwlwifi: mld: block EMLSR during TDLS connections
...
====================
Link: https://patch.msgid.link/20260326152021.305959-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Set the default number of queues per vPort to MANA_DEF_NUM_QUEUES (16),
as 16 queues can achieve optimal throughput for typical workloads. The
actual number of queues may be lower if it exceeds the hardware reported
limit. Users can increase the number of queues up to max_queues via
ethtool if needed.
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/20260323194925.1766385-1-longli@microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
__nf_ct_expect_find() and nf_ct_expect_find_get() are called under
rcu_read_lock() but they dereference the master conntrack via
exp->master.
Since the expectation does not hold a reference on the master conntrack,
this could be dying conntrack or different recycled conntrack than the
real master due to SLAB_TYPESAFE_RCU.
Store the netns, the master_tuple and the zone in struct
nf_conntrack_expect as a safety measure.
This patch is required by the follow up fix not to dump expectations
that do not belong to this netns.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Holding reference on the expectation is not sufficient, the master
conntrack object can just go away, making exp->master invalid.
To access exp->master safely:
- Grab the nf_conntrack_expect_lock, this gets serialized with
clean_from_lists() which also holds this lock when the master
conntrack goes away.
- Hold reference on master conntrack via nf_conntrack_find_get().
Not so easy since the master tuple to look up for the master conntrack
is not available in the existing problematic paths.
This patch goes for extending the nf_conntrack_expect_lock section
to address this issue for simplicity, in the cases that are described
below this is just slightly extending the lock section.
The add expectation command already holds a reference to the master
conntrack from ctnetlink_create_expect().
However, the delete expectation command needs to grab the spinlock
before looking up for the expectation. Expand the existing spinlock
section to address this to cover the expectation lookup. Note that,
the nf_ct_expect_iterate_net() calls already grabs the spinlock while
iterating over the expectation table, which is correct.
The get expectation command needs to grab the spinlock to ensure master
conntrack does not go away. This also expands the existing spinlock
section to cover the expectation lookup too. I needed to move the
netlink skb allocation out of the spinlock to keep it GFP_KERNEL.
For the expectation events, the IPEXP_DESTROY event is already delivered
under the spinlock, just move the delivery of IPEXP_NEW under the
spinlock too because the master conntrack event cache is reached through
exp->master.
While at it, add lockdep notations to help identify what codepaths need
to grab the spinlock.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The expectation helper field is mostly unused. As a result, the
netfilter codebase relies on accessing the helper through exp->master.
Always set on the expectation helper field so it can be used to reach
the helper.
nf_ct_expect_init() is called from packet path where the skb owns
the ct object, therefore accessing exp->master for the newly created
expectation is safe. This saves a lot of updates in all callsites
to pass the ct object as parameter to nf_ct_expect_init().
This is a preparation patches for follow up fixes.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>