inet_ehashfn() and inet6_ehashfn() initialise random secrets
on the first call by net_get_random_once().
While the init part is patched out using static keys, with
CONFIG_STACKPROTECTOR_STRONG=y, this causes a compiler to
generate a stack canary due to an automatic variable,
unsigned long ___flags, in the DO_ONCE() macro being passed
to __do_once_start().
With FDO, this is visible in __inet_lookup_established() and
__inet6_lookup_established() too.
Let's initialise the secrets by get_random_sleepable_once()
in the slow paths: inet_hash() for listen(), and
inet_hash_connect() and inet6_hash_connect() for connect().
Note that IPv6 listener will initialise both IPv4 & IPv6 secrets
in inet_hash() for IPv4-mapped IPv6 address.
With the patch, the stack size is reduced by 16 bytes (___flags
+ a stack canary) and NOPs for the static key go away.
Before: __inet6_lookup_established()
...
push %rbx
sub $0x38,%rsp # stack is 56 bytes
mov %edx,%ebx # sport
mov %gs:0x299419f(%rip),%rax # load stack canary
mov %rax,0x30(%rsp) and store it onto stack
mov 0x440(%rdi),%r15 # net->ipv4.tcp_death_row.hashinfo
nop
32: mov %r8d,%ebp # hnum
shl $0x10,%ebp # hnum << 16
nop
3d: mov 0x70(%rsp),%r14d # sdif
or %ebx,%ebp # INET_COMBINED_PORTS(sport, hnum)
mov 0x11a8382(%rip),%eax # inet6_ehashfn() ...
After: __inet6_lookup_established()
...
push %rbx
sub $0x28,%rsp # stack is 40 bytes
mov 0x60(%rsp),%ebp # sdif
mov %r8d,%r14d # hnum
shl $0x10,%r14d # hnum << 16
or %edx,%r14d # INET_COMBINED_PORTS(sport, hnum)
mov 0x440(%rdi),%rax # net->ipv4.tcp_death_row.hashinfo
mov 0x1194f09(%rip),%r10d # inet6_ehashfn() ...
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260303235424.3877267-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR (net-7.0-rc3).
No conflicts.
Adjacent changes:
net/netfilter/nft_set_rbtree.c
fb7fb40163 ("netfilter: nf_tables: clone set on flush only")
3aea466a43 ("netfilter: nft_set_rbtree: don't disable bh when acquiring tree lock")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The only user of frag_size field in XDP RxQ info is
bpf_xdp_frags_increase_tail(). It clearly expects whole buffer size instead
of DMA write size. Different assumptions in idpf driver configuration lead
to negative tailroom.
To make it worse, buffer sizes are not actually uniform in idpf when
splitq is enabled, as there are several buffer queues, so rxq->rx_buf_size
is meaningless in this case.
Use truesize of the first bufq in AF_XDP ZC, as there is only one. Disable
growing tail for regular splitq.
Fixes: ac8a861f63 ("idpf: prepare structures to support XDP")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://patch.msgid.link/20260305111253.2317394-8-larysa.zaremba@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yiming Qian reports Use-after-free in the pipapo set type:
Under a large number of expired elements, commit-time GC can run for a very
long time in a non-preemptible context, triggering soft lockup warnings and
RCU stall reports (local denial of service).
We must split GC in an unlink and a reclaim phase.
We cannot queue elements for freeing until pointers have been swapped.
Expired elements are still exposed to both the packet path and userspace
dumpers via the live copy of the data structure.
call_rcu() does not protect us: dump operations or element lookups starting
after call_rcu has fired can still observe the free'd element, unless the
commit phase has made enough progress to swap the clone and live pointers
before any new reader has picked up the old version.
This a similar approach as done recently for the rbtree backend in commit
35f83a7552 ("netfilter: nft_set_rbtree: don't gc elements on insert").
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Syzbot with fault injection triggered a failing memory allocation with
GFP_KERNEL which results in a WARN splat:
iter.err
WARNING: net/netfilter/nf_tables_api.c:845 at nft_map_deactivate+0x34e/0x3c0 net/netfilter/nf_tables_api.c:845, CPU#0: syz.0.17/5992
Modules linked in:
CPU: 0 UID: 0 PID: 5992 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2026
RIP: 0010:nft_map_deactivate+0x34e/0x3c0 net/netfilter/nf_tables_api.c:845
Code: 8b 05 86 5a 4e 09 48 3b 84 24 a0 00 00 00 75 62 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc e8 63 6d fa f7 90 <0f> 0b 90 43
+80 7c 35 00 00 0f 85 23 fe ff ff e9 26 fe ff ff 89 d9
RSP: 0018:ffffc900045af780 EFLAGS: 00010293
RAX: ffffffff89ca45bd RBX: 00000000fffffff4 RCX: ffff888028111e40
RDX: 0000000000000000 RSI: 00000000fffffff4 RDI: 0000000000000000
RBP: ffffc900045af870 R08: 0000000000400dc0 R09: 00000000ffffffff
R10: dffffc0000000000 R11: fffffbfff1d141db R12: ffffc900045af7e0
R13: 1ffff920008b5f24 R14: dffffc0000000000 R15: ffffc900045af920
FS: 000055557a6a5500(0000) GS:ffff888125496000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb5ea271fc0 CR3: 000000003269e000 CR4: 00000000003526f0
Call Trace:
<TASK>
__nft_release_table+0xceb/0x11f0 net/netfilter/nf_tables_api.c:12115
nft_rcv_nl_event+0xc25/0xdb0 net/netfilter/nf_tables_api.c:12187
notifier_call_chain+0x19d/0x3a0 kernel/notifier.c:85
blocking_notifier_call_chain+0x6a/0x90 kernel/notifier.c:380
netlink_release+0x123b/0x1ad0 net/netlink/af_netlink.c:761
__sock_release net/socket.c:662 [inline]
sock_close+0xc3/0x240 net/socket.c:1455
Restrict set clone to the flush set command in the preparation phase.
Add NFT_ITER_UPDATE_CLONE and use it for this purpose, update the rbtree
and pipapo backends to only clone the set when this iteration type is
used.
As for the existing NFT_ITER_UPDATE type, update the pipapo backend to
use the existing set clone if available, otherwise use the existing set
representation. After this update, there is no need to clone a set that
is being deleted, this includes bound anonymous set.
An alternative approach to NFT_ITER_UPDATE_CLONE is to add a .clone
interface and call it from the flush set path.
Reported-by: syzbot+4924a0edc148e8b4b342@syzkaller.appspotmail.com
Fixes: 3f1d886cc7 ("netfilter: nft_set_pipapo: move cloning of match info to insert/removal path")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Florian Westphal says:
====================
netfilter: updates for net-next
The following patchset contains Netfilter updates for *net-next*,
including changes to IPv6 stack and updates to IPVS from Julian Anastasov.
1) ipv6: export fib6_lookup for nft_fib_ipv6 module
2) factor out ipv6_anycast_destination logic so its usable without
dst_entry. These are dependencies for patch 3.
3) switch nft_fib_ipv6 module to no longer need temporary dst_entry
object allocations by using fib6_lookup() + RCU.
This gets us ~13% higher packet rate in my tests.
Patches 4 to 8, from Eric Dumazet, zap sk_callback_lock usage in
netfilter. Patch 9 removes another sk_callback_lock instance.
Remaining patches, from Julian Anastasov, improve IPVS, Quoting Julian:
* Add infrastructure for resizable hash tables based on hlist_bl.
* Change the 256-bucket service hash table to be resizable.
* Change the global connection table to be per-net and resizable.
* Make connection hashing more secure for setups with multiple services.
netfilter pull request nf-next-26-03-04
* tag 'nf-next-26-03-04' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
ipvs: use more keys for connection hashing
ipvs: switch to per-net connection table
ipvs: use resizable hash table for services
ipvs: add resizable hash tables
rculist_bl: add hlist_bl_for_each_entry_continue_rcu
netfilter: nfnetlink_queue: remove locking in nfqnl_get_sk_secctx
netfilter: nfnetlink_queue: no longer acquire sk_callback_lock
netfilter: nfnetlink_log: no longer acquire sk_callback_lock
netfilter: nft_meta: no longer acquire sk_callback_lock in nft_meta_get_eval_skugid()
netfilter: xt_owner: no longer acquire sk_callback_lock in mt_owner()
netfilter: nf_log_syslog: no longer acquire sk_callback_lock in nf_log_dump_sk_uid_gid()
netfilter: nft_fib_ipv6: switch to fib6_lookup
ipv6: make ipv6_anycast_destination logic usable without dst_entry
ipv6: export fib6_lookup for nft_fib_ipv6
====================
Link: https://patch.msgid.link/20260304114921.31042-1-fw@strlen.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Instead of using struct timespec64 in scm_timestamping_internal,
use ktime_t, saving 24 bytes in kernel stack.
This makes tcp_update_recv_tstamps() small enough to be inlined.
The ktime_t -> timespec64 conversions happen after socket lock
has been released in tcp_recvmsg(), and only if the application
requested them.
$ scripts/bloat-o-meter -t vmlinux.0 vmlinux
add/remove: 0/2 grow/shrink: 5/4 up/down: 146/-277 (-131)
Function old new delta
tcp_zerocopy_receive 2383 2425 +42
mptcp_recvmsg 1565 1607 +42
tcp_recvmsg_locked 3797 3823 +26
put_cmsg_scm_timestamping64 131 149 +18
put_cmsg_scm_timestamping 131 149 +18
__pfx_tcp_update_recv_tstamps 16 - -16
do_tcp_getsockopt 4024 4006 -18
tcp_recv_timestamp 474 430 -44
tcp_zc_handle_leftover 417 371 -46
__sock_recv_timestamp 1087 1031 -56
tcp_update_recv_tstamps 97 - -97
Total: Before=25223788, After=25223657, chg -0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20260304012747.881644-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts 28ee1b746f ("secure_seq: downgrade to per-host timestamp offsets")
tcp_tw_recycle went away in 2017.
Zhouyan Deng reported off-path TCP source port leakage via
SYN cookie side-channel that can be fixed in multiple ways.
One of them is to bring back TCP ports in TS offset randomization.
As a bonus, we perform a single siphash() computation
to provide both an ISN and a TS offset.
Fixes: 28ee1b746f ("secure_seq: downgrade to per-host timestamp offsets")
Reported-by: Zhouyan Deng <dengzhouyan_nwpu@163.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Acked-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260302205527.1982836-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When shrinking the number of real tx queues,
netif_set_real_num_tx_queues() calls qdisc_reset_all_tx_gt() to flush
qdiscs for queues which will no longer be used.
qdisc_reset_all_tx_gt() currently serializes qdisc_reset() with
qdisc_lock(). However, for lockless qdiscs, the dequeue path is
serialized by qdisc_run_begin/end() using qdisc->seqlock instead, so
qdisc_reset() can run concurrently with __qdisc_run() and free skbs
while they are still being dequeued, leading to UAF.
This can easily be reproduced on e.g. virtio-net by imposing heavy
traffic while frequently changing the number of queue pairs:
iperf3 -ub0 -c $peer -t 0 &
while :; do
ethtool -L eth0 combined 1
ethtool -L eth0 combined 2
done
With KASAN enabled, this leads to reports like:
BUG: KASAN: slab-use-after-free in __qdisc_run+0x133f/0x1760
...
Call Trace:
<TASK>
...
__qdisc_run+0x133f/0x1760
__dev_queue_xmit+0x248f/0x3550
ip_finish_output2+0xa42/0x2110
ip_output+0x1a7/0x410
ip_send_skb+0x2e6/0x480
udp_send_skb+0xb0a/0x1590
udp_sendmsg+0x13c9/0x1fc0
...
</TASK>
Allocated by task 1270 on cpu 5 at 44.558414s:
...
alloc_skb_with_frags+0x84/0x7c0
sock_alloc_send_pskb+0x69a/0x830
__ip_append_data+0x1b86/0x48c0
ip_make_skb+0x1e8/0x2b0
udp_sendmsg+0x13a6/0x1fc0
...
Freed by task 1306 on cpu 3 at 44.558445s:
...
kmem_cache_free+0x117/0x5e0
pfifo_fast_reset+0x14d/0x580
qdisc_reset+0x9e/0x5f0
netif_set_real_num_tx_queues+0x303/0x840
virtnet_set_channels+0x1bf/0x260 [virtio_net]
ethnl_set_channels+0x684/0xae0
ethnl_default_set_doit+0x31a/0x890
...
Serialize qdisc_reset_all_tx_gt() against the lockless dequeue path by
taking qdisc->seqlock for TCQ_F_NOLOCK qdiscs, matching the
serialization model already used by dev_reset_queue().
Additionally clear QDISC_STATE_NON_EMPTY after reset so the qdisc state
reflects an empty queue, avoiding needless re-scheduling.
Fixes: 6b3ba9146f ("net: sched: allow qdiscs to handle locking")
Signed-off-by: Koichiro Den <den@valinux.co.jp>
Link: https://patch.msgid.link/20260228145307.3955532-1-den@valinux.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of storing the @log at the beginning of rps_dev_flow_table
use 5 low order bits of the rps_tag_ptr to store the log of the size.
This removes a potential cache line miss (for light traffic).
This allows us to switch to one high-order allocation instead of vmalloc()
when CONFIG_RFS_ACCEL is not set.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260302181432.1836150-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of storing the @mask at the beginning of rps_sock_flow_table,
use 5 low order bits of the rps_tag_ptr to store the log of the size.
This removes a potential cache line miss to fetch @mask.
More importantly, we can switch to vmalloc_huge() without wasting memory.
Tested with:
numactl --interleave=all bash -c "echo 4194304 >/proc/sys/net/core/rps_sock_flow_entries"
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260302181432.1836150-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
udp_flow_src_port() and psp_write_headers() use ip_local_port_range.
ip_local_port_range is inclusive : all ports between min and max
can be used.
Before this patch, if ip_local_port_range was set to 40000-40001
40001 would not be used as a source port.
Use reciprocal_scale() to help code readability.
Not tagged for stable trees, as this change could break user
expectations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260302163933.1754393-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Johannes Berg says:
====================
Notable features this time:
- cfg80211/mac80211
- finished assoc frame encryption/EPPKE/802.1X-over-auth
(also hwsim)
- radar detection improvements
- 6 GHz incumbent signal detection APIs
- multi-link support for FILS, probe response
templates and client probling
- ath12k:
- monitor mode support on IPQ5332
- basic hwmon temperature reporting
* tag 'wireless-next-2026-03-04' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (38 commits)
wifi: UHR: define DPS/DBE/P-EDCA elements and fix size parsing
wifi: mac80211_hwsim: change hwsim_class to a const struct
wifi: mac80211: give the AP more time for EPPKE as well
wifi: ath12k: Remove the unused argument from the Rx data path
wifi: ath12k: Enable monitor mode support on IPQ5332
wifi: ath12k: Set up MLO after SSR
wifi: ath11k: Silence remoteproc probe deferral prints
wifi: cfg80211: support key installation on non-netdev wdevs
wifi: cfg80211: make cluster id an array
wifi: mac80211: update outdated comment
wifi: mac80211: Advertise IEEE 802.1X authentication support
wifi: mac80211: Add support for IEEE 802.1X authentication protocol in non-AP STA mode
wifi: cfg80211: add support for IEEE 802.1X Authentication Protocol
wifi: mac80211: Advertise EPPKE support based on driver capabilities
wifi: mac80211_hwsim: Advertise support for (Re)Association frame encryption
wifi: mac80211: Fix AAD/Nonce computation for management frames with MLO
wifi: rt2x00: use generic nvmem_cell_get
wifi: mac80211: fetch unsolicited probe response template by link ID
wifi: mac80211: fetch FILS discovery template by link ID
wifi: nl80211: don't allow DFS channels for NAN
...
====================
Link: https://patch.msgid.link/20260304113707.175181-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Simon Kirby reported long time ago that IPVS connection hashing
based only on the client address/port (caddr, cport) as hash keys
is not suitable for setups that accept traffic on multiple virtual
IPs and ports. It can happen for multiple VIP:VPORT services, for
single or many fwmark service(s) that match multiple virtual IPs
and ports or even for passive FTP with peristence in DR/TUN mode
where we expect traffic on multiple ports for the virtual IP.
Fix it by adding virtual addresses and ports to the hash function.
This causes the traffic from NAT real servers to clients to use
second hashing for the in->out direction.
As result:
- the IN direction from client will use hash node hn0 where
the source/dest addresses and ports used by client will be used
as hash keys
- the OUT direction from NAT real servers will use hash node hn1
for the traffic from real server to client
- the persistence templates are hashed only with parameters based on
the IN direction, so they now will also use the virtual address,
port and fwmark from the service.
OLD:
- all methods: c_list node: proto, caddr:cport
- persistence templates: c_list node: proto, caddr_net:0
- persistence engine templates: c_list node: per-PE, PE-SIP uses jhash
NEW:
- all methods: hn0 node (dir 0): proto, caddr:cport -> vaddr:vport
- MASQ method: hn1 node (dir 1): proto, daddr:dport -> caddr:cport
- persistence templates: hn0 node (dir 0):
proto, caddr_net:0 -> vaddr:vport_or_0
proto, caddr_net:0 -> fwmark:0
- persistence engine templates: hn0 node (dir 0): as before
Also reorder the ip_vs_conn fields, so that hash nodes are on same
read-mostly cache line while write-mostly fields are on separate
cache line.
Reported-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Use per-net resizable hash table for connections. The global table
is slow to walk when using many namespaces.
The table can be resized in the range of [256 - ip_vs_conn_tab_size].
Table is attached only while services are present. Resizing is done
by delayed work based on load (the number of connections).
Add a hash_key field into the connection to store the table ID in
the highest bit and the entry's hash value in the lowest bits. The
lowest part of the hash value is used as bucket ID, the remaining
part is used to filter the entries in the bucket before matching
the keys and as result, helps the lookup operation to access only
one cache line. By knowing the table ID and bucket ID for entry,
we can unlink it without calculating the hash value and doing
lookup by keys. We need only to validate the saved hash_key under
lock.
For better security switch from jhash to siphash for the default
connection hashing but the persistence engines may use their own
function. Keeping the hash table loaded with entries below the
size (12%) allows to avoid collision for 96+% of the conns.
ip_vs_conn_fill_cport() now will rehash the connection with proper
locking because unhash+hash is not safe for RCU readers.
To invalidate the templates setting just dport to 0xffff is enough,
no need to rehash them. As result, ip_vs_conn_unhash() is now
unused and removed.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Make the hash table for services resizable in the bit range of 4-20.
Table is attached only while services are present. Resizing is done
by delayed work based on load (the number of hashed services).
Table grows when load increases 2+ times (above 12.5% with lfactor=-3)
and shrinks 8+ times when load decreases 16+ times (below 0.78%).
Switch to jhash hashing to reduce the collisions for multiple
services.
Add a hash_key field into the service to store the table ID in
the highest bit and the entry's hash value in the lowest bits. The
lowest part of the hash value is used as bucket ID, the remaining
part is used to filter the entries in the bucket before matching
the keys and as result, helps the lookup operation to access only
one cache line. By knowing the table ID and bucket ID for entry,
we can unlink it without calculating the hash value and doing
lookup by keys. We need only to validate the saved hash_key under
lock.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Add infrastructure for resizable hash tables based on hlist_bl
which we will use in followup patches.
The tables allow RCU lookups during resizing, bucket modifications
are protected with per-bucket bit lock and additional custom locking,
the tables are resized when load reaches thresholds determined based
on load factor parameter.
Compared to other implementations we rely on:
* fast entry removal by using node unlinking without pre-lookup
* entry rehashing when hash key changes
* entries can contain multiple hash nodes
* custom locking depending on different contexts
* adjustable load factor to customize the grow/shrink process
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
nft_fib_ipv6 uses ipv6_anycast_destination(), but upcoming patch removes
the dst_entry usage in favor of fib6_result.
Move the 'plen > 127' logic to a new helper and call it from the
existing one.
Signed-off-by: Florian Westphal <fw@strlen.de>
`struct sysctl_fib_multipath_hash_seed` contains two u32 fields
(user_seed and mp_seed), making it an 8-byte structure with a 4-byte
alignment requirement.
In `fib_multipath_hash_from_keys()`, the code evaluates the entire
struct atomically via `READ_ONCE()`:
mp_seed = READ_ONCE(net->ipv4.sysctl_fib_multipath_hash_seed).mp_seed;
While this silently works on GCC by falling back to unaligned regular
loads which the ARM64 kernel tolerates, it causes a fatal kernel panic
when compiled with Clang and LTO enabled.
Commit e35123d83e ("arm64: lto: Strengthen READ_ONCE() to acquire
when CONFIG_LTO=y") strengthens `READ_ONCE()` to use Load-Acquire
instructions (`ldar` / `ldapr`) to prevent compiler reordering bugs
under Clang LTO. Since the macro evaluates the full 8-byte struct,
Clang emits a 64-bit `ldar` instruction. ARM64 architecture strictly
requires `ldar` to be naturally aligned, thus executing it on a 4-byte
aligned address triggers a strict Alignment Fault (FSC = 0x21).
Fix the read side by moving the `READ_ONCE()` directly to the `u32`
member, which emits a safe 32-bit `ldar Wn`.
Furthermore, Eric Dumazet pointed out that `WRITE_ONCE()` on the entire
struct in `proc_fib_multipath_hash_set_seed()` is also flawed. Analysis
shows that Clang splits this 8-byte write into two separate 32-bit
`str` instructions. While this avoids an alignment fault, it destroys
atomicity and exposes a tear-write vulnerability. Fix this by
explicitly splitting the write into two 32-bit `WRITE_ONCE()`
operations.
Finally, add the missing `READ_ONCE()` when reading `user_seed` in
`proc_fib_multipath_hash_seed()` to ensure proper pairing and
concurrency safety.
Fixes: 4ee2a8cace ("net: ipv4: Add a sysctl to set multipath hash seed")
Signed-off-by: Yung Chih Su <yuuchihsu@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260302060247.7066-1-yuuchihsu@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The GF stats periodic query is used as mechanism to monitor HWC health
check. If this HWC command times out, it is a strong indication that
the device/SoC is in a faulty state and requires recovery.
Today, when a timeout is detected, the driver marks
hwc_timeout_occurred, clears cached stats, and stops rescheduling the
periodic work. However, the device itself is left in the same failing
state.
Extend the timeout handling path to trigger the existing MANA VF
recovery service by queueing a GDMA_EQE_HWC_RESET_REQUEST work item.
This is expected to initiate the appropriate recovery flow by suspende
resume first and if it fails then trigger a bus rescan.
This change is intentionally limited to HWC command timeouts and does
not trigger recovery for errors reported by the SoC as a normal command
response.
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/aaFShvKnwR5FY8dH@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
bond_option_mode_set() already rejects mode changes that would make a
loaded XDP program incompatible via bond_xdp_check(). However,
bond_option_xmit_hash_policy_set() has no such guard.
For 802.3ad and balance-xor modes, bond_xdp_check() returns false when
xmit_hash_policy is vlan+srcmac, because the 802.1q payload is usually
absent due to hardware offload. This means a user can:
1. Attach a native XDP program to a bond in 802.3ad/balance-xor mode
with a compatible xmit_hash_policy (e.g. layer2+3).
2. Change xmit_hash_policy to vlan+srcmac while XDP remains loaded.
This leaves bond->xdp_prog set but bond_xdp_check() now returning false
for the same device. When the bond is later destroyed, dev_xdp_uninstall()
calls bond_xdp_set(dev, NULL, NULL) to remove the program, which hits
the bond_xdp_check() guard and returns -EOPNOTSUPP, triggering:
WARN_ON(dev_xdp_install(dev, mode, bpf_op, NULL, 0, NULL))
Fix this by rejecting xmit_hash_policy changes to vlan+srcmac when an
XDP program is loaded on a bond in 802.3ad or balance-xor mode.
commit 39a0876d59 ("net, bonding: Disallow vlan+srcmac with XDP")
introduced bond_xdp_check() which returns false for 802.3ad/balance-xor
modes when xmit_hash_policy is vlan+srcmac. The check was wired into
bond_xdp_set() to reject XDP attachment with an incompatible policy, but
the symmetric path -- preventing xmit_hash_policy from being changed to an
incompatible value after XDP is already loaded -- was left unguarded in
bond_option_xmit_hash_policy_set().
Note:
commit 094ee6017e ("bonding: check xdp prog when set bond mode")
later added a similar guard to bond_option_mode_set(), but
bond_option_xmit_hash_policy_set() remained unprotected.
Reported-by: syzbot+5a287bcdc08104bc3132@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6995aff6.050a0220.2eeac1.014e.GAE@google.com/T/
Fixes: 39a0876d59 ("net, bonding: Disallow vlan+srcmac with XDP")
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Link: https://patch.msgid.link/20260226080306.98766-2-jiayuan.chen@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We will no longer hold RTNL for ipmr_rtm_route() to modify the
MFC hash table.
Only __dev_get_by_index() in rtm_to_ipmr_mfcc() is the RTNL
dependant, otherwise, we just need protection for mrt->mfc_hash
and mrt->mfc_cache_list.
Let's add a new mutex for ipmr_mfc_add(), ipmr_mfc_delete(),
and mroute_clean_tables() (setsockopt(MRT_FLUSH or MRT_DONE)).
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-15-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We will no longer hold RTNL for ipmr_mfc_add() and ipmr_mfc_delete().
MFC entry can be loosely connected with VIF by its index for
mrt->vif_table[] (stored in mfc_parent), but the two tables are
not synchronised. i.e. Even if VIF 1 is removed, MFC for VIF 1
is not automatically removed.
The only field that the MFC/VIF interfaces share is
net->ipv[46].ipmr_seq, which is protected by RTNL.
Adding a new mutex for both just to protect a single field is overkill.
Let's convert the field to atomic_t.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-14-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Per IEEE Std 802.11be-2024, 12.5.2.3.3, if the MPDU is an
individually addressed Data frame between an AP MLD and a
non-AP MLD associated with the AP MLD, then A1/A2/A3
will be MLD MAC addresses. Otherwise, Al/A2/A3 will be
over-the-air link MAC addresses.
Currently, during AAD and Nonce computation for software based
encryption/decryption cases, mac80211 directly uses the addresses it
receives in the skb frame header. However, after the first
authentication, management frame addresses for non-AP MLD stations
are translated to MLD addresses from over the air link addresses in
software. This means that the skb header could contain translated MLD
addresses, which when used as is, can lead to incorrect AAD/Nonce
computation.
In the following manner, ensure that the right set of addresses are used:
In the receive path, stash the pre-translated link addresses in
ieee80211_rx_data and use them for the AAD/Nonce computations
when required.
In the transmit path, offload the encryption for a CCMP/GCMP key
to the hwsim driver that can then ensure that encryption and hence
the AAD/Nonce computations are performed on the frame containing the
right set of addresses, i.e, MLD addresses if unicast data frame and
link addresses otherwise.
To do so, register the set key handler in hwsim driver so mac80211 is
aware that it is the driver that would take care of encrypting the
frame. Offload encryption for a CCMP/GCMP key, while keeping the
encryption for WEP/TKIP and MMIE generation for a AES_CMAC or a
AES_GMAC key still at the SW crypto in mac layer
Co-developed-by: Rohan Dutta <quic_drohan@quicinc.com>
Signed-off-by: Rohan Dutta <quic_drohan@quicinc.com>
Signed-off-by: Sai Pratyusha Magam <sai.magam@oss.qualcomm.com>
Link: https://patch.msgid.link/20260226042959.3766157-1-sai.magam@oss.qualcomm.com
[only store and apply link_addrs for unicast non-data
rather storing always and applying for !unicast_data]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, the unsolicited probe response template is always fetched from
the default link of a virtual interface in both Multi-Link Operation (MLO)
and non-MLO cases. However, in the MLO case there is a need to fetch the
unsolicited probe response template from a specific link instead of the
default link.
Hence, add support for fetching the unsolicited probe response template
based on the link ID from the corresponding link data.
Signed-off-by: Sriram R <quic_srirrama@quicinc.com>
Co-developed-by: Raj Kumar Bhagat <raj.bhagat@oss.qualcomm.com>
Signed-off-by: Raj Kumar Bhagat <raj.bhagat@oss.qualcomm.com>
Link: https://patch.msgid.link/20260220-fils-prob-by-link-v1-2-a2746a853f75@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
DualPI2 drops packets during dequeue but was using kfree_skb_reason()
directly, bypassing trace_qdisc_drop. Convert to qdisc_dequeue_drop()
and add QDISC_DROP_L4S_STEP_NON_ECN to the qdisc drop reason enum.
- Set TCQ_F_DEQUEUE_DROPS flag in dualpi2_init()
- Use enum qdisc_drop_reason in drop_and_retry()
- Replace kfree_skb_reason() with qdisc_dequeue_drop()
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211351978.3011628.11267023360997620069.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rename QDISC_DROP_CAKE_FLOOD to QDISC_DROP_FLOOD_PROTECTION to use a
generic name without embedding the qdisc name. This follows the
principle that drop reasons should describe the drop mechanism rather
than being tied to a specific qdisc implementation.
The flood protection drop reason is used by qdiscs implementing
probabilistic drop algorithms (like BLUE) that detect unresponsive
flows indicating potential DoS or flood attacks. CAKE uses this via
its Cobalt AQM component.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211347537.3011628.13759059534638729639.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rename FQ-specific drop reasons to generic names:
- QDISC_DROP_FQ_BAND_LIMIT -> QDISC_DROP_BAND_LIMIT
- QDISC_DROP_FQ_HORIZON_LIMIT -> QDISC_DROP_HORIZON_LIMIT
This follows the principle that drop reasons should describe the drop
mechanism rather than being tied to a specific qdisc implementation.
These concepts (priority band limits, timestamp horizon) could apply
to other qdiscs as well.
Remove the local macro define FQDR() and instead use the
full QDISC_DROP_* name to make it easier to navigate code.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211346902.3011628.12523261489552097455.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert SFQ to use the new qdisc-specific drop reason infrastructure.
This patch demonstrates how to convert a flow-based qdisc to use the
new enum qdisc_drop_reason. As part of this conversion:
- Add QDISC_DROP_MAXFLOWS for flow table exhaustion
- Rename FQ_FLOW_LIMIT to generic FLOW_LIMIT, now shared by FQ and SFQ
- Use QDISC_DROP_OVERLIMIT for sfq_drop() when overall limit exceeded
- Use QDISC_DROP_FLOW_LIMIT for per-flow depth limit exceeded
The FLOW_LIMIT reason is now a common drop reason for per-flow limits,
applicable to both FQ and SFQ qdiscs.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211345946.3011628.12770616071857185664.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Create new enum qdisc_drop_reason and trace_qdisc_drop tracepoint
for qdisc layer drop diagnostics with direct qdisc context visibility.
The new tracepoint includes qdisc handle, parent, kind (name), and
device information. Existing SKB_DROP_REASON_QDISC_DROP is retained
for backwards compatibility via kfree_skb_reason().
Convert qdiscs with drop reasons to use the new infrastructure.
Change CAKE's cobalt_should_drop() return type from enum skb_drop_reason
to enum qdisc_drop_reason to fix implicit enum conversion warnings.
Use QDISC_DROP_UNSPEC as the 'not dropped' sentinel instead of
SKB_NOT_DROPPED_YET. Both have the same compiled value (0), so the
comparison logic remains semantically equivalent.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211345275.3011628.1974310302645218067.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After commit b692bf9a75 ("xsk: Get rid of xdp_buff_xsk::xskb_list_node"),
the list_node field is reused for both the xskb pool list and the buffer
free list, this causes a buffer leak as described below.
xp_free() checks if a buffer is already on the free list using
list_empty(&xskb->list_node). When list_del() is used to remove a node
from the xskb pool list, it doesn't reinitialize the node pointers.
This means list_empty() will return false even after the node has been
removed, causing xp_free() to incorrectly skip adding the buffer to the
free list.
Fix this by using list_del_init() instead of list_del() in all fragment
handling paths, this ensures the list node is reinitialized after removal,
allowing the list_empty() to work correctly.
Fixes: b692bf9a75 ("xsk: Get rid of xdp_buff_xsk::xskb_list_node")
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Nikhil P. Rao <nikhil.rao@amd.com>
Link: https://patch.msgid.link/20260225000456.107806-2-nikhil.rao@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>