linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-21 19:35:23 -04:00

Author	SHA1	Message	Date
Eric Dumazet	95fa78830e	inet_diag: avoid cache line misses in inet_diag_bc_sk() inet_diag_bc_sk() pulls five cache lines per socket, while most filters only need the two first ones. Add three booleans to struct inet_diag_dump_data, that are selectively set if a filter needs specific socket fields. - mark_needed /* INET_DIAG_BC_MARK_COND present. / - cgroup_needed / INET_DIAG_BC_CGROUP_COND present. / - userlocks_needed / INET_DIAG_BC_AUTO present. */ This removes millions of cache lines misses per ss invocation when simple filters are specified on busy servers. offsetof(struct sock, sk_userlocks) = 0xf3 offsetof(struct sock, sk_mark) = 0x20c offsetof(struct sock, sk_cgrp_data) = 0x298 Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250828102738.2065992-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-29 19:29:24 -07:00
Eric Dumazet	9529320ad6	inet_diag: change inet_diag_bc_sk() first argument We want to have access to the inet_diag_dump_data structure in the following patch. This patch removes duplication in callers. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250828102738.2065992-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-29 19:29:24 -07:00
Eric Dumazet	4fd84a0aaf	inet_diag: annotate data-races in inet_diag_bc_sk() inet_diag_bc_sk() runs with an unlocked socket, annotate potential races with READ_ONCE(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250828102738.2065992-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-29 19:29:24 -07:00
Eric Dumazet	8e60447f08	tcp: annotate data-races in tcp_req_diag_fill() req->num_retrans and rsk_timer.expires are read locklessly, and can be changed from tcp_rtx_synack(). Add READ_ONCE()/WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250828102738.2065992-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-29 19:29:24 -07:00
Eric Dumazet	9a574257b9	inet_diag: annotate data-races in inet_diag_msg_common_fill() inet_diag_msg_common_fill() can run without socket lock. Add READ_ONCE() or data_race() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250828102738.2065992-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-29 19:29:24 -07:00
Eric Dumazet	9f74c0ea9b	net_sched: gen_estimator: fix est_timer() vs CONFIG_PREEMPT_RT=y syzbot reported a WARNING in est_timer() [1] Problem here is that with CONFIG_PREEMPT_RT=y, timer callbacks can be preempted. Adopt preempt_disable_nested()/preempt_enable_nested() to fix this. [1] WARNING: CPU: 0 PID: 16 at ./include/linux/seqlock.h:221 __seqprop_assert include/linux/seqlock.h:221 [inline] WARNING: CPU: 0 PID: 16 at ./include/linux/seqlock.h:221 est_timer+0x6dc/0x9f0 net/core/gen_estimator.c:93 Modules linked in: CPU: 0 UID: 0 PID: 16 Comm: ktimers/0 Not tainted syzkaller #0 PREEMPT_{RT,(full)} Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/12/2025 RIP: 0010:__seqprop_assert include/linux/seqlock.h:221 [inline] RIP: 0010:est_timer+0x6dc/0x9f0 net/core/gen_estimator.c:93 Call Trace: <TASK> call_timer_fn+0x17e/0x5f0 kernel/time/timer.c:1747 expire_timers kernel/time/timer.c:1798 [inline] __run_timers kernel/time/timer.c:2372 [inline] __run_timer_base+0x648/0x970 kernel/time/timer.c:2384 run_timer_base kernel/time/timer.c:2393 [inline] run_timer_softirq+0xb7/0x180 kernel/time/timer.c:2403 handle_softirqs+0x22c/0x710 kernel/softirq.c:579 __do_softirq kernel/softirq.c:613 [inline] run_ktimerd+0xcf/0x190 kernel/softirq.c:1043 smpboot_thread_fn+0x53f/0xa60 kernel/smpboot.c:160 kthread+0x70e/0x8a0 kernel/kthread.c:463 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> Fixes: `d2d6422f8b` ("x86: Allow to enable PREEMPT_RT.") Reported-by: syzbot+72db9ee39db57c3fecc5@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/68adf6fa.a70a0220.3cafd4.0000.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20250827162352.3960779-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-29 19:04:20 -07:00
Jakub Kicinski	c407beb5b8	Merge tag 'nf-25-08-27' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Florian Westphal says: ==================== netfilter updates for net 1) Remove bogus WARN_ON in br_netfilter that came in 6.8. This is now more prominent due to commit `2d72afb340` ("netfilter: nf_conntrack: fix crash due to removal of uninitialised entry"). From Wang Liang. 2) Better error reporting when a helper module clashes with an existing helper name: -EEXIST makes modprobe believe that the module is already loaded, so error message is elided. From Phil Sutter. * tag 'nf-25-08-27' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: conntrack: helper: Replace -EEXIST by -EBUSY netfilter: br_netfilter: do not check confirmed bit in br_nf_local_in() after confirm ==================== Link: https://patch.msgid.link/20250827133900.16552-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-29 18:57:23 -07:00
Jakub Kicinski	629840e208	Merge tag 'wireless-2025-08-28' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Some fixes for the current cycle: - mt76: MLO regressions, offchannel handling, list corruption - mac80211: scan allocation size, no 40 MHz EHT, signed type - rt2x00: (randconfig) build - cfg80211: use-after-free - iwlwifi: config/old devices, BIOS compatibility - mwifiex: vmalloc content leak * tag 'wireless-2025-08-28' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: (29 commits) wifi: iwlwifi: cfg: add back more lost PCI IDs wifi: iwlwifi: fix byte count table for old devices wifi: iwlwifi: cfg: restore some 1000 series configs wifi: mwifiex: Initialize the chan_stats array to zero wifi: mac80211: do not permit 40 MHz EHT operation on 5/6 GHz wifi: iwlwifi: uefi: check DSM item validity wifi: iwlwifi: acpi: check DSM func validity wifi: iwlwifi: if scratch is ~0U, consider it a failure wifi: mt76: fix linked list corruption wifi: mt76: free pending offchannel tx frames on wcid cleanup wifi: mt76: mt7915: fix list corruption after hardware restart wifi: mt76: mt7996: add missing check for rx wcid entries wifi: mt76: do not add non-sta wcid entries to the poll list wifi: mt76: mt7996: fix crash on some tx status reports wifi: mt76: mt7996: use the correct vif link for scanning/roc wifi: mt76: mt7996: disable beacons when going offchannel wifi: mt76: prevent non-offchannel mgmt tx during scan/roc wifi: mt76: mt7925: skip EHT MLD TLV on non-MLD and pass conn_state for sta_cmd wifi: mt76: mt7925u: use connac3 tx aggr check in tx complete wifi: mt76: mt7925: fix the wrong bss cleanup for SAP ... ==================== Link: https://patch.msgid.link/20250828122654.1167754-8-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-29 13:35:13 -07:00
Kuniyuki Iwashima	862c628108	Bluetooth: Fix use-after-free in l2cap_sock_cleanup_listen() syzbot reported the splat below without a repro. In the splat, a single thread calling bt_accept_dequeue() freed sk and touched it after that. The root cause would be the racy l2cap_sock_cleanup_listen() call added by the cited commit. bt_accept_dequeue() is called under lock_sock() except for l2cap_sock_release(). Two threads could see the same socket during the list iteration in bt_accept_dequeue(): CPU1 CPU2 (close()) ---- ---- sock_hold(sk) sock_hold(sk); lock_sock(sk) <-- block close() sock_put(sk) bt_accept_unlink(sk) sock_put(sk) <-- refcnt by bt_accept_enqueue() release_sock(sk) lock_sock(sk) sock_put(sk) bt_accept_unlink(sk) sock_put(sk) <-- last refcnt bt_accept_unlink(sk) <-- UAF Depending on the timing, the other thread could show up in the "Freed by task" part. Let's call l2cap_sock_cleanup_listen() under lock_sock() in l2cap_sock_release(). [0]: BUG: KASAN: slab-use-after-free in debug_spin_lock_before kernel/locking/spinlock_debug.c:86 [inline] BUG: KASAN: slab-use-after-free in do_raw_spin_lock+0x26f/0x2b0 kernel/locking/spinlock_debug.c:115 Read of size 4 at addr ffff88803b7eb1c4 by task syz.5.3276/16995 CPU: 3 UID: 0 PID: 16995 Comm: syz.5.3276 Not tainted syzkaller #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xcd/0x630 mm/kasan/report.c:482 kasan_report+0xe0/0x110 mm/kasan/report.c:595 debug_spin_lock_before kernel/locking/spinlock_debug.c:86 [inline] do_raw_spin_lock+0x26f/0x2b0 kernel/locking/spinlock_debug.c:115 spin_lock_bh include/linux/spinlock.h:356 [inline] release_sock+0x21/0x220 net/core/sock.c:3746 bt_accept_dequeue+0x505/0x600 net/bluetooth/af_bluetooth.c:312 l2cap_sock_cleanup_listen+0x5c/0x2a0 net/bluetooth/l2cap_sock.c:1451 l2cap_sock_release+0x5c/0x210 net/bluetooth/l2cap_sock.c:1425 __sock_release+0xb3/0x270 net/socket.c:649 sock_close+0x1c/0x30 net/socket.c:1439 __fput+0x3ff/0xb70 fs/file_table.c:468 task_work_run+0x14d/0x240 kernel/task_work.c:227 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] exit_to_user_mode_loop+0xeb/0x110 kernel/entry/common.c:43 exit_to_user_mode_prepare include/linux/irq-entry-common.h:225 [inline] syscall_exit_to_user_mode_work include/linux/entry-common.h:175 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:210 [inline] do_syscall_64+0x3f6/0x4c0 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f2accf8ebe9 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffdb6cb1378 EFLAGS: 00000246 ORIG_RAX: 00000000000001b4 RAX: 0000000000000000 RBX: 00000000000426fb RCX: 00007f2accf8ebe9 RDX: 0000000000000000 RSI: 000000000000001e RDI: 0000000000000003 RBP: 00007f2acd1b7da0 R08: 0000000000000001 R09: 00000012b6cb166f R10: 0000001b30e20000 R11: 0000000000000246 R12: 00007f2acd1b609c R13: 00007f2acd1b6090 R14: ffffffffffffffff R15: 00007ffdb6cb1490 </TASK> Allocated by task 5326: kasan_save_stack+0x33/0x60 mm/kasan/common.c:47 kasan_save_track+0x14/0x30 mm/kasan/common.c:68 poison_kmalloc_redzone mm/kasan/common.c:388 [inline] __kasan_kmalloc+0xaa/0xb0 mm/kasan/common.c:405 kasan_kmalloc include/linux/kasan.h:260 [inline] __do_kmalloc_node mm/slub.c:4365 [inline] __kmalloc_noprof+0x223/0x510 mm/slub.c:4377 kmalloc_noprof include/linux/slab.h:909 [inline] sk_prot_alloc+0x1a8/0x2a0 net/core/sock.c:2239 sk_alloc+0x36/0xc20 net/core/sock.c:2295 bt_sock_alloc+0x3b/0x3a0 net/bluetooth/af_bluetooth.c:151 l2cap_sock_alloc.constprop.0+0x33/0x1d0 net/bluetooth/l2cap_sock.c:1894 l2cap_sock_new_connection_cb+0x101/0x240 net/bluetooth/l2cap_sock.c:1482 l2cap_connect_cfm+0x4c4/0xf80 net/bluetooth/l2cap_core.c:7287 hci_connect_cfm include/net/bluetooth/hci_core.h:2050 [inline] hci_remote_features_evt+0x4dd/0x970 net/bluetooth/hci_event.c:3712 hci_event_func net/bluetooth/hci_event.c:7519 [inline] hci_event_packet+0xa0d/0x11c0 net/bluetooth/hci_event.c:7573 hci_rx_work+0x2c5/0x16b0 net/bluetooth/hci_core.c:4071 process_one_work+0x9cf/0x1b70 kernel/workqueue.c:3236 process_scheduled_works kernel/workqueue.c:3319 [inline] worker_thread+0x6c8/0xf10 kernel/workqueue.c:3400 kthread+0x3c2/0x780 kernel/kthread.c:463 ret_from_fork+0x5d7/0x6f0 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 Freed by task 16995: kasan_save_stack+0x33/0x60 mm/kasan/common.c:47 kasan_save_track+0x14/0x30 mm/kasan/common.c:68 kasan_save_free_info+0x3b/0x60 mm/kasan/generic.c:576 poison_slab_object mm/kasan/common.c:243 [inline] __kasan_slab_free+0x60/0x70 mm/kasan/common.c:275 kasan_slab_free include/linux/kasan.h:233 [inline] slab_free_hook mm/slub.c:2417 [inline] slab_free mm/slub.c:4680 [inline] kfree+0x2b4/0x4d0 mm/slub.c:4879 sk_prot_free net/core/sock.c:2278 [inline] __sk_destruct+0x75f/0x9a0 net/core/sock.c:2373 sk_destruct+0xc2/0xf0 net/core/sock.c:2401 __sk_free+0xf4/0x3e0 net/core/sock.c:2412 sk_free+0x6a/0x90 net/core/sock.c:2423 sock_put include/net/sock.h:1960 [inline] bt_accept_unlink+0x245/0x2e0 net/bluetooth/af_bluetooth.c:262 bt_accept_dequeue+0x517/0x600 net/bluetooth/af_bluetooth.c:308 l2cap_sock_cleanup_listen+0x5c/0x2a0 net/bluetooth/l2cap_sock.c:1451 l2cap_sock_release+0x5c/0x210 net/bluetooth/l2cap_sock.c:1425 __sock_release+0xb3/0x270 net/socket.c:649 sock_close+0x1c/0x30 net/socket.c:1439 __fput+0x3ff/0xb70 fs/file_table.c:468 task_work_run+0x14d/0x240 kernel/task_work.c:227 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] exit_to_user_mode_loop+0xeb/0x110 kernel/entry/common.c:43 exit_to_user_mode_prepare include/linux/irq-entry-common.h:225 [inline] syscall_exit_to_user_mode_work include/linux/entry-common.h:175 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:210 [inline] do_syscall_64+0x3f6/0x4c0 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: `1728137b33` ("Bluetooth: L2CAP: Fix use-after-free in l2cap_sock_ready_cb") Reported-by: syzbot+e5e64cdf8e92046dd3e1@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-bluetooth/68af6b9d.a70a0220.3cafd4.0032.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2025-08-29 14:51:06 -04:00
Jakub Kicinski	d23ad54de7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.17-rc4). No conflicts. Adjacent changes: drivers/net/ethernet/intel/idpf/idpf_txrx.c `02614eee26` ("idpf: do not linearize big TSO packets") `6c4e684802` ("idpf: remove obsolete stashing code") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-29 11:48:01 -07:00
Lauri Vasama	db2ab24a34	Add RWF_NOSIGNAL flag for pwritev2 For a user mode library to avoid generating SIGPIPE signals (e.g. because this behaviour is not portable across operating systems) is cumbersome. It is generally bad form to change the process-wide signal mask in a library, so a local solution is needed instead. For I/O performed directly using system calls (synchronous or readiness based asynchronous) this currently involves applying a thread-specific signal mask before the operation and reverting it afterwards. This can be avoided when it is known that the file descriptor refers to neither a pipe nor a socket, but a conservative implementation must always apply the mask. This incurs the cost of two additional system calls. In the case of sockets, the existing MSG_NOSIGNAL flag can be used with send. For asynchronous I/O performed using io_uring, currently the only option (apart from MSG_NOSIGNAL for sockets), is to mask SIGPIPE entirely in the call to io_uring_enter. Thankfully io_uring_enter takes a signal mask, so only a single syscall is needed. However, copying the signal mask on every call incurs a non-zero performance penalty. Furthermore, this mask applies to all completions, meaning that if the non-signaling behaviour is desired only for some subset of operations, the desired signals must be raised manually from user-mode depending on the completed operation. Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE signal from being raised when writing on disconnected pipes or sockets. The flag is handled directly by the pipe filesystem and converted to the existing MSG_NOSIGNAL flag for sockets. Signed-off-by: Lauri Vasama <git@vasama.org> Link: https://lore.kernel.org/20250827133901.1820771-1-git@vasama.org Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-29 15:08:07 +02:00
Eric Dumazet	53df77e785	net_sched: act_skbmod: use RCU in tcf_skbmod_dump() Also storing tcf_action into struct tcf_skbmod_params makes sure there is no discrepancy in tcf_skbmod_act(). No longer block BH in tcf_skbmod_init() when acquiring tcf_lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250827125349.3505302-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-28 16:46:23 -07:00
Eric Dumazet	e97ae74297	net_sched: act_tunnel_key: use RCU in tunnel_key_dump() Also storing tcf_action into struct tcf_tunnel_key_params makes sure there is no discrepancy in tunnel_key_act(). No longer block BH in tunnel_key_init() when acquiring tcf_lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250827125349.3505302-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-28 16:46:23 -07:00
Eric Dumazet	48b5e5dbdb	net_sched: act_vlan: use RCU in tcf_vlan_dump() Also storing tcf_action into struct tcf_vlan_params makes sure there is no discrepancy in tcf_vlan_act(). No longer block BH in tcf_vlan_init() when acquiring tcf_lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250827125349.3505302-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-28 16:46:23 -07:00
Eric Dumazet	3133d5c15c	net_sched: remove BH blocking in eight actions Followup of `f45b45cbfa` ("Merge branch 'net_sched-act-extend-rcu-use-in-dump-methods'") We never grab tcf_lock from BH context in these modules: act_connmark act_csum act_ct act_ctinfo act_mpls act_nat act_pedit act_skbedit No longer block BH when acquiring tcf_lock from init functions. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250827125349.3505302-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-28 16:46:23 -07:00
Dragos Tatulea	b8aab4bb95	net: devmem: allow binding on rx queues with same DMA devices Multi-PF netdevs have queues belonging to different PFs which also means different DMA devices. This means that the binding on the DMA buffer can be done to the incorrect device. This change allows devmem binding to multiple queues only when the queues have the same DMA device. Otherwise an error is returned. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Link: https://patch.msgid.link/20250827144017.1529208-9-dtatulea@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-28 16:05:32 -07:00
Dragos Tatulea	1b416902cd	net: devmem: pre-read requested rx queues during bind Instead of reading the requested rx queues after binding the buffer, read the rx queues in advance in a bitmap and iterate over them when needed. This is a preparation for fetching the DMA device for each queue. This patch has no functional changes besides adding an extra rq index bounds check. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250827144017.1529208-8-dtatulea@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-28 16:05:32 -07:00
Dragos Tatulea	512c88fb0e	net: devmem: pull out dma_dev out of net_devmem_bind_dmabuf Fetch the DMA device before calling net_devmem_bind_dmabuf() and pass it on as a parameter. This is needed for an upcoming change which will read the DMA device per queue. This patch has no functional changes. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250827144017.1529208-7-dtatulea@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-28 16:05:32 -07:00
Dragos Tatulea	7c7e94603a	net: devmem: get netdev DMA device via new API Switch to the new API for fetching DMA devices for a netdev. The API is called with queue index 0 for now which is equivalent with the previous behavior. This patch will allow devmem to work with devices where the DMA device is not stored in the parent device. mlx5 SFs are an example of such a device. Multi-PF netdevs are still problematic (as they were before this change). Upcoming patches will address this for the rx binding. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250827144017.1529208-5-dtatulea@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-28 16:05:31 -07:00
Dragos Tatulea	13d8e05adf	queue_api: add support for fetching per queue DMA dev For zerocopy (io_uring, devmem), there is an assumption that the parent device can do DMA. However that is not always the case: - Scalable Function netdevs [1] have the DMA device in the grandparent. - For Multi-PF netdevs [2] queues can be associated to different DMA devices. This patch introduces the a queue based interface for allowing drivers to expose a different DMA device for zerocopy. [1] Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst [2] Documentation/networking/multi-pf-netdev.rst Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250827144017.1529208-3-dtatulea@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-28 16:05:31 -07:00
Benjamin Berg	75575e2d25	wifi: mac80211: do not permit 40 MHz EHT operation on 5/6 GHz The EHT PHY requirements state that 80 MHz must be supported on the 5 and 6 GHz bands unless the STA is 20 MHz only. So if the channel width is limited to 40 MHz on a band other than 2.4 GHz, then disable EHT and downgrade to HE. The primary case where this can happen is if the hardware disables puncturing using IEEE80211_HW_DISALLOW_PUNCTURING. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Cc: stable@vger.kernel.org Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20250826202553.a6582f3abf57.Ic670429dc7127f68c818b4290d950ebfb5a0b9e1@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2025-08-28 13:39:16 +02:00
Eric Dumazet	b81aa23234	inet: raw: add drop_counters to raw sockets When a packet flood hits one or more RAW sockets, many cpus have to update sk->sk_drops. This slows down other cpus, because currently sk_drops is in sock_write_rx group. Add a socket_drop_counters structure to raw sockets. Using dedicated cache lines to hold drop counters makes sure that consumers no longer suffer from false sharing if/when producers only change sk->sk_drops. This adds 128 bytes per RAW socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250826125031.1578842-6-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-28 13:14:50 +02:00
Eric Dumazet	c51613fa27	net: add sk->sk_drop_counters Some sockets suffer from heavy false sharing on sk->sk_drops, and fields in the same cache line. Add sk->sk_drop_counters to: - move the drop counter(s) to dedicated cache lines. - Add basic NUMA awareness to these drop counter(s). Following patches will use this infrastructure for UDP and RAW sockets. sk_clone_lock() is not yet ready, it would need to properly set newsk->sk_drop_counters if we plan to use this for TCP sockets. v2: used Paolo suggestion from https://lore.kernel.org/netdev/8f09830a-d83d-43c9-b36b-88ba0a23e9b2@redhat.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250826125031.1578842-4-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-28 13:14:50 +02:00
Eric Dumazet	cb4d5a6eb6	net: add sk_drops_skbadd() helper Existing sk_drops_add() helper is renamed to sk_drops_skbadd(). Add sk_drops_add() and convert sk_drops_inc() to use it. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250826125031.1578842-3-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-28 13:14:50 +02:00
Eric Dumazet	f86f42ed2c	net: add sk_drops_read(), sk_drops_inc() and sk_drops_reset() helpers We want to split sk->sk_drops in the future to reduce potential contention on this field. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250826125031.1578842-2-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-28 13:14:50 +02:00
Oscar Maes	5189446ba9	net: ipv4: fix regression in local-broadcast routes Commit `9e30ecf23b` ("net: ipv4: fix incorrect MTU in broadcast routes") introduced a regression where local-broadcast packets would have their gateway set in __mkroute_output, which was caused by fi = NULL being removed. Fix this by resetting the fib_info for local-broadcast packets. This preserves the intended changes for directed-broadcast packets. Cc: stable@vger.kernel.org Fixes: `9e30ecf23b` ("net: ipv4: fix incorrect MTU in broadcast routes") Reported-by: Brett A C Sheffield <bacs@librecast.net> Closes: https://lore.kernel.org/regressions/20250822165231.4353-4-bacs@librecast.net Signed-off-by: Oscar Maes <oscmaes92@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250827062322.4807-1-oscmaes92@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-28 10:52:30 +02:00
Krishna Kumar	48aa30443e	net: Cache hash and flow_id to avoid recalculation get_rps_cpu() can cache flow_id and hash as both are required by set_rps_cpu() instead of recalculating them twice. Signed-off-by: Krishna Kumar <krikku@gmail.com> Link: https://patch.msgid.link/20250825031005.3674864-3-krikku@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-27 18:24:20 -07:00
Krishna Kumar	97bcc5b6f4	net: Prevent RPS table overwrite of active flows This patch fixes an issue where two different flows on the same RXq produce the same hash resulting in continuous flow overwrites. Flow #1: A packet for Flow #1 comes in, kernel calls the steering function. The driver gives back a filter id. The kernel saves this filter id in the selected slot. Later, the driver's service task checks if any filters have expired and then installs the rule for Flow #1. Flow #2: A packet for Flow #2 comes in. It goes through the same steps. But this time, the chosen slot is being used by Flow #1. The driver gives a new filter id and the kernel saves it in the same slot. When the driver's service task runs, it runs through all the flows, checks if Flow #1 should be expired, the kernel returns True as the slot has a different filter id, and then the driver installs the rule for Flow #2. Flow #1: Another packet for Flow #1 comes in. The same thing repeats. The slot is overwritten with a new filter id for Flow #1. This causes a repeated cycle of flow programming for missed packets, wasting CPU cycles while not improving performance. This problem happens at higher rates when the RPS table is small, but tests show it still happens even with 12,000 connections and an RPS size of 16K per queue (global table size = 144x16K = 64K). This patch prevents overwriting an rps_dev_flow entry if it is active. The intention is that it is better to do aRFS for the first flow instead of hurting all flows on the same hash. Without this, two (or more) flows on one RX queue with the same hash can keep overwriting each other. This causes the driver to reprogram the flow repeatedly. Changes: 1. Add a new 'hash' field to struct rps_dev_flow. 2. Add rps_flow_is_active(): a helper function to check if a flow is active or not, extracted from rps_may_expire_flow(). It is further simplified as per reviewer feedback. 3. In set_rps_cpu(): - Avoid overwriting by programming a new filter if: - The slot is not in use, or - The slot is in use but the flow is not active, or - The slot has an active flow with the same hash, but target CPU differs. - Save the hash in the rps_dev_flow entry. 4. rps_may_expire_flow(): Use earlier extracted rps_flow_is_active(). Testing & results: - Driver: ice (E810 NIC), Kernel: net-next - #CPUs = #RXq = 144 (1:1) - Number of flows: 12K - Eight RPS settings from 256 to 32768. Though RPS=256 is not ideal, it is still sufficient to cover 12K flows (256144 rx-queues = 64K global table slots) - Global Table Size = 144 RPS (effectively equal to 256 * RPS) - Each RPS test duration = 8 mins (org code) + 8 mins (new code). - Metrics captured on client Legend for following tables: Steer-C: #times ndo_rx_flow_steer() was Called by set_rps_cpu() Steer-L: #times ice_arfs_flow_steer() Looped over aRFS entries Add: #times driver actually programmed aRFS (ice_arfs_build_entry()) Del: #times driver deleted the flow (ice_arfs_del_flow_rules()) Units: K = 1,000 times, M = 1 million times \|-------\|---------\|------\| Org Code \|---------\|---------\| \| RPS \| Latency \| CPU \| Add \| Del \| Steer-C \| Steer-L \| \|-------\|---------\|------\|--------\|--------\|---------\|---------\| \| 256 \| 227.0 \| 93.2 \| 1.6M \| 1.6M \| 121.7M \| 267.6M \| \| 512 \| 225.9 \| 94.1 \| 11.5M \| 11.2M \| 65.7M \| 199.6M \| \| 1024 \| 223.5 \| 95.6 \| 16.5M \| 16.5M \| 27.1M \| 187.3M \| \| 2048 \| 222.2 \| 96.3 \| 10.5M \| 10.5M \| 12.5M \| 115.2M \| \| 4096 \| 223.9 \| 94.1 \| 5.5M \| 5.5M \| 7.2M \| 65.9M \| \| 8192 \| 224.7 \| 92.5 \| 2.7M \| 2.7M \| 3.0M \| 29.9M \| \| 16384 \| 223.5 \| 92.5 \| 1.3M \| 1.3M \| 1.4M \| 13.9M \| \| 32768 \| 219.6 \| 93.2 \| 838.1K \| 838.1K \| 965.1K \| 8.9M \| \|-------\|---------\|------\| New Code \|---------\|---------\| \| 256 \| 201.5 \| 99.1 \| 13.4K \| 5.0K \| 13.7K \| 75.2K \| \| 512 \| 202.5 \| 98.2 \| 11.2K \| 5.9K \| 11.2K \| 55.5K \| \| 1024 \| 207.3 \| 93.9 \| 11.5K \| 9.7K \| 11.5K \| 59.6K \| \| 2048 \| 207.5 \| 96.7 \| 11.8K \| 11.1K \| 15.5K \| 79.3K \| \| 4096 \| 206.9 \| 96.6 \| 11.8K \| 11.7K \| 11.8K \| 63.2K \| \| 8192 \| 205.8 \| 96.7 \| 11.9K \| 11.8K \| 11.9K \| 63.9K \| \| 16384 \| 200.9 \| 98.2 \| 11.9K \| 11.9K \| 11.9K \| 64.2K \| \| 32768 \| 202.5 \| 98.0 \| 11.9K \| 11.9K \| 11.9K \| 64.2K \| \|-------\|---------\|------\|--------\|--------\|---------\|---------\| Some observations: 1. Overall Latency improved: (1790.19-1634.94)/1790.19100 = 8.67% 2. Overall CPU increased: (777.32-751.49)/751.45100 = 3.44% 3. Flow Management (add/delete) remained almost constant at ~11K compared to values in millions. Signed-off-by: Krishna Kumar <krikku@gmail.com> Link: https://patch.msgid.link/20250825031005.3674864-2-krikku@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-27 18:24:13 -07:00
Eric Dumazet	1cc8a5b534	net: rose: fix a typo in rose_clear_routes() syzbot crashed in rose_clear_routes(), after a recent patch typo. KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017] CPU: 0 UID: 0 PID: 10591 Comm: syz.3.1856 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/12/2025 RIP: 0010:rose_clear_routes net/rose/rose_route.c:565 [inline] RIP: 0010:rose_rt_ioctl+0x162/0x1250 net/rose/rose_route.c:760 <TASK> rose_ioctl+0x3ce/0x8b0 net/rose/af_rose.c:1381 sock_do_ioctl+0xd9/0x300 net/socket.c:1238 sock_ioctl+0x576/0x790 net/socket.c:1359 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:598 [inline] __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:584 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: `da9c9c8775` ("net: rose: include node references in rose_neigh refcount") Reported-by: syzbot+2eb8d1719f7cfcfa6840@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/68af3e29.a70a0220.3cafd4.002e.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Takamitsu Iwai <takamitz@amazon.co.jp> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250827172149.5359-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-27 17:27:52 -07:00
Eric Dumazet	9b8c88f875	l2tp: do not use sock_hold() in pppol2tp_session_get_sock() pppol2tp_session_get_sock() is using RCU, it must be ready for sk_refcnt being zero. Commit `ee40fb2e1e` ("l2tp: protect sock pointer of struct pppol2tp_session with RCU") was correct because it had a call_rcu(..., pppol2tp_put_sk) which was later removed in blamed commit. pppol2tp_recv() can use pppol2tp_session_get_sock() as well. Fixes: `c5cbaef992` ("l2tp: refactor ppp socket/session relationship") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: James Chapman <jchapman@katalix.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20250826134435.1683435-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-27 17:16:13 -07:00
Eric Dumazet	2e87504692	sctp: initialize more fields in sctp_v6_from_sk() syzbot found that sin6_scope_id was not properly initialized, leading to undefined behavior. Clear sin6_scope_id and sin6_flowinfo. BUG: KMSAN: uninit-value in __sctp_v6_cmp_addr+0x887/0x8c0 net/sctp/ipv6.c:649 __sctp_v6_cmp_addr+0x887/0x8c0 net/sctp/ipv6.c:649 sctp_inet6_cmp_addr+0x4f2/0x510 net/sctp/ipv6.c:983 sctp_bind_addr_conflict+0x22a/0x3b0 net/sctp/bind_addr.c:390 sctp_get_port_local+0x21eb/0x2440 net/sctp/socket.c:8452 sctp_get_port net/sctp/socket.c:8523 [inline] sctp_listen_start net/sctp/socket.c:8567 [inline] sctp_inet_listen+0x710/0xfd0 net/sctp/socket.c:8636 __sys_listen_socket net/socket.c:1912 [inline] __sys_listen net/socket.c:1927 [inline] __do_sys_listen net/socket.c:1932 [inline] __se_sys_listen net/socket.c:1930 [inline] __x64_sys_listen+0x343/0x4c0 net/socket.c:1930 x64_sys_call+0x271d/0x3e20 arch/x86/include/generated/asm/syscalls_64.h:51 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xd9/0x210 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Local variable addr.i.i created at: sctp_get_port net/sctp/socket.c:8515 [inline] sctp_listen_start net/sctp/socket.c:8567 [inline] sctp_inet_listen+0x650/0xfd0 net/sctp/socket.c:8636 __sys_listen_socket net/socket.c:1912 [inline] __sys_listen net/socket.c:1927 [inline] __do_sys_listen net/socket.c:1932 [inline] __se_sys_listen net/socket.c:1930 [inline] __x64_sys_listen+0x343/0x4c0 net/socket.c:1930 Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Reported-by: syzbot+e69f06a0f30116c68056@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/68adc0a2.050a0220.37038e.00c4.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20250826141314.1802610-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-27 17:15:21 -07:00
Takamitsu Iwai	da9c9c8775	net: rose: include node references in rose_neigh refcount Current implementation maintains two separate reference counting mechanisms: the 'count' field in struct rose_neigh tracks references from rose_node structures, while the 'use' field (now refcount_t) tracks references from rose_sock. This patch merges these two reference counting systems using 'use' field for proper reference management. Specifically, this patch adds incrementing and decrementing of rose_neigh->use when rose_neigh->count is incremented or decremented. This patch also modifies rose_rt_free(), rose_rt_device_down() and rose_clear_route() to properly release references to rose_neigh objects before freeing a rose_node through rose_remove_node(). These changes ensure rose_neigh structures are properly freed only when all references, including those from rose_node structures, are released. As a result, this resolves a slab-use-after-free issue reported by Syzbot. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Reported-by: syzbot+942297eecf7d2d61d1f1@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=942297eecf7d2d61d1f1 Signed-off-by: Takamitsu Iwai <takamitz@amazon.co.jp> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250823085857.47674-4-takamitz@amazon.co.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-27 07:43:09 -07:00
Takamitsu Iwai	d860d1faa6	net: rose: convert 'use' field to refcount_t The 'use' field in struct rose_neigh is used as a reference counter but lacks atomicity. This can lead to race conditions where a rose_neigh structure is freed while still being referenced by other code paths. For example, when rose_neigh->use becomes zero during an ioctl operation via rose_rt_ioctl(), the structure may be removed while its timer is still active, potentially causing use-after-free issues. This patch changes the type of 'use' from unsigned short to refcount_t and updates all code paths to use rose_neigh_hold() and rose_neigh_put() which operate reference counts atomically. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Takamitsu Iwai <takamitz@amazon.co.jp> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250823085857.47674-3-takamitz@amazon.co.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-27 07:43:08 -07:00
Takamitsu Iwai	dcb3465902	net: rose: split remove and free operations in rose_remove_neigh() The current rose_remove_neigh() performs two distinct operations: 1. Removes rose_neigh from rose_neigh_list 2. Frees the rose_neigh structure Split these operations into separate functions to improve maintainability and prepare for upcoming refcount_t conversion. The timer cleanup remains in rose_remove_neigh() because free operations can be called from timer itself. This patch introduce rose_neigh_put() to handle the freeing of rose_neigh structures and modify rose_remove_neigh() to handle removal only. Signed-off-by: Takamitsu Iwai <takamitz@amazon.co.jp> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250823085857.47674-2-takamitz@amazon.co.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-27 07:43:08 -07:00
Phil Sutter	54416fd767	netfilter: conntrack: helper: Replace -EEXIST by -EBUSY The helper registration return value is passed-through by module_init callbacks which modprobe confuses with the harmless -EEXIST returned when trying to load an already loaded module. Make sure modprobe fails so users notice their helper has not been registered and won't work. Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Fixes: `12f7a50533` ("netfilter: add user-space connection tracking helper infrastructure") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Florian Westphal <fw@strlen.de>	2025-08-27 11:53:38 +02:00
Wang Liang	479a54ab92	netfilter: br_netfilter: do not check confirmed bit in br_nf_local_in() after confirm When send a broadcast packet to a tap device, which was added to a bridge, br_nf_local_in() is called to confirm the conntrack. If another conntrack with the same hash value is added to the hash table, which can be triggered by a normal packet to a non-bridge device, the below warning may happen. ------------[ cut here ]------------ WARNING: CPU: 1 PID: 96 at net/bridge/br_netfilter_hooks.c:632 br_nf_local_in+0x168/0x200 CPU: 1 UID: 0 PID: 96 Comm: tap_send Not tainted 6.17.0-rc2-dirty #44 PREEMPT(voluntary) RIP: 0010:br_nf_local_in+0x168/0x200 Call Trace: <TASK> nf_hook_slow+0x3e/0xf0 br_pass_frame_up+0x103/0x180 br_handle_frame_finish+0x2de/0x5b0 br_nf_hook_thresh+0xc0/0x120 br_nf_pre_routing_finish+0x168/0x3a0 br_nf_pre_routing+0x237/0x5e0 br_handle_frame+0x1ec/0x3c0 __netif_receive_skb_core+0x225/0x1210 __netif_receive_skb_one_core+0x37/0xa0 netif_receive_skb+0x36/0x160 tun_get_user+0xa54/0x10c0 tun_chr_write_iter+0x65/0xb0 vfs_write+0x305/0x410 ksys_write+0x60/0xd0 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> ---[ end trace 0000000000000000 ]--- To solve the hash conflict, nf_ct_resolve_clash() try to merge the conntracks, and update skb->_nfct. However, br_nf_local_in() still use the old ct from local variable 'nfct' after confirm(), which leads to this warning. If confirm() does not insert the conntrack entry and return NF_DROP, the warning may also occur. There is no need to reserve the WARN_ON_ONCE, just remove it. Link: https://lore.kernel.org/netdev/20250820043329.2902014-1-wangliang74@huawei.com/ Fixes: `62e7151ae3` ("netfilter: bridge: confirm multicast packets before passing them up the stack") Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Wang Liang <wangliang74@huawei.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2025-08-27 11:53:37 +02:00
Eric Biggers	fe60065689	ipv6: sr: Prepare HMAC key ahead of time Prepare the HMAC key when it is added to the kernel, instead of preparing it implicitly for every packet. This significantly improves the performance of seg6_hmac_compute(). A microbenchmark on x86_64 shows seg6_hmac_compute() (with HMAC-SHA256) dropping from ~1978 cycles to ~1419 cycles, a 28% improvement. The size of 'struct seg6_hmac_info' increases by 128 bytes, but that should be fine, since there should not be a massive number of keys. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Link: https://patch.msgid.link/20250824013644.71928-3-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-26 18:11:29 -07:00
Eric Biggers	095928e7d8	ipv6: sr: Use HMAC-SHA1 and HMAC-SHA256 library functions Use the HMAC-SHA1 and HMAC-SHA256 library functions instead of crypto_shash. This is simpler and faster. Pre-allocating per-CPU hash transformation objects and descriptors is no longer needed, and a microbenchmark on x86_64 shows seg6_hmac_compute() (with HMAC-SHA256) dropping from ~2494 cycles to ~1978 cycles, a 20% improvement. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Link: https://patch.msgid.link/20250824013644.71928-2-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-26 18:11:29 -07:00
Guillaume Nault	1bec9d0c00	ipv4: Convert ->flowi4_tos to dscp_t. Convert the ->flowic_tos field of struct flowi_common from __u8 to dscp_t, rename it ->flowic_dscp and propagate these changes to struct flowi and struct flowi4. We've had several bugs in the past where ECN bits could interfere with IPv4 routing, because these bits were not properly cleared when setting ->flowi4_tos. These bugs should be fixed now and the dscp_t type has been introduced to ensure that variables carrying DSCP values don't accidentally have any ECN bits set. Several variables and structure fields have been converted to dscp_t already, but the main IPv4 routing structure, struct flowi4, is still using a __u8. To avoid any future regression, this patch converts it to dscp_t. There are many users to convert at once. Fortunately, around half of ->flowi4_tos users already have a dscp_t value at hand, which they currently convert to __u8 using inet_dscp_to_dsfield(). For all of these users, we just need to drop that conversion. But, although we try to do the __u8 <-> dscp_t conversions at the boundaries of the network or of user space, some places still store TOS/DSCP variables as __u8 in core networking code. Those can hardly be converted either because the data structure is part of UAPI or because the same variable or field is also used for handling ECN in other parts of the code. In all of these cases where we don't have a dscp_t variable at hand, we need to use inet_dsfield_to_dscp() when interacting with ->flowi4_dscp. Changes since v1: * Fix space alignment in __bpf_redirect_neigh_v4() (Ido). Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/29acecb45e911d17446b9a3dbdb1ab7b821ea371.1756128932.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-26 17:34:31 -07:00
Shahar Shitrit	da0e219764	devlink: Make health reporter burst period configurable Enable configuration of the burst period — a time window starting from the first error recovery, during which the reporter allows recovery attempts for each reported error. This feature is helpful when a single underlying issue causes multiple errors, as it delays the start of the grace period to allow sufficient time for recovering all related errors. For example, if multiple TX queues time out simultaneously, a sufficient burst period could allow all affected TX queues to be recovered within that window. Without this period, only the first TX queue that reports a timeout will undergo recovery, while the remaining TX queues will be blocked once the grace period begins. Configuration example: $ devlink health set pci/0000:00:09.0 reporter tx burst_period 500 Configuration example with ynl: ./tools/net/ynl/pyynl/cli.py \ --spec Documentation/netlink/specs/devlink.yaml \ --do health-reporter-set --json '{ "bus-name": "auxiliary", "dev-name": "mlx5_core.eth.0", "port-index": 65535, "health-reporter-name": "tx", "health-reporter-burst-period": 500 }' Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250824084354.533182-5-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-26 17:24:16 -07:00
Shahar Shitrit	6a06d8c405	devlink: Introduce burst period for health reporter Currently, the devlink health reporter starts the grace period immediately after handling an error, blocking any further recoveries until it finished. However, when a single root cause triggers multiple errors in a short time frame, it is desirable to treat them as a bulk of errors and to allow their recoveries, avoiding premature blocking of subsequent related errors, and reducing the risk of inconsistent or incomplete error handling. To address this, introduce a configurable burst period for devlink health reporter. Start this period when the first error is handled, and allow recovery attempts for reported errors during this window. Once burst period expires, begin the grace period to block further recoveries until it concludes. Timeline summary: ----\|--------\|------------------------------/----------------------/-- error is error is burst period grace period reported recovered (recoveries allowed) (recoveries blocked) For calculating the burst period duration, use the same last_recovery_ts as the grace period. Update it on recovery only when the burst period is inactive (either disabled or at the first error). This patch implements the framework for the burst period and effectively sets its value to 0 at reporter creation, so the current behavior remains unchanged, which ensures backward compatibility. A downstream patch will make the burst period configurable. Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250824084354.533182-4-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-26 17:24:16 -07:00
Shahar Shitrit	20597fb943	devlink: Move health reporter recovery abort logic to a separate function Extract the health reporter recovery abort logic into a separate function devlink_health_recover_abort(). The function encapsulates the conditions for aborting recovery: - When auto-recovery is disabled - When previous error wasn't recovered - When within the grace period after last recovery Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250824084354.533182-3-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-26 17:24:16 -07:00
Shahar Shitrit	d2b0073745	devlink: Move graceful period parameter to reporter ops Move the default graceful period from a parameter to devlink_health_reporter_create() to a field in the devlink_health_reporter_ops structure. This change improves consistency, as the graceful period is inherently tied to the reporter's behavior and recovery policy. It simplifies the signature of devlink_health_reporter_create() and its internal helper functions. It also centralizes the reporter configuration at the ops structure, preparing the groundwork for a downstream patch that will introduce a devlink health reporter burst period attribute whose default value will similarly be provided by the driver via the ops structure. Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250824084354.533182-2-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-26 17:24:16 -07:00
Lachlan Hodges	7e2f3213e8	wifi: mac80211: increase scan_ies_len for S1G Currently the S1G capability element is not taken into account for the scan_ies_len, which leads to a buffer length validation failure in ieee80211_prep_hw_scan() and subsequent WARN in __ieee80211_start_scan(). This prevents hw scanning from functioning. To fix ensure we accommodate for the S1G capability length. Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com> Link: https://patch.msgid.link/20250826085437.3493-1-lachlan.hodges@morsemicro.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2025-08-26 13:47:21 +02:00
Liao Yuanhong	a33b375ab5	wifi: mac80211: fix incorrect type for ret The variable ret is declared as a u32 type, but it is assigned a value of -EOPNOTSUPP. Since unsigned types cannot correctly represent negative values, the type of ret should be changed to int. Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com> Link: https://patch.msgid.link/20250825022911.139377-1-liaoyuanhong@vivo.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2025-08-26 13:46:48 +02:00
Dmitry Antipov	26e84445f0	wifi: cfg80211: fix use-after-free in cmp_bss() Following bss_free() quirk introduced in commit `776b358017` ("cfg80211: track hidden SSID networks properly"), adjust cfg80211_update_known_bss() to free the last beacon frame elements only if they're not shared via the corresponding 'hidden_beacon_bss' pointer. Reported-by: syzbot+30754ca335e6fb7e3092@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=30754ca335e6fb7e3092 Fixes: `3ab8227d3e` ("cfg80211: refactor cfg80211_bss_update") Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Link: https://patch.msgid.link/20250813135236.799384-1-dmantipov@yandex.ru Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2025-08-26 13:45:09 +02:00
Kuniyuki Iwashima	382a4d9cb6	tcp: Move TCP-specific diag functions to tcp_diag.c. tcp_diag_dump() / tcp_diag_dump_one() is just a wrapper of inet_diag_dump_icsk() / inet_diag_dump_one_icsk(), respectively. Let's inline them in tcp_diag.c and move static callees as well. Note that inet_sk_attr_size() is merged into tcp_diag_get_aux_size(), and we remove inet_diag_handler.idiag_get_aux_size() accordingly. While at it, BUG_ON() is replaced with DEBUG_NET_WARN_ON_ONCE(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250822190803.540788-7-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-25 17:53:36 -07:00
Kuniyuki Iwashima	f1241200cd	tcp: Don't pass hashinfo to inet_diag helpers. These inet_diag functions required struct inet_hashinfo because they are shared by TCP and DCCP: * inet_diag_dump_icsk() * inet_diag_dump_one_icsk() * inet_diag_find_one_icsk() DCCP has gone, and we don't need to pass hashinfo down to them. Let's fetch net->ipv4.tcp_death_row.hashinfo directly in the first 2 functions. Note that inet_diag_find_one_icsk() don't need hashinfo since the previous patch. We will move TCP-specific functions to tcp_diag.c in the next patch. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250822190803.540788-6-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-25 17:53:36 -07:00
Kuniyuki Iwashima	cb16f4b6c7	tcp: Don't pass hashinfo to socket lookup helpers. These socket lookup functions required struct inet_hashinfo because they are shared by TCP and DCCP. * __inet_lookup_established() * __inet_lookup_listener() * __inet6_lookup_established() * inet6_lookup_listener() DCCP has gone, and we don't need to pass hashinfo down to them. Let's fetch net->ipv4.tcp_death_row.hashinfo directly in the above 4 functions. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250822190803.540788-5-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-25 17:53:35 -07:00
Kuniyuki Iwashima	8150f3a44b	tcp: Remove hashinfo test for inet6?_lookup_run_sk_lookup(). Commit `6c886db2e7` ("net: remove duplicate sk_lookup helpers") started to check if hashinfo == net->ipv4.tcp_death_row.hashinfo in __inet_lookup_listener() and inet6_lookup_listener() and stopped invoking BPF sk_lookup prog for DCCP. DCCP has gone and the condition is always true. Let's remove the hashinfo test. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250822190803.540788-4-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-25 17:53:35 -07:00

... 10 11 12 13 14 ...

82231 Commits