linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-02-19 12:25:40 -05:00

Author	SHA1	Message	Date
Jakub Kicinski	7fc2bf8d30	Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2025-11-10 We've added 19 non-merge commits during the last 3 day(s) which contain a total of 22 files changed, 1345 insertions(+), 197 deletions(-). The main changes are: 1) Preserve skb metadata after a TC BPF program has changed the skb, from Jakub Sitnicki. This allows a TC program at the end of a TC filter chain to still see the skb metadata, even if another TC program at the front of the chain has changed the skb using BPF helpers. 2) Initial af_smc bpf_struct_ops support to control the smc specific syn/synack options, from D. Wythe. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: bpf/selftests: Add selftest for bpf_smc_hs_ctrl net/smc: bpf: Introduce generic hook for handshake flow bpf: Export necessary symbols for modules with struct_ops selftests/bpf: Cover skb metadata access after bpf_skb_change_proto selftests/bpf: Cover skb metadata access after change_head/tail helper selftests/bpf: Cover skb metadata access after bpf_skb_adjust_room selftests/bpf: Cover skb metadata access after vlan push/pop helper selftests/bpf: Expect unclone to preserve skb metadata selftests/bpf: Dump skb metadata on verification failure selftests/bpf: Verify skb metadata in BPF instead of userspace bpf: Make bpf_skb_change_head helper metadata-safe bpf: Make bpf_skb_change_proto helper metadata-safe bpf: Make bpf_skb_adjust_room metadata-safe bpf: Make bpf_skb_vlan_push helper metadata-safe bpf: Make bpf_skb_vlan_pop helper metadata-safe vlan: Make vlan_remove_tag return nothing bpf: Unclone skb head on bpf_dynptr_write to skb metadata net: Preserve metadata on pskb_expand_head net: Helper to move packet data and metadata after skb_push/pull ==================== Link: https://patch.msgid.link/20251110232427.3929291-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-10 16:43:51 -08:00
Kuniyuki Iwashima	73edb26b06	sctp: Don't inherit do_auto_asconf in sctp_clone_sock(). syzbot reported list_del(&sp->auto_asconf_list) corruption in sctp_destroy_sock(). The repro calls setsockopt(SCTP_AUTO_ASCONF, 1) to a SCTP listener, calls accept(), and close()s the child socket. setsockopt(SCTP_AUTO_ASCONF, 1) sets sp->do_auto_asconf to 1 and links sp->auto_asconf_list to a per-netns list. Both fields are placed after sp->pd_lobby in struct sctp_sock, and sctp_copy_descendant() did not copy the fields before the cited commit. Also, sctp_clone_sock() did not set them explicitly. In addition, sctp_auto_asconf_init() is called from sctp_sock_migrate(), but it initialises the fields only conditionally. The two fields relied on __GFP_ZERO added in sk_alloc(), but sk_clone() does not use it. Let's clear newsp->do_auto_asconf in sctp_clone_sock(). [0]: list_del corruption. prev->next should be ffff8880799e9148, but was ffff8880799e8808. (prev=ffff88803347d9f8) kernel BUG at lib/list_debug.c:64! Oops: invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 0 UID: 0 PID: 6008 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/02/2025 RIP: 0010:__list_del_entry_valid_or_report+0x15a/0x190 lib/list_debug.c:62 Code: e8 7b 26 71 fd 43 80 3c 2c 00 74 08 4c 89 ff e8 7c ee 92 fd 49 8b 17 48 c7 c7 80 0a bf 8b 48 89 de 4c 89 f9 e8 07 c6 94 fc 90 <0f> 0b 4c 89 f7 e8 4c 26 71 fd 43 80 3c 2c 00 74 08 4c 89 ff e8 4d RSP: 0018:ffffc90003067ad8 EFLAGS: 00010246 RAX: 000000000000006d RBX: ffff8880799e9148 RCX: b056988859ee6e00 RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000 RBP: dffffc0000000000 R08: ffffc90003067807 R09: 1ffff9200060cf00 R10: dffffc0000000000 R11: fffff5200060cf01 R12: 1ffff1100668fb3f R13: dffffc0000000000 R14: ffff88803347d9f8 R15: ffff88803347d9f8 FS: 00005555823e5500(0000) GS:ffff88812613e000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000200000000480 CR3: 00000000741ce000 CR4: 00000000003526f0 Call Trace: <TASK> __list_del_entry_valid include/linux/list.h:132 [inline] __list_del_entry include/linux/list.h:223 [inline] list_del include/linux/list.h:237 [inline] sctp_destroy_sock+0xb4/0x370 net/sctp/socket.c:5163 sk_common_release+0x75/0x310 net/core/sock.c:3961 sctp_close+0x77e/0x900 net/sctp/socket.c:1550 inet_release+0x144/0x190 net/ipv4/af_inet.c:437 __sock_release net/socket.c:662 [inline] sock_close+0xc3/0x240 net/socket.c:1455 __fput+0x44c/0xa70 fs/file_table.c:468 task_work_run+0x1d4/0x260 kernel/task_work.c:227 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] exit_to_user_mode_loop+0xe9/0x130 kernel/entry/common.c:43 exit_to_user_mode_prepare include/linux/irq-entry-common.h:225 [inline] syscall_exit_to_user_mode_work include/linux/entry-common.h:175 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:210 [inline] do_syscall_64+0x2bd/0xfa0 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: `16942cf4d3` ("sctp: Use sk_clone() in sctp_accept().") Reported-by: syzbot+ba535cb417f106327741@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/690d2185.a70a0220.22f260.000e.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20251106223418.1455510-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-10 16:22:09 -08:00
D. Wythe	15f295f556	net/smc: bpf: Introduce generic hook for handshake flow The introduction of IPPROTO_SMC enables eBPF programs to determine whether to use SMC based on the context of socket creation, such as network namespaces, PID and comm name, etc. As a subsequent enhancement, to introduce a new generic hook that allows decisions on whether to use SMC or not at runtime, including but not limited to local/remote IP address or ports. User can write their own implememtion via bpf_struct_ops now to choose whether to use SMC or not before TCP 3rd handshake to be comleted. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Link: https://patch.msgid.link/20251107035632.115950-3-alibuda@linux.alibaba.com	2025-11-10 11:19:41 -08:00
Jakub Sitnicki	fb206fc312	bpf: Make bpf_skb_change_head helper metadata-safe Although bpf_skb_change_head() doesn't move packet data after skb_push(), skb metadata still needs to be relocated. Use the dedicated helper to handle it. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-9-5ceb08a9b37b@cloudflare.com	2025-11-10 10:52:32 -08:00
Jakub Sitnicki	8cfc172ce2	bpf: Make bpf_skb_change_proto helper metadata-safe bpf_skb_change_proto reuses the same headroom operations as bpf_skb_adjust_room, already updated to handle metadata safely. The remaining step is to ensure that there is sufficient headroom to accommodate metadata on skb_push(). Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-8-5ceb08a9b37b@cloudflare.com	2025-11-10 10:52:32 -08:00
Jakub Sitnicki	be83105d38	bpf: Make bpf_skb_adjust_room metadata-safe bpf_skb_adjust_room() may push or pull bytes from skb->data. In both cases, skb metadata must be moved accordingly to stay accessible. Replace existing memmove() calls, which only move payload, with a helper that also handles metadata. Reserve enough space for metadata to fit after skb_push. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-7-5ceb08a9b37b@cloudflare.com	2025-11-10 10:52:32 -08:00
Jakub Sitnicki	f38499ff45	bpf: Unclone skb head on bpf_dynptr_write to skb metadata Currently bpf_dynptr_from_skb_meta() marks the dynptr as read-only when the skb is cloned, preventing writes to metadata. Remove this restriction and unclone the skb head on bpf_dynptr_write() to metadata, now that the metadata is preserved during uncloning. This makes metadata dynptr consistent with skb dynptr, allowing writes regardless of whether the skb is cloned. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-3-5ceb08a9b37b@cloudflare.com	2025-11-10 10:52:31 -08:00
Jakub Sitnicki	290fc0be09	net: Preserve metadata on pskb_expand_head pskb_expand_head() copies headroom, including skb metadata, into the newly allocated head, but then clears the metadata. As a result, metadata is lost when BPF helpers trigger an skb head reallocation. Let the skb metadata remain in the newly created copy of head. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-2-5ceb08a9b37b@cloudflare.com	2025-11-10 10:52:31 -08:00
Jakub Kicinski	a0c3aefb08	Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2025-11-06 (i40, ice, iavf) Mohammad Heib introduces a new devlink parameter, max_mac_per_vf, for controlling the maximum number of MAC address filters allowed by a VF. This allows administrators to control the VF behavior in a more nuanced manner. Aleksandr and Przemek add support for Receive Side Scaling of GTP to iAVF for VFs running on E800 series ice hardware. This improves performance and scalability for virtualized network functions in 5G and LTE deployments. * '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: iavf: add RSS support for GTP protocol via ethtool ice: Extend PTYPE bitmap coverage for GTP encapsulated flows ice: improve TCAM priority handling for RSS profiles ice: implement GTP RSS context tracking and configuration ice: add virtchnl definitions and static data for GTP RSS ice: add flow parsing for GTP and new protocol field support i40e: support generic devlink param "max_mac_per_vf" devlink: Add new "max_mac_per_vf" generic device param ==================== Link: https://patch.msgid.link/20251106225321.1609605-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-07 19:15:36 -08:00
Eric Dumazet	b61785852e	net: increase skb_defer_max default to 128 skb_defer_max value is very conservative, and can be increased to avoid too many calls to kick_defer_list_purge(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20251106202935.1776179-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-07 19:02:40 -08:00
Eric Dumazet	e20dfbad8a	net: fix napi_consume_skb() with alien skbs There is a lack of NUMA awareness and more generally lack of slab caches affinity on TX completion path. Modern drivers are using napi_consume_skb(), hoping to cache sk_buff in per-cpu caches so that they can be recycled in RX path. Only use this if the skb was allocated on the same cpu, otherwise use skb_attempt_defer_free() so that the skb is freed on the original cpu. This removes contention on SLUB spinlocks and data structures. After this patch, I get ~50% improvement for an UDP tx workload on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues). 80 Mpps -> 120 Mpps. Profiling one of the 32 cpus servicing NIC interrupts : Before: mpstat -P 511 1 1 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: 511 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 2.00 31.01% ksoftirqd/511 [kernel.kallsyms] [k] queued_spin_lock_slowpath 12.45% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath 5.60% ksoftirqd/511 [kernel.kallsyms] [k] __slab_free 3.31% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_clean_buf_ring 3.27% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_all 2.95% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_start 2.52% ksoftirqd/511 [kernel.kallsyms] [k] fq_dequeue 2.32% ksoftirqd/511 [kernel.kallsyms] [k] read_tsc 2.25% ksoftirqd/511 [kernel.kallsyms] [k] build_detached_freelist 2.15% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free 2.11% swapper [kernel.kallsyms] [k] __slab_free 2.06% ksoftirqd/511 [kernel.kallsyms] [k] idpf_features_check 2.01% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr 1.97% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_data 1.52% ksoftirqd/511 [kernel.kallsyms] [k] sock_wfree 1.34% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring 1.23% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all 1.15% ksoftirqd/511 [kernel.kallsyms] [k] dma_unmap_page_attrs 1.11% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start 1.03% swapper [kernel.kallsyms] [k] fq_dequeue 0.94% swapper [kernel.kallsyms] [k] kmem_cache_free 0.93% swapper [kernel.kallsyms] [k] read_tsc 0.81% ksoftirqd/511 [kernel.kallsyms] [k] napi_consume_skb 0.79% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr 0.77% ksoftirqd/511 [kernel.kallsyms] [k] skb_free_head 0.76% swapper [kernel.kallsyms] [k] idpf_features_check 0.72% swapper [kernel.kallsyms] [k] skb_release_data 0.69% swapper [kernel.kallsyms] [k] build_detached_freelist 0.58% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_head_state 0.56% ksoftirqd/511 [kernel.kallsyms] [k] __put_partials 0.55% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free_bulk 0.48% swapper [kernel.kallsyms] [k] sock_wfree After: mpstat -P 511 1 1 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: 511 0.00 0.00 0.00 0.00 0.00 51.49 0.00 0.00 0.00 48.51 19.10% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr 13.86% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring 10.80% swapper [kernel.kallsyms] [k] skb_attempt_defer_free 10.57% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all 7.18% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath 6.69% swapper [kernel.kallsyms] [k] sock_wfree 5.55% swapper [kernel.kallsyms] [k] dma_unmap_page_attrs 3.10% swapper [kernel.kallsyms] [k] fq_dequeue 3.00% swapper [kernel.kallsyms] [k] skb_release_head_state 2.73% swapper [kernel.kallsyms] [k] read_tsc 2.48% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start 1.20% swapper [kernel.kallsyms] [k] idpf_features_check 1.13% swapper [kernel.kallsyms] [k] napi_consume_skb 0.93% swapper [kernel.kallsyms] [k] idpf_vport_splitq_napi_poll 0.64% swapper [kernel.kallsyms] [k] native_send_call_func_single_ipi 0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter 0.53% swapper [kernel.kallsyms] [k] io_idle 0.43% swapper [kernel.kallsyms] [k] netif_skb_features 0.41% swapper [kernel.kallsyms] [k] __direct_call_cpuidle_state_enter2 0.40% swapper [kernel.kallsyms] [k] native_irq_return_iret 0.40% swapper [kernel.kallsyms] [k] idpf_tx_buf_hw_update 0.36% swapper [kernel.kallsyms] [k] sched_clock_noinstr 0.34% swapper [kernel.kallsyms] [k] handle_softirqs 0.32% swapper [kernel.kallsyms] [k] net_rx_action 0.32% swapper [kernel.kallsyms] [k] dql_completed 0.32% swapper [kernel.kallsyms] [k] validate_xmit_skb 0.31% swapper [kernel.kallsyms] [k] skb_network_protocol 0.29% swapper [kernel.kallsyms] [k] skb_csum_hwoffload_help 0.29% swapper [kernel.kallsyms] [k] x2apic_send_IPI 0.28% swapper [kernel.kallsyms] [k] ktime_get 0.24% swapper [kernel.kallsyms] [k] __qdisc_run Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20251106202935.1776179-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-07 19:02:39 -08:00
Eric Dumazet	1fcf572211	net: allow skb_release_head_state() to be called multiple times Currently, only skb dst is cleared (thanks to skb_dst_drop()) Make sure skb->destructor, conntrack and extensions are cleared. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20251106202935.1776179-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-07 19:02:39 -08:00
Eric Dumazet	fd9557c360	net: add prefetch() in skb_defer_free_flush() skb_defer_free_flush() is becoming more important these days. Add a prefetch operation to reduce latency a bit on some platforms like AMD EPYC 7B12. On more recent cpus, a stall happens when reading skb_shinfo(). Avoiding it will require a more elaborate strategy. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/20251106085500.2438951-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-07 19:00:16 -08:00
Jakub Kicinski	f05d26198c	psp: add stats from psp spec to driver facing api Provide a driver api for reporting device statistics required by the "Implementation Requirements" section of the PSP Architecture Specification. Use a warning to ensure drivers report stats required by the spec. Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20251106002608.1578518-4-daniel.zahka@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-07 18:53:57 -08:00
Jakub Kicinski	dae4a92399	psp: report basic stats from the core Track and report stats common to all psp devices from the core. A 'stale-event' is when the core marks the rx state of an active psp_assoc as incapable of authenticating psp encapsulated data. Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20251106002608.1578518-2-daniel.zahka@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-07 18:53:56 -08:00
Eric Dumazet	416dd649f3	tcp: add net.ipv4.tcp_comp_sack_rtt_percent TCP SACK compression has been added in 2018 in commit `5d9f4262b7` ("tcp: add SACK compression"). It is working great for WAN flows (with large RTT). Wifi in particular gets a significant boost _when_ ACK are suppressed. Add a new sysctl so that we can tune the very conservative 5 % value that has been used so far in this formula, so that small RTT flows can benefit from this feature. delay = min ( 5 % of RTT, 1 ms) This patch adds new tcp_comp_sack_rtt_percent sysctl to ease experiments and tuning. Given that we cap the delay to 1ms (tcp_comp_sack_delay_ns sysctl), set the default value to 33 %. Quoting Neal Cardwell ( https://lore.kernel.org/netdev/CADVnQymZ1tFnEA1Q=vtECs0=Db7zHQ8=+WCQtnhHFVbEOzjVnQ@mail.gmail.com/ ) The rationale for 33% is basically to try to facilitate pipelining, where there are always at least 3 ACKs and 3 GSO/TSO skbs per SRTT, so that the path can maintain a budget for 3 full-sized GSO/TSO skbs "in flight" at all times: + 1 skb in the qdisc waiting to be sent by the NIC next + 1 skb being sent by the NIC (being serialized by the NIC out onto the wire) + 1 skb being received and aggregated by the receiver machine's aggregation mechanism (some combination of LRO, GRO, and sack compression) Note that this is basically the same magic number (3) and the same rationales as: (a) tcp_tso_should_defer() ensuring that we defer sending data for no longer than cwnd/tcp_tso_win_divisor (where tcp_tso_win_divisor = 3), and (b) bbr_quantization_budget() ensuring that cwnd is at least 3 GSO/TSO skbs to maintain pipelining and full throughput at low RTTs Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20251106115236.3450026-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-07 18:41:44 -08:00
Kuniyuki Iwashima	207ce0f6bc	tcp: Remove timeout arg from reqsk_timeout(). reqsk_timeout() is always called with @timeout being TCP_RTO_MAX. Let's remove the arg. As a prep for the next patch, reqsk_timeout() is moved to tcp.h and renamed to tcp_reqsk_timeout(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251106003357.273403-5-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-07 18:05:26 -08:00
Kuniyuki Iwashima	6fbf648d5c	tcp: Remove redundant init for req->num_timeout. Commit `5903123f66` ("tcp: Use BPF timeout setting for SYN ACK RTO") introduced req->timeout and initialised it in 3 places: 1. reqsk_alloc() sets 0 2. inet_reqsk_alloc() sets TCP_TIMEOUT_INIT 3. tcp_conn_request() sets tcp_timeout_init() 1. has been always redundant as 2. overwrites it immediately. 2. was necessary for TFO SYN+ACK but is no longer needed after commit `8ea731d4c2` ("tcp: Make SYN ACK RTO tunable by BPF programs with TFO"). 3. was moved to reqsk_queue_hash_req() in the previous patch. Now, we always initialise req->timeout just before scheduling the SYN+ACK timer: * For non-TFO SYN+ACK : reqsk_queue_hash_req() * For TFO SYN+ACK : tcp_fastopen_create_child() Let's remove the redundant initialisation of req->timeout in reqsk_alloc() and inet_reqsk_alloc(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251106003357.273403-4-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-07 18:05:25 -08:00
Kuniyuki Iwashima	3ce5dd8161	tcp: Remove timeout arg from reqsk_queue_hash_req(). inet_csk_reqsk_queue_hash_add() is no longer shared by DCCP. We do not need to pass req->timeout down to reqsk_queue_hash_req(). Let's move tcp_timeout_init() from tcp_conn_request() to reqsk_queue_hash_req(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251106003357.273403-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-07 18:05:25 -08:00
Kuniyuki Iwashima	be88c549e9	tcp: Call tcp_syn_ack_timeout() directly. Since DCCP has been removed, we do not need to use request_sock_ops.syn_ack_timeout(). Let's call tcp_syn_ack_timeout() directly. Now other function pointers of request_sock_ops are protocol-dependent. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251106003357.273403-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-07 18:05:25 -08:00
Daniel Borkmann	24ab8efb9a	xsk: Move NETDEV_XDP_ACT_ZC into generic header Move NETDEV_XDP_ACT_ZC into xdp_sock_drv.h header such that external code can reuse it, and rename it into more generic NETDEV_XDP_ACT_XSK. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20251031212103.310683-7-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-06 16:46:11 -08:00
Daniel Golle	c6230446b1	net: dsa: add tagging driver for MaxLinear GSW1xx switch family Add support for a new DSA tagging protocol driver for the MaxLinear GSW1xx switch family. The GSW1xx switches use a proprietary 8-byte special tag inserted between the source MAC address and the EtherType field to indicate the source and destination ports for frames traversing the CPU port. Implement the tag handling logic to insert the special tag on transmit and parse it on receive. Signed-off-by: Daniel Golle <daniel@makrotopia.org> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@siemens.com> Tested-by: Alexander Sverdlin <alexander.sverdlin@siemens.com> Link: https://patch.msgid.link/0e973ebfd9433c30c96f50670da9e9449a0d98f2.1762170107.git.daniel@makrotopia.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-06 14:16:17 -08:00
Mohammad Heib	9352d40c8b	devlink: Add new "max_mac_per_vf" generic device param Add a new device generic parameter to controls the maximum number of MAC filters allowed per VF. For example, to limit a VF to 3 MAC addresses: $ devlink dev param set pci/0000:3b:00.0 name max_mac_per_vf \ value 3 \ cmode runtime Signed-off-by: Mohammad Heib <mheib@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-11-06 12:57:31 -08:00
Jakub Kicinski	1ec9871fbb	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.18-rc5). Conflicts: drivers/net/wireless/ath/ath12k/mac.c `9222582ec5` ("Revert "wifi: ath12k: Fix missing station power save configuration"") `6917e268c4` ("wifi: ath12k: Defer vdev bring-up until CSA finalize to avoid stale beacon") https://lore.kernel.org/11cece9f7e36c12efd732baa5718239b1bf8c950.camel@sipsolutions.net Adjacent changes: drivers/net/ethernet/intel/Kconfig `b1d16f7c00` ("libie: depend on DEBUG_FS when building LIBIE_FWLOG") `93f53db9f9` ("ice: switch to Page Pool") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-06 09:27:40 -08:00
Linus Torvalds	c2c2ccfd4b	Merge tag 'net-6.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: Including fixes from bluetooth and wireless. Current release - new code bugs: - ptp: expose raw cycles only for clocks with free-running counter - bonding: fix null-deref in actor_port_prio setting - mdio: ERR_PTR-check regmap pointer returned by device_node_to_regmap() - eth: libie: depend on DEBUG_FS when building LIBIE_FWLOG Previous releases - regressions: - virtio_net: fix perf regression due to bad alignment of virtio_net_hdr_v1_hash - Revert "wifi: ath10k: avoid unnecessary wait for service ready message" caused regressions for QCA988x and QCA9984 - Revert "wifi: ath12k: Fix missing station power save configuration" caused regressions for WCN7850 - eth: bnxt_en: shutdown FW DMA in bnxt_shutdown(), fix memory corruptions after kexec Previous releases - always broken: - virtio-net: fix received packet length check for big packets - sctp: fix races in socket diag handling - wifi: add an hrtimer-based delayed work item to avoid low granularity of timers set relatively far in the future, and use it where it matters (e.g. when performing AP-scheduled channel switch) - eth: mlx5e: - correctly propagate error in case of module EEPROM read failure - fix HW-GRO on systems with PAGE_SIZE == 64kB - dsa: b53: fixes for tagging, link configuration / RMII, FDB, multicast - phy: lan8842: implement latest errata" * tag 'net-6.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (63 commits) selftests/vsock: avoid false-positives when checking dmesg net: bridge: fix MST static key usage net: bridge: fix use-after-free due to MST port state bypass lan966x: Fix sleeping in atomic context bonding: fix NULL pointer dereference in actor_port_prio setting net: dsa: microchip: Fix reserved multicast address table programming net: wan: framer: pef2256: Switch to devm_mfd_add_devices() net: libwx: fix device bus LAN ID net/mlx5e: SHAMPO, Fix header formulas for higher MTUs and 64K pages net/mlx5e: SHAMPO, Fix skb size check for 64K pages net/mlx5e: SHAMPO, Fix header mapping for 64K pages net: ti: icssg-prueth: Fix fdb hash size configuration net/mlx5e: Fix return value in case of module EEPROM read error net: gro_cells: Reduce lock scope in gro_cell_poll libie: depend on DEBUG_FS when building LIBIE_FWLOG wifi: mac80211_hwsim: Limit destroy_on_close radio removal to netgroup netpoll: Fix deadlock in memory allocation under spinlock net: ethernet: ti: netcp: Standardize knav_dma_open_channel to return NULL on error virtio-net: fix received length check in big packets bnxt_en: Fix warning in bnxt_dl_reload_down() ...	2025-11-06 08:52:30 -08:00
Nikolay Aleksandrov	ee87c63f9b	net: bridge: fix MST static key usage As Ido pointed out, the static key usage in MST is buggy and should use inc/dec instead of enable/disable because we can have multiple bridges with MST enabled which means a single bridge can disable MST for all. Use static_branch_inc/dec to avoid that. When destroying a bridge decrement the key if MST was enabled. Fixes: `ec7328b591` ("net: bridge: mst: Multiple Spanning Tree (MST) mode") Reported-by: Ido Schimmel <idosch@nvidia.com> Closes: https://lore.kernel.org/netdev/20251104120313.1306566-1-razor@blackwall.org/T/#m6888d87658f94ed1725433940f4f4ebb00b5a68b Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20251105111919.1499702-3-razor@blackwall.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-06 07:32:17 -08:00
Nikolay Aleksandrov	8dca36978a	net: bridge: fix use-after-free due to MST port state bypass syzbot reported[1] a use-after-free when deleting an expired fdb. It is due to a race condition between learning still happening and a port being deleted, after all its fdbs have been flushed. The port's state has been toggled to disabled so no learning should happen at that time, but if we have MST enabled, it will bypass the port's state, that together with VLAN filtering disabled can lead to fdb learning at a time when it shouldn't happen while the port is being deleted. VLAN filtering must be disabled because we flush the port VLANs when it's being deleted which will stop learning. This fix adds a check for the port's vlan group which is initialized to NULL when the port is getting deleted, that avoids the port state bypass. When MST is enabled there would be a minimal new overhead in the fast-path because the port's vlan group pointer is cache-hot. [1] https://syzkaller.appspot.com/bug?extid=dd280197f0f7ab3917be Fixes: `ec7328b591` ("net: bridge: mst: Multiple Spanning Tree (MST) mode") Reported-by: syzbot+dd280197f0f7ab3917be@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/69088ffa.050a0220.29fc44.003d.GAE@google.com/ Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20251105111919.1499702-2-razor@blackwall.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-06 07:32:17 -08:00
Raju Rangoju	6b47af35a6	net: selftests: export packet creation helpers for driver use Export the network selftest packet creation infrastructure to allow network drivers to reuse the existing selftest framework instead of duplicating packet creation code. Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20251031111811.775434-1-Raju.Rangoju@amd.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-11-06 13:38:11 +01:00
Sebastian Andrzej Siewior	d917c217b6	net: gro_cells: Reduce lock scope in gro_cell_poll One GRO-cell device's NAPI callback can nest into the GRO-cell of another device if the underlying device is also using GRO-cell. This is the case for IPsec over vxlan. These two GRO-cells are separate devices. From lockdep's point of view it is the same because each device is sharing the same lock class and so it reports a possible deadlock assuming one device is nesting into itself. Hold the bh_lock only while accessing gro_cell::napi_skbs in gro_cell_poll(). This reduces the locking scope and avoids acquiring the same lock class multiple times. Fixes: `25718fdcbd` ("net: gro_cells: Use nested-BH locking for gro_cell") Reported-by: Gal Pressman <gal@nvidia.com> Closes: https://lore.kernel.org/all/66664116-edb8-48dc-ad72-d5223696dd19@nvidia.com/ Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20251104153435.ty88xDQt@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-05 17:41:29 -08:00
Breno Leitao	327c20c21d	netpoll: Fix deadlock in memory allocation under spinlock Fix a AA deadlock in refill_skbs() where memory allocation while holding skb_pool->lock can trigger a recursive lock acquisition attempt. The deadlock scenario occurs when the system is under severe memory pressure: 1. refill_skbs() acquires skb_pool->lock (spinlock) 2. alloc_skb() is called while holding the lock 3. Memory allocator fails and calls slab_out_of_memory() 4. This triggers printk() for the OOM warning 5. The console output path calls netpoll_send_udp() 6. netpoll_send_udp() attempts to acquire the same skb_pool->lock 7. Deadlock: the lock is already held by the same CPU Call stack: refill_skbs() spin_lock_irqsave(&skb_pool->lock) <- lock acquired __alloc_skb() kmem_cache_alloc_node_noprof() slab_out_of_memory() printk() console_flush_all() netpoll_send_udp() skb_dequeue() spin_lock_irqsave(&skb_pool->lock) <- deadlock attempt This bug was exposed by commit `248f6571fd` ("netpoll: Optimize skb refilling on critical path") which removed refill_skbs() from the critical path (where nested printk was being deferred), letting nested printk being called from inside refill_skbs() Refactor refill_skbs() to never allocate memory while holding the spinlock. Another possible solution to fix this problem is protecting the refill_skbs() from nested printks, basically calling printk_deferred_{enter,exit}() in refill_skbs(), then, any nested pr_warn() would be deferred. I prefer this approach, given I _think_ it might be a good idea to move the alloc_skb() from GFP_ATOMIC to GFP_KERNEL in the future, so, having the alloc_skb() outside of the lock will be necessary step. There is a possible TOCTOU issue when checking for the pool length, and queueing the new allocated skb, but, this is not an issue, given that an extra SKB in the pool is harmless and it will be eventually used. Signed-off-by: Breno Leitao <leitao@debian.org> Fixes: `248f6571fd` ("netpoll: Optimize skb refilling on critical path") Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251103-fix_netpoll_aa-v4-1-4cfecdf6da7c@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-04 19:17:00 -08:00
Kees Cook	2b5e9f9b7e	net: Convert struct sockaddr to fixed-size "sa_data[14]" Revert struct sockaddr from flexible array to fixed 14-byte "sa_data", to solve over 36,000 -Wflex-array-member-not-at-end warnings, since struct sockaddr is embedded within many network structs. With socket/proto sockaddr-based internal APIs switched to use struct sockaddr_unsized, there should be no more uses of struct sockaddr that depend on reading beyond the end of struct sockaddr::sa_data that might trigger bounds checking. Comparing an x86_64 "allyesconfig" vmlinux build before and after this patch showed no new "ud1" instructions from CONFIG_UBSAN_BOUNDS nor any new "field-spanning" memcpy CONFIG_FORTIFY_SOURCE instrumentations. Cc: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-8-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-04 19:10:33 -08:00
Kees Cook	8116d803e7	bpf: Convert cgroup sockaddr filters to use sockaddr_unsized consistently Update BPF cgroup sockaddr filtering infrastructure to use sockaddr_unsized consistently throughout the call chain, removing redundant explicit casts from callers. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-6-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-04 19:10:33 -08:00
Kees Cook	449f68f8ff	net: Convert proto callbacks from sockaddr to sockaddr_unsized Convert struct proto pre_connect(), connect(), bind(), and bind_add() callback function prototypes from struct sockaddr to struct sockaddr_unsized. This does not change per-implementation use of sockaddr for passing around an arbitrarily sized sockaddr struct. Those will be addressed in future patches. Additionally removes the no longer referenced struct sockaddr from include/net/inet_common.h. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-5-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-04 19:10:33 -08:00
Kees Cook	85cb0757d7	net: Convert proto_ops connect() callbacks to use sockaddr_unsized Update all struct proto_ops connect() callback function prototypes from "struct sockaddr " to "struct sockaddr_unsized " to avoid lying to the compiler about object sizes. Calls into struct proto handlers gain casts that will be removed in the struct proto conversion patch. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-3-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-04 19:10:32 -08:00
Kees Cook	0e50474fa5	net: Convert proto_ops bind() callbacks to use sockaddr_unsized Update all struct proto_ops bind() callback function prototypes from "struct sockaddr " to "struct sockaddr_unsized " to avoid lying to the compiler about object sizes. Calls into struct proto handlers gain casts that will be removed in the struct proto conversion patch. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-2-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-04 19:10:32 -08:00
Yue Haibing	f2143e283c	net: devmem: Remove unused declaration net_devmem_bind_tx_release() Commit `bd61848900` ("net: devmem: Implement TX path") declared this but never implemented it. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20251103072046.1670574-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-04 17:29:19 -08:00
Matthieu Baerts (NGI0)	e461e8a799	mptcp: pm: in kernel: only use fullmesh endp if any Our documentation is saying that the in-kernel PM is only using fullmesh endpoints to establish subflows to announced addresses when at least one endpoint has a fullmesh flag. But this was not totally correct: only fullmesh endpoints were used if at least one endpoint from the same address family as the received ADD_ADDR has the fullmesh flag. This is confusing, and it seems clearer not to have differences depending on the address family. So, now, when at least one MPTCP endpoint has a fullmesh flag, the local addresses are picked from all fullmesh endpoints, which might be 0 if there are no endpoints for the correct address family. One selftest needs to be adapted for this behaviour change. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251101-net-next-mptcp-fm-endp-nb-bind-v1-2-b4166772d6bb@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-04 17:15:06 -08:00
Matthieu Baerts (NGI0)	f88191c7f3	mptcp: pm: in-kernel: record fullmesh endp nb Instead of iterating over all endpoints, under RCU read lock, just to check if one of them as the fullmesh flag, we can keep a counter of fullmesh endpoint, similar to what is done with the other flags. This counter is now checked, before iterating over all endpoints. Similar to the other counters, this new one is also exposed. A userspace app can then know when it is being used in a fullmesh mode, with potentially (too) many subflows. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251101-net-next-mptcp-fm-endp-nb-bind-v1-1-b4166772d6bb@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-04 17:15:06 -08:00
Eric Dumazet	46173144e0	net: mark deliver_skb() as unlikely and not inlined deliver_skb() should not be inlined as is it not called in the fast path. Add unlikely() clauses giving hints to the compiler about this fact. Before this patch: size net/core/dev.o text data bss dec hex filename 121794 13330 176 135300 21084 net/core/dev.o __netif_receive_skb_core() size on x86_64 : 4080 bytes. After: size net/core/dev.o text data bss dec hex filenamee 120330 13338 176 133844 20ad4 net/core/dev.o __netif_receive_skb_core() size on x86_64 : 2781 bytes. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251103165256.1712169-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-04 16:08:25 -08:00
Adrian Moreno	105bae3218	rtnetlink: honor RTEXT_FILTER_SKIP_STATS in IFLA_STATS Gathering interface statistics can be a relatively expensive operation on certain systems as it requires iterating over all the cpus. RTEXT_FILTER_SKIP_STATS was first introduced [1] to skip AF_INET6 statistics from interface dumps and it was then extended [2] to also exclude IFLA_VF_INFO. The semantics of the flag does not seem to be limited to AF_INET or VF statistics and having a way to query the interface status (e.g: carrier, address) without retrieving its statistics seems reasonable. So this patch extends the use RTEXT_FILTER_SKIP_STATS to also affect IFLA_STATS. [1] https://lore.kernel.org/all/20150911204848.GC9687@oracle.com/ [2] https://lore.kernel.org/all/20230611105108.122586-1-gal@nvidia.com/ Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://patch.msgid.link/20251103154006.1189707-1-amorenoz@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-04 16:07:37 -08:00
Jason Xing	30ed05adca	xsk: use a smaller new lock for shared pool case - Split cq_lock into two smaller locks: cq_prod_lock and cq_cached_prod_lock - Avoid disabling/enabling interrupts in the hot xmit path In either xsk_cq_cancel_locked() or xsk_cq_reserve_locked() function, the race condition is only between multiple xsks sharing the same pool. They are all in the process context rather than interrupt context, so now the small lock named cq_cached_prod_lock can be used without handling interrupts. While cq_cached_prod_lock ensures the exclusive modification of @cached_prod, cq_prod_lock in xsk_cq_submit_addr_locked() only cares about @producer and corresponding @desc. Both of them don't necessarily be consistent with @cached_prod protected by cq_cached_prod_lock. That's the reason why the previous big lock can be split into two smaller ones. Please note that SPSC rule is all about the global state of producer and consumer that can affect both layers instead of local or cached ones. Frequently disabling and enabling interrupt are very time consuming in some cases, especially in a per-descriptor granularity, which now can be avoided after this optimization, even when the pool is shared by multiple xsks. With this patch, the performance number[1] could go from 1,872,565 pps to 1,961,009 pps. It's a minor rise of around 5%. [1]: taskset -c 1 ./xdpsock -i enp2s0f1 -q 0 -t -S -s 64 Signed-off-by: Jason Xing <kernelxing@tencent.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20251030000646.18859-3-kerneljasonxing@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-11-04 16:10:53 +01:00
Jason Xing	4622800434	xsk: do not enable/disable irq when grabbing/releasing xsk_tx_list_lock The commit `ac98d8aab6` ("xsk: wire upp Tx zero-copy functions") originally introducing this lock put the deletion process in the sk_destruct which can run in irq context obviously, so the xxx_irqsave()/xxx_irqrestore() pair was used. But later another commit `541d7fdd76` ("xsk: proper AF_XDP socket teardown ordering") moved the deletion into xsk_release() that only happens in process context. It means that since this commit, it doesn't necessarily need that pair. Now, there are two places that use this xsk_tx_list_lock and only run in the process context. So avoid manipulating the irq then. Signed-off-by: Jason Xing <kernelxing@tencent.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20251030000646.18859-2-kerneljasonxing@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-11-04 16:10:52 +01:00
Oleksij Rempel	e6e93fb013	ethtool: netlink: add ETHTOOL_MSG_MSE_GET and wire up PHY MSE access Introduce the userspace entry point for PHY MSE diagnostics via ethtool netlink. This exposes the core API added previously and returns both capability information and one or more snapshots. Userspace sends ETHTOOL_MSG_MSE_GET. The reply carries: - ETHTOOL_A_MSE_CAPABILITIES: scale limits and timing information - ETHTOOL_A_MSE_CHANNEL_* nests: one or more snapshots (per-channel if available, otherwise WORST, otherwise LINK) Link down returns -ENETDOWN. Changes: - YAML: add attribute sets (mse, mse-capabilities, mse-snapshot) and the mse-get operation - UAPI (generated): add ETHTOOL_A_MSE_* enums and message IDs, ETHTOOL_MSG_MSE_GET/REPLY - ethtool core: add net/ethtool/mse.c implementing the request, register genl op, and hook into ethnl dispatch - docs: document MSE_GET in ethtool-netlink.rst The include/uapi/linux/ethtool_netlink_generated.h is generated from Documentation/netlink/specs/ethtool.yaml. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20251027122801.982364-3-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-03 18:32:27 -08:00
Samiullah Khawaja	c18d4b190a	net: Extend NAPI threaded polling to allow kthread based busy polling Add a new state NAPI_STATE_THREADED_BUSY_POLL to the NAPI state enum to enable and disable threaded busy polling. When threaded busy polling is enabled for a NAPI, enable NAPI_STATE_THREADED also. When the threaded NAPI is scheduled, set NAPI_STATE_IN_BUSY_POLL to signal napi_complete_done not to rearm interrupts. Whenever NAPI_STATE_THREADED_BUSY_POLL is unset, the NAPI_STATE_IN_BUSY_POLL will be unset, napi_complete_done unsets the NAPI_STATE_SCHED_THREADED bit also, which in turn will make the kthread go to sleep. Signed-off-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Martin Karsten <mkarsten@uwaterloo.ca> Tested-by: Martin Karsten <mkarsten@uwaterloo.ca> Link: https://patch.msgid.link/20251028203007.575686-2-skhawaja@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-03 18:11:40 -08:00
Kuniyuki Iwashima	7d99a7c6c6	mpls: Drop RTNL for RTM_NEWROUTE, RTM_DELROUTE, and RTM_GETROUTE. RTM_NEWROUTE looks up dev under RCU (ip_route_output(), ipv6_stub->ipv6_dst_lookup_flow(), netdev_get_by_index()), and each neighbour holds the refcnt of its dev. Also, net->mpls.platform_label is protected by a dedicated per-netns mutex. Now, no MPLS code depends on RTNL. Let's drop RTNL for RTM_NEWROUTE, RTM_DELROUTE, and RTM_GETROUTE. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20251029173344.2934622-14-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-03 17:40:54 -08:00
Kuniyuki Iwashima	e833eb2516	mpls: Protect net->mpls.platform_label with a per-netns mutex. MPLS (re)uses RTNL to protect net->mpls.platform_label, but the lock does not need to be RTNL at all. Let's protect net->mpls.platform_label with a dedicated per-netns mutex. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20251029173344.2934622-13-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-03 17:40:53 -08:00
Kuniyuki Iwashima	fb2b77b9b1	mpls: Convert RTM_GETNETCONF to RCU. mpls_netconf_get_devconf() calls __dev_get_by_index(), and this only depends on RTNL. Let's convert mpls_netconf_get_devconf() to RCU and use dev_get_by_index_rcu(). Note that nlmsg_new() is moved ahead to use GFP_KERNEL. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20251029173344.2934622-12-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-03 17:40:52 -08:00
Kuniyuki Iwashima	dde1b38e87	mpls: Convert mpls_dump_routes() to RCU. mpls_dump_routes() sets fib_dump_filter.rtnl_held to true and calls __dev_get_by_index() in mpls_valid_fib_dump_req(). This is the only RTNL dependant in mpls_dump_routes(). Also, synchronize_rcu() in resize_platform_label_table() guarantees that net->mpls.platform_label is alive under RCU. Let's convert mpls_dump_routes() to RCU and use dev_get_by_index_rcu(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20251029173344.2934622-11-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-03 17:40:51 -08:00
Kuniyuki Iwashima	3a49629335	mpls: Use mpls_route_input() where appropriate. In many places, we uses rtnl_dereference() twice for net->mpls.platform_label and net->mpls.platform_label[index]. Let's replace the code with mpls_route_input(). We do not use mpls_route_input() in mpls_dump_routes() since we will rely on RCU there. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20251029173344.2934622-10-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-03 17:40:50 -08:00
Kuniyuki Iwashima	73e4053939	mpls: Add mpls_route_input(). mpls_route_input_rcu() is called from mpls_forward() and mpls_getroute(). The former is under RCU, and the latter is under RTNL, so mpls_route_input_rcu() uses rcu_dereference_rtnl(). Let's use rcu_dereference() in mpls_route_input_rcu() and add an RTNL variant for mpls_getroute(). Later, we will remove rtnl_dereference() there. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20251029173344.2934622-9-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-03 17:40:49 -08:00

1 2 3 4 5 ...

82328 Commits