linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-06-01 23:23:46 -04:00

Author	SHA1	Message	Date
Ido Schimmel	d12d04d221	ipv6: icmp: Add RFC 5837 support Add the ability to append the incoming IP interface information to ICMPv6 error messages in accordance with RFC 5837 and RFC 4884. This is required for more meaningful traceroute results in unnumbered networks. The feature is disabled by default and controlled via a new sysctl ("net.ipv6.icmp.errors_extension_mask") which accepts a bitmask of ICMP extensions to append to ICMP error messages. Currently, only a single value is supported, but the interface and the implementation should be able to support more extensions, if needed. Clone the skb and copy the relevant data portions before modifying the skb as the caller of icmp6_send() still owns the skb after the function returns. This should be fine since by default ICMP error messages are rate limited to 1000 per second and no more than 1 per second per specific host. Trim or pad the packet to 128 bytes before appending the ICMP extension structure in order to be compatible with legacy applications that assume that the ICMP extension structure always starts at this offset (the minimum length specified by RFC 4884). Since commit `20e1954fe2` ("ipv6: RFC 4884 partial support for SIT/GRE tunnels") it is possible for icmp6_send() to be called with an skb that already contains ICMP extensions. This can happen when we receive an ICMPv4 message with extensions from a tunnel and translate it to an ICMPv6 message towards an IPv6 host in the overlay network. I could not find an RFC that supports this behavior, but it makes sense to not overwrite the original extensions that were appended to the packet. Therefore, avoid appending extensions if the length field in the provided ICMPv6 header is already filled. Export netdev_copy_name() using EXPORT_IPV6_MOD_GPL() to make it available to IPv6 when it is built as a module. Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251027082232.232571-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 18:28:30 -07:00
Ido Schimmel	f0e7036fc9	ipv4: icmp: Add RFC 5837 support Add the ability to append the incoming IP interface information to ICMPv4 error messages in accordance with RFC 5837 and RFC 4884. This is required for more meaningful traceroute results in unnumbered networks. The feature is disabled by default and controlled via a new sysctl ("net.ipv4.icmp_errors_extension_mask") which accepts a bitmask of ICMP extensions to append to ICMP error messages. Currently, only a single value is supported, but the interface and the implementation should be able to support more extensions, if needed. Clone the skb and copy the relevant data portions before modifying the skb as the caller of __icmp_send() still owns the skb after the function returns. This should be fine since by default ICMP error messages are rate limited to 1000 per second and no more than 1 per second per specific host. Trim or pad the packet to 128 bytes before appending the ICMP extension structure in order to be compatible with legacy applications that assume that the ICMP extension structure always starts at this offset (the minimum length specified by RFC 4884). Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251027082232.232571-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 18:28:29 -07:00
Kuniyuki Iwashima	b8a7826e4b	net: sched: Don't use WARN_ON_ONCE() for -ENOMEM in tcf_classify(). As demonstrated by syzbot, WARN_ON_ONCE() in tcf_classify() can be easily triggered by fault injection. [0] We should not use WARN_ON_ONCE() for the simple -ENOMEM case. Also, we provide SKB_DROP_REASON_NOMEM for the same error. Let's remove WARN_ON_ONCE() there. [0]: FAULT_INJECTION: forcing a failure. name failslab, interval 1, probability 0, space 0, times 0 CPU: 0 UID: 0 PID: 31392 Comm: syz.8.7081 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/02/2025 Call Trace: <TASK> dump_stack_lvl+0x189/0x250 should_fail_ex+0x414/0x560 should_failslab+0xa8/0x100 kmem_cache_alloc_noprof+0x74/0x6e0 skb_ext_add+0x148/0x8f0 tcf_classify+0xeba/0x1140 multiq_enqueue+0xfd/0x4c0 net/sched/sch_multiq.c:66 ... WARNING: CPU: 0 PID: 31392 at net/sched/cls_api.c:1869 tcf_classify+0xfd7/0x1140 Modules linked in: CPU: 0 UID: 0 PID: 31392 Comm: syz.8.7081 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/02/2025 RIP: 0010:tcf_classify+0xfd7/0x1140 Code: e8 03 42 0f b6 04 30 84 c0 0f 85 41 01 00 00 66 41 89 1f eb 05 e8 89 26 75 f8 bb ff ff ff ff e9 04 f9 ff ff e8 7a 26 75 f8 90 <0f> 0b 90 49 83 c5 44 4c 89 eb 49 c1 ed 03 43 0f b6 44 35 00 84 c0 RSP: 0018:ffffc9000b7671f0 EFLAGS: 00010293 RAX: ffffffff894addf6 RBX: 0000000000000002 RCX: ffff888025029e40 RDX: 0000000000000000 RSI: ffffffff8bbf05c0 RDI: ffffffff8bbf0580 RBP: 0000000000000000 R08: 00000000ffffffff R09: 1ffffffff1c0bfd6 R10: dffffc0000000000 R11: fffffbfff1c0bfd7 R12: ffff88805a90de5c R13: ffff88805a90ddc0 R14: dffffc0000000000 R15: ffffc9000b7672c0 FS: 00007f20739f66c0(0000) GS:ffff88812613e000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000110c2d2a80 CR3: 0000000024e36000 CR4: 00000000003526f0 Call Trace: <TASK> multiq_classify net/sched/sch_multiq.c:39 [inline] multiq_enqueue+0xfd/0x4c0 net/sched/sch_multiq.c:66 dev_qdisc_enqueue+0x4e/0x260 net/core/dev.c:4118 __dev_xmit_skb net/core/dev.c:4214 [inline] __dev_queue_xmit+0xe83/0x3b50 net/core/dev.c:4729 packet_snd net/packet/af_packet.c:3076 [inline] packet_sendmsg+0x3e33/0x5080 net/packet/af_packet.c:3108 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg+0x21c/0x270 net/socket.c:742 ____sys_sendmsg+0x505/0x830 net/socket.c:2630 ___sys_sendmsg+0x21f/0x2a0 net/socket.c:2684 __sys_sendmsg net/socket.c:2716 [inline] __do_sys_sendmsg net/socket.c:2721 [inline] __se_sys_sendmsg net/socket.c:2719 [inline] __x64_sys_sendmsg+0x19b/0x260 net/socket.c:2719 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f207578efc9 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f20739f6038 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f20759e5fa0 RCX: 00007f207578efc9 RDX: 0000000000000004 RSI: 00002000000000c0 RDI: 0000000000000008 RBP: 00007f20739f6090 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 R13: 00007f20759e6038 R14: 00007f20759e5fa0 R15: 00007f2075b0fa28 </TASK> Reported-by: syzbot+87e1289a044fcd0c5f62@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/69003e33.050a0220.32483.00e8.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20251028035859.2067690-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 18:00:37 -07:00
Wang Liang	f58abec23d	net: ipv4: Remove extern udp_v4_early_demux()/tcp_v4_early_demux() in .c files Function udp_v4_early_demux() was already declared in 'include/net/udp.h', no need to keep the extern in 'ip_input.c', which may produce the following checkpatch warning: WARNING: externs should be avoided in .c files #45: FILE: net/ipv4/ip_input.c:322: +enum skb_drop_reason udp_v4_early_demux(struct sk_buff *skb); Replace it by including 'net/udp.h'. Do the same for tcp_v4_early_demux(). Signed-off-by: Wang Liang <wangliang74@huawei.com> Link: https://patch.msgid.link/20251025092637.1020960-1-wangliang74@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:05:30 -07:00
Rakuram Eswaran	5c00da851c	net: tcp_lp: fix kernel-doc warnings and update outdated reference links Fix kernel-doc warnings in tcp_lp.c by adding missing parameter descriptions for tcp_lp_cong_avoid() and tcp_lp_pkts_acked() when building with W=1. Also replace invalid URLs in the file header comment with the currently valid links to the TCP-LP paper and implementation page. No functional changes. Signed-off-by: Rakuram Eswaran <rakuram.e96@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251025-net_ipv4_tcp_lp_c-v1-1-058cc221499e@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-28 17:52:44 -07:00
Christophe JAILLET	294bfe0343	sctp: Constify struct sctp_sched_ops 'struct sctp_sched_ops' is not modified in these drivers. Constifying this structure moves some data to a read-only section, so increases overall security, especially when the structure holds some function pointers. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 8019 568 0 8587 218b net/sctp/stream_sched_fc.o After: ===== text data bss dec hex filename 8275 312 0 8587 218b net/sctp/stream_sched_fc.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://patch.msgid.link/dce03527eb7b7cc8a3c26d5cdac12bafe3350135.1761377890.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-28 17:50:55 -07:00
Eric Dumazet	a086e9860c	net: optimize enqueue_to_backlog() for the fast path Add likely() and unlikely() clauses for the common cases: Device is running. Queue is not full. Queue is less than half capacity. Add max_backlog parameter to skb_flow_limit() to avoid a second READ_ONCE(net_hotdata.max_backlog). skb_flow_limit() does not need the backlog_lock protection, and can be called before we acquire the lock, for even better resistance to attacks. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251024090517.3289181-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-28 17:39:03 -07:00
Eric Dumazet	0ae1ac7335	tcp: remove one ktime_get() from recvmsg() fast path Each time some payload is consumed by user space (recvmsg() and friends), TCP calls tcp_rcv_space_adjust() to run DRS algorithm to check if an increase of sk->sk_rcvbuf is needed. This function is based on time sampling, and currently calls tcp_mstamp_refresh(tp), which is a wrapper around ktime_get_ns(). ktime_get_ns() has a high cost on some platforms. 100+ cycles for rdtscp on AMD EPYC Turin for instance. We do not have to refresh tp->tcp_mpstamp, using the last cached value is enough. We only need to refresh it from __tcp_cleanup_rbuf() if an ACK must be sent (this is a rare event). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251024120707.3516550-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-27 18:15:38 -07:00
Kuniyuki Iwashima	71068e2e1b	sctp: Remove sctp_copy_sock() and sctp_copy_descendant(). Now, sctp_accept() and sctp_do_peeloff() use sk_clone(), and we no longer need sctp_copy_sock() and sctp_copy_descendant(). Let's remove them. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20251023231751.4168390-9-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-27 18:04:59 -07:00
Kuniyuki Iwashima	b7ddb55f31	sctp: Use sctp_clone_sock() in sctp_do_peeloff(). sctp_do_peeloff() calls sock_create() to allocate and initialise struct sock, inet_sock, and sctp_sock, but later sctp_copy_sock() and sctp_sock_migrate() overwrite most fields. What sctp_do_peeloff() does is more like accept(). Let's use sock_create_lite() and sctp_clone_sock(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20251023231751.4168390-8-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-27 18:04:58 -07:00
Kuniyuki Iwashima	c49ed521f1	sctp: Remove sctp_pf.create_accept_sk(). sctp_v[46]_create_accept_sk() are no longer used. Let's remove sctp_pf.create_accept_sk(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20251023231751.4168390-7-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-27 18:04:58 -07:00
Kuniyuki Iwashima	16942cf4d3	sctp: Use sk_clone() in sctp_accept(). sctp_accept() calls sctp_v[46]_create_accept_sk() to allocate a new socket and calls sctp_sock_migrate() to copy fields from the parent socket to the new socket. sctp_v4_create_accept_sk() allocates sk by sk_alloc(), initialises it by sock_init_data(), and copy a bunch of fields from the parent socekt by sctp_copy_sock(). sctp_sock_migrate() calls sctp_copy_descendant() to copy most fields in sctp_sock from the parent socket by memcpy(). These can be simply replaced by sk_clone(). Let's consolidate sctp_v[46]_create_accept_sk() to sctp_clone_sock() with sk_clone(). We will reuse sctp_clone_sock() for sctp_do_peeloff() and then remove sctp_copy_descendant(). Note that sock_reset_flag(newsk, SOCK_ZAPPED) is not copied to sctp_clone_sock() as sctp does not use SOCK_ZAPPED at all. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20251023231751.4168390-6-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-27 18:04:58 -07:00
Kuniyuki Iwashima	151b98d10e	net: Add sk_clone(). sctp_accept() will use sk_clone_lock(), but it will be called with the parent socket locked, and sctp_migrate() acquires the child lock later. Let's add no lock version of sk_clone_lock(). Note that lockdep complains if we simply use bh_lock_sock_nested(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20251023231751.4168390-5-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-27 18:04:57 -07:00
Kuniyuki Iwashima	b7185792f8	sctp: Don't call sk->sk_prot->init() in sctp_v[46]_create_accept_sk(). sctp_accept() calls sctp_v[46]_create_accept_sk() to allocate a new socket and calls sctp_sock_migrate() to copy fields from the parent socket to the new socket. sctp_v[46]_create_accept_sk() calls sctp_init_sock() to initialise sctp_sock, but most fields are overwritten by sctp_copy_descendant() called from sctp_sock_migrate(). Things done in sctp_init_sock() but not in sctp_sock_migrate() are the following: 1. Copy sk->sk_gso 2. Copy sk->sk_destruct (sctp_v6_init_sock()) 3. Allocate sctp_sock.ep 4. Initialise sctp_sock.pd_lobby 5. Count sk_sockets_allocated_inc(), sock_prot_inuse_add(), and SCTP_DBG_OBJCNT_INC() Let's do these in sctp_copy_sock() and sctp_sock_migrate() and avoid calling sk->sk_prot->init() in sctp_v[46]_create_accept_sk(). Note that sk->sk_destruct is already copied in sctp_copy_sock(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20251023231751.4168390-4-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-27 18:04:57 -07:00
Kuniyuki Iwashima	2d4df59aae	sctp: Don't copy sk_sndbuf and sk_rcvbuf in sctp_sock_migrate(). sctp_sock_migrate() is called from 2 places. 1) sctp_accept() calls sp->pf->create_accept_sk() before sctp_sock_migrate(), and sp->pf->create_accept_sk() calls sctp_copy_sock(). 2) sctp_do_peeloff() also calls sctp_copy_sock() before sctp_sock_migrate(). sctp_copy_sock() copies sk_sndbuf and sk_rcvbuf from the parent socket. Let's not copy the two fields in sctp_sock_migrate(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20251023231751.4168390-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-27 18:04:57 -07:00
Kuniyuki Iwashima	622e8838a2	sctp: Defer SCTP_DBG_OBJCNT_DEC() to sctp_destroy_sock(). SCTP_DBG_OBJCNT_INC() is called only when sctp_init_sock() returns 0 after successfully allocating sctp_sk(sk)->ep. OTOH, SCTP_DBG_OBJCNT_DEC() is called in sctp_close(). The code seems to expect that the socket is always exposed to userspace once SCTP_DBG_OBJCNT_INC() is incremented, but there is a path where the assumption is not true. In sctp_accept(), sctp_sock_migrate() could fail after sctp_init_sock(). Then, sk_common_release() does not call inet_release() nor sctp_close(). Instead, it calls sk->sk_prot->destroy(). Let's move SCTP_DBG_OBJCNT_DEC() from sctp_close() to sctp_destroy_sock(). Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20251023231751.4168390-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-27 18:04:56 -07:00
Jakub Kicinski	bbfa5e7c8d	Merge tag 'batadv-next-pullrequest-20251024' of https://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== This cleanup patchset includes the following patches: - bump version strings, by Simon Wunderlich - use skb_crc32c() instead of skb_seq_read(), by Sven Eckelmann * tag 'batadv-next-pullrequest-20251024' of https://git.open-mesh.org/linux-merge: batman-adv: use skb_crc32c() instead of skb_seq_read() batman-adv: Start new development cycle ==================== Link: https://patch.msgid.link/20251024092315.232636-1-sw@simonwunderlich.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-27 18:02:38 -07:00
Petr Machata	68800bbf58	net: bridge: Flush multicast groups when snooping is disabled When forwarding multicast packets, the bridge takes MDB into account when IGMP / MLD snooping is enabled. Currently, when snooping is disabled, the MDB is retained, even though it is not used anymore. At the same time, during the time that snooping is disabled, the IGMP / MLD control packets are obviously ignored, and after the snooping is reenabled, the administrator has to assume it is out of sync. In particular, missed join and leave messages would lead to traffic being forwarded to wrong interfaces. Keeping the MDB entries around thus serves no purpose, and just takes memory. Note also that disabling per-VLAN snooping does actually flush the relevant MDB entries. This patch flushes non-permanent MDB entries as global snooping is disabled. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/5e992df1bb93b88e19c0ea5819e23b669e3dde5d.1761228273.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-27 17:57:21 -07:00
Wilfred Mallawa	82cb5be6ad	net/tls: support setting the maximum payload size During a handshake, an endpoint may specify a maximum record size limit. Currently, the kernel defaults to TLS_MAX_PAYLOAD_SIZE (16KB) for the maximum record size. Meaning that, the outgoing records from the kernel can exceed a lower size negotiated during the handshake. In such a case, the TLS endpoint must send a fatal "record_overflow" alert [1], and thus the record is discarded. Upcoming Western Digital NVMe-TCP hardware controllers implement TLS support. For these devices, supporting TLS record size negotiation is necessary because the maximum TLS record size supported by the controller is less than the default 16KB currently used by the kernel. Currently, there is no way to inform the kernel of such a limit. This patch adds support to a new setsockopt() option `TLS_TX_MAX_PAYLOAD_LEN` that allows for setting the maximum plaintext fragment size. Once set, outgoing records are no larger than the size specified. This option can be used to specify the record size limit. [1] https://www.rfc-editor.org/rfc/rfc8449 Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20251022001937.20155-1-wilfred.opensource@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-27 16:13:42 -07:00
Dust Li	f0773d0b41	smc: rename smc_find_ism_store_rc to reflect broader usage The function smc_find_ism_store_rc() is used to record the reason why a suitable device (either ISM or RDMA) could not be found. However, its name suggests it is ISM-specific, which is misleading. Rename it to better reflect its actual usage. No functional changes. Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251023020012.69609-1-dust.li@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-24 18:48:03 -07:00
Julia Lawall	d0d2203b9a	strparser: fix typo in comment The name frags_list doesn't appear in the kernel. It should be frag_list as in the next sentence. Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Link: https://patch.msgid.link/20251023013051.1728388-1-Julia.Lawall@inria.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-24 18:47:08 -07:00
Kuniyuki Iwashima	3064d0fe02	neighbour: Convert rwlock of struct neigh_table to spinlock. Only neigh_for_each() and neigh_seq_start/stop() are on the reader side of neigh_table.lock. Let's convert rwlock to the plain spinlock. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251022054004.2514876-6-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-24 17:57:20 -07:00
Kuniyuki Iwashima	55a6046b48	neighbour: Convert RTM_SETNEIGHTBL to RCU. neightbl_set() fetches neigh_tables[] and updates attributes under write_lock_bh(&tbl->lock), so RTNL is not needed. neigh_table_clear() synchronises RCU only, and rcu_dereference_rtnl() protects nothing here. If we released RCU after fetching neigh_tables[], there would be no synchronisation to block neigh_table_clear() further, so RCU is held until the end of the function. Another option would be to protect neigh_tables[] user with SRCU and add synchronize_srcu() in neigh_table_clear(). But, holding RCU should be fine as we hold write_lock_bh() for the rest of neightbl_set() anyway. Let's perform RTM_SETNEIGHTBL under RCU and drop RTNL. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251022054004.2514876-5-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-24 17:57:20 -07:00
Kuniyuki Iwashima	4ae34be500	neighbour: Convert RTM_GETNEIGHTBL to RCU. neightbl_dump_info() calls these functions for each neigh_tables[] entry: 1. neightbl_fill_info() for tbl->parms 2. neightbl_fill_param_info() for tbl->parms_list (except tbl->parms) Both functions rely on the table lock (read_lock_bh(&tbl->lock)) and RTNL is not needed. Let's fetch the table under RCU and convert RTM_GETNEIGHTBL to RCU. Note that the first entry of tbl->parms_list is tbl->parms.list and embedded in neigh_table, so list_next_entry() is safe. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251022054004.2514876-4-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-24 17:57:20 -07:00
Kuniyuki Iwashima	35d7c70870	neighbour: Annotate access to neigh_parms fields. NEIGH_VAR() is read locklessly in the fast path, and IPv6 ndisc uses NEIGH_VAR_SET() locklessly. The next patch will convert neightbl_dump_info() to RCU. Let's annotate accesses to neigh_param with READ_ONCE() and WRITE_ONCE(). Note that ndisc_ifinfo_sysctl_change() uses &NEIGH_VAR() and we cannot use '&' with READ_ONCE(), so NEIGH_VAR_PTR() is introduced. Note also that NEIGH_VAR_INIT() does not need WRITE_ONCE() as it is before parms is published. Also, the only user hippi_neigh_setup_dev() is no longer called since commit `e3804cbebb` ("net: remove COMPAT_NET_DEV_OPS"), which looks wrong, but probably no one uses HIPPI and RoadRunner. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251022054004.2514876-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-24 17:57:20 -07:00
Kuniyuki Iwashima	06d6322280	neighbour: Use RCU list helpers for neigh_parms.list writers. We will convert RTM_GETNEIGHTBL to RCU soon, where we traverse tbl->parms_list under RCU in neightbl_dump_info(). Let's use RCU list helper for neigh_parms in neigh_parms_alloc() and neigh_parms_release(). neigh_table_init() uses the plain list_add() for the default neigh_parm that is embedded in the table and not yet published. Note that neigh_parms_release() already uses call_rcu() to free neigh_parms. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251022054004.2514876-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-24 17:57:20 -07:00
Eric Biggers	05774d7e42	tcp: Remove unnecessary null check in tcp_inbound_md5_hash() The 'if (!key && hash_location)' check in tcp_inbound_md5_hash() implies that hash_location might be null. However, later code in the function dereferences hash_location anyway, without checking for null first. Fortunately, there is no real bug, since tcp_inbound_md5_hash() is called only with non-null values of hash_location. Therefore, remove the unnecessary and misleading null check of hash_location. This silences a Smatch static checker warning (https://lore.kernel.org/netdev/aPi4b6aWBbBR52P1@stanley.mountain/) Also fix the related comment at the beginning of the function. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20251022221209.19716-1-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-23 18:49:55 -07:00
Sunday Adelodun	ec538867a3	net: unix: remove outdated BSD behavior comment in unix_release_sock() Remove the long-standing comment in unix_release_sock() that described a behavioral difference between Linux and BSD regarding when ECONNRESET is sent to connected UNIX sockets upon closure. As confirmed by testing on macOS (similar to BSD behavior), ECONNRESET is only observed for SOCK_DGRAM sockets, not for SOCK_STREAM. Meanwhile, Linux already returns ECONNRESET in cases where a socket is closed with unread data or is not yet accept()ed. This means the previous comment no longer accurately describes current behavior and is misleading. Suggested-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Sunday Adelodun <adelodunolaoluwa@yahoo.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251021195906.20389-1-adelodunolaoluwa@yahoo.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-23 17:26:16 -07:00
Jakub Kicinski	2b7553db91	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.18-rc3). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-23 10:53:08 -07:00
Linus Torvalds	ab431bc397	Merge tag 'net-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from can. Slim pickings, I'm guessing people haven't really started testing. Current release - new code bugs: - eth: mlx5e: - psp: avoid 'accel' NULL pointer dereference - skip PPHCR register query for FEC histogram if not supported Previous releases - regressions: - bonding: update the slave array for broadcast mode - rtnetlink: re-allow deleting FDB entries in user namespace - eth: dpaa2: fix the pointer passed to PTR_ALIGN on Tx path Previous releases - always broken: - can: drop skb on xmit if device is in listen-only mode - gro: clear skb_shinfo(skb)->hwtstamps in napi_reuse_skb() - eth: mlx5e - RX, fix generating skb from non-linear xdp_buff if program trims frags - make devcom init failures non-fatal, fix races with IPSec Misc: - some documentation formatting 'fixes'" * tag 'net-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (47 commits) net/mlx5: Fix IPsec cleanup over MPV device net/mlx5: Refactor devcom to return NULL on failure net/mlx5e: Skip PPHCR register query if not supported by the device net/mlx5: Add PPHCR to PCAM supported registers mask virtio-net: zero unused hash fields net: phy: micrel: always set shared->phydev for LAN8814 vsock: fix lock inversion in vsock_assign_transport() ovpn: use datagram_poll_queue for socket readiness in TCP espintcp: use datagram_poll_queue for socket readiness net: datagram: introduce datagram_poll_queue for custom receive queues net: bonding: fix possible peer notify event loss or dup issue net: hsr: prevent creation of HSR device with slaves from another netns sctp: avoid NULL dereference when chunk data buffer is missing ptp: ocp: Fix typo using index 1 instead of i in SMA initialization loop net: ravb: Ensure memory write completes before ringing TX doorbell net: ravb: Enforce descriptor type ordering net: hibmcge: select FIXED_PHY net: dlink: use dev_kfree_skb_any instead of dev_kfree_skb Documentation: networking: ax25: update the mailing list info. net: gro_cells: fix lock imbalance in gro_cells_receive() ...	2025-10-23 07:03:18 -10:00
Stefano Garzarella	f7c877e753	vsock: fix lock inversion in vsock_assign_transport() Syzbot reported a potential lock inversion deadlock between vsock_register_mutex and sk_lock-AF_VSOCK when vsock_linger() is called. The issue was introduced by commit `687aa0c558` ("vsock: Fix transport_* TOCTOU") which added vsock_register_mutex locking in vsock_assign_transport() around the transport->release() call, that can call vsock_linger(). vsock_assign_transport() can be called with sk_lock held. vsock_linger() calls sk_wait_event() that temporarily releases and re-acquires sk_lock. During this window, if another thread hold vsock_register_mutex while trying to acquire sk_lock, a circular dependency is created. Fix this by releasing vsock_register_mutex before calling transport->release() and vsock_deassign_transport(). This is safe because we don't need to hold vsock_register_mutex while releasing the old transport, and we ensure the new transport won't disappear by obtaining a module reference first via try_module_get(). Reported-by: syzbot+10e35716f8e4929681fa@syzkaller.appspotmail.com Tested-by: syzbot+10e35716f8e4929681fa@syzkaller.appspotmail.com Fixes: `687aa0c558` ("vsock: Fix transport_* TOCTOU") Cc: mhal@rbox.co Cc: stable@vger.kernel.org Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20251021121718.137668-1-sgarzare@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-10-23 16:07:58 +02:00
Ralf Lici	0fc3e32c2c	espintcp: use datagram_poll_queue for socket readiness espintcp uses a custom queue (ike_queue) to deliver packets to userspace. The polling logic relies on datagram_poll, which checks sk_receive_queue, which can lead to false readiness signals when that queue contains non-userspace packets. Switch espintcp_poll to use datagram_poll_queue with ike_queue, ensuring poll only signals readiness when userspace data is actually available. Fixes: `e27cca96cd` ("xfrm: add espintcp (RFC 8229)") Signed-off-by: Ralf Lici <ralf@mandelbit.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20251021100942.195010-3-ralf@mandelbit.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-10-23 15:46:04 +02:00
Ralf Lici	f6ceec6434	net: datagram: introduce datagram_poll_queue for custom receive queues Some protocols using TCP encapsulation (e.g., espintcp, openvpn) deliver userspace-bound packets through a custom skb queue rather than the standard sk_receive_queue. Introduce datagram_poll_queue that accepts an explicit receive queue, and convert datagram_poll into a wrapper around datagram_poll_queue. This allows protocols with custom skb queues to reuse the core polling logic without relying on sk_receive_queue. Cc: Sabrina Dubroca <sd@queasysnail.net> Cc: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: Ralf Lici <ralf@mandelbit.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Antonio Quartulli <antonio@openvpn.net> Link: https://patch.msgid.link/20251021100942.195010-2-ralf@mandelbit.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-10-23 15:46:04 +02:00
Fernando Fernandez Mancera	c0178eec88	net: hsr: prevent creation of HSR device with slaves from another netns HSR/PRP driver does not handle correctly having slaves/interlink devices in a different net namespace. Currently, it is possible to create a HSR link in a different net namespace than the slaves/interlink with the following command: ip link add hsr0 netns hsr-ns type hsr slave1 eth1 slave2 eth2 As there is no use-case on supporting this scenario, enforce that HSR device link matches netns defined by IFLA_LINK_NETNSID. The iproute2 command mentioned above will throw the following error: Error: hsr: HSR slaves/interlink must be on the same net namespace than HSR link. Fixes: `f421436a59` ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Link: https://patch.msgid.link/20251020135533.9373-1-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-22 19:22:22 -07:00
Alexey Simakov	441f0647f7	sctp: avoid NULL dereference when chunk data buffer is missing chunk->skb pointer is dereferenced in the if-block where it's supposed to be NULL only. chunk->skb can only be NULL if chunk->head_skb is not. Check for frag_list instead and do it just before replacing chunk->skb. We're sure that otherwise chunk->skb is non-NULL because of outer if() condition. Fixes: `90017accff` ("sctp: Add GSO support") Signed-off-by: Alexey Simakov <bigalex934@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Link: https://patch.msgid.link/20251021130034.6333-1-bigalex934@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-22 19:19:31 -07:00
David Yang	ca4709843b	net: dsa: tag_yt921x: add support for Motorcomm YT921x tags Add support for Motorcomm YT921x tags, which includes a proper configurable ethertype field (default to 0x9988). Signed-off-by: David Yang <mmyangfl@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20251017060859.326450-3-mmyangfl@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-21 18:25:30 -07:00
Hangbin Liu	0152747a52	net: bridge: use common function to compute the features Previously, bridge ignored all features propagation and DST retention, only handling explicitly the GSO limits. By switching to the new helper netdev_compute_master_upper_features(), the bridge now expose additional features, depending on the lowers capabilities. Since br_set_gso_limits() is already covered by the helper, it can be removed safely. Bridge has it's own way to update needed_headroom. So we don't need to update it in the helper. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20251017034155.61990-5-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-21 18:08:23 -07:00
Hangbin Liu	28098defc7	net: add a common function to compute features for upper devices Some high level software drivers need to compute features from lower devices. But each has their own implementations and may lost some feature compute. Let's use one common function to compute features for kinds of these devices. The new helper uses the current bond implementation as the reference one, as the latter already handles all the relevant aspects: netdev features, TSO limits and dst retention. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20251017034155.61990-2-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-21 18:08:23 -07:00
Eric Dumazet	c5394b8b7a	net: gro_cells: fix lock imbalance in gro_cells_receive() syzbot found that the local_unlock_nested_bh() call was missing in some cases. WARNING: possible recursive locking detected syzkaller #0 Not tainted -------------------------------------------- syz.2.329/7421 is trying to acquire lock: ffffe8ffffd48888 ((&cell->bh_lock)){+...}-{3:3}, at: spin_lock include/linux/spinlock_rt.h:44 [inline] ffffe8ffffd48888 ((&cell->bh_lock)){+...}-{3:3}, at: gro_cells_receive+0x404/0x790 net/core/gro_cells.c:30 but task is already holding lock: ffffe8ffffd48888 ((&cell->bh_lock)){+...}-{3:3}, at: spin_lock include/linux/spinlock_rt.h:44 [inline] ffffe8ffffd48888 ((&cell->bh_lock)){+...}-{3:3}, at: gro_cells_receive+0x404/0x790 net/core/gro_cells.c:30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock((&cell->bh_lock)); lock((&cell->bh_lock)); * DEADLOCK * Given the introduction of @have_bh_lock variable, it seems the author intent was to have the local_unlock_nested_bh() after the @unlock label. Fixes: `25718fdcbd` ("net: gro_cells: Use nested-BH locking for gro_cell") Reported-by: syzbot+f9651b9a8212e1c8906f@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/68f65eb9.a70a0220.205af.0034.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20251020161114.1891141-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-21 17:41:09 -07:00
Alok Tiwari	0364ca3309	devlink: region: correct port region lookup to use port_ops The function devlink_port_region_get_by_name() incorrectly uses region->ops->name to compare the region name. as it is not any critical impact as ops and port_ops define as union for devlink_region but as per code logic it should refer port_ops here. No functional impact as ops and port_ops are part of same union, and name is the first member of both. Update it to use region->port_ops->name to properly reference the name of the devlink port region. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20251020170916.1741808-1-alok.a.tiwari@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-21 17:40:26 -07:00
Matthieu Baerts (NGI0)	e84cb860ac	mptcp: pm: in-kernel: C-flag: handle late ADD_ADDR The special C-flag case expects the ADD_ADDR to be received when switching to 'fully-established'. But for various reasons, the ADD_ADDR could be sent after the "4th ACK", and the special case doesn't work. On NIPA, the new test validating this special case for the C-flag failed a few times, e.g. 102 default limits, server deny join id 0 syn rx [FAIL] got 0 JOIN[s] syn rx expected 2 Server ns stats (...) MPTcpExtAddAddrTx 1 MPTcpExtEchoAdd 1 Client ns stats (...) MPTcpExtAddAddr 1 MPTcpExtEchoAddTx 1 synack rx [FAIL] got 0 JOIN[s] synack rx expected 2 ack rx [FAIL] got 0 JOIN[s] ack rx expected 2 join Rx [FAIL] see above syn tx [FAIL] got 0 JOIN[s] syn tx expected 2 join Tx [FAIL] see above I had a suspicion about what the issue could be: the ADD_ADDR might have been received after the switch to the 'fully-established' state. The issue was not easy to reproduce. The packet capture shown that the ADD_ADDR can indeed be sent with a delay, and the client would not try to establish subflows to it as expected. A simple fix is not to mark the endpoints as 'used' in the C-flag case, when looking at creating subflows to the remote initial IP address and port. In this case, there is no need to try. Note: newly added fullmesh endpoints will still continue to be used as expected, thanks to the conditions behind mptcp_pm_add_addr_c_flag_case. Fixes: `4b1ff850e0` ("mptcp: pm: in-kernel: usable client side with C-flag") Cc: stable@vger.kernel.org Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251020-net-mptcp-c-flag-late-add-addr-v1-1-8207030cb0e8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-21 17:36:45 -07:00
Eric Dumazet	3ff9bcecce	net: avoid extra access to sk->sk_wmem_alloc in sock_wfree() UDP TX packets destructor is sock_wfree(). It suffers from a cache line bouncing in sock_def_write_space_wfree(). Instead of reading sk->sk_wmem_alloc after we just did an atomic RMW on it, use __refcount_sub_and_test() to get the old value for free, and pass the new value to sock_def_write_space_wfree(). Add __sock_writeable() helper. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251017133712.2842665-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-10-21 15:56:21 +02:00
Aswin Karuvally	38516e3fa4	s390/iucv: Convert sprintf/snprintf to scnprintf Convert sprintf/snprintf calls to scnprintf to better align with the kernel development community practices [1]. Link: https://lwn.net/Articles/69419 [1] Reviewed-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: Aswin Karuvally <aswin@linux.ibm.com> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251017094954.1402684-1-wintera@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-20 18:35:21 -07:00
Wang Liang	f584239a9e	net/smc: fix general protection fault in __smc_diag_dump The syzbot report a crash: Oops: general protection fault, probably for non-canonical address 0xfbd5a5d5a0000003: 0000 [#1] SMP KASAN NOPTI KASAN: maybe wild-memory-access in range [0xdead4ead00000018-0xdead4ead0000001f] CPU: 1 UID: 0 PID: 6949 Comm: syz.0.335 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025 RIP: 0010:smc_diag_msg_common_fill net/smc/smc_diag.c:44 [inline] RIP: 0010:__smc_diag_dump.constprop.0+0x3ca/0x2550 net/smc/smc_diag.c:89 Call Trace: <TASK> smc_diag_dump_proto+0x26d/0x420 net/smc/smc_diag.c:217 smc_diag_dump+0x27/0x90 net/smc/smc_diag.c:234 netlink_dump+0x539/0xd30 net/netlink/af_netlink.c:2327 __netlink_dump_start+0x6d6/0x990 net/netlink/af_netlink.c:2442 netlink_dump_start include/linux/netlink.h:341 [inline] smc_diag_handler_dump+0x1f9/0x240 net/smc/smc_diag.c:251 __sock_diag_cmd net/core/sock_diag.c:249 [inline] sock_diag_rcv_msg+0x438/0x790 net/core/sock_diag.c:285 netlink_rcv_skb+0x158/0x420 net/netlink/af_netlink.c:2552 netlink_unicast_kernel net/netlink/af_netlink.c:1320 [inline] netlink_unicast+0x5a7/0x870 net/netlink/af_netlink.c:1346 netlink_sendmsg+0x8d1/0xdd0 net/netlink/af_netlink.c:1896 sock_sendmsg_nosec net/socket.c:714 [inline] __sock_sendmsg net/socket.c:729 [inline] ____sys_sendmsg+0xa95/0xc70 net/socket.c:2614 ___sys_sendmsg+0x134/0x1d0 net/socket.c:2668 __sys_sendmsg+0x16d/0x220 net/socket.c:2700 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xcd/0x4e0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> The process like this: (CPU1) \| (CPU2) ---------------------------------\|------------------------------- inet_create() \| // init clcsock to NULL \| sk = sk_alloc() \| \| // unexpectedly change clcsock \| inet_init_csk_locks() \| \| // add sk to hash table \| smc_inet_init_sock() \| smc_sk_init() \| smc_hash_sk() \| \| // traverse the hash table \| smc_diag_dump_proto \| __smc_diag_dump() \| // visit wrong clcsock \| smc_diag_msg_common_fill() // alloc clcsock \| smc_create_clcsk \| sock_create_kern \| With CONFIG_DEBUG_LOCK_ALLOC=y, the smc->clcsock is unexpectedly changed in inet_init_csk_locks(). The INET_PROTOSW_ICSK flag is no need by smc, just remove it. After removing the INET_PROTOSW_ICSK flag, this patch alse revert commit `6fd27ea183` ("net/smc: fix lacks of icsk_syn_mss with IPPROTO_SMC") to avoid casting smc_sock to inet_connection_sock. Reported-by: syzbot+f775be4458668f7d220e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=f775be4458668f7d220e Tested-by: syzbot+f775be4458668f7d220e@syzkaller.appspotmail.com Fixes: `d25a92ccae` ("net/smc: Introduce IPPROTO_SMC") Signed-off-by: Wang Liang <wangliang74@huawei.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: D. Wythe <alibuda@linux.alibaba.com> Link: https://patch.msgid.link/20251017024827.3137512-1-wangliang74@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-20 17:46:06 -07:00
Eric Dumazet	a5cd3a60aa	net: shrink napi_skb_cache_{put,get}() and napi_skb_cache_get_bulk() Following loop in napi_skb_cache_put() is unrolled by the compiler even if CONFIG_KASAN is not enabled: for (i = NAPI_SKB_CACHE_HALF; i < NAPI_SKB_CACHE_SIZE; i++) kasan_mempool_unpoison_object(nc->skb_cache[i], kmem_cache_size(net_hotdata.skbuff_cache)); We have 32 times this sequence, for a total of 384 bytes. 48 8b 3d 00 00 00 00 net_hotdata.skbuff_cache,%rdi e8 00 00 00 00 call kmem_cache_size This is because kmem_cache_size() is not an inline and not const, and kasan_unpoison_object_data() is an inline function. Cache kmem_cache_size() result in a variable, so that the compiler can remove dead code (and variable) when/if CONFIG_KASAN is unset. After this patch, napi_skb_cache_put() is inlined in its callers, and we avoid one kmem_cache_size() call in napi_skb_cache_get() and napi_skb_cache_get_bulk(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20251016182911.1132792-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-20 17:37:57 -07:00
Eric Dumazet	f8a55d5e71	net: add a fast path in __netif_schedule() Cpus serving NIC interrupts and specifically TX completions are often trapped in also restarting a busy qdisc (because qdisc was stopped by BQL or the driver's own flow control). When they call netdev_tx_completed_queue() or netif_tx_wake_queue(), they call __netif_schedule() so that the queue can be run later from net_tx_action() (involving NET_TX_SOFTIRQ) Quite often, by the time the cpu reaches net_tx_action(), another cpu grabbed the qdisc spinlock from __dev_xmit_skb(), and we spend too much time spinning on this lock. We can detect in __netif_schedule() if a cpu is already at a specific point in __dev_xmit_skb() where we have the guarantee the queue will be run. This patch gives a 13 % increase of throughput on an IDPF NIC (200Gbit), 32 TX qeues, sending UDP packets of 120 bytes. This also helps __qdisc_run() to not force a NET_TX_SOFTIRQ if another thread is waiting in __dev_xmit_skb() Before: sar -n DEV 5 5\|grep eth1\|grep Average Average: eth1 1496.44 52191462.56 210.00 13369396.90 0.00 0.00 0.00 54.76 After: sar -n DEV 5 5\|grep eth1\|grep Average Average: eth1 1457.88 59363099.96 205.08 15206384.35 0.00 0.00 0.00 62.29 Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251017145334.3016097-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-20 16:45:25 -07:00
Linus Torvalds	d303caf5ca	Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Replace bpf_map_kmalloc_node() with kmalloc_nolock() to fix kmemleak imbalance in tracking of bpf_async_cb structures (Alexei Starovoitov) - Make selftests/bpf arg_parsing.c more robust to errors (Andrii Nakryiko) - Fix redefinition of 'off' as different kind of symbol when I40E driver is builtin (Brahmajit Das) - Do not disable preemption in bpf_test_run (Sahil Chandna) - Fix memory leak in __lookup_instance error path (Shardul Bankar) - Ensure test data is flushed to disk before reading it (Xing Guo) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Fix redefinition of 'off' as different kind of symbol bpf: Do not disable preemption in bpf_test_run(). bpf: Fix memory leak in __lookup_instance error path selftests: arg_parsing: Ensure data is flushed to disk before reading. bpf: Replace bpf_map_kmalloc_node() with kmalloc_nolock() to allocate bpf_async_cb structures. selftests/bpf: make arg_parsing.c more robust to crashes bpf: test_run: Fix ctx leak in bpf_prog_test_run_xdp error path	2025-10-18 08:00:43 -10:00
Jakub Kicinski	e90576829c	Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2025-10-16 We've added 6 non-merge commits during the last 1 day(s) which contain a total of 18 files changed, 577 insertions(+), 38 deletions(-). The main changes are: 1) Bypass the global per-protocol memory accounting either by setting a netns sysctl or using bpf_setsockopt in a bpf program, from Kuniyuki Iwashima. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: selftests/bpf: Add test for sk->sk_bypass_prot_mem. bpf: Introduce SK_BPF_BYPASS_PROT_MEM. bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_CREATE. net: Introduce net.core.bypass_prot_mem sysctl. net: Allow opt-out from global protocol memory accounting. tcp: Save lock_sock() for memcg in inet_csk_accept(). ==================== Link: https://patch.msgid.link/20251016204539.773707-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-17 17:20:42 -07:00
Eric Biggers	37a183d3b7	tcp: Convert tcp-md5 to use MD5 library instead of crypto_ahash Make tcp-md5 use the MD5 library API (added in 6.18) instead of the crypto_ahash API. This is much simpler and also more efficient: - The library API just operates on struct md5_ctx. Just allocate this struct on the stack instead of using a pool of pre-allocated crypto_ahash and ahash_request objects. - The library API accepts standard pointers and doesn't require scatterlists. So, for hashing the headers just use an on-stack buffer instead of a pool of pre-allocated kmalloc'ed scratch buffers. - The library API never fails. Therefore, checking for MD5 hashing errors is no longer necessary. Update tcp_v4_md5_hash_skb(), tcp_v6_md5_hash_skb(), tcp_v4_md5_hash_hdr(), tcp_v6_md5_hash_hdr(), tcp_md5_hash_key(), tcp_sock_af_ops::calc_md5_hash, and tcp_request_sock_ops::calc_md5_hash to return void instead of int. - The library API provides direct access to the MD5 code, eliminating unnecessary overhead such as indirect function calls and scatterlist management. Microbenchmarks of tcp_v4_md5_hash_skb() on x86_64 show a speedup from 7518 to 7041 cycles (6% fewer) with skb->len == 1440, or from 1020 to 678 cycles (33% fewer) with skb->len == 140. Since tcp_sigpool_hash_skb_data() can no longer be used, add a function tcp_md5_hash_skb_data() which is specialized to MD5. Of course, to the extent that this duplicates any code, it's well worth it. To preserve the existing behavior of TCP-MD5 support being disabled when the kernel is booted with "fips=1", make tcp_md5_do_add() check fips_enabled itself. Previously it relied on the error from crypto_alloc_ahash("md5") being bubbled up. I don't know for sure that this is actually needed, but this preserves the existing behavior. Tested with bidirectional TCP-MD5, both IPv4 and IPv6, between a kernel that includes this commit and a kernel that doesn't include this commit. (Side note: please don't use TCP-MD5! It's cryptographically weak. But as long as Linux supports it, it might as well be implemented properly.) Signed-off-by: Eric Biggers <ebiggers@kernel.org> Link: https://patch.msgid.link/20251014215836.115616-1-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-17 17:14:54 -07:00
Florian Westphal	2af8ff1e47	net: Kconfig: discourage drop_monitor enablement Quoting Eric Dumazet: "I do not understand the fascination with net/core/drop_monitor.c [..] misses all the features, flexibility, scalability that 'perf', eBPF tracing, bpftrace, .... have today." Reword DROP_MONITOR kconfig help text to clearly state that its not related to perf-based drop monitoring and that its safe to disable this unless support for the older netlink-based tools is needed. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251016115147.18503-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-17 16:29:26 -07:00

1 2 3 4 5 ...

82211 Commits