linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-21 21:54:52 -04:00

Author	SHA1	Message	Date
Kuniyuki Iwashima	2d842b6c67	tcp: Remove timewait_sock_ops.twsk_destructor(). Since DCCP has been removed, sk->sk_prot->twsk_prot->twsk_destructor is always tcp_twsk_destructor(). Let's call tcp_twsk_destructor() directly in inet_twsk_free() and remove ->twsk_destructor(). While at it, tcp_twsk_destructor() is un-exported. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250822190803.540788-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-25 17:53:35 -07:00
Kuniyuki Iwashima	9db0163e3c	tcp: Remove sk_protocol test for tcp_twsk_unique(). Commit `383eed2de5` ("tcp: get rid of twsk_unique()") added sk->sk_protocol test in __inet_check_established() and __inet6_check_established() to remove twsk_unique() and call tcp_twsk_unique() directly. DCCP has gone, and the condition is always true. Let's remove the sk_protocol test. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250822190803.540788-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-25 17:53:35 -07:00
Yue Haibing	60c481d4ca	ipv6: mcast: Add ip6_mc_find_idev() helper Extract the same code logic from __ipv6_sock_mc_join() and ip6_mc_find_dev(), also add new helper ip6_mc_find_idev() to reduce redundancy and enhance readability. No functional changes intended. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Link: https://patch.msgid.link/20250822064051.2991480-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-25 16:36:59 -07:00
Eric Dumazet	9bd999eb35	tcp: annotate data-races around icsk->icsk_probes_out icsk->icsk_probes_out is read locklessly from inet_sk_diag_fill(), get_tcp4_sock() and get_tcp6_sock(). Add corresponding READ_ONCE()/WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20250822091727.835869-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-25 16:20:59 -07:00
Eric Dumazet	e6f178be3c	tcp: annotate data-races around icsk->icsk_retransmits icsk->icsk_retransmits is read locklessly from inet_sk_diag_fill(), tcp_get_timestamping_opt_stats, get_tcp4_sock() and get_tcp6_sock(). Add corresponding READ_ONCE()/WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20250822091727.835869-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-25 16:20:59 -07:00
Oscar Maes	1b8c5fa0cb	net: ipv4: allow directed broadcast routes to use dst hint Currently, ip_extract_route_hint uses RTN_BROADCAST to decide whether to use the route dst hint mechanism. This check is too strict, as it prevents directed broadcast routes from using the hint, resulting in poor performance during bursts of directed broadcast traffic. Fix this in ip_extract_route_hint and modify ip_route_use_hint to preserve the intended behaviour. Signed-off-by: Oscar Maes <oscmaes92@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250819174642.5148-2-oscmaes92@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-25 16:07:16 -07:00
Nalivayko Sergey	674b56aa57	net/9p: fix double req put in p9_fd_cancelled Syzkaller reports a KASAN issue as below: general protection fault, probably for non-canonical address 0xfbd59c0000000021: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: maybe wild-memory-access in range [0xdead000000000108-0xdead00000000010f] CPU: 0 PID: 5083 Comm: syz-executor.2 Not tainted 6.1.134-syzkaller-00037-g855bd1d7d838 #0 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 RIP: 0010:__list_del include/linux/list.h:114 [inline] RIP: 0010:__list_del_entry include/linux/list.h:137 [inline] RIP: 0010:list_del include/linux/list.h:148 [inline] RIP: 0010:p9_fd_cancelled+0xe9/0x200 net/9p/trans_fd.c:734 Call Trace: <TASK> p9_client_flush+0x351/0x440 net/9p/client.c:614 p9_client_rpc+0xb6b/0xc70 net/9p/client.c:734 p9_client_version net/9p/client.c:920 [inline] p9_client_create+0xb51/0x1240 net/9p/client.c:1027 v9fs_session_init+0x1f0/0x18f0 fs/9p/v9fs.c:408 v9fs_mount+0xba/0xcb0 fs/9p/vfs_super.c:126 legacy_get_tree+0x108/0x220 fs/fs_context.c:632 vfs_get_tree+0x8e/0x300 fs/super.c:1573 do_new_mount fs/namespace.c:3056 [inline] path_mount+0x6a6/0x1e90 fs/namespace.c:3386 do_mount fs/namespace.c:3399 [inline] __do_sys_mount fs/namespace.c:3607 [inline] __se_sys_mount fs/namespace.c:3584 [inline] __x64_sys_mount+0x283/0x300 fs/namespace.c:3584 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 This happens because of a race condition between: - The 9p client sending an invalid flush request and later cleaning it up; - The 9p client in p9_read_work() canceled all pending requests. Thread 1 Thread 2 ... p9_client_create() ... p9_fd_create() ... p9_conn_create() ... // start Thread 2 INIT_WORK(&m->rq, p9_read_work); p9_read_work() ... p9_client_rpc() ... ... p9_conn_cancel() ... spin_lock(&m->req_lock); ... p9_fd_cancelled() ... ... spin_unlock(&m->req_lock); // status rewrite p9_client_cb(m->client, req, REQ_STATUS_ERROR) // first remove list_del(&req->req_list); ... spin_lock(&m->req_lock) ... // second remove list_del(&req->req_list); spin_unlock(&m->req_lock) ... Commit `74d6a5d566` ("9p/trans_fd: Fix concurrency del of req_list in p9_fd_cancelled/p9_read_work") fixes a concurrency issue in the 9p filesystem client where the req_list could be deleted simultaneously by both p9_read_work and p9_fd_cancelled functions, but for the case where req->status equals REQ_STATUS_RCVD. Update the check for req->status in p9_fd_cancelled to skip processing not just received requests, but anything that is not SENT, as whatever changed the state from SENT also removed the request from its list. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: `afd8d65411` ("9P: Add cancelled() to the transport functions.") Cc: stable@vger.kernel.org Signed-off-by: Nalivayko Sergey <Sergey.Nalivayko@kaspersky.com> Message-ID: <20250715154815.3501030-1-Sergey.Nalivayko@kaspersky.com> [updated the check from status == RECV \|\| status == ERROR to status != SENT] Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>	2025-08-23 15:34:47 +09:00
Dominique Martinet	c04db81cd0	net/9p: Fix buffer overflow in USB transport layer A buffer overflow vulnerability exists in the USB 9pfs transport layer where inconsistent size validation between packet header parsing and actual data copying allows a malicious USB host to overflow heap buffers. The issue occurs because: - usb9pfs_rx_header() validates only the declared size in packet header - usb9pfs_rx_complete() uses req->actual (actual received bytes) for memcpy This allows an attacker to craft packets with small declared size (bypassing validation) but large actual payload (triggering overflow in memcpy). Add validation in usb9pfs_rx_complete() to ensure req->actual does not exceed the buffer capacity before copying data. Reported-by: Yuhao Jiang <danisjiang@gmail.com> Closes: https://lkml.kernel.org/r/20250616132539.63434-1-danisjiang@gmail.com Fixes: `a3be076dc1` ("net/9p/usbg: Add new usb gadget function transport") Cc: stable@vger.kernel.org Message-ID: <20250622-9p-usb_overflow-v3-1-ab172691b946@codewreck.org> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>	2025-08-23 15:34:46 +09:00
Kuniyuki Iwashima	ec79003c5f	atm: atmtcp: Prevent arbitrary write in atmtcp_recv_control(). syzbot reported the splat below. [0] When atmtcp_v_open() or atmtcp_v_close() is called via connect() or close(), atmtcp_send_control() is called to send an in-kernel special message. The message has ATMTCP_HDR_MAGIC in atmtcp_control.hdr.length. Also, a pointer of struct atm_vcc is set to atmtcp_control.vcc. The notable thing is struct atmtcp_control is uAPI but has a space for an in-kernel pointer. struct atmtcp_control { struct atmtcp_hdr hdr; /* must be first / ... atm_kptr_t vcc; / both directions */ ... } __ATM_API_ALIGN; typedef struct { unsigned char _[8]; } __ATM_API_ALIGN atm_kptr_t; The special message is processed in atmtcp_recv_control() called from atmtcp_c_send(). atmtcp_c_send() is vcc->dev->ops->send() and called from 2 paths: 1. .ndo_start_xmit() (vcc->send() == atm_send_aal0()) 2. vcc_sendmsg() The problem is sendmsg() does not validate the message length and userspace can abuse atmtcp_recv_control() to overwrite any kptr by atmtcp_control. Let's add a new ->pre_send() hook to validate messages from sendmsg(). [0]: Oops: general protection fault, probably for non-canonical address 0xdffffc00200000ab: 0000 [#1] SMP KASAN PTI KASAN: probably user-memory-access in range [0x0000000100000558-0x000000010000055f] CPU: 0 UID: 0 PID: 5865 Comm: syz-executor331 Not tainted 6.17.0-rc1-syzkaller-00215-gbab3ce404553 #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/12/2025 RIP: 0010:atmtcp_recv_control drivers/atm/atmtcp.c:93 [inline] RIP: 0010:atmtcp_c_send+0x1da/0x950 drivers/atm/atmtcp.c:297 Code: 4d 8d 75 1a 4c 89 f0 48 c1 e8 03 42 0f b6 04 20 84 c0 0f 85 15 06 00 00 41 0f b7 1e 4d 8d b7 60 05 00 00 4c 89 f0 48 c1 e8 03 <42> 0f b6 04 20 84 c0 0f 85 13 06 00 00 66 41 89 1e 4d 8d 75 1c 4c RSP: 0018:ffffc90003f5f810 EFLAGS: 00010203 RAX: 00000000200000ab RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff88802a510000 RSI: 00000000ffffffff RDI: ffff888030a6068c RBP: ffff88802699fb40 R08: ffff888030a606eb R09: 1ffff1100614c0dd R10: dffffc0000000000 R11: ffffffff8718fc40 R12: dffffc0000000000 R13: ffff888030a60680 R14: 000000010000055f R15: 00000000ffffffff FS: 00007f8d7e9236c0(0000) GS:ffff888125c1c000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000045ad50 CR3: 0000000075bde000 CR4: 00000000003526f0 Call Trace: <TASK> vcc_sendmsg+0xa10/0xc60 net/atm/common.c:645 sock_sendmsg_nosec net/socket.c:714 [inline] __sock_sendmsg+0x219/0x270 net/socket.c:729 ____sys_sendmsg+0x505/0x830 net/socket.c:2614 ___sys_sendmsg+0x21f/0x2a0 net/socket.c:2668 __sys_sendmsg net/socket.c:2700 [inline] __do_sys_sendmsg net/socket.c:2705 [inline] __se_sys_sendmsg net/socket.c:2703 [inline] __x64_sys_sendmsg+0x19b/0x260 net/socket.c:2703 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f8d7e96a4a9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 51 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f8d7e923198 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f8d7e9f4308 RCX: 00007f8d7e96a4a9 RDX: 0000000000000000 RSI: 0000200000000240 RDI: 0000000000000005 RBP: 00007f8d7e9f4300 R08: 65732f636f72702f R09: 65732f636f72702f R10: 65732f636f72702f R11: 0000000000000246 R12: 00007f8d7e9c10ac R13: 00007f8d7e9231a0 R14: 0000200000000200 R15: 0000200000000250 </TASK> Modules linked in: Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Reported-by: syzbot+1741b56d54536f4ec349@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/68a6767c.050a0220.3d78fd.0011.GAE@google.com/ Tested-by: syzbot+1741b56d54536f4ec349@syzkaller.appspotmail.com Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250821021901.2814721-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-22 17:23:15 -07:00
Ujwal Kundur	bcb28bee98	rds: Fix endianness annotations for RDS extension headers Per the RDS 3.1 spec [1], RDS extension headers EXTHDR_NPATHS and EXTHDR_GEN_NUM are be16 and be32 values respectively, exchanged during normal operations over-the-wire (RDS Ping/Pong). This contrasts their declarations as host endian unsigned ints. Fix the annotations across occurrences. Flagged by Sparse. [1] https://oss.oracle.com/projects/rds/dist/documentation/rds-3.1-spec.html Signed-off-by: Ujwal Kundur <ujwal.kundur@gmail.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20250820175550.498-5-ujwal.kundur@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-22 16:44:39 -07:00
Ujwal Kundur	77907a0687	rds: Fix endianness annotation for RDS_MPATH_HASH jhash_1word accepts host endian inputs while rs_bound_port is a be16 value (sockaddr_in6.sin6_port). Use ntohs() for consistency. Flagged by Sparse. Signed-off-by: Ujwal Kundur <ujwal.kundur@gmail.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20250820175550.498-4-ujwal.kundur@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-22 16:44:39 -07:00
Ujwal Kundur	92b925297a	rds: Fix endianness annotation of jhash wrappers __ipv6_addr_jhash (wrapper around jhash2()) and __inet_ehashfn (wrapper around jhash_3words()) work with u32 (host endian) values but accept big endian inputs. Declare the local variables as big endian to avoid unnecessary casts. Flagged by Sparse. Signed-off-by: Ujwal Kundur <ujwal.kundur@gmail.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20250820175550.498-3-ujwal.kundur@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-22 16:44:34 -07:00
Ujwal Kundur	9308987803	rds: Replace POLLERR with EPOLLERR Both constants are 1<<3, but EPOLLERR uses the correct annotations. Flagged by Sparse. Signed-off-by: Ujwal Kundur <ujwal.kundur@gmail.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20250820175550.498-2-ujwal.kundur@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-22 16:42:25 -07:00
Jakub Kicinski	1559c9c231	Merge tag 'for-net-2025-08-22' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: * tag 'for-net-2025-08-22' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: hci_sync: fix set_local_name race condition Bluetooth: hci_event: Disconnect device when BIG sync is lost Bluetooth: hci_event: Detect if HCI_EV_NUM_COMP_PKTS is unbalanced Bluetooth: hci_event: Mark connection as closed during suspend disconnect Bluetooth: hci_event: Treat UNKNOWN_CONN_ID on disconnect as success Bluetooth: hci_conn: Make unacked packet handling more robust ==================== Link: https://patch.msgid.link/20250822180230.345979-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-22 16:10:24 -07:00
Eric Dumazet	9217146fee	tcp: lockless TCP_MAXSEG option setsockopt(TCP_MAXSEG) writes over a field that does not need socket lock protection anymore. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20250821141901.18839-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-22 15:58:59 -07:00
Eric Dumazet	d5ffba0f25	tcp: annotate data-races around tp->rx_opt.user_mss This field is already read locklessly for listeners, next patch will make setsockopt(TCP_MAXSEG) lockless. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20250821141901.18839-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-22 15:58:58 -07:00
Mina Almasry	abadf0ff63	page_pool: fix incorrect mp_ops error handling Minor fix to the memory provider error handling, we should be jumping to free_ptr_ring in this error case rather than returning directly. Found by code-inspection. Cc: skhawaja@google.com Fixes: `b400f4b874` ("page_pool: Set `dma_sync` to false for devmem memory provider") Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Link: https://patch.msgid.link/20250821030349.705244-1-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-22 15:52:02 -07:00
Pavel Shpakovskiy	6bbd0d3f0c	Bluetooth: hci_sync: fix set_local_name race condition Function set_name_sync() uses hdev->dev_name field to send HCI_OP_WRITE_LOCAL_NAME command, but copying from data to hdev->dev_name is called after mgmt cmd was queued, so it is possible that function set_name_sync() will read old name value. This change adds name as a parameter for function hci_update_name_sync() to avoid race condition. Fixes: `6f6ff38a1e` ("Bluetooth: hci_sync: Convert MGMT_OP_SET_LOCAL_NAME") Signed-off-by: Pavel Shpakovskiy <pashpakovskii@salutedevices.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2025-08-22 13:57:31 -04:00
Yang Li	55b9551fcd	Bluetooth: hci_event: Disconnect device when BIG sync is lost When a BIG sync is lost, the device should be set to "disconnected". This ensures symmetry with the ISO path setup, where the device is marked as "connected" once the path is established. Without this change, the device state remains inconsistent and may lead to a memory leak. Fixes: `b2a5f2e1c1` ("Bluetooth: hci_event: Add support for handling LE BIG Sync Lost event") Signed-off-by: Yang Li <yang.li@amlogic.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2025-08-22 13:57:14 -04:00
Luiz Augusto von Dentz	15bf2c6391	Bluetooth: hci_event: Detect if HCI_EV_NUM_COMP_PKTS is unbalanced This attempts to detect if HCI_EV_NUM_COMP_PKTS contain an unbalanced (more than currently considered outstanding) number of packets otherwise it could cause the hcon->sent to underflow and loop around breaking the tracking of the outstanding packets pending acknowledgment. Fixes: `f428091858` ("Bluetooth: Simplify num_comp_pkts_evt function") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2025-08-22 13:56:57 -04:00
Ludovico de Nittis	b7fafbc499	Bluetooth: hci_event: Mark connection as closed during suspend disconnect When suspending, the disconnect command for an active Bluetooth connection could be issued, but the corresponding `HCI_EV_DISCONN_COMPLETE` event might not be received before the system completes the suspend process. This can lead to an inconsistent state. On resume, the controller may auto-accept reconnections from the same device (due to suspend event filters), but these new connections are rejected by the kernel which still has connection objects from before suspend. Resulting in errors like: ``` kernel: Bluetooth: hci0: ACL packet for unknown connection handle 1 kernel: Bluetooth: hci0: Ignoring HCI_Connection_Complete for existing connection ``` This is a btmon snippet that shows the issue: ``` < HCI Command: Disconnect (0x01\|0x0006) plen 3 Handle: 1 Address: 78:20:A5:4A:DF:28 (Nintendo Co.,Ltd) Reason: Remote User Terminated Connection (0x13) > HCI Event: Command Status (0x0f) plen 4 Disconnect (0x01\|0x0006) ncmd 2 Status: Success (0x00) [...] // Host suspends with the event filter set for the device // On resume, the device tries to reconnect with a new handle > HCI Event: Connect Complete (0x03) plen 11 Status: Success (0x00) Handle: 2 Address: 78:20:A5:4A:DF:28 (Nintendo Co.,Ltd) // Kernel ignores this event because there is an existing connection with // handle 1 ``` By explicitly setting the connection state to BT_CLOSED we can ensure a consistent state, even if we don't receive the disconnect complete event in time. Link: https://github.com/bluez/bluez/issues/1226 Fixes: `182ee45da0` ("Bluetooth: hci_sync: Rework hci_suspend_notifier") Signed-off-by: Ludovico de Nittis <ludovico.denittis@collabora.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2025-08-22 13:55:29 -04:00
Ludovico de Nittis	2f050a5392	Bluetooth: hci_event: Treat UNKNOWN_CONN_ID on disconnect as success When the host sends an HCI_OP_DISCONNECT command, the controller may respond with the status HCI_ERROR_UNKNOWN_CONN_ID (0x02). E.g. this can happen on resume from suspend, if the link was terminated by the remote device before the event mask was correctly set. This is a btmon snippet that shows the issue: ``` > ACL Data RX: Handle 3 flags 0x02 dlen 12 L2CAP: Disconnection Request (0x06) ident 5 len 4 Destination CID: 65 Source CID: 72 < ACL Data TX: Handle 3 flags 0x00 dlen 12 L2CAP: Disconnection Response (0x07) ident 5 len 4 Destination CID: 65 Source CID: 72 > ACL Data RX: Handle 3 flags 0x02 dlen 12 L2CAP: Disconnection Request (0x06) ident 6 len 4 Destination CID: 64 Source CID: 71 < ACL Data TX: Handle 3 flags 0x00 dlen 12 L2CAP: Disconnection Response (0x07) ident 6 len 4 Destination CID: 64 Source CID: 71 < HCI Command: Set Event Mask (0x03\|0x0001) plen 8 Mask: 0x3dbff807fffbffff Inquiry Complete Inquiry Result Connection Complete Connection Request Disconnection Complete Authentication Complete [...] < HCI Command: Disconnect (0x01\|0x0006) plen 3 Handle: 3 Address: 78:20:A5:4A:DF:28 (Nintendo Co.,Ltd) Reason: Remote User Terminated Connection (0x13) > HCI Event: Command Status (0x0f) plen 4 Disconnect (0x01\|0x0006) ncmd 1 Status: Unknown Connection Identifier (0x02) ``` Currently, the hci_cs_disconnect function treats any non-zero status as a command failure. This can be misleading because the connection is indeed being terminated and the controller is confirming that is has no knowledge of that connection handle. Meaning that the initial request of disconnecting a device should be treated as done. With this change we allow the function to proceed, following the success path, which correctly calls `mgmt_device_disconnected` and ensures a consistent state. Link: https://github.com/bluez/bluez/issues/1226 Fixes: `182ee45da0` ("Bluetooth: hci_sync: Rework hci_suspend_notifier") Signed-off-by: Ludovico de Nittis <ludovico.denittis@collabora.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2025-08-22 13:54:27 -04:00
Luiz Augusto von Dentz	5d7eba62e5	Bluetooth: hci_conn: Make unacked packet handling more robust This attempts to make unacked packet handling more robust by detecting if there are no connections left then restore all buffers of the respective pool. Fixes: `5638d9ea9c` ("Bluetooth: hci_conn: Fix not restoring ISO buffer count on disconnect") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2025-08-22 13:40:18 -04:00
Will Deacon	7fb1291257	vsock/virtio: Fix message iterator handling on transmit path Commit `6693731487` ("vsock/virtio: Allocate nonlinear SKBs for handling large transmit buffers") converted the virtio vsock transmit path to utilise nonlinear SKBs when handling large buffers. As part of this change, virtio_transport_fill_skb() was updated to call skb_copy_datagram_from_iter() instead of memcpy_from_msg() as the latter expects a single destination buffer and cannot handle nonlinear SKBs correctly. Unfortunately, during this conversion, I overlooked the error case when the copying function returns -EFAULT due to a fault on the input buffer in userspace. In this case, memcpy_from_msg() reverts the iterator to its initial state thanks to copy_from_iter_full() whereas skb_copy_datagram_from_iter() leaves the iterator partially advanced. This results in a WARN_ONCE() from the vsock code, which expects the iterator to stay in sync with the number of bytes transmitted so that virtio_transport_send_pkt_info() can return -EFAULT when it is called again: ------------[ cut here ]------------ 'send_pkt()' returns 0, but 65536 expected WARNING: CPU: 0 PID: 5503 at net/vmw_vsock/virtio_transport_common.c:428 virtio_transport_send_pkt_info+0xd11/0xf00 net/vmw_vsock/virtio_transport_common.c:426 Modules linked in: CPU: 0 UID: 0 PID: 5503 Comm: syz.0.17 Not tainted 6.16.0-syzkaller-12063-g37816488247d #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Call virtio_transport_fill_skb_full() to restore the previous iterator behaviour. Cc: Jason Wang <jasowang@redhat.com> Cc: Stefano Garzarella <sgarzare@redhat.com> Fixes: `6693731487` ("vsock/virtio: Allocate nonlinear SKBs for handling large transmit buffers") Reported-by: syzbot+b4d960daf7a3c7c2b7b1@syzkaller.appspotmail.com Signed-off-by: Will Deacon <will@kernel.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Link: https://patch.msgid.link/20250818180355.29275-3-will@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-21 17:49:19 -07:00
Will Deacon	b08a784a5d	net: Introduce skb_copy_datagram_from_iter_full() In a similar manner to copy_from_iter()/copy_from_iter_full(), introduce skb_copy_datagram_from_iter_full() which reverts the iterator to its initial state when returning an error. A subsequent fix for a vsock regression will make use of this new function. Cc: Christian Brauner <brauner@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Will Deacon <will@kernel.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Link: https://patch.msgid.link/20250818180355.29275-2-will@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-21 17:47:57 -07:00
Jakub Kicinski	c3439666d1	Merge tag 'nf-next-25-08-20' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter: updates for net-next First patch gets rid of refcounting for dying list dumping, use a cookie value instead of keeping the object around. Remaining patches extend nftables pipapo (concatenated ranges) set type. Make the AVX2 optimized version available from the control plane as well, then use it during insert. This gives a nice speedup for large sets. All from myself. On PREEMPT_RT, we can't rely on local_bh_disable to protect the access to the percpu scratch maps. Use nested-BH locking for this, From Sebastian Siewior. * tag 'nf-next-25-08-20' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nft_set_pipapo: Use nested-BH locking for nft_pipapo_scratch netfilter: nft_set_pipapo: Store real pointer, adjust later. netfilter: nft_set_pipapo: use avx2 algorithm for insertions too netfilter: nft_set_pipapo_avx2: split lookup function in two parts netfilter: nft_set_pipapo_avx2: Drop the comment regarding protection netfilter: ctnetlink: remove refcounting in dying list dumping ==================== Link: https://patch.msgid.link/20250820144738.24250-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-21 17:23:26 -07:00
Jakub Kicinski	4dba4a936f	Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2025-08-21 We've added 9 non-merge commits during the last 3 day(s) which contain a total of 13 files changed, 1027 insertions(+), 27 deletions(-). The main changes are: 1) Added bpf dynptr support for accessing the metadata of a skb, from Jakub Sitnicki. The patches are merged from a stable branch bpf-next/skb-meta-dynptr. The same patches have also been merged into bpf-next/master. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: selftests/bpf: Cover metadata access from a modified skb clone selftests/bpf: Cover read/write to skb metadata at an offset selftests/bpf: Cover write access to skb metadata via dynptr selftests/bpf: Cover read access to skb metadata via dynptr selftests/bpf: Parametrize test_xdp_context_tuntap selftests/bpf: Pass just bpf_map to xdp_context_test helper selftests/bpf: Cover verifier checks for skb_meta dynptr type bpf: Enable read/write access to skb metadata through a dynptr bpf: Add dynptr type for skb metadata ==================== Link: https://patch.msgid.link/20250821191827.2099022-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-21 15:37:16 -07:00
Jakub Kicinski	a9af709fda	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.17-rc3). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-21 11:33:15 -07:00
Florian Westphal	91a79b7922	netfilter: nf_reject: don't leak dst refcount for loopback packets recent patches to add a WARN() when replacing skb dst entry found an old bug: WARNING: include/linux/skbuff.h:1165 skb_dst_check_unset include/linux/skbuff.h:1164 [inline] WARNING: include/linux/skbuff.h:1165 skb_dst_set include/linux/skbuff.h:1210 [inline] WARNING: include/linux/skbuff.h:1165 nf_reject_fill_skb_dst+0x2a4/0x330 net/ipv4/netfilter/nf_reject_ipv4.c:234 [..] Call Trace: nf_send_unreach+0x17b/0x6e0 net/ipv4/netfilter/nf_reject_ipv4.c:325 nft_reject_inet_eval+0x4bc/0x690 net/netfilter/nft_reject_inet.c:27 expr_call_ops_eval net/netfilter/nf_tables_core.c:237 [inline] .. This is because blamed commit forgot about loopback packets. Such packets already have a dst_entry attached, even at PRE_ROUTING stage. Instead of checking hook just check if the skb already has a route attached to it. Fixes: `f53b9b0bdc` ("netfilter: introduce support for reject at prerouting stage") Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20250820123707.10671-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-21 10:02:00 -07:00
Jakub Kicinski	62708b9452	tls: fix handling of zero-length records on the rx_list Each recvmsg() call must process either - only contiguous DATA records (any number of them) - one non-DATA record If the next record has different type than what has already been processed we break out of the main processing loop. If the record has already been decrypted (which may be the case for TLS 1.3 where we don't know type until decryption) we queue the pending record to the rx_list. Next recvmsg() will pick it up from there. Queuing the skb to rx_list after zero-copy decrypt is not possible, since in that case we decrypted directly to the user space buffer, and we don't have an skb to queue (darg.skb points to the ciphertext skb for access to metadata like length). Only data records are allowed zero-copy, and we break the processing loop after each non-data record. So we should never zero-copy and then find out that the record type has changed. The corner case we missed is when the initial record comes from rx_list, and it's zero length. Reported-by: Muhammad Alifa Ramdhan <ramdhan@starlabs.sg> Reported-by: Billy Jheng Bing-Jhong <billy@starlabs.sg> Fixes: `84c61fe1a7` ("tls: rx: do not use the standard strparser") Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250820021952.143068-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-21 07:52:30 -07:00
Thorsten Blum	833e43171b	net: pktgen: Use min()/min_t() to improve pktgen_finalize_skb() Use min() and min_t() to improve pktgen_finalize_skb() and avoid calculating 'datalen / frags' twice. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://patch.msgid.link/20250815153334.295431-3-thorsten.blum@linux.dev Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-21 10:12:11 +02:00
Yury Norov (NVIDIA)	62a2b35025	net: openvswitch: Use for_each_cpu() where appropriate Due to legacy reasons, openswitch code opencodes for_each_cpu() to make sure that CPU0 is always considered. Since commit `c4b2bf6b4a` ("openvswitch: Optimize operations for OvS flow_stats."), the corresponding flow->cpu_used_mask is initialized such that CPU0 is explicitly set. So, switch the code to using plain for_each_cpu(). Suggested-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Acked-by: Ilya Maximets <i.maximets@ovn.org> Link: https://patch.msgid.link/20250818172806.189325-1-yury.norov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-20 19:47:22 -07:00
Eric Dumazet	a6d4f25888	net: set net.core.rmem_max and net.core.wmem_max to 4 MB SO_RCVBUF and SO_SNDBUF have limited range today, unless distros or system admins change rmem_max and wmem_max. Even iproute2 uses 1 MB SO_RCVBUF which is capped by the kernel. Decouple [rw]mem_max and [rw]mem_default and increase [rw]mem_max to 4 MB. Before: $ sysctl net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max net.core.rmem_default = 212992 net.core.rmem_max = 212992 net.core.wmem_default = 212992 net.core.wmem_max = 212992 After: $ sysctl net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max net.core.rmem_default = 212992 net.core.rmem_max = 4194304 net.core.wmem_default = 212992 net.core.wmem_max = 4194304 Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20250819174030.1986278-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-20 19:35:00 -07:00
Eric Biggers	a458b29021	ipv6: sr: Fix MAC comparison to be constant-time To prevent timing attacks, MACs need to be compared in constant time. Use the appropriate helper function for this. Fixes: `bf355b8d2c` ("ipv6: sr: add core files for SR HMAC support") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it> Link: https://patch.msgid.link/20250818202724.15713-1-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-20 19:32:30 -07:00
Jakub Acs	7af76e9d18	net, hsr: reject HSR frame if skb can't hold tag Receiving HSR frame with insufficient space to hold HSR tag in the skb can result in a crash (kernel BUG): [ 45.390915] skbuff: skb_under_panic: text:ffffffff86f32cac len:26 put:14 head:ffff888042418000 data:ffff888042417ff4 tail:0xe end:0x180 dev:bridge_slave_1 [ 45.392559] ------------[ cut here ]------------ [ 45.392912] kernel BUG at net/core/skbuff.c:211! [ 45.393276] Oops: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI [ 45.393809] CPU: 1 UID: 0 PID: 2496 Comm: reproducer Not tainted 6.15.0 #12 PREEMPT(undef) [ 45.394433] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 45.395273] RIP: 0010:skb_panic+0x15b/0x1d0 <snip registers, remove unreliable trace> [ 45.402911] Call Trace: [ 45.403105] <IRQ> [ 45.404470] skb_push+0xcd/0xf0 [ 45.404726] br_dev_queue_push_xmit+0x7c/0x6c0 [ 45.406513] br_forward_finish+0x128/0x260 [ 45.408483] __br_forward+0x42d/0x590 [ 45.409464] maybe_deliver+0x2eb/0x420 [ 45.409763] br_flood+0x174/0x4a0 [ 45.410030] br_handle_frame_finish+0xc7c/0x1bc0 [ 45.411618] br_handle_frame+0xac3/0x1230 [ 45.413674] __netif_receive_skb_core.constprop.0+0x808/0x3df0 [ 45.422966] __netif_receive_skb_one_core+0xb4/0x1f0 [ 45.424478] __netif_receive_skb+0x22/0x170 [ 45.424806] process_backlog+0x242/0x6d0 [ 45.425116] __napi_poll+0xbb/0x630 [ 45.425394] net_rx_action+0x4d1/0xcc0 [ 45.427613] handle_softirqs+0x1a4/0x580 [ 45.427926] do_softirq+0x74/0x90 [ 45.428196] </IRQ> This issue was found by syzkaller. The panic happens in br_dev_queue_push_xmit() once it receives a corrupted skb with ETH header already pushed in linear data. When it attempts the skb_push() call, there's not enough headroom and skb_push() panics. The corrupted skb is put on the queue by HSR layer, which makes a sequence of unintended transformations when it receives a specific corrupted HSR frame (with incomplete TAG). Fix it by dropping and consuming frames that are not long enough to contain both ethernet and hsr headers. Alternative fix would be to check for enough headroom before skb_push() in br_dev_queue_push_xmit(). In the reproducer, this is injected via AF_PACKET, but I don't easily see why it couldn't be sent over the wire from adjacent network. Further Details: In the reproducer, the following network interface chain is set up: ┌────────────────┐ ┌────────────────┐ │ veth0_to_hsr ├───┤ hsr_slave0 ┼───┐ └────────────────┘ └────────────────┘ │ │ ┌──────┐ ├─┤ hsr0 ├───┐ │ └──────┘ │ ┌────────────────┐ ┌────────────────┐ │ │┌────────┐ │ veth1_to_hsr ┼───┤ hsr_slave1 ├───┘ └┤ │ └────────────────┘ └────────────────┘ ┌┼ bridge │ ││ │ │└────────┘ │ ┌───────┐ │ │ ... ├──────┘ └───────┘ To trigger the events leading up to crash, reproducer sends a corrupted HSR frame with incomplete TAG, via AF_PACKET socket on 'veth0_to_hsr'. The first HSR-layer function to process this frame is hsr_handle_frame(). It and then checks if the protocol is ETH_P_PRP or ETH_P_HSR. If it is, it calls skb_set_network_header(skb, ETH_HLEN + HSR_HLEN), without checking that the skb is long enough. For the crashing frame it is not, and hence the skb->network_header and skb->mac_len fields are set incorrectly, pointing after the end of the linear buffer. I will call this a BUG#1 and it is what is addressed by this patch. In the crashing scenario before the fix, the skb continues to go down the hsr path as follows. hsr_handle_frame() then calls this sequence hsr_forward_skb() fill_frame_info() hsr->proto_ops->fill_frame_info() hsr_fill_frame_info() hsr_fill_frame_info() contains a check that intends to check whether the skb actually contains the HSR header. But the check relies on the skb->mac_len field which was erroneously setup due to BUG#1, so the check passes and the execution continues back in the hsr_forward_skb(): hsr_forward_skb() hsr_forward_do() hsr->proto_ops->get_untagged_frame() hsr_get_untagged_frame() create_stripped_skb_hsr() In create_stripped_skb_hsr(), a copy of the skb is created and is further corrupted by operation that attempts to strip the HSR tag in a call to __pskb_copy(). The skb enters create_stripped_skb_hsr() with ethernet header pushed in linear buffer. The skb_pull(skb_in, HSR_HLEN) thus pulls 6 bytes of ethernet header into the headroom, creating skb_in with a headroom of size 8. The subsequent __pskb_copy() then creates an skb with headroom of just 2 and skb->len of just 12, this is how it looks after the copy: gdb) p skb->len $10 = 12 (gdb) p skb->data $11 = (unsigned char ) 0xffff888041e45382 "\252\252\252\252\252!\210\373", (gdb) p skb->head $12 = (unsigned char ) 0xffff888041e45380 "" It seems create_stripped_skb_hsr() assumes that ETH header is pulled in the headroom when it's entered, because it just pulls HSR header on top. But that is not the case in our code-path and we end up with the corrupted skb instead. I will call this BUG#2 I got confused here because it seems that under no conditions can create_stripped_skb_hsr() work well, the assumption it makes is not true during the processing of hsr frames - since the skb_push() in hsr_handle_frame to skb_pull in hsr_deliver_master(). I wonder whether I missed something here. Next, the execution arrives in hsr_deliver_master(). It calls skb_pull(ETH_HLEN), which just returns NULL - the SKB does not have enough space for the pull (as it only has 12 bytes in total at this point). The skb_pull() here further suggests that ethernet header is meant to be pushed through the whole hsr processing and create_stripped_skb_hsr() should pull it before doing the HSR header pull. hsr_deliver_master() then puts the corrupted skb on the queue, it is then picked up from there by bridge frame handling layer and finally lands in br_dev_queue_push_xmit where it panics. Cc: stable@kernel.org Fixes: `48b491a5cc` ("net: hsr: fix mac_len checks") Reported-by: syzbot+a81f2759d022496b40ab@syzkaller.appspotmail.com Signed-off-by: Jakub Acs <acsjakub@amazon.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250819082842.94378-1-acsjakub@amazon.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-20 19:31:25 -07:00
William Liu	2c2192e5f9	net/sched: Remove unnecessary WARNING condition for empty child qdisc in htb_activate The WARN_ON trigger based on !cl->leaf.q->q.qlen is unnecessary in htb_activate. htb_dequeue_tree already accounts for that scenario. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: William Liu <will@willsroot.io> Reviewed-by: Savino Dicanosa <savy@syst3mfailure.io> Link: https://patch.msgid.link/20250819033632.579854-1-will@willsroot.io Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-20 19:27:08 -07:00
William Liu	15de71d06a	net/sched: Make cake_enqueue return NET_XMIT_CN when past buffer_limit The following setup can trigger a WARNING in htb_activate due to the condition: !cl->leaf.q->q.qlen tc qdisc del dev lo root tc qdisc add dev lo root handle 1: htb default 1 tc class add dev lo parent 1: classid 1:1 \ htb rate 64bit tc qdisc add dev lo parent 1:1 handle f: \ cake memlimit 1b ping -I lo -f -c1 -s64 -W0.001 127.0.0.1 This is because the low memlimit leads to a low buffer_limit, which causes packet dropping. However, cake_enqueue still returns NET_XMIT_SUCCESS, causing htb_enqueue to call htb_activate with an empty child qdisc. We should return NET_XMIT_CN when packets are dropped from the same tin and flow. I do not believe return value of NET_XMIT_CN is necessary for packet drops in the case of ack filtering, as that is meant to optimize performance, not to signal congestion. Fixes: `046f6fd5da` ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: William Liu <will@willsroot.io> Reviewed-by: Savino Dicanosa <savy@syst3mfailure.io> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20250819033601.579821-1-will@willsroot.io Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-20 19:27:08 -07:00
Pengtao He	8f2c72f225	net: avoid one loop iteration in __skb_splice_bits If len is equal to 0 at the beginning of __splice_segment it returns true directly. But when decreasing len from a positive number to 0 in __splice_segment, it returns false. The __skb_splice_bits needs to call __splice_segment again. Recheck *len if it changes, return true in time. Reduce unnecessary calls to __splice_segment. Signed-off-by: Pengtao He <hept.hept.hept@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250819021551.8361-1-hept.hept.hept@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-20 19:24:17 -07:00
Sebastian Andrzej Siewior	456010c8b9	netfilter: nft_set_pipapo: Use nested-BH locking for nft_pipapo_scratch nft_pipapo_scratch is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. Add a local_lock_t to the data structure and use local_lock_nested_bh() for locking. This change adds only lockdep coverage and does not alter the functional behaviour for !PREEMPT_RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Florian Westphal <fw@strlen.de>	2025-08-20 13:52:37 +02:00
Sebastian Andrzej Siewior	6aa67d5706	netfilter: nft_set_pipapo: Store real pointer, adjust later. The struct nft_pipapo_scratch is allocated, then aligned to the required alignment and difference (in bytes) is then saved in align_off. The aligned pointer is used later. While this works, it gets complicated with all the extra checks if all member before map are larger than the required alignment. Instead of saving the aligned pointer, just save the returned pointer and align the map pointer in nft_pipapo_lookup() before using it. The alignment later on shouldn't be that expensive. With this change, the align_off can be removed and the pointer can be passed to kfree() as is. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Florian Westphal <fw@strlen.de>	2025-08-20 13:52:37 +02:00
Florian Westphal	84c1da7b38	netfilter: nft_set_pipapo: use avx2 algorithm for insertions too Always prefer the avx2 implementation if its available. This greatly improves insertion performance (each insertion checks if the new element would overlap with an existing one): time nft -f - <<EOF table ip pipapo { set s { typeof ip saddr . tcp dport flags interval size 800000 elements = { 10.1.1.1 - 10.1.1.4 . 3996, [.. 800k entries elided .. ] before: real 1m55.993s user 0m2.505s sys 1m53.296s after: real 0m42.586s user 0m2.554s sys 0m39.811s Fold patch from Sebastian: kernel_fpu_begin_mask()/ _end() remains in pipapo_get_avx2() where it is required. A followup patch will add local_lock_t to struct nft_pipapo_scratch in order to protect the map pointer. The lock can not be acquired in preemption disabled context which is what kernel_fpu_begin*() does. Link: https://lore.kernel.org/netfilter-devel/20250818110213.1319982-2-bigeasy@linutronix.de/ Co-developed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Florian Westphal <fw@strlen.de>	2025-08-20 13:52:37 +02:00
Florian Westphal	416e53e395	netfilter: nft_set_pipapo_avx2: split lookup function in two parts Split the main avx2 lookup function into a helper. This is a preparation patch: followup change will use the new helper from the insertion path if possible. This greatly improves insertion performance when avx2 is supported. Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2025-08-20 13:52:37 +02:00
Sebastian Andrzej Siewior	d11b26402a	netfilter: nft_set_pipapo_avx2: Drop the comment regarding protection The comment claims that the kernel_fpu_begin_mask() below protects access to the scratch map. This is not true because the access is only protected by local_bh_disable() above. Remove the misleading comment. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Florian Westphal <fw@strlen.de>	2025-08-20 13:52:37 +02:00
Florian Westphal	08d07f25fd	netfilter: ctnetlink: remove refcounting in dying list dumping There is no need to keep the object alive via refcount, use a cookie and then use that as the skip hint for dump resumption. Unlike the two earlier, similar patches in this file, this is a cleanup without intended side effects. Signed-off-by: Florian Westphal <fw@strlen.de>	2025-08-20 13:52:36 +02:00
Eric Biggers	d5a253702a	sctp: Stop accepting md5 and sha1 for net.sctp.cookie_hmac_alg The upgrade of the cookie authentication algorithm to HMAC-SHA256 kept some backwards compatibility for the net.sctp.cookie_hmac_alg sysctl by still accepting the values 'md5' and 'sha1'. Those algorithms are no longer actually used, but rather those values were just treated as requests to enable cookie authentication. As requested at https://lore.kernel.org/netdev/CADvbK_fmCRARc8VznH8cQa-QKaCOQZ6yFbF=1-VDK=zRqv_cXw@mail.gmail.com/ and https://lore.kernel.org/netdev/20250818084345.708ac796@kernel.org/ , go further and start rejecting 'md5' and 'sha1' completely. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Link: https://patch.msgid.link/20250818205426.30222-6-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 19:36:26 -07:00
Eric Biggers	2f3dd6ec90	sctp: Convert cookie authentication to use HMAC-SHA256 Convert SCTP cookies to use HMAC-SHA256, instead of the previous choice of the legacy algorithms HMAC-MD5 and HMAC-SHA1. Simplify and optimize the code by using the HMAC-SHA256 library instead of crypto_shash, and by preparing the HMAC key when it is generated instead of per-operation. This doesn't break compatibility, since the cookie format is an implementation detail, not part of the SCTP protocol itself. Note that the cookie size doesn't change either. The HMAC field was already 32 bytes, even though previously at most 20 bytes were actually compared. 32 bytes exactly fits an untruncated HMAC-SHA256 value. So, although we could safely truncate the MAC to something slightly shorter, for now just keep the cookie size the same. I also considered SipHash, but that would generate only 8-byte MACs. An 8-byte MAC might suffice here. However, there's quite a lot of information in the SCTP cookies: more than in TCP SYN cookies. So absent an analysis that occasional forgeries of all that information is okay in SCTP, I errored on the side of caution. Remove HMAC-MD5 and HMAC-SHA1 as options, since the new HMAC-SHA256 option is just better. It's faster as well as more secure. For example, benchmarking on x86_64, cookie authentication is now nearly 3x as fast as the previous default choice and implementation of HMAC-MD5. Also just make the kernel always support cookie authentication if SCTP is supported at all, rather than making it optional in the build. (It was sort of optional before, but it didn't really work properly. E.g., a kernel with CONFIG_SCTP_COOKIE_HMAC_MD5=n still supported HMAC-MD5 cookie authentication if CONFIG_CRYPTO_HMAC and CONFIG_CRYPTO_MD5 happened to be enabled in the kconfig for other reasons.) Acked-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Link: https://patch.msgid.link/20250818205426.30222-5-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 19:36:26 -07:00
Eric Biggers	bf40785fa4	sctp: Use HMAC-SHA1 and HMAC-SHA256 library for chunk authentication For SCTP chunk authentication, use the HMAC-SHA1 and HMAC-SHA256 library functions instead of crypto_shash. This is simpler and faster. There's no longer any need to pre-allocate 'crypto_shash' objects; the SCTP code now simply calls into the HMAC code directly. As part of this, make SCTP always support both HMAC-SHA1 and HMAC-SHA256. Previously, it only guaranteed support for HMAC-SHA1. However, HMAC-SHA256 tended to be supported too anyway, as it was supported if CONFIG_CRYPTO_SHA256 was enabled elsewhere in the kconfig. Acked-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Link: https://patch.msgid.link/20250818205426.30222-4-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 19:36:25 -07:00
Eric Biggers	dd91c79e4f	sctp: Fix MAC comparison to be constant-time To prevent timing attacks, MACs need to be compared in constant time. Use the appropriate helper function for this. Fixes: `bbd0d59809` ("[SCTP]: Implement the receive and verification of AUTH chunk") Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Link: https://patch.msgid.link/20250818205426.30222-3-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 19:36:25 -07:00
Kuniyuki Iwashima	bf64002c94	net: Define sk_memcg under CONFIG_MEMCG. Except for sk_clone_lock(), all accesses to sk->sk_memcg is done under CONFIG_MEMCG. As a bonus, let's define sk->sk_memcg under CONFIG_MEMCG. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Link: https://patch.msgid.link/20250815201712.1745332-11-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 19:20:59 -07:00
Kuniyuki Iwashima	bb178c6bc0	net-memcg: Pass struct sock to mem_cgroup_sk_(un)?charge(). We will store a flag in the lowest bit of sk->sk_memcg. Then, we cannot pass the raw pointer to mem_cgroup_charge_skmem() and mem_cgroup_uncharge_skmem(). Let's pass struct sock to the functions. While at it, they are renamed to match other functions starting with mem_cgroup_sk_. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Link: https://patch.msgid.link/20250815201712.1745332-9-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 19:20:59 -07:00

... 11 12 13 14 15 ...

82231 Commits