linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-20 07:31:07 -04:00

Author	SHA1	Message	Date
Ido Schimmel	4a8c416602	ipv4: icmp: Fix source IP derivation in presence of VRFs When the "icmp_errors_use_inbound_ifaddr" sysctl is enabled, the source IP of ICMP error messages should be the "primary address of the interface that received the packet that caused the icmp error". The IPv4 ICMP code determines this interface using inet_iif() which in the input path translates to skb->skb_iif. If the interface that received the packet is a VRF port, skb->skb_iif will contain the ifindex of the VRF device and not that of the receiving interface. This is because in the input path the VRF driver overrides skb->skb_iif with the ifindex of the VRF device itself (see vrf_ip_rcv()). As such, the source IP that will be chosen for the ICMP error message is either an address assigned to the VRF device itself (if present) or an address assigned to some VRF port, not necessarily the input or output interface. This behavior is especially problematic when the error messages are "Time Exceeded" messages as it means that utilities like traceroute will show an incorrect packet path. Solve this by determining the input interface based on the iif field in the control block, if present. This field is set in the input path to skb->skb_iif and is not later overridden by the VRF driver, unlike skb->skb_iif. This behavior is consistent with the IPv6 counterpart that already uses the iif from the control block. Reported-by: Andy Roulin <aroulin@nvidia.com> Reported-by: Rajkumar Srinivasan <rajsrinivasa@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250908073238.119240-4-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-11 12:22:38 +02:00
Ido Schimmel	0d3c4a4416	ipv4: icmp: Pass IPv4 control block structure as an argument to __icmp_send() __icmp_send() is used to generate ICMP error messages in response to various situations such as MTU errors (i.e., "Fragmentation Required") and too many hops (i.e., "Time Exceeded"). The skb that generated the error does not necessarily come from the IPv4 layer and does not always have a valid IPv4 control block in skb->cb. Therefore, commit `9ef6b42ad6` ("net: Add __icmp_send helper.") changed the function to take the IP options structure as argument instead of deriving it from the skb's control block. Some callers of this function such as icmp_send() pass the IP options structure from the skb's control block as in these call paths the control block is known to be valid, but other callers simply pass a zeroed structure. A subsequent patch will need __icmp_send() to access more information from the IPv4 control block (specifically, the ifindex of the input interface). As a preparation for this change, change the function to take the IPv4 control block structure as an argument instead of the IP options structure. This makes the function similar to its IPv6 counterpart that already takes the IPv6 control block structure as an argument. No functional changes intended. Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250908073238.119240-3-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-11 12:22:38 +02:00
Ido Schimmel	cda276bcb9	ipv4: cipso: Simplify IP options handling in cipso_v4_error() When __ip_options_compile() is called with an skb, the IP options are parsed from the skb data into the provided IP option argument. This is in contrast to the case where the skb argument is NULL and the options are parsed from opt->__data. Given that cipso_v4_error() always passes an skb to __ip_options_compile(), there is no need to allocate an extra 40 bytes (maximum IP options size). Therefore, simplify the function by removing these extra bytes and make the function similar to ipv4_send_dest_unreach() which also calls both __ip_options_compile() and __icmp_send(). This is a preparation for changing the arguments being passed to __icmp_send(). No functional changes intended. Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250908073238.119240-2-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-11 12:22:38 +02:00
Jakub Kicinski	1827f773e4	net: xdp: pass full flags to xdp_update_skb_shared_info() xdp_update_skb_shared_info() needs to update skb state which was maintained in xdp_buff / frame. Pass full flags into it, instead of breaking it out bit by bit. We will need to add a bit for unreadable frags (even tho XDP doesn't support those the driver paths may be common), at which point almost all call sites would become: xdp_update_skb_shared_info(skb, num_frags, sinfo->xdp_frags_size, MY_PAGE_SIZE * num_frags, xdp_buff_is_frag_pfmemalloc(xdp), xdp_buff_is_frag_unreadable(xdp)); Keep a helper for accessing the flags, in case we need to transform them somehow in the future (e.g. to cover up xdp_buff vs xdp_frame differences). While we are touching call callers - rename the helper to xdp_update_skb_frags_info(), previous name may have implied that it's shinfo that's updated. We are updating flags in struct sk_buff based on frags that got attched. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://patch.msgid.link/20250905221539.2930285-2-kuba@kernel.org Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-11 12:00:20 +02:00
Vlad Dumitrescu	ce0b015e26	devlink: Add 'total_vfs' generic device param NICs are typically configured with total_vfs=0, forcing users to rely on external tools to enable SR-IOV (a widely used and essential feature). Add total_vfs parameter to devlink for SR-IOV max VF configurability. Enables standard kernel tools to manage SR-IOV, addressing the need for flexible VF configuration. Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Tested-by: Kamal Heib <kheib@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250907012953.301746-2-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-09 19:14:23 -07:00
Geliang Tang	30549eebc4	mptcp: make ADD_ADDR retransmission timeout adaptive Currently the ADD_ADDR option is retransmitted with a fixed timeout. This patch makes the retransmission timeout adaptive by using the maximum RTO among all the subflows, while still capping it at the configured maximum value (add_addr_timeout_max). This improves responsiveness when establishing new subflows. Specifically: 1. Adds mptcp_adjust_add_addr_timeout() helper to compute the adaptive timeout. 2. Uses maximum subflow RTO (icsk_rto) when available. 3. Applies exponential backoff based on retransmission count. 4. Maintains fallback to configured max timeout when no RTO data exists. This slightly changes the behaviour of the MPTCP "add_addr_timeout" sysctl knob to be used as a maximum instead of a fixed value. But this is seen as an improvement: the ADD_ADDR might be sent quicker than before to improve the overall MPTCP connection. Also, the default value is set to 2 min, which was already way too long, and caused the ADD_ADDR not to be retransmitted for connections shorter than 2 minutes. Suggested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/576 Reviewed-by: Christoph Paasch <cpaasch@openai.com> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250907-net-next-mptcp-add_addr-retrans-adapt-v1-1-824cc805772b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-09 18:57:45 -07:00
Alok Tiwari	d436b5abba	ipv4: udp: fix typos in comments Correct typos in ipv4/udp.c comments for clarity: "Encapulation" -> "Encapsulation" "measureable" -> "measurable" "tacking care" -> "taking care" No functional changes. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250907192535.3610686-1-alok.a.tiwari@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-09 16:29:05 -07:00
Hangbin Liu	d67ca09ca3	hsr: use netdev_master_upper_dev_link() when linking lower ports Unlike VLAN devices, HSR changes the lower device’s rx_handler, which prevents the lower device from being attached to another master. Switch to using netdev_master_upper_dev_link() when setting up the lower device. This could improves user experience, since ip link will now display the HSR device as the master for its ports. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20250902065558.360927-1-liuhangbin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-09 11:27:37 +02:00
Håkon Bugge	9f0730b063	rds: ib: Remove unused extern definition In the old days, RDS used FMR (Fast Memory Registration) to register IB MRs to be used by RDMA. A newer and better verbs based registration/de-registration method called FRWR (Fast Registration Work Request) was added to RDS by commit `1659185fb4` ("RDS: IB: Support Fastreg MR (FRMR) memory registration mode") in 2016. Detection and enablement of FRWR was done in commit `2cb2912d65` ("RDS: IB: add Fastreg MR (FRMR) detection support"). But said commit added an extern bool prefer_frmr, which was not used by said commit - nor used by later commits. Hence, remove it. Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20250905101958.4028647-1-haakon.bugge@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-08 18:16:49 -07:00
Eric Dumazet	c73d583e70	xfrm: snmp: do not use SNMP_MIB_SENTINEL anymore Use ARRAY_SIZE(), so that we know the limit at compile time. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250905165813.1470708-9-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-08 18:06:21 -07:00
Eric Dumazet	3a951f9520	tls: snmp: do not use SNMP_MIB_SENTINEL anymore Use ARRAY_SIZE(), so that we know the limit at compile time. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250905165813.1470708-8-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-08 18:06:21 -07:00
Eric Dumazet	52a33cae6a	sctp: snmp: do not use SNMP_MIB_SENTINEL anymore Use ARRAY_SIZE(), so that we know the limit at compile time. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20250905165813.1470708-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-08 18:06:21 -07:00
Eric Dumazet	35cb2da0ab	mptcp: snmp: do not use SNMP_MIB_SENTINEL anymore Use ARRAY_SIZE(), so that we know the limit at compile time. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Mat Martineau <martineau@kernel.org> Cc: Geliang Tang <geliang@kernel.org> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250905165813.1470708-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-08 18:06:20 -07:00
Eric Dumazet	b7b74953f8	ipv4: snmp: do not use SNMP_MIB_SENTINEL anymore Use ARRAY_SIZE(), so that we know the limit at compile time. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250905165813.1470708-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-08 18:06:20 -07:00
Eric Dumazet	2fab94bcf3	ipv6: snmp: do not track per idev ICMP6_MIB_RATELIMITHOST Blamed commit added a critical false sharing on a single atomic_long_t under DOS, like receiving UDP packets to closed ports. Per netns ICMP6_MIB_RATELIMITHOST tracking uses per-cpu storage and is enough, we do not need per-device and slow tracking. Fixes: `d0941130c9` ("icmp: Add counters for rate limits") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jamie Bainbridge <jamie.bainbridge@gmail.com> Cc: Abhishek Rawal <rawal.abhishek92@gmail.com> Link: https://patch.msgid.link/20250905165813.1470708-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-08 18:06:20 -07:00
Eric Dumazet	ceac1fb229	ipv6: snmp: do not use SNMP_MIB_SENTINEL anymore Use ARRAY_SIZE(), so that we know the limit at compile time. Following patch needs this preliminary change. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250905165813.1470708-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-08 18:06:20 -07:00
Eric Dumazet	b7fe8c1be7	ipv6: snmp: remove icmp6type2name[] This 2KB array can be replaced by a switch() to save space. Before: $ size net/ipv6/proc.o text data bss dec hex filename 6410 624 0 7034 1b7a net/ipv6/proc.o After: $ size net/ipv6/proc.o text data bss dec hex filename 5516 592 0 6108 17dc net/ipv6/proc.o Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250905165813.1470708-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-08 18:06:20 -07:00
Alok Tiwari	bd64723327	net: mctp: fix typo in comment Correct a typo in af_mctp.c: "fist" -> "first". Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Acked-by: Jeremy Kerr <jk@codeconstruct.com.au> Link: https://patch.msgid.link/20250905165006.3032472-1-alok.a.tiwari@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-08 17:47:57 -07:00
Eric Dumazet	16c610162d	net: call cond_resched() less often in __release_sock() While stress testing TCP I had unexpected retransmits and sack packets when a single cpu receives data from multiple high-throughput flows. super_netperf 4 -H srv -T,10 -l 3000 & Tcpdump extract: 00:00:00.000007 IP6 clnt > srv: Flags [.], seq 26062848:26124288, ack 1, win 66, options [nop,nop,TS val 651460834 ecr 3100749131], length 61440 00:00:00.000006 IP6 clnt > srv: Flags [.], seq 26124288:26185728, ack 1, win 66, options [nop,nop,TS val 651460834 ecr 3100749131], length 61440 00:00:00.000005 IP6 clnt > srv: Flags [P.], seq 26185728:26243072, ack 1, win 66, options [nop,nop,TS val 651460834 ecr 3100749131], length 57344 00:00:00.000006 IP6 clnt > srv: Flags [.], seq 26243072:26304512, ack 1, win 66, options [nop,nop,TS val 651460844 ecr 3100749141], length 61440 00:00:00.000005 IP6 clnt > srv: Flags [.], seq 26304512:26365952, ack 1, win 66, options [nop,nop,TS val 651460844 ecr 3100749141], length 61440 00:00:00.000007 IP6 clnt > srv: Flags [P.], seq 26365952:26423296, ack 1, win 66, options [nop,nop,TS val 651460844 ecr 3100749141], length 57344 00:00:00.000006 IP6 clnt > srv: Flags [.], seq 26423296:26484736, ack 1, win 66, options [nop,nop,TS val 651460853 ecr 3100749150], length 61440 00:00:00.000005 IP6 clnt > srv: Flags [.], seq 26484736:26546176, ack 1, win 66, options [nop,nop,TS val 651460853 ecr 3100749150], length 61440 00:00:00.000005 IP6 clnt > srv: Flags [P.], seq 26546176:26603520, ack 1, win 66, options [nop,nop,TS val 651460853 ecr 3100749150], length 57344 00:00:00.003932 IP6 clnt > srv: Flags [P.], seq 26603520:26619904, ack 1, win 66, options [nop,nop,TS val 651464844 ecr 3100753141], length 16384 00:00:00.006602 IP6 clnt > srv: Flags [.], seq 24862720:24866816, ack 1, win 66, options [nop,nop,TS val 651471419 ecr 3100759716], length 4096 00:00:00.013000 IP6 clnt > srv: Flags [.], seq 24862720:24866816, ack 1, win 66, options [nop,nop,TS val 651484421 ecr 3100772718], length 4096 00:00:00.000416 IP6 srv > clnt: Flags [.], ack 26619904, win 1393, options [nop,nop,TS val 3100773185 ecr 651484421,nop,nop,sack 1 {24862720:24866816}], length 0 After analysis, it appears this is because of the cond_resched() call from __release_sock(). When current thread is yielding, while still holding the TCP socket lock, it might regain the cpu after a very long time. Other peer TLP/RTO is firing (multiple times) and packets are retransmit, while the initial copy is waiting in the socket backlog or receive queue. In this patch, I call cond_resched() only once every 16 packets. Modern TCP stack now spends less time per packet in the backlog, especially because ACK are no longer sent (commit `133c4c0d37` "tcp: defer regular ACK while processing socket backlog") Before: clnt:/# nstat -n;sleep 10;nstat\|egrep "TcpOutSegs\|TcpRetransSegs\|TCPFastRetrans\|TCPTimeouts\|Probes\|TCPSpuriousRTOs\|DSACK" TcpOutSegs 19046186 0.0 TcpRetransSegs 1471 0.0 TcpExtTCPTimeouts 1397 0.0 TcpExtTCPLossProbes 1356 0.0 TcpExtTCPDSACKRecv 1352 0.0 TcpExtTCPSpuriousRTOs 114 0.0 TcpExtTCPDSACKRecvSegs 1352 0.0 After: clnt:/# nstat -n;sleep 10;nstat\|egrep "TcpOutSegs\|TcpRetransSegs\|TCPFastRetrans\|TCPTimeouts\|Probes\|TCPSpuriousRTOs\|DSACK" TcpOutSegs 19218936 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250903174811.1930820-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-04 19:16:51 -07:00
Eric Dumazet	b13592d20b	tcp: use tcp_eat_recv_skb in __tcp_close() Small change to use tcp_eat_recv_skb() instead of __kfree_skb(). This can help if an application under attack has to close many sockets with unread data. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20250903084720.1168904-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-04 19:13:41 -07:00
Eric Dumazet	5f92385309	tcp: fix __tcp_close() to only send RST when required If the receive queue contains payload that was already received, __tcp_close() can send an unexpected RST. Refine the code to take tp->copied_seq into account, as we already do in tcp recvmsg(). Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20250903084720.1168904-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-04 19:13:41 -07:00
Jakub Kicinski	5ef04a7b06	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.17-rc5). No conflicts. Adjacent changes: include/net/sock.h `c51613fa27` ("net: add sk->sk_drop_counters") `5d6b58c932` ("net: lockless sock_i_ino()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-04 13:33:00 -07:00
Alexandra Winter	c975e1dfcc	net/smc: Improve log message for devices w/o pnetid Explicitly state in the log message, when a device has no pnetid. "with pnetid" and "has pnetid" was misleading for devices without pnetid. Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250901145842.1718373-3-wintera@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-04 08:57:04 -07:00
Jakub Kicinski	d93b10e894	Merge tag 'nf-25-09-04' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Florian Westphal says: ==================== netfilter: updates for net 1) Fix a silly bug in conntrack selftest, busyloop may get optimized to for (;;), reported by Yi Chen. 2) Introduce new NFTA_DEVICE_PREFIX attribute in nftables netlink api, re-using old NFTA_DEVICE_NAME led to confusion with different kernel/userspace versions. This refines the wildcard interface support added in 6.16 release. From Phil Sutter. * tag 'nf-25-09-04' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: Introduce NFTA_DEVICE_PREFIX selftests: netfilter: fix udpclash tool hang ==================== Link: https://patch.msgid.link/20250904072548.3267-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-04 06:59:27 -07:00
Jakub Kicinski	3ceb08838b	net: add helper to pre-check if PP for an Rx queue will be unreadable mlx5 pokes into the rxq state to check if the queue has a memory provider, and therefore whether it may produce unreadable mem. Add a helper for doing this in the page pool API. fbnic will want a similar thing (tho, for a slightly different reason). Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250901211214.1027927-11-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-04 10:19:17 +02:00
Yue Haibing	61481d72e1	ipv6: sit: Add ipip6_tunnel_dst_find() for cleanup Extract the dst lookup logic from ipip6_tunnel_xmit() into new helper ipip6_tunnel_dst_find() to reduce code duplication and enhance readability. No functional change intended. On a x86_64, with allmodconfig object size is also reduced: ./scripts/bloat-o-meter net/ipv6/sit.o net/ipv6/sit-new.o add/remove: 5/3 grow/shrink: 3/4 up/down: 1841/-2275 (-434) Function old new delta ipip6_tunnel_dst_find - 1697 +1697 __pfx_ipip6_tunnel_dst_find - 64 +64 __UNIQUE_ID_modinfo2094 - 43 +43 ipip6_tunnel_xmit.isra.cold 79 88 +9 __UNIQUE_ID_modinfo2096 12 20 +8 __UNIQUE_ID___addressable_init_module2092 - 8 +8 __UNIQUE_ID___addressable_cleanup_module2093 - 8 +8 __func__ 55 59 +4 __UNIQUE_ID_modinfo2097 20 18 -2 __UNIQUE_ID___addressable_init_module2093 8 - -8 __UNIQUE_ID___addressable_cleanup_module2094 8 - -8 __UNIQUE_ID_modinfo2098 18 - -18 __UNIQUE_ID_modinfo2095 43 12 -31 descriptor 112 56 -56 ipip6_tunnel_xmit.isra 9910 7758 -2152 Total: Before=72537, After=72103, chg -0.60% Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20250901114857.1968513-1-yuehaibing@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-04 10:03:59 +02:00
Wang Liang	0a228624bc	net: atm: fix memory leak in atm_register_sysfs when device_register fail When device_register() return error in atm_register_sysfs(), which can be triggered by kzalloc fail in device_private_init() or other reasons, kmemleak reports the following memory leaks: unreferenced object 0xffff88810182fb80 (size 8): comm "insmod", pid 504, jiffies 4294852464 hex dump (first 8 bytes): 61 64 75 6d 6d 79 30 00 adummy0. backtrace (crc 14dfadaf): __kmalloc_node_track_caller_noprof+0x335/0x450 kvasprintf+0xb3/0x130 kobject_set_name_vargs+0x45/0x120 dev_set_name+0xa9/0xe0 atm_register_sysfs+0xf3/0x220 atm_dev_register+0x40b/0x780 0xffffffffa000b089 do_one_initcall+0x89/0x300 do_init_module+0x27b/0x7d0 load_module+0x54cd/0x5ff0 init_module_from_file+0xe4/0x150 idempotent_init_module+0x32c/0x610 __x64_sys_finit_module+0xbd/0x120 do_syscall_64+0xa8/0x270 entry_SYSCALL_64_after_hwframe+0x77/0x7f When device_create_file() return error in atm_register_sysfs(), the same issue also can be triggered. Function put_device() should be called to release kobj->name memory and other device resource, instead of kfree(). Fixes: `1fa5ae857b` ("driver core: get rid of struct device's bus_id string array") Signed-off-by: Wang Liang <wangliang74@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250901063537.1472221-1-wangliang74@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-04 09:53:44 +02:00
Phil Sutter	4039ce7ef4	netfilter: nf_tables: Introduce NFTA_DEVICE_PREFIX This new attribute is supposed to be used instead of NFTA_DEVICE_NAME for simple wildcard interface specs. It holds a NUL-terminated string representing an interface name prefix to match on. While kernel code to distinguish full names from prefixes in NFTA_DEVICE_NAME is simpler than this solution, reusing the existing attribute with different semantics leads to confusion between different versions of kernel and user space though: * With old kernels, wildcards submitted by user space are accepted yet silently treated as regular names. * With old user space, wildcards submitted by kernel may cause crashes since libnftnl expects NUL-termination when there is none. Using a distinct attribute type sanitizes these situations as the receiving part detects and rejects the unexpected attribute nested in *_HOOK_DEVS attributes. Fixes: `6d07a28950` ("netfilter: nf_tables: Support wildcard netdev hook specs") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Florian Westphal <fw@strlen.de>	2025-09-04 09:19:25 +02:00
Eric Dumazet	8156210d36	ax25: properly unshare skbs in ax25_kiss_rcv() Bernard Pidoux reported a regression apparently caused by commit `c353e8983e` ("net: introduce per netns packet chains"). skb->dev becomes NULL and we crash in __netif_receive_skb_core(). Before above commit, different kind of bugs or corruptions could happen without a major crash. But the root cause is that ax25_kiss_rcv() can queue/mangle input skb without checking if this skb is shared or not. Many thanks to Bernard Pidoux for his help, diagnosis and tests. We had a similar issue years ago fixed with commit `7aaed57c5c` ("phonet: properly unshare skbs in phonet_rcv()"). Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Reported-by: Bernard Pidoux <f6bvp@free.fr> Closes: https://lore.kernel.org/netdev/1713f383-c538-4918-bc64-13b3288cd542@free.fr/ Tested-by: Bernard Pidoux <f6bvp@free.fr> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Joerg Reuter <jreuter@yaina.de> Cc: David Ranch <dranch@trinnet.net> Cc: Folkert van Heusden <folkert@vanheusden.com> Reviewed-by: Dan Cross <crossd@gmail.com> Link: https://patch.msgid.link/20250902124642.212705-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-03 17:06:30 -07:00
Alok Tiwari	a125c8fb9d	mctp: return -ENOPROTOOPT for unknown getsockopt options In mctp_getsockopt(), unrecognized options currently return -EINVAL. In contrast, mctp_setsockopt() returns -ENOPROTOOPT for unknown options. Update mctp_getsockopt() to also return -ENOPROTOOPT for unknown options. This aligns the behavior of getsockopt() and setsockopt(), and matches the standard kernel socket API convention for handling unsupported options. Fixes: `99ce45d5e7` ("mctp: Implement extended addressing") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Link: https://patch.msgid.link/20250902102059.1370008-1-alok.a.tiwari@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-03 17:01:52 -07:00
Mahanta Jambigi	cc282f73bc	net/smc: Remove validation of reserved bits in CLC Decline message Currently SMC code is validating the reserved bits while parsing the incoming CLC decline message & when this validation fails, its treated as a protocol error. As a result, the SMC connection is terminated instead of falling back to TCP. As per RFC7609[1] specs we shouldn't be validating the reserved bits that is part of CLC message. This patch fixes this issue. CLC Decline message format can viewed here[2]. [1] https://datatracker.ietf.org/doc/html/rfc7609#page-92 [2] https://datatracker.ietf.org/doc/html/rfc7609#page-105 Fixes: `8ade200c26` ("net/smc: add v2 format of CLC decline message") Signed-off-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com> Reviewed-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Link: https://patch.msgid.link/20250902082041.98996-1-mjambigi@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-03 17:01:07 -07:00
Dan Carpenter	a51160f8da	ipv4: Fix NULL vs error pointer check in inet_blackhole_dev_init() The inetdev_init() function never returns NULL. Check for error pointers instead. Fixes: `22600596b6` ("ipv4: give an IPv4 dev to blackhole_netdev") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/aLaQWL9NguWmeM1i@stanley.mountain Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-03 16:58:44 -07:00
Eric Dumazet	5d6b58c932	net: lockless sock_i_ino() Followup of commit `c51da3f7a1` ("net: remove sock_i_uid()") A recent syzbot report was the trigger for this change. Over the years, we had many problems caused by the read_lock[_bh](&sk->sk_callback_lock) in sock_i_uid(). We could fix smc_diag_dump_proto() or make a more radical move: Instead of waiting for new syzbot reports, cache the socket inode number in sk->sk_ino, so that we no longer need to acquire sk->sk_callback_lock in sock_i_ino(). This makes socket dumps faster (one less cache line miss, and two atomic ops avoided). Prior art: commit `25a9c8a443` ("netlink: Add __sock_i_ino() for __netlink_diag_dump().") commit `4f9bf2a2f5` ("tcp: Don't acquire inet_listen_hashbucket::lock with disabled BH.") commit `efc3dbc374` ("rds: Make rds_sock_lock BH rather than IRQ safe.") Fixes: `d2d6422f8b` ("x86: Allow to enable PREEMPT_RT.") Reported-by: syzbot+50603c05bbdf4dfdaffa@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/68b73804.050a0220.3db4df.01d8.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20250902183603.740428-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-03 16:08:24 -07:00
Jakub Kicinski	24ee9feeb3	Merge tag 'nf-next-25-09-02' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter: updates for net-next 1) prefer vmalloc_array in ebtables, from Qianfeng Rong. 2) Use csum_replace4 instead of open-coding it, from Christophe Leroy. 3+4) Get rid of GFP_ATOMIC in transaction object allocations, those cause silly failures with large sets under memory pressure, from myself. 5) Remove test for AVX cpu feature in nftables pipapo set type, testing for AVX2 feature is sufficient. 6) Unexport a few function in nf_reject infra: no external callers. 7) Extend payload offset to u16, this was restricted to values <=255 so far, from Fernando Fernandez Mancera. * tag 'nf-next-25-09-02' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nft_payload: extend offset to 65535 bytes netfilter: nf_reject: remove unneeded exports netfilter: nft_set_pipapo: remove redundant test for avx feature bit netfilter: nf_tables: all transaction allocations can now sleep netfilter: nf_tables: allow iter callbacks to sleep netfilter: nft_payload: Use csum_replace4() instead of opencoding netfilter: ebtables: Use vmalloc_array() to improve code ==================== Link: https://patch.msgid.link/20250902133549.15945-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-03 16:06:45 -07:00
Asbjørn Sloth Tønnesen	9f9581ba74	netlink: specs: fou: change local-v6/peer-v6 check While updating the binary min-len implementation, I noticed that the only user, should AFAICT be using exact-len instead. In net/ipv4/fou_core.c FOU_ATTR_LOCAL_V6 and FOU_ATTR_PEER_V6 are only used for singular IPv6 addresses, and there are AFAICT no known implementations trying to send more, it therefore appears safe to change it to an exact-len policy. This patch therefore changes the local-v6/peer-v6 attributes to use an exact-len check, instead of a min-len check. Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250902154640.759815-2-ast@fiberby.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-03 15:16:49 -07:00
Christoph Paasch	3bd4f98a4e	mptcp: record subflows in RPS table Accelerated Receive Flow Steering (aRFS) relies on sockets recording their RX flow hash into the rps_sock_flow_table so that incoming packets are steered to the CPU where the application runs. With MPTCP, the application interacts with the parent MPTCP socket while data is carried over per-subflow TCP sockets. Without recording these subflows, aRFS cannot steer interrupts and RX processing for the flows to the desired CPU. Record all subflows in the RPS table by calling sock_rps_record_flow() for each subflow at the start of mptcp_sendmsg(), mptcp_recvmsg() and mptcp_stream_accept(), by using the new helper mptcp_rps_record_subflows(). It does not by itself improve throughput, but ensures that IRQ and RX processing are directed to the right CPU, which is a prerequisite for effective aRFS. Signed-off-by: Christoph Paasch <cpaasch@openai.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250902-net-next-mptcp-misc-feat-6-18-v2-4-fa02bb3188b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-03 15:08:20 -07:00
Eric Biggers	2d5be5629c	mptcp: use HMAC-SHA256 library instead of open-coded HMAC Now that there are easy-to-use HMAC-SHA256 library functions, use these in net/mptcp/crypto.c instead of open-coding the HMAC algorithm. Remove the WARN_ON_ONCE() for messages longer than SHA256_DIGEST_SIZE. The new implementation handles all message lengths correctly. The mptcp-crypto KUnit test still passes after this change. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250902-net-next-mptcp-misc-feat-6-18-v2-1-fa02bb3188b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-03 15:08:20 -07:00
Jakub Kicinski	c5142df58d	Merge tag 'wireless-2025-09-03' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Just a few updates: - a set of buffer overflow fixes - ath11k: a fix for GTK rekeying - ath12k: a missed WiFi7 capability * tag 'wireless-2025-09-03' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: wifi: wilc1000: avoid buffer overflow in WID string configuration wifi: cfg80211: sme: cap SSID length in __cfg80211_connect_result() wifi: libertas: cap SSID len in lbs_associate() wifi: cw1200: cap SSID length in cw1200_do_join() wifi: ath11k: fix group data packet drops during rekey wifi: ath12k: Set EMLSR support flag in MLO flags for EML-capable stations ==================== Link: https://patch.msgid.link/20250903075602.30263-4-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-03 14:56:15 -07:00
Dan Carpenter	62b635dcd6	wifi: cfg80211: sme: cap SSID length in __cfg80211_connect_result() If the ssid->datalen is more than IEEE80211_MAX_SSID_LEN (32) it would lead to memory corruption so add some bounds checking. Fixes: `c38c701851` ("wifi: cfg80211: Set SSID if it is not already set") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://patch.msgid.link/0aaaae4a3ed37c6252363c34ae4904b1604e8e32.1756456951.git.dan.carpenter@linaro.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2025-09-03 09:37:55 +02:00
Yue Haibing	3d95261eeb	ipv6: Add sanity checks on ipv6_devconf.rpl_seg_enabled In ipv6_rpl_srh_rcv() we use min(net->ipv6.devconf_all->rpl_seg_enabled, idev->cnf.rpl_seg_enabled) is intended to return 0 when either value is zero, but if one of the values is negative it will in fact return non-zero. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Link: https://patch.msgid.link/20250901123726.1972881-3-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-02 17:01:14 -07:00
Yue Haibing	3a5f55500f	ipv6: annotate data-races around devconf->rpl_seg_enabled devconf->rpl_seg_enabled can be changed concurrently from /proc/sys/net/ipv6/conf, annotate lockless reads on it. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Link: https://patch.msgid.link/20250901123726.1972881-2-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-02 17:01:06 -07:00
Christoph Paasch	fa390321ab	net/tcp: Fix socket memory leak in TCP-AO failure handling for IPv6 When tcp_ao_copy_all_matching() fails in tcp_v6_syn_recv_sock() it just exits the function. This ends up causing a memory-leak: unreferenced object 0xffff0000281a8200 (size 2496): comm "softirq", pid 0, jiffies 4295174684 hex dump (first 32 bytes): 7f 00 00 06 7f 00 00 06 00 00 00 00 cb a8 88 13 ................ 0a 00 03 61 00 00 00 00 00 00 00 00 00 00 00 00 ...a............ backtrace (crc 5ebdbe15): kmemleak_alloc+0x44/0xe0 kmem_cache_alloc_noprof+0x248/0x470 sk_prot_alloc+0x48/0x120 sk_clone_lock+0x38/0x3b0 inet_csk_clone_lock+0x34/0x150 tcp_create_openreq_child+0x3c/0x4a8 tcp_v6_syn_recv_sock+0x1c0/0x620 tcp_check_req+0x588/0x790 tcp_v6_rcv+0x5d0/0xc18 ip6_protocol_deliver_rcu+0x2d8/0x4c0 ip6_input_finish+0x74/0x148 ip6_input+0x50/0x118 ip6_sublist_rcv+0x2fc/0x3b0 ipv6_list_rcv+0x114/0x170 __netif_receive_skb_list_core+0x16c/0x200 netif_receive_skb_list_internal+0x1f0/0x2d0 This is because in tcp_v6_syn_recv_sock (and the IPv4 counterpart), when exiting upon error, inet_csk_prepare_forced_close() and tcp_done() need to be called. They make sure the newsk will end up being correctly free'd. tcp_v4_syn_recv_sock() makes this very clear by having the put_and_exit label that takes care of things. So, this patch here makes sure tcp_v4_syn_recv_sock and tcp_v6_syn_recv_sock have similar error-handling and thus fixes the leak for TCP-AO. Fixes: `06b22ef295` ("net/tcp: Wire TCP-AO to request sockets") Signed-off-by: Christoph Paasch <cpaasch@openai.com> Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20250830-tcpao_leak-v1-1-e5878c2c3173@openai.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-02 15:58:22 -07:00
Eric Dumazet	5d14bbf9d1	net_sched: act: remove tcfa_qstats tcfa_qstats is currently only used to hold drops and overlimits counters. tcf_action_inc_drop_qstats() and tcf_action_inc_overlimit_qstats() currently acquire a->tcfa_lock to increment these counters. Switch to two atomic_t to get lock-free accounting. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20250901093141.2093176-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-02 15:52:24 -07:00
Eric Dumazet	3016024d75	net_sched: add back BH safety to tcf_lock Jamal reported that we had to use BH safety after all, because stats can be updated from BH handler. Fixes: `3133d5c15c` ("net_sched: remove BH blocking in eight actions") Fixes: `53df77e785` ("net_sched: act_skbmod: use RCU in tcf_skbmod_dump()") Fixes: `e97ae74297` ("net_sched: act_tunnel_key: use RCU in tunnel_key_dump()") Fixes: `48b5e5dbdb` ("net_sched: act_vlan: use RCU in tcf_vlan_dump()") Reported-by: Jamal Hadi Salim <jhs@mojatatu.com> Closes: https://lore.kernel.org/netdev/CAM0EoMmhq66EtVqDEuNik8MVFZqkgxFbMu=fJtbNoYD7YXg4bA@mail.gmail.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20250901092608.2032473-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-02 15:51:45 -07:00
James Flowers	d250f14f5f	net/smc: Replace use of strncpy on NUL-terminated string with strscpy strncpy is deprecated for use on NUL-terminated strings, as indicated in Documentation/process/deprecated.rst. strncpy NUL-pads the destination buffer and doesn't guarantee the destination buffer will be NUL terminated. Signed-off-by: James Flowers <bold.zone2373@fastmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Link: https://patch.msgid.link/20250901030512.80099-1-bold.zone2373@fastmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-02 14:05:49 -07:00
Fernando Fernandez Mancera	077dc4a275	netfilter: nft_payload: extend offset to 65535 bytes In some situations 255 bytes offset is not enough to match or manipulate the desired packet field. Increase the offset limit to 65535 or U16_MAX. In addition, the nla policy maximum value is not set anymore as it is limited to s16. Instead, the maximum value is checked during the payload expression initialization function. Tested with the nft command line tool. table ip filter { chain output { @nh,2040,8 set 0xff @nh,524280,8 set 0xff @nh,524280,8 0xff @nh,2040,8 0xff } } Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Florian Westphal <fw@strlen.de>	2025-09-02 15:28:18 +02:00
Florian Westphal	f4f9e05904	netfilter: nf_reject: remove unneeded exports These functions have no external callers and can be static. Signed-off-by: Florian Westphal <fw@strlen.de>	2025-09-02 15:28:17 +02:00
Florian Westphal	8959f27d39	netfilter: nft_set_pipapo: remove redundant test for avx feature bit Sebastian points out that avx2 depends on avx, see check_cpufeature_deps() in arch/x86/kernel/cpu/cpuid-deps.c: avx2 feature bit will be cleared when avx isn't available. No functional change intended. Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2025-09-02 15:28:17 +02:00
Florian Westphal	3d95a2e016	netfilter: nf_tables: all transaction allocations can now sleep Now that nft_setelem_flush is not called with rcu read lock held or disabled softinterrupts anymore this can now use GFP_KERNEL too. This is the last atomic allocation of transaction elements, so remove all gfp_t arguments and the wrapper function. This makes attempts to delete large sets much more reliable, before this was prone to transient memory allocation failures. Signed-off-by: Florian Westphal <fw@strlen.de>	2025-09-02 15:28:17 +02:00
Florian Westphal	a60a5abe19	netfilter: nf_tables: allow iter callbacks to sleep Quoting Sven Auhagen: we do see on occasions that we get the following error message, more so on x86 systems than on arm64: Error: Could not process rule: Cannot allocate memory delete table inet filter It is not a consistent error and does not happen all the time. We are on Kernel 6.6.80, seems to me like we have something along the lines of the nf_tables: allow clone callbacks to sleep problem using GFP_ATOMIC. As hinted at by Sven, this is because of GFP_ATOMIC allocations during set flush. When set is flushed, all elements are deactivated. This triggers a set walk and each element gets added to the transaction list. The rbtree and rhashtable sets don't allow the iter callback to sleep: rbtree walk acquires read side of an rwlock with bh disabled, rhashtable walk happens with rcu read lock held. Rbtree is simple enough to resolve: When the walk context is ITER_READ, no change is needed (the iter callback must not deactivate elements; we're not in a transaction). When the iter type is ITER_UPDATE, the rwlock isn't needed because the caller holds the transaction mutex, this prevents any and all changes to the ruleset, including add/remove of set elements. Rhashtable is slightly more complex. When the iter type is ITER_READ, no change is needed, like rbtree. For ITER_UPDATE, we hold transaction mutex which prevents elements from getting free'd, even outside of rcu read lock section. So build a temporary list of all elements while doing the rcu iteration and then call the iterator in a second pass. The disadvantage is the need to iterate twice, but this cost comes with the benefit to allow the iter callback to use GFP_KERNEL allocations in a followup patch. The new list based logic makes it necessary to catch recursive calls to the same set earlier. Such walk -> iter -> walk recursion for the same set can happen during ruleset validation in case userspace gave us a bogus (cyclic) ruleset where verdict map m jumps to chain that sooner or later also calls "vmap @m". Before the new ->in_update_walk test, the ruleset is rejected because the infinite recursion causes ctx->level to exceed the allowed maximum. But with the new logic added here, elements would get skipped: nft_rhash_walk_update would see elements that are on the walk_list of an older stack frame. As all recursive calls into same map results in -EMLINK, we can avoid this problem by using the new in_update_walk flag and reject immediately. Next patch converts the problematic GFP_ATOMIC allocations. Reported-by: Sven Auhagen <Sven.Auhagen@belden.com> Closes: https://lore.kernel.org/netfilter-devel/BY1PR18MB5874110CAFF1ED098D0BC4E7E07BA@BY1PR18MB5874.namprd18.prod.outlook.com/ Signed-off-by: Florian Westphal <fw@strlen.de>	2025-09-02 15:28:17 +02:00

1 2 3 4 5 ...

81744 Commits