linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-21 13:45:53 -04:00

Author	SHA1	Message	Date
Eric Dumazet	faf7b4aefd	udp: update sk_rmem_alloc before busylock acquisition Avoid piling too many producers on the busylock by updating sk_rmem_alloc before busylock acquisition. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250916160951.541279-7-edumazet@google.com Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 10:17:10 +02:00
Eric Dumazet	9aaec660b5	udp: refine __udp_enqueue_schedule_skb() test Commit `5a465a0da1` ("udp: Fix multiple wraparounds of sk->sk_rmem_alloc.") allowed to slightly overshoot sk->sk_rmem_alloc, when many cpus are trying to feed packets to a common UDP socket. This patch, combined with the following one reduces false sharing on the victim socket under DDOS. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250916160951.541279-6-edumazet@google.com Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 10:17:10 +02:00
Eric Dumazet	9fba1eb39e	ipv6: np->rxpmtu race annotation Add READ_ONCE() annotations because np->rxpmtu can be changed while udpv6_recvmsg() and rawv6_recvmsg() read it. Since this is a very rarely used feature, and that udpv6_recvmsg() and rawv6_recvmsg() read np->rxopt anyway, change the test order so that np->rxpmtu does not need to be in a hot cache line. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250916160951.541279-4-edumazet@google.com Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 10:17:09 +02:00
Eric Dumazet	5489f333ef	ipv6: make ipv6_pinfo.daddr_cache a boolean ipv6_pinfo.daddr_cache is either NULL or &sk->sk_v6_daddr We do not need 8 bytes, a boolean is enough. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250916160951.541279-3-edumazet@google.com Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 10:17:09 +02:00
Eric Dumazet	3fbb2a6f3a	ipv6: make ipv6_pinfo.saddr_cache a boolean ipv6_pinfo.saddr_cache is either NULL or &np->saddr. We do not need 8 bytes, a boolean is enough. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250916160951.541279-2-edumazet@google.com Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 10:17:09 +02:00
Chia-Yu Chang	e7e9da850a	tcp: accecn: try to fit AccECN option with SACK As SACK blocks tend to eat all option space when there are many holes, it is useful to compromise on sending many SACK blocks in every ACK and attempt to fit the AccECN option there by reducing the number of SACK blocks. However, it will never go below two SACK blocks because of the AccECN option. As the AccECN option is often not put to every ACK, the space hijack is usually only temporary. Depending on the reuqired AccECN fields (can be either 3, 2, 1, or 0, cf. Table 5 in AccECN spec) and the NOPs used for alignment of other TCP options, up to two SACK blocks will be reduced. Please find below tables for more details: +====================+=========================================+ \| Number of \| Required \| Remaining \| Number of \| Final \| \| SACK \| AccECN \| option \| reduced \| number of \| \| blocks \| fields \| spaces \| SACK blocks \| SACK blocks \| +===========+==========+===========+=============+=============+ \| x (<=2) \| 0 to 3 \| any \| 0 \| x \| +-----------+----------+-----------+-------------+-------------+ \| 3 \| 0 \| any \| 0 \| 3 \| \| 3 \| 1 \| <4 \| 1 \| 2 \| \| 3 \| 1 \| >=4 \| 0 \| 3 \| \| 3 \| 2 \| <8 \| 1 \| 2 \| \| 3 \| 2 \| >=8 \| 0 \| 3 \| \| 3 \| 3 \| <12 \| 1 \| 2 \| \| 3 \| 3 \| >=12 \| 0 \| 3 \| +-----------+----------+-----------+-------------+-------------+ \| y (>=4) \| 0 \| any \| 0 \| y \| \| y (>=4) \| 1 \| <4 \| 1 \| y-1 \| \| y (>=4) \| 1 \| >=4 \| 0 \| y \| \| y (>=4) \| 2 \| <8 \| 1 \| y-1 \| \| y (>=4) \| 2 \| >=8 \| 0 \| y \| \| y (>=4) \| 3 \| <4 \| 2 \| y-2 \| \| y (>=4) \| 3 \| <12 \| 1 \| y-1 \| \| y (>=4) \| 3 \| >=12 \| 0 \| y \| +===========+==========+===========+=============+=============+ Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Co-developed-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Ilpo Järvinen <ij@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-11-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 08:47:52 +02:00
Ilpo Järvinen	fe2cddc648	tcp: accecn: AccECN option ceb/cep and ACE field multi-wrap heuristics The AccECN option ceb/cep heuristic algorithm is from AccECN spec Appendix A.2.2 to mitigate against false ACE field overflows. Armed with ceb delta from option, delivered bytes, and delivered packets it is possible to estimate how many times ACE field wrapped. This calculation is necessary only if more than one wrap is possible. Without SACK, delivered bytes and packets are not always trustworthy in which case TCP falls back to the simpler no-or-all wraps ceb algorithm. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-10-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 08:47:52 +02:00
Chia-Yu Chang	b40671b5ee	tcp: accecn: AccECN option failure handling AccECN option may fail in various way, handle these: - Attempt to negotiate the use of AccECN on the 1st retransmitted SYN - From the 2nd retransmitted SYN, stop AccECN negotiation - Remove option from SYN/ACK rexmits to handle blackholes - If no option arrives in SYN/ACK, assume Option is not usable - If an option arrives later, re-enabled - If option is zeroed, disable AccECN option processing This patch use existing padding bits in tcp_request_sock and holes in tcp_sock without increasing the size. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-9-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 08:47:52 +02:00
Chia-Yu Chang	aa55a7dde7	tcp: accecn: AccECN option send control Instead of sending the option in every ACK, limit sending to those ACKs where the option is necessary: - Handshake - "Change-triggered ACK" + the ACK following it. The 2nd ACK is necessary to unambiguously indicate which of the ECN byte counters in increasing. The first ACK has two counters increasing due to the ecnfield edge. - ACKs with CE to allow CEP delta validations to take advantage of the option. - Force option to be sent every at least once per 2^22 bytes. The check is done using the bit edges of the byte counters (avoids need for extra variables). - AccECN option beacon to send a few times per RTT even if nothing in the ECN state requires that. The default is 3 times per RTT, and its period can be set via sysctl_tcp_ecn_option_beacon. Below are the pahole outcomes before and after this patch, in which the group size of tcp_sock_write_tx is increased from 89 to 97 due to the new u64 accecn_opt_tstamp member: [BEFORE THIS PATCH] struct tcp_sock { [...] u64 tcp_wstamp_ns; /* 2488 8 / struct list_head tsorted_sent_queue; / 2496 16 / [...] __cacheline_group_end__tcp_sock_write_tx[0]; / 2521 0 / __cacheline_group_begin__tcp_sock_write_txrx[0]; / 2521 0 / u8 nonagle:4; / 2521: 0 1 / u8 rate_app_limited:1; / 2521: 4 1 / / XXX 3 bits hole, try to pack / / Force alignment to the next boundary: / u8 :0; u8 received_ce_pending:4;/ 2522: 0 1 / u8 unused2:4; / 2522: 4 1 / u8 accecn_minlen:2; / 2523: 0 1 / u8 est_ecnfield:2; / 2523: 2 1 / u8 unused3:4; / 2523: 4 1 / [...] __cacheline_group_end__tcp_sock_write_txrx[0]; / 2628 0 / [...] / size: 3200, cachelines: 50, members: 171 / } [AFTER THIS PATCH] struct tcp_sock { [...] u64 tcp_wstamp_ns; / 2488 8 / u64 accecn_opt_tstamp; / 2596 8 / struct list_head tsorted_sent_queue; / 2504 16 / [...] __cacheline_group_end__tcp_sock_write_tx[0]; / 2529 0 / __cacheline_group_begin__tcp_sock_write_txrx[0]; / 2529 0 / u8 nonagle:4; / 2529: 0 1 / u8 rate_app_limited:1; / 2529: 4 1 / / XXX 3 bits hole, try to pack / / Force alignment to the next boundary: / u8 :0; u8 received_ce_pending:4;/ 2530: 0 1 / u8 unused2:4; / 2530: 4 1 / u8 accecn_minlen:2; / 2531: 0 1 / u8 est_ecnfield:2; / 2531: 2 1 / u8 accecn_opt_demand:2; / 2531: 4 1 / u8 prev_ecnfield:2; / 2531: 6 1 / [...] __cacheline_group_end__tcp_sock_write_txrx[0]; / 2636 0 / [...] / size: 3200, cachelines: 50, members: 173 */ } Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Co-developed-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Ilpo Järvinen <ij@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-8-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 08:47:52 +02:00
Ilpo Järvinen	b5e74132df	tcp: accecn: AccECN option The Accurate ECN allows echoing back the sum of bytes for each IP ECN field value in the received packets using AccECN option. This change implements AccECN option tx & rx side processing without option send control related features that are added by a later change. Based on specification: https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt (Some features of the spec will be added in the later changes rather than in this one). A full-length AccECN option is always attempted but if it does not fit, the minimum length is selected based on the counters that have changed since the last update. The AccECN option (with 24-bit fields) often ends in odd sizes so the option write code tries to take advantage of some nop used to pad the other TCP options. The delivered_ecn_bytes pairs with received_ecn_bytes similar to how delivered_ce pairs with received_ce. In contrast to ACE field, however, the option is not always available to update delivered_ecn_bytes. For ACK w/o AccECN option, the delivered bytes calculated based on the cumulative ACK+SACK information are assigned to one of the counters using an estimation heuristic to select the most likely ECN byte counter. Any estimation error is corrected when the next AccECN option arrives. It may occur that the heuristic gets too confused when there are enough different byte counter deltas between ACKs with the AccECN option in which case the heuristic just gives up on updating the counters for a while. tcp_ecn_option sysctl can be used to select option sending mode for AccECN: TCP_ECN_OPTION_DISABLED, TCP_ECN_OPTION_MINIMUM, and TCP_ECN_OPTION_FULL. This patch increases the size of tcp_info struct, as there is no existing holes for new u32 variables. Below are the pahole outcomes before and after this patch: [BEFORE THIS PATCH] struct tcp_info { [...] __u32 tcpi_total_rto_time; /* 244 4 / / size: 248, cachelines: 4, members: 61 / } [AFTER THIS PATCH] struct tcp_info { [...] __u32 tcpi_total_rto_time; / 244 4 / __u32 tcpi_received_ce; / 248 4 / __u32 tcpi_delivered_e1_bytes; / 252 4 / __u32 tcpi_delivered_e0_bytes; / 256 4 / __u32 tcpi_delivered_ce_bytes; / 260 4 / __u32 tcpi_received_e1_bytes; / 264 4 / __u32 tcpi_received_e0_bytes; / 268 4 / __u32 tcpi_received_ce_bytes; / 272 4 / / size: 280, cachelines: 5, members: 68 / } This patch uses the existing 1-byte holes in the tcp_sock_write_txrx group for new u8 members, but adds a 4-byte hole in tcp_sock_write_rx group after the new u32 delivered_ecn_bytes[3] member. Therefore, the group size of tcp_sock_write_rx is increased from 96 to 112. Below are the pahole outcomes before and after this patch: [BEFORE THIS PATCH] struct tcp_sock { [...] u8 received_ce_pending:4; / 2522: 0 1 / u8 unused2:4; / 2522: 4 1 / / XXX 1 byte hole, try to pack / [...] u32 rcv_rtt_last_tsecr; / 2668 4 / [...] __cacheline_group_end__tcp_sock_write_rx[0]; / 2728 0 / [...] / size: 3200, cachelines: 50, members: 167 / } [AFTER THIS PATCH] struct tcp_sock { [...] u8 received_ce_pending:4;/ 2522: 0 1 / u8 unused2:4; / 2522: 4 1 / u8 accecn_minlen:2; / 2523: 0 1 / u8 est_ecnfield:2; / 2523: 2 1 / u8 unused3:4; / 2523: 4 1 / [...] u32 rcv_rtt_last_tsecr; / 2668 4 / u32 delivered_ecn_bytes[3];/ 2672 12 / / XXX 4 bytes hole, try to pack / [...] __cacheline_group_end__tcp_sock_write_rx[0]; / 2744 0 / [...] / size: 3200, cachelines: 50, members: 171 */ } Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Neal Cardwell <ncardwell@google.com> Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-7-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 08:47:52 +02:00
Ilpo Järvinen	77a4fdf43c	tcp: sack option handling improvements 1) Don't early return when sack doesn't fit. AccECN code will be placed after this fragment so no early returns please. 2) Make sure opts->num_sack_blocks is not left undefined. E.g., tcp_current_mss() does not memset its opts struct to zero. AccECN code checks if SACK option is present and may even alter it to make room for AccECN option when many SACK blocks are present. Thus, num_sack_blocks needs to be always valid. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-6-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 08:47:52 +02:00
Ilpo Järvinen	a92543d597	tcp: accecn: AccECN needs to know delivered bytes AccECN byte counter estimation requires delivered bytes which can be calculated while processing SACK blocks and cumulative ACK. The delivered bytes will be used to estimate the byte counters between AccECN option (on ACKs w/o the option). Accurate ECN does not depend on SACK to function; however, the calculation would be more accurate if SACK were there. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-5-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 08:47:52 +02:00
Ilpo Järvinen	9a01127744	tcp: accecn: add AccECN rx byte counters These three byte counters track IP ECN field payload byte sums for all arriving (acceptable) packets for ECT0, ECT1, and CE. The AccECN option (added by a later patch in the series) echoes these counters back to sender side; therefore, it is placed within the group of tcp_sock_write_txrx. Below are the pahole outcomes before and after this patch, in which the group size of tcp_sock_write_txrx is increased from 95 + 4 to 107 + 4 and an extra 4-byte hole is created but will be exploited in later patches: [BEFORE THIS PATCH] struct tcp_sock { [...] u32 delivered_ce; /* 2576 4 / u32 received_ce; / 2580 4 / u32 app_limited; / 2584 4 / u32 rcv_wnd; / 2588 4 / struct tcp_options_received rx_opt; / 2592 24 / __cacheline_group_end__tcp_sock_write_txrx[0]; / 2616 0 / [...] / size: 3200, cachelines: 50, members: 166 / } [AFTER THIS PATCH] struct tcp_sock { [...] u32 delivered_ce; / 2576 4 / u32 received_ce; / 2580 4 / u32 received_ecn_bytes[3];/ 2584 12 / u32 app_limited; / 2596 4 / u32 rcv_wnd; / 2600 4 / struct tcp_options_received rx_opt; / 2604 24 / __cacheline_group_end__tcp_sock_write_txrx[0]; / 2628 0 / / XXX 4 bytes hole, try to pack / [...] / size: 3200, cachelines: 50, members: 167 */ } Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Neal Cardwell <ncardwell@google.com> Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-4-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 08:47:51 +02:00
Ilpo Järvinen	3cae34274c	tcp: accecn: AccECN negotiation Accurate ECN negotiation parts based on the specification: https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt Accurate ECN is negotiated using ECE, CWR and AE flags in the TCP header. TCP falls back into using RFC3168 ECN if one of the ends supports only RFC3168-style ECN. The AccECN negotiation includes reflecting IP ECN field value seen in SYN and SYNACK back using the same bits as negotiation to allow responding to SYN CE marks and to detect ECN field mangling. CE marks should not occur currently because SYN=1 segments are sent with Non-ECT in IP ECN field (but proposal exists to remove this restriction). Reflecting SYN IP ECN field in SYNACK is relatively simple. Reflecting SYNACK IP ECN field in the final/third ACK of the handshake is more challenging. Linux TCP code is not well prepared for using the final/third ACK a signalling channel which makes things somewhat complicated here. tcp_ecn sysctl can be used to select the highest ECN variant (Accurate ECN, ECN, No ECN) that is attemped to be negotiated and requested for incoming connection and outgoing connection: TCP_ECN_IN_NOECN_OUT_NOECN, TCP_ECN_IN_ECN_OUT_ECN, TCP_ECN_IN_ECN_OUT_NOECN, TCP_ECN_IN_ACCECN_OUT_ACCECN, TCP_ECN_IN_ACCECN_OUT_ECN, and TCP_ECN_IN_ACCECN_OUT_NOECN. After this patch, the size of tcp_request_sock remains unchanged and no new holes are added. Below are the pahole outcomes before and after this patch: [BEFORE THIS PATCH] struct tcp_request_sock { [...] u32 rcv_nxt; /* 352 4 / u8 syn_tos; / 356 1 / / size: 360, cachelines: 6, members: 16 / } [AFTER THIS PATCH] struct tcp_request_sock { [...] u32 rcv_nxt; / 352 4 / u8 syn_tos; / 356 1 / bool accecn_ok; / 357 1 / u8 syn_ect_snt:2; / 358: 0 1 / u8 syn_ect_rcv:2; / 358: 2 1 / u8 accecn_fail_mode:4; / 358: 4 1 / / size: 360, cachelines: 6, members: 20 / } After this patch, the size of tcp_sock remains unchanged and no new holes are added. Also, 4 bits of the existing 2-byte hole are exploited. Below are the pahole outcomes before and after this patch: [BEFORE THIS PATCH] struct tcp_sock { [...] u8 dup_ack_counter:2; / 2761: 0 1 / u8 tlp_retrans:1; / 2761: 2 1 / u8 unused:5; / 2761: 3 1 / u8 thin_lto:1; / 2762: 0 1 / u8 fastopen_connect:1; / 2762: 1 1 / u8 fastopen_no_cookie:1; / 2762: 2 1 / u8 fastopen_client_fail:2; / 2762: 3 1 / u8 frto:1; / 2762: 5 1 / / XXX 2 bits hole, try to pack / [...] u8 keepalive_probes; / 2765 1 / / XXX 2 bytes hole, try to pack / [...] / size: 3200, cachelines: 50, members: 164 / } [AFTER THIS PATCH] struct tcp_sock { [...] u8 dup_ack_counter:2; / 2761: 0 1 / u8 tlp_retrans:1; / 2761: 2 1 / u8 syn_ect_snt:2; / 2761: 3 1 / u8 syn_ect_rcv:2; / 2761: 5 1 / u8 thin_lto:1; / 2761: 7 1 / u8 fastopen_connect:1; / 2762: 0 1 / u8 fastopen_no_cookie:1; / 2762: 1 1 / u8 fastopen_client_fail:2; / 2762: 2 1 / u8 frto:1; / 2762: 4 1 / / XXX 3 bits hole, try to pack / [...] u8 keepalive_probes; / 2765 1 / u8 accecn_fail_mode:4; / 2766: 0 1 / / XXX 4 bits hole, try to pack / / XXX 1 byte hole, try to pack / [...] / size: 3200, cachelines: 50, members: 166 */ } Signed-off-by: Ilpo Järvinen <ij@kernel.org> Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com> Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com> Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-3-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 08:47:51 +02:00
Ilpo Järvinen	542a495cba	tcp: AccECN core This change implements Accurate ECN without negotiation and AccECN Option (that will be added by later changes). Based on AccECN specifications: https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt Accurate ECN allows feeding back the number of CE (congestion experienced) marks accurately to the sender in contrast to RFC3168 ECN that can only signal one marks-seen-yes/no per RTT. Congestion control algorithms can take advantage of the accurate ECN information to fine-tune their congestion response to avoid drastic rate reduction when only mild congestion is encountered. With Accurate ECN, tp->received_ce (r.cep in AccECN spec) keeps track of how many segments have arrived with a CE mark. Accurate ECN uses ACE field (ECE, CWR, AE) to communicate the value back to the sender which updates tp->delivered_ce (s.cep) based on the feedback. This signalling channel is lossy when ACE field overflow occurs. Conservative strategy is selected here to deal with the ACE overflow, however, some strategies using the AccECN option later in the overall patchset mitigate against false overflows detected. The ACE field values on the wire are offset by TCP_ACCECN_CEP_INIT_OFFSET. Delivered_ce/received_ce count the real CE marks rather than forcing all downstream users to adapt to the wire offset. This patch uses the first 1-byte hole and the last 4-byte hole of the tcp_sock_write_txrx for 'received_ce_pending' and 'received_ce'. Also, the group size of tcp_sock_write_txrx is increased from 91 + 4 to 95 + 4 due to the new u32 received_ce member. Below are the trimmed pahole outcomes before and after this patch. [BEFORE THIS PATCH] struct tcp_sock { [...] __cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2521 0 / u8 nonagle:4; / 2521: 0 1 / u8 rate_app_limited:1; / 2521: 4 1 / / XXX 3 bits hole, try to pack / / XXX 2 bytes hole, try to pack / [...] u32 delivered_ce; / 2576 4 / u32 app_limited; / 2580 4 / u32 rcv_wnd; / 2684 4 / struct tcp_options_received rx_opt; / 2688 24 / __cacheline_group_end__tcp_sock_write_txrx[0]; / 2612 0 / / XXX 4 bytes hole, try to pack / [...] / size: 3200, cachelines: 50, members: 161 / } [AFTER THIS PATCH] struct tcp_sock { [...] __cacheline_group_begin__tcp_sock_write_txrx[0]; / 2521 0 / u8 nonagle:4; / 2521: 0 1 / u8 rate_app_limited:1; / 2521: 4 1 / / XXX 3 bits hole, try to pack / / Force alignment to the next boundary: / u8 :0; u8 received_ce_pending:4;/ 2522: 0 1 / u8 unused2:4; / 2522: 4 1 / / XXX 1 byte hole, try to pack / [...] u32 delivered_ce; / 2576 4 / u32 received_ce; / 2580 4 / u32 app_limited; / 2584 4 / u32 rcv_wnd; / 2588 4 / struct tcp_options_received rx_opt; / 2592 24 / __cacheline_group_end__tcp_sock_write_txrx[0]; / 2616 0 / [...] / size: 3200, cachelines: 50, members: 164 */ } Signed-off-by: Ilpo Järvinen <ij@kernel.org> Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com> Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com> Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-2-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 08:47:51 +02:00
Cosmin Ratiu	6bdcb735fe	devlink: Add a 'num_doorbells' driverinit param This parameter can be used by drivers to configure a different number of doorbells. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:30:51 -07:00
Kuniyuki Iwashima	893c49a78d	mptcp: Use __sk_dst_get() and dst_dev_rcu() in mptcp_active_enable(). mptcp_active_enable() is called from subflow_finish_connect(), which is icsk->icsk_af_ops->sk_rx_dst_set() and it's not always under RCU. Using sk_dst_get(sk)->dev could trigger UAF. Let's use __sk_dst_get() and dst_dev_rcu(). Fixes: `27069e7cb3` ("mptcp: disable active MPTCP in case of blackhole") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916214758.650211-8-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:10:22 -07:00
Kuniyuki Iwashima	108a86c71c	mptcp: Call dst_release() in mptcp_active_enable(). mptcp_active_enable() calls sk_dst_get(), which returns dst with its refcount bumped, but forgot dst_release(). Let's add missing dst_release(). Cc: stable@vger.kernel.org Fixes: `27069e7cb3` ("mptcp: disable active MPTCP in case of blackhole") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916214758.650211-7-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:10:22 -07:00
Kuniyuki Iwashima	c65f27b9c3	tls: Use __sk_dst_get() and dst_dev_rcu() in get_netdev_for_sock(). get_netdev_for_sock() is called during setsockopt(), so not under RCU. Using sk_dst_get(sk)->dev could trigger UAF. Let's use __sk_dst_get() and dst_dev_rcu(). Note that the only ->ndo_sk_get_lower_dev() user is bond_sk_get_lower_dev(), which uses RCU. Fixes: `e8f6979981` ("net/tls: Add generic NIC offload infrastructure") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250916214758.650211-6-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:10:22 -07:00
Kuniyuki Iwashima	0b0e4d51c6	smc: Use __sk_dst_get() and dst_dev_rcu() in smc_vlan_by_tcpsk(). smc_vlan_by_tcpsk() fetches sk_dst_get(sk)->dev before RTNL and passes it to netdev_walk_all_lower_dev(), which is illegal. Also, smc_vlan_by_tcpsk_walk() does not require RTNL at all. Let's use __sk_dst_get(), dst_dev_rcu(), and netdev_walk_all_lower_dev_rcu(). Note that the returned value of smc_vlan_by_tcpsk() is not used in the caller. Fixes: `0cfdd8f92c` ("smc: connection and link group creation") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916214758.650211-5-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:10:22 -07:00
Kuniyuki Iwashima	235f81045c	smc: Use __sk_dst_get() and dst_dev_rcu() in smc_clc_prfx_match(). smc_clc_prfx_match() is called from smc_listen_work() and not under RCU nor RTNL. Using sk_dst_get(sk)->dev could trigger UAF. Let's use __sk_dst_get() and dst_dev_rcu(). Note that the returned value of smc_clc_prfx_match() is not used in the caller. Fixes: `a046d57da1` ("smc: CLC handshake (incl. preparation steps)") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916214758.650211-4-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:10:22 -07:00
Kuniyuki Iwashima	935d783e5d	smc: Use __sk_dst_get() and dst_dev_rcu() in in smc_clc_prfx_set(). smc_clc_prfx_set() is called during connect() and not under RCU nor RTNL. Using sk_dst_get(sk)->dev could trigger UAF. Let's use __sk_dst_get() and dev_dst_rcu() under rcu_read_lock() after kernel_getsockname(). Note that the returned value of smc_clc_prfx_set() is not used in the caller. While at it, we change the 1st arg of smc_clc_prfx_set[46]_rcu() not to touch dst there. Fixes: `a046d57da1` ("smc: CLC handshake (incl. preparation steps)") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916214758.650211-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:10:21 -07:00
Kuniyuki Iwashima	3d3466878a	smc: Fix use-after-free in __pnet_find_base_ndev(). syzbot reported use-after-free of net_device in __pnet_find_base_ndev(), which was called during connect(). [0] smc_pnet_find_ism_resource() fetches sk_dst_get(sk)->dev and passes down to pnet_find_base_ndev(), where RTNL is held. Then, UAF happened at __pnet_find_base_ndev() when the dev is first used. This means dev had already been freed before acquiring RTNL in pnet_find_base_ndev(). While dev is going away, dst->dev could be swapped with blackhole_netdev, and the dev's refcnt by dst will be released. We must hold dev's refcnt before calling smc_pnet_find_ism_resource(). Also, smc_pnet_find_roce_resource() has the same problem. Let's use __sk_dst_get() and dst_dev_rcu() in the two functions. [0]: BUG: KASAN: use-after-free in __pnet_find_base_ndev+0x1b1/0x1c0 net/smc/smc_pnet.c:926 Read of size 1 at addr ffff888036bac33a by task syz.0.3632/18609 CPU: 1 UID: 0 PID: 18609 Comm: syz.0.3632 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025 Call Trace: <TASK> dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xca/0x240 mm/kasan/report.c:482 kasan_report+0x118/0x150 mm/kasan/report.c:595 __pnet_find_base_ndev+0x1b1/0x1c0 net/smc/smc_pnet.c:926 pnet_find_base_ndev net/smc/smc_pnet.c:946 [inline] smc_pnet_find_ism_by_pnetid net/smc/smc_pnet.c:1103 [inline] smc_pnet_find_ism_resource+0xef/0x390 net/smc/smc_pnet.c:1154 smc_find_ism_device net/smc/af_smc.c:1030 [inline] smc_find_proposal_devices net/smc/af_smc.c:1115 [inline] __smc_connect+0x372/0x1890 net/smc/af_smc.c:1545 smc_connect+0x877/0xd90 net/smc/af_smc.c:1715 __sys_connect_file net/socket.c:2086 [inline] __sys_connect+0x313/0x440 net/socket.c:2105 __do_sys_connect net/socket.c:2111 [inline] __se_sys_connect net/socket.c:2108 [inline] __x64_sys_connect+0x7a/0x90 net/socket.c:2108 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f47cbf8eba9 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f47ccdb1038 EFLAGS: 00000246 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 00007f47cc1d5fa0 RCX: 00007f47cbf8eba9 RDX: 0000000000000010 RSI: 0000200000000280 RDI: 000000000000000b RBP: 00007f47cc011e19 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f47cc1d6038 R14: 00007f47cc1d5fa0 R15: 00007ffc512f8aa8 </TASK> The buggy address belongs to the physical page: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff888036bacd00 pfn:0x36bac flags: 0xfff00000000000(node=0\|zone=1\|lastcpupid=0x7ff) raw: 00fff00000000000 ffffea0001243d08 ffff8880b863fdc0 0000000000000000 raw: ffff888036bacd00 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as freed page last allocated via order 2, migratetype Unmovable, gfp_mask 0x446dc0(GFP_KERNEL_ACCOUNT\|__GFP_ZERO\|__GFP_NOWARN\|__GFP_RETRY_MAYFAIL\|__GFP_COMP), pid 16741, tgid 16741 (syz-executor), ts 343313197788, free_ts 380670750466 set_page_owner include/linux/page_owner.h:32 [inline] post_alloc_hook+0x240/0x2a0 mm/page_alloc.c:1851 prep_new_page mm/page_alloc.c:1859 [inline] get_page_from_freelist+0x21e4/0x22c0 mm/page_alloc.c:3858 __alloc_frozen_pages_noprof+0x181/0x370 mm/page_alloc.c:5148 alloc_pages_mpol+0x232/0x4a0 mm/mempolicy.c:2416 ___kmalloc_large_node+0x5f/0x1b0 mm/slub.c:4317 __kmalloc_large_node_noprof+0x18/0x90 mm/slub.c:4348 __do_kmalloc_node mm/slub.c:4364 [inline] __kvmalloc_node_noprof+0x6d/0x5f0 mm/slub.c:5067 alloc_netdev_mqs+0xa3/0x11b0 net/core/dev.c:11812 tun_set_iff+0x532/0xef0 drivers/net/tun.c:2775 __tun_chr_ioctl+0x788/0x1df0 drivers/net/tun.c:3085 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:598 [inline] __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:584 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f page last free pid 18610 tgid 18608 stack trace: reset_page_owner include/linux/page_owner.h:25 [inline] free_pages_prepare mm/page_alloc.c:1395 [inline] __free_frozen_pages+0xbc4/0xd30 mm/page_alloc.c:2895 free_large_kmalloc+0x13a/0x1f0 mm/slub.c:4820 device_release+0x99/0x1c0 drivers/base/core.c:-1 kobject_cleanup lib/kobject.c:689 [inline] kobject_release lib/kobject.c:720 [inline] kref_put include/linux/kref.h:65 [inline] kobject_put+0x22b/0x480 lib/kobject.c:737 netdev_run_todo+0xd2e/0xea0 net/core/dev.c:11513 rtnl_unlock net/core/rtnetlink.c:157 [inline] rtnl_net_unlock include/linux/rtnetlink.h:135 [inline] rtnl_dellink+0x537/0x710 net/core/rtnetlink.c:3563 rtnetlink_rcv_msg+0x7cc/0xb70 net/core/rtnetlink.c:6946 netlink_rcv_skb+0x208/0x470 net/netlink/af_netlink.c:2552 netlink_unicast_kernel net/netlink/af_netlink.c:1320 [inline] netlink_unicast+0x82f/0x9e0 net/netlink/af_netlink.c:1346 netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1896 sock_sendmsg_nosec net/socket.c:714 [inline] __sock_sendmsg+0x219/0x270 net/socket.c:729 ____sys_sendmsg+0x505/0x830 net/socket.c:2614 ___sys_sendmsg+0x21f/0x2a0 net/socket.c:2668 __sys_sendmsg net/socket.c:2700 [inline] __do_sys_sendmsg net/socket.c:2705 [inline] __se_sys_sendmsg net/socket.c:2703 [inline] __x64_sys_sendmsg+0x19b/0x260 net/socket.c:2703 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Memory state around the buggy address: ffff888036bac200: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff888036bac280: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >ffff888036bac300: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ ffff888036bac380: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff888036bac400: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff Fixes: `0afff91c6f` ("net/smc: add pnetid support") Fixes: `1619f77058` ("net/smc: add pnetid support for SMC-D and ISM") Reported-by: syzbot+ea28e9d85be2f327b6c6@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/68c237c7.050a0220.3c6139.0036.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916214758.650211-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:10:21 -07:00
Jakub Kicinski	934da21f99	Merge tag 'wireless-2025-09-17' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Just two fixes: - fix crash in rfkill due to uninitialized type_name - fix aggregation in iwlwifi 7000/8000 devices * tag 'wireless-2025-09-17' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: net: rfkill: gpio: Fix crash due to dereferencering uninitialized pointer wifi: iwlwifi: pcie: fix byte count table for some devices ==================== Link: https://patch.msgid.link/20250917105159.161583-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 16:12:46 -07:00
Kuniyuki Iwashima	45c8a6cc2b	tcp: Clear tcp_sk(sk)->fastopen_rsk in tcp_disconnect(). syzbot reported the splat below where a socket had tcp_sk(sk)->fastopen_rsk in the TCP_ESTABLISHED state. [0] syzbot reused the server-side TCP Fast Open socket as a new client before the TFO socket completes 3WHS: 1. accept() 2. connect(AF_UNSPEC) 3. connect() to another destination As of accept(), sk->sk_state is TCP_SYN_RECV, and tcp_disconnect() changes it to TCP_CLOSE and makes connect() possible, which restarts timers. Since tcp_disconnect() forgot to clear tcp_sk(sk)->fastopen_rsk, the retransmit timer triggered the warning and the intended packet was not retransmitted. Let's call reqsk_fastopen_remove() in tcp_disconnect(). [0]: WARNING: CPU: 2 PID: 0 at net/ipv4/tcp_timer.c:542 tcp_retransmit_timer (net/ipv4/tcp_timer.c:542 (discriminator 7)) Modules linked in: CPU: 2 UID: 0 PID: 0 Comm: swapper/2 Not tainted 6.17.0-rc5-g201825fb4278 #62 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:tcp_retransmit_timer (net/ipv4/tcp_timer.c:542 (discriminator 7)) Code: 41 55 41 54 55 53 48 8b af b8 08 00 00 48 89 fb 48 85 ed 0f 84 55 01 00 00 0f b6 47 12 3c 03 74 0c 0f b6 47 12 3c 04 74 04 90 <0f> 0b 90 48 8b 85 c0 00 00 00 48 89 ef 48 8b 40 30 e8 6a 4f 06 3e RSP: 0018:ffffc900002f8d40 EFLAGS: 00010293 RAX: 0000000000000002 RBX: ffff888106911400 RCX: 0000000000000017 RDX: 0000000002517619 RSI: ffffffff83764080 RDI: ffff888106911400 RBP: ffff888106d5c000 R08: 0000000000000001 R09: ffffc900002f8de8 R10: 00000000000000c2 R11: ffffc900002f8ff8 R12: ffff888106911540 R13: ffff888106911480 R14: ffff888106911840 R15: ffffc900002f8de0 FS: 0000000000000000(0000) GS:ffff88907b768000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8044d69d90 CR3: 0000000002c30003 CR4: 0000000000370ef0 Call Trace: <IRQ> tcp_write_timer (net/ipv4/tcp_timer.c:738) call_timer_fn (kernel/time/timer.c:1747) __run_timers (kernel/time/timer.c:1799 kernel/time/timer.c:2372) timer_expire_remote (kernel/time/timer.c:2385 kernel/time/timer.c:2376 kernel/time/timer.c:2135) tmigr_handle_remote_up (kernel/time/timer_migration.c:944 kernel/time/timer_migration.c:1035) __walk_groups.isra.0 (kernel/time/timer_migration.c:533 (discriminator 1)) tmigr_handle_remote (kernel/time/timer_migration.c:1096) handle_softirqs (./arch/x86/include/asm/jump_label.h:36 ./include/trace/events/irq.h:142 kernel/softirq.c:580) irq_exit_rcu (kernel/softirq.c:614 kernel/softirq.c:453 kernel/softirq.c:680 kernel/softirq.c:696) sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1050 (discriminator 35) arch/x86/kernel/apic/apic.c:1050 (discriminator 35)) </IRQ> Fixes: `8336886f78` ("tcp: TCP Fast Open Server - support TFO listeners") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250915175800.118793-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 16:01:52 -07:00
Paul Chaignon	6fabca2fc9	bpf: Explicitly check accesses to bpf_sock_addr Syzkaller found a kernel warning on the following sock_addr program: 0: r0 = 0 1: r2 = (u32 )(r1 +60) 2: exit which triggers: verifier bug: error during ctx access conversion (0) This is happening because offset 60 in bpf_sock_addr corresponds to an implicit padding of 4 bytes, right after msg_src_ip4. Access to this padding isn't rejected in sock_addr_is_valid_access and it thus later fails to convert the access. This patch fixes it by explicitly checking the various fields of bpf_sock_addr in sock_addr_is_valid_access. I checked the other ctx structures and is_valid_access functions and didn't find any other similar cases. Other cases of (properly handled) padding are covered in new tests in a subsequent patch. Fixes: `1cedee13d2` ("bpf: Hooks for sys_sendmsg") Reported-by: syzbot+136ca59d411f92e821b7@syzkaller.appspotmail.com Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Closes: https://syzkaller.appspot.com/bug?extid=136ca59d411f92e821b7 Link: https://lore.kernel.org/bpf/b58609d9490649e76e584b0361da0abd3c2c1779.1758094761.git.paul.chaignon@gmail.com	2025-09-17 16:15:17 +02:00
Hans de Goede	b6f56a44e4	net: rfkill: gpio: Fix crash due to dereferencering uninitialized pointer Since commit `7d5e9737ef` ("net: rfkill: gpio: get the name and type from device property") rfkill_find_type() gets called with the possibly uninitialized "const char *type_name;" local variable. On x86 systems when rfkill-gpio binds to a "BCM4752" or "LNV4752" acpi_device, the rfkill->type is set based on the ACPI acpi_device_id: rfkill->type = (unsigned)id->driver_data; and there is no "type" property so device_property_read_string() will fail and leave type_name uninitialized, leading to a potential crash. rfkill_find_type() does accept a NULL pointer, fix the potential crash by initializing type_name to NULL. Note likely sofar this has not been caught because: 1. Not many x86 machines actually have a "BCM4752"/"LNV4752" acpi_device 2. The stack happened to contain NULL where type_name is stored Fixes: `7d5e9737ef` ("net: rfkill: gpio: get the name and type from device property") Cc: stable@vger.kernel.org Cc: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: Hans de Goede <hansg@kernel.org> Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Link: https://patch.msgid.link/20250913113515.21698-1-hansg@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2025-09-17 12:37:05 +02:00
Jakub Kicinski	5e87fdc37f	Merge tag 'batadv-next-pullrequest-20250916' of https://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== This cleanup patchset includes the following patches: - bump version strings, by Simon Wunderlich - Remove network coding support, by Sven Eckelmann (2 patches) - remove includes for extern declarations, by Sven Eckelmann * tag 'batadv-next-pullrequest-20250916' of https://git.open-mesh.org/linux-merge: batman-adv: remove includes for extern declarations batman-adv: keep skb crc32 helper local in BLA batman-adv: remove network coding support batman-adv: Start new development cycle ==================== Link: https://patch.msgid.link/20250916122441.89246-1-sw@simonwunderlich.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-16 17:36:03 -07:00
Stefan Wahren	d2d3f529e7	ethernet: Extend device_get_mac_address() to use NVMEM A lot of modern SoC have the ability to store MAC addresses in their NVMEM. So extend the generic function device_get_mac_address() to obtain the MAC address from an nvmem cell named 'mac-address' in case there is no firmware node which contains the MAC address directly. Signed-off-by: Stefan Wahren <wahrenst@gmx.net> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250912140332.35395-3-wahrenst@gmx.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 18:34:08 -07:00
Matthieu Baerts (NGI0)	3f9a22be37	mptcp: pm: netlink: fix if-idx type As pointed out by Donald, when parsing an entry, the wrong type was set for the temp value: this value is signed. There are no real issues here, because the intermediate variable was only wrong for the sign, not for the size, and the final variable had the right sign. But this feels wrong, and is confusing, so fixing this small typo introduced by commit `ef0da3b8a2` ("mptcp: move address attribute into mptcp_addr_info"). Reported-by: Donald Hunter <donald.hunter@gmail.com> Closes: https://lore.kernel.org/m2plc0ui9z.fsf@gmail.com Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250912-net-next-mptcp-minor-fixes-6-18-v1-3-99d179b483ad@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 18:14:23 -07:00
Jakub Kicinski	f3b52167a0	page_pool: always add GFP_NOWARN for ATOMIC allocations Driver authors often forget to add GFP_NOWARN for page allocation from the datapath. This is annoying to users as OOMs are a fact of life, and we pretty much expect network Rx to hit page allocation failures during OOM. Make page pool add GFP_NOWARN for ATOMIC allocations by default. Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250912161703.361272-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 18:13:49 -07:00
Matthieu Baerts (NGI0)	92da495cb6	mptcp: tfo: record 'deny join id0' info When TFO is used, the check to see if the 'C' flag (deny join id0) was set was bypassed. This flag can be set when TFO is used, so the check should also be done when TFO is used. Note that the set_fully_established label is also used when a 4th ACK is received. In this case, deny_join_id0 will not be set. Fixes: `dfc8d06030` ("mptcp: implement delayed seq generation for passive fastopen") Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250912-net-mptcp-pm-uspace-deny_join_id0-v1-4-40171884ade8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 18:12:05 -07:00
Matthieu Baerts (NGI0)	2293c57484	mptcp: pm: nl: announce deny-join-id0 flag During the connection establishment, a peer can tell the other one that it cannot establish new subflows to the initial IP address and port by setting the 'C' flag [1]. Doing so makes sense when the sender is behind a strict NAT, operating behind a legacy Layer 4 load balancer, or using anycast IP address for example. When this 'C' flag is set, the path-managers must then not try to establish new subflows to the other peer's initial IP address and port. The in-kernel PM has access to this info, but the userspace PM didn't. The RFC8684 [1] is strict about that: (...) therefore the receiver MUST NOT try to open any additional subflows toward this address and port. So it is important to tell the userspace about that as it is responsible for the respect of this flag. When a new connection is created and established, the Netlink events now contain the existing but not currently used 'flags' attribute. When MPTCP_PM_EV_FLAG_DENY_JOIN_ID0 is set, it means no other subflows to the initial IP address and port -- info that are also part of the event -- can be established. Link: https://datatracker.ietf.org/doc/html/rfc8684#section-3.1-20.6 [1] Fixes: `702c2f646d` ("mptcp: netlink: allow userspace-driven subflow establishment") Reported-by: Marek Majkowski <marek@cloudflare.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/532 Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250912-net-mptcp-pm-uspace-deny_join_id0-v1-2-40171884ade8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 18:12:05 -07:00
Matthieu Baerts (NGI0)	96939cec99	mptcp: set remote_deny_join_id0 on SYN recv When a SYN containing the 'C' flag (deny join id0) was received, this piece of information was not propagated to the path-manager. Even if this flag is mainly set on the server side, a client can also tell the server it cannot try to establish new subflows to the client's initial IP address and port. The server's PM should then record such info when received, and before sending events about the new connection. Fixes: `df377be387` ("mptcp: add deny_join_id0 in mptcp_options_received") Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250912-net-mptcp-pm-uspace-deny_join_id0-v1-1-40171884ade8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 18:12:05 -07:00
Matthieu Baerts (NGI0)	f755be0b1f	mptcp: propagate shutdown to subflows when possible When the MPTCP DATA FIN have been ACKed, there is no more MPTCP related metadata to exchange, and all subflows can be safely shutdown. Before this patch, the subflows were actually terminated at 'close()' time. That's certainly fine most of the time, but not when the userspace 'shutdown()' a connection, without close()ing it. When doing so, the subflows were staying in LAST_ACK state on one side -- and consequently in FIN_WAIT2 on the other side -- until the 'close()' of the MPTCP socket. Now, when the DATA FIN have been ACKed, all subflows are shutdown. A consequence of this is that the TCP 'FIN' flag can be set earlier now, but the end result is the same. This affects the packetdrill tests looking at the end of the MPTCP connections, but for a good reason. Note that tcp_shutdown() will check the subflow state, so no need to do that again before calling it. Fixes: `3721b9b646` ("mptcp: Track received DATA_FIN sequence number and add related helpers") Cc: stable@vger.kernel.org Fixes: `16a9a9da17` ("mptcp: Add helper to process acks of DATA_FIN") Reviewed-by: Mat Martineau <martineau@kernel.org> Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250912-net-mptcp-fix-sft-connect-v1-1-d40e77cbbf02@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 18:10:36 -07:00
Håkon Bugge	4351ca3fcb	rds: ib: Increment i_fastreg_wrs before bailing out We need to increment i_fastreg_wrs before we bail out from rds_ib_post_reg_frmr(). We have a fixed budget of how many FRWR operations that can be outstanding using the dedicated QP used for memory registrations and de-registrations. This budget is enforced by the atomic_t i_fastreg_wrs. If we bail out early in rds_ib_post_reg_frmr(), we will "leak" the possibility of posting an FRWR operation, and if that accumulates, no FRWR operation can be carried out. Fixes: `1659185fb4` ("RDS: IB: Support Fastreg MR (FRMR) memory registration mode") Fixes: `3a2886cca7` ("net/rds: Keep track of and wait for FRWR segments in use upon shutdown") Cc: stable@vger.kernel.org Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20250911133336.451212-1-haakon.bugge@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 16:47:53 -07:00
Chia-Yu Chang	30f5ca0062	tcp: ecn functions in separated include file The following patches will modify ECN helpers and add AccECN herlpers, and this patch moves the existing ones into a separated include file. No functional changes. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250911110642.87529-5-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 16:26:33 -07:00
Chia-Yu Chang	c3426ba2ed	tcp: reorganize tcp_sock_write_txrx group for variables later Use the first 3-byte hole at the beginning of the tcp_sock_write_txrx group for 'noneagle'/'rate_app_limited' to fill in the existing hole in later patches. Therefore, the group size of tcp_sock_write_txrx is reduced from 92 + 4 to 91 + 4. In addition, the group size of tcp_sock_write_rx is changed to 96 to fit in the pahole outcome. Below are the trimmed pahole outcomes before and after this patch: [BEFORE THIS PATCH] struct tcp_sock { [...] __cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2521 0 / / XXX 3 bytes hole, try to pack / [...] struct tcp_options_received rx_opt; / 2588 24 / u8 nonagle:4; / 2612: 0 1 / u8 rate_app_limited:1; / 2612: 4 1 / / XXX 3 bits hole, try to pack / __cacheline_group_end__tcp_sock_write_txrx[0]; / 2613 0 / / XXX 3 bytes hole, try to pack / __cacheline_group_begin__tcp_sock_write_rx[0] __attribute__((__aligned__(8))); / 2616 0 / [...] __cacheline_group_end__tcp_sock_write_rx[0]; / 2712 0 / [...] / size: 3200, cachelines: 50, members: 161 / } [AFTER THIS PATCH] struct tcp_sock { [...] __cacheline_group_begin__tcp_sock_write_txrx[0]; / 2521 0 / u8 nonagle:4; / 2521: 0 1 / u8 rate_app_limited:1; / 2521: 4 1 / / XXX 3 bits hole, try to pack / / XXX 2 bytes hole, try to pack / [...] struct tcp_options_received rx_opt; / 2588 24 / __cacheline_group_end__tcp_sock_write_txrx[0]; / 2612 0 / / XXX 4 bytes hole, try to pack / __cacheline_group_begin__tcp_sock_write_rx[0] __attribute__((__aligned__(8))); / 2616 0 / [...] __cacheline_group_end__tcp_sock_write_rx[0]; / 2712 0 / [...] / size: 3200, cachelines: 50, members: 161 */ } Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250911110642.87529-4-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 16:26:33 -07:00
Ilpo Järvinen	449144f4d5	tcp: reorganize SYN ECN code Prepare for AccECN that needs to have access here on IP ECN field value which is only available after INET_ECN_xmit(). No functional changes. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250911110642.87529-2-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 16:26:33 -07:00
Sabrina Dubroca	91d8a53db2	xfrm: fix offloading of cross-family tunnels Xiumei reported a regression in IPsec offload tests over xfrmi, where the traffic for IPv6 over IPv4 tunnels is processed in SW instead of going through crypto offload, after commit `cc18f482e8` ("xfrm: provide common xdo_dev_offload_ok callback implementation"). Commit `cc18f482e8` added a generic version of existing checks attempting to prevent packets with IPv4 options or IPv6 extension headers from being sent to HW that doesn't support offloading such packets. The check mistakenly uses x->props.family (the outer family) to determine the inner packet's family and verify if options/extensions are present. In the case of IPv6 over IPv4, the check compares some of the traffic class bits to the expected no-options ihl value (5). The original check was introduced in commit `2ac9cfe782` ("net/mlx5e: IPSec, Add Innova IPSec offload TX data path"), and then duplicated in the other drivers. Before commit `cc18f482e8`, the loose check (ihl > 5) passed because those traffic class bits were not set to a value that triggered the no-offload codepath. Packets with options/extension headers that should have been handled in SW went through the offload path, and were likely dropped by the NIC or incorrectly processed. Since commit `cc18f482e8`, the check is now strict (ihl != 5), and in a basic setup (no traffic class configured), all packets go through the no-offload codepath. The commits that introduced the incorrect family checks in each driver are: `2ac9cfe782` ("net/mlx5e: IPSec, Add Innova IPSec offload TX data path") `8362ea16f6` ("crypto: chcr - ESN for Inline IPSec Tx") `859a497fe8` ("nfp: implement xfrm callbacks and expose ipsec offload feature to upper layer") `32188be805` ("cn10k-ipsec: Allow ipsec crypto offload for skb with SA") [ixgbe/ixgbevf commits are ignored, as that HW does not support tunnel mode, thus no cross-family setups are possible] Fixes: `cc18f482e8` ("xfrm: provide common xdo_dev_offload_ok callback implementation") Reported-by: Xiumei Mu <xmu@redhat.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2025-09-15 11:35:06 +02:00
David Howells	2429a19764	rxrpc: Fix untrusted unsigned subtract Fix the following Smatch static checker warning: net/rxrpc/rxgk_app.c:65 rxgk_yfs_decode_ticket() warn: untrusted unsigned subtract. 'ticket_len - 10 * 4' by prechecking the length of what we're trying to extract in two places in the token and decoding for a response packet. Also use sizeof() on the struct we're extracting rather specifying the size numerically to be consistent with the other related statements. Fixes: `9d1d2b5934` ("rxrpc: rxgk: Implement the yfs-rxgk security class (GSSAPI)") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lists.infradead.org/pipermail/linux-afs/2025-September/010135.html Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/2039268.1757631977@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-14 13:05:22 -07:00
David Howells	64863f4ca4	rxrpc: Fix unhandled errors in rxgk_verify_packet_integrity() rxgk_verify_packet_integrity() may get more errors than just -EPROTO from rxgk_verify_mic_skb(). Pretty much anything other than -ENOMEM constitutes an unrecoverable error. In the case of -ENOMEM, we can just drop the packet and wait for a retransmission. Similar happens with rxgk_decrypt_skb() and its callers. Fix rxgk_decrypt_skb() or rxgk_verify_mic_skb() to return a greater variety of abort codes and fix their callers to abort the connection on any error apart from -ENOMEM. Also preclear the variables used to hold the abort code returned from rxgk_decrypt_skb() or rxgk_verify_mic_skb() to eliminate uninitialised variable warnings. Fixes: `9d1d2b5934` ("rxrpc: rxgk: Implement the yfs-rxgk security class (GSSAPI)") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lists.infradead.org/pipermail/linux-afs/2025-April/009739.html Closes: https://lists.infradead.org/pipermail/linux-afs/2025-April/009740.html Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/2038804.1757631496@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-14 13:05:22 -07:00
Anderson Nascimento	2e7bba0892	net/tcp: Fix a NULL pointer dereference when using TCP-AO with TCP_REPAIR A NULL pointer dereference can occur in tcp_ao_finish_connect() during a connect() system call on a socket with a TCP-AO key added and TCP_REPAIR enabled. The function is called with skb being NULL and attempts to dereference it on tcp_hdr(skb)->seq without a prior skb validation. Fix this by checking if skb is NULL before dereferencing it. The commentary is taken from bpf_skops_established(), which is also called in the same flow. Unlike the function being patched, bpf_skops_established() validates the skb before dereferencing it. int main(void){ struct sockaddr_in sockaddr; struct tcp_ao_add tcp_ao; int sk; int one = 1; memset(&sockaddr,'\0',sizeof(sockaddr)); memset(&tcp_ao,'\0',sizeof(tcp_ao)); sk = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP); sockaddr.sin_family = AF_INET; memcpy(tcp_ao.alg_name,"cmac(aes128)",12); memcpy(tcp_ao.key,"ABCDEFGHABCDEFGH",16); tcp_ao.keylen = 16; memcpy(&tcp_ao.addr,&sockaddr,sizeof(sockaddr)); setsockopt(sk, IPPROTO_TCP, TCP_AO_ADD_KEY, &tcp_ao, sizeof(tcp_ao)); setsockopt(sk, IPPROTO_TCP, TCP_REPAIR, &one, sizeof(one)); sockaddr.sin_family = AF_INET; sockaddr.sin_port = htobe16(123); inet_aton("127.0.0.1", &sockaddr.sin_addr); connect(sk,(struct sockaddr *)&sockaddr,sizeof(sockaddr)); return 0; } $ gcc tcp-ao-nullptr.c -o tcp-ao-nullptr -Wall $ unshare -Urn BUG: kernel NULL pointer dereference, address: 00000000000000b6 PGD 1f648d067 P4D 1f648d067 PUD 1982e8067 PMD 0 Oops: Oops: 0000 [#1] SMP NOPTI Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020 RIP: 0010:tcp_ao_finish_connect (net/ipv4/tcp_ao.c:1182) Fixes: `7c2ffaf21b` ("net/tcp: Calculate TCP-AO traffic keys") Signed-off-by: Anderson Nascimento <anderson@allelesecurity.com> Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250911230743.2551-3-anderson@allelesecurity.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-14 12:49:53 -07:00
Mahanta Jambigi	010fe36ad2	net/smc: Remove unused argument from 2 SMC functions The smc argument is not used in both smc_connect_ism_vlan_setup() & smc_connect_ism_vlan_cleanup(). Hence removing it. Signed-off-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Link: https://patch.msgid.link/20250910063125.2112577-1-mjambigi@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-14 11:55:31 -07:00
Eric Dumazet	fdae0ab67d	net: use NUMA drop counters for softnet_data.dropped Hosts under DOS attack can suffer from false sharing in enqueue_to_backlog() : atomic_inc(&sd->dropped). This is because sd->dropped can be touched from many cpus, possibly residing on different NUMA nodes. Generalize the sk_drop_counters infrastucture added in commit `c51613fa27` ("net: add sk->sk_drop_counters") and use it to replace softnet_data.dropped with NUMA friendly softnet_data.drop_counters. This adds 64 bytes per cpu, maybe more in the future if we increase the number of counters (currently 2) per 'struct numa_drop_counters'. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250909121942.1202585-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-14 11:35:17 -07:00
Linus Torvalds	5cd64d4f92	Merge tag 'ceph-for-6.17-rc6' of https://github.com/ceph/ceph-client Pull ceph fixes from Ilya Dryomov: "A fix for a race condition around r_parent tracking that took a long time to track down from Alex and some fixes for potential crashes on accessing invalid memory from Max and myself. All marked for stable" * tag 'ceph-for-6.17-rc6' of https://github.com/ceph/ceph-client: libceph: fix invalid accesses to ceph_connection_v1_info ceph: fix crash after fscrypt_encrypt_pagecache_blocks() error ceph: always call ceph_shift_unused_folios_left() ceph: fix race condition where r_parent becomes stale before sending message ceph: fix race condition validating r_parent before applying state	2025-09-13 10:45:11 -07:00
Russell King (Oracle)	201825fb42	net: ethtool: handle EOPNOTSUPP from ethtool get_ts_info() method Network drivers sometimes return -EOPNOTSUPP from their get_ts_info() method, and this should not cause the reporting of PHY timestamping information to be prohibited. Handle this error code, and also arrange for ethtool_net_get_ts_info_by_phc() to return -EOPNOTSUPP when the method is not implemented. This allows e.g. PHYs connected to DSA switches which support timestamping to report their timestamping capabilities. Fixes: `b9e3f7dc9e` ("net: ethtool: tsinfo: Enhance tsinfo to support several hwtstamp by net topology") Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/E1uwiW3-00000004jRF-3CnC@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-12 17:09:10 -07:00
Jakub Kicinski	bd569dd935	Merge tag 'nf-next-25-09-11' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter: updates for net-next 1) Don't respond to ICMP_UNREACH errors with another ICMP_UNREACH error. 2) Support fetching the current bridge ethernet address. This allows a more flexible approach to packet redirection on bridges without need to use hardcoded addresses. From Fernando Fernandez Mancera. 3) Zap a few no-longer needed conditionals from ipvs packet path and convert to READ/WRITE_ONCE to avoid KCSAN warnings. From Zhang Tengfei. 4) Remove a no-longer-used macro argument in ipset, from Zhen Ni. * tag 'nf-next-25-09-11' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nf_reject: don't reply to icmp error messages ipvs: Use READ_ONCE/WRITE_ONCE for ipvs->enable netfilter: nft_meta_bridge: introduce NFT_META_BRI_IIFHWADDR support netfilter: ipset: Remove unused htable_bits in macro ahash_region selftest:net: fixed spelling mistakes ==================== Link: https://patch.msgid.link/20250911143819.14753-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-12 17:06:25 -07:00
Alok Tiwari	dc2f650f7e	udp_tunnel: use netdev_warn() instead of netdev_WARN() netdev_WARN() uses WARN/WARN_ON to print a backtrace along with file and line information. In this case, udp_tunnel_nic_register() returning an error is just a failed operation, not a kernel bug. udp_tunnel_nic_register() can fail due to a memory allocation failure (kzalloc() or udp_tunnel_nic_alloc()). This is a normal runtime error and not a kernel bug. Replace netdev_WARN() with netdev_warn() accordingly. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250910195031.3784748-1-alok.a.tiwari@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-11 19:09:48 -07:00
Dmitry Safonov	51e547e8c8	tcp: Free TCP-AO/TCP-MD5 info/keys without RCU Now that the destruction of info/keys is delayed until the socket destructor, it's safe to use kfree() without an RCU callback. The socket is in TCP_CLOSE state either because it never left it, or it's already closed and the refcounter is zero. In any way, no one can discover it anymore, it's safe to release memory straight away. Similar thing was possible for twsk already. Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Dmitry Safonov <dima@arista.com> Link: https://patch.msgid.link/20250909-b4-tcp-ao-md5-rst-finwait2-v5-2-9ffaaaf8b236@arista.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-11 19:05:56 -07:00

... 6 7 8 9 10 ...

82231 Commits