As SACK blocks tend to eat all option space when there are
many holes, it is useful to compromise on sending many SACK
blocks in every ACK and attempt to fit the AccECN option
there by reducing the number of SACK blocks. However, it will
never go below two SACK blocks because of the AccECN option.
As the AccECN option is often not put to every ACK, the space
hijack is usually only temporary. Depending on the reuqired
AccECN fields (can be either 3, 2, 1, or 0, cf. Table 5 in
AccECN spec) and the NOPs used for alignment of other
TCP options, up to two SACK blocks will be reduced. Please
find below tables for more details:
+====================+=========================================+
| Number of | Required | Remaining | Number of | Final |
| SACK | AccECN | option | reduced | number of |
| blocks | fields | spaces | SACK blocks | SACK blocks |
+===========+==========+===========+=============+=============+
| x (<=2) | 0 to 3 | any | 0 | x |
+-----------+----------+-----------+-------------+-------------+
| 3 | 0 | any | 0 | 3 |
| 3 | 1 | <4 | 1 | 2 |
| 3 | 1 | >=4 | 0 | 3 |
| 3 | 2 | <8 | 1 | 2 |
| 3 | 2 | >=8 | 0 | 3 |
| 3 | 3 | <12 | 1 | 2 |
| 3 | 3 | >=12 | 0 | 3 |
+-----------+----------+-----------+-------------+-------------+
| y (>=4) | 0 | any | 0 | y |
| y (>=4) | 1 | <4 | 1 | y-1 |
| y (>=4) | 1 | >=4 | 0 | y |
| y (>=4) | 2 | <8 | 1 | y-1 |
| y (>=4) | 2 | >=8 | 0 | y |
| y (>=4) | 3 | <4 | 2 | y-2 |
| y (>=4) | 3 | <12 | 1 | y-1 |
| y (>=4) | 3 | >=12 | 0 | y |
+===========+==========+===========+=============+=============+
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Co-developed-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-11-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The AccECN option ceb/cep heuristic algorithm is from AccECN spec
Appendix A.2.2 to mitigate against false ACE field overflows. Armed
with ceb delta from option, delivered bytes, and delivered packets it
is possible to estimate how many times ACE field wrapped.
This calculation is necessary only if more than one wrap is possible.
Without SACK, delivered bytes and packets are not always trustworthy in
which case TCP falls back to the simpler no-or-all wraps ceb algorithm.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-10-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
AccECN option may fail in various way, handle these:
- Attempt to negotiate the use of AccECN on the 1st retransmitted SYN
- From the 2nd retransmitted SYN, stop AccECN negotiation
- Remove option from SYN/ACK rexmits to handle blackholes
- If no option arrives in SYN/ACK, assume Option is not usable
- If an option arrives later, re-enabled
- If option is zeroed, disable AccECN option processing
This patch use existing padding bits in tcp_request_sock and
holes in tcp_sock without increasing the size.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-9-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Instead of sending the option in every ACK, limit sending to
those ACKs where the option is necessary:
- Handshake
- "Change-triggered ACK" + the ACK following it. The
2nd ACK is necessary to unambiguously indicate which
of the ECN byte counters in increasing. The first
ACK has two counters increasing due to the ecnfield
edge.
- ACKs with CE to allow CEP delta validations to take
advantage of the option.
- Force option to be sent every at least once per 2^22
bytes. The check is done using the bit edges of the
byte counters (avoids need for extra variables).
- AccECN option beacon to send a few times per RTT even if
nothing in the ECN state requires that. The default is 3
times per RTT, and its period can be set via
sysctl_tcp_ecn_option_beacon.
Below are the pahole outcomes before and after this patch,
in which the group size of tcp_sock_write_tx is increased
from 89 to 97 due to the new u64 accecn_opt_tstamp member:
[BEFORE THIS PATCH]
struct tcp_sock {
[...]
u64 tcp_wstamp_ns; /* 2488 8 */
struct list_head tsorted_sent_queue; /* 2496 16 */
[...]
__cacheline_group_end__tcp_sock_write_tx[0]; /* 2521 0 */
__cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2521 0 */
u8 nonagle:4; /* 2521: 0 1 */
u8 rate_app_limited:1; /* 2521: 4 1 */
/* XXX 3 bits hole, try to pack */
/* Force alignment to the next boundary: */
u8 :0;
u8 received_ce_pending:4;/* 2522: 0 1 */
u8 unused2:4; /* 2522: 4 1 */
u8 accecn_minlen:2; /* 2523: 0 1 */
u8 est_ecnfield:2; /* 2523: 2 1 */
u8 unused3:4; /* 2523: 4 1 */
[...]
__cacheline_group_end__tcp_sock_write_txrx[0]; /* 2628 0 */
[...]
/* size: 3200, cachelines: 50, members: 171 */
}
[AFTER THIS PATCH]
struct tcp_sock {
[...]
u64 tcp_wstamp_ns; /* 2488 8 */
u64 accecn_opt_tstamp; /* 2596 8 */
struct list_head tsorted_sent_queue; /* 2504 16 */
[...]
__cacheline_group_end__tcp_sock_write_tx[0]; /* 2529 0 */
__cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2529 0 */
u8 nonagle:4; /* 2529: 0 1 */
u8 rate_app_limited:1; /* 2529: 4 1 */
/* XXX 3 bits hole, try to pack */
/* Force alignment to the next boundary: */
u8 :0;
u8 received_ce_pending:4;/* 2530: 0 1 */
u8 unused2:4; /* 2530: 4 1 */
u8 accecn_minlen:2; /* 2531: 0 1 */
u8 est_ecnfield:2; /* 2531: 2 1 */
u8 accecn_opt_demand:2; /* 2531: 4 1 */
u8 prev_ecnfield:2; /* 2531: 6 1 */
[...]
__cacheline_group_end__tcp_sock_write_txrx[0]; /* 2636 0 */
[...]
/* size: 3200, cachelines: 50, members: 173 */
}
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Co-developed-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-8-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The Accurate ECN allows echoing back the sum of bytes for
each IP ECN field value in the received packets using
AccECN option. This change implements AccECN option tx & rx
side processing without option send control related features
that are added by a later change.
Based on specification:
https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt
(Some features of the spec will be added in the later changes
rather than in this one).
A full-length AccECN option is always attempted but if it does
not fit, the minimum length is selected based on the counters
that have changed since the last update. The AccECN option
(with 24-bit fields) often ends in odd sizes so the option
write code tries to take advantage of some nop used to pad
the other TCP options.
The delivered_ecn_bytes pairs with received_ecn_bytes similar
to how delivered_ce pairs with received_ce. In contrast to
ACE field, however, the option is not always available to update
delivered_ecn_bytes. For ACK w/o AccECN option, the delivered
bytes calculated based on the cumulative ACK+SACK information
are assigned to one of the counters using an estimation
heuristic to select the most likely ECN byte counter. Any
estimation error is corrected when the next AccECN option
arrives. It may occur that the heuristic gets too confused
when there are enough different byte counter deltas between
ACKs with the AccECN option in which case the heuristic just
gives up on updating the counters for a while.
tcp_ecn_option sysctl can be used to select option sending
mode for AccECN: TCP_ECN_OPTION_DISABLED, TCP_ECN_OPTION_MINIMUM,
and TCP_ECN_OPTION_FULL.
This patch increases the size of tcp_info struct, as there is
no existing holes for new u32 variables. Below are the pahole
outcomes before and after this patch:
[BEFORE THIS PATCH]
struct tcp_info {
[...]
__u32 tcpi_total_rto_time; /* 244 4 */
/* size: 248, cachelines: 4, members: 61 */
}
[AFTER THIS PATCH]
struct tcp_info {
[...]
__u32 tcpi_total_rto_time; /* 244 4 */
__u32 tcpi_received_ce; /* 248 4 */
__u32 tcpi_delivered_e1_bytes; /* 252 4 */
__u32 tcpi_delivered_e0_bytes; /* 256 4 */
__u32 tcpi_delivered_ce_bytes; /* 260 4 */
__u32 tcpi_received_e1_bytes; /* 264 4 */
__u32 tcpi_received_e0_bytes; /* 268 4 */
__u32 tcpi_received_ce_bytes; /* 272 4 */
/* size: 280, cachelines: 5, members: 68 */
}
This patch uses the existing 1-byte holes in the tcp_sock_write_txrx
group for new u8 members, but adds a 4-byte hole in tcp_sock_write_rx
group after the new u32 delivered_ecn_bytes[3] member. Therefore, the
group size of tcp_sock_write_rx is increased from 96 to 112. Below
are the pahole outcomes before and after this patch:
[BEFORE THIS PATCH]
struct tcp_sock {
[...]
u8 received_ce_pending:4; /* 2522: 0 1 */
u8 unused2:4; /* 2522: 4 1 */
/* XXX 1 byte hole, try to pack */
[...]
u32 rcv_rtt_last_tsecr; /* 2668 4 */
[...]
__cacheline_group_end__tcp_sock_write_rx[0]; /* 2728 0 */
[...]
/* size: 3200, cachelines: 50, members: 167 */
}
[AFTER THIS PATCH]
struct tcp_sock {
[...]
u8 received_ce_pending:4;/* 2522: 0 1 */
u8 unused2:4; /* 2522: 4 1 */
u8 accecn_minlen:2; /* 2523: 0 1 */
u8 est_ecnfield:2; /* 2523: 2 1 */
u8 unused3:4; /* 2523: 4 1 */
[...]
u32 rcv_rtt_last_tsecr; /* 2668 4 */
u32 delivered_ecn_bytes[3];/* 2672 12 */
/* XXX 4 bytes hole, try to pack */
[...]
__cacheline_group_end__tcp_sock_write_rx[0]; /* 2744 0 */
[...]
/* size: 3200, cachelines: 50, members: 171 */
}
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-7-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
1) Don't early return when sack doesn't fit. AccECN code will be
placed after this fragment so no early returns please.
2) Make sure opts->num_sack_blocks is not left undefined. E.g.,
tcp_current_mss() does not memset its opts struct to zero.
AccECN code checks if SACK option is present and may even
alter it to make room for AccECN option when many SACK blocks
are present. Thus, num_sack_blocks needs to be always valid.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-6-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
These three byte counters track IP ECN field payload byte sums for
all arriving (acceptable) packets for ECT0, ECT1, and CE. The
AccECN option (added by a later patch in the series) echoes these
counters back to sender side; therefore, it is placed within the
group of tcp_sock_write_txrx.
Below are the pahole outcomes before and after this patch, in which
the group size of tcp_sock_write_txrx is increased from 95 + 4 to
107 + 4 and an extra 4-byte hole is created but will be exploited
in later patches:
[BEFORE THIS PATCH]
struct tcp_sock {
[...]
u32 delivered_ce; /* 2576 4 */
u32 received_ce; /* 2580 4 */
u32 app_limited; /* 2584 4 */
u32 rcv_wnd; /* 2588 4 */
struct tcp_options_received rx_opt; /* 2592 24 */
__cacheline_group_end__tcp_sock_write_txrx[0]; /* 2616 0 */
[...]
/* size: 3200, cachelines: 50, members: 166 */
}
[AFTER THIS PATCH]
struct tcp_sock {
[...]
u32 delivered_ce; /* 2576 4 */
u32 received_ce; /* 2580 4 */
u32 received_ecn_bytes[3];/* 2584 12 */
u32 app_limited; /* 2596 4 */
u32 rcv_wnd; /* 2600 4 */
struct tcp_options_received rx_opt; /* 2604 24 */
__cacheline_group_end__tcp_sock_write_txrx[0]; /* 2628 0 */
/* XXX 4 bytes hole, try to pack */
[...]
/* size: 3200, cachelines: 50, members: 167 */
}
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-4-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Accurate ECN negotiation parts based on the specification:
https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt
Accurate ECN is negotiated using ECE, CWR and AE flags in the
TCP header. TCP falls back into using RFC3168 ECN if one of the
ends supports only RFC3168-style ECN.
The AccECN negotiation includes reflecting IP ECN field value
seen in SYN and SYNACK back using the same bits as negotiation
to allow responding to SYN CE marks and to detect ECN field
mangling. CE marks should not occur currently because SYN=1
segments are sent with Non-ECT in IP ECN field (but proposal
exists to remove this restriction).
Reflecting SYN IP ECN field in SYNACK is relatively simple.
Reflecting SYNACK IP ECN field in the final/third ACK of
the handshake is more challenging. Linux TCP code is not well
prepared for using the final/third ACK a signalling channel
which makes things somewhat complicated here.
tcp_ecn sysctl can be used to select the highest ECN variant
(Accurate ECN, ECN, No ECN) that is attemped to be negotiated and
requested for incoming connection and outgoing connection:
TCP_ECN_IN_NOECN_OUT_NOECN, TCP_ECN_IN_ECN_OUT_ECN,
TCP_ECN_IN_ECN_OUT_NOECN, TCP_ECN_IN_ACCECN_OUT_ACCECN,
TCP_ECN_IN_ACCECN_OUT_ECN, and TCP_ECN_IN_ACCECN_OUT_NOECN.
After this patch, the size of tcp_request_sock remains unchanged
and no new holes are added. Below are the pahole outcomes before
and after this patch:
[BEFORE THIS PATCH]
struct tcp_request_sock {
[...]
u32 rcv_nxt; /* 352 4 */
u8 syn_tos; /* 356 1 */
/* size: 360, cachelines: 6, members: 16 */
}
[AFTER THIS PATCH]
struct tcp_request_sock {
[...]
u32 rcv_nxt; /* 352 4 */
u8 syn_tos; /* 356 1 */
bool accecn_ok; /* 357 1 */
u8 syn_ect_snt:2; /* 358: 0 1 */
u8 syn_ect_rcv:2; /* 358: 2 1 */
u8 accecn_fail_mode:4; /* 358: 4 1 */
/* size: 360, cachelines: 6, members: 20 */
}
After this patch, the size of tcp_sock remains unchanged and no new
holes are added. Also, 4 bits of the existing 2-byte hole are exploited.
Below are the pahole outcomes before and after this patch:
[BEFORE THIS PATCH]
struct tcp_sock {
[...]
u8 dup_ack_counter:2; /* 2761: 0 1 */
u8 tlp_retrans:1; /* 2761: 2 1 */
u8 unused:5; /* 2761: 3 1 */
u8 thin_lto:1; /* 2762: 0 1 */
u8 fastopen_connect:1; /* 2762: 1 1 */
u8 fastopen_no_cookie:1; /* 2762: 2 1 */
u8 fastopen_client_fail:2; /* 2762: 3 1 */
u8 frto:1; /* 2762: 5 1 */
/* XXX 2 bits hole, try to pack */
[...]
u8 keepalive_probes; /* 2765 1 */
/* XXX 2 bytes hole, try to pack */
[...]
/* size: 3200, cachelines: 50, members: 164 */
}
[AFTER THIS PATCH]
struct tcp_sock {
[...]
u8 dup_ack_counter:2; /* 2761: 0 1 */
u8 tlp_retrans:1; /* 2761: 2 1 */
u8 syn_ect_snt:2; /* 2761: 3 1 */
u8 syn_ect_rcv:2; /* 2761: 5 1 */
u8 thin_lto:1; /* 2761: 7 1 */
u8 fastopen_connect:1; /* 2762: 0 1 */
u8 fastopen_no_cookie:1; /* 2762: 1 1 */
u8 fastopen_client_fail:2; /* 2762: 2 1 */
u8 frto:1; /* 2762: 4 1 */
/* XXX 3 bits hole, try to pack */
[...]
u8 keepalive_probes; /* 2765 1 */
u8 accecn_fail_mode:4; /* 2766: 0 1 */
/* XXX 4 bits hole, try to pack */
/* XXX 1 byte hole, try to pack */
[...]
/* size: 3200, cachelines: 50, members: 166 */
}
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-3-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This change implements Accurate ECN without negotiation and
AccECN Option (that will be added by later changes). Based on
AccECN specifications:
https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt
Accurate ECN allows feeding back the number of CE (congestion
experienced) marks accurately to the sender in contrast to
RFC3168 ECN that can only signal one marks-seen-yes/no per RTT.
Congestion control algorithms can take advantage of the accurate
ECN information to fine-tune their congestion response to avoid
drastic rate reduction when only mild congestion is encountered.
With Accurate ECN, tp->received_ce (r.cep in AccECN spec) keeps
track of how many segments have arrived with a CE mark. Accurate
ECN uses ACE field (ECE, CWR, AE) to communicate the value back
to the sender which updates tp->delivered_ce (s.cep) based on the
feedback. This signalling channel is lossy when ACE field overflow
occurs.
Conservative strategy is selected here to deal with the ACE
overflow, however, some strategies using the AccECN option later
in the overall patchset mitigate against false overflows detected.
The ACE field values on the wire are offset by
TCP_ACCECN_CEP_INIT_OFFSET. Delivered_ce/received_ce count the
real CE marks rather than forcing all downstream users to adapt
to the wire offset.
This patch uses the first 1-byte hole and the last 4-byte hole of
the tcp_sock_write_txrx for 'received_ce_pending' and 'received_ce'.
Also, the group size of tcp_sock_write_txrx is increased from
91 + 4 to 95 + 4 due to the new u32 received_ce member. Below are
the trimmed pahole outcomes before and after this patch.
[BEFORE THIS PATCH]
struct tcp_sock {
[...]
__cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2521 0 */
u8 nonagle:4; /* 2521: 0 1 */
u8 rate_app_limited:1; /* 2521: 4 1 */
/* XXX 3 bits hole, try to pack */
/* XXX 2 bytes hole, try to pack */
[...]
u32 delivered_ce; /* 2576 4 */
u32 app_limited; /* 2580 4 */
u32 rcv_wnd; /* 2684 4 */
struct tcp_options_received rx_opt; /* 2688 24 */
__cacheline_group_end__tcp_sock_write_txrx[0]; /* 2612 0 */
/* XXX 4 bytes hole, try to pack */
[...]
/* size: 3200, cachelines: 50, members: 161 */
}
[AFTER THIS PATCH]
struct tcp_sock {
[...]
__cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2521 0 */
u8 nonagle:4; /* 2521: 0 1 */
u8 rate_app_limited:1; /* 2521: 4 1 */
/* XXX 3 bits hole, try to pack */
/* Force alignment to the next boundary: */
u8 :0;
u8 received_ce_pending:4;/* 2522: 0 1 */
u8 unused2:4; /* 2522: 4 1 */
/* XXX 1 byte hole, try to pack */
[...]
u32 delivered_ce; /* 2576 4 */
u32 received_ce; /* 2580 4 */
u32 app_limited; /* 2584 4 */
u32 rcv_wnd; /* 2588 4 */
struct tcp_options_received rx_opt; /* 2592 24 */
__cacheline_group_end__tcp_sock_write_txrx[0]; /* 2616 0 */
[...]
/* size: 3200, cachelines: 50, members: 164 */
}
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-2-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
get_netdev_for_sock() is called during setsockopt(),
so not under RCU.
Using sk_dst_get(sk)->dev could trigger UAF.
Let's use __sk_dst_get() and dst_dev_rcu().
Note that the only ->ndo_sk_get_lower_dev() user is
bond_sk_get_lower_dev(), which uses RCU.
Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250916214758.650211-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
smc_vlan_by_tcpsk() fetches sk_dst_get(sk)->dev before RTNL and
passes it to netdev_walk_all_lower_dev(), which is illegal.
Also, smc_vlan_by_tcpsk_walk() does not require RTNL at all.
Let's use __sk_dst_get(), dst_dev_rcu(), and
netdev_walk_all_lower_dev_rcu().
Note that the returned value of smc_vlan_by_tcpsk() is not used
in the caller.
Fixes: 0cfdd8f92c ("smc: connection and link group creation")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916214758.650211-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
smc_clc_prfx_match() is called from smc_listen_work() and
not under RCU nor RTNL.
Using sk_dst_get(sk)->dev could trigger UAF.
Let's use __sk_dst_get() and dst_dev_rcu().
Note that the returned value of smc_clc_prfx_match() is not
used in the caller.
Fixes: a046d57da1 ("smc: CLC handshake (incl. preparation steps)")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916214758.650211-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
smc_clc_prfx_set() is called during connect() and not under RCU
nor RTNL.
Using sk_dst_get(sk)->dev could trigger UAF.
Let's use __sk_dst_get() and dev_dst_rcu() under rcu_read_lock()
after kernel_getsockname().
Note that the returned value of smc_clc_prfx_set() is not used
in the caller.
While at it, we change the 1st arg of smc_clc_prfx_set[46]_rcu()
not to touch dst there.
Fixes: a046d57da1 ("smc: CLC handshake (incl. preparation steps)")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916214758.650211-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Simon Wunderlich says:
====================
This cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
- Remove network coding support, by Sven Eckelmann (2 patches)
- remove includes for extern declarations, by Sven Eckelmann
* tag 'batadv-next-pullrequest-20250916' of https://git.open-mesh.org/linux-merge:
batman-adv: remove includes for extern declarations
batman-adv: keep skb crc32 helper local in BLA
batman-adv: remove network coding support
batman-adv: Start new development cycle
====================
Link: https://patch.msgid.link/20250916122441.89246-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A lot of modern SoC have the ability to store MAC addresses in their
NVMEM. So extend the generic function device_get_mac_address() to
obtain the MAC address from an nvmem cell named 'mac-address' in
case there is no firmware node which contains the MAC address directly.
Signed-off-by: Stefan Wahren <wahrenst@gmx.net>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250912140332.35395-3-wahrenst@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Driver authors often forget to add GFP_NOWARN for page allocation
from the datapath. This is annoying to users as OOMs are a fact
of life, and we pretty much expect network Rx to hit page allocation
failures during OOM. Make page pool add GFP_NOWARN for ATOMIC allocations
by default.
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250912161703.361272-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Hosts under DOS attack can suffer from false sharing
in enqueue_to_backlog() : atomic_inc(&sd->dropped).
This is because sd->dropped can be touched from many cpus,
possibly residing on different NUMA nodes.
Generalize the sk_drop_counters infrastucture
added in commit c51613fa27 ("net: add sk->sk_drop_counters")
and use it to replace softnet_data.dropped
with NUMA friendly softnet_data.drop_counters.
This adds 64 bytes per cpu, maybe more in the future
if we increase the number of counters (currently 2)
per 'struct numa_drop_counters'.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250909121942.1202585-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Florian Westphal says:
====================
netfilter: updates for net-next
1) Don't respond to ICMP_UNREACH errors with another ICMP_UNREACH
error.
2) Support fetching the current bridge ethernet address.
This allows a more flexible approach to packet redirection
on bridges without need to use hardcoded addresses. From
Fernando Fernandez Mancera.
3) Zap a few no-longer needed conditionals from ipvs packet path
and convert to READ/WRITE_ONCE to avoid KCSAN warnings.
From Zhang Tengfei.
4) Remove a no-longer-used macro argument in ipset, from Zhen Ni.
* tag 'nf-next-25-09-11' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_reject: don't reply to icmp error messages
ipvs: Use READ_ONCE/WRITE_ONCE for ipvs->enable
netfilter: nft_meta_bridge: introduce NFT_META_BRI_IIFHWADDR support
netfilter: ipset: Remove unused htable_bits in macro ahash_region
selftest:net: fixed spelling mistakes
====================
Link: https://patch.msgid.link/20250911143819.14753-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
netdev_WARN() uses WARN/WARN_ON to print a backtrace along with
file and line information. In this case, udp_tunnel_nic_register()
returning an error is just a failed operation, not a kernel bug.
udp_tunnel_nic_register() can fail due to a memory allocation
failure (kzalloc() or udp_tunnel_nic_alloc()).
This is a normal runtime error and not a kernel bug.
Replace netdev_WARN() with netdev_warn() accordingly.
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250910195031.3784748-1-alok.a.tiwari@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that the destruction of info/keys is delayed until the socket
destructor, it's safe to use kfree() without an RCU callback.
The socket is in TCP_CLOSE state either because it never left it,
or it's already closed and the refcounter is zero. In any way,
no one can discover it anymore, it's safe to release memory
straight away.
Similar thing was possible for twsk already.
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Link: https://patch.msgid.link/20250909-b4-tcp-ao-md5-rst-finwait2-v5-2-9ffaaaf8b236@arista.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently there are a couple of minor issues with destroying the keys
tcp_v4_destroy_sock():
1. The socket is yet in TCP bind buckets, making it reachable for
incoming segments [on another CPU core], potentially available to send
late FIN/ACK/RST replies.
2. There is at least one code path, where tcp_done() is called before
sending RST [kudos to Bob for investigation]. This is a case of
a server, that finished sending its data and just called close().
The socket is in TCP_FIN_WAIT2 and has RCV_SHUTDOWN (set by
__tcp_close())
tcp_v4_do_rcv()/tcp_v6_do_rcv()
tcp_rcv_state_process() /* LINUX_MIB_TCPABORTONDATA */
tcp_reset()
tcp_done_with_error()
tcp_done()
inet_csk_destroy_sock() /* Destroys AO/MD5 keys */
/* tcp_rcv_state_process() returns SKB_DROP_REASON_TCP_ABORT_ON_DATA */
tcp_v4_send_reset() /* Sends an unsigned RST segment */
tcpdump:
> 22:53:15.399377 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 33929, offset 0, flags [DF], proto TCP (6), length 60)
> 1.0.0.1.34567 > 1.0.0.2.49848: Flags [F.], seq 2185658590, ack 3969644355, win 502, options [nop,nop,md5 valid], length 0
> 22:53:15.399396 00:00:01:01:00:00 > 00:00:b2:1f:00:00, ethertype IPv4 (0x0800), length 86: (tos 0x0, ttl 64, id 51951, offset 0, flags [DF], proto TCP (6), length 72)
> 1.0.0.2.49848 > 1.0.0.1.34567: Flags [.], seq 3969644375, ack 2185658591, win 128, options [nop,nop,md5 valid,nop,nop,sack 1 {2185658590:2185658591}], length 0
> 22:53:16.429588 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 60: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
> 1.0.0.1.34567 > 1.0.0.2.49848: Flags [R], seq 2185658590, win 0, length 0
> 22:53:16.664725 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
> 1.0.0.1.34567 > 1.0.0.2.49848: Flags [R], seq 2185658591, win 0, options [nop,nop,md5 valid], length 0
> 22:53:17.289832 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
> 1.0.0.1.34567 > 1.0.0.2.49848: Flags [R], seq 2185658591, win 0, options [nop,nop,md5 valid], length 0
Note the signed RSTs later in the dump - those are sent by the server
when the fin-wait socket gets removed from hash buckets, by
the listener socket.
Instead of destroying AO/MD5 info and their keys in inet_csk_destroy_sock(),
slightly delay it until the actual socket .sk_destruct(). As shutdown'ed
socket can yet send non-data replies, they should be signed in order for
the peer to process them. Now it also matches how AO/MD5 gets destructed
for TIME-WAIT sockets (in tcp_twsk_destructor()).
This seems optimal for TCP-MD5, while for TCP-AO it seems to have an
open problem: once RST get sent and socket gets actually destructed,
there is no information on the initial sequence numbers. So, in case
this last RST gets lost in the network, the server's listener socket
won't be able to properly sign another RST. Nothing in RFC 1122
prescribes keeping any local state after non-graceful reset.
Luckily, BGP are known to use keep alive(s).
While the issue is quite minor/cosmetic, these days monitoring network
counters is a common practice and getting invalid signed segments from
a trusted BGP peer can get customers worried.
Investigated-by: Bob Gilligan <gilligan@arista.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Link: https://patch.msgid.link/20250909-b4-tcp-ao-md5-rst-finwait2-v5-1-9ffaaaf8b236@arista.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When BROPT_FDB_LOCAL_VLAN_0 is enabled, the local FDB entries for the
bridge itself should not be created per-VLAN, but instead only on VLAN 0.
When the bridge address changes, the local FDB entries need to be updated,
which is done in br_fdb_change_mac_address().
Bail out early when in VLAN-0 mode, so that the per-VLAN FDB entries are
not created. The per-VLAN walk is only done afterwards.
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/0bd432cf91921ef7c4ed0e129de1d1cd358c716b.1757004393.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When BROPT_FDB_LOCAL_VLAN_0 is enabled, the local FDB entries for member
ports should not be created per-VLAN, but instead only on VLAN 0. When the
member port address changes, the local FDB entries need to be updated,
which is done in br_fdb_changeaddr().
Under the VLAN-0 mode, only one local FDB entry will ever be added for a
port's address, and that on VLAN 0. Thus bail out of the delete loop early.
For the same reason, also skip adding the per-VLAN entries.
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/0cf9d41836d2a245b0ce07e1a16ee05ca506cbe9.1757004393.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When BROPT_FDB_LOCAL_VLAN_0 is enabled, the local FDB entries for the
member ports as well as the bridge itself should not be created per-VLAN,
but instead only on VLAN 0.
That means that br_handle_frame_finish() needs to make two lookups: the
primary lookup on an appropriate VLAN, and when that misses, a lookup on
VLAN 0.
Have the second lookup only accept local MAC addresses. Turning this into a
generic second-lookup feature is not the goal.
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/8087475009dce360fb68d873b1ed9c80827da302.1757004393.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_recvmsg_dmabuf can export the following errors:
- EFAULT when linear copy fails
- ETOOSMALL when cmsg put fails
- ENODEV if one of the frags is readable
- ENOMEM on xarray failures
But they are all ignored and replaced by EFAULT in the caller
(tcp_recvmsg_locked). Expose real error to the userspace to
add more transparency on what specifically fails.
In non-devmem case (skb_copy_datagram_msg) doing `if (!copied)
copied=-EFAULT` is ok because skb_copy_datagram_msg can return only EFAULT.
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250910162429.4127997-1-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In a system with high real-time requirements, the timeout mechanism of
ordinary timers with jiffies granularity is insufficient to meet the
demands for real-time performance. Meanwhile, the optimization of CPU
usage with af_packet is quite significant. Use hrtimer instead of timer
to help compensate for the shortcomings in real-time performance.
In HZ=100 or HZ=250 system, the update of TP_STATUS_USER is not real-time
enough, with fluctuations reaching over 8ms (on a system with HZ=250).
This is unacceptable in some high real-time systems that require timely
processing of network packets. By replacing it with hrtimer, if a timeout
of 2ms is set, the update of TP_STATUS_USER can be stabilized to within
3 ms.
Delete delete_blk_timer field, because hrtimer_cancel will check and wait
until the timer callback return and ensure never enter callback again.
Simplify the logic related to setting timeout, only update the hrtimer
expire time within the hrtimer callback, no longer update the expire time
in prb_open_block which is called by tpacket_rcv or timer callback.
Reasons why NOT update hrtimer in prb_open_block:
1) It will increase complexity to distinguish the two caller scenario.
2) hrtimer_cancel and hrtimer_start need to be called if you want to update
TMO of an already enqueued hrtimer, leading to complex shutdown logic.
One side effect of NOT update hrtimer when called by tpacket_rcv is that
a newly opened block triggered by tpacket_rcv may be retired earlier than
expected. On the other hand, if timeout is updated in prb_open_block, the
frequent reception of network packets that leads to prb_open_block being
called may cause hrtimer to be removed and enqueued repeatedly.
The retire hrtimer expiration is unconditional and periodic. If there are
numerous packet sockets on the system, please set an appropriate timeout
to avoid frequent enqueueing of hrtimers.
Reviewed-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://lore.kernel.org/all/20250831100822.1238795-1-jackzxcui1989@163.com/
Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
Link: https://patch.msgid.link/20250908104549.204412-3-jackzxcui1989@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
kactive_blk_num (K) is only incremented on block close.
In timer callback prb_retire_rx_blk_timer_expired, except delete_blk_timer
is true, last_kactive_blk_num (L) is set to match kactive_blk_num (K) in
all cases. L is also set to match K in prb_open_block.
The only case K not equal to L is when scheduled by tpacket_rcv
and K is just incremented on block close but no new block could be opened,
so that it does not call prb_open_block in prb_dispatch_next_block.
This patch modifies the prb_retire_rx_blk_timer_expired function by simply
removing the check for L == K. This patch just provides another checkpoint
to thaw the might-be-frozen block in any case. It doesn't have any effect
because __packet_lookup_frame_in_block() has the same logic and does it
again without this patch when detecting the ring is frozen. The patch only
advances checking the status of the ring.
Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Reviewed-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://lore.kernel.org/all/20250831100822.1238795-1-jackzxcui1989@163.com/
Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
Link: https://patch.msgid.link/20250908104549.204412-2-jackzxcui1989@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Johannes Berg says:
====================
Plenty of things going on, notably:
- iwlwifi: major cleanups/rework
- brcmfmac: gets AP isolation support
- mac80211: gets more S1G support
* tag 'wireless-next-2025-09-11' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (94 commits)
wifi: mwifiex: fix endianness handling in mwifiex_send_rgpower_table
wifi: cfg80211: Remove the redundant wiphy_dev
wifi: mac80211: fix incorrect comment
wifi: cfg80211: update the time stamps in hidden ssid
wifi: mac80211: Fix HE capabilities element check
wifi: mac80211: add tx_handlers_drop statistics to ethtool
wifi: mac80211: fix reporting of all valid links in sta_set_sinfo()
wifi: iwlwifi: mld: CHANNEL_SURVEY_NOTIF is always supported
wifi: iwlwifi: mld: remove support of iwl_esr_mode_notif version 1
wifi: iwlwifi: mld: remove support from of sta cmd version 1
wifi: iwlwifi: mld: remove support of roc cmd version 5
wifi: iwlwifi: mld: remove support of mac cmd ver 2
wifi: iwlwifi: mld: don't consider phy cmd version 5
wifi: iwlwifi: implement wowlan status notification API update
wifi: iwlwifi: fw: Add ASUS to PPAG and TAS list
wifi: iwlwifi: add kunit tests for nvm parse
wifi: iwlwifi: api: add a flag to iwl_link_ctx_modify_flags
wifi: iwlwifi: pcie: move ltr_enabled to the specific transport
wifi: iwlwifi: pcie: move pm_support to the specific transport
wifi: iwlwifi: rename iwl_finish_nic_init
...
====================
Link: https://patch.msgid.link/20250911100854.20445-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR (net-6.17-rc6).
Conflicts:
net/netfilter/nft_set_pipapo.c
net/netfilter/nft_set_pipapo_avx2.c
c4eaca2e10 ("netfilter: nft_set_pipapo: don't check genbit from packetpath lookups")
84c1da7b38 ("netfilter: nft_set_pipapo: use avx2 algorithm for insertions too")
Only trivial adjacent changes (in a doc and a Makefile).
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull networking fixes from Paolo Abeni:
"Including fixes from CAN, netfilter and wireless.
We have an IPv6 routing regression with the relevant fix still a WiP.
This includes a last-minute revert to avoid more problems.
Current release - new code bugs:
- wifi: nl80211: completely disable per-link stats for now
Previous releases - regressions:
- dev_ioctl: take ops lock in hwtstamp lower paths
- netfilter:
- fix spurious set lookup failures
- fix lockdep splat due to missing annotation
- genetlink: fix genl_bind() invoking bind() after -EPERM
- phy: transfer phy_config_inband() locking responsibility to phylink
- can: xilinx_can: fix use-after-free of transmitted SKB
- hsr: fix lock warnings
- eth:
- igb: fix NULL pointer dereference in ethtool loopback test
- i40e: fix Jumbo Frame support after iPXE boot
- macsec: sync features on RTM_NEWLINK
Previous releases - always broken:
- tunnels: reset the GSO metadata before reusing the skb
- mptcp: make sync_socket_options propagate SOCK_KEEPOPEN
- can: j1939: implement NETDEV_UNREGISTER notification hanidler
- wifi: ath12k: fix WMI TLV header misalignment"
* tag 'net-6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (47 commits)
Revert "net: usb: asix: ax88772: drop phylink use in PM to avoid MDIO runtime PM wakeups"
hsr: hold rcu and dev lock for hsr_get_port_ndev
hsr: use hsr_for_each_port_rtnl in hsr_port_get_hsr
hsr: use rtnl lock when iterating over ports
wifi: nl80211: completely disable per-link stats for now
net: usb: asix: ax88772: drop phylink use in PM to avoid MDIO runtime PM wakeups
net: ethtool: fix wrong type used in struct kernel_ethtool_ts_info
MAINTAINERS: add Phil as netfilter reviewer
netfilter: nf_tables: restart set lookup on base_seq change
netfilter: nf_tables: make nft_set_do_lookup available unconditionally
netfilter: nf_tables: place base_seq in struct net
netfilter: nft_set_rbtree: continue traversal if element is inactive
netfilter: nft_set_pipapo: don't check genbit from packetpath lookups
netfilter: nft_set_bitmap: fix lockdep splat due to missing annotation
can: rcar_can: rcar_can_resume(): fix s2ram with PSCI
can: xilinx_can: xcan_write_frame(): fix use-after-free of transmitted SKB
can: j1939: j1939_local_ecu_get(): undo increment when j1939_local_ecu_get() fails
can: j1939: j1939_sk_bind(): call j1939_priv_put() immediately when j1939_local_ecu_get() failed
can: j1939: implement NETDEV_UNREGISTER notification handler
selftests: can: enable CONFIG_CAN_VCAN as a module
...
Pull bpf fixes from Alexei Starovoitov:
"A number of fixes accumulated due to summer vacations
- Fix out-of-bounds dynptr write in bpf_crypto_crypt() kfunc which
was misidentified as a security issue (Daniel Borkmann)
- Update the list of BPF selftests maintainers (Eduard Zingerman)
- Fix selftests warnings with icecc compiler (Ilya Leoshkevich)
- Disable XDP/cpumap direct return optimization (Jesper Dangaard
Brouer)
- Fix unexpected get_helper_proto() result in unusual configuration
BPF_SYSCALL=y and BPF_EVENTS=n (Jiri Olsa)
- Allow fallback to interpreter when JIT support is limited (KaFai
Wan)
- Fix rqspinlock and choose trylock fallback for NMI waiters. Pick
the simplest fix. More involved fix is targeted bpf-next (Kumar
Kartikeya Dwivedi)
- Fix cleanup when tcp_bpf_send_verdict() fails to allocate
psock->cork (Kuniyuki Iwashima)
- Disallow bpf_timer in PREEMPT_RT for now. Proper solution is being
discussed for bpf-next. (Leon Hwang)
- Fix XSK cq descriptor production (Maciej Fijalkowski)
- Tell memcg to use allow_spinning=false path in bpf_timer_init() to
avoid lockup in cgroup_file_notify() (Peilin Ye)
- Fix bpf_strnstr() to handle suffix match cases (Rong Tao)"
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Skip timer cases when bpf_timer is not supported
bpf: Reject bpf_timer for PREEMPT_RT
tcp_bpf: Call sk_msg_free() when tcp_bpf_send_verdict() fails to allocate psock->cork.
bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init()
bpf: Allow fall back to interpreter for programs with stack size <= 512
rqspinlock: Choose trylock fallback for NMI waiters
xsk: Fix immature cq descriptor production
bpf: Update the list of BPF selftests maintainers
selftests/bpf: Add tests for bpf_strnstr
selftests/bpf: Fix "expression result unused" warnings with icecc
bpf: Fix bpf_strnstr() to handle suffix match cases better
selftests/bpf: Extend crypto_sanity selftest with invalid dst buffer
bpf: Fix out-of-bounds dynptr write in bpf_crypto_crypt
bpf: Check the helper function is valid in get_helper_proto
bpf, cpumap: Disable page_pool direct xdp_return need larger scope
tcp reject code won't reply to a tcp reset.
But the icmp reject 'netdev' family versions will reply to icmp
dst-unreach errors, unlike icmp_send() and icmp6_send() which are used
by the inet family implementation (and internally by the REJECT target).
Check for the icmp(6) type and do not respond if its an unreachable error.
Without this, something like 'ip protocol icmp reject', when used
in a netdev chain attached to 'lo', cause a packet loop.
Same for two hosts that both use such a rule: each error packet
will be replied to.
Such situation persist until the (bogus) rule is amended to ratelimit or
checks the icmp type before the reject statement.
As the inet versions don't do this make the netdev ones follow along.
Signed-off-by: Florian Westphal <fw@strlen.de>
KCSAN reported a data-race on the `ipvs->enable` flag, which is
written in the control path and read concurrently from many other
contexts.
Following a suggestion by Julian, this patch fixes the race by
converting all accesses to use `WRITE_ONCE()/READ_ONCE()`.
This lightweight approach ensures atomic access and acts as a
compiler barrier, preventing unsafe optimizations where the flag
is checked in loops (e.g., in ip_vs_est.c).
Additionally, the `enable` checks in the fast-path hooks
(`ip_vs_in_hook`, `ip_vs_out_hook`, `ip_vs_forward_icmp`) are
removed. These are unnecessary since commit 857ca89711
("ipvs: register hooks only with services"). The `enable=0`
condition they check for can only occur in two rare and non-fatal
scenarios: 1) after hooks are registered but before the flag is set,
and 2) after hooks are unregistered on cleanup_net. In the worst
case, a single packet might be mishandled (e.g., dropped), which
does not lead to a system crash or data corruption. Adding a check
in the performance-critical fast-path to handle this harmless
condition is not a worthwhile trade-off.
Fixes: 857ca89711 ("ipvs: register hooks only with services")
Reported-by: syzbot+1651b5234028c294c339@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=1651b5234028c294c339
Suggested-by: Julian Anastasov <ja@ssi.bg>
Link: https://lore.kernel.org/lvs-devel/2189fc62-e51e-78c9-d1de-d35b8e3657e3@ssi.bg/
Signed-off-by: Zhang Tengfei <zhtfdev@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>