Currently, in cfg80211_parse_ml_elem_sta_data(), when RNR element
indicates a BSS that operates in a channel that current regulatory
domain doesn't support, a NULL value is returned by
ieee80211_get_channel_khz() and assigned to this BSS entry's channel
field. Later in cfg80211_inform_single_bss_data(), the reported
BSS entry's channel will be wrongly overridden by transmitted BSS's.
This could result in connection failure that when wpa_supplicant
tries to select this reported BSS entry while it actually resides in
an unsupported channel.
Since this channel is not supported, it is reasonable to skip such
entries instead of reporting wrong information.
Signed-off-by: Chenming Huang <quic_chenhuan@quicinc.com>
Link: https://patch.msgid.link/20240923021644.12885-1-quic_chenhuan@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The global and per-netns netdev notifier depend on RTNL, and its
dependency is not so clear due to nested calls.
Let's add a placeholder to place ASSERT_RTNL_NET() for each event.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Once an RTNL scope is converted with rtnl_net_lock(), we will replace
RTNL helper functions inside the scope with the following per-netns
alternatives:
ASSERT_RTNL() -> ASSERT_RTNL_NET(net)
rcu_dereference_rtnl(p) -> rcu_dereference_rtnl_net(net, p)
Note that the per-netns helpers are equivalent to the conventional
helpers unless CONFIG_DEBUG_NET_SMALL_RTNL is enabled.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The goal is to break RTNL down into per-netns mutex.
This patch adds per-netns mutex and its helper functions, rtnl_net_lock()
and rtnl_net_unlock().
rtnl_net_lock() acquires the global RTNL and per-netns RTNL mutex, and
rtnl_net_unlock() releases them.
We will replace 800+ rtnl_lock() with rtnl_net_lock() and finally removes
rtnl_lock() in rtnl_net_lock().
When we need to nest per-netns RTNL mutex, we will use __rtnl_net_lock(),
and its locking order is defined by rtnl_net_lock_cmp_fn() as follows:
1. init_net is first
2. netns address ascending order
Note that the conversion will be done under CONFIG_DEBUG_NET_SMALL_RTNL
with LOCKDEP so that we can carefully add the extra mutex without slowing
down RTNL operations during conversion.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- RFCOMM: FIX possible deadlock in rfcomm_sk_state_change
- hci_conn: Fix UAF in hci_enhanced_setup_sync
- btusb: Don't fail external suspend requests
* tag 'for-net-2024-10-04' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: btusb: Don't fail external suspend requests
Bluetooth: hci_conn: Fix UAF in hci_enhanced_setup_sync
Bluetooth: RFCOMM: FIX possible deadlock in rfcomm_sk_state_change
====================
Link: https://patch.msgid.link/20241004210124.4010321-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
fib_devindex_hashfn() converts a 32bit ifindex value to a 8bit hash.
It makes no sense doing this from fib_info_hashfn() and
fib_find_info_nh().
It is better to keep as many bits as possible to let
fib_info_hashfn_result() have better spread.
Only fib_info_devhash_bucket() needs to make this operation,
we can 'inline' trivial fib_devindex_hashfn() in it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241004134720.579244-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We have recently noticed the exact same KASAN splat as in commit
6cd4a78d96 ("net: do not leave a dangling sk pointer, when socket
creation fails"). The problem is that commit did not fully address the
problem, as some pf->create implementations do not use sk_common_release
in their error paths.
For example, we can use the same reproducer as in the above commit, but
changing ping to arping. arping uses AF_PACKET socket and if packet_create
fails, it will just sk_free the allocated sk object.
While we could chase all the pf->create implementations and make sure they
NULL the freed sk object on error from the socket, we can't guarantee
future protocols will not make the same mistake.
So it is easier to just explicitly NULL the sk pointer upon return from
pf->create in __sock_create. We do know that pf->create always releases the
allocated sk object on error, so if the pointer is not NULL, it is
definitely dangling.
Fixes: 6cd4a78d96 ("net: do not leave a dangling sk pointer, when socket creation fails")
Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>
Cc: stable@vger.kernel.org
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241003170151.69445-1-ignat@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull virtio fixes from Michael Tsirkin:
"Several small bugfixes all over the place.
Most notably, fixes the vsock allocation with GFP_KERNEL in atomic
context, which has been triggering warnings for lots of testers"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost/scsi: null-ptr-dereference in vhost_scsi_get_req()
vsock/virtio: use GFP_ATOMIC under RCU read lock
virtio_console: fix misc probe bugs
virtio_ring: tag event_triggered as racy for KCSAN
vdpa/octeon_ep: Fix format specifier for pointers in debug messages
some of those used to be needed, some had been cargo-culted for
no reason...
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This expands the validation introduced in commit 07bf790895 ("xfrm:
Validate address prefix lengths in the xfrm selector.")
syzbot created an SA with
usersa.sel.family = AF_UNSPEC
usersa.sel.prefixlen_s = 128
usersa.family = AF_INET
Because of the AF_UNSPEC selector, verify_newsa_info doesn't put
limits on prefixlen_{s,d}. But then copy_from_user_state sets
x->sel.family to usersa.family (AF_INET). Do the same conversion in
verify_newsa_info before validating prefixlen_{s,d}, since that's how
prefixlen is going to be used later on.
Reported-by: syzbot+cc39f136925517aed571@syzkaller.appspotmail.com
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The commit ea4692c75e ("lib/string_helpers: Consolidate string helpers
implementation") introduced common helpers to print string representations
of boolean helpers. These are supposed to be used instead of open coded
versions.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The "struct batadv_tvlv_tt_data" uses a dynamically sized set of
trailing elements. Specifically, it uses an array of structures of type
"batadv_tvlv_tt_vlan_data". So, use the preferred way in the kernel
declaring a flexible array [1].
At the same time, prepare for the coming implementation by GCC and Clang
of the __counted_by attribute. Flexible array members annotated with
__counted_by can have their accesses bounds-checked at run-time via
CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for
strcpy/memcpy-family functions). In this case, it is important to note
that the attribute used is specifically __counted_by_be since variable
"num_vlan" is of type __be16.
The following change to the "batadv_tt_tvlv_ogm_handler_v1" function:
- tt_vlan = (struct batadv_tvlv_tt_vlan_data *)(tt_data + 1);
- tt_change = (struct batadv_tvlv_tt_change *)(tt_vlan + num_vlan);
+ tt_change = (struct batadv_tvlv_tt_change *)((void *)tt_data
+ + flex_size);
is intended to prevent the compiler from generating an "out-of-bounds"
notification due to the __counted_by attribute. The compiler can do a
pointer calculation using the vlan_data flexible array memory, or in
other words, this may be calculated as an array offset, since it is the
same as:
&tt_data->vlan_data[num_vlan]
Therefore, we go past the end of the array. In other "multiple trailing
flexible array" situations, this has been solved by addressing from the
base pointer, since the compiler either knows the full allocation size
or it knows nothing about it (this case, since it came from a "void *"
function argument).
The order in which the structure batadv_tvlv_tt_data and the structure
batadv_tvlv_tt_vlan_data are defined must be swap to avoid an incomplete
type error.
Also, avoid the open-coded arithmetic in memory allocator functions [2]
using the "struct_size" macro and use the "flex_array_size" helper to
clarify some calculations, when possible.
Moreover, the new structure member also allow us to avoid the open-coded
arithmetic on pointers in some situations. Take advantage of this.
This code was detected with the help of Coccinelle, and audited and
modified manually.
Link: https://www.kernel.org/doc/html/next/process/deprecated.html#zero-length-and-one-element-arrays [1]
Link: https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [2]
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Erick Archer <erick.archer@outlook.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
This version will contain all the (major or even only minor) changes for
Linux 6.13.
The version number isn't a semantic version number with major and minor
information. It is just encoding the year of the expected publishing as
Linux -rc1 and the number of published versions this year (starting at 0).
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Some network devices have the ability to offload EDT (Earliest
Departure Time) which is the model used for TCP pacing and FQ packet
scheduler.
Some of them implement the timing wheel mechanism described in
https://saeed.github.io/files/carousel-sigcomm17.pdf
with an associated 'timing wheel horizon'.
This patchs adds to FQ packet scheduler TCA_FQ_OFFLOAD_HORIZON
attribute.
Its value is capped by the device max_pacing_offload_horizon,
added in the prior patch.
It allows FQ to let packets within pacing offload horizon
to be delivered to the device, which will handle the needed
delay without host involvement.
Signed-off-by: Jeffrey Ji <jeffreyji@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20241003121219.2396589-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The kernel may crash when deleting a genetlink family if there are still
listeners for that family:
Oops: Kernel access of bad area, sig: 11 [#1]
...
NIP [c000000000c080bc] netlink_update_socket_mc+0x3c/0xc0
LR [c000000000c0f764] __netlink_clear_multicast_users+0x74/0xc0
Call Trace:
__netlink_clear_multicast_users+0x74/0xc0
genl_unregister_family+0xd4/0x2d0
Change the unsafe loop on the list to a safe one, because inside the
loop there is an element removal from this list.
Fixes: b8273570f8 ("genetlink: fix netns vs. netlink table locking (2)")
Cc: stable@vger.kernel.org
Signed-off-by: Anastasia Kovaleva <a.kovaleva@yadro.com>
Reviewed-by: Dmitry Bogdanov <d.bogdanov@yadro.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241003104431.12391-1-a.kovaleva@yadro.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
delack timer is not stopped from inet_csk_clear_xmit_timer()
because we do not define INET_CSK_CLEAR_TIMERS.
This is a conscious choice : inet_csk_clear_xmit_timer()
is often called from another cpu. Calling del_timer()
would cause false sharing and lock contention.
This means that very often, tcp_delack_timer() is called
at the timer expiration, while there is no ACK to transmit.
This can be detected very early, avoiding the socket spinlock.
Notes:
- test about tp->compressed_ack is racy,
but in the unlikely case there is a race, the dedicated
compressed_ack_timer hrtimer would close it.
- Even if the fast path is not taken, reading
icsk->icsk_ack.pending and tp->compressed_ack
before acquiring the socket spinlock reduces
acquisition time and chances of contention.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241002173042.917928-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
retransmit timer is not stopped from inet_csk_clear_xmit_timer()
because we do not define INET_CSK_CLEAR_TIMERS.
This is a conscious choice : for active TCP flows, it is better
to only call mod_timer(), because there is more chances of
keeping the timer unchanged. Also inet_csk_clear_xmit_timer()
is often called from another cpu, and calling del_timer()
would cause false sharing and lock contention.
This means that very often, tcp_write_timer() is called
at the timer expiration, while there is nothing to retransmit.
This can be detected very early, avoiding the socket spinlock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241002173042.917928-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
icsk->icsk_pending can be read locklessly already.
Following patch in the series will add another lockless read.
Add smp_load_acquire() and smp_store_release() annotations
because following patch will add a test in tcp_write_timer(),
and READ_ONCE()/WRITE_ONCE() alone would possibly lead to races.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241002173042.917928-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rfcomm_sk_state_change attempts to use sock_lock so it must never be
called with it locked but rfcomm_sock_ioctl always attempt to lock it
causing the following trace:
======================================================
WARNING: possible circular locking dependency detected
6.8.0-syzkaller-08951-gfe46a7dd189e #0 Not tainted
------------------------------------------------------
syz-executor386/5093 is trying to acquire lock:
ffff88807c396258 (sk_lock-AF_BLUETOOTH-BTPROTO_RFCOMM){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1671 [inline]
ffff88807c396258 (sk_lock-AF_BLUETOOTH-BTPROTO_RFCOMM){+.+.}-{0:0}, at: rfcomm_sk_state_change+0x5b/0x310 net/bluetooth/rfcomm/sock.c:73
but task is already holding lock:
ffff88807badfd28 (&d->lock){+.+.}-{3:3}, at: __rfcomm_dlc_close+0x226/0x6a0 net/bluetooth/rfcomm/core.c:491
Reported-by: syzbot+d7ce59b06b3eb14fd218@syzkaller.appspotmail.com
Tested-by: syzbot+d7ce59b06b3eb14fd218@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d7ce59b06b3eb14fd218
Fixes: 3241ad820d ("[Bluetooth] Add timestamp support to L2CAP, RFCOMM and SCO")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This warning is emitted when a driver does not default populate an rss
key when one is not provided from userspace. Some devices do not
support individual rss keys per context. For these devices, it is ok
to leave the key zeroed out in ethtool_rxfh_context. Do not warn on
zeroed key when ethtool_ops.rxfh_per_ctx_key == 0.
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20241003162310.1310576-1-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The last type of sockets which supports SOF_TIMESTAMPING_OPT_ID is RAW
sockets. To add new option this patch converts all callers (direct and
indirect) of _sock_tx_timestamp to provide sockcm_cookie instead of
tsflags. And while here fix __sock_tx_timestamp to receive tsflags as
__u32 instead of __u16.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://patch.msgid.link/20241001125716.2832769-3-vadfed@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
SOF_TIMESTAMPING_OPT_ID socket option flag gives a way to correlate TX
timestamps and packets sent via socket. Unfortunately, there is no way
to reliably predict socket timestamp ID value in case of error returned
by sendmsg. For UDP sockets it's impossible because of lockless
nature of UDP transmit, several threads may send packets in parallel. In
case of RAW sockets MSG_MORE option makes things complicated. More
details are in the conversation [1].
This patch adds new control message type to give user-space
software an opportunity to control the mapping between packets and
values by providing ID with each sendmsg for UDP sockets.
The documentation is also added in this patch.
[1] https://lore.kernel.org/netdev/CALCETrU0jB+kg0mhV6A8mrHfTE1D1pr1SD_B9Eaa9aDPfgHdtA@mail.gmail.com/
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://patch.msgid.link/20241001125716.2832769-2-vadfed@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
alloc_ltalkdev in net/appletalk/dev.c is dead since
commit 00f3696f75 ("net: appletalk: remove cops support")
Removing it (and it's helper) leaves dev.c and if_ltalk.h empty;
remove them and the Makefile entry.
tun.c was including that if_ltalk.h but actually wanted
the uapi version for LTALK_ALEN, fix up the path.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In rxrpc_open_socket(), it sets up the socket and then sets up the I/O
thread that will handle it. This is a problem, however, as there's a gap
between the two phases in which a packet may come into rxrpc_encap_rcv()
from the UDP packet but we oops when trying to wake the not-yet created I/O
thread.
As a quick fix, just make rxrpc_encap_rcv() discard the packet if there's
no I/O thread yet.
A better, but more intrusive fix would perhaps be to rearrange things such
that the socket creation is done by the I/O thread.
Fixes: a275da62e8 ("rxrpc: Create a per-local endpoint receive queue and I/O thread")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: yuxuanzhe@outlook.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241001132702.3122709-2-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pass a dscp_t variable to ip_route_input_rcu(), instead of a plain u8,
to prevent accidental setting of ECN bits in ->flowi4_tos.
Callers of ip_route_input_rcu() to consider are:
* ip_route_input_noref(), which already has a dscp_t variable to pass
as parameter. We just need to remove the inet_dscp_to_dsfield()
conversion.
* inet_rtm_getroute(), which receives a u8 from user space and needs
to convert it with inet_dsfield_to_dscp().
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/c4dbb5aa9cbc79c4fcb317abbffa7c7156bc56a7.1727807926.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pass a dscp_t variable to ip_route_input_noref(), instead of a plain
u8, to prevent accidental setting of ECN bits in ->flowi4_tos.
Callers of ip_route_input_noref() to consider are:
* arp_process() in net/ipv4/arp.c. This function sets the tos
parameter to 0, which is already a valid dscp_t value, so it
doesn't need to be adjusted for the new prototype.
* ip_route_input(), which already has a dscp_t variable to pass as
parameter. We just need to remove the inet_dscp_to_dsfield()
conversion.
* ipvlan_l3_rcv(), bpf_lwt_input_reroute(), ip_expire(),
ip_rcv_finish_core(), xfrm4_rcv_encap_finish() and
xfrm4_rcv_encap(), which get the DSCP directly from IPv4 headers
and can simply use the ip4h_dscp() helper.
While there, declare the IPv4 header pointers as const in
ipvlan_l3_rcv() and bpf_lwt_input_reroute().
Also, modify the declaration of ip_route_input_noref() in
include/net/route.h so that it matches the prototype of its
implementation in net/ipv4/route.c.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/a8a747bed452519c4d0cc06af32c7e7795d7b627.1727807926.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pass a dscp_t variable to ip_route_input(), instead of a plain u8, to
prevent accidental setting of ECN bits in ->flowi4_tos.
Callers of ip_route_input() to consider are:
* input_action_end_dx4_finish() and input_action_end_dt4() in
net/ipv6/seg6_local.c. These functions set the tos parameter to 0,
which is already a valid dscp_t value, so they don't need to be
adjusted for the new prototype.
* icmp_route_lookup(), which already has a dscp_t variable to pass as
parameter. We just need to remove the inet_dscp_to_dsfield()
conversion.
* br_nf_pre_routing_finish(), ip_options_rcv_srr() and ip4ip6_err(),
which get the DSCP directly from IPv4 headers. Define a helper to
read the .tos field of struct iphdr as dscp_t, so that these
function don't have to do the conversion manually.
While there, declare *iph as const in br_nf_pre_routing_finish(),
declare its local variables in reverse-christmas-tree order and move
the "err = ip_route_input()" assignment out of the conditional to avoid
checkpatch warning.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/e9d40781d64d3d69f4c79ac8a008b8d67a033e8d.1727807926.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix tcp_rcv_synrecv_state_fastopen() to not zero retrans_stamp
if retransmits are outstanding.
tcp_fastopen_synack_timer() sets retrans_stamp, so typically we'll
need to zero retrans_stamp here to prevent spurious
retransmits_timed_out(). The logic to zero retrans_stamp is from this
2019 commit:
commit cd736d8b67 ("tcp: fix retrans timestamp on passive Fast Open")
However, in the corner case where the ACK of our TFO SYNACK carried
some SACK blocks that caused us to enter TCP_CA_Recovery then that
non-zero retrans_stamp corresponds to the active fast recovery, and we
need to leave retrans_stamp with its current non-zero value, for
correct ETIMEDOUT and undo behavior.
Fixes: cd736d8b67 ("tcp: fix retrans timestamp on passive Fast Open")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241001200517.2756803-4-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix tcp_enter_recovery() so that if there are no retransmits out then
we zero retrans_stamp when entering fast recovery. This is necessary
to fix two buggy behaviors.
Currently a non-zero retrans_stamp value can persist across multiple
back-to-back loss recovery episodes. This is because we generally only
clears retrans_stamp if we are completely done with loss recoveries,
and get to tcp_try_to_open() and find !tcp_any_retrans_done(sk). This
behavior causes two bugs:
(1) When a loss recovery episode (CA_Loss or CA_Recovery) is followed
immediately by a new CA_Recovery, the retrans_stamp value can persist
and can be a time before this new CA_Recovery episode starts. That
means that timestamp-based undo will be using the wrong retrans_stamp
(a value that is too old) when comparing incoming TS ecr values to
retrans_stamp to see if the current fast recovery episode can be
undone.
(2) If there is a roughly minutes-long sequence of back-to-back fast
recovery episodes, one after another (e.g. in a shallow-buffered or
policed bottleneck), where each fast recovery successfully makes
forward progress and recovers one window of sequence space (but leaves
at least one retransmit in flight at the end of the recovery),
followed by several RTOs, then the ETIMEDOUT check may be using the
wrong retrans_stamp (a value set at the start of the first fast
recovery in the sequence). This can cause a very premature ETIMEDOUT,
killing the connection prematurely.
This commit changes the code to zero retrans_stamp when entering fast
recovery, when this is known to be safe (no retransmits are out in the
network). That ensures that when starting a fast recovery episode, and
it is safe to do so, retrans_stamp is set when we send the fast
retransmit packet. That addresses both bug (1) and bug (2) by ensuring
that (if no retransmits are out when we start a fast recovery) we use
the initial fast retransmit of this fast recovery as the time value
for undo and ETIMEDOUT calculations.
This makes intuitive sense, since the start of a new fast recovery
episode (in a scenario where no lost packets are out in the network)
means that the connection has made forward progress since the last RTO
or fast recovery, and we should thus "restart the clock" used for both
undo and ETIMEDOUT logic.
Note that if when we start fast recovery there *are* retransmits out
in the network, there can still be undesirable (1)/(2) issues. For
example, after this patch we can still have the (1) and (2) problems
in cases like this:
+ round 1: sender sends flight 1
+ round 2: sender receives SACKs and enters fast recovery 1,
retransmits some packets in flight 1 and then sends some new data as
flight 2
+ round 3: sender receives some SACKs for flight 2, notes losses, and
retransmits some packets to fill the holes in flight 2
+ fast recovery has some lost retransmits in flight 1 and continues
for one or more rounds sending retransmits for flight 1 and flight 2
+ fast recovery 1 completes when snd_una reaches high_seq at end of
flight 1
+ there are still holes in the SACK scoreboard in flight 2, so we
enter fast recovery 2, but some retransmits in the flight 2 sequence
range are still in flight (retrans_out > 0), so we can't execute the
new retrans_stamp=0 added here to clear retrans_stamp
It's not yet clear how to fix these remaining (1)/(2) issues in an
efficient way without breaking undo behavior, given that retrans_stamp
is currently used for undo and ETIMEDOUT. Perhaps the optimal (but
expensive) strategy would be to set retrans_stamp to the timestamp of
the earliest outstanding retransmit when entering fast recovery. But
at least this commit makes things better.
Note that this does not change the semantics of retrans_stamp; it
simply makes retrans_stamp accurate in some cases where it was not
before:
(1) Some loss recovery, followed by an immediate entry into a fast
recovery, where there are no retransmits out when entering the fast
recovery.
(2) When a TFO server has a SYNACK retransmit that sets retrans_stamp,
and then the ACK that completes the 3-way handshake has SACK blocks
that trigger a fast recovery. In this case when entering fast recovery
we want to zero out the retrans_stamp from the TFO SYNACK retransmit,
and set the retrans_stamp based on the timestamp of the fast recovery.
We introduce a tcp_retrans_stamp_cleanup() helper, because this
two-line sequence already appears in 3 places and is about to appear
in 2 more as a result of this bug fix patch series. Once this bug fix
patches series in the net branch makes it into the net-next branch
we'll update the 3 other call sites to use the new helper.
This is a long-standing issue. The Fixes tag below is chosen to be the
oldest commit at which the patch will apply cleanly, which is from
Linux v3.5 in 2012.
Fixes: 1fbc340514 ("tcp: early retransmit: tcp_enter_recovery()")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241001200517.2756803-3-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix the TCP loss recovery undo logic in tcp_packet_delayed() so that
it can trigger undo even if TSQ prevents a fast recovery episode from
reaching tcp_retransmit_skb().
Geumhwan Yu <geumhwan.yu@samsung.com> recently reported that after
this commit from 2019:
commit bc9f38c832 ("tcp: avoid unconditional congestion window undo
on SYN retransmit")
...and before this fix we could have buggy scenarios like the
following:
+ Due to reordering, a TCP connection receives some SACKs and enters a
spurious fast recovery.
+ TSQ prevents all invocations of tcp_retransmit_skb(), because many
skbs are queued in lower layers of the sending machine's network
stack; thus tp->retrans_stamp remains 0.
+ The connection receives a TCP timestamp ECR value echoing a
timestamp before the fast recovery, indicating that the fast
recovery was spurious.
+ The connection fails to undo the spurious fast recovery because
tp->retrans_stamp is 0, and thus tcp_packet_delayed() returns false,
due to the new logic in the 2019 commit: commit bc9f38c832 ("tcp:
avoid unconditional congestion window undo on SYN retransmit")
This fix tweaks the logic to be more similar to the
tcp_packet_delayed() logic before bc9f38c832, except that we take
care not to be fooled by the FLAG_SYN_ACKED code path zeroing out
tp->retrans_stamp (the bug noted and fixed by Yuchung in
bc9f38c832).
Note that this returns the high-level behavior of tcp_packet_delayed()
to again match the comment for the function, which says: "Nothing was
retransmitted or returned timestamp is less than timestamp of the
first retransmission." Note that this comment is in the original
2005-04-16 Linux git commit, so this is evidently long-standing
behavior.
Fixes: bc9f38c832 ("tcp: avoid unconditional congestion window undo on SYN retransmit")
Reported-by: Geumhwan Yu <geumhwan.yu@samsung.com>
Diagnosed-by: Geumhwan Yu <geumhwan.yu@samsung.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241001200517.2756803-2-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull networking fixes from Paolo Abeni:
"Including fixes from ieee802154, bluetooth and netfilter.
Current release - regressions:
- eth: mlx5: fix wrong reserved field in hca_cap_2 in mlx5_ifc
- eth: am65-cpsw: fix forever loop in cleanup code
Current release - new code bugs:
- eth: mlx5: HWS, fixed double-free in error flow of creating SQ
Previous releases - regressions:
- core: avoid potential underflow in qdisc_pkt_len_init() with UFO
- core: test for not too small csum_start in virtio_net_hdr_to_skb()
- vrf: revert "vrf: remove unnecessary RCU-bh critical section"
- bluetooth:
- fix uaf in l2cap_connect
- fix possible crash on mgmt_index_removed
- dsa: improve shutdown sequence
- eth: mlx5e: SHAMPO, fix overflow of hd_per_wq
- eth: ip_gre: fix drops of small packets in ipgre_xmit
Previous releases - always broken:
- core: fix gso_features_check to check for both
dev->gso_{ipv4_,}max_size
- core: fix tcp fraglist segmentation after pull from frag_list
- netfilter: nf_tables: prevent nf_skb_duplicated corruption
- sctp: set sk_state back to CLOSED if autobind fails in
sctp_listen_start
- mac802154: fix potential RCU dereference issue in
mac802154_scan_worker
- eth: fec: restart PPS after link state change"
* tag 'net-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (48 commits)
sctp: set sk_state back to CLOSED if autobind fails in sctp_listen_start
dt-bindings: net: xlnx,axi-ethernet: Add missing reg minItems
doc: net: napi: Update documentation for napi_schedule_irqoff
net/ncsi: Disable the ncsi work before freeing the associated structure
net: phy: qt2025: Fix warning: unused import DeviceId
gso: fix udp gso fraglist segmentation after pull from frag_list
bridge: mcast: Fail MDB get request on empty entry
vrf: revert "vrf: Remove unnecessary RCU-bh critical section"
net: ethernet: ti: am65-cpsw: Fix forever loop in cleanup code
net: phy: realtek: Check the index value in led_hw_control_get
ppp: do not assume bh is held in ppp_channel_bridge_input()
selftests: rds: move include.sh to TEST_FILES
net: test for not too small csum_start in virtio_net_hdr_to_skb()
net: gso: fix tcp fraglist segmentation after pull from frag_list
ipv4: ip_gre: Fix drops of small packets in ipgre_xmit
net: stmmac: dwmac4: extend timeout for VLAN Tag register busy bit check
net: add more sanity checks to qdisc_pkt_len_init()
net: avoid potential underflow in qdisc_pkt_len_init() with UFO
net: ethernet: ti: cpsw_ale: Fix warning on some platforms
net: microchip: Make FDMA config symbol invisible
...