Given how late we are in the cycle, merge the two fixes from
wireless into wireless-next as they don't see that urgent.
This way, the wireless tree won't need rebasing later.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
ath.git patches for v6.10
ath12k
* debugfs support
* dfs_simulate_radar debugfs file
* disable Wireless Extensions
* suspend and hibernation support
* ACPI support
* refactoring in preparation of multi-link support
ath11k
* support hibernation (required changes in qrtr and MHI subsystems)
* ieee80211-freq-limit Device Tree property support
ath10k
* firmware-name Device Tree property support
wdev->valid_links is not cleared when upper layer disconnect from a
wdev->AP MLD. It has been observed that this would prevent offchannel
operations like remain-on-channel which would be needed for user space
operations with Public Action frame.
Clear the wdev->valid_links when STA disconnects.
Signed-off-by: Xin Deng <quic_deng@quicinc.com>
Link: https://msgid.link/20240426092501.8592-1-quic_deng@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Before request->channels[] can be used, request->n_channels must be set.
Additionally, address calculations for memory after the "channels" array
need to be calculated from the allocation base ("request") rather than
via the first "out of bounds" index of "channels", otherwise run-time
bounds checking will throw a warning.
Reported-by: Nathan Chancellor <nathan@kernel.org>
Fixes: e3eac9f32e ("wifi: cfg80211: Annotate struct cfg80211_scan_request with __counted_by")
Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://msgid.link/20240424220057.work.819-kees@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Tony Nguyen says:
====================
net: intel: start The Great Code Dedup + Page Pool for iavf
Alexander Lobakin says:
Here's a two-shot: introduce {,Intel} Ethernet common library (libeth and
libie) and switch iavf to Page Pool. Details are in the commit messages;
here's a summary:
Not a secret there's a ton of code duplication between two and more Intel
ethernet modules. Before introducing new changes, which would need to be
copied over again, start decoupling the already existing duplicate
functionality into a new module, which will be shared between several
Intel Ethernet drivers. The first name that came to my mind was
"libie" -- "Intel Ethernet common library". Also this sounds like
"lovelie" (-> one word, no "lib I E" pls) and can be expanded as
"lib Internet Explorer" :P
The "generic", pure-software part is placed separately, so that it can be
easily reused in any driver by any vendor without linking to the Intel
pre-200G guts. In a few words, it's something any modern driver does the
same way, but nobody moved it level up (yet).
The series is only the beginning. From now on, adding every new feature
or doing any good driver refactoring will remove much more lines than add
for quite some time. There's a basic roadmap with some deduplications
planned already, not speaking of that touching every line now asks:
"can I share this?". The final destination is very ambitious: have only
one unified driver for at least i40e, ice, iavf, and idpf with a struct
ops for each generation. That's never gonna happen, right? But you still
can at least try.
PP conversion for iavf lands within the same series as these two are tied
closely. libie will support Page Pool model only, so that a driver can't
use much of the lib until it's converted. iavf is only the example, the
rest will eventually be converted soon on a per-driver basis. That is
when it gets really interesting. Stay tech.
* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
MAINTAINERS: add entry for libeth and libie
iavf: switch to Page Pool
iavf: pack iavf_ring more efficiently
libeth: add Rx buffer management
page_pool: add DMA-sync-for-CPU inline helper
page_pool: constify some read-only function arguments
slab: introduce kvmalloc_array_node() and kvcalloc_node()
iavf: drop page splitting and recycling
iavf: kill "legacy-rx" for good
net: intel: introduce {, Intel} Ethernet common library
====================
Link: https://lore.kernel.org/r/20240424203559.3420468-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While testing TCP performance with latest trees,
I saw suspect SOCKET_BACKLOG drops.
tcp_add_backlog() computes its limit with :
limit = (u32)READ_ONCE(sk->sk_rcvbuf) +
(u32)(READ_ONCE(sk->sk_sndbuf) >> 1);
limit += 64 * 1024;
This does not take into account that sk->sk_backlog.len
is reset only at the very end of __release_sock().
Both sk->sk_backlog.len and sk->sk_rmem_alloc could reach
sk_rcvbuf in normal conditions.
We should double sk->sk_rcvbuf contribution in the formula
to absorb bubbles in the backlog, which happen more often
for very fast flows.
This change maintains decent protection against abuses.
Fixes: c377411f24 ("net: sk_add_backlog() take rmem_alloc into account")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240423125620.3309458-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kalle Valo says:
====================
wireless-next patches for v6.10
The second "new features" pull request for v6.10 with changes both in
stack and in drivers. This time the pull request is rather small and
nothing special standing out except maybe that we have several
kernel-doc fixes. Great to see that we are getting warning free
wireless code (until new warnings are added).
Major changes:
rtl8xxxu:
* enable Management Frame Protection (MFP) support
rtw88:
* disable unsupported interface type of mesh point for all chips, and only
support station mode for SDIO chips.
* tag 'wireless-next-2024-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (63 commits)
wifi: mac80211: handle link ID during management Tx
wifi: mac80211: handle sdata->u.ap.active flag with MLO
wifi: cfg80211: add return docs for regulatory functions
wifi: cfg80211: make some regulatory functions void
wifi: mac80211: add return docs for sta_info_flush()
wifi: mac80211: keep mac80211 consistent on link activation failure
wifi: mac80211: simplify ieee80211_assign_link_chanctx()
wifi: mac80211: reserve chanctx during find
wifi: cfg80211: fix cfg80211 function kernel-doc
wifi: mac80211_hwsim: Use wider regulatory for custom for 6GHz tests
wifi: iwlwifi: mvm: Don't allow EMLSR when the RSSI is low
wifi: iwlwifi: mvm: disable EMLSR when we suspend with wowlan
wifi: iwlwifi: mvm: get periodic statistics in EMLSR
wifi: iwlwifi: mvm: don't recompute EMLSR mode in can_activate_links
wifi: iwlwifi: mvm: implement EMLSR prevention mechanism.
wifi: iwlwifi: mvm: exit EMLSR upon missed beacon
wifi: iwlwifi: mvm: init vif works only once
wifi: iwlwifi: mvm: Add helper functions to update EMLSR status
wifi: iwlwifi: mvm: Implement new link selection algorithm
wifi: iwlwifi: mvm: move EMLSR/links code
...
====================
Link: https://lore.kernel.org/r/20240424100122.217AEC113CE@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter, wireless and bluetooth.
Nothing major, regression fixes are mostly in drivers, two more of
those are flowing towards us thru various trees. I wish some of the
changes went into -rc5, we'll try to keep an eye on frequency of PRs
from sub-trees.
Also disproportional number of fixes for bugs added in v6.4, strange
coincidence.
Current release - regressions:
- igc: fix LED-related deadlock on driver unbind
- wifi: mac80211: small fixes to recent clean up of the connection
process
- Revert "wifi: iwlwifi: bump FW API to 90 for BZ/SC devices", kernel
doesn't have all the code to deal with that version, yet
- Bluetooth:
- set power_ctrl_enabled on NULL returned by gpiod_get_optional()
- qca: fix invalid device address check, again
- eth: ravb: fix registered interrupt names
Current release - new code bugs:
- wifi: mac80211: check EHT/TTLM action frame length
Previous releases - regressions:
- fix sk_memory_allocated_{add|sub} for architectures where
__this_cpu_{add|sub}* are not IRQ-safe
- dsa: mv88e6xx: fix link setup for 88E6250
Previous releases - always broken:
- ip: validate dev returned from __in_dev_get_rcu(), prevent possible
null-derefs in a few places
- switch number of for_each_rcu() loops using call_rcu() on the
iterator to for_each_safe()
- macsec: fix isolation of broadcast traffic in presence of offload
- vxlan: drop packets from invalid source address
- eth: mlxsw: trap and ACL programming fixes
- eth: bnxt: PCIe error recovery fixes, fix counting dropped packets
- Bluetooth:
- lots of fixes for the command submission rework from v6.4
- qca: fix NULL-deref on non-serdev suspend
Misc:
- tools: ynl: don't ignore errors in NLMSG_DONE messages"
* tag 'net-6.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits)
af_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc().
net: b44: set pause params only when interface is up
tls: fix lockless read of strp->msg_ready in ->poll
dpll: fix dpll_pin_on_pin_register() for multiple parent pins
net: ravb: Fix registered interrupt names
octeontx2-af: fix the double free in rvu_npc_freemem()
net: ethernet: ti: am65-cpts: Fix PTPv1 message type on TX packets
ice: fix LAG and VF lock dependency in ice_reset_vf()
iavf: Fix TC config comparison with existing adapter TC config
i40e: Report MFS in decimal base instead of hex
i40e: Do not use WQ_MEM_RECLAIM flag for workqueue
net: ti: icssg-prueth: Fix signedness bug in prueth_init_rx_chns()
net/mlx5e: Advertise mlx5 ethernet driver updates sk_buff md_dst for MACsec
macsec: Detect if Rx skb is macsec-related for offloading devices that update md_dst
ethernet: Add helper for assigning packet type when dest address does not match device address
macsec: Enable devices to advertise whether they update sk_buff md_dst during offloads
net: phy: dp83869: Fix MII mode failure
netfilter: nf_tables: honor table dormant flag from netdev release event path
eth: bnxt: fix counting packets discarded due to OOM and netpoll
igc: Fix LED-related deadlock on driver unbind
...
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains two Netfilter/IPVS fixes for net:
Patch #1 fixes SCTP checksumming for IPVS with gso packets,
from Ismael Luceno.
Patch #2 honor dormant flag from netdev event path to fix a possible
double hook unregistration.
* tag 'nf-24-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: honor table dormant flag from netdev release event path
ipvs: Fix checksumming on GSO of SCTP packets
====================
Link: https://lore.kernel.org/r/20240425090149.1359547-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In br_fill_forward_path(), f->dst is checked not to be NULL, then
immediately read using READ_ONCE and checked again. The first check is
useless, so this patch aims to remove the redundant check of f->dst.
Signed-off-by: linke li <lilinke99@qq.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes berg says:
====================
Fixes for the current cycle:
* ath11k: convert to correct RCU iteration of IPv6 addresses
* iwlwifi: link ID, FW API version, scanning and PASN fixes
* cfg80211: NULL-deref and tracing fixes
* mac80211: connection mode, mesh fast-TX, multi-link and
various other small fixes
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Marking TCP_SKB_CB(skb)->sacked with TCPCB_EVER_RETRANS after the
traceopint (trace_tcp_retransmit_skb), then we can get the
retransmission efficiency by counting skbs w/ and w/o TCPCB_EVER_RETRANS
mark in this tracepoint.
We have discussed to achieve this with BPF_SOCK_OPS in [0], and using
tracepoint is thought to be a better solution.
[0]
https://lore.kernel.org/all/20240417124622.35333-1-lulie@linux.alibaba.com/
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- qca: set power_ctrl_enabled on NULL returned by gpiod_get_optional()
- hci_sync: Using hci_cmd_sync_submit when removing Adv Monitor
- qca: fix invalid device address check
- hci_sync: Use advertised PHYs on hci_le_ext_create_conn_sync
- Fix type of len in {l2cap,sco}_sock_getsockopt_old()
- btusb: mediatek: Fix double free of skb in coredump
- btusb: Add Realtek RTL8852BE support ID 0x0bda:0x4853
- btusb: Fix triggering coredump implementation for QCA
* tag 'for-net-2024-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: qca: set power_ctrl_enabled on NULL returned by gpiod_get_optional()
Bluetooth: hci_sync: Using hci_cmd_sync_submit when removing Adv Monitor
Bluetooth: qca: fix NULL-deref on non-serdev setup
Bluetooth: qca: fix NULL-deref on non-serdev suspend
Bluetooth: btusb: mediatek: Fix double free of skb in coredump
Bluetooth: MGMT: Fix failing to MGMT_OP_ADD_UUID/MGMT_OP_REMOVE_UUID
Bluetooth: qca: fix invalid device address check
Bluetooth: hci_event: Fix sending HCI_OP_READ_ENC_KEY_SIZE
Bluetooth: btusb: Fix triggering coredump implementation for QCA
Bluetooth: btusb: Add Realtek RTL8852BE support ID 0x0bda:0x4853
Bluetooth: hci_sync: Use advertised PHYs on hci_le_ext_create_conn_sync
Bluetooth: Fix type of len in {l2cap,sco}_sock_getsockopt_old()
====================
Link: https://lore.kernel.org/r/20240424204102.2319483-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tony Nguyen says:
====================
ice: Support 5 layer Tx scheduler topology
Mateusz Polchlopek says:
For performance reasons there is a need to have support for selectable
Tx scheduler topology. Currently firmware supports only the default
9-layer and 5-layer topology. This patch series enables switch from
default to 5-layer topology, if user decides to opt-in.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
ice: Document tx_scheduling_layers parameter
ice: Add tx_scheduling_layers devlink param
ice: Enable switching default Tx scheduler topology
ice: Adjust the VSI/Aggregator layers
ice: Support 5 layer topology
devlink: extend devlink_param *set pointer
====================
Link: https://lore.kernel.org/r/20240422203913.225151-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since kfree_rcu, which is called in the hlist_for_each_entry_rcu traversal
of ovs_ct_limit_exit, is not part of the RCU read critical section, it
is possible that the RCU grace period will pass during the traversal and
the key will be free.
To prevent this, it should be changed to hlist_for_each_entry_safe.
Fixes: 11efd5cb04 ("openvswitch: Support conntrack zone limit")
Signed-off-by: Hyunwoo Kim <v4bel@theori.io>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://lore.kernel.org/r/ZiYvzQN/Ry5oeFQW@v4bel-B760M-AORUS-ELITE-AX
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It was observed in the wild that pairs of consecutive packets would leave
the IPVS with the same wrong checksum, and the issue only went away when
disabling GSO.
IPVS needs to avoid computing the SCTP checksum when using GSO.
Fixes: 90017accff ("sctp: Add GSO support")
Co-developed-by: Firo Yang <firo.yang@suse.com>
Signed-off-by: Ismael Luceno <iluceno@suse.de>
Tested-by: Andreas Taschner <andreas.taschner@suse.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since the d883a4669a be introduced in v6.4, bluetooth daemon
got the following failed message of MGMT_OP_REMOVE_ADV_MONITOR
command when controller is power-off:
bluetoothd[20976]:
src/adapter.c:reset_adv_monitors_complete() Failed to reset Adv
Monitors: Failed>
Normally this situation is happened when the bluetoothd deamon
be started manually after system booting. Which means that
bluetoothd received MGMT_EV_INDEX_ADDED event after kernel
runs hci_power_off().
Base on doc/mgmt-api.txt, the MGMT_OP_REMOVE_ADV_MONITOR command
can be used when the controller is not powered. This patch changes
the code in remove_adv_monitor() to use hci_cmd_sync_submit()
instead of hci_cmd_sync_queue().
Fixes: d883a4669a ("Bluetooth: hci_sync: Only allow hci_cmd_sync_queue if running")
Cc: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Cc: Manish Mandlik <mmandlik@google.com>
Cc: Archie Pusaka <apusaka@chromium.org>
Cc: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Chun-Yi Lee <jlee@suse.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
These commands don't require the adapter to be up and running so don't
use hci_cmd_sync_queue which would check that flag, instead use
hci_cmd_sync_submit which would ensure mgmt_class_complete is set
properly regardless if any command was actually run or not.
Link: https://github.com/bluez/bluez/issues/809
Fixes: d883a4669a ("Bluetooth: hci_sync: Only allow hci_cmd_sync_queue if running")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The code shall always check if HCI_QUIRK_BROKEN_READ_ENC_KEY_SIZE has
been set before attempting to use HCI_OP_READ_ENC_KEY_SIZE.
Fixes: c569242cd4 ("Bluetooth: hci_event: set the conn encrypted before conn establishes")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The extended advertising reports do report the PHYs so this store then
in hci_conn so it can be later used in hci_le_ext_create_conn_sync to
narrow the PHYs to be scanned since the controller will also perform a
scan having a smaller set of PHYs shall reduce the time it takes to
find and connect peers.
Fixes: 288c90224e ("Bluetooth: Enable all supported LE PHY by default")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
After an innocuous optimization change in LLVM main (19.0.0), x86_64
allmodconfig (which enables CONFIG_KCSAN / -fsanitize=thread) fails to
build due to the checks in check_copy_size():
In file included from net/bluetooth/sco.c:27:
In file included from include/linux/module.h:13:
In file included from include/linux/stat.h:19:
In file included from include/linux/time.h:60:
In file included from include/linux/time32.h:13:
In file included from include/linux/timex.h:67:
In file included from arch/x86/include/asm/timex.h:6:
In file included from arch/x86/include/asm/tsc.h:10:
In file included from arch/x86/include/asm/msr.h:15:
In file included from include/linux/percpu.h:7:
In file included from include/linux/smp.h:118:
include/linux/thread_info.h:244:4: error: call to '__bad_copy_from'
declared with 'error' attribute: copy source size is too small
244 | __bad_copy_from();
| ^
The same exact error occurs in l2cap_sock.c. The copy_to_user()
statements that are failing come from l2cap_sock_getsockopt_old() and
sco_sock_getsockopt_old(). This does not occur with GCC with or without
KCSAN or Clang without KCSAN enabled.
len is defined as an 'int' because it is assigned from
'__user int *optlen'. However, it is clamped against the result of
sizeof(), which has a type of 'size_t' ('unsigned long' for 64-bit
platforms). This is done with min_t() because min() requires compatible
types, which results in both len and the result of sizeof() being casted
to 'unsigned int', meaning len changes signs and the result of sizeof()
is truncated. From there, len is passed to copy_to_user(), which has a
third parameter type of 'unsigned long', so it is widened and changes
signs again. This excessive casting in combination with the KCSAN
instrumentation causes LLVM to fail to eliminate the __bad_copy_from()
call, failing the build.
The official recommendation from LLVM developers is to consistently use
long types for all size variables to avoid the unnecessary casting in
the first place. Change the type of len to size_t in both
l2cap_sock_getsockopt_old() and sco_sock_getsockopt_old(). This clears
up the error while allowing min_t() to be replaced with min(), resulting
in simpler code with no casts and fewer implicit conversions. While len
is a different type than optlen now, it should result in no functional
change because the result of sizeof() will clamp all values of optlen in
the same manner as before.
Cc: stable@vger.kernel.org
Closes: https://github.com/ClangBuiltLinux/linux/issues/2007
Link: https://github.com/llvm/llvm-project/issues/85647
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
There are several functions taking pointers to data they don't modify.
This includes statistics fetching, page and page_pool parameters, etc.
Constify the pointers, so that call sites will be able to pass const
pointers as well.
No functional changes, no visible changes in functions sizes.
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
It is impossible to use init_dummy_netdev together with alloc_netdev()
as the 'setup' argument.
This is because alloc_netdev() initializes some fields in the net_device
structure, and later init_dummy_netdev() memzero them all. This causes
some problems as reported here:
https://lore.kernel.org/all/20240322082336.49f110cc@kernel.org/
Split the init_dummy_netdev() function in two. Create a new function called
init_dummy_netdev_core() that does not memzero the net_device structure.
Then have init_dummy_netdev() memzero-ing and calling
init_dummy_netdev_core(), keeping the old behaviour.
init_dummy_netdev_core() is the new function that could be called as an
argument for alloc_netdev().
Also, create a helper to allocate and initialize dummy net devices,
leveraging init_dummy_netdev_core() as the setup argument. This function
basically simplify the allocation of dummy devices, by allocating and
initializing it. Freeing the device continue to be done through
free_netdev()
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For dummy devices, exit earlier at free_netdev() instead of executing
the whole function. This is necessary, because dummy devices are
special, and shouldn't have the second part of the function executed.
Otherwise reg_state, which is NETREG_DUMMY, will be overwritten and
there will be no way to identify that this is a dummy device. Also, this
device do not need the final put_device(), since dummy devices are not
registered (through register_netdevice()), where the device reference is
increased (at netdev_register_kobject()/device_add()).
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix bad grammar in description of init_dummy_netdev() function. This
topic showed up in the review of the "allocate dummy device dynamically"
patch set.
Suggested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since call_rcu, which is called in the hlist_for_each_entry_rcu traversal
of tcp_ao_connect_init, is not part of the RCU read critical section, it
is possible that the RCU grace period will pass during the traversal and
the key will be free.
To prevent this, it should be changed to hlist_for_each_entry_safe.
Fixes: 7c2ffaf21b ("net/tcp: Calculate TCP-AO traffic keys")
Signed-off-by: Hyunwoo Kim <v4bel@theori.io>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Dmitry Safonov <0x7f454c46@gmail.com>
Link: https://lore.kernel.org/r/ZiYu9NJ/ClR8uSkH@v4bel-B760M-AORUS-ELITE-AX
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Note that when this commit message refers to netlink dump
it only means the actual dumping part, the parsing / dump
start is handled by the same code as "doit".
Commit 4a19edb60d ("netlink: Pass extack to dump handlers")
added support for returning extack messages from dump handlers,
but left out other extack info, e.g. bad attribute.
This used to be fine because until YNL we had little practical
use for the machine readable attributes, and only messages were
used in practice.
YNL flips the preference 180 degrees, it's now much more useful
to point to a bad attr with NL_SET_BAD_ATTR() than type
an English message saying "attribute XYZ is $reason-why-bad".
Support all of extack. The fact that extack only gets added if
it fits remains unaddressed.
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240420023543.3300306-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
syzbot reported use-after-free in unix_del_edges(). [0]
What the repro does is basically repeat the following quickly.
1. pass a fd of an AF_UNIX socket to itself
socketpair(AF_UNIX, SOCK_DGRAM, 0, [3, 4]) = 0
sendmsg(3, {..., msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET,
cmsg_type=SCM_RIGHTS, cmsg_data=[4]}], ...}, 0) = 0
2. pass other fds of AF_UNIX sockets to the socket above
socketpair(AF_UNIX, SOCK_SEQPACKET, 0, [5, 6]) = 0
sendmsg(3, {..., msg_control=[{cmsg_len=48, cmsg_level=SOL_SOCKET,
cmsg_type=SCM_RIGHTS, cmsg_data=[5, 6]}], ...}, 0) = 0
3. close all sockets
Here, two skb are created, and every unix_edge->successor is the first
socket. Then, __unix_gc() will garbage-collect the two skb:
(a) free skb with self-referencing fd
(b) free skb holding other sockets
After (a), the self-referencing socket will be scheduled to be freed
later by the delayed_fput() task.
syzbot repeated the sequences above (1. ~ 3.) quickly and triggered
the task concurrently while GC was running.
So, at (b), the socket was already freed, and accessing it was illegal.
unix_del_edges() accesses the receiver socket as edge->successor to
optimise GC. However, we should not do it during GC.
Garbage-collecting sockets does not change the shape of the rest
of the graph, so we need not call unix_update_graph() to update
unix_graph_grouped when we purge skb.
However, if we clean up all loops in the unix_walk_scc_fast() path,
unix_graph_maybe_cyclic remains unchanged (true), and __unix_gc()
will call unix_walk_scc_fast() continuously even though there is no
socket to garbage-collect.
To keep that optimisation while fixing UAF, let's add the same
updating logic of unix_graph_maybe_cyclic in unix_walk_scc_fast()
as done in unix_walk_scc() and __unix_walk_scc().
Note that when unix_del_edges() is called from other places, the
receiver socket is always alive:
- sendmsg: the successor's sk_refcnt is bumped by sock_hold()
unix_find_other() for SOCK_DGRAM, connect() for SOCK_STREAM
- recvmsg: the successor is the receiver, and its fd is alive
[0]:
BUG: KASAN: slab-use-after-free in unix_edge_successor net/unix/garbage.c:109 [inline]
BUG: KASAN: slab-use-after-free in unix_del_edge net/unix/garbage.c:165 [inline]
BUG: KASAN: slab-use-after-free in unix_del_edges+0x148/0x630 net/unix/garbage.c:237
Read of size 8 at addr ffff888079c6e640 by task kworker/u8:6/1099
CPU: 0 PID: 1099 Comm: kworker/u8:6 Not tainted 6.9.0-rc4-next-20240418-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
Workqueue: events_unbound __unix_gc
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
print_address_description mm/kasan/report.c:377 [inline]
print_report+0x169/0x550 mm/kasan/report.c:488
kasan_report+0x143/0x180 mm/kasan/report.c:601
unix_edge_successor net/unix/garbage.c:109 [inline]
unix_del_edge net/unix/garbage.c:165 [inline]
unix_del_edges+0x148/0x630 net/unix/garbage.c:237
unix_destroy_fpl+0x59/0x210 net/unix/garbage.c:298
unix_detach_fds net/unix/af_unix.c:1811 [inline]
unix_destruct_scm+0x13e/0x210 net/unix/af_unix.c:1826
skb_release_head_state+0x100/0x250 net/core/skbuff.c:1127
skb_release_all net/core/skbuff.c:1138 [inline]
__kfree_skb net/core/skbuff.c:1154 [inline]
kfree_skb_reason+0x16d/0x3b0 net/core/skbuff.c:1190
__skb_queue_purge_reason include/linux/skbuff.h:3251 [inline]
__skb_queue_purge include/linux/skbuff.h:3256 [inline]
__unix_gc+0x1732/0x1830 net/unix/garbage.c:575
process_one_work kernel/workqueue.c:3218 [inline]
process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3299
worker_thread+0x86d/0xd70 kernel/workqueue.c:3380
kthread+0x2f0/0x390 kernel/kthread.c:389
ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
</TASK>
Allocated by task 14427:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
unpoison_slab_object mm/kasan/common.c:312 [inline]
__kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:338
kasan_slab_alloc include/linux/kasan.h:201 [inline]
slab_post_alloc_hook mm/slub.c:3897 [inline]
slab_alloc_node mm/slub.c:3957 [inline]
kmem_cache_alloc_noprof+0x135/0x290 mm/slub.c:3964
sk_prot_alloc+0x58/0x210 net/core/sock.c:2074
sk_alloc+0x38/0x370 net/core/sock.c:2133
unix_create1+0xb4/0x770
unix_create+0x14e/0x200 net/unix/af_unix.c:1034
__sock_create+0x490/0x920 net/socket.c:1571
sock_create net/socket.c:1622 [inline]
__sys_socketpair+0x33e/0x720 net/socket.c:1773
__do_sys_socketpair net/socket.c:1822 [inline]
__se_sys_socketpair net/socket.c:1819 [inline]
__x64_sys_socketpair+0x9b/0xb0 net/socket.c:1819
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Freed by task 1805:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579
poison_slab_object+0xe0/0x150 mm/kasan/common.c:240
__kasan_slab_free+0x37/0x60 mm/kasan/common.c:256
kasan_slab_free include/linux/kasan.h:184 [inline]
slab_free_hook mm/slub.c:2190 [inline]
slab_free mm/slub.c:4393 [inline]
kmem_cache_free+0x145/0x340 mm/slub.c:4468
sk_prot_free net/core/sock.c:2114 [inline]
__sk_destruct+0x467/0x5f0 net/core/sock.c:2208
sock_put include/net/sock.h:1948 [inline]
unix_release_sock+0xa8b/0xd20 net/unix/af_unix.c:665
unix_release+0x91/0xc0 net/unix/af_unix.c:1049
__sock_release net/socket.c:659 [inline]
sock_close+0xbc/0x240 net/socket.c:1421
__fput+0x406/0x8b0 fs/file_table.c:422
delayed_fput+0x59/0x80 fs/file_table.c:445
process_one_work kernel/workqueue.c:3218 [inline]
process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3299
worker_thread+0x86d/0xd70 kernel/workqueue.c:3380
kthread+0x2f0/0x390 kernel/kthread.c:389
ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
The buggy address belongs to the object at ffff888079c6e000
which belongs to the cache UNIX of size 1920
The buggy address is located 1600 bytes inside of
freed 1920-byte region [ffff888079c6e000, ffff888079c6e780)
Reported-by: syzbot+f3f3eef1d2100200e593@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=f3f3eef1d2100200e593
Fixes: 77e5593aeb ("af_unix: Skip GC if no cycle exists.")
Fixes: fd86344823 ("af_unix: Try not to hold unix_gc_lock during accept().")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240419235102.31707-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The NLM_F_ACK flag is ignored for nfnetlink batch begin and end
messages. This is a problem for ynl which wants to receive an ack for
every message it sends, not just the commands in between the begin/end
messages.
Add processing for ACKs for begin/end messages and provide responses
when requested.
I have checked that iproute2, pyroute2 and systemd are unaffected by
this change since none of them use NLM_F_ACK for batch begin/end.
Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240418104737.77914-5-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov says:
====================
implement io_uring notification (ubuf_info) stacking (net part)
To have per request buffer notifications each zerocopy io_uring send
request allocates a new ubuf_info. However, as an skb can carry only
one uarg, it may force the stack to create many small skbs hurting
performance in many ways.
The patchset implements notification, i.e. an io_uring's ubuf_info
extension, stacking. It attempts to link ubuf_info's into a list,
allowing to have multiple of them per skb.
liburing/examples/send-zerocopy shows up 6 times performance improvement
for TCP with 4KB bytes per send, and levels it with MSG_ZEROCOPY. Without
the patchset it requires much larger sends to utilise all potential.
bytes | before | after (Kqps)
1200 | 195 | 1023
4000 | 193 | 1386
8000 | 154 | 1058
====================
Link: https://lore.kernel.org/all/cover.1713369317.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
At the moment an skb can only have one ubuf_info associated with it,
which might be a performance problem for zerocopy sends in cases like
TCP via io_uring. Add a callback for assigning ubuf_info to skb, this
way we will implement smarter assignment later like linking ubuf_info
together.
Note, it's an optional callback, which should be compatible with
skb_zcopy_set(), that's because the net stack might potentially decide
to clone an skb and take another reference to ubuf_info whenever it
wishes. Also, a correct implementation should always be able to bind to
an skb without prior ubuf_info, otherwise we could end up in a situation
when the send would not be able to progress.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/all/b7918aadffeb787c84c9e72e34c729dc04f3a45d.1713369317.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While investigating TCP performance, I found that TCP would
sometimes send big skbs followed by a single MSS skb,
in a 'locked' pattern.
For instance, BIG TCP is enabled, MSS is set to have 4096 bytes
of payload per segment. gso_max_size is set to 181000.
This means that an optimal TCP packet size should contain
44 * 4096 = 180224 bytes of payload,
However, I was seeing packets sizes interleaved in this pattern:
172032, 8192, 172032, 8192, 172032, 8192, <repeat>
tcp_tso_should_defer() heuristic is defeated, because after a split of
a packet in write queue for whatever reason (this might be a too small
CWND or a small enough pacing_rate),
the leftover packet in the queue is smaller than the optimal size.
It is time to try to make 'leftover packets' bigger so that
tcp_tso_should_defer() can give its full potential.
After this patch, we can see the following output:
14:13:34.009273 IP6 sender > receiver: Flags [P.], seq 4048380:4098360, ack 1, win 256, options [nop,nop,TS val 3425678144 ecr 1561784500], length 49980
14:13:34.010272 IP6 sender > receiver: Flags [P.], seq 4098360:4148340, ack 1, win 256, options [nop,nop,TS val 3425678145 ecr 1561784501], length 49980
14:13:34.011271 IP6 sender > receiver: Flags [P.], seq 4148340:4198320, ack 1, win 256, options [nop,nop,TS val 3425678146 ecr 1561784502], length 49980
14:13:34.012271 IP6 sender > receiver: Flags [P.], seq 4198320:4248300, ack 1, win 256, options [nop,nop,TS val 3425678147 ecr 1561784503], length 49980
14:13:34.013272 IP6 sender > receiver: Flags [P.], seq 4248300:4298280, ack 1, win 256, options [nop,nop,TS val 3425678148 ecr 1561784504], length 49980
14:13:34.014271 IP6 sender > receiver: Flags [P.], seq 4298280:4348260, ack 1, win 256, options [nop,nop,TS val 3425678149 ecr 1561784505], length 49980
14:13:34.015272 IP6 sender > receiver: Flags [P.], seq 4348260:4398240, ack 1, win 256, options [nop,nop,TS val 3425678150 ecr 1561784506], length 49980
14:13:34.016270 IP6 sender > receiver: Flags [P.], seq 4398240:4448220, ack 1, win 256, options [nop,nop,TS val 3425678151 ecr 1561784507], length 49980
14:13:34.017269 IP6 sender > receiver: Flags [P.], seq 4448220:4498200, ack 1, win 256, options [nop,nop,TS val 3425678152 ecr 1561784508], length 49980
14:13:34.018276 IP6 sender > receiver: Flags [P.], seq 4498200:4548180, ack 1, win 256, options [nop,nop,TS val 3425678153 ecr 1561784509], length 49980
14:13:34.019259 IP6 sender > receiver: Flags [P.], seq 4548180:4598160, ack 1, win 256, options [nop,nop,TS val 3425678154 ecr 1561784510], length 49980
With 200 concurrent flows on a 100Gbit NIC, we can see a reduction
of TSO packets (and ACK packets) of about 30 %.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240418214600.1291486-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_write_xmit() calls tcp_init_tso_segs()
to set gso_size and gso_segs on the packet.
tcp_init_tso_segs() requires the stack to maintain
an up to date tcp_skb_pcount(), and this makes sense
for packets in rtx queue. Not so much for packets
still in the write queue.
In the following patch, we don't want to deal with
tcp_skb_pcount() when moving payload from 2nd
skb to 1st skb in the write queue.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240418214600.1291486-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_cwnd_test() has a special handing for the last packet in
the write queue if it is smaller than one MSS and has the FIN flag.
This is in violation of TCP RFC, and seems quite dubious.
This packet can be sent only if the current CWND is bigger
than the number of packets in flight.
Making tcp_cwnd_test() result independent of the first skb
in the write queue is needed for the last patch of the series.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240418214600.1291486-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>