Commit Graph

1478 Commits

Author SHA1 Message Date
Linus Torvalds
1a9239bb42 Merge tag 'net-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Continue Netlink conversions to per-namespace RTNL lock
     (IPv4 routing, routing rules, routing next hops, ARP ioctls)

   - Continue extending the use of netdev instance locks. As a driver
     opt-in protect queue operations and (in due course) ethtool
     operations with the instance lock and not RTNL lock.

   - Support collecting TCP timestamps (data submitted, sent, acked) in
     BPF, allowing for transparent (to the application) and lower
     overhead tracking of TCP RPC performance.

   - Tweak existing networking Rx zero-copy infra to support zero-copy
     Rx via io_uring.

   - Optimize MPTCP performance in single subflow mode by 29%.

   - Enable GRO on packets which went thru XDP CPU redirect (were queued
     for processing on a different CPU). Improving TCP stream
     performance up to 2x.

   - Improve performance of contended connect() by 200% by searching for
     an available 4-tuple under RCU rather than a spin lock. Bring an
     additional 229% improvement by tweaking hash distribution.

   - Avoid unconditionally touching sk_tsflags on RX, improving
     performance under UDP flood by as much as 10%.

   - Avoid skb_clone() dance in ping_rcv() to improve performance under
     ping flood.

   - Avoid FIB lookup in netfilter if socket is available, 20% perf win.

   - Rework network device creation (in-kernel) API to more clearly
     identify network namespaces and their roles. There are up to 4
     namespace roles but we used to have just 2 netns pointer arguments,
     interpreted differently based on context.

   - Use sysfs_break_active_protection() instead of trylock to avoid
     deadlocks between unregistering objects and sysfs access.

   - Add a new sysctl and sockopt for capping max retransmit timeout in
     TCP.

   - Support masking port and DSCP in routing rule matches.

   - Support dumping IPv4 multicast addresses with RTM_GETMULTICAST.

   - Support specifying at what time packet should be sent on AF_XDP
     sockets.

   - Expose TCP ULP diagnostic info (for TLS and MPTCP) to non-admin
     users.

   - Add Netlink YAML spec for WiFi (nl80211) and conntrack.

   - Introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL() for symbols
     which only need to be exported when IPv6 support is built as a
     module.

   - Age FDB entries based on Rx not Tx traffic in VxLAN, similar to
     normal bridging.

   - Allow users to specify source port range for GENEVE tunnels.

   - netconsole: allow attaching kernel release, CPU ID and task name to
     messages as metadata

  Driver API:

   - Continue rework / fixing of Energy Efficient Ethernet (EEE) across
     the SW layers. Delegate the responsibilities to phylink where
     possible. Improve its handling in phylib.

   - Support symmetric OR-XOR RSS hashing algorithm.

   - Support tracking and preserving IRQ affinity by NAPI itself.

   - Support loopback mode speed selection for interface selftests.

  Device drivers:

   - Remove the IBM LCS driver for s390

   - Remove the sb1000 cable modem driver

   - Add support for SFP module access over SMBus

   - Add MCTP transport driver for MCTP-over-USB

   - Enable XDP metadata support in multiple drivers

   - Ethernet high-speed NICs:
      - Broadcom (bnxt):
         - add PCIe TLP Processing Hints (TPH) support for new AMD
           platforms
         - support dumping RoCE queue state for debug
         - opt into instance locking
      - Intel (100G, ice, idpf):
         - ice: rework MSI-X IRQ management and distribution
         - ice: support for E830 devices
         - iavf: add support for Rx timestamping
         - iavf: opt into instance locking
      - nVidia/Mellanox:
         - mlx4: use page pool memory allocator for Rx
         - mlx5: support for one PTP device per hardware clock
         - mlx5: support for 200Gbps per-lane link modes
         - mlx5: move IPSec policy check after decryption
      - AMD/Solarflare:
         - support FW flashing via devlink
      - Cisco (enic):
         - use page pool memory allocator for Rx
         - enable 32, 64 byte CQEs
         - get max rx/tx ring size from the device
      - Meta (fbnic):
         - support flow steering and RSS configuration
         - report queue stats
         - support TCP segmentation
         - support IRQ coalescing
         - support ring size configuration
      - Marvell/Cavium:
         - support AF_XDP
      - Wangxun:
         - support for PTP clock and timestamping
      - Huawei (hibmcge):
         - checksum offload
         - add more statistics

   - Ethernet virtual:
      - VirtIO net:
         - aggressively suppress Tx completions, improve perf by 96%
           with 1 CPU and 55% with 2 CPUs
         - expose NAPI to IRQ mapping and persist NAPI settings
      - Google (gve):
         - support XDP in DQO RDA Queue Format
         - opt into instance locking
      - Microsoft vNIC:
         - support BIG TCP

   - Ethernet NICs consumer, and embedded:
      - Synopsys (stmmac):
         - cleanup Tx and Tx clock setting and other link-focused
           cleanups
         - enable SGMII and 2500BASEX mode switching for Intel platforms
         - support Sophgo SG2044
      - Broadcom switches (b53):
         - support for BCM53101
      - TI:
         - iep: add perout configuration support
         - icssg: support XDP
      - Cadence (macb):
         - implement BQL
      - Xilinx (axinet):
         - support dynamic IRQ moderation and changing coalescing at
           runtime
         - implement BQL
         - report standard stats
      - MediaTek:
         - support phylink managed EEE
      - Intel:
         - igc: don't restart the interface on every XDP program change
      - RealTek (r8169):
         - support reading registers of internal PHYs directly
         - increase max jumbo packet size on RTL8125/RTL8126
      - Airoha:
         - support for RISC-V NPU packet processing unit
         - enable scatter-gather and support MTU up to 9kB
      - Tehuti (tn40xx):
         - support cards with TN4010 MAC and an Aquantia AQR105 PHY

   - Ethernet PHYs:
      - support for TJA1102S, TJA1121
      - dp83tg720: add randomized polling intervals for link detection
      - dp83822: support changing the transmit amplitude voltage
      - support for LEDs on 88q2xxx

   - CAN:
      - canxl: support Remote Request Substitution bit access
      - flexcan: add S32G2/S32G3 SoC

   - WiFi:
      - remove cooked monitor support
      - strict mode for better AP testing
      - basic EPCS support
      - OMI RX bandwidth reduction support
      - batman-adv: add support for jumbo frames

   - WiFi drivers:
      - RealTek (rtw88):
         - support RTL8814AE and RTL8814AU
      - RealTek (rtw89):
         - switch using wiphy_lock and wiphy_work
         - add BB context to manipulate two PHY as preparation of MLO
         - improve BT-coexistence mechanism to play A2DP smoothly
      - Intel (iwlwifi):
         - add new iwlmld sub-driver for latest HW/FW combinations
      - MediaTek (mt76):
         - preparation for mt7996 Multi-Link Operation (MLO) support
      - Qualcomm/Atheros (ath12k):
         - continued work on MLO
      - Silabs (wfx):
         - Wake-on-WLAN support

   - Bluetooth:
      - add support for skb TX SND/COMPLETION timestamping
      - hci_core: enable buffer flow control for SCO/eSCO
      - coredump: log devcd dumps into the monitor

   - Bluetooth drivers:
      - intel: add support to configure TX power
      - nxp: handle bootloader error during cmd5 and cmd7"

* tag 'net-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1681 commits)
  unix: fix up for "apparmor: add fine grained af_unix mediation"
  mctp: Fix incorrect tx flow invalidation condition in mctp-i2c
  net: usb: asix: ax88772: Increase phy_name size
  net: phy: Introduce PHY_ID_SIZE — minimum size for PHY ID string
  net: libwx: fix Tx L4 checksum
  net: libwx: fix Tx descriptor content for some tunnel packets
  atm: Fix NULL pointer dereference
  net: tn40xx: add pci-id of the aqr105-based Tehuti TN4010 cards
  net: tn40xx: prepare tn40xx driver to find phy of the TN9510 card
  net: tn40xx: create swnode for mdio and aqr105 phy and add to mdiobus
  net: phy: aquantia: add essential functions to aqr105 driver
  net: phy: aquantia: search for firmware-name in fwnode
  net: phy: aquantia: add probe function to aqr105 for firmware loading
  net: phy: Add swnode support to mdiobus_scan
  gve: add XDP DROP and PASS support for DQ
  gve: update XDP allocation path support RX buffer posting
  gve: merge packet buffer size fields
  gve: update GQ RX to use buf_size
  gve: introduce config-based allocation for XDP
  gve: remove xdp_xsk_done and xdp_xsk_wakeup statistics
  ...
2025-03-26 21:48:21 -07:00
Linus Torvalds
91928e0d3c Merge tag 'for-6.15/io_uring-20250322' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
 "This is the first of the io_uring pull requests for the 6.15 merge
  window, there will be others once the net tree has gone in. This
  contains:

   - Cleanup and unification of cancelation handling across various
     request types.

   - Improvement for bundles, supporting them both for incrementally
     consumed buffers, and for non-multishot requests.

   - Enable toggling of using iowait while waiting on io_uring events or
     not. Unfortunately this is still tied with CPU frequency boosting
     on short waits, as the scheduler side has not been very receptive
     to splitting the (useless) iowait stat from the cpufreq implied
     boost.

   - Add support for kbuf nodes, enabling zero-copy support for the ublk
     block driver.

   - Various cleanups for resource node handling.

   - Series greatly cleaning up the legacy provided (non-ring based)
     buffers. For years, we've been pushing the ring provided buffers as
     the way to go, and that is what people have been using. Reduce the
     complexity and code associated with legacy provided buffers.

   - Series cleaning up the compat handling.

   - Series improving and cleaning up the recvmsg/sendmsg iovec and msg
     handling.

   - Series of cleanups for io-wq.

   - Start adding a bunch of selftests. The liburing repository
     generally carries feature and regression tests for everything, but
     at least for ublk initially, we'll try and go the route of having
     it in selftests as well. We'll see how this goes, might decide to
     migrate more tests this way in the future.

   - Various little cleanups and fixes"

* tag 'for-6.15/io_uring-20250322' of git://git.kernel.dk/linux: (108 commits)
  selftests: ublk: add stripe target
  selftests: ublk: simplify loop io completion
  selftests: ublk: enable zero copy for null target
  selftests: ublk: prepare for supporting stripe target
  selftests: ublk: move common code into common.c
  selftests: ublk: increase max buffer size to 1MB
  selftests: ublk: add single sqe allocator helper
  selftests: ublk: add generic_01 for verifying sequential IO order
  selftests: ublk: fix starting ublk device
  io_uring: enable toggle of iowait usage when waiting on CQEs
  selftests: ublk: fix write cache implementation
  selftests: ublk: add variable for user to not show test result
  selftests: ublk: don't show `modprobe` failure
  selftests: ublk: add one dependency header
  io_uring/kbuf: enable bundles for incrementally consumed buffers
  Revert "io_uring/rsrc: simplify the bvec iter count calculation"
  selftests: ublk: improve test usability
  selftests: ublk: add stress test for covering IO vs. killing ublk server
  selftests: ublk: add one stress test for covering IO vs. removing device
  selftests: ublk: load/unload ublk_drv when preparing & cleaning up tests
  ...
2025-03-26 17:56:00 -07:00
Linus Torvalds
054570267d Merge tag 'lsm-pr-20250323' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull lsm updates from Paul Moore:

 - Various minor updates to the LSM Rust bindings

   Changes include marking trivial Rust bindings as inlines and comment
   tweaks to better reflect the LSM hooks.

 - Add LSM/SELinux access controls to io_uring_allowed()

   Similar to the io_uring_disabled sysctl, add a LSM hook to
   io_uring_allowed() to enable LSMs a simple way to enforce security
   policy on the use of io_uring. This pull request includes SELinux
   support for this new control using the io_uring/allowed permission.

 - Remove an unused parameter from the security_perf_event_open() hook

   The perf_event_attr struct parameter was not used by any currently
   supported LSMs, remove it from the hook.

 - Add an explicit MAINTAINERS entry for the credentials code

   We've seen problems in the past where patches to the credentials code
   sent by non-maintainers would often languish on the lists for
   multiple months as there was no one explicitly tasked with the
   responsibility of reviewing and/or merging credentials related code.

   Considering that most of the code under security/ has a vested
   interest in ensuring that the credentials code is well maintained,
   I'm volunteering to look after the credentials code and Serge Hallyn
   has also volunteered to step up as an official reviewer. I posted the
   MAINTAINERS update as a RFC to LKML in hopes that someone else would
   jump up with an "I'll do it!", but beyond Serge it was all crickets.

 - Update Stephen Smalley's old email address to prevent confusion

   This includes a corresponding update to the mailmap file.

* tag 'lsm-pr-20250323' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
  mailmap: map Stephen Smalley's old email addresses
  lsm: remove old email address for Stephen Smalley
  MAINTAINERS: add Serge Hallyn as a credentials reviewer
  MAINTAINERS: add an explicit credentials entry
  cred,rust: mark Credential methods inline
  lsm,rust: reword "destroy" -> "release" in SecurityCtx
  lsm,rust: mark SecurityCtx methods inline
  perf: Remove unnecessary parameter of security check
  lsm: fix a missing security_uring_allowed() prototype
  io_uring,lsm,selinux: add LSM hooks for io_uring_setup()
  io_uring: refactor io_uring_allowed()
2025-03-25 15:44:19 -07:00
Linus Torvalds
a50b4fe095 Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer cleanups from Thomas Gleixner:
 "A treewide hrtimer timer cleanup

  hrtimers are initialized with hrtimer_init() and a subsequent store to
  the callback pointer. This turned out to be suboptimal for the
  upcoming Rust integration and is obviously a silly implementation to
  begin with.

  This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
  with hrtimer_setup(T, cb);

  The conversion was done with Coccinelle and a few manual fixups.

  Once the conversion has completely landed in mainline, hrtimer_init()
  will be removed and the hrtimer::function becomes a private member"

* tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
  wifi: rt2x00: Switch to use hrtimer_update_function()
  io_uring: Use helper function hrtimer_update_function()
  serial: xilinx_uartps: Use helper function hrtimer_update_function()
  ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup()
  RDMA: Switch to use hrtimer_setup()
  virtio: mem: Switch to use hrtimer_setup()
  drm/vmwgfx: Switch to use hrtimer_setup()
  drm/xe/oa: Switch to use hrtimer_setup()
  drm/vkms: Switch to use hrtimer_setup()
  drm/msm: Switch to use hrtimer_setup()
  drm/i915/request: Switch to use hrtimer_setup()
  drm/i915/uncore: Switch to use hrtimer_setup()
  drm/i915/pmu: Switch to use hrtimer_setup()
  drm/i915/perf: Switch to use hrtimer_setup()
  drm/i915/gvt: Switch to use hrtimer_setup()
  drm/i915/huc: Switch to use hrtimer_setup()
  drm/amdgpu: Switch to use hrtimer_setup()
  stm class: heartbeat: Switch to use hrtimer_setup()
  i2c: Switch to use hrtimer_setup()
  iio: Switch to use hrtimer_setup()
  ...
2025-03-25 10:54:15 -07:00
Linus Torvalds
bb18645ac1 Merge tag 'io_uring-6.14-20250322' of git://git.kernel.dk/linux
Pull io_uring fix from Jens Axboe:
 "Just a single fix for the commit that went into your tree yesterday,
  which exposed an issue with not always clearing notifications. That
  could cause them to be used more than once"

* tag 'io_uring-6.14-20250322' of git://git.kernel.dk/linux:
  io_uring/net: fix sendzc double notif flush
2025-03-22 10:45:44 -07:00
Pavel Begunkov
67c007d6c1 io_uring/net: fix sendzc double notif flush
refcount_t: underflow; use-after-free.
WARNING: CPU: 0 PID: 5823 at lib/refcount.c:28 refcount_warn_saturate+0x15a/0x1d0 lib/refcount.c:28
RIP: 0010:refcount_warn_saturate+0x15a/0x1d0 lib/refcount.c:28
Call Trace:
 <TASK>
 io_notif_flush io_uring/notif.h:40 [inline]
 io_send_zc_cleanup+0x121/0x170 io_uring/net.c:1222
 io_clean_op+0x58c/0x9a0 io_uring/io_uring.c:406
 io_free_batch_list io_uring/io_uring.c:1429 [inline]
 __io_submit_flush_completions+0xc16/0xd20 io_uring/io_uring.c:1470
 io_submit_flush_completions io_uring/io_uring.h:159 [inline]

Before the blamed commit, sendzc relied on io_req_msg_cleanup() to clear
REQ_F_NEED_CLEANUP, so after the following snippet the request will
never hit the core io_uring cleanup path.

io_notif_flush();
io_req_msg_cleanup();

The easiest fix is to null the notification. io_send_zc_cleanup() can
still be called after, but it's tolerated.

Reported-by: syzbot+cf285a028ffba71b2ef5@syzkaller.appspotmail.com
Tested-by: syzbot+cf285a028ffba71b2ef5@syzkaller.appspotmail.com
Fixes: cc34d8330e ("io_uring/net: don't clear REQ_F_NEED_CLEANUP unconditionally")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e1306007458b8891c88c4f20c966a17595f766b0.1742643795.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-22 08:14:36 -06:00
Linus Torvalds
d07de43e3f Merge tag 'io_uring-6.14-20250321' of git://git.kernel.dk/linux
Pull io_uring fix from Jens Axboe:
 "Single fix heading to stable, fixing an issue with io_req_msg_cleanup()
  sometimes too eagerly clearing cleanup flags"

* tag 'io_uring-6.14-20250321' of git://git.kernel.dk/linux:
  io_uring/net: don't clear REQ_F_NEED_CLEANUP unconditionally
2025-03-21 10:30:15 -07:00
Jens Axboe
07754bfd9a io_uring: enable toggle of iowait usage when waiting on CQEs
By default, io_uring marks a waiting task as being in iowait, if it's
sleeping waiting on events and there are pending requests. This isn't
necessarily always useful, and may be confusing on non-storage setups
where iowait isn't expected. It can also cause extra power usage, by
preventing the CPU from entering lower sleep states.

This adds a new enter flag, IORING_ENTER_NO_IOWAIT. If set, then
io_uring will not account the sleeping task as being in iowait. If the
kernel supports this feature, then it will be marked by having the
IORING_FEAT_NO_IOWAIT feature flag set.

As the kernel currently does not support separating the iowait
accounting and CPU frequency boosting, the IORING_ENTER_NO_IOWAIT
controls both of these at the same time. In the future, if those do end
up being split, then it'd be possible to control them separately.
However, it seems more likely that the kernel will decouple iowait and
CPU frequency boosting anyway.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-20 20:01:03 -06:00
Jens Axboe
cc34d8330e io_uring/net: don't clear REQ_F_NEED_CLEANUP unconditionally
io_req_msg_cleanup() relies on the fact that io_netmsg_recycle() will
always fully recycle, but that may not be the case if the msg cache
was already full. To ensure that normal cleanup always gets run,
let io_netmsg_recycle() deal with clearing the relevant cleanup flags,
as it knows exactly when that should be done.

Cc: stable@vger.kernel.org
Reported-by: David Wei <dw@davidwei.uk>
Fixes: 7519134178 ("io_uring/net: add iovec recycling")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-20 12:27:27 -06:00
Paolo Abeni
941defcea7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.14-rc6).

Conflicts:

tools/testing/selftests/drivers/net/ping.py
  75cc19c8ff ("selftests: drv-net: add xdp cases for ping.py")
  de94e86974 ("selftests: drv-net: store addresses in dict indexed by ipver")
https://lore.kernel.org/netdev/20250311115758.17a1d414@canb.auug.org.au/

net/core/devmem.c
  a70f891e0f ("net: devmem: do not WARN conditionally after netdev_rx_queue_restart()")
  1d22d3060b ("net: drop rtnl_lock for queue_mgmt operations")
https://lore.kernel.org/netdev/20250313114929.43744df1@canb.auug.org.au/

Adjacent changes:

tools/testing/selftests/net/Makefile
  6f50175cca ("selftests: Add IPv6 link-local address generation tests for GRE devices.")
  2e5584e0f9 ("selftests/net: expand cmsg_ipv6.sh with ipv4")

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  661958552e ("eth: bnxt: do not use BNXT_VNIC_NTUPLE unconditionally in queue restart logic")
  fe96d717d3 ("bnxt_en: Extend queue stop/start for TX rings")

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-13 23:08:11 +01:00
Jens Axboe
cf9536e550 io_uring/kbuf: enable bundles for incrementally consumed buffers
The original support for incrementally consumed buffers didn't allow it
to be used with bundles, with the assumption being that incremental
buffers are generally larger, and hence there's less of a nedd to
support it.

But that assumption may not be correct - it's perfectly viable to use
smaller buffers with incremental consumption, and there may be valid
reasons for an application or framework to do so.

As there's really no need to explicitly disable bundles with
incrementally consumed buffers, allow it. This actually makes the peek
side cheaper and simpler, with the completion side basically the same,
just needing to iterate for the consumed length.

Reported-by: Norman Maurer <norman_maurer@apple.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10 16:24:43 -06:00
Keith Busch
334f795ff8 Revert "io_uring/rsrc: simplify the bvec iter count calculation"
This reverts commit 2a51c327d4.

The kernel registered bvecs do use the iov_iter_advance() API, so we
can't rely on this simplification anymore.

Fixes: 27cb27b6d5 ("io_uring: add support for kernel registered bvecs")
Reported-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250310184825.569371-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10 16:24:43 -06:00
Linus Torvalds
d53276d292 Merge tag 'io_uring-6.14-20250306' of git://git.kernel.dk/linux
Pull io_uring fix from Jens Axboe:
 "A single fix for a regression introduced in the 6.14 merge window,
  causing stalls/hangs with IOPOLL reads or writes"

* tag 'io_uring-6.14-20250306' of git://git.kernel.dk/linux:
  io_uring/rw: ensure reissue path is correctly handled for IOPOLL
2025-03-07 11:09:33 -10:00
Yue Haibing
30c970354c io_uring: Remove unused declaration io_alloc_async_data()
Commit ef623a647f ("io_uring: Move old async data allocation helper
to header") leave behind this unused declaration.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20250305013454.3635021-1-yuehaibing@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-07 14:09:16 -07:00
Jakub Kicinski
2525e16a2b Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.14-rc6).

Conflicts:

net/ethtool/cabletest.c
  2bcf4772e4 ("net: ethtool: try to protect all callback with netdev instance lock")
  637399bf7e ("net: ethtool: netlink: Allow NULL nlattrs when getting a phy_device")

No Adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06 13:03:35 -08:00
Jens Axboe
bcb0fda3c2 io_uring/rw: ensure reissue path is correctly handled for IOPOLL
The IOPOLL path posts CQEs when the io_kiocb is marked as completed,
so it cannot rely on the usual retry that non-IOPOLL requests do for
read/write requests.

If -EAGAIN is received and the request should be retried, go through
the normal completion path and let the normal flush logic catch it and
reissue it, like what is done for !IOPOLL reads or writes.

Fixes: d803d12394 ("io_uring/rw: handle -EAGAIN retry at IO completion time")
Reported-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/io-uring/2b43ccfa-644d-4a09-8f8f-39ad71810f41@oracle.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-05 14:03:34 -07:00
Caleb Sander Mateos
0d83b8a9f1 io_uring: introduce io_cache_free() helper
Add a helper function io_cache_free() that returns an allocation to a
io_alloc_cache, falling back on kfree() if the io_alloc_cache is full.
This is the inverse of io_cache_alloc(), which takes an allocation from
an io_alloc_cache and falls back on kmalloc() if the cache is empty.

Convert 4 callers to use the helper.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Suggested-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/20250304194814.2346705-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-05 07:38:55 -07:00
Caleb Sander Mateos
fe21a4532e io_uring/rsrc: skip NULL file/buffer checks in io_free_rsrc_node()
io_rsrc_node's of type IORING_RSRC_FILE always have a file attached
immediately after they are allocated. IORING_RSRC_BUFFER nodes won't be
returned from io_sqe_buffer_register()/io_buffer_register_bvec() until
they have a io_mapped_ubuf attached.

So remove the checks for a NULL file/buffer in io_free_rsrc_node().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228235916.670437-5-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-04 07:17:15 -07:00
Caleb Sander Mateos
6e5d321a08 io_uring/rsrc: avoid NULL node check on io_sqe_buffer_register() failure
The done: label is only reachable if node is non-NULL. So don't bother
checking, just call io_free_node().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228235916.670437-4-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-04 07:17:15 -07:00
Caleb Sander Mateos
13f7f9686e io_uring/rsrc: call io_free_node() on io_sqe_buffer_register() failure
io_sqe_buffer_register() currently calls io_put_rsrc_node() if it fails
to fully set up the io_rsrc_node. io_put_rsrc_node() is more involved
than necessary, since we already know the reference count will reach 0
and no io_mapped_ubuf has been attached to the node yet.

So just call io_free_node() to release the node's memory. This also
avoids the need to temporarily set the node's buf pointer to NULL.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228235916.670437-3-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-04 07:17:15 -07:00
Caleb Sander Mateos
a387b96d2a io_uring/rsrc: free io_rsrc_node using kfree()
io_rsrc_node_alloc() calls io_cache_alloc(), which uses kmalloc() to
allocate the node. So it can be freed with kfree() instead of kvfree().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228235916.670437-2-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-04 07:17:15 -07:00
Caleb Sander Mateos
6a53541829 io_uring/rsrc: split out io_free_node() helper
Split the freeing of the io_rsrc_node from io_free_rsrc_node(), for use
with nodes that haven't been fully initialized.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228235916.670437-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-04 07:17:15 -07:00
Caleb Sander Mateos
a1967280a1 io_uring/rsrc: include io_uring_types.h in rsrc.h
io_uring/rsrc.h uses several types from include/linux/io_uring_types.h.
Include io_uring_types.h explicitly in rsrc.h to avoid depending on
users of rsrc.h including io_uring_types.h first.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/20250301183612.937529-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-04 07:16:07 -07:00
Caleb Sander Mateos
6e83a442fb io_uring/nop: use io_find_buf_node()
Call io_find_buf_node() to avoid duplicating it in io_nop().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250301001610.678223-2-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28 19:35:37 -07:00
Caleb Sander Mateos
bf931be52e io_uring/rsrc: declare io_find_buf_node() in header file
Declare io_find_buf_node() in io_uring/rsrc.h so it can be called from
other files.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250301001610.678223-1-csander@purestorage.com
[axboe: keep the inline for local hot path usage]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28 19:35:22 -07:00
Caleb Sander Mateos
e6ea7ec494 io_uring/ublk: report error when unregister operation fails
Indicate to userspace applications if a UBLK_IO_UNREGISTER_IO_BUF
command specifies an invalid buffer index by returning an error code.
Return -EINVAL if no buffer is registered with the given index, and
-EBUSY if the registered buffer is not a kernel bvec.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228231432.642417-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28 19:15:05 -07:00
Caleb Sander Mateos
0c542a69cb io_uring/uring_cmd: specify io_uring_cmd_import_fixed() pointer type
io_uring_cmd_import_fixed() takes a struct io_uring_cmd *, but the type
of the ioucmd parameter is void *. Make the pointer type explicit so the
compiler can type check it.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228221514.604350-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28 19:14:43 -07:00
Caleb Sander Mateos
2fced37638 io_uring/rsrc: use rq_data_dir() to compute bvec dir
The macro rq_data_dir() already computes a request's data direction.
Use it in place of the if-else to set imu->dir.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228223057.615284-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28 19:11:55 -07:00
Linus Torvalds
3e5d15dd83 Merge tag 'io_uring-6.14-20250228' of git://git.kernel.dk/linux
Pull io_uring fix from Jens Axboe:
 "Just a single fix headed for stable, ensuring that msg_control is
  properly saved in compat mode as well"

* tag 'io_uring-6.14-20250228' of git://git.kernel.dk/linux:
  io_uring/net: save msg_control for compat
2025-02-28 09:11:15 -08:00
Keith Busch
ed9f3112a8 io_uring: cache nodes and mapped buffers
Frequent alloc/free cycles on these is pretty costly. Use an io cache to
more efficiently reuse these buffers.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250227223916.143006-7-kbusch@meta.com
[axboe: fix imu leak]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28 07:05:46 -07:00
Keith Busch
27cb27b6d5 io_uring: add support for kernel registered bvecs
Provide an interface for the kernel to leverage the existing
pre-registered buffers that io_uring provides. User space can reference
these later to achieve zero-copy IO.

User space must register an empty fixed buffer table with io_uring in
order for the kernel to make use of it.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250227223916.143006-5-kbusch@meta.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28 07:05:36 -07:00
Keith Busch
ff92d824d0 io_uring/rw: move fixed buffer import to issue path
Registered buffers may depend on a linked command, which makes the prep
path too early to import. Move to the issue path when the node is
actually needed like all the other users of fixed buffers.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250227223916.143006-3-kbusch@meta.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28 07:05:26 -07:00
Keith Busch
2a61e63891 io_uring/rw: move buffer_select outside generic prep
Cleans up the generic rw prep to not require the do_import flag. Use a
different prep function for callers that might need buffer select.

Based-on-a-patch-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250227223916.143006-2-kbusch@meta.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28 07:05:21 -07:00
Jakub Kicinski
357660d759 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.14-rc5).

Conflicts:

drivers/net/ethernet/cadence/macb_main.c
  fa52f15c74 ("net: cadence: macb: Synchronize stats calculations")
  75696dd0fd ("net: cadence: macb: Convert to get_stats64")
https://lore.kernel.org/20250224125848.68ee63e5@canb.auug.org.au

Adjacent changes:

drivers/net/ethernet/intel/ice/ice_sriov.c
  79990cf5e7 ("ice: Fix deinitializing VF in error path")
  a203163274 ("ice: simplify VF MSI-X managing")

net/ipv4/tcp.c
  18912c5206 ("tcp: devmem: don't write truncated dmabuf CMSGs to userspace")
  297d389e9e ("net: prefix devmem specific helpers")

net/mptcp/subflow.c
  8668860b0a ("mptcp: reset when MPTCP opts are dropped after join")
  c3349a22c2 ("mptcp: consolidate subflow cleanup")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-27 10:20:58 -08:00
Arnd Bergmann
4afc332bc8 io_uring/net: fix build warning for !CONFIG_COMPAT
A code rework resulted in an uninitialized return code when COMPAT
mode is disabled:

io_uring/net.c:722:6: error: variable 'ret' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
  722 |         if (io_is_compat(req->ctx)) {
      |             ^~~~~~~~~~~~~~~~~~~~~~
io_uring/net.c:736:15: note: uninitialized use occurs here
  736 |         if (unlikely(ret))
      |                      ^~~

Since io_is_compat() turns into a compile-time 'false', the #ifdef
here is completely unnecessary, and removing it avoids the warning.

Fixes: 51e158d405 ("io_uring/net: unify *mshot_prep calls with compat")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20250227132018.1111094-1-arnd@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-27 07:53:31 -07:00
Pavel Begunkov
0fea2c4509 io_uring: rearrange opdef flags by use pattern
Keep all flags that we use in the generic req init path close together.
That saves a load for x86 because apparently some compilers prefer
reading single bytes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ef03b6ce4a0c2a5234cd4037fa07e9e4902dcc9e.1740602793.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-27 07:27:56 -07:00
Pavel Begunkov
5ee6e3ea31 io_uring/net: extract iovec import into a helper
Deduplicate iovec imports between compat and !compat by introducing a
helper function.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6a5f8c526f6732c4249a7fa0213b49e1a3ecccf0.1740569495.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-27 07:27:56 -07:00
Pavel Begunkov
51e158d405 io_uring/net: unify *mshot_prep calls with compat
Instead of duplicating a io_recvmsg_mshot_prep() call in the compat
path, let the common code handle it. For that, copy necessary compat
fields into struct user_msghdr. Note, it zeroes user_msghdr to be on the
safe side as compat is not that interesting and overhead shouldn't be
high.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/94e62386dec570f83b4a4270a46ac60bc415fb71.1740569495.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-27 07:27:56 -07:00
Pavel Begunkov
0c623f4899 io_uring/net: derive iovec storage later
Don't read free_iov until right before we need it to import the iovec.
The only place that uses it before that is provided buffer selection,
but it only serves as temporary storage and iovec content is not reused
afterwards, so use a local variable for that.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8bfa7d74c33e37860a724f4e0e96660c25cd4c02.1740569495.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-27 07:27:55 -07:00
Pavel Begunkov
00a9143d98 io_uring/net: verify msghdr before copying iovec
Normally, net/ would verify msghdr before importing iovec, for example
see copy_msghdr_from_user(), which further assumed by __copy_msghdr()
validating msg->msg_iovlen.

io_uring does it in reverse order, which is fine, but it'll be more
convenient for flip it so that the iovec business is done at the end and
eventually can be nicely pulled out of msghdr parsing section and
thought as a sepaarate step. That also makes structure accesses more
localised, which should be better for caches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cd35dc1b48d4e6e31f59ae7304c037fbe8a3fd3d.1740569495.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-27 07:27:55 -07:00
Pavel Begunkov
a223e96f73 io_uring/net: isolate msghdr copying code
The user access section in io_msg_copy_hdr() is overextended by covering
selected buffers. It's hard to work with and prone to errors. Limit the
section to msghdr import only, selected buffers will do a separate
copy_from_user() call, and then move it into its own function. This
should be fine, selected buffer single shots are not important, for
multishots the overhead should be non-existent, and it's not that
expensive overall.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d3eb1f81c8cfbea9f1aa57dab90c472d2aa6e371.1740569495.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-27 07:27:55 -07:00
Pavel Begunkov
0fc5a589af io_uring/net: simplify compat selbuf iov parsing
Use copy_from_user() instead of open coded access_ok() + get_user(),
that's simpler and we don't care about compat that much.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e51f9c323a3cd4ad7c8da656559bdf6237f052fb.1740569495.git.asml.silence@gmail.com
[axboe: fold in bogus < 0 check for tmp_iov.iov_len]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-27 07:27:37 -07:00
Pavel Begunkov
80b3de7da7 io_uring/net: remove unnecessary REQ_F_NEED_CLEANUP
REQ_F_NEED_CLEANUP in io_recvmsg_prep_setup() and in io_sendmsg_setup()
are relics of the past and don't do anything useful, the flag should be
and are set earlier on iovec and async_data allocation.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6aedc3141c1fc027128a4503656cfd686a6980ef.1740569495.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-27 07:19:01 -07:00
Jens Axboe
c0d8c0362b Merge branch 'io_uring-6.14' into for-6.15/io_uring
Merge mainline fixes into 6.15 branch, as upcoming patches depend on
fixes that went into the 6.14 mainline branch.

* io_uring-6.14:
  io_uring/net: save msg_control for compat
  io_uring/rw: clean up mshot forced sync mode
  io_uring/rw: move ki_complete init into prep
  io_uring/rw: don't directly use ki_complete
  io_uring/rw: forbid multishot async reads
  io_uring/rsrc: remove unused constants
  io_uring: fix spelling error in uapi io_uring.h
  io_uring: prevent opcode speculation
  io-wq: backoff when retrying worker creation
2025-02-27 07:18:01 -07:00
Pavel Begunkov
5d30991477 io_uring: combine buffer lookup and import
Registered buffer are currently imported in two steps, first we lookup
a rsrc node and then use it to set up the iterator. The first part is
usually done at the prep stage, and import happens whenever it's needed.
As we want to defer binding to a node so that it works with linked
requests, combine both steps into a single helper.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250224213116.3509093-6-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-27 07:17:39 -07:00
Pavel Begunkov
69d483d5f4 io_uring/nvme: pass issue_flags to io_uring_cmd_import_fixed()
io_uring_cmd_import_fixed() will need to know the io_uring execution
state in following commits, for now just pass issue_flags into it
without actually using.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250224213116.3509093-5-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-27 07:17:39 -07:00
Pavel Begunkov
81cc96fcb3 io_uring/net: reuse req->buf_index for sendzc
There is already a field in io_kiocb that can store a registered buffer
index, use that instead of stashing the value into struct io_sr_msg.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250224213116.3509093-4-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-27 07:17:39 -07:00
Keith Busch
a14ca7a413 io_uring/nop: reuse req->buf_index
There is already a field in io_kiocb that can store a registered buffer
index, use that instead of stashing the value into struct io_nop.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250224213116.3509093-3-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-27 07:17:39 -07:00
Keith Busch
c5b47d5a8c io_uring/rsrc: remove redundant check for valid imu
The only caller to io_buffer_unmap already checks if the node's buf is
not null, so no need to check again.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250224213116.3509093-2-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-27 07:17:39 -07:00
Pavel Begunkov
7a9b0d6925 io_uring/rw: open code io_prep_rw_setup()
Open code io_prep_rw_setup() into its only caller, it doesn't provide
any meaningful abstraction anymore.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/61ba72e2d46119db71f27ab908018e6a6cd6c064.1740425922.git.asml.silence@gmail.com
[axboe: fold in 'ret' being unused fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-27 07:17:03 -07:00