Commit Graph

1382415 Commits

Author SHA1 Message Date
Pavel Begunkov
01464ea405 io_uring/zcrx: move area reg checks into io_import_area
io_import_area() is responsible for importing memory and parsing
io_uring_zcrx_area_reg, so move all area reg structure checks into the
function.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16 12:37:20 -06:00
Pavel Begunkov
d425f13146 io_uring/zcrx: don't pass slot to io_zcrx_create_area
Don't pass a pointer to a pointer where an area should be stored to
io_zcrx_create_area(), and let it handle finding the right place for a
new area. It's more straightforward and will be needed to support
multiple areas.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16 12:37:20 -06:00
Pavel Begunkov
c49606fc4b io_uring/zcrx: remove extra io_zcrx_drop_netdev
io_close_queue() already detaches the netdev, don't unnecessary call
io_zcrx_drop_netdev() right after.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16 12:37:20 -06:00
Pavel Begunkov
d5e31db9a9 io_uring/zcrx: use page_pool_unref_and_test()
page_pool_unref_and_test() tries to better follow usuall refcount
semantics, use it instead of page_pool_unref_netmem().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16 12:37:20 -06:00
Pavel Begunkov
bdc0d478a1 io_uring/zcrx: replace memchar_inv with is_zero
memchr_inv() is more ambiguous than mem_is_zero(), so use the latter
for zero checks.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16 12:37:20 -06:00
Pavel Begunkov
9eb3c57178 io_uring/zcrx: improve rqe cache alignment
Refill queue entries are 16B structures, but because of the ring header
placement, they're 8B aligned but not naturally / 16B aligned, which
means some of them span across 2 cache lines. Push rqes to a new cache
line.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16 12:37:20 -06:00
Jens Axboe
1b3aa39007 io_uring/uring_cmd: correct signature for io_uring_mshot_cmd_post_cqe()
The !CONFIG_IO_URING signature is wrong, fix that up. The non stub
signature got updated for the io_br_sel changes that happened before
this patch went in, but the stub one did not.

Fixes: 620a50c927 ("io_uring: uring_cmd: add multishot support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-15 09:17:24 -06:00
Jens Axboe
9adc6669a6 io_uring: correct size of overflow CQE calculation
If a 32b CQE is required, don't double the size of the overflow struct,
just add the size of the io_uring_cqe addition that is needed. This
avoids allocating too much memory, as the io_overflow_cqe size includes
the list member required to queue them too.

Fixes: e26dca67fd ("io_uring: add support for IORING_SETUP_CQE_MIXED")
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 17:30:51 -06:00
Marco Crivellari
9f5f69d98e io_uring: replace use of system_unbound_wq with system_dfl_wq
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistentcy cannot be addressed without refactoring the API.

system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.

Adding system_dfl_wq to encourage its use when unbound work should be used.

queue_work() / queue_delayed_work() / mod_delayed_work() will now use the
new unbound wq: whether the user still use the old wq a warn will be
printed along with a wq redirect to the new one.

The old system_unbound_wq will be kept for a few release cycles.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 08:21:03 -06:00
Marco Crivellari
8577441d4a io_uring: replace use of system_wq with system_percpu_wq
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistentcy cannot be addressed without refactoring the API.

system_wq is a per-CPU worqueue, yet nothing in its name tells about that
CPU affinity constraint, which is very often not required by users. Make
it clear by adding a system_percpu_wq.

queue_work() / queue_delayed_work() mod_delayed_work() will now use the
new per-cpu wq: whether the user still stick on the old name a warn will
be printed along a wq redirect to the new one.

This patch add the new system_percpu_wq except for mm, fs and net
subsystem, whom are handled in separated patches.

The old wq will be kept for a few release cylces.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 08:21:02 -06:00
Caleb Sander Mateos
2f076a453f io_uring/rsrc: respect submitter_task in io_register_clone_buffers()
io_ring_ctx's enabled with IORING_SETUP_SINGLE_ISSUER are only allowed
a single task submitting to the ctx. Although the documentation only
mentions this restriction applying to io_uring_enter() syscalls,
commit d7cce96c44 ("io_uring: limit registration w/ SINGLE_ISSUER")
extends it to io_uring_register(). Ensuring only one task interacts
with the io_ring_ctx will be important to allow this task to avoid
taking the uring_lock.
There is, however, one gap in these checks: io_register_clone_buffers()
may take the uring_lock on a second (source) io_ring_ctx, but
__io_uring_register() only checks the current thread against the
*destination* io_ring_ctx's submitter_task. Fail the
IORING_REGISTER_CLONE_BUFFERS with -EEXIST if the source io_ring_ctx has
a registered submitter_task other than the current task.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-08 13:21:24 -06:00
Caleb Sander Mateos
5d4c52bfa8 io_uring: don't include filetable.h in io_uring.h
io_uring/io_uring.h doesn't use anything declared in
io_uring/filetable.h, so drop the unnecessary #include. Add filetable.h
includes in .c files previously relying on the transitive include from
io_uring.h.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-08 13:20:46 -06:00
Thorsten Blum
7b0604d77a io_uring: Replace kzalloc() + copy_from_user() with memdup_user()
Replace kzalloc() followed by copy_from_user() with memdup_user() to
improve and simplify io_probe().

No functional changes intended.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-08 08:21:36 -06:00
Jens Axboe
473efbc3ca io_uring/uring_cmd: fix __io_uring_cmd_do_in_task !CONFIG_IO_URING typo
A manual application of this patch resulted in a typo for the stub
function __io_uring_cmd_do_in_task(), for the case where CONFIG_IO_URING
isn't true. Fix that up.

Reported-by: Klara Modin <klarasmodin@gmail.com>
Fixes: df3a7762ee ("io_uring/uring_cmd: add io_uring_cmd_tw_t type alias")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-08 08:18:15 -06:00
Pavel Begunkov
c265ae75f9 io_uring: introduce io_uring querying
There are many parameters users might want to query about io_uring like
available request types or the ring sizes. This patch introduces an
interface for such slow path queries.

It was written with several requirements in mind:
- Can be used with or without an io_uring instance. Asking for supported
  setup flags before creating an instance as well as qeurying info about
  an already created ring are valid use cases.
- Should be moderately fast. For example, users might use it to
  periodically retrieve ring attributes at runtime. As a consequence,
  it should be able to query multiple attributes in a single syscall.
- Backward and forward compatible.
- Should be reasobably easy to use.
- Reduce the kernel code size for introducing new query types.

It's implemented as a new registration opcode IORING_REGISTER_QUERY.
The user passes one or more query strutctures linked together, each
represented by struct io_uring_query_hdr. The header stores common
control fields needed for processing and points to query type specific
information.

The header contains
- The query type
- The result field, which on return contains the error code for the query
- Pointer to the query type specific information
- The size of the query structure. The kernel will only populate up to
  the size, which helps with backward compatibility. The kernel can also
  reduce the size, so if the current kernel is older than the inteface
  the user tries to use, it'll get only the supported bits.
- next_entry field is used to chain multiple queries.

Apart from common registeration syscall failures, it can only immediately
return an error code in case when the headers are incorrect or any
other addresses and invalid. That usually mean that the userspace
doesn't use the API right and should be corrected. All query type
specific errors are returned in the header's result field.

As an example, the patch adds a single query type for now, i.e.
IO_URING_QUERY_OPCODES, which tells what register / request / etc.
opcodes are supported, but there are particular plans to extend it.

Note: there is a request probing interface via IORING_REGISTER_PROBE,
but it's a mess. It requires the user to create a ring first, it only
works for requests, and requires dynamic allocations.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-08 08:06:37 -06:00
Pavel Begunkov
63805d0a9b io_uring: add macros for avaliable flags
Add constants for supported setup / request / feature flags as well as
the feature mask. They'll be used in the next patch.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-08 08:06:37 -06:00
Pavel Begunkov
da8bc3c81c io_uring: add helper for *REGISTER_SEND_MSG_RING
Move handling of IORING_REGISTER_SEND_MSG_RING into a separate function
in preparation to growing io_uring_register_blind().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-08 08:06:37 -06:00
Caleb Sander Mateos
c2685729fa io_uring: remove WRITE_ONCE() in io_uring_create()
There's no need to use WRITE_ONCE() to set ctx->submitter_task in
io_uring_create() since no other task can access the io_ring_ctx until a
file descriptor is associated with it. So use a normal assignment
instead of WRITE_ONCE().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250904161223.2600435-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-04 17:22:24 -06:00
Caleb Sander Mateos
9f8608fce9 io_uring/cmd: remove unused io_uring_cmd_iopoll_done()
io_uring_cmd_iopoll_done()'s only caller was removed in commit
9ce6c9875f ("nvme: always punt polled uring_cmd end_io work to
task_work"). So remove the unused function too.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250902013328.1517686-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-03 17:35:26 -06:00
Caleb Sander Mateos
dd386b0d5e io_uring/uring_cmd: correct io_uring_cmd_done() ret type
io_uring_cmd_done() takes the result code for the CQE as a ssize_t ret
argument. However, the CQE res field is a s32 value, as is the argument
to io_req_set_res(). To clarify that only s32 values can be faithfully
represented without truncation, change io_uring_cmd_done()'s ret
argument type to s32.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250902012609.1513123-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-03 17:34:36 -06:00
Caleb Sander Mateos
df3a7762ee io_uring/uring_cmd: add io_uring_cmd_tw_t type alias
Introduce a function pointer type alias io_uring_cmd_tw_t for the
uring_cmd task work callback. This avoids repeating the signature in
several places. Also name both arguments to the callback to clarify what
they represent.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250902160657.1726828-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-02 19:21:12 -06:00
Caleb Sander Mateos
8b9c9a2e7d io_uring/register: drop redundant submitter_task check
For IORING_SETUP_SINGLE_ISSUER io_ring_ctx's, io_register_resize_rings()
checks that the current task is the ctx's submitter_task. However, its
caller __io_uring_register() already checks this. Drop the redundant
check in io_register_resize_rings().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250902215108.1925105-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-02 19:20:24 -06:00
Jens Axboe
37500634d0 io_uring/net: correct type for min_not_zero() cast
The kernel test robot reports that after a recent change, the signedness
of a min_not_zero() compare is now incorrect. Fix that up and cast to
the right type.

Fixes: 429884ff35 ("io_uring/kbuf: use struct io_br_sel for multiple buffers picking")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202509020426.WJtrdwOU-lkp@intel.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-02 05:19:42 -06:00
Jens Axboe
4c0b26e23c io_uring: add async data clear/free helpers
Futex recently had an issue where it mishandled how ->async_data and
REQ_F_ASYNC_DATA is handled. To avoid future issues like that, add a set
of helpers that either clear or clear-and-free the async data assigned
to a struct io_kiocb.

Convert existing manual handling of that to use the helpers. No intended
functional changes in this patch.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-27 11:24:25 -06:00
Jens Axboe
c986f7586b io_uring/zcrx: add support for IORING_SETUP_CQE_MIXED
zcrx currently requires the ring to be set up with fixed 32b CQEs,
allow it to use IORING_SETUP_CQE_MIXED as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-27 11:24:22 -06:00
Jens Axboe
1e81bf1414 io_uring/uring_cmd: add support for IORING_SETUP_CQE_MIXED
Certain users of uring_cmd currently require fixed 32b CQE support,
which is propagated through IO_URING_F_CQE32. Allow
IORING_SETUP_CQE_MIXED to cover that case as well, so not all CQEs
posted need to be 32b in size.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-27 11:24:15 -06:00
Jens Axboe
806ecb209a io_uring/nop: add support for IORING_SETUP_CQE_MIXED
This adds support for setting IORING_NOP_CQE32 as a flag for a NOP
command, in which case a 32b CQE will be posted rather than a regular
one. This is the default if the ring has been setup with
IORING_SETUP_CQE32. If the ring has been setup with
IORING_SETUP_CQE_MIXED, then 16b CQEs will be posted without this flag
set, and 32b CQEs if this flag is set. For the latter case, sqe->off is
what will be posted as cqe->big_cqe[0] and sqe->addr is what will be
posted as cqe->big_cqe[1].

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-27 11:24:15 -06:00
Jens Axboe
e26dca67fd io_uring: add support for IORING_SETUP_CQE_MIXED
Normal rings support 16b CQEs for posting completions, while certain
features require the ring to be configured with IORING_SETUP_CQE32, as
they need to convey more information per completion. This, in turn,
makes ALL the CQEs be 32b in size. This is somewhat wasteful and
inefficient, particularly when only certain CQEs need to be of the
bigger variant.

This adds support for setting up a ring with mixed CQE sizes, using
IORING_SETUP_CQE_MIXED. When setup in this mode, CQEs posted to the ring
may be either 16b or 32b in size. If a CQE is 32b in size, then
IORING_CQE_F_32 is set in the CQE flags to indicate that this is the
case. If this flag isn't set, the CQE is the normal 16b variant.

CQEs on these types of mixed rings may also have IORING_CQE_F_SKIP set.
This can happen if the ring is one (small) CQE entry away from wrapping,
and an attempt is made to post a 32b CQE. As CQEs must be contigious in
the CQ ring, a 32b CQE cannot wrap the ring. For this case, a single
dummy CQE is posted with the SKIP flag set. The application should
simply ignore those.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-27 11:23:57 -06:00
Jens Axboe
89a8859721 io_uring/trace: support completion tracing of mixed 32b CQEs
Check for IORING_CQE_F_32 as well, not just if the ring was setup with
IORING_SETUP_CQE32 to only support big CQEs.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-24 11:41:13 -06:00
Jens Axboe
82ceb7fcc5 io_uring/fdinfo: handle mixed sized CQEs
Ensure that the CQ ring iteration handles differently sized CQEs, not
just a fixed 16b or 32b size per ring. These CQEs aren't possible just
yet, but prepare the fdinfo CQ ring dumping for handling them.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-24 11:41:13 -06:00
Jens Axboe
b69458735d io_uring: add UAPI definitions for mixed CQE postings
This adds the CQE flags related to supporting a mixed CQ ring mode, where
both normal (16b) and big (32b) CQEs may be posted.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-24 11:41:12 -06:00
Jens Axboe
d0201c4436 io_uring: remove io_ctx_cqe32() helper
It's pretty pointless and only used for the tracing helper, get rid
of it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-24 11:41:12 -06:00
Caleb Sander Mateos
e5c717e795 io_uring/cmd: consolidate REQ_F_BUFFER_SELECT checks
io_uring_cmd_prep() checks that REQ_F_BUFFER_SELECT is set in the
io_kiocb's flags iff IORING_URING_CMD_MULTISHOT is set in the SQE's
uring_cmd_flags. Consolidate the IORING_URING_CMD_MULTISHOT and
!IORING_URING_CMD_MULTISHOT branches into a single check that the
IORING_URING_CMD_MULTISHOT flag matches the REQ_F_BUFFER_SELECT flag.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250821163308.977915-4-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-24 11:41:12 -06:00
Caleb Sander Mateos
3484f530f8 io_uring/cmd: deduplicate uring_cmd_flags checks
io_uring_cmd_prep() currently has two checks for whether
IORING_URING_CMD_FIXED and IORING_URING_CMD_MULTISHOT are both set in
uring_cmd_flags. Remove the second check.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250821163308.977915-3-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-24 11:41:12 -06:00
Ming Lei
620a50c927 io_uring: uring_cmd: add multishot support
Add UAPI flag IORING_URING_CMD_MULTISHOT for supporting multishot
uring_cmd operations with provided buffer.

This enables drivers to post multiple completion events from a single
uring_cmd submission, which is useful for:

- Notifying userspace of device events (e.g., interrupt handling)
- Supporting devices with multiple event sources (e.g., multi-queue devices)
- Avoiding the need for device poll() support when events originate
  from multiple sources device-wide

The implementation adds two new APIs:
- io_uring_cmd_select_buffer(): selects a buffer from the provided
  buffer group for multishot uring_cmd
- io_uring_mshot_cmd_post_cqe(): posts a CQE after event data is
  pushed to the provided buffer

Multishot uring_cmd must be used with buffer select (IOSQE_BUFFER_SELECT)
and is mutually exclusive with IORING_URING_CMD_FIXED for now.

The ublk driver will be the first user of this functionality:

	https://github.com/ming1/linux/commits/ublk-devel/

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250821040210.1152145-3-ming.lei@redhat.com
[axboe: fold in fix for !CONFIG_IO_URING]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-24 11:41:12 -06:00
Ming Lei
d589bcddaa io-uring: move struct io_br_sel into io_uring_types.h
Move `struct io_br_sel` into io_uring_types.h and prepare for supporting
provided buffer on uring_cmd.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250821040210.1152145-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-24 11:41:12 -06:00
Jens Axboe
fe524b0684 io_uring/kbuf: check for ring provided buffers first in recycling
This is the most likely of paths if a provided buffer is used, so offer
it up first and push the legacy buffers to later.

Link: https://lore.kernel.org/r/20250821020750.598432-14-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-24 11:41:12 -06:00
Jens Axboe
e973837b54 io_uring: remove async/poll related provided buffer recycles
These aren't necessary anymore, get rid of them.

Link: https://lore.kernel.org/r/20250821020750.598432-13-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-24 11:41:12 -06:00
Jens Axboe
5fda512554 io_uring/kbuf: switch to storing struct io_buffer_list locally
Currently the buffer list is stored in struct io_kiocb. The buffer list
can be of two types:

1) Classic/legacy buffer list. These don't need to get referenced after
   a buffer pick, and hence storing them in struct io_kiocb is perfectly
   fine.

2) Ring provided buffer lists. These DO need to be referenced after the
   initial buffer pick, as they need to get consumed later on. This can
   be either just incrementing the head of the ring, or it can be
   consuming parts of a buffer if incremental buffer consumptions has
   been configured.

For case 2, io_uring needs to be careful not to access the buffer list
after the initial pick-and-execute context. The core does recycling of
these, but it's easy to make a mistake, because it's stored in the
io_kiocb which does persist across multiple execution contexts. Either
because it's a multishot request, or simply because it needed some kind
of async trigger (eg poll) for retry purposes.

Add a struct io_buffer_list to struct io_br_sel, which is always on
stack for the various users of it. This prevents the buffer list from
leaking outside of that execution context, and additionally it enables
kbuf to not even pass back the struct io_buffer_list if the given
context isn't appropriately locked already.

This doesn't fix any bugs, it's simply a defensive measure to prevent
any issues with reuse of a buffer list.

Link: https://lore.kernel.org/r/20250821020750.598432-12-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-24 11:41:12 -06:00
Jens Axboe
461382a51f io_uring/net: use struct io_br_sel->val as the send finish value
Currently a pointer is passed in to the 'ret' in the send mshot handler,
but since we already have a value field in io_br_sel, just use that.
This is also in preparation for needing to pass in struct io_br_sel
to io_send_finish() anyway.

Link: https://lore.kernel.org/r/20250821020750.598432-11-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-24 11:41:12 -06:00
Jens Axboe
58d8150918 io_uring/net: use struct io_br_sel->val as the recv finish value
Currently a pointer is passed in to the 'ret' in the receive handlers,
but since we already have a value field in io_br_sel, just use that.
This is also in preparation for needing to pass in struct io_br_sel
to io_recv_finish() anyway.

Link: https://lore.kernel.org/r/20250821020750.598432-10-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-24 11:41:12 -06:00
Jens Axboe
429884ff35 io_uring/kbuf: use struct io_br_sel for multiple buffers picking
The networking side uses bundles, which is picking multiple buffers at
the same time. Pass in struct io_br_sel to those helpers.

Link: https://lore.kernel.org/r/20250821020750.598432-9-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-24 11:41:12 -06:00
Jens Axboe
d8e1dec2f8 io_uring/rw: recycle buffers manually for non-mshot reads
The mshot side of reads already does this, but the regular read path
does not. This leads to needing recycling checks sprinkled in various
spots in the "go async" path, like arming poll. In preparation for
getting rid of those, ensure that read recycles appropriately.

Link: https://lore.kernel.org/r/20250821020750.598432-8-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-24 11:41:12 -06:00
Jens Axboe
ab6559bdbb io_uring/kbuf: introduce struct io_br_sel
Rather than return addresses directly from buffer selection, add a
struct around it. No functional changes in this patch, it's in
preparation for storing more buffer related information locally, rather
than in struct io_kiocb.

Link: https://lore.kernel.org/r/20250821020750.598432-7-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-24 11:41:12 -06:00
Jens Axboe
1b5add75d7 io_uring/kbuf: pass in struct io_buffer_list to commit/recycle helpers
Rather than have this implied being in the io_kiocb, pass it in directly
so it's immediately obvious where these users of ->buf_list are coming
from.

Link: https://lore.kernel.org/r/20250821020750.598432-6-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-24 11:41:12 -06:00
Jens Axboe
b22743f29b io_uring/net: clarify io_recv_buf_select() return value
It returns 0 on success, less than zero on error.

Link: https://lore.kernel.org/r/20250821020750.598432-5-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-24 11:41:12 -06:00
Jens Axboe
15ba5e51e6 io_uring/net: don't use io_net_kbuf_recyle() for non-provided cases
A previous commit used io_net_kbuf_recyle() for any network helper that
did IO and needed partial retry. However, that's only needed if the
opcode does buffer selection, which isnt support for sendzc, sendmsg_zc,
or sendmsg. Just remove them - they don't do any harm, but it is a bit
confusing when reading the code.

Link: https://lore.kernel.org/r/20250821020750.598432-4-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-24 11:41:11 -06:00
Jens Axboe
5e73b402cb io_uring/kbuf: drop 'issue_flags' from io_put_kbuf(s)() arguments
Picking multiple buffers always requires the ring lock to be held across
the operation, so there's no need to pass in the issue_flags to
io_put_kbufs(). On the single buffer side, if the initial picking of a
ring buffer was unlocked, then it will have been committed already. For
legacy buffers, no locking is required, as they will simply be freed.

Link: https://lore.kernel.org/r/20250821020750.598432-3-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-24 11:41:11 -06:00
Pavel Begunkov
ab3ea6eac5 io_uring/zctx: check chained notif contexts
Send zc only links ubuf_info for requests coming from the same context.
There are some ambiguous syz reports, so let's check the assumption on
notification completion.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fd527d8638203fe0f1c5ff06ff2e1d8fd68f831b.1755179962.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-24 11:41:11 -06:00
Pavel Begunkov
92a96b0a22 io_uring: add request poisoning
Poison various request fields on free. __io_req_caches_free() is a slow
path, so can be done unconditionally, but gate it on kasan for
io_req_add_to_cache(). Note that some fields are logically retained
between cache allocations and can't be poisoned in
io_req_add_to_cache().

Ideally, it'd be replaced with KASAN'ed caches, but that can't be
enabled because of some synchronisation nuances.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7a78e8a7f5be434313c400650b862e36c211b312.1755459452.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-24 11:41:11 -06:00