linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-21 09:05:20 -04:00

Author	SHA1	Message	Date
Pavel Begunkov	bdc0d478a1	io_uring/zcrx: replace memchar_inv with is_zero memchr_inv() is more ambiguous than mem_is_zero(), so use the latter for zero checks. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:20 -06:00
Pavel Begunkov	9eb3c57178	io_uring/zcrx: improve rqe cache alignment Refill queue entries are 16B structures, but because of the ring header placement, they're 8B aligned but not naturally / 16B aligned, which means some of them span across 2 cache lines. Push rqes to a new cache line. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:20 -06:00
Feng Zhou	3a0ac20253	io_uring/zcrx: fix ifq->if_rxq is -1, get dma_dev is NULL ifq->if_rxq has not been assigned, is -1, the correct value is in reg.if_rxq. Fixes: `59b8b32ac8` ("io_uring/zcrx: add support for custom DMA devices") Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Link: https://patch.msgid.link/20250912140133.97741-1-zhoufeng.zf@bytedance.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 18:12:53 -07:00
Max Kellermann	cd4ea81be3	io_uring/io-wq: fix `max_workers` breakage and `nr_workers` underflow Commit 88e6c42e40de ("io_uring/io-wq: add check free worker before create new worker") reused the variable `do_create` for something else, abusing it for the free worker check. This caused the value to effectively always be `true` at the time `nr_workers < max_workers` was checked, but it should really be `false`. This means the `max_workers` setting was ignored, and worse: if the limit had already been reached, incrementing `nr_workers` was skipped even though another worker would be created. When later lots of workers exit, the `nr_workers` field could easily underflow, making the problem worse because more and more workers would be created without incrementing `nr_workers`. The simple solution is to use a different variable for the free worker check instead of using one variable for two different things. Cc: stable@vger.kernel.org Fixes: 88e6c42e40de ("io_uring/io-wq: add check free worker before create new worker") Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Fengnan Chang <changfengnan@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-15 10:46:13 -06:00
Jakub Kicinski	fc3a281041	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.17-rc6). Conflicts: net/netfilter/nft_set_pipapo.c net/netfilter/nft_set_pipapo_avx2.c `c4eaca2e10` ("netfilter: nft_set_pipapo: don't check genbit from packetpath lookups") `84c1da7b38` ("netfilter: nft_set_pipapo: use avx2 algorithm for insertions too") Only trivial adjacent changes (in a doc and a Makefile). Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-11 17:40:13 -07:00
Jens Axboe	9adc6669a6	io_uring: correct size of overflow CQE calculation If a 32b CQE is required, don't double the size of the overflow struct, just add the size of the io_uring_cqe addition that is needed. This avoids allocating too much memory, as the io_overflow_cqe size includes the list member required to queue them too. Fixes: `e26dca67fd` ("io_uring: add support for IORING_SETUP_CQE_MIXED") Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 17:30:51 -06:00
Marco Crivellari	9f5f69d98e	io_uring: replace use of system_unbound_wq with system_dfl_wq Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. queue_work() / queue_delayed_work() / mod_delayed_work() will now use the new unbound wq: whether the user still use the old wq a warn will be printed along with a wq redirect to the new one. The old system_unbound_wq will be kept for a few release cycles. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 08:21:03 -06:00
Marco Crivellari	8577441d4a	io_uring: replace use of system_wq with system_percpu_wq Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_wq is a per-CPU worqueue, yet nothing in its name tells about that CPU affinity constraint, which is very often not required by users. Make it clear by adding a system_percpu_wq. queue_work() / queue_delayed_work() mod_delayed_work() will now use the new per-cpu wq: whether the user still stick on the old name a warn will be printed along a wq redirect to the new one. This patch add the new system_percpu_wq except for mm, fs and net subsystem, whom are handled in separated patches. The old wq will be kept for a few release cylces. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 08:21:02 -06:00
Caleb Sander Mateos	2f076a453f	io_uring/rsrc: respect submitter_task in io_register_clone_buffers() io_ring_ctx's enabled with IORING_SETUP_SINGLE_ISSUER are only allowed a single task submitting to the ctx. Although the documentation only mentions this restriction applying to io_uring_enter() syscalls, commit `d7cce96c44` ("io_uring: limit registration w/ SINGLE_ISSUER") extends it to io_uring_register(). Ensuring only one task interacts with the io_ring_ctx will be important to allow this task to avoid taking the uring_lock. There is, however, one gap in these checks: io_register_clone_buffers() may take the uring_lock on a second (source) io_ring_ctx, but __io_uring_register() only checks the current thread against the destination io_ring_ctx's submitter_task. Fail the IORING_REGISTER_CLONE_BUFFERS with -EEXIST if the source io_ring_ctx has a registered submitter_task other than the current task. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-08 13:21:24 -06:00
Caleb Sander Mateos	5d4c52bfa8	io_uring: don't include filetable.h in io_uring.h io_uring/io_uring.h doesn't use anything declared in io_uring/filetable.h, so drop the unnecessary #include. Add filetable.h includes in .c files previously relying on the transitive include from io_uring.h. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-08 13:20:46 -06:00
Linus Torvalds	f777d1112e	Merge tag 'vfs-6.17-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "fuse: - Prevent opening of non-regular backing files. Fuse doesn't support non-regular files anyway. - Check whether copy_file_range() returns a larger size than requested. - Prevent overflow in copy_file_range() as fuse currently only supports 32-bit sized copies. - Cache the blocksize value if the server returned a new value as inode->i_blkbits isn't modified directly anymore. - Fix i_blkbits handling for iomap partial writes. By default i_blkbits is set to PAGE_SIZE which causes iomap to mark the whole folio as uptodate even on a partial write. But fuseblk filesystems support choosing a blocksize smaller than PAGE_SIZE risking data corruption. Simply enforce PAGE_SIZE as blocksize for fuseblk's internal inode for now. - Prevent out-of-bounds acces in fuse_dev_write() when the number of bytes to be retrieved is truncated to the fc->max_pages limit. virtiofs: - Fix page faults for DAX page addresses. Misc: - Tighten file handle decoding from userns. Check that the decoded dentry itself has a valid idmapping in the user namespace. - Fix mount-notify selftests. - Fix some indentation errors. - Add an FMODE_ flag to indicate IOCB_HAS_METADATA availability. This will be moved to an FOP_* flag with a bit more rework needed for that to happen not suitable for a fix. - Don't silently ignore metadata for sync read/write. - Don't pointlessly log warning when reading coredump sysctls" * tag 'vfs-6.17-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fuse: virtio_fs: fix page fault for DAX page address selftests/fs/mount-notify: Fix compilation failure. fhandle: use more consistent rules for decoding file handle from userns fuse: Block access to folio overlimit fuse: fix fuseblk i_blkbits for iomap partial writes fuse: reflect cached blocksize if blocksize was changed fuse: prevent overflow in copy_file_range return value fuse: check if copy_file_range() returns larger than requested size fuse: do not allow mapping a non-regular backing file coredump: don't pointlessly check and spew warnings fs: fix indentation style block: don't silently ignore metadata for sync read/write fs: add a FMODE_ flag to indicate IOCB_HAS_METADATA availability Please enter a commit message to explain why this merge is necessary, especially if it merges an updated upstream into a topic branch.	2025-09-08 07:53:01 -07:00
Thorsten Blum	7b0604d77a	io_uring: Replace kzalloc() + copy_from_user() with memdup_user() Replace kzalloc() followed by copy_from_user() with memdup_user() to improve and simplify io_probe(). No functional changes intended. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-08 08:21:36 -06:00
Pavel Begunkov	c265ae75f9	io_uring: introduce io_uring querying There are many parameters users might want to query about io_uring like available request types or the ring sizes. This patch introduces an interface for such slow path queries. It was written with several requirements in mind: - Can be used with or without an io_uring instance. Asking for supported setup flags before creating an instance as well as qeurying info about an already created ring are valid use cases. - Should be moderately fast. For example, users might use it to periodically retrieve ring attributes at runtime. As a consequence, it should be able to query multiple attributes in a single syscall. - Backward and forward compatible. - Should be reasobably easy to use. - Reduce the kernel code size for introducing new query types. It's implemented as a new registration opcode IORING_REGISTER_QUERY. The user passes one or more query strutctures linked together, each represented by struct io_uring_query_hdr. The header stores common control fields needed for processing and points to query type specific information. The header contains - The query type - The result field, which on return contains the error code for the query - Pointer to the query type specific information - The size of the query structure. The kernel will only populate up to the size, which helps with backward compatibility. The kernel can also reduce the size, so if the current kernel is older than the inteface the user tries to use, it'll get only the supported bits. - next_entry field is used to chain multiple queries. Apart from common registeration syscall failures, it can only immediately return an error code in case when the headers are incorrect or any other addresses and invalid. That usually mean that the userspace doesn't use the API right and should be corrected. All query type specific errors are returned in the header's result field. As an example, the patch adds a single query type for now, i.e. IO_URING_QUERY_OPCODES, which tells what register / request / etc. opcodes are supported, but there are particular plans to extend it. Note: there is a request probing interface via IORING_REGISTER_PROBE, but it's a mess. It requires the user to create a ring first, it only works for requests, and requires dynamic allocations. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-08 08:06:37 -06:00
Pavel Begunkov	63805d0a9b	io_uring: add macros for avaliable flags Add constants for supported setup / request / feature flags as well as the feature mask. They'll be used in the next patch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-08 08:06:37 -06:00
Pavel Begunkov	da8bc3c81c	io_uring: add helper for *REGISTER_SEND_MSG_RING Move handling of IORING_REGISTER_SEND_MSG_RING into a separate function in preparation to growing io_uring_register_blind(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-08 08:06:37 -06:00
Caleb Sander Mateos	c2685729fa	io_uring: remove WRITE_ONCE() in io_uring_create() There's no need to use WRITE_ONCE() to set ctx->submitter_task in io_uring_create() since no other task can access the io_ring_ctx until a file descriptor is associated with it. So use a normal assignment instead of WRITE_ONCE(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250904161223.2600435-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-04 17:22:24 -06:00
Jakub Kicinski	5ef04a7b06	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.17-rc5). No conflicts. Adjacent changes: include/net/sock.h `c51613fa27` ("net: add sk->sk_drop_counters") `5d6b58c932` ("net: lockless sock_i_ino()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-04 13:33:00 -07:00
Caleb Sander Mateos	dd386b0d5e	io_uring/uring_cmd: correct io_uring_cmd_done() ret type io_uring_cmd_done() takes the result code for the CQE as a ssize_t ret argument. However, the CQE res field is a s32 value, as is the argument to io_req_set_res(). To clarify that only s32 values can be faithfully represented without truncation, change io_uring_cmd_done()'s ret argument type to s32. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250902012609.1513123-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-03 17:34:36 -06:00
Caleb Sander Mateos	df3a7762ee	io_uring/uring_cmd: add io_uring_cmd_tw_t type alias Introduce a function pointer type alias io_uring_cmd_tw_t for the uring_cmd task work callback. This avoids repeating the signature in several places. Also name both arguments to the callback to clarify what they represent. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250902160657.1726828-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-02 19:21:12 -06:00
Caleb Sander Mateos	8b9c9a2e7d	io_uring/register: drop redundant submitter_task check For IORING_SETUP_SINGLE_ISSUER io_ring_ctx's, io_register_resize_rings() checks that the current task is the ctx's submitter_task. However, its caller __io_uring_register() already checks this. Drop the redundant check in io_register_resize_rings(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250902215108.1925105-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-02 19:20:24 -06:00
Jens Axboe	37500634d0	io_uring/net: correct type for min_not_zero() cast The kernel test robot reports that after a recent change, the signedness of a min_not_zero() compare is now incorrect. Fix that up and cast to the right type. Fixes: `429884ff35` ("io_uring/kbuf: use struct io_br_sel for multiple buffers picking") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202509020426.WJtrdwOU-lkp@intel.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-02 05:19:42 -06:00
Christian Brauner	e23654f5b1	Merge tag 'fuse-fixes-6.17-rc5' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse into vfs.fixes fuse fixes for 6.17-rc5 * tag 'fuse-fixes-6.17-rc5' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (6 commits) fuse: Block access to folio overlimit fuse: fix fuseblk i_blkbits for iomap partial writes fuse: reflect cached blocksize if blocksize was changed fuse: prevent overflow in copy_file_range return value fuse: check if copy_file_range() returns larger than requested size fuse: do not allow mapping a non-regular backing file Link: https://lore.kernel.org/CAJfpeguEVMMyw_zCb+hbOuSxdE2Z3Raw=SJsq=Y56Ae6dn2W3g@mail.gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-01 12:48:28 +02:00
Jakub Kicinski	d23ad54de7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.17-rc4). No conflicts. Adjacent changes: drivers/net/ethernet/intel/idpf/idpf_txrx.c `02614eee26` ("idpf: do not linearize big TSO packets") `6c4e684802` ("idpf: remove obsolete stashing code") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-29 11:48:01 -07:00
Dragos Tatulea	59b8b32ac8	io_uring/zcrx: add support for custom DMA devices Use the new API for getting a DMA device for a specific netdev queue. This patch will allow io_uring zero-copy rx to work with devices where the DMA device is not stored in the parent device. mlx5 SFs are an example of such a device. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/20250827144017.1529208-4-dtatulea@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-28 16:05:31 -07:00
Jens Axboe	98b6fa62c8	io_uring/kbuf: always use READ_ONCE() to read ring provided buffer lengths Since the buffers are mapped from userspace, it is prudent to use READ_ONCE() to read the value into a local variable, and use that for any other actions taken. Having a stable read of the buffer length avoids worrying about it changing after checking, or being read multiple times. Similarly, the buffer may well change in between it being picked and being committed. Ensure the looping for incremental ring buffer commit stops if it hits a zero sized buffer, as no further progress can be made at that point. Fixes: `ae98dbf43d` ("io_uring/kbuf: add support for incremental buffer consumption") Link: https://lore.kernel.org/io-uring/tencent_000C02641F6250C856D0C26228DE29A3D30A@qq.com/ Reported-by: Qingyue Zhang <chunzhennn@qq.com> Reported-by: Suoxing Zhang <aftern00n@qq.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-28 05:48:34 -06:00
Jens Axboe	4c0b26e23c	io_uring: add async data clear/free helpers Futex recently had an issue where it mishandled how ->async_data and REQ_F_ASYNC_DATA is handled. To avoid future issues like that, add a set of helpers that either clear or clear-and-free the async data assigned to a struct io_kiocb. Convert existing manual handling of that to use the helpers. No intended functional changes in this patch. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-27 11:24:25 -06:00
Jens Axboe	c986f7586b	io_uring/zcrx: add support for IORING_SETUP_CQE_MIXED zcrx currently requires the ring to be set up with fixed 32b CQEs, allow it to use IORING_SETUP_CQE_MIXED as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-27 11:24:22 -06:00
Jens Axboe	1e81bf1414	io_uring/uring_cmd: add support for IORING_SETUP_CQE_MIXED Certain users of uring_cmd currently require fixed 32b CQE support, which is propagated through IO_URING_F_CQE32. Allow IORING_SETUP_CQE_MIXED to cover that case as well, so not all CQEs posted need to be 32b in size. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-27 11:24:15 -06:00
Jens Axboe	806ecb209a	io_uring/nop: add support for IORING_SETUP_CQE_MIXED This adds support for setting IORING_NOP_CQE32 as a flag for a NOP command, in which case a 32b CQE will be posted rather than a regular one. This is the default if the ring has been setup with IORING_SETUP_CQE32. If the ring has been setup with IORING_SETUP_CQE_MIXED, then 16b CQEs will be posted without this flag set, and 32b CQEs if this flag is set. For the latter case, sqe->off is what will be posted as cqe->big_cqe[0] and sqe->addr is what will be posted as cqe->big_cqe[1]. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-27 11:24:15 -06:00
Jens Axboe	e26dca67fd	io_uring: add support for IORING_SETUP_CQE_MIXED Normal rings support 16b CQEs for posting completions, while certain features require the ring to be configured with IORING_SETUP_CQE32, as they need to convey more information per completion. This, in turn, makes ALL the CQEs be 32b in size. This is somewhat wasteful and inefficient, particularly when only certain CQEs need to be of the bigger variant. This adds support for setting up a ring with mixed CQE sizes, using IORING_SETUP_CQE_MIXED. When setup in this mode, CQEs posted to the ring may be either 16b or 32b in size. If a CQE is 32b in size, then IORING_CQE_F_32 is set in the CQE flags to indicate that this is the case. If this flag isn't set, the CQE is the normal 16b variant. CQEs on these types of mixed rings may also have IORING_CQE_F_SKIP set. This can happen if the ring is one (small) CQE entry away from wrapping, and an attempt is made to post a 32b CQE. As CQEs must be contigious in the CQ ring, a 32b CQE cannot wrap the ring. For this case, a single dummy CQE is posted with the SKIP flag set. The application should simply ignore those. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-27 11:23:57 -06:00
Qingyue Zhang	c64eff368a	io_uring/kbuf: fix signedness in this_len calculation When importing and using buffers, buf->len is considered unsigned. However, buf->len is converted to signed int when committing. This can lead to unexpected behavior if the buffer is large enough to be interpreted as a negative value. Make min_t calculation unsigned. Fixes: `ae98dbf43d` ("io_uring/kbuf: add support for incremental buffer consumption") Co-developed-by: Suoxing Zhang <aftern00n@qq.com> Signed-off-by: Suoxing Zhang <aftern00n@qq.com> Signed-off-by: Qingyue Zhang <chunzhennn@qq.com> Link: https://lore.kernel.org/r/tencent_4DBB3674C0419BEC2C0C525949DA410CA307@qq.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-27 08:39:44 -06:00
Jens Axboe	82ceb7fcc5	io_uring/fdinfo: handle mixed sized CQEs Ensure that the CQ ring iteration handles differently sized CQEs, not just a fixed 16b or 32b size per ring. These CQEs aren't possible just yet, but prepare the fdinfo CQ ring dumping for handling them. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:13 -06:00
Caleb Sander Mateos	e5c717e795	io_uring/cmd: consolidate REQ_F_BUFFER_SELECT checks io_uring_cmd_prep() checks that REQ_F_BUFFER_SELECT is set in the io_kiocb's flags iff IORING_URING_CMD_MULTISHOT is set in the SQE's uring_cmd_flags. Consolidate the IORING_URING_CMD_MULTISHOT and !IORING_URING_CMD_MULTISHOT branches into a single check that the IORING_URING_CMD_MULTISHOT flag matches the REQ_F_BUFFER_SELECT flag. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250821163308.977915-4-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Caleb Sander Mateos	3484f530f8	io_uring/cmd: deduplicate uring_cmd_flags checks io_uring_cmd_prep() currently has two checks for whether IORING_URING_CMD_FIXED and IORING_URING_CMD_MULTISHOT are both set in uring_cmd_flags. Remove the second check. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250821163308.977915-3-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Ming Lei	620a50c927	io_uring: uring_cmd: add multishot support Add UAPI flag IORING_URING_CMD_MULTISHOT for supporting multishot uring_cmd operations with provided buffer. This enables drivers to post multiple completion events from a single uring_cmd submission, which is useful for: - Notifying userspace of device events (e.g., interrupt handling) - Supporting devices with multiple event sources (e.g., multi-queue devices) - Avoiding the need for device poll() support when events originate from multiple sources device-wide The implementation adds two new APIs: - io_uring_cmd_select_buffer(): selects a buffer from the provided buffer group for multishot uring_cmd - io_uring_mshot_cmd_post_cqe(): posts a CQE after event data is pushed to the provided buffer Multishot uring_cmd must be used with buffer select (IOSQE_BUFFER_SELECT) and is mutually exclusive with IORING_URING_CMD_FIXED for now. The ublk driver will be the first user of this functionality: https://github.com/ming1/linux/commits/ublk-devel/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250821040210.1152145-3-ming.lei@redhat.com [axboe: fold in fix for !CONFIG_IO_URING] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Ming Lei	d589bcddaa	io-uring: move `struct io_br_sel` into io_uring_types.h Move `struct io_br_sel` into io_uring_types.h and prepare for supporting provided buffer on uring_cmd. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250821040210.1152145-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Jens Axboe	fe524b0684	io_uring/kbuf: check for ring provided buffers first in recycling This is the most likely of paths if a provided buffer is used, so offer it up first and push the legacy buffers to later. Link: https://lore.kernel.org/r/20250821020750.598432-14-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Jens Axboe	e973837b54	io_uring: remove async/poll related provided buffer recycles These aren't necessary anymore, get rid of them. Link: https://lore.kernel.org/r/20250821020750.598432-13-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Jens Axboe	5fda512554	io_uring/kbuf: switch to storing struct io_buffer_list locally Currently the buffer list is stored in struct io_kiocb. The buffer list can be of two types: 1) Classic/legacy buffer list. These don't need to get referenced after a buffer pick, and hence storing them in struct io_kiocb is perfectly fine. 2) Ring provided buffer lists. These DO need to be referenced after the initial buffer pick, as they need to get consumed later on. This can be either just incrementing the head of the ring, or it can be consuming parts of a buffer if incremental buffer consumptions has been configured. For case 2, io_uring needs to be careful not to access the buffer list after the initial pick-and-execute context. The core does recycling of these, but it's easy to make a mistake, because it's stored in the io_kiocb which does persist across multiple execution contexts. Either because it's a multishot request, or simply because it needed some kind of async trigger (eg poll) for retry purposes. Add a struct io_buffer_list to struct io_br_sel, which is always on stack for the various users of it. This prevents the buffer list from leaking outside of that execution context, and additionally it enables kbuf to not even pass back the struct io_buffer_list if the given context isn't appropriately locked already. This doesn't fix any bugs, it's simply a defensive measure to prevent any issues with reuse of a buffer list. Link: https://lore.kernel.org/r/20250821020750.598432-12-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Jens Axboe	461382a51f	io_uring/net: use struct io_br_sel->val as the send finish value Currently a pointer is passed in to the 'ret' in the send mshot handler, but since we already have a value field in io_br_sel, just use that. This is also in preparation for needing to pass in struct io_br_sel to io_send_finish() anyway. Link: https://lore.kernel.org/r/20250821020750.598432-11-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Jens Axboe	58d8150918	io_uring/net: use struct io_br_sel->val as the recv finish value Currently a pointer is passed in to the 'ret' in the receive handlers, but since we already have a value field in io_br_sel, just use that. This is also in preparation for needing to pass in struct io_br_sel to io_recv_finish() anyway. Link: https://lore.kernel.org/r/20250821020750.598432-10-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Jens Axboe	429884ff35	io_uring/kbuf: use struct io_br_sel for multiple buffers picking The networking side uses bundles, which is picking multiple buffers at the same time. Pass in struct io_br_sel to those helpers. Link: https://lore.kernel.org/r/20250821020750.598432-9-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Jens Axboe	d8e1dec2f8	io_uring/rw: recycle buffers manually for non-mshot reads The mshot side of reads already does this, but the regular read path does not. This leads to needing recycling checks sprinkled in various spots in the "go async" path, like arming poll. In preparation for getting rid of those, ensure that read recycles appropriately. Link: https://lore.kernel.org/r/20250821020750.598432-8-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Jens Axboe	ab6559bdbb	io_uring/kbuf: introduce struct io_br_sel Rather than return addresses directly from buffer selection, add a struct around it. No functional changes in this patch, it's in preparation for storing more buffer related information locally, rather than in struct io_kiocb. Link: https://lore.kernel.org/r/20250821020750.598432-7-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Jens Axboe	1b5add75d7	io_uring/kbuf: pass in struct io_buffer_list to commit/recycle helpers Rather than have this implied being in the io_kiocb, pass it in directly so it's immediately obvious where these users of ->buf_list are coming from. Link: https://lore.kernel.org/r/20250821020750.598432-6-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Jens Axboe	b22743f29b	io_uring/net: clarify io_recv_buf_select() return value It returns 0 on success, less than zero on error. Link: https://lore.kernel.org/r/20250821020750.598432-5-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Jens Axboe	15ba5e51e6	io_uring/net: don't use io_net_kbuf_recyle() for non-provided cases A previous commit used io_net_kbuf_recyle() for any network helper that did IO and needed partial retry. However, that's only needed if the opcode does buffer selection, which isnt support for sendzc, sendmsg_zc, or sendmsg. Just remove them - they don't do any harm, but it is a bit confusing when reading the code. Link: https://lore.kernel.org/r/20250821020750.598432-4-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:11 -06:00
Jens Axboe	5e73b402cb	io_uring/kbuf: drop 'issue_flags' from io_put_kbuf(s)() arguments Picking multiple buffers always requires the ring lock to be held across the operation, so there's no need to pass in the issue_flags to io_put_kbufs(). On the single buffer side, if the initial picking of a ring buffer was unlocked, then it will have been committed already. For legacy buffers, no locking is required, as they will simply be freed. Link: https://lore.kernel.org/r/20250821020750.598432-3-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:11 -06:00
Pavel Begunkov	ab3ea6eac5	io_uring/zctx: check chained notif contexts Send zc only links ubuf_info for requests coming from the same context. There are some ambiguous syz reports, so let's check the assumption on notification completion. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fd527d8638203fe0f1c5ff06ff2e1d8fd68f831b.1755179962.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:11 -06:00
Pavel Begunkov	92a96b0a22	io_uring: add request poisoning Poison various request fields on free. __io_req_caches_free() is a slow path, so can be done unconditionally, but gate it on kasan for io_req_add_to_cache(). Note that some fields are logically retained between cache allocations and can't be poisoned in io_req_add_to_cache(). Ideally, it'd be replaced with KASAN'ed caches, but that can't be enabled because of some synchronisation nuances. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7a78e8a7f5be434313c400650b862e36c211b312.1755459452.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:11 -06:00

1 2 3 4 5 ...

1825 Commits