A manual application of this patch resulted in a typo for the stub
function __io_uring_cmd_do_in_task(), for the case where CONFIG_IO_URING
isn't true. Fix that up.
Reported-by: Klara Modin <klarasmodin@gmail.com>
Fixes: df3a7762ee ("io_uring/uring_cmd: add io_uring_cmd_tw_t type alias")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are many parameters users might want to query about io_uring like
available request types or the ring sizes. This patch introduces an
interface for such slow path queries.
It was written with several requirements in mind:
- Can be used with or without an io_uring instance. Asking for supported
setup flags before creating an instance as well as qeurying info about
an already created ring are valid use cases.
- Should be moderately fast. For example, users might use it to
periodically retrieve ring attributes at runtime. As a consequence,
it should be able to query multiple attributes in a single syscall.
- Backward and forward compatible.
- Should be reasobably easy to use.
- Reduce the kernel code size for introducing new query types.
It's implemented as a new registration opcode IORING_REGISTER_QUERY.
The user passes one or more query strutctures linked together, each
represented by struct io_uring_query_hdr. The header stores common
control fields needed for processing and points to query type specific
information.
The header contains
- The query type
- The result field, which on return contains the error code for the query
- Pointer to the query type specific information
- The size of the query structure. The kernel will only populate up to
the size, which helps with backward compatibility. The kernel can also
reduce the size, so if the current kernel is older than the inteface
the user tries to use, it'll get only the supported bits.
- next_entry field is used to chain multiple queries.
Apart from common registeration syscall failures, it can only immediately
return an error code in case when the headers are incorrect or any
other addresses and invalid. That usually mean that the userspace
doesn't use the API right and should be corrected. All query type
specific errors are returned in the header's result field.
As an example, the patch adds a single query type for now, i.e.
IO_URING_QUERY_OPCODES, which tells what register / request / etc.
opcodes are supported, but there are particular plans to extend it.
Note: there is a request probing interface via IORING_REGISTER_PROBE,
but it's a mess. It requires the user to create a ring first, it only
works for requests, and requires dynamic allocations.
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add constants for supported setup / request / feature flags as well as
the feature mask. They'll be used in the next patch.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move handling of IORING_REGISTER_SEND_MSG_RING into a separate function
in preparation to growing io_uring_register_blind().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring_cmd_done() takes the result code for the CQE as a ssize_t ret
argument. However, the CQE res field is a s32 value, as is the argument
to io_req_set_res(). To clarify that only s32 values can be faithfully
represented without truncation, change io_uring_cmd_done()'s ret
argument type to s32.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250902012609.1513123-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Futex recently had an issue where it mishandled how ->async_data and
REQ_F_ASYNC_DATA is handled. To avoid future issues like that, add a set
of helpers that either clear or clear-and-free the async data assigned
to a struct io_kiocb.
Convert existing manual handling of that to use the helpers. No intended
functional changes in this patch.
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
zcrx currently requires the ring to be set up with fixed 32b CQEs,
allow it to use IORING_SETUP_CQE_MIXED as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Certain users of uring_cmd currently require fixed 32b CQE support,
which is propagated through IO_URING_F_CQE32. Allow
IORING_SETUP_CQE_MIXED to cover that case as well, so not all CQEs
posted need to be 32b in size.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds support for setting IORING_NOP_CQE32 as a flag for a NOP
command, in which case a 32b CQE will be posted rather than a regular
one. This is the default if the ring has been setup with
IORING_SETUP_CQE32. If the ring has been setup with
IORING_SETUP_CQE_MIXED, then 16b CQEs will be posted without this flag
set, and 32b CQEs if this flag is set. For the latter case, sqe->off is
what will be posted as cqe->big_cqe[0] and sqe->addr is what will be
posted as cqe->big_cqe[1].
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Normal rings support 16b CQEs for posting completions, while certain
features require the ring to be configured with IORING_SETUP_CQE32, as
they need to convey more information per completion. This, in turn,
makes ALL the CQEs be 32b in size. This is somewhat wasteful and
inefficient, particularly when only certain CQEs need to be of the
bigger variant.
This adds support for setting up a ring with mixed CQE sizes, using
IORING_SETUP_CQE_MIXED. When setup in this mode, CQEs posted to the ring
may be either 16b or 32b in size. If a CQE is 32b in size, then
IORING_CQE_F_32 is set in the CQE flags to indicate that this is the
case. If this flag isn't set, the CQE is the normal 16b variant.
CQEs on these types of mixed rings may also have IORING_CQE_F_SKIP set.
This can happen if the ring is one (small) CQE entry away from wrapping,
and an attempt is made to post a 32b CQE. As CQEs must be contigious in
the CQ ring, a 32b CQE cannot wrap the ring. For this case, a single
dummy CQE is posted with the SKIP flag set. The application should
simply ignore those.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Check for IORING_CQE_F_32 as well, not just if the ring was setup with
IORING_SETUP_CQE32 to only support big CQEs.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ensure that the CQ ring iteration handles differently sized CQEs, not
just a fixed 16b or 32b size per ring. These CQEs aren't possible just
yet, but prepare the fdinfo CQ ring dumping for handling them.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds the CQE flags related to supporting a mixed CQ ring mode, where
both normal (16b) and big (32b) CQEs may be posted.
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring_cmd_prep() checks that REQ_F_BUFFER_SELECT is set in the
io_kiocb's flags iff IORING_URING_CMD_MULTISHOT is set in the SQE's
uring_cmd_flags. Consolidate the IORING_URING_CMD_MULTISHOT and
!IORING_URING_CMD_MULTISHOT branches into a single check that the
IORING_URING_CMD_MULTISHOT flag matches the REQ_F_BUFFER_SELECT flag.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250821163308.977915-4-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add UAPI flag IORING_URING_CMD_MULTISHOT for supporting multishot
uring_cmd operations with provided buffer.
This enables drivers to post multiple completion events from a single
uring_cmd submission, which is useful for:
- Notifying userspace of device events (e.g., interrupt handling)
- Supporting devices with multiple event sources (e.g., multi-queue devices)
- Avoiding the need for device poll() support when events originate
from multiple sources device-wide
The implementation adds two new APIs:
- io_uring_cmd_select_buffer(): selects a buffer from the provided
buffer group for multishot uring_cmd
- io_uring_mshot_cmd_post_cqe(): posts a CQE after event data is
pushed to the provided buffer
Multishot uring_cmd must be used with buffer select (IOSQE_BUFFER_SELECT)
and is mutually exclusive with IORING_URING_CMD_FIXED for now.
The ublk driver will be the first user of this functionality:
https://github.com/ming1/linux/commits/ublk-devel/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250821040210.1152145-3-ming.lei@redhat.com
[axboe: fold in fix for !CONFIG_IO_URING]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently the buffer list is stored in struct io_kiocb. The buffer list
can be of two types:
1) Classic/legacy buffer list. These don't need to get referenced after
a buffer pick, and hence storing them in struct io_kiocb is perfectly
fine.
2) Ring provided buffer lists. These DO need to be referenced after the
initial buffer pick, as they need to get consumed later on. This can
be either just incrementing the head of the ring, or it can be
consuming parts of a buffer if incremental buffer consumptions has
been configured.
For case 2, io_uring needs to be careful not to access the buffer list
after the initial pick-and-execute context. The core does recycling of
these, but it's easy to make a mistake, because it's stored in the
io_kiocb which does persist across multiple execution contexts. Either
because it's a multishot request, or simply because it needed some kind
of async trigger (eg poll) for retry purposes.
Add a struct io_buffer_list to struct io_br_sel, which is always on
stack for the various users of it. This prevents the buffer list from
leaking outside of that execution context, and additionally it enables
kbuf to not even pass back the struct io_buffer_list if the given
context isn't appropriately locked already.
This doesn't fix any bugs, it's simply a defensive measure to prevent
any issues with reuse of a buffer list.
Link: https://lore.kernel.org/r/20250821020750.598432-12-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently a pointer is passed in to the 'ret' in the send mshot handler,
but since we already have a value field in io_br_sel, just use that.
This is also in preparation for needing to pass in struct io_br_sel
to io_send_finish() anyway.
Link: https://lore.kernel.org/r/20250821020750.598432-11-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently a pointer is passed in to the 'ret' in the receive handlers,
but since we already have a value field in io_br_sel, just use that.
This is also in preparation for needing to pass in struct io_br_sel
to io_recv_finish() anyway.
Link: https://lore.kernel.org/r/20250821020750.598432-10-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The mshot side of reads already does this, but the regular read path
does not. This leads to needing recycling checks sprinkled in various
spots in the "go async" path, like arming poll. In preparation for
getting rid of those, ensure that read recycles appropriately.
Link: https://lore.kernel.org/r/20250821020750.598432-8-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rather than return addresses directly from buffer selection, add a
struct around it. No functional changes in this patch, it's in
preparation for storing more buffer related information locally, rather
than in struct io_kiocb.
Link: https://lore.kernel.org/r/20250821020750.598432-7-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A previous commit used io_net_kbuf_recyle() for any network helper that
did IO and needed partial retry. However, that's only needed if the
opcode does buffer selection, which isnt support for sendzc, sendmsg_zc,
or sendmsg. Just remove them - they don't do any harm, but it is a bit
confusing when reading the code.
Link: https://lore.kernel.org/r/20250821020750.598432-4-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Picking multiple buffers always requires the ring lock to be held across
the operation, so there's no need to pass in the issue_flags to
io_put_kbufs(). On the single buffer side, if the initial picking of a
ring buffer was unlocked, then it will have been committed already. For
legacy buffers, no locking is required, as they will simply be freed.
Link: https://lore.kernel.org/r/20250821020750.598432-3-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Poison various request fields on free. __io_req_caches_free() is a slow
path, so can be done unconditionally, but gate it on kasan for
io_req_add_to_cache(). Note that some fields are logically retained
between cache allocations and can't be poisoned in
io_req_add_to_cache().
Ideally, it'd be replaced with KASAN'ed caches, but that can't be
enabled because of some synchronisation nuances.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7a78e8a7f5be434313c400650b862e36c211b312.1755459452.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull perf fix from Borislav Petkov:
- Fix a case where the events throttling logic operates on inactive
events
* tag 'perf_urgent_for_v6.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Avoid undefined behavior from stopping/starting inactive events
Pull x86 fixes from Borislav Petkov:
- Fix the GDS mitigation detection on some machines after the recent
attack vectors conversion
- Filter out the invalid machine reset reason value -1 when running as
a guest as in such cases the reason why the machine was rebooted does
not make a whole lot of sense
- Init the resource control machinery on Hygon hw in order to avoid a
division by zero and to actually enable the feature on hw which
supports it
* tag 'x86_urgent_for_v6.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/bugs: Fix GDS mitigation selecting when mitigation is off
x86/CPU/AMD: Ignore invalid reset reason value
x86/cpu/hygon: Add missing resctrl_cpu_detect() in bsp_init helper
Pull modules fix from Daniel Gomez:
"This includes a fix part of the KSPP (Kernel Self Protection Project)
to replace the deprecated and unsafe strcpy() calls in the kernel
parameter string handler and sysfs parameters for built-in modules.
Single commit, no functional changes"
* tag 'modules-6.17-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux:
params: Replace deprecated strcpy() with strscpy() and memcpy()
Pull char/misc/iio fixes from Greg KH:
"Here are a small number of char/misc/iio and other driver fixes for
6.17-rc3. Included in here are:
- IIO driver bugfixes for reported issues
- bunch of comedi driver fixes
- most core bugfix
- fpga driver bugfix
- cdx driver bugfix
All of these have been in linux-next this week with no reported
issues"
* tag 'char-misc-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
most: core: Drop device reference after usage in get_channel()
comedi: Make insn_rw_emulate_bits() do insn->n samples
comedi: Fix use of uninitialized memory in do_insn_ioctl() and do_insnlist_ioctl()
comedi: pcl726: Prevent invalid irq number
cdx: Fix off-by-one error in cdx_rpmsg_probe()
fpga: zynq_fpga: Fix the wrong usage of dma_map_sgtable()
iio: pressure: bmp280: Use IS_ERR() in bmp280_common_probe()
iio: light: as73211: Ensure buffer holes are zeroed
iio: adc: rzg2l_adc: Set driver data before enabling runtime PM
iio: adc: rzg2l: Cleanup suspend/resume path
iio: adc: ad7380: fix missing max_conversion_rate_hz on adaq4381-4
iio: adc: bd79124: Add GPIOLIB dependency
iio: imu: inv_icm42600: change invalid data error to -EBUSY
iio: adc: ad7124: fix channel lookup in syscalib functions
iio: temperature: maxim_thermocouple: use DMA-safe buffer for spi_read()
iio: adc: ad7173: prevent scan if too many setups requested
iio: proximity: isl29501: fix buffered read on big-endian systems
iio: accel: sca3300: fix uninitialized iio scan data
Pull USB fixes from Greg KH:
"Here are some small USB driver fixes for 6.17-rc3 to resolve a bunch
of reported issues. Included in here are:
- typec driver fixes
- dwc3 new device id
- dwc3 driver fixes
- new usb-storage driver quirks
- xhci driver fixes
- other tiny USB driver fixes to resolve bugs
All of these have been in linux-next this week with no reported issues"
* tag 'usb-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: xhci: fix host not responding after suspend and resume
usb: xhci: Fix slot_id resource race conflict
usb: typec: fusb302: Revert incorrect threaded irq fix
USB: core: Update kerneldoc for usb_hcd_giveback_urb()
usb: typec: maxim_contaminant: re-enable cc toggle if cc is open and port is clean
usb: typec: maxim_contaminant: disable low power mode when reading comparator values
usb: dwc3: Remove WARN_ON for device endpoint command timeouts
USB: storage: Ignore driver CD mode for Realtek multi-mode Wi-Fi dongles
usb: storage: realtek_cr: Use correct byte order for bcs->Residue
usb: chipidea: imx: improve usbmisc_imx7d_pullup()
kcov, usb: Don't disable interrupts in kcov_remote_start_usb_softirq()
usb: dwc3: pci: add support for the Intel Wildcat Lake
usb: dwc3: Ignore late xferNotReady event to prevent halt timeout
USB: storage: Add unusual-devs entry for Novatek NTK96550-based camera
usb: core: hcd: fix accessing unmapped memory in SINGLE_STEP_SET_FEATURE test
usb: renesas-xhci: Fix External ROM access timeouts
usb: gadget: tegra-xudc: fix PM use count underflow
usb: quirks: Add DELAY_INIT quick for another SanDisk 3.2Gen1 Flash Drive
Pull tracing fixes from Steven Rostedt:
- Fix rtla and latency tooling pkg-config errors
If libtraceevent and libtracefs is installed, but their corresponding
'.pc' files are not installed, it reports that the libraries are
missing and confuses the developer. Instead, report that the
pkg-config files are missing and should be installed.
- Fix overflow bug of the parser in trace_get_user()
trace_get_user() uses the parsing functions to parse the user space
strings. If the parser fails due to incorrect processing, it doesn't
terminate the buffer with a nul byte. Add a "failed" flag to the
parser that gets set when parsing fails and is used to know if the
buffer is fine to use or not.
- Remove a semicolon that was at an end of a comment line
- Fix register_ftrace_graph() to unregister the pm notifier on error
The register_ftrace_graph() registers a pm notifier but there's an
error path that can exit the function without unregistering it. Since
the function returns an error, it will never be unregistered.
- Allocate and copy ftrace hash for reader of ftrace filter files
When the set_ftrace_filter or set_ftrace_notrace files are open for
read, an iterator is created and sets its hash pointer to the
associated hash that represents filtering or notrace filtering to it.
The issue is that the hash it points to can change while the
iteration is happening. All the locking used to access the tracer's
hashes are released which means those hashes can change or even be
freed. Using the hash pointed to by the iterator can cause UAF bugs
or similar.
Have the read of these files allocate and copy the corresponding
hashes and use that as that will keep them the same while the
iterator is open. This also simplifies the code as opening it for
write already does an allocate and copy, and now that the read is
doing the same, there's no need to check which way it was opened on
the release of the file, and the iterator hash can always be freed.
- Fix function graph to copy args into temp storage
The output of the function graph tracer shows both the entry and the
exit of a function. When the exit is right after the entry, it
combines the two events into one with the output of "function();",
instead of showing:
function() {
}
In order to do this, the iterator descriptor that reads the events
includes storage that saves the entry event while it peaks at the
next event in the ring buffer. The peek can free the entry event so
the iterator must store the information to use it after the peek.
With the addition of function graph tracer recording the args, where
the args are a dynamic array in the entry event, the temp storage
does not save them. This causes the args to be corrupted or even
cause a read of unsafe memory.
Add space to save the args in the temp storage of the iterator.
- Fix race between ftrace_dump and reading trace_pipe
ftrace_dump() is used when a crash occurs where the ftrace buffer
will be printed to the console. But it can also be triggered by
sysrq-z. If a sysrq-z is triggered while a task is reading trace_pipe
it can cause a race in the ftrace_dump() where it checks if the
buffer has content, then it checks if the next event is available,
and then prints the output (regardless if the next event was
available or not). Reading trace_pipe at the same time can cause it
to not be available, and this triggers a WARN_ON in the print. Move
the printing into the check if the next event exists or not
* tag 'trace-v6.17-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ftrace: Also allocate and copy hash for reading of filter files
ftrace: Fix potential warning in trace_printk_seq during ftrace_dump
fgraph: Copy args in intermediate storage with entry
trace/fgraph: Fix the warning caused by missing unregister notifier
ring-buffer: Remove redundant semicolons
tracing: Limit access to parser->buffer when trace_get_user failed
rtla: Check pkg-config install
tools/latency-collector: Check pkg-config install
Pull driver core fixes from Danilo Krummrich:
- Fix swapped handling of lru_gen and lru_gen_full debugfs files in
vmscan
- Fix debugfs mount options (uid, gid, mode) being silently ignored
- Fix leak of devres action in the unwind path of Devres::new()
- Documentation:
- Expand and fix documentation of (outdated) Device, DeviceContext
and generic driver infrastructure
- Fix C header link of faux device abstractions
- Clarify expected interaction with the security team
- Smooth text flow in the security bug reporting process
documentation
* tag 'driver-core-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
Documentation: smooth the text flow in the security bug reporting process
Documentation: clarify the expected collaboration with security bugs reporters
debugfs: fix mount options not being applied
rust: devres: fix leaking call to devm_add_action()
rust: faux: fix C header link
driver: rust: expand documentation for driver infrastructure
device: rust: expand documentation for Device
device: rust: expand documentation for DeviceContext
mm/vmscan: fix inverted polarity in lru_gen_seq_show()
Pull drm fixes from Dave Airlie:
"Weekly drm fixes. Looks like things did indeed get busier after rc2,
nothing seems too major, but stuff scattered all over the place,
amdgpu, xe, i915, hibmc, rust support code, and other small fixes.
rust:
- drm device memory layout and safety fixes
tests:
- Endianness fixes
gpuvm:
- docs warning fix
panic:
- fix division on 32-bit arm
i915:
- TypeC DP display Fixes
- Silence rpm wakeref asserts on GEN11_GU_MISC_IIR access
- Relocate compression repacking WA for JSL/EHL
xe:
- xe_vm_create fixes
- fix vm bind ioctl double free
amdgpu:
- Replay fixes
- SMU14 fix
- Null check DC fixes
- DCE6 DC fixes
- Misc DC fixes
bridge:
- analogix_dp: devm_drm_bridge_alloc() error handling fix
habanalabs:
- Memory deallocation fix
hibmc:
- modesetting black screen fixes
- fix UAF on irq
- fix leak on i2c failure path
nouveau:
- memory leak fixes
- typos
rockchip:
- Kconfig fix
- register caching fix"
* tag 'drm-fixes-2025-08-23-1' of https://gitlab.freedesktop.org/drm/kernel: (49 commits)
drm/xe: Fix vm_bind_ioctl double free bug
drm/xe: Move ASID allocation and user PT BO tracking into xe_vm_create
drm/xe: Assign ioctl xe file handler to vm in xe_vm_create
drm/i915/gt: Relocate compression repacking WA for JSL/EHL
drm/i915: silence rpm wakeref asserts on GEN11_GU_MISC_IIR access
drm/amd/display: Fix DP audio DTO1 clock source on DCE 6.
drm/amd/display: Fix fractional fb divider in set_pixel_clock_v3
drm/amd/display: Don't print errors for nonexistent connectors
drm/amd/display: Don't warn when missing DCE encoder caps
drm/amd/display: Fill display clock and vblank time in dce110_fill_display_configs
drm/amd/display: Find first CRTC and its line time in dce110_fill_display_configs
drm/amd/display: Adjust DCE 8-10 clock, don't overclock by 15%
drm/amd/display: Don't overclock DCE 6 by 15%
drm/amd/display: Add null pointer check in mod_hdcp_hdcp1_create_session()
drm/amd/display: Fix Xorg desktop unresponsive on Replay panel
drm/amd/display: Avoid a NULL pointer dereference
drm/amdgpu/swm14: Update power limit logic
drm/amd/display: Revert Add HPO encoder support to Replay
drm/i915/icl+/tc: Convert AUX powered WARN to a debug message
drm/i915/lnl+/tc: Use the cached max lane count value
...