Pull io_uring large rx buffer support from Jens Axboe:
"Now that the networking updates are upstream, here's the support for
large buffers for zcrx.
Using larger (bigger than 4K) rx buffers can increase the effiency of
zcrx. For example, it's been shown that using 32K buffers can decrease
CPU usage by ~30% compared to 4K buffers"
* tag 'for-7.0/io_uring-zcrx-large-buffers-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/zcrx: implement large rx buffer support
Pull block updates from Jens Axboe:
- Support for batch request processing for ublk, improving the
efficiency of the kernel/ublk server communication. This can yield
nice 7-12% performance improvements
- Support for integrity data for ublk
- Various other ublk improvements and additions, including a ton of
selftests additions and updated
- Move the handling of blk-crypto software fallback from below the
block layer to above it. This reduces the complexity of dealing with
bio splitting
- Series fixing a number of potential deadlocks in blk-mq related to
the queue usage counter and writeback throttling and rq-qos debugfs
handling
- Add an async_depth queue attribute, to resolve a performance
regression that's been around for a qhilw related to the scheduler
depth handling
- Only use task_work for IOPOLL completions on NVMe, if it is necessary
to do so. An earlier fix for an issue resulted in all these
completions being punted to task_work, to guarantee that completions
were only run for a given io_uring ring when it was local to that
ring. With the new changes, we can detect if it's necessary to use
task_work or not, and avoid it if possible.
- rnbd fixes:
- Fix refcount underflow in device unmap path
- Handle PREFLUSH and NOUNMAP flags properly in protocol
- Fix server-side bi_size for special IOs
- Zero response buffer before use
- Fix trace format for flags
- Add .release to rnbd_dev_ktype
- MD pull requests via Yu Kuai
- Fix raid5_run() to return error when log_init() fails
- Fix IO hang with degraded array with llbitmap
- Fix percpu_ref not resurrected on suspend timeout in llbitmap
- Fix GPF in write_page caused by resize race
- Fix NULL pointer dereference in process_metadata_update
- Fix hang when stopping arrays with metadata through dm-raid
- Fix any_working flag handling in raid10_sync_request
- Refactor sync/recovery code path, improve error handling for
badblocks, and remove unused recovery_disabled field
- Consolidate mddev boolean fields into mddev_flags
- Use mempool to allocate stripe_request_ctx and make sure
max_sectors is not less than io_opt in raid5
- Fix return value of mddev_trylock
- Fix memory leak in raid1_run()
- Add Li Nan as mdraid reviewer
- Move phys_vec definitions to the kernel types, mostly in preparation
for some VFIO and RDMA changes
- Improve the speed for secure erase for some devices
- Various little rust updates
- Various other minor fixes, improvements, and cleanups
* tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits)
blk-mq: ABI/sysfs-block: fix docs build warnings
selftests: ublk: organize test directories by test ID
block: decouple secure erase size limit from discard size limit
block: remove redundant kill_bdev() call in set_blocksize()
blk-mq: add documentation for new queue attribute async_dpeth
block, bfq: convert to use request_queue->async_depth
mq-deadline: covert to use request_queue->async_depth
kyber: covert to use request_queue->async_depth
blk-mq: add a new queue sysfs attribute async_depth
blk-mq: factor out a helper blk_mq_limit_depth()
blk-mq-sched: unify elevators checking for async requests
block: convert nr_requests to unsigned int
block: don't use strcpy to copy blockdev name
blk-mq-debugfs: warn about possible deadlock
blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs()
blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos()
blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static
blk-rq-qos: fix possible debugfs_mutex deadlock
blk-mq-debugfs: factor out a helper to register debugfs for all rq_qos
blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter
...
Pull io_uring bpf filters from Jens Axboe:
"This adds support for both cBPF filters for io_uring, as well as task
inherited restrictions and filters.
seccomp and io_uring don't play along nicely, as most of the
interesting data to filter on resides somewhat out-of-band, in the
submission queue ring.
As a result, things like containers and systemd that apply seccomp
filters, can't filter io_uring operations.
That leaves them with just one choice if filtering is critical -
filter the actual io_uring_setup(2) system call to simply disallow
io_uring. That's rather unfortunate, and has limited us because of it.
io_uring already has some filtering support. It requires the ring to
be setup in a disabled state, and then a filter set can be applied.
This filter set is completely bi-modal - an opcode is either enabled
or it's not. Once a filter set is registered, the ring can be enabled.
This is very restrictive, and it's not useful at all to systemd or
containers which really want both broader and more specific control.
This first adds support for cBPF filters for opcodes, which enables
tighter control over what exactly a specific opcode may do. As
examples, specific support is added for IORING_OP_OPENAT/OPENAT2,
allowing filtering on resolve flags. And another example is added for
IORING_OP_SOCKET, allowing filtering on domain/type/protocol. These
are both common use cases. cBPF was chosen rather than eBPF, because
the latter is often restricted in containers as well.
These filters are run post the init phase of the request, which allows
filters to even dip into data that is being passed in struct in user
memory, as the init side of requests make that data stable by bringing
it into the kernel. This allows filtering without needing to copy this
data twice, or have filters etc know about the exact layout of the
user data. The filters get the already copied and sanitized data
passed.
On top of that support is added for per-task filters, meaning that any
ring created with a task that has a per-task filter will get those
filters applied when it's created. These filters are inherited across
fork as well. Once a filter has been registered, any further added
filters may only further restrict what operations are permitted.
Filters cannot change the return value of an operation, they can only
permit or deny it based on the contents"
* tag 'io_uring-bpf-restrictions.4-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring: allow registration of per-task restrictions
io_uring: add task fork hook
io_uring/bpf_filter: add ref counts to struct io_bpf_filter
io_uring/bpf_filter: cache lookup table in ctx->bpf_filters
io_uring/bpf_filter: allow filtering on contents of struct open_how
io_uring/net: allow filtering on IORING_OP_SOCKET data
io_uring: add support for BPF filtering for opcode restrictions
Pull io_uring updates from Jens Axboe:
- Clean up the IORING_SETUP_R_DISABLED and submitter task checking,
mostly just in preparation for relaxing the locking for SINGLE_ISSUER
in the future.
- Improve IOPOLL by using a doubly linked list to manage completions.
Previously it was singly listed, which meant that to complete request
N in the chain 0..N-1 had to have completed first. With a doubly
linked list we can complete whatever request completes in that order,
rather than need to wait for a consecutive range to be available.
This reduces latencies.
- Improve the restriction setup and checking. Mostly in preparation for
adding further features on top of that. Coming in a separate pull
request.
- Split out task_work and wait handling into separate files. These are
mostly nicely abstracted already, but still remained in the
io_uring.c file which is on the larger side.
- Use GFP_KERNEL_ACCOUNT in a few more spots, where appropriate.
- Ensure even the idle io-wq worker exits if a task no longer has any
rings open.
- Add support for a non-circular submission queue.
By default, the SQ ring keeps moving around, even if only a few
entries are used for each submission. This can be wasteful in terms
of cachelines.
If IORING_SETUP_SQ_REWIND is set for the ring when created, each
submission will start at offset 0 instead of where we last left off
doing submissions.
- Various little cleanups
* tag 'for-7.0/io_uring-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (30 commits)
io_uring/kbuf: fix memory leak if io_buffer_add_list fails
io_uring: Add SPDX id lines to remaining source files
io_uring: allow io-wq workers to exit when unused
io_uring/io-wq: add exit-on-idle state
io_uring/net: don't continue send bundle if poll was required for retry
io_uring/rsrc: use GFP_KERNEL_ACCOUNT consistently
io_uring/futex: use GFP_KERNEL_ACCOUNT for futex data allocation
io_uring/io-wq: handle !sysctl_hung_task_timeout_secs
io_uring: fix bad indentation for setup flags if statement
io_uring/rsrc: take unsigned index in io_rsrc_node_lookup()
io_uring: introduce non-circular SQ
io_uring: split out CQ waiting code into wait.c
io_uring: split out task work code into tw.c
io_uring/io-wq: don't trigger hung task for syzbot craziness
io_uring: add IO_URING_EXIT_WAIT_MAX definition
io_uring/sync: validate passed in offset
io_uring/eventfd: remove unused ctx->evfd_last_cq_tail member
io_uring/timeout: annotate data race in io_flush_timeouts()
io_uring/uring_cmd: explicitly disallow cancelations for IOPOLL
io_uring: fix IOPOLL with passthrough I/O
...
Pull vfs 'struct filename' updates from Al Viro:
"[Mostly] sanitize struct filename handling"
* tag 'pull-filename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (68 commits)
sysfs(2): fs_index() argument is _not_ a pathname
alpha: switch osf_mount() to strndup_user()
ksmbd: use CLASS(filename_kernel)
mqueue: switch to CLASS(filename)
user_statfs(): switch to CLASS(filename)
statx: switch to CLASS(filename_maybe_null)
quotactl_block(): switch to CLASS(filename)
chroot(2): switch to CLASS(filename)
move_mount(2): switch to CLASS(filename_maybe_null)
namei.c: switch user pathname imports to CLASS(filename{,_flags})
namei.c: convert getname_kernel() callers to CLASS(filename_kernel)
do_f{chmod,chown,access}at(): use CLASS(filename_uflags)
do_readlinkat(): switch to CLASS(filename_flags)
do_sys_truncate(): switch to CLASS(filename)
do_utimes_path(): switch to CLASS(filename_uflags)
chdir(2): unspaghettify a bit...
do_fchownat(): unspaghettify a bit...
fspick(2): use CLASS(filename_flags)
name_to_handle_at(): use CLASS(filename_uflags)
vfs_open_tree(): use CLASS(filename_uflags)
...
Currently io_uring supports restricting operations on a per-ring basis.
To use those, the ring must be setup in a disabled state by setting
IORING_SETUP_R_DISABLED. Then restrictions can be set for the ring, and
the ring can then be enabled.
This commit adds support for IORING_REGISTER_RESTRICTIONS with ring_fd
== -1, like the other "blind" register opcodes which work on the task
rather than a specific ring. This allows registration of the same kind
of restrictions as can been done on a specific ring, but with the task
itself. Once done, any ring created will inherit these restrictions.
If a restriction filter is registered with a task, then it's inherited
on fork for its children. Children may only further restrict operations,
not extend them.
Inheriting restrictions include both the classic
IORING_REGISTER_RESTRICTIONS based restrictions, as well as the BPF
filters that have been registered with the task via
IORING_REGISTER_BPF_FILTER.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Called when copy_process() is called to copy state to a new child.
Right now this is just a stub, but will be used shortly to properly
handle fork'ing of task based io_uring restrictions.
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull io_uring fixes from Jens Axboe:
- Two small fixes for zcrx
- Two small fixes for fdinfo - one is just killing a superflous newline
* tag 'io_uring-6.19-20260205' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/fdinfo: be a bit nicer when looping a lot of SQEs/CQEs
io_uring/fdinfo: kill unnecessary newline feed in CQE32 printing
io_uring/zcrx: fix rq flush locking
io_uring/zcrx: fix page array leak
io_register_pbuf_ring() ignores the return value of io_buffer_add_list(),
which can fail if xa_store() returns an error (e.g., -ENOMEM). When this
happens, the function returns 0 (success) to the caller, but the
io_buffer_list structure is neither added to the xarray nor freed.
In practice this requires failure injection to hit, hence not a real
issue. But it should get fixed up none the less.
Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some io_uring files are missing SPDX-License-Identifier lines.
Add lines with GPL-2.0 license IDs to these files.
Signed-off-by: Tim Bird <tim.bird@sony.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's an unconditional newline feed anyway after dumping both normal
and big CQE contents, remove the \n from the CQE32 extra1/extra2
printing.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
zcrx needs to keep the rq lock for uref manipulations, for now move all
zcrx_return_buffers() under the lock.
Fixes: 475eb39b00 ("io_uring/zcrx: add sync refill queue flushing")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
d9f595b9a6 ("io_uring/zcrx: fix leaking pages on sg init fail") fixed
a page leakage but didn't free the page array, release it as well.
Fixes: b84621d96e ("io_uring/zcrx: allocate sgtable for umem areas")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring keeps a per-task io-wq around, even when the task no longer has
any io_uring instances.
If the task previously used io_uring for file I/O, this can leave an
unrelated iou-wrk-* worker thread behind after the last io_uring
instance is gone.
When the last io_uring ctx is removed from the task context, mark the
io-wq exit-on-idle so workers can go away. Clear the flag on subsequent
io_uring usage.
Signed-off-by: Li Chen <me@linux.beauty>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io-wq uses an idle timeout to shrink the pool, but keeps the last worker
around indefinitely to avoid churn.
For tasks that used io_uring for file I/O and then stop using io_uring,
this can leave an iou-wrk-* thread behind even after all io_uring
instances are gone. This is unnecessary overhead and also gets in the
way of process checkpoint/restore.
Add an exit-on-idle state that makes all io-wq workers exit as soon as
they become idle, and provide io_wq_set_exit_on_idle() to toggle it.
Signed-off-by: Li Chen <me@linux.beauty>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a send bundle has picked a bunch of buffers, then it needs to send
all of those to be complete. This may require poll arming, if the send
buffer ends up being full. Once a send bundle has been poll armed, no
further bundles should be attempted.
This allows a current bundle to complete even though it needs to go
through polling to do so, but it will not allow another bundle to be
started once that has happened. Ideally we would abort a bundle if it
was only partially sent, but as some parts of it already went out on the
wire, this obviously isn't feasible. Not continuing more bundle attempts
post encountering a full socket buffer is the second best thing.
Cc: stable@vger.kernel.org
Fixes: a05d1f625c ("io_uring/net: support bundles for send")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for allowing inheritance of BPF filters and filter
tables, add a reference count to the filter. This allows multiple tables
to safely include the same filter.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently a few pointer dereferences need to be made to both check if
BPF filters are installed, and then also to retrieve the actual filter
for the opcode. Cache the table in ctx->bpf_filters to avoid that.
Add a bit of debug info on ring exit to show if we ever got this wrong.
Small risk of that given that the table is currently only updated in one
spot, but once task forking is enabled, that will add one more spot.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds custom filtering for IORING_OP_OPENAT and IORING_OP_OPENAT2,
where the open_how flags, mode, and resolve can be checked by filters.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Example population method for the BPF based opcode filtering. This
exposes the socket family, type, and protocol to a registered BPF
filter. This in turn enables the filter to make decisions based on
what was passed in to the IORING_OP_SOCKET request type.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add support for loading classic BPF programs with io_uring to provide
fine-grained filtering of SQE operations. Unlike
IORING_REGISTER_RESTRICTIONS which only allows bitmap-based allow/deny
of opcodes, BPF filters can inspect request attributes and make dynamic
decisions.
The filter is registered via IORING_REGISTER_BPF_FILTER with a struct
io_uring_bpf:
struct io_uring_bpf_filter {
__u32 opcode; /* io_uring opcode to filter */
__u32 flags;
__u32 filter_len; /* number of BPF instructions */
__u32 resv;
__u64 filter_ptr; /* pointer to BPF filter */
__u64 resv2[5];
};
enum {
IO_URING_BPF_CMD_FILTER = 1,
};
struct io_uring_bpf {
__u16 cmd_type; /* IO_URING_BPF_* values */
__u16 cmd_flags; /* none so far */
__u32 resv;
union {
struct io_uring_bpf_filter filter;
};
};
and the filters get supplied a struct io_uring_bpf_ctx:
struct io_uring_bpf_ctx {
__u64 user_data;
__u8 opcode;
__u8 sqe_flags;
__u8 pdu_size;
__u8 pad[5];
};
where it's possible to filter on opcode and sqe_flags, with pdu_size
indicating how much extra data is being passed in beyond the pad field.
This will used for specific finer grained filtering inside an opcode.
An example of that for sockets is in one of the following patches.
Anything the opcode supports can end up in this struct, populated by
the opcode itself, and hence can be filtered for.
Filters have the following semantics:
- Return 1 to allow the request
- Return 0 to deny the request with -EACCES
- Multiple filters can be stacked per opcode. All filters must
return 1 for the opcode to be allowed.
- Filters are evaluated in registration order (most recent first)
The implementation uses classic BPF (cBPF) rather than eBPF for as
that's required for containers, and since they can be used by any
user in the system.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For potential long term allocations, ensure that we play nicer with
memcg and use the accounting variant of the GFP_KERNEL allocation.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are network cards that support receive buffers larger than 4K, and
that can be vastly beneficial for performance, and benchmarks for this
patch showed up to 30% CPU util improvement for 32K vs 4K buffers.
Allows zcrx users to specify the size in struct
io_uring_zcrx_ifq_reg::rx_buf_len. If set to zero, zcrx will use a
default value. zcrx will check and fail if the memory backing the area
can't be split into physically contiguous chunks of the required size.
It's more restrictive as it only needs dma addresses to be contig, but
that's beyond this series.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: kill duplicate netdev_queues.h include]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the hung_task_timeout sysctl is set to 0, then we'll end up busy
looping inside io_wq_exit_workers() after an earlier commit switched to
using wait_for_completion_timeout(). Use the maximum schedule timeout
value for that case.
Fixes: 1f293098a3 ("io_uring/io-wq: don't trigger hung task for syzbot craziness")
Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull io_uring fixes from Jens Axboe:
- Fix for a potential leak of an iovec, if a specific cleanup path is
used and the rw_cache is full at the time of the call
- Fix for a regression added in this cycle, where waitid should be
using prober release/acquire semantics for updating the wait queue
head
- Check for the cancelation bit being set for every work item processed
by io-wq, not just at the start of the loop. Has no real practical
implications other than to shut up syzbot doing crazy things that
grossly overload a system, hence slowing down ring exit
- A few selftest additions, updating the mini_liburing that selftests
use
* tag 'io_uring-6.19-20260122' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
selftests/io_uring: support NO_SQARRAY in miniliburing
selftests/io_uring: add io_uring_queue_init_params
io_uring/io-wq: check IO_WQ_BIT_EXIT inside work run loop
io_uring/waitid: fix KCSAN warning on io_waitid->head
io_uring/rw: free potentially allocated iovec on cache put failure
io_rsrc_node_lookup() takes a signed int index as input and compares it
to an unsigned length. Since the signed int is implicitly cast to an
unsigned int for the comparison and the length is bounded by
IORING_MAX_FIXED_FILES/IORING_MAX_REG_BUFFERS, negative indices are
already rejected on architectures where int is at least 32 bits. Make
this a bit clearer and avoid compiler warnings for comparisons of
signed and unsigned values by taking an unsigned int index instead.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Outside of SQPOLL, normally SQ entries are consumed by the time the
submission syscall returns. For those cases we don't need a circular
buffer and the head/tail tracking, instead the kernel can assume that
entries always start from the beginning of the SQ at index 0. This patch
introduces a setup flag doing exactly that. It's a simpler and helps
to keeps SQEs hot in cache.
The feature is optional and enabled by setting IORING_SETUP_SQ_REWIND.
The flag is rejected if passed together with SQPOLL as it'd require
waiting for SQ before each submission. It also requires
IORING_SETUP_NO_SQARRAY, which can be supported but it's unlikely there
will be users, so leave more space for future optimisations.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the completion queue waiting and scheduling code out of io_uring.c
into a dedicated wait.c file. This further removes code out of the
main io_uring C and header file, and into a topical new file.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the task work handling code out of io_uring.c into a new tw.c file.
This includes the local work, normal work, and fallback work handling
infrastructure.
The associated tw.h header contains io_should_terminate_tw() as a static
inline helper, along with the necessary function declarations.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the same trick that blk_io_schedule() does to avoid triggering the
hung task warning (and potential reboot/panic, depending on system
settings), and only wait for half the hung task timeout at the time.
If we exceed the default IO_URING_EXIT_WAIT_MAX period where we expect
things to certainly have finished unless there's a bug, then throw a
WARN_ON_ONCE() for that case.
Reported-by: syzbot+4eb282331cab6d5b6588@syzkaller.appspotmail.com
Tested-by: syzbot+4eb282331cab6d5b6588@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add the timeout we normally wait before complaining about things being
stuck waiting for cancelations to complete as a define, and use it in
io_ring_exit_work().
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Check if the passed in offset is negative once cast to sync->off. This
ensures that -EINVAL is returned for that case, like it would be for
sync_file_range(2).
Fixes: c992fe2925 ("io_uring: add fsync support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When multiple io_uring rings poll on the same NVMe queue, one ring can
find completions belonging to another ring. The current code always
uses task_work to handle this, but this adds overhead for the common
single-ring case.
This patch passes the polling io_ring_ctx through io_comp_batch's new
poll_ctx field. In io_do_iopoll(), the polling ring's context is stored
in iob.poll_ctx before calling the iopoll callbacks.
In nvme_uring_cmd_end_io(), we now compare iob->poll_ctx with the
request's owning io_ring_ctx (via io_uring_cmd_ctx_handle()). If they
match (local context), we complete inline with io_uring_cmd_done32().
If they differ (remote context) or iob is NULL (non-iopoll path), we
use task_work as before.
This optimization eliminates task_work scheduling overhead for the
common case where a ring polls and finds its own completions.
~10% IOPS improvement is observed in the following benchmark:
fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -O0 -P1 -u1 -n1 /dev/ng0n1
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
syzbot correctly reports this as a KCSAN race, as ctx->cached_cq_tail
should be read under ->uring_lock. This isn't immediately feasible in
io_flush_timeouts(), but as long as we read a stable value, that should
be good enough. If two io-wq threads compete on this value, then they
will both end up calling io_flush_timeouts() and at least one of them
will see the correct value.
Reported-by: syzbot+6c48db7d94402407301e@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently this is checked before running the pending work. Normally this
is quite fine, as work items either end up blocking (which will create a
new worker for other items), or they complete fairly quickly. But syzbot
reports an issue where io-wq takes seemingly forever to exit, and with a
bit of debugging, this turns out to be because it queues a bunch of big
(2GB - 4096b) reads with a /dev/msr* file. Since this file type doesn't
support ->read_iter(), loop_rw_iter() ends up handling them. Each read
returns 16MB of data read, which takes 20 (!!) seconds. With a bunch of
these pending, processing the whole chain can take a long time. Easily
longer than the syzbot uninterruptible sleep timeout of 140 seconds.
This then triggers a complaint off the io-wq exit path:
INFO: task syz.4.135:6326 blocked for more than 143 seconds.
Not tainted syzkaller #0
Blocked by coredump.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz.4.135 state:D stack:26824 pid:6326 tgid:6324 ppid:5957 task_flags:0x400548 flags:0x00080000
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5256 [inline]
__schedule+0x1139/0x6150 kernel/sched/core.c:6863
__schedule_loop kernel/sched/core.c:6945 [inline]
schedule+0xe7/0x3a0 kernel/sched/core.c:6960
schedule_timeout+0x257/0x290 kernel/time/sleep_timeout.c:75
do_wait_for_common kernel/sched/completion.c:100 [inline]
__wait_for_common+0x2fc/0x4e0 kernel/sched/completion.c:121
io_wq_exit_workers io_uring/io-wq.c:1328 [inline]
io_wq_put_and_exit+0x271/0x8a0 io_uring/io-wq.c:1356
io_uring_clean_tctx+0x10d/0x190 io_uring/tctx.c:203
io_uring_cancel_generic+0x69c/0x9a0 io_uring/cancel.c:651
io_uring_files_cancel include/linux/io_uring.h:19 [inline]
do_exit+0x2ce/0x2bd0 kernel/exit.c:911
do_group_exit+0xd3/0x2a0 kernel/exit.c:1112
get_signal+0x2671/0x26d0 kernel/signal.c:3034
arch_do_signal_or_restart+0x8f/0x7e0 arch/x86/kernel/signal.c:337
__exit_to_user_mode_loop kernel/entry/common.c:41 [inline]
exit_to_user_mode_loop+0x8c/0x540 kernel/entry/common.c:75
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode_work include/linux/entry-common.h:159 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:194 [inline]
do_syscall_64+0x4ee/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fa02738f749
RSP: 002b:00007fa0281ae0e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: fffffffffffffe00 RBX: 00007fa0275e6098 RCX: 00007fa02738f749
RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007fa0275e6098
RBP: 00007fa0275e6090 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fa0275e6128 R14: 00007fff14e4fcb0 R15: 00007fff14e4fd98
There's really nothing wrong here, outside of processing these reads
will take a LONG time. However, we can speed up the exit by checking the
IO_WQ_BIT_EXIT inside the io_worker_handle_work() loop, as syzbot will
exit the ring after queueing up all of these reads. Then once the first
item is processed, io-wq will simply cancel the rest. That should avoid
syzbot running into this complaint again.
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/68a2decc.050a0220.e29e5.0099.GAE@google.com/
Reported-by: syzbot+4eb282331cab6d5b6588@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Storing of the iw->head entry inside the wait_queue callback, or when
removing a waitid item, really should use proper load/store
acquire/release semantics, and KCSAN correctly warns of that. Ensure
that they do so.
Reported-by: syzbot+eb441775f4f948a0902f@syzkaller.appspotmail.com
Fixes: a48c0cbf28 ("io_uring/waitid: have io_waitid_complete() remove wait queue entry")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a read/write request goes through io_req_rw_cleanup() and has an
allocated iovec attached and fails to put to the rw_cache, then it may
end up with an unaccounted iovec pointer. Have io_rw_recycle() return
whether it recycled the request or not, and use that to gauge whether to
free a potential iovec or not.
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull io_uring fix from Jens Axboe:
"Just a single fix moving local task_work inside the cancelation loop,
rather than only before cancelations.
If any cancelations generate task_work, we do need to re-run it"
* tag 'io_uring-6.19-20260116' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring: move local task_work in exit cancel loop
Callers switched to CLASS(filename_maybe_null) (in fs/xattr.c)
and CLASS(filename_complete_delayed) (in io_uring/xattr.c).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
filename_renameat2() replaces do_renameat2(); unlike the latter,
it does not drop filename references - these days it can be just
as easily arranged in the caller.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This currently isn't supported, and due to a recent commit, it also
cannot easily be supported by io_uring due to hash_node and IOPOLL
completion data overlapping.
This can be revisited if we ever do support cancelations of requests
that have gone to the block stack.
Suggested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With IORING_SETUP_DEFER_TASKRUN, task work is queued to ctx->work_llist
(local work) rather than the fallback list. During io_ring_exit_work(),
io_move_task_work_from_local() was called once before the cancel loop,
moving work from work_llist to fallback_llist.
However, task work can be added to work_llist during the cancel loop
itself. There are two cases:
1) io_kill_timeouts() is called from io_uring_try_cancel_requests() to
cancel pending timeouts, and it adds task work via io_req_queue_tw_complete()
for each cancelled timeout:
2) URING_CMD requests like ublk can be completed via
io_uring_cmd_complete_in_task() from ublk_queue_rq() during canceling,
given ublk request queue is only quiesced when canceling the 1st uring_cmd.
Since io_allowed_defer_tw_run() returns false in io_ring_exit_work()
(kworker != submitter_task), io_run_local_work() is never invoked,
and the work_llist entries are never processed. This causes
io_uring_try_cancel_requests() to loop indefinitely, resulting in
100% CPU usage in kworker threads.
Fix this by moving io_move_task_work_from_local() inside the cancel
loop, ensuring any work on work_llist is moved to fallback before
each cancel attempt.
Cc: stable@vger.kernel.org
Fixes: c0e0d6ba25 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>