Commit Graph

1200119 Commits

Author SHA1 Message Date
Pavel Begunkov
2af89abda7 io_uring: add option to remove SQ indirection
Not many aware, but io_uring submission queue has two levels. The first
level usually appears as sq_array and stores indexes into the actual SQ.

To my knowledge, no one has ever seriously used it, nor liburing exposes
it to users. Add IORING_SETUP_NO_SQARRAY, when set we don't bother
creating and using the sq_array and SQ heads/tails will be pointing
directly into the SQ. Improves memory footprint, in term of both
allocations as well as cache usage, and also should make io_get_sqe()
less branchy in the end.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0ffa3268a5ef61d326201ff43a233315c96312e0.1692916914.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24 17:16:19 -06:00
Pavel Begunkov
e5598d6ae6 io_uring: compact SQ/CQ heads/tails
Queues heads and tails cache line aligned. That makes sq, cq taking 4
lines or 5 lines if we include the rest of struct io_rings (e.g.
sq_flags is frequently accessed).

Since modern io_uring is mostly single threaded, it doesn't make much
send to spread them as such, it wastes space and puts additional pressure
on caches. Put them all into a single line.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9c8deddf9a7ed32069235a530d1e117fb460bc4c.1692916914.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24 17:16:19 -06:00
Pavel Begunkov
093a650b75 io_uring: force inline io_fill_cqe_req
There are only 2 callers of io_fill_cqe_req left, and one of them is
extremely hot. Force inline the function.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ffce4fc5e3521966def848a4d930586dfe33ae11.1692916914.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24 17:16:19 -06:00
Pavel Begunkov
ec26c225f0 io_uring: merge iopoll and normal completion paths
io_do_iopoll() and io_submit_flush_completions() are pretty similar,
both filling CQEs and then free a list of requests. Don't duplicate it
and make iopoll use __io_submit_flush_completions(), which also helps
with inlining and other optimisations.

For that, we need to first find all completed iopoll requests and splice
them from the iopoll list and then pass it down. This adds one extra
list traversal, which should be fine as requests will stay hot in cache.

CQ locking is already conditional, introduce ->lockless_cq and skip
locking for IOPOLL as it's protected by ->uring_lock.

We also add a wakeup optimisation for IOPOLL to __io_cq_unlock_post(),
so it works just like io_cqring_ev_posted_iopoll().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3840473f5e8a960de35b77292026691880f6bdbc.1692916914.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24 17:16:19 -06:00
Pavel Begunkov
54927baf6c io_uring: reorder cqring_flush and wakeups
Unlike in the past, io_commit_cqring_flush() doesn't do anything that
may need io_cqring_wake() to be issued after, all requests it completes
will go via task_work. Do io_commit_cqring_flush() after
io_cqring_wake() to clean up __io_cq_unlock_post().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ed32dcfeec47e6c97bd6b18c152ddce5b218403f.1692916914.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24 17:16:19 -06:00
Pavel Begunkov
59fbc409e7 io_uring: optimise extra io_get_cqe null check
If the cached cqe check passes in io_get_cqe*() it already means that
the cqe we return is valid and non-zero, however the compiler is unable
to optimise null checks like in io_fill_cqe_req().

Do a bit of trickery, return success/fail boolean from io_get_cqe*()
and store cqe in the cqe parameter. That makes it do the right thing,
erasing the check together with the introduced indirection.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/322ea4d3377d3d4efd8ae90ab8ed28a99f518210.1692916914.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24 17:16:19 -06:00
Pavel Begunkov
20d6b63387 io_uring: refactor __io_get_cqe()
Make __io_get_cqe simpler by not grabbing the cqe from refilled cached,
but letting io_get_cqe() do it for us. That's cleaner and removes some
duplication.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/74dc8fdf2657e438b2e05e1d478a3596924604e9.1692916914.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24 17:16:19 -06:00
Pavel Begunkov
b24c5d7529 io_uring: simplify big_cqe handling
Don't keep big_cqe bits of req in a union with hash_node, find a
separate space for it. It's bit safer, but also if we keep it always
initialised, we can get rid of ugly REQ_F_CQE32_INIT handling.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/447aa1b2968978c99e655ba88db536e903df0fe9.1692916914.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24 17:16:19 -06:00
Pavel Begunkov
31d3ba924f io_uring: cqe init hardening
io_kiocb::cqe stores the completion info which we'll memcpy to
userspace, and we rely on callbacks and other later steps to populate
it with right values. We have never had problems with that, but it would
still be safer to zero it on allocation.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b16a3b64dde678686460d3c3792c3ba6d3d1bc7a.1692916914.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24 17:16:19 -06:00
Pavel Begunkov
a0727c7383 io_uring: improve cqe !tracing hot path
While looking at io_fill_cqe_req()'s asm I stumbled on our trace points
turning into the chunk below:

trace_io_uring_complete(req->ctx, req, req->cqe.user_data,
			req->cqe.res, req->cqe.flags,
			req->extra1, req->extra2);

io_uring/io_uring.c:898: 	trace_io_uring_complete(req->ctx, req, req->cqe.user_data,
	movq	232(%rbx), %rdi	# req_44(D)->big_cqe.extra2, _5
	movq	224(%rbx), %rdx	# req_44(D)->big_cqe.extra1, _6
	movl	84(%rbx), %r9d	# req_44(D)->cqe.D.81184.flags, _7
	movl	80(%rbx), %r8d	# req_44(D)->cqe.res, _8
	movq	72(%rbx), %rcx	# req_44(D)->cqe.user_data, _9
	movq	88(%rbx), %rsi	# req_44(D)->ctx, _10
./arch/x86/include/asm/jump_label.h:27: 	asm_volatile_goto("1:"
	1:jmp .L1772 # objtool NOPs this 	#
	...

It does a jump_label for actual tracing, but those 6 moves will stay
there in the hottest io_uring path. As an optimisation, add a
trace_io_uring_complete_enabled() check, which is also uses jump_labels,
it tricks the compiler into behaving. It removes the junk without
changing anything else int the hot path.

Note: apparently, it's not only me noticing it, and people are also
working it around. We should remove the check when it's solved
generically or rework tracing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/555d8312644b3776f4be7e23f9b92943875c4bc7.1692916914.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24 17:16:19 -06:00
Kees Cook
04d9244c94 io_uring/rsrc: Annotate struct io_mapped_ubuf with __counted_by
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).

As found with Coccinelle[1], add __counted_by for struct io_mapped_ubuf.

[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: io-uring@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20230817212146.never.853-kees@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-17 19:14:47 -06:00
Jens Axboe
ebdfefc09c io_uring/sqpoll: fix io-wq affinity when IORING_SETUP_SQPOLL is used
If we setup the ring with SQPOLL, then that polling thread has its
own io-wq setup. This means that if the application uses
IORING_REGISTER_IOWQ_AFF to set the io-wq affinity, we should not be
setting it for the invoking task, but rather the sqpoll task.

Add an sqpoll helper that parks the thread and updates the affinity,
and use that one if we're using SQPOLL.

Fixes: fe76421d1d ("io_uring: allow user configurable IO thread CPU affinity")
Cc: stable@vger.kernel.org # 5.10+
Link: https://github.com/axboe/liburing/discussions/884
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-16 13:40:28 -06:00
Pavel Begunkov
d246c759c4 io_uring: simplify io_run_task_work_sig return
Nobody cares about io_run_task_work_sig returning 1, we only check for
negative errors. Simplify by keeping to 0/-error returns.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3aec8a532c003d6e50739b969a82989402696170.1691757663.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11 10:42:57 -06:00
Pavel Begunkov
19a63c4021 io_uring/rsrc: keep one global dummy_ubuf
We set empty registered buffers to dummy_ubuf as an optimisation.
Currently, we allocate the dummy entry for each ring, whenever we can
simply have one global instance.

We're casting out const on assignment, it's fine as we're not going to
change the content of the dummy, the constness gives us an extra layer
of protection if sth ever goes wrong.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e4a96dda35ab755914bc43f6781bba0df97ac489.1691757663.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11 10:42:57 -06:00
Pavel Begunkov
b6b2bb58a7 io_uring: never overflow io_aux_cqe
Now all callers of io_aux_cqe() set allow_overflow to false, remove the
parameter and not allow overflowing auxilary multishot cqes.

When CQ is full the function callers and all multishot requests in
general are expected to complete the request. That prevents indefinite
in-background grows of the overflow list and let's the userspace to
handle the backlog at its own pace.

Resubmitting a request should also be faster than accounting a bunch of
overflows, so it should be better for perf when it happens, but a well
behaving userspace should be trying to avoid overflows in any case.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/bb20d14d708ea174721e58bb53786b0521e4dd6d.1691757663.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11 10:42:57 -06:00
Pavel Begunkov
056695bffa io_uring: remove return from io_req_cqe_overflow()
Nobody checks io_req_cqe_overflow()'s return, make it return void.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8f2029ad0c22f73451664172d834372608ee0a77.1691757663.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11 10:42:57 -06:00
Pavel Begunkov
00b0db5624 io_uring: open code io_fill_cqe_req()
io_fill_cqe_req() is only called from one place, open code it, and
rename __io_fill_cqe_req().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f432ce75bb1c94cadf0bd2add4d6aa510bd1fb36.1691757663.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11 10:42:57 -06:00
Pavel Begunkov
b2e74db55d io_uring/net: don't overflow multishot recv
Don't allow overflowing multishot recv CQEs, it might get out of
hand, hurt performance, and in the worst case scenario OOM the task.

Cc: stable@vger.kernel.org
Fixes: b3fdea6ecb ("io_uring: multishot recv")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0b295634e8f1b71aa764c984608c22d85f88f75c.1691757663.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11 10:42:51 -06:00
Pavel Begunkov
1bfed23349 io_uring/net: don't overflow multishot accept
Don't allow overflowing multishot accept CQEs, we want to limit
the grows of the overflow list.

Cc: stable@vger.kernel.org
Fixes: 4e86a2c980 ("io_uring: implement multishot mode for accept")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7d0d749649244873772623dd7747966f516fe6e2.1691757663.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11 10:42:34 -06:00
Jens Axboe
22f7fb80e6 io_uring/io-wq: don't gate worker wake up success on wake_up_process()
All we really care about is finding a free worker. If said worker is
already running, it's either starting new work already or it's just
finishing up existing work. For the latter, we'll be finding this work
item next anyway, and for the former, if the worker does go to sleep,
it'll create a new worker anyway as we have pending items.

This reduces try_to_wake_up() overhead considerably:

23.16%    -10.46%  [kernel.kallsyms]      [k] try_to_wake_up

Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11 10:36:20 -06:00
Jens Axboe
de36a15f9a io_uring/io-wq: reduce frequency of acct->lock acquisitions
When we check if we have work to run, we grab the acct lock, check,
drop it, and then return the result. If we do have work to run, then
running the work will again grab acct->lock and get the work item.

This causes us to grab acct->lock more frequently than we need to.
If we have work to do, have io_acct_run_queue() return with the acct
lock still acquired. io_worker_handle_work() is then always invoked
with the acct lock already held.

In a simple test cases that stats files (IORING_OP_STATX always hits
io-wq), we see a nice reduction in locking overhead with this change:

19.32%   -12.55%  [kernel.kallsyms]      [k] __cmpwait_case_32
20.90%   -12.07%  [kernel.kallsyms]      [k] queued_spin_lock_slowpath

Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11 10:36:17 -06:00
Jens Axboe
78848b9b05 io_uring/io-wq: don't grab wq->lock for worker activation
The worker free list is RCU protected, and checks for workers going away
when iterating it. There's no need to hold the wq->lock around the
lookup.

Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11 10:36:12 -06:00
Jens Axboe
89226307b1 io_uring: remove unnecessary forward declaration
We never use io_move_task_work_from_local() before it's defined in the
file anyway, so kill the forward declaration.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-10 15:01:58 -06:00
Jens Axboe
17bc28374c io_uring: have io_file_put() take an io_kiocb rather than the file
No functional changes in this patch, just a prep patch for needing the
request in io_file_put().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-10 10:27:46 -06:00
Jens Axboe
9f69a25957 io_uring/splice: use fput() directly
No point in using io_file_put() here, as we need to check if it's a
fixed file in the caller anyway.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-10 10:24:25 -06:00
Jens Axboe
3aaf22b62a io_uring/fdinfo: get rid of ref tryget
The caller holds a reference to the ring itself, so by definition
the ring cannot go away. There's no need to play games with tryget
for the reference, as we don't need an extra reference at all.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-10 10:24:19 -06:00
Jens Axboe
9e4bef2ba9 io_uring: cleanup 'ret' handling in io_iopoll_check()
We return 0 for success, or -error when there's an error. Move the 'ret'
variable into the loop where we are actually using it, to make it
clearer that we don't carry this variable forward for return outside of
the loop.

While at it, also move the need_resched() break condition out of the
while check itself, keeping it with the signal pending check.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09 10:46:46 -06:00
Pavel Begunkov
dc314886cb io_uring: break iopolling on signal
Don't keep spinning iopoll with a signal set. It'll eventually return
back, e.g. by virtue of need_resched(), but it's not a nice user
experience.

Cc: stable@vger.kernel.org
Fixes: def596e955 ("io_uring: support for IO polling")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/eeba551e82cad12af30c3220125eb6cb244cc94c.1691594339.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09 10:46:46 -06:00
Pavel Begunkov
17619322e5 io_uring: kill io_uring userspace examples
There are tons of io_uring tests and examples in liburing and on the
Internet. If you're looking for a benchmark, io_uring-bench.c is just an
acutely outdated version of fio/io_uring. And for basic condensed init
template for likes of selftests take a peek at io_uring_zerocopy_tx.c.

Kill tools/io_uring/, it's a burden keeping it here.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7c740701d3b475dcad8c92602a551044f72176b4.1691543666.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09 10:46:46 -06:00
Pavel Begunkov
569f5308e5 io_uring: fix false positive KASAN warnings
io_req_local_work_add() peeks into the work list, which can be executed
in the meanwhile. It's completely fine without KASAN as we're in an RCU
read section and it's SLAB_TYPESAFE_BY_RCU. With KASAN though it may
trigger a false positive warning because internal io_uring caches are
sanitised.

Remove sanitisation from the io_uring request cache for now.

Cc: stable@vger.kernel.org
Fixes: 8751d15426 ("io_uring: reduce scheduling due to tw")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c6fbf7a82a341e66a0007c76eefd9d57f2d3ba51.1691541473.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09 10:46:46 -06:00
Pavel Begunkov
cfdbaa3a29 io_uring: fix drain stalls by invalid SQE
cq_extra is protected by ->completion_lock, which io_get_sqe() misses.
The bug is harmless as it doesn't happen in real life, requires invalid
SQ index array and racing with submission, and only messes up the
userspace, i.e. stall requests execution but will be cleaned up on
ring destruction.

Fixes: 15641e4270 ("io_uring: don't cache number of dropped SQEs")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/66096d54651b1a60534bb2023f2947f09f50ef73.1691538547.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09 10:46:46 -06:00
Yue Haibing
d4b30eed51 io_uring/rsrc: Remove unused declaration io_rsrc_put_tw()
Commit 36b9818a5a ("io_uring/rsrc: don't offload node free")
removed the implementation but leave declaration.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20230808151058.4572-1-yuehaibing@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09 10:46:46 -06:00
Jens Axboe
b97f96e22f io_uring: annotate the struct io_kiocb slab for appropriate user copy
When compiling the kernel with clang and having HARDENED_USERCOPY
enabled, the liburing openat2.t test case fails during request setup:

usercopy: Kernel memory overwrite attempt detected to SLUB object 'io_kiocb' (offset 24, size 24)!
------------[ cut here ]------------
kernel BUG at mm/usercopy.c:102!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU: 3 PID: 413 Comm: openat2.t Tainted: G                 N 6.4.3-g6995e2de6891-dirty #19
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
RIP: 0010:usercopy_abort+0x84/0x90
Code: ce 49 89 ce 48 c7 c3 68 48 98 82 48 0f 44 de 48 c7 c7 56 c6 94 82 4c 89 de 48 89 c1 41 52 41 56 53 e8 e0 51 c5 00 48 83 c4 18 <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 41 57 41 56
RSP: 0018:ffffc900016b3da0 EFLAGS: 00010296
RAX: 0000000000000062 RBX: ffffffff82984868 RCX: 4e9b661ac6275b00
RDX: ffff8881b90ec580 RSI: ffffffff82949a64 RDI: 00000000ffffffff
RBP: 0000000000000018 R08: 0000000000000000 R09: 0000000000000000
R10: ffffc900016b3c88 R11: ffffc900016b3c30 R12: 00007ffe549659e0
R13: ffff888119014000 R14: 0000000000000018 R15: 0000000000000018
FS:  00007f862e3ca680(0000) GS:ffff8881b90c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005571483542a8 CR3: 0000000118c11000 CR4: 00000000003506e0
Call Trace:
 <TASK>
 ? __die_body+0x63/0xb0
 ? die+0x9d/0xc0
 ? do_trap+0xa7/0x180
 ? usercopy_abort+0x84/0x90
 ? do_error_trap+0xc6/0x110
 ? usercopy_abort+0x84/0x90
 ? handle_invalid_op+0x2c/0x40
 ? usercopy_abort+0x84/0x90
 ? exc_invalid_op+0x2f/0x40
 ? asm_exc_invalid_op+0x16/0x20
 ? usercopy_abort+0x84/0x90
 __check_heap_object+0xe2/0x110
 __check_object_size+0x142/0x3d0
 io_openat2_prep+0x68/0x140
 io_submit_sqes+0x28a/0x680
 __se_sys_io_uring_enter+0x120/0x580
 do_syscall_64+0x3d/0x80
 entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x55714834de26
Code: ca 01 0f b6 82 d0 00 00 00 8b ba cc 00 00 00 45 31 c0 31 d2 41 b9 08 00 00 00 83 e0 01 c1 e0 04 41 09 c2 b8 aa 01 00 00 0f 05 <c3> 66 0f 1f 84 00 00 00 00 00 89 30 eb 89 0f 1f 40 00 8b 00 a8 06
RSP: 002b:00007ffe549659c8 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
RAX: ffffffffffffffda RBX: 00007ffe54965a50 RCX: 000055714834de26
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000008
R10: 0000000000000000 R11: 0000000000000246 R12: 000055714834f057
R13: 00007ffe54965a50 R14: 0000000000000001 R15: 0000557148351dd8
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---

when it tries to copy struct open_how from userspace into the per-command
space in the io_kiocb. There's nothing wrong with the copy, but we're
missing the appropriate annotations for allowing user copies to/from the
io_kiocb slab.

Allow copies in the per-command area, which is from the 'file' pointer to
when 'opcode' starts. We do have existing user copies there, but they are
not all annotated like the one that openat2_prep() uses,
copy_struct_from_user(). But in practice opcodes should be allowed to
copy data into their per-command area in the io_kiocb.

Reported-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09 10:46:44 -06:00
Breno Leitao
8e9fad0e70 io_uring: Add io_uring command support for sockets
Enable io_uring commands on network sockets. Create two new
SOCKET_URING_OP commands that will operate on sockets.

In order to call ioctl on sockets, use the file_operations->io_uring_cmd
callbacks, and map it to a uring socket function, which handles the
SOCKET_URING_OP accordingly, and calls socket ioctls.

This patches was tested by creating a new test case in liburing.
Link: https://github.com/leitao/liburing/tree/io_uring_cmd

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230627134424.2784797-1-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09 10:46:15 -06:00
Jens Axboe
f77569d22a io_uring/cancel: wire up IORING_ASYNC_CANCEL_OP for sync cancel
Allow usage of IORING_ASYNC_CANCEL_OP through the sync cancelation
API as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17 10:05:48 -06:00
Jens Axboe
d7b8b079a8 io_uring/cancel: support opcode based lookup and cancelation
Add IORING_ASYNC_CANCEL_OP flag for cancelation, which allows the
application to target cancelation based on the opcode of the original
request.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17 10:05:48 -06:00
Jens Axboe
8165b56604 io_uring/cancel: add IORING_ASYNC_CANCEL_USERDATA
Add a flag to explicitly match on user_data in the request for
cancelation purposes. This is the default behavior if none of the
other match flags are set, but if we ALSO want to match on user_data,
then this flag can be set.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17 10:05:48 -06:00
Jens Axboe
a30badf66d io_uring: use cancelation match helper for poll and timeout requests
Get rid of the request vs io_cancel_data checking and just use the
exported helper for this.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17 10:05:48 -06:00
Jens Axboe
3a372b6692 io_uring/cancel: fix sequence matching for IORING_ASYNC_CANCEL_ANY
We always need to check/update the cancel sequence if
IORING_ASYNC_CANCEL_ALL is set. Also kill the redundant check for
IORING_ASYNC_CANCEL_ANY at the end, if we get here we know it's
not set as we would've matched it higher up.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17 10:05:48 -06:00
Jens Axboe
aa5cd116f3 io_uring/cancel: abstract out request match helper
We have different match code in a variety of spots. Start the cleanup of
this by abstracting out a helper that can be used to check if a given
request matches the cancelation criteria outlined in io_cancel_data.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17 10:05:48 -06:00
Jens Axboe
faa9c0ee3c io_uring/timeout: always set 'ctx' in io_cancel_data
In preparation for using a generic handler to match requests for
cancelation purposes, ensure that ctx is set in io_cancel_data. The
timeout handlers don't check for this as it'll always match, but we'll
need it set going forward.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17 10:05:48 -06:00
Jens Axboe
ad711c5d11 io_uring/poll: always set 'ctx' in io_cancel_data
This isn't strictly necessary for this callsite, as it uses it's
internal lookup for this cancelation purpose. But let's be consistent
with how it's used in general and set ctx as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17 10:05:48 -06:00
Linus Torvalds
fdf0eaf114 Linux 6.5-rc2 v6.5-rc2 2023-07-16 15:10:37 -07:00
Linus Torvalds
5b8d6e8539 Merge tag 'xtensa-20230716' of https://github.com/jcmvbkbc/linux-xtensa
Pull xtensa fixes from Max Filippov:

 - fix interaction between unaligned exception handler and load/store
   exception handler

 - fix parsing ISS network interface specification string

 - add comment about etherdev freeing to ISS network driver

* tag 'xtensa-20230716' of https://github.com/jcmvbkbc/linux-xtensa:
  xtensa: fix unaligned and load/store configuration interaction
  xtensa: ISS: fix call to split_if_spec
  xtensa: ISS: add comment about etherdev freeing
2023-07-16 14:12:49 -07:00
Linus Torvalds
1667e630c2 Merge tag 'perf_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Borislav Petkov:

 - Fix a lockdep warning when the event given is the first one, no event
   group exists yet but the code still goes and iterates over event
   siblings

* tag 'perf_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Fix lockdep warning in for_each_sibling_event() on SPR
2023-07-16 13:46:08 -07:00
Linus Torvalds
8a3e4a6484 Merge tag 'objtool_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fixes from Borislav Petkov:

 - Mark copy_iovec_from_user() __noclone in order to prevent gcc from
   doing an inter-procedural optimization and confuse objtool

 - Initialize struct elf fully to avoid build failures

* tag 'objtool_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  iov_iter: Mark copy_iovec_from_user() noclone
  objtool: initialize all of struct elf
2023-07-16 13:34:29 -07:00
Linus Torvalds
f61a89ca11 Merge tag 'sched_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:

 - Remove a cgroup from under a polling process properly

 - Fix the idle sibling selection

* tag 'sched_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/psi: use kernfs polling functions for PSI trigger polling
  sched/fair: Use recent_used_cpu to test p->cpus_ptr
2023-07-16 13:22:08 -07:00
Linus Torvalds
ede950b019 Merge tag 'pinctrl-v6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
 "I'm mostly on vacation but what would vacation be without a few
  critical fixes so people can use their gaming laptops when hiding away
  from the sun (or rain)?

   - Fix a really annoying interrupt storm in the AMD driver affecting
     Asus TUF gaming notebooks

   - Fix device tree parsing in the Renesas driver"

* tag 'pinctrl-v6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl: amd: Unify debounce handling into amd_pinconf_set()
  pinctrl: amd: Drop pull up select configuration
  pinctrl: amd: Use amd_pinconf_set() for all config options
  pinctrl: amd: Only use special debounce behavior for GPIO 0
  pinctrl: renesas: rzg2l: Handle non-unique subnode names
  pinctrl: renesas: rzv2m: Handle non-unique subnode names
2023-07-16 12:55:31 -07:00
Linus Torvalds
fe756ad021 Merge tag '6.5-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:

 - Two reconnect fixes: important fix to address inFlight count to leak
   (which can leak credits), and fix for better handling a deleted share

 - DFS fix

 - SMB1 cleanup fix

 - deferred close fix

* tag '6.5-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: fix mid leak during reconnection after timeout threshold
  cifs: is_network_name_deleted should return a bool
  smb: client: fix missed ses refcounting
  smb: client: Fix -Wstringop-overflow issues
  cifs: if deferred close is disabled then close files immediately
2023-07-16 12:49:05 -07:00
Linus Torvalds
20edcec23f Merge tag 'powerpc-6.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:

 - Fix Speculation_Store_Bypass reporting in /proc/self/status on
   Power10

 - Fix HPT with 4K pages since recent changes by implementing pmd_same()

 - Fix 64-bit native_hpte_remove() to be irq-safe

Thanks to Aneesh Kumar K.V, Nageswara R Sastry, and Russell Currey.

* tag 'powerpc-6.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/mm/book3s64/hash/4k: Add pmd_same callback for 4K page size
  powerpc/64e: Fix obtool warnings in exceptions-64e.S
  powerpc/security: Fix Speculation_Store_Bypass reporting on Power10
  powerpc/64s: Fix native_hpte_remove() to be irq-safe
2023-07-16 12:28:04 -07:00