Now ublk_abort_queue() is run exclusively with ublk_queue_rq() and the
ubq_daemon task, so simplify aborting request:
- set UBLK_IO_FLAG_ABORTED in ublk_abort_queue() just for aborting
this request
- abort request in ublk_queue_rq() if ubq->canceling is set
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231009093324.957829-8-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Monitor work actually introduces one extra context for handling abort, this
way is easy to cause race, and also introduce extra delay when handling
aborting.
Now we start to support cancelable uring_cmd, so use it instead:
1) this cancel callback is either run from the uring cmd submission task
context or called after the io_uring context is exit, so the callback is
run exclusively with ublk_ch_uring_cmd() and __ublk_rq_task_work().
2) the previous patch freezes request queue when calling ublk_abort_queue(),
which is now completely exclusive with ublk_queue_rq() and
ublk_ch_uring_cmd()/__ublk_rq_task_work().
3) in timeout handler, if all IOs are in-flight, then all uring commands
are completed, uring command canceling can't help us to provide forward
progress any more, so call ublk_abort_requests() in timeout handler.
This way simplifies aborting queue, and is helpful for adding new feature,
such as, relax the limit of using single task for handling one queue.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231009093324.957829-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
So far aborting queue ends request when the ubq daemon is exiting, and
it can be run concurrently with ublk_queue_rq(), this way is fragile and
we depend on the tricky usage of UBLK_IO_FLAG_ABORTED for avoiding such
race.
Quiesce queue when aborting queue, and the two code paths can be run
completely exclusively, then it becomes easier to add new ublk feature,
such as relaxing single same task limit for each queue.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231009093324.957829-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ublk_cancel_dev() just calls ublk_cancel_queue() to cancel all pending
io commands after ublk request queue is idle. The only protection is just
the read & write of ubq->nr_io_ready and avoid duplicated command cancel,
so add one per-queue lock with cancel flag for providing this protection,
meantime move ublk_cancel_dev() out of ub->mutex.
Then we needn't to call io_uring_cmd_complete_in_task() to cancel
pending command. And the same cancel logic will be re-used for
cancelable uring command.
This patch basically reverts commit ac5902f84b ("ublk: fix AB-BA lockdep warning").
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231009093324.957829-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In well-done ublk server implementation, ublk io command won't be
linked into any link chain. Meantime they are always handled in no-wait
style, so basically io cmd is always handled in submitter task context.
However, the server may set IOSQE_ASYNC, or io command is linked to one
chain mistakenly, then we may still run into io-wq context and
ctx->uring_lock isn't held.
So in case of IO_URING_F_UNLOCKED, schedule this command by
io_uring_cmd_complete_in_task to force running it in submitter task. Then
ublk_ch_uring_cmd_local() is guaranteed to run with context uring_lock held,
and we needn't to worry about sync among submission code path any more.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231009093324.957829-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The dev_id/ub_number is used for the ublk dev's char device's minor
number so it has to fit into MINORMASK. This patch adds checks to prevent
userspace from passing a number that's too large and limits what can be
allocated by the ublk_index_idr for the case where userspace has the
kernel allocate the dev_id/ub_number.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231012150600.6198-2-michael.christie@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge in io_uring fixes, as the ublk simplifying cancelations and
aborts depend on the two patches from Ming adding cancelation support
for uring_cmd.
* for-6.7/io_uring:
io_uring/kbuf: Use slab for struct io_buffer objects
io_uring/kbuf: Allow the full buffer id space for provided buffers
io_uring/kbuf: Fix check of BID wrapping in provided buffers
io_uring/rsrc: cleanup io_pin_pages()
io_uring: cancelable uring_cmd
io_uring: retain top 8bits of uring_cmd flags for kernel internal use
io_uring: add IORING_OP_WAITID support
exit: add internal include file with helpers
exit: add kernel_waitid_prepare() helper
exit: move core of do_wait() into helper
exit: abstract out should_wake helper for child_wait_callback()
io_uring/rw: add support for IORING_OP_READ_MULTISHOT
io_uring/rw: mark readv/writev as vectored in the opcode definition
io_uring/rw: split io_read() into a helper
Pull MD changes from Song:
"1. Rewrite mddev_suspend(), by Yu Kuai;
2. Simplify md_seq_ops, by Yu Kuai;
3. Reduce unnecessary locking array_state_store(), by Mariusz Tkaczyk."
* tag 'md-next-20231012' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: (23 commits)
md: rename __mddev_suspend/resume() back to mddev_suspend/resume()
md: remove old apis to suspend the array
md: suspend array in md_start_sync() if array need reconfiguration
md/raid5: replace suspend with quiesce() callback
md/md-linear: cleanup linear_add()
md: cleanup mddev_create/destroy_serial_pool()
md: use new apis to suspend array before mddev_create/destroy_serial_pool
md: use new apis to suspend array for ioctls involed array reconfiguration
md: use new apis to suspend array for adding/removing rdev from state_store()
md: use new apis to suspend array for sysfs apis
md/raid5: use new apis to suspend array
md/raid5-cache: use new apis to suspend array
md/md-bitmap: use new apis to suspend array for location_store()
md/dm-raid: use new apis to suspend array
md: add new helpers to suspend/resume and lock/unlock array
md: add new helpers to suspend/resume array
md: replace is_md_suspended() with 'mddev->suspended' in md_check_recovery()
md/raid5-cache: use READ_ONCE/WRITE_ONCE for 'conf->log'
md: use READ_ONCE/WRITE_ONCE for 'suspend_lo' and 'suspend_hi'
md/raid1: don't split discard io for write behind
...
From Yu Kuai, written by Song Liu
Recent tests with raid10 revealed many issues with the following scenarios:
- add or remove disks to the array
- issue io to the array
At first, we fixed each problem independently respect that io can
concurrent with array reconfiguration. However, with more issues reported
continuously, I am hoping to fix these problems thoroughly.
Refer to how block layer protect io with queue reconfiguration (for
example, change elevator):
blk_mq_freeze_queue
-> wait for all io to be done, and prevent new io to be dispatched
// reconfiguration
blk_mq_unfreeze_queue
I think we can do something similar to synchronize io with array
reconfiguration.
Current synchronization works as the following. For the reconfiguration
operation:
1. Hold 'reconfig_mutex';
2. Check that rdev can be added/removed, one condition is that there is no
IO (for example, check nr_pending).
3. Do the actual operations to add/remove a rdev, one procedure is
set/clear a pointer to rdev.
4. Check if there is still no IO on this rdev, if not, revert the
change.
IO path uses rcu_read_lock/unlock() to access rdev.
- rcu is used wrongly;
- There are lots of places involved that old rdev can be read, however,
many places doesn't handle old value correctly;
- Between step 3 and 4, if new io is dispatched, NULL will be read for
the rdev, and data will be lost if step 4 failed.
The new synchronization is similar to blk_mq_freeze_queue(). To add or
remove disk:
1. Suspend the array, that is, stop new IO from being dispatched
and wait for inflight IO to finish.
2. Add or remove rdevs to array;
3. Resume the array;
IO path doesn't need to change for now, and all rcu implementation can
be removed.
Then main work is divided into 3 steps:
First, first make sure new apis to suspend the array is general:
- make sure suspend array will wait for io to be done(Done by [1]);
- make sure suspend array can be called for all personalities(Done by [2]);
- make sure suspend array can be called at any time(Done by [3]);
- make sure suspend array doesn't rely on 'reconfig_mutex'(PATCH 3-5);
Second replace old apis with new apis(PATCH 6-16). Specifically, the
synchronization is changed from:
lock reconfig_mutex
suspend array
make changes
resume array
unlock reconfig_mutex
to:
suspend array
lock reconfig_mutex
make changes
unlock reconfig_mutex
resume array
Finally, for the remain path that involved reconfiguration, suspend the
array first(PATCH 11,12, [4] and PATCH 17):
Preparatory work:
[1] https://lore.kernel.org/all/20230621165110.1498313-1-yukuai1@huaweicloud.com/
[2] https://lore.kernel.org/all/20230628012931.88911-2-yukuai1@huaweicloud.com/
[3] https://lore.kernel.org/all/20230825030956.1527023-1-yukuai1@huaweicloud.com/
[4] https://lore.kernel.org/all/20230825031622.1530464-1-yukuai1@huaweicloud.com/
* md-suspend-rewrite:
md: rename __mddev_suspend/resume() back to mddev_suspend/resume()
md: remove old apis to suspend the array
md: suspend array in md_start_sync() if array need reconfiguration
md/raid5: replace suspend with quiesce() callback
md/md-linear: cleanup linear_add()
md: cleanup mddev_create/destroy_serial_pool()
md: use new apis to suspend array before mddev_create/destroy_serial_pool
md: use new apis to suspend array for ioctls involed array reconfiguration
md: use new apis to suspend array for adding/removing rdev from state_store()
md: use new apis to suspend array for sysfs apis
md/raid5: use new apis to suspend array
md/raid5-cache: use new apis to suspend array
md/md-bitmap: use new apis to suspend array for location_store()
md/dm-raid: use new apis to suspend array
md: add new helpers to suspend/resume and lock/unlock array
md: add new helpers to suspend/resume array
md: replace is_md_suspended() with 'mddev->suspended' in md_check_recovery()
md/raid5-cache: use READ_ONCE/WRITE_ONCE for 'conf->log'
md: use READ_ONCE/WRITE_ONCE for 'suspend_lo' and 'suspend_hi'
raid5 is the only personality to suspend array in check_reshape() and
start_reshape() callback, suspend and quiesce() callback can both wait
for all normal io to be done, and prevent new io to be dispatched, the
difference is that suspend is implemented in common layer, and quiesce()
callback is implemented in raid5.
In order to cleanup all the usage of mddev_suspend(), the new apis
__mddev_suspend() need to be called before 'reconfig_mutex' is held,
and it's not good to affect all the personalities in common layer just
for raid5. Hence replace suspend with quiesce() callaback, prepare to
reomove all the users of mddev_suspend().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-17-yukuai1@huaweicloud.com
User can write 'remove' and 're-add' to trigger array reconfiguration
through sysfs, suspend array in this case so that io won't concurrent
with array reconfiguration.
And now that all the caller of add_bound_rdev() alread suspend the
array, remove mddev_suspend/resume() from add_bound_rdev() as well.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-12-yukuai1@huaweicloud.com
The new helpers suspend the array first and then lock the array,
Prepare to refactor from:
mddev_lock/lock_nointr
mddev_suspend
...
mddev_resuem
mddev_lock
With:
mddev_suspend_and_lock/lock_nointr
...
mddev_unlock_and_resume
After all the use cases is refactored, mddev_suspend/resume() will be
removed.
And mddev_suspend_and_lock() will also replace mddev_lock() for the case
that the array will be reconfigured, in order to synchronize with io to
prevent problems in many corner cases.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-6-yukuai1@huaweicloud.com
Advantages for new apis:
- reconfig_mutex is not required;
- the weird logical that suspend array hold 'reconfig_mutex' for
mddev_check_recovery() to update superblock is not needed;
- the specail handling, 'pers->prepare_suspend', for raid456 is not
needed;
- It's safe to be called at any time once mddev is allocated, and it's
designed to be used from slow path where array configuration is changed;
- the new helpers is designed to be called before mddev_lock(), hence
it support to be interrupted by user as well.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-5-yukuai1@huaweicloud.com
Prepare to cleanup pers->prepare_suspend(), which is used to fix a
deadlock in raid456 by returning error for io that is waiting for
reshape to make progress in mddev_suspend().
This change will allow reshape to make progress while waiting for io to
be done in mddev_suspend() in following patches.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-4-yukuai1@huaweicloud.com
The allocation of struct io_buffer for metadata of provided buffers is
done through a custom allocator that directly gets pages and
fragments them. But, slab would do just fine, as this is not a hot path
(in fact, it is a deprecated feature) and, by keeping a custom allocator
implementation we lose benefits like tracking, poisoning,
sanitizers. Finally, the custom code is more complex and requires
keeping the list of pages in struct ctx for no good reason. This patch
cleans this path up and just uses slab.
I microbenchmarked it by forcing the allocation of a large number of
objects with the least number of io_uring commands possible (keeping
nbufs=USHRT_MAX), with and without the patch. There is a slight
increase in time spent in the allocation with slab, of course, but even
when allocating to system resources exhaustion, which is not very
realistic and happened around 1/2 billion provided buffers for me, it
wasn't a significant hit in system time. Specially if we think of a
real-world scenario, an application doing register/unregister of
provided buffers will hit ctx->io_buffers_cache more often than actually
going to slab.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20231005000531.30800-4-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
nbufs tracks the number of buffers and not the last bgid. In 16-bit, we
have 2^16 valid buffers, but the check mistakenly rejects the last
bid. Let's fix it to make the interface consistent with the
documentation.
Fixes: ddf0322db7 ("io_uring: add IORING_OP_PROVIDE_BUFFERS")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20231005000531.30800-3-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 3851d25c75 ("io_uring: check for rollover of buffer ID when
providing buffers") introduced a check to prevent wrapping the BID
counter when sqe->off is provided, but it's off-by-one too
restrictive, rejecting the last possible BID (65534).
i.e., the following fails with -EINVAL.
io_uring_prep_provide_buffers(sqe, addr, size, 0xFFFF, 0, 0);
Fixes: 3851d25c75 ("io_uring: check for rollover of buffer ID when providing buffers")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20231005000531.30800-2-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The length values for volume label type and volume label id are
hard-coded in several places. Provide defines for those values and
replace all occurrences accordingly.
Note that the length is defined and used, and not the size since the
volume label type string and volume label id string are not
nul-terminated.
Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://lore.kernel.org/r/20230915131001.697070-4-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
strncpy() is deprecated and needs to be replaced. The volume label
information strings are not nul-terminated and strncpy() can simply be
replaced with memcpy().
To enhance the readability of find_label() alongside this change, the
following improvements are made:
- Introduce the array dasd_vollabels[] containing all information
necessary for the label detection.
- Provide a helper function to obtain an index value corresponding to a
volume label type. This allows the use of a switch statement to reduce
indentation levels.
- The 'temp' variable is used to check against valid volume label types.
In the good case, this variable already contains the volume label type
making it unnecessary to copy the information again from e.g.
label->vol.vollbl. Remove the 'temp' variable and the second copy as
all information are already provided.
- Remove the 'found' variable and replace it with early returns
Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://lore.kernel.org/r/20230915131001.697070-3-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
`strncpy` is deprecated for use on NUL-terminated destination strings [1].
We should favor a more robust and less ambiguous interface.
We expect that both `nullb->disk_name` and `disk->disk_name` be
NUL-terminated:
| snprintf(nullb->disk_name, sizeof(nullb->disk_name),
| "%s", config_item_name(&dev->group.cg_item));
...
| pr_info("disk %s created\n", nullb->disk_name);
It seems like NUL-padding may be required due to __assign_disk_name()
utilizing a memcpy as opposed to a `str*cpy` api.
| static inline void __assign_disk_name(char *name, struct gendisk *disk)
| {
| if (disk)
| memcpy(name, disk->disk_name, DISK_NAME_LEN);
| else
| memset(name, 0, DISK_NAME_LEN);
| }
Then we go and print it with `__print_disk_name` which wraps `nullb_trace_disk_name()`.
| #define __print_disk_name(name) nullb_trace_disk_name(p, name)
This function obviously expects a NUL-terminated string.
| const char *nullb_trace_disk_name(struct trace_seq *p, char *name)
| {
| const char *ret = trace_seq_buffer_ptr(p);
|
| if (name && *name)
| trace_seq_printf(p, "disk=%s, ", name);
| trace_seq_putc(p, 0);
|
| return ret;
| }
>From the above, we need both 1) a NUL-terminated string and 2) a
NUL-padded string. So, let's use strscpy_pad() as per Kees' suggestion
from v1.
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Cc: Kees Cook <keescook@chromium.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230919-strncpy-drivers-block-null_blk-main-c-v3-1-10cf0a87a2c3@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This function is overly convoluted with a goto error path, and checks
under the mmap_read_lock() that don't need to be at all. Rearrange it
a bit so the checks and errors fall out naturally, rather than needing
to jump around for it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull MD updates from Song:
"1. Make rdev add/remove independent from daemon thread, by Yu Kuai;
2. Refactor code around quiesce() and mddev_suspend(), by Yu Kuai."
* tag 'md-next-20230927' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
md: replace deprecated strncpy with memcpy
md/md-linear: Annotate struct linear_conf with __counted_by
md: don't check 'mddev->pers' and 'pers->quiesce' from suspend_lo_store()
md: don't check 'mddev->pers' from suspend_hi_store()
md-bitmap: suspend array earlier in location_store()
md-bitmap: remove the checking of 'pers->quiesce' from location_store()
md: don't rely on 'mddev->pers' to be set in mddev_suspend()
md: initialize 'writes_pending' while allocating mddev
md: initialize 'active_io' while allocating mddev
md: delay remove_and_add_spares() for read only array to md_start_sync()
md: factor out a helper rdev_addable() from remove_and_add_spares()
md: factor out a helper rdev_is_spare() from remove_and_add_spares()
md: factor out a helper rdev_removeable() from remove_and_add_spares()
md: delay choosing sync action to md_start_sync()
md: factor out a helper to choose sync action from md_check_recovery()
md: use separate work_struct for md_start_sync()
uring_cmd may never complete, such as ublk, in which uring cmd isn't
completed until one new block request is coming from ublk block device.
Add cancelable uring_cmd to provide mechanism to driver for cancelling
pending commands in its own way.
Add API of io_uring_cmd_mark_cancelable() for driver to mark one command as
cancelable, then io_uring will cancel this command in
io_uring_cancel_generic(). ->uring_cmd() callback is reused for canceling
command in driver's way, then driver gets notified with the cancelling
from io_uring.
Add API of io_uring_cmd_get_task() to help driver cancel handler
deal with the canceling.
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Retain top 8bits of uring_cmd flags for kernel internal use, so that we
can move IORING_URING_CMD_POLLED out of uapi header.
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Before this patch, the implementation is hacky and hard to understand:
1) md_seq_start set pos to 1;
2) md_seq_show found pos is 1, then print Personalities;
3) md_seq_next found pos is 1, then it update pos to the first mddev;
4) md_seq_show found pos is not 1 or 2, show mddev;
5) md_seq_next found pos is not 1 or 2, update pos to next mddev;
6) loop 4-5 until the last mddev, then md_seq_next update pos to 2;
7) md_seq_show found pos is 2, then print unused devices;
8) md_seq_next found pos is 2, stop;
This patch remove the magic value and use seq_list_start/next/stop()
directly, and move printing "Personalities" to md_seq_start(),
"unsed devices" to md_seq_stop():
1) md_seq_start print Personalities, and then set pos to first mddev;
2) md_seq_show show mddev;
3) md_seq_next update pos to next mddev;
4) loop 2-3 until the last mddev;
5) md_seq_stop print unsed devices;
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230927061241.1552837-3-yukuai1@huaweicloud.com