Extend blk_rq_map_user_iov so that it can handle bvec iterator, using
the new blk_rq_map_user_bvec function. It maps the pages from bvec
iterator into a bio and place the bio into request.
This helper will be used by nvme for uring-passthrough path when IO is
done using pre-mapped buffers.
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220930062749.152261-11-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With end_io handlers now being able to potentially pass ownership of
the request upon completion, we can allow requests with end_io handlers
in the batch completion handling.
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Co-developed-by: Stefan Roesch <shr@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Everything is just converted to returning RQ_END_IO_NONE, and there
should be no functional changes with this patch.
In preparation for allowing the end_io handler to pass ownership back
to the block layer, rather than retain ownership of the request.
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The filesystem IO path can take advantage of allocating batches of
requests, if the underlying submitter tells the block layer about it
through the blk_plug. For passthrough IO, the exported API is the
blk_mq_alloc_request() helper, and that one does not allow for
request caching.
Wire up request caching for blk_mq_alloc_request(), which is generally
done without having a bio available upfront.
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We've never had any useful reports from this BUG_ON(), and in fact a
number of the BUG_ON()'s in the flush handling need to be turned into
more graceful handling.
In preparation for allowing batched completions of the end_io handling,
where we can enter the flush completion with queuelist having been reused
for the batch, get rid of this BUG_ON().
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* for-6.1/block: (162 commits)
sbitmap: fix lockup while swapping
block: add rationale for not using blk_mq_plug() when applicable
block: adapt blk_mq_plug() to not plug for writes that require a zone lock
s390/dasd: use blk_mq_alloc_disk
blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep
nvmet: don't look at the request_queue in nvmet_bdev_set_limits
nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all
blk-mq: use quiesced elevator switch when reinitializing queues
block: replace blk_queue_nowait with bdev_nowait
nvme: remove nvme_ctrl_init_connect_q
nvme-loop: use the tagset alloc/free helpers
nvme-loop: store the generic nvme_ctrl in set->driver_data
nvme-loop: initialize sqsize later
nvme-fc: use the tagset alloc/free helpers
nvme-fc: store the generic nvme_ctrl in set->driver_data
nvme-fc: keep ctrl->sqsize in sync with opts->queue_size
nvme-rdma: use the tagset alloc/free helpers
nvme-rdma: store the generic nvme_ctrl in set->driver_data
nvme-tcp: use the tagset alloc/free helpers
nvme-tcp: store the generic nvme_ctrl in set->driver_data
...
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The hctx's run_work may be racing with the elevator switch when
reinitializing hardware queues. The queue is merely frozen in this
context, but that only prevents requests from allocating and doesn't
stop the hctx work from running. The work may get an elevator pointer
that's being torn down, and can result in use-after-free errors and
kernel panics (example below). Use the quiesced elevator switch instead,
and make the previous one static since it is now only used locally.
nvme nvme0: resetting controller
nvme nvme0: 32/0/0 default/read/poll queues
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 80000020c8861067 P4D 80000020c8861067 PUD 250f8c8067 PMD 0
Oops: 0000 [#1] SMP PTI
Workqueue: kblockd blk_mq_run_work_fn
RIP: 0010:kyber_has_work+0x29/0x70
...
Call Trace:
__blk_mq_do_dispatch_sched+0x83/0x2b0
__blk_mq_sched_dispatch_requests+0x12e/0x170
blk_mq_sched_dispatch_requests+0x30/0x60
__blk_mq_run_hw_queue+0x2b/0x50
process_one_work+0x1ef/0x380
worker_thread+0x2d/0x3e0
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220927155652.3260724-1-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass the gendisk to blkcg_init_disk and blkcg_exit_disk as part of moving
the blk-cgroup infrastructure to be gendisk based. Also remove the
rather pointless kerneldoc comments for these internal functions with a
single caller each.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block fixes from Jens Axboe:
"Fix a regression that's been plaguing us by reverting the offending
commit, as attempts to both reproduce the issue and fix it in a saner
fashion have failed.
Fix for a potential oops condition in the s390 dasd block driver"
* tag 'block-6.0-2022-09-22' of git://git.kernel.dk/linux:
Revert "block: freeze the queue earlier in del_gendisk"
s390/dasd: fix Oops in dasd_alias_get_start_dev due to missing pavgroup
"tg->has_rules" is extended to "tg->has_rules_iops/bps", thus bios that
don't need to be throttled can be checked accurately.
With this patch, bio will be throttled if:
1) Bio is read/write, and corresponding read/write iops limit exist.
2) If corresponding doesn't exist, corresponding bps limit exist and
bio is not throttled before.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921095309.1481289-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, "tg->has_rules" and "tg->flags & THROTL_TG_HAS_IOPS_LIMIT"
both try to bypass bios that don't need to be throttled, however, they are
a little redundant and both not perfect:
1) "tg->has_rules" only distinguish read and write, but not iops and bps
limit.
2) "tg->flags & THROTL_TG_HAS_IOPS_LIMIT" only check if iops limit
exist, read and write is not distinguished, and bps limit is not
checked.
tg->has_rules will extended to distinguish bps and iops in the following
patch. There is no need to keep the flag.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921095309.1481289-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
for-6.0 has the following fix for cgroup_get_from_id().
836ac87d ("cgroup: fix cgroup_get_from_id")
which conflicts with the following two commits in for-6.1.
4534dee9 ("cgroup: cgroup: Honor caller's cgroup NS when resolving cgroup id")
fa7e439c ("cgroup: Homogenize cgroup_get_from_id() return value")
While the resolution is straightforward, the code ends up pretty ugly
afterwards. Let's pull for-6.0-fixes into for-6.1 so that the code can be
fixed up there.
Signed-off-by: Tejun Heo <tj@kernel.org>
Our test found a problem that wbt inflight counter is negative, which
will cause io hang(noted that this problem doesn't exist in mainline):
t1: device create t2: issue io
add_disk
blk_register_queue
wbt_enable_default
wbt_init
rq_qos_add
// wb_normal is still 0
/*
* in mainline, disk can't be opened before
* bdev_add(), however, in old kernels, disk
* can be opened before blk_register_queue().
*/
blkdev_issue_flush
// disk size is 0, however, it's not checked
submit_bio_wait
submit_bio
blk_mq_submit_bio
rq_qos_throttle
wbt_wait
bio_to_wbt_flags
rwb_enabled
// wb_normal is 0, inflight is not increased
wbt_queue_depth_changed(&rwb->rqos);
wbt_update_limits
// wb_normal is initialized
rq_qos_track
wbt_track
rq->wbt_flags |= bio_to_wbt_flags(rwb, bio);
// wb_normal is not 0,wbt_flags will be set
t3: io completion
blk_mq_free_request
rq_qos_done
wbt_done
wbt_is_tracked
// return true
__wbt_done
wbt_rqw_done
atomic_dec_return(&rqw->inflight);
// inflight is decreased
commit 8235b5c1e8 ("block: call bdev_add later in device_add_disk") can
avoid this problem, however it's better to fix this problem in wbt:
1) Lower kernel can't backport this patch due to lots of refactor.
2) Root cause is that wbt call rq_qos_add() before wb_normal is
initialized.
Fixes: e34cbd3074 ("blk-wbt: add general throttling mechanism")
Cc: <stable@vger.kernel.org>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20220913105749.3086243-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit a09b314005.
Dusty Mabe reported consistent hang during CoreOS shutdown with a MD
RAID1 setup. Although apparently similar hangs happened before,
and this patch most likely is not the root cause it made it much
more severe. Revert it until we can figure out what is going on
with the md driver.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220919144049.978907-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block fixes from Jens Axboe:
"Two fixes for -rc6:
- Fix a mixup of sectors and bytes in the secure erase ioctl
(Mikulas)
- Fix for a bad return value for a non-blocking bio/blk queue enter
call (me)"
* tag 'block-6.0-2022-09-16' of git://git.kernel.dk/linux-block:
blk-lib: fix blkdev_issue_secure_erase
block: blk_queue_enter() / __bio_queue_enter() must return -EAGAIN for nowait
There's a bug in blkdev_issue_secure_erase. The statement
"unsigned int len = min_t(sector_t, nr_sects, max_sectors);"
sets the variable "len" to the length in sectors, but the statement
"bio->bi_iter.bi_size = len" treats it as if it were in bytes.
The statements "sector += len << SECTOR_SHIFT" and "nr_sects -= len <<
SECTOR_SHIFT" are thinko.
This patch fixes it.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # v5.19
Fixes: 44abff2c0b ("block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARD")
Link: https://lore.kernel.org/r/alpine.LRH.2.02.2209141549480.28100@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Today blk_queue_enter() and __bio_queue_enter() return -EBUSY for the
nowait code path. This is not correct: they should return -EAGAIN
instead.
This problem was detected by fio. The following command exposed the
above problem:
t/io_uring -p0 -d128 -b4096 -s32 -c32 -F1 -B0 -R0 -X1 -n24 -P1 -u1 -O0 /dev/ng0n1
By applying the patch, the retry case is handled correctly in the slow
path.
Signed-off-by: Stefan Roesch <shr@fb.com>
Fixes: bfd343aa17 ("blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
tg_update_disptime() only need to adjust postion for 'tg' in
'parent_sq', there is no need to call throtl_enqueue/dequeue_tg(),
since they will set/clear flag THROTL_TG_PENDING and increase/decrease
nr_pending, which is useless. By the way, clear the flag/decrease
nr_pending while there are still throttled bios is not good for debugging.
There are no functional changes.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220827101637.1775111-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>