We need to throttle split bio in case of IOPS limit even though the
split bio has been marked as BIO_THROTTLED since block layer
accounts split bio actually.
If only throughput throttle is setup, no need to throttle any more
if BIO_THROTTLED is set since we have accounted & considered the
whole bio bytes already.
Add one flag of THROTL_TG_HAS_IOPS_LIMIT for serving this purpose.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220216044514.2903784-8-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 111be88398 ("block-throttle: avoid double charge") marks bio as
BIO_THROTTLED unconditionally if __blk_throtl_bio() is called on this bio,
then this bio won't be called into __blk_throtl_bio() any more. This way
is to avoid double charge in case of bio splitting. It is reasonable for
read/write throughput limit, but not reasonable for IOPS limit because
block layer provides io accounting against split bio.
Chunguang Xu has already observed this issue and fixed it in commit
4f1e9630af ("blk-throtl: optimize IOPS throttle for large IO scenarios").
However, that patch only covers bio splitting in __blk_queue_split(), and
we have other kind of bio splitting, such as bio_split() &
submit_bio_noacct() and other ways.
This patch tries to fix the issue in one generic way by always charging
the bio for iops limit in blk_throtl_bio(). This way is reasonable:
re-submission & fast-cloned bio is charged if it is submitted to same
disk/queue, and BIO_THROTTLED will be cleared if bio->bi_bdev is changed.
This new approach can get much more smooth/stable iops limit compared with
commit 4f1e9630af ("blk-throtl: optimize IOPS throttle for large IO
scenarios") since that commit can't throttle current split bios actually.
Also this way won't cause new double bio iops charge in
blk_throtl_dispatch_work_fn() in which blk_throtl_bio() won't be called
any more.
Reported-by: Ning Li <lining2020x@163.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Chunguang Xu <brookxu@tencent.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220216044514.2903784-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Both ->start_time_ns and the RQF_IO_STAT are set when the request is
allocated using blk_mq_alloc_request by dm-mpath in blk_mq_rq_ctx_init.
The block layer also ensures ->start_time_ns is only set when actually
needed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220215100540.3892965-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The code to stack blk-mq drivers is only used by dm-multipath, and
will preferably stay that way. Make it optional and only selected
by device mapper, so that the buildbots more easily catch abuses
like the one that slipped in in the ufs driver in the last merged
window. Another positive side effects is that kernel builds without
device mapper shrink a little bit as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220215100540.3892965-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This document is completely out of date and extremely misleading. In
general the existing kerneldoc comment serve as a much better
documentation of the still existing functionality, while the history
blurbs are pretty much irrelevant today.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220215081047.3693582-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since commit 7eaceaccab ("block: remove per-queue plugging"), kernel
has removed blk_run_address_space(), blk_unplug() and sync_buffer(),
and moved to on-stack plugging. The document has been obsolete for
years.
Given that there is no obvious counterparts in the new mechinism to
replace old APIs, this patch drops the content directly.
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Link: https://lore.kernel.org/r/20220207074931.20067-1-song.bao.hua@hisilicon.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, rasdaemon uses the existing tracepoint block_rq_complete
and filters out non-error cases in order to capture block disk errors.
But there are a few problems with this approach:
1. Even kernel trace filter could do the filtering work, there is
still some overhead after we enable this tracepoint.
2. The filter is merely based on errno, which does not align with kernel
logic to check the errors for print_req_error().
3. block_rq_complete only provides dev major and minor to identify
the block device, it is not convenient to use in user-space.
So introduce a new tracepoint block_rq_error just for the error case.
With this patch, rasdaemon could switch to block_rq_error.
Since the new tracepoint has the similar implementation with
block_rq_complete, so move the existing code from TRACE_EVENT
block_rq_complete() into new event class block_rq_completion(). Then add
event for block_rq_complete and block_rq_err respectively from the newly
created event class per the suggestion from Chaitanya Kulkarni.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220210225222.260069-1-shy828301@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Only the last sbitmap_word can have different depth, and all the others
must have same depth of 1U << sb->shift, so not necessary to store it in
sbitmap_word, and it can be retrieved easily and efficiently by adding
one internal helper of __map_depth(sb, index).
Remove 'depth' field from sbitmap_word, then the annotation of
____cacheline_aligned_in_smp for 'word' isn't needed any more.
Not see performance effect when running high parallel IOPS test on
null_blk.
This way saves us one cacheline(usually 64 words) per each sbitmap_word.
Cc: Martin Wilck <martin.wilck@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin Wilck <mwilck@suse.com>
Reviewed-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/20220110072945.347535-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace open coded bio_clone_fast implementations with the actual helper.
Note that the bio allocated as part of the dm_io structure in alloc_io
will only actually be used later in alloc_tio, making this earlier
cloning of the information safe.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
__bio_clone_fast should also clone integrity and crypto data, as a clone
without those is incomplete. Right now the only caller that can actually
support crypto and integrity data (dm) does it manually for the one
callchain that supports these, but we better do it properly in the core.
Note that all callers except for the above mentioned one also don't need
to handle failure at all, given that the integrity and crypto clones are
based on mempool allocations that won't fail for sleeping allocations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, drivers reports BLK_STS_IOERR for devices that are not full
online or being removed. This behavior could cause confusion for users,
as they are not really I/O errors from the device.
Solve this issue with a new state BLK_STS_OFFLINE, which reports "device
offline error" in dmesg instead of "I/O error".
EIO is intentionally kept to not change user visible return value.
Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20220203192827.1370270-2-song@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This code triggers a Smatch warning:
fs/ntfs3/fsntfs.c:1606 ntfs_bio_fill_1()
warn: variable dereferenced before check 'bio' (see line 1591)
The "bio" pointer cannot be NULL so there is no need to check.
Originally there was more extensive NULL checking but it was removed
because bio_alloc() will never fail if it is allowed to sleep.
Remove this check as well.
Fixes: 39146b6f66 ("ntfs3: remove ntfs_alloc_bio")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220128140922.GA29766@kili
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_needs_flush_plug fails to account for the cb_list, which needs
flushing as well. Remove it and just check if there is a plug instead
of poking into the internals of the plug structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220127070549.1377856-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass the block_device that we plan to use this bio for and the
operation to bio_reset to optimize the assigment. A NULL block_device
can be passed, both for the passthrough case on a raw request_queue and
to temporarily avoid refactoring some nasty code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-20-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass the block_device that we plan to use this bio for and the
operation to bio_init to optimize the assignment. A NULL block_device
can be passed, both for the passthrough case on a raw request_queue and
to temporarily avoid refactoring some nasty code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-19-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass the block_device and operation that we plan to use this bio for to
bio_alloc to optimize the assignment. NULL/0 can be passed, both for the
passthrough case on a raw request_queue and to temporarily avoid
refactoring some nasty code.
Also move the gfp_mask argument after the nr_vecs argument for a much
more logical calling convention matching what most of the kernel does.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>