linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-02-15 13:14:59 -05:00

Author	SHA1	Message	Date
Caleb Sander Mateos	692cf47e1a	ublk: don't access ublk_queue in ublk_register_io_buf() For ublk servers with many ublk queues, accessing the ublk_queue in ublk_register_io_buf() is a frequent cache miss. Get the flags from the ublk_device instead, which is accessed earlier in ublk_ch_uring_cmd_local(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-20 06:36:27 -06:00
Caleb Sander Mateos	8a81926e45	ublk: pass ublk_device to ublk_register_io_buf() Avoid repeating the 2 dereferences to get the ublk_device from the io_uring_cmd by passing it from ublk_ch_uring_cmd_local() to ublk_register_io_buf(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-20 06:36:27 -06:00
Caleb Sander Mateos	b40dcdf823	ublk: don't dereference ublk_queue in ublk_check_and_get_req() For ublk servers with many ublk queues, accessing the ublk_queue in ublk_ch_{read,write}_iter() is a frequent cache miss. Get the flags and queue depth from the ublk_device instead, which is accessed just before. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-20 06:36:27 -06:00
Caleb Sander Mateos	5125535f90	ublk: don't dereference ublk_queue in ublk_ch_uring_cmd_local() For ublk servers with many ublk queues, accessing the ublk_queue to handle a ublk command is a frequent cache miss. Get the queue depth from the ublk_device instead, which is accessed just before. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-20 06:36:27 -06:00
Caleb Sander Mateos	d74a383ec7	ublk: add helpers to check ublk_device flags Introduce ublk_device analogues of the ublk_queue flag helpers: - ublk_support_zero_copy() -> ublk_dev_support_user_copy() - ublk_support_auto_buf_reg() -> ublk_dev_support_auto_buf_reg() - ublk_support_user_copy() -> ublk_dev_support_user_copy() - ublk_need_map_io() -> ublk_dev_need_map_io() - ublk_need_req_ref() -> ublk_dev_need_req_ref() - ublk_need_get_data() -> ublk_dev_need_get_data() These will be used in subsequent changes to avoid accessing the ublk_queue just for the flags, and instead use the ublk_device. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-20 06:36:27 -06:00
Caleb Sander Mateos	0265595002	ublk: don't pass ublk_queue to __ublk_fail_req() __ublk_fail_req() only uses the ublk_queue to get the ublk_device, which its caller already has. So just pass the ublk_device directly. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-20 06:36:27 -06:00
Caleb Sander Mateos	b7e255b034	ublk: don't pass q_id to ublk_queue_cmd_buf_size() ublk_queue_cmd_buf_size() only needs the queue depth, which is the same for all queues. Get the queue depth from the ublk_device instead so the q_id parameter can be dropped. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-20 06:36:27 -06:00
Caleb Sander Mateos	163f80dabf	ublk: remove ubq check in ublk_check_and_get_req() ublk_get_queue() never returns a NULL pointer, so there's no need to check its return value in ublk_check_and_get_req(). Drop the check. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-20 06:36:27 -06:00
Uday Shankar	a755da0dd0	selftests: ublk: add test to verify that feat_map is complete Add a test that verifies that the currently running kernel does not report support for any features that are unrecognized by kublk. This should catch cases where features are added without updating kublk's feat_map accordingly, which has happened multiple times in the past (see [1], [2]). Note that this new test may fail if the test suite is older than the kernel, and the newer kernel contains a newly introduced feature. I believe this is not a use case we currently care about - we only care about newer test suites passing on older kernels. [1] https://lore.kernel.org/linux-block/20250606214011.2576398-1-csander@purestorage.com/t/#u [2] https://lore.kernel.org/linux-block/2a370ab1-d85b-409d-b762-f9f3f6bdf705@nvidia.com/t/#m1c520a058448d594fd877f07804e69b28908533f Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-19 11:06:15 -06:00
Uday Shankar	742bcc1101	selftests: ublk: kublk: add UBLK_F_BUF_REG_OFF_DAEMON to feat_map When UBLK_F_BUF_REG_OFF_DAEMON was added, we missed updating kublk's feat_map, which results in the feature being reported as "unknown." Add UBLK_F_BUF_REG_OFF_DAEMON to feat_map to fix this. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-19 11:06:15 -06:00
Uday Shankar	1f924cf781	selftests: ublk: kublk: simplify feat_map definition Simplify the definition of feat_map by introducing a helper macro FEAT_NAME to avoid having to type the feature name twice. As a side effect, this changes the names in the feature list to be the full macro name instead of the abbreviated names that were used before, but this is a good change for clarity. Using the full feature macro names ruins the alignment of the output, so change the output format to put each feature's hex value before its name, as this is easier to align nicely. The output now looks as follows: root# ./kublk features ublk_drv features: 0x7fff 0x1 : UBLK_F_SUPPORT_ZERO_COPY 0x2 : UBLK_F_URING_CMD_COMP_IN_TASK 0x4 : UBLK_F_NEED_GET_DATA 0x8 : UBLK_F_USER_RECOVERY 0x10 : UBLK_F_USER_RECOVERY_REISSUE 0x20 : UBLK_F_UNPRIVILEGED_DEV 0x40 : UBLK_F_CMD_IOCTL_ENCODE 0x80 : UBLK_F_USER_COPY 0x100 : UBLK_F_ZONED 0x200 : UBLK_F_USER_RECOVERY_FAIL_IO 0x400 : UBLK_F_UPDATE_SIZE 0x800 : UBLK_F_AUTO_BUF_REG 0x1000 : UBLK_F_QUIESCE 0x2000 : UBLK_F_PER_IO_DAEMON 0x4000 : unknown Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-19 11:06:15 -06:00
Yu Kuai	336aec7b06	blk-throttle: fix throtl_data leak during disk release Tightening the throttle activation check in blk_throtl_activated() to require both q->td presence and policy bit set introduced a memory leak during disk release: blkg_destroy_all() clears the policy bit first during queue deactivation, causing subsequent blk_throtl_exit() to skip throtl_data cleanup when blk_throtl_activated() fails policy check. Idealy we should avoid modifying blk_throtl_exit() activation check because it's intuitive that blk-throtl start from blk_throtl_init() and end in blk_throtl_exit(). However, call blk_throtl_exit() before blkg_destroy_all() will make a long term deadlock problem easier to trigger[1], hence fix this problem by checking if q->td is NULL from blk_throtl_exit(), and remove policy deactivation as well since it's useless. [1] https://lore.kernel.org/all/CAHj4cs9p9H5yx+ywsb3CMUdbqGPhM+8tuBvhW=9ADiCjAqza9w@mail.gmail.com/#t Fixes: `bd9fd5be6b` ("blk-throttle: fix access race during throttle policy activation") Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/CAHj4cs-p-ZwBEKigBj7T6hQCOo-H68-kVwCrV6ZvRovrr9Z+HA@mail.gmail.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-17 07:27:29 -06:00
Bart Van Assche	0b507305a0	blk-mq: Fix the blk_mq_tagset_busy_iter() documentation Commit `2dd6532e95` ("blk-mq: Drop 'reserved' arg of busy_tag_iter_fn") removed the 'reserved' argument from tag iteration callback functions. Bring the blk_mq_tagset_busy_iter() documentation in sync with that change. Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: John Garry <john.g.garry@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-17 07:26:29 -06:00
John Garry	da7b97ba0d	block: relax atomic write boundary vs chunk size check blk_validate_atomic_write_limits() ensures that any boundary fits into and is aligned to any chunk size. However, it should also be possible to fit the chunk size into any boundary. That check is already made in blk_stack_atomic_writes_boundary_head(). Relax the check in blk_validate_atomic_write_limits() by reusing (and renaming) blk_stack_atomic_writes_boundary_head(). Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:29:10 -06:00
John Garry	f2d8c5a2f7	block: fix stacking of atomic writes when atomics are not supported Atomic writes support may not always be possible when stacking devices which support atomic writes. Such as case is a different atomic write boundary between stacked devices (which is not supported). In the case that atomic writes cannot supported, the top device queue HW limits are set to 0. However, in blk_stack_atomic_writes_limits(), we detect that we are stacking the first bottom device by checking the top device atomic_write_hw_max value == 0. This get confused with the case of atomic writes not supported, above. Make the distinction between stacking the first bottom device and no atomics supported by initializing stacked device atomic_write_hw_max = UINT_MAX and checking that for stacking the first bottom device. Fixes: `d7f36dc446` ("block: Support atomic writes limits for stacked devices") Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:29:10 -06:00
John Garry	bfd4037296	block: update validation of atomic writes boundary for stacked devices In commit `63d092d1c1` ("block: use chunk_sectors when evaluating stacked atomic write limits"), it was missed to use a chunk sectors limit check in blk_stack_atomic_writes_boundary_head(), so update that function to do the proper check. Fixes: `63d092d1c1` ("block: use chunk_sectors when evaluating stacked atomic write limits") Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:29:10 -06:00
chengkaitao	74b1db8684	block/mq-deadline: Remove the redundant rb_entry_rq in the deadline_from_pos(). In commit(`fde02699c2`), the "if (blk_rq_is_seq_zoned_write(rq))" was removed, but the "rb_entry_rq(node)" and some other code were inadvertently left behind. This patch fixed it. Signed-off-by: chengkaitao <chengkaitao@kylinos.cn> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Li Nan <linan122@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-15 13:00:05 -06:00
Nathan Chancellor	7935b843ce	md/md-llbitmap: Use DIV_ROUND_UP_SECTOR_T When building for 32-bit platforms, there are several link (if builtin) or modpost (if a module) errors due to dividends of type 'sector_t' in DIV_ROUND_UP: arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o: in function `llbitmap_resize': drivers/md/md-llbitmap.c:1017:(.text+0xae8): undefined reference to `__aeabi_uldivmod' arm-linux-gnueabi-ld: drivers/md/md-llbitmap.c:1020:(.text+0xb10): undefined reference to `__aeabi_uldivmod' arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o: in function `llbitmap_end_discard': drivers/md/md-llbitmap.c:1114:(.text+0xf14): undefined reference to `__aeabi_uldivmod' arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o: in function `llbitmap_start_discard': drivers/md/md-llbitmap.c:1097:(.text+0x1808): undefined reference to `__aeabi_uldivmod' arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o: in function `llbitmap_read_sb': drivers/md/md-llbitmap.c:867:(.text+0x2080): undefined reference to `__aeabi_uldivmod' arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o:drivers/md/md-llbitmap.c:895: more undefined references to `__aeabi_uldivmod' follow Use DIV_ROUND_UP_SECTOR_T instead of DIV_ROUND_UP, which exists to handle this exact situation. Fixes: `5ab829f197` ("md/md-llbitmap: introduce new lockless bitmap") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 18:18:41 -06:00
Yu Kuai	a75fe12fa2	blk-mq: fix stale nr_requests documentation The nr_requests documentation is still the removed single queue, remove it and update to current blk-mq. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:25:56 -06:00
Yu Kuai	9784041145	blk-mq: remove blk_mq_tag_update_depth() This helper is not used now. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:25:56 -06:00
Yu Kuai	b86433721f	blk-mq: fix potential deadlock while nr_requests grown Allocate and free sched_tags while queue is freezed can deadlock[1], this is a long term problem, hence allocate memory before freezing queue and free memory after queue is unfreezed. [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/ Fixes: `e3a2b3f931` ("blk-mq: allow changing of queue depth through sysfs") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:25:56 -06:00
Yu Kuai	6293e336f6	blk-mq-sched: add new parameter nr_requests in blk_mq_alloc_sched_tags() This helper only support to allocate the default number of requests, add a new parameter to support specific number of requests. Prepare to fix potential deadlock in the case nr_requests grow. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:25:56 -06:00
Yu Kuai	e632004044	blk-mq: split bitmap grow and resize case in blk_mq_update_nr_requests() No functional changes are intended, make code cleaner and prepare to fix the grow case in following patches. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:25:56 -06:00
Yu Kuai	7f2799c546	blk-mq: cleanup shared tags case in blk_mq_update_nr_requests() For shared tags case, all hctx->sched_tags/tags are the same, it doesn't make sense to call into blk_mq_tag_update_depth() multiple times for the same tags. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:25:56 -06:00
Yu Kuai	626ff4f8eb	blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lock request_queue->nr_requests can be changed by: a) switch elevator by updating nr_hw_queues b) switch elevator by elevator sysfs attribute c) configue queue sysfs attribute nr_requests Current lock order is: 1) update_nr_hwq_lock, case a,b 2) freeze_queue 3) elevator_lock, case a,b,c And update nr_requests is seriablized by elevator_lock() already, however, in the case c, we'll have to allocate new sched_tags if nr_requests grow, and do this with elevator_lock held and queue freezed has the risk of deadlock. Hence use update_nr_hwq_lock instead, make it possible to allocate memory if tags grow, meanwhile also prevent nr_requests to be changed concurrently. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:25:56 -06:00
Yu Kuai	b46d4c447d	blk-mq: check invalid nr_requests in queue_requests_store() queue_requests_store() is the only caller of blk_mq_update_nr_requests(), and blk_mq_update_nr_requests() is the only caller of blk_mq_tag_update_depth(), however, they all have checkings for nr_requests input by user. Make code cleaner by moving all the checkings to the top function: 1) nr_requests > reserved tags; 2) if there is elevator, 4 <= nr_requests <= 2048; 3) if elevator is none, 4 <= nr_requests <= tag_set->queue_depth; Meanwhile, case 2 is the only case tags can grow and -ENOMEM might be returned. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:25:56 -06:00
Yu Kuai	8bd7195fea	blk-mq: remove useless checkings in blk_mq_update_nr_requests() 1) queue_requests_store() is the only caller of blk_mq_update_nr_requests(), where queue is already freezed, no need to check mq_freeze_depth; 2) q->tag_set must be set for request based device, and queue_is_mq() is already checked in blk_mq_queue_attr_visible(), no need to check q->tag_set. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:25:56 -06:00
Yu Kuai	dc1dd13d44	blk-mq: remove useless checking in queue_requests_store() blk_mq_queue_attr_visible() already checked queue_is_mq(), no need to check this again in queue_requests_store(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:25:56 -06:00
Caleb Sander Mateos	97e8ba31b8	ublk: consolidate nr_io_ready and nr_queues_ready ublk_mark_io_ready() tracks whether all the ublk_device's I/Os have been fetched by incrementing ublk_queue's nr_io_ready count and incrementing ublk_device's nr_queues_ready count if the whole queue is ready. Simplify the logic by just tracking the total number of fetched I/Os on each ublk_device. When this count reaches nr_hw_queues * queue_depth, the ublk_device is ready to receive I/O. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:24:53 -06:00
Yu Kuai	e0ed2bca7b	md/raid0: convert raid0_make_request() to use bio_submit_split_bioset() Currently, raid0_make_request() will remap the original bio to underlying disks to prevent reordered IO. Now that bio_submit_split_bioset() will put original bio to the head of current->bio_list, it's safe converting to use this helper and bio will still be ordered. CC: Jan Kara <jack@suse.cz> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:23:46 -06:00
Yu Kuai	b2f5974079	block: fix ordering of recursive split IO Currently, split bio will be chained to original bio, and original bio will be resubmitted to the tail of current->bio_list, waiting for split bio to be issued. However, if split bio get split again, the IO order will be messed up. This problem, on the one hand, will cause performance degradation, especially for mdraid with large IO size; on the other hand, will cause write errors for zoned block devices[1]. For example, in raid456 IO will first be split by max_sector from md_submit_bio(), and then later be split again by chunksize for internal handling: For example, assume max_sectors is 1M, and chunksize is 512k 1) issue a 2M IO: bio issuing: 0+2M current->bio_list: NULL 2) md_submit_bio() split by max_sector: bio issuing: 0+1M current->bio_list: 1M+1M 3) chunk_aligned_read() split by chunksize: bio issuing: 0+512k current->bio_list: 1M+1M -> 512k+512k 4) after first bio issued, __submit_bio_noacct() will contuine issuing next bio: bio issuing: 1M+1M current->bio_list: 512k+512k bio issued: 0+512k 5) chunk_aligned_read() split by chunksize: bio issuing: 1M+512k current->bio_list: 512k+512k -> 1536k+512k bio issued: 0+512k 6) no split afterwards, finally the issue order is: 0+512k -> 1M+512k -> 512k+512k -> 1536k+512k This behaviour will cause large IO read on raid456 endup to be small discontinuous IO in underlying disks. Fix this problem by placing split bio to the head of current->bio_list. Test script: test on 8 disk raid5 with 64k chunksize dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct Test results: Before this patch 1) iostat results: Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util md0 52430.00 3276.87 0.00 0.00 0.62 64.00 32.60 80.10 sd* 4487.00 409.00 2054.00 31.40 0.82 93.34 3.68 71.20 2) blktrace G stage: 8,0 0 486445 11.357392936 843 G R 14071424 + 128 [dd] 8,0 0 486451 11.357466360 843 G R 14071168 + 128 [dd] 8,0 0 486454 11.357515868 843 G R 14071296 + 128 [dd] 8,0 0 486468 11.357968099 843 G R 14072192 + 128 [dd] 8,0 0 486474 11.358031320 843 G R 14071936 + 128 [dd] 8,0 0 486480 11.358096298 843 G R 14071552 + 128 [dd] 8,0 0 486490 11.358303858 843 G R 14071808 + 128 [dd] 3) io seek for sdx: Noted io seek is the result from blktrace D stage, statistic of: ABS((offset of next IO) - (offset + len of previous IO)) Read\|Write seek cnt 55175, zero cnt 25079 >=(KB) .. <(KB) : count ratio \|distribution \| 0 .. 1 : 25079 45.5% \|########################################\| 1 .. 2 : 0 0.0% \| \| 2 .. 4 : 0 0.0% \| \| 4 .. 8 : 0 0.0% \| \| 8 .. 16 : 0 0.0% \| \| 16 .. 32 : 0 0.0% \| \| 32 .. 64 : 12540 22.7% \|##################### \| 64 .. 128 : 2508 4.5% \|##### \| 128 .. 256 : 0 0.0% \| \| 256 .. 512 : 10032 18.2% \|################# \| 512 .. 1024 : 5016 9.1% \|######### \| After this patch: 1) iostat results: Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util md0 87965.00 5271.88 0.00 0.00 0.16 61.37 14.03 90.60 sd* 6020.00 658.44 5117.00 45.95 0.44 112.00 2.68 86.50 2) blktrace G stage: 8,0 0 206296 5.354894072 664 G R 7156992 + 128 [dd] 8,0 0 206305 5.355018179 664 G R 7157248 + 128 [dd] 8,0 0 206316 5.355204438 664 G R 7157504 + 128 [dd] 8,0 0 206319 5.355241048 664 G R 7157760 + 128 [dd] 8,0 0 206333 5.355500923 664 G R 7158016 + 128 [dd] 8,0 0 206344 5.355837806 664 G R 7158272 + 128 [dd] 8,0 0 206353 5.355960395 664 G R 7158528 + 128 [dd] 8,0 0 206357 5.356020772 664 G R 7158784 + 128 [dd] 3) io seek for sdx Read\|Write seek cnt 28644, zero cnt 21483 >=(KB) .. <(KB) : count ratio \|distribution \| 0 .. 1 : 21483 75.0% \|########################################\| 1 .. 2 : 0 0.0% \| \| 2 .. 4 : 0 0.0% \| \| 4 .. 8 : 0 0.0% \| \| 8 .. 16 : 0 0.0% \| \| 16 .. 32 : 0 0.0% \| \| 32 .. 64 : 7161 25.0% \|############## \| BTW, this looks like a long term problem from day one, and large sequential IO read is pretty common case like video playing. And even with this patch, in this test case IO is merged to at most 128k is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such limit can get even better performance. However, we'll figure out how to do this properly later. [1] https://lore.kernel.org/all/e40b076d-583d-406b-b223-005910a9f46f@acm.org/ Fixes: `d89d87965d` ("When stacked block devices are in-use (e.g. md or dm), the recursive calls") Reported-by: Tie Ren <tieren@fnnas.com> Closes: https://lore.kernel.org/all/7dro5o7u5t64d6bgiansesjavxcuvkq5p2pok7dtwkav7b7ape@3isfr44b6352/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:23:46 -06:00
Yu Kuai	0b64682e78	block: skip unnecessary checks for split bio Lots of checks are already done while submitting this bio the first time, and there is no need to check them again when this bio is resubmitted after split. Hence open code should_fail_bio() and blk_throtl_bio() that are still necessary from submit_bio_split_bioset(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:23:46 -06:00
Yu Kuai	e3290419d9	blk-crypto: convert to use bio_submit_split_bioset() Unify bio split code, prepare to fix ordering of split IO. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:23:46 -06:00
Yu Kuai	6529d41d87	md/md-linear: convert to use bio_submit_split_bioset() Unify bio split code, prepare to fix reordered split IO. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:23:46 -06:00
Yu Kuai	9e8a5b37c9	md/raid5: convert to use bio_submit_split_bioset() Unify bio split code, prepare to fix ordering of split IO. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:23:45 -06:00
Yu Kuai	6fc07785d9	md/raid10: convert read/write to use bio_submit_split_bioset() Unify bio split code, prepare to fix ordering of split IO, the error path is modified a bit, however no functional changes are intended: - bio_submit_split_bioset() can fail the original bio directly by split error, set R10BIO_Uptodate in this case to notify raid_end_bio_io() that the original bio is returned already. - set R10BIO_Uptodate and set error value to -EIO is useless now, for r10_bio without R10BIO_Uptodate, -EIO will be returned for original bio. And discard is not handled, because discard is only split for unaligned head and tail, and this can be considered slow path, the reorder here does not matter much. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:23:45 -06:00
Yu Kuai	deeeab3028	md/raid10: add a new r10bio flag R10BIO_Returned The new helper bio_submit_split_bioset() can failed the orginal bio on split errors, prepare to handle this case in raid_end_bio_io(). The flag name is refer to the r1bio flag name. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:23:45 -06:00
Yu Kuai	a6fcc160d6	md/raid1: convert to use bio_submit_split_bioset() Unify bio split code, and prepare to fix ordering of split IO. Noted that bio_submit_split_bioset() can fail the original bio directly by split error, set R1BIO_Returned in this case to notify raid_end_bio_io() that the original bio is returned already. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:23:45 -06:00
Yu Kuai	5b38ee5a4a	md/raid0: convert raid0_handle_discard() to use bio_submit_split_bioset() Unify bio split code, and prepare to fix ordering of split IO Noted commit `319ff40a54` ("md/raid0: Fix performance regression for large sequential writes") already fix ordering of split IO by remapping bio to underlying disks before resubmitting it, with the respect md_submit_bio() already split it by sectors, and raid0_make_request() will split at most once for unaligned IO. This is a bit hacky and we'll convert this to solution in general later. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:23:45 -06:00
Yu Kuai	e37b5596a1	block: factor out a helper bio_submit_split_bioset() No functional changes are intended, some drivers like mdraid will split bio by internal processing, prepare to unify bio split codes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:23:45 -06:00
Yu Kuai	06d712d297	blk-crypto: fix missing blktrace bio split events trace_block_split() is missing, resulting in blktrace inability to catch BIO split events and making it harder to analyze the BIO sequence. Cc: stable@vger.kernel.org Fixes: `488f6682c8` ("block: blk-crypto-fallback for Inline Encryption") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:23:45 -06:00
Yu Kuai	22f166218f	md: fix mssing blktrace bio split events If bio is split by internal handling like chunksize or badblocks, the corresponding trace_block_split() is missing, resulting in blktrace inability to catch BIO split events and making it harder to analyze the BIO sequence. Cc: stable@vger.kernel.org Fixes: `4b1faf9316` ("block: Kill bio_pair_split()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:23:45 -06:00
Yu Kuai	ea3d1f104d	blk-mq: add QUEUE_FLAG_BIO_ISSUE_TIME bio->issue_time_ns is initialized for every bio, however, it's only used by blk-iolatency. Add a new queue_flag and only set this flag when blk-iolatency is enabled, so that extra blk_time_get_ns() can be saved for disks that blk-iolatency is not enabled. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:23:45 -06:00
Yu Kuai	1f963bdd64	block: initialize bio issue time in blk_mq_submit_bio() bio->issue_time_ns is only used by blk-iolatency, which can only be enabled for rq-based disk, hence it's not necessary to initialize the time for bio-based disk. Meanwhile, if bio is split by blk_crypto_fallback_split_bio_if_needed(), the issue time is not initialized for new split bio, this can be fixed as well. Noted the next patch will optimize better that bio issue time will only be used when blk-iolatency is really enabled by the disk. Fixes: `488f6682c8` ("block: blk-crypto-fallback for Inline Encryption") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:23:45 -06:00
Yu Kuai	1733e88874	block: cleanup bio_issue Now that bio->bi_issue is only used by blk-iolatency to get bio issue time, replace bio_issue with u64 time directly and remove bio_issue to make code cleaner. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 05:23:45 -06:00
Jens Axboe	79b24810a2	Merge tag 'md-6.18-20250909' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux into for-6.18/block Pull MD changes from Yu Kuai: "Redundant data is used to enhance data fault tolerance, and the storage method for redundant data vary depending on the RAID levels. And it's important to maintain the consistency of redundant data. Bitmap is used to record which data blocks have been synchronized and which ones need to be resynchronized or recovered. Each bit in the bitmap represents a segment of data in the array. When a bit is set, it indicates that the multiple redundant copies of that data segment may not be consistent. Data synchronization can be performed based on the bitmap after power failure or readding a disk. If there is no bitmap, a full disk synchronization is required. Due to known performance issues with md-bitmap and the unreasonable implementations: - self-managed IO submitting like filemap_write_page(); - global spin_lock I have decided not to continue optimizing based on the current bitmap implementation, this new bitmap is invented without locking from IO fast path and can be used with fast disks. Key features for the new bitmap: - IO fastpath is lockless, if user issues lots of write IO to the same bitmap bit in a short time, only the first write has additional overhead to update bitmap bit, no additional overhead for the following writes; - support only resync or recover written data, means in the case creating new array or replacing with a new disk, there is no need to do a full disk resync/recovery;" * tag 'md-6.18-20250909' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux: (24 commits) md/md-llbitmap: introduce new lockless bitmap md/md-bitmap: make method bitmap_ops->daemon_work optional md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER md/md-bitmap: add a new method blocks_synced() in bitmap_operations md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations md/md-bitmap: delay registration of bitmap_ops until creating bitmap md/md-bitmap: add a new sysfs api bitmap_type md: add a new mddev field 'bitmap_id' md/md-bitmap: support discard for bitmap ops md: factor out a helper raid_is_456() md: add a new parameter 'offset' to md_super_write() md/md-bitmap: introduce CONFIG_MD_BITMAP md: check before referencing mddev->bitmap_ops md/dm-raid: check before referencing mddev->bitmap_ops md/raid5: check before referencing mddev->bitmap_ops md/raid10: check before referencing mddev->bitmap_ops md/raid1: check before referencing mddev->bitmap_ops md/raid1: check bitmap before behind write md/md-bitmap: handle the case bitmap is not enabled before end_sync() md/md-bitmap: handle the case bitmap is not enabled before start_sync() ...	2025-09-09 11:24:03 -06:00
Keith Busch	d0d1d52231	blk-map: provide the bdev to bio if one exists We can now safely provide a block device when extracting user pages for driver and user passthrough commands. Set the bdev so the caller doesn't have to do that later. This has an additional benefit of being able to extract P2P pages in the passthrough path. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-09 10:35:28 -06:00
Keith Busch	d57447ffb5	blk-mq-dma: bring back p2p request flags We only need to consider data and metadata dma mapping types separately. The request and bio integrity payload have enough flag bits to internally track the mapping type for each. Use these so the caller doesn't need to track them, and provide separete request and integrity helpers to the common code. This will make it easier to scale new mappings, like the proposed MMIO attribute, without burdening the caller to track such things. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-09 10:33:35 -06:00
Keith Busch	05ceea5d3e	blk-integrity: enable p2p source and destination Set the extraction flags to allow p2p pages for the metadata buffer if the block device allows it. Similar to data payloads, ensure the bio does not use merging if we see a p2p page. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-09 10:33:27 -06:00
Keith Busch	b475272f03	iov_iter: remove iov_iter_is_aligned No more callers. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-09 10:27:01 -06:00

1 2 3 4 5 ...

1382501 Commits