linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-13 23:29:10 -04:00

Author	SHA1	Message	Date
Randy Dunlap	d211a28035	block/rnbd: correct all kernel-doc complaints Fix all kernel-doc warnings in rnbd-proto.h: - use correct enum name in kdoc comment - mark several struct members as "/* private: /" so that no kdoc is required for them - don't use "/" for a non-kernel-doc comment - use the correct struct member name for "dev_name" - use " " for a blank kernel-doc line Fixes these warnings: Warning: drivers/block/rnbd/rnbd-proto.h:41 expecting prototype for enum rnbd_msg_types. Prototype was for enum rnbd_msg_type instead Warning: drivers/block/rnbd/rnbd-proto.h:50 struct member '__padding' not described in 'rnbd_msg_hdr' Warning: drivers/block/rnbd/rnbd-proto.h:53 This comment starts with '/*', but isn't a kernel-doc comment. We allow to map RO many times and RW only once. We allow to map yet another Warning: drivers/block/rnbd/rnbd-proto.h:81 struct member 'reserved' not described in 'rnbd_msg_sess_info' Warning: drivers/block/rnbd/rnbd-proto.h:92 struct member 'reserved' not described in 'rnbd_msg_sess_info_rsp' Warning: drivers/block/rnbd/rnbd-proto.h:107 struct member 'resv1' not described in 'rnbd_msg_open' Warning: drivers/block/rnbd/rnbd-proto.h:107 struct member 'dev_name' not described in 'rnbd_msg_open' Warning: drivers/block/rnbd/rnbd-proto.h:107 struct member 'reserved' not described in 'rnbd_msg_open' Warning: drivers/block/rnbd/rnbd-proto.h:158 struct member 'reserved' not described in 'rnbd_msg_open_rsp' Warning: drivers/block/rnbd/rnbd-proto.h:189 bad line: Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-01 07:19:50 -07:00
Fengnan Chang	4d0e1f2139	blk-mq: use queue_hctx in blk_mq_map_queue_type Some caller of blk_mq_map_queue_type now didn't grab 'q_usage_counter', such as blk_mq_cpu_mapped_to_hctx, so we need protect 'queue_hw_ctx' through rcu. Also checked all other functions, no more missed cases. Fixes: `89e1fb7cef` ("blk-mq: fix potential uaf for 'queue_hw_ctx'") Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Fengnan Chang <changfengnan@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-01 07:18:31 -07:00
Jens Axboe	c1536df942	Merge tag 'md-6.19-20251130' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux into for-6.19/block Pull MD changes from Yu: "- fix null-ptr-dereference regression for dm-raid0 (Yu Kuai) - fix IO hang for raid5 when array is broken with IO inflight (Yu Kuai) - remove legacy 1s delay to speed up system shutdown (Tarun Sahu)" * tag 'md-6.19-20251130' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux: md: remove legacy 1s delay in md_notify_reboot md/raid5: fix IO hang when array is broken with IO inflight md: warn about updating super block failure md/raid0: fix NULL pointer dereference in create_strip_zones() for dm-raid	2025-11-30 06:31:01 -07:00
Tarun Sahu	fdd0c6a649	md: remove legacy 1s delay in md_notify_reboot During system shutdown, the md driver registered notifier function (md_notify_reboot) currently imposes a hardcoded one-second delay. This delay was introduced approximately 23 years ago and was likely necessary for the hardware generation of that time. Proposing this patch to make sure there are no known devices that need this delay. Link: https://lore.kernel.org/linux-raid/20251121191422.2758555-1-tarunsahu@google.com Signed-off-by: Tarun Sahu <tarunsahu@google.com> Reviewed-by: Yu Kuai<yukuai@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>	2025-11-30 09:42:28 +08:00
Yu Kuai	a913d1f6a7	md/raid5: fix IO hang when array is broken with IO inflight Following test can cause IO hang: mdadm -CvR /dev/md0 -l10 -n4 /dev/sd[abcd] --assume-clean --chunk=64K --bitmap=none sleep 5 echo 1 > /sys/block/sda/device/delete echo 1 > /sys/block/sdb/device/delete echo 1 > /sys/block/sdc/device/delete echo 1 > /sys/block/sdd/device/delete dd if=/dev/md0 of=/dev/null bs=8k count=1 iflag=direct Root cause: 1) all disks removed, however all rdevs in the array is still in sync, IO will be issued normally. 2) IO failure from sda, and set badblocks failed, sda will be faulty and MD_SB_CHANGING_PENDING will be set. 3) error recovery try to recover this IO from other disks, IO will be issued to sdb, sdc, and sdd. 4) IO failure from sdb, and set badblocks failed again, now array is broken and will become read-only. 5) IO failure from sdc and sdd, however, stripe can't be handled anymore because MD_SB_CHANGING_PENDING is set: handle_stripe handle_stripe if (test_bit MD_SB_CHANGING_PENDING) set_bit STRIPE_HANDLE goto finish // skip handling failed stripe release_stripe if (test_bit STRIPE_HANDLE) list_add_tail conf->hand_list 6) later raid5d can't handle failed stripe as well: raid5d md_check_recovery md_update_sb if (!md_is_rdwr()) // can't clear pending bit return if (test_bit MD_SB_CHANGING_PENDING) break; // can't handle failed stripe Since MD_SB_CHANGING_PENDING can never be cleared for read-only array, fix this problem by skip this checking for read-only array. Link: https://lore.kernel.org/linux-raid/20251117085557.770572-3-yukuai@fnnas.com Fixes: `d87f064f58` ("md: never update metadata when array is read-only.") Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com>	2025-11-30 09:38:45 +08:00
Yu Kuai	8c9e376b9d	md: warn about updating super block failure Many personalities will handle IO error from daemon thread(like raid1d, raid10d, raid5d), and sb will require to be clean before hanlding these failed IO. However update sb can fail, for example array is broken by IO failure, or user config sysfs api array_state. This patch adds warning if updating sb failed first, in case this will be related to IO hang. Link: https://lore.kernel.org/linux-raid/20251117085557.770572-2-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com>	2025-11-30 09:38:22 +08:00
Yu Kuai	46f21952c4	md/raid0: fix NULL pointer dereference in create_strip_zones() for dm-raid Commit `2107457e31` ("md/raid0: Move queue limit setup before r0conf initialization") dereference mddev->gendisk unconditionally, which is NULL for dm-raid. Fix this problem by reverting to old codes for dm-raid. Link: https://lore.kernel.org/linux-raid/20251116021816.107648-1-yukuai@fnnas.com Fixes: `2107457e31` ("md/raid0: Move queue limit setup before r0conf initialization") Reported-and-tested-by: Changhui Zhong <czhong@redhat.com> Closes: https://lore.kernel.org/all/CAGVVp+VqVnvGeneUoTbYvBv2cw6GwQRrR3B-iQ-_9rVfyumoKA@mail.gmail.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Xiao Ni <xni@redhat.com> Reviewed-by: Li Nan <linan122@huawei.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>	2025-11-30 09:36:50 +08:00
Randy Dunlap	418de94e75	sbitmap: fix all kernel-doc warnings Modify kernel-doc comments in sbitmap.h to prevent warnings: Warning: include/linux/sbitmap.h:84 struct member 'alloc_hint' not described in 'sbitmap' Warning: include/linux/sbitmap.h:151 struct member 'ws_active' not described in 'sbitmap_queue' Warning: include/linux/sbitmap.h:552 No description found for return value of 'sbq_wait_ptr' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-28 09:21:18 -07:00
Ming Lei	28d7a371f0	ublk: add helper of __ublk_fetch() Add helper __ublk_fetch() for refactoring ublk_fetch(). Meantime move ublk_config_io_buf() out of __ublk_fetch() to make the code structure cleaner. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-28 09:20:13 -07:00
Ming Lei	3443bab2f8	ublk: pass const pointer to ublk_queue_is_zoned() Pass const pointer to ublk_queue_is_zoned() because it is readonly. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-28 09:20:13 -07:00
Ming Lei	0a9beafa7c	ublk: refactor auto buffer register in ublk_dispatch_req() Refactor auto buffer register code and prepare for supporting batch IO feature, and the main motivation is to put 'ublk_io' operation code together, so that per-io lock can be applied for the code block. The key changes are: - Rename ublk_auto_buf_reg() as ublk_do_auto_buf_reg() - Introduce an enum `auto_buf_reg_res` to represent the result of the buffer registration attempt (FAIL, FALLBACK, OK). - Split the existing `ublk_do_auto_buf_reg` function into two: - `__ublk_do_auto_buf_reg`: Performs the actual buffer registration and returns the `auto_buf_reg_res` status. - `ublk_do_auto_buf_reg`: A wrapper that calls the internal function and handles the I/O preparation based on the result. - Introduce `ublk_prep_auto_buf_reg_io` to encapsulate the logic for preparing the I/O for completion after buffer registration. - Pass the `tag` directly to `ublk_auto_buf_reg_fallback` to avoid recalculating it. This refactoring makes the control flow clearer and isolates the different stages of the auto buffer registration process. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-28 09:20:13 -07:00
Ming Lei	8d61ece156	ublk: add `union ublk_io_buf` with improved naming Add `union ublk_io_buf` for naming the anonymous union of struct ublk_io's addr and buf fields, meantime apply it to `struct ublk_io` for storing either ublk auto buffer register data or ublk server io buffer address. The union uses clear field names: - `addr`: for regular ublk server io buffer addresses - `auto_reg`: for ublk auto buffer registration data This eliminates confusing access patterns and improves code readability. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-28 09:20:13 -07:00
Ming Lei	3035b9b46b	ublk: add parameter `struct io_uring_cmd ` to ublk_prep_auto_buf_reg() Add parameter `struct io_uring_cmd ` to ublk_prep_auto_buf_reg() and prepare for reusing this helper for the coming UBLK_BATCH_IO feature, which can fetch & commit one batch of io commands via single uring_cmd. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-28 09:20:13 -07:00
Ming Lei	9574b21e95	kfifo: add kfifo_alloc_node() helper for NUMA awareness Add __kfifo_alloc_node() by refactoring and reusing __kfifo_alloc(), and define kfifo_alloc_node() macro to support NUMA-aware memory allocation. The new __kfifo_alloc_node() function accepts a NUMA node parameter and uses kmalloc_array_node() instead of kmalloc_array() for node-specific allocation. The existing __kfifo_alloc() now calls __kfifo_alloc_node() with NUMA_NO_NODE to maintain backward compatibility. This enables users to allocate kfifo buffers on specific NUMA nodes, which is important for performance in NUMA systems where the kfifo will be primarily accessed by threads running on specific nodes. Cc: Stefani Seibold <stefani@seibold.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-28 09:20:13 -07:00
Fengnan Chang	89e1fb7cef	blk-mq: fix potential uaf for 'queue_hw_ctx' This is just apply Kuai's patch in [1] with mirror changes. blk_mq_realloc_hw_ctxs() will free the 'queue_hw_ctx'(e.g. undate submit_queues through configfs for null_blk), while it might still be used from other context(e.g. switch elevator to none): t1 t2 elevator_switch blk_mq_unquiesce_queue blk_mq_run_hw_queues queue_for_each_hw_ctx // assembly code for hctx = (q)->queue_hw_ctx[i] mov 0x48(%rbp),%rdx -> read old queue_hw_ctx __blk_mq_update_nr_hw_queues blk_mq_realloc_hw_ctxs hctxs = q->queue_hw_ctx q->queue_hw_ctx = new_hctxs kfree(hctxs) movslq %ebx,%rax mov (%rdx,%rax,8),%rdi ->uaf This problem was found by code review, and I comfirmed that the concurrent scenario do exist(specifically 'q->queue_hw_ctx' can be changed during blk_mq_run_hw_queues()), however, the uaf problem hasn't been repoduced yet without hacking the kernel. Sicne the queue is freezed in __blk_mq_update_nr_hw_queues(), fix the problem by protecting 'queue_hw_ctx' through rcu where it can be accessed without grabbing 'q_usage_counter'. [1] https://lore.kernel.org/all/20220225072053.2472431-1-yukuai3@huawei.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Fengnan Chang <changfengnan@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-28 09:09:19 -07:00
Fengnan Chang	d0c98769ee	blk-mq: use array manage hctx map instead of xarray After commit `4e5cc99e1e` ("blk-mq: manage hctx map via xarray"), we use an xarray instead of array to store hctx, but in poll mode, each time in blk_mq_poll, we need use xa_load to find corresponding hctx, this introduce some costs. In my test, xa_load may cost 3.8% cpu. This patch revert previous change, eliminates the overhead of xa_load and can result in a 3% performance improvement. Signed-off-by: Fengnan Chang <changfengnan@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-28 09:09:19 -07:00
Kevin Brodsky	c6a45ee760	ublk: prevent invalid access with DEBUG ublk_ch_uring_cmd_local() may jump to the out label before initialising the io pointer. This will cause trouble if DEBUG is defined, because the pr_devel() call dereferences io. Clang reports: drivers/block/ublk_drv.c:2403:6: error: variable 'io' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized] 2403 \| if (tag >= ub->dev_info.queue_depth) \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/block/ublk_drv.c:2492:32: note: uninitialized use occurs here 2492 \| __func__, cmd_op, tag, ret, io->flags); \| Fix this by initialising io to NULL and checking it before dereferencing it. Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Fixes: `71f28f3136` ("ublk_drv: add io_uring based userspace block driver") Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-26 10:59:18 -07:00
Jan Höppner	a857d99201	s390/dasd: Use scnprintf() instead of sprintf() Use scnprintf() instead of sprintf() for those cases where the destination is an array and the size of the array is known at compile time. This prevents theoretical buffer overflows, but also avoids that people again and again spend time to figure out if the code is actually safe. Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com> Reviewed-by: Stefan Haberland <sth@linux.ibm.com> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-26 10:14:17 -07:00
Jan Höppner	43198756ee	s390/dasd: Move device name formatting into separate function The device name formatting can be generalized and made more readable compared to the current state. SCSI already provides a generalized way to format many devices in the same naming scheme as DASD does, which was introduced with commit `3e1a7ff8a0` ("block: allow disk to have extended device number"). Use this much cleaner code from drivers/scsi/sd.c to handle the legacy naming scheme in DASD as a replacement for the current implementation. For easier error handling for the new function, move the gendisk free portion of dasd_gendisk_free() out into a new function dasd_gd_free(). Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com> Reviewed-by: Stefan Haberland <sth@linux.ibm.com> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-26 10:14:17 -07:00
Stefan Haberland	764def9e8e	s390/dasd: Remove unnecessary debugfs_create() return checks The DASD driver only uses the dentry pointers when removing debugfs entries, and debugfs_remove() can safely handle both NULL and ERR_PTR. There is therefore no need to check debugfs_create() return values. This simplifies the debugfs setup code without changing functionality. Suggested-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-26 10:14:17 -07:00
Stefan Haberland	c943bfc6af	s390/dasd: Fix gendisk parent after copy pair swap After a copy pair swap the block device's "device" symlink points to the secondary CCW device, but the gendisk's parent remained the primary, leaving /sys/block/<dasdx> under the wrong parent. Move the gendisk to the secondary's device with device_move(), keeping the sysfs topology consistent after the swap. Fixes: `413862caad` ("s390/dasd: add copy pair swap capability") Cc: stable@vger.kernel.org #6.1 Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-26 10:14:16 -07:00
Chaitanya Kulkarni	7d09a8e251	block: ignore __blkdev_issue_discard() return value __blkdev_issue_discard() always returns 0, making the error check in blkdev_issue_discard() dead code. In function blkdev_issue_discard() initialize ret = 0, remove ret assignment from __blkdev_issue_discard(), rely on bio == NULL check to call submit_bio_wait(), preserve submit_bio_wait() error handling, and preserve -EOPNOTSUPP to 0 mapping. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-25 12:19:39 -07:00
shechenglong	3a64c46c40	block: fix typos in comments and strings in blk-core This patch fixes multiple spelling mistakes in comments and documentation in the file block/blk-core.c. No functional changes intended. Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: shechenglong <shechenglong@xfusion.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-25 10:39:49 -07:00
John Garry	a74de0c366	block: Remove references to __device_add_disk() Since commit `d1254a8749` ("block: remove support for delayed queue registrations"), function __device_add_disk() has been replaced with device_add_disk(), so fix up comments. Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-25 10:36:37 -07:00
Jens Axboe	96f03c8cb2	Revert "Merge branch 'loop-aio-nowait' into for-6.19/block" This reverts commit `f43fdeb9a3`, reversing changes made to `2c6d792d4b`. There are concerns that doing inline submits can cause excessive stack usage, particularly when going back into the filesystem. Revert the loop dio nowait change for now. Link: https://lore.kernel.org/linux-block/aSP3SG_KaROJTBHx@infradead.org/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-24 20:53:19 -07:00
David Laight	9420e720ad	block: use min() instead of min_t() min_t(unsigned int, a, b) casts an 'unsigned long' to 'unsigned int'. Use min(a, b) instead as it promotes any 'unsigned int' to 'unsigned long' and so cannot discard significant bits. In this case the 'unsigned long' value is small enough that the result is ok. (Similarly for max_t() and clamp_t().) Detected by an extra check added to min_t(). Signed-off-by: David Laight <david.laight.linux@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-20 07:44:29 -07:00
Chaitanya Kulkarni	e8f0abdd49	zloop: clear nowait flag in workqueue context The zloop driver advertises REQ_NOWAIT support through BLK_FEAT_NOWAIT (enabled by default for all blk-mq devices), and honors the nowait behavior throughout zloop_queue_rq(). However, actual I/O to the backing file is performed in a workqueue, where blocking is allowed. To avoid imposing unnecessary non-blocking constraints in this blocking context, clear the REQ_NOWAIT flag before processing the request in the workqueue context. Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-20 07:42:32 -07:00
Chaitanya Kulkarni	b11e483a1c	loop: clear nowait flag in workqueue context The loop driver advertises REQ_NOWAIT support through BLK_FEAT_NOWAIT (enabled by default for all blk-mq devices), and honors the nowait behavior throughout loop_queue_rq(). However, actual I/O to the backing file is performed in a workqueue, where blocking is allowed. To avoid imposing unnecessary non-blocking constraints in this blocking context, clear the REQ_NOWAIT flag before processing the request in the workqueue context. Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-20 07:42:32 -07:00
Damien Le Moal	a9637ab93c	zloop: fix zone append check in zloop_rw() While commit `cf28f6f923` ("zloop: fail zone append operations that are targeting full zones") added a check in zloop_rw() that a zone append is not issued to a full zone, commit `e3a96ca904` ("zloop: simplify checks for writes to sequential zones") inadvertently removed the check to verify that there is enough unwritten space in a zone for an incoming zone append opration. Re-add this check in zloop_rw() to make sure we do not write beyond the end of a zone. Of note is that this same check is already present in the function zloop_set_zone_append_sector() when ordered zone append is in use. Reported-by: Hans Holmberg <Hans.Holmberg@wdc.com> Fixes: `e3a96ca904` ("zloop: simplify checks for writes to sequential zones") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-19 07:39:42 -07:00
Damien Le Moal	ebcc028b4a	MAINTAINERS: add a maintainer for zoned block device support Add myself as the maintainer of the block layer support for the zoned block device code and user API. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-19 07:37:48 -07:00
Damien Le Moal	00ed035094	MAINTAINERS: add missing block layer user API header files Add the missing user API header files related to the block layer to the list of matching file patterns for Jens's block layer entry. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-19 07:37:40 -07:00
Chengkaitao	8e1d91c258	block: remove the declaration of elevator_init_mq function In commit `1e44bedbc9` ("block: unifying elevator change"), the elevator_init_mq function was deleted, but its declaration in elevator.h was overlooked. This patch fixes it. Signed-off-by: Chengkaitao <chengkaitao@kylinos.cn> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 16:27:19 -07:00
Jens Axboe	caebce24f6	Revert "block: consider discard merge last" This reverts commit `2516c246d0`. Suspected issues with discard merging post this patch, hence revert it for now. Link: https://lore.kernel.org/linux-block/26acdfdf-de13-430b-8c73-f890c7689a84@kernel.dk/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 15:00:12 -07:00
Bart Van Assche	935a20d1be	block: Remove queue freezing from several sysfs store callbacks Freezing the request queue from inside sysfs store callbacks may cause a deadlock in combination with the dm-multipath driver and the queue_if_no_path option. Additionally, freezing the request queue slows down system boot on systems where sysfs attributes are set synchronously. Fix this by removing the blk_mq_freeze_queue() / blk_mq_unfreeze_queue() calls from the store callbacks that do not strictly need these callbacks. Add the __data_racy annotation to request_queue.rq_timeout to suppress KCSAN data race reports about the rq_timeout reads. This patch may cause a small delay in applying the new settings. For all the attributes affected by this patch, I/O will complete correctly whether the old or the new value of the attribute is used. This patch affects the following sysfs attributes: * io_poll_delay * io_timeout * nomerges * read_ahead_kb * rq_affinity Here is an example of a deadlock triggered by running test srp/002 if this patch is not applied: task:multipathd Call Trace: <TASK> __schedule+0x8c1/0x1bf0 schedule+0xdd/0x270 schedule_preempt_disabled+0x1c/0x30 __mutex_lock+0xb89/0x1650 mutex_lock_nested+0x1f/0x30 dm_table_set_restrictions+0x823/0xdf0 __bind+0x166/0x590 dm_swap_table+0x2a7/0x490 do_resume+0x1b1/0x610 dev_suspend+0x55/0x1a0 ctl_ioctl+0x3a5/0x7e0 dm_ctl_ioctl+0x12/0x20 __x64_sys_ioctl+0x127/0x1a0 x64_sys_call+0xe2b/0x17d0 do_syscall_64+0x96/0x3a0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> task:(udev-worker) Call Trace: <TASK> __schedule+0x8c1/0x1bf0 schedule+0xdd/0x270 blk_mq_freeze_queue_wait+0xf2/0x140 blk_mq_freeze_queue_nomemsave+0x23/0x30 queue_ra_store+0x14e/0x290 queue_attr_store+0x23e/0x2c0 sysfs_kf_write+0xde/0x140 kernfs_fop_write_iter+0x3b2/0x630 vfs_write+0x4fd/0x1390 ksys_write+0xfd/0x230 __x64_sys_write+0x76/0xc0 x64_sys_call+0x276/0x17d0 do_syscall_64+0x96/0x3a0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Nilay Shroff <nilay@linux.ibm.com> Cc: Martin Wilck <mwilck@suse.com> Cc: Benjamin Marzinski <bmarzins@redhat.com> Cc: stable@vger.kernel.org Fixes: `af28141498` ("block: freeze the queue in queue_attr_store") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 15:00:11 -07:00
Bart Van Assche	42adb2d4ef	fs: Add the __data_racy annotation to backing_dev_info.ra_pages Some but not all .ra_pages changes happen while block layer I/O is paused with blk_mq_freeze_queue(). Filesystems may read .ra_pages even while block layer I/O is paused, e.g. from inside their .fadvise callback. Annotating all .ra_pages reads with READ_ONCE() would be cumbersome. Hence, add the __data_racy annotatation to the .ra_pages member variable. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 15:00:11 -07:00
Xue He	152c331bcd	block: plug attempts to batch allocate tags multiple times This patch aims to enable batch allocation of sufficient tags after batch IO submission with plug mechanism, thereby avoiding the need for frequent individual requests when the initial allocation is insufficient. ----------------------------------------------------------- HW: 16 CPUs/16 poll queues Disk: Samsung PM9A3 Gen4 3.84T CMD: [global] ioengine=io_uring group_reporting=1 time_based=1 runtime=1m refill_buffers=1 norandommap=1 randrepeat=0 fixedbufs=1 registerfiles=1 rw=randread iodepth=128 iodepth_batch_submit=32 iodepth_batch_complete_min=32 iodepth_batch_complete_max=128 iodepth_low=32 bs=4k numjobs=1 direct=1 hipri=1 [job1] filename=/dev/nvme0n1 name=batch_test ------------------------------------------------------------ Perf: base code: __blk_mq_alloc_requests() 1.47% patch: __blk_mq_alloc_requests() 0.75% ------------------------------------------------------------ Signed-off-by: hexue <xue01.he@samsung.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 14:59:41 -07:00
Jens Axboe	f43fdeb9a3	Merge branch 'loop-aio-nowait' into for-6.19/block Merge async IO IOCB_NOWAIT support from Ming: "This patchset improves loop aio perf by using IOCB_NOWAIT for avoiding to queue aio command to workqueue context, meantime refactor lo_rw_aio() a bit. In my test VM, loop disk perf becomes very close to perf of the backing block device(nvme/mq virtio-scsi). And Mikulas verified that this way can improve 12jobs sequential readwrite io by ~5X, and basically solve the reported problem together with loop MQ change. https://lore.kernel.org/linux-block/a8e5c76a-231f-07d1-a394-847de930f638@redhat.com/ Zhaoyang Huang also mentioned it may fix their performance issue on Android use case. The loop MQ change will be posted as standalone patch, because it needs UAPI change." Link: https://lore.kernel.org/linux-block/20251015110735.1361261-1-ming.lei@redhat.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk> * loop-aio-nowait: loop: add hint for handling aio via IOCB_NOWAIT loop: try to handle loop aio command via NOWAIT IO first loop: move command blkcg/memcg initialization into loop_queue_work loop: add lo_submit_rw_aio() loop: add helper lo_rw_aio_prep() loop: add helper lo_cmd_nr_bvec()	2025-11-18 06:49:59 -07:00
Ming Lei	837ed30396	loop: add hint for handling aio via IOCB_NOWAIT Add hint for using IOCB_NOWAIT to handle loop aio command for avoiding to cause write(especially randwrite) perf regression on sparse backed file. Try IOCB_NOWAIT in the following situations: - backing file is block device OR - READ aio command OR - there isn't any queued blocking async WRITEs, because NOWAIT won't cause contention with blocking WRITE, which often implies exclusive lock With this simple policy, perf regression of randwrite/write on sparse backing file is fixed. Link: https://lore.kernel.org/dm-devel/7d6ae2c9-df8e-50d0-7ad6-b787cb3cfab4@redhat.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 06:49:52 -07:00
Ming Lei	0ba93a906d	loop: try to handle loop aio command via NOWAIT IO first Try to handle loop aio command via NOWAIT IO first, then we can avoid to queue the aio command into workqueue. This is usually one big win in case that FS block mapping is stable, Mikulas verified [1] that this way improves IO perf by close to 5X in 12jobs sequential read/write test, in which FS block mapping is just stable. Fallback to workqueue in case of -EAGAIN. This way may bring a little cost from the 1st retry, but when running the following write test over loop/sparse_file, the actual effect on randwrite is obvious: ``` truncate -s 4G 1.img #1.img is created on XFS/virtio-scsi losetup -f 1.img --direct-io=on fio --direct=1 --bs=4k --runtime=40 --time_based --numjobs=1 --ioengine=libaio \ --iodepth=16 --group_reporting=1 --filename=/dev/loop0 -name=job --rw=$RW ``` - RW=randwrite: obvious IOPS drop observed - RW=write: a little drop(%5 - 10%) This perf drop on randwrite over sparse file will be addressed in the following patch. BLK_MQ_F_BLOCKING has to be set for calling into .read_iter() or .write_iter() which might sleep even though it is NOWAIT, and the only effect is that rcu read lock is replaced with srcu read lock. Link: https://lore.kernel.org/linux-block/a8e5c76a-231f-07d1-a394-847de930f638@redhat.com/ [1] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 06:49:52 -07:00
Ming Lei	f4788ae9d7	loop: move command blkcg/memcg initialization into loop_queue_work Move loop command blkcg/memcg initialization into loop_queue_work, and prepare for supporting to handle loop io command by IOCB_NOWAIT. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 06:49:52 -07:00
Ming Lei	c66e9708f9	loop: add lo_submit_rw_aio() Refactor lo_rw_aio() by extracting the I/O submission logic into a new helper function lo_submit_rw_aio(). This further improves code organization by separating the I/O preparation, submission, and completion handling into distinct phases. Prepare for using NOWAIT to improve loop performance. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 06:49:51 -07:00
Ming Lei	fd858d1ca9	loop: add helper lo_rw_aio_prep() Add helper lo_rw_aio_prep() to separate the preparation phase(setting up bio vectors and initializing the iocb structure) from the actual I/O execution in the loop block driver. Prepare for using NOWAIT to improve loop performance. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 06:49:51 -07:00
Ming Lei	c3e6c11147	loop: add helper lo_cmd_nr_bvec() Add lo_cmd_nr_bvec() and prepare for refactoring lo_rw_aio(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 06:49:51 -07:00
Sukrut Heroorkar	2c6d792d4b	drbd: turn bitmap I/O comments into regular block comments W=1 build warns because the bitmap I/O comments use '/**', which marks them as kernel-doc comments even though these functions do not document an external API. Convert these comments to regular block comments so kernel-doc no longer parses them. Signed-off-by: Sukrut Heroorkar <hsukrut3@gmail.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 06:48:33 -07:00
Li Chen	3179a5f7f8	block: rate-limit capacity change info log loop devices under heavy stress-ng loop streessor can trigger many capacity change events in a short time. Each event prints an info message from set_capacity_and_notify(), flooding the console and contributing to soft lockups on slow consoles. Switch the printk in set_capacity_and_notify() to pr_info_ratelimited() so frequent capacity changes do not spam the log while still reporting occasional changes. Cc: stable@vger.kernel.org Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 11:30:24 -07:00
Damien Le Moal	ade260ca85	Documentation: admin-guide: blockdev: update zloop parameters In Documentation/admin-guide/blockdev/zoned_loop.rst, add the description of the zone_append and ordered_zone_append configuration arguments of zloop "add" command (device creation). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:40:09 -07:00
Damien Le Moal	fcc6eaa3a0	zloop: introduce the ordered_zone_append configuration parameter The zone append operation processing for zloop devices is similar to any other command, that is, the operation is processed as a command work item, without any special serialization between the work items (beside the zone mutex for mutually exclusive code sections). This processing is fine and gives excellent performance. However, it has a side effect: zone append operation are very often reordered and processed in a sequence that is very different from their issuing order by the user. This effect is very visible using an XFS file system on top of a zloop device. A simple file write leads to many file extents as the data writes using zone append are reordered and so result in the physical order being different than the file logical order. E.g. executing: $ dd if=/dev/zero of=/mnt/test bs=1M count=10 && sync $ xfs_bmap /mnt/test /mnt/test: 0: [0..4095]: 2162688..2166783 1: [4096..6143]: 2168832..2170879 2: [6144..8191]: 2166784..2168831 3: [8192..10239]: 2170880..2172927 4: [10240..12287]: 2174976..2177023 5: [12288..14335]: 2172928..2174975 6: [14336..20479]: 2177024..2183167 For 10 IOs, 6 extents are created. This is fine and actually allows to exercise XFS zone garbage collection very well. However, this also makes debugging/working on XFS data placement harder as the underlying device will most of the time reorder IOs, resulting in many file extents. Allow a user to mitigate this with the new ordered_zone_append configuration parameter. For a zloop device created with this parameter specified, the sector of a zone append command is set early, when the command is submitted by the block layer with the zloop_queue_rq() function, instead of in the zloop_rw() function which is exectued later in the command work item context. This change ensures that more often than not, zone append operations data end up being written in the same order as the command submission by the user. In the case of XFS, this leads to far less file data extents. E.g., for the previous example, we get a single file data extent for the written file. $ dd if=/dev/zero of=/mnt/test bs=1M count=10 && sync $ xfs_bmap /mnt/test /mnt/test: 0: [0..20479]: 2162688..2183167 Since we cannot use a mutex in the context of the zloop_queue_rq() function to atomically set a zone append operation sector to the target zone write pointer location and increment that the write pointer, a new per-zone spinlock is introduced to protect a zone write pointer access and modifications. To check a zone write pointer location and set a zone append operation target sector to that value, the function zloop_set_zone_append_sector() is introduced and called from zloop_queue_rq(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:40:09 -07:00
Damien Le Moal	9236c5fdd5	zloop: introduce the zone_append configuration parameter A zloop zoned block device declares to the block layer that it supports zone append operations. That is, a zloop device ressembles an NVMe ZNS devices supporting zone append. This native support is fine but it does not allow exercising the block layer zone write plugging emulation of zone append, as is done with SCSI or ATA SMR HDDs. Introduce the zone_append configuration parameter to allow creating a zloop device without native support for zone append, thus relying on the block layer zone append emulation. If not specified, zone append support is enabled by default. Otherwise, a value of 0 disables native zone append and a value of 1 enables it. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:40:09 -07:00
Damien Le Moal	e3a96ca904	zloop: simplify checks for writes to sequential zones The function zloop_rw() already checks early that a request is fully contained within the target zone. So this check does not need to be done again for regular writes to sequential zones. Furthermore, since zone append operations are always directed to the zone write pointer location, we do not need to check for their alignment to that value after setting it. So turn the "if" checking the write pointer alignment into an "else if". While at it, improve the comment describing the write pointer modification and how this value is corrected in case of error. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:40:09 -07:00
Damien Le Moal	cf28f6f923	zloop: fail zone append operations that are targeting full zones zloop_rw() will fail any regular write operation that targets a full sequential zone. The check for this is indirect and achieved by checking the write pointer alignment of the write operation. But this check is ineffective for zone append operations since these are alwasy automatically directed at a zone write pointer. Prevent zone append operations from being executed in a full zone with an explicit check of the zone condition. Fixes: `eb0570c7df` ("block: new zoned loop block device driver") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:40:09 -07:00

1 2 3 4 5 ...

1397181 Commits