linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-12-28 06:44:36 -05:00

Author	SHA1	Message	Date
Xue He	152c331bcd	block: plug attempts to batch allocate tags multiple times This patch aims to enable batch allocation of sufficient tags after batch IO submission with plug mechanism, thereby avoiding the need for frequent individual requests when the initial allocation is insufficient. ----------------------------------------------------------- HW: 16 CPUs/16 poll queues Disk: Samsung PM9A3 Gen4 3.84T CMD: [global] ioengine=io_uring group_reporting=1 time_based=1 runtime=1m refill_buffers=1 norandommap=1 randrepeat=0 fixedbufs=1 registerfiles=1 rw=randread iodepth=128 iodepth_batch_submit=32 iodepth_batch_complete_min=32 iodepth_batch_complete_max=128 iodepth_low=32 bs=4k numjobs=1 direct=1 hipri=1 [job1] filename=/dev/nvme0n1 name=batch_test ------------------------------------------------------------ Perf: base code: __blk_mq_alloc_requests() 1.47% patch: __blk_mq_alloc_requests() 0.75% ------------------------------------------------------------ Signed-off-by: hexue <xue01.he@samsung.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 14:59:41 -07:00
Jens Axboe	f43fdeb9a3	Merge branch 'loop-aio-nowait' into for-6.19/block Merge async IO IOCB_NOWAIT support from Ming: "This patchset improves loop aio perf by using IOCB_NOWAIT for avoiding to queue aio command to workqueue context, meantime refactor lo_rw_aio() a bit. In my test VM, loop disk perf becomes very close to perf of the backing block device(nvme/mq virtio-scsi). And Mikulas verified that this way can improve 12jobs sequential readwrite io by ~5X, and basically solve the reported problem together with loop MQ change. https://lore.kernel.org/linux-block/a8e5c76a-231f-07d1-a394-847de930f638@redhat.com/ Zhaoyang Huang also mentioned it may fix their performance issue on Android use case. The loop MQ change will be posted as standalone patch, because it needs UAPI change." Link: https://lore.kernel.org/linux-block/20251015110735.1361261-1-ming.lei@redhat.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk> * loop-aio-nowait: loop: add hint for handling aio via IOCB_NOWAIT loop: try to handle loop aio command via NOWAIT IO first loop: move command blkcg/memcg initialization into loop_queue_work loop: add lo_submit_rw_aio() loop: add helper lo_rw_aio_prep() loop: add helper lo_cmd_nr_bvec()	2025-11-18 06:49:59 -07:00
Ming Lei	837ed30396	loop: add hint for handling aio via IOCB_NOWAIT Add hint for using IOCB_NOWAIT to handle loop aio command for avoiding to cause write(especially randwrite) perf regression on sparse backed file. Try IOCB_NOWAIT in the following situations: - backing file is block device OR - READ aio command OR - there isn't any queued blocking async WRITEs, because NOWAIT won't cause contention with blocking WRITE, which often implies exclusive lock With this simple policy, perf regression of randwrite/write on sparse backing file is fixed. Link: https://lore.kernel.org/dm-devel/7d6ae2c9-df8e-50d0-7ad6-b787cb3cfab4@redhat.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 06:49:52 -07:00
Ming Lei	0ba93a906d	loop: try to handle loop aio command via NOWAIT IO first Try to handle loop aio command via NOWAIT IO first, then we can avoid to queue the aio command into workqueue. This is usually one big win in case that FS block mapping is stable, Mikulas verified [1] that this way improves IO perf by close to 5X in 12jobs sequential read/write test, in which FS block mapping is just stable. Fallback to workqueue in case of -EAGAIN. This way may bring a little cost from the 1st retry, but when running the following write test over loop/sparse_file, the actual effect on randwrite is obvious: ``` truncate -s 4G 1.img #1.img is created on XFS/virtio-scsi losetup -f 1.img --direct-io=on fio --direct=1 --bs=4k --runtime=40 --time_based --numjobs=1 --ioengine=libaio \ --iodepth=16 --group_reporting=1 --filename=/dev/loop0 -name=job --rw=$RW ``` - RW=randwrite: obvious IOPS drop observed - RW=write: a little drop(%5 - 10%) This perf drop on randwrite over sparse file will be addressed in the following patch. BLK_MQ_F_BLOCKING has to be set for calling into .read_iter() or .write_iter() which might sleep even though it is NOWAIT, and the only effect is that rcu read lock is replaced with srcu read lock. Link: https://lore.kernel.org/linux-block/a8e5c76a-231f-07d1-a394-847de930f638@redhat.com/ [1] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 06:49:52 -07:00
Ming Lei	f4788ae9d7	loop: move command blkcg/memcg initialization into loop_queue_work Move loop command blkcg/memcg initialization into loop_queue_work, and prepare for supporting to handle loop io command by IOCB_NOWAIT. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 06:49:52 -07:00
Ming Lei	c66e9708f9	loop: add lo_submit_rw_aio() Refactor lo_rw_aio() by extracting the I/O submission logic into a new helper function lo_submit_rw_aio(). This further improves code organization by separating the I/O preparation, submission, and completion handling into distinct phases. Prepare for using NOWAIT to improve loop performance. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 06:49:51 -07:00
Ming Lei	fd858d1ca9	loop: add helper lo_rw_aio_prep() Add helper lo_rw_aio_prep() to separate the preparation phase(setting up bio vectors and initializing the iocb structure) from the actual I/O execution in the loop block driver. Prepare for using NOWAIT to improve loop performance. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 06:49:51 -07:00
Ming Lei	c3e6c11147	loop: add helper lo_cmd_nr_bvec() Add lo_cmd_nr_bvec() and prepare for refactoring lo_rw_aio(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 06:49:51 -07:00
Sukrut Heroorkar	2c6d792d4b	drbd: turn bitmap I/O comments into regular block comments W=1 build warns because the bitmap I/O comments use '/**', which marks them as kernel-doc comments even though these functions do not document an external API. Convert these comments to regular block comments so kernel-doc no longer parses them. Signed-off-by: Sukrut Heroorkar <hsukrut3@gmail.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 06:48:33 -07:00
Li Chen	3179a5f7f8	block: rate-limit capacity change info log loop devices under heavy stress-ng loop streessor can trigger many capacity change events in a short time. Each event prints an info message from set_capacity_and_notify(), flooding the console and contributing to soft lockups on slow consoles. Switch the printk in set_capacity_and_notify() to pr_info_ratelimited() so frequent capacity changes do not spam the log while still reporting occasional changes. Cc: stable@vger.kernel.org Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 11:30:24 -07:00
Damien Le Moal	ade260ca85	Documentation: admin-guide: blockdev: update zloop parameters In Documentation/admin-guide/blockdev/zoned_loop.rst, add the description of the zone_append and ordered_zone_append configuration arguments of zloop "add" command (device creation). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:40:09 -07:00
Damien Le Moal	fcc6eaa3a0	zloop: introduce the ordered_zone_append configuration parameter The zone append operation processing for zloop devices is similar to any other command, that is, the operation is processed as a command work item, without any special serialization between the work items (beside the zone mutex for mutually exclusive code sections). This processing is fine and gives excellent performance. However, it has a side effect: zone append operation are very often reordered and processed in a sequence that is very different from their issuing order by the user. This effect is very visible using an XFS file system on top of a zloop device. A simple file write leads to many file extents as the data writes using zone append are reordered and so result in the physical order being different than the file logical order. E.g. executing: $ dd if=/dev/zero of=/mnt/test bs=1M count=10 && sync $ xfs_bmap /mnt/test /mnt/test: 0: [0..4095]: 2162688..2166783 1: [4096..6143]: 2168832..2170879 2: [6144..8191]: 2166784..2168831 3: [8192..10239]: 2170880..2172927 4: [10240..12287]: 2174976..2177023 5: [12288..14335]: 2172928..2174975 6: [14336..20479]: 2177024..2183167 For 10 IOs, 6 extents are created. This is fine and actually allows to exercise XFS zone garbage collection very well. However, this also makes debugging/working on XFS data placement harder as the underlying device will most of the time reorder IOs, resulting in many file extents. Allow a user to mitigate this with the new ordered_zone_append configuration parameter. For a zloop device created with this parameter specified, the sector of a zone append command is set early, when the command is submitted by the block layer with the zloop_queue_rq() function, instead of in the zloop_rw() function which is exectued later in the command work item context. This change ensures that more often than not, zone append operations data end up being written in the same order as the command submission by the user. In the case of XFS, this leads to far less file data extents. E.g., for the previous example, we get a single file data extent for the written file. $ dd if=/dev/zero of=/mnt/test bs=1M count=10 && sync $ xfs_bmap /mnt/test /mnt/test: 0: [0..20479]: 2162688..2183167 Since we cannot use a mutex in the context of the zloop_queue_rq() function to atomically set a zone append operation sector to the target zone write pointer location and increment that the write pointer, a new per-zone spinlock is introduced to protect a zone write pointer access and modifications. To check a zone write pointer location and set a zone append operation target sector to that value, the function zloop_set_zone_append_sector() is introduced and called from zloop_queue_rq(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:40:09 -07:00
Damien Le Moal	9236c5fdd5	zloop: introduce the zone_append configuration parameter A zloop zoned block device declares to the block layer that it supports zone append operations. That is, a zloop device ressembles an NVMe ZNS devices supporting zone append. This native support is fine but it does not allow exercising the block layer zone write plugging emulation of zone append, as is done with SCSI or ATA SMR HDDs. Introduce the zone_append configuration parameter to allow creating a zloop device without native support for zone append, thus relying on the block layer zone append emulation. If not specified, zone append support is enabled by default. Otherwise, a value of 0 disables native zone append and a value of 1 enables it. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:40:09 -07:00
Damien Le Moal	e3a96ca904	zloop: simplify checks for writes to sequential zones The function zloop_rw() already checks early that a request is fully contained within the target zone. So this check does not need to be done again for regular writes to sequential zones. Furthermore, since zone append operations are always directed to the zone write pointer location, we do not need to check for their alignment to that value after setting it. So turn the "if" checking the write pointer alignment into an "else if". While at it, improve the comment describing the write pointer modification and how this value is corrected in case of error. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:40:09 -07:00
Damien Le Moal	cf28f6f923	zloop: fail zone append operations that are targeting full zones zloop_rw() will fail any regular write operation that targets a full sequential zone. The check for this is indirect and achieved by checking the write pointer alignment of the write operation. But this check is ineffective for zone append operations since these are alwasy automatically directed at a zone write pointer. Prevent zone append operations from being executed in a full zone with an explicit check of the zone condition. Fixes: `eb0570c7df` ("block: new zoned loop block device driver") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:40:09 -07:00
Damien Le Moal	866d65745b	zloop: make the write pointer of full zones invalid The write pointer of zones that are in the full condition is always invalid. Reflect that fact by setting the write pointer of full zones to ULLONG_MAX. Fixes: `eb0570c7df` ("block: new zoned loop block device driver") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:40:09 -07:00
Guenter Roeck	6483faa393	block/blk-throttle: Remove throtl_slice from struct throtl_data throtl_slice is now a constant. Remove the variable and use the constant directly where needed. Cc: Yu Kuai <yukuai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:39:48 -07:00
Guenter Roeck	20d0b359c7	block/blk-throttle: drop unneeded blk_stat_enable_accounting After the removal of CONFIG_BLK_DEV_THROTTLING_LOW, it is no longer necessary to enable block accounting, so remove the call to blk_stat_enable_accounting(). With that, the track_bio_latency variable is no longer used and can be deleted from struct throtl_data. Also, including blk-stat.h is no longer necessary. Fixes: `bf20ab538c` ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") Cc: Yu Kuai <yukuai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:39:48 -07:00
Guenter Roeck	f76581f9f1	block/blk-throttle: Fix throttle slice time for SSDs Commit `d61fcfa4bb` ("blk-throttle: choose a small throtl_slice for SSD") introduced device type specific throttle slices if BLK_DEV_THROTTLING_LOW was enabled. Commit `bf20ab538c` ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") removed support for BLK_DEV_THROTTLING_LOW, but left the device type specific throttle slices in place. This effectively changed throttling behavior on systems with SSD which now use a different and non-configurable slice time compared to non-SSD devices. Practical impact is that throughput tests with low configured throttle values (65536 bps) experience less than expected throughput on SSDs, presumably due to rounding errors associated with the small throttle slice time used for those devices. The same tests pass when setting the throttle values to 65536 * 4 = 262144 bps. The original code sets the throttle slice time to DFL_THROTL_SLICE_HD if CONFIG_BLK_DEV_THROTTLING_LOW is disabled. Restore that code to fix the problem. With that, DFL_THROTL_SLICE_SSD is no longer necessary. Revert to the original code and re-introduce DFL_THROTL_SLICE to replace both DFL_THROTL_SLICE_HD and DFL_THROTL_SLICE_SSD. This effectively reverts commit `d61fcfa4bb` ("blk-throttle: choose a small throtl_slice for SSD"). While at it, also remove MAX_THROTL_SLICE since it is not used anymore. Fixes: `bf20ab538c` ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") Cc: Yu Kuai <yukuai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:39:48 -07:00
Keith Busch	2516c246d0	block: consider discard merge last If the next discard range is contiguous with the current range being considered, it's cheaper to expand the current range than to append an additional bio. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:39:36 -07:00
Rene Rebe	82d2048102	floppy: fix for PAGE_SIZE != 4KB For years I wondered why the floppy driver does not just work on sparc64, e.g: root@SUNW_375_0066:# disktype /dev/fd0 disktype: Can't open /dev/fd0: No such device or address [ 525.341906] disktype: attempt to access beyond end of device fd0: rw=0, sector=0, nr_sectors = 16 limit=8 [ 525.341991] floppy: error 10 while reading block 0 Turns out floppy.c __floppy_read_block_0 tries to read one page for the first test read to determine the disk size and thus fails if that is greater than 4k. Adjust minimum MAX_DISK_SIZE to PAGE_SIZE to fix floppy on sparc64 and likely all other PAGE_SIZE != 4KB configs. Cc: stable@vger.kernel.org Signed-off-by: René Rebe <rene@exactco.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 08:22:00 -07:00
Rene Rebe	79bd8c9814	ps3disk: use memcpy_{from,to}_bvec index With `6e0a48552b` (ps3disk: use memcpy_{from,to}_bvec) converting ps3disk to new bvec helpers, incrementing the offset was accidently lost, corrupting consecutive buffers. Restore index for non-corrupted data transfers. Fixes: `6e0a48552b` (ps3disk: use memcpy_{from,to}_bvec) Signed-off-by: René Rebe <rene@exactco.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-14 09:10:16 -07:00
Jens Axboe	fa0d2dc69e	Merge branch 'p2pdma-mmio-6.19.v5' into for-6.19/block Merge MMIO P2P DMA series from Leon: "This patch series improves block layer and NVMe driver support for MMIO memory regions, particularly for peer-to-peer (P2P) DMA transfers that go through the host bridge. The series addresses a critical gap where P2P transfers through the host bridge (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE) were not properly marked as MMIO memory, leading to potential issues with: - Inappropriate CPU cache synchronization operations on MMIO regions - Incorrect DMA mapping/unmapping that doesn't respect MMIO semantics - Missing IOMMU configuration for MMIO memory handling This work is extracted from the larger DMA physical API improvement series [1] and focuses specifically on block layer and NVMe requirements for MMIO memory support. [1] https://lore.kernel.org/all/cover.1757423202.git.leonro@nvidia.com/" Link: https://lore.kernel.org/linux-block/20251114-block-with-mmio-v5-0-69d00f73d766@nvidia.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk> * p2pdma-mmio-6.19.v5: block-dma: properly take MMIO path nvme-pci: migrate to dma_map_phys instead of map_page	2025-11-14 05:10:09 -07:00
Leon Romanovsky	37f0c7a8df	block-dma: properly take MMIO path In commit `eadaa8b255` ("dma-mapping: introduce new DMA attribute to indicate MMIO memory"), DMA_ATTR_MMIO attribute was added to describe MMIO addresses, which require to avoid any memory cache flushing, as an outcome of the discussion pointed in Link tag below. In case of PCI_P2PDMA_MAP_THRU_HOST_BRIDGE transfer, blk-mq-dm logic treated this as regular page and relied on "struct page" DMA flow. That flow performs CPU cache flushing, which shouldn't be done here, and doesn't set IOMMU_MMIO flag in DMA-IOMMU case. As a solution, let's encode peer-to-peer transaction type in NVMe IOD flags variable and provide it to blk-mq-dma API. Link: https://lore.kernel.org/all/f912c446-1ae9-4390-9c11-00dce7bf0fd3@arm.com/ Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-14 05:09:56 -07:00
Leon Romanovsky	61d43b1731	nvme-pci: migrate to dma_map_phys instead of map_page After introduction of dma_map_phys(), there is no need to convert from physical address to struct page in order to map page. So let's use it directly. Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-14 05:09:56 -07:00
Jens Axboe	8e1bf774ab	Merge branch 'elevator-switch-6.19' into for-6.19/block Merge elevator switching improvements from Nilay: "This patchset reorganizes the elevator switch path used during both nr_hw_queues update and elv_iosched_store() operations to address a recently reported lockdep splat [1]. The warning highlights a locking dependency between ->freeze_lock and ->elevator_lock on pcpu_alloc_mutex, triggered when the Kyber scheduler dynamically allocates its private scheduling data. The fix is to ensure that such allocations occur outside the locked sections, thus eliminating the dependency chain. While working on this, it also became evident that the nr_hw_queue update code maintains two disjoint xarrays—one for elevator tags and another for elevator type—both serving the same purpose. Unifying these into a single elv_change_ctx structure improves clarity and maintainability. This series therefore implements five patches: The first perparatory patch unifies elevator tags and type xarrays. It combines both xarrays into a single struct elv_change_ctx, simplifying per-queue elevator state management. The second patch is aimed to group together all elevator-related resources that share the same lifetime and as a first step we move the elevator tags pointer from struct elv_change_ctx into the newly introduced struct elevator_resources. The subsequent patch extends the struct elevator_resources to include other elevator-related data. The third patch introduce ->alloc_sched_data and ->free_sched_data elevator ops which could be then used to safely allocate and free scheduler data. The fourth patch now builds upon the previous patch and starts using the newly introduced alloc/free sched data methods in the earlier patch during elevator switch and nr_hw_queue update. And while doing so, it's ensured that sched data allocation and free happens before we acquire ->freeze_lock and ->elevator_lock thus preventing its dependency on pcpu_alloc_mutex. The last patch of this series converts Kyber scheduler to use the new methods inroduced in the previous patch. It hooks Kyber’s scheduler data allocation and teardown logic from ->init_sched and ->exit_sched into the new methods, ensuring memory operations are performed outside locked sections. Together, these changes simplify the elevator switch logic and prevent the reported lockdep splat." Link: https://lore.kernel.org/linux-block/20251113090619.2030737-1-nilay@linux.ibm.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk> * elevator-switch-6.19: block: define alloc_sched_data and free_sched_data methods for kyber block: use {alloc\|free}_sched data methods block: introduce alloc_sched_data and free_sched_data elevator methods block: move elevator tags into struct elevator_resources block: unify elevator tags and type xarrays into struct elv_change_ctx	2025-11-13 09:28:00 -07:00
Nilay Shroff	d4c3ef56a1	block: define alloc_sched_data and free_sched_data methods for kyber Currently, the Kyber elevator allocates its private data dynamically in ->init_sched and frees it in ->exit_sched. However, since ->init_sched is invoked during elevator switch after acquiring both ->freeze_lock and ->elevator_lock, it may trigger the lockdep splat [1] due to dependency on pcpu_alloc_mutex. To resolve this, move the elevator data allocation and deallocation logic from ->init_sched and ->exit_sched into the newly introduced ->alloc_sched_data and ->free_sched_data methods. These callbacks are invoked before acquiring ->freeze_lock and ->elevator_lock, ensuring that memory allocation happens safely without introducing additional locking dependencies. This change breaks the dependency chain involving pcpu_alloc_mutex and prevents the reported lockdep warning. [1] https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/ Reported-by: Changhui Zhong <czhong@redhat.com> Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/ Tested-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 09:27:49 -07:00
Nilay Shroff	0315476e78	block: use {alloc\|free}_sched data methods The previous patch introduced ->alloc_sched_data and ->free_sched_data methods. This patch builds upon that by now using these methods during elevator switch and nr_hw_queue update. It's also ensured that scheduler-specific data is allocated and freed through the new callbacks outside of the ->freeze_lock and ->elevator_lock locking contexts, thereby preventing any dependency on pcpu_alloc_mutex. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 09:27:49 -07:00
Nilay Shroff	61019afdf6	block: introduce alloc_sched_data and free_sched_data elevator methods The recent lockdep splat [1] highlights a potential deadlock risk involving ->elevator_lock and ->freeze_lock dependencies on -pcpu_alloc_ mutex. The trace shows that the issue occurs when the Kyber scheduler allocates dynamic memory for its elevator data during initialization. To address this, introduce two new elevator operation callbacks: ->alloc_sched_data and ->free_sched_data. The subsequent patch would build upon these newly introduced methods to suppress lockdep splat[1]. [1] https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/ Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 09:27:49 -07:00
Nilay Shroff	04728ce909	block: move elevator tags into struct elevator_resources This patch introduces a new structure, struct elevator_resources, to group together all elevator-related resources that share the same lifetime. As a first step, this change moves the elevator tag pointer from struct elv_change_ctx into the new struct elevator_resources. Additionally, rename blk_mq_alloc_sched_tags_batch() and blk_mq_free_sched_tags_batch() to blk_mq_alloc_sched_res_batch() and blk_mq_free_sched_res_batch(), respectively. Introduce two new wrapper helpers, blk_mq_alloc_sched_res() and blk_mq_free_sched_res(), around blk_mq_alloc_sched_tags() and blk_mq_free_sched_tags(). These changes pave the way for consolidating the allocation and freeing of elevator-specific resources into common helper functions. This refactoring improves encapsulation and prepares the code for future extensions, allowing additional elevator-specific data to be added to struct elevator_resources without cluttering struct elv_change_ctx. Subsequent patches will extend struct elevator_resources to include other elevator-related data. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 09:27:49 -07:00
Nilay Shroff	232143b605	block: unify elevator tags and type xarrays into struct elv_change_ctx Currently, the nr_hw_queues update path manages two disjoint xarrays — one for elevator tags and another for elevator type — both used during elevator switching. Maintaining these two parallel structures for the same purpose adds unnecessary complexity and potential for mismatched state. This patch unifies both xarrays into a single structure, struct elv_change_ctx, which holds all per-queue elevator change context. A single xarray, named elv_tbl, now maps each queue (q->id) in a tagset to its corresponding elv_change_ctx entry, encapsulating the elevator tags, type and name references. This unification simplifies the code, improves maintainability, and clarifies ownership of per-queue elevator state. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 09:27:49 -07:00
Jens Axboe	c3f42a6de7	Merge branch 'bcache-updates-6.19' into for-6.19/block Merge bcache updates from Coly for 6.19: "The major change is from me, which is to remove useless discard interface and code for cache device (not the backing device). And the last patch about gc latency is a cooperative result from Robert Pang (Google), Mingzhe Zou (Easystack) and me, by inspired from their previous works, I compose the final version and Robert prvides positive benchmark result. Marco contributes 2 patches to improve the usage of per-cpu system work queue. Gustavo contributes a patch to fix the not-at-end flexible-array member warning by gcc14. And Qianfeng contributes a code cleanup patch to remove redundant __GFP_NOWARN." Link: https://lore.kernel.org/linux-block/20251113053630.54218-1-colyli@fnnas.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk> * bcache-updates-6.19: bcache: Avoid -Wflex-array-member-not-at-end warning bcache: WQ_PERCPU added to alloc_workqueue users bcache: replace use of system_wq with system_percpu_wq bcache: remove redundant __GFP_NOWARN bcache: reduce gc latency by processing less nodes and sleep less time bcache: remove discard sysfs interface document bcache: drop discard sysfs interface bcache: remove discard code from alloc.c bcache: get rid of discard code from journal	2025-11-13 09:18:19 -07:00
Gustavo A. R. Silva	699122b590	bcache: Avoid -Wflex-array-member-not-at-end warning -Wflex-array-member-not-at-end was introduced in GCC-14, and we are getting ready to enable it, globally. Use the new TRAILING_OVERLAP() helper to fix the following warning: drivers/md/bcache/bset.h:330:27: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] This helper creates a union between a flexible-array member (FAM) and a set of MEMBERS that would otherwise follow it. This overlays the trailing MEMBER struct btree_iter_set stack_data[MAX_BSETS]; onto the FAM struct btree_iter::data[], while keeping the FAM and the start of MEMBER aligned. The static_assert() ensures this alignment remains, and it's intentionally placed immediately after the corresponding structures --no blank line in between. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Coly Li <colyli@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 09:18:06 -07:00
Marco Crivellari	c0c8082142	bcache: WQ_PERCPU added to alloc_workqueue users Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This patch continues the effort to refactor worqueue APIs, which has begun with the change introducing new workqueues and a new alloc_workqueue flag: commit `128ea9f6cc` ("workqueue: Add system_percpu_wq and system_dfl_wq") commit `930c2ea566` ("workqueue: Add new WQ_PERCPU flag") This change adds a new WQ_PERCPU flag to explicitly request alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Coly Li <colyli@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 09:18:06 -07:00
Marco Crivellari	fd82071814	bcache: replace use of system_wq with system_percpu_wq Currently if a user enqueues a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistency cannot be addressed without refactoring the API. This patch continues the effort to refactor worqueue APIs, which has begun with the change introducing new workqueues and a new alloc_workqueue flag: commit `128ea9f6cc` ("workqueue: Add system_percpu_wq and system_dfl_wq") commit `930c2ea566` ("workqueue: Add new WQ_PERCPU flag") system_wq should be the per-cpu workqueue, yet in this name nothing makes that clear, so replace system_wq with system_percpu_wq. The old wq (system_wq) will be kept for a few release cycles. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Coly Li <colyli@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 09:18:06 -07:00
Qianfeng Rong	21194c44b6	bcache: remove redundant __GFP_NOWARN GFP_NOWAIT already includes __GFP_NOWARN, so let's remove the redundant __GFP_NOWARN. Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Acked-by: Coly Li <colyli@fnnas.com> Acked-by: Coly Li <colyli@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 09:18:06 -07:00
Coly Li	70bc173ce0	bcache: reduce gc latency by processing less nodes and sleep less time When bcache device is busy for high I/O loads, there are two methods to reduce the garbage collection latency, - Process less nodes in eac loop of incremental garbage collection in btree_gc_recurse(). - Sleep less time between two full garbage collection in bch_btree_gc(). This patch introduces to hleper routines to provide different garbage collection nodes number and sleep intervel time. - btree_gc_min_nodes() If there is no front end I/O, return 128 nodes to process in each incremental loop, otherwise only 10 nodes are returned. Then front I/O is able to access the btree earlier. - btree_gc_sleep_ms() If there is no synchronized wait for bucket allocation, sleep 100 ms between two incremental GC loop. Othersize only sleep 10 ms before incremental GC loop. Then a faster GC may provide available buckets earlier, to avoid most of bcache working threads from being starved by buckets allocation. The idea is inspired by works from Mingzhe Zou and Robert Pang, but much simpler and the expected behavior is more predictable. Signed-off-by: Coly Li <colyli@fnnas.com> Signed-off-by: Robert Pang <robertpang@google.com> Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 09:18:06 -07:00
Coly Li	7bf90cd740	bcache: remove discard sysfs interface document This patch removes documents of bcache discard sysfs interface, it drops discard related sections from, - Documentation/ABI/testing/sysfs-block-bcache - Documentation/admin-guide/bcache.rst Signed-off-by: Coly Li <colyli@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 09:18:06 -07:00
Coly Li	73a004f83c	bcache: drop discard sysfs interface Since discard code is removed, now the sysfs interface to enable discard is useless. This patch removes the corresponding sysfs entry, and remove bool variable 'discard' from struct cache as well. Signed-off-by: Coly Li <colyli@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 09:18:06 -07:00
Coly Li	b4056afbd4	bcache: remove discard code from alloc.c Bcache allocator initially has no free space to allocate. Firstly it does a garbage collection which is triggered by a cache device write and fills free space into ca->free[] lists. The discard happens after the free bucket is handled by garbage collection added into one of the ca->free[] lists. But normally this bucket will be allocated out very soon to requester and filled data onto it. The discard hint on this bucket LBA range doesn't help SSD control to improve internal erasure performance, and waste extra CPU cycles to issue discard bios. This patch removes the almost-useless discard code from alloc.c. Signed-off-by: Coly Li <colyli@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 09:18:06 -07:00
Coly Li	0c72e9fcc1	bcache: get rid of discard code from journal In bcache journal there is discard functionality but almost useless in reality. Because discard happens after a journal bucket is reclaimed, and the reclaimed bucket is allocated for new journaling immediately. There is no time for underlying SSD to use the discard hint for internal data management. The discard code in bcache journal doesn't bring any performance optimization and wastes CPU cycles for issuing discard bios. Therefore this patch gits rid of it from journal.c and journal.h. Signed-off-by: Coly Li <colyli@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 09:18:06 -07:00
Damien Le Moal	7b2038b1b1	dm: fix zone reset all operation processing dm_zone_get_reset_bitmap() is used to generate a bitmap of the zones of a zoned device target when a REQ_OP_ZONE_RESET_ALL request is being processed. This bitmap is built by executing a zone report with a report callback set to the function dm_zone_need_reset_cb() in struct dm_report_zones_args. However, the cb callback pointer is not anymore the same as the callback specified by callers of the blkdev_report_zones() function. Rather, this is a DM internal callback and report zones callback functions from blkdev_report_zones() are passed using struct blk_report_zones_args, introduced with commit db9aed869f34 ("block: introduce disk_report_zone()"). This commit changed the DM main report zones callback handler function dm_report_zones_cb() to call the new disk_report_zone() so that callback functions from blkdev_report_zones() are executed, and this change resulted in the DM internal dm_zone_need_reset_cb() callback function to not be executed anymore, turning any REQ_OP_ZONE_RESET_ALL request into a no-op. Fix this by calling in dm_report_zones_cb() the DM internal cb function specified in struct dm_report_zones_args. Fixes: db9aed869f34 ("block: introduce disk_report_zone()"). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 09:10:04 -07:00
Damien Le Moal	881880b6f3	block: fix NULL pointer dereference in disk_report_zones() Commit 2284eec5053d ("block: introduce blkdev_get_zone_info()") introduced the report_active field in struct blk_report_zones_args so that open and closed zones can be reported with the condition BLK_ZONE_COND_ACTIVE in the case of a cached report zone. However, the args pointer to a struct blk_report_zones_args that is passed to disk_report_zones() can be NULL, e.g. in the case of internal report zones operations for device mapper zoned targets. Fix disk_report_zones() to make sure to check that the args is not null before updating a zone condition for cached zone reports. Fixes: 2284eec5053d ("block: introduce blkdev_get_zone_info()") Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 09:10:04 -07:00
Damien Le Moal	c2b8d20628	block: fix NULL pointer dereference in blk_zone_reset_all_bio_endio() For zoned block devices that do not need zone write plugs (e.g. most device mapper devices that support zones), the disk hash table of zone write plugs is NULL. For such devices, blk_zone_reset_all_bio_endio() should not attempt to scan this has table as that causes a NULL pointer dereference. Fix this by checking that the disk does have zone write plugs using the atomic counter. This is equivalent to checking for a non-NULL hash table but has the advantage to also speed up the execution of blk_zone_reset_all_bio_endio() for devices that do use zone write plugs but do not have any plug in the hash table (e.g. a disk with only full zones). Fixes: `efae226c2e` ("block: handle zone management operations completions") Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 09:10:04 -07:00
Bart Van Assche	f233339188	blk-zoned: Move code from disk_zone_wplug_add_bio() into its caller Move the following code into the only caller of disk_zone_wplug_add_bio(): - The code for clearing the REQ_NOWAIT flag. - The code that sets the BLK_ZONE_WPLUG_PLUGGED flag. - The disk_zone_wplug_schedule_bio_work() call. This patch moves all code that is related to REQ_NOWAIT or to bio scheduling into a single function. Additionally, the 'schedule_bio_work' variable is removed. No functionality has been changed. Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlmoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-12 14:05:23 -07:00
Bart Van Assche	faa3be1a61	blk-zoned: Document disk_zone_wplug_schedule_bio_work() locking Document that all callers hold this lock because the code in disk_zone_wplug_schedule_bio_work() depends on this. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-12 14:05:23 -07:00
Bart Van Assche	fd0ae4754c	blk-zoned: Fix a typo in a source code comment Remove a superfluous parenthesis that was introduced by commit `fa8555630b` ("blk-zoned: Improve the queue reference count strategy documentation"). Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-12 14:05:23 -07:00
Keith Busch	3749ea4dee	null_blk: fix zone read length beyond write pointer Fix up the divisor calculating the number of zone sectors being read and handle a read that straddles the zone write pointer. The length is rounded up a sector boundary, so be sure to truncate any excess bytes off to avoid copying past the data segment. Fixes: `3451cf34f5` ("null_blk: allow byte aligned memory offsets") Signed-off-by: Keith Busch <kbusch@kernel.org> Tested-by: Bart van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-12 10:02:56 -07:00
Kriish Sharma	6d7e3870af	blk-mq-dma: fix kernel-doc function name for integrity DMA iterator Documentation build reported: Warning: block/blk-mq-dma.c:373 expecting prototype for blk_rq_integrity_dma_map_iter_start(). Prototype was for blk_rq_integrity_dma_map_iter_next() instead The kernel-doc comment above `blk_rq_integrity_dma_map_iter_next()` used the wrong function name (`blk_rq_integrity_dma_map_iter_start`) in its header. This patch corrects the function name in the kernel-doc block to match the actual implementation, ensuring clean documentation builds. Fixes: `fec9b16dc5` ("blk-mq-dma: add scatter-less integrity data DMA mapping") Signed-off-by: Kriish Sharma <kriish.sharma2006@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-11 08:37:33 -07:00
Keith Busch	fd9ecd0052	block: fix merging data-less bios The data segment gaps the block layer tracks doesn't apply to bio's that don't have data. Skip calculating this to fix a NULL pointer access. Fixes: `2f6b2565d4` ("block: accumulate memory segment gaps per bio") Reported-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-11 08:35:59 -07:00

1 2 3 4 5 ...

1397146 Commits