linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-12-27 10:01:39 -05:00

Author	SHA1	Message	Date
Yongpeng Yang	4b2b03151e	zloop: use READ_ONCE() to read lo->lo_state in queue_rq path In the queue_rq path, zlo->state is accessed without locking, and direct access may read stale data. This patch uses READ_ONCE() to read zlo->state and data_race() to silence code checkers, and changes all assignments to use WRITE_ONCE(). Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-15 09:32:42 -07:00
Chaitanya Kulkarni	71075d25ca	blk-mq: add blk_rq_nr_bvec() helper Add a new helper function blk_rq_nr_bvec() that returns the number of bvecs in a request. This count represents the number of iterations rq_for_each_bvec() would perform on a request. Drivers need to pre-allocate bvec arrays before iterating through a request's bvecs. Currently, they manually count bvecs using rq_for_each_bvec() in a loop, which is repetitive. The new helper centralizes this logic. This pattern exists in loop and zloop drivers, where multi-bio requests require copying bvecs into a contiguous array before creating an iov_iter for file operations. Update loop and zloop drivers to use the new helper, eliminating duplicate code. This patch also provides a clear API to avoid any potential misuse of blk_nr_phys_segments() for calculating the bvecs since, one bvec can have more than one segments and use of blk_nr_phys_segments() can lead to extra memory allocation :- [ 6155.673749] nullb_bio: 128K bio as ONE bvec: sector=0, size=131072 [ 6155.673846] null_blk: #### null_handle_data_transfer:1375 [ 6155.673850] null_blk: nr_bvec=1 blk_rq_nr_phys_segments=2 [ 6155.674263] null_blk: #### null_handle_data_transfer:1375 [ 6155.674267] null_blk: nr_bvec=1 blk_rq_nr_phys_segments=1 Reviewed-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-04 07:19:26 -07:00
Chaitanya Kulkarni	e8f0abdd49	zloop: clear nowait flag in workqueue context The zloop driver advertises REQ_NOWAIT support through BLK_FEAT_NOWAIT (enabled by default for all blk-mq devices), and honors the nowait behavior throughout zloop_queue_rq(). However, actual I/O to the backing file is performed in a workqueue, where blocking is allowed. To avoid imposing unnecessary non-blocking constraints in this blocking context, clear the REQ_NOWAIT flag before processing the request in the workqueue context. Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-20 07:42:32 -07:00
Damien Le Moal	a9637ab93c	zloop: fix zone append check in zloop_rw() While commit `cf28f6f923` ("zloop: fail zone append operations that are targeting full zones") added a check in zloop_rw() that a zone append is not issued to a full zone, commit `e3a96ca904` ("zloop: simplify checks for writes to sequential zones") inadvertently removed the check to verify that there is enough unwritten space in a zone for an incoming zone append opration. Re-add this check in zloop_rw() to make sure we do not write beyond the end of a zone. Of note is that this same check is already present in the function zloop_set_zone_append_sector() when ordered zone append is in use. Reported-by: Hans Holmberg <Hans.Holmberg@wdc.com> Fixes: `e3a96ca904` ("zloop: simplify checks for writes to sequential zones") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-19 07:39:42 -07:00
Damien Le Moal	fcc6eaa3a0	zloop: introduce the ordered_zone_append configuration parameter The zone append operation processing for zloop devices is similar to any other command, that is, the operation is processed as a command work item, without any special serialization between the work items (beside the zone mutex for mutually exclusive code sections). This processing is fine and gives excellent performance. However, it has a side effect: zone append operation are very often reordered and processed in a sequence that is very different from their issuing order by the user. This effect is very visible using an XFS file system on top of a zloop device. A simple file write leads to many file extents as the data writes using zone append are reordered and so result in the physical order being different than the file logical order. E.g. executing: $ dd if=/dev/zero of=/mnt/test bs=1M count=10 && sync $ xfs_bmap /mnt/test /mnt/test: 0: [0..4095]: 2162688..2166783 1: [4096..6143]: 2168832..2170879 2: [6144..8191]: 2166784..2168831 3: [8192..10239]: 2170880..2172927 4: [10240..12287]: 2174976..2177023 5: [12288..14335]: 2172928..2174975 6: [14336..20479]: 2177024..2183167 For 10 IOs, 6 extents are created. This is fine and actually allows to exercise XFS zone garbage collection very well. However, this also makes debugging/working on XFS data placement harder as the underlying device will most of the time reorder IOs, resulting in many file extents. Allow a user to mitigate this with the new ordered_zone_append configuration parameter. For a zloop device created with this parameter specified, the sector of a zone append command is set early, when the command is submitted by the block layer with the zloop_queue_rq() function, instead of in the zloop_rw() function which is exectued later in the command work item context. This change ensures that more often than not, zone append operations data end up being written in the same order as the command submission by the user. In the case of XFS, this leads to far less file data extents. E.g., for the previous example, we get a single file data extent for the written file. $ dd if=/dev/zero of=/mnt/test bs=1M count=10 && sync $ xfs_bmap /mnt/test /mnt/test: 0: [0..20479]: 2162688..2183167 Since we cannot use a mutex in the context of the zloop_queue_rq() function to atomically set a zone append operation sector to the target zone write pointer location and increment that the write pointer, a new per-zone spinlock is introduced to protect a zone write pointer access and modifications. To check a zone write pointer location and set a zone append operation target sector to that value, the function zloop_set_zone_append_sector() is introduced and called from zloop_queue_rq(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:40:09 -07:00
Damien Le Moal	9236c5fdd5	zloop: introduce the zone_append configuration parameter A zloop zoned block device declares to the block layer that it supports zone append operations. That is, a zloop device ressembles an NVMe ZNS devices supporting zone append. This native support is fine but it does not allow exercising the block layer zone write plugging emulation of zone append, as is done with SCSI or ATA SMR HDDs. Introduce the zone_append configuration parameter to allow creating a zloop device without native support for zone append, thus relying on the block layer zone append emulation. If not specified, zone append support is enabled by default. Otherwise, a value of 0 disables native zone append and a value of 1 enables it. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:40:09 -07:00
Damien Le Moal	e3a96ca904	zloop: simplify checks for writes to sequential zones The function zloop_rw() already checks early that a request is fully contained within the target zone. So this check does not need to be done again for regular writes to sequential zones. Furthermore, since zone append operations are always directed to the zone write pointer location, we do not need to check for their alignment to that value after setting it. So turn the "if" checking the write pointer alignment into an "else if". While at it, improve the comment describing the write pointer modification and how this value is corrected in case of error. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:40:09 -07:00
Damien Le Moal	cf28f6f923	zloop: fail zone append operations that are targeting full zones zloop_rw() will fail any regular write operation that targets a full sequential zone. The check for this is indirect and achieved by checking the write pointer alignment of the write operation. But this check is ineffective for zone append operations since these are alwasy automatically directed at a zone write pointer. Prevent zone append operations from being executed in a full zone with an explicit check of the zone condition. Fixes: `eb0570c7df` ("block: new zoned loop block device driver") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:40:09 -07:00
Damien Le Moal	866d65745b	zloop: make the write pointer of full zones invalid The write pointer of zones that are in the full condition is always invalid. Reflect that fact by setting the write pointer of full zones to ULLONG_MAX. Fixes: `eb0570c7df` ("block: new zoned loop block device driver") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:40:09 -07:00
Damien Le Moal	fdb9aed869	block: introduce disk_report_zone() Commit `b76b840fd9` ("dm: Fix dm-zoned-reclaim zone write pointer alignment") introduced an indirect call for the callback function of a report zones executed with blkdev_report_zones(). This is necessary so that the function disk_zone_wplug_sync_wp_offset() can be called to refresh a zone write plug zone write pointer offset after a write error. However, this solution makes following the path of a zone information harder to understand. Clean this up by introducing the new blk_report_zones_args structure to define a zone report callback and its private data and introduce the helper function disk_report_zone() which calls both disk_zone_wplug_sync_wp_offset() and the zone report user callback function for all zones of a zone report. This helper function must be called by all block device drivers that implement the report zones block operation in order to correctly report a zone information. All block device drivers supporting the report_zones block operation are updated to use this new scheme. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-05 08:07:21 -07:00
Shin'ichiro Kawasaki	765761851d	zloop: fix KASAN use-after-free of tag set When a zoned loop device, or zloop device, is removed, KASAN enabled kernel reports "BUG KASAN use-after-free" in blk_mq_free_tag_set(). The BUG happens because zloop_ctl_remove() calls put_disk(), which invokes zloop_free_disk(). The zloop_free_disk() frees the memory allocated for the zlo pointer. However, after the memory is freed, zloop_ctl_remove() calls blk_mq_free_tag_set(&zlo->tag_set), which accesses the freed zlo. Hence the KASAN use-after-free. zloop_ctl_remove() put_disk(zlo->disk) put_device() kobject_put() ... zloop_free_disk() kvfree(zlo) blk_mq_free_tag_set(&zlo->tag_set) To avoid the BUG, move the call to blk_mq_free_tag_set(&zlo->tag_set) from zloop_ctl_remove() into zloop_free_disk(). This ensures that the tag_set is freed before the call to kvfree(zlo). Fixes: `eb0570c7df` ("block: new zoned loop block device driver") CC: stable@vger.kernel.org Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250731110745.165751-1-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-31 15:01:07 -06:00
Damien Le Moal	eb0570c7df	block: new zoned loop block device driver The zoned loop block device driver allows a user to create emulated zoned block devices using one regular file per zone as backing storage. Compared to null_blk or scsi_debug, it has the advantage of allowing emulating large zoned devices without requiring the same amount of memory as the capacity of the emulated device. Furthermore, zoned devices emulated with this driver can be re-started after a host reboot without any loss of the state of the device zones, which is something that null_blk and scsi_debug do not support. This initial implementation is simple and does not support zone resource limits. That is, a zoned loop block device limits for the maximum number of open zones and maximum number of active zones is always 0. This driver can be either compiled in-kernel or as a module, named "zloop". Compilation of this driver depends on the block layer support for zoned block device (CONFIG_BLK_DEV_ZONED must be set). Using the zloop driver to create and delete zoned block devices is done by writing commands to the zoned loop control character device file (/dev/zloop-control). Creating a device is done with: $ echo "add [options]" > /dev/zloop-control The options available for the "add" operation cat be listed by reading the zloop-control device file: $ cat /dev/zloop-control add id=%d,capacity_mb=%u,zone_size_mb=%u,zone_capacity_mb=%u,conv_zones=%u,base_dir=%s,nr_queues=%u,queue_depth=%u remove id=%d The options available allow controlling the zoned device total capacity, zone size, zone capactity of sequential zones, total number of conventional zones, base directory for the zones backing file, number of I/O queues and the maximum queue depth of I/O queues. Deleting a device is done using the "remove" command: $ echo "remove id=0" > /dev/zloop-control This implementation passes various tests using zonefs and fio (t/zbd tests) and provides a state machine for zone conditions that is compliant with the T10 ZBC and NVMe ZNS specifications. Co-developed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250407075222.170336-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-01 17:03:56 -06:00

12 Commits