linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-01 00:02:19 -04:00

Author	SHA1	Message	Date
John Garry	fa6fec8281	md/raid0: Atomic write support Set BLK_FEAT_ATOMIC_WRITES_STACKED to enable atomic writes. All other stacked device request queue limits should automatically be set properly. With regards to atomic write max bytes limit, this will be set at hw_max_sectors and this is limited by the stripe width, which we want. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20241118105018.1870052-4-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-19 10:30:02 -07:00
John Garry	d7f36dc446	block: Support atomic writes limits for stacked devices Allow stacked devices to support atomic writes by aggregating the minimum capability of all bottom devices. Flag BLK_FEAT_ATOMIC_WRITES_STACKED is set for stacked devices which have been enabled to support atomic writes. Some things to note on the implementation: - For simplicity, all bottom devices must have same atomic write boundary value (if any) - The atomic write boundary must be a power-of-2 already, but this restriction could be relaxed. Furthermore, it is now required that the chunk sectors for a top device must be aligned with this boundary. - If a bottom device atomic write unit min/max are not aligned with the top device chunk sectors, the top device atomic write unit min/max are reduced to a value which works for the chunk sectors. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20241118105018.1870052-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-19 10:30:02 -07:00
John Garry	d00eea91de	block: Add extra checks in blk_validate_atomic_write_limits() It is so far expected that the limits passed are valid. In future atomic writes will be supported for stacked block devices, and calculating the limits there will be complicated, so add extra sanity checks to ensure that the values are always valid. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20241118105018.1870052-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-19 10:30:02 -07:00
Ming Lei	34c1227035	ublk: fix error code for unsupported command ENOTSUPP is for kernel use only, and shouldn't be sent to userspace. Fix it by replacing it with EOPNOTSUPP. Cc: stable@vger.kernel.org Fixes: `bfbcef0363` ("ublk_drv: move ublk_get_device_from_id into ublk_ctrl_uring_cmd") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241119030646.2319030-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-19 09:19:46 -07:00
John Garry	e924da7d66	block: Drop granularity check in queue_limit_discard_alignment() lim->discard_granularity is always at least SECTOR_SIZE, so drop the pointless check for granularity less than SECTOR_SIZE. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241112092144.4059847-1-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-19 09:09:47 -07:00
OGAWA Hirofumi	b49125574c	loop: Fix ABBA locking race Current loop calls vfs_statfs() while holding the q->limits_lock. If FS takes some locking in vfs_statfs callback, this may lead to ABBA locking bug (at least, FAT fs has this issue actually). So this patch calls vfs_statfs() outside q->limits_locks instead, because looks like no reason to hold q->limits_locks while getting discord configs. Chain exists of: &sbi->fat_lock --> &q->q_usage_counter(io)#17 --> &q->limits_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&q->limits_lock); lock(&q->q_usage_counter(io)#17); lock(&q->limits_lock); lock(&sbi->fat_lock); * DEADLOCK * Reported-by: syzbot+a5d8c609c02f508672cc@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=a5d8c609c02f508672cc Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-19 07:54:56 -07:00
Christoph Hellwig	46fd48ab3e	block: return unsigned int from bdev_io_min The underlying limit is defined as an unsigned int, so return that from bdev_io_min as well. Fixes: `ac481c20ef` ("block: Topology ioctls") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241119072602.1059488-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-19 07:19:48 -07:00
Yu Kuai	3802f73bd8	block: fix uaf for flush rq while iterating tags blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared in del_gendisk by commit `aec89dc5d4` ("block: keep q_usage_counter in atomic mode after del_gendisk"), hence for disk like scsi, following blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well, cause following uaf that is found by our syzkaller for v6.6: ================================================================== BUG: KASAN: slab-use-after-free in blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261 Read of size 4 at addr ffff88811c969c20 by task kworker/1:2H/224909 CPU: 1 PID: 224909 Comm: kworker/1:2H Not tainted 6.6.0-ga836a5060850 #32 Workqueue: kblockd blk_mq_timeout_work Call Trace: __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106 print_address_description.constprop.0+0x66/0x300 mm/kasan/report.c:364 print_report+0x3e/0x70 mm/kasan/report.c:475 kasan_report+0xb8/0xf0 mm/kasan/report.c:588 blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261 bt_iter block/blk-mq-tag.c:288 [inline] __sbitmap_for_each_set include/linux/sbitmap.h:295 [inline] sbitmap_for_each_set include/linux/sbitmap.h:316 [inline] bt_for_each+0x455/0x790 block/blk-mq-tag.c:325 blk_mq_queue_tag_busy_iter+0x320/0x740 block/blk-mq-tag.c:534 blk_mq_timeout_work+0x1a3/0x7b0 block/blk-mq.c:1673 process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631 process_scheduled_works kernel/workqueue.c:2704 [inline] worker_thread+0x804/0xe40 kernel/workqueue.c:2785 kthread+0x346/0x450 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293 Allocated by task 942: kasan_save_stack+0x22/0x50 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 ____kasan_kmalloc mm/kasan/common.c:374 [inline] __kasan_kmalloc mm/kasan/common.c:383 [inline] __kasan_kmalloc+0xaa/0xb0 mm/kasan/common.c:380 kasan_kmalloc include/linux/kasan.h:198 [inline] __do_kmalloc_node mm/slab_common.c:1007 [inline] __kmalloc_node+0x69/0x170 mm/slab_common.c:1014 kmalloc_node include/linux/slab.h:620 [inline] kzalloc_node include/linux/slab.h:732 [inline] blk_alloc_flush_queue+0x144/0x2f0 block/blk-flush.c:499 blk_mq_alloc_hctx+0x601/0x940 block/blk-mq.c:3788 blk_mq_alloc_and_init_hctx+0x27f/0x330 block/blk-mq.c:4261 blk_mq_realloc_hw_ctxs+0x488/0x5e0 block/blk-mq.c:4294 blk_mq_init_allocated_queue+0x188/0x860 block/blk-mq.c:4350 blk_mq_init_queue_data block/blk-mq.c:4166 [inline] blk_mq_init_queue+0x8d/0x100 block/blk-mq.c:4176 scsi_alloc_sdev+0x843/0xd50 drivers/scsi/scsi_scan.c:335 scsi_probe_and_add_lun+0x77c/0xde0 drivers/scsi/scsi_scan.c:1189 __scsi_scan_target+0x1fc/0x5a0 drivers/scsi/scsi_scan.c:1727 scsi_scan_channel drivers/scsi/scsi_scan.c:1815 [inline] scsi_scan_channel+0x14b/0x1e0 drivers/scsi/scsi_scan.c:1791 scsi_scan_host_selected+0x2fe/0x400 drivers/scsi/scsi_scan.c:1844 scsi_scan+0x3a0/0x3f0 drivers/scsi/scsi_sysfs.c:151 store_scan+0x2a/0x60 drivers/scsi/scsi_sysfs.c:191 dev_attr_store+0x5c/0x90 drivers/base/core.c:2388 sysfs_kf_write+0x11c/0x170 fs/sysfs/file.c:136 kernfs_fop_write_iter+0x3fc/0x610 fs/kernfs/file.c:338 call_write_iter include/linux/fs.h:2083 [inline] new_sync_write+0x1b4/0x2d0 fs/read_write.c:493 vfs_write+0x76c/0xb00 fs/read_write.c:586 ksys_write+0x127/0x250 fs/read_write.c:639 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x70/0x120 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x78/0xe2 Freed by task 244687: kasan_save_stack+0x22/0x50 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522 ____kasan_slab_free mm/kasan/common.c:236 [inline] __kasan_slab_free+0x12a/0x1b0 mm/kasan/common.c:244 kasan_slab_free include/linux/kasan.h:164 [inline] slab_free_hook mm/slub.c:1815 [inline] slab_free_freelist_hook mm/slub.c:1841 [inline] slab_free mm/slub.c:3807 [inline] __kmem_cache_free+0xe4/0x520 mm/slub.c:3820 blk_free_flush_queue+0x40/0x60 block/blk-flush.c:520 blk_mq_hw_sysfs_release+0x4a/0x170 block/blk-mq-sysfs.c:37 kobject_cleanup+0x136/0x410 lib/kobject.c:689 kobject_release lib/kobject.c:720 [inline] kref_put include/linux/kref.h:65 [inline] kobject_put+0x119/0x140 lib/kobject.c:737 blk_mq_release+0x24f/0x3f0 block/blk-mq.c:4144 blk_free_queue block/blk-core.c:298 [inline] blk_put_queue+0xe2/0x180 block/blk-core.c:314 blkg_free_workfn+0x376/0x6e0 block/blk-cgroup.c:144 process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631 process_scheduled_works kernel/workqueue.c:2704 [inline] worker_thread+0x804/0xe40 kernel/workqueue.c:2785 kthread+0x346/0x450 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293 Other than blk_mq_clear_flush_rq_mapping(), the flag is only used in blk_register_queue() from initialization path, hence it's safe not to clear the flag in del_gendisk. And since QUEUE_FLAG_REGISTERED already make sure that queue should only be registered once, there is no need to test the flag as well. Fixes: `6cfeadbff3` ("blk-mq: don't clear flush_rq from tags->rqs[]") Depends-on: commit `aec89dc5d4` ("block: keep q_usage_counter in atomic mode after del_gendisk") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241104110005.1412161-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-18 18:31:57 -07:00
Mikulas Patocka	9c0ba14828	blk-settings: round down io_opt to physical_block_size There was a bug report [1] where the user got a warning alignment inconsistency. The user has optimal I/O 16776704 (0xFFFE00) and physical block size 4096. Note that the optimal I/O size may be set by the DMA engines or SCSI controllers and they have no knowledge about the disks attached to them, so the situation with optimal I/O not aligned to physical block size may happen. This commit makes blk_validate_limits round down optimal I/O size to the physical block size of the block device. Closes: https://lore.kernel.org/dm-devel/1426ad71-79b4-4062-b2bf-84278be66a5d@redhat.com/T/ [1] Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: `a23634644a` ("block: take io_opt and io_min into account for max_sectors") Cc: stable@vger.kernel.org # v6.11+ Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/3dc0014b-9690-dc38-81c9-4a316a2d4fb2@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-18 14:54:50 -07:00
Manas	a3f143c461	rust: block: simplify Result<()> in validate_block_size return `Result` is used in place of `Result<()>` because the default type parameters are unit `()` and `Error` types, which are automatically inferred. Thus keep the usage consistent throughout codebase. Suggested-by: Miguel Ojeda <ojeda@kernel.org> Link: https://github.com/Rust-for-Linux/linux/issues/1128 Signed-off-by: Manas <manas18244@iiitd.ac.in> Reviewed-by: Miguel Ojeda <ojeda@kernel.org> Link: https://lore.kernel.org/r/20241118-simplify-result-v3-1-6b1566a77eab@iiitd.ac.in Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-18 09:10:33 -07:00
Jens Axboe	88d47f6293	Merge tag 'md-6.13-20241115' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-6.13/block Pull MD fixes from Song: "This set contains a fix for a W=1 warning, by John Garry, and a MAINTAINERS update." * tag 'md-6.13-20241115' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux: MAINTAINERS: Update git tree for mdraid subsystem md/raid5: Increase r5conf.cache_name size	2024-11-15 12:37:33 -07:00
Song Liu	886e4757f4	MAINTAINERS: Update git tree for mdraid subsystem Moving the official git tree to the MDRAID Group account. Signed-off-by: Song Liu <song@kernel.org>	2024-11-15 10:25:53 -08:00
Jens Axboe	957860cbc1	block: make struct rq_list available for !CONFIG_BLOCK A previous commit changed how requests are linked in the plug structure, but unlike the previous method, it uses a new type for it rather than struct request. The latter is available even for !CONFIG_BLOCK, while struct rq_list is now. Move it outside CONFIG_BLOCK. Reported-by: Nathan Chancellor <nathan@kernel.org> Fixes: `a3396b9999` ("block: add a rq_list type") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-15 07:14:03 -07:00
David Wang	bda9c7d92f	block/genhd: use seq_put_decimal_ull for diskstats decimal values seq_printf is costly. For each block device, 19 decimal values are yielded in /proc/diskstats via seq_printf. On a system with 16 logical block devices, profiling for open/read/close sequences shows seq_printf took ~75% samples of diskstats_show: diskstats_show(92.626% 2269372/2450040) seq_printf(76.026% 1725313/2269372) vsnprintf(99.163% 1710866/1725313) format_decode(26.597% 455040/1710866) number(19.554% 334542/1710866) memcpy_orig(4.183% 71570/1710866) ... srso_return_thunk(0.009% 148/1725313) part_stat_read_all(8.030% 182236/2269372) One million rounds of open/read/close /proc/diskstats takes: real 0m37.687s user 0m0.264s sys 0m32.911s On average, each sequence tooks ~0.032ms With this patch, most decimal values are yield via seq_put_decimal_ull, performance is significantly improved: real 0m20.792s user 0m0.316s sys 0m20.463s On average, each sequence tooks ~0.020ms, a ~37.5% improvement. Signed-off-by: David Wang <00107082@163.com> Link: https://lore.kernel.org/r/20241108054500.4251-1-00107082@163.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-13 14:02:19 -07:00
Christoph Hellwig	00e8d290b5	block: don't reorder requests in blk_mq_add_to_batch LIFO ordering for batched completions is a bit unexpected and also defeats some merging optimizations in e.g. the XFS buffered write code. Now that we can easily add the request to the tail of the list do that. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241113152050.157179-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-13 12:04:58 -07:00
Christoph Hellwig	e70c301fae	block: don't reorder requests in blk_add_rq_to_plug Add requests to the tail of the list instead of the front so that they are queued up in submission order. Remove the re-reordering in blk_mq_dispatch_plug_list, virtio_queue_rqs and nvme_queue_rqs now that the list is ordered as expected. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241113152050.157179-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-13 12:04:58 -07:00
Christoph Hellwig	a3396b9999	block: add a rq_list type Replace the semi-open coded request list helpers with a proper rq_list type that mirrors the bio_list and has head and tail pointers. Besides better type safety this actually allows to insert at the tail of the list, which will be useful soon. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241113152050.157179-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-13 12:04:58 -07:00
Christoph Hellwig	e8225ab150	block: remove rq_list_move Unused now. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241113152050.157179-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-13 12:04:58 -07:00
Christoph Hellwig	7f212e997e	virtio_blk: reverse request order in virtio_queue_rqs blk_mq_flush_plug_list submits requests in the reverse order that they were submitted, which leads to a rather suboptimal I/O pattern especially in rotational devices. Fix this by rewriting virtio_queue_rqs so that it always pops the requests from the passed in request list, and then adds them to the head of a local submit list. This actually simplifies the code a bit as it removes the complicated list splicing, at the cost of extra updates of the rq_next pointer. As that should be cache hot anyway it should be an easy price to pay. Fixes: `0e9911fa76` ("virtio-blk: support mq_ops->queue_rqs()") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241113152050.157179-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-13 12:04:26 -07:00
Christoph Hellwig	beadf00885	nvme-pci: reverse request order in nvme_queue_rqs blk_mq_flush_plug_list submits requests in the reverse order that they were submitted, which leads to a rather suboptimal I/O pattern especially in rotational devices. Fix this by rewriting nvme_queue_rqs so that it always pops the requests from the passed in request list, and then adds them to the head of a local submit list. This actually simplifies the code a bit as it removes the complicated list splicing, at the cost of extra updates of the rq_next pointer. As that should be cache hot anyway it should be an easy price to pay. Fixes: `d62cbcf62f` ("nvme: add support for mq_ops->queue_rqs()") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241113152050.157179-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-13 11:40:33 -07:00
Christoph Hellwig	e559ee0226	btrfs: validate queue limits Call blk_validate_limits on the queue limits used for zone append splitting so that calculated values get filled in and any stacking conflicts get cought. Without this there isn't a max_zone_append_sectors limits as of commit `559218d43e` ("block: pre-calculate max_zone_append_sectors"). Fixes: `559218d43e` ("block: pre-calculate max_zone_append_sectors") Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241113084541.34315-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-13 11:40:11 -07:00
Christoph Hellwig	470d2bc3a0	block: export blk_validate_limits While block drivers do the validation as part of committing them to the queue, users that use the limit outside of a block device context have to validate the limits and fill in the calculated values as well. So far btrfs is the only user of queue limits without a block device, and it has gotten away with that more or less by accident. But with commit `559218d43e` ("block: pre-calculate max_zone_append_sectors") this became fatal for setups that have small max zone append size, as it won't be limited now. Export blk_validate_limits so that it can be called directly from btrfs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241113084541.34315-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-13 11:40:11 -07:00
Jens Axboe	15da3dd3f5	Merge tag 'nvme-6.13-2024-11-13' of git://git.infradead.org/nvme into for-6.13/block Pull NVMe updates from Keith: "nvme updates for Linux 6.13 - Use uring_cmd helper (Pavel) - Host Memory Buffer allocation enhancements (Christoph) - Target persistent reservation support (Guixin) - Persistent reservation tracing (Guixen) - NVMe 2.1 specification support (Keith) - Rotational Meta Support (Matias, Wang, Keith) - Volatile cache detection enhancment (Guixen)" * tag 'nvme-6.13-2024-11-13' of git://git.infradead.org/nvme: (22 commits) nvmet: add tracing of reservation commands nvme: parse reservation commands's action and rtype to string nvmet: report ns's vwc not present nvme: check ns's volatile write cache not present nvme: add rotational support nvme: use command set independent id ns if available nvmet: support for csi identify ns nvmet: implement rotational media information log nvmet: implement endurance groups nvmet: declare 2.1 version compliance nvmet: implement crto property nvmet: implement supported features log nvmet: implement supported log pages nvmet: implement active command set ns list nvmet: implement id ns for nvm command set nvmet: support reservation feature nvme: add reservation command's defines nvme-core: remove repeated wq flags nvmet: make nvmet_wq visible in sysfs nvme-pci: use dma_alloc_noncontigous if possible ...	2024-11-13 10:43:11 -07:00
Guixin Liu	50bee3857d	nvmet: add tracing of reservation commands Add tracing of reservation commands, including register, acquire, release and report, and also parse the action and rtype to string to make the trace log more human-readable. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2024-11-13 08:51:24 -08:00
Guixin Liu	8a502b5c16	nvme: parse reservation commands's action and rtype to string Parse reservation commands's action(including rrega, racqa and rrela) and rtype to string to make the trace log more human-readable. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2024-11-13 08:51:24 -08:00
Guixin Liu	609e60a3a9	nvmet: report ns's vwc not present Currently, we report that controller has vwc even though the ns may not have vwc. Report ns's vwc not present when not buffered_io or backdev doesn't have vwc. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2024-11-13 08:51:17 -08:00
John Garry	ea90d27034	md/raid5: Increase r5conf.cache_name size For compiling with W=1, the following warning can be seen: drivers/md/raid5.c: In function ‘setup_conf’: drivers/md/raid5.c:2423:12: error: ‘%s’ directive output may be truncated writing up to 31 bytes into a region of size between 16 and 26 [-Werror=format-truncation=] "raid%d-%s", conf->level, mdname(conf->mddev)); ^~ drivers/md/raid5.c:2422:3: note: ‘snprintf’ output between 7 and 48 bytes into a destination of size 32 snprintf(conf->cache_name[0], namelen, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ "raid%d-%s", conf->level, mdname(conf->mddev)); ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cc1: all warnings being treated as errors Increase the array size to avoid this warning. Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241112161019.4154616-2-john.g.garry@oracle.com Signed-off-by: Song Liu <song@kernel.org>	2024-11-12 15:11:09 -08:00
Christoph Hellwig	6975c1a486	block: remove the ioprio field from struct request The request ioprio is only initialized from the first attached bio, so requests without a bio already never set it. Directly use the bio field instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20241112170050.1612998-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-12 14:42:02 -07:00
Christoph Hellwig	61952bb734	block: remove the write_hint field from struct request The write_hint is only used for read/write requests, which must have a bio attached to them. Just use the bio field instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20241112170050.1612998-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-12 14:42:02 -07:00
Guixin Liu	8a825d22a7	nvme: check ns's volatile write cache not present When the VWC of a namespace does not exist, the BLK_FEAT_WRITE_CACHE flag should not be set when registering the block device, regardless of whether the controller supports VWC. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2024-11-11 09:49:50 -08:00
Wang Yugui	1d81143885	nvme: add rotational support Rotational devices, such as hard-drives, can be detected using the rotational bit in the namespace independent identify namespace data structure. Make the bit visible to the block layer through the rotational queue setting. Signed-off-by: Wang Yugui <wangyugui@e16-tech.com> Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2024-11-11 09:49:50 -08:00
Matias Bjørling	ee9f36db1f	nvme: use command set independent id ns if available The NVMe 2.0 specification adds an independent identify namespace data structure that contains generic attributes that apply to all namespace types. Some attributes carry over from the NVM command set identify namespace data structure, and others are new. Currently, the data structure only considered when CRIMS is enabled or when the namespace type is key-value. However, the independent namespace data structure is mandatory for devices that implement features from the 2.0+ specification. Therefore, we can check this data structure first. If unavailable, retrieve the generic attributes from the NVM command set identify namespace data structure. Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2024-11-11 09:49:50 -08:00
Keith Busch	e2758c76a0	nvmet: support for csi identify ns Implements reporting the I/O Command Set Independent Identify Namespace command. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2024-11-11 09:49:49 -08:00
Keith Busch	5fd075cdaf	nvmet: implement rotational media information log Most of the information is stubbed. Supporting these commands is a requirement for supporting rotational media. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2024-11-11 09:49:49 -08:00
Keith Busch	266b652c65	nvmet: implement endurance groups Most of the returned information is just stubbed data. The target must support these in order to report rotational media. Since this driver doesn't know any better, each namespace is its own endurance group with the engid value matching the nsid. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2024-11-11 09:49:49 -08:00
Keith Busch	81ee2f2811	nvmet: declare 2.1 version compliance The target driver implements all the mandatory logs, identifications, features, and properties up to nvme sepcification 2.1. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2024-11-11 09:49:49 -08:00
Keith Busch	1e058089d2	nvmet: implement crto property This property is required for nvme 2.1. The target only supports ready with media, so this is just the same value as CAP.TO. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2024-11-11 09:49:49 -08:00
Keith Busch	e973c91727	nvmet: implement supported features log This log is required for nvme 2.1. Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2024-11-11 09:49:49 -08:00
Keith Busch	83acb24e6d	nvmet: implement supported log pages This log is required for nvme 2.1. Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2024-11-11 09:49:48 -08:00
Keith Busch	61c9967cd6	nvmet: implement active command set ns list This is required for nvme 2.1 for targets that support multiple command sets. We support NVM and ZNS, so are required to support this identification. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2024-11-11 09:49:48 -08:00
Keith Busch	64a51080ea	nvmet: implement id ns for nvm command set We don't report anything here, but it's a mandatory identification for nvme 2.1. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2024-11-11 09:49:48 -08:00
Guixin Liu	5a47c2080a	nvmet: support reservation feature This patch implements the reservation feature, including: 1. reservation register(register, unregister and replace). 2. reservation acquire(acquire, preempt, preempt and abort). 3. reservation release(release and clear). 4. reservation report. 5. set feature and get feature of reservation notify mask. 6. get log page of reservation event. Not supported: 1. persistent reservation through power loss. Test cases: Use nvme-cli and fio to test all implemented sub features: 1. use nvme resv-register to register host a registrant or unregister or replace a new key. 2. use nvme resv-acquire to set host to the holder, and use fio to send read and write io in all reservation type. And also test preempt and "preempt and abort". 3. use nvme resv-report to show all registrants and reservation status. 4. use nvme resv-release to release all registrants. 5. use nvme get-log to get events generated by the preceding operations. In addition, make reservation configurable, one can set ns to support reservation before enable ns. The default of resv_enable is false. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Dmitry Bogdanov <d.bogdanov@yadro.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2024-11-11 09:49:48 -08:00
Christoph Hellwig	0b4ace9da5	nvme-multipath: don't bother clearing max_hw_zone_append_sectors The limits stacking now properly zeroes it if at least one of the underlying limits clears it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20241108154657.845768-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-11 09:20:36 -07:00
Christoph Hellwig	559218d43e	block: pre-calculate max_zone_append_sectors max_zone_append_sectors differs from all other queue limits in that the final value used is not stored in the queue_limits but needs to be obtained using queue_limits_max_zone_append_sectors helper. This not only adds (tiny) extra overhead to the I/O path, but also can be easily forgotten in file system code. Add a new max_hw_zone_append_sectors value to queue_limits which is set by the driver, and calculate max_zone_append_sectors from that and the other inputs in blk_validate_zoned_limits, similar to how max_sectors is calculated to fix this. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20241108154657.845768-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-11 09:20:36 -07:00
Christoph Hellwig	0ef2b9e698	block: lift bio_is_zone_append to bio.h Make bio_is_zone_append globally available, because file systems need to use to check for a zone append bio in their end_io handlers to deal with the block layer emulation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241104062647.91160-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-11 09:06:45 -07:00
Christoph Hellwig	7ecd2cd4fa	block: fix bio_split_rw_at to take zone_write_granularity into account Otherwise it can create unaligned writes on zoned devices. Fixes: `a805a4fa4f` ("block: introduce zone_write_granularity limit") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241104062647.91160-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-11 09:06:45 -07:00
Christoph Hellwig	60dc5ea6bc	block: take chunk_sectors into account in bio_split_write_zeroes For zoned devices, write zeroes must be split at the zone boundary which is represented as chunk_sectors. For other uses like the internally RAIDed NVMe devices it is probably at least useful. Enhance get_max_io_size to know about write zeroes and use it in bio_split_write_zeroes. Also add a comment about the seemingly nonsensical zero max_write_zeroes limit. Fixes: `885fa13f65` ("block: implement splitting of REQ_OP_WRITE_ZEROES bios") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20241104062647.91160-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-11 09:06:45 -07:00
John Garry	4cf58d9529	md/raid10: Handle bio_split() errors Add proper bio_split() error handling. For any error, call raid_end_bio_io() and return. Except for discard, where we end the bio directly. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241111112150.3756529-7-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-11 08:35:46 -07:00
John Garry	b1a7ad8b5c	md/raid1: Handle bio_split() errors Add proper bio_split() error handling. For any error, call raid_end_bio_io() and return. For the case of an in the write path, we need to undo the increment in the rdev pending count and NULLify the r1_bio->bios[] pointers. For read path failure, we need to undo rdev pending count increment from the earlier read_balance() call. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241111112150.3756529-6-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-11 08:35:46 -07:00
John Garry	74538fdac3	md/raid0: Handle bio_split() errors Add proper bio_split() error handling. For any error, set bi_status, end the bio, and return. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241111112150.3756529-5-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-11 08:35:46 -07:00

1 2 3 4 5 ...

1310151 Commits