linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-08 09:12:39 -04:00

Author	SHA1	Message	Date
Caleb Sander Mateos	64d1dc522b	ublk: store request pointer in ublk_io A ublk_io is converted to a request in several places in the I/O path by using blk_mq_tag_to_rq() to look up the (qid, tag) on the ublk device's tagset. This involves a bunch of dereferences and a tag bounds check. To make this conversion cheaper, store the request pointer in ublk_io. Overlap this storage with the io_uring_cmd pointer. This is safe because the io_uring_cmd pointer is only valid if UBLK_IO_FLAG_ACTIVE is set on the ublk_io, the request pointer is valid if UBLK_IO_FLAG_OWNED_BY_SRV, and these flags are mutually exclusive. Suggested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250430225234.2676781-10-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-02 09:22:30 -06:00
Caleb Sander Mateos	8ed95b5470	ublk: check UBLK_IO_FLAG_OWNED_BY_SRV in ublk_abort_queue() ublk_abort_queue() currently checks whether the UBLK_IO_FLAG_ACTIVE flag is cleared to tell whether to abort each ublk_io in the queue. But it's possible for a ublk_io to not be ACTIVE but also not have a request in flight, such as when no fetch request has yet been submitted for a tag or when a fetch request is cancelled. So ublk_abort_queue() must additionally check for an inflight request. Simplify this code by checking for UBLK_IO_FLAG_OWNED_BY_SRV instead, which indicates precisely whether a request is currently inflight. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250430225234.2676781-9-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-02 09:22:30 -06:00
Caleb Sander Mateos	9810362a57	ublk: don't call ublk_dispatch_req() for NEED_GET_DATA ublk_dispatch_req() currently handles 3 different cases: incoming ublk requests that don't need to wait for a data buffer, incoming requests that do need to wait for a buffer, and resuming those requests once the buffer is provided. But the call site that provides a data buffer (UBLK_IO_NEED_GET_DATA) is separate from those for incoming requests. So simplify the function by splitting the UBLK_IO_NEED_GET_DATA case into its own function ublk_get_data(). This avoids several redundant checks in the UBLK_IO_NEED_GET_DATA case, and streamlines the incoming request cases. Don't call ublk_fill_io_cmd() for UBLK_IO_NEED_GET_DATA, as it's no longer necessary to set io->cmd or the UBLK_IO_FLAG_ACTIVE flag for ublk_dispatch_req(). Since UBLK_IO_NEED_GET_DATA no longer relies on ublk_dispatch_req() calling io_uring_cmd_done(), return the UBLK_IO_RES_OK status directly from the ->uring_cmd() handler. If ublk_start_io() fails, don't complete the UBLK_IO_NEED_GET_DATA command, matching the existing behavior. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250430225234.2676781-8-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-02 09:22:30 -06:00
Caleb Sander Mateos	2fcb88bdf2	ublk: factor out ublk_start_io() helper In preparation for calling it from outside ublk_dispatch_req(), factor out the code responsible for setting up an incoming ublk I/O request. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250430225234.2676781-7-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-02 09:22:30 -06:00
Caleb Sander Mateos	551270690d	ublk: don't log uring_cmd cmd_op in ublk_dispatch_req() cmd_op is either UBLK_U_IO_FETCH_REQ, UBLK_U_IO_COMMIT_AND_FETCH_REQ, or UBLK_U_IO_NEED_GET_DATA. Which one isn't particularly interesting and is already recorded by the log line in __ublk_ch_uring_cmd(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250430225234.2676781-6-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-02 09:22:30 -06:00
Caleb Sander Mateos	2a86eec639	ublk: take const ubq pointer in ublk_get_iod() ublk_get_iod() doesn't modify the struct ublk_queue it is passed. Clarify that by making the argument a const pointer. Move the function definition earlier in the file so it doesn't need a forward declaration. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250430225234.2676781-5-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-02 09:22:30 -06:00
Caleb Sander Mateos	5a43d93588	ublk: remove misleading "ubq" in "ubq_complete_io_cmd()" ubq_complete_io_cmd() doesn't interact with a ublk queue, so "ubq" in the name is confusing. Most likely "ubq" was meant to be "ublk". Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250430225234.2676781-4-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-02 09:22:30 -06:00
Caleb Sander Mateos	80c0789a7d	ublk: fix "immepdately" typo in comment Looks like "immediately" was intended. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250430225234.2676781-3-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-02 09:22:30 -06:00
Uday Shankar	0cb39afd2a	ublk: factor out ublk_commit_and_fetch Move the logic for the UBLK_IO_COMMIT_AND_FETCH_REQ opcode into its own function. This also allows us to mark ublk_queue pointers as const for that operation, which can help prevent data races since we may allow concurrent operation on one ublk_queue in the future. Also open code ublk_commit_completion in ublk_commit_and_fetch to reduce the number of parameters/avoid a redundant lookup. [Restore __ublk_ch_uring_cmd() req variable used in commit `d6aa0c178b` ("ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA")] Suggested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250430225234.2676781-2-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-02 09:22:30 -06:00
Caleb Sander Mateos	9712c57ec1	block: avoid hctx spinlock for plug with multiple queues blk_mq_flush_plug_list() has a fast path if all requests in the plug are destined for the same request_queue. It calls ->queue_rqs() with the whole batch of requests, falling back on ->queue_rq() for any requests not handled by ->queue_rqs(). However, if the requests are destined for multiple queues, blk_mq_flush_plug_list() has a slow path that calls blk_mq_dispatch_list() repeatedly to filter the requests by ctx/hctx. Each queue's requests are inserted into the hctx's dispatch list under a spinlock, then __blk_mq_sched_dispatch_requests() takes them out of the dispatch list (taking the spinlock again), and finally blk_mq_dispatch_rq_list() calls ->queue_rq() on each request. Acquiring the hctx spinlock twice and calling ->queue_rq() instead of ->queue_rqs() makes the slow path significantly more expensive. Thus, batching more requests into a single plug (e.g. io_uring_enter syscall) can counterintuitively hurt performance by causing the plug to span multiple queues. We have observed 2-3% of CPU time spent acquiring the hctx spinlock alone on workloads issuing requests to multiple NVMe devices in the same io_uring SQE batches. Add a medium path in blk_mq_flush_plug_list() for plugs that don't have elevators or come from a schedule, but do span multiple queues. Filter the requests by queue and call ->queue_rqs()/->queue_rq() on the list of requests destined to each request_queue. With this change, we no longer see any CPU time spent in _raw_spin_lock from blk_mq_flush_plug_list and throughput increases accordingly. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250426011728.4189119-4-csander@purestorage.com [axboe: fix whitespace damage] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-02 09:21:36 -06:00
Caleb Sander Mateos	a5728a1d1e	block: factor out blk_mq_dispatch_queue_requests() helper Factor out the logic from blk_mq_flush_plug_list() that calls ->queue_rqs() with a fallback to ->queue_rq() into a helper function blk_mq_dispatch_queue_requests(). This is in preparation for using this code with other lists of requests. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250426011728.4189119-3-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-02 09:21:08 -06:00
Caleb Sander Mateos	0aeb7ebfc7	block: take rq_list instead of plug in dispatch functions blk_mq_plug_issue_direct(), __blk_mq_flush_plug_list(), and blk_mq_dispatch_plug_list() take a struct blk_plug * but only use its mq_list. Pass the struct rq_list * instead in preparation for calling them with other lists of requests. Drop "plug" from the function names as they are no longer plug-specific. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250426011728.4189119-2-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-02 09:21:08 -06:00
Damien Le Moal	9e4f11c122	Documentation: Document the new zoned loop block device driver Introduce the zoned_loop.rst documentation file under admin-guide/blockdev to document the zoned loop block device driver. An overview of the driver is provided and its usage to create and delete zoned devices described. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250407075222.170336-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-01 17:03:56 -06:00
Damien Le Moal	eb0570c7df	block: new zoned loop block device driver The zoned loop block device driver allows a user to create emulated zoned block devices using one regular file per zone as backing storage. Compared to null_blk or scsi_debug, it has the advantage of allowing emulating large zoned devices without requiring the same amount of memory as the capacity of the emulated device. Furthermore, zoned devices emulated with this driver can be re-started after a host reboot without any loss of the state of the device zones, which is something that null_blk and scsi_debug do not support. This initial implementation is simple and does not support zone resource limits. That is, a zoned loop block device limits for the maximum number of open zones and maximum number of active zones is always 0. This driver can be either compiled in-kernel or as a module, named "zloop". Compilation of this driver depends on the block layer support for zoned block device (CONFIG_BLK_DEV_ZONED must be set). Using the zloop driver to create and delete zoned block devices is done by writing commands to the zoned loop control character device file (/dev/zloop-control). Creating a device is done with: $ echo "add [options]" > /dev/zloop-control The options available for the "add" operation cat be listed by reading the zloop-control device file: $ cat /dev/zloop-control add id=%d,capacity_mb=%u,zone_size_mb=%u,zone_capacity_mb=%u,conv_zones=%u,base_dir=%s,nr_queues=%u,queue_depth=%u remove id=%d The options available allow controlling the zoned device total capacity, zone size, zone capactity of sequential zones, total number of conventional zones, base directory for the zones backing file, number of I/O queues and the maximum queue depth of I/O queues. Deleting a device is done using the "remove" command: $ echo "remove id=0" > /dev/zloop-control This implementation passes various tests using zonefs and fio (t/zbd tests) and provides a state machine for zone conditions that is compliant with the T10 ZBC and NVMe ZNS specifications. Co-developed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250407075222.170336-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-01 17:03:56 -06:00
Christoph Hellwig	53ec1abce7	brd: use memcpy_{to,from]_page in brd_rw_bvec Use the proper helpers to copy to/from potential highmem pages, which do a local instead of atomic kmap underneath, and perform flush_dcache_page where needed. This also simplifies the code so much that the separate read write helpers are not required any more. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250428141014.2360063-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-28 11:45:41 -06:00
Christoph Hellwig	3185444f05	brd: split I/O at page boundaries A lot of complexity in brd stems from the fact that it tries to handle I/O spanning two backing pages. Instead limit the size of a single bvec iteration so that it never crosses a page boundary and remove all the now unneeded code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250428141014.2360063-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-28 11:45:41 -06:00
Christoph Hellwig	95a375a3be	brd: use bvec_kmap_local in brd_do_bvec Use the proper helper to kmap a bvec in brd_do_bvec instead of directly accessing the bvec fields and use the deprecated kmap_atomic API. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250428141014.2360063-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-28 11:45:41 -06:00
Christoph Hellwig	857aba38b5	brd: remove the sector variable in brd_submit_bio The bvec iter iterates over the sector already, no need to duplicate the work. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250428141014.2360063-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-28 11:45:40 -06:00
Christoph Hellwig	75d99aa279	brd: pass a bvec pointer to brd_do_bvec Pass the bvec to brd_do_bvec instead of marshalling the information into individual arguments. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250428141014.2360063-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-28 11:45:40 -06:00
Jens Axboe	bf4b8794de	Merge branch 'block-6.15' into for-6.16/block Merge 6.15 block fixes - both to get the fixes causing issues with XFS testing, but also to make it easier for 6.16 ublk patches to avoid conflicts. * block-6.15: ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA block: don't autoload drivers on blk-cgroup configuration block: don't autoload drivers on stat block: remove the backing_inode variable in bdev_statx block: move blkdev_{get,put} _no_open prototypes out of blkdev.h block: never reduce ra_pages in blk_apply_bdi_limits selftests: ublk: common: fix _get_disk_dev_t for pre-9.0 coreutils selftests: ublk: remove useless 'delay_us' from 'struct dev_ctx' selftests: ublk: fix recover test block: hoist block size validation code to a separate function block: fix race between set_blocksize and read paths nvmet: fix out-of-bounds access in nvmet_enable_port	2025-04-24 20:41:11 -06:00
Ming Lei	f40139fde5	ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd ublk_cancel_cmd() calls io_uring_cmd_done() to complete uring_cmd, but we may have scheduled task work via io_uring_cmd_complete_in_task() for dispatching request, then kernel crash can be triggered. Fix it by not trying to canceling the command if ublk block request is started. Fixes: `216c8f5ef0` ("ublk: replace monitor with cancelable uring_cmd") Reported-by: Jared Holzman <jholzman@nvidia.com> Tested-by: Jared Holzman <jholzman@nvidia.com> Closes: https://lore.kernel.org/linux-block/d2179120-171b-47ba-b664-23242981ef19@nvidia.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250425013742.1079549-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-24 19:52:20 -06:00
Ming Lei	d6aa0c178b	ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA We call io_uring_cmd_complete_in_task() to schedule task_work for handling UBLK_U_IO_NEED_GET_DATA. This way is really not necessary because the current context is exactly the ublk queue context, so call ublk_dispatch_req() directly for handling UBLK_U_IO_NEED_GET_DATA. Fixes: `216c8f5ef0` ("ublk: replace monitor with cancelable uring_cmd") Tested-by: Jared Holzman <jholzman@nvidia.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250425013742.1079549-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-24 19:52:20 -06:00
Christoph Hellwig	c4d2519c6a	block: don't autoload drivers on blk-cgroup configuration Loading a driver just to configure blk-cgroup doesn't make sense, as that assumes and already existing device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-24 07:35:23 -06:00
Christoph Hellwig	5f33b5226c	block: don't autoload drivers on stat blkdev_get_no_open can trigger the legacy autoload of block drivers. A simple stat of a block device has not historically done that, so disable this behavior again. Fixes: `9abcfbd235` ("block: Add atomic write support for statx") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-24 07:35:23 -06:00
Christoph Hellwig	d13b7090b2	block: remove the backing_inode variable in bdev_statx backing_inode is only used once, so remove it and update the comment describing the bdev lookup to be a bit more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-24 07:35:09 -06:00
Christoph Hellwig	c63202140d	block: move blkdev_{get,put} _no_open prototypes out of blkdev.h These are only to be used by block internal code. Remove the comment as we grew more users due to reworking block device node opening. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-24 07:33:38 -06:00
Christoph Hellwig	7b720c7202	block: never reduce ra_pages in blk_apply_bdi_limits When the user increased the read-ahead size through sysfs this value currently get lost if the device is reprobe, including on a resume from suspend. As there is no hardware limitation for the read-ahead size there is no real need to reset it or track a separate hardware limitation like for max_sectors. This restores the pre-atomic queue limit behavior in the sd driver as sd did not use blk_queue_io_opt and thus never updated the read ahead size to the value based of the optimal I/O, but changes behavior for all other drivers. As the new behavior seems useful and sd is the driver for which the readahead size tweaks are most useful that seems like a worthwhile trade off. Fixes: `804e498e04` ("sd: convert to the atomic queue limits API") Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20250424082521.1967286-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-24 07:32:17 -06:00
Uday Shankar	1d019736b6	selftests: ublk: common: fix _get_disk_dev_t for pre-9.0 coreutils Some distributions, such as centos stream 9, still have a version of coreutils which does not yet support the %Hr and %Lr formats for stat(1) [1, 2]. Running ublk selftests on these distributions results in the following error in tests that use the _get_disk_dev_t helper: line 23: ?r: syntax error: operand expected (error token is "?r") To better accommodate older distributions, rewrite _get_disk_dev_t to use the much older %t and %T formats for stat instead. [1] https://github.com/coreutils/coreutils/blob/v9.0/NEWS#L114 [2] https://pkgs.org/download/coreutils Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250423-ublk_selftests-v1-2-7d060e260e76@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-24 06:30:54 -06:00
Jens Axboe	6c9c56d94e	Merge tag 'nvme-6.15-2025-04-24' of git://git.infradead.org/nvme into block-6.15 Pull NVMe fix from Christoph: "nvme fixes for Linux 6.15 - fix an out-of-bounds access in nvmet_enable_port (Richard Weinberger)" * tag 'nvme-6.15-2025-04-24' of git://git.infradead.org/nvme: nvmet: fix out-of-bounds access in nvmet_enable_port	2025-04-24 06:27:54 -06:00
Ming Lei	8f50363789	selftests: ublk: remove useless 'delay_us' from 'struct dev_ctx' 'delay_us' shouldn't be added to 'struct dev_ctx' since now it is handled by per-target command line & 'struct fault_inject_ctx'. So remove it. Fixes: `81586652bb` ("selftests: ublk: add generic_06 for covering fault inject") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Uday Shankar <ushankar@purestorage.com> Link: https://lore.kernel.org/r/20250421235947.715272-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-23 13:58:55 -06:00
Ming Lei	5533bc70ae	selftests: ublk: fix recover test When adding recovery test: - 'break' is missed for handling '-g' argument - test name of test_generic_05.sh is wrong So fix the two. Fixes: `57e13a2e8c` ("selftests: ublk: support user recovery") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Uday Shankar <ushankar@purestorage.com> Link: https://lore.kernel.org/r/20250421235947.715272-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-23 13:58:55 -06:00
Darrick J. Wong	e03463d247	block: hoist block size validation code to a separate function Hoist the block size validation code to bdev_validate_blocksize so that we can call it from filesystems that don't care about the bdev pagecache manipulations of set_blocksize. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/174543795720.4139148.840349813093799165.stgit@frogsfrogsfrogs Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-23 13:58:06 -06:00
Darrick J. Wong	c0e473a0d2	block: fix race between set_blocksize and read paths With the new large sector size support, it's now the case that set_blocksize can change i_blksize and the folio order in a manner that conflicts with a concurrent reader and causes a kernel crash. Specifically, let's say that udev-worker calls libblkid to detect the labels on a block device. The read call can create an order-0 folio to read the first 4096 bytes from the disk. But then udev is preempted. Next, someone tries to mount an 8k-sectorsize filesystem from the same block device. The filesystem calls set_blksize, which sets i_blksize to 8192 and the minimum folio order to 1. Now udev resumes, still holding the order-0 folio it allocated. It then tries to schedule a read bio and do_mpage_readahead tries to create bufferheads for the folio. Unfortunately, blocks_per_folio == 0 because the page size is 4096 but the blocksize is 8192 so no bufferheads are attached and the bh walk never sets bdev. We then submit the bio with a NULL block device and crash. Therefore, truncate the page cache after flushing but before updating i_blksize. However, that's not enough -- we also need to lock out file IO and page faults during the update. Take both the i_rwsem and the invalidate_lock in exclusive mode for invalidations, and in shared mode for read/write operations. I don't know if this is the correct fix, but xfs/259 found it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/174543795699.4139148.2086129139322431423.stgit@frogsfrogsfrogs Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-23 13:58:06 -06:00
Caleb Sander Mateos	4c7d3c88c7	ublk: remove unnecessary ubq checks ublk_init_queues() ensures that all nr_hw_queues queues are initialized, with each ublk_queue's q_id set to its index. And ublk_init_queues() is called before ublk_add_chdev(), which creates the cdev. Is is therefore impossible for the !ubq \|\| ub_cmd->q_id != ubq->q_id condition to hit in __ublk_ch_uring_cmd(). Remove it to avoids some branches in the I/O path. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Uday Shankar <ushankar@purestorage.com> Link: https://lore.kernel.org/r/20250416170154.3621609-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-22 20:09:12 -06:00
Richard Weinberger	3d7aa0c7b4	nvmet: fix out-of-bounds access in nvmet_enable_port When trying to enable a port that has no transport configured yet, nvmet_enable_port() uses NVMF_TRTYPE_MAX (255) to query the transports array, causing an out-of-bounds access: [ 106.058694] BUG: KASAN: global-out-of-bounds in nvmet_enable_port+0x42/0x1da [ 106.058719] Read of size 8 at addr ffffffff89dafa58 by task ln/632 [...] [ 106.076026] nvmet: transport type 255 not supported Since commit `200adac758`, NVMF_TRTYPE_MAX is the default state as configured by nvmet_ports_make(). Avoid this by checking for NVMF_TRTYPE_MAX before proceeding. Fixes: `200adac758` ("nvme: Add PCI transport type") Signed-off-by: Richard Weinberger <richard@nod.at> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org>	2025-04-22 09:50:28 +02:00
Omri Mann	98b995660b	ublk: Add UBLK_U_CMD_UPDATE_SIZE Currently ublk only allows the size of the ublkb block device to be set via UBLK_CMD_SET_PARAMS before UBLK_CMD_START_DEV is triggered. This does not provide support for extendable user-space block devices without having to stop and restart the underlying ublkb block device causing IO interruption. This patch adds a new ublk command UBLK_U_CMD_UPDATE_SIZE to allow the ublk block device to be resized on-the-fly. Feature flag UBLK_F_UPDATE_SIZE is also added to indicate support. Signed-off-by: Omri Mann <omri@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/2a370ab1-d85b-409d-b762-f9f3f6bdf705@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-21 09:25:41 -06:00
Jens Axboe	033b667a82	block: blk-rq-qos: guard rq-qos helpers by static key Even if blk-rq-qos isn't used or configured, dipping into the queue to fetch ->rq_qos is a noticeable slowdown and visible in profiles. Add an unlikely static key around blk-rq-qos, to avoid fetching this cacheline if blk-iolatency or blk-wbt isn't configured or used. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-21 05:07:03 -06:00
Jens Axboe	9b79f86e06	block: ensure that struct blk_mq_alloc_data is fully initialized On x86, rep stos will be emitted to clear the the blk_mq_alloc_data struct, as not all members are being explicitly initialied. Depending on the type of CPU, this is a noticeable slowdown compared to just ensuring that the struct is fully initialized when setup. For the 4 spots that setup a struct blk_mq_alloc_data on the stack, ensure all members are being initialized. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-21 05:07:02 -06:00
Bart Van Assche	e093b784ab	block: Simplify blk_mq_dispatch_rq_list() and its callers The 'nr_budgets' argument of blk_mq_dispatch_rq_list() is either the number of elements in the 'list' argument or zero. Instead of passing the number of list elements to blk_mq_dispatch_rq_list(), pass a boolean argument that indicates whether or not blk_mq_dispatch_rq_list() should request the block driver for a budget for each request in 'list'. Remove the code for counting list elements from blk_mq_dispatch_rq_list() callers where possible. Remove the code that decrements nr_budgets from blk_mq_dispatch_rq_list() because it is superfluous. Each request that is processed by blk_mq_dispatch_rq_list() is in one of these two states if 'get_budget' is false: * Either the request is on 'list' and the budget for the request has to be released from the error path. * Or the request is not on 'list' and q->mq_ops->queue_rq() has already released the budget (ret != BLK_STS_OK) or q->mq_ops->queue_rq() will release the budget asynchronously (ret == BLK_STS_OK). Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: John Garry <john.g.garry@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20250415205134.3650042-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-21 05:07:02 -06:00
Linus Torvalds	9d7a0577c9	gcc-15: disable '-Wunterminated-string-initialization' entirely for now I had left the warning around but as a non-fatal error to get my gcc-15 builds going, but fixed up some of the most annoying warning cases so that it wouldn't be too verbose. Because I like the _concept_ of the warning, even if I detested the implementation to shut it up. It turns out the implementation to shut it up is even more broken than I thought, and my "shut up most of the warnings" patch just caused fatal errors on gcc-14 instead. I had tested with clang, but when I upgrade my development environment, I try to do it on all machines because I hate having different systems to maintain, and hadn't realized that gcc-14 now had issues. The ACPI case is literally why I wanted to have a type that doesn't trigger the warning (see commit `d5d45a7f26`: "gcc-15: make 'unterminated string initialization' just a warning"), instead of marking individual places as "__nonstring". But gcc-14 doesn't like that __nonstring location that shut gcc-15 up, because it's on an array of char arrays, not on one single array: drivers/acpi/tables.c:399:1: error: 'nonstring' attribute ignored on objects of type 'const char[][4]' [-Werror=attributes] 399 \| static const char table_sigs[][ACPI_NAMESEG_SIZE] __initconst __nonstring = { \| ^~~~~~ and my attempts to nest it properly with a type had failed, because of how gcc doesn't like marking the types as having attributes, only symbols. There may be some trick to it, but I was already annoyed by the bad attribute design, now I'm just entirely fed up with it. I wish gcc had a proper way to say "this type is a byte array, not a string". The obvious thing would be to distinguish between "char []" and an explicitly signed "unsigned char []" (as opposed to an implicitly unsigned char, which is typically an architecture-specific default, but for the kernel is universal thanks to '-funsigned-char'). But any "we can typedef a 8-bit type to not become a string just because it's an array" model would be fine. But "__attribute__((nonstring))" is sadly not that sane model. Reported-by: Chris Clayton <chris2553@googlemail.com> Fixes: `4b4bd8c50f` ("gcc-15: acpi: sprinkle random '__nonstring' crumbles around") Fixes: `d5d45a7f26` ("gcc-15: make 'unterminated string initialization' just a warning") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-20 15:30:53 -07:00
Linus Torvalds	9c32cda43e	Linux 6.15-rc3 v6.15-rc3	2025-04-20 13:43:47 -07:00
Linus Torvalds	ac71fabf15	gcc-15: work around sequence-point warning The C sequence points are complicated things, and gcc-15 has apparently added a warning for the case where an object is both used and modified multiple times within the same sequence point. That's a great warning. Or rather, it would be a great warning, except gcc-15 seems to not really be very exact about it, and doesn't notice that the modification are to two entirely different members of the same object: the array counter and the array entries. So that seems kind of silly. That said, the code that gcc complains about is unnecessarily complicated, so moving the array counter update into a separate statement seems like the most straightforward fix for these warnings: drivers/net/wireless/intel/iwlwifi/mld/d3.c: In function ‘iwl_mld_set_netdetect_info’: drivers/net/wireless/intel/iwlwifi/mld/d3.c:1102:66: error: operation on ‘netdetect_info->n_matches’ may be undefined [-Werror=sequence-point] 1102 \| netdetect_info->matches[netdetect_info->n_matches++] = match; \| ~~~~~~~~~~~~~~~~~~~~~~~~~^~ drivers/net/wireless/intel/iwlwifi/mld/d3.c:1120:58: error: operation on ‘match->n_channels’ may be undefined [-Werror=sequence-point] 1120 \| match->channels[match->n_channels++] = \| ~~~~~~~~~~~~~~~~~^~ side note: the code at that second warning is actively buggy, and only works on little-endian machines that don't do strict alignment checks. The code casts an array of integers into an array of unsigned long in order to use our bitmap iterators. That happens to work fine on any sane architecture, but it's still wrong. This does not fix that more serious problem. This only splits the two assignments into two statements and fixes the compiler warning. I need to get rid of the new warnings in order to be able to actually do any build testing. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-20 11:57:54 -07:00
Linus Torvalds	05e8d261a3	gcc-15: add '__nonstring' markers to byte arrays All of these cases are perfectly valid and good traditional C, but hit by the "you're not NUL-terminating your byte array" warning. And none of the cases want any terminating NUL character. Mark them __nonstring to shut up gcc-15 (and in the case of the ak8974 magnetometer driver, I just removed the explicit array size and let gcc expand the 3-byte and 6-byte arrays by one extra byte, because it was the simpler change). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-20 11:57:54 -07:00
Linus Torvalds	be913e7c40	gcc-15: get rid of misc extra NUL character padding This removes two cases of explicit NUL padding that now causes warnings because of '-Wunterminated-string-initialization' being part of -Wextra in gcc-15. Gcc is being silly in this case when it says that it truncates a NUL terminator, because in these cases there were _multiple_ NUL characters. But we can get rid of the warning by just simplifying the two initializers that trigger the warning for me, so this does exactly that. I'm not sure why the power supply code did that odd .attr_name = #_name "\0", pattern: it was introduced in commit `2cabeaf151` ("power: supply: core: Cleanup power supply sysfs attribute list"), but that 'attr_name[]' field is an explicitly sized character array in a statically initialized variable, and a string initializer always has a terminating NUL _and_ statically initialized character arrays are zero-padded anyway, so it really seems to be rather extraneous belt-and-suspenders. The zero_uuid[16] initialization in drivers/md/bcache/super.c makes perfect sense, but it isn't necessary for the same reasons, and not worth the new gcc warning noise. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-20 11:57:54 -07:00
Linus Torvalds	4b4bd8c50f	gcc-15: acpi: sprinkle random '__nonstring' crumbles around This is not great: I'd much rather introduce a typedef that is a "ACPI name byte buffer", and use that to mark these special 4-byte ACPI names that do not use NUL termination. But as noted in the previous commit ("gcc-15: make 'unterminated string initialization' just a warning") gcc doesn't actually seem to support that notion, so instead you have to just mark every single array declaration individually. So this is not pretty, but this gets rid of the bulk of the annoying warnings during an allmodconfig build for me. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-20 11:57:54 -07:00
Linus Torvalds	d5d45a7f26	gcc-15: make 'unterminated string initialization' just a warning gcc-15 enabling -Wunterminated-string-initialization in -Wextra by default was done with the best intentions, but the warning is still quite broken. What annoys me about the warning is that this is a very traditional AND CORRECT way to initialize fixed byte arrays in C: unsigned char hex[16] = "0123456789abcdef"; and we use this all over the kernel. And the warning is fine, but gcc developers apparently never made a reasonable way to disable it. As is (sadly) tradition with these things. Yes, there's "__attribute__((nonstring))", and we have a macro to make that absolutely disgusting syntax more palatable (ie the kernel syntax for that monstrosity is just "__nonstring"). But that attribute is misdesigned. What you'd typically want to do is tell the compiler that you are using a type that isn't a string but a byte array, but that doesn't work at all: warning: ‘nonstring’ attribute does not apply to types [-Wattributes] and because of this fundamental mis-design, you then have to mark each instance of that pattern. This is particularly noticeable in our ACPI code, because ACPI has this notion of a 4-byte "type name" that gets used all over, and is exactly this kind of byte array. This is a sad oversight, because the warning is useful, but really would be so much better if gcc had also given a sane way to indicate that we really just want a byte array type at a type level, not the broken "each and every array definition" level. So now instead of creating a nice "ACPI name" type using something like typedef char acpi_name_t[4] __nonstring; we have to do things like char name[ACPI_NAMESEG_SIZE] __nonstring; in every place that uses this concept and then happens to have the typical initializers. This is annoying me mainly because I think the warning _is_ a good warning, which is why I'm not just turning it off in disgust. But it is hampered by this bad implementation detail. [ And obviously I'm doing this now because system upgrades for me are something that happen in the middle of the release cycle: don't do it before or during travel, or just before or during the busy merge window period. ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-20 11:57:54 -07:00
Linus Torvalds	6fea5fabd3	Merge tag 'mm-hotfixes-stable-2025-04-19-21-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc hotfixes from Andrew Morton: "16 hotfixes. 2 are cc:stable and the remainder address post-6.14 issues or aren't considered necessary for -stable kernels. All patches are basically for MM although five are alterations to MAINTAINERS" [ Basic counting skills are clearly not a strictly necessary requirement for kernel maintainers. - Linus ] * tag 'mm-hotfixes-stable-2025-04-19-21-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: MAINTAINERS: add section for locking of mm's and VMAs mm: vmscan: fix kswapd exit condition in defrag_mode mm: vmscan: restore high-cpu watermark safety in kswapd MAINTAINERS: add Pedro as reviewer to the MEMORY MAPPING section mm/memory: move sanity checks in do_wp_page() after mapcount vs. refcount stabilization mm, hugetlb: increment the number of pages to be reset on HVO writeback: fix false warning in inode_to_wb() docs: ABI: replace mcroce@microsoft.com with new Meta address mm/gup: fix wrongly calculated returned value in fault_in_safe_writeable() MAINTAINERS: add memory advice section MAINTAINERS: add mmap trace events to MEMORY MAPPING mm: memcontrol: fix swap counter leak from offline cgroup MAINTAINERS: add MM subsection for the page allocator MAINTAINERS: update SLAB ALLOCATOR maintainers fs/dax: fix folio splitting issue by resetting old folio order + _nr_pages mm/page_alloc: fix deadlock on cpu_hotplug_lock in __accept_page()	2025-04-19 21:46:58 -07:00
Linus Torvalds	119009db26	Merge tag 'vfs-6.15-rc3.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Revert the hfs{plus} deprecation warning that's also included in this pull request. The commit introducing the deprecation warning resides rather early in this branch. So simply dropping it would've rebased all other commits which I decided to avoid. Hence the revert in the same branch [ Background - the deprecation warning discussion resulted in people stepping up, and so hfs{plus} will have a maintainer taking care of it after all.. - Linus ] - Switch CONFIG_SYSFS_SYCALL default to n and decouple from CONFIG_EXPERT - Fix an audit bug caused by changes to our kernel path lookup helpers this cycle. Audit needs the parent path even if the dentry it tried to look up is negative - Ensure that the kernel path lookup helpers leave the passed in path argument clean when they return an error. This is consistent with all our other helpers - Ensure that vfs_getattr_nosec() calls bdev_statx() so the relevant information is available to kernel consumers as well - Don't set a timer and call schedule() if the timer will expire immediately in epoll - Make netfs lookup tables with __nonstring * tag 'vfs-6.15-rc3.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: Revert "hfs{plus}: add deprecation warning" fs: move the bdex_statx call to vfs_getattr_nosec netfs: Mark __nonstring lookup tables eventpoll: Set epoll timeout if it's in the future fs: ensure that path_locked() helpers leave passed path pristine fs: add kern_path_locked_negative() hfs{plus}: add deprecation warning Kconfig: switch CONFIG_SYSFS_SYCALL default to n	2025-04-19 14:31:08 -07:00
Linus Torvalds	6fe8131757	Merge tag 'i2c-for-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fixes from Wolfram Sang: - Address translator: fix wrong include - ChromeOS EC tunnel: fix potential NULL pointer dereference * tag 'i2c-for-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: atr: Fix wrong include i2c: cros-ec-tunnel: defer probe if parent EC is not present	2025-04-19 13:59:04 -07:00
Christian Brauner	408e4504f9	Revert "hfs{plus}: add deprecation warning" This reverts commit `ddee68c499`. There's ongoing discussion about better maintenance of at least hfsplus. Rever the deprecation warning for now. Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-04-19 22:48:59 +02:00

1 2 3 4 5 ...

1351979 Commits