linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-16 12:31:52 -04:00

Author	SHA1	Message	Date
Eric Biggers	844d950bb2	nvme-auth: common: remove nvme_auth_digest_name() Since nvme_auth_digest_name() is no longer used, remove it and the associated data from the hash_map array. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:02 -07:00
Eric Biggers	16977e7755	nvme-auth: target: use crypto library in nvmet_auth_ctrl_hash() For the HMAC computation in nvmet_auth_ctrl_hash(), use the crypto library instead of crypto_shash. This is simpler, faster, and more reliable. Notably, this eliminates the crypto transformation object allocation for every call, which was very slow. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:02 -07:00
Eric Biggers	e501533f67	nvme-auth: target: use crypto library in nvmet_auth_host_hash() For the HMAC computation in nvmet_auth_host_hash(), use the crypto library instead of crypto_shash. This is simpler, faster, and more reliable. Notably, this eliminates the crypto transformation object allocation for every call, which was very slow. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:02 -07:00
Eric Biggers	efe8df9f9c	nvme-auth: target: remove obsolete crypto_has_shash() checks Since nvme-auth is now doing its HMAC computations using the crypto library, it's guaranteed that all the algorithms actually work. Therefore, remove the crypto_has_shash() checks which are now obsolete. However, the caller in nvmet_auth_negotiate() seems to have also been relying on crypto_has_shash(nvme_auth_hmac_name(host_hmac_id)) to validate the host_hmac_id. Therefore, make it validate the ID more directly by checking whether nvme_auth_hmac_hash_len() returns 0 or not. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:02 -07:00
Eric Biggers	ac9a49cf6e	nvme-auth: host: remove allocation of crypto_shash Now that the crypto_shash that is being allocated in nvme_auth_process_dhchap_challenge() and stored in the struct nvme_dhchap_queue_context is no longer used, remove it. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:02 -07:00
Eric Biggers	c4f216c2a9	nvme-auth: host: use crypto library in nvme_auth_dhchap_setup_ctrl_response() For the HMAC computation in nvme_auth_dhchap_setup_ctrl_response(), use the crypto library instead of crypto_shash. This is simpler, faster, and more reliable. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:02 -07:00
Eric Biggers	6be8d3f043	nvme-auth: host: use crypto library in nvme_auth_dhchap_setup_host_response() For the HMAC computation in nvme_auth_dhchap_setup_host_response(), use the crypto library instead of crypto_shash. This is simpler, faster, and more reliable. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:02 -07:00
Eric Biggers	d126cbaa7d	nvme-auth: common: use crypto library in nvme_auth_derive_tls_psk() For the HKDF-Expand-Label computation in nvme_auth_derive_tls_psk(), use the crypto library instead of crypto_shash and crypto/hkdf.c. While this means the HKDF "helper" functions are no longer utilized, they clearly weren't buying us much: it's simpler to just inline the HMAC computations directly, and this code needs to be tested anyway. (A similar result was seen in fs/crypto/. As a result, this eliminates the last user of crypto/hkdf.c, which we'll be able to remove as well.) As usual this is also a lot more efficient, eliminating the allocation of a transformation object and multiple other dynamic allocations. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:01 -07:00
Eric Biggers	0002764c2f	nvme-auth: common: use crypto library in nvme_auth_generate_digest() For the HMAC computation in nvme_auth_generate_digest(), use the crypto library instead of crypto_shash. This is simpler, faster, and more reliable. Notably, this eliminates the crypto transformation object allocation for every call, which was very slow. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:01 -07:00
Eric Biggers	be01b841d3	nvme-auth: common: use crypto library in nvme_auth_generate_psk() For the HMAC computation in nvme_auth_generate_psk(), use the crypto library instead of crypto_shash. This is simpler, faster, and more reliable. Notably, this eliminates the crypto transformation object allocation for every call, which was very slow. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:01 -07:00
Eric Biggers	a67d096fe9	nvme-auth: common: use crypto library in nvme_auth_augmented_challenge() For the hash and HMAC computations in nvme_auth_augmented_challenge(), use the crypto library instead of crypto_shash. This is simpler, faster, and more reliable. Notably, this eliminates two crypto transformation object allocations for every call, which was very slow. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:01 -07:00
Eric Biggers	092c05f8de	nvme-auth: common: use crypto library in nvme_auth_transform_key() For the HMAC computation in nvme_auth_transform_key(), use the crypto library instead of crypto_shash. This is simpler, faster, and more reliable. Notably, this eliminates the transformation object allocation for every call, which was very slow. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:01 -07:00
Eric Biggers	4263ca1cae	nvme-auth: common: add HMAC helper functions Add some helper functions for computing HMAC-SHA256, HMAC-SHA384, or HMAC-SHA512 values using the crypto library instead of crypto_shash. These will enable some significant simplifications and performance improvements in nvme-auth. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:01 -07:00
Eric Biggers	4454820b4e	nvme-auth: common: explicitly verify psk_len == hash_len nvme_auth_derive_tls_psk() is always called with psk_len == hash_len. And based on the comments above nvme_auth_generate_psk() and nvme_auth_derive_tls_psk(), this isn't an implementation choice but rather just the length the spec uses. Add a check which makes this explicit, so that when cleaning up nvme_auth_derive_tls_psk() we don't have to retain support for arbitrary values of psk_len. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:00 -07:00
Eric Biggers	0beeca72cf	nvme-auth: rename nvme_auth_generate_key() to nvme_auth_parse_key() This function does not generate a key. It parses the key from the string that the caller passes in. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:00 -07:00
Eric Biggers	f990ad67f0	nvme-auth: common: add KUnit tests for TLS key derivation Unit-test the sequence of function calls that derive tls_psk, so that we can be more confident that changes in the implementation don't break it. Since the NVMe specification doesn't seem to include any test vectors for this (nor does its description of the algorithm seem to match what was actually implemented, for that matter), I just set the expected values to the values that the code currently produces. In the case of SHA-512, nvme_auth_generate_digest() currently returns -EINVAL, so for now the test tests for that too. If it is later determined that some other behavior is needed, the test can be updated accordingly. Tested with: tools/testing/kunit/kunit.py run --kunitconfig drivers/nvme/common/ Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:00 -07:00
Eric Biggers	bf0e2567a6	nvme-auth: use proper argument types For input parameters, use pointer to const. This makes it easier to understand which parameters are inputs and which are outputs. In addition, consistently use char for strings and u8 for binary. This makes it easier to understand what is a string and what is binary data. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:00 -07:00
Eric Biggers	e57406c07b	nvme-auth: common: constify static data Fully constify the dhgroup_map and hash_map arrays. Remove 'const' from individual fields, as it is now redundant. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:00 -07:00
Eric Biggers	9100a28c8b	nvme-auth: add NVME_AUTH_MAX_DIGEST_SIZE constant Define a NVME_AUTH_MAX_DIGEST_SIZE constant and use it in the appropriate places. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:00 -07:00
Vasily Gorbik	67807fbaf1	block: fix bio_alloc_bioset slowpath GFP handling bio_alloc_bioset() first strips __GFP_DIRECT_RECLAIM from the optimistic fast allocation attempt with try_alloc_gfp(). If that fast path fails, the slowpath checks saved_gfp to decide whether blocking allocation is allowed, but then still calls mempool_alloc() with the stripped gfp mask. That can lead to a NULL bio pointer being passed into bio_init(). Fix the slowpath by using saved_gfp for the bio and bvec mempool allocations. Fixes: `b520c4eef8` ("block: split bio_alloc_bioset more clearly into a fast and slowpath") Reported-by: syzbot+09ddb593eea76a158f42@syzkaller.appspotmail.com Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/p01.gc6e9ad5845ad.ttca29g@ub.hpns Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-23 07:58:32 -06:00
Ming Lei	24d4c90286	ublk: move cold paths out of __ublk_batch_dispatch() for icache efficiency Mark ublk_filter_unused_tags() as noinline since it is only called from the unlikely(needs_filter) branch. Extract the error-handling block from __ublk_batch_dispatch() into a new noinline ublk_batch_dispatch_fail() function to keep the hot path compact and icache-friendly. This also makes __ublk_batch_dispatch() more readable by separating the error recovery logic from the normal dispatch flow. Before: __ublk_batch_dispatch is ~1419 bytes After: __ublk_batch_dispatch is ~1090 bytes (-329 bytes, -23%) Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://patch.msgid.link/20260318014112.3125432-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-22 19:03:52 -06:00
Jens Axboe	713db70d6d	Merge tag 'md-7.1-20260323' of git://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-7.1/block Pull MD changes from Yu Kuia: "Bug Fixes: - md: suppress spurious superblock update error message for dm-raid (Chen Cheng) - md/raid1: fix the comparing region of interval tree (Xiao Ni) - md/raid10: fix deadlock with check operation and nowait requests (Josh Hunt) - md/raid5: skip 2-failure compute when other disk is R5_LOCKED (FengWei Shih) - md/md-llbitmap: raise barrier before state machine transition (Yu Kuai) - md/md-llbitmap: skip reading rdevs that are not in_sync (Yu Kuai) Improvements: - md/raid5: set chunk_sectors to enable full stripe I/O splitting (Yu Kuai) Cleanups: - md: remove unused mddev argument from export_rdev (Chen Cheng) - md/raid5: remove stale md_raid5_kick_device() declaration (Chen Cheng) - md/raid5: move handle_stripe() comment to correct location (Chen Cheng)" * tag 'md-7.1-20260323' of git://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux: md: remove unused mddev argument from export_rdev md/raid5: move handle_stripe() comment to correct location md/raid5: remove stale md_raid5_kick_device() declaration md/raid1: fix the comparing region of interval tree md/raid5: skip 2-failure compute when other disk is R5_LOCKED md/md-llbitmap: raise barrier before state machine transition md/md-llbitmap: skip reading rdevs that are not in_sync md/raid5: set chunk_sectors to enable full stripe I/O splitting md/raid10: fix deadlock with check operation and nowait requests md: suppress spurious superblock update error message for dm-raid	2026-03-22 13:37:45 -06:00
Chen Cheng	6f507eb2bb	md: remove unused mddev argument from export_rdev The mddev argument in export_rdev() is never used. Remove it to simplify callers. Signed-off-by: Chen Cheng <chencheng@fnnas.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Link: https://lore.kernel.org/linux-raid/20260304111417.20777-1-chencheng@fnnas.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com>	2026-03-23 02:15:11 +08:00
Chen Cheng	81c041260a	md/raid5: move handle_stripe() comment to correct location Move the handle_stripe() documentation comment from above analyse_stripe() to directly above handle_stripe() where it belongs. Signed-off-by: Chen Cheng <chencheng@fnnas.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Link: https://lore.kernel.org/linux-raid/20260304111001.15767-1-chencheng@fnnas.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com>	2026-03-23 02:15:11 +08:00
Chen Cheng	af5c99b8ea	md/raid5: remove stale md_raid5_kick_device() declaration Remove the unused md_raid5_kick_device() declaration from raid5.h - no definition exists for this function. Signed-off-by: Chen Cheng <chencheng@fnnas.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Link: https://lore.kernel.org/linux-raid/20260304110919.15071-1-chencheng@fnnas.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com>	2026-03-23 02:15:11 +08:00
Xiao Ni	de3544d2e5	md/raid1: fix the comparing region of interval tree Interval tree uses [start, end] as a region which stores in the tree. In raid1, it uses the wrong end value. For example: bio(A,B) is too big and needs to be split to bio1(A,C-1), bio2(C,B). The region of bio1 is [A,C] and the region of bio2 is [C,B]. So bio1 and bio2 overlap which is not right. Fix this problem by using right end value of the region. Fixes: `d0d2d8ba04` ("md/raid1: introduce wait_for_serialization") Signed-off-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20260305011839.5118-2-xni@redhat.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com>	2026-03-23 02:15:10 +08:00
FengWei Shih	52e4324935	md/raid5: skip 2-failure compute when other disk is R5_LOCKED When skip_copy is enabled on a doubly-degraded RAID6, a device that is being written to will be in R5_LOCKED state with R5_UPTODATE cleared. If a new read triggers fetch_block() while the write is still in flight, the 2-failure compute path may select this locked device as a compute target because it is not R5_UPTODATE. Because skip_copy makes the device page point directly to the bio page, reconstructing data into it might be risky. Also, since the compute marks the device R5_UPTODATE, it triggers WARN_ON in ops_run_io() which checks that R5_SkipCopy and R5_UPTODATE are not both set. This can be reproduced by running small-range concurrent read/write on a doubly-degraded RAID6 with skip_copy enabled, for example: mdadm -C /dev/md0 -l6 -n6 -R -f /dev/loop[0-3] missing missing echo 1 > /sys/block/md0/md/skip_copy fio --filename=/dev/md0 --rw=randrw --bs=4k --numjobs=8 \ --iodepth=32 --size=4M --runtime=30 --time_based --direct=1 Fix by checking R5_LOCKED before proceeding with the compute. The compute will be retried once the lock is cleared on IO completion. Signed-off-by: FengWei Shih <dannyshih@synology.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Link: https://lore.kernel.org/linux-raid/20260319053351.3676794-1-dannyshih@synology.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com>	2026-03-22 09:57:33 +08:00
Kees Cook	c2d466b9fe	block: partitions: Replace pp_buf with struct seq_buf In preparation for removing the strlcat API[1], replace the char *pp_buf with a struct seq_buf, which tracks the current write position and remaining space internally. This allows for: - Direct use of seq_buf_printf() in place of snprintf()+strlcat() pairs, eliminating local tmp buffers throughout. - Adjacent strlcat() calls that build strings piece-by-piece (e.g., strlcat("["); strlcat(name); strlcat("]")) to be collapsed into single seq_buf_printf() calls. - Simpler call sites: seq_buf_puts() takes only the buffer and string, with no need to pass PAGE_SIZE at every call. The backing buffer allocation is unchanged (__get_free_page), and the output path uses seq_buf_str() to NUL-terminate before passing to printk(). Link: https://github.com/KSPP/linux/issues/370 [1] Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Josh Law <objecting@objecting.org> Signed-off-by: Kees Cook <kees@kernel.org> Reviewed-by: Josh Law <objecting@objecting.org> Link: https://patch.msgid.link/20260321004840.work.670-kees@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-21 08:27:08 -06:00
Yang Xiuwei	7b6d3255e7	scsi: bsg: add io_uring passthrough handler Implement the SCSI-specific io_uring command handler for BSG using struct bsg_uring_cmd. The handler builds a SCSI request from the io_uring command, maps user buffers (including fixed buffers), and completes asynchronously via a request end_io callback and task_work. Completion returns a 32-bit status and packed residual/sense information via CQE res and res2, and supports IO_URING_F_NONBLOCK. Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://patch.msgid.link/20260317072226.2598233-4-yangxiuwei@kylinos.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-19 11:38:24 -06:00
Yang Xiuwei	a1e97ce80d	bsg: add io_uring command support to generic layer Add an io_uring command handler to the generic BSG layer. The new .uring_cmd file operation validates io_uring features and delegates handling to a per-queue bsg_uring_cmd_fn callback. Extend bsg_register_queue() so transport drivers can register both sg_io and io_uring command handlers. Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://patch.msgid.link/20260317072226.2598233-3-yangxiuwei@kylinos.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-19 11:38:24 -06:00
Yang Xiuwei	7da9261bab	bsg: add bsg_uring_cmd uapi structure Add the bsg_uring_cmd structure to the BSG UAPI header to support io_uring-based SCSI passthrough operations via IORING_OP_URING_CMD. Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://patch.msgid.link/20260317072226.2598233-2-yangxiuwei@kylinos.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-19 11:38:24 -06:00
Qu Wenruo	643893647c	block: reject zero length in bio_add_page() The function bio_add_page() returns the number of bytes added to the bio, and if that failed it should return 0. However there is a special quirk, if a caller is passing a page with length 0, that function will always return 0 but with different results: - The page is added to the bio If there is enough bvec slot or the folio can be merged with the last bvec. The return value 0 is just the length passed in, which is also 0. - The page is not added to the bio If the page is not mergeable with the last bvec, or there is no bvec slot available. The return value 0 means page is not added into the bio. Unfortunately the caller is not able to distinguish the above two cases, and will treat the 0 return value as page addition failure. In that case, this can lead to the double releasing of the last page: - By the bio cleanup Which normally goes through every page of the bio, including the last page which is added into the bio. - By the caller Which believes the page is not added into the bio, thus would manually release the page. I do not think anyone should call bio_add_folio()/bio_add_page() with zero length, but idiots like me can still show up. So add an extra WARN_ON_ONCE() check for zero length and rejects it early to avoid double freeing. Signed-off-by: Qu Wenruo <wqu@suse.com> Link: https://patch.msgid.link/bc2223c080f38d0b63f968f605c918181c840f40.1773734749.git.wqu@suse.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-17 19:32:13 -06:00
Thomas Weißschuh	3141e0e536	blk-mq: make blk_mq_hw_ctx_sysfs_entry instances const The blk_mq_hw_ctx_sysfs_entry structures are never modified, mark them as const. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://patch.msgid.link/20260316-b4-sysfs-const-attr-block-v1-4-a35d73b986b0@weissschuh.net Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-17 19:29:16 -06:00
Thomas Weißschuh	f00d826f1b	blk-crypto: make blk_crypto_attr instances const The blk_crypto_attrs structures are never modified, mark them as const. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: John Garry <john.g.garry@oracle.com>> Link: https://patch.msgid.link/20260316-b4-sysfs-const-attr-block-v1-3-a35d73b986b0@weissschuh.net Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-17 19:29:16 -06:00
Thomas Weißschuh	3c91226309	block: ia-ranges: make blk_ia_range_sysfs_entry instances const The blk_ia_range_sysfs_entry structures are never modified, mark them as const. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://patch.msgid.link/20260316-b4-sysfs-const-attr-block-v1-2-a35d73b986b0@weissschuh.net Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-17 19:29:16 -06:00
Thomas Weißschuh	223983874d	block: make queue_sysfs_entry instances const The queue_sysfs_entry structures are never modified, mark them as const. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://patch.msgid.link/20260316-b4-sysfs-const-attr-block-v1-1-a35d73b986b0@weissschuh.net Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-17 19:29:16 -06:00
Christoph Hellwig	e80fd7a089	block: remove bvec_free bvec_free is only called by bio_free, so inline it there. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> -ck Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://patch.msgid.link/20260316161144.1607877-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-17 19:27:14 -06:00
Christoph Hellwig	b520c4eef8	block: split bio_alloc_bioset more clearly into a fast and slowpath bio_alloc_bioset tries non-waiting slab allocations first for the bio and bvec array, but does so in a somewhat convoluted way. Restructure the function so that it first open codes these slab allocations, and then falls back to the mempools with the original gfp mask. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> -ck Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://patch.msgid.link/20260316161144.1607877-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-17 19:27:14 -06:00
Christoph Hellwig	fed406f3c1	block: mark bvec_{alloc,free} static Only used in bio.c these days. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> -ck Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://patch.msgid.link/20260316161144.1607877-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-17 19:27:14 -06:00
Yu Kuai	ef4ca3d4bf	md/md-llbitmap: raise barrier before state machine transition Move the barrier raise operation before calling llbitmap_state_machine() in both llbitmap_start_write() and llbitmap_start_discard(). This ensures the barrier is in place before any state transitions occur, preventing potential race conditions where the state machine could complete before the barrier is properly raised. Cc: stable@vger.kernel.org Fixes: `5ab829f197` ("md/md-llbitmap: introduce new lockless bitmap") Link: https://lore.kernel.org/linux-raid/20260223024038.3084853-3-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>	2026-03-16 01:24:59 +08:00
Yu Kuai	7701e68b50	md/md-llbitmap: skip reading rdevs that are not in_sync When reading bitmap pages from member disks, the code iterates through all rdevs and attempts to read from the first available one. However, it only checks for raid_disk assignment and Faulty flag, missing the In_sync flag check. This can cause bitmap data to be read from spare disks that are still being rebuilt and don't have valid bitmap information yet. Reading stale or uninitialized bitmap data from such disks can lead to incorrect dirty bit tracking, potentially causing data corruption during recovery or normal operation. Add the In_sync flag check to ensure bitmap pages are only read from fully synchronized member disks that have valid bitmap data. Cc: stable@vger.kernel.org Fixes: `5ab829f197` ("md/md-llbitmap: introduce new lockless bitmap") Link: https://lore.kernel.org/linux-raid/20260223024038.3084853-2-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>	2026-03-16 01:24:59 +08:00
Yu Kuai	d51e1668fa	md/raid5: set chunk_sectors to enable full stripe I/O splitting Set chunk_sectors to the full stripe width (io_opt) so that the block layer splits I/O at full stripe boundaries. This ensures that large writes are aligned to full stripes, avoiding the read-modify-write overhead that occurs with partial stripe writes in RAID-5/6. When chunk_sectors is set, the block layer's bio splitting logic in get_max_io_size() uses blk_boundary_sectors_left() to limit I/O size to the boundary. This naturally aligns split bios to full stripe boundaries, enabling more efficient full stripe writes. Test results with 24-disk RAID5 (chunk_size=64k): dd if=/dev/zero of=/dev/md0 bs=10M oflag=direct Before: 461 MB/s After: 520 MB/s (+12.8%) Link: https://lore.kernel.org/linux-raid/20260223035834.3132498-1-yukuai@fnnas.com Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai@fnnas.com>	2026-03-16 01:24:59 +08:00
Josh Hunt	7d96f3120a	md/raid10: fix deadlock with check operation and nowait requests When an array check is running it will raise the barrier at which point normal requests will become blocked and increment the nr_pending value to signal there is work pending inside of wait_barrier(). NOWAIT requests do not block and so will return immediately with an error, and additionally do not increment nr_pending in wait_barrier(). Upstream change commit `43806c3d5b` ("raid10: cleanup memleak at raid10_make_request") added a call to raid_end_bio_io() to fix a memory leak when NOWAIT requests hit this condition. raid_end_bio_io() eventually calls allow_barrier() and it will unconditionally do an atomic_dec_and_test(&conf->nr_pending) even though the corresponding increment on nr_pending didn't happen in the NOWAIT case. This can be easily seen by starting a check operation while an application is doing nowait IO on the same array. This results in a deadlocked state due to nr_pending value underflowing and so the md resync thread gets stuck waiting for nr_pending to == 0. Output of r10conf state of the array when we hit this condition: crash> struct r10conf barrier = 1, nr_pending = { counter = -41 }, nr_waiting = 15, nr_queued = 0, Example of md_sync thread stuck waiting on raise_barrier() and other requests stuck in wait_barrier(): md1_resync [<0>] raise_barrier+0xce/0x1c0 [<0>] raid10_sync_request+0x1ca/0x1ed0 [<0>] md_do_sync+0x779/0x1110 [<0>] md_thread+0x90/0x160 [<0>] kthread+0xbe/0xf0 [<0>] ret_from_fork+0x34/0x50 [<0>] ret_from_fork_asm+0x1a/0x30 kworker/u1040:2+flush-253:4 [<0>] wait_barrier+0x1de/0x220 [<0>] regular_request_wait+0x30/0x180 [<0>] raid10_make_request+0x261/0x1000 [<0>] md_handle_request+0x13b/0x230 [<0>] __submit_bio+0x107/0x1f0 [<0>] submit_bio_noacct_nocheck+0x16f/0x390 [<0>] ext4_io_submit+0x24/0x40 [<0>] ext4_do_writepages+0x254/0xc80 [<0>] ext4_writepages+0x84/0x120 [<0>] do_writepages+0x7a/0x260 [<0>] __writeback_single_inode+0x3d/0x300 [<0>] writeback_sb_inodes+0x1dd/0x470 [<0>] __writeback_inodes_wb+0x4c/0xe0 [<0>] wb_writeback+0x18b/0x2d0 [<0>] wb_workfn+0x2a1/0x400 [<0>] process_one_work+0x149/0x330 [<0>] worker_thread+0x2d2/0x410 [<0>] kthread+0xbe/0xf0 [<0>] ret_from_fork+0x34/0x50 [<0>] ret_from_fork_asm+0x1a/0x30 Fixes: `43806c3d5b` ("raid10: cleanup memleak at raid10_make_request") Cc: stable@vger.kernel.org Signed-off-by: Josh Hunt <johunt@akamai.com> Link: https://lore.kernel.org/linux-raid/20260303005619.1352958-1-johunt@akamai.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>	2026-03-16 01:24:59 +08:00
Chen Cheng	eff0d74c6c	md: suppress spurious superblock update error message for dm-raid dm-raid has external metadata management (mddev->external = 1) and no persistent superblock (mddev->persistent = 0). For these arrays, there's no superblock to update, so the error message is spurious. The error appears as: md_update_sb: can't update sb for read-only array md0 Fixes: `8c9e376b9d` ("md: warn about updating super block failure") Reported-by: Tj <tj.iam.tj@proton.me> Closes: https://lore.kernel.org/all/20260128082430.96788-1-tj.iam.tj@proton.me/ Signed-off-by: Chen Cheng <chencheng@fnnas.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Link: https://lore.kernel.org/linux-raid/20260210133847.269986-1-chencheng@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>	2026-03-16 01:24:59 +08:00
Caleb Sander Mateos	5d54016205	ublk: report BLK_SPLIT_INTERVAL_CAPABLE The ublk driver doesn't access request integrity buffers directly, it only copies them to/from the ublk server in ublk_copy_user_integrity(). ublk_copy_user_integrity() uses bio_for_each_integrity_vec() to walk all the integrity segments. ublk devices are therefore capable of handling requests with integrity intervals split across segments. Set BLK_SPLIT_INTERVAL_CAPABLE in the struct blk_integrity flags for ublk devices to opt out of the integrity-interval dma_alignment limit. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://patch.msgid.link/20260313144701.1221652-3-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-14 07:44:30 -06:00
Keith Busch	203247c5cb	blk-integrity: support arbitrary buffer alignment A bio segment may have partial interval block data with the rest continuing into the next segments because direct-io data payloads only need to align in memory to the device's DMA limits. At the same time, the protection information may also be split in multiple segments. The most likely way that may happen is if two requests merge, or if we're directly using the io_uring user metadata. The generate/verify, however, only ever accessed the first bip_vec. Further, it may be possible to unalign the protection fields from the user space buffer, or if there are odd additional opaque bytes in front or in back of the protection information metadata region. Change up the iteration to allow spanning multiple segments. This patch is mostly a re-write of the protection information handling to allow any arbitrary alignments, so it's probably easier to review the end result rather than the diff. Many controllers are not able to handle interval data composed of multiple segments when PI is used, so this patch introduces a new integrity limit that a low level driver can set to notify that it is capable, default to false. The nvme driver is the first one to enable it in this patch. Everyone else will force DMA alignment to the logical block size as before to ensure interval data is always aligned within a single segment. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://patch.msgid.link/20260313144701.1221652-2-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-14 07:44:30 -06:00
Ming Lei	3dbaacf6ab	blk-cgroup: wait for blkcg cleanup before initializing new disk When a queue is shared across disk rebind (e.g., SCSI unbind/bind), the previous disk's blkcg state is cleaned up asynchronously via disk_release() -> blkcg_exit_disk(). If the new disk's blkcg_init_disk() runs before that cleanup finishes, we may overwrite q->root_blkg while the old one is still alive, and radix_tree_insert() in blkg_create() fails with -EEXIST because the old blkg entries still occupy the same queue id slot in blkcg->blkg_tree. This causes the sd probe to fail with -ENOMEM. Fix it by waiting in blkcg_init_disk() for root_blkg to become NULL, which indicates the previous disk's blkcg cleanup has completed. Fixes: `1059699f87` ("block: move blkcg initialization/destroy into disk allocation/release handler") Cc: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260311032837.2368714-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-11 08:30:30 -06:00
Chaitanya Kulkarni	daa6c79858	block: clear BIO_QOS flags in blk_steal_bios() When a bio goes through the rq_qos infrastructure on a path's request queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These flags indicate that rq_qos_done_bio() should be called on completion to update rq_qos accounting. During path failover in nvme_failover_req(), the bio's bi_bdev is redirected from the failed path's disk to the multipath head's disk via bio_set_dev(). However, the BIO_QOS flags are not cleared. When the bio eventually completes (either successfully via a new path or with an error via bio_io_error()), rq_qos_done_bio() checks for these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is obtained from the bio's current bi_bdev - which is now the multipath head's queue, not the original path's queue. The multipath head's queue does not have rq_qos enabled (q->rq_qos is NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos must be valid. This breaks when a bio is moved between queues during NVMe multipath failover, leading to a NULL pointer dereference. Execution Context timeline :- * =====> dd process context [USER] dd process [SYSCALL] write() - dd process context submit_bio() nvme_ns_head_submit_bio() - path selection blk_mq_submit_bio() #### QOS FLAGS SET HERE [USER] dd waits or returns ==== I/O in flight on NVMe hardware ===== ===== End of submission path ==== ------------------------------------------------------ * dd ====> Interrupt context; [IRQ] NVMe completion interrupt nvme_irq() nvme_complete_rq() nvme_failover_req() ### BIO MOVED TO HEAD spin_lock_irqsave (atomic section) bio_set_dev() changes bi_bdev ### BUG: QOS flags NOT cleared kblockd_schedule_work() * Interrupt context =====> kblockd workqueue [WQ] kblockd workqueue - kworker process nvme_requeue_work() submit_bio_noacct() nvme_ns_head_submit_bio() nvme_find_path() returns NULL bio_io_error() bio_endio() rq_qos_done_bio() ### CRASH ### KERNEL PANIC / OOPS Crash from blktests nvme/058 (rapid namespace remapping): [ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 1339.641025] nvme nvme4: rescanning namespaces. [ 1339.642064] #PF: supervisor read access in kernel mode [ 1339.642067] #PF: error_code(0x0000) - not-present page [ 1339.642070] PGD 0 P4D 0 [ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI [ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary) [ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST [ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core] [ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40 [ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5 53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee 48 89 df ff d0 0f 1f [ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202 [ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000 [ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000 [ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000 [ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010 [ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020 [ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000 [ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0 [ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee [ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600 [ 1339.748488] Call Trace: [ 1339.749512] <TASK> [ 1339.750449] bio_endio+0x71/0x2e0 [ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core] [ 1339.754073] __submit_bio+0x222/0x5e0 [ 1339.755623] ? rcu_is_watching+0xd/0x40 [ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370 [ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370 [ 1339.761189] ? submit_bio_noacct+0x20/0x620 [ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core] [ 1339.764828] process_one_work+0x20e/0x630 [ 1339.766528] worker_thread+0x184/0x330 [ 1339.768129] ? __pfx_worker_thread+0x10/0x10 [ 1339.769942] kthread+0x10a/0x250 [ 1339.771263] ? __pfx_kthread+0x10/0x10 [ 1339.772776] ? __pfx_kthread+0x10/0x10 [ 1339.774381] ret_from_fork+0x273/0x2e0 [ 1339.775948] ? __pfx_kthread+0x10/0x10 [ 1339.777504] ret_from_fork_asm+0x1a/0x30 [ 1339.779163] </TASK> Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags when bios are redirected to the multipath head in nvme_failover_req(). This is consistent with the existing code that clears REQ_POLLED and REQ_NOWAIT flags when the bio changes queues. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260226031243.87200-3-kch@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-10 07:11:09 -06:00
Chaitanya Kulkarni	b2c45ced59	block: move bio queue-transition flag fixups into blk_steal_bios() blk_steal_bios() transfers bios from a request to a bio_list when the request is requeued to a different queue. The NVMe multipath failover path (nvme_failover_req) currently open-codes clearing of REQ_POLLED, bi_cookie, and REQ_NOWAIT on each bio before calling blk_steal_bios(). Move these fixups into blk_steal_bios() itself so that any caller automatically gets correct flag state when bios cross queue boundaries. Simplify nvme_failover_req() accordingly. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260226031243.87200-2-kch@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-10 07:11:09 -06:00
Jens Axboe	89d10b7803	Merge branch 'for-7.1/block-integrity' into for-7.1/block Merge in integrity changes which are also landing in the VFS tree as dependencies for fs related changes. * for-7.1/block-integrity: block: pass a maxlen argument to bio_iov_iter_bounce block: add fs_bio_integrity helpers block: make max_integrity_io_size public block: prepare generation / verification helpers for fs usage block: add a bdev_has_integrity_csum helper block: factor out a bio_integrity_setup_default helper block: factor out a bio_integrity_action helper	2026-03-09 14:30:14 -06:00

1 2 3 4 5 ...

1427945 Commits