linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-16 12:31:52 -04:00

Author	SHA1	Message	Date
Chia-Ming Chang	7f9f7c6974	md/raid5: fix soft lockup in retry_aligned_read() When retry_aligned_read() encounters an overlapped stripe, it releases the stripe via raid5_release_stripe() which puts it on the lockless released_stripes llist. In the next raid5d loop iteration, release_stripe_list() drains the stripe onto handle_list (since STRIPE_HANDLE is set by the original IO), but retry_aligned_read() runs before handle_active_stripes() and removes the stripe from handle_list via find_get_stripe() -> list_del_init(). This prevents handle_stripe() from ever processing the stripe to resolve the overlap, causing an infinite loop and soft lockup. Fix this by using __release_stripe() with temp_inactive_list instead of raid5_release_stripe() in the failure path, so the stripe does not go through the released_stripes llist. This allows raid5d to break out of its loop, and the overlap will be resolved when the stripe is eventually processed by handle_stripe(). Fixes: `773ca82fa1` ("raid5: make release_stripe lockless") Cc: stable@vger.kernel.org Signed-off-by: FengWei Shih <dannyshih@synology.com> Signed-off-by: Chia-Ming Chang <chiamingc@synology.com> Link: https://lore.kernel.org/linux-raid/20260402061406.455755-1-chiamingc@synology.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>	2026-04-07 15:13:52 +08:00
Yu Kuai	cf86bb53b9	md: wake raid456 reshape waiters before suspend During raid456 reshape, direct IO across the reshape position can sleep in raid5_make_request() waiting for reshape progress while still holding an active_io reference. If userspace then freezes reshape and writes md/suspend_lo or md/suspend_hi, mddev_suspend() kills active_io and waits for all in-flight IO to drain. This can deadlock: the IO needs reshape progress to continue, but the reshape thread is already frozen, so the active_io reference is never dropped and suspend never completes. raid5_prepare_suspend() already wakes wait_for_reshape for dm-raid. Do the same for normal md suspend when reshape is already interrupted, so waiting raid456 IO can abort, drop its reference, and let suspend finish. The mdadm test tests/25raid456-reshape-deadlock reproduces the hang. Fixes: `714d20150e` ("md: add new helpers to suspend/resume array") Link: https://lore.kernel.org/linux-raid/20260327140729.2030564-1-yukuai@fnnas.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>	2026-04-07 13:09:22 +08:00
Xiao Ni	808cec7460	md/raid1: serialize overlap io for writemostly disk Previously, using wait_event() would wake up all waiters simultaneously, and they would compete for the tree lock. The bio which gets the lock first will be handled, so the write sequence cannot be guaranteed. For example: bio1(100,200) bio2(150,200) bio3(150,300) The write sequence of fast device is bio1,bio2,bio3. But the write sequence of slow device could be bio1,bio3,bio2 due to lock competition. This causes data corruption. Replace waitqueue with a fifo list to guarantee the write sequence. And it also needs to iterate the list when removing one entry. If not, it may miss the opportunity to wake up the waiting io. For example: bio1(1,3), bio2(2,4) bio3(5,7), bio4(6,8) These four bios are in the same bucket. bio1 and bio3 are inserted into the rbtree. bio2 and bio4 are added to the waiting list and bio2 is the first one. bio3 returns from slow disk and tries to wake up the waiting bios. bio2 is removed from the list and will be handled. But bio1 hasn't finished. So bio2 will be added into waiting list again. Then bio1 returns from slow disk and wakes up waiting bios. bio4 is removed from the list and will be handled. Now bio1, bio3 and bio4 all finish and bio2 is left on the waiting list. So it needs to iterate the waiting list to wake up the right bio. Signed-off-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20260324072501.59865-1-xni@redhat.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>	2026-04-07 13:09:22 +08:00
Yu Kuai	e92a5325b5	md/md-llbitmap: optimize initial sync with write_zeroes_unmap support For RAID-456 arrays with llbitmap, if all underlying disks support write_zeroes with unmap, issue write_zeroes to zero all disk data regions and initialize the bitmap to BitCleanUnwritten instead of BitUnwritten. This optimization skips the initial XOR parity building because: 1. write_zeroes with unmap guarantees zeroed reads after the operation 2. For RAID-456, when all data is zero, parity is automatically consistent (0 XOR 0 XOR ... = 0) 3. BitCleanUnwritten indicates parity is valid but no user data has been written The implementation adds two helper functions: - llbitmap_all_disks_support_wzeroes_unmap(): Checks if all active disks support write_zeroes with unmap - llbitmap_zero_all_disks(): Issues blkdev_issue_zeroout() to each rdev's data region to zero all disks The zeroing and bitmap state setting happens in llbitmap_init_state() during bitmap initialization. If any disk fails to zero, we fall back to BitUnwritten and normal lazy recovery. This significantly reduces array initialization time for RAID-456 arrays built on modern NVMe SSDs or other devices that support write_zeroes with unmap. Reviewed-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20260323054644.3351791-4-yukuai@fnnas.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>	2026-04-07 13:09:22 +08:00
Yu Kuai	4403023e2a	md/md-llbitmap: add CleanUnwritten state for RAID-5 proactive parity building Add new states to the llbitmap state machine to support proactive XOR parity building for RAID-5 arrays. This allows users to pre-build parity data for unwritten regions before any user data is written. New states added: - BitNeedSyncUnwritten: Transitional state when proactive sync is triggered via sysfs on Unwritten regions. - BitSyncingUnwritten: Proactive sync in progress for unwritten region. - BitCleanUnwritten: XOR parity has been pre-built, but no user data written yet. When user writes to this region, it transitions to BitDirty. New actions added: - BitmapActionProactiveSync: Trigger for proactive XOR parity building. - BitmapActionClearUnwritten: Convert CleanUnwritten/NeedSyncUnwritten/ SyncingUnwritten states back to Unwritten before recovery starts. State flows: - Current (lazy): Unwritten -> (write) -> NeedSync -> (sync) -> Dirty -> Clean - New (proactive): Unwritten -> (sysfs) -> NeedSyncUnwritten -> (sync) -> CleanUnwritten - On write to CleanUnwritten: CleanUnwritten -> (write) -> Dirty -> Clean - On disk replacement: CleanUnwritten regions are converted to Unwritten before recovery starts, so recovery only rebuilds regions with user data A new sysfs interface is added at /sys/block/mdX/md/llbitmap/proactive_sync (write-only) to trigger proactive sync. This only works for RAID-456 arrays. Link: https://lore.kernel.org/linux-raid/20260323054644.3351791-3-yukuai@fnnas.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>	2026-04-07 13:09:22 +08:00
Yu Kuai	09af773650	md: add fallback to correct bitmap_ops on version mismatch If default bitmap version and on-disk version doesn't match, and mdadm is not the latest version to set bitmap_type, set bitmap_ops based on the disk version. Link: https://lore.kernel.org/linux-raid/20260323054644.3351791-2-yukuai@fnnas.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>	2026-04-07 13:09:22 +08:00
Junrui Luo	b0cc3ae97e	md/raid5: validate payload size before accessing journal metadata r5c_recovery_analyze_meta_block() and r5l_recovery_verify_data_checksum_for_mb() iterate over payloads in a journal metadata block using on-disk payload size fields without validating them against the remaining space in the metadata block. A corrupted journal contains payload sizes extending beyond the PAGE_SIZE boundary can cause out-of-bounds reads when accessing payload fields or computing offsets. Add bounds validation for each payload type to ensure the full payload fits within meta_size before processing. Fixes: `b4c625c673` ("md/r5cache: r5cache recovery: part 1") Cc: stable@vger.kernel.org Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Link: https://lore.kernel.org/linux-raid/SYBPR01MB78815E78D829BB86CD7C8015AF5FA@SYBPR01MB7881.ausprd01.prod.outlook.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>	2026-04-07 13:09:22 +08:00
Abd-Alrhman Masalkhi	e4979f4fac	md: remove unused static md_wq workqueue The md_wq workqueue is defined as static and initialized in md_init(), but it is not used anywhere within md.c. All asynchronous and deferred work in this file is handled via md_misc_wq or dedicated md threads. Fixes: `b75197e86e` ("md: Remove flush handling") Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Link: https://lore.kernel.org/linux-raid/20260328193522.3624-1-abd.masalkhi@gmail.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>	2026-04-07 13:09:21 +08:00
Gregory Price	078d1d8e68	md/raid0: use kvzalloc/kvfree for strip_zone and devlist allocations syzbot reported a WARNING at mm/page_alloc.c:__alloc_frozen_pages_noprof() triggered by create_strip_zones() in the RAID0 driver. When raid_disks is large, the allocation size exceeds MAX_PAGE_ORDER (4MB on x86), causing WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER). Convert the strip_zone and devlist allocations from kzalloc/kzalloc_objs to kvzalloc/kvzalloc_objs, which first attempts a contiguous allocation with __GFP_NOWARN and then falls back to vmalloc for large sizes. Convert the corresponding kfree calls to kvfree. Both arrays are pure metadata lookup tables (arrays of pointers and zone descriptors) accessed only via indexing, so they do not require physically contiguous memory. Reported-by: syzbot+924649752adf0d3ac9dd@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/69adaba8.a00a0220.b130.0005.GAE@google.com/ Signed-off-by: Gregory Price <gourry@gourry.net> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/linux-raid/20260308234202.3118119-1-gourry@gourry.net/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>	2026-04-07 13:09:21 +08:00
Yu Kuai	2aa72276fa	md: fix array_state=clear sysfs deadlock When "clear" is written to array_state, md_attr_store() breaks sysfs active protection so the array can delete itself from its own sysfs store method. However, md_attr_store() currently drops the mddev reference before calling sysfs_unbreak_active_protection(). Once do_md_stop(..., 0) has made the mddev eligible for delayed deletion, the temporary kobject reference taken by sysfs_break_active_protection() can become the last kobject reference protecting the md kobject. That allows sysfs_unbreak_active_protection() to drop the last kobject reference from the current sysfs writer context. kobject teardown then recurses into kernfs removal while the current sysfs node is still being unwound, and lockdep reports recursive locking on kn->active with kernfs_drain() in the call chain. Reproducer on an existing level: 1. Create an md0 linear array and activate it: mknod /dev/md0 b 9 0 echo none > /sys/block/md0/md/metadata_version echo linear > /sys/block/md0/md/level echo 1 > /sys/block/md0/md/raid_disks echo "$(cat /sys/class/block/sdb/dev)" > /sys/block/md0/md/new_dev echo "$(($(cat /sys/class/block/sdb/size) / 2))" > \ /sys/block/md0/md/dev-sdb/size echo 0 > /sys/block/md0/md/dev-sdb/slot echo active > /sys/block/md0/md/array_state 2. Wait briefly for the array to settle, then clear it: sleep 2 echo clear > /sys/block/md0/md/array_state The warning looks like: WARNING: possible recursive locking detected bash/588 is trying to acquire lock: (kn->active#65) at __kernfs_remove+0x157/0x1d0 but task is already holding lock: (kn->active#65) at sysfs_unbreak_active_protection+0x1f/0x40 ... Call Trace: kernfs_drain __kernfs_remove kernfs_remove_by_name_ns sysfs_remove_group sysfs_remove_groups __kobject_del kobject_put md_attr_store kernfs_fop_write_iter vfs_write ksys_write Restore active protection before mddev_put() so the extra sysfs kobject reference is dropped while the mddev is still held alive. The actual md kobject deletion is then deferred until after the sysfs write path has fully returned. Fixes: `9e59d60976` ("md: call del_gendisk in control path") Reviewed-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20260330055213.3976052-1-yukuai@fnnas.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>	2026-04-06 11:40:14 +08:00
Caleb Sander Mateos	8b155f2e4a	block: remove unused BVEC_ITER_ALL_INIT This macro no longer has any users, so remove it. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://patch.msgid.link/20260403184852.2140919-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-04 08:10:37 -06:00
Mingzhe Zou	20a8e451ec	bcache: fix uninitialized closure object In the previous patch ("bcache: fix cached_dev.sb_bio use-after-free and crash"), we adopted a simple modification suggestion from AI to fix the use-after-free. But in actual testing, we found an extreme case where the device is stopped before calling bch_write_bdev_super(). At this point, struct closure sb_write has not been initialized yet. For this patch, we ensure that sb_bio has been completed via sb_write_mutex. Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Signed-off-by: Coly Li <colyli@fnnas.com> Link: https://patch.msgid.link/20260403042135.2221247-1-colyli@fnnas.com Fixes: `fec114a98b` ("bcache: fix cached_dev.sb_bio use-after-free and crash") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-03 05:11:08 -06:00
Mingzhe Zou	fec114a98b	bcache: fix cached_dev.sb_bio use-after-free and crash In our production environment, we have received multiple crash reports regarding libceph, which have caught our attention: ``` [6888366.280350] Call Trace: [6888366.280452] blk_update_request+0x14e/0x370 [6888366.280561] blk_mq_end_request+0x1a/0x130 [6888366.280671] rbd_img_handle_request+0x1a0/0x1b0 [rbd] [6888366.280792] rbd_obj_handle_request+0x32/0x40 [rbd] [6888366.280903] __complete_request+0x22/0x70 [libceph] [6888366.281032] osd_dispatch+0x15e/0xb40 [libceph] [6888366.281164] ? inet_recvmsg+0x5b/0xd0 [6888366.281272] ? ceph_tcp_recvmsg+0x6f/0xa0 [libceph] [6888366.281405] ceph_con_process_message+0x79/0x140 [libceph] [6888366.281534] ceph_con_v1_try_read+0x5d7/0xf30 [libceph] [6888366.281661] ceph_con_workfn+0x329/0x680 [libceph] ``` After analyzing the coredump file, we found that the address of dc->sb_bio has been freed. We know that cached_dev is only freed when it is stopped. Since sb_bio is a part of struct cached_dev, rather than an alloc every time. If the device is stopped while writing to the superblock, the released address will be accessed at endio. This patch hopes to wait for sb_write to complete in cached_dev_free. It should be noted that we analyzed the cause of the problem, then tell all details to the QWEN and adopted the modifications it made. Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Fixes: `cafe563591` ("bcache: A block layer cache") Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Coly Li <colyli@fnnas.com> Link: https://patch.msgid.link/20260322134102.480107-1-colyli@fnnas.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-03 05:08:15 -06:00
Thorsten Blum	a175ee8273	block: use sysfs_emit in sysfs show functions Replace sprintf() with sysfs_emit() in sysfs show functions. sysfs_emit() is preferred for formatting sysfs output because it provides safer bounds checking. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://patch.msgid.link/20260402164958.894879-4-thorsten.blum@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-03 05:07:46 -06:00
Christoph Hellwig	4e56428ed4	blk-crypto: fix name of the bio completion callback Fix a simple naming issue in the documentation: the completion routine is called bi_end_io and not bi_complete. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Link: https://patch.msgid.link/20260401135854.125109-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-02 07:09:03 -06:00
Ming Lei	c691e4b0d8	bio: fix kmemleak false positives from percpu bio alloc cache When a bio is allocated from the mempool with REQ_ALLOC_CACHE set and later completed, bio_put() places it into the per-cpu bio_alloc_cache via bio_put_percpu_cache() instead of freeing it back to the mempool/slab. The slab allocation remains tracked by kmemleak, but the only reference to the bio is through the percpu cache's free_list, which kmemleak fails to trace through percpu memory. This causes kmemleak to report the cached bios as unreferenced objects. Use symmetric kmemleak_free()/kmemleak_alloc() calls to properly track bios across percpu cache transitions: - bio_put_percpu_cache: call kmemleak_free() when a bio enters the cache, unregistering it from kmemleak tracking. - bio_alloc_percpu_cache: call kmemleak_alloc() when a bio is taken from the cache for reuse, re-registering it so that genuine leaks of reused bios remain detectable. - __bio_alloc_cache_prune: call kmemleak_alloc() before bio_free() so that kmem_cache_free()'s internal kmemleak_free() has a matching allocation to pair with. Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://patch.msgid.link/20260326144058.2392319-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-02 07:07:03 -06:00
Jialin Wang	f91ffe89b2	blk-iocost: fix busy_level reset when no IOs complete When a disk is saturated, it is common for no IOs to complete within a timer period. Currently, in this case, rq_wait_pct and missed_ppm are calculated as 0, the iocost incorrectly interprets this as meeting QoS targets and resets busy_level to 0. This reset prevents busy_level from reaching the threshold (4) needed to reduce vrate. On certain cloud storage, such as Azure Premium SSD, we observed that iocost may fail to reduce vrate for tens of seconds during saturation, failing to mitigate noisy neighbor issues. Fix this by tracking the number of IO completions (nr_done) in a period. If nr_done is 0 and there are lagging IOs, the saturation status is unknown, so we keep busy_level unchanged. The issue is consistently reproducible on Azure Standard_D8as_v5 (Dasv5) VMs with 512GB Premium SSD (P20) using the script below. It was not observed on GCP n2d VMs (with 100G pd-ssd and 1.5T local-ssd), and no regressions were found with this patch. In this script, cgA performs large IOs with iodepth=128, while cgB performs small IOs with iodepth=1 rate_iops=100 rw=randrw. With iocost enabled, we expect it to throttle cgA, the submission latency (slat) of cgA should be significantly higher, cgB can reach 200 IOPS and the completion latency (clat) should below. BLK_DEVID="8:0" MODEL="rbps=173471131 rseqiops=3566 rrandiops=3566 wbps=173333269 wseqiops=3566 wrandiops=3566" QOS="rpct=90 rlat=3500 wpct=90 wlat=3500 min=80 max=10000" echo "$BLK_DEVID ctrl=user model=linear $MODEL" > /sys/fs/cgroup/io.cost.model echo "$BLK_DEVID enable=1 ctrl=user $QOS" > /sys/fs/cgroup/io.cost.qos CG_A="/sys/fs/cgroup/cgA" CG_B="/sys/fs/cgroup/cgB" FILE_A="/path/to/sda/A.fio.testfile" FILE_B="/path/to/sda/B.fio.testfile" RESULT_DIR="./iocost_results_$(date +%Y%m%d_%H%M%S)" mkdir -p "$CG_A" "$CG_B" "$RESULT_DIR" get_result() { local file=$1 local label=$2 local results=$(jq -r ' .jobs[0].mixed \| ( .iops \| tonumber \| round ) as $iops \| ( .bw_bytes / 1024 / 1024 ) as $bps \| ( .slat_ns.mean / 1000000 ) as $slat \| ( .clat_ns.mean / 1000000 ) as $avg \| ( .clat_ns.max / 1000000 ) as $max \| ( .clat_ns.percentile["90.000000"] / 1000000 ) as $p90 \| ( .clat_ns.percentile["99.000000"] / 1000000 ) as $p99 \| ( .clat_ns.percentile["99.900000"] / 1000000 ) as $p999 \| ( .clat_ns.percentile["99.990000"] / 1000000 ) as $p9999 \| "\($iops)\|\($bps)\|\($slat)\|\($avg)\|\($max)\|\($p90)\|\($p99)\|\($p999)\|\($p9999)" ' "$file") IFS='\|' read -r iops bps slat avg max p90 p99 p999 p9999 <<<"$results" printf "%-8s %-6s %-7.2f %-8.2f %-8.2f %-8.2f %-8.2f %-8.2f %-8.2f %-8.2f\n" \ "$label" "$iops" "$bps" "$slat" "$avg" "$max" "$p90" "$p99" "$p999" "$p9999" } run_fio() { local cg_path=$1 local filename=$2 local name=$3 local bs=$4 local qd=$5 local out=$6 shift 6 local extra=$@ ( pid=$(sh -c 'echo $PPID') echo $pid >"${cg_path}/cgroup.procs" fio --name="$name" --filename="$filename" --direct=1 --rw=randrw --rwmixread=50 \ --ioengine=libaio --bs="$bs" --iodepth="$qd" --size=4G --runtime=10 \ --time_based --group_reporting --unified_rw_reporting=mixed \ --output-format=json --output="$out" $extra >/dev/null 2>&1 ) & } echo "Starting Test ..." for bs_b in "4k" "32k" "256k"; do echo "Running iteration: BS=$bs_b" out_a="${RESULT_DIR}/cgA_1m.json" out_b="${RESULT_DIR}/cgB_${bs_b}.json" # cgA: Heavy background (BS 1MB, QD 128) run_fio "$CG_A" "$FILE_A" "cgA" "1m" 128 "$out_a" # cgB: Latency sensitive (Variable BS, QD 1, Read/Write IOPS limit 100) run_fio "$CG_B" "$FILE_B" "cgB" "$bs_b" 1 "$out_b" "--rate_iops=100" wait SUMMARY_DATA+="$(get_result "$out_a" "cgA-1m")"$'\n' SUMMARY_DATA+="$(get_result "$out_b" "cgB-$bs_b")"$'\n\n' done echo -e "\nFinal Results Summary:\n" printf "%-8s %-6s %-7s %-8s %-8s %-8s %-8s %-8s %-8s %-8s\n" \ "" "" "" "slat" "clat" "clat" "clat" "clat" "clat" "clat" printf "%-8s %-6s %-7s %-8s %-8s %-8s %-8s %-8s %-8s %-8s\n\n" \ "CGROUP" "IOPS" "MB/s" "avg(ms)" "avg(ms)" "max(ms)" "P90(ms)" "P99" "P99.9" "P99.99" echo "$SUMMARY_DATA" echo "Results saved in $RESULT_DIR" Before: slat clat clat clat clat clat clat CGROUP IOPS MB/s avg(ms) avg(ms) max(ms) P90(ms) P99 P99.9 P99.99 cgA-1m 166 166.37 3.44 748.95 1298.29 977.27 1233.13 1300.23 1300.23 cgB-4k 5 0.02 0.02 181.74 761.32 742.39 759.17 759.17 759.17 cgA-1m 167 166.51 1.98 748.68 1549.41 809.50 1451.23 1551.89 1551.89 cgB-32k 6 0.18 0.02 169.98 761.76 742.39 759.17 759.17 759.17 cgA-1m 166 165.55 2.89 750.89 1540.37 851.44 1451.23 1535.12 1535.12 cgB-256k 5 1.30 0.02 191.35 759.51 750.78 759.17 759.17 759.17 After: slat clat clat clat clat clat clat CGROUP IOPS MB/s avg(ms) avg(ms) max(ms) P90(ms) P99 P99.9 P99.99 cgA-1m 162 162.48 6.14 749.69 850.02 826.28 834.67 843.06 851.44 cgB-4k 199 0.78 0.01 1.95 42.12 2.57 7.50 34.87 42.21 cgA-1m 146 146.20 6.83 833.04 908.68 893.39 901.78 910.16 910.16 cgB-32k 200 6.25 0.01 2.32 31.40 3.06 7.50 16.58 31.33 cgA-1m 110 110.46 9.04 1082.67 1197.91 1182.79 1199.57 1199.57 1199.57 cgB-256k 200 49.98 0.02 3.69 22.20 4.88 9.11 20.05 22.15 Signed-off-by: Jialin Wang <wjl.linux@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://patch.msgid.link/20260331100509.182882-1-wjl.linux@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-31 13:56:38 -06:00
Jackie Liu	23308af722	blk-cgroup: fix disk reference leak in blkcg_maybe_throttle_current() Add the missing put_disk() on the error path in blkcg_maybe_throttle_current(). When blkcg lookup, blkg lookup, or blkg_tryget() fails, the function jumps to the out label which only calls rcu_read_unlock() but does not release the disk reference acquired by blkcg_schedule_throttle() via get_device(). Since current->throttle_disk is already set to NULL before the lookup, blkcg_exit() cannot release this reference either, causing the disk to never be freed. Restore the reference release that was present as blk_put_queue() in the original code but was inadvertently dropped during the conversion from request_queue to gendisk. Fixes: `f05837ed73` ("blk-cgroup: store a gendisk to throttle in struct task_struct") Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260331085054.46857-1-liu.yun@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-31 13:55:41 -06:00
Damien Le Moal	b2a78fec34	zloop: add max_open_zones option Introduce the new max_open_zones option to allow specifying a limit on the maximum number of open zones of a zloop device. This change allows creating a zloop device that can more closely mimick the characteristics of a physical SMR drive. When set to a non zero value, only up to max_open_zones zones can be in the implicit open (BLK_ZONE_COND_IMP_OPEN) and explicit open (BLK_ZONE_COND_EXP_OPEN) conditions at any time. The transition to the implicit open condition of a zone on a write operation can result in an implicit close of an already implicitly open zone. This is handled in the function zloop_do_open_zone(). This function also handles transitions to the explicit open condition. Implicit close transitions are handled using an LRU ordered list of open zones which is managed using the helper functions zloop_lru_rotate_open_zone() and zloop_lru_remove_open_zone(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260326203245.946830-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-31 08:33:28 -06:00
Jackie Liu	2a2f520fda	block: fix zones_cond memory leak on zone revalidation error paths When blk_revalidate_disk_zones() fails after disk_revalidate_zone_resources() has allocated args.zones_cond, the memory is leaked because no error path frees it. Fixes: `6e945ffb65` ("block: use zone condition to determine conventional zones") Suggested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Link: https://patch.msgid.link/20260331111216.24242-1-liu.yun@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-31 07:05:49 -06:00
Daan De Meyer	267ec4d722	loop: fix partition scan race between udev and loop_reread_partitions() When LOOP_CONFIGURE is called with LO_FLAGS_PARTSCAN, the following sequence occurs: 1. disk_force_media_change() sets GD_NEED_PART_SCAN 2. Uevent suppression is lifted and a KOBJ_CHANGE uevent is sent 3. loop_global_unlock() releases the lock 4. loop_reread_partitions() calls bdev_disk_changed() to scan There is a race between steps 2 and 4: when udev receives the uevent and opens the device before loop_reread_partitions() runs, blkdev_get_whole() in bdev.c sees GD_NEED_PART_SCAN set and calls bdev_disk_changed() for a first scan. Then loop_reread_partitions() does a second scan. The open_mutex serializes these two scans, but does not prevent both from running. The second scan in bdev_disk_changed() drops all partition devices from the first scan (via blk_drop_partitions()) before re-adding them, causing partition block devices to briefly disappear. This breaks any systemd unit with BindsTo= on the partition device: systemd observes the device going dead, fails the dependent units, and does not retry them when the device reappears. Fix this by removing the GD_NEED_PART_SCAN set from disk_force_media_change() entirely. None of the current callers need the lazy on-open partition scan triggered by this flag: - floppy: sets GENHD_FL_NO_PART, so disk_has_partscan() is always false and GD_NEED_PART_SCAN has no effect. - loop (loop_configure, loop_change_fd): when LO_FLAGS_PARTSCAN is set, loop_reread_partitions() performs an explicit scan. When not set, GD_SUPPRESS_PART_SCAN prevents the lazy scan path. - loop (__loop_clr_fd): calls bdev_disk_changed() explicitly if LO_FLAGS_PARTSCAN is set. - nbd (nbd_clear_sock_ioctl): capacity is set to zero immediately after; nbd manages GD_NEED_PART_SCAN explicitly elsewhere. With GD_NEED_PART_SCAN no longer set by disk_force_media_change(), udev opening the loop device after the uevent no longer triggers a redundant scan in blkdev_get_whole(), and only the single explicit scan from loop_reread_partitions() runs. A regression test for this bug has been submitted to blktests: https://github.com/linux-blktests/blktests/pull/240. Fixes: `9f65c489b6` ("loop: raise media_change event") Signed-off-by: Daan De Meyer <daan@amutable.com> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://patch.msgid.link/20260331105130.1077599-1-daan@amutable.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-31 07:04:34 -06:00
Milan Broz	499d2d2f4c	sed-opal: Add STACK_RESET command The TCG Opal device could enter a state where no new session can be created, blocking even Discovery or PSID reset. While a power cycle or waiting for the timeout should work, there is another possibility for recovery: using the Stack Reset command. The Stack Reset command is defined in the TCG Storage Architecture Core Specification and is mandatory for all Opal devices (see Section 3.3.6 of the Opal SSC specification). This patch implements the Stack Reset command. Sending it should clear all active sessions immediately, allowing subsequent commands to run successfully. While it is a TCG transport layer command, the Linux kernel implements only Opal ioctls, so it makes sense to use the IOC_OPAL ioctl interface. The Stack Reset takes no arguments; the response can be success or pending. If the command reports a pending state, userspace can try to repeat it; in this case, the code returns -EBUSY. Signed-off-by: Milan Broz <gmazyland@gmail.com> Reviewed-by: Ondrej Kozina <okozina@redhat.com> Link: https://patch.msgid.link/20260310095349.411287-1-gmazyland@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-31 07:04:00 -06:00
Jens Axboe	75e75445b9	Merge tag 'nvme-7.1-2026-03-27' of git://git.infradead.org/nvme into for-7.1/block Pull NVMe updates from Keith: "- Fabrics authentication updates (Eric, Alistar) - Enanced block queue limits support (Caleb) - Workqueue usage updates (Marco) - A new write zeroes device quirk (Robert) - Tagset cleanup fix for loop device (Nilay)" * tag 'nvme-7.1-2026-03-27' of git://git.infradead.org/nvme: (41 commits) nvme-loop: do not cancel I/O and admin tagset during ctrl reset/shutdown nvme: add WQ_PERCPU to alloc_workqueue users nvmet-fc: add WQ_PERCPU to alloc_workqueue users nvmet: replace use of system_wq with system_percpu_wq nvme-auth: Don't propose NVME_AUTH_DHGROUP_NULL with SC_C nvme: Add the DHCHAP maximum HD IDs nvme-pci: add NVME_QUIRK_DISABLE_WRITE_ZEROES for Kingston OM3SGP4 nvme: respect NVME_QUIRK_DISABLE_WRITE_ZEROES when wzsl is set nvmet: report NPDGL and NPDAL nvmet: use NVME_NS_FEAT_OPTPERF_SHIFT nvme: set discard_granularity from NPDG/NPDA nvme: add from0based() helper nvme: always issue I/O Command Set specific Identify Namespace nvme: update nvme_id_ns OPTPERF constants nvme: fold nvme_config_discard() into nvme_update_disk_info() nvme: add preferred I/O size fields to struct nvme_id_ns_nvm nvme: Allow reauth from sysfs nvme: Expose the tls_configured sysfs for secure concat connections nvmet-tcp: Don't free SQ on authentication success nvmet-tcp: Don't error if TLS is enabed on a reset ...	2026-03-27 09:51:17 -06:00
Nilay Shroff	886f352015	nvme-loop: do not cancel I/O and admin tagset during ctrl reset/shutdown Cancelling the I/O and admin tagsets during nvme-loop controller reset or shutdown is unnecessary. The subsequent destruction of the I/O and admin queues already waits for all in-flight target operations to complete. Cancelling the tagsets first also opens a race window. After a request tag has been cancelled, a late completion from the target may still arrive before the queues are destroyed. In that case the completion path may access a request whose tag has already been cancelled or freed, which can lead to a kernel crash. Please see below the kernel crash encountered while running blktests nvme/040: run blktests nvme/040 at 2026-03-08 06:34:27 loop0: detected capacity change from 0 to 2097152 nvmet: adding nsid 1 to subsystem blktests-subsystem-1 nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349. nvme nvme6: creating 96 I/O queues. nvme nvme6: new ctrl: "blktests-subsystem-1" nvme_log_error: 1 callbacks suppressed block nvme6n1: no usable path - requeuing I/O nvme6c6n1: Read(0x2) @ LBA 2096384, 128 blocks, Host Aborted Command (sct 0x3 / sc 0x71) blk_print_req_error: 1 callbacks suppressed I/O error, dev nvme6c6n1, sector 2096384 op 0x0:(READ) flags 0x2880700 phys_seg 1 prio class 2 block nvme6n1: no usable path - requeuing I/O Kernel attempted to read user page (236) - exploit attempt? (uid: 0) BUG: Kernel NULL pointer dereference on read at 0x00000236 Faulting instruction address: 0xc000000000961274 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries Modules linked in: nvme_loop nvme_fabrics loop nvmet null_blk rpadlpar_io rpaphp xsk_diag bonding rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink pseries_rng dax_pmem vmx_crypto drm drm_panel_orientation_quirks xfs mlx5_core nvme bnx2x sd_mod nd_pmem nd_btt nvme_core sg papr_scm tls libnvdimm ibmvscsi ibmveth scsi_transport_srp nvme_keyring nvme_auth mdio hkdf pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse [last unloaded: loop] CPU: 25 UID: 0 PID: 0 Comm: swapper/25 Kdump: loaded Not tainted 7.0.0-rc3+ #14 PREEMPT Hardware name: IBM,9043-MRX Power11 (architected) 0x820200 0xf000007 of:IBM,FW1120.00 (RF1120_128) hv:phyp pSeries NIP: c000000000961274 LR: c008000009af1808 CTR: c00000000096124c REGS: c0000007ffc0f910 TRAP: 0300 Not tainted (7.0.0-rc3+) MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22222222 XER: 00000000 CFAR: c008000009af232c DAR: 0000000000000236 DSISR: 40000000 IRQMASK: 0 GPR00: c008000009af17fc c0000007ffc0fbb0 c000000001c78100 c0000000be05cc00 GPR04: 0000000000000001 0000000000000000 0000000000000007 0000000000000000 GPR08: 0000000000000000 0000000000000000 0000000000000002 c008000009af2318 GPR12: c00000000096124c c0000007ffdab880 0000000000000000 0000000000000000 GPR16: 0000000000000010 0000000000000000 0000000000000004 0000000000000000 GPR20: 0000000000000001 c000000002ca2b00 0000000100043bb2 000000000000000a GPR24: 000000000000000a 0000000000000000 0000000000000000 0000000000000000 GPR28: c000000084021d40 c000000084021d50 c0000000be05cd60 c0000000be05cc00 NIP [c000000000961274] blk_mq_complete_request_remote+0x28/0x2d4 LR [c008000009af1808] nvme_loop_queue_response+0x110/0x290 [nvme_loop] Call Trace: 0xc00000000502c640 (unreliable) nvme_loop_queue_response+0x104/0x290 [nvme_loop] __nvmet_req_complete+0x80/0x498 [nvmet] nvmet_req_complete+0x24/0xf8 [nvmet] nvmet_bio_done+0x58/0xcc [nvmet] bio_endio+0x250/0x390 blk_update_request+0x2e8/0x68c blk_mq_end_request+0x30/0x5c lo_complete_rq+0x94/0x110 [loop] blk_complete_reqs+0x78/0x98 handle_softirqs+0x148/0x454 do_softirq_own_stack+0x3c/0x50 __irq_exit_rcu+0x18c/0x1b4 irq_exit+0x1c/0x34 do_IRQ+0x114/0x278 hardware_interrupt_common_virt+0x28c/0x290 Since the queue teardown path already guarantees that all target-side operations have completed, cancelling the tagsets is redundant and unsafe. So avoid cancelling the I/O and admin tagsets during controller reset and shutdown. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:06 -07:00
Marco Crivellari	e8e1a4c0fb	nvme: add WQ_PERCPU to alloc_workqueue users This continues the effort to refactor workqueue APIs, which began with the introduction of new workqueues and a new alloc_workqueue flag in: commit `128ea9f6cc` ("workqueue: Add system_percpu_wq and system_dfl_wq") commit `930c2ea566` ("workqueue: Add new WQ_PERCPU flag") The refactoring is going to alter the default behavior of alloc_workqueue() to be unbound by default. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. For more details see the Link tag below. In order to keep alloc_workqueue() behavior identical, explicitly request WQ_PERCPU. Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/ Suggested-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:06 -07:00
Marco Crivellari	12f5fb5ee1	nvmet-fc: add WQ_PERCPU to alloc_workqueue users This continues the effort to refactor workqueue APIs, which began with the introduction of new workqueues and a new alloc_workqueue flag in: commit `128ea9f6cc` ("workqueue: Add system_percpu_wq and system_dfl_wq") commit `930c2ea566` ("workqueue: Add new WQ_PERCPU flag") The refactoring is going to alter the default behavior of alloc_workqueue() to be unbound by default. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. For more details see the Link tag below. In order to keep alloc_workqueue() behavior identical, explicitly request WQ_PERCPU. Cc: Justin Tee <justin.tee@broadcom.com> Cc: Naresh Gottumukkala <nareshgottumukkala83@gmail.com> CC: Paul Ely <paul.ely@broadcom.com> Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/ Suggested-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:05 -07:00
Marco Crivellari	3d553be6d2	nvmet: replace use of system_wq with system_percpu_wq This patch continues the effort to refactor workqueue APIs, which has begun with the changes introducing new workqueues and a new alloc_workqueue flag: commit `128ea9f6cc` ("workqueue: Add system_percpu_wq and system_dfl_wq") commit `930c2ea566` ("workqueue: Add new WQ_PERCPU flag") The point of the refactoring is to eventually alter the default behavior of workqueues to become unbound by default so that their workload placement is optimized by the scheduler. Before that to happen, workqueue users must be converted to the better named new workqueues with no intended behaviour changes: system_wq -> system_percpu_wq system_unbound_wq -> system_dfl_wq This way the old obsolete workqueues (system_wq, system_unbound_wq) can be removed in the future. Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/ Suggested-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:05 -07:00
Alistair Francis	33eb451044	nvme-auth: Don't propose NVME_AUTH_DHGROUP_NULL with SC_C Section 8.3.4.5.2 of the NVMe 2.1 base spec states that """ The 00h identifier shall not be proposed in an AUTH_Negotiate message that requests secure channel concatenation (i.e., with the SC_C field set to a non-zero value). """ We need to ensure that we don't set the NVME_AUTH_DHGROUP_NULL idlist if SC_C is set. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chris Leech <cleech@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kamaljit Singh <kamaljit.singh@opensource.wdc.com> Signed-off-by: Alistair Francis <alistair.francis@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:05 -07:00
Alistair Francis	09e8f0f934	nvme: Add the DHCHAP maximum HD IDs In preperation for using DHCHAP length in upcoming host and target patches let's add the hash and diffie-hellman ID length macros. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Yunje Shin <ioerts@kookmin.ac.kr> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chris Leech <cleech@redhat.com> Signed-off-by: Alistair Francis <alistair.francis@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:05 -07:00
Robert Beckett	a8eebf9699	nvme-pci: add NVME_QUIRK_DISABLE_WRITE_ZEROES for Kingston OM3SGP4 The Kingston OM3SGP42048K2-A00 (PCI ID 2646:502f) firmware has a race condition when processing concurrent write zeroes and DSM (discard) commands, causing spurious "LBA Out of Range" errors and IOMMU page faults at address 0x0. The issue is reliably triggered by running two concurrent mkfs commands on different partitions of the same drive, which generates interleaved write zeroes and discard operations. Disable write zeroes for this device, matching the pattern used for other Kingston OM* drives that have similar firmware issues. Cc: stable@vger.kernel.org Signed-off-by: Robert Beckett <bob.beckett@collabora.com> Assisted-by: claude-opus-4-6-v1 Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:05 -07:00
Robert Beckett	40f0496b61	nvme: respect NVME_QUIRK_DISABLE_WRITE_ZEROES when wzsl is set The NVM Command Set Identify Controller data may report a non-zero Write Zeroes Size Limit (wzsl). When present, nvme_init_non_mdts_limits() unconditionally overrides max_zeroes_sectors from wzsl, even if NVME_QUIRK_DISABLE_WRITE_ZEROES previously set it to zero. This effectively re-enables write zeroes for devices that need it disabled, defeating the quirk. Several Kingston OM* drives rely on this quirk to avoid firmware issues with write zeroes commands. Check for the quirk before applying the wzsl override. Fixes: `5befc7c26e` ("nvme: implement non-mdts command limits") Cc: stable@vger.kernel.org Signed-off-by: Robert Beckett <bob.beckett@collabora.com> Assisted-by: claude-opus-4-6-v1 Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:05 -07:00
Caleb Sander Mateos	c4cfe8c328	nvmet: report NPDGL and NPDAL A block device with a very large discard_granularity queue limit may not be able to report it in the 16-bit NPDG and NPDA fields in the Identify Namespace data structure. For this reason, version 2.1 of the NVMe specs added 32-bit fields NPDGL and NPDAL to the NVM Command Set Specific Identify Namespace structure. So report the discard_granularity there too and set OPTPERF to 11b to indicate those fields are supported. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:05 -07:00
Caleb Sander Mateos	e0d56e7055	nvmet: use NVME_NS_FEAT_OPTPERF_SHIFT Use the NVME_NS_FEAT_OPTPERF_SHIFT constant in nvmet_bdev_set_limits() to set the OPTPERF bits of the nvme_id_ns NSFEAT field instead of the magic number 4. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:04 -07:00
Caleb Sander Mateos	1029298da3	nvme: set discard_granularity from NPDG/NPDA Currently, nvme_config_discard() always sets the discard_granularity queue limit to the logical block size. However, NVMe namespaces can advertise a larger preferred discard granularity in the NPDG or NPDA field of the Identify Namespace structure or the NPDGL or NPDAL fields of the I/O Command Set Specific Identify Namespace structure. Use these fields to compute the discard_granularity limit. The logic is somewhat involved. First, the fields are optional. NPDG is only reported if the low bit of OPTPERF is set in NSFEAT. NPDA is reported if any bit of OPTPERF is set. And NPDGL and NPDAL are reported if the high bit of OPTPERF is set. NPDGL and NPDAL can also each be set to 0 to opt out of reporting a limit. I/O Command Set Specific Identify Namespace may also not be supported by older NVMe controllers. Another complication is that multiple values may be reported among NPDG, NPDGL, NPDA, and NPDAL. The spec says to prefer the values reported in the L variants. The spec says NPDG should be a multiple of NPDA and NPDGL should be a multiple of NPDAL, but it doesn't specify a relationship between NPDG and NPDAL or NPDGL and NPDA. So use the maximum of the reported NPDG(L) and NPDA(L) values as the discard_granularity. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:04 -07:00
Caleb Sander Mateos	b465046c8c	nvme: add from0based() helper The NVMe specifications are big fans of "0's based"/"0-based" fields for encoding values that must be positive. The encoded value is 1 less than the value it represents. nvmet already provides a helper to0based() for encoding 0's based values, so add a corresponding helper to decode these fields on the host side. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:04 -07:00
Caleb Sander Mateos	823340b7e8	nvme: always issue I/O Command Set specific Identify Namespace Currently, the I/O Command Set specific Identify Namespace structure is only fetched for controllers that support extended LBA formats. This is because struct nvme_id_ns_nvm is only used by nvme_configure_pi_elbas(), which is only called when the ELBAS bit is set in the CTRATT field of the Identify Controller structure. However, the I/O Command Set specific Identify Namespace structure will soon be used in nvme_update_disk_info(), so always try to obtain it in nvme_update_ns_info_block(). This Identify structure is first defined in NVMe spec version 2.0, but controllers reporting older versions could still implement it. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:04 -07:00
Caleb Sander Mateos	d3c04a6ea5	nvme: update nvme_id_ns OPTPERF constants In NVMe verson 2.0 and below, OPTPERF comprises only bit 4 of NSFEAT in the Identify Namespace structure. Since version 2.1, OPTPERF includes both bits 4 and 5 of NSFEAT. Replace the NVME_NS_FEAT_IO_OPT constant with NVME_NS_FEAT_OPTPERF_SHIFT, NVME_NS_FEAT_OPTPERF_MASK, and NVME_NS_FEAT_OPTPERF_MASK_2_1, representing the first bit, pre-2.1 bit width, and post-2.1 bit width of OPTPERF. Update nvme_update_disk_info() to check both OPTPERF bits for controllers that report version 2.1 or newer, as NPWG and NOWS are supported even if only bit 5 is set. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:04 -07:00
Caleb Sander Mateos	9110b85244	nvme: fold nvme_config_discard() into nvme_update_disk_info() The choice of what queue limits are set in nvme_update_disk_info() vs. nvme_config_discard() seems a bit arbitrary. A subsequent commit will compute the discard_granularity limit using struct nvme_id_ns, which is only passed to nvme_update_disk_info() currently. So move the logic in nvme_config_discard() to nvme_update_disk_info(). Replace several instances of ns->ctrl in nvme_update_disk_info() with the ctrl variable brought from nvme_config_discard(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:04 -07:00
Caleb Sander Mateos	ac61e869be	nvme: add preferred I/O size fields to struct nvme_id_ns_nvm A subsequent change will use the NPDGL and NPDAL fields of the NVM Command Set Specific Identify Namespace structure, so add them (and the handful of intervening fields) to struct nvme_id_ns_nvm. Add an assertion that the size is still 4 KB. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:03 -07:00
Alistair Francis	ed6a9f7dab	nvme: Allow reauth from sysfs Allow userspace to trigger a reauth (REPLACETLSPSK) from sysfs. This can be done by writing a zero to the sysfs file. echo 0 > /sys/devices/virtual/nvme-fabrics/ctl/nvme0/tls_configured_key In order to use the new keys for the admin queue we call controller reset. This isn't ideal, but I can't find a simpler way to reset the admin queue TLS connection. Signed-off-by: Alistair Francis <alistair.francis@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:03 -07:00
Alistair Francis	56d25f1a6e	nvme: Expose the tls_configured sysfs for secure concat connections Signed-off-by: Alistair Francis <alistair.francis@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:03 -07:00
Alistair Francis	2e6eb6b277	nvmet-tcp: Don't free SQ on authentication success Curently after the host sends a REPLACETLSPSK we free the TLS keys as part of calling nvmet_auth_sq_free() on success. This means when the host sends a follow up REPLACETLSPSK we return CONCAT_MISMATCH as the check for !nvmet_queue_tls_keyid(req->sq) fails. This patch ensures we don't free the TLS key on success as we might need it again in the future. Signed-off-by: Alistair Francis <alistair.francis@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:03 -07:00
Alistair Francis	ecf4d2d883	nvmet-tcp: Don't error if TLS is enabed on a reset If the host sends a AUTH_Negotiate Message on the admin queue with REPLACETLSPSK set then we expect and require a TLS connection and shouldn't report an error if TLS is enabled. This change only enforces the nvmet_queue_tls_keyid() check if we aren't resetting the negotiation. Signed-off-by: Alistair Francis <alistair.francis@wdc.com> Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:03 -07:00
Eric Biggers	6d888db2cf	crypto: remove HKDF library Remove crypto/hkdf.c, since it's no longer used. Originally it had two users, but now both of them just inline the needed HMAC computations using the HMAC library APIs. That ends up being better, since it eliminates all the complexity and performance issues associated with the crypto_shash abstraction and multi-step HMAC input formatting. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:03 -07:00
Eric Biggers	26c8c2ddde	nvme-auth: common: remove selections of no-longer used crypto modules Now that nvme-auth uses the crypto library instead of crypto_shash, remove obsolete selections from the NVME_AUTH kconfig option. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:03 -07:00
Eric Biggers	844d950bb2	nvme-auth: common: remove nvme_auth_digest_name() Since nvme_auth_digest_name() is no longer used, remove it and the associated data from the hash_map array. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:02 -07:00
Eric Biggers	16977e7755	nvme-auth: target: use crypto library in nvmet_auth_ctrl_hash() For the HMAC computation in nvmet_auth_ctrl_hash(), use the crypto library instead of crypto_shash. This is simpler, faster, and more reliable. Notably, this eliminates the crypto transformation object allocation for every call, which was very slow. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:02 -07:00
Eric Biggers	e501533f67	nvme-auth: target: use crypto library in nvmet_auth_host_hash() For the HMAC computation in nvmet_auth_host_hash(), use the crypto library instead of crypto_shash. This is simpler, faster, and more reliable. Notably, this eliminates the crypto transformation object allocation for every call, which was very slow. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:02 -07:00
Eric Biggers	efe8df9f9c	nvme-auth: target: remove obsolete crypto_has_shash() checks Since nvme-auth is now doing its HMAC computations using the crypto library, it's guaranteed that all the algorithms actually work. Therefore, remove the crypto_has_shash() checks which are now obsolete. However, the caller in nvmet_auth_negotiate() seems to have also been relying on crypto_has_shash(nvme_auth_hmac_name(host_hmac_id)) to validate the host_hmac_id. Therefore, make it validate the ID more directly by checking whether nvme_auth_hmac_hash_len() returns 0 or not. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:02 -07:00
Eric Biggers	ac9a49cf6e	nvme-auth: host: remove allocation of crypto_shash Now that the crypto_shash that is being allocated in nvme_auth_process_dhchap_challenge() and stored in the struct nvme_dhchap_queue_context is no longer used, remove it. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-03-27 07:35:02 -07:00

1 2 3 4 5 ...

1427994 Commits