Commit Graph

1428009 Commits

Author SHA1 Message Date
Chris Leech
bd7b7ce96d nvme-auth: Hash DH shared secret to create session key
The NVMe Base Specification 8.3.5.5.9 states that the session key Ks
shall be computed from the ephemeral DH key by applying the hash
function selected by the HashID parameter.

The current implementation stores the raw DH shared secret as the
session key without hashing it. This causes redundant hash operations:

1. Augmented challenge computation (section 8.3.5.5.4) requires
   Ca = HMAC(H(g^xy mod p), C). The code compensates by hashing the
   unhashed session key in nvme_auth_augmented_challenge() to produce
   the correct result.

2. PSK generation (section 8.3.5.5.9) requires PSK = HMAC(Ks, C1 || C2)
   where Ks should already be H(g^xy mod p). As the DH shared secret
   is always larger than the HMAC block size, HMAC internally hashes
   it before use, accidentally producing the correct result.

When using secure channel concatenation with bidirectional
authentication, this results in hashing the DH value three times: twice
for augmented challenge calculations and once during PSK generation.

Fix this by:
- Modifying nvme_auth_gen_shared_secret() to hash the DH shared secret
  once after computation: Ks = H(g^xy mod p)
- Removing the hash operation from nvme_auth_augmented_challenge()
  as the session key is now already hashed
- Updating session key buffer size from DH key size to hash output size
- Adding specification references in comments

This avoid storing the raw DH shared secret and reduces the number of
hash operations from three to one when using secure channel
concatenation.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Chris Leech <cleech@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-04-22 13:02:16 -07:00
Keith Busch
1cc4cdae2a nvme-pci: fix missed admin queue sq doorbell write
We can batch admin commands submitted through io_uring_cmd passthrough,
which means bd->last may be false and skips the doorbell write to
aggregate multiple commands per write. If a subsequent command can't be
dispatched for whatever reason, we have to provide the blk-mq ops'
commit_rqs callback in order to ensure we properly update the doorbell.

Fixes: 58e5bdeb9c ("nvme: enable uring-passthrough for admin commands")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-04-22 10:07:30 -07:00
Alistair Francis
5d10069e1a nvme-auth: Include SC_C in RVAL controller hash
Section 8.3.4.5.5 of the NVMe Base Specification 2.1 describes what is
included in the Response Value (RVAL) hash and SC_C should be included.
Currently we are hardcoding 0 instead of using the correct SC_C value.

Update the host and target code to use the SC_C when calculating the
RVAL instead of using 0.

Fixes: e88a7595b5 ("nvme-tcp: request secure channel concatenation")
Reviewed-by: Chris Leech <cleech@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-04-22 10:07:30 -07:00
Chaitanya Kulkarni
26bb12b9ca nvme-tcp: teardown circular locking fixes
When a controller reset is triggered via sysfs (by writing to
/sys/class/nvme/<nvmedev>/reset_controller), the reset work tears down
and re-establishes all queues. The socket release using fput() defers
the actual cleanup to task_work delayed_fput workqueue. This deferred
cleanup can race with the subsequent queue re-allocation during reset,
potentially leading to use-after-free or resource conflicts.

Replace fput() with __fput_sync() to ensure synchronous socket release,
guaranteeing that all socket resources are fully cleaned up before the
function returns. This prevents races during controller reset where
new queue setup may begin before the old socket is fully released.

* Call chain during reset:
  nvme_reset_ctrl_work()
    -> nvme_tcp_teardown_ctrl()
      -> nvme_tcp_teardown_io_queues()
        -> nvme_tcp_free_io_queues()
          -> nvme_tcp_free_queue()       <-- fput() -> __fput_sync()
      -> nvme_tcp_teardown_admin_queue()
        -> nvme_tcp_free_admin_queue()
          -> nvme_tcp_free_queue()       <-- fput() -> __fput_sync()
    -> nvme_tcp_setup_ctrl()             <-- race with deferred fput

memalloc_noreclaim_save() sets PF_MEMALLOC which is intended for tasks
performing memory reclaim work that need reserve access. While PF_MEMALLOC
prevents the task from entering direct reclaim (causing __need_reclaim() to
return false), it does not strip __GFP_IO from gfp flags. The allocator can
therefore still trigger writeback I/O when __GFP_IO remains set, which is
unsafe when the caller holds block layer locks.

Switch to memalloc_noio_save() which sets PF_MEMALLOC_NOIO. This causes
current_gfp_context() to strip __GFP_IO|__GFP_FS from every allocation in
the scope, making it safe to allocate memory while holding elevator_lock and
set->srcu.

* The issue can be reproduced using blktests:

  nvme_trtype=tcp ./check nvme/005
blktests (master) # nvme_trtype=tcp ./check nvme/005
nvme/005 (tr=tcp) (reset local loopback target)              [failed]
    runtime  0.725s  ...  0.798s
    something found in dmesg:
    [  108.473940] run blktests nvme/005 at 2025-11-22 16:12:20

    [...]
    ...
    (See '/root/blktests/results/nodev_tr_tcp/nvme/005.dmesg' for the entire message)
blktests (master) # cat /root/blktests/results/nodev_tr_tcp/nvme/005.dmesg
[  108.473940] run blktests nvme/005 at 2025-11-22 16:12:20
[  108.526983] loop0: detected capacity change from 0 to 2097152
[  108.555606] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[  108.572531] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
[  108.613061] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
[  108.616832] nvme nvme0: creating 48 I/O queues.
[  108.630791] nvme nvme0: mapped 48/0/0 default/read/poll queues.
[  108.661892] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[  108.746639] nvmet: Created nvm controller 2 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
[  108.748466] nvme nvme0: creating 48 I/O queues.
[  108.802984] nvme nvme0: mapped 48/0/0 default/read/poll queues.
[  108.829983] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
[  108.854288] block nvme0n1: no available path - failing I/O
[  108.854344] block nvme0n1: no available path - failing I/O
[  108.854373] Buffer I/O error on dev nvme0n1, logical block 1, async page read

[  108.891693] ======================================================
[  108.895912] WARNING: possible circular locking dependency detected
[  108.900184] 6.17.0nvme+ #3 Tainted: G                 N
[  108.903913] ------------------------------------------------------
[  108.908171] nvme/2734 is trying to acquire lock:
[  108.911957] ffff88810210e610 (set->srcu){.+.+}-{0:0}, at: __synchronize_srcu+0x17/0x170
[  108.917587]
               but task is already holding lock:
[  108.921570] ffff88813abea198 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0xa8/0x1c0
[  108.927361]
               which lock already depends on the new lock.

[  108.933018]
               the existing dependency chain (in reverse order) is:
[  108.938223]
               -> #4 (&q->elevator_lock){+.+.}-{4:4}:
[  108.942988]        __mutex_lock+0xa2/0x1150
[  108.945873]        elevator_change+0xa8/0x1c0
[  108.948925]        elv_iosched_store+0xdf/0x140
[  108.952043]        kernfs_fop_write_iter+0x16a/0x220
[  108.955367]        vfs_write+0x378/0x520
[  108.957598]        ksys_write+0x67/0xe0
[  108.959721]        do_syscall_64+0x76/0xbb0
[  108.962052]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  108.965145]
               -> #3 (&q->q_usage_counter(io)){++++}-{0:0}:
[  108.968923]        blk_alloc_queue+0x30e/0x350
[  108.972117]        blk_mq_alloc_queue+0x61/0xd0
[  108.974677]        scsi_alloc_sdev+0x2a0/0x3e0
[  108.977092]        scsi_probe_and_add_lun+0x1bd/0x430
[  108.979921]        __scsi_add_device+0x109/0x120
[  108.982504]        ata_scsi_scan_host+0x97/0x1c0
[  108.984365]        async_run_entry_fn+0x2d/0x130
[  108.986109]        process_one_work+0x20e/0x630
[  108.987830]        worker_thread+0x184/0x330
[  108.989473]        kthread+0x10a/0x250
[  108.990852]        ret_from_fork+0x297/0x300
[  108.992491]        ret_from_fork_asm+0x1a/0x30
[  108.994159]
               -> #2 (fs_reclaim){+.+.}-{0:0}:
[  108.996320]        fs_reclaim_acquire+0x99/0xd0
[  108.998058]        kmem_cache_alloc_node_noprof+0x4e/0x3c0
[  109.000123]        __alloc_skb+0x15f/0x190
[  109.002195]        tcp_send_active_reset+0x3f/0x1e0
[  109.004038]        tcp_disconnect+0x50b/0x720
[  109.005695]        __tcp_close+0x2b8/0x4b0
[  109.007227]        tcp_close+0x20/0x80
[  109.008663]        inet_release+0x31/0x60
[  109.010175]        __sock_release+0x3a/0xc0
[  109.011778]        sock_close+0x14/0x20
[  109.013263]        __fput+0xee/0x2c0
[  109.014673]        delayed_fput+0x31/0x50
[  109.016183]        process_one_work+0x20e/0x630
[  109.017897]        worker_thread+0x184/0x330
[  109.019543]        kthread+0x10a/0x250
[  109.020929]        ret_from_fork+0x297/0x300
[  109.022565]        ret_from_fork_asm+0x1a/0x30
[  109.024194]
               -> #1 (sk_lock-AF_INET-NVME){+.+.}-{0:0}:
[  109.026634]        lock_sock_nested+0x2e/0x70
[  109.028251]        tcp_sendmsg+0x1a/0x40
[  109.029783]        sock_sendmsg+0xed/0x110
[  109.031321]        nvme_tcp_try_send_cmd_pdu+0x13e/0x260 [nvme_tcp]
[  109.034263]        nvme_tcp_try_send+0xb3/0x330 [nvme_tcp]
[  109.036375]        nvme_tcp_queue_rq+0x342/0x3d0 [nvme_tcp]
[  109.038528]        blk_mq_dispatch_rq_list+0x297/0x800
[  109.040448]        __blk_mq_sched_dispatch_requests+0x3db/0x5f0
[  109.042677]        blk_mq_sched_dispatch_requests+0x29/0x70
[  109.044787]        blk_mq_run_work_fn+0x76/0x1b0
[  109.046535]        process_one_work+0x20e/0x630
[  109.048245]        worker_thread+0x184/0x330
[  109.049890]        kthread+0x10a/0x250
[  109.051331]        ret_from_fork+0x297/0x300
[  109.053024]        ret_from_fork_asm+0x1a/0x30
[  109.054740]
               -> #0 (set->srcu){.+.+}-{0:0}:
[  109.056850]        __lock_acquire+0x1468/0x2210
[  109.058614]        lock_sync+0xa5/0x110
[  109.060048]        __synchronize_srcu+0x49/0x170
[  109.061802]        elevator_switch+0xc9/0x330
[  109.063950]        elevator_change+0x128/0x1c0
[  109.065675]        elevator_set_none+0x4c/0x90
[  109.067316]        blk_unregister_queue+0xa8/0x110
[  109.069165]        __del_gendisk+0x14e/0x3c0
[  109.070824]        del_gendisk+0x75/0xa0
[  109.072328]        nvme_ns_remove+0xf2/0x230 [nvme_core]
[  109.074365]        nvme_remove_namespaces+0xf2/0x150 [nvme_core]
[  109.076652]        nvme_do_delete_ctrl+0x71/0x90 [nvme_core]
[  109.078775]        nvme_delete_ctrl_sync+0x3b/0x50 [nvme_core]
[  109.081009]        nvme_sysfs_delete+0x34/0x40 [nvme_core]
[  109.083082]        kernfs_fop_write_iter+0x16a/0x220
[  109.085009]        vfs_write+0x378/0x520
[  109.086539]        ksys_write+0x67/0xe0
[  109.087982]        do_syscall_64+0x76/0xbb0
[  109.089577]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  109.091665]
               other info that might help us debug this:

[  109.095478] Chain exists of:
                 set->srcu --> &q->q_usage_counter(io) --> &q->elevator_lock

[  109.099544]  Possible unsafe locking scenario:

[  109.101708]        CPU0                    CPU1
[  109.103402]        ----                    ----
[  109.105103]   lock(&q->elevator_lock);
[  109.106530]                                lock(&q->q_usage_counter(io));
[  109.109022]                                lock(&q->elevator_lock);
[  109.111391]   sync(set->srcu);
[  109.112586]
                *** DEADLOCK ***

[  109.114772] 5 locks held by nvme/2734:
[  109.116189]  #0: ffff888101925410 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0x67/0xe0
[  109.119143]  #1: ffff88817a914e88 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x10f/0x220
[  109.123141]  #2: ffff8881046313f8 (kn->active#185){++++}-{0:0}, at: sysfs_remove_file_self+0x26/0x50
[  109.126543]  #3: ffff88810470e1d0 (&set->update_nr_hwq_lock){++++}-{4:4}, at: del_gendisk+0x6d/0xa0
[  109.129891]  #4: ffff88813abea198 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0xa8/0x1c0
[  109.133149]
               stack backtrace:
[  109.134817] CPU: 6 UID: 0 PID: 2734 Comm: nvme Tainted: G                 N  6.17.0nvme+ #3 PREEMPT(voluntary)
[  109.134819] Tainted: [N]=TEST
[  109.134820] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[  109.134821] Call Trace:
[  109.134823]  <TASK>
[  109.134824]  dump_stack_lvl+0x75/0xb0
[  109.134828]  print_circular_bug+0x26a/0x330
[  109.134831]  check_noncircular+0x12f/0x150
[  109.134834]  __lock_acquire+0x1468/0x2210
[  109.134837]  ? __synchronize_srcu+0x17/0x170
[  109.134838]  lock_sync+0xa5/0x110
[  109.134840]  ? __synchronize_srcu+0x17/0x170
[  109.134842]  __synchronize_srcu+0x49/0x170
[  109.134843]  ? mark_held_locks+0x49/0x80
[  109.134845]  ? _raw_spin_unlock_irqrestore+0x2d/0x60
[  109.134847]  ? kvm_clock_get_cycles+0x14/0x30
[  109.134853]  ? ktime_get_mono_fast_ns+0x36/0xb0
[  109.134858]  elevator_switch+0xc9/0x330
[  109.134860]  elevator_change+0x128/0x1c0
[  109.134862]  ? kernfs_put.part.0+0x86/0x290
[  109.134864]  elevator_set_none+0x4c/0x90
[  109.134866]  blk_unregister_queue+0xa8/0x110
[  109.134868]  __del_gendisk+0x14e/0x3c0
[  109.134870]  del_gendisk+0x75/0xa0
[  109.134872]  nvme_ns_remove+0xf2/0x230 [nvme_core]
[  109.134879]  nvme_remove_namespaces+0xf2/0x150 [nvme_core]
[  109.134887]  nvme_do_delete_ctrl+0x71/0x90 [nvme_core]
[  109.134893]  nvme_delete_ctrl_sync+0x3b/0x50 [nvme_core]
[  109.134899]  nvme_sysfs_delete+0x34/0x40 [nvme_core]
[  109.134905]  kernfs_fop_write_iter+0x16a/0x220
[  109.134908]  vfs_write+0x378/0x520
[  109.134911]  ksys_write+0x67/0xe0
[  109.134913]  do_syscall_64+0x76/0xbb0
[  109.134915]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  109.134916] RIP: 0033:0x7fd68a737317
[  109.134917] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[  109.134919] RSP: 002b:00007ffded1546d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  109.134920] RAX: ffffffffffffffda RBX: 000000000054f7e0 RCX: 00007fd68a737317
[  109.134921] RDX: 0000000000000001 RSI: 00007fd68a855719 RDI: 0000000000000003
[  109.134921] RBP: 0000000000000003 R08: 0000000030407850 R09: 00007fd68a7cd4e0
[  109.134922] R10: 00007fd68a65b130 R11: 0000000000000246 R12: 00007fd68a855719
[  109.134923] R13: 00000000304074c0 R14: 00000000304074c0 R15: 0000000030408660
[  109.134926]  </TASK>
[  109.962756] Key type psk unregistered

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-04-22 10:07:30 -07:00
Alistair Francis
5fc422951c nvmet-tcp: Don't clear tls_key when freeing sq
Curently after the host sends a REPLACETLSPSK we free the TLS keys as
part of calling nvmet_auth_sq_free() on success. This means when the
host sends a follow up REPLACETLSPSK we return CONCAT_MISMATCH as the
check for !nvmet_queue_tls_keyid(req->sq) fails.

A previous attempt to fix this involed not calling nvmet_auth_sq_free()
on successful connections, but that results in memory leaks. Instead we
should not clear `tls_key` in nvmet_auth_sq_free(), as that was
incorrectly wiping the tls keys which are used for the session.

This patch ensures we correctly free the ephemeral session key on
connection, yet we don't free the TLS key unless closing the connection.

Reviewed-by: Chris Leech <cleech@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-04-22 10:07:30 -07:00
Alistair Francis
f920ebd03c Revert "nvmet-tcp: Don't free SQ on authentication success"
In an attempt to fix REPLACETLSPSK we stopped freeing the secrets on
successful connections. This resulted in memory leaks in the kernel, so
let's revert the commit. A improved fix is being developed to just avoid
clearing the tls_key variable.

This reverts commit 2e6eb6b277.

Closes: https://lore.kernel.org/linux-nvme/CAHj4cs-u3MWQR4idywptMfjEYi4YwObWFx4KVib35dZ5HMBDdw@mail.gmail.com
Reviewed-by: Chris Leech <cleech@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-04-22 10:07:30 -07:00
Keith Busch
bddb911d28 nvme: skip trace completion for host path errors
The command was never dispatched for the driver's "host path error", so
the command was never actually initialized and there's no corresponding
submit trace for the completion.

Reported-by: Minsik Jeon <hmi.jeon@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-04-22 10:06:23 -07:00
Tao Jiang
cf92d78a4a nvme-pci: add quirk for Memblaze Pblaze5 (0x1c5f:0x0555)
The Memblaze Pblaze5 NVMe device (PCI ID 0x1c5f:0x0555)
is detected as a controller on recent kernels (tested on 5.15.85
and 6.8.4), but no namespace is exposed.

Tools like lsblk and fdisk do not report any block device.

dmesg shows:
  nvme nvme0: missing or invalid SUBNQN field.

The device works correctly on older kernels (e.g. 4.19), suggesting
a compatibility issue with newer namespace handling.

This indicates the device does not properly support the
Namespace Descriptor List feature.

Applying NVME_QUIRK_NO_NS_DESC_LIST allows the namespace to be
discovered correctly.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Tao Jiang <tanroame.kyle@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-04-16 15:11:14 -07:00
John Garry
3f150f0f01 nvme-multipath: put module reference when delayed removal work is canceled
The delayed disk removal work is canceled when a NS (re)appears. However,
we do not put the module reference grabbed in nvme_mpath_remove_disk(), so
fix that.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-04-16 15:08:38 -07:00
Daniel Wagner
20925812de nvme: expose TLS mode
It is not possible to determine the active TLS mode from the
presence or absence of sysfs attributes like tls_key,
tls_configured_key, or dhchap_secret.

With the introduction of the concat mode and optional DH-CHAP
authentication, different configurations can result in identical
sysfs state. This makes user space detection unreliable.

Expose the TLS mode explicitly to allow user space to
unambiguously identify the active configuration and avoid
fragile heuristics in nvme-cli.

Reviewed-by: Chris Leech <cleech@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-04-16 15:05:14 -07:00
Fedor Pchelkin
ba9d308ccd nvme-apple: drop invalid put of admin queue reference count
Commit 03b3bcd319 ("nvme: fix admin request_queue lifetime") moved the
admin queue reference ->put call into nvme_free_ctrl() - a controller
device release callback performed for every nvme driver doing
nvme_init_ctrl().

nvme-apple sets refcount of the admin queue to 1 at allocation during the
probe function and then puts it twice now:

nvme_free_ctrl()
  blk_put_queue(ctrl->admin_q) // #1
  ->free_ctrl()
    apple_nvme_free_ctrl()
      blk_put_queue(anv->ctrl.admin_q) // #2

Note that there is a commit 941f7298c7 ("nvme-apple: remove an extra
queue reference") which intended to drop taking an extra admin queue
reference.  Looks like at that moment it accidentally fixed a refcount
leak, which existed since the driver's introduction.  There were two ->get
calls at driver's probe function and a single ->put inside
apple_nvme_free_ctrl().

However now after commit 03b3bcd319 ("nvme: fix admin request_queue
lifetime") the refcount is imbalanced again.  Fix it by removing extra
->put call from apple_nvme_free_ctrl().  anv->dev and ctrl->dev point to
the same device, so use ctrl->dev directly for simplification.  Compile
tested only.

Found by Linux Verification Center (linuxtesting.org).

Fixes: 03b3bcd319 ("nvme: fix admin request_queue lifetime")
Cc: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-04-16 15:05:06 -07:00
Flavio Suligoi
e80e39f255 nvme-core: fix parameter name in comment
In the declaration of the structure "core_quirks[]", in the comment
referred to the devices "Kioxia CD6-V Series / HPE PE8030", the
parameter "default_ps_max_latency_us" is reported in a wrong way:

nvme_core.default_ps_max_latency=0

The correct form is, instead:

nvme_core.default_ps_max_latency_us=0

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Flavio Suligoi <f.suligoi@asem.it>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-04-16 15:05:06 -07:00
Chaitanya Kulkarni
aade8abd8b nvmet: avoid recursive nvmet-wq flush in nvmet_ctrl_free
nvmet_tcp_release_queue_work() runs on nvmet-wq and can drop the
final controller reference through nvmet_cq_put(). If that triggers
nvmet_ctrl_free(), the teardown path flushes ctrl->async_event_work on
the same nvmet-wq.

Call chain:

 nvmet_tcp_schedule_release_queue()
   kref_put(&queue->kref, nvmet_tcp_release_queue)
     nvmet_tcp_release_queue()
       queue_work(nvmet_wq, &queue->release_work) <--- nvmet_wq
         process_one_work()
           nvmet_tcp_release_queue_work()
             nvmet_cq_put(&queue->nvme_cq)
               nvmet_cq_destroy()
                 nvmet_ctrl_put(cq->ctrl)
                   nvmet_ctrl_free()
                     flush_work(&ctrl->async_event_work) <--- nvmet_wq

                      Previously Scheduled by :-
		        nvmet_add_async_event
		          queue_work(nvmet_wq, &ctrl->async_event_work);

This trips lockdep with a possible recursive locking warning.

[ 5223.015876] run blktests nvme/003 at 2026-04-07 20:53:55
[ 5223.061801] loop0: detected capacity change from 0 to 2097152
[ 5223.072206] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 5223.088368] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
[ 5223.126086] nvmet: Created discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
[ 5223.128453] nvme nvme1: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[ 5233.199447] nvme nvme1: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"

[ 5233.227718] ============================================
[ 5233.231283] WARNING: possible recursive locking detected
[ 5233.234696] 7.0.0-rc3nvme+ #20 Tainted: G           O     N
[ 5233.238434] --------------------------------------------
[ 5233.241852] kworker/u192:6/2413 is trying to acquire lock:
[ 5233.245429] ffff888111632548 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x26/0x90
[ 5233.251438]
               but task is already holding lock:
[ 5233.255254] ffff888111632548 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: process_one_work+0x5cc/0x6e0
[ 5233.261125]
               other info that might help us debug this:
[ 5233.265333]  Possible unsafe locking scenario:

[ 5233.269217]        CPU0
[ 5233.270795]        ----
[ 5233.272436]   lock((wq_completion)nvmet-wq);
[ 5233.275241]   lock((wq_completion)nvmet-wq);
[ 5233.278020]
                *** DEADLOCK ***

[ 5233.281793]  May be due to missing lock nesting notation

[ 5233.286195] 3 locks held by kworker/u192:6/2413:
[ 5233.289192]  #0: ffff888111632548 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: process_one_work+0x5cc/0x6e0
[ 5233.294569]  #1: ffffc9000e2a7e40 ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: process_one_work+0x1c5/0x6e0
[ 5233.300128]  #2: ffffffff82d7dc40 (rcu_read_lock){....}-{1:3}, at: __flush_work+0x62/0x530
[ 5233.304290]
               stack backtrace:
[ 5233.306520] CPU: 4 UID: 0 PID: 2413 Comm: kworker/u192:6 Tainted: G           O     N  7.0.0-rc3nvme+ #20 PREEMPT(full)
[ 5233.306524] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 5233.306525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
[ 5233.306527] Workqueue: nvmet-wq nvmet_tcp_release_queue_work [nvmet_tcp]
[ 5233.306532] Call Trace:
[ 5233.306534]  <TASK>
[ 5233.306536]  dump_stack_lvl+0x73/0xb0
[ 5233.306552]  print_deadlock_bug+0x225/0x2f0
[ 5233.306556]  __lock_acquire+0x13f0/0x2290
[ 5233.306563]  lock_acquire+0xd0/0x300
[ 5233.306565]  ? touch_wq_lockdep_map+0x26/0x90
[ 5233.306571]  ? __flush_work+0x20b/0x530
[ 5233.306573]  ? touch_wq_lockdep_map+0x26/0x90
[ 5233.306577]  touch_wq_lockdep_map+0x3b/0x90
[ 5233.306580]  ? touch_wq_lockdep_map+0x26/0x90
[ 5233.306583]  ? __flush_work+0x20b/0x530
[ 5233.306585]  __flush_work+0x268/0x530
[ 5233.306588]  ? __pfx_wq_barrier_func+0x10/0x10
[ 5233.306594]  ? xen_error_entry+0x30/0x60
[ 5233.306600]  nvmet_ctrl_free+0x140/0x310 [nvmet]
[ 5233.306617]  nvmet_cq_put+0x74/0x90 [nvmet]
[ 5233.306629]  nvmet_tcp_release_queue_work+0x19f/0x360 [nvmet_tcp]
[ 5233.306634]  process_one_work+0x206/0x6e0
[ 5233.306640]  worker_thread+0x184/0x320
[ 5233.306643]  ? __pfx_worker_thread+0x10/0x10
[ 5233.306646]  kthread+0xf1/0x130
[ 5233.306648]  ? __pfx_kthread+0x10/0x10
[ 5233.306651]  ret_from_fork+0x355/0x450
[ 5233.306653]  ? __pfx_kthread+0x10/0x10
[ 5233.306656]  ret_from_fork_asm+0x1a/0x30
[ 5233.306664]  </TASK>

There is also no need to flush async_event_work from controller
teardown. The admin queue teardown already fails outstanding AER
requests before the final controller put :-

 nvmet_sq_destroy(admin sq)
    nvmet_async_events_failall(ctrl)

The controller has already been removed from the subsystem list before
nvmet_ctrl_free() quiesces outstanding work.

Replace flush_work() with cancel_work_sync() so a pending
async_event_work item is canceled and a running instance is waited on
without recursing into the same workqueue.

Fixes: 06406d81a2 ("nvmet: cancel fatal error and flush async work before free controller")
Cc: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-04-16 15:05:06 -07:00
John Garry
7d435caacd nvme-multipath: drop head pointer check in nvme_mpath_clear_current_path()
A NS will always have a head pointer, so drop the check. As proof in
practice, all the nvme_mpath_clear_current_path() callers also
dereference ns->head.

This check has endured since the original changes to support multipath.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-04-09 07:20:39 -07:00
Alan Cui
7f991e3f9b nvme: add quirk NVME_QUIRK_IGNORE_DEV_SUBNQN for 144d:a808 (Samsung PM981/983/970 EVO Plus )
The firmware for Samsung 970 Evo Plus / PM981 / PM983 does not support SUBNQN.
Make quirks to suppress warnings.

# nvme id-ctrl /dev/nvme1n1
NVME Identify Controller:
vid       : 0x144d
ssvid     : 0x144d
sn        : ***
mn        : Samsung SSD 970 EVO Plus 500GB
fr        : 2B2QEXM7

mcdqpc    : 0
subnqn    :
ioccsz    : 0

Signed-off-by: Alan Cui <me@alancui.cc>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-04-09 07:19:43 -07:00
Chaitanya Kulkarni
5293a8882c nvmet-tcp: fix race between ICReq handling and queue teardown
nvmet_tcp_handle_icreq() updates queue->state after sending an
Initialization Connection Response (ICResp), but it does so without
serializing against target-side queue teardown.

If an NVMe/TCP host sends an Initialization Connection Request
(ICReq) and immediately closes the connection, target-side teardown
may start in softirq context before io_work drains the already
buffered ICReq. In that case, nvmet_tcp_schedule_release_queue()
sets queue->state to NVMET_TCP_Q_DISCONNECTING and drops the queue
reference under state_lock.

If io_work later processes that ICReq, nvmet_tcp_handle_icreq() can
still overwrite the state back to NVMET_TCP_Q_LIVE. That defeats the
DISCONNECTING-state guard in nvmet_tcp_schedule_release_queue() and
allows a later socket state change to re-enter teardown and issue a
second kref_put() on an already released queue.

The ICResp send failure path has the same problem. If teardown has
already moved the queue to DISCONNECTING, a send error can still
overwrite the state with NVMET_TCP_Q_FAILED, again reopening the
window for a second teardown path to drop the queue reference.

Fix this by serializing both post-send state transitions with
state_lock and bailing out if teardown has already started.

Use -ESHUTDOWN as an internal sentinel for that bail-out path rather
than propagating it as a transport error like -ECONNRESET. Keep
nvmet_tcp_socket_error() setting rcv_state to NVMET_TCP_RECV_ERR before
honoring that sentinel so receive-side parsing stays quiesced until the
existing release path completes.

Fixes: c46a6465bac2 ("nvmet-tcp: add NVMe over TCP target driver")
Cc: stable@vger.kernel.org
Reported-by: Shivam Kumar <skumar47@syr.edu>
Tested-by: Shivam Kumar <kumar.shivam43666@gmail.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-04-09 07:19:32 -07:00
Maurizio Lombardi
bad44c9c31 nvmet-tcp: remove redundant calls to nvmet_tcp_fatal_error()
Executing nvmet_tcp_fatal_error() is generally the responsibility
of the caller (nvmet_tcp_try_recv); all other functions should
just return the error code.

Remove the nvmet_tcp_fatal_error() function, it's not needed
anymore.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-04-08 12:05:06 -07:00
Maurizio Lombardi
ea8e356acb nvmet-tcp: propagate nvmet_tcp_build_pdu_iovec() errors to its callers
Currently, when nvmet_tcp_build_pdu_iovec() detects an out-of-bounds
PDU length or offset, it triggers nvmet_tcp_fatal_error(cmd->queue)
and returns early. However, because the function returns void, the
callers are entirely unaware that a fatal error has occurred and
that the cmd->recv_msg.msg_iter was left uninitialized.

Callers such as nvmet_tcp_handle_h2c_data_pdu() proceed to blindly
overwrite the queue state with queue->rcv_state = NVMET_TCP_RECV_DATA
Consequently, the socket receiving loop may attempt to read incoming
network data into the uninitialized iterator.

Fix this by shifting the error handling responsibility to the callers.

Fixes: 52a0a98549 ("nvmet-tcp: add bounds checks in nvmet_tcp_build_pdu_iovec")
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Yunje Shin <ioerts@kookmin.ac.kr>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-04-08 12:05:00 -07:00
Shivaji Kant
23528aa332 nvme: enable PCI P2PDMA support for RDMA transport
Enable BLK_FEAT_PCI_P2PDMA on the NVMe when the underlying
RDMA controller supports it.

Suggested-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Henrique Carvalho <henrique.carvalho@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shivaji Kant <shivajikant@google.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-04-08 08:25:51 -07:00
Aurelien Aptel
0a5a946486 nvmet: introduce new mdts configuration entry
Using this port configuration, one will be able to set the Maximum Data
Transfer Size (MDTS) for any controller that will be associated to the
configured port. The default value remains 0 (no limit).

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Aurelien Aptel <aaptel@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-04-08 08:13:46 -07:00
Geliang Tang
723277b15e nvme: add missing MODULE_ALIAS for fabrics transports
The generic fabrics layer uses request_module("nvme-%s", opts->transport)
to auto-load transport modules. Currently, the nvme-tcp, nvme-rdma, and
nvme-fc modules lack MODULE_ALIAS entries for these names, which prevents
the kernel from automatically finding and loading them when requested.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-04-08 08:12:45 -07:00
Shivam Kumar
4606467a75 nvmet-tcp: check INIT_FAILED before nvmet_req_uninit in digest error path
In nvmet_tcp_try_recv_ddgst(), when a data digest mismatch is detected,
nvmet_req_uninit() is called unconditionally. However, if the command
arrived via the nvmet_tcp_handle_req_failure() path, nvmet_req_init()
had returned false and percpu_ref_tryget_live() was never executed. The
unconditional percpu_ref_put() inside nvmet_req_uninit() then causes a
refcount underflow, leading to a WARNING in
percpu_ref_switch_to_atomic_rcu, a use-after-free diagnostic, and
eventually a permanent workqueue deadlock.

Check cmd->flags & NVMET_TCP_F_INIT_FAILED before calling
nvmet_req_uninit(), matching the existing pattern in
nvmet_tcp_execute_request().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shivam Kumar <kumar.shivam43666@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-04-07 07:25:50 -07:00
Yuto Ohnuki
e9b004ff83 blk-wbt: remove WARN_ON_ONCE from wbt_init_enable_default()
wbt_init_enable_default() uses WARN_ON_ONCE to check for failures from
wbt_alloc() and wbt_init(). However, both are expected failure paths:

- wbt_alloc() can return NULL under memory pressure (-ENOMEM)
- wbt_init() can fail with -EBUSY if wbt is already registered

syzbot triggers this by injecting memory allocation failures during MTD
partition creation via ioctl(BLKPG), causing a spurious warning.

wbt_init_enable_default() is a best-effort initialization called from
blk_register_queue() with a void return type. Failure simply means the
disk operates without writeback throttling, which is harmless.

Replace WARN_ON_ONCE with plain if-checks, consistent with how
wbt_set_lat() in the same file already handles these failures. Add a
pr_warn() for the wbt_init() failure to retain diagnostic information
without triggering a full stack trace.

Reported-by: syzbot+71fcf20f7c1e5043d78c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=71fcf20f7c1e5043d78c
Fixes: 41afaeeda5 ("blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter")
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://patch.msgid.link/20260316070358.65225-2-ytohnuki@amazon.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-06 12:54:03 -06:00
Uday Shankar
320f9b1c6a selftests: ublk: test that teardown after incomplete recovery completes
Before the fix, teardown of a ublk server that was attempting to recover
a device, but died when it had submitted a nonempty proper subset of the
fetch commands to any queue would loop forever. Add a test to verify
that, after the fix, teardown completes. This is done by:

- Adding a new argument to the fault_inject target that causes it die
  after fetching a nonempty proper subset of the IOs to a queue
- Using that argument in a new test while trying to recover an
  already-created device
- Attempting to delete the ublk device at the end of the test; this
  hangs forever if teardown from the fault-injected ublk server never
  completed.

It was manually verified that the test passes with the fix and hangs
without it.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260405-cancel-v2-2-02d711e643c2@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-06 08:37:48 -06:00
Uday Shankar
0842186d2c ublk: reset per-IO canceled flag on each fetch
If a ublk server starts recovering devices but dies before issuing fetch
commands for all IOs, cancellation of the fetch commands that were
successfully issued may never complete. This is because the per-IO
canceled flag can remain set even after the fetch for that IO has been
submitted - the per-IO canceled flags for all IOs in a queue are reset
together only once all IOs for that queue have been fetched. So if a
nonempty proper subset of the IOs for a queue are fetched when the ublk
server dies, the IOs in that subset will never successfully be canceled,
as their canceled flags remain set, and this prevents ublk_cancel_cmd
from actually calling io_uring_cmd_done on the commands, despite the
fact that they are outstanding.

Fix this by resetting the per-IO cancel flags immediately when each IO
is fetched instead of waiting for all IOs for the queue (which may never
happen).

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Fixes: 728cbac5fe ("ublk: move device reset into ublk_ch_release()")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: zhang, the-essence-of-life <zhangweize9@gmail.com>
Link: https://patch.msgid.link/20260405-cancel-v2-1-02d711e643c2@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-06 08:37:48 -06:00
Caleb Sander Mateos
8b155f2e4a block: remove unused BVEC_ITER_ALL_INIT
This macro no longer has any users, so remove it.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://patch.msgid.link/20260403184852.2140919-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-04 08:10:37 -06:00
Mingzhe Zou
20a8e451ec bcache: fix uninitialized closure object
In the previous patch ("bcache: fix cached_dev.sb_bio use-after-free and
crash"), we adopted a simple modification suggestion from AI to fix the
use-after-free.

But in actual testing, we found an extreme case where the device is
stopped before calling bch_write_bdev_super().

At this point, struct closure sb_write has not been initialized yet.
For this patch, we ensure that sb_bio has been completed via
sb_write_mutex.

Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Coly Li <colyli@fnnas.com>
Link: https://patch.msgid.link/20260403042135.2221247-1-colyli@fnnas.com
Fixes: fec114a98b ("bcache: fix cached_dev.sb_bio use-after-free and crash")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-03 05:11:08 -06:00
Mingzhe Zou
fec114a98b bcache: fix cached_dev.sb_bio use-after-free and crash
In our production environment, we have received multiple crash reports
regarding libceph, which have caught our attention:

```
[6888366.280350] Call Trace:
[6888366.280452]  blk_update_request+0x14e/0x370
[6888366.280561]  blk_mq_end_request+0x1a/0x130
[6888366.280671]  rbd_img_handle_request+0x1a0/0x1b0 [rbd]
[6888366.280792]  rbd_obj_handle_request+0x32/0x40 [rbd]
[6888366.280903]  __complete_request+0x22/0x70 [libceph]
[6888366.281032]  osd_dispatch+0x15e/0xb40 [libceph]
[6888366.281164]  ? inet_recvmsg+0x5b/0xd0
[6888366.281272]  ? ceph_tcp_recvmsg+0x6f/0xa0 [libceph]
[6888366.281405]  ceph_con_process_message+0x79/0x140 [libceph]
[6888366.281534]  ceph_con_v1_try_read+0x5d7/0xf30 [libceph]
[6888366.281661]  ceph_con_workfn+0x329/0x680 [libceph]
```

After analyzing the coredump file, we found that the address of
dc->sb_bio has been freed. We know that cached_dev is only freed when it
is stopped.

Since sb_bio is a part of struct cached_dev, rather than an alloc every
time.  If the device is stopped while writing to the superblock, the
released address will be accessed at endio.

This patch hopes to wait for sb_write to complete in cached_dev_free.

It should be noted that we analyzed the cause of the problem, then tell
all details to the QWEN and adopted the modifications it made.

Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Fixes: cafe563591 ("bcache: A block layer cache")
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Coly Li <colyli@fnnas.com>
Link: https://patch.msgid.link/20260322134102.480107-1-colyli@fnnas.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-03 05:08:15 -06:00
Thorsten Blum
a175ee8273 block: use sysfs_emit in sysfs show functions
Replace sprintf() with sysfs_emit() in sysfs show functions.
sysfs_emit() is preferred for formatting sysfs output because it
provides safer bounds checking.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://patch.msgid.link/20260402164958.894879-4-thorsten.blum@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-03 05:07:46 -06:00
Christoph Hellwig
4e56428ed4 blk-crypto: fix name of the bio completion callback
Fix a simple naming issue in the documentation: the completion
routine is called bi_end_io and not bi_complete.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20260401135854.125109-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-02 07:09:03 -06:00
Ming Lei
c691e4b0d8 bio: fix kmemleak false positives from percpu bio alloc cache
When a bio is allocated from the mempool with REQ_ALLOC_CACHE set and
later completed, bio_put() places it into the per-cpu bio_alloc_cache
via bio_put_percpu_cache() instead of freeing it back to the
mempool/slab. The slab allocation remains tracked by kmemleak, but the
only reference to the bio is through the percpu cache's free_list,
which kmemleak fails to trace through percpu memory. This causes
kmemleak to report the cached bios as unreferenced objects.

Use symmetric kmemleak_free()/kmemleak_alloc() calls to properly track
bios across percpu cache transitions:

 - bio_put_percpu_cache: call kmemleak_free() when a bio enters the
   cache, unregistering it from kmemleak tracking.

 - bio_alloc_percpu_cache: call kmemleak_alloc() when a bio is taken
   from the cache for reuse, re-registering it so that genuine leaks
   of reused bios remain detectable.

 - __bio_alloc_cache_prune: call kmemleak_alloc() before bio_free() so
   that kmem_cache_free()'s internal kmemleak_free() has a matching
   allocation to pair with.

Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260326144058.2392319-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-02 07:07:03 -06:00
Jialin Wang
f91ffe89b2 blk-iocost: fix busy_level reset when no IOs complete
When a disk is saturated, it is common for no IOs to complete within a
timer period. Currently, in this case, rq_wait_pct and missed_ppm are
calculated as 0, the iocost incorrectly interprets this as meeting QoS
targets and resets busy_level to 0.

This reset prevents busy_level from reaching the threshold (4) needed
to reduce vrate. On certain cloud storage, such as Azure Premium SSD,
we observed that iocost may fail to reduce vrate for tens of seconds
during saturation, failing to mitigate noisy neighbor issues.

Fix this by tracking the number of IO completions (nr_done) in a period.
If nr_done is 0 and there are lagging IOs, the saturation status is
unknown, so we keep busy_level unchanged.

The issue is consistently reproducible on Azure Standard_D8as_v5 (Dasv5)
VMs with 512GB Premium SSD (P20) using the script below. It was not
observed on GCP n2d VMs (with 100G pd-ssd and 1.5T local-ssd), and no
regressions were found with this patch. In this script, cgA performs
large IOs with iodepth=128, while cgB performs small IOs with iodepth=1
rate_iops=100 rw=randrw. With iocost enabled, we expect it to throttle
cgA, the submission latency (slat) of cgA should be significantly higher,
cgB can reach 200 IOPS and the completion latency (clat) should below.

  BLK_DEVID="8:0"
  MODEL="rbps=173471131 rseqiops=3566 rrandiops=3566 wbps=173333269 wseqiops=3566 wrandiops=3566"
  QOS="rpct=90 rlat=3500 wpct=90 wlat=3500 min=80 max=10000"

  echo "$BLK_DEVID ctrl=user model=linear $MODEL" > /sys/fs/cgroup/io.cost.model
  echo "$BLK_DEVID enable=1 ctrl=user $QOS" > /sys/fs/cgroup/io.cost.qos

  CG_A="/sys/fs/cgroup/cgA"
  CG_B="/sys/fs/cgroup/cgB"

  FILE_A="/path/to/sda/A.fio.testfile"
  FILE_B="/path/to/sda/B.fio.testfile"
  RESULT_DIR="./iocost_results_$(date +%Y%m%d_%H%M%S)"

  mkdir -p "$CG_A" "$CG_B" "$RESULT_DIR"

  get_result() {
    local file=$1
    local label=$2

    local results=$(jq -r '
    .jobs[0].mixed |
    ( .iops | tonumber | round ) as $iops |
    ( .bw_bytes / 1024 / 1024 ) as $bps |
    ( .slat_ns.mean / 1000000 ) as $slat |
    ( .clat_ns.mean / 1000000 ) as $avg |
    ( .clat_ns.max / 1000000 ) as $max |
    ( .clat_ns.percentile["90.000000"] / 1000000 ) as $p90 |
    ( .clat_ns.percentile["99.000000"] / 1000000 ) as $p99 |
    ( .clat_ns.percentile["99.900000"] / 1000000 ) as $p999 |
    ( .clat_ns.percentile["99.990000"] / 1000000 ) as $p9999 |
    "\($iops)|\($bps)|\($slat)|\($avg)|\($max)|\($p90)|\($p99)|\($p999)|\($p9999)"
    ' "$file")

    IFS='|' read -r iops bps slat avg max p90 p99 p999 p9999 <<<"$results"
    printf "%-8s %-6s %-7.2f %-8.2f %-8.2f %-8.2f %-8.2f %-8.2f %-8.2f %-8.2f\n" \
           "$label" "$iops" "$bps" "$slat" "$avg" "$max" "$p90" "$p99" "$p999" "$p9999"
  }

  run_fio() {
    local cg_path=$1
    local filename=$2
    local name=$3
    local bs=$4
    local qd=$5
    local out=$6
    shift 6
    local extra=$@

    (
      pid=$(sh -c 'echo $PPID')
      echo $pid >"${cg_path}/cgroup.procs"
      fio --name="$name" --filename="$filename" --direct=1 --rw=randrw --rwmixread=50 \
          --ioengine=libaio --bs="$bs" --iodepth="$qd" --size=4G --runtime=10 \
          --time_based --group_reporting --unified_rw_reporting=mixed \
          --output-format=json --output="$out" $extra >/dev/null 2>&1
    ) &
  }

  echo "Starting Test ..."

  for bs_b in "4k" "32k" "256k"; do
    echo "Running iteration: BS=$bs_b"
    out_a="${RESULT_DIR}/cgA_1m.json"
    out_b="${RESULT_DIR}/cgB_${bs_b}.json"

    # cgA: Heavy background (BS 1MB, QD 128)
    run_fio "$CG_A" "$FILE_A" "cgA" "1m" 128 "$out_a"
    # cgB: Latency sensitive (Variable BS, QD 1, Read/Write IOPS limit 100)
    run_fio "$CG_B" "$FILE_B" "cgB" "$bs_b" 1 "$out_b" "--rate_iops=100"

    wait
    SUMMARY_DATA+="$(get_result "$out_a" "cgA-1m")"$'\n'
    SUMMARY_DATA+="$(get_result "$out_b" "cgB-$bs_b")"$'\n\n'
  done

  echo -e "\nFinal Results Summary:\n"

  printf "%-8s %-6s %-7s %-8s %-8s %-8s %-8s %-8s %-8s %-8s\n" \
          "" "" "" "slat" "clat" "clat" "clat" "clat" "clat" "clat"
  printf "%-8s %-6s %-7s %-8s %-8s %-8s %-8s %-8s %-8s %-8s\n\n" \
          "CGROUP" "IOPS" "MB/s" "avg(ms)" "avg(ms)" "max(ms)" "P90(ms)" "P99" "P99.9" "P99.99"
  echo "$SUMMARY_DATA"

  echo "Results saved in $RESULT_DIR"

Before:
                          slat     clat     clat     clat     clat     clat     clat
  CGROUP   IOPS   MB/s    avg(ms)  avg(ms)  max(ms)  P90(ms)  P99      P99.9    P99.99

  cgA-1m   166    166.37  3.44     748.95   1298.29  977.27   1233.13  1300.23  1300.23
  cgB-4k   5      0.02    0.02     181.74   761.32   742.39   759.17   759.17   759.17

  cgA-1m   167    166.51  1.98     748.68   1549.41  809.50   1451.23  1551.89  1551.89
  cgB-32k  6      0.18    0.02     169.98   761.76   742.39   759.17   759.17   759.17

  cgA-1m   166    165.55  2.89     750.89   1540.37  851.44   1451.23  1535.12  1535.12
  cgB-256k 5      1.30    0.02     191.35   759.51   750.78   759.17   759.17   759.17

After:
                          slat     clat     clat     clat     clat     clat     clat
  CGROUP   IOPS   MB/s    avg(ms)  avg(ms)  max(ms)  P90(ms)  P99      P99.9    P99.99

  cgA-1m   162    162.48  6.14     749.69   850.02   826.28   834.67   843.06   851.44
  cgB-4k   199    0.78    0.01     1.95     42.12    2.57     7.50     34.87    42.21

  cgA-1m   146    146.20  6.83     833.04   908.68   893.39   901.78   910.16   910.16
  cgB-32k  200    6.25    0.01     2.32     31.40    3.06     7.50     16.58    31.33

  cgA-1m   110    110.46  9.04     1082.67  1197.91  1182.79  1199.57  1199.57  1199.57
  cgB-256k 200    49.98   0.02     3.69     22.20    4.88     9.11     20.05    22.15

Signed-off-by: Jialin Wang <wjl.linux@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://patch.msgid.link/20260331100509.182882-1-wjl.linux@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-31 13:56:38 -06:00
Jackie Liu
23308af722 blk-cgroup: fix disk reference leak in blkcg_maybe_throttle_current()
Add the missing put_disk() on the error path in
blkcg_maybe_throttle_current(). When blkcg lookup, blkg lookup, or
blkg_tryget() fails, the function jumps to the out label which only
calls rcu_read_unlock() but does not release the disk reference acquired
by blkcg_schedule_throttle() via get_device(). Since current->throttle_disk
is already set to NULL before the lookup, blkcg_exit() cannot release
this reference either, causing the disk to never be freed.

Restore the reference release that was present as blk_put_queue() in the
original code but was inadvertently dropped during the conversion from
request_queue to gendisk.

Fixes: f05837ed73 ("blk-cgroup: store a gendisk to throttle in struct task_struct")
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260331085054.46857-1-liu.yun@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-31 13:55:41 -06:00
Damien Le Moal
b2a78fec34 zloop: add max_open_zones option
Introduce the new max_open_zones option to allow specifying a limit on
the maximum number of open zones of a zloop device. This change allows
creating a zloop device that can more closely mimick the characteristics
of a physical SMR drive.

When set to a non zero value, only up to max_open_zones zones can be in
the implicit open (BLK_ZONE_COND_IMP_OPEN) and explicit open
(BLK_ZONE_COND_EXP_OPEN) conditions at any time. The transition to the
implicit open condition of a zone on a write operation can result in an
implicit close of an already implicitly open zone. This is handled in
the function zloop_do_open_zone(). This function also handles
transitions to the explicit open condition. Implicit close transitions
are handled using an LRU ordered list of open zones which is managed
using the helper functions zloop_lru_rotate_open_zone() and
zloop_lru_remove_open_zone().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260326203245.946830-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-31 08:33:28 -06:00
Jackie Liu
2a2f520fda block: fix zones_cond memory leak on zone revalidation error paths
When blk_revalidate_disk_zones() fails after disk_revalidate_zone_resources()
has allocated args.zones_cond, the memory is leaked because no error path
frees it.

Fixes: 6e945ffb65 ("block: use zone condition to determine conventional zones")
Suggested-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Link: https://patch.msgid.link/20260331111216.24242-1-liu.yun@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-31 07:05:49 -06:00
Daan De Meyer
267ec4d722 loop: fix partition scan race between udev and loop_reread_partitions()
When LOOP_CONFIGURE is called with LO_FLAGS_PARTSCAN, the following
sequence occurs:

  1. disk_force_media_change() sets GD_NEED_PART_SCAN
  2. Uevent suppression is lifted and a KOBJ_CHANGE uevent is sent
  3. loop_global_unlock() releases the lock
  4. loop_reread_partitions() calls bdev_disk_changed() to scan

There is a race between steps 2 and 4: when udev receives the uevent
and opens the device before loop_reread_partitions() runs,
blkdev_get_whole() in bdev.c sees GD_NEED_PART_SCAN set and calls
bdev_disk_changed() for a first scan. Then loop_reread_partitions()
does a second scan. The open_mutex serializes these two scans, but
does not prevent both from running.

The second scan in bdev_disk_changed() drops all partition devices
from the first scan (via blk_drop_partitions()) before re-adding
them, causing partition block devices to briefly disappear. This
breaks any systemd unit with BindsTo= on the partition device: systemd
observes the device going dead, fails the dependent units, and does
not retry them when the device reappears.

Fix this by removing the GD_NEED_PART_SCAN set from
disk_force_media_change() entirely. None of the current callers need
the lazy on-open partition scan triggered by this flag:

  - floppy: sets GENHD_FL_NO_PART, so disk_has_partscan() is always
    false and GD_NEED_PART_SCAN has no effect.
  - loop (loop_configure, loop_change_fd): when LO_FLAGS_PARTSCAN is
    set, loop_reread_partitions() performs an explicit scan. When not
    set, GD_SUPPRESS_PART_SCAN prevents the lazy scan path.
  - loop (__loop_clr_fd): calls bdev_disk_changed() explicitly if
    LO_FLAGS_PARTSCAN is set.
  - nbd (nbd_clear_sock_ioctl): capacity is set to zero immediately
    after; nbd manages GD_NEED_PART_SCAN explicitly elsewhere.

With GD_NEED_PART_SCAN no longer set by disk_force_media_change(),
udev opening the loop device after the uevent no longer triggers a
redundant scan in blkdev_get_whole(), and only the single explicit
scan from loop_reread_partitions() runs.

A regression test for this bug has been submitted to blktests:
https://github.com/linux-blktests/blktests/pull/240.

Fixes: 9f65c489b6 ("loop: raise media_change event")
Signed-off-by: Daan De Meyer <daan@amutable.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://patch.msgid.link/20260331105130.1077599-1-daan@amutable.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-31 07:04:34 -06:00
Milan Broz
499d2d2f4c sed-opal: Add STACK_RESET command
The TCG Opal device could enter a state where no new session can be
created, blocking even Discovery or PSID reset. While a power cycle
or waiting for the timeout should work, there is another possibility
for recovery: using the Stack Reset command.

The Stack Reset command is defined in the TCG Storage Architecture Core
Specification and is mandatory for all Opal devices (see Section 3.3.6
of the Opal SSC specification).

This patch implements the Stack Reset command. Sending it should clear
all active sessions immediately, allowing subsequent commands to run
successfully. While it is a TCG transport layer command, the Linux
kernel implements only Opal ioctls, so it makes sense to use the
IOC_OPAL ioctl interface.

The Stack Reset takes no arguments; the response can be success or pending.
If the command reports a pending state, userspace can try to repeat it;
in this case, the code returns -EBUSY.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Reviewed-by: Ondrej Kozina <okozina@redhat.com>
Link: https://patch.msgid.link/20260310095349.411287-1-gmazyland@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-31 07:04:00 -06:00
Jens Axboe
75e75445b9 Merge tag 'nvme-7.1-2026-03-27' of git://git.infradead.org/nvme into for-7.1/block
Pull NVMe updates from Keith:

"- Fabrics authentication updates (Eric, Alistar)
 - Enanced block queue limits support (Caleb)
 - Workqueue usage updates (Marco)
 - A new write zeroes device quirk (Robert)
 - Tagset cleanup fix for loop device (Nilay)"

* tag 'nvme-7.1-2026-03-27' of git://git.infradead.org/nvme: (41 commits)
  nvme-loop: do not cancel I/O and admin tagset during ctrl reset/shutdown
  nvme: add WQ_PERCPU to alloc_workqueue users
  nvmet-fc: add WQ_PERCPU to alloc_workqueue users
  nvmet: replace use of system_wq with system_percpu_wq
  nvme-auth: Don't propose NVME_AUTH_DHGROUP_NULL with SC_C
  nvme: Add the DHCHAP maximum HD IDs
  nvme-pci: add NVME_QUIRK_DISABLE_WRITE_ZEROES for Kingston OM3SGP4
  nvme: respect NVME_QUIRK_DISABLE_WRITE_ZEROES when wzsl is set
  nvmet: report NPDGL and NPDAL
  nvmet: use NVME_NS_FEAT_OPTPERF_SHIFT
  nvme: set discard_granularity from NPDG/NPDA
  nvme: add from0based() helper
  nvme: always issue I/O Command Set specific Identify Namespace
  nvme: update nvme_id_ns OPTPERF constants
  nvme: fold nvme_config_discard() into nvme_update_disk_info()
  nvme: add preferred I/O size fields to struct nvme_id_ns_nvm
  nvme: Allow reauth from sysfs
  nvme: Expose the tls_configured sysfs for secure concat connections
  nvmet-tcp: Don't free SQ on authentication success
  nvmet-tcp: Don't error if TLS is enabed on a reset
  ...
2026-03-27 09:51:17 -06:00
Nilay Shroff
886f352015 nvme-loop: do not cancel I/O and admin tagset during ctrl reset/shutdown
Cancelling the I/O and admin tagsets during nvme-loop controller reset
or shutdown is unnecessary. The subsequent destruction of the I/O and
admin queues already waits for all in-flight target operations to
complete.

Cancelling the tagsets first also opens a race window. After a request
tag has been cancelled, a late completion from the target may still
arrive before the queues are destroyed. In that case the completion path
may access a request whose tag has already been cancelled or freed,
which can lead to a kernel crash. Please see below the kernel crash
encountered while running blktests nvme/040:

run blktests nvme/040 at 2026-03-08 06:34:27
loop0: detected capacity change from 0 to 2097152
nvmet: adding nsid 1 to subsystem blktests-subsystem-1
nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
nvme nvme6: creating 96 I/O queues.
nvme nvme6: new ctrl: "blktests-subsystem-1"
nvme_log_error: 1 callbacks suppressed
block nvme6n1: no usable path - requeuing I/O
nvme6c6n1: Read(0x2) @ LBA 2096384, 128 blocks, Host Aborted Command (sct 0x3 / sc 0x71)
blk_print_req_error: 1 callbacks suppressed
I/O error, dev nvme6c6n1, sector 2096384 op 0x0:(READ) flags 0x2880700 phys_seg 1 prio class 2
block nvme6n1: no usable path - requeuing I/O
Kernel attempted to read user page (236) - exploit attempt? (uid: 0)
BUG: Kernel NULL pointer dereference on read at 0x00000236
Faulting instruction address: 0xc000000000961274
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix  SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: nvme_loop nvme_fabrics loop nvmet null_blk rpadlpar_io rpaphp xsk_diag bonding rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink pseries_rng dax_pmem vmx_crypto drm drm_panel_orientation_quirks xfs mlx5_core nvme bnx2x sd_mod nd_pmem nd_btt nvme_core sg papr_scm tls libnvdimm ibmvscsi ibmveth scsi_transport_srp nvme_keyring nvme_auth mdio hkdf pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse [last unloaded: loop]
CPU: 25 UID: 0 PID: 0 Comm: swapper/25 Kdump: loaded Not tainted 7.0.0-rc3+ #14 PREEMPT
Hardware name: IBM,9043-MRX Power11 (architected) 0x820200 0xf000007 of:IBM,FW1120.00 (RF1120_128) hv:phyp pSeries
NIP:  c000000000961274 LR: c008000009af1808 CTR: c00000000096124c
REGS: c0000007ffc0f910 TRAP: 0300   Not tainted  (7.0.0-rc3+)
MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 22222222  XER: 00000000
CFAR: c008000009af232c DAR: 0000000000000236 DSISR: 40000000 IRQMASK: 0
GPR00: c008000009af17fc c0000007ffc0fbb0 c000000001c78100 c0000000be05cc00
GPR04: 0000000000000001 0000000000000000 0000000000000007 0000000000000000
GPR08: 0000000000000000 0000000000000000 0000000000000002 c008000009af2318
GPR12: c00000000096124c c0000007ffdab880 0000000000000000 0000000000000000
GPR16: 0000000000000010 0000000000000000 0000000000000004 0000000000000000
GPR20: 0000000000000001 c000000002ca2b00 0000000100043bb2 000000000000000a
GPR24: 000000000000000a 0000000000000000 0000000000000000 0000000000000000
GPR28: c000000084021d40 c000000084021d50 c0000000be05cd60 c0000000be05cc00
NIP [c000000000961274] blk_mq_complete_request_remote+0x28/0x2d4
LR [c008000009af1808] nvme_loop_queue_response+0x110/0x290 [nvme_loop]
Call Trace:
 0xc00000000502c640 (unreliable)
 nvme_loop_queue_response+0x104/0x290 [nvme_loop]
 __nvmet_req_complete+0x80/0x498 [nvmet]
 nvmet_req_complete+0x24/0xf8 [nvmet]
 nvmet_bio_done+0x58/0xcc [nvmet]
 bio_endio+0x250/0x390
 blk_update_request+0x2e8/0x68c
 blk_mq_end_request+0x30/0x5c
 lo_complete_rq+0x94/0x110 [loop]
 blk_complete_reqs+0x78/0x98
 handle_softirqs+0x148/0x454
 do_softirq_own_stack+0x3c/0x50
 __irq_exit_rcu+0x18c/0x1b4
 irq_exit+0x1c/0x34
 do_IRQ+0x114/0x278
 hardware_interrupt_common_virt+0x28c/0x290

Since the queue teardown path already guarantees that all target-side
operations have completed, cancelling the tagsets is redundant and
unsafe. So avoid cancelling the I/O and admin tagsets during controller
reset and shutdown.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:06 -07:00
Marco Crivellari
e8e1a4c0fb nvme: add WQ_PERCPU to alloc_workqueue users
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:

   commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
   commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")

The refactoring is going to alter the default behavior of
alloc_workqueue() to be unbound by default.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU. For more details see the Link tag below.

In order to keep alloc_workqueue() behavior identical, explicitly request
WQ_PERCPU.

Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:06 -07:00
Marco Crivellari
12f5fb5ee1 nvmet-fc: add WQ_PERCPU to alloc_workqueue users
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:

   commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
   commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")

The refactoring is going to alter the default behavior of
alloc_workqueue() to be unbound by default.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU. For more details see the Link tag below.

In order to keep alloc_workqueue() behavior identical, explicitly request
WQ_PERCPU.

Cc: Justin Tee <justin.tee@broadcom.com>
Cc: Naresh Gottumukkala <nareshgottumukkala83@gmail.com>
CC: Paul Ely <paul.ely@broadcom.com>
Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:05 -07:00
Marco Crivellari
3d553be6d2 nvmet: replace use of system_wq with system_percpu_wq
This patch continues the effort to refactor workqueue APIs, which has begun
with the changes introducing new workqueues and a new alloc_workqueue flag:

   commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
   commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")

The point of the refactoring is to eventually alter the default behavior of
workqueues to become unbound by default so that their workload placement is
optimized by the scheduler.

Before that to happen, workqueue users must be converted to the better named
new workqueues with no intended behaviour changes:

   system_wq -> system_percpu_wq
   system_unbound_wq -> system_dfl_wq

This way the old obsolete workqueues (system_wq, system_unbound_wq) can be
removed in the future.

Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:05 -07:00
Alistair Francis
33eb451044 nvme-auth: Don't propose NVME_AUTH_DHGROUP_NULL with SC_C
Section 8.3.4.5.2 of the NVMe 2.1 base spec states that

"""
The 00h identifier shall not be proposed in an AUTH_Negotiate message
that requests secure channel concatenation (i.e., with the SC_C field
set to a non-zero value).
"""

We need to ensure that we don't set the NVME_AUTH_DHGROUP_NULL idlist if
SC_C is set.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chris Leech <cleech@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kamaljit Singh <kamaljit.singh@opensource.wdc.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:05 -07:00
Alistair Francis
09e8f0f934 nvme: Add the DHCHAP maximum HD IDs
In preperation for using DHCHAP length in upcoming host and target
patches let's add the hash and diffie-hellman ID length macros.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Yunje Shin <ioerts@kookmin.ac.kr>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chris Leech <cleech@redhat.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:05 -07:00
Robert Beckett
a8eebf9699 nvme-pci: add NVME_QUIRK_DISABLE_WRITE_ZEROES for Kingston OM3SGP4
The Kingston OM3SGP42048K2-A00 (PCI ID 2646:502f) firmware has a race
condition when processing concurrent write zeroes and DSM (discard)
commands, causing spurious "LBA Out of Range" errors and IOMMU page
faults at address 0x0.

The issue is reliably triggered by running two concurrent mkfs commands
on different partitions of the same drive, which generates interleaved
write zeroes and discard operations.

Disable write zeroes for this device, matching the pattern used for
other Kingston OM* drives that have similar firmware issues.

Cc: stable@vger.kernel.org
Signed-off-by: Robert Beckett <bob.beckett@collabora.com>
Assisted-by: claude-opus-4-6-v1
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:05 -07:00
Robert Beckett
40f0496b61 nvme: respect NVME_QUIRK_DISABLE_WRITE_ZEROES when wzsl is set
The NVM Command Set Identify Controller data may report a non-zero
Write Zeroes Size Limit (wzsl). When present, nvme_init_non_mdts_limits()
unconditionally overrides max_zeroes_sectors from wzsl, even if
NVME_QUIRK_DISABLE_WRITE_ZEROES previously set it to zero.

This effectively re-enables write zeroes for devices that need it
disabled, defeating the quirk. Several Kingston OM* drives rely on
this quirk to avoid firmware issues with write zeroes commands.

Check for the quirk before applying the wzsl override.

Fixes: 5befc7c26e ("nvme: implement non-mdts command limits")
Cc: stable@vger.kernel.org
Signed-off-by: Robert Beckett <bob.beckett@collabora.com>
Assisted-by: claude-opus-4-6-v1
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:05 -07:00
Caleb Sander Mateos
c4cfe8c328 nvmet: report NPDGL and NPDAL
A block device with a very large discard_granularity queue limit may not
be able to report it in the 16-bit NPDG and NPDA fields in the Identify
Namespace data structure. For this reason, version 2.1 of the NVMe specs
added 32-bit fields NPDGL and NPDAL to the NVM Command Set Specific
Identify Namespace structure. So report the discard_granularity there
too and set OPTPERF to 11b to indicate those fields are supported.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:05 -07:00
Caleb Sander Mateos
e0d56e7055 nvmet: use NVME_NS_FEAT_OPTPERF_SHIFT
Use the NVME_NS_FEAT_OPTPERF_SHIFT constant in nvmet_bdev_set_limits()
to set the OPTPERF bits of the nvme_id_ns NSFEAT field instead of the
magic number 4.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:04 -07:00
Caleb Sander Mateos
1029298da3 nvme: set discard_granularity from NPDG/NPDA
Currently, nvme_config_discard() always sets the discard_granularity
queue limit to the logical block size. However, NVMe namespaces can
advertise a larger preferred discard granularity in the NPDG or NPDA
field of the Identify Namespace structure or the NPDGL or NPDAL fields
of the I/O Command Set Specific Identify Namespace structure.

Use these fields to compute the discard_granularity limit. The logic is
somewhat involved. First, the fields are optional. NPDG is only reported
if the low bit of OPTPERF is set in NSFEAT. NPDA is reported if any bit
of OPTPERF is set. And NPDGL and NPDAL are reported if the high bit of
OPTPERF is set. NPDGL and NPDAL can also each be set to 0 to opt out of
reporting a limit. I/O Command Set Specific Identify Namespace may also
not be supported by older NVMe controllers. Another complication is that
multiple values may be reported among NPDG, NPDGL, NPDA, and NPDAL. The
spec says to prefer the values reported in the L variants. The spec says
NPDG should be a multiple of NPDA and NPDGL should be a multiple of
NPDAL, but it doesn't specify a relationship between NPDG and NPDAL or
NPDGL and NPDA. So use the maximum of the reported NPDG(L) and NPDA(L)
values as the discard_granularity.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:04 -07:00
Caleb Sander Mateos
b465046c8c nvme: add from0based() helper
The NVMe specifications are big fans of "0's based"/"0-based" fields for
encoding values that must be positive. The encoded value is 1 less than
the value it represents. nvmet already provides a helper to0based() for
encoding 0's based values, so add a corresponding helper to decode these
fields on the host side.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:04 -07:00