Commit Graph

1427967 Commits

Author SHA1 Message Date
Nilay Shroff
886f352015 nvme-loop: do not cancel I/O and admin tagset during ctrl reset/shutdown
Cancelling the I/O and admin tagsets during nvme-loop controller reset
or shutdown is unnecessary. The subsequent destruction of the I/O and
admin queues already waits for all in-flight target operations to
complete.

Cancelling the tagsets first also opens a race window. After a request
tag has been cancelled, a late completion from the target may still
arrive before the queues are destroyed. In that case the completion path
may access a request whose tag has already been cancelled or freed,
which can lead to a kernel crash. Please see below the kernel crash
encountered while running blktests nvme/040:

run blktests nvme/040 at 2026-03-08 06:34:27
loop0: detected capacity change from 0 to 2097152
nvmet: adding nsid 1 to subsystem blktests-subsystem-1
nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
nvme nvme6: creating 96 I/O queues.
nvme nvme6: new ctrl: "blktests-subsystem-1"
nvme_log_error: 1 callbacks suppressed
block nvme6n1: no usable path - requeuing I/O
nvme6c6n1: Read(0x2) @ LBA 2096384, 128 blocks, Host Aborted Command (sct 0x3 / sc 0x71)
blk_print_req_error: 1 callbacks suppressed
I/O error, dev nvme6c6n1, sector 2096384 op 0x0:(READ) flags 0x2880700 phys_seg 1 prio class 2
block nvme6n1: no usable path - requeuing I/O
Kernel attempted to read user page (236) - exploit attempt? (uid: 0)
BUG: Kernel NULL pointer dereference on read at 0x00000236
Faulting instruction address: 0xc000000000961274
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix  SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: nvme_loop nvme_fabrics loop nvmet null_blk rpadlpar_io rpaphp xsk_diag bonding rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink pseries_rng dax_pmem vmx_crypto drm drm_panel_orientation_quirks xfs mlx5_core nvme bnx2x sd_mod nd_pmem nd_btt nvme_core sg papr_scm tls libnvdimm ibmvscsi ibmveth scsi_transport_srp nvme_keyring nvme_auth mdio hkdf pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse [last unloaded: loop]
CPU: 25 UID: 0 PID: 0 Comm: swapper/25 Kdump: loaded Not tainted 7.0.0-rc3+ #14 PREEMPT
Hardware name: IBM,9043-MRX Power11 (architected) 0x820200 0xf000007 of:IBM,FW1120.00 (RF1120_128) hv:phyp pSeries
NIP:  c000000000961274 LR: c008000009af1808 CTR: c00000000096124c
REGS: c0000007ffc0f910 TRAP: 0300   Not tainted  (7.0.0-rc3+)
MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 22222222  XER: 00000000
CFAR: c008000009af232c DAR: 0000000000000236 DSISR: 40000000 IRQMASK: 0
GPR00: c008000009af17fc c0000007ffc0fbb0 c000000001c78100 c0000000be05cc00
GPR04: 0000000000000001 0000000000000000 0000000000000007 0000000000000000
GPR08: 0000000000000000 0000000000000000 0000000000000002 c008000009af2318
GPR12: c00000000096124c c0000007ffdab880 0000000000000000 0000000000000000
GPR16: 0000000000000010 0000000000000000 0000000000000004 0000000000000000
GPR20: 0000000000000001 c000000002ca2b00 0000000100043bb2 000000000000000a
GPR24: 000000000000000a 0000000000000000 0000000000000000 0000000000000000
GPR28: c000000084021d40 c000000084021d50 c0000000be05cd60 c0000000be05cc00
NIP [c000000000961274] blk_mq_complete_request_remote+0x28/0x2d4
LR [c008000009af1808] nvme_loop_queue_response+0x110/0x290 [nvme_loop]
Call Trace:
 0xc00000000502c640 (unreliable)
 nvme_loop_queue_response+0x104/0x290 [nvme_loop]
 __nvmet_req_complete+0x80/0x498 [nvmet]
 nvmet_req_complete+0x24/0xf8 [nvmet]
 nvmet_bio_done+0x58/0xcc [nvmet]
 bio_endio+0x250/0x390
 blk_update_request+0x2e8/0x68c
 blk_mq_end_request+0x30/0x5c
 lo_complete_rq+0x94/0x110 [loop]
 blk_complete_reqs+0x78/0x98
 handle_softirqs+0x148/0x454
 do_softirq_own_stack+0x3c/0x50
 __irq_exit_rcu+0x18c/0x1b4
 irq_exit+0x1c/0x34
 do_IRQ+0x114/0x278
 hardware_interrupt_common_virt+0x28c/0x290

Since the queue teardown path already guarantees that all target-side
operations have completed, cancelling the tagsets is redundant and
unsafe. So avoid cancelling the I/O and admin tagsets during controller
reset and shutdown.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:06 -07:00
Marco Crivellari
e8e1a4c0fb nvme: add WQ_PERCPU to alloc_workqueue users
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:

   commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
   commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")

The refactoring is going to alter the default behavior of
alloc_workqueue() to be unbound by default.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU. For more details see the Link tag below.

In order to keep alloc_workqueue() behavior identical, explicitly request
WQ_PERCPU.

Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:06 -07:00
Marco Crivellari
12f5fb5ee1 nvmet-fc: add WQ_PERCPU to alloc_workqueue users
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:

   commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
   commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")

The refactoring is going to alter the default behavior of
alloc_workqueue() to be unbound by default.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU. For more details see the Link tag below.

In order to keep alloc_workqueue() behavior identical, explicitly request
WQ_PERCPU.

Cc: Justin Tee <justin.tee@broadcom.com>
Cc: Naresh Gottumukkala <nareshgottumukkala83@gmail.com>
CC: Paul Ely <paul.ely@broadcom.com>
Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:05 -07:00
Marco Crivellari
3d553be6d2 nvmet: replace use of system_wq with system_percpu_wq
This patch continues the effort to refactor workqueue APIs, which has begun
with the changes introducing new workqueues and a new alloc_workqueue flag:

   commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
   commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")

The point of the refactoring is to eventually alter the default behavior of
workqueues to become unbound by default so that their workload placement is
optimized by the scheduler.

Before that to happen, workqueue users must be converted to the better named
new workqueues with no intended behaviour changes:

   system_wq -> system_percpu_wq
   system_unbound_wq -> system_dfl_wq

This way the old obsolete workqueues (system_wq, system_unbound_wq) can be
removed in the future.

Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:05 -07:00
Alistair Francis
33eb451044 nvme-auth: Don't propose NVME_AUTH_DHGROUP_NULL with SC_C
Section 8.3.4.5.2 of the NVMe 2.1 base spec states that

"""
The 00h identifier shall not be proposed in an AUTH_Negotiate message
that requests secure channel concatenation (i.e., with the SC_C field
set to a non-zero value).
"""

We need to ensure that we don't set the NVME_AUTH_DHGROUP_NULL idlist if
SC_C is set.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chris Leech <cleech@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kamaljit Singh <kamaljit.singh@opensource.wdc.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:05 -07:00
Alistair Francis
09e8f0f934 nvme: Add the DHCHAP maximum HD IDs
In preperation for using DHCHAP length in upcoming host and target
patches let's add the hash and diffie-hellman ID length macros.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Yunje Shin <ioerts@kookmin.ac.kr>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chris Leech <cleech@redhat.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:05 -07:00
Robert Beckett
a8eebf9699 nvme-pci: add NVME_QUIRK_DISABLE_WRITE_ZEROES for Kingston OM3SGP4
The Kingston OM3SGP42048K2-A00 (PCI ID 2646:502f) firmware has a race
condition when processing concurrent write zeroes and DSM (discard)
commands, causing spurious "LBA Out of Range" errors and IOMMU page
faults at address 0x0.

The issue is reliably triggered by running two concurrent mkfs commands
on different partitions of the same drive, which generates interleaved
write zeroes and discard operations.

Disable write zeroes for this device, matching the pattern used for
other Kingston OM* drives that have similar firmware issues.

Cc: stable@vger.kernel.org
Signed-off-by: Robert Beckett <bob.beckett@collabora.com>
Assisted-by: claude-opus-4-6-v1
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:05 -07:00
Robert Beckett
40f0496b61 nvme: respect NVME_QUIRK_DISABLE_WRITE_ZEROES when wzsl is set
The NVM Command Set Identify Controller data may report a non-zero
Write Zeroes Size Limit (wzsl). When present, nvme_init_non_mdts_limits()
unconditionally overrides max_zeroes_sectors from wzsl, even if
NVME_QUIRK_DISABLE_WRITE_ZEROES previously set it to zero.

This effectively re-enables write zeroes for devices that need it
disabled, defeating the quirk. Several Kingston OM* drives rely on
this quirk to avoid firmware issues with write zeroes commands.

Check for the quirk before applying the wzsl override.

Fixes: 5befc7c26e ("nvme: implement non-mdts command limits")
Cc: stable@vger.kernel.org
Signed-off-by: Robert Beckett <bob.beckett@collabora.com>
Assisted-by: claude-opus-4-6-v1
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:05 -07:00
Caleb Sander Mateos
c4cfe8c328 nvmet: report NPDGL and NPDAL
A block device with a very large discard_granularity queue limit may not
be able to report it in the 16-bit NPDG and NPDA fields in the Identify
Namespace data structure. For this reason, version 2.1 of the NVMe specs
added 32-bit fields NPDGL and NPDAL to the NVM Command Set Specific
Identify Namespace structure. So report the discard_granularity there
too and set OPTPERF to 11b to indicate those fields are supported.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:05 -07:00
Caleb Sander Mateos
e0d56e7055 nvmet: use NVME_NS_FEAT_OPTPERF_SHIFT
Use the NVME_NS_FEAT_OPTPERF_SHIFT constant in nvmet_bdev_set_limits()
to set the OPTPERF bits of the nvme_id_ns NSFEAT field instead of the
magic number 4.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:04 -07:00
Caleb Sander Mateos
1029298da3 nvme: set discard_granularity from NPDG/NPDA
Currently, nvme_config_discard() always sets the discard_granularity
queue limit to the logical block size. However, NVMe namespaces can
advertise a larger preferred discard granularity in the NPDG or NPDA
field of the Identify Namespace structure or the NPDGL or NPDAL fields
of the I/O Command Set Specific Identify Namespace structure.

Use these fields to compute the discard_granularity limit. The logic is
somewhat involved. First, the fields are optional. NPDG is only reported
if the low bit of OPTPERF is set in NSFEAT. NPDA is reported if any bit
of OPTPERF is set. And NPDGL and NPDAL are reported if the high bit of
OPTPERF is set. NPDGL and NPDAL can also each be set to 0 to opt out of
reporting a limit. I/O Command Set Specific Identify Namespace may also
not be supported by older NVMe controllers. Another complication is that
multiple values may be reported among NPDG, NPDGL, NPDA, and NPDAL. The
spec says to prefer the values reported in the L variants. The spec says
NPDG should be a multiple of NPDA and NPDGL should be a multiple of
NPDAL, but it doesn't specify a relationship between NPDG and NPDAL or
NPDGL and NPDA. So use the maximum of the reported NPDG(L) and NPDA(L)
values as the discard_granularity.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:04 -07:00
Caleb Sander Mateos
b465046c8c nvme: add from0based() helper
The NVMe specifications are big fans of "0's based"/"0-based" fields for
encoding values that must be positive. The encoded value is 1 less than
the value it represents. nvmet already provides a helper to0based() for
encoding 0's based values, so add a corresponding helper to decode these
fields on the host side.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:04 -07:00
Caleb Sander Mateos
823340b7e8 nvme: always issue I/O Command Set specific Identify Namespace
Currently, the I/O Command Set specific Identify Namespace structure is
only fetched for controllers that support extended LBA formats. This is
because struct nvme_id_ns_nvm is only used by nvme_configure_pi_elbas(),
which is only called when the ELBAS bit is set in the CTRATT field of
the Identify Controller structure.

However, the I/O Command Set specific Identify Namespace structure will
soon be used in nvme_update_disk_info(), so always try to obtain it in
nvme_update_ns_info_block(). This Identify structure is first defined in
NVMe spec version 2.0, but controllers reporting older versions could
still implement it.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:04 -07:00
Caleb Sander Mateos
d3c04a6ea5 nvme: update nvme_id_ns OPTPERF constants
In NVMe verson 2.0 and below, OPTPERF comprises only bit 4 of NSFEAT in
the Identify Namespace structure. Since version 2.1, OPTPERF includes
both bits 4 and 5 of NSFEAT. Replace the NVME_NS_FEAT_IO_OPT constant
with NVME_NS_FEAT_OPTPERF_SHIFT, NVME_NS_FEAT_OPTPERF_MASK, and
NVME_NS_FEAT_OPTPERF_MASK_2_1, representing the first bit, pre-2.1 bit
width, and post-2.1 bit width of OPTPERF.

Update nvme_update_disk_info() to check both OPTPERF bits for
controllers that report version 2.1 or newer, as NPWG and NOWS are
supported even if only bit 5 is set.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:04 -07:00
Caleb Sander Mateos
9110b85244 nvme: fold nvme_config_discard() into nvme_update_disk_info()
The choice of what queue limits are set in nvme_update_disk_info() vs.
nvme_config_discard() seems a bit arbitrary. A subsequent commit will
compute the discard_granularity limit using struct nvme_id_ns, which is
only passed to nvme_update_disk_info() currently. So move the logic in
nvme_config_discard() to nvme_update_disk_info(). Replace several
instances of ns->ctrl in nvme_update_disk_info() with the ctrl variable
brought from nvme_config_discard().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:04 -07:00
Caleb Sander Mateos
ac61e869be nvme: add preferred I/O size fields to struct nvme_id_ns_nvm
A subsequent change will use the NPDGL and NPDAL fields of the NVM
Command Set Specific Identify Namespace structure, so add them (and the
handful of intervening fields) to struct nvme_id_ns_nvm. Add an
assertion that the size is still 4 KB.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:03 -07:00
Alistair Francis
ed6a9f7dab nvme: Allow reauth from sysfs
Allow userspace to trigger a reauth (REPLACETLSPSK) from sysfs.
This can be done by writing  a zero to the sysfs file.

echo 0 > /sys/devices/virtual/nvme-fabrics/ctl/nvme0/tls_configured_key

In order to use the new keys for the admin queue we call controller
reset. This isn't ideal, but I can't find a simpler way to reset the
admin queue TLS connection.

Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:03 -07:00
Alistair Francis
56d25f1a6e nvme: Expose the tls_configured sysfs for secure concat connections
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:03 -07:00
Alistair Francis
2e6eb6b277 nvmet-tcp: Don't free SQ on authentication success
Curently after the host sends a REPLACETLSPSK we free the TLS keys as
part of calling nvmet_auth_sq_free() on success. This means when the
host sends a follow up REPLACETLSPSK we return CONCAT_MISMATCH as the
check for !nvmet_queue_tls_keyid(req->sq) fails.

This patch ensures we don't free the TLS key on success as we might need
it again in the future.

Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:03 -07:00
Alistair Francis
ecf4d2d883 nvmet-tcp: Don't error if TLS is enabed on a reset
If the host sends a AUTH_Negotiate Message on the admin queue with
REPLACETLSPSK set then we expect and require a TLS connection and
shouldn't report an error if TLS is enabled.

This change only enforces the nvmet_queue_tls_keyid() check if we aren't
resetting the negotiation.

Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:03 -07:00
Eric Biggers
6d888db2cf crypto: remove HKDF library
Remove crypto/hkdf.c, since it's no longer used.  Originally it had two
users, but now both of them just inline the needed HMAC computations
using the HMAC library APIs.  That ends up being better, since it
eliminates all the complexity and performance issues associated with the
crypto_shash abstraction and multi-step HMAC input formatting.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:03 -07:00
Eric Biggers
26c8c2ddde nvme-auth: common: remove selections of no-longer used crypto modules
Now that nvme-auth uses the crypto library instead of crypto_shash,
remove obsolete selections from the NVME_AUTH kconfig option.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:03 -07:00
Eric Biggers
844d950bb2 nvme-auth: common: remove nvme_auth_digest_name()
Since nvme_auth_digest_name() is no longer used, remove it and the
associated data from the hash_map array.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:02 -07:00
Eric Biggers
16977e7755 nvme-auth: target: use crypto library in nvmet_auth_ctrl_hash()
For the HMAC computation in nvmet_auth_ctrl_hash(), use the crypto
library instead of crypto_shash.  This is simpler, faster, and more
reliable.  Notably, this eliminates the crypto transformation object
allocation for every call, which was very slow.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:02 -07:00
Eric Biggers
e501533f67 nvme-auth: target: use crypto library in nvmet_auth_host_hash()
For the HMAC computation in nvmet_auth_host_hash(), use the crypto
library instead of crypto_shash.  This is simpler, faster, and more
reliable.  Notably, this eliminates the crypto transformation object
allocation for every call, which was very slow.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:02 -07:00
Eric Biggers
efe8df9f9c nvme-auth: target: remove obsolete crypto_has_shash() checks
Since nvme-auth is now doing its HMAC computations using the crypto
library, it's guaranteed that all the algorithms actually work.
Therefore, remove the crypto_has_shash() checks which are now obsolete.

However, the caller in nvmet_auth_negotiate() seems to have also been
relying on crypto_has_shash(nvme_auth_hmac_name(host_hmac_id)) to
validate the host_hmac_id.  Therefore, make it validate the ID more
directly by checking whether nvme_auth_hmac_hash_len() returns 0 or not.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:02 -07:00
Eric Biggers
ac9a49cf6e nvme-auth: host: remove allocation of crypto_shash
Now that the crypto_shash that is being allocated in
nvme_auth_process_dhchap_challenge() and stored in the
struct nvme_dhchap_queue_context is no longer used, remove it.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:02 -07:00
Eric Biggers
c4f216c2a9 nvme-auth: host: use crypto library in nvme_auth_dhchap_setup_ctrl_response()
For the HMAC computation in nvme_auth_dhchap_setup_ctrl_response(), use
the crypto library instead of crypto_shash.  This is simpler, faster,
and more reliable.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:02 -07:00
Eric Biggers
6be8d3f043 nvme-auth: host: use crypto library in nvme_auth_dhchap_setup_host_response()
For the HMAC computation in nvme_auth_dhchap_setup_host_response(), use
the crypto library instead of crypto_shash.  This is simpler, faster,
and more reliable.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:02 -07:00
Eric Biggers
d126cbaa7d nvme-auth: common: use crypto library in nvme_auth_derive_tls_psk()
For the HKDF-Expand-Label computation in nvme_auth_derive_tls_psk(), use
the crypto library instead of crypto_shash and crypto/hkdf.c.

While this means the HKDF "helper" functions are no longer utilized,
they clearly weren't buying us much: it's simpler to just inline the
HMAC computations directly, and this code needs to be tested anyway.  (A
similar result was seen in fs/crypto/.  As a result, this eliminates the
last user of crypto/hkdf.c, which we'll be able to remove as well.)

As usual this is also a lot more efficient, eliminating the allocation
of a transformation object and multiple other dynamic allocations.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:01 -07:00
Eric Biggers
0002764c2f nvme-auth: common: use crypto library in nvme_auth_generate_digest()
For the HMAC computation in nvme_auth_generate_digest(), use the crypto
library instead of crypto_shash.  This is simpler, faster, and more
reliable.  Notably, this eliminates the crypto transformation object
allocation for every call, which was very slow.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:01 -07:00
Eric Biggers
be01b841d3 nvme-auth: common: use crypto library in nvme_auth_generate_psk()
For the HMAC computation in nvme_auth_generate_psk(), use the crypto
library instead of crypto_shash.  This is simpler, faster, and more
reliable.  Notably, this eliminates the crypto transformation object
allocation for every call, which was very slow.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:01 -07:00
Eric Biggers
a67d096fe9 nvme-auth: common: use crypto library in nvme_auth_augmented_challenge()
For the hash and HMAC computations in nvme_auth_augmented_challenge(),
use the crypto library instead of crypto_shash.  This is simpler,
faster, and more reliable.  Notably, this eliminates two crypto
transformation object allocations for every call, which was very slow.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:01 -07:00
Eric Biggers
092c05f8de nvme-auth: common: use crypto library in nvme_auth_transform_key()
For the HMAC computation in nvme_auth_transform_key(), use the crypto
library instead of crypto_shash.  This is simpler, faster, and more
reliable.  Notably, this eliminates the transformation object allocation
for every call, which was very slow.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:01 -07:00
Eric Biggers
4263ca1cae nvme-auth: common: add HMAC helper functions
Add some helper functions for computing HMAC-SHA256, HMAC-SHA384, or
HMAC-SHA512 values using the crypto library instead of crypto_shash.
These will enable some significant simplifications and performance
improvements in nvme-auth.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:01 -07:00
Eric Biggers
4454820b4e nvme-auth: common: explicitly verify psk_len == hash_len
nvme_auth_derive_tls_psk() is always called with psk_len == hash_len.
And based on the comments above nvme_auth_generate_psk() and
nvme_auth_derive_tls_psk(), this isn't an implementation choice but
rather just the length the spec uses.  Add a check which makes this
explicit, so that when cleaning up nvme_auth_derive_tls_psk() we don't
have to retain support for arbitrary values of psk_len.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:00 -07:00
Eric Biggers
0beeca72cf nvme-auth: rename nvme_auth_generate_key() to nvme_auth_parse_key()
This function does not generate a key.  It parses the key from the
string that the caller passes in.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:00 -07:00
Eric Biggers
f990ad67f0 nvme-auth: common: add KUnit tests for TLS key derivation
Unit-test the sequence of function calls that derive tls_psk, so that we
can be more confident that changes in the implementation don't break it.

Since the NVMe specification doesn't seem to include any test vectors
for this (nor does its description of the algorithm seem to match what
was actually implemented, for that matter), I just set the expected
values to the values that the code currently produces.  In the case
of SHA-512, nvme_auth_generate_digest() currently returns -EINVAL, so
for now the test tests for that too.  If it is later determined that
some other behavior is needed, the test can be updated accordingly.

Tested with:

    tools/testing/kunit/kunit.py run --kunitconfig drivers/nvme/common/

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:00 -07:00
Eric Biggers
bf0e2567a6 nvme-auth: use proper argument types
For input parameters, use pointer to const.  This makes it easier to
understand which parameters are inputs and which are outputs.

In addition, consistently use char for strings and u8 for binary.  This
makes it easier to understand what is a string and what is binary data.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:00 -07:00
Eric Biggers
e57406c07b nvme-auth: common: constify static data
Fully constify the dhgroup_map and hash_map arrays.  Remove 'const' from
individual fields, as it is now redundant.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:00 -07:00
Eric Biggers
9100a28c8b nvme-auth: add NVME_AUTH_MAX_DIGEST_SIZE constant
Define a NVME_AUTH_MAX_DIGEST_SIZE constant and use it in the
appropriate places.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-27 07:35:00 -07:00
Vasily Gorbik
67807fbaf1 block: fix bio_alloc_bioset slowpath GFP handling
bio_alloc_bioset() first strips __GFP_DIRECT_RECLAIM from the optimistic
fast allocation attempt with try_alloc_gfp(). If that fast path fails,
the slowpath checks saved_gfp to decide whether blocking allocation is
allowed, but then still calls mempool_alloc() with the stripped gfp mask.
That can lead to a NULL bio pointer being passed into bio_init().

Fix the slowpath by using saved_gfp for the bio and bvec mempool
allocations.

Fixes: b520c4eef8 ("block: split bio_alloc_bioset more clearly into a fast and slowpath")
Reported-by: syzbot+09ddb593eea76a158f42@syzkaller.appspotmail.com
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/p01.gc6e9ad5845ad.ttca29g@ub.hpns
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-23 07:58:32 -06:00
Ming Lei
24d4c90286 ublk: move cold paths out of __ublk_batch_dispatch() for icache efficiency
Mark ublk_filter_unused_tags() as noinline since it is only called from
the unlikely(needs_filter) branch. Extract the error-handling block from
__ublk_batch_dispatch() into a new noinline ublk_batch_dispatch_fail()
function to keep the hot path compact and icache-friendly. This also
makes __ublk_batch_dispatch() more readable by separating the error
recovery logic from the normal dispatch flow.

Before: __ublk_batch_dispatch is ~1419 bytes
After:  __ublk_batch_dispatch is ~1090 bytes (-329 bytes, -23%)

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260318014112.3125432-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-22 19:03:52 -06:00
Jens Axboe
713db70d6d Merge tag 'md-7.1-20260323' of git://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-7.1/block
Pull MD changes from Yu Kuia:

"Bug Fixes:
 - md: suppress spurious superblock update error message for dm-raid
   (Chen Cheng)
 - md/raid1: fix the comparing region of interval tree (Xiao Ni)
 - md/raid10: fix deadlock with check operation and nowait requests
   (Josh Hunt)
 - md/raid5: skip 2-failure compute when other disk is R5_LOCKED
   (FengWei Shih)
 - md/md-llbitmap: raise barrier before state machine transition
   (Yu Kuai)
 - md/md-llbitmap: skip reading rdevs that are not in_sync (Yu Kuai)

 Improvements:
 - md/raid5: set chunk_sectors to enable full stripe I/O splitting
   (Yu Kuai)

 Cleanups:
 - md: remove unused mddev argument from export_rdev (Chen Cheng)
 - md/raid5: remove stale md_raid5_kick_device() declaration
   (Chen Cheng)
 - md/raid5: move handle_stripe() comment to correct location
   (Chen Cheng)"

* tag 'md-7.1-20260323' of git://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux:
  md: remove unused mddev argument from export_rdev
  md/raid5: move handle_stripe() comment to correct location
  md/raid5: remove stale md_raid5_kick_device() declaration
  md/raid1: fix the comparing region of interval tree
  md/raid5: skip 2-failure compute when other disk is R5_LOCKED
  md/md-llbitmap: raise barrier before state machine transition
  md/md-llbitmap: skip reading rdevs that are not in_sync
  md/raid5: set chunk_sectors to enable full stripe I/O splitting
  md/raid10: fix deadlock with check operation and nowait requests
  md: suppress spurious superblock update error message for dm-raid
2026-03-22 13:37:45 -06:00
Chen Cheng
6f507eb2bb md: remove unused mddev argument from export_rdev
The mddev argument in export_rdev() is never used. Remove it to
simplify callers.

Signed-off-by: Chen Cheng <chencheng@fnnas.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Link: https://lore.kernel.org/linux-raid/20260304111417.20777-1-chencheng@fnnas.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2026-03-23 02:15:11 +08:00
Chen Cheng
81c041260a md/raid5: move handle_stripe() comment to correct location
Move the handle_stripe() documentation comment from above
analyse_stripe() to directly above handle_stripe() where it belongs.

Signed-off-by: Chen Cheng <chencheng@fnnas.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Link: https://lore.kernel.org/linux-raid/20260304111001.15767-1-chencheng@fnnas.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2026-03-23 02:15:11 +08:00
Chen Cheng
af5c99b8ea md/raid5: remove stale md_raid5_kick_device() declaration
Remove the unused md_raid5_kick_device() declaration from raid5.h -
no definition exists for this function.

Signed-off-by: Chen Cheng <chencheng@fnnas.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Link: https://lore.kernel.org/linux-raid/20260304110919.15071-1-chencheng@fnnas.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2026-03-23 02:15:11 +08:00
Xiao Ni
de3544d2e5 md/raid1: fix the comparing region of interval tree
Interval tree uses [start, end] as a region which stores in the tree.
In raid1, it uses the wrong end value. For example:
bio(A,B) is too big and needs to be split to bio1(A,C-1), bio2(C,B).
The region of bio1 is [A,C] and the region of bio2 is [C,B]. So bio1 and
bio2 overlap which is not right.

Fix this problem by using right end value of the region.

Fixes: d0d2d8ba04 ("md/raid1: introduce wait_for_serialization")
Signed-off-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/linux-raid/20260305011839.5118-2-xni@redhat.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2026-03-23 02:15:10 +08:00
FengWei Shih
52e4324935 md/raid5: skip 2-failure compute when other disk is R5_LOCKED
When skip_copy is enabled on a doubly-degraded RAID6, a device that is
being written to will be in R5_LOCKED state with R5_UPTODATE cleared.
If a new read triggers fetch_block() while the write is still in
flight, the 2-failure compute path may select this locked device as a
compute target because it is not R5_UPTODATE.

Because skip_copy makes the device page point directly to the bio page,
reconstructing data into it might be risky. Also, since the compute
marks the device R5_UPTODATE, it triggers WARN_ON in ops_run_io()
which checks that R5_SkipCopy and R5_UPTODATE are not both set.

This can be reproduced by running small-range concurrent read/write on
a doubly-degraded RAID6 with skip_copy enabled, for example:

  mdadm -C /dev/md0 -l6 -n6 -R -f /dev/loop[0-3] missing missing
  echo 1 > /sys/block/md0/md/skip_copy
  fio --filename=/dev/md0 --rw=randrw --bs=4k --numjobs=8 \
      --iodepth=32 --size=4M --runtime=30 --time_based --direct=1

Fix by checking R5_LOCKED before proceeding with the compute. The
compute will be retried once the lock is cleared on IO completion.

Signed-off-by: FengWei Shih <dannyshih@synology.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Link: https://lore.kernel.org/linux-raid/20260319053351.3676794-1-dannyshih@synology.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2026-03-22 09:57:33 +08:00
Kees Cook
c2d466b9fe block: partitions: Replace pp_buf with struct seq_buf
In preparation for removing the strlcat API[1], replace the char *pp_buf
with a struct seq_buf, which tracks the current write position and
remaining space internally. This allows for:

- Direct use of seq_buf_printf() in place of snprintf()+strlcat()
  pairs, eliminating local tmp buffers throughout.
- Adjacent strlcat() calls that build strings piece-by-piece
  (e.g., strlcat("["); strlcat(name); strlcat("]")) to be collapsed
  into single seq_buf_printf() calls.
- Simpler call sites: seq_buf_puts() takes only the buffer and string,
  with no need to pass PAGE_SIZE at every call.

The backing buffer allocation is unchanged (__get_free_page), and the
output path uses seq_buf_str() to NUL-terminate before passing to
printk().

Link: https://github.com/KSPP/linux/issues/370 [1]
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Josh Law <objecting@objecting.org>
Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: Josh Law <objecting@objecting.org>
Link: https://patch.msgid.link/20260321004840.work.670-kees@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-21 08:27:08 -06:00