Pull SCSI updates from James Bottomley:
"Usual driver updates (ufs, mpi3mr, lpfc, pm80xx, mpt3sas) plus
assorted cleanups and fixes.
The only core update is to sd.c and is mostly cosmetic"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (105 commits)
scsi: MAINTAINERS: Update FC element owners
scsi: mpt3sas: Update driver version to 54.100.00.00
scsi: mpt3sas: Add support for 22.5 Gbps SAS link rate
scsi: mpt3sas: Suppress unnecessary IOCLogInfo on CONFIG_INVALID_PAGE
scsi: mpt3sas: Fix crash in transport port remove by using ioc_info()
scsi: ufs: ufs-qcom: Add support for limiting HS gear and rate
scsi: ufs: pltfrm: Add DT support to limit HS gear and gear rate
scsi: ufs: ufs-qcom: Remove redundant re-assignment to hs_rate
scsi: ufs: dt-bindings: Document gear and rate limit properties
scsi: ufs: core: Fix data race in CPU latency PM QoS request handling
scsi: libfc: Fix potential buffer overflow in fc_ct_ms_fill()
scsi: storvsc: Remove redundant ternary operators
scsi: ufs: core: Change MCQ interrupt enable flow
scsi: smartpqi: Replace kmalloc() + copy_from_user() with memdup_user()
scsi: hpsa: Replace kmalloc() + copy_from_user() with memdup_user()
scsi: hpsa: Fix potential memory leak in hpsa_big_passthru_ioctl()
scsi: lpfc: Copyright updates for 14.4.0.11 patches
scsi: lpfc: Update lpfc version to 14.4.0.11
scsi: lpfc: Convert debugfs directory counts from atomic to unsigned int
scsi: lpfc: Clean up extraneous phba dentries
...
The sd_revalidate_disk() function currently returns 0 for both success
and memory allocation failure. Since none of its callers use the return
value, this return code is both unnecessary and potentially misleading.
Change the return type of sd_revalidate_disk() from int to void
and remove all return value handling. This makes the function
semantics clearer and avoids confusion about unused return codes.
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Abinash Singh <abinashsinghlalotra@gmail.com>
Link: https://lore.kernel.org/r/20250825183940.13211-4-abinashsinghlalotra@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
A build warning was triggered due to excessive stack usage in
sd_revalidate_disk():
drivers/scsi/sd.c: In function ‘sd_revalidate_disk.isra’:
drivers/scsi/sd.c:3824:1: warning: the frame size of 1160 bytes is larger than 1024 bytes [-Wframe-larger-than=]
This is caused by a large local struct queue_limits (~400B) allocated on
the stack. Replacing it with a heap allocation using kmalloc()
significantly reduces frame usage. Kernel stack is limited (~8 KB), and
allocating large structs on the stack is discouraged. As the function
already performs heap allocations (e.g. for buffer), this change fits
well.
Fixes: 804e498e04 ("sd: convert to the atomic queue limits API")
Cc: stable@vger.kernel.org
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Abinash Singh <abinashsinghlalotra@gmail.com>
Link: https://lore.kernel.org/r/20250825183940.13211-2-abinashsinghlalotra@gmail.com
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Instances are happier that way and it makes more sense anyway -
the only part of the result that is related to partition we are given
is the start sector, and that has been filled in by the caller.
Everything else is a function of the disk. Only one instance
(DASD) is ever looking at anything other than bdev->bd_disk and
that one is trivial to adjust.
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instances are passed struct block_device *bdev argument; the only thing
it is used for (if it's used in the first place) is bdev->bd_disk.
Might as well pass that in the first place...
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull more SCSI updates from James Bottomley:
"This is mostly fixes and cleanups and code reworks that trickled in
across the merge window and the weeks leading up. The only substantive
update is the Mediatek ufs driver which accounts for the bulk of the
additions"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (37 commits)
scsi: libsas: Use a bool for sas_deform_port() second argument
scsi: libsas: Move declarations of internal functions to sas_internal.h
scsi: libsas: Make sas_get_ata_info() static
scsi: libsas: Simplify sas_ata_wait_eh()
scsi: libsas: Refactor dev_is_sata()
scsi: sd: Make sd shutdown issue START STOP UNIT appropriately
scsi: arm64: dts: mediatek: mt8195: Add UFSHCI node
scsi: dt-bindings: mediatek,ufs: add MT8195 compatible and update clock nodes
scsi: dt-bindings: mediatek,ufs: Add ufs-disable-mcq flag for UFS host
scsi: ufs: ufs-mediatek: Add UFS host support for MT8195 SoC
scsi: ufs: ufs-pci: Remove control of UIC Completion interrupt for Intel MTL
scsi: ufs: core: Do not write interrupt enable register unnecessarily
scsi: ufs: core: Set and clear UIC Completion interrupt as needed
scsi: ufs: core: Remove duplicated code in ufshcd_send_bsg_uic_cmd()
scsi: ufs: core: Move ufshcd_enable_intr() and ufshcd_disable_intr()
scsi: ufs: ufs-pci: Remove UFS PCI driver's ->late_init() call back
scsi: ufs: ufs-pci: Fix default runtime and system PM levels
scsi: ufs: ufs-pci: Fix hibernate state transition for Intel MTL-like host controllers
scsi: ufs: host: mediatek: Support FDE (AES) clock scaling
scsi: ufs: host: mediatek: Support clock scaling with Vcore binding
...
Pull SCSI updates from James Bottomley:
"Smaller set of driver updates than usual (ufs, lpfc, mpi3mr).
The rest (including the core file changes) are doc updates and some
minor bug fixes"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (49 commits)
scsi: libiscsi: Initialize iscsi_conn->dd_data only if memory is allocated
scsi: scsi_transport_fc: Add comments to describe added 'rport' parameter
scsi: bfa: Double-free fix
scsi: isci: Fix dma_unmap_sg() nents value
scsi: mvsas: Fix dma_unmap_sg() nents value
scsi: elx: efct: Fix dma_unmap_sg() nents value
scsi: scsi_transport_fc: Change to use per-rport devloss_work_q
scsi: ufs: exynos: Fix programming of HCI_UTRL_NEXUS_TYPE
scsi: core: Fix kernel doc for scsi_track_queue_full()
scsi: ibmvscsi_tgt: Fix dma_unmap_sg() nents value
scsi: ibmvscsi_tgt: Fix typo in comment
scsi: mpi3mr: Update driver version to 8.14.0.5.50
scsi: mpi3mr: Serialize admin queue BAR writes on 32-bit systems
scsi: mpi3mr: Drop unnecessary volatile from __iomem pointers
scsi: mpi3mr: Fix race between config read submit and interrupt completion
scsi: ufs: ufs-qcom: Enable QUnipro Internal Clock Gating
scsi: ufs: core: Add ufshcd_dme_rmw() to modify DME attributes
scsi: ufs: ufs-qcom: Update esi_vec_mask for HW major version >= 6
scsi: core: Use scsi_cmd_priv() instead of open-coding it
scsi: qla2xxx: Remove firmware URL
...
Pull fallocate updates from Christian Brauner:
"fallocate() currently supports creating preallocated files
efficiently. However, on most filesystems fallocate() will preallocate
blocks in an unwriten state even if FALLOC_FL_ZERO_RANGE is specified.
The extent state must later be converted to a written state when the
user writes data into this range, which can trigger numerous metadata
changes and journal I/O. This may leads to significant write
amplification and performance degradation in synchronous write mode.
At the moment, the only method to avoid this is to create an empty
file and write zero data into it (for example, using 'dd' with a large
block size). However, this method is slow and consumes a considerable
amount of disk bandwidth.
Now that more and more flash-based storage devices are available it is
possible to efficiently write zeros to SSDs using the unmap write
zeroes command if the devices do not write physical zeroes to the
media.
For example, if SCSI SSDs support the UMMAP bit or NVMe SSDs support
the DEAC bit[1], the write zeroes command does not write actual data
to the device, instead, NVMe converts the zeroed range to a
deallocated state, which works fast and consumes almost no disk write
bandwidth.
This series implements the BLK_FEAT_WRITE_ZEROES_UNMAP feature and
BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED flag for SCSI, NVMe and
device-mapper drivers, and add the FALLOC_FL_WRITE_ZEROES and
STATX_ATTR_WRITE_ZEROES_UNMAP support for ext4 and raw bdev devices.
fallocate() is subsequently extended with the FALLOC_FL_WRITE_ZEROES
flag. FALLOC_FL_WRITE_ZEROES zeroes a specified file range in such a
way that subsequent writes to that range do not require further
changes to the file mapping metadata. This flag is beneficial for
subsequent pure overwriting within this range, as it can save on block
allocation and, consequently, significant metadata changes"
* tag 'vfs-6.17-rc1.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
ext4: add FALLOC_FL_WRITE_ZEROES support
block: add FALLOC_FL_WRITE_ZEROES support
block: factor out common part in blkdev_fallocate()
fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate
dm: clear unmap write zeroes limits when disabling write zeroes
scsi: sd: set max_hw_wzeroes_unmap_sectors if device supports SD_ZERO_*_UNMAP
nvmet: set WZDS and DRB if device enables unmap write zeroes operation
nvme: set max_hw_wzeroes_unmap_sectors if device supports DEAC bit
block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limits
Commit aa3998dbeb ("ata: libata-scsi: Disable scsi device
manage_system_start_stop") enabled libata EH to manage device power mode
trasitions for system suspend/resume and removed the flag from
ata_scsi_dev_config. However, since the sd_shutdown() function still
relies on the manage_system_start_stop flag, a spin-down command is not
issued to the disk with command "echo 1 > /sys/block/sdb/device/delete"
sd_shutdown() can be called for both system/runtime start stop
operations, so utilize the manage_run_time_start_stop flag set in the
ata_scsi_dev_config and issue a spin-down command during disk removal
when the system is running. This is in addition to when the system is
powering off and manage_shutdown flag is set. The
manage_system_start_stop flag will still be used for drivers that still
set the flag.
Fixes: aa3998dbeb ("ata: libata-scsi: Disable scsi device manage_system_start_stop")
Signed-off-by: Salomon Dushimirimana <salomondush@google.com>
Link: https://lore.kernel.org/r/20250724214520.112927-1-salomondush@google.com
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
sd_read_block_limits_ext() currently assumes that vpd->len excludes the
size of the page header. However, vpd->len describes the size of the entire
VPD page, therefore the sanity check is incorrect.
In practice this is not really a problem since we don't attach VPD
pages unless they actually report data trailing the header. But fix
the length check regardless.
This issue was identified by Wukong-Agent (formerly Tencent Woodpecker), a
code security AI agent, through static code analysis.
[mkp: rewrote patch description]
Signed-off-by: jackysliu <1972843537@qq.com>
Link: https://lore.kernel.org/r/tencent_ADA5210D1317EEB6CD7F3DE9FE9DA4591D05@qq.com
Fixes: 96b171d6db ("scsi: core: Query the Block Limits Extension VPD page")
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
scsi_add_lun() tests the device vendor string of SCSI devices to detect
if a SCSI device is in fact an ATA device, in order to correctly handle
SATL power management. The function scsi_cdl_enable() also requires
knowing if a SCSI device is an ATA device to control the state of the
device CDL feature but this function does that by testing for the
presence of the VPD page 89h (ATA INFORMATION page).
sd_read_write_same() also has a similar test.
Simplify these different methods by adding the is_ata field to struct
scsi_device to remember that a SCSI device is in fact an ATA one based
on the device vendor name test. This field can also allow low level
SCSI host adapter drivers to take special actions for ATA devices
(e.g. to better handle ATA NCQ errors).
With this, simplify scsi_cdl_enable() and sd_read_write_same().
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250611093421.2901633-1-dlemoal@kernel.org
Reviewed-by: Igor Pylypiv <ipylypiv@google.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Having a variable length array at the end of scsi_stream_status_header
only causes problems. Remove it and switch sd_is_perm_stream(), which is
the only place that currently uses it, to use the scsi_stream_status
directly following it in the local buf structure.
Besides being a much better data structure design, this also avoids a
-Wflex-array-member-not-at-end warning.
Reported-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250505060640.3398500-1-hch@lst.de
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Pull io_uring updates from Jens Axboe:
"Not a lot in terms of features this time around, mostly just cleanups
and code consolidation:
- Support for PI meta data read/write via io_uring, with NVMe and
SCSI covered
- Cleanup the per-op structure caching, making it consistent across
various command types
- Consolidate the various user mapped features into a concept called
regions, making the various users of that consistent
- Various cleanups and fixes"
* tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux: (56 commits)
io_uring/fdinfo: fix io_uring_show_fdinfo() misuse of ->d_iname
io_uring: reuse io_should_terminate_tw() for cmds
io_uring: Factor out a function to parse restrictions
io_uring/rsrc: require cloned buffers to share accounting contexts
io_uring: simplify the SQPOLL thread check when cancelling requests
io_uring: expose read/write attribute capability
io_uring/rw: don't gate retry on completion context
io_uring/rw: handle -EAGAIN retry at IO completion time
io_uring/rw: use io_rw_recycle() from cleanup path
io_uring/rsrc: simplify the bvec iter count calculation
io_uring: ensure io_queue_deferred() is out-of-line
io_uring/rw: always clear ->bytes_done on io_async_rw setup
io_uring/rw: use NULL for rw->free_iovec assigment
io_uring/rw: don't mask in f_iocb_flags
io_uring/msg_ring: Drop custom destructor
io_uring: Move old async data allocation helper to header
io_uring/rw: Allocate async data through helper
io_uring/net: Allocate msghdr async data through helper
io_uring/uring_cmd: Allocate async data through generic helper
io_uring/poll: Allocate apoll with generic alloc_cache helper
...
Currently only stacked devices need to explicitly enable atomic writes by
setting BLK_FEAT_ATOMIC_WRITES_STACKED flag.
This does not work well for device mapper stacking devices, as there many
sets of limits are stacked and what is the 'bottom' and 'top' device can
swapped. This means that BLK_FEAT_ATOMIC_WRITES_STACKED needs to be set
for many queue limits, which is messy.
Generalize enabling atomic writes enabling by ensuring that all devices
must explicitly set a flag - that includes NVMe, SCSI sd, and md raid.
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20250116170301.474130-2-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
asm/unaligned.h is always an include of asm-generic/unaligned.h;
might as well move that thing to linux/unaligned.h and include
that - there's nothing arch-specific in that header.
auto-generated by the following:
for i in `git grep -l -w asm/unaligned.h`; do
sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
done
for i in `git grep -l -w asm-generic/unaligned.h`; do
sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
done
git mv include/asm-generic/unaligned.h include/linux/unaligned.h
git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
Pull more SCSI updates from James Bottomley:
"These are mostly minor updates.
There are two drivers (lpfc and mpi3mr) which missed the initial
pull and a core change to retry a start/stop unit which affect
suspend/resume"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (32 commits)
scsi: lpfc: Update lpfc version to 14.4.0.5
scsi: lpfc: Support loopback tests with VMID enabled
scsi: lpfc: Revise TRACE_EVENT log flag severities from KERN_ERR to KERN_WARNING
scsi: lpfc: Ensure DA_ID handling completion before deleting an NPIV instance
scsi: lpfc: Fix kref imbalance on fabric ndlps from dev_loss_tmo handler
scsi: lpfc: Restrict support for 32 byte CDBs to specific HBAs
scsi: lpfc: Update phba link state conditional before sending CMF_SYNC_WQE
scsi: lpfc: Add ELS_RSP cmd to the list of WQEs to flush in lpfc_els_flush_cmd()
scsi: mpi3mr: Update driver version to 8.12.0.0.50
scsi: mpi3mr: Improve wait logic while controller transitions to READY state
scsi: mpi3mr: Update MPI Headers to revision 34
scsi: mpi3mr: Use firmware-provided timestamp update interval
scsi: mpi3mr: Enhance the Enable Controller retry logic
scsi: sd: Fix off-by-one error in sd_read_block_characteristics()
scsi: pm8001: Do not overwrite PCI queue mapping
scsi: scsi_debug: Remove a useless memset()
scsi: pmcraid: Convert comma to semicolon
scsi: sd: Retry START STOP UNIT commands
scsi: mpi3mr: A performance fix
scsi: ufs: qcom: Update MODE_MAX cfg_bw value
...
Pull SCSI updates from James Bottomley:
"Updates to the usual drivers (ufs, smartpqi, NCR5380, mac_scsi, lpfc,
mpi3mr).
There are no user visible core changes and a whole series of minor
updates and fixes. The largest core change is probably the
simplification of the workqueue allocation path"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (86 commits)
scsi: smartpqi: update driver version to 2.1.30-031
scsi: smartpqi: fix volume size updates
scsi: smartpqi: fix rare system hang during LUN reset
scsi: smartpqi: add new controller PCI IDs
scsi: smartpqi: add counter for parity write stream requests
scsi: smartpqi: correct stream detection
scsi: smartpqi: Add fw log to kdump
scsi: bnx2fc: Remove some unused fields in struct bnx2fc_rport
scsi: qla2xxx: Remove the unused 'del_list_entry' field in struct fc_port
scsi: ufs: core: Remove ufshcd_urgent_bkops()
scsi: core: Remove obsoleted declaration for scsi_driverbyte_string()
scsi: bnx2i: Remove unused declarations
scsi: core: Simplify an alloc_workqueue() invocation
scsi: ufs: Simplify alloc*_workqueue() invocation
scsi: stex: Simplify an alloc_ordered_workqueue() invocation
scsi: scsi_transport_fc: Simplify alloc_workqueue() invocations
scsi: snic: Simplify alloc_workqueue() invocations
scsi: qedi: Simplify an alloc_workqueue() invocation
scsi: qedf: Simplify alloc_workqueue() invocations
scsi: myrs: Simplify an alloc_ordered_workqueue() invocation
...
During system resume, sd_start_stop_device() submits a START STOP UNIT
command to the SCSI device that is being resumed. That command is not
retried in case of a unit attention and hence may fail. An example:
[16575.983359] sd 0:0:0:3: [sdd] Starting disk
[16575.983693] sd 0:0:0:3: [sdd] Start/Stop Unit failed: Result: hostbyte=0x00 driverbyte=DRIVER_OK
[16575.983712] sd 0:0:0:3: [sdd] Sense Key : 0x6
[16575.983730] sd 0:0:0:3: [sdd] ASC=0x29 ASCQ=0x0
[16575.983738] sd 0:0:0:3: PM: dpm_run_callback(): scsi_bus_resume+0x0/0xa0 returns -5
[16575.983783] sd 0:0:0:3: PM: failed to resume async: error -5
Make the SCSI core retry the START STOP UNIT command if the device reports
that it has been powered on or that it has been reset.
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240904210304.2947789-1-bvanassche@acm.org
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
If formatting a suspended disk (such as formatting with different DIF
type), the disk will be resuming first, and then the format command will
submit to the disk through SG_IO ioctl.
When the disk is processing the format command, the system does not
submit other commands to the disk. Therefore, the system attempts to
suspend the disk again and sends the SYNCHRONIZE CACHE command. However,
the SYNCHRONIZE CACHE command will fail because the disk is in the
formatting process. This will cause the runtime_status of the disk to
error and it is difficult for user to recover it. Error info like:
[ 669.925325] sd 6:0:6:0: [sdg] Synchronizing SCSI cache
[ 670.202371] sd 6:0:6:0: [sdg] Synchronize Cache(10) failed: Result: hostbyte=0x00 driverbyte=DRIVER_OK
[ 670.216300] sd 6:0:6:0: [sdg] Sense Key : 0x2 [current]
[ 670.221860] sd 6:0:6:0: [sdg] ASC=0x4 ASCQ=0x4
To solve the issue, ignore the error and return success/0 when format is
in progress.
Cc: stable@vger.kernel.org
Signed-off-by: Yihang Li <liyihang9@huawei.com>
Link: https://lore.kernel.org/r/20240819090934.2130592-1-liyihang9@huawei.com
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
There is a scenario where a large number of discard commands are issued
when the iscsi initiator connects to the target and then performs a session
rescan operation. There is a time window, most of the commands are in UNMAP
mode, and some discard commands become WRITE SAME with UNMAP.
The discard mode has been negotiated during the SCSI probe. If the mode is
temporarily changed from UNMAP to WRITE SAME with UNMAP, an I/O ERROR may
occur because the target may not implement WRITE SAME with UNMAP. Keep the
discard mode stable to fix this issue.
Signed-off-by: Li Feng <fengli@smartx.com>
Link: https://lore.kernel.org/r/20240718080751.313102-2-fengli@smartx.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Commit 804e498e04 ("sd: convert to the atomic queue limits API")
introduced pairs of function calls to queue_limits_start_update() and
queue_limits_commit_update(). These two functions lock and unlock
q->limits_lock. In sd_revalidate_disk(), sd_read_cpr() is called after
queue_limits_start_update() call and before queue_limits_commit_update()
call. sd_read_cpr() locks q->sysfs_dir_lock and &q->sysfs_lock. Then new
lock dependencies were created between q->limits_lock, q->sysfs_dir_lock
and q->sysfs_lock, as follows:
sd_revalidate_disk
queue_limits_start_update
mutex_lock(&q->limits_lock)
sd_read_cpr
disk_set_independent_access_ranges
mutex_lock(&q->sysfs_dir_lock)
mutex_lock(&q->sysfs_lock)
mutex_unlock(&q->sysfs_lock)
mutex_unlock(&q->sysfs_dir_lock)
queue_limits_commit_update
mutex_unlock(&q->limits_lock)
However, the three locks already had reversed dependencies in other
places. Then the new dependencies triggered the lockdep WARN "possible
circular locking dependency detected" [1]. This WARN was observed by
running the blktests test case srp/002.
To avoid the WARN, move the sd_read_cpr() call in sd_revalidate_disk()
after the queue_limits_commit_update() call. In other words, move the
sd_read_cpr() call out of the q->limits_lock region.
[1] https://lore.kernel.org/linux-scsi/vlmv53ni3ltwxplig5qnw4xsl2h6ccxijfbqzekx76vxoim5a5@dekv7q3es3tx/
Fixes: 804e498e04 ("sd: convert to the atomic queue limits API")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/20240801054234.540532-1-shinichiro.kawasaki@wdc.com
Tested-by: Luca Coelho <luciano.coelho@intel.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
This reverts commit 7a6bbc2829.
The offending commit tried to suppress a double "Starting disk" message for
some drivers, but instead started spamming the log with bogus messages
every five seconds:
[ 311.798956] sd 0:0:0:0: [sda] Starting disk
[ 316.919103] sd 0:0:0:0: [sda] Starting disk
[ 322.040775] sd 0:0:0:0: [sda] Starting disk
[ 327.161140] sd 0:0:0:0: [sda] Starting disk
[ 332.281352] sd 0:0:0:0: [sda] Starting disk
[ 337.401878] sd 0:0:0:0: [sda] Starting disk
[ 342.521527] sd 0:0:0:0: [sda] Starting disk
[ 345.850401] sd 0:0:0:0: [sda] Starting disk
[ 350.967132] sd 0:0:0:0: [sda] Starting disk
[ 356.090454] sd 0:0:0:0: [sda] Starting disk
...
on machines that do not actually stop the disk on runtime suspend (e.g.
the Qualcomm sc8280xp CRD with UFS).
Let's just revert for now to address the regression.
Fixes: 7a6bbc2829 ("scsi: sd: Do not repeat the starting disk message")
Cc: stable@vger.kernel.org
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
Link: https://lore.kernel.org/r/20240716161101.30692-1-johan+linaro@kernel.org
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Pull block integrity mapping updates from Jens Axboe:
"A set of cleanups and fixes for the block integrity support.
Sent separately from the main block changes from last week, as they
depended on later fixes in the 6.10-rc cycle"
* tag 'for-6.11/block-post-20240722' of git://git.kernel.dk/linux:
block: don't free the integrity payload in bio_integrity_unmap_free_user
block: don't free submitter owned integrity payload on I/O completion
block: call bio_integrity_unmap_free_user from blk_rq_unmap_user
block: don't call bio_uninit from bio_endio
block: also return bio_integrity_payload * from stubs
block: split integrity support out of bio.h
Pull block updates from Jens Axboe:
- NVMe updates via Keith:
- Device initialization memory leak fixes (Keith)
- More constants defined (Weiwen)
- Target debugfs support (Hannes)
- PCIe subsystem reset enhancements (Keith)
- Queue-depth multipath policy (Redhat and PureStorage)
- Implement get_unique_id (Christoph)
- Authentication error fixes (Gaosheng)
- MD updates via Song
- sync_action fix and refactoring (Yu Kuai)
- Various small fixes (Christoph Hellwig, Li Nan, and Ofir Gal, Yu
Kuai, Benjamin Marzinski, Christophe JAILLET, Yang Li)
- Fix loop detach/open race (Gulam)
- Fix lower control limit for blk-throttle (Yu)
- Add module descriptions to various drivers (Jeff)
- Add support for atomic writes for block devices, and statx reporting
for same. Includes SCSI and NVMe (John, Prasad, Alan)
- Add IO priority information to block trace points (Dongliang)
- Various zone improvements and tweaks (Damien)
- mq-deadline tag reservation improvements (Bart)
- Ignore direct reclaim swap writes in writeback throttling (Baokun)
- Block integrity improvements and fixes (Anuj)
- Add basic support for rust based block drivers. Has a dummy null_blk
variant for now (Andreas)
- Series converting driver settings to queue limits, and cleanups and
fixes related to that (Christoph)
- Cleanup for poking too deeply into the bvec internals, in preparation
for DMA mapping API changes (Christoph)
- Various minor tweaks and fixes (Jiapeng, John, Kanchan, Mikulas,
Ming, Zhu, Damien, Christophe, Chaitanya)
* tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux: (206 commits)
floppy: add missing MODULE_DESCRIPTION() macro
loop: add missing MODULE_DESCRIPTION() macro
ublk_drv: add missing MODULE_DESCRIPTION() macro
xen/blkback: add missing MODULE_DESCRIPTION() macro
block/rnbd: Constify struct kobj_type
block: take offset into account in blk_bvec_map_sg again
block: fix get_max_segment_size() warning
loop: Don't bother validating blocksize
virtio_blk: Don't bother validating blocksize
null_blk: Don't bother validating blocksize
block: Validate logical block size in blk_validate_limits()
virtio_blk: Fix default logical block size fallback
nvmet-auth: fix nvmet_auth hash error handling
nvme: implement ->get_unique_id
block: pass a phys_addr_t to get_max_segment_size
block: add a bvec_phys helper
blk-lib: check for kill signal in ioctl BLKZEROOUT
block: limit the Write Zeroes to manually writing zeroes fallback
block: refacto blkdev_issue_zeroout
block: move read-only and supported checks into (__)blkdev_issue_zeroout
...
Pull SCSI fixes from James Bottomley:
"One core change that moves a disk start message to a location where it
will only be printed once instead of twice plus a couple of error
handling race fixes in the ufs driver"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: sd: Do not repeat the starting disk message
scsi: ufs: core: Fix ufshcd_abort_one racing issue
scsi: ufs: core: Fix ufshcd_clear_cmd racing issue
The SCSI disk message "Starting disk" to signal resuming of a suspended
disk is printed in both sd_resume() and sd_resume_common() which results
in this message being printed twice when resuming from e.g. autosuspend:
$ echo 5000 > /sys/block/sda/device/power/autosuspend_delay_ms
$ echo auto > /sys/block/sda/device/power/control
[ 4962.438293] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 4962.501121] sd 0:0:0:0: [sda] Stopping disk
$ echo on > /sys/block/sda/device/power/control
[ 4972.805851] sd 0:0:0:0: [sda] Starting disk
[ 4980.558806] sd 0:0:0:0: [sda] Starting disk
Fix this double print by removing the call to sd_printk() from sd_resume()
and moving the call to sd_printk() in sd_resume_common() earlier in the
function, before the check using sd_do_start_stop(). Doing so, the message
is printed once regardless if sd_resume_common() actually executes
sd_start_stop_device() (i.e. SCSI device case) or not (libsas and libata
managed ATA devices case).
Fixes: 0c76106cb9 ("scsi: sd: Fix TCG OPAL unlock on system resume")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240701215326.128067-1-dlemoal@kernel.org
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Pull in v6.10-rc6 to resolve a conflict for the integrity cleanups.
* tag 'v6.10-rc6': (778 commits)
Linux 6.10-rc6
ata: ahci: Clean up sysfs file on error
ata: libata-core: Fix double free on error
ata,scsi: libata-core: Do not leak memory for ata_port struct members
ata: libata-core: Fix null pointer dereference on error
x86-32: fix cmpxchg8b_emu build error with clang
x86: stop playing stack games in profile_pc()
i2c: testunit: discard write requests while old command is running
i2c: testunit: don't erase registers after STOP
tty: mxser: Remove __counted_by from mxser_board.ports[]
randomize_kstack: Remove non-functional per-arch entropy filtering
string: kunit: add missing MODULE_DESCRIPTION() macros
ata: libata-core: Add ATA_HORKAGE_NOLPM for all Crucial BX SSD1 models
MAINTAINERS: Update IOMMU tree location
tools/power turbostat: Add local build_bug.h header for snapshot target
tools/power turbostat: Fix unc freq columns not showing with '-q' or '-l'
tools/power turbostat: option '-n' is ambiguous
drm/drm_file: Fix pid refcounting race
kallsyms: rework symbol lookup return codes
gpiolib: cdev: Ignore reconfiguration without direction
...
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull SCSI fixes from James Bottomley:
"Two fixes: one in the ufs driver fixing an obvious memory leak and the
other (with a core flag based update) trying to prevent USB crashes by
stopping the core from issuing a request for the I/O Hints mode page"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: usb: uas: Do not query the IO Advice Hints Grouping mode page for USB/UAS devices
scsi: core: Introduce the BLIST_SKIP_IO_HINTS flag
scsi: ufs: core: Free memory allocated for model before reinit
Support is divided into two main areas:
- reading VPD pages and setting sdev request_queue limits
- support WRITE ATOMIC (16) command and tracing
The relevant block limits VPD page need to be read to allow the block layer
request_queue atomic write limits to be set. These VPD page limits are
described in sbc4r22 section 6.6.4 - Block limits VPD page.
There are five limits of interest:
- MAXIMUM ATOMIC TRANSFER LENGTH
- ATOMIC ALIGNMENT
- ATOMIC TRANSFER LENGTH GRANULARITY
- MAXIMUM ATOMIC TRANSFER LENGTH WITH BOUNDARY
- MAXIMUM ATOMIC BOUNDARY SIZE
MAXIMUM ATOMIC TRANSFER LENGTH is the maximum length for a WRITE ATOMIC
(16) command. It will not be greater than the device MAXIMUM TRANSFER
LENGTH.
ATOMIC ALIGNMENT and ATOMIC TRANSFER LENGTH GRANULARITY are the minimum
alignment and length values for an atomic write in terms of logical blocks.
Unlike NVMe, SCSI does not specify an LBA space boundary, but does specify
a per-IO boundary granularity. The maximum boundary size is specified in
MAXIMUM ATOMIC BOUNDARY SIZE. When used, this boundary value is set in the
WRITE ATOMIC (16) ATOMIC BOUNDARY field - layout for the WRITE_ATOMIC_16
command can be found in sbc4r22 section 5.48. This boundary value is the
granularity size at which the device may atomically write the data. A value
of zero in WRITE ATOMIC (16) ATOMIC BOUNDARY field means that all data must
be atomically written together.
MAXIMUM ATOMIC TRANSFER LENGTH WITH BOUNDARY is the maximum atomic write
length if a non-zero boundary value is set.
For atomic write support, the WRITE ATOMIC (16) boundary is not of much
interest, as the block layer expects each request submitted to be executed
atomically. However, the SCSI spec does leave itself open to a quirky
scenario where MAXIMUM ATOMIC TRANSFER LENGTH is zero, yet MAXIMUM ATOMIC
TRANSFER LENGTH WITH BOUNDARY and MAXIMUM ATOMIC BOUNDARY SIZE are both
non-zero. This case will be supported.
To set the block layer request_queue atomic write capabilities, sanitize
the VPD page limits and set limits as follows:
- atomic_write_unit_min is derived from granularity and alignment values.
If no granularity value is not set, use physical block size
- atomic_write_unit_max is derived from MAXIMUM ATOMIC TRANSFER LENGTH. In
the scenario where MAXIMUM ATOMIC TRANSFER LENGTH is zero and boundary
limits are non-zero, use MAXIMUM ATOMIC BOUNDARY SIZE for
atomic_write_unit_max. New flag scsi_disk.use_atomic_write_boundary is
set for this scenario.
- atomic_write_boundary_bytes is set to zero always
SCSI also supports a WRITE ATOMIC (32) command, which is for type 2
protection enabled. This is not going to be supported now, so check for
T10_PI_TYPE2_PROTECTION when setting any request_queue limits.
To handle an atomic write request, add support for WRITE ATOMIC (16)
command in handler sd_setup_atomic_cmnd(). Flag use_atomic_write_boundary
is checked here for encoding ATOMIC BOUNDARY field.
Trace info is also added for WRITE_ATOMIC_16 command.
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-9-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the add_random flag into the queue_limits feature field so that it
can be set atomically with the queue frozen.
Note that this also removes code from dm to clear the flag based on
the underlying devices, which can't be reached as dm devices will
always start out without the flag set.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the nonrot flag into the queue_limits feature field so that it can
be set atomically with the queue frozen.
Use the chance to switch to defaulting to non-rotational and require
the driver to opt into rotational, which matches the polarity of the
sysfs interface.
For the z2ram, ps3vram, 2x memstick, ubiblock and dcssblk the new
rotational flag is not set as they clearly are not rotational despite
this being a behavior change. There are some other drivers that
unconditionally set the rotational flag to keep the existing behavior
as they arguably can be used on rotational devices even if that is
probably not their main use today (e.g. virtio_blk and drbd).
The flag is automatically inherited in blk_stack_limits matching the
existing behavior in dm and md.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the cache control settings into the queue_limits so that the flags
can be set atomically with the device queue frozen.
Add new features and flags field for the driver set flags, and internal
(usually sysfs-controlled) flags in the block layer. Note that we'll
eventually remove enough field from queue_limits to bring it back to the
previous size.
The disable flag is inverted compared to the previous meaning, which
means it now survives a rescan, similar to the max_sectors and
max_discard_sectors user limits.
The FLUSH and FUA flags are now inherited by blk_stack_limits, which
simplified the code in dm a lot, but also causes a slight behavior
change in that dm-switch and dm-unstripe now advertise a write cache
despite setting num_flush_bios to 0. The I/O path will handle this
gracefully, but as far as I can tell the lack of num_flush_bios
and thus flush support is a pre-existing data integrity bug in those
targets that really needs fixing, after which a non-zero num_flush_bios
should be required in dm for targets that map to underlying devices.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull SCSI fixes from James Bottomley:
"Three obvious driver fixes and two core fixes.
The two core fixes are to disable Command Duration Limits by default
to fix an inconsistency in SATA and some USB devices. The other is to
change the default read size for block zero to follow the device
preference (some USB bridges preferring 16 byte commands don't have a
translation for READ(10) and thus don't scan properly)"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: mpi3mr: Fix ATA NCQ priority support
scsi: ufs: core: Quiesce request queues before checking pending cmds
scsi: core: Disable CDL by default
scsi: mpt3sas: Avoid test/set_bit() operating in non-allocated memory
scsi: sd: Use READ(16) when reading block zero on large capacity disks
Move the integrity information into the queue limits so that it can be
set atomically with other queue limits, and that the sysfs changes to
the read_verify and write_generate flags are properly synchronized.
This also allows to provide a more useful helper to stack the integrity
fields, although it still is separate from the main stacking function
as not all stackable devices want to inherit the integrity settings.
Even with that it greatly simplifies the code in md and dm.
Note that the integrity field is moved as-is into the queue limits.
While there are good arguments for removing the separate blk_integrity
structure, this would cause a lot of churn and might better be done at a
later time if desired. However the integrity field in the queue_limits
structure is now unconditional so that various ifdefs can be avoided or
replaced with IS_ENABLED(). Given that tiny size of it that seems like
a worthwhile trade off.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240613084839.1044015-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A few drivers optimistically try to support discard, write zeroes and
secure erase and disable the features from the I/O completion handler
if the hardware can't support them. This disable can't be done using
the atomic queue limits API because the I/O completion handlers can't
take sleeping locks or freeze the queue. Keep the existing clearing
of the relevant field to zero, but replace the old blk_queue_max_*
APIs with new disable APIs that force the value to 0.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240531074837.1648501-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>