Pull driver core updates from Greg KH:
"Here is a small set of driver core changes for 6.13-rc1.
Nothing major for this merge cycle, except for the two simple merge
conflicts are here just to make life interesting.
Included in here are:
- sysfs core changes and preparations for more sysfs api cleanups
that can come through all driver trees after -rc1 is out
- fw_devlink fixes based on many reports and debugging sessions
- list_for_each_reverse() removal, no one was using it!
- last-minute seq_printf() format string bug found and fixed in many
drivers all at once.
- minor bugfixes and changes full details in the shortlog"
* tag 'driver-core-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (35 commits)
Fix a potential abuse of seq_printf() format string in drivers
cpu: Remove spurious NULL in attribute_group definition
s390/con3215: Remove spurious NULL in attribute_group definition
perf: arm-ni: Remove spurious NULL in attribute_group definition
driver core: Constify bin_attribute definitions
sysfs: attribute_group: allow registration of const bin_attribute
firmware_loader: Fix possible resource leak in fw_log_firmware_info()
drivers: core: fw_devlink: Fix excess parameter description in docstring
driver core: class: Correct WARN() message in APIs class_(for_each|find)_device()
cacheinfo: Use of_property_present() for non-boolean properties
cdx: Fix cdx_mmap_resource() after constifying attr in ->mmap()
drivers: core: fw_devlink: Make the error message a bit more useful
phy: tegra: xusb: Set fwnode for xusb port devices
drm: display: Set fwnode for aux bus devices
driver core: fw_devlink: Stop trying to optimize cycle detection logic
driver core: Constify attribute arguments of binary attributes
sysfs: bin_attribute: add const read/write callback variants
sysfs: implement all BIN_ATTR_* macros in terms of __BIN_ATTR()
sysfs: treewide: constify attribute callback of bin_attribute::llseek()
sysfs: treewide: constify attribute callback of bin_attribute::mmap()
...
Pull SCSI updates from James Bottomley:
"Updates to the usual drivers (ufs, lpfc, hisi_sas, st).
Amazingly enough, no core changes with the biggest set of driver
changes being ufs (which conflicted with it's own fixes a bit, hence
the merges) and the rest being minor fixes and updates"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (97 commits)
scsi: st: New session only when Unit Attention for new tape
scsi: st: Add MTIOCGET and MTLOAD to ioctls allowed after device reset
scsi: st: Don't modify unknown block number in MTIOCGET
scsi: ufs: core: Restore SM8650 support
scsi: sun3: Mark driver struct with __refdata to prevent section mismatch
scsi: sg: Enable runtime power management
scsi: qedi: Fix a possible memory leak in qedi_alloc_and_init_sb()
scsi: qedf: Fix a possible memory leak in qedf_alloc_and_init_sb()
scsi: fusion: Remove unused variable 'rc'
scsi: bfa: Fix use-after-free in bfad_im_module_exit()
scsi: esas2r: Remove unused esas2r_build_cli_req()
scsi: target: Fix incorrect function name in pscsi_create_type_disk()
scsi: ufs: Replace deprecated PCI functions
scsi: Switch back to struct platform_driver::remove()
scsi: pm8001: Increase request sg length to support 4MiB requests
scsi: pm8001: Initialize devices in pm8001_alloc_dev()
scsi: pm8001: Use module param to set pcs event log severity
scsi: ufs: ufs-mediatek: Configure individual LU queue flags
scsi: MAINTAINERS: Update UFS Exynos entry
scsi: lpfc: Copyright updates for 14.4.0.6 patches
...
Pull interrupt subsystem updates from Thomas Gleixner:
"Tree wide:
- Make nr_irqs static to the core code and provide accessor functions
to remove existing and prevent future aliasing problems with local
variables or function arguments of the same name.
Core code:
- Prevent freeing an interrupt in the devres code which is not
managed by devres in the first place.
- Use seq_put_decimal_ull_width() for decimal values output in
/proc/interrupts which increases performance significantly as it
avoids parsing the format strings over and over.
- Optimize raising the timer and hrtimer soft interrupts by using the
'set bit only' variants instead of the combined version which
checks whether ksoftirqd should be woken up. The latter is a
pointless exercise as both soft interrupts are raised in the
context of the timer interrupt and therefore never wake up
ksoftirqd.
- Delegate timer/hrtimer soft interrupt processing to a dedicated
thread on RT.
Timer and hrtimer soft interrupts are always processed in ksoftirqd
on RT enabled kernels. This can lead to high latencies when other
soft interrupts are delegated to ksoftirqd as well.
The separate thread allows to run them seperately under a RT
scheduling policy to reduce the latency overhead.
Drivers:
- New drivers or extensions of existing drivers to support Renesas
RZ/V2H(P), Aspeed AST27XX, T-HEAD C900 and ATMEL sam9x7 interrupt
chips
- Support for multi-cluster GICs on MIPS.
MIPS CPUs can come with multiple CPU clusters, where each CPU
cluster has its own GIC (Generic Interrupt Controller). This
requires to access the GIC of a remote cluster through a redirect
register block.
This is encapsulated into a set of helper functions to keep the
complexity out of the actual code paths which handle the GIC
details.
- Support for encrypted guests in the ARM GICV3 ITS driver
The ITS page needs to be shared with the hypervisor and therefore
must be decrypted.
- Small cleanups and fixes all over the place"
* tag 'irq-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
irqchip/riscv-aplic: Prevent crash when MSI domain is missing
genirq/proc: Use seq_put_decimal_ull_width() for decimal values
softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT.
timers: Use __raise_softirq_irqoff() to raise the softirq.
hrtimer: Use __raise_softirq_irqoff() to raise the softirq
riscv: defconfig: Enable T-HEAD C900 ACLINT SSWI drivers
irqchip: Add T-HEAD C900 ACLINT SSWI driver
dt-bindings: interrupt-controller: Add T-HEAD C900 ACLINT SSWI device
irqchip/stm32mp-exti: Use of_property_present() for non-boolean properties
irqchip/mips-gic: Fix selection of GENERIC_IRQ_EFFECTIVE_AFF_MASK
irqchip/mips-gic: Prevent indirect access to clusters without CPU cores
irqchip/mips-gic: Multi-cluster support
irqchip/mips-gic: Setup defaults in each cluster
irqchip/mips-gic: Support multi-cluster in for_each_online_cpu_gic()
irqchip/mips-gic: Replace open coded online CPU iterations
genirq/irqdesc: Use str_enabled_disabled() helper in wakeup_show()
genirq/devres: Don't free interrupt which is not managed by devres
irqchip/gic-v3-its: Fix over allocation in itt_alloc_pool()
irqchip/aspeed-intc: Add AST27XX INTC support
dt-bindings: interrupt-controller: Add support for ASPEED AST27XX INTC
...
Pull block updates from Jens Axboe:
- NVMe updates via Keith:
- Use uring_cmd helper (Pavel)
- Host Memory Buffer allocation enhancements (Christoph)
- Target persistent reservation support (Guixin)
- Persistent reservation tracing (Guixen)
- NVMe 2.1 specification support (Keith)
- Rotational Meta Support (Matias, Wang, Keith)
- Volatile cache detection enhancment (Guixen)
- MD updates via Song:
- Maintainers update
- raid5 sync IO fix
- Enhance handling of faulty and blocked devices
- raid5-ppl atomic improvement
- md-bitmap fix
- Support for manually defining embedded partition tables
- Zone append fixes and cleanups
- Stop sending the queued requests in the plug list to the driver
->queue_rqs() handle in reverse order.
- Zoned write plug cleanups
- Cleanups disk stats tracking and add support for disk stats for
passthrough IO
- Add preparatory support for file system atomic writes
- Add lockdep support for queue freezing. Already found a bunch of
issues, and some fixes for that are in here. More will be coming.
- Fix race between queue stopping/quiescing and IO queueing
- ublk recovery improvements
- Fix ublk mmap for 64k pages
- Various fixes and cleanups
* tag 'for-6.13/block-20241118' of git://git.kernel.dk/linux: (118 commits)
MAINTAINERS: Update git tree for mdraid subsystem
block: make struct rq_list available for !CONFIG_BLOCK
block/genhd: use seq_put_decimal_ull for diskstats decimal values
block: don't reorder requests in blk_mq_add_to_batch
block: don't reorder requests in blk_add_rq_to_plug
block: add a rq_list type
block: remove rq_list_move
virtio_blk: reverse request order in virtio_queue_rqs
nvme-pci: reverse request order in nvme_queue_rqs
btrfs: validate queue limits
block: export blk_validate_limits
nvmet: add tracing of reservation commands
nvme: parse reservation commands's action and rtype to string
nvmet: report ns's vwc not present
md/raid5: Increase r5conf.cache_name size
block: remove the ioprio field from struct request
block: remove the write_hint field from struct request
nvme: check ns's volatile write cache not present
nvme: add rotational support
nvme: use command set independent id ns if available
...
max_zone_append_sectors differs from all other queue limits in that the
final value used is not stored in the queue_limits but needs to be
obtained using queue_limits_max_zone_append_sectors helper. This not
only adds (tiny) extra overhead to the I/O path, but also can be easily
forgotten in file system code.
Add a new max_hw_zone_append_sectors value to queue_limits which is
set by the driver, and calculate max_zone_append_sectors from that and
the other inputs in blk_validate_zoned_limits, similar to how
max_sectors is calculated to fix this.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20241108154657.845768-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull SCSI fixes from James Bottomley:
"Two small fixes, the drivers one in ufs simply delays running a work
queue and the generic one in zoned storage switches to a more correct
API that tries the standard buddy allocator first (for small
allocations); this fixes an allocation problem with small allocations
seen under memory pressure"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: core: Start the RTC update work later
scsi: sd_zbc: Use kvzalloc() to allocate REPORT ZONES buffer
Kai Mäkisara <Kai.Makisara@kolumbus.fi> says:
These three patches were developed in response to Bugzilla report
https://bugzilla.kernel.org/show_bug.cgi?id=219419
After device reset, the tape driver allows only operations that don't
write or read anything from tape. The reason for this is that many
(most ?) drives rewind the tape after reset and the subsequent reads
or writes would not be at the tape location the user expects. Reading
and writing is allowed again when the user does something to position the
tape (e.g., rewind).
The Bugzilla report considers the case when a user, after reset, tries
to read the drive status with MTIOCGET ioctl, but it fails. MTIOCGET
does not return much useful data after reset, but it can be allowed.
MTLOAD positions the tape and it should be allowed. The second patch
adds these to the set of allowed operations after device reset.
The first patch fixes a bug seen when developing the second patch.
V2: The third patch is added to fix a bug that resulted in not
blocking writes if reset occurs while the device file is not open.
Link: https://lore.kernel.org/r/20241106095723.63254-1-Kai.Makisara@kolumbus.fi
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
As described in the added code comment, a reference to .exit.text is ok
for drivers registered via module_platform_driver_probe(). Make this
explicit to prevent the following section mismatch warnings
WARNING: modpost: drivers/scsi/sun3_scsi: section mismatch in reference: sun3_scsi_driver+0x4 (section: .data) -> sun3_scsi_remove (section: .exit.text)
WARNING: modpost: drivers/scsi/sun3_scsi_vme: section mismatch in reference: sun3_scsi_driver+0x4 (section: .data) -> sun3_scsi_remove (section: .exit.text)
that trigger on a Sun 3 allmodconfig build.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/r/b2c56fa3556505befe9b4cb9a830d9e2a962e72c.1730831769.git.geert@linux-m68k.org
Acked-by: Finn Thain <fthain@linux-m68k.org>
Reviewed-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In 2010, runtime power management support was implemented in the SCSI
core. The description of patch "[SCSI] implement runtime Power
Management" mentions that the sg driver is skipped but not why. This
patch enables runtime power management even if an instance of the sg
driver is held open. Enabling runtime PM for the sg driver is safe
because all interactions of the sg driver with the SCSI device pass
through the block layer (blk_execute_rq_nowait()) and the block layer
already supports runtime PM.
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Fixes: bc4f24014d ("[SCSI] implement runtime Power Management")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20241030220310.1373569-1-bvanassche@acm.org
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
After commit 0edb555a65 ("platform: Make platform_driver::remove()
return void") .remove() is (again) the right callback to implement for
platform drivers.
Convert all platform drivers below drivers/scsi to use .remove(), with
the eventual goal to drop struct platform_driver::remove_new(). As
.remove() and .remove_new() have the same prototypes, conversion is done
by just changing the structure member name in the driver initializer.
On the way do a few whitespace changes to make indention consistent.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Link: https://lore.kernel.org/r/20241028080754.429191-2-u.kleine-koenig@baylibre.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Increasing the per-request size maximum to 4MiB (8192 sectors x 512
bytes) runs into the per-device DMA scatter gather list limit
(max_segments) for users of the io vector system calls (e.g. readv and
writev).
Increase the max scatter gather list length to 1024 to enable kernel to
send 4MiB (1024 * 4KiB page size) requests.
Signed-off-by: Igor Pylypiv <ipylypiv@google.com>
Link: https://lore.kernel.org/r/20241025185009.3278297-1-ipylypiv@google.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Devices can be allocated and freed at runtime. For example during a soft
reset all devices are freed and reallocated upon discovery.
Currently the driver fully initializes devices once in pm8001_alloc().
Allows initialization steps to happen during runtime, avoiding any
leftover states from the device being freed.
Signed-off-by: Igor Pylypiv <ipylypiv@google.com>
Signed-off-by: Terrence Adams <tadamsjr@google.com>
Link: https://lore.kernel.org/r/20241021201828.1378858-1-tadamsjr@google.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Justin Tee <justintee8345@gmail.com> says:
Update lpfc to revision 14.4.0.6
This patch set contains bug fixes related to congestion handling,
accounting for internal remoteport objects, resource release during
HBA unload and reset, and clean up regarding the abuse of a global
spinlock.
The patches were cut against Martin's 6.13/scsi-queue tree.
Link: https://lore.kernel.org/r/20241031223219.152342-1-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
max_zone_append_sectors differs from all other queue limits in that the
final value used is not stored in the queue_limits but needs to be
obtained using queue_limits_max_zone_append_sectors helper. This not
only adds (tiny) extra overhead to the I/O path, but also can be easily
forgotten in file system code.
Add a new max_hw_zone_append_sectors value to queue_limits which is
set by the driver, and calculate max_zone_append_sectors from that and
the other inputs in blk_validate_zoned_limits, similar to how
max_sectors is calculated to fix this.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
An RPI is tightly bound to an NDLP structure and is freed only upon
release of an NDLP object. As such, there should be no logic that frees
an RPI outside of the lpfc_nlp_release() routine. In order to reinforce
the original design usage of RPIs, remove the NLP_RELEASE_RPI flag and
related logic.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20241031223219.152342-9-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Current dev_loss_tmo handling checks whether there has been a previous
call to unregister with SCSI transport. If so, the NDLP kref count is
decremented a second time in dev_loss_tmo as the final kref release.
However, this can sometimes result in a reference count underflow if
there is also a race to unregister with NVMe transport as well. Add a
check for NVMe transport registration before decrementing the final
kref. If NVMe transport is still registered, then the NVMe transport
unregistration is designated as the final kref decrement.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20241031223219.152342-8-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
An HBA reset request that is executed when there are outstanding NVME-LS
commands can cause delays for the reset process to complete. Fix by
introducing a new routine called lpfc_nvmels_flush_cmd() that walks the
phba->nvmels_wq list and cancels outstanding submitted NVME-LS requests
speeding up the HBA reset process.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20241031223219.152342-7-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The lpfc_cmpl_ct_disc_fdmi() routine has incorrect logic that treats an
FDMI completion with error LOCAL_REJECT/SLI_ABORTED as a success status.
Under the erroneous assumption of successful completion, the routine
proceeds to issue follow up FDMI commands, which may never complete if
the HBA is in an errata state as indicated by the errored completion
status. Fix by freeing FDMI cmd resources and early return when the
LPFC_SLI_ACTIVE flag is not set and a LOCAL_REJECT/SLI_ABORTED or
SLI_DOWN status is received.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20241031223219.152342-6-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
During firmware errata events, the lpfc_els_flush_cmd() routine is
responsible for the clean up of outstanding ELS and CT command
submissions. Thus, move the LPFC_SLI_ACTIVE flag check into the txcmplq
list walk and mark a piocb object for canceling if determined the HBA is
not active. Clean up should be regardless of application or driver
layer origin.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20241031223219.152342-5-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
During initialization, the driver allocates wq->pring in lpfc_wq_create
and lpfc_sli4_queue_unset() is the only place where kfree(wq->pring) is
called.
There is a possible memory leak in lpfc_sli_brdrestart_s4() (restart)
and lpfc_pci_remove_one_s4() (rmmod) paths because there are no calls to
lpfc_sli4_queue_unset() to kfree() the wq->pring.
Fix by inserting a call to lpfc_sli4_queue_unset() in
lpfc_sli_brdrestart_s4() and lpfc_sli4_hba_unset() routines. Also, add
a check for the SLI_ACTIVE flag before issuing the Q_DESTROY mailbox
command. If not set, then the mailbox command will obviously fail. In
such cases, skip issuing the mailbox command and only execute the driver
resource clean up portions of the lpfc_*q_destroy routines.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20241031223219.152342-4-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Should an rport remain in the NOTPRESENT state it is possible that
stgt_delete_work is scheduled after dev_loss_tmo_callbk. In such cases,
dev_loss_tmo_callbk would have cleaned up the NDLP object resulting in
stale ndlp pointers in lpfc_terminate_rport_io().
Check for the DEVLOSS_CALLBK_DONE flag to know if dev_loss_tmo_callbk
has been called. This is a more reliable way to avoid dereferencing
stale NDLP pointers.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20241031223219.152342-3-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
CGN warning signals are currently statically fixed to a passed in driver
parameter called lpfc_fabric_cgn_frequency. However, CGN frequency
should be more correctly based on EDC responses from the fabric when
available. Otherwise, still allow the driver to fall back on user
configured lpfc_fabric_cgn_frequency driver parameter.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20241031223219.152342-2-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
We have two reports of failed memory allocation in btrfs' code which is
calling into report zones.
Both of these reports have the following signature coming from
__vmalloc_area_node():
kworker/u17:5: vmalloc error: size 0, failed to allocate pages, mode:0x10dc2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_NORETRY|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0
Further debugging showed these where allocations of one sector (512
bytes) and at least one of the reporter's systems where low on memory,
so going through the overhead of allocating a vm area failed.
Switching the allocation from __vmalloc() to kvzalloc() avoids the
overhead of vmalloc() on small allocations and succeeds.
Note: the buffer is already freed using kvfree() so there's no need to
adjust the free path.
Cc: Qu Wenru <wqu@suse.com>
Cc: Naohiro Aota <naohiro.aota@wdc.com>
Link: https://github.com/kdave/btrfs-progs/issues/779
Link: https://github.com/kdave/btrfs-progs/issues/915
Fixes: 23a50861ad ("scsi: sd_zbc: Cleanup sd_zbc_alloc_report_buffer()")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241030110253.11718-1-jth@kernel.org
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Pull SCSI fixes from James Bottomley:
"Two small fixes, both in drivers (ufs and scsi_debug)"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: core: Fix another deadlock during RTC update
scsi: scsi_debug: Fix do_device_access() handling of unexpected SG copy length
Pull SCSI fixes from James Bottomley:
"Fixes all in drivers. The largest is the mpi3mr which corrects a phy
count limit that should only apply to the controller but was being
incorrectly applied to expander phys"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: target: core: Fix null-ptr-deref in target_alloc_device()
scsi: mpi3mr: Validate SAS port assignments
scsi: ufs: core: Set SDEV_OFFLINE when UFS is shut down
scsi: ufs: core: Requeue aborted request
scsi: ufs: core: Fix the issue of ICU failure
Yihang Li <liyihang9@huawei.com> says:
This series contains some fixes including:
- Adjust priority of registering and exiting debugfs for security;
- Create trigger_dump at the end of the debugfs initialization;
- Add firmware information check;
- Enable all PHYs that are not disabled by user during controller reset;
- Reset PHY again if phyup timeout;
- Check usage count only when the runtime PM status is RPM_SUSPENDING;
- Add cond_resched() for no forced preemption model;
- Default enable interrupt coalescing;
- Update disk locked timeout to 7 seconds;
- Add time interval between two H2D FIS following soft reset spec;
- Update v3 hw STP_LINK_TIMER setting;
- Create all dump files during debugfs initialization;
- Add latest_dump for the debugfs dump;
Link: https://lore.kernel.org/r/20241008021822.2617339-1-liyihang9@huawei.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Before that, after the user triggers the dump, the latest dump
information can be viewed in the directory with the maximum number in
the dump directory.
After this series patch, the driver creates all debugfs directories and
files during initialization. Therefore, users cannot know the directory
where the latest dump information is stored. So, add latest_dump file to
notify users where the latest dump information is stored.
Signed-off-by: Yihang Li <liyihang9@huawei.com>
Link: https://lore.kernel.org/r/20241008021822.2617339-14-liyihang9@huawei.com
Reviewed-by: Xingui Yang <yangxingui@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The SATA disk will be locked after the disk sends the DMA Setup frame
until all data frame transmission is completed. The
CFG_ICT_TIMER_STEP_TRSH register is used for sata disk to configure the
step size of the timer which records the time when the disk is
locked. The unit is 1us and the default step size is 150ms. If the disk
is locked for more than 7 timer steps, the io to be sent to the disk
will end abnormally.
The current timeout is only about 1 second, it is easy to trigger IO
abnormal end when the SATA hard disk returns data slowly. Adjust the
timeout to 7 seconds based on ERC time of most disks.
Signed-off-by: Xingui Yang <yangxingui@huawei.com>
Link: https://lore.kernel.org/r/20241008021822.2617339-10-liyihang9@huawei.com
Reviewed-by: Xiang Chen <chenxiang66@hisilicon.com>
Reviewed-by: Yihang Li <liyihang9@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In the current interrupt reporting mode, each CQ entry reports an
interrupt. However, when there are a large number of I/O hardware
completion interrupts, the following issue may occur:
[ 4682.678657][ C129] irq 134: nobody cared (try booting with the "irqpoll" option)
[ 4682.708455][ C129] Call trace:
[ 4682.711589][ C129] dump_backtrace+0x0/0x1e4
[ 4682.715934][ C129] show_stack+0x20/0x2c
[ 4682.719933][ C129] dump_stack+0xd8/0x140
[ 4682.724017][ C129] __report_bad_irq+0x54/0x180
[ 4682.728625][ C129] note_interrupt+0x1ec/0x2f0
[ 4682.733143][ C129] handle_irq_event+0x118/0x1ac
[ 4682.737834][ C129] handle_fasteoi_irq+0xc8/0x200
[ 4682.742613][ C129] __handle_domain_irq+0x84/0xf0
[ 4682.747391][ C129] gic_handle_irq+0x88/0x2c0
[ 4682.751822][ C129] el1_irq+0xbc/0x140
[ 4682.755648][ C129] _find_next_bit.constprop.0+0x20/0x94
[ 4682.761036][ C129] cpumask_next+0x24/0x30
[ 4682.765208][ C129] gic_ipi_send_mask+0x48/0x170
[ 4682.769900][ C129] __ipi_send_mask+0x34/0x110
[ 4682.775720][ C129] smp_cross_call+0x3c/0xcc
[ 4682.780064][ C129] arch_send_call_function_single_ipi+0x38/0x44
[ 4682.786146][ C129] send_call_function_single_ipi+0xd0/0xe0
[ 4682.791794][ C129] generic_exec_single+0xb4/0x170
[ 4682.796659][ C129] smp_call_function_single_async+0x2c/0x40
[ 4682.802395][ C129] blk_mq_complete_request_remote.part.0+0xec/0x100
[ 4682.808822][ C129] blk_mq_complete_request+0x30/0x70
[ 4682.813950][ C129] scsi_mq_done+0x48/0xac
[ 4682.818128][ C129] sas_scsi_task_done+0xb0/0x150 [libsas]
[ 4682.823692][ C129] slot_complete_v3_hw+0x230/0x710 [hisi_sas_v3_hw]
[ 4682.830120][ C129] cq_thread_v3_hw+0xbc/0x190 [hisi_sas_v3_hw]
[ 4682.836114][ C129] irq_thread_fn+0x34/0xa4
[ 4682.840371][ C129] irq_thread+0xc4/0x130
[ 4682.844455][ C129] kthread+0x108/0x13c
[ 4682.848365][ C129] ret_from_fork+0x10/0x18
[ 4682.852621][ C129] handlers:
[ 4682.855577][ C129] [<00000000949e52bf>] cq_interrupt_v3_hw [hisi_sas_v3_hw] threaded [<000000005d8e3b68>] cq_thread_v3_hw [hisi_sas_v3_hw]
[ 4682.868084][ C129] Disabling IRQ #134
When the IRQ management layer processes each hardware interrupt, if the
return value of the interrupt handler is IRQ_WAKE_THREAD, it will wake
up the handler thread for this interrupt action and set IRQTF_RUNTHREAD
flag, wait for the interrupt handling thread to clear the
IRQTF_RUNTHREAD flag after execution. Later in note_interrupt(), use
irq_count to count hardware interrupts and irqs_unhandled to count
interrupts for which no thread handler is responsible. When irq_count
reaches 100000 and irqs_unhandled reaches 99000, irq will be disabled.
In the performance test scenario, I/O completion hardware interrupts are
continuously and quickly generated. As a result, the interrupt
processing thread is cyclically called in irq_thread() and does not
exit, this affects the response of the interrupt thread to the hardware
interrupt and causes irqs_unhandled to grow to 99000. Finally, the irq
is disabled.
Therefore, default enable interrupt coalescing to reduce the generation
of hardware interrupts, this helps interrupt processing threads to stop
calling in irq_thread().
For interrupt coalescing, according to the actual performance test, set
the count of CQ entries to 10 and the interrupt coalescing timeout
period to 10us based on the actual performance test.
Before and after interrupt coalescing is enabled, the 4K read/write
performance is improved by about 3%, and the 256K read/write performance
is basically the same.
Signed-off-by: Yihang Li <liyihang9@huawei.com>
Link: https://lore.kernel.org/r/20241008021822.2617339-9-liyihang9@huawei.com
Reviewed-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>