When lpfc is handling a solicited and unsolicited PLOGI with another
initiator, the remote initiator is never recovered. The node for the
initiator is erroneouosly removed and all resources released.
In lpfc_cmpl_els_plogi(), when lpfc_els_retry() returns a failure code, the
driver is calling the state machine with a device remove event because the
remote port is not currently registered with the SCSI or NVMe
transports. The issue is that on a PLOGI "collision" the driver correctly
aborts the solicited PLOGI and allows the unsolicited PLOGI to complete the
process, but this process is interrupted with a device_rm event.
Introduce logic in the PLOGI completion to capture the PLOGI collision
event and jump out of the routine. This will avoid removal of the node.
If there is no collision, the normal node removal will occur.
Fixes: 52edb2caf6 ("scsi: lpfc: Remove ndlp when a PLOGI/ADISC/PRLI/REG_RPI ultimately fails")
Cc: <stable@vger.kernel.org> # v5.11+
Link: https://lore.kernel.org/r/20210514195559.119853-6-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
An 'unexpected timeout' message may be seen in a point-2-point topology.
The message occurs when a PLOGI is received before the driver is notified
of FLOGI completion. The FLOGI completion failure causes discovery to be
triggered for a second time. The discovery timer is restarted but no new
discovery activity is initiated, thus the timeout message eventually
appears.
In point-2-point, when discovery has progressed before the FLOGI completion
is processed, it is not a failure. Add code to FLOGI completion to detect
that discovery has progressed and exit the FLOGI handling (noop'ing it).
Link: https://lore.kernel.org/r/20210514195559.119853-4-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When processing an NVMe ERSP IU which didn't match the optimized CQE-only
path, the status was being left to the WQE status. WQE status is non-zero
as it is indicating a non-optimized completion that needs to be handled by
the driver.
Fix by clearing the status field when falling into the non-optimized
case. Log message added to track optimized vs non-optimized debug.
Link: https://lore.kernel.org/r/20210514195559.119853-3-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
While testing NPIV and watching logins and used RPI levels, it was seen the
used RPI count was much higher than the number of remote ports discovered.
Code inspection showed that remote port removals on any NPIV instance are
releasing the RPI, but not performing an UNREG_RPI with the adapter thus
the reference counting never fully drops and the RPI is never fully
released. This was happening on NPIV nodes due to a log of fabric ELS's to
fabric addresses. This lack of UNREG_RPI was introduced by a prior node
rework patch that performed the UNREG_RPI as part of node cleanup.
To resolve the issue, do the following:
- Restore the RPI release code, but move the location to so that it is in
line with the new node cleanup design.
- NPIV ports now release the RPI and drop the node when the caller sets
the NLP_RELEASE_RPI flag.
- Set the NLP_RELEASE_RPI flag in node cleanup which will trigger a
release of RPI to free pool.
- Ensure there's an UNREG_RPI at LOGO completion so that RPI release is
completed.
- Stop offline_prep from skipping nodes that are UNUSED. The RPI may
not have been released.
- Stop the default RPI handling in lpfc_cmpl_els_rsp() for SLI4.
- Fixed up debugfs RPI displays for better debugging.
Fixes: a70e63eee1 ("scsi: lpfc: Fix NPIV Fabric Node reference counting")
Link: https://lore.kernel.org/r/20210514195559.119853-2-jsmart2021@gmail.com
Cc: <stable@vger.kernel.org> # v5.11+
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
If an RTPG fails, we can't infer anything wrt. the state of the ports in
the port group except that we were unable to reach the one port on which
the RTPG had failed. "offline" is just a secondary port state, which means
that we can't infer the state of any port in the PG from the failure (in
fact, even the failed port might still be in "active/optimized" primary
port access state).
Therefore, when we encounter an RTPG failure, we should retry the RTPG on a
different port. This avoids falsely setting port states to offline for
unreachable ports. To do this, ports on which an RTPG has failed are
temporarily set to "disabled" to avoid repeating the failed I/O on the same
target port. Once the RTPG has either succeeded on one port or failed on
all ports of the PG, the ports are enabled again.
Link: https://lore.kernel.org/r/20210514153214.5626-1-mwilck@suse.com
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Don't populate the const array granularity_tbl on the stack but instead
make it static. Makes the object code smaller by 190 bytes:
Before:
text data bss dec hex filename
25563 6908 0 32471 7ed7 ./drivers/scsi/ufs/ufs-exynos.o
After:
text data bss dec hex filename
25213 7068 0 32281 7e19 ./drivers/scsi/ufs/ufs-exynos.o
(gcc version 10.3.0)
Link: https://lore.kernel.org/r/20210505190104.70112-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
drivers/target/target_core_user.c:1424:9-10: WARNING: return of 0/1 in function 'tcmu_handle_completions' with return type bool
Return statements in functions returning bool should use
true/false instead of 1/0.
Generated by: scripts/coccinelle/misc/boolreturn.cocci
Link: https://lore.kernel.org/r/20210515230358.GA97544@60d1edce16e0
Fixes: 9814b55cde ("scsi: target: tcmu: Return from tcmu_handle_completions() if cmd_id not found")
CC: Bodo Stroesser <bostroesser@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Bodo Stroesser <bostroesser@gmail.com>
Signed-off-by: kernel test robot <lkp@intel.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
After commit 6c11dc0604 ("scsi: hisi_sas: Fix IRQ checks") we have the
error codes returned by platform_get_irq() ready for the propagation
upsream in interrupt_init_v1_hw() -- that will fix still broken deferred
probing. Let's propagate the error codes from devm_request_irq() as well
since I don't see the reason to override them with -ENOENT...
Link: https://lore.kernel.org/r/49ba93a3-d427-7542-d85a-b74fe1a33a73@omp.ru
Acked-by: John Garry <john.garry@huawei.com>
Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In the Linux kernel definitions of data structures should occur in .c
files. Hence move the exynos7_uic_attr definition from a .h into a .c
file. Additionally, declare exynos_ufs_drvs static. This patch fixes the
following two sparse warnings:
drivers/scsi/ufs/ufs-exynos.h:248:28: warning: symbol 'exynos_ufs_drvs' was not declared. Should it be static?
drivers/scsi/ufs/ufs-exynos.h:250:28: warning: symbol 'exynos7_uic_attr' was not declared. Should it be static?
Link: https://lore.kernel.org/r/20210509213817.4348-1-bvanassche@acm.org
Cc: Alim Akhtar <alim.akhtar@samsung.com>
Cc: Kiwoong Kim <kwmad.kim@samsung.com>
Reviewed-by: Alim Akhtar <alim.akhtar@samsung.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The controller expects all data it sends/receives to be little-endian.
Therefore, the packet struct definitions should use the __le16/32/64
types. Once those are correct, sparse reports several issues with the
driver code, which are fixed here as well.
The main issue observed was at the call to scsi_set_resid(), where the
byteswapped parameter would eventually trigger the alignment check at
drivers/scsi/sd.c:2009. At that point, the kernel would continuously
complain about an "Unaligned partial completion", and no further I/O could
occur.
This gets the controller working on big endian powerpc64.
Link: https://lore.kernel.org/r/20210427235915.39211-4-samuel@sholland.org
Signed-off-by: Samuel Holland <samuel@sholland.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Currently, all command packet structs used by this driver are packed.
However, only one (TW_SG_Entry) actually needs to be packed, because it
uses 64-bit addresses at 32-bit alignment. To improve the quality of
generated code, stop packing all of the other command packet structs. This
requires adjusting the type of one misaligned "reserved" member.
After this change, pahole reports that only one type had its layout change:
the tw_compat_info member of TW_Device_Extension is now naturally aligned.
Link: https://lore.kernel.org/r/20210427235915.39211-3-samuel@sholland.org
Signed-off-by: Samuel Holland <samuel@sholland.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In preparation for removing the "#pragma pack(1)" from the driver, fix all
instances where a trailing array member could be replaced by a flexible
array member. Since a flexible array member has zero size, it introduces no
padding, whether or not the struct is packed.
Link: https://lore.kernel.org/r/20210427235915.39211-2-samuel@sholland.org
Signed-off-by: Samuel Holland <samuel@sholland.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Extend the standard INQUIRY data to 96 bytes and fill in the VERSION
DESCRIPTOR fields.
The layout follows SPC-4:
- SCSI architecture standard
- SCSI transport protocol standard
- SCSI primary command set standard
- SCSI device type command set standard
All version descriptor values are defined as "no version claimed" because
some initiators fail to recognize anything else.
[mkp: whitespace]
Link: https://lore.kernel.org/r/20210513192804.1252142-3-k.shelekhin@yadro.com
Reviewed-by: Roman Bolshakov <r.bolshakov@yadro.com>
Signed-off-by: Konstantin Shelekhin <k.shelekhin@yadro.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Both the INQUIRY handling and the XCOPY implementation provide functions to
generate an NAA designator. In addition, these functions are poorly named:
- spc_parse_naa_6h_vendor_specific()
- target_xcopy_gen_naa_ieee()
Introduce a common NAA 6 designator generation function,
spc_gen_naa_6h_vendor_specific().
Link: https://lore.kernel.org/r/20210420185920.42431-2-s.samoylenko@yadro.com
Signed-off-by: Sergey Samoylenko <s.samoylenko@yadro.com>
Signed-off-by: Roman Bolshakov <r.bolshakov@yadro.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
If rport target discovery commands fail for some reason, they get retried
up to a set number of retries. Once the retry limit is exceeded, the target
is deleted. In order to delete the target, we either need to do an implicit
logout or a move login. In the move login case, if the move login fails, we
want to retry it. This ensures the retry counter gets reinitialized so the
move login will get retried.
Link: https://lore.kernel.org/r/1620756740-7045-4-git-send-email-brking@linux.vnet.ibm.com
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
If fast fail is enabled and we encounter a WWPN moving from one port id to
another port id with I/O outstanding, if we use the move login MAD,
although it will work, it will leave any outstanding I/O still outstanding
to the old port id. Eventually, the SCSI command timers will fire and we
will abort these commands, however, this is generally much longer than the
fast fail timeout, which can lead to I/O operations being outstanding for a
long time. This patch changes the behavior to avoid the move login if fast
fail is enabled. Once terminate_rport_io cleans up the rport, then we force
the target back through the delete process, which re-drives the implicit
logout, then kicks us back into discovery where we will discover the WWPN
at the new location and do a PLOGI to it.
Link: https://lore.kernel.org/r/1620756740-7045-3-git-send-email-brking@linux.vnet.ibm.com
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When service is being performed on an SVC with NPIV enabled, the WWPN of
the canister / node being serviced fails over to the another canister /
node. This looks to the ibmvfc driver as a WWPN moving from one SCSI ID to
another. The driver will first attempt to do an implicit logout of the old
SCSI ID. If this works, we simply delete the rport at the old location and
add an rport at the new location and the FC transport class handles
everything. However, if there is I/O outstanding, this implicit logout will
fail, in which case we will send a "move login" request to the VIOS. This
will cancel any outstanding I/O to that port, logout the port, and PLOGI
the new port. Recently we've encountered a scenario where the move login
fails. This was resulting in an attempted plogi to the new scsi id, without
the old scsi id getting logged out, which is a VIOS protocol violation. To
solve this, we want to keep tracking the old scsi id as the current scsi
id. That way, once terminate_rport_io cancels the outstanding i/o, it will
send us back through to do an implicit logout of the old scsi id, rather
than the new scsi id, and then we can plogi the new scsi id.
Link: https://lore.kernel.org/r/1620756740-7045-2-git-send-email-brking@linux.vnet.ibm.com
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The result of container_of() operations is never NULL unless the embedded
element is the first element of the structure, which is not the case here.
The NULL checks are therefore unnecessary and misleading. Remove them.
The changes in this patch were made automatically using the following
Coccinelle script.
@@
type t;
identifier v;
statement s;
@@
<+...
(
t v = container_of(...);
|
v = container_of(...);
)
...
when != v
- if (\( !v \| v == NULL \) ) s
...+>
Link: https://lore.kernel.org/r/20210510041211.2051325-1-linux@roeck-us.net
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The structure pointer passed to container_of() is never NULL; that was
already checked. That means that the result of container_of() operations
on it is also never NULL, even though se_node_acl is the first element
of the structure embedding it. On top of that, it is misleading to perform
a NULL check on the result of container_of() because the position of the
contained element could change, which would make the test invalid.
Remove the unnecessary NULL check.
As it turns out, the container_of operation was only made for the purpose
of the NULL check. If the container_of is actually needed, it is repeated
later. Remove the container_of operation as well.
The NULL check was identified and removed with the following Coccinelle
script.
@@
type t;
identifier v;
statement s;
@@
<+...
(
t v = container_of(...);
|
v = container_of(...);
)
...
when != v
- if (\( !v \| v == NULL \) ) s
...+>
Link: https://lore.kernel.org/r/20210510040817.2050266-1-linux@roeck-us.net
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Hou Pu <houpu@bytedance.com>
Cc: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code
should always use “flexible array members”[1] for these cases. The older
style of one-element or zero-length arrays should no longer be used[2].
Refactor the code according to the use of a flexible-array member in struct
aac_raw_io2 instead of one-element array, and use the struct_size() helper.
Also, this helps with the ongoing efforts to enable -Warray-bounds by
fixing the following warnings:
drivers/scsi/aacraid/aachba.c: In function ‘aac_build_sgraw2’:
drivers/scsi/aacraid/aachba.c:3970:18: warning: array subscript 1 is above array bounds of ‘struct sge_ieee1212[1]’ [-Warray-bounds]
3970 | if (rio2->sge[j].length % (i*PAGE_SIZE)) {
| ~~~~~~~~~^~~
drivers/scsi/aacraid/aachba.c:3974:27: warning: array subscript 1 is above array bounds of ‘struct sge_ieee1212[1]’ [-Warray-bounds]
3974 | nseg_new += (rio2->sge[j].length / (i*PAGE_SIZE));
| ~~~~~~~~~^~~
drivers/scsi/aacraid/aachba.c:4011:28: warning: array subscript 1 is above array bounds of ‘struct sge_ieee1212[1]’ [-Warray-bounds]
4011 | for (j = 0; j < rio2->sge[i].length / (pages * PAGE_SIZE); ++j) {
| ~~~~~~~~~^~~
drivers/scsi/aacraid/aachba.c:4012:24: warning: array subscript 1 is above array bounds of ‘struct sge_ieee1212[1]’ [-Warray-bounds]
4012 | addr_low = rio2->sge[i].addrLow + j * pages * PAGE_SIZE;
| ~~~~~~~~~^~~
drivers/scsi/aacraid/aachba.c:4014:33: warning: array subscript 1 is above array bounds of ‘struct sge_ieee1212[1]’ [-Warray-bounds]
4014 | sge[pos].addrHigh = rio2->sge[i].addrHigh;
| ~~~~~~~~~^~~
drivers/scsi/aacraid/aachba.c:4015:28: warning: array subscript 1 is above array bounds of ‘struct sge_ieee1212[1]’ [-Warray-bounds]
4015 | if (addr_low < rio2->sge[i].addrLow)
| ~~~~~~~~~^~~
[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9/process/deprecated.html#zero-length-and-one-element-arrays
Link: https://github.com/KSPP/linux/issues/79
Link: https://github.com/KSPP/linux/issues/109
Link: https://lore.kernel.org/lkml/60414244.ur4%2FkI+fBF1ohKZs%25lkp@intel.com/
Link: https://lore.kernel.org/r/20210421185611.GA105224@embeddedor
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Build-tested-by: kernel test robot <lkp@intel.com>
During runtime-suspend of ufs host, the SCSI devices are already suspended
and so are the queues associated with them. However, the ufs host sends SSU
(START_STOP_UNIT) to the wlun during runtime-suspend.
During the process blk_queue_enter() checks if the queue is not in suspended
state. If so, it waits for the queue to resume, and never comes out of
it. Commit 52abca64fd ("scsi: block: Do not accept any requests while
suspended") adds the check to see if the queue is in suspended state in
blk_queue_enter().
Call trace:
__switch_to+0x174/0x2c4
__schedule+0x478/0x764
schedule+0x9c/0xe0
blk_queue_enter+0x158/0x228
blk_mq_alloc_request+0x40/0xa4
blk_get_request+0x2c/0x70
__scsi_execute+0x60/0x1c4
ufshcd_set_dev_pwr_mode+0x124/0x1e4
ufshcd_suspend+0x208/0x83c
ufshcd_runtime_suspend+0x40/0x154
ufshcd_pltfrm_runtime_suspend+0x14/0x20
pm_generic_runtime_suspend+0x28/0x3c
__rpm_callback+0x80/0x2a4
rpm_suspend+0x308/0x614
rpm_idle+0x158/0x228
pm_runtime_work+0x84/0xac
process_one_work+0x1f0/0x470
worker_thread+0x26c/0x4c8
kthread+0x13c/0x320
ret_from_fork+0x10/0x18
Fix this by registering ufs device wlun as a SCSI driver and registering it
for block runtime-pm. Also make this a supplier for all other LUNs. This
way the wlun device suspends after all the consumers and resumes after HBA
resumes. This also registers a new SCSI driver for rpmb wlun. This new
driver is mostly used to clear rpmb uac.
[mkp: resolve merge conflict with 5.13-rc1 and fix doc warning]
Fixed smatch warnings:
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/4662c462e79e3e7f541f54f88f8993f421026d83.1619223249.git.asutoshd@codeaurora.org
Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
Co-developed-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Asutosh Das <asutoshd@codeaurora.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>