The cpu io statistics were capped by a hard define limit of 128. This
effectively was a max number of CPUs, not an actual CPU count, nor actual
CPU numbers which can be even larger than both of those values. This made
stats off/misleading and on large CPU count systems, wrong.
Fix the stats so that all CPUs can have a stats struct. Fix the looping
such that it loops by hdwq, finds CPUs that used the hdwq, and sum the
stats, then display.
Link: https://lore.kernel.org/r/20200322181304.37655-9-jsmart2021@gmail.com
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Kernel is crashing with the following stacktrace:
BUG: unable to handle kernel NULL pointer dereference at
00000000000005bc
IP: lpfc_nvme_register_port+0x1a8/0x3a0 [lpfc]
...
Call Trace:
lpfc_nlp_state_cleanup+0x2b2/0x500 [lpfc]
lpfc_nlp_set_state+0xd7/0x1a0 [lpfc]
lpfc_cmpl_prli_prli_issue+0x1f7/0x450 [lpfc]
lpfc_disc_state_machine+0x7a/0x1e0 [lpfc]
lpfc_cmpl_els_prli+0x16f/0x1e0 [lpfc]
lpfc_sli_sp_handle_rspiocb+0x5b2/0x690 [lpfc]
lpfc_sli_handle_slow_ring_event_s4+0x182/0x230 [lpfc]
lpfc_do_work+0x87f/0x1570 [lpfc]
kthread+0x10d/0x130
ret_from_fork+0x35/0x40
During target side fault injections, it is possible to hit the
NLP_WAIT_FOR_UNREG case in lpfc_nvme_remoteport_delete. A prior commit
fixed a rebind and delete race condition, but called lpfc_nlp_put
unconditionally. This triggered a deletion and the crash.
Fix by movng nlp_put to inside the NLP_WAIT_FOR_UNREG case, where the nlp
will be being unregistered/removed. Leave the reference if the flag isn't
set.
Link: https://lore.kernel.org/r/20200322181304.37655-8-jsmart2021@gmail.com
Fixes: b15bd3e621 ("scsi: lpfc: Fix nvme remoteport registration race conditions")
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Injecting EEH on a 32GB card is causing kernel oops
The pci error handler is doing an IO flush and the offline code is also
doing an IO flush. When the 1st flush is complete the hdwq is destroyed
(freed), yet the second flush accesses the hdwq and crashes.
Added a check in lpfc_sli4_fush_io_rings to check both the HBA_IOQ_FLUSH
flag and the hdwq pointer to see if it is already set and not already
freed.
Link: https://lore.kernel.org/r/20200322181304.37655-6-jsmart2021@gmail.com
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
SCSI layer sends driver IOs with more s/g segments than driver can handle.
This results in "Too many sg segments from dma_map_sg. Config 64, seg_cnt
219" error messages from the lpfc_scsi_prep_dma_buf_s3() routine.
The was due to use the driver using individual templates for pport and
vport, host reset enabled or not, nvme vs scsi, etc. In the end, there was
a combination for a vport that didn't match the pport.
Rather than enumerating more templates and more discretionary assignments,
revert to a base template that is copied to a template specific to the
pport/vport. Then, based on role, attributes and sli type, modify the
fields that are different for that port. Added a log message to
lpfc_create_port to validate values.
Link: https://lore.kernel.org/r/20200322181304.37655-5-jsmart2021@gmail.com
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In lpfc_nvmet_prep_fcp_wqe() the line "rsp->sg_cnt = 0" is modifying the
transport's data structure. This may result in the transport believing the
s/g list was already freed, thus may not unmap/free it properly. Lpfc
driver should not modify the transport data structure.
The zeroing of the sg_cnt is to avoid use of the transport's sgl in a
subsequent loop where the driver builds the necessary requests for the
adapter firmware to complete the IO.
Change LLDD to use a local copy of the transport sg_cnt when building
requests to be passed to the adapter fw.
Link: https://lore.kernel.org/r/20200322181304.37655-4-jsmart2021@gmail.com
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The following lockdep error was reported when unloading the lpfc driver:
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
...
Call Trace:
dump_stack+0x96/0xe0
register_lock_class+0x8b8/0x8c0
? lockdep_hardirqs_on+0x190/0x280
? is_dynamic_key+0x150/0x150
? wait_for_completion_interruptible+0x2a0/0x2a0
? wake_up_q+0xd0/0xd0
__lock_acquire+0xda/0x21a0
? register_lock_class+0x8c0/0x8c0
? synchronize_rcu_expedited+0x500/0x500
? __call_rcu+0x850/0x850
lock_acquire+0xf3/0x1f0
? del_timer_sync+0x5/0xb0
del_timer_sync+0x3c/0xb0
? del_timer_sync+0x5/0xb0
lpfc_pci_remove_one.cold.102+0x8b7/0x935 [lpfc]
...
Unloading the driver resulted in a call to del_timer_sync for the
cpuhp_poll_timer. However the call to setup the timer had never been made,
so the timer structures used by lockdep checking were not initialized.
Unconditionally call setup_timer for the cpuhp_poll_timer during driver
initialization. Calls to start the timer remain "as needed".
Link: https://lore.kernel.org/r/20200322181304.37655-3-jsmart2021@gmail.com
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The following kasan bug was called out:
BUG: KASAN: slab-out-of-bounds in lpfc_unreg_login+0x7c/0xc0 [lpfc]
Read of size 2 at addr ffff889fc7c50a22 by task lpfc_worker_3/6676
...
Call Trace:
dump_stack+0x96/0xe0
? lpfc_unreg_login+0x7c/0xc0 [lpfc]
print_address_description.constprop.6+0x1b/0x220
? lpfc_unreg_login+0x7c/0xc0 [lpfc]
? lpfc_unreg_login+0x7c/0xc0 [lpfc]
__kasan_report.cold.9+0x37/0x7c
? lpfc_unreg_login+0x7c/0xc0 [lpfc]
kasan_report+0xe/0x20
lpfc_unreg_login+0x7c/0xc0 [lpfc]
lpfc_sli_def_mbox_cmpl+0x334/0x430 [lpfc]
...
When processing the completion of a "Reg Rpi" login mailbox command in
lpfc_sli_def_mbox_cmpl, a call may be made to lpfc_unreg_login. The vpi is
extracted from the completing mailbox context and passed as an input for
the next. However, the vpi stored in the mailbox command context is an
absolute vpi, which for SLI4 represents both base + offset. When used with
a non-zero base component, (function id > 0) this results in an
out-of-range access beyond the allocated phba->vpi_ids array.
Fix by subtracting the function's base value to get an accurate vpi number.
Link: https://lore.kernel.org/r/20200322181304.37655-2-jsmart2021@gmail.com
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
We were setting bActiveICCLevel attribute for UFS device only once but the
type of this attribute has changed from persistent to volatile since UFS
device specification v2.1. This attribute is set to the default value after
power cycle or hardware reset event. It isn't safe to rely on prefetched
data (only used for bActiveICCLevel attribute now). Hence this change
removes the code related to data prefetching and set this parameter on
every attempt to probe the UFS device.
Tested-by: Stanley Chu <stanley.chu@mediatek.com>
Reviewed-by: Stanley Chu <stanley.chu@mediatek.com>
Reviewed-by: Avri Altman <avri.altman@wdc.com>
Signed-off-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Correct race condition where ioaccel is re-enabled before the raid_map is
updated. For RAID_1, RAID_1ADM, and RAID 5/6 there is a BUG_ON called which
is bad.
- Change event thread to disable ioaccel only. Send all requests down the
RAID path instead.
- Have rescan thread handle offload_enable.
- Since there is only one rescan allowed at a time, turning
offload_enabled on/off should not be racy. Each handler queues up a
rescan if one is already in progress.
- For timing diagram, offload_enabled is initially off due to a change
(transformation: splitmirror/remirror), ...
otbe = offload_to_be_enabled
oe = offload_enabled
Time Event Rescan Completion Request
Worker Worker Thread Thread
---- ------ ------ ---------- -------
T0 | | + UA |
T1 | + rescan started | 0x3f |
T2 + Event | | 0x0e |
T3 + Ack msg | | |
T4 | + if (!dev[i]->oe && | |
T5 | | dev[i]->otbe) | |
T6 | | get_raid_map | |
T7 + otbe = 1 | | |
T8 | | | |
T9 | + oe = otbe | |
T10 | | | + ioaccel request
T11 * BUG_ON
T0 - I/O completion with UA 0x3f 0x0e sets rescan flag.
T1 - rescan worker thread starts a rescan.
T2 - event comes in
T3 - event thread starts and issues "Acknowledge" message
...
T6 - rescan thread has bypassed code to reload new raid map.
...
T7 - event thread runs and sets offload_to_be_enabled
...
T9 - rescan thread turns on offload_enabled.
T10- request comes in and goes down ioaccel path.
T11- BUG_ON.
- After the patch is applied, ioaccel_enabled can only be re-enabled in
the re-scan thread.
Link: https://lore.kernel.org/r/158472877894.14200.7077843399036368335.stgit@brunhilda
Reviewed-by: Scott Teel <scott.teel@microsemi.com>
Reviewed-by: Matt Perricone <matt.perricone@microsemi.com>
Reviewed-by: Scott Benesh <scott.benesh@microsemi.com>
Signed-off-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The current codebase makes use of the zero-length array language extension
to the C90 standard, but the preferred mechanism to declare variable-length
types such as these ones is a flexible array member[1][2], introduced in
C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning in
case the flexible array does not occur last in the structure, which will
help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by this
change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Link: https://lore.kernel.org/r/20200319222533.GA20577@embeddedor.com
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Block layer RPM is enabled for the genernal UFS SCSI devices when they are
probed by their driver. However block layer RPM is not enabled for UFS
well-known SCSI devices.
As UFS SCSI devices have their corresponding BSG char devices, accessing a
BSG char device via IOCTL may send requests to its corresponding SCSI
device through its request queue. If BSG IOCTL sends a request to a
well-known SCSI device when HBA is not runtime active, due to block layer
RPM not being enabled for the well-known SCSI devices, the HBA, which is at
the top of a SCSI device's parent chain, will not be resumed.
This change enables block layer RPM for the well-known SCSI devices so that
block layer can handle RPM for the well-known SCSI devices just like for
the general SCSI devices.
Reviewed-by: Avri Altman <avri.altman@wdc.com>
Reviewed-by: Stanley Chu <stanley.chu@mediatek.com>
Signed-off-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
If an iSCSI connection happens to fail while the daemon isn't running (due
to a crash or for another reason), the kernel failure report is not
received. When the daemon restarts, there is insufficient kernel state in
sysfs for it to know that this happened. open-iscsi tries to reopen every
connection, but on different initiators, we'd like to know which
connections have failed.
There is session->state, but that has a different lifetime than an iSCSI
connection, so it doesn't directly reflect the connection state.
[mkp: typos]
Link: https://lore.kernel.org/r/20200317233422.532961-1-krisman@collabora.com
Cc: Khazhismel Kumykov <khazhy@google.com>
Suggested-by: Junho Ryu <jayr@google.com>
Reviewed-by: Lee Duncan <lduncan@suse.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Trace events target_sequencer_start and target_cmd_complete
(include/trace/events/target.h) are ready to show NAA identifier, LUN ID,
and many other important command details in the system log:
TP_printk("%s -> LUN %03u %s data_length %6u CDB %s (TA:%s C:%02x)",
However, it's still hard to identify command on the initiator and command
on the target in the real life output of system log. For that purpose SCSI
provides a command identifier or task tag (term used in previous
standards). This patch adds tag ID in the system log's output:
TP_printk("%s -> LUN %03u tag %#llx %s data_length %6u CDB %s (TA:%s C:%02x)",
kworker/1:1-35 [001] .... 1392.989452: target_sequencer_start:
naa.5001405ec1ba6364 -> LUN 001 tag 0x1
SERVICE_ACTION_IN_16 data_length 32
CDB 9e 10 00 00 00 00 00 00 00 00 00 00 00 20 00 00 (TA:SIMPLE C:00)
kworker/1:1-35 [001] .... 1392.989456: target_cmd_complete:
naa.5001405ec1ba6364 <- LUN 001 tag 0x1 status GOOD (sense len 0)
SERVICE_ACTION_IN_16 data_length 32
CDB 9e 10 00 00 00 00 00 00 00 00 00 00 00 20 00 00 (TA:SIMPLE C:00)
Link: https://lore.kernel.org/r/226e01deaa9baf46d6ff3b8698bc9fe881f7dfc1.camel@dubeyko.com
Reviewed-by: Roman Bolshakov <r.bolshakov@yadro.com>
Reviewed-by: Konstantin Shelekhin <k.shelekhin@yadro.com>
Reviewed-by: Bart van Assche <bvanassche@acm.org>
Signed-off-by: Viacheslav Dubeyko <v.dubeiko@yadro.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
iscsit_close_session() can only be called when nconn is zero (otherwise a
kernel panic is triggered). If nconn is zero then iscsit_stop_session()
does nothing and exits, so calling it makes no sense.
We still need to call iscsit_check_session_usage_count() because this
function will sleep if the session's refcount is not zero and we don't want
to destroy the session structure if it's still being referenced.
Link: https://lore.kernel.org/r/20200313170656.9716-4-mlombard@redhat.com
Tested-by: Rahul Kundu <rahul.kundu@chelsio.com>
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
A number of hangs have been reported against the target driver; they are
due to the fact that multiple threads may try to destroy the iscsi session
at the same time. This may be reproduced for example when a "targetcli
iscsi/iqn.../tpg1 disable" command is executed while a logout operation is
underway.
When this happens, two or more threads may end up sleeping and waiting for
iscsit_close_connection() to execute "complete(session_wait_comp)". Only
one of the threads will wake up and proceed to destroy the session
structure, the remaining threads will hang forever.
Note that if the blocked threads are somehow forced to wake up with
complete_all(), they will try to free the same iscsi session structure
destroyed by the first thread, causing double frees, memory corruptions
etc...
With this patch, the threads that want to destroy the iscsi session will
increase the session refcount and will set the "session_close" flag to 1;
then they wait for the driver to close the remaining active connections.
When the last connection is closed, iscsit_close_connection() will wake up
all the threads and will wait for the session's refcount to reach zero;
when this happens, iscsit_close_connection() will destroy the session
structure because no one is referencing it anymore.
INFO: task targetcli:5971 blocked for more than 120 seconds.
Tainted: P OE 4.15.0-72-generic #81~16.04.1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
targetcli D 0 5971 1 0x00000080
Call Trace:
__schedule+0x3d6/0x8b0
? vprintk_func+0x44/0xe0
schedule+0x36/0x80
schedule_timeout+0x1db/0x370
? __dynamic_pr_debug+0x8a/0xb0
wait_for_completion+0xb4/0x140
? wake_up_q+0x70/0x70
iscsit_free_session+0x13d/0x1a0 [iscsi_target_mod]
iscsit_release_sessions_for_tpg+0x16b/0x1e0 [iscsi_target_mod]
iscsit_tpg_disable_portal_group+0xca/0x1c0 [iscsi_target_mod]
lio_target_tpg_enable_store+0x66/0xe0 [iscsi_target_mod]
configfs_write_file+0xb9/0x120
__vfs_write+0x1b/0x40
vfs_write+0xb8/0x1b0
SyS_write+0x5c/0xe0
do_syscall_64+0x73/0x130
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Link: https://lore.kernel.org/r/20200313170656.9716-3-mlombard@redhat.com
Reported-by: Matt Coleman <mcoleman@datto.com>
Tested-by: Matt Coleman <mcoleman@datto.com>
Tested-by: Rahul Kundu <rahul.kundu@chelsio.com>
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Added the sysfs attribute for non fatal log so that management utility can
get the non fatal dump from driver. The non-fatal error is an error
condition or abnormal behavior detected by the host, or detected and
reported by the controller to the host.The non-fatal error does not stop
the controller firmware and enables it to still respond to host requests.
A typical example of a non-fatal error is an I/O timeout or an unusual
error notification from the controller. Since the firmware is operational,
the error dump information is pushed to host memory (by firmware) upon
request from the host.
Link: https://lore.kernel.org/r/20200316074906.9119-6-deepak.ukey@microchip.com
Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Deepak Ukey <deepak.ukey@microchip.com>
Signed-off-by: Viswas G <Viswas.G@microchip.com>
Signed-off-by: Radha Ramachandran <radha@google.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In pm80xx driver, the command mpi_set_phy_profile_req is sent by host
during boot to configure the phy profile such as analog setting page, rate
control page. However, the tag is not freed when its response is
received. As a result, 16 tags are missing for each HBA after boot. When
NCQ is enabled with queue depth 16, it needs at least, 15 * 16 = 240 tags
for each HBA to achieve the best performance. In current pm80xx driver with
setting CCB_MAX = 256, the total number of tags in each HBA is 255 for data
IO. Hence, without returning those tags to the pool after boot, some device
will finally be forced to non-ncq mode by ATA layer due to excessive errors
(i.e. LLDD cannot allocate tag for queued task).
Link: https://lore.kernel.org/r/20200316074906.9119-4-deepak.ukey@microchip.com
Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: yuuzheng <yuuzheng@google.com>
Signed-off-by: Deepak Ukey <deepak.ukey@microchip.com>
Signed-off-by: Viswas G <Viswas.G@microchip.com>
Signed-off-by: Radha Ramachandran <radha@google.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
A kexec reboot causes the controller fw to assert. This assertion shows up
in two ways, the controller doesn't show up as ready and an interrupt is
waiting as soon as the handler is registered. To resolve this added below
fix:
- Split the interrupt handling setup into two parts, setup and request.
- If the controller ready register indicates not-ready, but that the not
readiness is only on the IOC units we can still try a reset to bring the
system back to the pre-reboot state.
Link: https://lore.kernel.org/r/20200316074906.9119-3-deepak.ukey@microchip.com
Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Vikram Auradkar <auradkar@google.com>
Signed-off-by: Deepak Ukey <deepak.ukey@microchip.com>
Signed-off-by: Viswas G <Viswas.G@microchip.com>
Signed-off-by: Radha Ramachandran <radha@google.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Increasing the per-request size maximum (max_sectors_kb) runs into the
per-device DMA scatter gather list limit (max_segments) for users of the io
vector system calls (eg, readv and writev). This is because the kernel
combines io vectors into DMA segments when possible, but it doesn't work
for our user because the vectors in the buffer cache get scrambled. This
change bumps the advertised max scatter gather length to 528 to cover 2M w/
x86's 4k pages and some extra for the user checksum. It trims the size of
some of the tables we don't care about and exposes all of the command slots
upstream to the SCSI layer. Also reduced the PM8001_MAX_CCB to 256 as
pm8001 driver has memory limit depend on machine capability. If we increase
the sg length, we need to trade-off it by decreasing PM8001_MAX_CCB.
PM8001_MAX_CCB = 256 does not have any influence on normal use
Link: https://lore.kernel.org/r/20200316074906.9119-2-deepak.ukey@microchip.com
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Peter Chang <dpf@google.com>
Signed-off-by: Deepak Ukey <deepak.ukey@microchip.com>
Signed-off-by: Viswas G <Viswas.G@microchip.com>
Signed-off-by: Radha Ramachandran <radha@google.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>