iSCSI suffers from a deadlock in case a management command submitted via
the netlink socket sleeps on an allocation while holding the rx_queue_mutex
if that allocation causes a memory reclaim that writebacks to a failed
iSCSI device. The recovery procedure can never make progress to recover
the failed disk or abort outstanding IO operations to complete the reclaim
(since rx_queue_mutex is locked), thus locking the system.
Nevertheless, just marking all allocations under rx_queue_mutex as GFP_NOIO
(or locking the userspace process with something like PF_MEMALLOC_NOIO) is
not enough, since the iSCSI command code relies on other subsystems that
try to grab locked mutexes, whose threads are GFP_IO, leading to the same
deadlock. One instance where this situation can be observed is in the
backtraces below, stitched from multiple bugs reports, involving the kobj
uevent sent when a session is created.
The root of the problem is not the fact that iSCSI does GFP_IO allocations,
that is acceptable. The actual problem is that rx_queue_mutex has a very
large granularity, covering every unrelated netlink command execution at
the same time as the error recovery path.
The proposed fix leverages the recently added mechanism to stop failed
connections from the kernel, by enabling it to execute even though a
management command from the netlink socket is being run (rx_queue_mutex is
held), provided that the command is known to be safe. It splits the
rx_queue_mutex in two mutexes, one protecting from concurrent command
execution from the netlink socket, and one protecting stop_conn from racing
with other connection management operations that might conflict with it.
It is not very pretty, but it is the simplest way to resolve the deadlock.
I considered making it a lock per connection, but some external mutex would
still be needed to deal with iscsi_if_destroy_conn.
The patch was tested by forcing a memory shrinker (unrelated, but used
bufio/dm-verity) to reclaim iSCSI pages every time
ISCSI_UEVENT_CREATE_SESSION happens, which is reasonable to simulate
reclaims that might happen with GFP_KERNEL on that path. Then, a faulty
hung target causes a connection to fail during intensive IO, at the same
time a new session is added by iscsid.
The following stacktraces are stiches from several bug reports, showing a
case where the deadlock can happen.
iSCSI-write
holding: rx_queue_mutex
waiting: uevent_sock_mutex
kobject_uevent_env+0x1bd/0x419
kobject_uevent+0xb/0xd
device_add+0x48a/0x678
scsi_add_host_with_dma+0xc5/0x22d
iscsi_host_add+0x53/0x55
iscsi_sw_tcp_session_create+0xa6/0x129
iscsi_if_rx+0x100/0x1247
netlink_unicast+0x213/0x4f0
netlink_sendmsg+0x230/0x3c0
iscsi_fail iscsi_conn_failure
waiting: rx_queue_mutex
schedule_preempt_disabled+0x325/0x734
__mutex_lock_slowpath+0x18b/0x230
mutex_lock+0x22/0x40
iscsi_conn_failure+0x42/0x149
worker_thread+0x24a/0xbc0
EventManager_
holding: uevent_sock_mutex
waiting: dm_bufio_client->lock
dm_bufio_lock+0xe/0x10
shrink+0x34/0xf7
shrink_slab+0x177/0x5d0
do_try_to_free_pages+0x129/0x470
try_to_free_mem_cgroup_pages+0x14f/0x210
memcg_kmem_newpage_charge+0xa6d/0x13b0
__alloc_pages_nodemask+0x4a3/0x1a70
fallback_alloc+0x1b2/0x36c
__kmalloc_node_track_caller+0xb9/0x10d0
__alloc_skb+0x83/0x2f0
kobject_uevent_env+0x26b/0x419
dm_kobject_uevent+0x70/0x79
dev_suspend+0x1a9/0x1e7
ctl_ioctl+0x3e9/0x411
dm_ctl_ioctl+0x13/0x17
do_vfs_ioctl+0xb3/0x460
SyS_ioctl+0x5e/0x90
MemcgReclaimerD"
holding: dm_bufio_client->lock
waiting: stuck io to finish (needs iscsi_fail thread to progress)
schedule at ffffffffbd603618
io_schedule at ffffffffbd603ba4
do_io_schedule at ffffffffbdaf0d94
__wait_on_bit at ffffffffbd6008a6
out_of_line_wait_on_bit at ffffffffbd600960
wait_on_bit.constprop.10 at ffffffffbdaf0f17
__make_buffer_clean at ffffffffbdaf18ba
__cleanup_old_buffer at ffffffffbdaf192f
shrink at ffffffffbdaf19fd
do_shrink_slab at ffffffffbd6ec000
shrink_slab at ffffffffbd6ec24a
do_try_to_free_pages at ffffffffbd6eda09
try_to_free_mem_cgroup_pages at ffffffffbd6ede7e
mem_cgroup_resize_limit at ffffffffbd7024c0
mem_cgroup_write at ffffffffbd703149
cgroup_file_write at ffffffffbd6d9c6e
sys_write at ffffffffbd6662ea
system_call_fastpath at ffffffffbdbc34a2
Link: https://lore.kernel.org/r/20200520022959.1912856-1-krisman@collabora.com
Reported-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Lee Duncan <lduncan@suse.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Currently UFS host driver promises VCC supply if UFS device needs to do
WriteBooster flush during runtime suspend.
However the UFS specification mentions:
"While the flushing operation is in progress, the device is in Active power
mode."
Therefore UFS host driver needs to promise more: Keep UFS device as "Active
power mode", otherwise UFS device shall not do any flush if device enters
Sleep or PowerDown power mode. Similarly, the same promises shall be
applied if device needs urgent BKOP during runtime suspend.
Fix this by not changing device power mode if WriteBooster flush or urgent
BKOP is required in ufshcd_suspend().
Now, if device finishes its job but is not resumed for a very long time,
system will have unnecessary power drain because VCC is still supplied. A
method to re-check the threshold of keeping VCC supply is required to fix
the power drain. However, the threshold re-check needs to re-activate the
link first because the decision depends on the latest device status.
Also introduce a delayed work to force runtime resume after a certain delay
during runtime suspend. This makes threshold re-check happen natually in
the entry of the next runtime-suspend. The device can continue its
WriteBooster flush or urgent BKOP jobs soon after resumed if device has no
upcoming requests and link enters hibern8 state either by Auto-Hibern8 or
hibern8 during clk-gating scheme. This solution not only prevents power
drain but also makes as much use of time as possible for device's
background jobs.
Link: https://lore.kernel.org/r/20200522083212.4008-5-stanley.chu@mediatek.com
Reviewed-by: Asutosh Das <asutoshd@codeaurora.org>
Signed-off-by: Stanley Chu <stanley.chu@mediatek.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
ufs_qcom_dump_dbg_regs() uses usleep_range, a sleeping function, but can be
called from atomic context in the following flow:
ufshcd_intr -> ufshcd_sl_intr -> ufshcd_check_errors ->
ufshcd_print_host_regs -> ufshcd_vops_dbg_register_dump ->
ufs_qcom_dump_dbg_regs
This causes a boot crash on the Lenovo Miix 630 when the interrupt is
handled on the idle thread.
Fix the issue by switching to udelay().
Link: https://lore.kernel.org/r/20200525204125.46171-1-jeffrey.l.hugo@gmail.com
Fixes: 9c46b86762 ("scsi: ufs-qcom: dump additional testbus registers")
Reviewed-by: Bean Huo <beanhuo@micron.com>
Reviewed-by: Avri Altman <avri.altman@wdc.com>
Signed-off-by: Jeffrey Hugo <jeffrey.l.hugo@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
For non RDPQ mode, the driver allocates a single contiguous block of memory
pool for all reply descriptor post queues and passes down a single address
in the ReplyDescriptorPostQueueAddress field of the IOC Init Request
Message to the firmware. So reply_post queue will have only one entry which
holds the address of this single contiguous block of memory pool.
While allocating the reply descriptor post queue pool, driver should loop
only once in non-RDPQ mode. But the driver is looping for
ioc->reply_queue_count number of times even though reply_post queue's queue
depth is only one in non-RDPQ mode. This leads to 'BUG: KASAN:
use-after-free in base_alloc_rdpq_dma_pool'.
The fix is to loop only once while allocating memory for the reply
descriptor post queue in non-RDPQ mode
Fixes: 8012209eb2 ("scsi: mpt3sas: Handle RDPQ DMA allocation in same 4G region")
Link: https://lore.kernel.org/r/20200522103558.5710-1-suganath-prabu.subramani@broadcom.com
Reported-by: Tomas Henzl <thenzl@redhat.com>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In order to create or activate a new node, lpfc_els_unsol_buffer() invokes
lpfc_nlp_init() or lpfc_enable_node() or lpfc_nlp_get(), all of them will
return a reference of the specified lpfc_nodelist object to "ndlp" with
increased refcnt.
When lpfc_els_unsol_buffer() returns, local variable "ndlp" becomes
invalid, so the refcount should be decreased to keep refcount balanced.
The reference counting issue happens in one exception handling path of
lpfc_els_unsol_buffer(). When "ndlp" in DEV_LOSS, the function forgets to
decrease the refcnt increased by lpfc_nlp_init() or lpfc_enable_node() or
lpfc_nlp_get(), causing a refcnt leak.
Fix this issue by calling lpfc_nlp_put() when "ndlp" in DEV_LOSS.
Link: https://lore.kernel.org/r/1590416184-52592-1-git-send-email-xiyuyang19@fudan.edu.cn
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
vhost-scsi pre-allocates the maximum sg entries per command and if a
command requires more than VHOST_SCSI_PREALLOC_SGLS entries, then that
command is failed by it. This patch lets vhost communicate the max sg limit
when it registers vhost_scsi_ops with TCM. With this change, TCM would
report the max sg entries through "Block Limits" VPD page which will be
typically queried by the SCSI initiator during device discovery. By knowing
this limit, the initiator could ensure the maximum transfer length is less
than or equal to what is reported by vhost-scsi.
Link: https://lore.kernel.org/r/1590166317-953-1-git-send-email-sudhakar.panneerselvam@oracle.com
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Sudhakar Panneerselvam <sudhakar.panneerselvam@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
This patch is in response to a static analyser report from Dan Carpenter
titled: "[bug report] scsi: scsi_debug: Add per_host_store option". This
code may not clear the static analyzer reports, but may shed light on why
they occur. Amongst other things this driver has a table driven SCSI
command parser which also involves some C code. There are some invariants
between the table entries and the corresponding C code (i.e. the resp_*()
functions) that, if broken, may lead to a NULL dereference. And the report
is valid, at least in the case of the PRE-FETCH command. Alas, that is not
one of the cases that the static analyzer reported.
In this particular corner case: when the fake_rw flag is set and the table
entry for a "store"-accessing command does not have the required F_FAKE_RW
flag set, do the following. Call BUG_ON() in the devip2sip() very close to
a comment block explaining why it was called and how to fix it.
checkpatch.pl complains about the BUG_ON() but there is no reasonable
remedial action that can be taken at run time.
This change allows the code reported by the static analyzer to be
simplified. Comments were also added to the table flags (e.g. F_FAKE_RW)
so developers who add commands might be more inclined to use them
(properly).
Link: https://lore.kernel.org/r/20200513013943.25285-1-dgilbert@interlog.com
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
shost_for_each_device(sdev, shost) \
for ((sdev) = __scsi_iterate_devices((shost), NULL); \
(sdev); \
(sdev) = __scsi_iterate_devices((shost), (sdev)))
When terminating shost_for_each_device() iteration with break or return,
scsi_device_put() should be used to prevent stale scsi device references
from being left behind.
Link: https://lore.kernel.org/r/20200518074420.39275-1-yebin10@huawei.com
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Annotate members of FC protocol and firmware dump data structures as big
endian. Annotate members of RISC control structures as little endian.
Annotate mailbox registers as little endian. Annotate the mb[] arrays as
CPU-endian because communication of the mb[] values with the hardware
happens through the readw() and writew() functions. readw() converts from
__le16 to u16 and writew() converts from u16 to __le16. Annotate 'handles'
as CPU-endian because for the firmware these are opaque values.
Link: https://lore.kernel.org/r/20200518211712.11395-15-bvanassche@acm.org
CC: Hannes Reinecke <hare@suse.de>
Cc: Nilesh Javali <njavali@marvell.com>
Cc: Quinn Tran <qutran@marvell.com>
Cc: Martin Wilck <mwilck@suse.com>
Cc: Roman Bolshakov <r.bolshakov@yadro.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Instead of passing an argument to the firmware dumping functions that tells
these functions whether or not to obtain the hardware lock, obtain that
lock before calling these functions. This patch fixes the following
recently introduced C=2 build error:
CHECK drivers/scsi/qla2xxx/qla_tmpl.c
drivers/scsi/qla2xxx/qla_tmpl.c:1133:1: error: Expected ; at end of statement
drivers/scsi/qla2xxx/qla_tmpl.c:1133:1: error: got }
drivers/scsi/qla2xxx/qla_tmpl.h:247:0: error: Expected } at end of function
drivers/scsi/qla2xxx/qla_tmpl.h:247:0: error: got end-of-input
Link: https://lore.kernel.org/r/20200518211712.11395-4-bvanassche@acm.org
Fixes: cbb01c2f2f ("scsi: qla2xxx: Fix MPI failure AEN (8200) handling")
Cc: Arun Easi <aeasi@marvell.com>
Cc: Nilesh Javali <njavali@marvell.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Cc: Martin Wilck <mwilck@suse.com>
Cc: Roman Bolshakov <r.bolshakov@yadro.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Suppress the following two compiler warnings because these are not useful:
In file included from ./include/trace/define_trace.h:102,
from ./include/trace/events/qla.h:39,
from drivers/scsi/qla2xxx/qla_dbg.c:77:
./include/trace/events/qla.h: In function 'trace_event_raw_event_qla_log_event':
./include/trace/trace_events.h:691:9: warning: function 'trace_event_raw_event_qla_log_event' might be a candidate for 'gnu_printf' format attribute [-Wsuggest-attribute=format]
691 | struct trace_event_raw_##call *entry; \
| ^~~~~~~~~~~~~~~~
./include/trace/events/qla.h:12:1: note: in expansion of macro 'DECLARE_EVENT_CLASS'
12 | DECLARE_EVENT_CLASS(qla_log_event,
| ^~~~~~~~~~~~~~~~~~~
In file included from ./include/trace/define_trace.h:103,
from ./include/trace/events/qla.h:39,
from drivers/scsi/qla2xxx/qla_dbg.c:77:
./include/trace/events/qla.h: In function 'perf_trace_qla_log_event':
./include/trace/perf.h:41:9: warning: function 'perf_trace_qla_log_event' might be a candidate for 'gnu_printf' format attribute [-Wsuggest-attribute=format]
41 | struct hlist_head *head; \
| ^~~~~~~~~~
./include/trace/events/qla.h:12:1: note: in expansion of macro 'DECLARE_EVENT_CLASS'
Link: https://lore.kernel.org/r/20200518211712.11395-3-bvanassche@acm.org
Fixes: 598a90f200 ("scsi: qla2xxx: add ring buffer for tracing debug logs")
Cc: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com>
Cc: Joe Jin <joe.jin@oracle.com>
Cc: Nilesh Javali <njavali@marvell.com>
Cc: Quinn Tran <qutran@marvell.com>
Cc: Martin Wilck <mwilck@suse.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Arun Easi <aeasi@marvell.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When tcmu queues a new command - no matter whether in command ring or in
qfull_queue - a cmd_id from IDR udev->commands is assigned to the command.
If userspace sends a wrong command completion containing the cmd_id of a
command on the qfull_queue, tcmu_handle_completions() finds the command in
the IDR and calls tcmu_handle_completion() for it. This might do some nasty
things because commands in qfull_queue do not have a valid dbi list.
To fix this bug, we no longer add queued commands to the idr. Instead the
cmd_id is assign when a command is written to the command ring.
Due to this change I had to adapt the source code at several places where
up to now an idr_for_each had been done.
[mkp: fix checkpatch warnings]
Link: https://lore.kernel.org/r/20200518164833.12775-1-bstroesser@ts.fujitsu.com
Acked-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Bodo Stroesser <bstroesser@ts.fujitsu.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
We found out that after phy up, the hardware reports another oob interrupt
but did not follow a phy up interrupt:
oob ready -> phy up -> DEV found -> oob read -> wait phy up -> timeout
We run link reset when wait phy up timeout, and it send a normal disk into
reset processing. So we made some circumvention action in the code, so that
this abnormal oob interrupt will not start the timer to wait for phy up.
Link: https://lore.kernel.org/r/1589552025-165012-2-git-send-email-john.garry@huawei.com
Signed-off-by: Luo Jiaxing <luojiaxing@huawei.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Export through sysfs as a scsi_disk attribute the zoned capabilities of a
disk ("zoned_cap" attribute file). This new attribute indicates in human
readable form (i.e. a string) the zoned block capabilities implemented by
the disk as found in the ZONED field of the disk block device
characteristics VPD page. The possible values are:
- "none": ZONED=00b (not reported), regular disk
- "host-aware": ZONED=01b, host-aware ZBC disk
- "drive-managed": ZONED=10b, drive-managed ZBC disk (regular disk
interface)
For completeness, also add the following value which is detected using the
device type rather than the ZONED field:
- "host-managed": device type = 0x14 (TYPE_ZBC), host-managed ZBC disk
This new sysfs attribute is purely informational and complementary to the
"zoned" device request queue sysfs attribute as it allows applications and
user daemons (e.g. udev) to easily differentiate regular disks from
drive-managed SMR disks without the need for direct access tools such as
provided by sg3utils.
Link: https://lore.kernel.org/r/20200515054856.1408575-1-damien.lemoal@wdc.com
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
At the moment we allocate and register the Scsi_Host object corresponding
to a zfcp adapter (FCP device) very early in the life cycle of the adapter
- even before we fully discover and initialize the underlying
firmware/hardware. This had the advantage that we could already use the
Scsi_Host object, and fill in all its information during said discover and
initialize.
Due to commit 737eb78e82 ("block: Delay default elevator initialization")
(first released in v5.4), we noticed a regression that would prevent us
from using any storage volume if zfcp is configured with support for DIF or
DIX (zfcp.dif=1 || zfcp.dix=1). Doing so would result in an illegal memory
access as soon as the first request is sent with such an configuration. As
example for a crash resulting from this:
scsi host0: scsi_eh_0: sleeping
scsi host0: zfcp
qdio: 0.0.1900 ZFCP on SC 4bd using AI:1 QEBSM:0 PRI:1 TDD:1 SIGA: W AP
scsi 0:0:0:0: scsi scan: INQUIRY pass 1 length 36
Unable to handle kernel pointer dereference in virtual kernel address space
Failing address: 0000000000000000 TEID: 0000000000000483
Fault in home space mode while using kernel ASCE.
AS:0000000035c7c007 R3:00000001effcc007 S:00000001effd1000 P:000000000000003d
Oops: 0004 ilc:3 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: ...
CPU: 1 PID: 783 Comm: kworker/u760:5 Kdump: loaded Not tainted 5.6.0-rc2-bb-next+ #1
Hardware name: ...
Workqueue: scsi_wq_0 fc_scsi_scan_rport [scsi_transport_fc]
Krnl PSW : 0704e00180000000 000003ff801fcdae (scsi_queue_rq+0x436/0x740 [scsi_mod])
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
Krnl GPRS: 0fffffffffffffff 0000000000000000 0000000187150120 0000000000000000
000003ff80223d20 000000000000018e 000000018adc6400 0000000187711000
000003e0062337e8 00000001ae719000 0000000187711000 0000000187150000
00000001ab808100 0000000187150120 000003ff801fcd74 000003e0062336a0
Krnl Code: 000003ff801fcd9e: e310a35c0012 lt %r1,860(%r10)
000003ff801fcda4: a7840010 brc 8,000003ff801fcdc4
#000003ff801fcda8: e310b2900004 lg %r1,656(%r11)
>000003ff801fcdae: d71710001000 xc 0(24,%r1),0(%r1)
000003ff801fcdb4: e310b2900004 lg %r1,656(%r11)
000003ff801fcdba: 41201018 la %r2,24(%r1)
000003ff801fcdbe: e32010000024 stg %r2,0(%r1)
000003ff801fcdc4: b904002b lgr %r2,%r11
Call Trace:
[<000003ff801fcdae>] scsi_queue_rq+0x436/0x740 [scsi_mod]
([<000003ff801fcd74>] scsi_queue_rq+0x3fc/0x740 [scsi_mod])
[<00000000349c9970>] blk_mq_dispatch_rq_list+0x390/0x680
[<00000000349d1596>] blk_mq_sched_dispatch_requests+0x196/0x1a8
[<00000000349c7a04>] __blk_mq_run_hw_queue+0x144/0x160
[<00000000349c7ab6>] __blk_mq_delay_run_hw_queue+0x96/0x228
[<00000000349c7d5a>] blk_mq_run_hw_queue+0xd2/0xe0
[<00000000349d194a>] blk_mq_sched_insert_request+0x192/0x1d8
[<00000000349c17b8>] blk_execute_rq_nowait+0x80/0x90
[<00000000349c1856>] blk_execute_rq+0x6e/0xb0
[<000003ff801f8ac2>] __scsi_execute+0xe2/0x1f0 [scsi_mod]
[<000003ff801fef98>] scsi_probe_and_add_lun+0x358/0x840 [scsi_mod]
[<000003ff8020001c>] __scsi_scan_target+0xc4/0x228 [scsi_mod]
[<000003ff80200254>] scsi_scan_target+0xd4/0x100 [scsi_mod]
[<000003ff802d8b96>] fc_scsi_scan_rport+0x96/0xc0 [scsi_transport_fc]
[<0000000034245ce8>] process_one_work+0x458/0x7d0
[<00000000342462a2>] worker_thread+0x242/0x448
[<0000000034250994>] kthread+0x15c/0x170
[<0000000034e1979c>] ret_from_fork+0x30/0x38
INFO: lockdep is turned off.
Last Breaking-Event-Address:
[<000003ff801fbc36>] scsi_add_cmd_to_list+0x9e/0xa8 [scsi_mod]
Kernel panic - not syncing: Fatal exception: panic_on_oops
While this issue is exposed by the commit named above, this is only by
accident. The real issue exists for longer already - basically since it's
possible to use blk-mq via scsi-mq, and blk-mq pre-allocates all requests
for a tag-set during initialization of the same. For a given Scsi_Host
object this is done when adding the object to the midlayer
(`scsi_add_host()` and such). In `scsi_mq_setup_tags()` the midlayer
calculates how much memory is required for a single scsi_cmnd, and its
additional data, which also might include space for additional protection
data - depending on whether the Scsi_Host has any form of protection
capabilities (`scsi_host_get_prot()`).
The problem is now thus, because zfcp does this step before we actually
know whether the firmware/hardware has these capabilities, we don't set any
protection capabilities in the Scsi_Host object. And so, no space is
allocated for additional protection data for requests in the Scsi_Host
tag-set.
Once we go through discover and initialize the FCP device firmware/hardware
fully (this is done via the firmware commands "Exchange Config Data" and
"Exchange Port Data") we find out whether it actually supports DIF and DIX,
and we set the corresponding capabilities in the Scsi_Host object (in
`zfcp_scsi_set_prot()`). Now the Scsi_Host potentially has protection
capabilities, but the already allocated requests in the tag-set don't have
any space allocated for that.
When we then trigger target scanning or add scsi_devices manually, the
midlayer will use requests from that tag-set, and before sending most
requests, it will also call `scsi_mq_prep_fn()`. To prepare the scsi_cmnd
this function will check again whether the used Scsi_Host has any
protection capabilities - and now it potentially has - and if so, it will
try to initialize the assumed to be preallocated structures and thus it
causes the crash, like shown above.
Before delaying the default elevator initialization with the commit named
above, we always would also allocate an elevator for any scsi_device before
ever sending any requests - in contrast to now, where we do it after
device-probing. That elevator in turn would have its own tag-set, and that
is initialized after we went through discovery and initialization of the
underlying firmware/hardware. So requests from that tag-set can be
allocated properly, and if used - unless the user changes/disabled the
default elevator - this would hide the underlying issue.
To fix this for any configuration - with or without an elevator - we move
the allocation and registration of the Scsi_Host object for a given FCP
device to after the first complete discovery and initialization of the
underlying firmware/hardware. By doing that we can make all basic
properties of the Scsi_Host known to the midlayer by the time we call
`scsi_add_host()`, including whether we have any protection capabilities.
To do that we have to delay all the accesses that we would have done in the
past during discovery and initialization, and do them instead once we are
finished with it. The previous patches ramp up to this by fencing and
factoring out all these accesses, and make it possible to re-do them later
on. In addition we make also use of the diagnostic buffers we recently
added with
commit 92953c6e0a ("scsi: zfcp: signal incomplete or error for sync exchange config/port data")
commit 7e418833e6 ("scsi: zfcp: diagnostics buffer caching and use for exchange port data")
commit 088210233e ("scsi: zfcp: add diagnostics buffer for exchange config data")
(first released in v5.5), because these already cache all the information
we need for that "re-do operation" - the information cached are always
updated during xconf or xport data, so it won't be stale.
In addition to the move and re-do, this patch also updates the
function-documentation of `zfcp_scsi_adapter_register()` and changes how it
reports if a Scsi_Host object already exists. In that case future
recovery-operations can skip this step completely and behave much like they
would do in the past - zfcp does not release a once allocated Scsi_Host
object unless the corresponding FCP device is deconstructed completely.
Link: https://lore.kernel.org/r/030dd6da318bbb529f0b5268ec65cebcd20fc0a3.1588956679.git.bblock@linux.ibm.com
Reviewed-by: Steffen Maier <maier@linux.ibm.com>
Signed-off-by: Benjamin Block <bblock@linux.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When setting an adapter online for the first time, we also create a couple
of entries for it in the sysfs device tree. This is also true even if the
adapter has not yet ever gone successfully through exchange config and
exchange port data.
When moving the scsi host object allocation and registration to after the
first exchange config and exchange port data, this make the `port_rescan`
attribute susceptible to invalid pointer-dereferences of the shost field
before the adapter is fully initialized.
When written to, it schedules a `scan_work` item that will in turn make use
of the associated fibre channel host object to check the topology used for
this FCP device.
Because scanning for remote ports can't be done successfully without
completing exchange config and exchange port data first, we can simply
fence `port_rescan`, and so prevent the illegal access.
As with cases where we can't get a reference to the adapter, we also return
-ENODEV here. Applications need to handle that errno today already.
After a successful allocation of the scsi host object nothing changes in
the work flow.
Link: https://lore.kernel.org/r/ef65366d309993ca91b6917727590ca7ca166c8f.1588956679.git.bblock@linux.ibm.com
Reviewed-by: Steffen Maier <maier@linux.ibm.com>
Signed-off-by: Benjamin Block <bblock@linux.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>