For shared sbitmap, if the call to blk_mq_tag_update_depth() was
successful for any hctx when hctx->sched_tags is not set, then it would be
successful for all (due to nature in which blk_mq_tag_update_depth()
fails).
As such, there is no need to call blk_mq_tag_resize_shared_sbitmap() for
each hctx. So relocate the call until after the hctx iteration under the
!q->elevator check, which is equivalent (to !hctx->sched_tags).
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-4-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In addition to reverting commit 7b05bf7710 ("Revert "block/mq-deadline:
Prioritize high-priority requests""), this patch uses 'jiffies' instead
of ktime_get() in the code for aging lower priority requests.
This patch has been tested as follows:
Measured QD=1/jobs=1 IOPS for nullb with the mq-deadline scheduler.
Result without and with this patch: 555 K IOPS.
Measured QD=1/jobs=8 IOPS for nullb with the mq-deadline scheduler.
Result without and with this patch: about 380 K IOPS.
Ran the following script:
set -e
scriptdir=$(dirname "$0")
if [ -e /sys/module/scsi_debug ]; then modprobe -r scsi_debug; fi
modprobe scsi_debug ndelay=1000000 max_queue=16
sd=''
while [ -z "$sd" ]; do
sd=$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*)
done
echo $((100*1000)) > "/sys/block/$sd/queue/iosched/prio_aging_expire"
if [ -e /sys/fs/cgroup/io.prio.class ]; then
cd /sys/fs/cgroup
echo restrict-to-be >io.prio.class
echo +io > cgroup.subtree_control
else
cd /sys/fs/cgroup/blkio/
echo restrict-to-be >blkio.prio.class
fi
echo $$ >cgroup.procs
mkdir -p hipri
cd hipri
if [ -e io.prio.class ]; then
echo none-to-rt >io.prio.class
else
echo none-to-rt >blkio.prio.class
fi
{ "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/low-pri.txt & }
echo $$ >cgroup.procs
"${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/hi-pri.txt
Result:
* 11000 IOPS for the high-priority job
* 40 IOPS for the low-priority job
If the prio aging expiry time is changed from 100s into 0, the IOPS results
change into 6712 and 6796 IOPS.
The max-iops script is a script that runs fio with the following arguments:
--bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
--norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
--iodepth=${arg_d} --iodepth_batch_submit=${arg_a}
--iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1}
--filename=${positional_argument_1}
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20210927220328.1410161-5-bvanassche@acm.org
[axboe: @latest -> @latest_start]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Calculating the sum over all CPUs of per-CPU counters frequently is
inefficient. Hence switch from per-CPU to individual counters. Three
counters are protected by the mq-deadline spinlock since these are
only accessed from contexts that already hold that spinlock. The fourth
counter is atomic because protecting it with the mq-deadline spinlock
would trigger lock contention.
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210927220328.1410161-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The scheduler .insert_requests() callback is called when a request is
queued for the first time and also when it is requeued. Only count a
request the first time it is queued. Additionally, since the mq-deadline
scheduler only performs zone locking for requests that have been
inserted, skip the zone unlock code for requests that have not been
inserted into the mq-deadline scheduler.
Fixes: 38ba64d12d ("block/mq-deadline: Track I/O statistics")
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210927220328.1410161-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Except for the features passed to blk_queue_required_elevator_features,
elevator.h is only needed internally to the block layer. Move the
ELEVATOR_F_* definitions to blkdev.h, and the move elevator.h to
block/, dropping all the spurious includes outside of that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull libata fixes from Damien Le Moal:
"Two fixes for this cycle:
- Fix a null pointer dereference in ahci-platform driver (from Hai)
- Fix uninitialized variables in pata_legacy driver (from Dan)"
* tag 'libata-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
ata: ahci_platform: fix null-ptr-deref in ahci_platform_enable_regulators()
pata_legacy: fix a couple uninitialized variable bugs
Pull block fixes from Jens Axboe:
"Bigger than usual for this point in time, the majority is fixing some
issues around BDI lifetimes with the move from the request_queue to
the disk in this release. In detail:
- Series on draining fs IO for del_gendisk() (Christoph)
- NVMe pull request via Christoph:
- fix the abort command id (Keith Busch)
- nvme: fix per-namespace chardev deletion (Adam Manzanares)
- brd locking scope fix (Tetsuo)
- BFQ fix (Paolo)"
* tag 'block-5.15-2021-10-17' of git://git.kernel.dk/linux-block:
block, bfq: reset last_bfqq_created on group change
block: warn when putting the final reference on a registered disk
brd: reduce the brd_devices_mutex scope
kyber: avoid q->disk dereferences in trace points
block: keep q_usage_counter in atomic mode after del_gendisk
block: drain file system I/O on del_gendisk
block: split bio_queue_enter from blk_queue_enter
block: factor out a blk_try_enter_queue helper
block: call submit_bio_checks under q_usage_counter
nvme: fix per-namespace chardev deletion
block/rnbd-clt-sysfs: fix a couple uninitialized variable bugs
nvme-pci: Fix abort command id
Pull io_uring fix from Jens Axboe:
"Just a single fix for a wrong condition for grabbing a lock, a
regression in this merge window"
* tag 'io_uring-5.15-2021-10-17' of git://git.kernel.dk/linux-block:
io_uring: fix wrong condition to grab uring lock
Pull virtio fixes from Michael Tsirkin:
"Fixes up some issues in rc5"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost-vdpa: Fix the wrong input in config_cb
VDUSE: fix documentation underline warning
Revert "virtio-blk: Add validation for block size in config space"
vhost_vdpa: unset vq irq before freeing irq
virtio: write back F_VERSION_1 before validate
Pull powerpc fixes from Michael Ellerman:
- Fix a bug where guests on P9 with interrupts passed through could get
stuck in synchronize_irq().
- Fix a bug in KVM on P8 where secondary threads entering a guest would
write outside their allocated stack.
- Fix a bug in KVM on P8 where secondary threads could confuse the host
offline code and cause the guest or host to crash.
Thanks to Cédric Le Goater.
* tag 'powerpc-5.15-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
KVM: PPC: Book3S HV: Make idle_kvm_start_guest() return 0 if it went to guest
KVM: PPC: Book3S HV: Fix stack handling in idle_kvm_start_guest()
powerpc/xive: Discard disabled interrupts in get_irqchip_state()
Pull objtool fixes from Borislav Petkov:
- Update section headers before the respective relocations to not
trigger a safety check in elftoolchain's implementation of libelf
- Do not add garbage data to the .rela.orc_unwind_ip section
* tag 'objtool_urgent_for_v5.15_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Update section header before relocations
objtool: Check for gelf_update_rel[a] failures
Pull EDAC fix from Borislav Petkov:
- Log the "correct" uncorrectable error count in the armada_xp driver
* tag 'edac_urgent_for_v5.15_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/armada-xp: Fix output of uncorrectable error counter
Pull perf fix from Borislav Petkov:
- Add Sapphire Rapids to the list of CPUs supporting the SMI count MSR
* tag 'perf_urgent_for_v5.15_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/msr: Add Sapphire Rapids CPU support
Pull EFI fixes from Borislav Petkov:
"Forwarded from Ard Biesheuvel through the tip tree. Ard will send
stuff directly in the near future.
Low priority fixes but fixes nonetheless:
- update stub diagnostic print that is no longer accurate
- avoid statically allocated buffer for CPER error record decoding
- avoid sleeping on the efi_runtime semaphore when calling the
ResetSystem EFI runtime service"
* tag 'efi-urgent-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi: Change down_interruptible() in virt_efi_reset_system() to down_trylock()
efi/cper: use stack buffer for error record decoding
efi/libstub: Simplify "Exiting bootservices" message
Pull x86 fixes from Borislav Petkov:
- Do not enable AMD memory encryption in Kconfig by default due to
shortcomings of some platforms, leading to boot failures.
- Mask out invalid bits in the MXCSR for 32-bit kernels again because
Thomas and I don't know how to mask out bits properly. Third time's
the charm.
* tag 'x86_urgent_for_v5.15_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Mask out the invalid MXCSR bits properly
x86/Kconfig: Do not enable AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT automatically
Pull driver core fixes from Greg KH:
"Here are some small driver core fixes for 5.15-rc6, all of which have
been in linux-next for a while with no reported issues.
They include:
- kernfs negative dentry bugfix
- simple pm bus fixes to resolve reported issues"
* tag 'driver-core-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
drivers: bus: Delete CONFIG_SIMPLE_PM_BUS
drivers: bus: simple-pm-bus: Add support for probing simple bus only devices
driver core: Reject pointless SYNC_STATE_ONLY device links
kernfs: don't create a negative dentry if inactive node exists
Pull char/misc driver fixes from Greg KH:
"Here are some small char/misc driver fixes for 5.15-rc6 for reported
issues that include:
- habanalabs driver fixes
- mei driver fixes and new ids
- fpga new device ids
- MAINTAINER file updates for fpga subsystem
- spi module id table additions and fixes
- fastrpc locking fixes
- nvmem driver fix
All of these have been in linux-next for a while with no reported
issues"
* tag 'char-misc-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
eeprom: 93xx46: fix MODULE_DEVICE_TABLE
nvmem: Fix shift-out-of-bound (UBSAN) with byte size cells
mei: hbm: drop hbm responses on early shutdown
mei: me: add Ice Lake-N device id.
eeprom: 93xx46: Add SPI device ID table
eeprom: at25: Add SPI ID table
misc: HI6421V600_IRQ should depend on HAS_IOMEM
misc: fastrpc: Add missing lock before accessing find_vma()
cb710: avoid NULL pointer subtraction
misc: gehc: Add SPI ID table
MAINTAINERS: Drop outdated FPGA Manager website
MAINTAINERS: Add Hao and Yilun as maintainers
habanalabs: fix resetting args in wait for CS IOCTL
fpga: ice40-spi: Add SPI device ID table
Pull staging and IIO driver fixes from Greg KH:
"Here are a number of small IIO and staging driver fixes for 5.15-rc6.
They include:
- vc04_services bugfix for reported problem
- r8188eu array underflow fix
- iio driver fixes for a lot of tiny reported issues.
All of these have been in linux-next for a while with no reported
issues"
* tag 'staging-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: r8188eu: prevent array underflow in rtw_hal_update_ra_mask()
staging: vc04_services: shut up out-of-range warning
iio: light: opt3001: Fixed timeout error when 0 lux
iio: adis16480: fix devices that do not support sleep mode
iio: mtk-auxadc: fix case IIO_CHAN_INFO_PROCESSED
iio: adis16475: fix deadlock on frequency set
iio: ssp_sensors: add more range checking in ssp_parse_dataframe()
iio: ssp_sensors: fix error code in ssp_print_mcu_debug()
iio: adc: ad7793: Fix IRQ flag
iio: adc: ad7780: Fix IRQ flag
iio: adc: ad7192: Add IRQ flag
iio: adc: aspeed: set driver data when adc probe.
iio: adc: rzg2l_adc: add missing clk_disable_unprepare() in rzg2l_adc_pm_runtime_resume()
iio: adc: max1027: Fix the number of max1X31 channels
iio: adc: max1027: Fix wrong shift with 12-bit devices
iio: adc128s052: Fix the error handling path of 'adc128_probe()'
iio: adc: rzg2l_adc: Fix -EBUSY timeout error return
iio: accel: fxls8962af: return IRQ_HANDLED when fifo is flushed
iio: dac: ti-dac5571: fix an error code in probe()
Pull serial driver fix from Greg KH:
"Here is a single 8250 Kconfig fix for 5.15-rc6 that resolves a
regression that showed up in 5.15-rc1. It has been in linux-next for a
while with no reported issues"
* tag 'tty-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
serial: 8250: allow disabling of Freescale 16550 compile test
Pull USB fixes from Greg KH:
"Here are some small USB fixes that resolve a number of tiny issues.
They include:
- new USB serial driver ids
- xhci driver fixes for a bunch of issues
- musb error path fixes.
All of these have been in linux-next for a while with no reported
issues"
* tag 'usb-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: musb: dsps: Fix the probe error path
xhci: Enable trust tx length quirk for Fresco FL11 USB controller
xhci: Fix command ring pointer corruption while aborting a command
USB: xhci: dbc: fix tty registration race
xhci: add quirk for host controllers that don't update endpoint DCS
xhci: guard accesses to ep_state in xhci_endpoint_reset()
USB: serial: qcserial: add EM9191 QDL support
USB: serial: option: add Quectel EC200S-CN module support
USB: serial: option: add prod. id for Quectel EG91
USB: serial: option: add Telit LE910Cx composition 0x1204
Pull input fixes from Dmitry Torokhov:
- a new product ID for the xpad joystick driver
- fixes to resistive-adc-touch and snvs_pwrkey drivers
- a change to touchscreen helpers to make clang happier
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: touchscreen - avoid bitwise vs logical OR warning
Input: xpad - add support for another USB ID of Nacon GC-100
Input: resistive-adc-touch - fix division by zero error on z1 == 0
Input: snvs_pwrkey - add clk handling
Since commit 430a67f9d6 ("block, bfq: merge bursts of newly-created
queues"), BFQ maintains a per-group pointer to the last bfq_queue
created. If such a queue, say bfqq, happens to move to a different
group, then bfqq is no more a valid last bfq_queue created for its
previous group. That pointer must then be cleared. Not resetting such
a pointer may also cause UAF, if bfqq happens to also be freed after
being moved to a different group. This commit performs this missing
reset. As such it fixes commit 430a67f9d6 ("block, bfq: merge bursts
of newly-created queues").
Such a missing reset is most likely the cause of the crash reported in [1].
With some analysis, we found that this crash was due to the
above UAF. And such UAF did go away with this commit applied [1].
Anyway, before this commit, that crash happened to be triggered in
conjunction with commit 2d52c58b9c ("block, bfq: honor already-setup
queue merges"). The latter was then reverted by commit ebc69e897e
("Revert "block, bfq: honor already-setup queue merges""). Yet commit
2d52c58b9c ("block, bfq: honor already-setup queue merges") contains
no error related with the above UAF, and can then be restored.
[1] https://bugzilla.kernel.org/show_bug.cgi?id=214503
Fixes: 430a67f9d6 ("block, bfq: merge bursts of newly-created queues")
Tested-by: Grzegorz Kowal <custos.mentis@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20211015144336.45894-2-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Warn when the last reference on a live disk is put without calling
del_gendisk first. There are some BDI related bug reports that look
like a case of this, so make sure we have the proper instrumentation
to catch it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211014130231.1468538-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull perf tools fixes from Arnaldo Carvalho de Melo:
- Fix 'perf test evsel' build error on !x86 architectures
- Fix libperf's test_stat_cpu mixup of CPU numbers and CPU indexes
- Output offsets for decompressed records, not just useless zeros
* tag 'perf-tools-fixes-for-v5.15-2021-10-16' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
libperf tests: Fix test_stat_cpu
libperf test evsel: Fix build error on !x86 architectures
perf report: Output non-zero offset for decompressed records
Pull memblock fix from Mike Rapoport:
"Fix handling of NOMAP regions with kmemleak.
NOMAP regions don't have linear map entries so an attempt to scan
these areas in kmemleak would fault.
Prevent such faults by excluding NOMAP regions from kmemleak"
* tag 'fixes-2021-10-16' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
memblock: exclude NOMAP regions from kmemleak