Commit Graph

132747 Commits

Author SHA1 Message Date
Pavel Begunkov
17220ca5ce block: cache request queue in bdev
There are tons of places where we need to get a request_queue only
having bdev, which turns into bdev->bd_disk->queue. There are probably a
hundred of such places considering inline helpers, and enough of them
are in hot paths.

Cache queue pointer in struct block_device and make use of it in
bdev_get_queue().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a3bfaecdd28956f03629d0ca5c63ebc096e1c809.1634219547.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Christoph Hellwig
3e08773c38 block: switch polling to be bio based
Replace the blk_poll interface that requires the caller to keep a queue
and cookie from the submissions with polling based on the bio.

Polling for the bio itself leads to a few advantages:

 - the cookie construction can made entirely private in blk-mq.c
 - the caller does not need to remember the request_queue and cookie
   separately and thus sidesteps their lifetime issues
 - keeping the device and the cookie inside the bio allows to trivially
   support polling BIOs remapping by stacking drivers
 - a lot of code to propagate the cookie back up the submission path can
   be removed entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Ming Lei
19416123ab block: define 'struct bvec_iter' as packed
'struct bvec_iter' is embedded into 'struct bio', define it as packed
so that we can get one extra 4bytes for other uses without expanding
bio.

'struct bvec_iter' is often allocated on stack, so making it packed
doesn't affect performance. Also I have run io_uring on both
nvme/null_blk, and not observe performance effect in this way.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Christoph Hellwig
6ce913fe3e block: rename REQ_HIPRI to REQ_POLLED
Unlike the RWF_HIPRI userspace ABI which is intentionally kept vague,
the bio flag is specific to the polling implementation, so rename and
document it properly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Christoph Hellwig
d729cf9acb io_uring: don't sleep when polling for I/O
There is no point in sleeping for the expected I/O completion timeout
in the io_uring async polling model as we never poll for a specific
I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Christoph Hellwig
ef99b2d376 block: replace the spin argument to blk_iopoll with a flags argument
Switch the boolean spin argument to blk_poll to passing a set of flags
instead.  This will allow to control polling behavior in a more fine
grained way.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-10-hch@lst.de
[axboe: adapt to changed io_uring iopoll]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Christoph Hellwig
28a1ae6b9d blk-mq: remove blk_qc_t_valid
Move the trivial check into the only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Christoph Hellwig
efbabbe121 blk-mq: remove blk_qc_t_to_tag and blk_qc_t_is_internal
Merge both functions into their only caller to keep the blk-mq tag to
blk_qc_t mapping as private as possible in blk-mq.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Christoph Hellwig
f70299f0d5 blk-mq: factor out a blk_qc_to_hctx helper
Add a helper to get the hctx from a request_queue and cookie, and fold
the blk_qc_t_to_queue_num helper into it as no other callers are left.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:35 -06:00
Jens Axboe
9672b0d437 sbitmap: add __sbitmap_queue_get_batch()
The block layer tag allocation batching still calls into sbitmap to get
each tag, but we can improve on that. Add __sbitmap_queue_get_batch(),
which returns a mask of tags all at once, along with an offset for
those tags.

An example return would be 0xff, where bits 0..7 are set, with
tag_offset == 128. The valid tags in this case would be 128..135.

A batch is specific to an individual sbitmap_map, hence it cannot be
larger than that. The requested number of tags is automatically reduced
to the max that can be satisfied with a single map.

On failure, 0 is returned. Caller should fall back to single tag
allocation at that point/

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:35 -06:00
Christoph Hellwig
4f7ab09a1c block: mark bio_truncate static
bio_truncate is only used in bio.c, so mark it static.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:35 -06:00
Christoph Hellwig
ff18d77b5f block: move bio_get_{first,last}_bvec out of bio.h
bio_get_first_bvec and bio_get_last_bvec are only used in blk-merge.c,
so move them there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:35 -06:00
Christoph Hellwig
9774b39175 block: mark __bio_try_merge_page static
Mark __bio_try_merge_page static and move it up a bit to avoid the need
for a forward declaration.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:35 -06:00
Christoph Hellwig
9a6083becb block: move bio_full out of bio.h
bio_full is only used in bio.c, so move it there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:35 -06:00
Christoph Hellwig
b6559d8f9f block: fold bio_cur_bytes into blk_rq_cur_bytes
Fold bio_cur_bytes into the only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:35 -06:00
Christoph Hellwig
8addffd657 block: move bio_mergeable out of bio.h
bio_mergeable is only needed by I/O schedulers, so move it to
blk-mq-sched.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:35 -06:00
Christoph Hellwig
11d9cab1ca block: don't include <linux/ioprio.h> in <linux/bio.h>
bio.h doesn't need any of the definitions from ioprio.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:34 -06:00
Christoph Hellwig
9e8c0d0d4d block: remove BIO_BUG_ON
BIO_DEBUG is always defined, so just switch the two instances to use
BUG_ON directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:34 -06:00
Christoph Hellwig
84b8514b46 block: move the *blkdev_ioctl declarations out of blkdev.h
These are only used inside of block/.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012104450.659013-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:34 -06:00
Jens Axboe
47c122e35d block: pre-allocate requests if plug is started and is a batch
The caller typically has a good (or even exact) idea of how many requests
it needs to submit. We can make the request/tag allocation a lot more
efficient if we just allocate N requests/tags upfront when we queue the
first bio from the batch.

Provide a new plug start helper that allows the caller to specify how many
IOs are expected. This sets plug->nr_ios, and we can use that for smarter
request allocation. The plug provides a holding spot for requests, and
request allocation will check it before calling into the normal request
allocation path.

The blk_finish_plug() is called, check if there are unused requests and
free them. This should not happen in normal operations. The exception is
if we get merging, then we may be left with requests that need freeing
when done.

This raises the per-core performance on my setup from ~5.8M to ~6.1M
IOPS.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:03 -06:00
Jens Axboe
ba0ffdd8ce block: bump max plugged deferred size from 16 to 32
Particularly for NVMe with efficient deferred submission for many
requests, there are nice benefits to be seen by bumping the default max
plug count from 16 to 32. This is especially true for virtualized setups,
where the submit part is more expensive. But can be noticed even on
native hardware.

Reduce the multiple queue factor from 4 to 2, since we're changing the
default size.

While changing it, move the defines into the block layer private header.
These aren't values that anyone outside of the block layer uses, or
should use.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:03 -06:00
John Garry
079a2e3e86 blk-mq: Change shared sbitmap naming to shared tags
Now that shared sbitmap support really means shared tags, rename symbols
to match that.

Signed-off-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/1633429419-228500-15-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:03 -06:00
John Garry
e155b0c238 blk-mq: Use shared tags for shared sbitmap support
Currently we use separate sbitmap pairs and active_queues atomic_t for
shared sbitmap support.

However a full sets of static requests are used per HW queue, which is
quite wasteful, considering that the total number of requests usable at
any given time across all HW queues is limited by the shared sbitmap depth.

As such, it is considerably more memory efficient in the case of shared
sbitmap to allocate a set of static rqs per tag set or request queue, and
not per HW queue.

So replace the sbitmap pairs and active_queues atomic_t with a shared
tags per tagset and request queue, which will hold a set of shared static
rqs.

Since there is now no valid HW queue index to be passed to the blk_mq_ops
.init and .exit_request callbacks, pass an invalid index token. This
changes the semantics of the APIs, such that the callback would need to
validate the HW queue index before using it. Currently no user of shared
sbitmap actually uses the HW queue index (as would be expected).

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1633429419-228500-13-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:03 -06:00
John Garry
d2a27964e6 block: Rename BLKDEV_MAX_RQ -> BLKDEV_DEFAULT_RQ
It is a bit confusing that there is BLKDEV_MAX_RQ and MAX_SCHED_RQ, as
the name BLKDEV_MAX_RQ would imply the max requests always, which it is
not.

Rename to BLKDEV_MAX_RQ to BLKDEV_DEFAULT_RQ, matching its usage - that being
the default number of requests assigned when allocating a request queue.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
Christoph Hellwig
24b83deb29 block: move struct request to blk-mq.h
struct request is only used by blk-mq drivers, so move it and all
related declarations to blk-mq.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
Christoph Hellwig
fe45e630a1 block: move integrity handling out of <linux/blkdev.h>
Split the integrity/metadata handling definitions out into a new header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
Christoph Hellwig
badf7f6437 block: move a few merge helpers out of <linux/blkdev.h>
These are block-layer internal helpers, so move them to block/blk.h and
block/blk-merge.c.  Also update a comment a bit to use better grammar.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
Christoph Hellwig
b81e0c2372 block: drop unused includes in <linux/genhd.h>
Drop various include not actually used in genhd.h itself, and
move the remaning includes closer together.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
Christoph Hellwig
3ab0bc78e9 block: drop unused includes in <linux/blkdev.h>
Drop various include not actually used in blkdev.h itself.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:01 -06:00
Christoph Hellwig
2e9bc3465a block: move elevator.h to block/
Except for the features passed to blk_queue_required_elevator_features,
elevator.h is only needed internally to the block layer.  Move the
ELEVATOR_F_* definitions to blkdev.h, and the move elevator.h to
block/, dropping all the spurious includes outside of that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:01 -06:00
Christoph Hellwig
9778ac77c2 block: remove the struct blk_queue_ctx forward declaration
This type doesn't exist at all, so no need to forward declare it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:01 -06:00
Christoph Hellwig
713e4e1108 block: remove the cmd_size field from struct request_queue
Entirely unused.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:01 -06:00
Christoph Hellwig
90138237a5 block: remove the unused blk_queue_state enum
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:01 -06:00
Christoph Hellwig
1d9433cdd0 block: remove the unused rq_end_sector macro
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:01 -06:00
Christoph Hellwig
ccdf774189 mm: don't include <linux/blkdev.h> in <linux/backing-dev.h>
Move inode_to_bdi out of line to avoid having to include blkdev.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:01 -06:00
Christoph Hellwig
e41d12f539 mm: don't include <linux/blk-cgroup.h> in <linux/backing-dev.h>
There is no need to pull blk-cgroup.h and thus blkdev.h in here, so
break the include chain.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:01 -06:00
Christoph Hellwig
348332e000 mm: don't include <linux/blk-cgroup.h> in <linux/writeback.h>
blk-cgroup.h pulls in blkdev.h and thus pretty much all the block
headers.  Break this dependency chain by turning wbc_blkcg_css into a
macro and dropping the blk-cgroup.h include.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:01 -06:00
Linus Torvalds
f2b3420b92 Merge tag 'block-5.15-2021-10-17' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "Bigger than usual for this point in time, the majority is fixing some
  issues around BDI lifetimes with the move from the request_queue to
  the disk in this release. In detail:

   - Series on draining fs IO for del_gendisk() (Christoph)

   - NVMe pull request via Christoph:
        - fix the abort command id (Keith Busch)
        - nvme: fix per-namespace chardev deletion (Adam Manzanares)

   - brd locking scope fix (Tetsuo)

   - BFQ fix (Paolo)"

* tag 'block-5.15-2021-10-17' of git://git.kernel.dk/linux-block:
  block, bfq: reset last_bfqq_created on group change
  block: warn when putting the final reference on a registered disk
  brd: reduce the brd_devices_mutex scope
  kyber: avoid q->disk dereferences in trace points
  block: keep q_usage_counter in atomic mode after del_gendisk
  block: drain file system I/O on del_gendisk
  block: split bio_queue_enter from blk_queue_enter
  block: factor out a blk_try_enter_queue helper
  block: call submit_bio_checks under q_usage_counter
  nvme: fix per-namespace chardev deletion
  block/rnbd-clt-sysfs: fix a couple uninitialized variable bugs
  nvme-pci: Fix abort command id
2021-10-17 19:25:20 -10:00
Linus Torvalds
e3572dff12 Merge tag 'char-misc-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
 "Here are some small char/misc driver fixes for 5.15-rc6 for reported
  issues that include:

   - habanalabs driver fixes

   - mei driver fixes and new ids

   - fpga new device ids

   - MAINTAINER file updates for fpga subsystem

   - spi module id table additions and fixes

   - fastrpc locking fixes

   - nvmem driver fix

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'char-misc-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  eeprom: 93xx46: fix MODULE_DEVICE_TABLE
  nvmem: Fix shift-out-of-bound (UBSAN) with byte size cells
  mei: hbm: drop hbm responses on early shutdown
  mei: me: add Ice Lake-N device id.
  eeprom: 93xx46: Add SPI device ID table
  eeprom: at25: Add SPI ID table
  misc: HI6421V600_IRQ should depend on HAS_IOMEM
  misc: fastrpc: Add missing lock before accessing find_vma()
  cb710: avoid NULL pointer subtraction
  misc: gehc: Add SPI ID table
  MAINTAINERS: Drop outdated FPGA Manager website
  MAINTAINERS: Add Hao and Yilun as maintainers
  habanalabs: fix resetting args in wait for CS IOCTL
  fpga: ice40-spi: Add SPI device ID table
2021-10-17 17:14:00 -10:00
Christoph Hellwig
c41108049d kyber: avoid q->disk dereferences in trace points
q->disk becomes invalid after the gendisk is removed.  Work around this
by caching the dev_t for the tracepoints.  The real fix would be to
properly tear down the I/O schedulers with the gendisk, but that is
a much more invasive change.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012093301.GA27795@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-15 21:02:57 -06:00
Christoph Hellwig
8e141f9eb8 block: drain file system I/O on del_gendisk
Instead of delaying draining of file system I/O related items like the
blk-qos queues, the integrity read workqueue and timeouts only when the
request_queue is removed, do that when del_gendisk is called.  This is
important for SCSI where the upper level drivers that control the gendisk
are separate entities, and the disk can be freed much earlier than the
request_queue, or can even be unbound without tearing down the queue.

Fixes: edb0872f44 ("block: move the bdi from the request_queue to the gendisk")
Reported-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210929071241.934472-5-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-15 21:02:50 -06:00
Linus Torvalds
985f6ab93f Merge tag 'spi-fix-v5.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
 "A few small fixes.

  Mostly driver specific but there's one in the core which fixes a
  deadlock when adding devices on spi-mux that's triggered because
  spi-mux is a SPI device which is itself a SPI controller and so can
  instantiate devices when registered.

  We were using a global lock to protect against reusing chip selects
  but they're a per controller thing so moving the lock per controller
  resolves that"

* tag 'spi-fix-v5.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi-mux: Fix false-positive lockdep splats
  spi: Fix deadlock when adding SPI controllers on SPI buses
  spi: bcm-qspi: clear MSPI spifie interrupt during probe
  spi: spi-nxp-fspi: don't depend on a specific node name erratum workaround
  spi: mediatek: skip delays if they are 0
  spi: atmel: Fix PDC transfer setup bug
  spi: spidev: Add SPI ID table
  spi: Use 'flash' node name instead of 'spi-flash' in example
2021-10-15 10:21:46 -04:00
Linus Torvalds
ec681c53f8 Merge tag 'net-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
 "Quite calm.

  The noisy DSA driver (embedded switches) changes, and adjustment to
  IPv6 IOAM behavior add to diffstat's bottom line but are not scary.

  Current release - regressions:

   - af_unix: rename UNIX-DGRAM to UNIX to maintain backwards
     compatibility

   - procfs: revert "add seq_puts() statement for dev_mcast", minor
     format change broke user space

  Current release - new code bugs:

   - dsa: fix bridge_num not getting cleared after ports leaving the
     bridge, resource leak

   - dsa: tag_dsa: send packets with TX fwd offload from VLAN-unaware
     bridges using VID 0, prevent packet drops if pvid is removed

   - dsa: mv88e6xxx: keep the pvid at 0 when VLAN-unaware, prevent HW
     getting confused about station to VLAN mapping

  Previous releases - regressions:

   - virtio-net: fix for skb_over_panic inside big mode

   - phy: do not shutdown PHYs in READY state

   - dsa: mv88e6xxx: don't use PHY_DETECT on internal PHY's, fix link
     LED staying lit after ifdown

   - mptcp: fix possible infinite wait on recvmsg(MSG_WAITALL)

   - mqprio: Correct stats in mqprio_dump_class_stats()

   - ice: fix deadlock for Tx timestamp tracking flush

   - stmmac: fix feature detection on old hardware

  Previous releases - always broken:

   - sctp: account stream padding length for reconf chunk

   - icmp: fix icmp_ext_echo_iio parsing in icmp_build_probe()

   - isdn: cpai: check ctr->cnr to avoid array index out of bound

   - isdn: mISDN: fix sleeping function called from invalid context

   - nfc: nci: fix potential UAF of rf_conn_info object

   - dsa: microchip: prevent ksz_mib_read_work from kicking back in
     after it's canceled in .remove and crashing

   - dsa: mv88e6xxx: isolate the ATU databases of standalone and bridged
     ports

   - dsa: sja1105, ocelot: break circular dependency between switch and
     tag drivers

   - dsa: felix: improve timestamping in presence of packe loss

   - mlxsw: thermal: fix out-of-bounds memory accesses

  Misc:

   - ipv6: ioam: move the check for undefined bits to improve
     interoperability"

* tag 'net-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (60 commits)
  icmp: fix icmp_ext_echo_iio parsing in icmp_build_probe
  MAINTAINERS: Update the devicetree documentation path of imx fec driver
  sctp: account stream padding length for reconf chunk
  mlxsw: thermal: Fix out-of-bounds memory accesses
  ethernet: s2io: fix setting mac address during resume
  NFC: digital: fix possible memory leak in digital_in_send_sdd_req()
  NFC: digital: fix possible memory leak in digital_tg_listen_mdaa()
  nfc: fix error handling of nfc_proto_register()
  Revert "net: procfs: add seq_puts() statement for dev_mcast"
  net: encx24j600: check error in devm_regmap_init_encx24j600
  net: korina: select CRC32
  net: arc: select CRC32
  net: dsa: felix: break at first CPU port during init and teardown
  net: dsa: tag_ocelot_8021q: fix inability to inject STP BPDUs into BLOCKING ports
  net: dsa: felix: purge skb from TX timestamping queue if it cannot be sent
  net: dsa: tag_ocelot_8021q: break circular dependency with ocelot switch lib
  net: dsa: tag_ocelot: break circular dependency with ocelot switch lib driver
  net: mscc: ocelot: cross-check the sequence id from the timestamp FIFO with the skb PTP header
  net: mscc: ocelot: deny TX timestamping of non-PTP packets
  net: mscc: ocelot: warn when a PTP IRQ is raised for an unknown skb
  ...
2021-10-14 18:21:39 -04:00
Linus Torvalds
1626d9a35e Merge tag 'sound-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
 "This contains quite a few device-specific fixes for usual HD- and
  USB-audio in addition to a couple of ALSA core fixes (a UAF fix in
  sequencer and a fix for a misplaced PCM 32bit compat ioctl).

  Nothing really stands out"

* tag 'sound-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: usb-audio: Add quirk for VF0770
  ALSA: hda: avoid write to STATESTS if controller is in reset
  ALSA: hda/realtek: Fix the mic type detection issue for ASUS G551JW
  ALSA: pcm: Workaround for a wrong offset in SYNC_PTR compat ioctl
  ALSA: hda/realtek: Fix for quirk to enable speaker output on the Lenovo 13s Gen2
  ALSA: hda: intel: Allow repeatedly probing on codec configuration errors
  ALSA: hda/realtek: Add quirk for TongFang PHxTxX1
  ALSA: hda/realtek - ALC236 headset MIC recording issue
  ALSA: usb-audio: Enable rate validation for Scarlett devices
  ALSA: hda/realtek: Add quirk for Clevo X170KM-G
  ALSA: hda/realtek: Complete partial device name to avoid ambiguity
  ALSA: hda - Enable headphone mic on Dell Latitude laptops with ALC3254
  ALSA: seq: Fix a potential UAF by wrong private_free call order
  ALSA: hda/realtek: Enable 4-speaker output for Dell Precision 5560 laptop
  ALSA: usb-audio: Fix a missing error check in scarlett gen2 mixer
2021-10-14 09:53:36 -04:00
Mark Brown
6098475d4c spi: Fix deadlock when adding SPI controllers on SPI buses
Currently we have a global spi_add_lock which we take when adding new
devices so that we can check that we're not trying to reuse a chip
select that's already controlled.  This means that if the SPI device is
itself a SPI controller and triggers the instantiation of further SPI
devices we trigger a deadlock as we try to register and instantiate
those devices while in the process of doing so for the parent controller
and hence already holding the global spi_add_lock.  Since we only care
about concurrency within a single SPI bus move the lock to be per
controller, avoiding the deadlock.

This can be easily triggered in the case of spi-mux.

Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Mark Brown <broonie@kernel.org>
2021-10-14 12:47:29 +01:00
Jakub Kicinski
b70b152173 Merge tag 'mlx5-fixes-2021-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:

====================
mlx5 fixes 2021-10-12

* tag 'mlx5-fixes-2021-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5e: Fix division by 0 in mlx5e_select_queue for representors
  net/mlx5e: Mutually exclude RX-FCS and RX-port-timestamp
  net/mlx5e: Switchdev representors are not vlan challenged
  net/mlx5e: Fix memory leak in mlx5_core_destroy_cq() error path
  net/mlx5e: Allow only complete TXQs partition in MQPRIO channel mode
  net/mlx5: Fix cleanup of bridge delayed work
====================

Link: https://lore.kernel.org/r/20211012205323.20123-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-13 13:39:55 -07:00
Vladimir Oltean
49f885b2d9 net: dsa: tag_ocelot_8021q: break circular dependency with ocelot switch lib
Michael reported that when using the "ocelot-8021q" tagging protocol,
the switch driver module must be manually loaded before the tagging
protocol can be loaded/is available.

This appears to be the same problem described here:
https://lore.kernel.org/netdev/20210908220834.d7gmtnwrorhharna@skbuf/
where due to the fact that DSA tagging protocols make use of symbols
exported by the switch drivers, circular dependencies appear and this
breaks module autoloading.

The ocelot_8021q driver needs the ocelot_can_inject() and
ocelot_port_inject_frame() functions from the switch library. Previously
the wrong approach was taken to solve that dependency: shims were
provided for the case where the ocelot switch library was compiled out,
but that turns out to be insufficient, because the dependency when the
switch lib _is_ compiled is problematic too.

We cannot declare ocelot_can_inject() and ocelot_port_inject_frame() as
static inline functions, because these access I/O functions like
__ocelot_write_ix() which is called by ocelot_write_rix(). Making those
static inline basically means exposing the whole guts of the ocelot
switch library, not ideal...

We already have one tagging protocol driver which calls into the switch
driver during xmit but not using any exported symbol: sja1105_defer_xmit.
We can do the same thing here: create a kthread worker and one work item
per skb, and let the switch driver itself do the register accesses to
send the skb, and then consume it.

Fixes: 0a6f17c6ae ("net: dsa: tag_ocelot_8021q: add support for PTP timestamping")
Reported-by: Michael Walle <michael@walle.cc>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-12 17:35:18 -07:00
Vladimir Oltean
deab6b1cd9 net: dsa: tag_ocelot: break circular dependency with ocelot switch lib driver
As explained here:
https://lore.kernel.org/netdev/20210908220834.d7gmtnwrorhharna@skbuf/
DSA tagging protocol drivers cannot depend on symbols exported by switch
drivers, because this creates a circular dependency that breaks module
autoloading.

The tag_ocelot.c file depends on the ocelot_ptp_rew_op() function
exported by the common ocelot switch lib. This function looks at
OCELOT_SKB_CB(skb) and computes how to populate the REW_OP field of the
DSA tag, for PTP timestamping (the command: one-step/two-step, and the
TX timestamp identifier).

None of that requires deep insight into the driver, it is quite
stateless, as it only depends upon the skb->cb. So let's make it a
static inline function and put it in include/linux/dsa/ocelot.h, a
file that despite its name is used by the ocelot switch driver for
populating the injection header too - since commit 40d3f295b5 ("net:
mscc: ocelot: use common tag parsing code with DSA").

With that function declared as static inline, its body is expanded
inside each call site, so the dependency is broken and the DSA tagger
can be built without the switch library, upon which the felix driver
depends.

Fixes: 39e5308b32 ("net: mscc: ocelot: support PTP Sync one-step timestamping")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-12 17:35:18 -07:00
Vladimir Oltean
ebb4c6a990 net: mscc: ocelot: cross-check the sequence id from the timestamp FIFO with the skb PTP header
The sad reality is that when a PTP frame with a TX timestamping request
is transmitted, it isn't guaranteed that it will make it all the way to
the wire (due to congestion inside the switch), and that a timestamp
will be taken by the hardware and placed in the timestamp FIFO where an
IRQ will be raised for it.

The implication is that if enough PTP frames are silently dropped by the
hardware such that the timestamp ID has rolled over, it is possible to
match a timestamp to an old skb.

Furthermore, nobody will match on the real skb corresponding to this
timestamp, since we stupidly matched on a previous one that was stale in
the queue, and stopped there.

So PTP timestamping will be broken and there will be no way to recover.

It looks like the hardware parses the sequenceID from the PTP header,
and also provides that metadata for each timestamp. The driver currently
ignores this, but it shouldn't.

As an extra resiliency measure, do the following:

- check whether the PTP sequenceID also matches between the skb and the
  timestamp, treat the skb as stale otherwise and free it

- if we see a stale skb, don't stop there and try to match an skb one
  more time, chances are there's one more skb in the queue with the same
  timestamp ID, otherwise we wouldn't have ever found the stale one (it
  is by timestamp ID that we matched it).

While this does not prevent PTP packet drops, it at least prevents
the catastrophic consequences of incorrect timestamp matching.

Since we already call ptp_classify_raw in the TX path, save the result
in the skb->cb of the clone, and just use that result in the interrupt
code path.

Fixes: 4e3b0468e6 ("net: mscc: PTP Hardware Clock (PHC) support")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-12 17:35:18 -07:00
Vladimir Oltean
52849bcf00 net: mscc: ocelot: avoid overflowing the PTP timestamp FIFO
PTP packets with 2-step TX timestamp requests are matched to packets
based on the egress port number and a 6-bit timestamp identifier.
All PTP timestamps are held in a common FIFO that is 128 entry deep.

This patch ensures that back-to-back timestamping requests cannot exceed
the hardware FIFO capacity. If that happens, simply send the packets
without requesting a TX timestamp to be taken (in the case of felix,
since the DSA API has a void return code in ds->ops->port_txtstamp) or
drop them (in the case of ocelot).

I've moved the ts_id_lock from a per-port basis to a per-switch basis,
because we need separate accounting for both numbers of PTP frames in
flight. And since we need locking to inc/dec the per-switch counter,
that also offers protection for the per-port counter and hence there is
no reason to have a per-port counter anymore.

Fixes: 4e3b0468e6 ("net: mscc: PTP Hardware Clock (PHC) support")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-12 17:35:17 -07:00