Use task_work_add if it is available, since task_work_add can bring
up better performance, especially batching signaling ->ubq_daemon can
be done.
It is observed that task_work_add() can boost iops by +4% on random
4k io test. Also except for completing io command, all other code
paths are same with completing io command via
io_uring_cmd_complete_in_task.
Meantime add one flag of UBLK_F_URING_CMD_COMP_IN_TASK for comparing
the mode easily.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220713140711.97356-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is the driver part of userspace block driver(ublk driver), the other
part is userspace daemon part(ublksrv)[1].
The two parts communicate by io_uring's IORING_OP_URING_CMD with one
shared cmd buffer for storing io command, and the buffer is read only for
ublksrv, each io command is indexed by io request tag directly, and is
written by ublk driver.
For example, when one READ io request is submitted to ublk block driver,
ublk driver stores the io command into cmd buffer first, then completes
one IORING_OP_URING_CMD for notifying ublksrv, and the URING_CMD is issued
to ublk driver beforehand by ublksrv for getting notification of any new
io request, and each URING_CMD is associated with one io request by tag.
After ublksrv gets the io command, it translates and handles the ublk io
request, such as, for the ublk-loop target, ublksrv translates the request
into same request on another file or disk, like the kernel loop block
driver. In ublksrv's implementation, the io is still handled by io_uring,
and share same ring with IORING_OP_URING_CMD command. When the target io
request is done, the same IORING_OP_URING_CMD is issued to ublk driver for
both committing io request result and getting future notification of new
io request.
Another thing done by ublk driver is to copy data between kernel io
request and ublksrv's io buffer:
1) before ubsrv handles WRITE request, copy the request's data into
ublksrv's userspace io buffer, so that ublksrv can handle the write
request
2) after ubsrv handles READ request, copy ublksrv's userspace io buffer
into this READ request, then ublk driver can complete the READ request
Zero copy may be switched if mm is ready to support it.
ublk driver doesn't handle any logic of the specific user space driver,
so it is small/simple enough.
[1] ublksrv
https://github.com/ming1/ubdsrv
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220713140711.97356-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Set the queue dying flag and call blk_mq_exit_queue from del_gendisk for
all disks that do not have separately allocated queues, and thus remove
the need to call blk_cleanup_queue for them.
Rename blk_cleanup_disk to blk_mq_destroy_queue to make it clear that
this function is intended only for separately allocated blk-mq queues.
This saves an extra queue freeze for devices without a separately
allocated queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20220619060552.1850436-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit e70344c059 ("block: fix default IO priority handling")
introduced an inconsistency in get_current_ioprio() that tasks without
IO context return IOPRIO_DEFAULT priority while tasks with freshly
allocated IO context will return 0 (IOPRIO_CLASS_NONE/0) IO priority.
Tasks without IO context used to be rare before 5a9d041ba2 ("block:
move io_context creation into where it's needed") but after this commit
they became common because now only BFQ IO scheduler setups task's IO
context. Similar inconsistency is there for get_task_ioprio() so this
inconsistency is now exposed to userspace and userspace will see
different IO priority for tasks operating on devices with BFQ compared
to devices without BFQ. Furthemore the changes done by commit
e70344c059 change the behavior when no IO priority is set for BFQ IO
scheduler which is also documented in ioprio_set(2) manpage:
"If no I/O scheduler has been set for a thread, then by default the I/O
priority will follow the CPU nice value (setpriority(2)). In Linux
kernels before version 2.6.24, once an I/O priority had been set using
ioprio_set(), there was no way to reset the I/O scheduling behavior to
the default. Since Linux 2.6.24, specifying ioprio as 0 can be used to
reset to the default I/O scheduling behavior."
So make sure we default to IOPRIO_CLASS_NONE as used to be the case
before commit e70344c059. Also cleanup alloc_io_context() to
explicitely set this IO priority for the allocated IO context to avoid
future surprises. Note that we tweak ioprio_best() to maintain
ioprio_get(2) behavior and make this commit easily backportable.
CC: stable@vger.kernel.org
Fixes: e70344c059 ("block: fix default IO priority handling")
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the address alignment requirements from the block_device for direct
io instead of requiring addresses be aligned to the block size. User
space can discover the alignment requirements from the dma_alignment
queue attribute.
User space can specify any hardware compatible DMA offset for each
segment, but every segment length is still required to be a multiple of
the block size.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220610195830.3574005-11-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The existing iov_iter_alignment() function returns the logical OR of
address and length. For cases where address and length need to be
considered separately, introduce a helper function that a caller can
specificy length and address masks that indicate if the iov is
unaligned.
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220610195830.3574005-9-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull ARM SoC fixes from Arnd Bergmann:
"A number of fixes have accumulated, but they are largely for harmless
issues:
- Several OF node leak fixes
- A fix to the Exynos7885 UART clock description
- DTS fixes to prevent boot failures on TI AM64 and J721s2
- Bus probe error handling fixes for Baikal-T1
- A fixup to the way STM32 SoCs use separate dts files for different
firmware stacks
- Multiple code fixes for Arm SCMI firmware, all dealing with
robustness of the implementation
- Multiple NXP i.MX devicetree fixes, addressing incorrect data in DT
nodes
- Three updates to the MAINTAINERS file, including Florian Fainelli
taking over BCM283x/BCM2711 (Raspberry Pi) from Nicolas Saenz
Julienne"
* tag 'soc-fixes-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (29 commits)
ARM: dts: aspeed: nuvia: rename vendor nuvia to qcom
arm: mach-spear: Add missing of_node_put() in time.c
ARM: cns3xxx: Fix refcount leak in cns3xxx_init
MAINTAINERS: Update email address
arm64: dts: ti: k3-am64-main: Remove support for HS400 speed mode
arm64: dts: ti: k3-j721s2: Fix overlapping GICD memory region
ARM: dts: bcm2711-rpi-400: Fix GPIO line names
bus: bt1-axi: Don't print error on -EPROBE_DEFER
bus: bt1-apb: Don't print error on -EPROBE_DEFER
ARM: Fix refcount leak in axxia_boot_secondary
ARM: dts: stm32: move SCMI related nodes in a dedicated file for stm32mp15
soc: imx: imx8m-blk-ctrl: fix display clock for LCDIF2 power domain
ARM: dts: imx6qdl-colibri: Fix capacitive touch reset polarity
ARM: dts: imx6qdl: correct PU regulator ramp delay
firmware: arm_scmi: Fix incorrect error propagation in scmi_voltage_descriptors_get
firmware: arm_scmi: Avoid using extended string-buffers sizes if not necessary
firmware: arm_scmi: Fix SENSOR_AXIS_NAME_GET behaviour when unsupported
ARM: dts: imx7: Move hsic_phy power domain to HSIC PHY node
soc: bcm: brcmstb: pm: pm-arm: Fix refcount leak in brcmstb_pm_probe
MAINTAINERS: Update BCM2711/BCM2835 maintainer
...
Pull hotfixes from Andrew Morton:
"Minor things, mainly - mailmap updates, MAINTAINERS updates, etc.
Fixes for this merge window:
- fix for a damon boot hang, from SeongJae
- fix for a kfence warning splat, from Jason Donenfeld
- fix for zero-pfn pinning, from Alex Williamson
- fix for fallocate hole punch clearing, from Mike Kravetz
Fixes for previous releases:
- fix for a performance regression, from Marcelo
- fix for a hwpoisining BUG from zhenwei pi"
* tag 'mm-hotfixes-stable-2022-06-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mailmap: add entry for Christian Marangi
mm/memory-failure: disable unpoison once hw error happens
hugetlbfs: zero partial pages during fallocate hole punch
mm: memcontrol: reference to tools/cgroup/memcg_slabinfo.py
mm: re-allow pinning of zero pfns
mm/kfence: select random number before taking raw lock
MAINTAINERS: add maillist information for LoongArch
MAINTAINERS: update MM tree references
MAINTAINERS: update Abel Vesa's email
MAINTAINERS: add MEMORY HOT(UN)PLUG section and add David as reviewer
MAINTAINERS: add Miaohe Lin as a memory-failure reviewer
mailmap: add alias for jarkko@profian.com
mm/damon/reclaim: schedule 'damon_reclaim_timer' only after 'system_wq' is initialized
kthread: make it clear that kthread_create_on_node() might be terminated by any fatal signal
mm: lru_cache_disable: use synchronize_rcu_expedited
mm/page_isolation.c: fix one kernel-doc comment
Pull gpio fixes from Bartosz Golaszewski:
- make the irqchip immutable in gpio-realtek-otto
- fix error code propagation in gpio-winbond
- fix device removing in gpio-grgpio
- fix a typo in gpio-mxs which indicates the driver is for a different
model
- documentation fixes
- MAINTAINERS file updates
* tag 'gpio-fixes-for-v5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpio: mxs: Fix header comment
gpio: Fix kernel-doc comments to nested union
gpio: grgpio: Fix device removing
gpio: winbond: Fix error code in winbond_gpio_get()
gpio: realtek-otto: Make the irqchip immutable
docs: driver-api: gpio: Fix filename mismatch
MAINTAINERS: add include/dt-bindings/gpio to GPIO SUBSYSTEM
Pull ATA fix from Damien Le Moal:
- a single patch to fix tracing of command completion (Edward)
* tag 'ata-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
ata: libata: add qc->flags in ata_qc_complete_template tracepoint
Pull block fixes from Jens Axboe:
- Series fixing issues with sysfs locking and name reuse (Christoph)
- NVMe pull request via Christoph:
- Fix the mixed up CRIMS/CRWMS constants (Joel Granados)
- Add another broken identifier quirk (Leo Savernik)
- Fix up a quirk because Samsung reuses PCI IDs over different
products (Christoph Hellwig)
- Remove old WARN_ON() that doesn't apply anymore (Li)
- Fix for using a stale cached request value for rq-qos throttling
mechanisms that may schedule(), like iocost (me)
- Remove unused parameter to blk_independent_access_range() (Damien)
* tag 'block-5.19-2022-06-24' of git://git.kernel.dk/linux-block:
block: remove WARN_ON() from bd_link_disk_holder
nvme: move the Samsung X5 quirk entry to the core quirks
nvme: fix the CRIMS and CRWMS definitions to match the spec
nvme: add a bogus subsystem NQN quirk for Micron MTFDKBA2T0TFH
block: pop cached rq before potentially blocking rq_qos_throttle()
block: remove queue from struct blk_independent_access_range
block: freeze the queue earlier in del_gendisk
block: remove per-disk debugfs files in blk_unregister_queue
block: serialize all debugfs operations using q->debugfs_mutex
block: disable the elevator int del_gendisk
Pull io_uring fixes from Jens Axboe:
"A few fixes that should go into the 5.19 release. All are fixing
issues that either happened in this release, or going to stable.
In detail:
- A small series of fixlets for the poll handling, all destined for
stable (Pavel)
- Fix a merge error from myself that caused a potential -EINVAL for
the recv/recvmsg flag setting (me)
- Fix a kbuf recycling issue for partial IO (me)
- Use the original request for the inflight tracking (me)
- Fix an issue introduced this merge window with trace points using a
custom decoder function, which won't work for perf (Dylan)"
* tag 'io_uring-5.19-2022-06-24' of git://git.kernel.dk/linux-block:
io_uring: use original request task for inflight tracking
io_uring: move io_uring_get_opcode out of TP_printk
io_uring: fix double poll leak on repolling
io_uring: fix wrong arm_poll error handling
io_uring: fail links when poll fails
io_uring: fix req->apoll_events
io_uring: fix merge error in checking send/recv addr2 flags
io_uring: mark reissue requests with REQ_F_PARTIAL_IO
Pull printk kernel thread revert from Petr Mladek:
"Revert printk console kthreads.
The testing of 5.19 release candidates revealed issues that did not
happen when all consoles were serialized using the console semaphore.
More time is needed to check expectations of the existing console
drivers and be confident that they can be safely used in parallel"
* tag 'printk-for-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
Revert "printk: add functions to prefer direct printing"
Revert "printk: add kthread console printers"
Revert "printk: extend console_lock for per-console locking"
Revert "printk: remove @console_locked"
Revert "printk: Block console kthreads when direct printing will be required"
Revert "printk: Wait for the global console lock when the system is going down"
Pull random number generator fixes from Jason Donenfeld:
- A change to schedule the interrupt randomness mixing less often, yet
credit a little more each time, to reduce overhead during interrupt
storms.
- Squelch an undesired pr_warn() from __ratelimit(), which was causing
problems in the reporters' CI.
- A trivial comment fix.
* tag 'random-5.19-rc4-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
random: update comment from copy_to_user() -> copy_to_iter()
random: quiet urandom warning ratelimit suppression message
random: schedule mix_interrupt_randomness() less often
This reverts commit 2bb2b7b57f.
The testing of 5.19 release candidates revealed missing synchronization
between early and regular console functionality.
It would be possible to start the console kthreads later as a workaround.
But it is clear that console lock serialized console drivers between
each other. It opens a big area of possible problems that were not
considered by people involved in the development and review.
printk() is crucial for debugging kernel issues and console output is
very important part of it. The number of consoles is huge and a proper
review would take some time. As a result it need to be reverted for 5.19.
Link: https://lore.kernel.org/r/YrBdjVwBOVgLfHyb@alley
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220623145157.21938-7-pmladek@suse.com
This reverts commit 09c5ba0aa2.
This reverts commit b87f02307d.
The testing of 5.19 release candidates revealed missing synchronization
between early and regular console functionality.
It would be possible to start the console kthreads later as a workaround.
But it is clear that console lock serialized console drivers between
each other. It opens a big area of possible problems that were not
considered by people involved in the development and review.
printk() is crucial for debugging kernel issues and console output is
very important part of it. The number of consoles is huge and a proper
review would take some time. As a result it need to be reverted for 5.19.
Link: https://lore.kernel.org/r/YrBdjVwBOVgLfHyb@alley
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220623145157.21938-6-pmladek@suse.com
This reverts commit 8e27473211.
The testing of 5.19 release candidates revealed missing synchronization
between early and regular console functionality.
It would be possible to start the console kthreads later as a workaround.
But it is clear that console lock serialized console drivers between
each other. It opens a big area of possible problems that were not
considered by people involved in the development and review.
printk() is crucial for debugging kernel issues and console output is
very important part of it. The number of consoles is huge and a proper
review would take some time. As a result it need to be reverted for 5.19.
Link: https://lore.kernel.org/r/YrBdjVwBOVgLfHyb@alley
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220623145157.21938-5-pmladek@suse.com
This reverts commit b87f02307d.
The testing of 5.19 release candidates revealed missing synchronization
between early and regular console functionality.
It would be possible to start the console kthreads later as a workaround.
But it is clear that console lock serialized console drivers between
each other. It opens a big area of possible problems that were not
considered by people involved in the development and review.
printk() is crucial for debugging kernel issues and console output is
very important part of it. The number of consoles is huge and a proper
review would take some time. As a result it need to be reverted for 5.19.
Link: https://lore.kernel.org/r/YrBdjVwBOVgLfHyb@alley
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220623145157.21938-2-pmladek@suse.com
Pull networking fixes from Paolo Abeni:
"Including fixes from bpf and netfilter.
Current release - regressions:
- netfilter: cttimeout: fix slab-out-of-bounds read in
cttimeout_net_exit
Current release - new code bugs:
- bpf: ftrace: keep address offset in ftrace_lookup_symbols
- bpf: force cookies array to follow symbols sorting
Previous releases - regressions:
- ipv4: ping: fix bind address validity check
- tipc: fix use-after-free read in tipc_named_reinit
- eth: veth: add updating of trans_start
Previous releases - always broken:
- sock: redo the psock vs ULP protection check
- netfilter: nf_dup_netdev: fix skb_under_panic
- bpf: fix request_sock leak in sk lookup helpers
- eth: igb: fix a use-after-free issue in igb_clean_tx_ring
- eth: ice: prohibit improper channel config for DCB
- eth: at803x: fix null pointer dereference on AR9331 phy
- eth: virtio_net: fix xdp_rxq_info bug after suspend/resume
Misc:
- eth: hinic: replace memcpy() with direct assignment"
* tag 'net-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (47 commits)
net: openvswitch: fix parsing of nw_proto for IPv6 fragments
sock: redo the psock vs ULP protection check
Revert "net/tls: fix tls_sk_proto_close executed repeatedly"
virtio_net: fix xdp_rxq_info bug after suspend/resume
igb: Make DMA faster when CPU is active on the PCIe link
net: dsa: qca8k: reduce mgmt ethernet timeout
net: dsa: qca8k: reset cpu port on MTU change
MAINTAINERS: Add a maintainer for OCP Time Card
hinic: Replace memcpy() with direct assignment
Revert "drivers/net/ethernet/neterion/vxge: Fix a use-after-free bug in vxge-main.c"
net: phy: smsc: Disable Energy Detect Power-Down in interrupt mode
ice: ethtool: Prohibit improper channel config for DCB
ice: ethtool: advertise 1000M speeds properly
ice: Fix switchdev rules book keeping
ice: ignore protocol field in GTP offload
netfilter: nf_dup_netdev: add and use recursion counter
netfilter: nf_dup_netdev: do not push mac header a second time
selftests: netfilter: correct PKTGEN_SCRIPT_PATHS in nft_concat_range.sh
net/tls: fix tls_sk_proto_close executed repeatedly
erspan: do not assume transport header is always set
...
Adjust the values of NVME_CAP_CRMS_CRIMS and NVME_CAP_CRMS_CRWMS masks as
they are different from the ones in TP4084 - Time-to-ready.
Fixes: 354201c53e ("nvme: add support for TP4084 - Time-to-Ready Enhancements").
Signed-off-by: Joel Granados <j.granados@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Commit 8a59f9d1e3 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()")
has moved the inet_csk_has_ulp(sk) check from sk_psock_init() to
the new tcp_bpf_update_proto() function. I'm guessing that this
was done to allow creating psocks for non-inet sockets.
Unfortunately the destruction path for psock includes the ULP
unwind, so we need to fail the sk_psock_init() itself.
Otherwise if ULP is already present we'll notice that later,
and call tcp_update_ulp() with the sk_proto of the ULP
itself, which will most likely result in the ULP looping
its callbacks.
Fixes: 8a59f9d1e3 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Tested-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/r/20220620191353.1184629-2-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Pull signature checking selftest from David Howells:
"The signature checking code, as used by module signing, kexec, etc.,
is non-FIPS compliant as there is no selftest.
For a kernel to be FIPS-compliant, signature checking would have to be
tested before being used, and the box would need to panic if it's not
available (probably reasonable as simply disabling signature checking
would prevent you from loading any driver modules).
Deal with this by adding a minimal test.
This is split into two patches: the first moves load_certificate_list()
to the same place as the X.509 code to make it more accessible
internally; the second adds a selftest"
* tag 'certs-20220621' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
certs: Add FIPS selftests
certs: Move load_certificate_list() to be with the asymmetric keys code