Merge in changes that went into 5.8-rc3. GIT will silently do the
merge, but we still need a tweak on top of that since
io_complete_rw_common() was modified to take a io_comp_state pointer.
The auto-merge fails on that, and we end up with something that
doesn't compile.
* io_uring-5.8:
io_uring: fix current->mm NULL dereference on exit
io_uring: fix hanging iopoll in case of -EAGAIN
io_uring: fix io_sq_thread no schedule when busy
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's easier to return next work from ->do_work() than
having an in-out argument. Looks nicer and easier to compile.
Also, merge io_wq_assign_next() into its only user.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently links are always done in an async fashion, unless we catch them
inline after we successfully complete a request without having to resort
to blocking. This isn't necessarily the most efficient approach, it'd be
more ideal if we could just use the task_work handling for this.
Outside of saving an async jump, we can also do less prep work for these
kinds of requests.
Running dependent links from the task_work handler yields some nice
performance benefits. As an example, examples/link-cp from the liburing
repository uses read+write links to implement a copy operation. Without
this patch, the a cache fold 4G file read from a VM runs in about 3
seconds:
$ time examples/link-cp /data/file /dev/null
real 0m2.986s
user 0m0.051s
sys 0m2.843s
and a subsequent cache hot run looks like this:
$ time examples/link-cp /data/file /dev/null
real 0m0.898s
user 0m0.069s
sys 0m0.797s
With this patch in place, the cold case takes about 2.4 seconds:
$ time examples/link-cp /data/file /dev/null
real 0m2.400s
user 0m0.020s
sys 0m2.366s
and the cache hot case looks like this:
$ time examples/link-cp /data/file /dev/null
real 0m0.676s
user 0m0.010s
sys 0m0.665s
As expected, the (mostly) cache hot case yields the biggest improvement,
running about 25% faster with this change, while the cache cold case
yields about a 20% increase in performance. Outside of the performance
increase, we're using less CPU as well, as we're not using the async
offload threads at all for this anymore.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A bit more surgery required here, as completions are generally done
through the kiocb->ki_complete() callback, even if they complete inline.
This enables the regular read/write path to use the io_comp_state
logic to batch inline completions.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Provide the completion state to the handlers that we know can complete
inline, so they can utilize this for batching completions.
Cap the max batch count at 32. This should be enough to provide a good
amortization of the cost of the lock+commit dance for completions, while
still being low enough not to cause any real latency issues for SQPOLL
applications.
Xuan Zhuo <xuanzhuo@linux.alibaba.com> reports that this changes his
profile from:
17.97% [kernel] [k] copy_user_generic_unrolled
13.92% [kernel] [k] io_commit_cqring
11.04% [kernel] [k] __io_cqring_fill_event
10.33% [kernel] [k] udp_recvmsg
5.94% [kernel] [k] skb_release_data
4.31% [kernel] [k] udp_rmem_release
2.68% [kernel] [k] __check_object_size
2.24% [kernel] [k] __slab_free
2.22% [kernel] [k] _raw_spin_lock_bh
2.21% [kernel] [k] kmem_cache_free
2.13% [kernel] [k] free_pcppages_bulk
1.83% [kernel] [k] io_submit_sqes
1.38% [kernel] [k] page_frag_free
1.31% [kernel] [k] inet_recvmsg
to
19.99% [kernel] [k] copy_user_generic_unrolled
11.63% [kernel] [k] skb_release_data
9.36% [kernel] [k] udp_rmem_release
8.64% [kernel] [k] udp_recvmsg
6.21% [kernel] [k] __slab_free
4.39% [kernel] [k] __check_object_size
3.64% [kernel] [k] free_pcppages_bulk
2.41% [kernel] [k] kmem_cache_free
2.00% [kernel] [k] io_submit_sqes
1.95% [kernel] [k] page_frag_free
1.54% [kernel] [k] io_put_req
[...]
0.07% [kernel] [k] io_commit_cqring
0.44% [kernel] [k] __io_cqring_fill_event
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No functional changes in this patch, just in preparation for having the
completion state be available on the issue side. Later on, this will
allow requests that complete inline to be completed in batches.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No functional changes in this patch, just in preparation for passing back
pending completions to the caller and completing them in a batched
fashion.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have lots of callers of:
io_cqring_add_event(req, result);
io_put_req(req);
Provide a helper that does this for us. It helps clean up the code, and
also provides a more convenient location for us to change the completion
handling.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
__io_queue_sqe() tries to handle all request of a link,
so it's not enough to grab mm in io_sq_thread_acquire_mm()
based just on the head.
Don't check req->needs_mm and do it always.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
io_do_iopoll() won't do anything with a request unless
req->iopoll_completed is set. So io_complete_rw_iopoll() has to set
it, otherwise io_do_iopoll() will poll a file again and again even
though the request of interest was completed long time ago.
Also, remove -EAGAIN check from io_issue_sqe() as it races with
the changed lines. The request will take the long way and be
resubmitted from io_iopoll*().
io_kiocb's result and iopoll_completed")
Fixes: bbde017a32 ("io_uring: add memory barrier to synchronize
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When the user consumes and generates sqe at a fast rate,
io_sqring_entries can always get sqe, and ret will not be equal to -EBUSY,
so that io_sq_thread will never call cond_resched or schedule, and then
we will get the following system error prompt:
rcu: INFO: rcu_sched self-detected stall on CPU
or
watchdog: BUG: soft lockup-CPU#23 stuck for 112s! [io_uring-sq:1863]
This patch checks whether need to call cond_resched() by checking
the need_resched() function every cycle.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After recent changes, io_submit_sqes() always passes valid submit state,
so kill leftovers checking it for NULL.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's a good practice to modify fields of a struct after but not before
it was initialised. Even though io_init_poll_iocb() doesn't touch
poll->file, call it first.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
REQ_F_MUST_PUNT may seem looking good and clear, but it's the same
as not having REQ_F_NOWAIT set. That rather creates more confusion.
Moreover, it doesn't even affect any behaviour (e.g. see the patch
removing it from io_{read,write}).
Kill theg flag and update already outdated comments.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_{read,write}() {
...
copy_iov: // prep async
if (!(flags & REQ_F_NOWAIT) && !file_can_poll(file))
flags |= REQ_F_MUST_PUNT;
}
REQ_F_MUST_PUNT there is pointless, because if it happens then
REQ_F_NOWAIT is known to be _not_ set, and the request will go
async path in __io_queue_sqe() anyway. file_can_poll() check
is also repeated in arm_poll*(), so don't need it.
Remove the mentioned assignment REQ_F_MUST_PUNT in preparation
for killing the flag.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull in async buffered reads branch.
* async-buffered.8:
io_uring: support true async buffered reads, if file provides it
mm: add kiocb_wait_page_queue_init() helper
btrfs: flag files as supporting buffered async reads
xfs: flag files as supporting buffered async reads
block: flag block devices as supporting IOCB_WAITQ
fs: add FMODE_BUF_RASYNC
mm: support async buffered reads in generic_file_buffered_read()
mm: add support for async page locking
mm: abstract out wake_page_match() from wake_page_function()
mm: allow read-ahead with IOCB_NOWAIT set
io_uring: re-issue block requests that failed because of resources
io_uring: catch -EIO from buffered issue request failure
io_uring: always plug for any number of IOs
block: provide plug based way of signaling forced no-wait semantics
If the file is flagged with FMODE_BUF_RASYNC, then we don't have to punt
the buffered read to an io-wq worker. Instead we can rely on page
unlocking callbacks to support retry based async IO. This is a lot more
efficient than doing async thread offload.
The retry is done similarly to how we handle poll based retry. From
the unlock callback, we simply queue the retry to a task_work based
handler.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Checks if the file supports it, and initializes the values that we need.
Caller passes in 'data' pointer, if any, and the callback function to
be used.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the async page locking infrastructure, if IOCB_WAITQ is set in the
passed in iocb. The caller must expect an -EIOCBQUEUED return value,
which means that IO is started but not done yet. This is similar to how
O_DIRECT signals the same operation. Once the callback is received by
the caller for IO completion, the caller must retry the operation.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Normally waiting for a page to become unlocked, or locking the page,
requires waiting for IO to complete. Add support for lock_page_async()
and wait_on_page_locked_async(), which are callback based instead. This
allows a caller to get notified when a page becomes unlocked, rather
than wait for it.
We add a new iocb field, ki_waitq, to pass in the necessary data for this
to happen. We can unionize this with ki_cookie, since that is only used
for polled IO. Polled IO can never co-exist with async callbacks, as it is
(by definition) polled completions. struct wait_page_key is made public,
and we define struct wait_page_async as the interface between the caller
and the core.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No functional changes in this patch, just in preparation for allowing
more callers.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The read-ahead shouldn't block, so allow it to be done even if
IOCB_NOWAIT is set in the kiocb.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mark the plug with nowait == true, which will cause requests to avoid
blocking on request allocation. If they do, we catch them and reissue
them from a task_work based handler.
Normally we can catch -EAGAIN directly, but the hard case is for split
requests. As an example, the application issues a 512KB request. The
block core will split this into 128KB if that's the max size for the
device. The first request issues just fine, but we run into -EAGAIN for
some latter splits for the same request. As the bio is split, we don't
get to see the -EAGAIN until one of the actual reads complete, and hence
we cannot handle it inline as part of submission.
This does potentially cause re-reads of parts of the range, as the whole
request is reissued. There's currently no better way to handle this.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-EIO bubbles up like -EAGAIN if we fail to allocate a request at the
lower level. Play it safe and treat it like -EAGAIN in terms of sync
retry, to avoid passing back an errant -EIO.
Catch some of these early for block based file, as non-mq devices
generally do not support NOWAIT. That saves us some overhead by
not first trying, then retrying from async context. We can go straight
to async punt instead.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently we only plug if we're doing more than two request. We're going
to be relying on always having the plug there to pass down information,
so plug unconditionally.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Provide a way for the caller to specify that IO should be marked
with REQ_NOWAIT to avoid blocking on allocation.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ring pages are not pinned so it is more appropriate to report them
as locked.
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
poll events should be 32-bits to cover EPOLLEXCLUSIVE.
Explicit word-swap the poll32_events for big endian to make sure the ABI
is not changed. We call this feature IORING_FEAT_POLL_32BITS,
applications who want to use EPOLLEXCLUSIVE should check the feature bit
first.
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull SELinux fixes from Paul Moore:
"Three small patches to fix problems in the SELinux code, all found via
clang.
Two patches fix potential double-free conditions and one fixes an
undefined return value"
* tag 'selinux-pr-20200621' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: fix undefined return of cond_evaluate_expr
selinux: fix a double free in cond_read_node()/cond_read_list()
selinux: fix double free
Pull pin control fixes from Linus Walleij:
"Some early fixes collected during the first week after the merge
window, all pretty self-evident, with the details below. The revert is
the crucial thing.
- Fix a warning on the Qualcomm SPMI GPIO chip being instatiated
twice without a unique irqchip struct
- Use the noirq variants of the suspend and resume callbacks in the
Tegra driver
- Clean up the errorpath on the MCP23s08 driver
- Revert the use of devm_of_iomap() in the Freescale driver as it was
regressing the platform
- Add some missing pins in the Qualcomm IPQ6018 driver
- Fix a simple documentation bug in the pinctrl-single driver"
* tag 'pinctrl-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: single: fix function name in documentation
pinctrl: qcom: ipq6018 Add missing pins in qpic pin group
Revert "pinctrl: freescale: imx: Use 'devm_of_iomap()' to avoid a resource leak in case of error in 'imx_pinctrl_probe()'"
pinctrl: mcp23s08: Split to three parts: fix ptr_ret.cocci warnings
pinctrl: tegra: Use noirq suspend/resume callbacks
pinctrl: qcom: spmi-gpio: fix warning about irq chip reusage
Pull Kbuild fixes from Masahiro Yamada:
- fix -gz=zlib compiler option test for CONFIG_DEBUG_INFO_COMPRESSED
- improve cc-option in scripts/Kbuild.include to clean up temp files
- improve cc-option in scripts/Kconfig.include for more reliable
compile option test
- do not copy modules.builtin by 'make install' because it would break
existing systems
- use 'userprogs' syntax for watch_queue sample
* tag 'kbuild-fixes-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
samples: watch_queue: build sample program for target architecture
Revert "Makefile: install modules.builtin even if CONFIG_MODULES=n"
scripts: Fix typo in headers_install.sh
kconfig: unify cc-option and as-option
kbuild: improve cc-option to clean up all temporary files
Makefile: Improve compressed debug info support detection
Pull powerpc fixes from Michael Ellerman:
- One fix for the interrupt rework we did last release which broke
KVM-PR
- Three commits fixing some fallout from the READ_ONCE() changes
interacting badly with our 8xx 16K pages support, which uses a pte_t
that is a structure of 4 actual PTEs
- A cleanup of the 8xx pte_update() to use the newly added pmd_off()
- A fix for a crash when handling an oops if CONFIG_DEBUG_VIRTUAL is
enabled
- A minor fix for the SPU syscall generation
Thanks to Aneesh Kumar K.V, Christian Zigotzky, Christophe Leroy, Mike
Rapoport, Nicholas Piggin.
* tag 'powerpc-5.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/8xx: Provide ptep_get() with 16k pages
mm: Allow arches to provide ptep_get()
mm/gup: Use huge_ptep_get() in gup_hugepte()
powerpc/syscalls: Use the number when building SPU syscall table
powerpc/8xx: use pmd_off() to access a PMD entry in pte_update()
powerpc/64s: Fix KVM interrupt using wrong save area
powerpc: Fix kernel crash in show_instructions() w/DEBUG_VIRTUAL
This userspace program includes UAPI headers exported to usr/include/.
'make headers' always works for the target architecture (i.e. the same
architecture as the kernel), so the sample program should be built for
the target as well. Kbuild now supports 'userprogs' for that.
I also guarded the CONFIG option by 'depends on CC_CAN_LINK' because
$(CC) may not provide libc.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
This reverts commit e0b250b57d,
which broke build systems that need to install files to a certain
path, but do not set INSTALL_MOD_PATH when invoking 'make install'.
$ make INSTALL_PATH=/tmp/destdir install
mkdir: cannot create directory ‘/lib/modules/5.8.0-rc1+/’: Permission denied
Makefile:1342: recipe for target '_builtin_inst_' failed
make: *** [_builtin_inst_] Error 1
While modules.builtin is useful also for CONFIG_MODULES=n, this change
in the behavior is quite unexpected. Maybe "make modules_install"
can install modules.builtin irrespective of CONFIG_MODULES as Jonas
originally suggested.
Anyway, that commit should be reverted ASAP.
Reported-by: Douglas Anderson <dianders@chromium.org>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Cc: Jonas Karlman <jonas@kwiboo.se>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Pull SCSI fixes from James Bottomley:
"One minor fix and two patches reworking the ata dma drain for the
!CONFIG_LIBATA case. The latter is a 5.7 regression fix"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: Wire up ata_scsi_dma_need_drain for SAS HBA drivers
scsi: libata: Provide an ata_scsi_dma_need_drain stub for !CONFIG_ATA
scsi: ufs-bsg: Fix runtime PM imbalance on error
Pull i2c fixes from Wolfram Sang:
- a small collection of remaining API conversion patches (all acked)
which allow to finally remove the deprecated API
- some documentation fixes and a MAINTAINERS addition
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
MAINTAINERS: Add robert and myself as qcom i2c cci maintainers
i2c: smbus: Fix spelling mistake in the comments
Documentation/i2c: SMBus start signal is S not A
i2c: remove deprecated i2c_new_device API
Documentation: media: convert to use i2c_new_client_device()
video: backlight: tosa_lcd: convert to use i2c_new_client_device()
x86/platform/intel-mid: convert to use i2c_new_client_device()
drm: encoder_slave: use new I2C API
drm: encoder_slave: fix refcouting error for modules
Use the correct the function name in the documentation for
"pcs_parse_one_pinctrl_entry()".
"smux_parse_one_pinctrl_entry()" appears to be an artifact from the
development of a prior patch series ("simple pinmux driver") which
transformed into pinctrl-single.
Signed-off-by: Drew Fustini <drew@beagleboard.org>
Link: https://lore.kernel.org/r/20200612112758.GA3407886@x1
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Pull tracing fixes from Steven Rostedt:
- Have recordmcount work with > 64K sections (to support LTO)
- kprobe RCU fixes
- Correct a kprobe critical section with missing mutex
- Remove redundant arch_disarm_kprobe() call
- Fix lockup when kretprobe triggers within kprobe_flush_task()
- Fix memory leak in fetch_op_data operations
- Fix sleep in atomic in ftrace trace array sample code
- Free up memory on failure in sample trace array code
- Fix incorrect reporting of function_graph fields in format file
- Fix quote within quote parsing in bootconfig
- Fix return value of bootconfig tool
- Add testcases for bootconfig tool
- Fix maybe uninitialized warning in ftrace pid file code
- Remove unused variable in tracing_iter_reset()
- Fix some typos
* tag 'trace-v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix maybe-uninitialized compiler warning
tools/bootconfig: Add testcase for show-command and quotes test
tools/bootconfig: Fix to return 0 if succeeded to show the bootconfig
tools/bootconfig: Fix to use correct quotes for value
proc/bootconfig: Fix to use correct quotes for value
tracing: Remove unused event variable in tracing_iter_reset
tracing/probe: Fix memleak in fetch_op_data operations
trace: Fix typo in allocate_ftrace_ops()'s comment
tracing: Make ftrace packed events have align of 1
sample-trace-array: Remove trace_array 'sample-instance'
sample-trace-array: Fix sleeping function called from invalid context
kretprobe: Prevent triggering kretprobe from within kprobe_flush_task
kprobes: Remove redundant arch_disarm_kprobe() call
kprobes: Fix to protect kick_kprobe_optimizer() by kprobe_mutex
kprobes: Use non RCU traversal APIs on kprobe_tables if possible
kprobes: Suppress the suspicious RCU warning on kprobes
recordmcount: support >64k sections