It is generally preferred that the macros from
include/linux/compiler_attributes.h are used, unless there is a reason not
to.
checkpatch currently checks __attribute__ for each of packed, aligned,
section, printf, scanf, and weak. Other declarations in
compiler_attributes.h are not handled.
Add a generic test to check the presence of such attributes. Some
attributes require more specific handling and are kept separate.
Also add fixes to the generic attributes check to substitute the correct
conversions.
New attributes which are now handled are:
__always_inline__
__assume_aligned__(a, ## __VA_ARGS__)
__cold__
__const__
__copy__(symbol)
__designated_init__
__externally_visible__
__gnu_inline__
__malloc__
__mode__(x)
__no_caller_saved_registers__
__noclone__
__noinline__
__nonstring__
__noreturn__
__pure__
__unused__
__used__
Declarations which contain multiple attributes like
__attribute__((__packed__, __cold__)) are also handled except when proper
conversions for one or more attributes of the list cannot be determined.
Link: https://lore.kernel.org/linux-kernel-mentees/3ec15b41754b01666d94b76ce51b9832c2dd577a.camel@perches.com/
Link: https://lkml.kernel.org/r/20201025193103.23223-1-dwaipayanray1@gmail.com
Signed-off-by: Dwaipayan Ray <dwaipayanray1@gmail.com>
Suggested-by: Joe Perches <joe@perches.com>
Acked-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are about 100,000 uses of 'static const <type>' but about 400 uses
of 'static <type> const' in the kernel where type is not a pointer.
The kernel almost always uses "static const" over "const static" as there
is a compiler warning for that declaration style.
But there is no compiler warning for "static <type> const".
So add a checkpatch warning for the atypical declaration uses of.
const static <type> <foo>
and
static <type> const <foo>
For example:
$ ./scripts/checkpatch.pl -f --emacs --quiet --nosummary -types=static_const arch/arm/crypto/aes-ce-glue.c
arch/arm/crypto/aes-ce-glue.c:75: WARNING: Move const after static - use 'static const u8'
#75: FILE: arch/arm/crypto/aes-ce-glue.c:75:
+ static u8 const rcon[] = {
Link: https://lkml.kernel.org/r/4b863be68e679546b40d50b97a4a806c03056a1c.camel@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Presence of hexadecimal address or symbol results in false warning
message by checkpatch.pl.
For example, running checkpatch on commit b8ad540dd4 ("mptcp: fix
memory leak in mptcp_subflow_create_socket()") results in warning:
WARNING:REPEATED_WORD: Possible repeated word: 'ff'
00 00 00 00 00 00 00 00 00 2f 30 0a 81 88 ff ff ........./0.....
Similarly, the presence of list command output in commit results in
an unnecessary warning.
For example, running checkpatch on commit 899e5ffbf2 ("perf record:
Introduce --switch-output-event") gives:
WARNING:REPEATED_WORD: Possible repeated word: 'root'
dr-xr-x---. 12 root root 4096 Apr 27 17:46 ..
Here, it reports 'ff' and 'root' to be repeated, but it is in fact part
of some address or code, where it has to be repeated.
In these cases, the intent of the warning to find stylistic issues in
commit messages is not met and the warning is just completely wrong in
this case.
To avoid these warnings, add an additional regex check for the directory
permission pattern and avoid checking the line for this class of
warning. Similarly, to avoid hex pattern, check if the word consists of
hex symbols and skip this warning if it is not among the common english
words formed using hex letters.
A quick evaluation on v5.6..v5.8 showed that this fix reduces
REPEATED_WORD warnings by the frequency of 1890.
A quick manual check found all cases are related to hex output or list
command outputs in commit messages.
Link: https://lkml.kernel.org/r/20201024102253.13614-1-yashsri421@gmail.com
Signed-off-by: Aditya Srivastava <yashsri421@gmail.com>
Acked-by: Joe Perches <joe@perches.com>
Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Recently, commit 4f6ad8aa1eac ("checkpatch: move repeated word test")
moved the repeated word test to check for more file types. But after
this, if checkpatch.pl is run on MAINTAINERS, it generates several
new warnings of the type:
WARNING: Possible repeated word: 'git'
For example:
WARNING: Possible repeated word: 'git'
+T: git git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml.git
So, the pattern "git git://..." is a false positive in this case.
There are several other combinations which may produce a wrong warning
message, such as "@size size", ":Begin begin", etc.
Extend repeated word check to compare the characters before and after
the word matches.
If there is a non whitespace character before the first word or a non
whitespace character excluding punctuation characters after the second
word, then the check is skipped and the warning is avoided.
Also add case insensitive word matching to the repeated word check.
Link: https://lore.kernel.org/linux-kernel-mentees/81b6a0bb2c7b9256361573f7a13201ebcd4876f1.camel@perches.com/
Link: https://lkml.kernel.org/r/20201017162732.152351-1-dwaipayanray1@gmail.com
Signed-off-by: Dwaipayan Ray <dwaipayanray1@gmail.com>
Suggested-by: Joe Perches <joe@perches.com>
Suggested-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Acked-by: Joe Perches <joe@perches.com>
Cc: Aditya Srivastava <yashsri421@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
LZ4 final literal copy could be overlapped when doing
in-place decompression, so it's unsafe to just use memcpy()
on an optimized memcpy approach but memmove() instead.
Upstream LZ4 has updated this years ago [1] (and the impact
is non-sensible [2] plus only a few bytes remain), this commit
just synchronizes LZ4 upstream code to the kernel side as well.
It can be observed as EROFS in-place decompression failure
on specific files when X86_FEATURE_ERMS is unsupported,
memcpy() optimization of commit 59daa706fb ("x86, mem:
Optimize memcpy by avoiding memory false dependece") will
be enabled then.
Currently most modern x86-CPUs support ERMS, these CPUs just
use "rep movsb" approach so no problem at all. However, it can
still be verified with forcely disabling ERMS feature...
arch/x86/lib/memcpy_64.S:
ALTERNATIVE_2 "jmp memcpy_orig", "", X86_FEATURE_REP_GOOD, \
- "jmp memcpy_erms", X86_FEATURE_ERMS
+ "jmp memcpy_orig", X86_FEATURE_ERMS
We didn't observe any strange on arm64/arm/x86 platform before
since most memcpy() would behave in an increasing address order
("copy upwards" [3]) and it's the correct order of in-place
decompression but it really needs an update to memmove() for sure
considering it's an undefined behavior according to the standard
and some unique optimization already exists in the kernel.
[1] 33cb8518ac
[2] https://github.com/lz4/lz4/pull/717#issuecomment-497818921
[3] https://sourceware.org/bugzilla/show_bug.cgi?id=12518
Link: https://lkml.kernel.org/r/20201122030749.2698994-1-hsiangkao@redhat.com
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Nick Terrell <terrelln@fb.com>
Cc: Yann Collet <yann.collet.73@gmail.com>
Cc: Miao Xie <miaoxie@huawei.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Li Guifu <bluce.liguifu@huawei.com>
Cc: Guo Xuenan <guoxuenan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Fortify strscpy()", v7.
This patch implements a fortified version of strscpy() enabled by setting
CONFIG_FORTIFY_SOURCE=y. The new version ensures the following before
calling vanilla strscpy():
1. There is no read overflow because either size is smaller than src
length or we shrink size to src length by calling fortified strnlen().
2. There is no write overflow because we either failed during
compilation or at runtime by checking that size is smaller than dest
size. Note that, if src and dst size cannot be got, the patch defaults
to call vanilla strscpy().
The patches adds the following:
1. Implement the fortified version of strscpy().
2. Add a new LKDTM test to ensures the fortified version still returns
the same value as the vanilla one while panic'ing when there is a write
overflow.
3. Correct some typos in LKDTM related file.
I based my modifications on top of two patches from Daniel Axtens which
modify calls to __builtin_object_size, in fortified string functions, to
ensure the true size of char * are returned and not the surrounding
structure size.
About performance, I measured the slow down of fortified strscpy(), using
the vanilla one as baseline. The hardware I used is an Intel i3 2130 CPU
clocked at 3.4 GHz. I ran "Linux 5.10.0-rc4+ SMP PREEMPT" inside qemu
3.10 with 4 CPU cores. The following code, called through LKDTM, was used
as a benchmark:
#define TIMES 10000
char *src;
char dst[7];
int i;
ktime_t begin;
src = kstrdup("foobar", GFP_KERNEL);
if (src == NULL)
return;
begin = ktime_get();
for (i = 0; i < TIMES; i++)
strscpy(dst, src, strlen(src));
pr_info("%d fortified strscpy() tooks %lld", TIMES, ktime_get() - begin);
begin = ktime_get();
for (i = 0; i < TIMES; i++)
__real_strscpy(dst, src, strlen(src));
pr_info("%d vanilla strscpy() tooks %lld", TIMES, ktime_get() - begin);
kfree(src);
I called the above code 30 times to compute stats for each version (in ns,
round to int):
| version | mean | std | median | 95th |
| --------- | ------- | ------ | ------- | ------- |
| fortified | 245_069 | 54_657 | 216_230 | 331_122 |
| vanilla | 172_501 | 70_281 | 143_539 | 219_553 |
On average, fortified strscpy() is approximately 1.42 times slower than
vanilla strscpy(). For the 95th percentile, the fortified version is
about 1.50 times slower.
So, clearly the stats are not in favor of fortified strscpy(). But, the
fortified version loops the string twice (one in strnlen() and another in
vanilla strscpy()) while the vanilla one only loops once. This can
explain why fortified strscpy() is slower than the vanilla one.
This patch (of 5):
When the fortify feature was first introduced in commit 6974f0c455
("include/linux/string.h: add the option of fortified string.h
functions"), Daniel Micay observed:
* It should be possible to optionally use __builtin_object_size(x, 1) for
some functions (C strings) to detect intra-object overflows (like
glibc's _FORTIFY_SOURCE=2), but for now this takes the conservative
approach to avoid likely compatibility issues.
This is a case that often cannot be caught by KASAN. Consider:
struct foo {
char a[10];
char b[10];
}
void test() {
char *msg;
struct foo foo;
msg = kmalloc(16, GFP_KERNEL);
strcpy(msg, "Hello world!!");
// this copy overwrites foo.b
strcpy(foo.a, msg);
}
The questionable copy overflows foo.a and writes to foo.b as well. It
cannot be detected by KASAN. Currently it is also not detected by
fortify, because strcpy considers __builtin_object_size(x, 0), which
considers the size of the surrounding object (here, struct foo). However,
if we switch the string functions over to use __builtin_object_size(x, 1),
the compiler will measure the size of the closest surrounding subobject
(here, foo.a), rather than the size of the surrounding object as a whole.
See https://gcc.gnu.org/onlinedocs/gcc/Object-Size-Checking.html for more
info.
Only do this for string functions: we cannot use it on things like memcpy,
memmove, memcmp and memchr_inv due to code like this which purposefully
operates on multiple structure members: (arch/x86/kernel/traps.c)
/*
* regs->sp points to the failing IRET frame on the
* ESPFIX64 stack. Copy it to the entry stack. This fills
* in gpregs->ss through gpregs->ip.
*
*/
memmove(&gpregs->ip, (void *)regs->sp, 5*8);
This change passes an allyesconfig on powerpc and x86, and an x86 kernel
built with it survives running with syz-stress from syzkaller, so it seems
safe so far.
Link: https://lkml.kernel.org/r/20201122162451.27551-1-laniel_francis@privacyrequired.com
Link: https://lkml.kernel.org/r/20201122162451.27551-2-laniel_francis@privacyrequired.com
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Francis Laniel <laniel_francis@privacyrequired.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As discussed in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97445 the
const_ilog2 macro generates a lot of code which interferes badly with GCC
inlining heuristics, until it can be proven that the ilog2 argument can or
can't be simplified into a constant.
It can be expressed using __builtin_clzll builtin which is supported by
GCC 3.4 and later and when used only in the __builtin_constant_p guarded
code it ought to always fold back to a constant. Other compilers support
the same builtin for many years too.
Other option would be to change the const_ilog2 macro, though as the
description says it is meant to be used also in C constant expressions,
and while GCC will fold it to constant with constant argument even in
those, perhaps it is better to avoid using extensions in that case.
[akpm@linux-foundation.org: coding style fixes]
Link: https://lkml.kernel.org/r/20201120125154.GB3040@hirez.programming.kicks-ass.net
Link: https://lkml.kernel.org/r/20201021132718.GB2176@tucnak
Signed-off-by: Jakub Jelinek <jakub@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On PREEMPT_RT the locks are quite different so they can't be tested as it
is done below. The alternative is to test for the waitlock within
rtmutex.
This is the bare minimun to get it compiled. Problems which exist on
PREEMP_RT:
- none of the locks (spinlock_t, rwlock_t, mutex_t, rw_semaphore) may
be acquired with disabled preemption or interrupts.
If I read the code correct the it is possible to acquire a mutex_t
with disabled interrupts.
I don't know how to obtain a lock pointer. Technically they are not
exported to userland.
- memory can not be allocated with disabled preemption or interrupts
even with GFP_ATOMIC.
Link: https://lkml.kernel.org/r/20201028181041.xyeothhkouc3p4md@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When building mpc885_ads_defconfig with gcc 10.1,
the function get_order() appears 50 times in vmlinux:
[linux]# ppc-linux-objdump -x vmlinux | grep get_order | wc -l
50
[linux]# size vmlinux
text data bss dec hex filename
3842620 675624 135160 4653404 47015c vmlinux
In the old days, marking a function 'static inline' was forcing GCC to
inline, but since commit ac7c3e4ff4 ("compiler: enable
CONFIG_OPTIMIZE_INLINING forcibly") GCC may decide to not inline a
function.
It looks like GCC 10 is taking poor decisions on this.
get_order() compiles into the following tiny function, occupying 20
bytes of text.
0000007c <get_order>:
7c: 38 63 ff ff addi r3,r3,-1
80: 54 63 a3 3e rlwinm r3,r3,20,12,31
84: 7c 63 00 34 cntlzw r3,r3
88: 20 63 00 20 subfic r3,r3,32
8c: 4e 80 00 20 blr
By forcing get_order() to be __always_inline, the size of text is
reduced by 1940 bytes, that is almost twice the space occupied by
50 times get_order()
[linux-powerpc]# size vmlinux
text data bss dec hex filename
3840680 675588 135176 4651444 46f9b4 vmlinux
Link: https://lkml.kernel.org/r/96c6172d619c51acc5c1c4884b80785c59af4102.1602949927.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Cc: Segher Boessenkool <segher@kernel.crashing.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 1fde6f21d9 ("proc: fix /proc/net/* after setns(2)") only forced
revalidation of regular files under /proc/net/
However, /proc/net/ is unusual in the sense of /proc/net/foo handlers
take netns pointer from parent directory which is old netns.
Steps to reproduce:
(void)open("/proc/net/sctp/snmp", O_RDONLY);
unshare(CLONE_NEWNET);
int fd = open("/proc/net/sctp/snmp", O_RDONLY);
read(fd, &c, 1);
Read will read wrong data from original netns.
Patch forces lookup on every directory under /proc/net .
Link: https://lkml.kernel.org/r/20201205160916.GA109739@localhost.localdomain
Fixes: 1da4d377f9 ("proc: revalidate misc dentries")
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reported-by: "Rantala, Tommi T. (Nokia - FI/Espoo)" <tommi.t.rantala@nokia.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The scenario on which "Free swap = -4kB" happens in my system, which is caused
by several get_swap_pages racing with each other and show_swap_cache_info
happens simutaniously. No need to add a lock on get_swap_page_of_type as we
remove "Presub/PosAdd" here.
ProcessA ProcessB ProcessC
ngoals = 1 ngoals = 1
avail = nr_swap_pages(1) avail = nr_swap_pages(1)
nr_swap_pages(1) -= ngoals
nr_swap_pages(0) -= ngoals
nr_swap_pages = -1
Link: https://lkml.kernel.org/r/1607050340-4535-1-git-send-email-zhaoyang.huang@unisoc.com
Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull exec-update-lock update from Eric Biederman:
"The key point of this is to transform exec_update_mutex into a
rw_semaphore so readers can be separated from writers.
This makes it easier to understand what the holders of the lock are
doing, and makes it harder to contend or deadlock on the lock.
The real deadlock fix wound up in perf_event_open"
* 'exec-update-lock-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
exec: Transform exec_update_mutex into a rw_semaphore
Pull execve updates from Eric Biederman:
"This set of changes ultimately fixes the interaction of posix file
lock and exec. Fundamentally most of the change is just moving where
unshare_files is called during exec, and tweaking the users of
files_struct so that the count of files_struct is not unnecessarily
played with.
Along the way fcheck and related helpers were renamed to more
accurately reflect what they do.
There were also many other small changes that fell out, as this is the
first time in a long time much of this code has been touched.
Benchmarks haven't turned up any practical issues but Al Viro has
observed a possibility for a lot of pounding on task_lock. So I have
some changes in progress to convert put_files_struct to always rcu
free files_struct. That wasn't ready for the merge window so that will
have to wait until next time"
* 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
exec: Move io_uring_task_cancel after the point of no return
coredump: Document coredump code exclusively used by cell spufs
file: Remove get_files_struct
file: Rename __close_fd_get_file close_fd_get_file
file: Replace ksys_close with close_fd
file: Rename __close_fd to close_fd and remove the files parameter
file: Merge __alloc_fd into alloc_fd
file: In f_dupfd read RLIMIT_NOFILE once.
file: Merge __fd_install into fd_install
proc/fd: In fdinfo seq_show don't use get_files_struct
bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu
proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu
file: Implement task_lookup_next_fd_rcu
kcmp: In get_file_raw_ptr use task_lookup_fd_rcu
proc/fd: In tid_fd_mode use task_lookup_fd_rcu
file: Implement task_lookup_fd_rcu
file: Rename fcheck lookup_fd_rcu
file: Replace fcheck_files with files_lookup_fd_rcu
file: Factor files_lookup_fd_locked out of fcheck_files
file: Rename __fcheck_files to files_lookup_fd_raw
...
Pull signal cleanup from Eric Biederman:
"Remove a never used HP-UX compatibility from parisc headers and
consolidating the SA_* flags definitions into a generic header as much
as possible.
We only have 32 SA_* flag bits total, so we need to be careful. But as
this is the first addition in a decade or so I think we are fine for
the forseeable future"
* 'signal-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
signal/parisc: Remove parisc specific definition of __ARCH_UAPI_SA_FLAGS
Pull close_range/openat2 updates from Christian Brauner:
"This contains a fix for openat2() to make RESOLVE_BENEATH and
RESOLVE_IN_ROOT mutually exclusive. It doesn't make sense to specify
both at the same time. The openat2() selftests have been extended to
verify that these two flags can't be specified together.
This also adds the CLOSE_RANGE_CLOEXEC flag to close_range() which
allows to mark a range of file descriptors as close-on-exec without
actually closing them.
This is useful in general but the use-case that triggered the patch is
installing a seccomp profile in the calling task before exec. If the
seccomp profile wants to block the close_range() syscall it obviously
can't use it to close all fds before exec. If it calls close_range()
before installing the seccomp profile it needs to take care not to
close fds that it will still need before the exec meaning it would
have to call close_range() multiple times on different ranges and then
still fall back to closing fds one by one right before the exec.
CLOSE_RANGE_CLOEXEC allows to solve this problem relying on the exec
codepath to get rid of the unwanted fds. The close_range() tests have
been expanded to verify that CLOSE_RANGE_CLOEXEC works"
* tag 'close-range-openat2-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
selftests: core: add tests for CLOSE_RANGE_CLOEXEC
fs, close_range: add flag CLOSE_RANGE_CLOEXEC
selftests: openat2: add RESOLVE_ conflict test
openat2: reject RESOLVE_BENEATH|RESOLVE_IN_ROOT
Pull regset updates from Al Viro:
"Dead code removal, mostly.
The only exception is a bit of cleanups on itanic (getting rid of
redundant stack unwinds - each access_uarea() call does it and we call
that 7 times in a row in ptrace_[sg]etregs(), *after* having done it
ourselves in the caller; location where the user registers have been
spilled won't change under us, and we can bloody well just call
access_elf_reg() directly, giving it the unw_frame_info we'd
calculated for our own purposes)"
* 'regset.followup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
c6x: kill ELF_CORE_COPY_FPREGS
whack-a-mole: USE_ELF_CORE_DUMP
[ia64] ptrace_[sg]etregs(): use access_elf_reg() instead of access_uarea()
[ia64] missed cleanups from switch to regset coredumps
arm: kill dump_task_regs()
Pull epoll updates from Al Viro:
"Deal with epoll loop check/removal races sanely (among other things).
The solution merged last cycle (pinning a bunch of struct file
instances) had been forced by the wrong data structures; untangling
that takes a bunch of preparations, but it's worth doing - control
flow in there is ridiculously overcomplicated. Memory footprint has
also gone down, while we are at it.
This is not all I want to do in the area, but since I didn't get
around to posting the followups they'll have to wait for the next
cycle"
* 'work.epoll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (27 commits)
epoll: take epitem list out of struct file
epoll: massage the check list insertion
lift rcu_read_lock() into reverse_path_check()
convert ->f_ep_links/->fllink to hlist
ep_insert(): move creation of wakeup source past the fl_ep_links insertion
fold ep_read_events_proc() into the only caller
take the common part of ep_eventpoll_poll() and ep_item_poll() into helper
ep_insert(): we only need tep->mtx around the insertion itself
ep_insert(): don't open-code ep_remove() on failure exits
lift locking/unlocking ep->mtx out of ep_{start,done}_scan()
ep_send_events_proc(): fold into the caller
lift the calls of ep_send_events_proc() into the callers
lift the calls of ep_read_events_proc() into the callers
ep_scan_ready_list(): prepare to splitup
ep_loop_check_proc(): saner calling conventions
get rid of ep_push_nested()
ep_loop_check_proc(): lift pushing the cookie into callers
clean reverse_path_check_proc() a bit
reverse_path_check_proc(): don't bother with cookies
reverse_path_check_proc(): sane arguments
...
Pull erofs updates from Gao Xiang:
"This cycle we got rid of magical page->mapping type marks for
temporary pages which had some concern before, now such usage is
replaced with specific page->private.
Also switch to inplace I/O instead of allocating extra cached pages to
avoid direct reclaim under low memory scenario.
There are some bmap bugfix and minor cleanups as well"
* tag 'erofs-for-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: avoid using generic_block_bmap
erofs: force inplace I/O under low memory scenario
erofs: simplify try_to_claim_pcluster()
erofs: insert to managed cache after adding to pcl
erofs: get rid of magical Z_EROFS_MAPPING_STAGING
erofs: remove a void EROFS_VERSION macro set in Makefile
Pull nfsd updates from Chuck Lever:
"Several substantial changes this time around:
- Previously, exporting an NFS mount via NFSD was considered to be an
unsupported feature. With v5.11, the community has attempted to
make re-exporting a first-class feature of NFSD.
This would enable the Linux in-kernel NFS server to be used as an
intermediate cache for a remotely-located primary NFS server, for
example, even with other NFS server implementations, like a NetApp
filer, as the primary.
- A short series of patches brings support for multiple RPC/RDMA data
chunks per RPC transaction to the Linux NFS server's RPC/RDMA
transport implementation.
This is a part of the RPC/RDMA spec that the other premiere
NFS/RDMA implementation (Solaris) has had for a very long time, and
completes the implementation of RPC/RDMA version 1 in the Linux
kernel's NFS server.
- Long ago, NFSv4 support was introduced to NFSD using a series of C
macros that hid dprintk's and goto's. Over time, the kernel's XDR
implementation has been greatly improved, but these C macros have
remained and become fallow. A series of patches in this pull
request completely replaces those macros with the use of current
kernel XDR infrastructure. Benefits include:
- More robust input sanitization in NFSD's NFSv4 XDR decoders.
- Make it easier to use common kernel library functions that use
XDR stream APIs (for example, GSS-API).
- Align the structure of the source code with the RFCs so it is
easier to learn, verify, and maintain our XDR implementation.
- Removal of more than a hundred hidden dprintk() call sites.
- Removal of some explicit manipulation of pages to help make the
eventual transition to xdr->bvec smoother.
- On top of several related fixes in 5.10-rc, there are a few more
fixes to get the Linux NFSD implementation of NFSv4.2 inter-server
copy up to speed.
And as usual, there is a pinch of seasoning in the form of a
collection of unrelated minor bug fixes and clean-ups.
Many thanks to all who contributed this time around!"
* tag 'nfsd-5.11' of git://git.linux-nfs.org/projects/cel/cel-2.6: (131 commits)
nfsd: Record NFSv4 pre/post-op attributes as non-atomic
nfsd: Set PF_LOCAL_THROTTLE on local filesystems only
nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALE
exportfs: Add a function to return the raw output from fh_to_dentry()
nfsd: close cached files prior to a REMOVE or RENAME that would replace target
nfsd: allow filesystems to opt out of subtree checking
nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations
Revert "nfsd4: support change_attr_type attribute"
nfsd4: don't query change attribute in v2/v3 case
nfsd: minor nfsd4_change_attribute cleanup
nfsd: simplify nfsd4_change_info
nfsd: only call inode_query_iversion in the I_VERSION case
nfs_common: need lock during iterate through the list
NFSD: Fix 5 seconds delay when doing inter server copy
NFSD: Fix sparse warning in nfs4proc.c
SUNRPC: Remove XDRBUF_SPARSE_PAGES flag in gss_proxy upcall
sunrpc: clean-up cache downcall
nfsd: Fix message level for normal termination
NFSD: Remove macros that are no longer used
NFSD: Replace READ* macros in nfsd4_decode_compound()
...
Pull jfs updates from David Kleikamp:
"A few jfs fixes"
* tag 'jfs-5.11' of git://github.com/kleikamp/linux-shaggy:
jfs: Fix array index bounds check in dbAdjTree
jfs: Fix memleak in dbAdjCtl
jfs: delete duplicated words + other fixes
Pull dlm updates from David Teigland:
"This set includes more low level communication layer cleanups.
The main change is the listening socket is no longer handled as a
special case of node connection sockets. There is one small fix for
checking the number of local connections"
* tag 'dlm-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
fs: dlm: check on existing node address
fs: dlm: constify addr_compare
fs: dlm: fix check for multi-homed hosts
fs: dlm: listen socket out of connection hash
fs: dlm: refactor sctp sock parameter
fs: dlm: move shutdown action to node creation
fs: dlm: move connect callback in node creation
fs: dlm: add helper for init connection
fs: dlm: handle non blocked connect event
fs: dlm: flush othercon at close
fs: dlm: add get buffer error handling
fs: dlm: define max send buffer
fs: dlm: fix proper srcu api call
Pull btrfs updates from David Sterba:
"We have a mix of all kinds of changes, feature updates, core stuff,
performance improvements and lots of cleanups and preparatory changes.
User visible:
- export filesystem generation in sysfs
- new features for mount option 'rescue':
- what's currently supported is exported in sysfs
- 'ignorebadroots'/'ibadroots' - continue even if some essential
tree roots are not usable (extent, uuid, data reloc, device,
csum, free space)
- 'ignoredatacsums'/'idatacsums' - skip checksum verification on
data
- 'all' - now enables 'ignorebadroots' + 'ignoredatacsums' +
'nologreplay'
- export read mirror policy settings to sysfs, new policies will be
added in the future
- remove inode number cache feature (mount -o inode_cache), obsoleted
in 5.9
User visible fixes:
- async discard scheduling fixes on high loads
- update inode byte counter atomically so stat() does not report
wrong value in some cases
- free space tree fixes:
- correctly report status of v2 after remount
- clear v1 cache inodes when v2 is newly enabled after remount
Core:
- switch own tree lock implementation to standard rw semaphore:
- one-level lock nesting is not required anymore, the last use of
this was in free space that's now loaded asynchronously
- own implementation of adaptive spinning before taking mutex has
been part of rwsem
- performance seems to be better in general, much better (+tens
of percents) for some workloads
- lockdep does not complain
- finish direct IO conversion to iomap infrastructure, remove
temporary workaround for DSYNC after iomap API updates
- preparatory work to support data and metadata blocks smaller than
page:
- generalize code that assumes sectorsize == PAGE_SIZE, lots of
refactoring
- planned namely for 64K pages (eg. arm64, ppc64)
- scrub read-only support
- preparatory work for zoned allocation mode (SMR/ZBC/ZNS friendly):
- disable incompatible features
- round-robin superblock write
- free space cache (v1) is loaded asynchronously, remove tree path
recursion
- slightly improved time tacking for transaction kthread wake ups
Performance improvements (note that the numbers depend on load type or
other features and weren't run on the same machine):
- skip unnecessary work:
- do not start readahead for csum tree when scrubbing non-data
block groups
- do not start and wait for delalloc on snapshot roots on
transaction commit
- fix race when defragmenting leads to unnecessary IO
- dbench speedups (+throughput%/-max latency%):
- skip unnecessary searches for xattrs when logging an inode
(+10.8/-8.2)
- stop incrementing log batch when joining log transaction (1-2)
- unlock path before checking if extent is shared during nocow
writeback (+5.0/-20.5), on fio load +9.7% throughput/-9.8%
runtime
- several tree log improvements, eg. removing unnecessary
operations, fixing races that lead to additional work
(+12.7/-8.2)
- tree-checker error branches annotated with unlikely() (+3%
throughput)
Other:
- cleanups
- lockdep fixes
- more btrfs_inode conversions
- error variable cleanups"
* tag 'for-5.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (198 commits)
btrfs: scrub: allow scrub to work with subpage sectorsize
btrfs: scrub: support subpage data scrub
btrfs: scrub: support subpage tree block scrub
btrfs: scrub: always allocate one full page for one sector for RAID56
btrfs: scrub: reduce width of extent_len/stripe_len from 64 to 32 bits
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
btrfs: remove btrfs_find_ordered_sum call from btrfs_lookup_bio_sums
btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors
btrfs: update num_extent_pages to support subpage sized extent buffer
btrfs: don't allow tree block to cross page boundary for subpage support
btrfs: calculate inline extent buffer page size based on page size
btrfs: factor out btree page submission code to a helper
btrfs: make btrfs_verify_data_csum follow sector size
btrfs: pass bio_offset to check_data_csum() directly
btrfs: rename bio_offset of extent_submit_bio_start_t to dio_file_offset
btrfs: fix lockdep warning when creating free space tree
btrfs: skip space_cache v1 setup when not using it
btrfs: remove free space items when disabling space cache v1
btrfs: warn when remount will not change the free space tree
btrfs: use superblock state to print space_cache mount option
...
Pull file locking fixes from Jeff Layton:
"A fix for some undefined integer overflow behavior, a typo in a
comment header, and a fix for a potential deadlock involving internal
senders of SIGIO/SIGURG"
* tag 'locks-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
fcntl: Fix potential deadlock in send_sig{io, urg}()
locks: fix a typo at a kernel-doc markup
locks: Fix UBSAN undefined behaviour in flock64_to_posix_lock
Pull rpmsg updates from Bjorn Andersson:
"This extracts the 'nameserver' previously used only by the virtio
rpmsg transport to work ontop of any rpmsg implementation and
clarifies the endianness of the data types used in rpmsg"
* tag 'rpmsg-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc:
rpmsg: Turn name service into a stand alone driver
rpmsg: Make rpmsg_{register|unregister}_device() public
rpmsg: virtio: Add rpmsg channel device ops
rpmsg: core: Add channel creation internal API
rpmsg: virtio: Rename rpmsg_create_channel
rpmsg: Move structure rpmsg_ns_msg to header file
rpmsg: virtio: Move from virtio to rpmsg byte conversion
rpmsg: Introduce __rpmsg{16|32|64} types
Pull hwspinlock updates from Bjorn Andersson:
"This contains a few minor cleanups and build warning fixes for the
sprd and sirf hwspinlock drivers"
* tag 'hwlock-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc:
hwspinlock: sirf: Remove the redundant 'of_match_ptr'
hwspinlock: sprd: fixed warning of unused variable 'sprd_hwspinlock_of_match'
hwspinlock: sprd: use module_platform_driver() instead postcore initcall
hwspinlock: sprd: Remove redundant header files
Pull remoteproc updates from Bjorn Andersson:
"This introduces support for controlling the TI PRU, adds hooks for
remoteproc drivers to override the default ELF based coredump format,
introduces a library function for coredumps using named sections (aka
the Qualcomm "minidump" format).
It also fixes a problem with inconsistent notifications sent by the
Qualcomm sysmon driver to the remote processors and it migrates the
Qualcomm MSS driver to use power-domains for resources that aren't
actually regulators.
Lastly it contains a number of fixes for minor bugs and build warnings
throughout the drivers"
* tag 'rproc-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc: (47 commits)
remoteproc/mediatek: read IPI buffer offset from FW
remoteproc/mediatek: unprepare clk if scp_before_load fails
remoteproc: qcom: Fix potential NULL dereference in adsp_init_mmio()
remoteproc/mediatek: Fix kernel test robot warning
remoteproc: k3-dsp: Fix return value check in k3_dsp_rproc_of_get_memories()
remoteproc: qcom: pas: fix error handling in adsp_pds_enable
remoteproc: qcom: fix reference leak in adsp_start
remoteproc: q6v5-mss: fix error handling in q6v5_pds_enable
remoteproc/mtk_scp: surround DT device IDs with CONFIG_OF
remoteproc: qcom: Add minidump id for sm8150 modem
remoteproc: qcom: Add capability to collect minidumps
remoteproc: coredump: Add minidump functionality
remoteproc: core: Add ops to enable custom coredump functionality
remoteproc/mediatek: change MT8192 CFG register base
remoteproc: pru: Add support for various PRU cores on K3 J721E SoCs
remoteproc: pru: Add support for various PRU cores on K3 AM65x SoCs
remoteproc: pru: Add pru-specific debugfs support
remoteproc: pru: Add support for PRU specific interrupt configuration
remoteproc: pru: Add a PRU remoteproc driver
dt-bindings: remoteproc: Add binding doc for PRU cores in the PRU-ICSS
...