Currently, the code responsible for initializing and poisoning memory in
post_alloc_hook() is scattered across two locations: kasan_alloc_pages()
hook for HW_TAGS KASAN and post_alloc_hook() itself. This is confusing.
This and a few following patches combine the code from these two
locations. Along the way, these patches do a step-by-step restructure the
many performed checks to make them easier to follow.
Replace the only caller of kasan_alloc_pages() with its implementation.
As kasan_has_integrated_init() is only true when CONFIG_KASAN_HW_TAGS is
enabled, moving the code does no functional changes.
Also move init and init_tags variables definitions out of
kasan_has_integrated_init() clause in post_alloc_hook(), as they have the
same values regardless of what the if condition evaluates to.
This patch is not useful by itself but makes the simplifications in the
following patches easier to follow.
Link: https://lkml.kernel.org/r/5ac7e0b30f5cbb177ec363ddd7878a3141289592.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "kasan, vmalloc, arm64: add vmalloc tagging support for SW/HW_TAGS", v6.
This patchset adds vmalloc tagging support for SW_TAGS and HW_TAGS
KASAN modes.
About half of patches are cleanups I went for along the way. None of them
seem to be important enough to go through stable, so I decided not to
split them out into separate patches/series.
The patchset is partially based on an early version of the HW_TAGS
patchset by Vincenzo that had vmalloc support. Thus, I added a
Co-developed-by tag into a few patches.
SW_TAGS vmalloc tagging support is straightforward. It reuses all of the
generic KASAN machinery, but uses shadow memory to store tags instead of
magic values. Naturally, vmalloc tagging requires adding a few
kasan_reset_tag() annotations to the vmalloc code.
HW_TAGS vmalloc tagging support stands out. HW_TAGS KASAN is based on Arm
MTE, which can only assigns tags to physical memory. As a result, HW_TAGS
KASAN only tags vmalloc() allocations, which are backed by page_alloc
memory. It ignores vmap() and others.
This patch (of 39):
Currently, should_skip_kasan_poison() has two definitions: one for when
CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, one for when it's not.
Instead of duplicating the checks, add a deferred_pages_enabled() helper
and use it in a single should_skip_kasan_poison() definition.
Also move should_skip_kasan_poison() closer to its caller and clarify all
conditions in the comment.
Link: https://lkml.kernel.org/r/cover.1643047180.git.andreyknvl@google.com
Link: https://lkml.kernel.org/r/658b79f5fb305edaf7dc16bc52ea870d3220d4a8.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NR_FILE_MAPPED accounting in mm/rmap.c (for /proc/meminfo "Mapped" and
/proc/vmstat "nr_mapped" and the memcg's memory.stat "mapped_file") is
slightly flawed for file or shmem huge pages.
It is well thought out, and looks convincing, but there's a racy case when
the careful counting in page_remove_file_rmap() (without page lock) gets
discarded. So that in a workload like two "make -j20" kernel builds under
memory pressure, with cc1 on hugepage text, "Mapped" can easily grow by a
spurious 5MB or more on each iteration, ending up implausibly bigger than
most other numbers in /proc/meminfo. And, hypothetically, might grow to
the point of seriously interfering in mm/vmscan.c's heuristics, which do
take NR_FILE_MAPPED into some consideration.
Fixed by moving the __mod_lruvec_page_state() down to where it will not be
missed before return (and I've grown a bit tired of that oft-repeated
but-not-everywhere comment on the __ness: it gets lost in the move here).
Does page_add_file_rmap() need the same change? I suspect not, because
page lock is held in all relevant cases, and its skipping case looks safe;
but it's much easier to be sure, if we do make the same change.
Link: https://lkml.kernel.org/r/e02e52a1-8550-a57c-ed29-f51191ea2375@google.com
Fixes: dd78fedde4 ("rmap: support file thp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The PG_waiters bit is not included in PAGE_FLAGS_CHECK_AT_FREE, and
vmscan.c's free_unref_page_list() callers rely on that not to generate
bad_page() alerts. So __page_cache_release(), put_pages_list() and
release_pages() (and presumably copy-and-pasted free_zone_device_page())
are redundant and misleading to make a special point of clearing it (as
the "__" implies, it could only safely be used on the freeing path).
Delete __ClearPageWaiters(). Remark on this in one of the "possible"
comments in folio_wake_bit(), and delete the superfluous comments.
Link: https://lkml.kernel.org/r/3eafa969-5b1a-accf-88fe-318784c791a@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Tested-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When viewing page owner information, we may want to cull blocks of
information with our own rules. So it is important to enhance culling
function to provide the support for customizing culling rules.
Therefore, following adjustments are made:
1. Add --cull option to support the culling of blocks of information
with user-defined culling rules.
./page_owner_sort <input> <output> --cull=<rules>
./page_owner_sort <input> <output> --cull <rules>
<rules> is a single argument in the form of a comma-separated list to
specify individual culling rules, by the sequence of keys k1,k2, ....
Mixed use of abbreviated and complete-form of keys is allowed.
For reference, please see the document(Documentation/vm/page_owner.rst).
Now, assuming two blocks in the input file are as follows:
Page allocated via order 0, mask xxxx, pid 1, tgid 1 (task_name_demo)
PFN xxxx
prep_new_page+0xd0/0xf8
get_page_from_freelist+0x4a0/0x1290
__alloc_pages+0x168/0x340
alloc_pages+0xb0/0x158
Page allocated via order 0, mask xxxx, pid 32, tgid 32 (task_name_demo)
PFN xxxx
prep_new_page+0xd0/0xf8
get_page_from_freelist+0x4a0/0x1290
__alloc_pages+0x168/0x340
alloc_pages+0xb0/0x158
If we want to cull the blocks by stacktrace and task command name, we can
use this command:
./page_owner_sort <input> <output> --cull=stacktrace,name
The output would be like:
2 times, 2 pages, task_comm_name: task_name_demo
prep_new_page+0xd0/0xf8
get_page_from_freelist+0x4a0/0x1290
__alloc_pages+0x168/0x340
alloc_pages+0xb0/0x158
As we can see, these two blocks are culled successfully, for they share
the same pid and task command name.
However, if we want to cull the blocks by pid, stacktrace and task command
name, we can this command:
./page_owner_sort <input> <output> --cull=stacktrace,name,pid
The output would be like:
1 times, 1 pages, PID 1, task_comm_name: task_name_demo
prep_new_page+0xd0/0xf8
get_page_from_freelist+0x4a0/0x1290
__alloc_pages+0x168/0x340
alloc_pages+0xb0/0x158
1 times, 1 pages, PID 32, task_comm_name: task_name_demo
prep_new_page+0xd0/0xf8
get_page_from_freelist+0x4a0/0x1290
__alloc_pages+0x168/0x340
alloc_pages+0xb0/0x158
As we can see, these two blocks are failed to cull, for their PIDs are
different.
2. Add explanations of --cull options to the document.
This work is coauthored by
Yixuan Cao
Shenghong Han
Yinan Zhang
Chongxi Zhao
Yuhong Feng
Link: https://lkml.kernel.org/r/20220312145834.624-1-yejiajian2018@email.szu.edu.cn
Signed-off-by: Jiajian Ye <yejiajian2018@email.szu.edu.cn>
Cc: Yixuan Cao <caoyixuan2019@email.szu.edu.cn>
Cc: Shenghong Han <hanshenghong2019@email.szu.edu.cn>
Cc: Yinan Zhang <zhangyinan2019@email.szu.edu.cn>
Cc: Chongxi Zhao <zhaochongxi2019@email.szu.edu.cn>
Cc: Yuhong Feng <yuhongf@szu.edu.cn>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Sean Anderson <seanga2@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When viewing page owner information, we may also need to select the blocks
by PID, TGID or task command name, which helps to get more accurate page
allocation information as needed.
Therefore, following adjustments are made:
1. Add three new options, including --pid, --tgid and --name, to support
the selection of information blocks by a specific pid, tgid and task
command name. In addtion, multiple options are allowed to be used at
the same time.
./page_owner_sort [input] [output] --pid <PID>
./page_owner_sort [input] [output] --tgid <TGID>
./page_owner_sort [input] [output] --name <TASK_COMMAND_NAME>
Assuming a scenario when a multi-threaded program, ./demo (PID =
5280), is running, and ./demo creates a child process (PID = 5281).
$ps
PID TTY TIME CMD
5215 pts/0 00:00:00 bash
5280 pts/0 00:00:00 ./demo
5281 pts/0 00:00:00 ./demo
5282 pts/0 00:00:00 ps
It would be better to filter out the records with tgid=5280 and the
task name "demo" when debugging the parent process, and the specific
usage is
./page_owner_sort [input] [output] --tgid 5280 --name demo
2. Add explanations of three new options, including --pid, --tgid and
--name, to the document.
This work is coauthored by
Shenghong Han <hanshenghong2019@email.szu.edu.cn>,
Yixuan Cao <caoyixuan2019@email.szu.edu.cn>,
Yinan Zhang <zhangyinan2019@email.szu.edu.cn>,
Chongxi Zhao <zhaochongxi2019@email.szu.edu.cn>,
Yuhong Feng <yuhongf@szu.edu.cn>.
Link: https://lkml.kernel.org/r/1646835223-7584-1-git-send-email-yejiajian2018@email.szu.edu.cn
Signed-off-by: Jiajian Ye <yejiajian2018@email.szu.edu.cn>
Cc: Sean Anderson <seanga2@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Zhenliang Wei <weizhenliang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In a single-threaded process, the pid in kernel task_struct is the same
as the tgid, which can mark the process of page allocation. But in a
multithreaded process, only the task_struct of the thread leader has the
same pid as tgid, and the pids of other threads are different from tgid.
Therefore, tgid is recorded to provide effective information for
debugging and data statistics of multithreaded programs.
This can also be achieved by observing the task name (executable file
name) for a specific process. However, when the same program is started
multiple times, the task name is the same and the tgid is different.
Therefore, in the debugging of multi-threaded programs, combined with
the task name and tgid, more accurate runtime information of a certain
run of the program can be obtained.
Link: https://lkml.kernel.org/r/20220219180450.2399-1-caoyixuan2019@email.szu.edu.cn
Signed-off-by: Yixuan Cao <caoyixuan2019@email.szu.edu.cn>
Cc: Waiman Long <longman@redhat.com>
Cc: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/page_owner: Extend page_owner to show memcg information", v4.
While debugging the constant increase in percpu memory consumption on a
system that spawned large number of containers, it was found that a lot
of offline mem_cgroup structures remained in place without being freed.
Further investigation indicated that those mem_cgroup structures were
pinned by some pages.
In order to find out what those pages are, the existing page_owner
debugging tool is extended to show memory cgroup information and whether
those memcgs are offline or not. With the enhanced page_owner tool, the
following is a typical page that pinned the mem_cgroup structure in my
test case:
Page allocated via order 0, mask 0x1100cca(GFP_HIGHUSER_MOVABLE), pid 162970 (podman), ts 1097761405537 ns, free_ts 1097760838089 ns
PFN 1925700 type Movable Block 3761 type Movable Flags 0x17ffffc00c001c(uptodate|dirty|lru|reclaim|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
prep_new_page+0xac/0xe0
get_page_from_freelist+0x1327/0x14d0
__alloc_pages+0x191/0x340
alloc_pages_vma+0x84/0x250
shmem_alloc_page+0x3f/0x90
shmem_alloc_and_acct_page+0x76/0x1c0
shmem_getpage_gfp+0x281/0x940
shmem_write_begin+0x36/0xe0
generic_perform_write+0xed/0x1d0
__generic_file_write_iter+0xdc/0x1b0
generic_file_write_iter+0x5d/0xb0
new_sync_write+0x11f/0x1b0
vfs_write+0x1ba/0x2a0
ksys_write+0x59/0xd0
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Charged to offline memcg libpod-conmon-15e4f9c758422306b73b2dd99f9d50a5ea53cbb16b4a13a2c2308a4253cc0ec8.
So the page was not freed because it was part of a shmem segment. That
is useful information that can help users to diagnose similar problems.
With cgroup v1, /proc/cgroups can be read to find out the total number
of memory cgroups (online + offline). With cgroup v2, the cgroup.stat
of the root cgroup can be read to find the number of dying cgroups (most
likely pinned by dying memcgs).
The page_owner feature is not supposed to be enabled for production
system due to its memory overhead. However, if it is suspected that
dying memcgs are increasing over time, a test environment with
page_owner enabled can then be set up with appropriate workload for
further analysis on what may be causing the increasing number of dying
memcgs.
This patch (of 4):
For *scnprintf(), vsnprintf() is always called even if the input size is
0. That is a waste of time, so just return 0 in this case.
Note that vsnprintf() will never return -1 to indicate an error. So
skipping the call to vsnprintf() when size is 0 will have no functional
impact at all.
Link: https://lkml.kernel.org/r/20220202203036.744010-1-longman@redhat.com
Link: https://lkml.kernel.org/r/20220202203036.744010-2-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When viewing the page owner information, we expect that the information
can be sorted by PID, so that we can quickly combine PID with the program
to check the information together.
We also expect that the information can be sorted by time. Time sorting
helps to view the running status of the program according to the time
interval when the program hangs up.
Finally, we hope to pass the page_ owner_ Sort. C can reduce part of the
output and only output the plate information whose memory has not been
released, which can make us locate the problem of the program faster.
Therefore, the following adjustments have been made:
1. Add the static functions search_pattern and check_regcomp to
improve the cleanliness.
2. Add member attributes and their corresponding sorting methods. In
terms of comparison time, int will overflow because the data of ull is
too large, so the ternary operator is used
3. Add the -f parameter to filter out the information of blocks whose
memory has not been released
Link: https://lkml.kernel.org/r/20211206165653.5093-1-zhaochongxi2019@email.szu.edu.cn
Signed-off-by: Chongxi Zhao <zhaochongxi2019@email.szu.edu.cn>
Reviewed-by: Sean Anderson <seanga2@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Culling by comparing stacktrace would casue loss of some information. For
example, if there exists 2 blocks which have the same stacktrace and the
different head info
Page allocated via order 0, mask 0x108c48(...), pid 73696,
ts 1578829190639010 ns, free_ts 1576583851324450 ns
prep_new_page+0x80/0xb8
get_page_from_freelist+0x924/0xee8
__alloc_pages+0x138/0xc18
alloc_pages+0x80/0xf0
__page_cache_alloc+0x90/0xc8
Page allocated via order 0, mask 0x108c48(...), pid 61806,
ts 1354113726046100 ns, free_ts 1354104926841400 ns
prep_new_page+0x80/0xb8
get_page_from_freelist+0x924/0xee8
__alloc_pages+0x138/0xc18
alloc_pages+0x80/0xf0
__page_cache_alloc+0x90/0xc8
After culling, it would be like this
2 times, 2 pages:
Page allocated via order 0, mask 0x108c48(...), pid 73696,
ts 1578829190639010 ns, free_ts 1576583851324450 ns
prep_new_page+0x80/0xb8
get_page_from_freelist+0x924/0xee8
__alloc_pages+0x138/0xc18
alloc_pages+0x80/0xf0
__page_cache_alloc+0x90/0xc8
The info of second block missed. So, add -c to turn on culling by
stacktrace. By default, it will cull by txt.
Link: https://lkml.kernel.org/r/20211129145658.2491-1-zhangyinan2019@email.szu.edu.cn
Signed-off-by: Yinan Zhang <zhangyinan2019@email.szu.edu.cn>
Cc: Changhee Han <ch0.han@lge.com>
Cc: Sean Anderson <seanga2@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tang Bin <tangbin@cmss.chinamobile.com>
Cc: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Cc: Zhenliang Wei <weizhenliang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge more updates from Andrew Morton:
"Various misc subsystems, before getting into the post-linux-next
material.
41 patches.
Subsystems affected by this patch series: procfs, misc, core-kernel,
lib, checkpatch, init, pipe, minix, fat, cgroups, kexec, kdump,
taskstats, panic, kcov, resource, and ubsan"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (41 commits)
Revert "ubsan, kcsan: Don't combine sanitizer with kcov on clang"
kernel/resource: fix kfree() of bootmem memory again
kcov: properly handle subsequent mmap calls
kcov: split ioctl handling into locked and unlocked parts
panic: move panic_print before kmsg dumpers
panic: add option to dump all CPUs backtraces in panic_print
docs: sysctl/kernel: add missing bit to panic_print
taskstats: remove unneeded dead assignment
kasan: no need to unset panic_on_warn in end_report()
ubsan: no need to unset panic_on_warn in ubsan_epilogue()
panic: unset panic_on_warn inside panic()
docs: kdump: add scp example to write out the dump file
docs: kdump: update description about sysfs file system support
arm64: mm: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef
x86/setup: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef
riscv: mm: init: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef
kexec: make crashk_res, crashk_low_res and crash_notes symbols always visible
cgroup: use irqsave in cgroup_rstat_flush_locked().
fat: use pointer to simple type in put_user()
minix: fix bug when opening a file with O_DIRECT
...
Pull networking updates from Jakub Kicinski:
"The sprinkling of SPI drivers is because we added a new one and Mark
sent us a SPI driver interface conversion pull request.
Core
----
- Introduce XDP multi-buffer support, allowing the use of XDP with
jumbo frame MTUs and combination with Rx coalescing offloads (LRO).
- Speed up netns dismantling (5x) and lower the memory cost a little.
Remove unnecessary per-netns sockets. Scope some lists to a netns.
Cut down RCU syncing. Use batch methods. Allow netdev registration
to complete out of order.
- Support distinguishing timestamp types (ingress vs egress) and
maintaining them across packet scrubbing points (e.g. redirect).
- Continue the work of annotating packet drop reasons throughout the
stack.
- Switch netdev error counters from an atomic to dynamically
allocated per-CPU counters.
- Rework a few preempt_disable(), local_irq_save() and busy waiting
sections problematic on PREEMPT_RT.
- Extend the ref_tracker to allow catching use-after-free bugs.
BPF
---
- Introduce "packing allocator" for BPF JIT images. JITed code is
marked read only, and used to be allocated at page granularity.
Custom allocator allows for more efficient memory use, lower iTLB
pressure and prevents identity mapping huge pages from getting
split.
- Make use of BTF type annotations (e.g. __user, __percpu) to enforce
the correct probe read access method, add appropriate helpers.
- Convert the BPF preload to use light skeleton and drop the
user-mode-driver dependency.
- Allow XDP BPF_PROG_RUN test infra to send real packets, enabling
its use as a packet generator.
- Allow local storage memory to be allocated with GFP_KERNEL if
called from a hook allowed to sleep.
- Introduce fprobe (multi kprobe) to speed up mass attachment (arch
bits to come later).
- Add unstable conntrack lookup helpers for BPF by using the BPF
kfunc infra.
- Allow cgroup BPF progs to return custom errors to user space.
- Add support for AF_UNIX iterator batching.
- Allow iterator programs to use sleepable helpers.
- Support JIT of add, and, or, xor and xchg atomic ops on arm64.
- Add BTFGen support to bpftool which allows to use CO-RE in kernels
without BTF info.
- Large number of libbpf API improvements, cleanups and deprecations.
Protocols
---------
- Micro-optimize UDPv6 Tx, gaining up to 5% in test on dummy netdev.
- Adjust TSO packet sizes based on min_rtt, allowing very low latency
links (data centers) to always send full-sized TSO super-frames.
- Make IPv6 flow label changes (AKA hash rethink) more configurable,
via sysctl and setsockopt. Distinguish between server and client
behavior.
- VxLAN support to "collect metadata" devices to terminate only
configured VNIs. This is similar to VLAN filtering in the bridge.
- Support inserting IPv6 IOAM information to a fraction of frames.
- Add protocol attribute to IP addresses to allow identifying where
given address comes from (kernel-generated, DHCP etc.)
- Support setting socket and IPv6 options via cmsg on ping6 sockets.
- Reject mis-use of ECN bits in IP headers as part of DSCP/TOS.
Define dscp_t and stop taking ECN bits into account in fib-rules.
- Add support for locked bridge ports (for 802.1X).
- tun: support NAPI for packets received from batched XDP buffs,
doubling the performance in some scenarios.
- IPv6 extension header handling in Open vSwitch.
- Support IPv6 control message load balancing in bonding, prevent
neighbor solicitation and advertisement from using the wrong port.
Support NS/NA monitor selection similar to existing ARP monitor.
- SMC
- improve performance with TCP_CORK and sendfile()
- support auto-corking
- support TCP_NODELAY
- MCTP (Management Component Transport Protocol)
- add user space tag control interface
- I2C binding driver (as specified by DMTF DSP0237)
- Multi-BSSID beacon handling in AP mode for WiFi.
- Bluetooth:
- handle MSFT Monitor Device Event
- add MGMT Adv Monitor Device Found/Lost events
- Multi-Path TCP:
- add support for the SO_SNDTIMEO socket option
- lots of selftest cleanups and improvements
- Increase the max PDU size in CAN ISOTP to 64 kB.
Driver API
----------
- Add HW counters for SW netdevs, a mechanism for devices which
offload packet forwarding to report packet statistics back to
software interfaces such as tunnels.
- Select the default NIC queue count as a fraction of number of
physical CPU cores, instead of hard-coding to 8.
- Expose devlink instance locks to drivers. Allow device layer of
drivers to use that lock directly instead of creating their own
which always runs into ordering issues in devlink callbacks.
- Add header/data split indication to guide user space enabling of
TCP zero-copy Rx.
- Allow configuring completion queue event size.
- Refactor page_pool to enable fragmenting after allocation.
- Add allocation and page reuse statistics to page_pool.
- Improve Multiple Spanning Trees support in the bridge to allow
reuse of topologies across VLANs, saving HW resources in switches.
- DSA (Distributed Switch Architecture):
- replay and offload of host VLAN entries
- offload of static and local FDB entries on LAG interfaces
- FDB isolation and unicast filtering
New hardware / drivers
----------------------
- Ethernet:
- LAN937x T1 PHYs
- Davicom DM9051 SPI NIC driver
- Realtek RTL8367S, RTL8367RB-VB switch and MDIO
- Microchip ksz8563 switches
- Netronome NFP3800 SmartNICs
- Fungible SmartNICs
- MediaTek MT8195 switches
- WiFi:
- mt76: MediaTek mt7916
- mt76: MediaTek mt7921u USB adapters
- brcmfmac: Broadcom BCM43454/6
- Mobile:
- iosm: Intel M.2 7360 WWAN card
Drivers
-------
- Convert many drivers to the new phylink API built for split PCS
designs but also simplifying other cases.
- Intel Ethernet NICs:
- add TTY for GNSS module for E810T device
- improve AF_XDP performance
- GTP-C and GTP-U filter offload
- QinQ VLAN support
- Mellanox Ethernet NICs (mlx5):
- support xdp->data_meta
- multi-buffer XDP
- offload tc push_eth and pop_eth actions
- Netronome Ethernet NICs (nfp):
- flow-independent tc action hardware offload (police / meter)
- AF_XDP
- Other Ethernet NICs:
- at803x: fiber and SFP support
- xgmac: mdio: preamble suppression and custom MDC frequencies
- r8169: enable ASPM L1.2 if system vendor flags it as safe
- macb/gem: ZynqMP SGMII
- hns3: add TX push mode
- dpaa2-eth: software TSO
- lan743x: multi-queue, mdio, SGMII, PTP
- axienet: NAPI and GRO support
- Mellanox Ethernet switches (mlxsw):
- source and dest IP address rewrites
- RJ45 ports
- Marvell Ethernet switches (prestera):
- basic routing offload
- multi-chain TC ACL offload
- NXP embedded Ethernet switches (ocelot & felix):
- PTP over UDP with the ocelot-8021q DSA tagging protocol
- basic QoS classification on Felix DSA switch using dcbnl
- port mirroring for ocelot switches
- Microchip high-speed industrial Ethernet (sparx5):
- offloading of bridge port flooding flags
- PTP Hardware Clock
- Other embedded switches:
- lan966x: PTP Hardward Clock
- qca8k: mdio read/write operations via crafted Ethernet packets
- Qualcomm 802.11ax WiFi (ath11k):
- add LDPC FEC type and 802.11ax High Efficiency data in radiotap
- enable RX PPDU stats in monitor co-exist mode
- Intel WiFi (iwlwifi):
- UHB TAS enablement via BIOS
- band disablement via BIOS
- channel switch offload
- 32 Rx AMPDU sessions in newer devices
- MediaTek WiFi (mt76):
- background radar detection
- thermal management improvements on mt7915
- SAR support for more mt76 platforms
- MBSSID and 6 GHz band on mt7915
- RealTek WiFi:
- rtw89: AP mode
- rtw89: 160 MHz channels and 6 GHz band
- rtw89: hardware scan
- Bluetooth:
- mt7921s: wake on Bluetooth, SCO over I2S, wide-band-speed (WBS)
- Microchip CAN (mcp251xfd):
- multiple RX-FIFOs and runtime configurable RX/TX rings
- internal PLL, runtime PM handling simplification
- improve chip detection and error handling after wakeup"
* tag 'net-next-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2521 commits)
llc: fix netdevice reference leaks in llc_ui_bind()
drivers: ethernet: cpsw: fix panic when interrupt coaleceing is set via ethtool
ice: don't allow to run ice_send_event_to_aux() in atomic ctx
ice: fix 'scheduling while atomic' on aux critical err interrupt
net/sched: fix incorrect vlan_push_eth dest field
net: bridge: mst: Restrict info size queries to bridge ports
net: marvell: prestera: add missing destroy_workqueue() in prestera_module_init()
drivers: net: xgene: Fix regression in CRC stripping
net: geneve: add missing netlink policy and size for IFLA_GENEVE_INNER_PROTO_INHERIT
net: dsa: fix missing host-filtered multicast addresses
net/mlx5e: Fix build warning, detected write beyond size of field
iwlwifi: mvm: Don't fail if PPAG isn't supported
selftests/bpf: Fix kprobe_multi test.
Revert "rethook: x86: Add rethook x86 implementation"
Revert "arm64: rethook: Add arm64 rethook implementation"
Revert "powerpc: Add rethook support"
Revert "ARM: rethook: Add rethook arm implementation"
netdevice: add missing dm_private kdoc
net: bridge: mst: prevent NULL deref in br_mst_info_size()
selftests: forwarding: Use same VRF for port and VLAN upper
...
Pull VFIO updates from Alex Williamson:
- Introduce new device migration uAPI and implement device specific
mlx5 vfio-pci variant driver supporting new protocol (Jason
Gunthorpe, Yishai Hadas, Leon Romanovsky)
- New HiSilicon acc vfio-pci variant driver, also supporting migration
interface (Shameer Kolothum, Longfang Liu)
- D3hot fixes for vfio-pci-core (Abhishek Sahu)
- Document new vfio-pci variant driver acceptance criteria
(Alex Williamson)
- Fix UML build unresolved ioport_{un}map() functions
(Alex Williamson)
- Fix MAINTAINERS due to header movement (Lukas Bulwahn)
* tag 'vfio-v5.18-rc1' of https://github.com/awilliam/linux-vfio: (31 commits)
vfio-pci: Provide reviewers and acceptance criteria for variant drivers
MAINTAINERS: adjust entry for header movement in hisilicon qm driver
hisi_acc_vfio_pci: Use its own PCI reset_done error handler
hisi_acc_vfio_pci: Add support for VFIO live migration
crypto: hisilicon/qm: Set the VF QM state register
hisi_acc_vfio_pci: Add helper to retrieve the struct pci_driver
hisi_acc_vfio_pci: Restrict access to VF dev BAR2 migration region
hisi_acc_vfio_pci: add new vfio_pci driver for HiSilicon ACC devices
hisi_acc_qm: Move VF PCI device IDs to common header
crypto: hisilicon/qm: Move few definitions to common header
crypto: hisilicon/qm: Move the QM header to include/linux
vfio/mlx5: Fix to not use 0 as NULL pointer
PCI/IOV: Fix wrong kernel-doc identifier
vfio/mlx5: Use its own PCI reset_done error handler
vfio/pci: Expose vfio_pci_core_aer_err_detected()
vfio/mlx5: Implement vfio_pci driver for mlx5 devices
vfio/mlx5: Expose migration commands over mlx5 device
vfio: Remove migration protocol v1 documentation
vfio: Extend the device migration protocol with RUNNING_P2P
vfio: Define device migration protocol v2
...
Pull hyperv updates from Wei Liu:
"Minor patches from various people"
* tag 'hyperv-next-signed-20220322' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
x86/hyperv: Output host build info as normal Windows version number
hv_balloon: rate-limit "Unhandled message" warning
drivers: hv: log when enabling crash_kexec_post_notifiers
hv_utils: Add comment about max VMbus packet size in VSS driver
Drivers: hv: Compare cpumasks and not their weights in init_vp_index()
Drivers: hv: Rename 'alloced' to 'allocated'
Drivers: hv: vmbus: Use struct_size() helper in kmalloc()
Pull kvm updates from Paolo Bonzini:
"ARM:
- Proper emulation of the OSLock feature of the debug architecture
- Scalibility improvements for the MMU lock when dirty logging is on
- New VMID allocator, which will eventually help with SVA in VMs
- Better support for PMUs in heterogenous systems
- PSCI 1.1 support, enabling support for SYSTEM_RESET2
- Implement CONFIG_DEBUG_LIST at EL2
- Make CONFIG_ARM64_ERRATUM_2077057 default y
- Reduce the overhead of VM exit when no interrupt is pending
- Remove traces of 32bit ARM host support from the documentation
- Updated vgic selftests
- Various cleanups, doc updates and spelling fixes
RISC-V:
- Prevent KVM_COMPAT from being selected
- Optimize __kvm_riscv_switch_to() implementation
- RISC-V SBI v0.3 support
s390:
- memop selftest
- fix SCK locking
- adapter interruptions virtualization for secure guests
- add Claudio Imbrenda as maintainer
- first step to do proper storage key checking
x86:
- Continue switching kvm_x86_ops to static_call(); introduce
static_call_cond() and __static_call_ret0 when applicable.
- Cleanup unused arguments in several functions
- Synthesize AMD 0x80000021 leaf
- Fixes and optimization for Hyper-V sparse-bank hypercalls
- Implement Hyper-V's enlightened MSR bitmap for nested SVM
- Remove MMU auditing
- Eager splitting of page tables (new aka "TDP" MMU only) when dirty
page tracking is enabled
- Cleanup the implementation of the guest PGD cache
- Preparation for the implementation of Intel IPI virtualization
- Fix some segment descriptor checks in the emulator
- Allow AMD AVIC support on systems with physical APIC ID above 255
- Better API to disable virtualization quirks
- Fixes and optimizations for the zapping of page tables:
- Zap roots in two passes, avoiding RCU read-side critical
sections that last too long for very large guests backed by 4
KiB SPTEs.
- Zap invalid and defunct roots asynchronously via
concurrency-managed work queue.
- Allowing yielding when zapping TDP MMU roots in response to the
root's last reference being put.
- Batch more TLB flushes with an RCU trick. Whoever frees the
paging structure now holds RCU as a proxy for all vCPUs running
in the guest, i.e. to prolongs the grace period on their behalf.
It then kicks the the vCPUs out of guest mode before doing
rcu_read_unlock().
Generic:
- Introduce __vcalloc and use it for very large allocations that need
memcg accounting"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits)
KVM: use kvcalloc for array allocations
KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2
kvm: x86: Require const tsc for RT
KVM: x86: synthesize CPUID leaf 0x80000021h if useful
KVM: x86: add support for CPUID leaf 0x80000021
KVM: x86: do not use KVM_X86_OP_OPTIONAL_RET0 for get_mt_mask
Revert "KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()"
kvm: x86/mmu: Flush TLB before zap_gfn_range releases RCU
KVM: arm64: fix typos in comments
KVM: arm64: Generalise VM features into a set of flags
KVM: s390: selftests: Add error memop tests
KVM: s390: selftests: Add more copy memop tests
KVM: s390: selftests: Add named stages for memop test
KVM: s390: selftests: Add macro as abstraction for MEM_OP
KVM: s390: selftests: Split memop tests
KVM: s390x: fix SCK locking
RISC-V: KVM: Implement SBI HSM suspend call
RISC-V: KVM: Add common kvm_riscv_vcpu_wfi() function
RISC-V: Add SBI HSM suspend related defines
RISC-V: KVM: Implement SBI v0.3 SRST extension
...