commit 5478afc55a ("kmsan: fix memcpy tests") uses OPTIMIZER_HIDE_VAR()
to hide the uninitialized var from the compiler optimizations.
However OPTIMIZER_HIDE_VAR(uninit) enforces an immediate check of @uninit,
so memcpy tests did not actually check the behavior of memcpy(), because
they always contained a KMSAN report.
Replace OPTIMIZER_HIDE_VAR() with a file-local macro that just clobbers
the memory with a barrier(), and add a test case for memcpy() that does
not expect an error report.
Also reflow kmsan_test.c with clang-format.
Link: https://lkml.kernel.org/r/20230303141433.3422671-2-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use atomic_try_cmpxchg instead of atomic_cmpxchg (*ptr, old, new) == old
in set_tlb_ubc_flush_pending. 86 CMPXCHG instruction returns success in
ZF flag, so this change saves a compare after cmpxchg (and related move
instruction in front of cmpxchg).
Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
fails.
No functional change intended.
Link: https://lkml.kernel.org/r/20230227214228.3533299-1-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
%pGp format is used to display 'flags' field of a struct page. However,
some page flags (i.e. PG_buddy, see page-flags.h for more details) are
stored in page_type field. To display human-readable output of page_type,
introduce %pGt format.
It is important to note the meaning of bits are different in page_type.
if page_type is 0xffffffff, no flags are set. Setting PG_buddy
(0x00000080) flag results in a page_type of 0xffffff7f. Clearing a bit
actually means setting a flag. Bits in page_type are inverted when
displaying type names.
Only values for which page_type_has_type() returns true are considered as
page_type, to avoid confusion with mapcount values. if it returns false,
only raw values are displayed and not page type names.
Link: https://lkml.kernel.org/r/20230130042514.2418-3-42.hyeyoo@gmail.com
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Petr Mladek <pmladek@suse.com> [vsprintf part]
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
On big systems, the mm refcount can become highly contented when doing a
lot of context switching with threaded applications. user<->idle switch
is one of the important cases. Abandoning lazy tlb entirely slows this
switching down quite a bit in the common uncontended case, so that is not
viable.
Implement a scheme where lazy tlb mm references do not contribute to the
refcount, instead they get explicitly removed when the refcount reaches
zero.
The final mmdrop() sends IPIs to all CPUs in the mm_cpumask and they
switch away from this mm to init_mm if it was being used as the lazy tlb
mm. Enabling the shoot lazies option therefore requires that the arch
ensures that mm_cpumask contains all CPUs that could possibly be using mm.
A DEBUG_VM option IPIs every CPU in the system after this to ensure there
are no references remaining before the mm is freed.
Shootdown IPIs cost could be an issue, but they have not been observed to
be a serious problem with this scheme, because short-lived processes tend
not to migrate CPUs much, therefore they don't get much chance to leave
lazy tlb mm references on remote CPUs. There are a lot of options to
reduce them if necessary, described in comments.
The near-worst-case can be benchmarked with will-it-scale:
context_switch1_threads -t $(($(nproc) / 2))
This will create nproc threads (nproc / 2 switching pairs) all sharing the
same mm that spread over all CPUs so each CPU does thread->idle->thread
switching.
[ Rik came up with basically the same idea a few years ago, so credit
to him for that. ]
Link: https://lore.kernel.org/linux-mm/20230118080011.2258375-1-npiggin@gmail.com/
Link: https://lore.kernel.org/all/20180728215357.3249-11-riel@surriel.com/
Link: https://lkml.kernel.org/r/20230203071837.1136453-5-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The worst-case scenario on finding same element pages is that almost all
elements are same at the first glance but only last few elements are
different.
Since the same element tends to be grouped from the beginning of the
pages, if we check the first element with the last element before looping
through all elements, we might have some chances to quickly detect
non-same element pages.
1. Test is done under LG webOS TV (64-bit arch)
2. Dump the swap-out pages (~819200 pages)
3. Analyze the pages with simple test script which counts the iteration
number and measures the speed at off-line
Under 64-bit arch, the worst iteration count is PAGE_SIZE / 8 bytes = 512.
The speed is based on the time to consume page_same_filled() function
only. The result, on average, is listed as below:
Num of Iter Speed(MB/s)
Looping-Forward (Orig) 38 99265
Looping-Backward 36 102725
Last-element-check (This Patch) 33 125072
The result shows that the average iteration count decreases by 13% and the
speed increases by 25% with this patch. This patch does not increase the
overall time complexity, though.
I also ran simpler version which uses backward loop. Just looping
backward also makes some improvement, but less than this patch.
A similar change has already been made to zram in 90f82cbfe5 ("zram: try
to avoid worst-case scenario on same element pages").
Link: https://lkml.kernel.org/r/20230205190036.1730134-1-taejoon.song@lge.com
Signed-off-by: Taejoon Song <taejoon.song@lge.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Taejoon Song <taejoon.song@lge.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <yjay.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Syzbot reports a warning in untrack_pfn(). Digging into the root we found
that this is due to memory allocation failure in pmd_alloc_one. And this
failure is produced due to failslab.
In copy_page_range(), memory alloaction for pmd failed. During the error
handling process in copy_page_range(), mmput() is called to remove all
vmas. While untrack_pfn this empty pfn, warning happens.
Here's a simplified flow:
dup_mm
dup_mmap
copy_page_range
copy_p4d_range
copy_pud_range
copy_pmd_range
pmd_alloc
__pmd_alloc
pmd_alloc_one
page = alloc_pages(gfp, 0);
if (!page)
return NULL;
mmput
exit_mmap
unmap_vmas
unmap_single_vma
untrack_pfn
follow_phys
WARN_ON_ONCE(1);
Since this vma is not generate successfully, we can clear flag VM_PAT. In
this case, untrack_pfn() will not be called while cleaning this vma.
Function untrack_pfn_moved() has also been renamed to fit the new logic.
Link: https://lkml.kernel.org/r/20230217025615.1595558-1-mawupeng1@huawei.com
Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
Reported-by: <syzbot+5f488e922d047d8f00cc@syzkaller.appspotmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mwriteprotect_range() errors out if [start, end) doesn't fall in one VMA.
We are facing a use case where multiple VMAs are present in one range of
interest. For example, the following pseudocode reproduces the error
which we are trying to fix:
- Allocate memory of size 16 pages with PROT_NONE with mmap
- Register userfaultfd
- Change protection of the first half (1 to 8 pages) of memory to
PROT_READ | PROT_WRITE. This breaks the memory area in two VMAs.
- Now UFFDIO_WRITEPROTECT_MODE_WP on the whole memory of 16 pages errors
out.
This is a simple use case where user may or may not know if the memory
area has been divided into multiple VMAs.
We need an implementation which doesn't disrupt the already present users.
So keeping things simple, stop going over all the VMAs if any one of the
VMA hasn't been registered in WP mode. While at it, remove the un-needed
error check as well.
[akpm@linux-foundation.org: s/VM_WARN_ON_ONCE/VM_WARN_ONCE/ to fix build]
Link: https://lkml.kernel.org/r/20230217105558.832710-1-usama.anjum@collabora.com
Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reported-by: Paul Gofman <pgofman@codeweavers.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Historically, we have performed sanity checks on all struct pages being
allocated or freed, making sure they have no unexpected page flags or
certain field values. This can detect insufficient cleanup and some cases
of use-after-free, although on its own it can't always identify the
culprit. The result is a warning and the "bad page" being leaked.
The checks do need some cpu cycles, so in 4.7 with commits 479f854a20
("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
and 4db7548ccb ("mm, page_alloc: defer debugging checks of freed pages
until a PCP drain") they were no longer performed in the hot paths when
allocating and freeing from pcplists, but only when pcplists are bypassed,
refilled or drained. For debugging purposes, with CONFIG_DEBUG_VM enabled
the checks were instead still done in the hot paths and not when refilling
or draining pcplists.
With 4462b32c92 ("mm, page_alloc: more extensive free page checking with
debug_pagealloc"), enabling debug_pagealloc also moved the sanity checks
back to hot pahs. When both debug_pagealloc and CONFIG_DEBUG_VM are
enabled, the checks are done both in hotpaths and pcplist refill/drain.
Even though the non-debug default today might seem to be a sensible
tradeoff between overhead and ability to detect bad pages, on closer look
it's arguably not. As most allocations go through the pcplists, catching
any bad pages when refilling or draining pcplists has only a small chance,
insufficient for debugging or serious hardening purposes. On the other
hand the cost of the checks is concentrated in the already expensive
drain/refill batching operations, and those are done under the often
contended zone lock. That was recently identified as an issue for page
allocation and the zone lock contention reduced by moving the checks
outside of the locked section with a patch "mm: reduce lock contention of
pcp buffer refill", but the cost of the checks is still visible compared
to their removal [1]. In the pcplist draining path free_pcppages_bulk()
the checks are still done under zone->lock.
Thus, remove the checks from pcplist refill and drain paths completely.
Introduce a static key check_pages_enabled to control checks during page
allocation a freeing (whether pcplist is used or bypassed). The static
key is enabled if either is true:
- kernel is built with CONFIG_DEBUG_VM=y (debugging)
- debug_pagealloc or page poisoning is boot-time enabled (debugging)
- init_on_alloc or init_on_free is boot-time enabled (hardening)
The resulting user visible changes:
- no checks when draining/refilling pcplists - less overhead, with
likely no practical reduction of ability to catch bad pages
- no checks when bypassing pcplists in default config (no
debugging/hardening) - less overhead etc. as above
- on typical hardened kernels [2], checks are now performed on each page
allocation/free (previously only when bypassing/draining/refilling
pcplists) - the init_on_alloc/init_on_free enabled should be sufficient
indication for preferring more costly alloc/free operations for
hardening purposes and we shouldn't need to introduce another toggle
- code (various wrappers) removal and simplification
[1] https://lore.kernel.org/all/68ba44d8-6899-c018-dcb3-36f3a96e6bea@sra.uni-hannover.de/
[2] https://lore.kernel.org/all/63ebc499.a70a0220.9ac51.29ea@mx.google.com/
[akpm@linux-foundation.org: coding-style cleanups]
[akpm@linux-foundation.org: make check_pages_enabled static]
Link: https://lkml.kernel.org/r/20230216095131.17336-1-vbabka@suse.cz
Reported-by: Alexander Halbuer <halbuer@sra.uni-hannover.de>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
rmqueue_bulk() batches the allocation of multiple elements to refill the
per-CPU buffers into a single hold of the zone lock. Each element is
allocated and checked using check_pcp_refill(). The check touches every
related struct page which is especially expensive for higher order
allocations (huge pages).
This patch reduces the time holding the lock by moving the check out of
the critical section similar to rmqueue_buddy() which allocates a single
element.
Measurements of parallel allocation-heavy workloads show a reduction of
the average huge page allocation latency of 50 percent for two cores and
nearly 90 percent for 24 cores.
Link: https://lkml.kernel.org/r/20230201162549.68384-1-halbuer@sra.uni-hannover.de
Signed-off-by: Alexander Halbuer <halbuer@sra.uni-hannover.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If memory charge failed, instead of returning the hpage but with an error,
allow the function to cleanup the folio properly, which is normally what a
function should do in this case - either return successfully, or return
with no side effect of partial runs with an indicated error.
This will also avoid the caller calling mem_cgroup_uncharge()
unnecessarily with either anon or shmem path (even if it's safe to do so).
Link: https://lkml.kernel.org/r/20230222195247.791227-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Stevens <stevensd@chromium.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In an dedupe comparison iter loop, the length of iomap_iter decreases
because it implies the remaining length after each iteration.
The dedupe command will fail with -EIO if the range is larger than one
page size and not aligned to the page size. Also report warning in dmesg:
[ 4338.498374] ------------[ cut here ]------------
[ 4338.498689] WARNING: CPU: 3 PID: 1415645 at fs/iomap/iter.c:16
...
The compare function should use the min length of the current iters,
not the total length.
Link: https://lkml.kernel.org/r/1679469958-2-1-git-send-email-ruansy.fnst@fujitsu.com
Fixes: 0e79e3736d ("fsdax: dedupe: iter two files at the same time")
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull USB / Thunderbolt driver fixes from Greg KH:
"Here are a small set of USB and Thunderbolt driver fixes for reported
problems and a documentation update, for 6.3-rc4.
Included in here are:
- documentation update for uvc gadget driver
- small thunderbolt driver fixes
- cdns3 driver fixes
- dwc3 driver fixes
- dwc2 driver fixes
- chipidea driver fixes
- typec driver fixes
- onboard_usb_hub device id updates
- quirk updates
All of these have been in linux-next with no reported problems"
* tag 'usb-6.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (30 commits)
usb: dwc2: fix a race, don't power off/on phy for dual-role mode
usb: dwc2: fix a devres leak in hw_enable upon suspend resume
usb: chipidea: core: fix possible concurrent when switch role
usb: chipdea: core: fix return -EINVAL if request role is the same with current role
thunderbolt: Rename shadowed variables bit to interrupt_bit and auto_clear_bit
thunderbolt: Disable interrupt auto clear for rings
thunderbolt: Use const qualifier for `ring_interrupt_index`
usb: gadget: Use correct endianness of the wLength field for WebUSB
uas: Add US_FL_NO_REPORT_OPCODES for JMicron JMS583Gen 2
usb: cdnsp: changes PCI Device ID to fix conflict with CNDS3 driver
usb: cdns3: Fix issue with using incorrect PCI device function
usb: cdnsp: Fixes issue with redundant Status Stage
MAINTAINERS: make me a reviewer of USB/IP
thunderbolt: Use scale field when allocating USB3 bandwidth
thunderbolt: Limit USB3 bandwidth of certain Intel USB4 host routers
thunderbolt: Call tb_check_quirks() after initializing adapters
thunderbolt: Add missing UNSET_INBOUND_SBTX for retimer access
thunderbolt: Fix memory leak in margining
usb: dwc2: drd: fix inconsistent mode if role-switch-default-mode="host"
docs: usb: Add documentation for the UVC Gadget
...
Pull scheduler fix from Borislav Petkov:
- Fix a corner case where vruntime of a task is not being sanitized
* tag 'sched_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Sanitize vruntime of entity being migrated
Pull perf fix from Borislav Petkov:
- Properly clear perf event status tracking in the AMD perf event
overflow handler
* tag 'perf_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/amd/core: Always clear status for idx
Pull core fixes from Borislav Petkov:
- Do the delayed RCU wakeup for kthreads in the proper order so that
former doesn't get ignored
- A noinstr warning fix
* tag 'core_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
entry/rcu: Check TIF_RESCHED _after_ delayed RCU wake-up
entry: Fix noinstr warning in __enter_from_user_mode()
Pull x86 fixes from Borislav Petkov:
- Add a AMX ptrace self test
- Prevent a false-positive warning when retrieving the (invalid)
address of dynamic FPU features in their init state which are not
saved in init_fpstate at all
- Randomize per-CPU entry areas only when KASLR is enabled
* tag 'x86_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
selftests/x86/amx: Add a ptrace test
x86/fpu/xstate: Prevent false-positive warning in __copy_xstate_uabi_buf()
x86/mm: Do not shuffle CPU entry areas without KASLR
Pull cifs client fixes from Steve French:
"Twelve cifs/smb3 client fixes (most also for stable)
- forced umount fix
- fix for two perf regressions
- reconnect fixes
- small debugging improvements
- multichannel fixes"
* tag 'smb3-client-fixes-6.3-rc3' of git://git.samba.org/sfrench/cifs-2.6:
smb3: fix unusable share after force unmount failure
cifs: fix dentry lookups in directory handle cache
smb3: lower default deferred close timeout to address perf regression
cifs: fix missing unload_nls() in smb2_reconnect()
cifs: avoid race conditions with parallel reconnects
cifs: append path to open_enter trace event
cifs: print session id while listing open files
cifs: dump pending mids for all channels in DebugData
cifs: empty interface list when server doesn't support query interfaces
cifs: do not poll server interfaces too regularly
cifs: lock chan_lock outside match_session
cifs: check only tcon status on tcon related functions
Pull nfsd fix from Chuck Lever:
- Fix a crash when using NFS with krb5p
* tag 'nfsd-6.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
SUNRPC: Fix a crash in gss_krb5_checksum()