Commit 0de56e38b3 ("maple_tree: use maple state end for write
operations") was broken by a later patch "maple_tree: do not preallocate
nodes for slot stores". But the later patch was scheduled ahead of
0de56e38b3, for 6.7-rc.
This fixlet undoes the damage.
Fixes: 0de56e38b3 ("maple_tree: use maple state end for write operations")
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Three memory-failure fixes".
I've been looking at the memory-failure code and I believe I have found
three bugs that need fixing -- one going all the way back to 2010! I'll
have more patches later to use folios more extensively but didn't want
these bugfixes to get caught up in that.
This patch (of 3):
Both collect_procs_anon() and collect_procs_file() iterate over the VMA
interval trees looking for a single pgoff, so it is wrong to look for the
pgoff of the head page as is currently done. However, it is also wrong to
look at page->mapping of the precise page as this is invalid for tail
pages. Clear up the confusion by passing both the folio and the precise
page to collect_procs().
Link: https://lkml.kernel.org/r/20231218135837.3310403-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20231218135837.3310403-2-willy@infradead.org
Fixes: 415c64c145 ("mm/memory-failure: split thp earlier in memory error handling")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mas_preallocate() defaults to requesting 1 node for preallocation and then
,depending on the type of store, will update the request variable. There
isn't a check for a slot store type, so slot stores are preallocating the
default 1 node. Slot stores do not require any additional nodes, so add a
check for the slot store case that will bypass node_count_gfp(). Update
the tests to reflect that slot stores do not require allocations.
User visible effects of this bug include increased memory usage from the
unneeded node that was allocated.
Link: https://lkml.kernel.org/r/20231213205058.386589-1-sidhartha.kumar@oracle.com
Fixes: 0b8bb544b1 ("maple_tree: update mas_preallocate() testing")
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: <stable@vger.kernel.org> [6.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The following concurrency may cause the data read to be inconsistent with
the data on disk:
cpu1 cpu2
------------------------------|------------------------------
// Buffered write 2048 from 0
ext4_buffered_write_iter
generic_perform_write
copy_page_from_iter_atomic
ext4_da_write_end
ext4_da_do_write_end
block_write_end
__block_commit_write
folio_mark_uptodate
// Buffered read 4096 from 0 smp_wmb()
ext4_file_read_iter set_bit(PG_uptodate, folio_flags)
generic_file_read_iter i_size_write // 2048
filemap_read unlock_page(page)
filemap_get_pages
filemap_get_read_batch
folio_test_uptodate(folio)
ret = test_bit(PG_uptodate, folio_flags)
if (ret)
smp_rmb();
// Ensure that the data in page 0-2048 is up-to-date.
// New buffered write 2048 from 2048
ext4_buffered_write_iter
generic_perform_write
copy_page_from_iter_atomic
ext4_da_write_end
ext4_da_do_write_end
block_write_end
__block_commit_write
folio_mark_uptodate
smp_wmb()
set_bit(PG_uptodate, folio_flags)
i_size_write // 4096
unlock_page(page)
isize = i_size_read(inode) // 4096
// Read the latest isize 4096, but without smp_rmb(), there may be
// Load-Load disorder resulting in the data in the 2048-4096 range
// in the page is not up-to-date.
copy_page_to_iter
// copyout 4096
In the concurrency above, we read the updated i_size, but there is no read
barrier to ensure that the data in the page is the same as the i_size at
this point, so we may copy the unsynchronized page out. Hence adding the
missing read memory barrier to fix this.
This is a Load-Load reordering issue, which only occurs on some weak
mem-ordering architectures (e.g. ARM64, ALPHA), but not on strong
mem-ordering architectures (e.g. X86). And theoretically the problem
doesn't only happen on ext4, filesystems that call filemap_read() but
don't hold inode lock (e.g. btrfs, f2fs, ubifs ...) will have this
problem, while filesystems with inode lock (e.g. xfs, nfs) won't have
this problem.
Link: https://lkml.kernel.org/r/20231213062324.739009-1-libaokun1@huawei.com
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: yangerkun <yangerkun@huawei.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Similar to commit 09c6304e38 ("kasan: test: fix compatibility with
FORTIFY_SOURCE") the kernel is panicing in kmalloc_oob_memset_*.
This is due to the `ptr` not being hidden from the optimizer which would
disable the runtime fortify string checker.
kernel BUG at lib/string_helpers.c:1048!
Call Trace:
[<00000000272502e2>] fortify_panic+0x2a/0x30
([<00000000272502de>] fortify_panic+0x26/0x30)
[<001bffff817045c4>] kmalloc_oob_memset_2+0x22c/0x230 [kasan_test]
Hide the `ptr` variable from the optimizer to fix the kernel panic. Also
define a memset_size variable and hide that as well. This cleans up the
code and follows the same convention as other tests.
[npache@redhat.com: address review comments from Andrey]
Link: https://lkml.kernel.org/r/20231214164423.6202-1-npache@redhat.com
Link: https://lkml.kernel.org/r/20231212232659.18839-1-npache@redhat.com
Signed-off-by: Nico Pache <npache@redhat.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The cleanup for the CONFIG_KEXEC Kconfig logic accidentally changed the
'depends on CRYPTO=y' dependency to a plain 'depends on CRYPTO', which
causes a link failure when all the crypto support is in a loadable module
and kexec_file support is built-in:
x86_64-linux-ld: vmlinux.o: in function `__x64_sys_kexec_file_load':
(.text+0x32e30a): undefined reference to `crypto_alloc_shash'
x86_64-linux-ld: (.text+0x32e58e): undefined reference to `crypto_shash_update'
x86_64-linux-ld: (.text+0x32e6ee): undefined reference to `crypto_shash_final'
Both s390 and x86 have this problem, while ppc64 and riscv have the
correct dependency already. On riscv, the dependency is only used for the
purgatory, not for the kexec_file code itself, which may be a bit
surprising as it means that with CONFIG_CRYPTO=m, it is possible to enable
KEXEC_FILE but then the purgatory code is silently left out.
Move this into the common Kconfig.kexec file in a way that is correct
everywhere, using the dependency on CRYPTO_SHA256=y only when the
purgatory code is available. This requires reversing the dependency
between ARCH_SUPPORTS_KEXEC_PURGATORY and KEXEC_FILE, but the effect
remains the same, other than making riscv behave like the other ones.
On s390, there is an additional dependency on CRYPTO_SHA256_S390, which
should technically not be required but gives better performance. Remove
this dependency here, noting that it was not present in the initial
Kconfig code but was brought in without an explanation in commit
71406883fd ("s390/kexec_file: Add kexec_file_load system call").
[arnd@arndb.de: fix riscv build]
Link: https://lkml.kernel.org/r/67ddd260-d424-4229-a815-e3fcfb864a77@app.fastmail.com
Link: https://lkml.kernel.org/r/20231023110308.1202042-1-arnd@kernel.org
Fixes: 6af5138083 ("x86/kexec: refactor for kernel/Kconfig.kexec")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Eric DeVolder <eric_devolder@yahoo.com>
Tested-by: Eric DeVolder <eric_devolder@yahoo.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Conor Dooley <conor@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
While investigating kswapd "consuming 100% CPU" [1] (also see "mm/mglru:
try to stop at high watermarks"), it was discovered that the memcg LRU can
breach the thrashing protection imposed by min_ttl_ms.
Before the memcg LRU:
kswapd()
shrink_node_memcgs()
mem_cgroup_iter()
inc_max_seq() // always hit a different memcg
lru_gen_age_node()
mem_cgroup_iter()
check the timestamp of the oldest generation
After the memcg LRU:
kswapd()
shrink_many()
restart:
iterate the memcg LRU:
inc_max_seq() // occasionally hit the same memcg
if raced with lru_gen_rotate_memcg():
goto restart
lru_gen_age_node()
mem_cgroup_iter()
check the timestamp of the oldest generation
Specifically, when the restart happens in shrink_many(), it needs to stick
with the (memcg LRU) generation it began with. In other words, it should
neither re-read memcg_lru->seq nor age an lruvec of a different
generation. Otherwise it can hit the same memcg multiple times without
giving lru_gen_age_node() a chance to check the timestamp of that memcg's
oldest generation (against min_ttl_ms).
[1] https://lore.kernel.org/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg@mail.gmail.com/
Link: https://lkml.kernel.org/r/20231208061407.2125867-3-yuzhao@google.com
Fixes: e4dde56cd2 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: T.J. Mercier <tjmercier@google.com>
Cc: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The initial MGLRU patchset didn't include the memcg LRU support, and it
relied on should_abort_scan(), added by commit f76c833788 ("mm:
multi-gen LRU: optimize multiple memcgs"), to "backoff to avoid
overshooting their aggregate reclaim target by too much".
Later on when the memcg LRU was added, should_abort_scan() was deemed
unnecessary, and the test results [1] showed no side effects after it was
removed by commit a579086c99 ("mm: multi-gen LRU: remove eviction
fairness safeguard").
However, that test used memory.reclaim, which sets nr_to_reclaim to
SWAP_CLUSTER_MAX. So it can overshoot only by SWAP_CLUSTER_MAX-1 pages,
i.e., from nr_reclaimed=nr_to_reclaim-1 to
nr_reclaimed=nr_to_reclaim+SWAP_CLUSTER_MAX-1. Compared with the batch
size kswapd sets to nr_to_reclaim, SWAP_CLUSTER_MAX is tiny. Therefore
that test isn't able to reproduce the worst case scenario, i.e., kswapd
overshooting GBs on large systems and "consuming 100% CPU" (see the Closes
tag).
Bring back a simplified version of should_abort_scan() on top of the memcg
LRU, so that kswapd stops when all eligible zones are above their
respective high watermarks plus a small delta to lower the chance of
KSWAPD_HIGH_WMARK_HIT_QUICKLY. Note that this only applies to order-0
reclaim, meaning compaction-induced reclaim can still run wild (which is a
different problem).
On Android, launching 55 apps sequentially:
Before After Change
pgpgin 838377172 802955040 -4%
pgpgout 38037080 34336300 -10%
[1] https://lore.kernel.org/20221222041905.2431096-1-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20231208061407.2125867-2-yuzhao@google.com
Fixes: a579086c99 ("mm: multi-gen LRU: remove eviction fairness safeguard")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Closes: https://lore.kernel.org/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg@mail.gmail.com/
Tested-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Tested-by: Kalesh Singh <kaleshsingh@google.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: T.J. Mercier <tjmercier@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Unmapped folios accessed through file descriptors can be underprotected.
Those folios are added to the oldest generation based on:
1. The fact that they are less costly to reclaim (no need to walk the
rmap and flush the TLB) and have less impact on performance (don't
cause major PFs and can be non-blocking if needed again).
2. The observation that they are likely to be single-use. E.g., for
client use cases like Android, its apps parse configuration files
and store the data in heap (anon); for server use cases like MySQL,
it reads from InnoDB files and holds the cached data for tables in
buffer pools (anon).
However, the oldest generation can be very short lived, and if so, it
doesn't provide the PID controller with enough time to respond to a surge
of refaults. (Note that the PID controller uses weighted refaults and
those from evicted generations only take a half of the whole weight.) In
other words, for a short lived generation, the moving average smooths out
the spike quickly.
To fix the problem:
1. For folios that are already on LRU, if they can be beyond the
tracking range of tiers, i.e., five accesses through file
descriptors, move them to the second oldest generation to give them
more time to age. (Note that tiers are used by the PID controller
to statistically determine whether folios accessed multiple times
through file descriptors are worth protecting.)
2. When adding unmapped folios to LRU, adjust the placement of them so
that they are not too close to the tail. The effect of this is
similar to the above.
On Android, launching 55 apps sequentially:
Before After Change
workingset_refault_anon 25641024 25598972 0%
workingset_refault_file 115016834 106178438 -8%
Link: https://lkml.kernel.org/r/20231208061407.2125867-1-yuzhao@google.com
Fixes: ac35a49023 ("mm: multi-gen LRU: minimal implementation")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
Tested-by: Kalesh Singh <kaleshsingh@google.com>
Cc: T.J. Mercier <tjmercier@google.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Split folios during the second loop of shmem_undo_range. It's not
sufficient to only split folios when dealing with partial pages, since
it's possible for a THP to be faulted in after that point. Calling
truncate_inode_folio in that situation can result in throwing away data
outside of the range being targeted.
[akpm@linux-foundation.org: tidy up comment layout]
Link: https://lkml.kernel.org/r/20230418084031.3439795-1-stevensd@google.com
Fixes: b9a8a4195c ("truncate,shmem: Handle truncates that split large folios")
Signed-off-by: David Stevens <stevensd@chromium.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If crash_base is equal to CRASH_ADDR_LOW_MAX, it also indicates that
the crashkernel memory is allocated from high memory. However, the
current check only considers the case where crash_base is greater than
CRASH_ADDR_LOW_MAX. Fix it.
The runtime effects is that crashkernel high memory is successfully
reserved, whereas the crashkernel low memory is bypassed in this case,
then kdump kernel bootup will fail because of no low memory under 4G.
This patch also includes some minor cleanups.
Link: https://lkml.kernel.org/r/20231209141438.77233-1-ytcoode@gmail.com
Fixes: 0ab97169aa ("crash_core: add generic function to do reservation")
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The select of KEXEC for CRASH_DUMP in kernel/Kconfig.kexec will be
dropped, then compiling errors will be triggered if below config items are
set:
===
CONFIG_CRASH_CORE=y
CONFIG_KEXEC_CORE=y
CONFIG_CRASH_DUMP=y
===
--------------------------------------------------------------------
mipsel-linux-ld: kernel/kexec_core.o: in function `kimage_free':
kernel/kexec_core.c:(.text+0x2200): undefined reference to `machine_kexec_cleanup'
mipsel-linux-ld: kernel/kexec_core.o: in function `__crash_kexec':
kernel/kexec_core.c:(.text+0x2480): undefined reference to `machine_crash_shutdown'
mipsel-linux-ld: kernel/kexec_core.c:(.text+0x2488): undefined reference to `machine_kexec'
mipsel-linux-ld: kernel/kexec_core.o: in function `kernel_kexec':
kernel/kexec_core.c:(.text+0x29b8): undefined reference to `machine_shutdown'
mipsel-linux-ld: kernel/kexec_core.c:(.text+0x29c0): undefined reference to `machine_kexec'
--------------------------------------------------------------------
Here, change the dependency of building kexec_core related object files,
and the ifdeffery in mips from CONFIG_KEXEC to CONFIG_KEXEC_CORE.
Link: https://lkml.kernel.org/r/20231208073036.7884-4-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202311302042.sn8cDPIX-lkp@intel.com/
Cc: Eric DeVolder <eric_devolder@yahoo.com>
Cc: Ignat Korchagin <ignat@cloudflare.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "kexec: fix the incorrect ifdeffery and dependency of
CONFIG_KEXEC".
The select of KEXEC for CRASH_DUMP in kernel/Kconfig.kexec will be
dropped, then compiling errors will be triggered if below config items are
set:
===
CONFIG_CRASH_CORE=y
CONFIG_KEXEC_CORE=y
CONFIG_CRASH_DUMP=y
===
E.g on mips, below link error are seen:
--------------------------------------------------------------------
mipsel-linux-ld: kernel/kexec_core.o: in function `kimage_free':
kernel/kexec_core.c:(.text+0x2200): undefined reference to `machine_kexec_cleanup'
mipsel-linux-ld: kernel/kexec_core.o: in function `__crash_kexec':
kernel/kexec_core.c:(.text+0x2480): undefined reference to `machine_crash_shutdown'
mipsel-linux-ld: kernel/kexec_core.c:(.text+0x2488): undefined reference to `machine_kexec'
mipsel-linux-ld: kernel/kexec_core.o: in function `kernel_kexec':
kernel/kexec_core.c:(.text+0x29b8): undefined reference to `machine_shutdown'
mipsel-linux-ld: kernel/kexec_core.c:(.text+0x29c0): undefined reference to `machine_kexec'
--------------------------------------------------------------------
Here, change the incorrect dependency of building kexec_core related
object files, and the ifdeffery on architectures from CONFIG_KEXEC to
CONFIG_KEXEC_CORE.
Testing:
========
Passed on mips and loognarch with the LKP reproducer.
This patch (of 5):
Currently, in arch/loongarch/kernel/Makefile, building machine_kexec.o
relocate_kernel.o depends on CONFIG_KEXEC.
Whereas, since we will drop the select of KEXEC for CRASH_DUMP in
kernel/Kconfig.kexec, compiling error will be triggered if below config
items are set:
===
CONFIG_CRASH_CORE=y
CONFIG_KEXEC_CORE=y
CONFIG_CRASH_DUMP=y
===
---------------------------------------------------------------
loongarch64-linux-ld: kernel/kexec_core.o: in function `.L209':
>> kexec_core.c:(.text+0x1660): undefined reference to `machine_kexec_cleanup'
loongarch64-linux-ld: kernel/kexec_core.o: in function `.L287':
>> kexec_core.c:(.text+0x1c5c): undefined reference to `machine_crash_shutdown'
>> loongarch64-linux-ld: kexec_core.c:(.text+0x1c64): undefined reference to `machine_kexec'
loongarch64-linux-ld: kernel/kexec_core.o: in function `.L2^B5':
>> kexec_core.c:(.text+0x2090): undefined reference to `machine_shutdown'
loongarch64-linux-ld: kexec_core.c:(.text+0x20a0): undefined reference to `machine_kexec'
---------------------------------------------------------------
Here, change the dependency of machine_kexec.o relocate_kernel.o to
CONFIG_KEXEC_CORE can fix above building error.
Link: https://lkml.kernel.org/r/20231208073036.7884-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20231208073036.7884-2-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202311300946.kHE9Iu71-lkp@intel.com/
Cc: Eric DeVolder <eric_devolder@yahoo.com>
Cc: Ignat Korchagin <ignat@cloudflare.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The cleanup tasks of kdamond threads including reset of corresponding
DAMON context's ->kdamond field and decrease of global nr_running_ctxs
counter is supposed to be executed by kdamond_fn(). However, commit
0f91d13366 ("mm/damon: simplify stop mechanism") made neither
damon_start() nor damon_stop() ensure the corresponding kdamond has
started the execution of kdamond_fn().
As a result, the cleanup can be skipped if damon_stop() is called fast
enough after the previous damon_start(). Especially the skipped reset
of ->kdamond could cause a use-after-free.
Fix it by waiting for start of kdamond_fn() execution from
damon_start().
Link: https://lkml.kernel.org/r/20231208175018.63880-1-sj@kernel.org
Fixes: 0f91d13366 ("mm/damon: simplify stop mechanism")
Signed-off-by: SeongJae Park <sj@kernel.org>
Reported-by: Jakub Acs <acsjakub@amazon.de>
Cc: Changbin Du <changbin.du@intel.com>
Cc: Jakub Acs <acsjakub@amazon.de>
Cc: <stable@vger.kernel.org> # 5.15.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Doing a ksft_print_msg() before the ksft_print_header() seems to confuse
the ksft framework in a strange way: running the test on the cmdline
results in the expected output.
But piping the output somewhere else, results in some odd output,
whereby we repeatedly get the same info printed:
# [INFO] detected THP size: 2048 KiB
# [INFO] detected hugetlb page size: 2048 KiB
# [INFO] detected hugetlb page size: 1048576 KiB
# [INFO] huge zeropage is enabled
TAP version 13
1..190
# [INFO] Anonymous memory tests in private mappings
# [RUN] Basic COW after fork() ... with base page
# [INFO] detected THP size: 2048 KiB
# [INFO] detected hugetlb page size: 2048 KiB
# [INFO] detected hugetlb page size: 1048576 KiB
# [INFO] huge zeropage is enabled
TAP version 13
1..190
# [INFO] Anonymous memory tests in private mappings
# [RUN] Basic COW after fork() ... with base page
ok 1 No leak from parent into child
# [RUN] Basic COW after fork() ... with swapped out base page
# [INFO] detected THP size: 2048 KiB
# [INFO] detected hugetlb page size: 2048 KiB
# [INFO] detected hugetlb page size: 1048576 KiB
# [INFO] huge zeropage is enabled
Doing the ksft_print_header() first seems to resolve that and gives us
the output we expect:
TAP version 13
# [INFO] detected THP size: 2048 KiB
# [INFO] detected hugetlb page size: 2048 KiB
# [INFO] detected hugetlb page size: 1048576 KiB
# [INFO] huge zeropage is enabled
1..190
# [INFO] Anonymous memory tests in private mappings
# [RUN] Basic COW after fork() ... with base page
ok 1 No leak from parent into child
# [RUN] Basic COW after fork() ... with swapped out base page
ok 2 No leak from parent into child
# [RUN] Basic COW after fork() ... with THP
ok 3 No leak from parent into child
# [RUN] Basic COW after fork() ... with swapped-out THP
ok 4 No leak from parent into child
# [RUN] Basic COW after fork() ... with PTE-mapped THP
ok 5 No leak from parent into child
Link: https://lkml.kernel.org/r/20231206103558.38040-1-david@redhat.com
Fixes: f4b5fd6946 ("selftests/vm: anon_cow: THP tests")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Nico Pache <npache@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When below config items are set, compiler complained:
--------------------
CONFIG_CRASH_CORE=y
CONFIG_KEXEC_CORE=y
CONFIG_CRASH_DUMP=y
......
-----------------------
-------------------------------------------------------------------
arch/riscv/kernel/crash_core.c: In function 'arch_crash_save_vmcoreinfo':
arch/riscv/kernel/crash_core.c:11:58: warning: format '%lx' expects argument of type 'long unsigned int', but argument 2 has type 'int' [-Wformat=]
11 | vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
| ~~^
| |
| long unsigned int
| %x
----------------------------------------------------------------------
This is because on riscv macro VMALLOC_START has different type when
CONFIG_MMU is set or unset.
arch/riscv/include/asm/pgtable.h:
--------------------------------------------------
Changing it to _AC(0, UL) in case CONFIG_MMU=n can fix the warning.
Link: https://lkml.kernel.org/r/ZW7OsX4zQRA3mO4+@MiWiFi-R3L-srv
Signed-off-by: Baoquan He <bhe@redhat.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Cc: Eric DeVolder <eric_devolder@yahoo.com>
Cc: Ignat Korchagin <ignat@cloudflare.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In commit f8ff23429c62 ("kernel/Kconfig.kexec: drop select of KEXEC for
CRASH_DUMP") we tried to fix a config regression, where CONFIG_CRASH_DUMP
required CONFIG_KEXEC.
However, it was not enough at least for arm64 platforms. While further
testing the patch with our arm64 config I noticed that CONFIG_CRASH_DUMP
is unavailable in menuconfig. This is because CONFIG_CRASH_DUMP still
depends on the new CONFIG_ARCH_SUPPORTS_KEXEC introduced in commit
91506f7e5d ("arm64/kexec: refactor for kernel/Kconfig.kexec") and on
arm64 CONFIG_ARCH_SUPPORTS_KEXEC requires CONFIG_PM_SLEEP_SMP=y, which in
turn requires either CONFIG_SUSPEND=y or CONFIG_HIBERNATION=y neither of
which are set in our config.
Given that we already established that CONFIG_KEXEC (which is a switch for
kexec system call itself) is not required for CONFIG_CRASH_DUMP drop
CONFIG_ARCH_SUPPORTS_KEXEC dependency as well. The arm64 kernel builds
just fine with CONFIG_CRASH_DUMP=y and with both CONFIG_KEXEC=n and
CONFIG_KEXEC_FILE=n after f8ff23429c62 ("kernel/Kconfig.kexec: drop select
of KEXEC for CRASH_DUMP") and this patch are applied given that the
necessary shared bits are included via CONFIG_KEXEC_CORE dependency.
[bhe@redhat.com: don't export some symbols when CONFIG_MMU=n]
Link: https://lkml.kernel.org/r/ZW03ODUKGGhP1ZGU@MiWiFi-R3L-srv
[bhe@redhat.com: riscv, kexec: fix dependency of two items]
Link: https://lkml.kernel.org/r/ZW04G/SKnhbE5mnX@MiWiFi-R3L-srv
Link: https://lkml.kernel.org/r/20231129220409.55006-1-ignat@cloudflare.com
Fixes: 91506f7e5d ("arm64/kexec: refactor for kernel/Kconfig.kexec")
Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: <stable@vger.kernel.org> # 6.6+: f8ff234: kernel/Kconfig.kexec: drop select of KEXEC for CRASH_DUMP
Cc: <stable@vger.kernel.org> # 6.6+
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Testing shows fast_isolate_freepages can blindly choose an unsuitable
pageblock from time to time particularly while the min mark is used from
XXX path:
if (!page) {
cc->fast_search_fail++;
if (scan_start) {
/*
* Use the highest PFN found above min. If one was
* not found, be pessimistic for direct compaction
* and use the min mark.
*/
if (highest >= min_pfn) {
page = pfn_to_page(highest);
cc->free_pfn = highest;
} else {
if (cc->direct_compaction && pfn_valid(min_pfn)) { /* XXX */
page = pageblock_pfn_to_page(min_pfn,
min(pageblock_end_pfn(min_pfn),
zone_end_pfn(cc->zone)),
cc->zone);
cc->free_pfn = min_pfn;
}
}
}
}
The reason is that no code is doing any check on the min_pfn
min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1));
In contrast, slow path of isolate_freepages() is always skipping
unsuitable pageblocks in a decent way.
This issue doesn't happen quite often. When running 25 machines with
16GiB memory for one night, most of them can hit this unexpected code
path. However the frequency isn't like many times per second. It might
be one time in a couple of hours. Thus, it is very hard to measure the
visible performance impact in my machines though the affection of choosing
the unsuitable migration_target should be negative in theory.
I feel it's still worth fixing this to at least make the code
theoretically self-explanatory as it is quite odd an unsuitable
migration_target can be still migration_target.
Link: https://lkml.kernel.org/r/20231206110054.61617-1-v-songbaohua@oppo.com
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Reported-by: Zhanyuan Hu <huzhanyuan@oppo.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently enabling THP support (CONFIG_TRANSPARENT_HUGEPAGE) requires
enabling either CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS or
CONFIG_TRANSPARENT_HUGEPAGE_MADVISE, which both cause khugepaged starting
by default at kernel bootup. Add the third choice
CONFIG_TRANSPARENT_HUGEPAGE_NEVER, in line with the existing kernel
command line setting transparent_hugepage=never, to disable THP by default
(in particular, to prevent starting khugepaged by default) but still allow
enabling it at runtime via sysfs.
Rationale: khugepaged has its own non-negligible memory cost even if it is
not used by any applications, since it bumps up vm.min_free_kbytes to its
own required minimum in set_recommended_min_free_kbytes(). For example,
on a machine with 4GB RAM, with 3 mm zones and pageblock_order ==
MAX_ORDER, starting khugepaged causes vm.min_free_kbytes increase from 8MB
to 132MB.
So if we use THP on machines with e.g. >=8GB of memory for better
performance, but avoid using it on lower-memory machines to avoid its
memory overhead, then for the same reason we also want to avoid even
starting khugepaged on those <8GB machines. So with
CONFIG_TRANSPARENT_HUGEPAGE_NEVER we can use the same kernel image on both
>=8GB and <8GB machines, with THP support enabled but khugepaged not
started by default. The userspace can then decide to enable THP via sysfs
if needed, based on the total amount of memory.
This could also be achieved with the existing transparent_hugepage=never
setting in the kernel command line instead. But it seems cleaner to avoid
tweaking the command line for such a basic setting.
P.S. I see that CONFIG_TRANSPARENT_HUGEPAGE_NEVER was already proposed
in the past [1] but without an explanation of the purpose.
[1] https://lore.kernel.org/all/202211301651462590168@zte.com.cn/
Link: https://lkml.kernel.org/r/20231205170244.2746210-1-dmaluka@chromium.org
Link: https://lore.kernel.org/all/20231204163254.2636289-1-dmaluka@chromium.org/
Signed-off-by: Dmytro Maluka <dmaluka@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
On systems with large number of CPUs, the following soft lockup splat
might sometimes happen:
[ 2656.001617] watchdog: BUG: soft lockup - CPU#364 stuck for 21s! [ksoftirqd/364:2206]
:
[ 2656.141194] RIP: 0010:_raw_spin_unlock_irqrestore+0x3d/0x70
:
2656.241214] Call Trace:
[ 2656.243971] <IRQ>
[ 2656.246237] ? show_trace_log_lvl+0x1c4/0x2df
[ 2656.251152] ? show_trace_log_lvl+0x1c4/0x2df
[ 2656.256066] ? kmemleak_free_percpu+0x11f/0x1f0
[ 2656.261173] ? watchdog_timer_fn+0x379/0x470
[ 2656.265984] ? __pfx_watchdog_timer_fn+0x10/0x10
[ 2656.271179] ? __hrtimer_run_queues+0x5f3/0xd00
[ 2656.276283] ? __pfx___hrtimer_run_queues+0x10/0x10
[ 2656.281783] ? ktime_get_update_offsets_now+0x95/0x2c0
[ 2656.287573] ? ktime_get_update_offsets_now+0xdd/0x2c0
[ 2656.293380] ? hrtimer_interrupt+0x2e9/0x780
[ 2656.298221] ? __sysvec_apic_timer_interrupt+0x184/0x640
[ 2656.304211] ? sysvec_apic_timer_interrupt+0x8e/0xc0
[ 2656.309807] </IRQ>
[ 2656.312169] <TASK>
[ 2656.326110] kmemleak_free_percpu+0x11f/0x1f0
[ 2656.331015] free_percpu.part.0+0x1b/0xe70
[ 2656.335635] free_vfsmnt+0xb9/0x100
[ 2656.339567] rcu_do_batch+0x3c8/0xe30
[ 2656.363693] rcu_core+0x3de/0x5a0
[ 2656.367433] __do_softirq+0x2d0/0x9a8
[ 2656.381119] run_ksoftirqd+0x36/0x60
[ 2656.385145] smpboot_thread_fn+0x556/0x910
[ 2656.394971] kthread+0x2a4/0x350
[ 2656.402826] ret_from_fork+0x29/0x50
[ 2656.406861] </TASK>
The issue is caused by kmemleak registering each per_cpu_ptr()
corresponding to the __percpu pointer. This is unnecessary since such
individual per-CPU pointers are not tracked anyway. Create a new
object_percpu_tree_root rbtree that stores a single __percpu pointer
together with an OBJECT_PERCPU flag for the kmemleak metadata. Scanning
needs to be done for all per_cpu_ptr() pointers with a cond_resched()
between each CPU iteration to avoid RCU stalls.
[catalin.marinas@arm.com: update comment]
Link: https://lkml.kernel.org/r/20231206114414.2085824-1-catalin.marinas@arm.com
Link: https://lore.kernel.org/r/20231127194153.289626-1-longman@redhat.comLink: https://lkml.kernel.org/r/20231201190829.825856-1-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Waiman Long <longman@redhat.com>
Closes: https://lore.kernel.org/r/20231127194153.289626-1-longman@redhat.com
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Running my yearly branch profiler to see where likely/unlikely annotation
may be added or removed, I discovered this:
correct incorrect % Function File Line
------- --------- - -------- ---- ----
0 457918 100 page_try_dup_anon_rmap rmap.h 264
[..]
458021 0 0 page_try_dup_anon_rmap rmap.h 265
I thought it was interesting that line 264 of rmap.h had a 100% incorrect
annotation, but the line directly below it was 100% correct. Looking at the
code:
if (likely(!is_device_private_page(page) &&
unlikely(page_needs_cow_for_dma(vma, page))))
It didn't make sense. The "likely()" was around the entire if statement
(not just the "!is_device_private_page(page)"), which also included the
"unlikely()" portion of that if condition.
If the unlikely portion is unlikely to be true, that would make the entire
if condition unlikely to be true, so it made no sense at all to say the
entire if condition is true.
What is more likely to be likely is just the first part of the if statement
before the && operation. It's likely to be a misplaced parenthesis. And
after making the if condition broken into a likely() && unlikely(), both
now appear to be correct!
Link: https://lkml.kernel.org/r/20231201145936.5ddfdb50@gandalf.local.home
Fixes:fb3d824d1a46c ("mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap()")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The THP machinery does not support order-1 folios because it requires meta
data spanning the first 3 `struct page`s. So order-2 is the smallest
large folio that we can safely create.
There was a theoretical bug whereby if ra->size was 2 or 3 pages (due to
the device-specific bdi->ra_pages being set that way), we could end up
with order = 1. Fix this by unconditionally checking if the preferred
order is 1 and if so, set it to 0. Previously this was done in a few
specific places, but with this refactoring it is done just once,
unconditionally, at the end of the calculation.
This is a theoretical bug found during review of the code; I have no
evidence to suggest this manifests in the real world (I expect all
device-specific ra_pages values are much bigger than 3).
Link: https://lkml.kernel.org/r/20231201161045.3962614-1-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>