Most compressors are actually CPU-based and won't sleep during compression
and decompression. We should remove the redundant memcpy for them.
This patch checks if the algorithm is sleepable by testing the
CRYPTO_ALG_ASYNC algorithm flag.
Generally speaking, async and sleepable are semantically similar but not
equal. But for compress drivers, they are basically equal at least due to
the below facts.
Firstly, scompress drivers - crypto/deflate.c, lz4.c, zstd.c, lzo.c etc
have no sleep. Secondly, zRAM has been using these scompress drivers for
years in atomic contexts, and never worried those drivers going to sleep.
One exception is that an async driver can sometimes still return
synchronously per Herbert's clarification. In this case, we are still
having a redundant memcpy. But we can't know if one particular acomp
request will sleep or not unless crypto can expose more details for each
specific request from offload drivers.
Link: https://lkml.kernel.org/r/20240222081135.173040-3-21cnbao@gmail.com
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Tested-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Chris Li <chrisl@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In a Copy-on-Write (CoW) scenario, the last subpage will reuse the entire
large folio, resulting in the waste of (nr_pages - 1) pages. This wasted
memory remains allocated until it is either unmapped or memory reclamation
occurs.
The following small program can serve as evidence of this behavior
main()
{
#define SIZE 1024 * 1024 * 1024UL
void *p = malloc(SIZE);
memset(p, 0x11, SIZE);
if (fork() == 0)
_exit(0);
memset(p, 0x12, SIZE);
printf("done\n");
while(1);
}
For example, using a 1024KiB mTHP by:
echo always > /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/enabled
(1) w/o the patch, it takes 2GiB,
Before running the test program,
/ # free -m
total used free shared buff/cache available
Mem: 5754 84 5692 0 17 5669
Swap: 0 0 0
/ # /a.out &
/ # done
After running the test program,
/ # free -m
total used free shared buff/cache available
Mem: 5754 2149 3627 0 19 3605
Swap: 0 0 0
(2) w/ the patch, it takes 1GiB only,
Before running the test program,
/ # free -m
total used free shared buff/cache available
Mem: 5754 89 5687 0 17 5664
Swap: 0 0 0
/ # /a.out &
/ # done
After running the test program,
/ # free -m
total used free shared buff/cache available
Mem: 5754 1122 4655 0 17 4632
Swap: 0 0 0
This patch migrates the last subpage to a small folio and immediately
returns the large folio to the system. It benefits both memory availability
and anti-fragmentation.
Link: https://lkml.kernel.org/r/20240308092721.144735-1-21cnbao@gmail.com
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Lance Yang <ioworker0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This reverts one change in commit 924bd6a8c9 ("mm/x86: drop two
unnecessary pud_leaf() definitions").
One issue with that is it broke nopmd builds for at least both arm64 and
riscv (CONFIG_PGTABLE_LEVELS=2). The other issue is it was overlooked that
it's a common change rather than x86 specific (relevant to the commit
message of the commit).
Normally there's no need for empty definition of pXd_leaf() because of the
fallback functions, however this logic may not apply to pgtable-nopmd.h,
because that's a header that can even be used by arch *pgtable.h headers,
which can use the *_leaf() definitions _before_ the fallback functions are
defined. Leave it there to pass PGTABLE_LEVELS=2 builds.
Link: https://lkml.kernel.org/r/Ze8vFNV9YSdgC2S7@x1n
Fixes: 924bd6a8c9 ("mm/x86: drop two unnecessary pud_leaf() definitions")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202403090900.OwPqmRuI-lkp@intel.com/
Closes: https://lore.kernel.org/oe-kbuild-all/202403101607.a42gaLOS-lkp@intel.com/
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "selftests/mm: Improve Hugepage Test Handling in MM
Selftests", v2.
This series addresses issues related to hugepage requirements in the MM
selftests, ensuring tests are skipped rather than failing when the
necessary hugepage count is not met.
This adjustment allows for a more graceful handling for systems with
insufficient hugepages, preventing unnecessary test failures and improving
the overall robustness of the test suite.
This patch (of 3):
On systems that have large core counts and large page sizes, but limited
memory, the userfaultfd test hugepage requirement is too large.
Exiting early due to missing one test's requirements is a rather
aggressive strategy, and prevents a lot of other tests from running.
Remove the early exit to prevent this.
Link: https://lkml.kernel.org/r/20240306223714.320681-1-npache@redhat.com
Link: https://lkml.kernel.org/r/20240306223714.320681-2-npache@redhat.com
Fixes: ee00479d67 ("selftests: vm: Try harder to allocate huge pages")
Signed-off-by: Nico Pache <npache@redhat.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
With cache_trim_mode on, reclaim logic doesn't bother reclaiming anon
pages. However, it should be more careful to use the mode because it's
going to prevent anon pages from being reclaimed even if there are a huge
number of anon pages that are cold and should be reclaimed. Even worse,
that leads kswapd_failures to reach MAX_RECLAIM_RETRIES and stopping
kswapd from functioning until direct reclaim eventually works to resume
kswapd.
So kswapd needs to retry its scan priority loop with cache_trim_mode off
again if the mode doesn't work for reclaim.
The problematic behavior can be reproduced by:
CONFIG_NUMA_BALANCING enabled
sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING
numa node0 (8GB local memory, 16 CPUs)
numa node1 (8GB slow tier memory, no CPUs)
Sequence:
1) echo 3 > /proc/sys/vm/drop_caches
2) To emulate the system with full of cold memory in local DRAM, run
the following dummy program and never touch the region:
mmap(0, 8 * 1024 * 1024 * 1024, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE | MAP_POPULATE, -1, 0);
3) Run any memory intensive work e.g. XSBench.
4) Check if numa balancing is working e.i. promotion/demotion.
5) Iterate 1) ~ 4) until numa balancing stops.
With this, you could see that promotion/demotion are not working because
kswapd has stopped due to ->kswapd_failures >= MAX_RECLAIM_RETRIES.
Interesting vmstat delta's differences between before and after are like:
+-----------------------+-------------------------------+
| interesting vmstat | before | after |
+-----------------------+-------------------------------+
| nr_inactive_anon | 321935 | 1664772 |
| nr_active_anon | 1780700 | 437834 |
| nr_inactive_file | 30425 | 40882 |
| nr_active_file | 14961 | 3012 |
| pgpromote_success | 356 | 1293122 |
| pgpromote_candidate | 21953245 | 1824148 |
| pgactivate | 1844523 | 3311907 |
| pgdeactivate | 50634 | 1554069 |
| pgfault | 31100294 | 6518806 |
| pgdemote_kswapd | 30856 | 2230821 |
| pgscan_kswapd | 1861981 | 7667629 |
| pgscan_anon | 1822930 | 7610583 |
| pgscan_file | 39051 | 57046 |
| pgsteal_anon | 386 | 2192033 |
| pgsteal_file | 30470 | 38788 |
| pageoutrun | 30 | 412 |
| numa_hint_faults | 27418279 | 2875955 |
| numa_pages_migrated | 356 | 1293122 |
+-----------------------+-------------------------------+
Link: https://lkml.kernel.org/r/20240304082118.20499-1-byungchul@sk.com
Signed-off-by: Byungchul Park <byungchul@sk.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Users of UFFDIO_CONTINUE may reasonably assume that a write memory barrier
is included as part of UFFDIO_CONTINUE. That is, a user may believe that
all writes it has done to a page that it is now UFFDIO_CONTINUE'ing are
guaranteed to be visible to anyone subsequently reading the page through
the newly mapped virtual memory region.
Today, such a user happens to be correct. mmget_not_zero(), for example,
is called as part of UFFDIO_CONTINUE (and comes before any PTE updates),
and it implicitly gives us a write barrier.
To be resilient against future changes, include an explicit smp_wmb().
While we're at it, optimize the smp_wmb() that is already incidentally
present for the HugeTLB case.
Merely making a syscall does not generally imply the memory ordering
constraints that we need (including on x86).
Link: https://lkml.kernel.org/r/20240307010250.3847179-1-jthoughton@google.com
Signed-off-by: James Houghton <jthoughton@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A major fault occurred when using mlockall(MCL_CURRENT | MCL_FUTURE) in
application, which leading to an unexpected issue[1].
This is caused by temporarily cleared PTE during a read+clear/modify/write
update of the PTE, eg, do_numa_page()/change_pte_range().
For the data segment of the user-mode program, the global variable area is
a private mapping. After the pagecache is loaded, the private anonymous
page is generated after the COW is triggered. Mlockall can lock COW pages
(anonymous pages), but the original file pages cannot be locked and may be
reclaimed. If the global variable (private anon page) is accessed when
vmf->pte is zeroed in numa fault, a file page fault will be triggered. At
this time, the original private file page may have been reclaimed. If the
page cache is not available at this time, a major fault will be triggered
and the file will be read, causing additional overhead.
This issue affects our traffic analysis service. The inbound traffic is
heavy. If a major fault occurs, the I/O schedule is triggered and the
original I/O is suspended. Generally, the I/O schedule is 0.7 ms. If
other applications are operating disks, the system needs to wait for more
than 10 ms. However, the inbound traffic is heavy and the NIC buffer is
small. As a result, packet loss occurs. But the traffic analysis service
can't tolerate packet loss.
Fix this by holding PTL and rechecking the PTE in filemap_fault() before
triggering a major fault. We do this check only if vma is VM_LOCKED to
reduce the performance impact in common scenarios.
In our product environment, there were 7 major faults every 12 hours.
After the patch is applied, no major fault have been triggered.
Testing file page read and write page fault performance in ext4 and
ramdisk using will-it-scale[2] on a x86 physical machine. The data is the
average change compared with the mainline after the patch is applied. The
test results are within the range of fluctuation. We do this check only
if vma is VM_LOCKED, therefore, no performance regressions is caused for
most common cases.
The test results are as follows:
processes processes_idle threads threads_idle
ext4 private file write: 0.22% 0.26% 1.21% -0.15%
ext4 private file read: 0.03% 1.00% 1.39% 0.34%
ext4 shared file write: -0.50% -0.02% -0.14% -0.02%
ramdisk private file write: 0.07% 0.02% 0.53% 0.04%
ramdisk private file read: 0.01% 1.60% -0.32% -0.02%
[1] https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa17434ee@huawei.com/
[2] https://github.com/antonblanchard/will-it-scale/
Link: https://lkml.kernel.org/r/20240306083809.1236634-1-zhangpeng362@huawei.com
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Suggested-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "page_owner: Fixup and cleanup".
This patchset consists of a fixup by an error that was reported by intel
robot, where it seems to be that by the time page_owner gets initialized,
stackdepot has already depleted its allocation space and returns
0-handles, turning that into null stack_records when trying to retrieve
the stack_record. I was not able to reproduce that from the config
because it booted fine for me, but when setting e.g: dummy_handle to 0
artificially, I could see the same error that was reported.
The second patch is a cleanup that can also lead to a compilation warning.
This patch (of 2):
Although the retrieval of the stack_records for {dummy,failure}_handle
happen when page_owner gets initialized, there seems to be some situations
where stackdepot space has been already depleted by then, so we get
0-handles which make stack_records being NULL for those cases.
Be careful to 1) only bump stack_records refcount and 2) only access
stack_record fields if we actually have a non-null stack_record between
hands.
Link: https://lkml.kernel.org/r/20240306123217.29774-1-osalvador@suse.de
Link: https://lkml.kernel.org/r/20240306123217.29774-2-osalvador@suse.de
Fixes: 4bedfb314b ("mm,page_owner: implement the tracking of the stacks count")
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202403051032.e2f865a-lkp@intel.com
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Marco Elver <elver@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There was previously a theoretical window where swapoff() could run and
teardown a swap_info_struct while a call to free_swap_and_cache() was
running in another thread. This could cause, amongst other bad
possibilities, swap_page_trans_huge_swapped() (called by
free_swap_and_cache()) to access the freed memory for swap_map.
This is a theoretical problem and I haven't been able to provoke it from a
test case. But there has been agreement based on code review that this is
possible (see link below).
Fix it by using get_swap_device()/put_swap_device(), which will stall
swapoff(). There was an extra check in _swap_info_get() to confirm that
the swap entry was not free. This isn't present in get_swap_device()
because it doesn't make sense in general due to the race between getting
the reference and swapoff. So I've added an equivalent check directly in
free_swap_and_cache().
Details of how to provoke one possible issue (thanks to David Hildenbrand
for deriving this):
--8<-----
__swap_entry_free() might be the last user and result in
"count == SWAP_HAS_CACHE".
swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
So the question is: could someone reclaim the folio and turn
si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
still references by swap entries.
Process 1 still references subpage 0 via swap entry.
Process 2 still references subpage 1 via swap entry.
Process 1 quits. Calls free_swap_and_cache().
-> count == SWAP_HAS_CACHE
[then, preempted in the hypervisor etc.]
Process 2 quits. Calls free_swap_and_cache().
-> count == SWAP_HAS_CACHE
Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
__try_to_reclaim_swap().
__try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->
put_swap_folio()->free_swap_slot()->swapcache_free_entries()->
swap_entry_free()->swap_range_free()->
...
WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
What stops swapoff to succeed after process 2 reclaimed the swap cache
but before process1 finished its call to swap_page_trans_huge_swapped()?
--8<-----
Link: https://lkml.kernel.org/r/20240306140356.3974886-1-ryan.roberts@arm.com
Fixes: 7c00bafee8 ("mm/swap: free swap slots in batch")
Closes: https://lore.kernel.org/linux-mm/65a66eb9-41f8-4790-8db2-0c70ea15979f@redhat.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The rounddown_pow_of_two(0) is undefined, so val = 0 is not allowed in the
fault_around_bytes_set(), and leads to shift-out-of-bounds,
UBSAN: shift-out-of-bounds in include/linux/log2.h:67:13
shift exponent 4294967295 is too large for 64-bit type 'long unsigned int'
CPU: 7 PID: 107 Comm: sh Not tainted 6.8.0-rc6-next-20240301 #294
Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
Call trace:
dump_backtrace+0x94/0xec
show_stack+0x18/0x24
dump_stack_lvl+0x78/0x90
dump_stack+0x18/0x24
ubsan_epilogue+0x10/0x44
__ubsan_handle_shift_out_of_bounds+0x98/0x134
fault_around_bytes_set+0xa4/0xb0
simple_attr_write_xsigned.isra.0+0xe4/0x1ac
simple_attr_write+0x18/0x24
debugfs_attr_write+0x4c/0x98
vfs_write+0xd0/0x4b0
ksys_write+0x6c/0xfc
__arm64_sys_write+0x1c/0x28
invoke_syscall+0x44/0x104
el0_svc_common.constprop.0+0x40/0xe0
do_el0_svc+0x1c/0x28
el0_svc+0x34/0xdc
el0t_64_sync_handler+0xc0/0xc4
el0t_64_sync+0x190/0x194
---[ end trace ]---
Fix it by setting the minimum val to PAGE_SIZE.
Link: https://lkml.kernel.org/r/20240302064312.2358924-1-wangkefeng.wang@huawei.com
Fixes: 53d36a56d8 ("mm: prefer fault_around_pages to fault_around_bytes")
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reported-by: Yue Sun <samsun1006219@gmail.com>
Closes: https://lore.kernel.org/all/CAEkJfYPim6DQqW1GqCiHLdh2-eweqk1fGyXqs3JM+8e1qGge8w@mail.gmail.com/
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "PageFlags cleanups".
We have now successfully removed all of the uses of some of the PageFlags
from the kernel, but there's nothing to stop somebody reintroducing them.
By splitting out FOLIO_FLAGS from PAGEFLAGS, we can stop defining the old
flags; and we do that in some of the later patches.
After doing this, I realised that dump_page() was living dangerously; we
could end up calling folio_test_foo() on a pointer which no longer pointed
to a folio (as dump_page() is not necessarily called when the caller has a
reference to the page). So I fixed that up.
And then I realised that this was the key to making dump_page() take a
const argument, which means we can constify the page flags testing, which
means we can remove more cast-away-the-const bad code.
And here's where I ended up.
This patch (of 8):
We've progressed far enough with the folio transition that some flags are
now no longer checked on pages, but only on folios. To prevent new users
appearing, prepare to only define the folio versions of the flag
test/set/clear.
Link: https://lkml.kernel.org/r/20240227192337.757313-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20240227192337.757313-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
hugetlb parallelization depends on PADATA, and PADATA depends on SMP.
PADATA consists of two distinct functionality: One part is
padata_do_multithreaded which disregards order and simply divides tasks
into several groups for parallel execution. Hugetlb init parallelization
depends on padata_do_multithreaded.
The other part is composed of a set of APIs that, while handling data in
an out-of-order parallel manner, can eventually return the data with
ordered sequence. Currently Only `crypto/pcrypt.c` use them.
All users of PADATA of non-SMP case currently only use
padata_do_multithreaded. It is easy to implement a serial one in
include/linux/padata.h. And it is not necessary to implement another
functionality unless the only user of crypto/pcrypt.c does not depend on
SMP in the future.
Link: https://lkml.kernel.org/r/20240222140422.393911-6-gang.li@linux.dev
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>