Commit Graph

25197 Commits

Author SHA1 Message Date
Vlastimil Babka
2d517aa09b slab: add opt-in caching layer of percpu sheaves
Specifying a non-zero value for a new struct kmem_cache_args field
sheaf_capacity will setup a caching layer of percpu arrays called
sheaves of given capacity for the created cache.

Allocations from the cache will allocate via the percpu sheaves (main or
spare) as long as they have no NUMA node preference. Frees will also
put the object back into one of the sheaves.

When both percpu sheaves are found empty during an allocation, an empty
sheaf may be replaced with a full one from the per-node barn. If none
are available and the allocation is allowed to block, an empty sheaf is
refilled from slab(s) by an internal bulk alloc operation. When both
percpu sheaves are full during freeing, the barn can replace a full one
with an empty one, unless over a full sheaves limit. In that case a
sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
sheaves and barns is also wired to the existing cpu flushing and cache
shrinking operations.

The sheaves do not distinguish NUMA locality of the cached objects. If
an allocation is requested with kmem_cache_alloc_node() (or a mempolicy
with strict_numa mode enabled) with a specific node (not NUMA_NO_NODE),
the sheaves are bypassed.

The bulk operations exposed to slab users also try to utilize the
sheaves as long as the necessary (full or empty) sheaves are available
on the cpu or in the barn. Once depleted, they will fallback to bulk
alloc/free to slabs directly to avoid double copying.

The sheaf_capacity value is exported in sysfs for observability.

Sysfs CONFIG_SLUB_STATS counters alloc_cpu_sheaf and free_cpu_sheaf
count objects allocated or freed using the sheaves (and thus not
counting towards the other alloc/free path counters). Counters
sheaf_refill and sheaf_flush count objects filled or flushed from or to
slab pages, and can be used to assess how effective the caching is. The
refill and flush operations will also count towards the usual
alloc_fastpath/slowpath, free_fastpath/slowpath and other counters for
the backing slabs.  For barn operations, barn_get and barn_put count how
many full sheaves were get from or put to the barn, the _fail variants
count how many such requests could not be satisfied mainly  because the
barn was either empty or full. While the barn also holds empty sheaves
to make some operations easier, these are not as critical to mandate own
counters.  Finally, there are sheaf_alloc/sheaf_free counters.

Access to the percpu sheaves is protected by local_trylock() when
potential callers include irq context, and local_lock() otherwise (such
as when we already know the gfp flags allow blocking). The trylock
failures should be rare and we can easily fallback. Each per-NUMA-node
barn has a spin_lock.

When slub_debug is enabled for a cache with sheaf_capacity also
specified, the latter is ignored so that allocations and frees reach the
slow path where debugging hooks are processed. Similarly, we ignore it
with CONFIG_SLUB_TINY which prefers low memory usage to performance.

[boot failure: https://lore.kernel.org/all/583eacf5-c971-451a-9f76-fed0e341b815@linux.ibm.com/ ]

Reported-and-tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-26 11:56:50 +02:00
Vlastimil Babka
9d4e6ab865 slab: simplify init_kmem_cache_nodes() error handling
We don't need to call free_kmem_cache_nodes() immediately when failing
to allocate a kmem_cache_node, because when we return 0,
do_kmem_cache_create() calls __kmem_cache_release() which also performs
free_kmem_cache_nodes().

Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-26 11:56:36 +02:00
Akinobu Mita
06195ee967 mm/damon/sysfs: do not ignore callback's return value in damon_sysfs_damon_call()
The callback return value is ignored in damon_sysfs_damon_call(), which
means that it is not possible to detect invalid user input when writing
commands such as 'commit' to
/sys/kernel/mm/damon/admin/kdamonds/<K>/state.  Fix it.

Link: https://lkml.kernel.org/r/20250920132546.5822-1-akinobu.mita@gmail.com
Fixes: f64539dcdb ("mm/damon/sysfs: use damon_call() for update_schemes_stats")
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org>	[6.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-25 16:10:35 -07:00
Eric Biggers
85e1ff6106 kmsan: fix out-of-bounds access to shadow memory
Running sha224_kunit on a KMSAN-enabled kernel results in a crash in
kmsan_internal_set_shadow_origin():

    BUG: unable to handle page fault for address: ffffbc3840291000
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 1810067 P4D 1810067 PUD 192d067 PMD 3c17067 PTE 0
    Oops: 0000 [#1] SMP NOPTI
    CPU: 0 UID: 0 PID: 81 Comm: kunit_try_catch Tainted: G                 N  6.17.0-rc3 #10 PREEMPT(voluntary)
    Tainted: [N]=TEST
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
    RIP: 0010:kmsan_internal_set_shadow_origin+0x91/0x100
    [...]
    Call Trace:
    <TASK>
    __msan_memset+0xee/0x1a0
    sha224_final+0x9e/0x350
    test_hash_buffer_overruns+0x46f/0x5f0
    ? kmsan_get_shadow_origin_ptr+0x46/0xa0
    ? __pfx_test_hash_buffer_overruns+0x10/0x10
    kunit_try_run_case+0x198/0xa00

This occurs when memset() is called on a buffer that is not 4-byte aligned
and extends to the end of a guard page, i.e.  the next page is unmapped.

The bug is that the loop at the end of kmsan_internal_set_shadow_origin()
accesses the wrong shadow memory bytes when the address is not 4-byte
aligned.  Since each 4 bytes are associated with an origin, it rounds the
address and size so that it can access all the origins that contain the
buffer.  However, when it checks the corresponding shadow bytes for a
particular origin, it incorrectly uses the original unrounded shadow
address.  This results in reads from shadow memory beyond the end of the
buffer's shadow memory, which crashes when that memory is not mapped.

To fix this, correctly align the shadow address before accessing the 4
shadow bytes corresponding to each origin.

Link: https://lkml.kernel.org/r/20250911195858.394235-1-ebiggers@kernel.org
Fixes: 2ef3cec44c ("kmsan: do not wipe out origin when doing partial unpoisoning")
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Tested-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-25 16:10:34 -07:00
Jane Chu
14967a9c7d mm/hugetlb: fix copy_hugetlb_page_range() to use ->pt_share_count
commit 59d9094df3 ("mm: hugetlb: independent PMD page table shared
count") introduced ->pt_share_count dedicated to hugetlb PMD share count
tracking, but omitted fixing copy_hugetlb_page_range(), leaving the
function relying on page_count() for tracking that no longer works.

When lazy page table copy for hugetlb is disabled, that is, revert commit
bcd51a3c67 ("hugetlb: lazy page table copies in fork()") fork()'ing with
hugetlb PMD sharing quickly lockup -

[  239.446559] watchdog: BUG: soft lockup - CPU#75 stuck for 27s!
[  239.446611] RIP: 0010:native_queued_spin_lock_slowpath+0x7e/0x2e0
[  239.446631] Call Trace:
[  239.446633]  <TASK>
[  239.446636]  _raw_spin_lock+0x3f/0x60
[  239.446639]  copy_hugetlb_page_range+0x258/0xb50
[  239.446645]  copy_page_range+0x22b/0x2c0
[  239.446651]  dup_mmap+0x3e2/0x770
[  239.446654]  dup_mm.constprop.0+0x5e/0x230
[  239.446657]  copy_process+0xd17/0x1760
[  239.446660]  kernel_clone+0xc0/0x3e0
[  239.446661]  __do_sys_clone+0x65/0xa0
[  239.446664]  do_syscall_64+0x82/0x930
[  239.446668]  ? count_memcg_events+0xd2/0x190
[  239.446671]  ? syscall_trace_enter+0x14e/0x1f0
[  239.446676]  ? syscall_exit_work+0x118/0x150
[  239.446677]  ? arch_exit_to_user_mode_prepare.constprop.0+0x9/0xb0
[  239.446681]  ? clear_bhb_loop+0x30/0x80
[  239.446684]  ? clear_bhb_loop+0x30/0x80
[  239.446686]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

There are two options to resolve the potential latent issue:
  1. warn against PMD sharing in copy_hugetlb_page_range(),
  2. fix it.
This patch opts for the second option.
While at it, simplify the comment, the details are not actually relevant
anymore.

Link: https://lkml.kernel.org/r/20250916004520.1604530-1-jane.chu@oracle.com
Fixes: 59d9094df3 ("mm: hugetlb: independent PMD page table shared count")
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-25 16:10:34 -07:00
SeongJae Park
8d009da32f mm/damon/sysfs: set damon_ctx->min_sz_region only for paddr use case
damon_ctx->addr_unit is respected only for physical address space
monitoring use case.  Meanwhile, damon_ctx->min_sz_region is used by the
core layer for aligning regions, regardless of whether it is set for
physical address space monitoring or virtual address spaces monitoring. 
And it is set as 'DAMON_MIN_REGION / damon_ctx->addr_unit'.  Hence, if
user sets ->addr_unit on virtual address spaces monitoring mode, regions
can be unexpectedly aligned in <PAGE_SIZE granularity.  It shouldn't cause
crash-like issues but make monitoring and DAMOS behavior difficult to
understand.

Fix the unexpected behavior by setting ->min_sz_region only when it is
configured for physical address space monitoring.

The issue was found from a result of Chris' experiments that thankfully
shared with me off-list.

Link: https://lkml.kernel.org/r/20250917160041.53187-1-sj@kernel.org
Fixes: d8f867fa08 ("mm/damon: add damon_ctx->min_sz_region")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: ze zuo <zuoze1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-23 14:14:16 -07:00
Uladzislau Rezki (Sony)
7ef5268a90 mm/vmalloc: move resched point into alloc_vmap_area()
Currently vm_area_alloc_pages() contains two cond_resched() points. 
However, the page allocator already has its own in slow path so an extra
resched is not optimal because it delays the loops.

The place where CPU time can be consumed is in the VA-space search in
alloc_vmap_area(), especially if the space is really fragmented using
synthetic stress tests, after a fast path falls back to a slow one.

Move a single cond_resched() there, after dropping free_vmap_area_lock in
a slow path.  This keeps fairness where it matters while removing
redundant yields from the page-allocation path.

[akpm@linux-foundation.org: tweak comment grammar]
Link: https://lkml.kernel.org/r/20250917185906.1595454-1-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-23 14:14:16 -07:00
Matthew Wilcox (Oracle)
1b00ab4889 ksm: use a folio inside cmp_and_merge_page()
This removes the last call to page_stable_node(), so delete the wrapper. 
It also removes a call to trylock_page() and saves a call to
compound_head(), as well as removing a reference to folio->page.

Link: https://lkml.kernel.org/r/20250916181219.2400258-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Longlong Xia <xialonglong@kylinos.cn>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-23 14:14:16 -07:00
Johannes Weiner
19c5fb83f2 mm: page_alloc: avoid kswapd thrashing due to NUMA restrictions
On NUMA systems without bindings, allocations check all nodes for free
space, then wake up the kswapds on all nodes and retry. This ensures
all available space is evenly used before reclaim begins. However,
when one process or certain allocations have node restrictions, they
can cause kswapds on only a subset of nodes to be woken up.

Since kswapd hysteresis targets watermarks that are *higher* than
needed for allocation, even *unrestricted* allocations can now get
suckered onto such nodes that are already pressured. This ends up
concentrating all allocations on them, even when there are idle nodes
available for the unrestricted requests.

This was observed with two numa nodes, where node0 is normal and node1
is ZONE_MOVABLE to facilitate hotplugging: a kernel allocation wakes
kswapd on node0 only (since node1 is not eligible); once kswapd0 is
active, the watermarks hover between low and high, and then even the
movable allocations end up on node0, only to be kicked out again;
meanwhile node1 is empty and idle.

Similar behavior is possible when a process with NUMA bindings is
causing selective kswapd wakeups.

To fix this, on NUMA systems augment the (misleading) watermark test
with a check for whether kswapd is already active during the first
iteration through the zonelist. If this fails to place the request,
kswapd must be running everywhere already, and the watermark test is
good enough to decide placement.

With this patch, unrestricted requests successfully make use of node1,
even while kswapd is reclaiming node0 for restricted allocations.

[gourry@gourry.net: don't retry if no kswapds were active]
Link: https://lkml.kernel.org/r/20250919162134.1098208-1-hannes@cmpxchg.org
Signed-off-by: Gregory Price <gourry@gourry.net>
Tested-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-23 14:14:16 -07:00
Lorenzo Stoakes
fde591dad1 mm/oom_kill.c: fix inverted check
Fix an incorrect logic conversion in process_mrelease().

Link: https://lkml.kernel.org/r/3b7f0faf-4dbc-4d67-8a71-752fbcdf0906@lucifer.local
Fixes: 12e423ba4e ("mm: convert core mm to mm_flags_*() accessors")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: Chris Mason <clm@meta.com>
  Closes: https://lkml.kernel.org/r/c2e28e27-d84b-4671-8784-de5fe0d14f41@lucifer.local
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-23 14:14:16 -07:00
Kiryl Shutsemau
f8a01513f5 mm/khugepaged: do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
MADV_COLLAPSE on a file mapping behaves inconsistently depending on if PMD
page table is installed or not.

Consider following example:

	p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
		 MAP_SHARED, fd, 0);
	err = madvise(p, 2UL << 20, MADV_COLLAPSE);

fd is a populated tmpfs file.

The result depends on the address that the kernel returns on mmap().  If
it is located in an existing PMD table, the madvise() will succeed. 
However, if the table does not exist, it will fail with -EINVAL.

This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when a
page table is missing, which causes collapse_pte_mapped_thp() to fail.

SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
tables as needed.

Link: https://lkml.kernel.org/r/v5ivpub6z2n2uyemlnxgbilzs52ep4lrary7lm7o6axxoneb75@yfacfl5rkzeh
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Zach O'Keefe <zokeefe@google.com>
Cc: Barry Song <baohua@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-23 14:14:15 -07:00
Trond Myklebust
010054a530 filemap: Add a version of folio_end_writeback that ignores dropbehind
Filesystems such as NFS may need to defer dropbehind until after their
2-stage writes are done. This adds a helper
folio_end_writeback_no_dropbehind() that allows them to release the
writeback flag without immediately dropping the folio.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-23 13:29:50 -04:00
Trond Myklebust
24bbd533f5 filemap: Add a helper for filesystems implementing dropbehind
Add a helper to allow filesystems to attempt to free the 'dropbehind'
folio.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Link: https://lore.kernel.org/all/5588a06f6d5a2cf6746828e2d36e7ada668b1739.1745381692.git.trond.myklebust@hammerspace.com/
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-23 13:29:50 -04:00
Lorenzo Stoakes
f7a741c53b mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()
In commit bb666b7c27 ("mm: add mmap_prepare() compatibility layer for
nested file systems") we introduced the ability for stacked drivers and
file systems to correctly invoke the f_op->mmap_prepare() handler from an
f_op->mmap() handler via a compatibility layer implemented in
compat_vma_mmap_prepare().

This populates vm_area_desc fields according to those found in the (not
yet fully initialised) VMA passed to f_op->mmap().

However this function implicitly assumes that the struct file which we are
operating upon is equal to vma->vm_file.  This is not a safe assumption in
all cases.

The only really sane situation in which this matters would be something
like e.g.  i915_gem_dmabuf_mmap() which invokes vfs_mmap() against
obj->base.filp:

	ret = vfs_mmap(obj->base.filp, vma);
	if (ret)
		return ret;

And then sets the VMA's file to this, should the mmap operation succeed:

	vma_set_file(vma, obj->base.filp);

That is - it is the file that is intended to back the VMA mapping.

This is not an issue currently, as so far we have only implemented
f_op->mmap_prepare() handlers for some file systems and internal mm uses,
and the only stacked f_op->mmap() operations that can be performed upon
these are those in backing_file_mmap() and coda_file_mmap(), both of which
use vma->vm_file.

However, moving forward, as we convert drivers to using
f_op->mmap_prepare(), this will become a problem.

Resolve this issue by explicitly setting desc->file to the provided file
parameter and update callers accordingly.

Callers are expected to read desc->file and update desc->vm_file - the
former will be the file provided by the caller (if stacked, this may
differ from vma->vm_file).

If the caller needs to differentiate between the two they therefore now
can.

While we are here, also provide a variant of compat_vma_mmap_prepare()
that operates against a pointer to any file_operations struct and does not
assume that the file_operations struct we are interested in is file->f_op.

This function is __compat_vma_mmap_prepare() and we invoke it from
compat_vma_mmap_prepare() so that we share code between the two functions.

This is important, because some drivers provide hooks in a separate
struct, for instance struct drm_device provides an fops field for this
purpose.

Also update the VMA selftests accordingly.

Link: https://lkml.kernel.org/r/dd0c72df8a33e8ffaa243eeb9b01010b670610e9.1756920635.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-22 20:17:11 -07:00
Lorenzo Stoakes
af6703838e mm: specify separate file and vm_file params in vm_area_desc
Patch series "mm: do not assume file == vma->vm_file in
compat_vma_mmap_prepare()", v2.

As part of the efforts to eliminate the problematic f_op->mmap callback, a
new callback - f_op->mmap_prepare was provided.

While we are converting these callbacks, we must deal with 'stacked'
filesystems and drivers - those which in their own f_op->mmap callback
invoke an inner f_op->mmap callback.

To accomodate for this, a compatibility layer is provided that, via
vfs_mmap(), detects if f_op->mmap_prepare is provided and if so, generates
a vm_area_desc containing the VMA's metadata and invokes the call.

So far, we have provided desc->file equal to vma->vm_file.  However this
is not necessarily valid, especially in the case of stacked drivers which
wish to assign a new file after the inner hook is invoked.

To account for this, we adjust vm_area_desc to have both file and vm_file
fields.  The .vm_file field is strictly set to vma->vm_file (or in the
case of a new mapping, what will become vma->vm_file).

However, .file is set to whichever file vfs_mmap() is invoked with when
using the compatibilty layer.

Therefore, if the VMA's file needs to be updated in .mmap_prepare,
desc->vm_file should be assigned, whilst desc->file should be read.

No current f_op->mmap_prepare users assign desc->file so this is safe to
do.

This makes the .mmap_prepare callback in the context of a stacked
filesystem or driver completely consistent with the existing .mmap
implementations.

While we're here, we do a few small cleanups, and ensure that we const-ify
things correctly in the vm_area_desc struct to avoid hooks accidentally
trying to assign fields they should not.


This patch (of 2):

Stacked filesystems and drivers may invoke mmap hooks with a struct file
pointer that differs from the overlying file.  We will make this
functionality possible in a subsequent patch.

In order to prepare for this, let's update vm_area_struct to separately
provide desc->file and desc->vm_file parameters.

The desc->file parameter is the file that the hook is expected to operate
upon, and is not assignable (though the hok may wish to e.g.  update the
file's accessed time for instance).

The desc->vm_file defaults to what will become vma->vm_file and is what
the hook must reassign should it wish to change the VMA"s vma->vm_file.

For now we keep desc->file, vm_file the same to remain consistent.

No f_op->mmap_prepare() callback sets a new vma->vm_file currently, so
this is safe to change.

While we're here, make the mm_struct desc->mm pointers at immutable as
well as the desc->mm field itself.

As part of this change, also update the single hook which this would
otherwise break - mlock_future_ok(), invoked by secretmem_mmap_prepare()).

We additionally update set_vma_from_desc() to compare fields in a more
logical fashion, checking the (possibly) user-modified fields as the first
operand against the existing value as the second one.

Additionally, update VMA tests to accommodate changes.

Link: https://lkml.kernel.org/r/cover.1756920635.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/3fa15a861bb7419f033d22970598aa61850ea267.1756920635.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-22 20:17:11 -07:00
Dev Jain
473b73222f mm: drop all references of writable and SCAN_PAGE_RO
Now that all actionable outcomes from checking pte_write() are gone, drop
the related references.

Link: https://lkml.kernel.org/r/20250908075028.38431-3-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Zach O'Keefe <zokeefe@google.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:40 -07:00
Dev Jain
62b98015d9 mm: enable khugepaged anonymous collapse on non-writable regions
Patch series "Expand scope of khugepaged anonymous collapse", v2.

Currently khugepaged does not collapse an anonymous region which does not
have a single writable pte.  This is wasteful since a region mapped with
non-writable ptes, for example, non-writable VMAs mapped by the
application, won't benefit from THP collapse.

An additional consequence of this constraint is that MADV_COLLAPSE does
not perform a collapse on a non-writable VMA, and this restriction is
nowhere to be found on the manpage - the restriction itself sounds wrong
to me since the user knows the protection of the memory it has mapped, so
collapsing read-only memory via madvise() should be a choice of the user
which shouldn't be overridden by the kernel.

Therefore, remove this constraint.

On an arm64 bare metal machine, comparing with vanilla 6.17-rc2, an
average of 5% improvement is seen on some mmtests benchmarks, particularly
hackbench, with a maximum improvement of 12%.  In the following table, (I)
denotes statistically significant improvement, (R) denotes statistically
significant regression.

+-------------------------+--------------------------------+---------------+
| mmtests/hackbench       | process-pipes-1 (seconds)      |        -0.06% |
|                         | process-pipes-4 (seconds)      |        -0.27% |
|                         | process-pipes-7 (seconds)      |   (I) -12.13% |
|                         | process-pipes-12 (seconds)     |    (I) -5.32% |
|                         | process-pipes-21 (seconds)     |    (I) -2.87% |
|                         | process-pipes-30 (seconds)     |    (I) -3.39% |
|                         | process-pipes-48 (seconds)     |    (I) -5.65% |
|                         | process-pipes-79 (seconds)     |    (I) -6.74% |
|                         | process-pipes-110 (seconds)    |    (I) -6.26% |
|                         | process-pipes-141 (seconds)    |    (I) -4.99% |
|                         | process-pipes-172 (seconds)    |    (I) -4.45% |
|                         | process-pipes-203 (seconds)    |    (I) -3.65% |
|                         | process-pipes-234 (seconds)    |    (I) -3.45% |
|                         | process-pipes-256 (seconds)    |    (I) -3.47% |
|                         | process-sockets-1 (seconds)    |         2.13% |
|                         | process-sockets-4 (seconds)    |         1.02% |
|                         | process-sockets-7 (seconds)    |        -0.26% |
|                         | process-sockets-12 (seconds)   |        -1.24% |
|                         | process-sockets-21 (seconds)   |         0.01% |
|                         | process-sockets-30 (seconds)   |        -0.15% |
|                         | process-sockets-48 (seconds)   |         0.15% |
|                         | process-sockets-79 (seconds)   |         1.45% |
|                         | process-sockets-110 (seconds)  |        -1.64% |
|                         | process-sockets-141 (seconds)  |    (I) -4.27% |
|                         | process-sockets-172 (seconds)  |         0.30% |
|                         | process-sockets-203 (seconds)  |        -1.71% |
|                         | process-sockets-234 (seconds)  |        -1.94% |
|                         | process-sockets-256 (seconds)  |        -0.71% |
|                         | thread-pipes-1 (seconds)       |         0.66% |
|                         | thread-pipes-4 (seconds)       |         1.66% |
|                         | thread-pipes-7 (seconds)       |        -0.17% |
|                         | thread-pipes-12 (seconds)      |    (I) -4.12% |
|                         | thread-pipes-21 (seconds)      |    (I) -2.13% |
|                         | thread-pipes-30 (seconds)      |    (I) -3.78% |
|                         | thread-pipes-48 (seconds)      |    (I) -5.77% |
|                         | thread-pipes-79 (seconds)      |    (I) -5.31% |
|                         | thread-pipes-110 (seconds)     |    (I) -6.12% |
|                         | thread-pipes-141 (seconds)     |    (I) -4.00% |
|                         | thread-pipes-172 (seconds)     |    (I) -3.01% |
|                         | thread-pipes-203 (seconds)     |    (I) -2.62% |
|                         | thread-pipes-234 (seconds)     |    (I) -2.00% |
|                         | thread-pipes-256 (seconds)     |    (I) -2.30% |
|                         | thread-sockets-1 (seconds)     |     (R) 2.39% |
+-------------------------+--------------------------------+---------------+

+-------------------------+------------------------------------------------+
| mmtests/sysbench-mutex  | sysbenchmutex-1 (usec)         |        -0.02% |
|                         | sysbenchmutex-4 (usec)         |        -0.02% |
|                         | sysbenchmutex-7 (usec)         |         0.00% |
|                         | sysbenchmutex-12 (usec)        |         0.12% |
|                         | sysbenchmutex-21 (usec)        |        -0.40% |
|                         | sysbenchmutex-30 (usec)        |         0.08% |
|                         | sysbenchmutex-48 (usec)        |         2.59% |
|                         | sysbenchmutex-79 (usec)        |        -0.80% |
|                         | sysbenchmutex-110 (usec)       |        -3.87% |
|                         | sysbenchmutex-128 (usec)       |    (I) -4.46% |
+-------------------------+--------------------------------+---------------+


This patch (of 2):

Currently khugepaged does not collapse an anonymous region which does not
have a single writable pte.  This is wasteful since a region mapped with
non-writable ptes, for example, non-writable VMAs mapped by the
application, won't benefit from THP collapse.

An additional consequence of this constraint is that MADV_COLLAPSE does
not perform a collapse on a non-writable VMA, and this restriction is
nowhere to be found on the manpage - the restriction itself sounds wrong
to me since the user knows the protection of the memory it has mapped, so
collapsing read-only memory via madvise() should be a choice of the user
which shouldn't be overridden by the kernel.

Therefore, remove this restriction by not honouring SCAN_PAGE_RO.

Link: https://lkml.kernel.org/r/20250908075028.38431-1-dev.jain@arm.com
Link: https://lkml.kernel.org/r/20250908075028.38431-2-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Zach O'Keefe <zokeefe@google.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:39 -07:00
SeongJae Park
a983a26d52 mm/damon/stat: expose negative idle time
DAMON_STAT calculates the idle time of a region using the region's age if
the region's nr_accesses is zero.  If the nr_accesses value is non-zero
(positive), the idle time of the region becomes zero.

This means the users cannot know how warm and hot data is distributed,
using DAMON_STAT's memory_idle_ms_percentiles output.  The other stat,
namely estimated_memory_bandwidth, can help understanding how the overall
access temperature of the system is, but it is still very rough
information.  On production systems, actually, a significant portion of
the system memory is observed with zero idle time, and we cannot break it
down based on its internal hotness distribution.

Define the idle time of the region using its age, similar to those having
zero nr_accesses, but multiples '-1' to distinguish it.  And expose that
using the same parameter interface, memory_idle_ms_percentiles.

Link: https://lkml.kernel.org/r/20250916183127.65708-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:39 -07:00
SeongJae Park
cc7ceb1d14 mm/damon/stat: expose the current tuned aggregation interval
Patch series "mm/damon/stat: expose auto-tuned intervals and non-idle
ages".

DAMON_STAT is intentionally providing limited information for easy
consumption of the information.  From production fleet level usages, below
limitations are found, though.

The aggregation interval of DAMON_STAT represents the granularity of the
memory_idle_ms_percentiles.  But the interval is auto-tuned and not
exposed to users, so users cannot know the granularity.

All memory regions of non-zero (positive) nr_accesses are treated as
having zero idle time.  A significant portion of production systems have
such zero idle time.  Hence breakdown of warm and hot data is nearly
impossible.

Make following changes to overcome the limitations.  Expose the auto-tuned
aggregation interval with a new parameter named aggr_interval_us.  Expose
the age of non-zero nr_accesses (how long >0 access frequency the region
retained) regions as a negative idle time.


This patch (of 2):

DAMON_STAT calculates the idle time for a region as the region's age
multiplied by the aggregation interval.  That is, the aggregation interval
is the granularity of the idle time.  Since the aggregation interval is
auto-tuned and not exposed to users, however, users cannot easily know in
what granularity the stat is made.  Expose the tuned aggregation interval
in microseconds via a new parameter, aggr_interval_us.

Link: https://lkml.kernel.org/r/20250916183127.65708-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250916183127.65708-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:39 -07:00
SeongJae Park
c4a8e662c8 mm/damon/lru_sort: use damon_initialized()
DAMON_LRU_SORT is assuming DAMON is ready to use in module_init time, and
uses its own hack to see if it is the time.  Use damon_initialized(),
which is a way for seeing if DAMON is ready to be used that is more
reliable and better to maintain instead of the hack.

Link: https://lkml.kernel.org/r/20250916033511.116366-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:38 -07:00
SeongJae Park
3f7a914ab9 mm/damon/reclaim: use damon_initialized()
DAMON_RECLAIM is assuming DAMON is ready to use in module_init time, and
uses its own hack to see if it is the time.  Use damon_initialized(),
which is a way for seeing if DAMON is ready to be used that is more
reliable and better to maintain instead of the hack.

Link: https://lkml.kernel.org/r/20250916033511.116366-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:38 -07:00
SeongJae Park
b663f17b73 mm/damon/stat: use damon_initialized()
DAMON_STAT is assuming DAMON is ready to use in module_init time, and uses
its own hack to see if it is the time.  Use damon_initialized(), which is
a way for seeing if DAMON is ready to be used that is more reliable and
better to maintain instead of the hack.

Link: https://lkml.kernel.org/r/20250916033511.116366-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:37 -07:00
SeongJae Park
ab152db3ca mm/damon/core: implement damon_initialized() function
Patch series "mm/damon: define and use DAMON initialization check
function".

DAMON is initialized in subsystem initialization time, by damon_init().
If DAMON API functions are called before the initialization, the
system could crash.  Actually such issues happened and were fixed [1]
in the past.  For the fix, DAMON API callers have updated to check if
DAMON is initialized or not, using their own hacks.  The hacks are
unnecessarily duplicated on every DAMON API callers and therefore it
would be difficult to reliably maintain in the long term.

Make it reliable and easy to maintain.  For this, implement a new DAMON
core layer API function that returns if DAMON is successfully
initialized.  If it returns true, it means DAMON API functions are safe
to be used.  After the introduction of the new API, update DAMON API
callers to use the new function instead of their own hacks.


This patch (of 7):

If DAMON is tried to be used when it is not yet successfully initialized,
the caller could be crashed.  DAMON core layer is not providing a reliable
way to see if it is successfully initialized and therefore ready to be
used, though.  As a result, DAMON API callers are implementing their own
hacks to see it.  The hacks simply assume DAMON should be ready on module
init time.  It is not reliable as DAMON initialization can indeed fail if
KMEM_CACHE() fails, and difficult to maintain as those are duplicates. 
Implement a core layer API function for better reliability and
maintainability to replace the hacks with followup commits.

Link: https://lkml.kernel.org/r/20250916033511.116366-2-sj@kernel.org
Link: https://lkml.kernel.org/r/20250916033511.116366-2-sj@kernel.org
Link: https://lore.kernel.org/20250909022238.2989-1-sj@kernel.org [1]
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:37 -07:00
SeongJae Park
ac93e87c66 mm/damon/core: set effective quota on first charge window
The effective quota of a scheme is initialized zero, which means there is
no quota.  It is set based on user-specified time/quota/quota goals.  But
the later value set is done only from the second charge window.  As a
result, a scheme having a user-specified quota can work as not having the
quota (unexpectedly fast) for the first charge window.  In practical and
common use cases the quota interval is not too long, and the scheme's
target access pattern is restrictive.  Hence the issue should be modest. 
That said, it is apparently an unintended misbehavior.  Fix the problem by
setting esz on the first charge window.

Link: https://lkml.kernel.org/r/20250916032339.115817-3-sj@kernel.org
Fixes: 1cd2430300 ("mm/damon/schemes: implement time quota") # 5.16.x
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:36 -07:00
SeongJae Park
2a05df14b3 mm/damon/core: reset age if nr_accesses changes between non-zero and zero
Patch series "mm/damon: misc fixups and improvements for 6.18", v2.

Misc fixes and improvements for DAMON that are not critical and therefore
aims to be merged into Linux 6.18-rc1.

The first patch improves DAMON's age counting for nr_accesses zero to/from
non-zero changes.

The second patch fixes an initial DAMOS apply interval delay issue that is
not realistic but still could happen on an odd setup.

The third and the fourth patches update DAMON community meetup description
and DAMON user-space tool example command for DAMOS usage, respectively.

Finally, the fifth patch updates MAINTAINERS section name for DAMON to
just DAMON.


This patch (of 5):

DAMON resets the age of a region if its nr_accesses value has
significantly changed.  Specifically, the threshold is calculated as 20%
of largest nr_accesses of the current snapshot.  This means that regions
changing the nr_accesses from zero to small non-zero value or from a small
non-zero value to zero will keep the age.  Since many users treat zero
nr_accesses regions special, this can be confusing.  Kernel code including
DAMOS' regions priority calculation and DAMON_STAT's idle time calculation
also treat zero nr_accesses regions special.  Make it unconfusing by
resetting the age when the nr_accesses changes between zero and a non-zero
value.

Link: https://lkml.kernel.org/r/20250916032339.115817-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250916032339.115817-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:36 -07:00
Suren Baghdasaryan
b9e2f58ffb alloc_tag: mark inaccurate allocation counters in /proc/allocinfo output
While rare, memory allocation profiling can contain inaccurate counters if
slab object extension vector allocation fails.  That allocation might
succeed later but prior to that, slab allocations that would have used
that object extension vector will not be accounted for.  To indicate
incorrect counters, "accurate:no" marker is appended to the call site line
in the /proc/allocinfo output.  Bump up /proc/allocinfo version to reflect
the change in the file format and update documentation.

Example output with invalid counters:
allocinfo - version: 2.0
           0        0 arch/x86/kernel/kdebugfs.c:105 func:create_setup_data_nodes
           0        0 arch/x86/kernel/alternative.c:2090 func:alternatives_smp_module_add
           0        0 arch/x86/kernel/alternative.c:127 func:__its_alloc accurate:no
           0        0 arch/x86/kernel/fpu/regset.c:160 func:xstateregs_set
           0        0 arch/x86/kernel/fpu/xstate.c:1590 func:fpstate_realloc
           0        0 arch/x86/kernel/cpu/aperfmperf.c:379 func:arch_enable_hybrid_capacity_scale
           0        0 arch/x86/kernel/cpu/amd_cache_disable.c:258 func:init_amd_l3_attrs
       49152       48 arch/x86/kernel/cpu/mce/core.c:2709 func:mce_device_create accurate:no
       32768        1 arch/x86/kernel/cpu/mce/genpool.c:132 func:mce_gen_pool_create
           0        0 arch/x86/kernel/cpu/mce/amd.c:1341 func:mce_threshold_create_device

[surenb@google.com: document new "accurate:no" marker]
  Fixes: 39d117e04d15 ("alloc_tag: mark inaccurate allocation counters in /proc/allocinfo output")
[akpm@linux-foundation.org: simplification per Usama, reflow text]
[akpm@linux-foundation.org: add newline to prevent docs warning, per Randy]
Link: https://lkml.kernel.org/r/20250915230224.4115531-1-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Usama Arif <usamaarif642@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: David Wang <00107082@163.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sourav Panda <souravpanda@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:36 -07:00
zhongjinji
5e1953dc71 mm/oom_kill: the OOM reaper traverses the VMA maple tree in reverse order
Although the oom_reaper is delayed and it gives the oom victim chance to
clean up its address space this might take a while especially for
processes with a large address space footprint.  In those cases oom_reaper
might start racing with the dying task and compete for shared resources -
e.g.  page table lock contention has been observed.

Reduce those races by reaping the oom victim from the other end of the
address space.

It is also a significant improvement for process_mrelease().  When a
process is killed, process_mrelease is used to reap the killed process and
often runs concurrently with the dying task.  The test data shows that
after applying the patch, lock contention is greatly reduced during the
procedure of reaping the killed process.

The test is conducted on arm64.  The following basic perf numbers show
that applying this patch significantly reduces pte spin lock contention.

Without the patch:
|--99.57%-- oom_reaper
|    |--73.58%-- unmap_page_range
|    |    |--8.67%-- [hit in function]
|    |    |--41.59%-- __pte_offset_map_lock
|    |    |--29.47%-- folio_remove_rmap_ptes
|    |    |--16.11%-- tlb_flush_mmu
|    |--19.94%-- tlb_finish_mmu
|    |--3.21%-- folio_remove_rmap_ptes

With the patch:
|--99.53%-- oom_reaper
|    |--55.77%-- unmap_page_range
|    |    |--20.49%-- [hit in function]
|    |    |--58.30%-- folio_remove_rmap_ptes
|    |    |--11.48%-- tlb_flush_mmu
|    |    |--3.33%-- folio_mark_accessed
|    |--32.21%-- tlb_finish_mmu
|    |--6.93%-- folio_remove_rmap_ptes
|    |--0.69%-- __pte_offset_map_lock

Detailed breakdowns for both scenarios are provided below.  The cumulative
time for oom_reaper plus exit_mmap(victim) in both cases is also
summarized, making the performance improvements clear.

+----------------------------------------------------------------+
| Category                      | Applying patch | Without patch |
+-------------------------------+----------------+---------------+
| Total running time            |    132.6       |    167.1      |
|   (exit_mmap + reaper work)   |  72.4 + 60.2   |  90.7 + 76.4  |
+-------------------------------+----------------+---------------+
| Time waiting for pte spinlock |     1.0        |    33.1       |
|   (exit_mmap + reaper work)   |   0.4 + 0.6    |  10.0 + 23.1  |
+-------------------------------+----------------+---------------+
| folio_remove_rmap_ptes time   |    42.0        |    41.3       |
|   (exit_mmap + reaper work)   |  18.4 + 23.6   |  22.4 + 18.9  |
+----------------------------------------------------------------+

From this report, we can see that:

1. The reduction in total time comes mainly from the decrease in time
   spent on pte spinlock and other locks.

2. oom_reaper performs more work in some areas, but at the same time,
   exit_mmap also handles certain tasks more efficiently, such as
   folio_remove_rmap_ptes.

Here is a more detailed perf report. [1]

Link: https://lkml.kernel.org/r/20250915162946.5515-3-zhongjinji@honor.com
Link: https://lore.kernel.org/all/20250915162619.5133-1-zhongjinji@honor.com/ [1]
Signed-off-by: zhongjinji <zhongjinji@honor.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:35 -07:00
zhongjinji
59d4d36158 mm/oom_kill: thaw the entire OOM victim process
Patch series "Improvements to Victim Process Thawing and OOM Reaper
Traversal Order", v10.

This patch series focuses on optimizing victim process thawing and
refining the traversal order of the OOM reaper.  Since __thaw_task() is
used to thaw a single thread of the victim, thawing only one thread cannot
guarantee the exit of the OOM victim when it is frozen.  Patch 1 thaw the
entire process of the OOM victim to ensure that OOM victims are able to
terminate themselves.  Even if the oom_reaper is delayed, patch 2 is still
beneficial for reaping processes with a large address space footprint, and
it also greatly improves process_mrelease.


This patch (of 10):

OOM killer is a mechanism that selects and kills processes when the system
runs out of memory to reclaim resources and keep the system stable.  But
the oom victim cannot terminate on its own when it is frozen, even if the
OOM victim task is thawed through __thaw_task().  This is because
__thaw_task() can only thaw a single OOM victim thread, and cannot thaw
the entire OOM victim process.

In addition, freezing_slow_path() determines whether a task is an OOM
victim by checking the task's TIF_MEMDIE flag.  When a task is identified
as an OOM victim, the freezer bypasses both PM freezing and cgroup
freezing states to thaw it.

Historically, TIF_MEMDIE was a "this is the oom victim & it has access to
memory reserves" flag in the past.  It has that thread vs.  process
problems and tsk_is_oom_victim was introduced later to get rid of them and
other issues as well as the guarantee that we can identify the oom
victim's mm reliably for other oom_reaper.

Therefore, thaw_process() is introduced to unfreeze all threads within the
OOM victim process, ensuring that every thread is properly thawed.  The
freezer now uses tsk_is_oom_victim() to determine OOM victim status,
allowing all victim threads to be unfrozen as necessary.

With this change, the entire OOM victim process will be thawed when an OOM
event occurs, ensuring that the victim can terminate on its own.

Link: https://lkml.kernel.org/r/20250915162946.5515-1-zhongjinji@honor.com
Link: https://lkml.kernel.org/r/20250915162946.5515-2-zhongjinji@honor.com
Signed-off-by: zhongjinji <zhongjinji@honor.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:35 -07:00
SeongJae Park
e18190b7e9 mm/damon/lru_sort: use param_ctx for damon_attrs staging
damon_lru_sort_apply_parameters() allocates a new DAMON context, stages
user-specified DAMON parameters on it, and commits to running DAMON
context at once, using damon_commit_ctx().  The code is, however, directly
updating the monitoring attributes of the running context.  And the
attributes are over-written by later damon_commit_ctx() call.  This means
that the monitoring attributes parameters are not really working.  Fix the
wrong use of the parameter context.

Link: https://lkml.kernel.org/r/20250916031549.115326-1-sj@kernel.org
Fixes: a309694364 ("mm/damon/lru_sort: use damon_commit_ctx()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: <stable@vger.kernel.org>	[6.11+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:34 -07:00
Quanmin Yan
7db551fcfb mm/damon/reclaim: support addr_unit for DAMON_RECLAIM
Implement a sysfs file to expose addr_unit for DAMON_RECLAIM users. 
During parameter application, use the configured addr_unit parameter to
perform the necessary initialization.  Similar to the core layer, prevent
setting addr_unit to zero.

It is worth noting that when monitor_region_start and monitor_region_end
are unset (i.e., 0), their values will later be set to biggest_system_ram.
At that point, addr_unit may not be the default value 1.  Although we
could divide the biggest_system_ram value by addr_unit, changing addr_unit
without setting monitor_region_start/end should be considered a user
misoperation.  And biggest_system_ram is only within the 0~ULONG_MAX
range, system can clearly work correctly with addr_unit=1.  Therefore, if
monitor_region_start/end are unset, always silently reset addr_unit to 1.

Link: https://lkml.kernel.org/r/20250910113221.1065764-3-yanquanmin1@huawei.com
Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: ze zuo <zuoze1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:33 -07:00
Quanmin Yan
2e0fe9245d mm/damon/lru_sort: support addr_unit for DAMON_LRU_SORT
Patch series "mm/damon: add addr_unit for DAMON_LRU_SORT and
DAMON_RECLAIM".

In DAMON_LRU_SORT and DAMON_RECLAIM, damon_ctx is independent of the core.
Add addr_unit to these modules to support systems like ARM32 with LPAE.


This patch (of 2):

Implement a sysfs file to expose addr_unit for DAMON_LRU_SORT users. 
During parameter application, use the configured addr_unit parameter to
perform the necessary initialization.  Similar to the core layer, prevent
setting addr_unit to zero.

It is worth noting that when monitor_region_start and monitor_region_end
are unset (i.e., 0), their values will later be set to biggest_system_ram.
At that point, addr_unit may not be the default value 1.  Although we
could divide the biggest_system_ram value by addr_unit, changing addr_unit
without setting monitor_region_start/end should be considered a user
misoperation.  And biggest_system_ram is only within the 0~ULONG_MAX
range, system can clearly work correctly with addr_unit=1.  Therefore, if
monitor_region_start/end are unset, always silently reset addr_unit to 1.

Link: https://lkml.kernel.org/r/20250910113221.1065764-1-yanquanmin1@huawei.com
Link: https://lkml.kernel.org/r/20250910113221.1065764-2-yanquanmin1@huawei.com
Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: ze zuo <zuoze1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:33 -07:00
Matthew Wilcox (Oracle)
9d003dec97 mm: remove page->order
We already use page->private for storing the order of a page while it's in
the buddy allocator system; extend that to also storing the order while
it's in the pcp_llist.

Link: https://lkml.kernel.org/r/20250910142923.2465470-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:32 -07:00
Matthew Wilcox (Oracle)
6fd893a40e mm: remove redundant test in validate_page_before_insert()
The page_has_type() call would have included slab since commit
46df8e73a4 and now we don't even get that far because slab pages have a
zero refcount since commit 9aec2fb0fd.

Link: https://lkml.kernel.org/r/20250910142923.2465470-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:32 -07:00
Hugh Dickins
b33939383b mm: lru_add_drain_all() do local lru_add_drain() first
No numbers to back this up, but it seemed obvious to me, that if there are
competing lru_add_drain_all()ers, the work will be minimized if each
flushes its own local queues before locking and doing cross-CPU drains.

Link: https://lkml.kernel.org/r/33389bf8-f79d-d4dd-b7a4-680c4aa21b23@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Keir Fraser <keirf@google.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shivank Garg <shivankg@amd.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Xu <weixugc@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: yangge <yangge1116@126.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:32 -07:00
Suren Baghdasaryan
7ffc923e25 alloc_tag: avoid warnings when freeing non-compound "tail" pages
When freeing "tail" pages of a non-compount high-order page, we properly
subtract the allocation tag counters, however later when these pages are
released, alloc_tag_sub() will issue warnings because tags for these pages
are NULL.

This issue was originally anticipated by Vlastimil in his review [1] and
then recently reported by David.  Prevent warnings by marking the tags
empty.

Link: https://lkml.kernel.org/r/20250915212756.3998938-4-surenb@google.com
Link: https://lore.kernel.org/all/6db0f0c8-81cb-4d04-9560-ba73d63db4b8@suse.cz/ [1]
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: David Wang <00107082@163.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Usama Arif <usamaarif642@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Sourav Panda <souravpanda@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:31 -07:00
Jackie Liu
5919f12821 mm/shmem: remove unused entry_order after large swapin rework
After commit 93c0476e70 ("mm/shmem, swap: rework swap entry and index
calculation for large swapin"), xas_get_order() will never return a
non-zero value for `entry_order` in shmem_split_large_entry().  As a
result, the local variable `entry_order` is effectively unused.

Clean up the code by removing `entry_order` and directly using
`cur_order`.  This change is purely a refactor and has no functional
impact.

No functional change intended.

Link: https://lkml.kernel.org/r/20250908062614.89880-1-liu.yun@linux.dev
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kairui Song <kasong@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:30 -07:00
Lance Yang
6ce3bc990c mm: skip mlocked THPs that are underused early in deferred_split_scan()
When we stumble over a fully-mapped mlocked THP in the deferred shrinker,
it does not make sense to try to detect whether it is underused, because
try_to_map_unused_to_zeropage(), called while splitting the folio, will
not actually replace any zeroed pages by the shared zeropage.

Splitting the folio in that case does not make any sense, so let's not
even scan to check if the folio is underused.

Link: https://lkml.kernel.org/r/20250908090741.61519-1-lance.yang@linux.dev
Signed-off-by: Lance Yang <lance.yang@linux.dev>
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Usama Arif <usamaarif642@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kiryl Shutsemau <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:30 -07:00
Francois Dugast
10b9feee2d mm/hmm: populate PFNs from PMD swap entry
Once support for THP migration of zone device pages is enabled, device
private swap entries will be found during the walk not only for PTEs but
also for PMDs.

Therefore, it is necessary to extend to PMDs the special handling which is
already in place for PTEs when device private pages are owned by the
caller: instead of faulting or skipping the range, the correct behavior is
to use the swap entry to populate HMM PFNs.

This change is a prerequisite to make use of device-private THP in drivers
using drivers/gpu/drm/drm_pagemap, such as xe.

Even though subsequent PFNs can be inferred when handling large order
PFNs, the PFN list is still fully populated because this is currently
expected by HMM users.  In case this changes in the future, that is all
HMM users support a sparsely populated PFN list, the for() loop can be
made to skip remaining PFNs for the current order.  A quick test shows the
loop takes about 10 ns, roughly 20 times faster than without this
optimization.

Link: https://lkml.kernel.org/r/20250908091052.612303-1-francois.dugast@intel.com
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Thomas Hellstrom <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:29 -07:00
David Hildenbrand
7cad96ae59 mm/gup: fix handling of errors from arch_make_folio_accessible() in follow_page_pte()
In case we call arch_make_folio_accessible() and it fails, we would
incorrectly return a value that is "!= 0" to the caller, indicating that
we pinned all requested pages and that the caller can keep going.

follow_page_pte() is not supposed to return error values, but instead "0"
on failure and "1" on success -- we'll clean that up separately.

In case we return "!= 0", the caller will just keep going pinning more
pages.  If we happen to pin a page afterwards, we're in trouble, because
we essentially skipped some pages in the requested range.

Staring at the arch_make_folio_accessible() implementation on s390x, I
assume it should actually never really fail unless something unexpected
happens (BUG?).  So let's not CC stable and just fix common code to do the
right thing.

Clean up the code a bit now that there is no reason to store the return
value of arch_make_folio_accessible().

Link: https://lkml.kernel.org/r/20250908094517.303409-1-david@redhat.com
Fixes: f28d43636d ("mm/gup/writeback: add callbacks for inaccessible pages")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:29 -07:00
Chanwon Park
e7a5f249e6 mm: re-enable kswapd when memory pressure subsides or demotion is toggled
If kswapd fails to reclaim pages from a node MAX_RECLAIM_RETRIES in a
row, kswapd on that node gets disabled. That is, the system won't wakeup
kswapd for that node until page reclamation is observed at least once.
That reclamation is mostly done by direct reclaim, which in turn enables
kswapd back.

However, on systems with CXL memory nodes, workloads with high anon page
usage can disable kswapd indefinitely, without triggering direct
reclaim. This can be reproduced with following steps:

   numa node 0   (32GB memory, 48 CPUs)
   numa node 2~5 (512GB CXL memory, 128GB each)
   (numa node 1 is disabled)
   swap space 8GB

   1) Set /sys/kernel/mm/demotion_enabled to 0.
   2) Set /proc/sys/kernel/numa_balancing to 0.
   3) Run a process that allocates and random accesses 500GB of anon
      pages.
   4) Let the process exit normally.

During 3), free memory on node 0 gets lower than low watermark, and
kswapd runs and depletes swap space. Then, kswapd fails consecutively
and gets disabled. Allocation afterwards happens on CXL memory, so node
0 never gains more memory pressure to trigger direct reclaim.

After 4), kswapd on node 0 remains disabled, and tasks running on that
node are unable to swap. If you turn on NUMA_BALANCING_MEMORY_TIERING
and demotion now, it won't work properly since kswapd is disabled.

To mitigate this problem, reset kswapd_failures to 0 on following
conditions:

   a) ZONE_BELOW_HIGH bit of a zone in hopeless node with a fallback
      memory node gets cleared.
   b) demotion_enabled is changed from false to true.

Rationale for a):
   ZONE_BELOW_HIGH bit being cleared might be a sign that the node may
   be reclaimable afterwards. This won't help much if the memory-hungry
   process keeps running without freeing anything, but at least the node
   will go back to reclaimable state when the process exits.

Rationale for b):
   When demotion_enabled is false, kswapd can only reclaim anon pages by
   swapping them out to swap space. If demotion_enabled is turned on,
   kswapd can demote anon pages to another node for reclaiming. So, the
   original failure count for determining reclaimability is no longer
   valid.

Since kswapd_failures resets may be missed by ++ operation, it is
changed from int to atomic_t.

[akpm@linux-foundation.org: tweak whitespace]
Link: https://lkml.kernel.org/r/aL6qGi69jWXfPc4D@pcw-MS-7D22
Signed-off-by: Chanwon Park <flyinrm@gmail.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:29 -07:00
Jan Kara
6028372689 readahead: add trace points
Add a couple of trace points to make debugging readahead logic easier.

[jack@suse.cz: v2]
  Link: https://lkml.kernel.org/r/20250909145849.5090-2-jack@suse.cz
Link: https://lkml.kernel.org/r/20250908145533.31528-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Tested-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:28 -07:00
Stanislav Fort
72797d218b mm/memcg: v1: account event registrations and drop world-writable cgroup.event_control
In cgroup v1, the legacy cgroup.event_control file is world-writable and
allows unprivileged users to register unbounded events and thresholds. 
Each registration allocates kernel memory without capping or memcg
charging, which can be abused to exhaust kernel memory in affected
configurations.

Make the following minimal changes:
- Account allocations with __GFP_ACCOUNT in event and threshold registration.
- Remove CFTYPE_WORLD_WRITABLE from cgroup.event_control to make it
  owner-writable.

This does not affect cgroup v2.  Allocations are still subject to kmem
accounting being enabled, but this reduces unbounded global growth.

Link: https://lkml.kernel.org/r/20250905093851.80596-1-disclosure@aisle.com
Signed-off-by: Stanislav Fort <disclosure@aisle.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:26 -07:00
Kairui Song
f83938e418 mm, swap: use a single page for swap table when the size fits
We have a cluster size of 512 slots.  Each slot consumes 8 bytes in swap
table so the swap table size of each cluster is exactly one page (4K).

If that condition is true, allocate one page direct and disable the slab
cache to reduce the memory usage of swap table and avoid fragmentation.

Link: https://lkml.kernel.org/r/20250916160100.31545-16-ryncsn@gmail.com
Co-developed-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Suggested-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: kernel test robot <oliver.sang@intel.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:25 -07:00
Kairui Song
07adc4cf1e mm, swap: implement dynamic allocation of swap table
Now swap table is cluster based, which means free clusters can free its
table since no one should modify it.

There could be speculative readers, like swap cache look up, protect them
by making them RCU protected.  All swap table should be filled with null
entries before free, so such readers will either see a NULL pointer or a
null filled table being lazy freed.

On allocation, allocate the table when a cluster is used by any order.

This way, we can reduce the memory usage of large swap device
significantly.

This idea to dynamically release unused swap cluster data was initially
suggested by Chris Li while proposing the cluster swap allocator and it
suits the swap table idea very well.

Link: https://lkml.kernel.org/r/20250916160100.31545-15-ryncsn@gmail.com
Co-developed-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
Suggested-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: kernel test robot <oliver.sang@intel.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:25 -07:00
Kairui Song
685a17fbd3 mm, swap: remove contention workaround for swap cache
Swap cluster setup will try to shuffle the clusters on initialization.  It
was helpful to avoid contention for the swap cache space.  The cluster
size (2M) was much smaller than each swap cache space (64M), so shuffling
the cluster means the allocator will try to allocate swap slots that are
in different swap cache spaces for each CPU, reducing the chance of two
CPUs using the same swap cache space, and hence reducing the contention.

Now, swap cache is managed by swap clusters, this shuffle is pointless. 
Just remove it, and clean up related macros.

This also improves the HDD swap performance as shuffling IO is a bad idea
for HDD, and now the shuffling is gone.  Test have shown a ~40%
performance gain for HDD [1]:

Doing sequential swap in of 8G data using 8 processes with usemem, average
of 3 test runs:

Before: 1270.91 KB/s per process
After:  1849.54 KB/s per process

Link: https://lore.kernel.org/linux-mm/CAMgjq7AdauQ8=X0zeih2r21QoV=-WWj1hyBxLWRzq74n-C=-Ng@mail.gmail.com/ [1]
Link: https://lkml.kernel.org/r/20250916160100.31545-14-ryncsn@gmail.com
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202504241621.f27743ec-lkp@intel.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Suggested-by: Chris Li <chrisl@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:25 -07:00
Kairui Song
8b47299a41 mm, swap: mark swap address space ro and add context debug check
Swap cache is now backed by swap table, and the address space is not
holding any mutable data anymore.  And swap cache is now protected by the
swap cluster lock, instead of the XArray lock.  All access to swap cache
are wrapped by swap cache helpers.  Locking is mostly handled internally
by swap cache helpers, only a few __swap_cache_* helpers require the
caller to lock the cluster by themselves.

Worth noting that, unlike XArray, the cluster lock is not IRQ safe.  The
swap cache was very different compared to filemap, and now it's completely
separated from filemap.  Nothing wants to mark or change anything or do a
writeback callback in IRQ.

So explicitly document this and add a debug check to avoid further
potential misuse.  And mark the swap cache space as read-only to avoid any
user wrongly mixing unexpected filemap helpers with swap cache.

Link: https://lkml.kernel.org/r/20250916160100.31545-13-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Suggested-by: Chris Li <chrisl@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: kernel test robot <oliver.sang@intel.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:25 -07:00
Kairui Song
8578e0c00d mm, swap: use the swap table for the swap cache and switch API
Introduce basic swap table infrastructures, which are now just a
fixed-sized flat array inside each swap cluster, with access wrappers.

Each cluster contains a swap table of 512 entries.  Each table entry is an
opaque atomic long.  It could be in 3 types: a shadow type (XA_VALUE), a
folio type (pointer), or NULL.

In this first step, it only supports storing a folio or shadow, and it is
a drop-in replacement for the current swap cache.  Convert all swap cache
users to use the new sets of APIs.  Chris Li has been suggesting using a
new infrastructure for swap cache for better performance, and that idea
combined well with the swap table as the new backing structure.  Now the
lock contention range is reduced to 2M clusters, which is much smaller
than the 64M address_space.  And we can also drop the multiple
address_space design.

All the internal works are done with swap_cache_get_* helpers.  Swap cache
lookup is still lock-less like before, and the helper's contexts are same
with original swap cache helpers.  They still require a pin on the swap
device to prevent the backing data from being freed.

Swap cache updates are now protected by the swap cluster lock instead of
the XArray lock.  This is mostly handled internally, but new
__swap_cache_* helpers require the caller to lock the cluster.  So, a few
new cluster access and locking helpers are also introduced.

A fully cluster-based unified swap table can be implemented on top of this
to take care of all count tracking and synchronization work, with dynamic
allocation.  It should reduce the memory usage while making the
performance even better.

Link: https://lkml.kernel.org/r/20250916160100.31545-12-ryncsn@gmail.com
Co-developed-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Suggested-by: Chris Li <chrisl@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: kernel test robot <oliver.sang@intel.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:24 -07:00
Kairui Song
094dc8b059 mm, swap: wrap swap cache replacement with a helper
There are currently three swap cache users that are trying to replace an
existing folio with a new one: huge memory splitting, migration, and shmem
replacement.  What they are doing is quite similar.

Introduce a common helper for this.  In later commits, this can be easily
switched to use the swap table by updating this helper.

The newly added helper also makes the swap cache API better defined, and
make debugging easier by adding a few more debug checks.

Migration and shmem replace are meant to clone the folio, including
content, swap entry value, and flags.  And splitting will adjust each sub
folio's swap entry according to order, which could be non-uniform in the
future.  So document it clearly that it's the caller's responsibility to
set up the new folio's swap entries and flags before calling the helper. 
The helper will just follow the new folio's entry value.

This also prepares for replacing high-order folios in the swap cache. 
Currently, only splitting to order 0 is allowed for swap cache folios. 
Using the new helper, we can handle high-order folio splitting better.

Link: https://lkml.kernel.org/r/20250916160100.31545-11-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Chris Li <chrisl@kernel.org>
Suggested-by: Chris Li <chrisl@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: kernel test robot <oliver.sang@intel.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:24 -07:00
Kairui Song
84a7a9823e mm/shmem, swap: remove redundant error handling for replacing folio
Shmem may replace a folio in the swap cache if the cached one doesn't fit
the swapin's GFP zone.  When doing so, shmem has already double checked
that the swap cache folio is locked, still has the swap cache flag set,
and contains the wanted swap entry.  So it is impossible to fail due to an
XArray mismatch.  There is even a comment for that.

Delete the defensive error handling path, and add a WARN_ON instead: if
that happened, something has broken the basic principle of how the swap
cache works, we should catch and fix that.

Link: https://lkml.kernel.org/r/20250916160100.31545-10-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Suggested-by: Chris Li <chrisl@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: kernel test robot <oliver.sang@intel.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:24 -07:00
Kairui Song
fd8d4f862f mm, swap: cleanup swap cache API and add kerneldoc
In preparation for replacing the swap cache backend with the swap table,
clean up and add proper kernel doc for all swap cache APIs.  Now all swap
cache APIs are well-defined with consistent names.

No feature change, only renaming and documenting.

Link: https://lkml.kernel.org/r/20250916160100.31545-9-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Suggested-by: Chris Li <chrisl@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: kernel test robot <oliver.sang@intel.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:23 -07:00