Suppose xas is pointing somewhere near the end of the multi-entry batch.
Then it may happen that the computed slot already falls beyond the batch,
thus breaking the loop due to !xa_is_sibling(), and computing the wrong
order.
For example, suppose we have a shift-6 node having an order-9 entry => 8 -
1 = 7 siblings, so assume the slots are at offset 0 till 7 in this node.
If xas->xa_offset is 6, then the code will compute order as 1 +
xas->xa_node->shift = 7. Therefore, the order computation must start from
the beginning of the multi-slot entries, that is, the non-sibling entry.
Thus ensure that the caller is aware of this by triggering a BUG when the
entry is a sibling entry. Note that this BUG_ON() is only active while
running selftests, so there is no overhead in a running kernel.
Link: https://lkml.kernel.org/r/20250604041533.91198-1-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Certain madvise operations, especially MADV_DONTNEED, occur far more
frequently than other madvise options, particularly in native and Java
heaps for dynamic memory management.
Currently, the mmap_lock is always held during these operations, even when
unnecessary. This causes lock contention and can lead to severe priority
inversion, where low-priority threads—such as Android's
HeapTaskDaemon— hold the lock and block higher-priority threads.
This patch enables the use of per-VMA locks when the advised range lies
entirely within a single VMA, avoiding the need for full VMA traversal.
In practice, userspace heaps rarely issue MADV_DONTNEED across multiple
VMAs.
Tangquan's testing shows that over 99.5% of memory reclaimed by Android
benefits from this per-VMA lock optimization. After extended runtime,
217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
only 1,231 fell back to mmap_lock.
To simplify handling, the implementation falls back to the standard
mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
userfaultfd_remove().
Many thanks to Lorenzo's work[1] on "mm/madvise: support VMA read locks
for MADV_DONTNEED[_LOCKED]"
Then use this mechanism to permit VMA locking to be done later in the
madvise() logic and also to allow altering of the locking mode to permit
falling back to an mmap read lock if required."
One important point, as pointed out by Jann[2], is that
untagged_addr_remote() requires holding mmap_lock. This is because
address tagging on x86 and RISC-V is quite complex.
Until untagged_addr_remote() becomes atomic—which seems unlikely in the
near future—we cannot support per-VMA locks for remote processes. So
for now, only local processes are supported.
Lance said:
: Just to put some numbers on it, I ran a micro-benchmark with 100
: parallel threads, where each thread calls madvise() on its own 1GiB
: chunk of 64KiB mTHP-backed memory. The performance gain is huge:
:
: 1) MADV_DONTNEED saw its average time drop from 0.0508s to 0.0270s
: (~47% faster)
:
: 2) MADV_FREE saw its average time drop from 0.3078s to 0.1095s (~64%
: faster)
[lorenzo.stoakes@oracle.com: avoid any chance of uninitialised pointer deref]
Link: https://lkml.kernel.org/r/309d22ca-6cd9-4601-8402-d441a07d9443@lucifer.local
Link: https://lore.kernel.org/all/0b96ce61-a52c-4036-b5b6-5c50783db51f@lucifer.local/ [1]
Link: https://lore.kernel.org/all/CAG48ez11zi-1jicHUZtLhyoNPGGVB+ROeAJCUw48bsjk4bbEkA@mail.gmail.com/ [2]
Link: https://lkml.kernel.org/r/20250607220150.2980-1-21cnbao@gmail.com
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Cc: Lance Yang <ioworker0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
BUG_ON() is deprecated [1]. Convert all the BUG_ON()s and VM_BUG_ON()s to
use VM_WARN_ON_ONCE().
There are a few additional cases that are converted or modified:
- Convert the printk(KERN_WARNING ...) in handle_userfault() to use
pr_warn().
- Convert the WARN_ON_ONCE()s in move_pages() to use VM_WARN_ON_ONCE(),
as the relevant conditions are already checked in validate_range() in
move_pages()'s caller.
- Convert the VM_WARN_ON()'s in move_pages() to VM_WARN_ON_ONCE(). These
cases should never happen and are similar to those in mfill_atomic()
and mfill_atomic_hugetlb(), which were previously BUG_ON()s.
move_pages() was added later than those functions and makes use of
VM_WARN_ON() as a replacement for the deprecated BUG_ON(), but.
VM_WARN_ON_ONCE() is likely a better direct replacement.
- Convert the WARN_ON() for !VM_MAYWRITE in userfaultfd_unregister() and
userfaultfd_register_range() to VM_WARN_ON_ONCE(). This condition is
enforced in userfaultfd_register() so it should never happen, and can
be converted to a debug check.
[1] https://www.kernel.org/doc/html/v6.15/process/coding-style.html#use-warn-rather-than-bug
Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-3-a7274d3bd5e4@columbia.edu
Signed-off-by: Tal Zussman <tz2294@columbia.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, a VMA registered with a uffd can be unregistered through a
different uffd associated with the same mm_struct.
The existing behavior is slightly broken and may incorrectly reject
unregistering some VMAs due to the following check:
if (!vma_can_userfault(cur, cur->vm_flags, wp_async))
goto out_unlock;
where wp_async is derived from ctx, not from cur. For example, a
file-backed VMA registered with wp_async enabled and UFFD_WP mode cannot
be unregistered through a uffd that does not have wp_async enabled.
Rather than fix this and maintain this odd behavior, make unregistration
stricter by requiring VMAs to be unregistered through the same uffd they
were registered with. Additionally, reorder the BUG() checks to avoid the
aforementioned wp_async issue in them. Convert the existing check to
VM_WARN_ON_ONCE() as BUG_ON() is deprecated.
This change slightly modifies the ABI. It should not be backported to
-stable. It is expected that no one depends on this behavior, and no such
cases are known.
While at it, correct the comment for the no userfaultfd case. This seems
to be a copy-paste artifact from the analogous userfaultfd_register()
check.
Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-2-a7274d3bd5e4@columbia.edu
Fixes: 86039bd3b4 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Tal Zussman <tz2294@columbia.edu>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The function register_memory_blocks_under_node() is now only called from
the memory hotplug path, as register_memory_blocks_under_node_early()
handles registration during early boot. Therefore, the context argument
used to differentiate between early boot and hotplug is no longer needed
and was removed.
Since the function is only called from the hotplug path, we renamed
register_memory_blocks_under_node() to
register_memory_blocks_under_node_hotplug()
Link: https://lkml.kernel.org/r/907c22292b0ee4975107876efc875c75c11badd9.1748452242.git.donettom@linux.ibm.com
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "drivers/base/node.c: optimization and cleanups", v7.
This patch (of 7)
During node device initialization, `memory blocks` are registered under
each NUMA node. The `memory blocks` to be registered are identified using
the node's start and end PFNs, which are obtained from the node's pg_data
However, not all PFNs within this range necessarily belong to the same
node—some may belong to other nodes. Additionally, due to the
discontiguous nature of physical memory, certain sections within a `memory
block` may be absent.
As a result, `memory blocks` that fall between a node's start and end PFNs
may span across multiple nodes, and some sections within those blocks may
be missing. `Memory blocks` have a fixed size, which is architecture
dependent.
Due to these considerations, the memory block registration is currently
performed as follows:
for_each_online_node(nid):
start_pfn = pgdat->node_start_pfn;
end_pfn = pgdat->node_start_pfn + node_spanned_pages;
for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn))
mem_blk = memory_block_id(pfn_to_section_nr(pfn));
pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
pfn_mb_end = pfn_start + memory_block_pfns - 1
for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
if (get_nid_for_pfn(pfn) != nid):
continue;
else
do_register_memory_block_under_node(nid, mem_blk,
MEMINIT_EARLY);
Here, we derive the start and end PFNs from the node's pg_data, then
determine the memory blocks that may belong to the node. For each `memory
block` in this range, we inspect all PFNs it contains and check their
associated NUMA node ID. If a PFN within the block matches the current
node, the memory block is registered under that node.
If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
a binary search in the `memblock regions` to determine the NUMA node ID
for a given PFN. If it is not enabled, the node ID is retrieved directly
from the struct page.
On large systems, this process can become time-consuming, especially since
we iterate over each `memory block` and all PFNs within it until a match
is found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the
additional overhead of the binary search increases the execution time
significantly, potentially leading to soft lockups during boot.
In this patch, we iterate over `memblock region` to identify the `memory
blocks` that belong to the current NUMA node. `memblock regions` are
contiguous memory ranges, each associated with a single NUMA node, and
they do not span across multiple nodes.
for_each_memory_region(r): // r => region
if (!node_online(r->nid)):
continue;
else
for_each_memory_block_between(r->base, r->base + r->size - 1):
do_register_memory_block_under_node(r->nid, mem_blk, MEMINIT_EARLY);
We iterate over all memblock regions, and if the node associated with the
region is online, we calculate the start and end memory blocks based on
the region's start and end PFNs. We then register all the memory blocks
within that range under the region node.
Test Results on My system with 32TB RAM
=======================================
1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
Without this patch
------------------
Startup finished in 1min 16.528s (kernel)
With this patch
---------------
Startup finished in 17.236s (kernel) - 78% Improvement
2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
Without this patch
------------------
Startup finished in 28.320s (kernel)
With this patch
---------------
Startup finished in 15.621s (kernel) - 46% Improvement
[donettom@linux.ibm.com: restore removed extra line]
Link: https://lkml.kernel.org/r/20250609140354.467908-1-donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/2a0a05c2dffc62a742bf1dd030098be4ce99be28.1748452241.git.donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/2a0a05c2dffc62a742bf1dd030098be4ce99be28.1748452241.git.donettom@linux.ibm.com
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Especially once we hit one of the assertions in
sanity_check_pinned_pages(), observing follow-up assertions failing in
other code can give good clues about what went wrong, so use
VM_WARN_ON_ONCE instead.
While at it, let's just convert all VM_BUG_ON to VM_WARN_ON_ONCE as well.
Add one comment for the pfn_valid() check.
We have to introduce VM_WARN_ON_ONCE_VMA() to make that fly.
Drop the BUG_ON after mmap_read_lock_killable(), if that ever returns
something > 0 we're in bigger trouble. Convert the other BUG_ON's into
VM_WARN_ON_ONCE as well, they are in a similar domain "should never
happen", but more reasonable to check for during early testing.
[david@redhat.com: use the _FOLIO variant where possible, per Lorenzo]
Link: https://lkml.kernel.org/r/844bd929-a551-48e3-a12e-285cd65ba580@redhat.com
Link: https://lkml.kernel.org/r/20250604140544.688711-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Knowing how much memory is how cold can be useful for understanding
coldness and utilization efficiency of memory. The raw form of DAMON's
monitoring results has the information. Convert the raw results into the
per-byte idle time distributions and expose it as percentiles metric to
users, as a read-only DAMON_STAT parameter.
In detail, the metrics are calculated as follows. First, DAMON's
per-region access frequency and age information is converted into per-byte
idle time. If access frequency of a region is higher than zero, every
byte of the region has zero idle time. If the access frequency of a
region is zero, every byte of the region has idle time as the age of the
region. Then the logic sorts the per-byte idle times and provides the
value at 0/100, 1/100, ..., 99/100 and 100/100 location of the sorted
array.
The metric can be easily aggregated and compared on large scale production
systems. For example, if an average of 75-th percentile idle time of
machines that collected on similar time is two minutes, it means the
system's 25 percent memory is not accessed at all for two minutes or more
on average. If a workload considers two minutes as unit work time, we can
conclude its working set size is only 75 percent of the memory. If the
system utilizes proactive reclamation and it supports coldness-based
thresholds like DAMON_RECLAIM, the idle time percentiles can be used to
find a more safe or aggressive coldness threshold for aimed memory saving.
Link: https://lkml.kernel.org/r/20250604183127.13968-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The raw form of DAMON's monitoring results captures many details of the
information. However, not every bit of the information is always required
for understanding practical access patterns. Especially on real world
production systems of high scale time and size, the raw form is difficult
to be aggregated and compared.
Convert the raw monitoring results into a single number metric, namely
estimated memory bandwidth and expose it to users as a read-only
DAMON_STAT parameter. The metric represents access intensiveness
(hotness) of the system. It can easily be aggregated and compared for
high level understanding of the access pattern on large systems.
Link: https://lkml.kernel.org/r/20250604183127.13968-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/damon: introduce DAMON_STAT for simple and practical
access monitoring", v2.
DAMON-based access monitoring is not simple due to required DAMON control
and results visualizations. Introduce a static kernel module for making
it simple. The module can be enabled without manual setup and provides
access pattern metrics that easy to fetch and understand the practical
access pattern information, namely estimated memory bandwidth and memory
idle time percentiles.
Background and Problems
=======================
DAMON can be used for monitoring data access patterns of the system and
workloads. Specifically, users can start DAMON to monitor access events
on specific address space with fine controls including address ranges to
monitor and time intervals between samplings and aggregations. The
resulting access information snapshot contains access frequency
(nr_accesses) and how long the frequency was kept (age) for each byte.
The monitoring usage is not simple and practical enough for production
usage. Users should first start DAMON with a number of parameters, and
wait until DAMON's monitoring results capture a reasonable amount of the
time data (age). In production, such manual start and wait is impractical
to capture useful information from a high number of machines in a timely
manner.
The monitoring result is also too detailed to be used on production
environments. The raw results are hard to be aggregated and/or compared
for production environments having a large scale of time, space and
machines fleet.
Users have to implement and use their own automation of DAMON control and
results processing. It is repetitive and challenging since there is no
good reference or guideline for such automation.
Solution: DAMON_STAT
====================
Implement such automation in kernel space as a static kernel module,
namely DAMON_STAT. It can be enabled at build, boot, or run time via its
build configuration or module parameter. It monitors the entire physical
address space with monitoring intervals that auto-tuned for a reasonable
amount of access observations and minimum overhead. It converts the raw
monitoring results into simpler metrics that can easily be aggregated and
compared, namely estimated memory bandwidth and idle time percentiles.
Understanding of the metrics and the user interface of DAMON_STAT is
essential. Refer to the commit messages of the second and the third
patches of this patch series for more details about the metrics. For the
user interface, the standard module parameters system is used. Refer to
the fourth patch of this patch series for details of the user interface.
Discussions
===========
The module aims to be useful on production environments constructed with a
large number of machines that run a long time. The auto-tuned monitoring
intervals ensure a reasonable quality of the outputs. The auto-tuning
also ensures its overhead be reasonable and low enough to be enabled
always on the production. The simplified monitoring results metrics can
be useful for showing both coldness (idle time percentiles) and hotness
(memory bandwidth) of the system's access pattern. We expect the
information can be useful for assessing system memory utilization and
inspiring optimizations or investigations on both kernel and user space
memory management logics for large scale fleets.
We hence expect the module is good enough to be just used in most
environments. For special cases that require a custom access monitoring
automation, users will still benefit by using DAMON_STAT as a reference or
a guideline for their specialized automation.
This patch (of 4):
To use DAMON for monitoring access patterns of the system, users should
manually start DAMON via DAMON sysfs ABI with a number of parameters for
specifying the monitoring target address space, address ranges, and
monitoring intervals. After that, users should also wait until desired
amount of time data is captured into DAMON's monitoring results. It is
bothersome and take a long time to be practical for access monitoring on
large fleet level production environments.
For access-aware system operations use cases like proactive cold memory
reclamation, similar problems existed. We we solved those by introducing
dedicated static kernel modules such as DAMON_RECLAIM.
Implement such static kernel module for access monitoring, namely
DAMON_STAT. It monitors the entire physical address space with auto-tuned
monitoring intervals. The auto-tuning is set to capture 4 % of observable
access events in each snapshot while keeping the sampling intervals 5
milliseconds in minimum and 10 seconds in maximum. From a few production
environments, we confirmed this setup provides high quality monitoring
results with minimum overheads. The module therefore receives only one
user input, whether to enable or disable it. It can be set on build or
boot time via build configuration or kernel boot command line. It can
also be overridden at runtime.
Note that this commit only implements the DAMON control part of the
module. Users could get the monitoring results via damon:damon_aggregated
tracepoint, but that's of course not the recommended way. Following
commits will implement convenient and optimized ways for serving the
monitoring results to users.
[sj@kernel.org: use IS_ENABLED() for enabled initial value]
Link: https://lkml.kernel.org/r/20250604205619.18929-1-sj@kernel.org
[sj@kernel.org: reset enabled when DAMON start failed]
Link: https://lkml.kernel.org/r/20250706184750.36588-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250604183127.13968-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250604183127.13968-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If a user wishes to enable KSM mergeability for an entire process and all
fork/exec'd processes that come after it, they use the prctl()
PR_SET_MEMORY_MERGE operation.
This defaults all newly mapped VMAs to have the VM_MERGEABLE VMA flag set
(in order to indicate they are KSM mergeable), as well as setting this
flag for all existing VMAs and propagating this across fork/exec.
However it also breaks VMA merging for new VMAs, both in the process and
all forked (and fork/exec'd) child processes.
This is because when a new mapping is proposed, the flags specified will
never have VM_MERGEABLE set. However all adjacent VMAs will already have
VM_MERGEABLE set, rendering VMAs unmergeable by default.
To work around this, we try to set the VM_MERGEABLE flag prior to
attempting a merge. In the case of brk() this can always be done.
However on mmap() things are more complicated - while KSM is not supported
for MAP_SHARED file-backed mappings, it is supported for MAP_PRIVATE
file-backed mappings.
These mappings may have deprecated .mmap() callbacks specified which
could, in theory, adjust flags and thus KSM eligibility.
So we check to determine whether this is possible. If not, we set
VM_MERGEABLE prior to the merge attempt on mmap(), otherwise we retain the
previous behaviour.
This fixes VMA merging for all new anonymous mappings, which covers the
majority of real-world cases, so we should see a significant improvement
in VMA mergeability.
For MAP_PRIVATE file-backed mappings, those which implement the
.mmap_prepare() hook and shmem are both known to be safe, so we allow
these, disallowing all other cases.
Also add stubs for newly introduced function invocations to VMA userland
testing.
[lorenzo.stoakes@oracle.com: correctly invoke late KSM check after mmap hook]
Link: https://lkml.kernel.org/r/5861f8f6-cf5a-4d82-a062-139fb3f9cddb@lucifer.local
Link: https://lkml.kernel.org/r/3ba660af716d87a18ca5b4e635f2101edeb56340.1748537921.git.lorenzo.stoakes@oracle.com
Fixes: d7597f59d1 ("mm: add new api to enable ksm per process") # please no backport!
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Xu Xin <xu.xin16@zte.com.cn>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Stefan Roesch <shr@devkernel.io>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: ksm: prevent KSM from breaking merging of new VMAs", v3.
When KSM-by-default is established using prctl(PR_SET_MEMORY_MERGE), this
defaults all newly mapped VMAs to having VM_MERGEABLE set, and thus makes
them available to KSM for samepage merging. It also sets VM_MERGEABLE in
all existing VMAs.
However this causes an issue upon mapping of new VMAs - the initial flags
will never have VM_MERGEABLE set when attempting a merge with adjacent
VMAs (this is set later in the mmap() logic), and adjacent VMAs will
ALWAYS have VM_MERGEABLE set.
This renders all newly mapped VMAs unmergeable.
To avoid this, this series performs the check for PR_SET_MEMORY_MERGE far
earlier in the mmap() logic, prior to the merge being attempted.
However we run into complexity with the depreciated .mmap() callback - if
a driver hooks this, it might change flags which adjust KSM merge
eligibility.
We have to worry about this because, while KSM is only applicable to
private mappings, this includes both anonymous and MAP_PRIVATE-mapped
file-backed mappings.
This isn't a problem for brk(), where the VMA must be anonymous. However
in mmap() we must be conservative - if the VMA is anonymous then we can
always proceed, however if not, we permit only shmem mappings (whose .mmap
hook does not affect KSM eligibility) and drivers which implement
.mmap_prepare() (invoked prior to the KSM eligibility check).
If we can't be sure of the driver changing things, then we maintain the
same behaviour of performing the KSM check later in the mmap() logic (and
thus losing new VMA mergeability).
A great many use-cases for this logic will use anonymous mappings any
rate, so this change should already cover the majority of actual KSM
use-cases.
This patch (of 4):
In subsequent commits we are going to determine KSM eligibility prior to a
VMA being constructed, at which point we will of course not yet have
access to a VMA pointer.
It is trivial to boil down the check logic to be parameterised on
mm_struct, file and VMA flags, so do so.
As a part of this change, additionally expose and use file_is_dax() to
determine whether a file is being mapped under a DAX inode.
Link: https://lkml.kernel.org/r/cover.1748537921.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/36ad13eb50cdbd8aac6dcfba22c65d5031667295.1748537921.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Xu Xin <xu.xin16@zte.com.cn>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Stefan Roesch <shr@devkernel.io>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introduces a new drgn script, `show_page_info.py`, which allows users
to analyze the state of a page given a process ID (PID) and a virtual
address (VADDR). This can help kernel developers or debuggers easily
inspect page-related information in a live kernel or vmcore.
The script extracts information such as the page flags, mapping, and
other metadata relevant to diagnosing memory issues.
Output example:
sudo ./show_page_info.py 1 0x7fc988181000
PID: 1 Comm: systemd mm: 0xffff8d22c4089700
RAW: 0017ffffc000416c fffff939062ff708 fffff939062ffe08 ffff8d23062a12a8
RAW: 0000000000000000 ffff8d2323438f60 0000002500000007 ffff8d23203ff500
Page Address: 0xfffff93905664e00
Page Flags: PG_referenced|PG_uptodate|PG_lru|PG_head|PG_active|
PG_private|PG_reported|PG_has_hwpoisoned
Page Size: 4096
Page PFN: 0x159938
Page Physical: 0x159938000
Page Virtual: 0xffff8d2319938000
Page Refcount: 37
Page Mapcount: 7
Page Index: 0x0
Page Memcg Data: 0xffff8d23203ff500
Memcg Name: init.scope
Memcg Path: /sys/fs/cgroup/memory/init.scope
Page Mapping: 0xffff8d23062a12a8
Page Anon/File: File
Page VMA: 0xffff8d22e06e0e40
VMA Start: 0x7fc988181000
VMA End: 0x7fc988185000
This page is part of a compound page.
This page is the head page of a compound page.
Head Page: 0xfffff93905664e00
Compound Order: 2
Number of Pages: 4
Link: https://lkml.kernel.org/r/20250530055855.687067-1-ye.liu@linux.dev
Signed-off-by: Ye Liu <liuye@kylinos.cn>
Tested-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Omar Sandoval <osandov@osandov.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The scan implementation for MGLRU was missing proportional reclaim
pressure for memcg, which contradicts the description in
Documentation/admin-guide/cgroup-v2.rst (memory.{low,min} section).
This issue can be observed in kselftest cgroup:test_memcontrol
(specifically test_memcg_min and test_memcg_low). The following table
shows the actual values observed in my local test env (on xfs) and the
error "e", which is the symmetric absolute percentage error from the ideal
values of 29M for c[0] and 21M for c[1].
test_memcg_min
| MGLRU enabled | MGLRU enabled | MGLRU disabled
| Without patch | With patch |
-----|-----------------|-----------------|---------------
c[0] | 25964544 (e=8%) | 28770304 (e=3%) | 27820032 (e=4%)
c[1] | 26214400 (e=9%) | 23998464 (e=4%) | 24776704 (e=6%)
test_memcg_low
| MGLRU enabled | MGLRU enabled | MGLRU disabled
| Without patch | With patch |
-----|-----------------|-----------------|---------------
c[0] | 26214400 (e=7%) | 27930624 (e=4%) | 27688960 (e=5%)
c[1] | 26214400 (e=9%) | 24764416 (e=6%) | 24920064 (e=6%)
Factor out the proportioning logic to a new function and have MGLRU reuse
it. While at it, update the eviction behavior via debugfs 'lru_gen'
interface ('-' command with an explicit 'nr_to_reclaim' parameter) to
ensure eviction is limited to the specified number.
Link: https://lkml.kernel.org/r/20250530162353.541882-1-den@valinux.co.jp
Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
Reviewed-by: Yuanchu Xie <yuanchu@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull /proc/sys dcache lookup fix from Al Viro:
"Fix for the breakage spotted by Neil in the interplay between
/proc/sys ->d_compare() weirdness and parallel lookups"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix proc_sys_compare() handling of in-lookup dentries
Pull scheduler fixes from Borislav Petkov:
- Fix the calculation of the deadline server task's runtime as this
mishap was preventing realtime tasks from running
- Avoid a race condition during migrate-swapping two tasks
- Fix the string reported for the "none" dynamic preemption option
* tag 'sched_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Fix dl_server runtime calculation formula
sched/core: Fix migrate_swap() vs. hotplug
sched: Fix preemption string of preempt_dynamic_none
Pull objtool fix from Borislav Petkov:
- Fix the compilation of an x86 kernel on a big engian machine due to a
missed endianness conversion
* tag 'objtool_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Add missing endian conversion to read_annotate()
Pull perf fixes from Borislav Petkov:
- Revert uprobes to using CAP_SYS_ADMIN again as currently they can
destructively modify kernel code from an unprivileged process
- Move a warning to where it belongs
* tag 'perf_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Revert to requiring CAP_SYS_ADMIN for uprobes
perf/core: Fix the WARN_ON_ONCE is out of lock protected region
Pull x86 fix from Borislav Petkov:
- Make sure AMD SEV guests using secure TSC, include a TSC_FACTOR which
prevents their TSCs from going skewed from the hypervisor's
* tag 'x86_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sev: Use TSC_FACTOR for Secure TSC frequency calculation
Pull locking fixes from Borislav Petkov:
- Disable FUTEX_PRIVATE_HASH for this cycle due to a performance
regression
- Add a selftests compilation product to the corresponding .gitignore
file
* tag 'locking_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
selftests/futex: Add futex_numa to .gitignore
futex: Temporary disable FUTEX_PRIVATE_HASH
Pull EDAC fix from Borislav Petkov:
- Initialize sysfs attributes properly to avoid lockdep complaining
about an uninitialized lock class
* tag 'edac_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC: Initialize EDAC features sysfs attributes
Pull RAS fixes from Borislav Petkov:
- Do not remove the MCE sysfs hierarchy if thresholding sysfs nodes
init fails due to new/unknown banks present, which in itself is not
fatal anyway; add default names for new banks
- Make sure MCE polling settings are honored after CMCI storms
- Make sure MCE threshold limit is reset after the thresholding
interrupt has been serviced
- Clean up properly and disable CMCI banks on shutdown so that a
second/kexec-ed kernel can rediscover those banks again
* tag 'ras_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Make sure CMCI banks are cleared during shutdown on Intel
x86/mce/amd: Fix threshold limit reset
x86/mce/amd: Add default names for MCA banks and blocks
x86/mce: Ensure user polling settings are honored when restarting timer
x86/mce: Don't remove sysfs if thresholding sysfs init fails
Pull irq fix from Borislav Petkov:
- Have irq-msi-lib select CONFIG_GENERIC_MSI_IRQ explicitly as it uses
its facilities
* tag 'irq_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/irq-msi-lib: Select CONFIG_GENERIC_MSI_IRQ