In 2008, we added through commit 48c906823f ("memory hotplug: allocate
usemap on the section with pgdat") quite some complexity to try allocating
memory for the "usemap" (storing pageblock information per memory section)
for a memory section close to the memory of the "pgdat" of the node.
The goal was to make memory hotunplug of boot memory more likely to
succeed. That commit also added some checks for circular dependencies
between two memory sections, whereby two memory sections would contain
each others usemap, turning both boot memory sections un-removable.
However, in 2010, commit a4322e1bad ("sparsemem: Put usemap for one node
together") started allocating the usemap for multiple memory sections on
the same node in one chunk, effectively grouping all usemap allocations of
the same node in a single memblock allocation.
We don't really give guarantees about memory hotunplug of boot memory, and
with the change in 2010, it is impossible in practice to get any circular
dependencies.
So let's simply remove this complexity.
Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-10-096addc8800d@kernel.org
Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ever since commit f8f03eb5f0 ("mm: stop making SPARSEMEM_VMEMMAP
user-selectable"), an architecture that supports CONFIG_SPARSEMEM_VMEMMAP
(by selecting SPARSEMEM_VMEMMAP_ENABLE) can no longer enable
CONFIG_SPARSEMEM without CONFIG_SPARSEMEM_VMEMMAP.
Right now, CONFIG_MEMORY_HOTPLUG is guarded by CONFIG_SPARSEMEM.
However, CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG is only enabled by
* arm64: which selects SPARSEMEM_VMEMMAP_ENABLE
* loongarch: which selects SPARSEMEM_VMEMMAP_ENABLE
* powerpc (64bit): which selects SPARSEMEM_VMEMMAP_ENABLE
* riscv (64bit): which selects SPARSEMEM_VMEMMAP_ENABLE
* s390 with SPARSEMEM: which selects SPARSEMEM_VMEMMAP_ENABLE
* x86 (64bit): which selects SPARSEMEM_VMEMMAP_ENABLE
So, we can make CONFIG_MEMORY_HOTPLUG depend on CONFIG_SPARSEMEM_VMEMMAP
without affecting any setups.
Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-4-096addc8800d@kernel.org
Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: memory hot(un)plug and SPARSEMEM cleanups", v2.
Some cleanups around memory hot(un)plug and SPARSEMEM. In essence, we can
limit CONFIG_MEMORY_HOTPLUG to CONFIG_SPARSEMEM_VMEMMAP, remove some dead
code, and move all the hotplug bits over to mm/sparse-vmemmap.c.
Some further/related cleanups around other unnecessary code (memory hole
handling and complicated usemap allocation).
I have some further sparse.c cleanups lying around, and I'm planning on
getting rid of bootmem_info.c entirely.
This patch (of 15):
If a hugetlb folio gets freed while we are in scan_movable_pages(),
folio_nr_pages() could return 0, resulting in or'ing "0 - 1 = -1" to the
PFN, resulting in PFN = -1. We're not holding any locks or references
that would prevent that.
for_each_valid_pfn() would then search for the next valid PFN, and could
return a PFN that is outside of the range of the original requested range.
do_migrate_page() would then try to migrate quite a big range, which is
certainly undesirable.
To fix it, simply test for valid folio_nr_pages() values. While at it, as
PageHuge() really just does a page_folio() internally, we can just use
folio_test_hugetlb() on the folio directly.
scan_movable_pages() is expected to be fast, and we try to avoid taking
locks or grabbing references. We cannot use folio_try_get() as that does
not work for free hugetlb folios. We could grab the hugetlb_lock, but
that just adds complexity.
The race is unlikely to trigger in practice, so we won't be CCing stable.
Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-0-096addc8800d@kernel.org
Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-1-096addc8800d@kernel.org
Fixes: 16540dae95 ("mm/hugetlb: mm/memory_hotplug: use a folio in scan_movable_pages()")
Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
scan_slots_for_writeback() and scan_slots_for_recompress() work in a "best
effort" fashion, if they cannot allocate memory for a new pp-slot
candidate they just return and post-processing selects slots that were
successfully scanned thus far. scan_slots functions never return errors
and their callers never check the return status, so convert them to return
void.
Link: https://lkml.kernel.org/r/20260317032349.753645-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Calling `LZ4_loadDict()` repeatedly in Zram causes significant overhead
due to its internal dictionary pre-processing. This commit introduces a
template stream mechanism to pre-process the dictionary only once when the
dictionary is initially set or modified. It then efficiently copies this
state for subsequent compressions.
Verification Test Items:
Test Platform: android16-6.12
1. Collect Anonymous Page Dataset
1) Apply the following patch:
static bool zram_meta_alloc(struct zram *zram, u64 disksize)
if (!huge_class_size)
- huge_class_size = zs_huge_class_size(zram->mem_pool);
+ huge_class_size = 0;
2)Install multiple apps and monkey testing until SwapFree is close to 0.
3)Execute the following command to export data:
dd if=/dev/block/zram0 of=/data/samples/zram_dump.img bs=4K
2. Train Dictionary
Since LZ4 does not have a dedicated dictionary training tool, the zstd
tool can be used for training[1]. The command is as follows:
zstd --train /data/samples/* --split=4096 --maxdict=64KB -o /vendor/etc/dict_data
3. Test Code
adb shell "dd if=/data/samples/zram_dump.img of=/dev/test_pattern bs=4096 count=131072 conv=fsync"
adb shell "swapoff /dev/block/zram0"
adb shell "echo 1 > /sys/block/zram0/reset"
adb shell "echo lz4 > /sys/block/zram0/comp_algorithm"
adb shell "echo dict=/vendor/etc/dict_data > /sys/block/zram0/algorithm_params"
adb shell "echo 6G > /sys/block/zram0/disksize"
echo "Start Compression"
adb shell "taskset 80 dd if=/dev/test_pattern of=/dev/block/zram0 bs=4096 count=131072 conv=fsync"
echo.
echo "Start Decompression"
adb shell "taskset 80 dd if=/dev/block/zram0 of=/dev/output_result bs=4096 count=131072 conv=fsync"
echo "mm_stat:"
adb shell "cat /sys/block/zram0/mm_stat"
echo.
Note: To ensure stable test results, it is best to lock the CPU frequency
before executing the test.
LZ4 supports dictionaries up to 64KB. Below are the test results for
compression rates at various dictionary sizes:
dict_size base patch
4 KB 156M/s 219M/s
8 KB 136M/s 217M/s
16KB 98M/s 214M/s
32KB 66M/s 225M/s
64KB 38M/s 224M/s
When an LZ4 compression dictionary is enabled, compression speed is
negatively impacted by the dictionary's size; larger dictionaries result
in slower compression. This patch eliminates the influence of dictionary
size on compression speed, ensuring consistent performance regardless of
dictionary scale.
Link: https://lkml.kernel.org/r/698181478c9c4b10aa21b4a847bdc706@honor.com
Link: https://github.com/lz4/lz4?tab=readme-ov-file [1]
Signed-off-by: gao xu <gaoxu2@honor.com>
Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: khugepaged cleanups and mTHP prerequisites", v4.
The following series contains cleanups and prerequisites for my work on
khugepaged mTHP support [1]. These have been separated out to ease
review.
The first patch in the series refactors the page fault folio to pte
mapping and follows a similar convention as defined by
map_anon_folio_pmd_(no)pf(). This not only cleans up the current
implementation of do_anonymous_page(), but will allow for reuse later in
the khugepaged mTHP implementation.
The second patch adds a small is_pmd_order() helper to check if an order
is the PMD order. This check is open-coded in a number of places. This
patch aims to clean this up and will be used more in the khugepaged mTHP
work. The third patch also adds a small DEFINE for (HPAGE_PMD_NR - 1)
which is used often across the khugepaged code.
The fourth and fifth patch come from the khugepaged mTHP patchset [1].
These two patches include the rename of function prefixes, and the
unification of khugepaged and madvise_collapse via a new
collapse_single_pmd function.
Patch 1: refactor do_anonymous_page into map_anon_folio_pte_(no)pf
Patch 2: add is_pmd_order helper
Patch 3: Add define for (HPAGE_PMD_NR - 1)
Patch 4: Refactor/rename hpage_collapse
Patch 5: Refactoring to combine madvise_collapse and khugepaged
A big thanks to everyone that has reviewed, tested, and participated in
the development process.
This patch (of 5):
The anonymous page fault handler in do_anonymous_page() open-codes the
sequence to map a newly allocated anonymous folio at the PTE level:
- construct the PTE entry
- add rmap
- add to LRU
- set the PTEs
- update the MMU cache.
Introduce two helpers to consolidate this duplicated logic, mirroring the
existing map_anon_folio_pmd_nopf() pattern for PMD-level mappings:
map_anon_folio_pte_nopf(): constructs the PTE entry, takes folio
references, adds anon rmap and LRU. This function also handles the
uffd_wp that can occur in the pf variant. The future khugepaged mTHP code
calls this to handle mapping the new collapsed mTHP to its folio.
map_anon_folio_pte_pf(): extends the nopf variant to handle MM_ANONPAGES
counter updates, and mTHP fault allocation statistics for the page fault
path.
The zero-page read path in do_anonymous_page() is also untangled from the
shared setpte label, since it does not allocate a folio and should not
share the same mapping sequence as the write path. We can now leave
nr_pages undeclared at the function intialization, and use the single page
update_mmu_cache function to handle the zero page update.
This refactoring will also help reduce code duplication between
mm/memory.c and mm/khugepaged.c, and provides a clean API for PTE-level
anonymous folio mapping that can be reused by future callers (like
khugpeaged mTHP support)
Link: https://lkml.kernel.org/r/20260325114022.444081-1-npache@redhat.com
Link: https://lkml.kernel.org/r/20260325114022.444081-2-npache@redhat.com
Link: https://lore.kernel.org/all/20260122192841.128719-1-npache@redhat.com
Signed-off-by: Nico Pache <npache@redhat.com>
Suggested-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rafael Aquini <raquini@redhat.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shivank Garg <shivankg@amd.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Takashi Iwai (SUSE) <tiwai@suse.de>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In the past, damon_set_region_biggest_system_ram_default(), which is the
core function for setting the default monitoring target region of
DAMON_LRU_SORT, didn't support addr_unit. Hence DAMON_LRU_SORT was
silently ignoring the user input for addr_unit when the user doesn't
explicitly set the monitoring target regions, and therefore the default
target region is being used. No real problem from that ignorance was
reported so far. But, the implicit rule is only making things confusing.
Also, the default target region setup function is updated to support
addr_unit. Hence there is no reason to keep ignoring it. Respect the
user-passed addr_unit for the default target monitoring region use case.
Link: https://lkml.kernel.org/r/20260311052927.93921-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Yang yingliang <yangyingliang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In the past, damon_set_region_biggest_system_ram_default(), which is the
core function for setting the default monitoring target region of
DAMON_RECLAIM, didn't support addr_unit. Hence DAMON_RECLAIM was silently
ignoring the user input for addr_unit when the user doesn't explicitly set
the monitoring target regions, and therefore the default target region is
being used. No real problem from that ignorance was reported so far.
But, the implicit rule is only making things confusing. Also, the default
target region setup function is updated to support addr_unit. Hence there
is no reason to keep ignoring it. Respect the user-passed addr_unit for
the default target monitoring region use case.
Link: https://lkml.kernel.org/r/20260311052927.93921-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Yang yingliang <yangyingliang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damon_find_biggest_system_ram() was not supporting addr_unit in the past.
Hence, its caller, damon_set_region_biggest_system_ram_default(), was also
not supporting addr_unit. The previous commit has updated the inner
function to support addr_unit. There is no more reason to not support
addr_unit on damon_set_region_biggest_system_ram_default(). Rather, it
makes unnecessary inconsistency on support of addr_unit. Update it to
receive addr_unit and handle it inside.
Link: https://lkml.kernel.org/r/20260311052927.93921-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Yang yingliang <yangyingliang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damon_find_biggest_system_ram() sets an 'unsigned long' variable with
'resource_size_t' value. This is fundamentally wrong. On environments
such as ARM 32 bit machines having LPAE (Large Physical Address
Extensions), which DAMON supports, the size of 'unsigned long' may be
smaller than that of 'resource_size_t'. It is safe, though, since we
restrict the walk to be done only up to ULONG_MAX.
DAMON supports the address size gap using 'addr_unit'. We didn't add the
support to the function, just to make the initial support change small.
Now the support is reasonably settled. This kind of gap is only making
the code inconsistent and easy to be confused. Add the support of
'addr_unit' to the function, by letting callers pass the 'addr_unit' and
handling it in the function. All callers are passing 'addr_unit' 1,
though, to keep the old behavior.
[sj@kernel.org: verify found biggest system ram]
Link: https://lkml.kernel.org/r/20260317144725.88524-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20260311052927.93921-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Yang yingliang <yangyingliang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/damon: support addr_unit on default monitoring targets
for modules".
DAMON_RECLAIM and DAMON_LRU_SORT support 'addr_unit' parameters only when
the monitoring target address range is explicitly set. This was
intentional for making the initial 'addr_unit' support change small. Now
'addr_unit' support is being quite stabilized. Having the corner case of
the support is only making the code inconsistent with implicit rules. The
inconsistency makes it easy to confuse [1] readers. After all, there is
no real reason to keep 'addr_unit' support incomplete. Add the support
for the case to improve the readability and more completely support
'addr_unit'.
This series is constructed with five patches. The first one (patch 1)
fixes a small bug that mistakenly assigns inclusive end address to open
end address, which was found from this work. The second and third ones
(patches 2 and 3) extend the default monitoring target setting functions
in the core layer one by one, to support the 'addr_unit' while making no
visible changes. The final two patches (patches 4 and 5) update
DAMON_RECLAIM and DAMON_LRU_SORT to support 'addr_unit' for the default
monitoring target address ranges, by passing the user input to the core
functions.
This patch (of 5):
'struct damon_addr_range' and 'struct resource' represent different types
of address ranges. 'damon_addr_range' is for end-open ranges ([start,
end)). 'resource' is for fully-closed ranges ([start, end]). But
walk_system_ram() is assigning resource->end to damon_addr_range->end
without the inclusiveness adjustment. As a result, the function returns
an address range that is missing the last one byte.
The function is being used to find and set the biggest system ram as the
default monitoring target for DAMON_RECLAIM and DAMON_LRU_SORT. Missing
the last byte of the big range shouldn't be a real problem for the real
use cases. That said, the loss is definitely an unintended behavior. Do
the correct adjustment.
Link: https://lkml.kernel.org/r/20260311052927.93921-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20260311052927.93921-2-sj@kernel.org
Link: https://lore.kernel.org/20260131015643.79158-1-sj@kernel.org [1]
Fixes: 43b0536cb4 ("mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM)")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Yang yingliang <yangyingliang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: improve map count checks".
Firstly, in mremap(), it appears that our map count checks have been overly
conservative - there is simply no reason to require that we have headroom
of 4 mappings prior to moving the VMA, we only need headroom of 2 VMAs
since commit 659ace584e ("mmap: don't return ENOMEM when mapcount is
temporarily exceeded in munmap()").
Likely the original headroom of 4 mappings was a mistake, and 3 was
actually intended.
Next, we access sysctl_max_map_count in a number of places without being
all that careful about how we do so.
We introduce a simple helper that READ_ONCE()'s the field
(get_sysctl_max_map_count()) to ensure that the field is accessed
correctly. The WRITE_ONCE() side is already handled by the sysctl procfs
code in proc_int_conv().
We also move this field to internal.h as there's no reason for anybody
else to access it outside of mm. Unfortunately we have to maintain the
extern variable, as mmap.c implements the procfs code.
Finally, we are accessing current->mm->map_count without holding the mmap
write lock, which is also not correct, so this series ensures the lock is
head before we access it.
We also abstract the check to a helper function, and add ASCII diagrams to
explain why we're doing what we're doing.
This patch (of 3):
We currently check to see, if on moving a VMA when doing mremap(), if it
might violate the sys.vm.max_map_count limit.
This was introduced in the mists of time prior to 2.6.12.
At this point in time, as now, the move_vma() operation would copy the VMA
(+1 mapping if not merged), then potentially split the source VMA upon
unmap.
Prior to commit 659ace584e ("mmap: don't return ENOMEM when mapcount is
temporarily exceeded in munmap()"), a VMA split would check whether
mm->map_count >= sysctl_max_map_count prior to a split before it ran.
On unmap of the source VMA, if we are moving a partial VMA, we might split
the VMA twice.
This would mean, on invocation of split_vma() (as was), we'd check whether
mm->map_count >= sysctl_max_map_count with a map count elevated by one,
then again with a map count elevated by two, ending up with a map count
elevated by three.
At this point we'd reduce the map count on unmap.
At the start of move_vma(), there was a check that has remained throughout
mremap()'s history of mm->map_count >= sysctl_max_map_count - 3 (which
implies mm->mmap_count + 4 > sysctl_max_map_count - that is, we must have
headroom for 4 additional mappings).
After mm->map_count is elevated by 3, it is decremented by one once the
unmap completes. The mmap write lock is held, so nothing else will observe
mm->map_count > sysctl_max_map_count.
It appears this check was always incorrect - it should have either be one
of 'mm->map_count > sysctl_max_map_count - 3' or 'mm->map_count >=
sysctl_max_map_count - 2'.
After commit 659ace584e ("mmap: don't return ENOMEM when mapcount is
temporarily exceeded in munmap()"), the map count check on split is
eliminated in the newly introduced __split_vma(), which the unmap path
uses, and has that path check whether mm->map_count >=
sysctl_max_map_count.
This is valid since, net, an unmap can only cause an increase in map count
of 1 (split both sides, unmap middle).
Since we only copy a VMA and (if MREMAP_DONTUNMAP is not set) unmap
afterwards, the maximum number of additional mappings that will actually be
subject to any check will be 2.
Therefore, update the check to assert this corrected value. Additionally,
update the check introduced by commit ea2c3f6f55 ("mm,mremap: bail out
earlier in mremap_to under map pressure") to account for this.
While we're here, clean up the comment prior to that.
Link: https://lkml.kernel.org/r/cover.1773249037.git.ljs@kernel.org
Link: https://lkml.kernel.org/r/73e218c67dcd197c5331840fb011e2c17155bfb0.1773249037.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Jianzhou Zhao <luckd0g@163.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kmalloc_large() was renamed kmalloc_large_noprof() by commit 7bd230a266
("mm/slab: enable slab allocation tagging for kmalloc and friends"), and
subsequently renamed __kmalloc_large_noprof() by commit a0a44d9175 ("mm,
slab: don't wrap internal functions with alloc_hooks()"), making it an
internal implementation detail.
Large kmalloc allocations are now performed through the public kmalloc()
interface directly, making the reference to KMALLOC_MAX_SIZE also stale
(KMALLOC_MAX_CACHE_SIZE would be more accurate). Remove the references to
kmalloc_large() and KMALLOC_MAX_SIZE, and rephrase the description for
large kmalloc allocations.
Link: https://lkml.kernel.org/r/20260312053812.1365-1-kexinsun@smail.nju.edu.cn
Signed-off-by: Kexin Sun <kexinsun@smail.nju.edu.cn>
Suggested-by: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Assisted-by: unnamed:deepseek-v3.2 coccinelle
Reviewed-by: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
During folio migration, __folio_migrate_mapping() removes the source folio
from the deferred split queue, but the destination folio is never
re-queued. This causes underutilized THPs to escape the shrinker after
NUMA migration, since they silently drop off the deferred split list.
Fix this by recording whether the source folio was on the deferred split
queue and its partially mapped state before move_to_new_folio() unqueues
it, and re-queuing the destination folio after a successful migration if
it was.
By the time migrate_folio_move() runs, partially mapped folios without a
pin have already been split by migrate_pages_batch(). So only two cases
remain on the deferred list at this point:
1. Partially mapped folios with a pin (split failed).
2. Fully mapped but potentially underused folios. The recorded
partially_mapped state is forwarded to deferred_split_folio() so that
the destination folio is correctly re-queued in both cases.
Because THPs are removed from the deferred_list, THP shinker cannot
split the underutilized THPs in time. As a result, users will show
less free memory than before.
Link: https://lkml.kernel.org/r/20260312104723.1351321-1-usama.arif@linux.dev
Fixes: dafff3f4c8 ("mm: split underused THPs")
Signed-off-by: Usama Arif <usama.arif@linux.dev>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The test_memcg_sock test in memcontrol.c sets up an IPv6 socket and send
data over it to consume memory and verify that memory.stat.sock and
memory.current values are close.
On systems where IPv6 isn't enabled or not configured to support
SOCK_STREAM, the test_memcg_sock test always fails. When the socket()
call fails, there is no way we can test the memory consumption and verify
the above claim. I believe it is better to just skip the test in this
case instead of reporting a test failure hinting that there may be
something wrong with the memcg code.
Link: https://lkml.kernel.org/r/20260311200526.885899-1-longman@redhat.com
Fixes: 5f8f019380 ("selftests: cgroup/memcontrol: add basic test for socket accounting")
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: thp: reduce unnecessary start_stop_khugepaged()", v7.
Writing to /sys/kernel/mm/transparent_hugepage/enabled causes
start_stop_khugepaged() called independent of any change.
start_stop_khugepaged() SPAMs the printk ring buffer overflow with the
exact same message, even when nothing changes.
For instance, if you have a custom vm.min_free_kbytes, just touching
/sys/kernel/mm/transparent_hugepage/enabled causes a printk message.
Example:
# sysctl -w vm.min_free_kbytes=112382
# for i in $(seq 100); do echo never > /sys/kernel/mm/transparent_hugepage/enabled ; done
and you have 100 WARN messages like the following, which is pretty dull:
khugepaged: min_free_kbytes is not updated to 112381 because user defined value 112382 is preferred
A similar message shows up when setting thp to "always":
# for i in $(seq 100); do
# echo 1024 > /proc/sys/vm/min_free_kbytes
# echo always > /sys/kernel/mm/transparent_hugepage/enabled
# done
And then, we have 100 messages like:
khugepaged: raising min_free_kbytes from 1024 to 67584 to help transparent hugepage allocations
This is more common when you have a configuration management system that
writes the THP configuration without an extra read, assuming that nothing
will happen if there is no change in the configuration, but it prints
these annoying messages.
For instance, at Meta's fleet, ~10K servers were producing 3.5M of these
messages per day.
Fix this by making the sysfs _store helpers easier to digest and
ratelimiting the message.
This patch (of 4):
Make set_recommended_min_free_kbytes() callable from outside khugepaged.c
by removing the static qualifier and adding a declaration in
mm/internal.h.
This allows callers that change THP settings to recalculate watermarks
without going through start_stop_khugepaged().
Link: https://lkml.kernel.org/r/20260317-thp_logs-v7-0-31eb98fa5a8b@debian.org
Link: https://lkml.kernel.org/r/20260317-thp_logs-v7-1-31eb98fa5a8b@debian.org
Signed-off-by: Breno Leitao <leitao@debian.org>
Suggested-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a new DAMON sysfs interface file, namely 'goal_tuner' under the DAMOS
quotas directory. It is connected to the damos_quota->goal_tuner field.
Users can therefore select their favorite goal-based quotas tuning
algorithm by writing the name of the tuner to the file. Reading the file
returns the name of the currently selected tuner.
Link: https://lkml.kernel.org/r/20260310010529.91162-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introduce a new goal-based DAMOS quota auto-tuning algorithm, namely
DAMOS_QUOTA_GOAL_TUNER_TEMPORAL (temporal in short). The algorithm aims
to trigger the DAMOS action only for a temporal time, to achieve the goal
as soon as possible. For the temporal period, it uses as much quota as
allowed. Once the goal is achieved, it sets the quota zero, so
effectively makes the scheme be deactivated.
Link: https://lkml.kernel.org/r/20260310010529.91162-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
User-explicit quotas (size and time quotas) having zero value means the
quotas are unset. And, effective size quota is set as the minimum value
of the explicit quotas. When quota goals are set, the goal-based quota
tuner can make it lower. But the existing only single tuner never sets
the effective size quota zero. Because of the fact, DAMON core assumes
zero effective quota means the user has set no quota.
Multiple tuners are now allowed, though. In the future, some tuners might
want to set a zero effective size quota. There is no reason to restrict
that. Meanwhile, because of the current implementation, it will only
deactivate all quotas and make the scheme work at its full speed.
Introduce a dedicated function for checking if no quota is set. The
function checks the fact by showing if the user-set explicit quotas are
zero and no goal is installed. It is decoupled from zero effective quota,
and hence allows future tuners set zero effective quota for intentionally
deactivating the scheme by a purpose.
Link: https://lkml.kernel.org/r/20260310010529.91162-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>