From 3091b61505028702438b3677820b2b786eda755b Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 4 Jun 2025 20:03:26 +0100 Subject: [PATCH 001/370] mm: restore documentation for __free_pages() The documentation was converted to be for ___free_pages(), which doesn't need documentation as it's static. Link: https://lkml.kernel.org/r/20250604190327.814086-1-willy@infradead.org Fixes: 8c57b687e833 (mm, bpf: Introduce free_pages_nolock()) Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Vlastimil Babka Cc: Alexei Starovoitov Signed-off-by: Andrew Morton --- mm/page_alloc.c | 42 +++++++++++++++++++++--------------------- 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2ef3c07266b3..31041e2aa33a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5028,27 +5028,6 @@ unsigned long get_zeroed_page_noprof(gfp_t gfp_mask) } EXPORT_SYMBOL(get_zeroed_page_noprof); -/** - * ___free_pages - Free pages allocated with alloc_pages(). - * @page: The page pointer returned from alloc_pages(). - * @order: The order of the allocation. - * @fpi_flags: Free Page Internal flags. - * - * This function can free multi-page allocations that are not compound - * pages. It does not check that the @order passed in matches that of - * the allocation, so it is easy to leak memory. Freeing more memory - * than was allocated will probably emit a warning. - * - * If the last reference to this page is speculative, it will be released - * by put_page() which only frees the first page of a non-compound - * allocation. To prevent the remaining pages from being leaked, we free - * the subsequent pages here. If you want to use the page's reference - * count to decide when to free the allocation, you should allocate a - * compound page, and use put_page() instead of __free_pages(). - * - * Context: May be called in interrupt context or while holding a normal - * spinlock, but not in NMI context or while holding a raw spinlock. - */ static void ___free_pages(struct page *page, unsigned int order, fpi_t fpi_flags) { @@ -5066,6 +5045,27 @@ static void ___free_pages(struct page *page, unsigned int order, fpi_flags); } } + +/** + * __free_pages - Free pages allocated with alloc_pages(). + * @page: The page pointer returned from alloc_pages(). + * @order: The order of the allocation. + * + * This function can free multi-page allocations that are not compound + * pages. It does not check that the @order passed in matches that of + * the allocation, so it is easy to leak memory. Freeing more memory + * than was allocated will probably emit a warning. + * + * If the last reference to this page is speculative, it will be released + * by put_page() which only frees the first page of a non-compound + * allocation. To prevent the remaining pages from being leaked, we free + * the subsequent pages here. If you want to use the page's reference + * count to decide when to free the allocation, you should allocate a + * compound page, and use put_page() instead of __free_pages(). + * + * Context: May be called in interrupt context or while holding a normal + * spinlock, but not in NMI context or while holding a raw spinlock. + */ void __free_pages(struct page *page, unsigned int order) { ___free_pages(page, order, FPI_NONE); From 80d1a8130934ae7ff64c40e5495169ac40f3a916 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Wed, 4 Jun 2025 19:03:08 +0100 Subject: [PATCH 002/370] docs/mm: expand vma doc to highlight pte freeing, non-vma traversal The process addresses documentation already contains a great deal of information about mmap/VMA locking and page table traversal and manipulation. However it waves it hands about non-VMA traversal. Add a section for this and explain the caveats around this kind of traversal. Additionally, commit 6375e95f381e ("mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED)") caused zapping to also free empty PTE page tables. Highlight this. Link: https://lkml.kernel.org/r/20250604180308.137116-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Bagas Sanjaya Cc: Jann Horn Cc: Jonathan Corbet Cc: Liam Howlett Cc: Qi Zheng Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- Documentation/mm/process_addrs.rst | 54 ++++++++++++++++++++++++++---- 1 file changed, 48 insertions(+), 6 deletions(-) diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst index e6756e78b476..be49e2a269e4 100644 --- a/Documentation/mm/process_addrs.rst +++ b/Documentation/mm/process_addrs.rst @@ -303,7 +303,9 @@ There are four key operations typically performed on page tables: 1. **Traversing** page tables - Simply reading page tables in order to traverse them. This only requires that the VMA is kept stable, so a lock which establishes this suffices for traversal (there are also lockless variants - which eliminate even this requirement, such as :c:func:`!gup_fast`). + which eliminate even this requirement, such as :c:func:`!gup_fast`). There is + also a special case of page table traversal for non-VMA regions which we + consider separately below. 2. **Installing** page table mappings - Whether creating a new mapping or modifying an existing one in such a way as to change its identity. This requires that the VMA is kept stable via an mmap or VMA lock (explicitly not @@ -335,15 +337,13 @@ ahead and perform these operations on page tables (though internally, kernel operations that perform writes also acquire internal page table locks to serialise - see the page table implementation detail section for more details). +.. note:: We free empty PTE tables on zap under the RCU lock - this does not + change the aforementioned locking requirements around zapping. + When **installing** page table entries, the mmap or VMA lock must be held to keep the VMA stable. We explore why this is in the page table locking details section below. -.. warning:: Page tables are normally only traversed in regions covered by VMAs. - If you want to traverse page tables in areas that might not be - covered by VMAs, heavier locking is required. - See :c:func:`!walk_page_range_novma` for details. - **Freeing** page tables is an entirely internal memory management operation and has special requirements (see the page freeing section below for more details). @@ -355,6 +355,44 @@ has special requirements (see the page freeing section below for more details). from the reverse mappings, but no other VMAs can be permitted to be accessible and span the specified range. +Traversing non-VMA page tables +------------------------------ + +We've focused above on traversal of page tables belonging to VMAs. It is also +possible to traverse page tables which are not represented by VMAs. + +Kernel page table mappings themselves are generally managed but whatever part of +the kernel established them and the aforementioned locking rules do not apply - +for instance vmalloc has its own set of locks which are utilised for +establishing and tearing down page its page tables. + +However, for convenience we provide the :c:func:`!walk_kernel_page_table_range` +function which is synchronised via the mmap lock on the :c:macro:`!init_mm` +kernel instantiation of the :c:struct:`!struct mm_struct` metadata object. + +If an operation requires exclusive access, a write lock is used, but if not, a +read lock suffices - we assert only that at least a read lock has been acquired. + +Since, aside from vmalloc and memory hot plug, kernel page tables are not torn +down all that often - this usually suffices, however any caller of this +functionality must ensure that any additionally required locks are acquired in +advance. + +We also permit a truly unusual case is the traversal of non-VMA ranges in +**userland** ranges, as provided for by :c:func:`!walk_page_range_debug`. + +This has only one user - the general page table dumping logic (implemented in +:c:macro:`!mm/ptdump.c`) - which seeks to expose all mappings for debug purposes +even if they are highly unusual (possibly architecture-specific) and are not +backed by a VMA. + +We must take great care in this case, as the :c:func:`!munmap` implementation +detaches VMAs under an mmap write lock before tearing down page tables under a +downgraded mmap read lock. + +This means such an operation could race with this, and thus an mmap **write** +lock is required. + Lock ordering ------------- @@ -461,6 +499,10 @@ Locking Implementation Details Page table locking details -------------------------- +.. note:: This section explores page table locking requirements for page tables + encompassed by a VMA. See the above section on non-VMA page table + traversal for details on how we handle that case. + In addition to the locks described in the terminology section above, we have additional locks dedicated to page tables: From af827e0904899f14e0cd8e629fea6d55022e53a9 Mon Sep 17 00:00:00 2001 From: Koichiro Den Date: Sat, 31 May 2025 01:23:53 +0900 Subject: [PATCH 003/370] mm: vmscan: apply proportional reclaim pressure for memcg when MGLRU is enabled The scan implementation for MGLRU was missing proportional reclaim pressure for memcg, which contradicts the description in Documentation/admin-guide/cgroup-v2.rst (memory.{low,min} section). This issue can be observed in kselftest cgroup:test_memcontrol (specifically test_memcg_min and test_memcg_low). The following table shows the actual values observed in my local test env (on xfs) and the error "e", which is the symmetric absolute percentage error from the ideal values of 29M for c[0] and 21M for c[1]. test_memcg_min | MGLRU enabled | MGLRU enabled | MGLRU disabled | Without patch | With patch | -----|-----------------|-----------------|--------------- c[0] | 25964544 (e=8%) | 28770304 (e=3%) | 27820032 (e=4%) c[1] | 26214400 (e=9%) | 23998464 (e=4%) | 24776704 (e=6%) test_memcg_low | MGLRU enabled | MGLRU enabled | MGLRU disabled | Without patch | With patch | -----|-----------------|-----------------|--------------- c[0] | 26214400 (e=7%) | 27930624 (e=4%) | 27688960 (e=5%) c[1] | 26214400 (e=9%) | 24764416 (e=6%) | 24920064 (e=6%) Factor out the proportioning logic to a new function and have MGLRU reuse it. While at it, update the eviction behavior via debugfs 'lru_gen' interface ('-' command with an explicit 'nr_to_reclaim' parameter) to ensure eviction is limited to the specified number. Link: https://lkml.kernel.org/r/20250530162353.541882-1-den@valinux.co.jp Signed-off-by: Koichiro Den Reviewed-by: Yuanchu Xie Cc: Yu Zhao Cc: Axel Rasmussen Signed-off-by: Andrew Morton --- mm/vmscan.c | 149 ++++++++++++++++++++++++++++------------------------ 1 file changed, 79 insertions(+), 70 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index f8dfd2864bbf..ad47244179bb 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2482,6 +2482,69 @@ static inline void calculate_pressure_balance(struct scan_control *sc, *denominator = ap + fp; } +static unsigned long apply_proportional_protection(struct mem_cgroup *memcg, + struct scan_control *sc, unsigned long scan) +{ + unsigned long min, low; + + mem_cgroup_protection(sc->target_mem_cgroup, memcg, &min, &low); + + if (min || low) { + /* + * Scale a cgroup's reclaim pressure by proportioning + * its current usage to its memory.low or memory.min + * setting. + * + * This is important, as otherwise scanning aggression + * becomes extremely binary -- from nothing as we + * approach the memory protection threshold, to totally + * nominal as we exceed it. This results in requiring + * setting extremely liberal protection thresholds. It + * also means we simply get no protection at all if we + * set it too low, which is not ideal. + * + * If there is any protection in place, we reduce scan + * pressure by how much of the total memory used is + * within protection thresholds. + * + * There is one special case: in the first reclaim pass, + * we skip over all groups that are within their low + * protection. If that fails to reclaim enough pages to + * satisfy the reclaim goal, we come back and override + * the best-effort low protection. However, we still + * ideally want to honor how well-behaved groups are in + * that case instead of simply punishing them all + * equally. As such, we reclaim them based on how much + * memory they are using, reducing the scan pressure + * again by how much of the total memory used is under + * hard protection. + */ + unsigned long cgroup_size = mem_cgroup_size(memcg); + unsigned long protection; + + /* memory.low scaling, make sure we retry before OOM */ + if (!sc->memcg_low_reclaim && low > min) { + protection = low; + sc->memcg_low_skipped = 1; + } else { + protection = min; + } + + /* Avoid TOCTOU with earlier protection check */ + cgroup_size = max(cgroup_size, protection); + + scan -= scan * protection / (cgroup_size + 1); + + /* + * Minimally target SWAP_CLUSTER_MAX pages to keep + * reclaim moving forwards, avoiding decrementing + * sc->priority further than desirable. + */ + scan = max(scan, SWAP_CLUSTER_MAX); + } + return scan; +} + /* * Determine how aggressively the anon and file LRU lists should be * scanned. @@ -2560,70 +2623,10 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, for_each_evictable_lru(lru) { bool file = is_file_lru(lru); unsigned long lruvec_size; - unsigned long low, min; unsigned long scan; lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx); - mem_cgroup_protection(sc->target_mem_cgroup, memcg, - &min, &low); - - if (min || low) { - /* - * Scale a cgroup's reclaim pressure by proportioning - * its current usage to its memory.low or memory.min - * setting. - * - * This is important, as otherwise scanning aggression - * becomes extremely binary -- from nothing as we - * approach the memory protection threshold, to totally - * nominal as we exceed it. This results in requiring - * setting extremely liberal protection thresholds. It - * also means we simply get no protection at all if we - * set it too low, which is not ideal. - * - * If there is any protection in place, we reduce scan - * pressure by how much of the total memory used is - * within protection thresholds. - * - * There is one special case: in the first reclaim pass, - * we skip over all groups that are within their low - * protection. If that fails to reclaim enough pages to - * satisfy the reclaim goal, we come back and override - * the best-effort low protection. However, we still - * ideally want to honor how well-behaved groups are in - * that case instead of simply punishing them all - * equally. As such, we reclaim them based on how much - * memory they are using, reducing the scan pressure - * again by how much of the total memory used is under - * hard protection. - */ - unsigned long cgroup_size = mem_cgroup_size(memcg); - unsigned long protection; - - /* memory.low scaling, make sure we retry before OOM */ - if (!sc->memcg_low_reclaim && low > min) { - protection = low; - sc->memcg_low_skipped = 1; - } else { - protection = min; - } - - /* Avoid TOCTOU with earlier protection check */ - cgroup_size = max(cgroup_size, protection); - - scan = lruvec_size - lruvec_size * protection / - (cgroup_size + 1); - - /* - * Minimally target SWAP_CLUSTER_MAX pages to keep - * reclaim moving forwards, avoiding decrementing - * sc->priority further than desirable. - */ - scan = max(scan, SWAP_CLUSTER_MAX); - } else { - scan = lruvec_size; - } - + scan = apply_proportional_protection(memcg, sc, lruvec_size); scan >>= sc->priority; /* @@ -4554,8 +4557,9 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca return true; } -static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, - int type, int tier, struct list_head *list) +static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec, + struct scan_control *sc, int type, int tier, + struct list_head *list) { int i; int gen; @@ -4564,7 +4568,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, int scanned = 0; int isolated = 0; int skipped = 0; - int remaining = MAX_LRU_BATCH; + int remaining = min(nr_to_scan, MAX_LRU_BATCH); struct lru_gen_folio *lrugen = &lruvec->lrugen; struct mem_cgroup *memcg = lruvec_memcg(lruvec); @@ -4675,7 +4679,8 @@ static int get_type_to_scan(struct lruvec *lruvec, int swappiness) return positive_ctrl_err(&sp, &pv); } -static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness, +static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec, + struct scan_control *sc, int swappiness, int *type_scanned, struct list_head *list) { int i; @@ -4687,7 +4692,7 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw *type_scanned = type; - scanned = scan_folios(lruvec, sc, type, tier, list); + scanned = scan_folios(nr_to_scan, lruvec, sc, type, tier, list); if (scanned) return scanned; @@ -4697,7 +4702,8 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw return 0; } -static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness) +static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec, + struct scan_control *sc, int swappiness) { int type; int scanned; @@ -4716,7 +4722,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap spin_lock_irq(&lruvec->lru_lock); - scanned = isolate_folios(lruvec, sc, swappiness, &type, &list); + scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness, &type, &list); scanned += try_to_inc_min_seq(lruvec, swappiness); @@ -4837,6 +4843,8 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int s if (nr_to_scan && !mem_cgroup_online(memcg)) return nr_to_scan; + nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan); + /* try to get away with not aging at the default priority */ if (!success || sc->priority == DEF_PRIORITY) return nr_to_scan >> sc->priority; @@ -4889,7 +4897,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) if (nr_to_scan <= 0) break; - delta = evict_folios(lruvec, sc, swappiness); + delta = evict_folios(nr_to_scan, lruvec, sc, swappiness); if (!delta) break; @@ -5510,7 +5518,8 @@ static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_co if (sc->nr_reclaimed >= nr_to_reclaim) return 0; - if (!evict_folios(lruvec, sc, swappiness)) + if (!evict_folios(nr_to_reclaim - sc->nr_reclaimed, lruvec, sc, + swappiness)) return 0; cond_resched(); From fee8870a09c84347f1bd51ba094489869a0df33d Mon Sep 17 00:00:00 2001 From: Ye Liu Date: Fri, 30 May 2025 13:58:55 +0800 Subject: [PATCH 004/370] tools/mm: add script to display page state for a given PID and VADDR Introduces a new drgn script, `show_page_info.py`, which allows users to analyze the state of a page given a process ID (PID) and a virtual address (VADDR). This can help kernel developers or debuggers easily inspect page-related information in a live kernel or vmcore. The script extracts information such as the page flags, mapping, and other metadata relevant to diagnosing memory issues. Output example: sudo ./show_page_info.py 1 0x7fc988181000 PID: 1 Comm: systemd mm: 0xffff8d22c4089700 RAW: 0017ffffc000416c fffff939062ff708 fffff939062ffe08 ffff8d23062a12a8 RAW: 0000000000000000 ffff8d2323438f60 0000002500000007 ffff8d23203ff500 Page Address: 0xfffff93905664e00 Page Flags: PG_referenced|PG_uptodate|PG_lru|PG_head|PG_active| PG_private|PG_reported|PG_has_hwpoisoned Page Size: 4096 Page PFN: 0x159938 Page Physical: 0x159938000 Page Virtual: 0xffff8d2319938000 Page Refcount: 37 Page Mapcount: 7 Page Index: 0x0 Page Memcg Data: 0xffff8d23203ff500 Memcg Name: init.scope Memcg Path: /sys/fs/cgroup/memory/init.scope Page Mapping: 0xffff8d23062a12a8 Page Anon/File: File Page VMA: 0xffff8d22e06e0e40 VMA Start: 0x7fc988181000 VMA End: 0x7fc988185000 This page is part of a compound page. This page is the head page of a compound page. Head Page: 0xfffff93905664e00 Compound Order: 2 Number of Pages: 4 Link: https://lkml.kernel.org/r/20250530055855.687067-1-ye.liu@linux.dev Signed-off-by: Ye Liu Tested-by: SeongJae Park Reviewed-by: Stephen Brennan Cc: Florian Weimer Cc: Omar Sandoval Cc: "Paul E . McKenney" Cc: Sweet Tea Dorminy Signed-off-by: Andrew Morton --- MAINTAINERS | 5 ++ tools/mm/show_page_info.py | 169 +++++++++++++++++++++++++++++++++++++ 2 files changed, 174 insertions(+) create mode 100644 tools/mm/show_page_info.py diff --git a/MAINTAINERS b/MAINTAINERS index fad6cb025a19..9dd4111d7d96 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -18818,6 +18818,11 @@ F: Documentation/mm/page_table_check.rst F: include/linux/page_table_check.h F: mm/page_table_check.c +PAGE STATE DEBUG SCRIPT +M: Ye Liu +S: Maintained +F: tools/mm/show_page_info.py + PANASONIC LAPTOP ACPI EXTRAS DRIVER M: Kenneth Chan L: platform-driver-x86@vger.kernel.org diff --git a/tools/mm/show_page_info.py b/tools/mm/show_page_info.py new file mode 100644 index 000000000000..c46d8ea283d7 --- /dev/null +++ b/tools/mm/show_page_info.py @@ -0,0 +1,169 @@ +#!/usr/bin/env drgn +# SPDX-License-Identifier: GPL-2.0-only +# Copyright (C) 2025 Ye Liu + +import argparse +import sys +from drgn import Object, FaultError, PlatformFlags, cast +from drgn.helpers.linux import find_task, follow_page, page_size +from drgn.helpers.linux.mm import ( + decode_page_flags, page_to_pfn, page_to_phys, page_to_virt, vma_find, + PageSlab, PageCompound, PageHead, PageTail, compound_head, compound_order, compound_nr +) +from drgn.helpers.linux.cgroup import cgroup_name, cgroup_path + +DESC = """ +This is a drgn script to show the page state. +For more info on drgn, visit https://github.com/osandov/drgn. +""" + +def format_page_data(page): + """ + Format raw page data into a readable hex dump with "RAW:" prefix. + + :param page: drgn.Object instance representing the page. + :return: Formatted string of memory contents. + """ + try: + address = page.value_() + size = prog.type("struct page").size + + if prog.platform.flags & PlatformFlags.IS_64_BIT: + word_size = 8 + else: + word_size = 4 + num_words = size // word_size + + values = [] + for i in range(num_words): + word_address = address + i * word_size + word = prog.read_word(word_address) + values.append(f"{word:0{word_size * 2}x}") + + lines = [f"RAW: {' '.join(values[i:i + 4])}" for i in range(0, len(values), 4)] + + return "\n".join(lines) + + except FaultError as e: + return f"Error reading memory: {e}" + except Exception as e: + return f"Unexpected error: {e}" + +def get_memcg_info(page): + """Retrieve memory cgroup information for a page.""" + try: + MEMCG_DATA_OBJEXTS = prog.constant("MEMCG_DATA_OBJEXTS").value_() + MEMCG_DATA_KMEM = prog.constant("MEMCG_DATA_KMEM").value_() + mask = prog.constant('__NR_MEMCG_DATA_FLAGS').value_() - 1 + memcg_data = page.memcg_data.read_() + if memcg_data & MEMCG_DATA_OBJEXTS: + slabobj_ext = cast("struct slabobj_ext *", memcg_data & ~mask) + memcg = slabobj_ext.objcg.memcg.value_() + elif memcg_data & MEMCG_DATA_KMEM: + objcg = cast("struct obj_cgroup *", memcg_data & ~mask) + memcg = objcg.memcg.value_() + else: + memcg = cast("struct mem_cgroup *", memcg_data & ~mask) + + if memcg.value_() == 0: + return "none", "/sys/fs/cgroup/memory/" + cgrp = memcg.css.cgroup + return cgroup_name(cgrp).decode(), f"/sys/fs/cgroup/memory{cgroup_path(cgrp).decode()}" + except FaultError as e: + return "unknown", f"Error retrieving memcg info: {e}" + except Exception as e: + return "unknown", f"Unexpected error: {e}" + +def show_page_state(page, addr, mm, pid, task): + """Display detailed information about a page.""" + try: + print(f'PID: {pid} Comm: {task.comm.string_().decode()} mm: {hex(mm)}') + try: + print(format_page_data(page)) + except FaultError as e: + print(f"Error reading page data: {e}") + fields = { + "Page Address": hex(page.value_()), + "Page Flags": decode_page_flags(page), + "Page Size": prog["PAGE_SIZE"].value_(), + "Page PFN": hex(page_to_pfn(page).value_()), + "Page Physical": hex(page_to_phys(page).value_()), + "Page Virtual": hex(page_to_virt(page).value_()), + "Page Refcount": page._refcount.counter.value_(), + "Page Mapcount": page._mapcount.counter.value_(), + "Page Index": hex(page.__folio_index.value_()), + "Page Memcg Data": hex(page.memcg_data.value_()), + } + + memcg_name, memcg_path = get_memcg_info(page) + fields["Memcg Name"] = memcg_name + fields["Memcg Path"] = memcg_path + fields["Page Mapping"] = hex(page.mapping.value_()) + fields["Page Anon/File"] = "Anon" if page.mapping.value_() & 0x1 else "File" + + try: + vma = vma_find(mm, addr) + fields["Page VMA"] = hex(vma.value_()) + fields["VMA Start"] = hex(vma.vm_start.value_()) + fields["VMA End"] = hex(vma.vm_end.value_()) + except FaultError as e: + fields["Page VMA"] = "Unavailable" + fields["VMA Start"] = "Unavailable" + fields["VMA End"] = "Unavailable" + print(f"Error retrieving VMA information: {e}") + + # Calculate the maximum field name length for alignment + max_field_len = max(len(field) for field in fields) + + # Print aligned fields + for field, value in fields.items(): + print(f"{field}:".ljust(max_field_len + 2) + f"{value}") + + # Additional information about the page + if PageSlab(page): + print("This page belongs to the slab allocator.") + + if PageCompound(page): + print("This page is part of a compound page.") + if PageHead(page): + print("This page is the head page of a compound page.") + if PageTail(page): + print("This page is the tail page of a compound page.") + print(f"{'Head Page:'.ljust(max_field_len + 2)}{hex(compound_head(page).value_())}") + print(f"{'Compound Order:'.ljust(max_field_len + 2)}{compound_order(page).value_()}") + print(f"{'Number of Pages:'.ljust(max_field_len + 2)}{compound_nr(page).value_()}") + else: + print("This page is not part of a compound page.") + except FaultError as e: + print(f"Error accessing page state: {e}") + except Exception as e: + print(f"Unexpected error: {e}") + +def main(): + """Main function to parse arguments and display page state.""" + parser = argparse.ArgumentParser(description=DESC, formatter_class=argparse.RawTextHelpFormatter) + parser.add_argument('pid', metavar='PID', type=int, help='Target process ID (PID)') + parser.add_argument('vaddr', metavar='VADDR', type=str, help='Target virtual address in hexadecimal format (e.g., 0x7fff1234abcd)') + args = parser.parse_args() + + try: + vaddr = int(args.vaddr, 16) + except ValueError: + sys.exit(f"Error: Invalid virtual address format: {args.vaddr}") + + try: + task = find_task(args.pid) + mm = task.mm + page = follow_page(mm, vaddr) + + if page: + show_page_state(page, vaddr, mm, args.pid, task) + else: + sys.exit(f"Address {hex(vaddr)} is not mapped.") + except FaultError as e: + sys.exit(f"Error accessing task or memory: {e}") + except Exception as e: + sys.exit(f"Unexpected error: {e}") + +if __name__ == "__main__": + main() From de195c67bfcbc770bb6327bbfd3010f1c371216a Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Thu, 29 May 2025 18:15:45 +0100 Subject: [PATCH 005/370] mm: ksm: have KSM VMA checks not require a VMA pointer Patch series "mm: ksm: prevent KSM from breaking merging of new VMAs", v3. When KSM-by-default is established using prctl(PR_SET_MEMORY_MERGE), this defaults all newly mapped VMAs to having VM_MERGEABLE set, and thus makes them available to KSM for samepage merging. It also sets VM_MERGEABLE in all existing VMAs. However this causes an issue upon mapping of new VMAs - the initial flags will never have VM_MERGEABLE set when attempting a merge with adjacent VMAs (this is set later in the mmap() logic), and adjacent VMAs will ALWAYS have VM_MERGEABLE set. This renders all newly mapped VMAs unmergeable. To avoid this, this series performs the check for PR_SET_MEMORY_MERGE far earlier in the mmap() logic, prior to the merge being attempted. However we run into complexity with the depreciated .mmap() callback - if a driver hooks this, it might change flags which adjust KSM merge eligibility. We have to worry about this because, while KSM is only applicable to private mappings, this includes both anonymous and MAP_PRIVATE-mapped file-backed mappings. This isn't a problem for brk(), where the VMA must be anonymous. However in mmap() we must be conservative - if the VMA is anonymous then we can always proceed, however if not, we permit only shmem mappings (whose .mmap hook does not affect KSM eligibility) and drivers which implement .mmap_prepare() (invoked prior to the KSM eligibility check). If we can't be sure of the driver changing things, then we maintain the same behaviour of performing the KSM check later in the mmap() logic (and thus losing new VMA mergeability). A great many use-cases for this logic will use anonymous mappings any rate, so this change should already cover the majority of actual KSM use-cases. This patch (of 4): In subsequent commits we are going to determine KSM eligibility prior to a VMA being constructed, at which point we will of course not yet have access to a VMA pointer. It is trivial to boil down the check logic to be parameterised on mm_struct, file and VMA flags, so do so. As a part of this change, additionally expose and use file_is_dax() to determine whether a file is being mapped under a DAX inode. Link: https://lkml.kernel.org/r/cover.1748537921.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/36ad13eb50cdbd8aac6dcfba22c65d5031667295.1748537921.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Acked-by: David Hildenbrand Reviewed-by: Chengming Zhou Reviewed-by: Vlastimil Babka Reviewed-by: Xu Xin Reviewed-by: Liam R. Howlett Cc: Al Viro Cc: Christian Brauner Cc: Jan Kara Cc: Jann Horn Cc: Stefan Roesch Signed-off-by: Andrew Morton --- include/linux/fs.h | 7 ++++++- mm/ksm.c | 34 +++++++++++++++++++++------------- 2 files changed, 27 insertions(+), 14 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 040c0036320f..62634af97da6 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3726,9 +3726,14 @@ void setattr_copy(struct mnt_idmap *, struct inode *inode, extern int file_update_time(struct file *file); +static inline bool file_is_dax(const struct file *file) +{ + return file && IS_DAX(file->f_mapping->host); +} + static inline bool vma_is_dax(const struct vm_area_struct *vma) { - return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); + return file_is_dax(vma->vm_file); } static inline bool vma_is_fsdax(struct vm_area_struct *vma) diff --git a/mm/ksm.c b/mm/ksm.c index 8583fb91ef13..08d486f188ff 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -677,28 +677,33 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr, bool lock_v return (ret & VM_FAULT_OOM) ? -ENOMEM : 0; } -static bool vma_ksm_compatible(struct vm_area_struct *vma) +static bool ksm_compatible(const struct file *file, vm_flags_t vm_flags) { - if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE | VM_PFNMAP | - VM_IO | VM_DONTEXPAND | VM_HUGETLB | - VM_MIXEDMAP| VM_DROPPABLE)) + if (vm_flags & (VM_SHARED | VM_MAYSHARE | VM_PFNMAP | + VM_IO | VM_DONTEXPAND | VM_HUGETLB | + VM_MIXEDMAP | VM_DROPPABLE)) return false; /* just ignore the advice */ - if (vma_is_dax(vma)) + if (file_is_dax(file)) return false; #ifdef VM_SAO - if (vma->vm_flags & VM_SAO) + if (vm_flags & VM_SAO) return false; #endif #ifdef VM_SPARC_ADI - if (vma->vm_flags & VM_SPARC_ADI) + if (vm_flags & VM_SPARC_ADI) return false; #endif return true; } +static bool vma_ksm_compatible(struct vm_area_struct *vma) +{ + return ksm_compatible(vma->vm_file, vma->vm_flags); +} + static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm, unsigned long addr) { @@ -2696,14 +2701,17 @@ static int ksm_scan_thread(void *nothing) return 0; } +static bool __ksm_should_add_vma(const struct file *file, vm_flags_t vm_flags) +{ + if (vm_flags & VM_MERGEABLE) + return false; + + return ksm_compatible(file, vm_flags); +} + static void __ksm_add_vma(struct vm_area_struct *vma) { - unsigned long vm_flags = vma->vm_flags; - - if (vm_flags & VM_MERGEABLE) - return; - - if (vma_ksm_compatible(vma)) + if (__ksm_should_add_vma(vma->vm_file, vma->vm_flags)) vm_flags_set(vma, VM_MERGEABLE); } From b914c47d46da2ac66425521d304cbc8729153556 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Thu, 29 May 2025 18:15:46 +0100 Subject: [PATCH 006/370] mm: ksm: refer to special VMAs via VM_SPECIAL in ksm_compatible() There's no need to spell out all the special cases, also doing it this way makes it absolutely clear that we preclude unmergeable VMAs in general, and puts the other excluded flags in stark and clear contrast. Link: https://lkml.kernel.org/r/c8be5b055163b164c8824020164076ee3b9389bd.1748537921.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Acked-by: David Hildenbrand Reviewed-by: Chengming Zhou Reviewed-by: Vlastimil Babka Reviewed-by: Xu Xin Reviewed-by: Liam R. Howlett Cc: Al Viro Cc: Christian Brauner Cc: Jan Kara Cc: Jann Horn Cc: Stefan Roesch Signed-off-by: Andrew Morton --- mm/ksm.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/mm/ksm.c b/mm/ksm.c index 08d486f188ff..d0c763abd499 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -679,9 +679,8 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr, bool lock_v static bool ksm_compatible(const struct file *file, vm_flags_t vm_flags) { - if (vm_flags & (VM_SHARED | VM_MAYSHARE | VM_PFNMAP | - VM_IO | VM_DONTEXPAND | VM_HUGETLB | - VM_MIXEDMAP | VM_DROPPABLE)) + if (vm_flags & (VM_SHARED | VM_MAYSHARE | VM_SPECIAL | + VM_HUGETLB | VM_DROPPABLE)) return false; /* just ignore the advice */ if (file_is_dax(file)) From cf7e7a3503df0b71afd68ee84e9a09d4514cc2dd Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Thu, 29 May 2025 18:15:47 +0100 Subject: [PATCH 007/370] mm: prevent KSM from breaking VMA merging for new VMAs If a user wishes to enable KSM mergeability for an entire process and all fork/exec'd processes that come after it, they use the prctl() PR_SET_MEMORY_MERGE operation. This defaults all newly mapped VMAs to have the VM_MERGEABLE VMA flag set (in order to indicate they are KSM mergeable), as well as setting this flag for all existing VMAs and propagating this across fork/exec. However it also breaks VMA merging for new VMAs, both in the process and all forked (and fork/exec'd) child processes. This is because when a new mapping is proposed, the flags specified will never have VM_MERGEABLE set. However all adjacent VMAs will already have VM_MERGEABLE set, rendering VMAs unmergeable by default. To work around this, we try to set the VM_MERGEABLE flag prior to attempting a merge. In the case of brk() this can always be done. However on mmap() things are more complicated - while KSM is not supported for MAP_SHARED file-backed mappings, it is supported for MAP_PRIVATE file-backed mappings. These mappings may have deprecated .mmap() callbacks specified which could, in theory, adjust flags and thus KSM eligibility. So we check to determine whether this is possible. If not, we set VM_MERGEABLE prior to the merge attempt on mmap(), otherwise we retain the previous behaviour. This fixes VMA merging for all new anonymous mappings, which covers the majority of real-world cases, so we should see a significant improvement in VMA mergeability. For MAP_PRIVATE file-backed mappings, those which implement the .mmap_prepare() hook and shmem are both known to be safe, so we allow these, disallowing all other cases. Also add stubs for newly introduced function invocations to VMA userland testing. [lorenzo.stoakes@oracle.com: correctly invoke late KSM check after mmap hook] Link: https://lkml.kernel.org/r/5861f8f6-cf5a-4d82-a062-139fb3f9cddb@lucifer.local Link: https://lkml.kernel.org/r/3ba660af716d87a18ca5b4e635f2101edeb56340.1748537921.git.lorenzo.stoakes@oracle.com Fixes: d7597f59d1d3 ("mm: add new api to enable ksm per process") # please no backport! Signed-off-by: Lorenzo Stoakes Reviewed-by: Chengming Zhou Acked-by: David Hildenbrand Reviewed-by: Liam R. Howlett Reviewed-by: Vlastimil Babka Reviewed-by: Xu Xin Cc: Al Viro Cc: Christian Brauner Cc: Jan Kara Cc: Jann Horn Cc: Stefan Roesch Signed-off-by: Andrew Morton --- include/linux/ksm.h | 8 +++-- mm/ksm.c | 18 +++++++---- mm/vma.c | 51 ++++++++++++++++++++++++++++++-- tools/testing/vma/vma_internal.h | 11 +++++++ 4 files changed, 77 insertions(+), 11 deletions(-) diff --git a/include/linux/ksm.h b/include/linux/ksm.h index d73095b5cd96..51787f0b0208 100644 --- a/include/linux/ksm.h +++ b/include/linux/ksm.h @@ -17,8 +17,8 @@ #ifdef CONFIG_KSM int ksm_madvise(struct vm_area_struct *vma, unsigned long start, unsigned long end, int advice, unsigned long *vm_flags); - -void ksm_add_vma(struct vm_area_struct *vma); +vm_flags_t ksm_vma_flags(const struct mm_struct *mm, const struct file *file, + vm_flags_t vm_flags); int ksm_enable_merge_any(struct mm_struct *mm); int ksm_disable_merge_any(struct mm_struct *mm); int ksm_disable(struct mm_struct *mm); @@ -97,8 +97,10 @@ bool ksm_process_mergeable(struct mm_struct *mm); #else /* !CONFIG_KSM */ -static inline void ksm_add_vma(struct vm_area_struct *vma) +static inline vm_flags_t ksm_vma_flags(const struct mm_struct *mm, + const struct file *file, vm_flags_t vm_flags) { + return vm_flags; } static inline int ksm_disable(struct mm_struct *mm) diff --git a/mm/ksm.c b/mm/ksm.c index d0c763abd499..18b3690bb69a 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -2731,16 +2731,22 @@ static int __ksm_del_vma(struct vm_area_struct *vma) return 0; } /** - * ksm_add_vma - Mark vma as mergeable if compatible + * ksm_vma_flags - Update VMA flags to mark as mergeable if compatible * - * @vma: Pointer to vma + * @mm: Proposed VMA's mm_struct + * @file: Proposed VMA's file-backed mapping, if any. + * @vm_flags: Proposed VMA"s flags. + * + * Returns: @vm_flags possibly updated to mark mergeable. */ -void ksm_add_vma(struct vm_area_struct *vma) +vm_flags_t ksm_vma_flags(const struct mm_struct *mm, const struct file *file, + vm_flags_t vm_flags) { - struct mm_struct *mm = vma->vm_mm; + if (test_bit(MMF_VM_MERGE_ANY, &mm->flags) && + __ksm_should_add_vma(file, vm_flags)) + vm_flags |= VM_MERGEABLE; - if (test_bit(MMF_VM_MERGE_ANY, &mm->flags)) - __ksm_add_vma(vma); + return vm_flags; } static void ksm_add_vmas(struct mm_struct *mm) diff --git a/mm/vma.c b/mm/vma.c index fef67a66a095..079540ebfb72 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -32,6 +32,9 @@ struct mmap_state { struct vma_munmap_struct vms; struct ma_state mas_detach; struct maple_tree mt_detach; + + /* Determine if we can check KSM flags early in mmap() logic. */ + bool check_ksm_early; }; #define MMAP_STATE(name, mm_, vmi_, addr_, len_, pgoff_, flags_, file_) \ @@ -2320,6 +2323,11 @@ static void vms_abort_munmap_vmas(struct vma_munmap_struct *vms, vms_complete_munmap_vmas(vms, mas_detach); } +static void update_ksm_flags(struct mmap_state *map) +{ + map->flags = ksm_vma_flags(map->mm, map->file, map->flags); +} + /* * __mmap_prepare() - Prepare to gather any overlapping VMAs that need to be * unmapped once the map operation is completed, check limits, account mapping @@ -2424,6 +2432,7 @@ static int __mmap_new_file_vma(struct mmap_state *map, !(map->flags & VM_MAYWRITE) && (vma->vm_flags & VM_MAYWRITE)); + map->file = vma->vm_file; map->flags = vma->vm_flags; return 0; @@ -2473,6 +2482,11 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap) if (error) goto free_iter_vma; + if (!map->check_ksm_early) { + update_ksm_flags(map); + vm_flags_init(vma, map->flags); + } + #ifdef CONFIG_SPARC64 /* TODO: Fix SPARC ADI! */ WARN_ON_ONCE(!arch_validate_flags(map->flags)); @@ -2490,7 +2504,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap) */ if (!vma_is_anonymous(vma)) khugepaged_enter_vma(vma, map->flags); - ksm_add_vma(vma); *vmap = vma; return 0; @@ -2593,6 +2606,35 @@ static void set_vma_user_defined_fields(struct vm_area_struct *vma, vma->vm_private_data = map->vm_private_data; } +/* + * Are we guaranteed no driver can change state such as to preclude KSM merging? + * If so, let's set the KSM mergeable flag early so we don't break VMA merging. + */ +static bool can_set_ksm_flags_early(struct mmap_state *map) +{ + struct file *file = map->file; + + /* Anonymous mappings have no driver which can change them. */ + if (!file) + return true; + + /* + * If .mmap_prepare() is specified, then the driver will have already + * manipulated state prior to updating KSM flags. So no need to worry + * about mmap callbacks modifying VMA flags after the KSM flag has been + * updated here, which could otherwise affect KSM eligibility. + */ + if (file->f_op->mmap_prepare) + return true; + + /* shmem is safe. */ + if (shmem_file(file)) + return true; + + /* Any other .mmap callback is not safe. */ + return false; +} + static unsigned long __mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, struct list_head *uf) @@ -2604,12 +2646,17 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr, VMA_ITERATOR(vmi, mm, addr); MMAP_STATE(map, mm, &vmi, addr, len, pgoff, vm_flags, file); + map.check_ksm_early = can_set_ksm_flags_early(&map); + error = __mmap_prepare(&map, uf); if (!error && have_mmap_prepare) error = call_mmap_prepare(&map); if (error) goto abort_munmap; + if (map.check_ksm_early) + update_ksm_flags(&map); + /* Attempt to merge with adjacent VMAs... */ if (map.prev || map.next) { VMG_MMAP_STATE(vmg, &map, /* vma = */ NULL); @@ -2721,6 +2768,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, * Note: This happens *after* clearing old mappings in some code paths. */ flags |= VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags; + flags = ksm_vma_flags(mm, NULL, flags); if (!may_expand_vm(mm, flags, len >> PAGE_SHIFT)) return -ENOMEM; @@ -2764,7 +2812,6 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, mm->map_count++; validate_mm(mm); - ksm_add_vma(vma); out: perf_event_mmap(vma); mm->total_vm += len >> PAGE_SHIFT; diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 14718ca23a05..0f013784da89 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -1484,4 +1484,15 @@ static inline void vma_set_file(struct vm_area_struct *vma, struct file *file) fput(file); } +static inline bool shmem_file(struct file *) +{ + return false; +} + +static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct file *, + vm_flags_t vm_flags) +{ + return vm_flags; +} + #endif /* __MM_VMA_INTERNAL_H */ From 4a1ff347e44ce45e8a5c15f466574583b019e4f9 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Thu, 29 May 2025 18:15:48 +0100 Subject: [PATCH 008/370] tools/testing/selftests: add VMA merge tests for KSM merge Add test to assert that we have now allowed merging of VMAs when KSM merging-by-default has been set by prctl(PR_SET_MEMORY_MERGE, ...). We simply perform a trivial mapping of adjacent VMAs expecting a merge, however prior to recent changes implementing this mode earlier than before, these merges would not have succeeded. Assert that we have fixed this! Link: https://lkml.kernel.org/r/6dec7aabf062c6b121cfac992c9c716cefdda00c.1748537921.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Chengming Zhou Tested-by: Chengming Zhou Reviewed-by: Liam R. Howlett Acked-by: Vlastimil Babka Cc: Al Viro Cc: Christian Brauner Cc: David Hildenbrand Cc: Jan Kara Cc: Jann Horn Cc: Stefan Roesch Cc: Xu Xin Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/merge.c | 78 ++++++++++++++++++++++++++++++ 1 file changed, 78 insertions(+) diff --git a/tools/testing/selftests/mm/merge.c b/tools/testing/selftests/mm/merge.c index cc26480098ae..150dd5baed2b 100644 --- a/tools/testing/selftests/mm/merge.c +++ b/tools/testing/selftests/mm/merge.c @@ -2,11 +2,13 @@ #define _GNU_SOURCE #include "../kselftest_harness.h" +#include #include #include #include #include #include +#include #include #include #include @@ -34,6 +36,11 @@ FIXTURE_TEARDOWN(merge) { ASSERT_EQ(munmap(self->carveout, 12 * self->page_size), 0); ASSERT_EQ(close_procmap(&self->procmap), 0); + /* + * Clear unconditionally, as some tests set this. It is no issue if this + * fails (KSM may be disabled for instance). + */ + prctl(PR_SET_MEMORY_MERGE, 0, 0, 0, 0); } TEST_F(merge, mprotect_unfaulted_left) @@ -498,4 +505,75 @@ TEST_F(merge, handle_uprobe_upon_merged_vma) remove(probe_file); } +TEST_F(merge, ksm_merge) +{ + unsigned int page_size = self->page_size; + char *carveout = self->carveout; + struct procmap_fd *procmap = &self->procmap; + char *ptr, *ptr2; + int err; + + /* + * Map two R/W immediately adjacent to one another, they should + * trivially merge: + * + * |-----------|-----------| + * | R/W | R/W | + * |-----------|-----------| + * ptr ptr2 + */ + + ptr = mmap(&carveout[page_size], page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + ptr2 = mmap(&carveout[2 * page_size], page_size, + PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr2, MAP_FAILED); + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 2 * page_size); + + /* Unmap the second half of this merged VMA. */ + ASSERT_EQ(munmap(ptr2, page_size), 0); + + /* OK, now enable global KSM merge. We clear this on test teardown. */ + err = prctl(PR_SET_MEMORY_MERGE, 1, 0, 0, 0); + if (err == -1) { + int errnum = errno; + + /* Only non-failure case... */ + ASSERT_EQ(errnum, EINVAL); + /* ...but indicates we should skip. */ + SKIP(return, "KSM memory merging not supported, skipping."); + } + + /* + * Now map a VMA adjacent to the existing that was just made + * VM_MERGEABLE, this should merge as well. + */ + ptr2 = mmap(&carveout[2 * page_size], page_size, + PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr2, MAP_FAILED); + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 2 * page_size); + + /* Now this VMA altogether. */ + ASSERT_EQ(munmap(ptr, 2 * page_size), 0); + + /* Try the same operation as before, asserting this also merges fine. */ + ptr = mmap(&carveout[page_size], page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + ptr2 = mmap(&carveout[2 * page_size], page_size, + PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr2, MAP_FAILED); + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 2 * page_size); +} + TEST_HARNESS_MAIN From cdf48aa83279d4369ec6195f716468950c4440ca Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Wed, 28 May 2025 15:20:13 -0400 Subject: [PATCH 009/370] mm/hugetlb: convert hugetlb_change_protection() to folios The for loop inside hugetlb_change_protection() increments by the huge page size: psize = huge_page_size(h); for (; address < end; address += psize) so we are operating on the head page of the huge pages between address and end. We can safely convert the struct page usage to struct folio. Link: https://lkml.kernel.org/r/20250528192013.91130-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar Reviewed-by: Matthew Wilcox (Oracle) Reviewed-by: Oscar Salvador Cc: Muchun Song Cc: Sidhartha Kumar Signed-off-by: Andrew Morton --- mm/hugetlb.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 9dc95eac558c..7a7df0b2a561 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -7166,11 +7166,11 @@ long hugetlb_change_protection(struct vm_area_struct *vma, /* Nothing to do. */ } else if (unlikely(is_hugetlb_entry_migration(pte))) { swp_entry_t entry = pte_to_swp_entry(pte); - struct page *page = pfn_swap_entry_to_page(entry); + struct folio *folio = pfn_swap_entry_folio(entry); pte_t newpte = pte; if (is_writable_migration_entry(entry)) { - if (PageAnon(page)) + if (folio_test_anon(folio)) entry = make_readable_exclusive_migration_entry( swp_offset(entry)); else From bafa31a1ceab70c70df40ff09b88f359cd0117c4 Mon Sep 17 00:00:00 2001 From: Paul Menzel Date: Tue, 3 Jun 2025 08:13:03 +0200 Subject: [PATCH 010/370] mm: Kconfig: use verb *use* in plural form in description *workloads* is plural requiring the verb *use* in plural form. Link: https://lkml.kernel.org/r/20250603061303.479551-2-pmenzel@molgen.mpg.de Fixes: e13e7922d034 ("mm: add CONFIG_PAGE_BLOCK_ORDER to select page block order") Signed-off-by: Paul Menzel Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Lorenzo Stoakes Reviewed-by: Zi Yan Signed-off-by: Andrew Morton --- mm/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/Kconfig b/mm/Kconfig index 781be3240e21..15716b1a5e21 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1022,7 +1022,7 @@ config PAGE_BLOCK_ORDER or MAX_PAGE_ORDER. Reducing pageblock order can negatively impact THP generation - success rate. If your workloads uses THP heavily, please use this + success rate. If your workloads use THP heavily, please use this option with caution. Don't change if unsure. From e399a07a8a527dd023ca720a338fc508bedd364f Mon Sep 17 00:00:00 2001 From: Caleb Sander Mateos Date: Fri, 11 Apr 2025 10:17:45 -0600 Subject: [PATCH 011/370] mm: remove unused mmap tracepoints The vma_mas_szero and vma_store tracepoints are unused since commit fbcc3104b843 ("mmap: convert __vma_adjust() to use vma iterator"). Remove them so they are no longer listed as available tracepoints. Link: https://lkml.kernel.org/r/20250411161746.1043239-1-csander@purestorage.com Signed-off-by: Caleb Sander Mateos Reported-by: Eric Mueller Reviewed-by: Liam R. Howlett Cc: Jann Horn Cc: Lorenzo Stoakes Cc: "Masami Hiramatsu (Google)" Cc: Mathieu Desnoyers Cc: Steven Rostedt Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/trace/events/mmap.h | 52 ------------------------------------- 1 file changed, 52 deletions(-) diff --git a/include/trace/events/mmap.h b/include/trace/events/mmap.h index f8d61485de16..ee2843a5daef 100644 --- a/include/trace/events/mmap.h +++ b/include/trace/events/mmap.h @@ -43,58 +43,6 @@ TRACE_EVENT(vm_unmapped_area, __entry->align_offset) ); -TRACE_EVENT(vma_mas_szero, - TP_PROTO(struct maple_tree *mt, unsigned long start, - unsigned long end), - - TP_ARGS(mt, start, end), - - TP_STRUCT__entry( - __field(struct maple_tree *, mt) - __field(unsigned long, start) - __field(unsigned long, end) - ), - - TP_fast_assign( - __entry->mt = mt; - __entry->start = start; - __entry->end = end; - ), - - TP_printk("mt_mod %p, (NULL), SNULL, %lu, %lu,", - __entry->mt, - (unsigned long) __entry->start, - (unsigned long) __entry->end - ) -); - -TRACE_EVENT(vma_store, - TP_PROTO(struct maple_tree *mt, struct vm_area_struct *vma), - - TP_ARGS(mt, vma), - - TP_STRUCT__entry( - __field(struct maple_tree *, mt) - __field(struct vm_area_struct *, vma) - __field(unsigned long, vm_start) - __field(unsigned long, vm_end) - ), - - TP_fast_assign( - __entry->mt = mt; - __entry->vma = vma; - __entry->vm_start = vma->vm_start; - __entry->vm_end = vma->vm_end - 1; - ), - - TP_printk("mt_mod %p, (%p), STORE, %lu, %lu,", - __entry->mt, __entry->vma, - (unsigned long) __entry->vm_start, - (unsigned long) __entry->vm_end - ) -); - - TRACE_EVENT(exit_mmap, TP_PROTO(struct mm_struct *mm), From 369c415e60732b7c8ed33368891581246f580d7a Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Wed, 4 Jun 2025 11:31:24 -0700 Subject: [PATCH 012/370] mm/damon: introduce DAMON_STAT module Patch series "mm/damon: introduce DAMON_STAT for simple and practical access monitoring", v2. DAMON-based access monitoring is not simple due to required DAMON control and results visualizations. Introduce a static kernel module for making it simple. The module can be enabled without manual setup and provides access pattern metrics that easy to fetch and understand the practical access pattern information, namely estimated memory bandwidth and memory idle time percentiles. Background and Problems ======================= DAMON can be used for monitoring data access patterns of the system and workloads. Specifically, users can start DAMON to monitor access events on specific address space with fine controls including address ranges to monitor and time intervals between samplings and aggregations. The resulting access information snapshot contains access frequency (nr_accesses) and how long the frequency was kept (age) for each byte. The monitoring usage is not simple and practical enough for production usage. Users should first start DAMON with a number of parameters, and wait until DAMON's monitoring results capture a reasonable amount of the time data (age). In production, such manual start and wait is impractical to capture useful information from a high number of machines in a timely manner. The monitoring result is also too detailed to be used on production environments. The raw results are hard to be aggregated and/or compared for production environments having a large scale of time, space and machines fleet. Users have to implement and use their own automation of DAMON control and results processing. It is repetitive and challenging since there is no good reference or guideline for such automation. Solution: DAMON_STAT ==================== Implement such automation in kernel space as a static kernel module, namely DAMON_STAT. It can be enabled at build, boot, or run time via its build configuration or module parameter. It monitors the entire physical address space with monitoring intervals that auto-tuned for a reasonable amount of access observations and minimum overhead. It converts the raw monitoring results into simpler metrics that can easily be aggregated and compared, namely estimated memory bandwidth and idle time percentiles. Understanding of the metrics and the user interface of DAMON_STAT is essential. Refer to the commit messages of the second and the third patches of this patch series for more details about the metrics. For the user interface, the standard module parameters system is used. Refer to the fourth patch of this patch series for details of the user interface. Discussions =========== The module aims to be useful on production environments constructed with a large number of machines that run a long time. The auto-tuned monitoring intervals ensure a reasonable quality of the outputs. The auto-tuning also ensures its overhead be reasonable and low enough to be enabled always on the production. The simplified monitoring results metrics can be useful for showing both coldness (idle time percentiles) and hotness (memory bandwidth) of the system's access pattern. We expect the information can be useful for assessing system memory utilization and inspiring optimizations or investigations on both kernel and user space memory management logics for large scale fleets. We hence expect the module is good enough to be just used in most environments. For special cases that require a custom access monitoring automation, users will still benefit by using DAMON_STAT as a reference or a guideline for their specialized automation. This patch (of 4): To use DAMON for monitoring access patterns of the system, users should manually start DAMON via DAMON sysfs ABI with a number of parameters for specifying the monitoring target address space, address ranges, and monitoring intervals. After that, users should also wait until desired amount of time data is captured into DAMON's monitoring results. It is bothersome and take a long time to be practical for access monitoring on large fleet level production environments. For access-aware system operations use cases like proactive cold memory reclamation, similar problems existed. We we solved those by introducing dedicated static kernel modules such as DAMON_RECLAIM. Implement such static kernel module for access monitoring, namely DAMON_STAT. It monitors the entire physical address space with auto-tuned monitoring intervals. The auto-tuning is set to capture 4 % of observable access events in each snapshot while keeping the sampling intervals 5 milliseconds in minimum and 10 seconds in maximum. From a few production environments, we confirmed this setup provides high quality monitoring results with minimum overheads. The module therefore receives only one user input, whether to enable or disable it. It can be set on build or boot time via build configuration or kernel boot command line. It can also be overridden at runtime. Note that this commit only implements the DAMON control part of the module. Users could get the monitoring results via damon:damon_aggregated tracepoint, but that's of course not the recommended way. Following commits will implement convenient and optimized ways for serving the monitoring results to users. [sj@kernel.org: use IS_ENABLED() for enabled initial value] Link: https://lkml.kernel.org/r/20250604205619.18929-1-sj@kernel.org [sj@kernel.org: reset enabled when DAMON start failed] Link: https://lkml.kernel.org/r/20250706184750.36588-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250604183127.13968-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250604183127.13968-2-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- mm/damon/Kconfig | 16 +++++ mm/damon/Makefile | 1 + mm/damon/stat.c | 146 ++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 163 insertions(+) create mode 100644 mm/damon/stat.c diff --git a/mm/damon/Kconfig b/mm/damon/Kconfig index c93d0c56b963..b3171f9406c1 100644 --- a/mm/damon/Kconfig +++ b/mm/damon/Kconfig @@ -94,4 +94,20 @@ config DAMON_LRU_SORT protect frequently accessed (hot) pages while rarely accessed (cold) pages reclaimed first under memory pressure. +config DAMON_STAT + bool "Build data access monitoring stat (DAMON_STAT)" + depends on DAMON_PADDR + help + This builds the DAMON-based access monitoring statistics subsystem. + It runs DAMON and expose access monitoring results in simple stat + metrics. + +config DAMON_STAT_ENABLED_DEFAULT + bool "Enable DAMON_STAT by default" + depends on DAMON_PADDR + default DAMON_STAT + help + Whether to enable DAMON_STAT by default. Users can disable it in + boot or runtime using its 'enabled' parameter. + endmenu diff --git a/mm/damon/Makefile b/mm/damon/Makefile index 8b49012ba8c3..d8d6bf5f8bff 100644 --- a/mm/damon/Makefile +++ b/mm/damon/Makefile @@ -6,3 +6,4 @@ obj-$(CONFIG_DAMON_PADDR) += ops-common.o paddr.o obj-$(CONFIG_DAMON_SYSFS) += sysfs-common.o sysfs-schemes.o sysfs.o obj-$(CONFIG_DAMON_RECLAIM) += modules-common.o reclaim.o obj-$(CONFIG_DAMON_LRU_SORT) += modules-common.o lru_sort.o +obj-$(CONFIG_DAMON_STAT) += modules-common.o stat.o diff --git a/mm/damon/stat.c b/mm/damon/stat.c new file mode 100644 index 000000000000..dd00f2384b95 --- /dev/null +++ b/mm/damon/stat.c @@ -0,0 +1,146 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Shows data access monitoring resutls in simple metrics. + */ + +#define pr_fmt(fmt) "damon-stat: " fmt + +#include +#include +#include +#include +#include + +#ifdef MODULE_PARAM_PREFIX +#undef MODULE_PARAM_PREFIX +#endif +#define MODULE_PARAM_PREFIX "damon_stat." + +static int damon_stat_enabled_store( + const char *val, const struct kernel_param *kp); + +static const struct kernel_param_ops enabled_param_ops = { + .set = damon_stat_enabled_store, + .get = param_get_bool, +}; + +static bool enabled __read_mostly = IS_ENABLED( + CONFIG_DAMON_STAT_ENABLED_DEFAULT); +module_param_cb(enabled, &enabled_param_ops, &enabled, 0600); +MODULE_PARM_DESC(enabled, "Enable of disable DAMON_STAT"); + +static struct damon_ctx *damon_stat_context; + +static struct damon_ctx *damon_stat_build_ctx(void) +{ + struct damon_ctx *ctx; + struct damon_attrs attrs; + struct damon_target *target; + unsigned long start = 0, end = 0; + + ctx = damon_new_ctx(); + if (!ctx) + return NULL; + attrs = (struct damon_attrs) { + .sample_interval = 5 * USEC_PER_MSEC, + .aggr_interval = 100 * USEC_PER_MSEC, + .ops_update_interval = 60 * USEC_PER_MSEC * MSEC_PER_SEC, + .min_nr_regions = 10, + .max_nr_regions = 1000, + }; + /* + * auto-tune sampling and aggregation interval aiming 4% DAMON-observed + * accesses ratio, keeping sampling interval in [5ms, 10s] range. + */ + attrs.intervals_goal = (struct damon_intervals_goal) { + .access_bp = 400, .aggrs = 3, + .min_sample_us = 5000, .max_sample_us = 10000000, + }; + if (damon_set_attrs(ctx, &attrs)) + goto free_out; + + /* + * auto-tune sampling and aggregation interval aiming 4% DAMON-observed + * accesses ratio, keeping sampling interval in [5ms, 10s] range. + */ + ctx->attrs.intervals_goal = (struct damon_intervals_goal) { + .access_bp = 400, .aggrs = 3, + .min_sample_us = 5000, .max_sample_us = 10000000, + }; + if (damon_select_ops(ctx, DAMON_OPS_PADDR)) + goto free_out; + + target = damon_new_target(); + if (!target) + goto free_out; + damon_add_target(ctx, target); + if (damon_set_region_biggest_system_ram_default(target, &start, &end)) + goto free_out; + return ctx; +free_out: + damon_destroy_ctx(ctx); + return NULL; +} + +static int damon_stat_start(void) +{ + damon_stat_context = damon_stat_build_ctx(); + if (!damon_stat_context) + return -ENOMEM; + return damon_start(&damon_stat_context, 1, true); +} + +static void damon_stat_stop(void) +{ + damon_stop(&damon_stat_context, 1); + damon_destroy_ctx(damon_stat_context); +} + +static bool damon_stat_init_called; + +static int damon_stat_enabled_store( + const char *val, const struct kernel_param *kp) +{ + bool is_enabled = enabled; + int err; + + err = kstrtobool(val, &enabled); + if (err) + return err; + + if (is_enabled == enabled) + return 0; + + if (!damon_stat_init_called) + /* + * probably called from command line parsing (parse_args()). + * Cannot call damon_new_ctx(). Let damon_stat_init() handle. + */ + return 0; + + if (enabled) { + err = damon_stat_start(); + if (err) + enabled = false; + return err; + } + damon_stat_stop(); + return 0; +} + +static int __init damon_stat_init(void) +{ + int err = 0; + + damon_stat_init_called = true; + + /* probably set via command line */ + if (enabled) + err = damon_stat_start(); + + if (err && enabled) + enabled = false; + return err; +} + +module_init(damon_stat_init); From fabdd1e911da43cddbf17acf930171ba4b944477 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Wed, 4 Jun 2025 11:31:25 -0700 Subject: [PATCH 013/370] mm/damon/stat: calculate and expose estimated memory bandwidth The raw form of DAMON's monitoring results captures many details of the information. However, not every bit of the information is always required for understanding practical access patterns. Especially on real world production systems of high scale time and size, the raw form is difficult to be aggregated and compared. Convert the raw monitoring results into a single number metric, namely estimated memory bandwidth and expose it to users as a read-only DAMON_STAT parameter. The metric represents access intensiveness (hotness) of the system. It can easily be aggregated and compared for high level understanding of the access pattern on large systems. Link: https://lkml.kernel.org/r/20250604183127.13968-3-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- mm/damon/stat.c | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) diff --git a/mm/damon/stat.c b/mm/damon/stat.c index dd00f2384b95..3be86ca57c8b 100644 --- a/mm/damon/stat.c +++ b/mm/damon/stat.c @@ -29,8 +29,42 @@ static bool enabled __read_mostly = IS_ENABLED( module_param_cb(enabled, &enabled_param_ops, &enabled, 0600); MODULE_PARM_DESC(enabled, "Enable of disable DAMON_STAT"); +static unsigned long estimated_memory_bandwidth __read_mostly; +module_param(estimated_memory_bandwidth, ulong, 0400); +MODULE_PARM_DESC(estimated_memory_bandwidth, + "Estimated memory bandwidth usage in bytes per second"); + static struct damon_ctx *damon_stat_context; +static void damon_stat_set_estimated_memory_bandwidth(struct damon_ctx *c) +{ + struct damon_target *t; + struct damon_region *r; + unsigned long access_bytes = 0; + + damon_for_each_target(t, c) { + damon_for_each_region(r, t) + access_bytes += (r->ar.end - r->ar.start) * + r->nr_accesses; + } + estimated_memory_bandwidth = access_bytes * USEC_PER_MSEC * + MSEC_PER_SEC / c->attrs.aggr_interval; +} + +static int damon_stat_after_aggregation(struct damon_ctx *c) +{ + static unsigned long last_refresh_jiffies; + + /* avoid unnecessarily frequent stat update */ + if (time_before_eq(jiffies, last_refresh_jiffies + + msecs_to_jiffies(5 * MSEC_PER_SEC))) + return 0; + last_refresh_jiffies = jiffies; + + damon_stat_set_estimated_memory_bandwidth(c); + return 0; +} + static struct damon_ctx *damon_stat_build_ctx(void) { struct damon_ctx *ctx; @@ -76,6 +110,7 @@ static struct damon_ctx *damon_stat_build_ctx(void) damon_add_target(ctx, target); if (damon_set_region_biggest_system_ram_default(target, &start, &end)) goto free_out; + ctx->callback.after_aggregation = damon_stat_after_aggregation; return ctx; free_out: damon_destroy_ctx(ctx); From e5d2585d9e859df712e5c9308bf19f83927e0021 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Wed, 4 Jun 2025 11:31:26 -0700 Subject: [PATCH 014/370] mm/damon/stat: calculate and expose idle time percentiles Knowing how much memory is how cold can be useful for understanding coldness and utilization efficiency of memory. The raw form of DAMON's monitoring results has the information. Convert the raw results into the per-byte idle time distributions and expose it as percentiles metric to users, as a read-only DAMON_STAT parameter. In detail, the metrics are calculated as follows. First, DAMON's per-region access frequency and age information is converted into per-byte idle time. If access frequency of a region is higher than zero, every byte of the region has zero idle time. If the access frequency of a region is zero, every byte of the region has idle time as the age of the region. Then the logic sorts the per-byte idle times and provides the value at 0/100, 1/100, ..., 99/100 and 100/100 location of the sorted array. The metric can be easily aggregated and compared on large scale production systems. For example, if an average of 75-th percentile idle time of machines that collected on similar time is two minutes, it means the system's 25 percent memory is not accessed at all for two minutes or more on average. If a workload considers two minutes as unit work time, we can conclude its working set size is only 75 percent of the memory. If the system utilizes proactive reclamation and it supports coldness-based thresholds like DAMON_RECLAIM, the idle time percentiles can be used to find a more safe or aggressive coldness threshold for aimed memory saving. Link: https://lkml.kernel.org/r/20250604183127.13968-4-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- mm/damon/stat.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 72 insertions(+) diff --git a/mm/damon/stat.c b/mm/damon/stat.c index 3be86ca57c8b..b75af871627e 100644 --- a/mm/damon/stat.c +++ b/mm/damon/stat.c @@ -34,6 +34,11 @@ module_param(estimated_memory_bandwidth, ulong, 0400); MODULE_PARM_DESC(estimated_memory_bandwidth, "Estimated memory bandwidth usage in bytes per second"); +static unsigned long memory_idle_ms_percentiles[101] __read_mostly = {0,}; +module_param_array(memory_idle_ms_percentiles, ulong, NULL, 0400); +MODULE_PARM_DESC(memory_idle_ms_percentiles, + "Memory idle time percentiles in milliseconds"); + static struct damon_ctx *damon_stat_context; static void damon_stat_set_estimated_memory_bandwidth(struct damon_ctx *c) @@ -51,6 +56,72 @@ static void damon_stat_set_estimated_memory_bandwidth(struct damon_ctx *c) MSEC_PER_SEC / c->attrs.aggr_interval; } +static unsigned int damon_stat_idletime(const struct damon_region *r) +{ + if (r->nr_accesses) + return 0; + return r->age + 1; +} + +static int damon_stat_cmp_regions(const void *a, const void *b) +{ + const struct damon_region *ra = *(const struct damon_region **)a; + const struct damon_region *rb = *(const struct damon_region **)b; + + return damon_stat_idletime(ra) - damon_stat_idletime(rb); +} + +static int damon_stat_sort_regions(struct damon_ctx *c, + struct damon_region ***sorted_ptr, int *nr_regions_ptr, + unsigned long *total_sz_ptr) +{ + struct damon_target *t; + struct damon_region *r; + struct damon_region **region_pointers; + unsigned int nr_regions = 0; + unsigned long total_sz = 0; + + damon_for_each_target(t, c) { + /* there is only one target */ + region_pointers = kmalloc_array(damon_nr_regions(t), + sizeof(*region_pointers), GFP_KERNEL); + if (!region_pointers) + return -ENOMEM; + damon_for_each_region(r, t) { + region_pointers[nr_regions++] = r; + total_sz += r->ar.end - r->ar.start; + } + } + sort(region_pointers, nr_regions, sizeof(*region_pointers), + damon_stat_cmp_regions, NULL); + *sorted_ptr = region_pointers; + *nr_regions_ptr = nr_regions; + *total_sz_ptr = total_sz; + return 0; +} + +static void damon_stat_set_idletime_percentiles(struct damon_ctx *c) +{ + struct damon_region **sorted_regions, *region; + int nr_regions; + unsigned long total_sz, accounted_bytes = 0; + int err, i, next_percentile = 0; + + err = damon_stat_sort_regions(c, &sorted_regions, &nr_regions, + &total_sz); + if (err) + return; + for (i = 0; i < nr_regions; i++) { + region = sorted_regions[i]; + accounted_bytes += region->ar.end - region->ar.start; + while (next_percentile <= accounted_bytes * 100 / total_sz) + memory_idle_ms_percentiles[next_percentile++] = + damon_stat_idletime(region) * + c->attrs.aggr_interval / USEC_PER_MSEC; + } + kfree(sorted_regions); +} + static int damon_stat_after_aggregation(struct damon_ctx *c) { static unsigned long last_refresh_jiffies; @@ -62,6 +133,7 @@ static int damon_stat_after_aggregation(struct damon_ctx *c) last_refresh_jiffies = jiffies; damon_stat_set_estimated_memory_bandwidth(c); + damon_stat_set_idletime_percentiles(c); return 0; } From 7c33c6c47456c55ca043bb95a34968ef022bf625 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Wed, 4 Jun 2025 11:31:27 -0700 Subject: [PATCH 015/370] Docs/admin-guide/mm/damon: add DAMON_STAT usage document Document DAMON_STAT usage and add a link to it on DAMON admin-guide page. Link: https://lkml.kernel.org/r/20250604183127.13968-5-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/index.rst | 1 + Documentation/admin-guide/mm/damon/stat.rst | 69 ++++++++++++++++++++ 2 files changed, 70 insertions(+) create mode 100644 Documentation/admin-guide/mm/damon/stat.rst diff --git a/Documentation/admin-guide/mm/damon/index.rst b/Documentation/admin-guide/mm/damon/index.rst index bc7e976120e0..3ce3164480c7 100644 --- a/Documentation/admin-guide/mm/damon/index.rst +++ b/Documentation/admin-guide/mm/damon/index.rst @@ -14,3 +14,4 @@ access monitoring and access-aware system operations. usage reclaim lru_sort + stat diff --git a/Documentation/admin-guide/mm/damon/stat.rst b/Documentation/admin-guide/mm/damon/stat.rst new file mode 100644 index 000000000000..4c517c2c219a --- /dev/null +++ b/Documentation/admin-guide/mm/damon/stat.rst @@ -0,0 +1,69 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================================== +Data Access Monitoring Results Stat +=================================== + +Data Access Monitoring Results Stat (DAMON_STAT) is a static kernel module that +is aimed to be used for simple access pattern monitoring. It monitors accesses +on the system's entire physical memory using DAMON, and provides simplified +access monitoring results statistics, namely idle time percentiles and +estimated memory bandwidth. + +Monitoring Accuracy and Overhead +================================ + +DAMON_STAT uses monitoring intervals :ref:`auto-tuning +` to make its accuracy high and +overhead minimum. It auto-tunes the intervals aiming 4 % of observable access +events to be captured in each snapshot, while limiting the resulting sampling +events to be 5 milliseconds in minimum and 10 seconds in maximum. On a few +production server systems, it resulted in consuming only 0.x % single CPU time, +while capturing reasonable quality of access patterns. + +Interface: Module Parameters +============================ + +To use this feature, you should first ensure your system is running on a kernel +that is built with ``CONFIG_DAMON_STAT=y``. The feature can be enabled by +default at build time, by setting ``CONFIG_DAMON_STAT_ENABLED_DEFAULT`` true. + +To let sysadmins enable or disable it at boot and/or runtime, and read the +monitoring results, DAMON_STAT provides module parameters. Following +sections are descriptions of the parameters. + +enabled +------- + +Enable or disable DAMON_STAT. + +You can enable DAMON_STAT by setting the value of this parameter as ``Y``. +Setting it as ``N`` disables DAMON_STAT. The default value is set by +``CONFIG_DAMON_STAT_ENABLED_DEFAULT`` build config option. + +estimated_memory_bandwidth +-------------------------- + +Estimated memory bandwidth consumption (bytes per second) of the system. + +DAMON_STAT reads observed access events on the current DAMON results snapshot +and converts it to memory bandwidth consumption estimation in bytes per second. +The resulting metric is exposed to user via this read-only parameter. Because +DAMON uses sampling, this is only an estimation of the access intensity rather +than accurate memory bandwidth. + +memory_idle_ms_percentiles +-------------------------- + +Per-byte idle time (milliseconds) percentiles of the system. + +DAMON_STAT calculates how long each byte of the memory was not accessed until +now (idle time), based on the current DAMON results snapshot. If DAMON found a +region of access frequency (nr_accesses) larger than zero, every byte of the +region gets zero idle time. If a region has zero access frequency +(nr_accesses), how long the region was keeping the zero access frequency (age) +becomes the idle time of every byte of the region. Then, DAMON_STAT exposes +the percentiles of the idle time values via this read-only parameter. Reading +the parameter returns 101 idle time values in milliseconds, separated by comma. +Each value represents 0-th, 1st, 2nd, 3rd, ..., 99th and 100th percentile idle +times. From 792b429db7e0217faf7bce9fe46e7708135cf83c Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Wed, 4 Jun 2025 16:05:44 +0200 Subject: [PATCH 016/370] mm/gup: remove (VM_)BUG_ONs Especially once we hit one of the assertions in sanity_check_pinned_pages(), observing follow-up assertions failing in other code can give good clues about what went wrong, so use VM_WARN_ON_ONCE instead. While at it, let's just convert all VM_BUG_ON to VM_WARN_ON_ONCE as well. Add one comment for the pfn_valid() check. We have to introduce VM_WARN_ON_ONCE_VMA() to make that fly. Drop the BUG_ON after mmap_read_lock_killable(), if that ever returns something > 0 we're in bigger trouble. Convert the other BUG_ON's into VM_WARN_ON_ONCE as well, they are in a similar domain "should never happen", but more reasonable to check for during early testing. [david@redhat.com: use the _FOLIO variant where possible, per Lorenzo] Link: https://lkml.kernel.org/r/844bd929-a551-48e3-a12e-285cd65ba580@redhat.com Link: https://lkml.kernel.org/r/20250604140544.688711-1-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Vlastimil Babka Reviewed-by: Suren Baghdasaryan Reviewed-by: Lorenzo Stoakes Acked-by: SeongJae Park Reviewed-by: Liam R. Howlett Acked-by: Michal Hocko Reviewed-by: John Hubbard Cc: Mike Rapoport Cc: Jason Gunthorpe Cc: Peter Xu Signed-off-by: Andrew Morton --- include/linux/mmdebug.h | 12 ++++++++++++ mm/gup.c | 41 +++++++++++++++++++---------------------- 2 files changed, 31 insertions(+), 22 deletions(-) diff --git a/include/linux/mmdebug.h b/include/linux/mmdebug.h index a0a3894900ed..14a45979cccc 100644 --- a/include/linux/mmdebug.h +++ b/include/linux/mmdebug.h @@ -89,6 +89,17 @@ void vma_iter_dump_tree(const struct vma_iterator *vmi); } \ unlikely(__ret_warn_once); \ }) +#define VM_WARN_ON_ONCE_VMA(cond, vma) ({ \ + static bool __section(".data..once") __warned; \ + int __ret_warn_once = !!(cond); \ + \ + if (unlikely(__ret_warn_once && !__warned)) { \ + dump_vma(vma); \ + __warned = true; \ + WARN_ON(1); \ + } \ + unlikely(__ret_warn_once); \ +}) #define VM_WARN_ON_VMG(cond, vmg) ({ \ int __ret_warn = !!(cond); \ \ @@ -115,6 +126,7 @@ void vma_iter_dump_tree(const struct vma_iterator *vmi); #define VM_WARN_ON_FOLIO(cond, folio) BUILD_BUG_ON_INVALID(cond) #define VM_WARN_ON_ONCE_FOLIO(cond, folio) BUILD_BUG_ON_INVALID(cond) #define VM_WARN_ON_ONCE_MM(cond, mm) BUILD_BUG_ON_INVALID(cond) +#define VM_WARN_ON_ONCE_VMA(cond, vma) BUILD_BUG_ON_INVALID(cond) #define VM_WARN_ON_VMG(cond, vmg) BUILD_BUG_ON_INVALID(cond) #define VM_WARN_ONCE(cond, format...) BUILD_BUG_ON_INVALID(cond) #define VM_WARN(cond, format...) BUILD_BUG_ON_INVALID(cond) diff --git a/mm/gup.c b/mm/gup.c index 3c39cbbeebef..7f2644a433a0 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -64,11 +64,11 @@ static inline void sanity_check_pinned_pages(struct page **pages, !folio_test_anon(folio)) continue; if (!folio_test_large(folio) || folio_test_hugetlb(folio)) - VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page), page); + VM_WARN_ON_ONCE_FOLIO(!PageAnonExclusive(&folio->page), folio); else /* Either a PTE-mapped or a PMD-mapped THP. */ - VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) && - !PageAnonExclusive(page), page); + VM_WARN_ON_ONCE_PAGE(!PageAnonExclusive(&folio->page) && + !PageAnonExclusive(page), page); } } @@ -760,8 +760,8 @@ static struct page *follow_huge_pmd(struct vm_area_struct *vma, if (!pmd_write(pmdval) && gup_must_unshare(vma, flags, page)) return ERR_PTR(-EMLINK); - VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) && - !PageAnonExclusive(page), page); + VM_WARN_ON_ONCE_PAGE((flags & FOLL_PIN) && PageAnon(page) && + !PageAnonExclusive(page), page); ret = try_grab_folio(page_folio(page), 1, flags); if (ret) @@ -899,8 +899,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, goto out; } - VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) && - !PageAnonExclusive(page), page); + VM_WARN_ON_ONCE_PAGE((flags & FOLL_PIN) && PageAnon(page) && + !PageAnonExclusive(page), page); /* try_grab_folio() does nothing unless FOLL_GET or FOLL_PIN is set. */ ret = try_grab_folio(folio, 1, flags); @@ -1180,7 +1180,7 @@ static int faultin_page(struct vm_area_struct *vma, if (unshare) { fault_flags |= FAULT_FLAG_UNSHARE; /* FAULT_FLAG_WRITE and FAULT_FLAG_UNSHARE are incompatible */ - VM_BUG_ON(fault_flags & FAULT_FLAG_WRITE); + VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_WRITE); } ret = handle_mm_fault(vma, address, fault_flags, NULL); @@ -1760,10 +1760,7 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm, } /* VM_FAULT_RETRY or VM_FAULT_COMPLETED cannot return errors */ - if (!*locked) { - BUG_ON(ret < 0); - BUG_ON(ret >= nr_pages); - } + VM_WARN_ON_ONCE(!*locked && (ret < 0 || ret >= nr_pages)); if (ret > 0) { nr_pages -= ret; @@ -1808,7 +1805,6 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm, ret = mmap_read_lock_killable(mm); if (ret) { - BUG_ON(ret > 0); if (!pages_done) pages_done = ret; break; @@ -1819,11 +1815,11 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm, pages, locked); if (!*locked) { /* Continue to retry until we succeeded */ - BUG_ON(ret != 0); + VM_WARN_ON_ONCE(ret != 0); goto retry; } if (ret != 1) { - BUG_ON(ret > 1); + VM_WARN_ON_ONCE(ret > 1); if (!pages_done) pages_done = ret; break; @@ -1885,10 +1881,10 @@ long populate_vma_page_range(struct vm_area_struct *vma, int gup_flags; long ret; - VM_BUG_ON(!PAGE_ALIGNED(start)); - VM_BUG_ON(!PAGE_ALIGNED(end)); - VM_BUG_ON_VMA(start < vma->vm_start, vma); - VM_BUG_ON_VMA(end > vma->vm_end, vma); + VM_WARN_ON_ONCE(!PAGE_ALIGNED(start)); + VM_WARN_ON_ONCE(!PAGE_ALIGNED(end)); + VM_WARN_ON_ONCE_VMA(start < vma->vm_start, vma); + VM_WARN_ON_ONCE_VMA(end > vma->vm_end, vma); mmap_assert_locked(mm); /* @@ -1957,8 +1953,8 @@ long faultin_page_range(struct mm_struct *mm, unsigned long start, int gup_flags; long ret; - VM_BUG_ON(!PAGE_ALIGNED(start)); - VM_BUG_ON(!PAGE_ALIGNED(end)); + VM_WARN_ON_ONCE(!PAGE_ALIGNED(start)); + VM_WARN_ON_ONCE(!PAGE_ALIGNED(end)); mmap_assert_locked(mm); /* @@ -2914,7 +2910,8 @@ static int gup_fast_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, } else if (pte_special(pte)) goto pte_unmap; - VM_BUG_ON(!pfn_valid(pte_pfn(pte))); + /* If it's not marked as special it must have a valid memmap. */ + VM_WARN_ON_ONCE(!pfn_valid(pte_pfn(pte))); page = pte_page(pte); folio = try_grab_folio_fast(page, 1, flags); From 3800d55250976b7a4bd42c255b267bc242669709 Mon Sep 17 00:00:00 2001 From: Zi Yan Date: Wed, 4 Jun 2025 17:14:27 -0400 Subject: [PATCH 017/370] mm: rename CONFIG_PAGE_BLOCK_ORDER to CONFIG_PAGE_BLOCK_MAX_ORDER The config is in fact an additional upper limit of pageblock_order, so rename it to avoid confusion. Link: https://lkml.kernel.org/r/20250604211427.1590859-1-ziy@nvidia.com Signed-off-by: Zi Yan Acked-by: David Hildenbrand Reviewed-by: Oscar Salvador Acked-by: Juan Yescas Reviewed-by: Anshuman Khandual Acked-by: Vlastimil Babka Cc: "Isaac J. Manjarres" Cc: Kalesh Singh Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Masahiro Yamada Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Suren Baghdasaryan Cc: T.J. Mercier Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 14 +++++++------- include/linux/pageblock-flags.h | 8 ++++---- mm/Kconfig | 15 ++++++++------- mm/mm_init.c | 2 +- 4 files changed, 20 insertions(+), 19 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 283913d42d7b..5bec8b1d0e66 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -38,19 +38,19 @@ #define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1) /* Defines the order for the number of pages that have a migrate type. */ -#ifndef CONFIG_PAGE_BLOCK_ORDER -#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER +#ifndef CONFIG_PAGE_BLOCK_MAX_ORDER +#define PAGE_BLOCK_MAX_ORDER MAX_PAGE_ORDER #else -#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER -#endif /* CONFIG_PAGE_BLOCK_ORDER */ +#define PAGE_BLOCK_MAX_ORDER CONFIG_PAGE_BLOCK_MAX_ORDER +#endif /* CONFIG_PAGE_BLOCK_MAX_ORDER */ /* * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated - * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER, + * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_MAX_ORDER, * which defines the order for the number of pages that can have a migrate type */ -#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER) -#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER +#if (PAGE_BLOCK_MAX_ORDER > MAX_PAGE_ORDER) +#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_MAX_ORDER #endif /* diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h index e73a4292ef02..6297c6343c55 100644 --- a/include/linux/pageblock-flags.h +++ b/include/linux/pageblock-flags.h @@ -41,18 +41,18 @@ extern unsigned int pageblock_order; * Huge pages are a constant size, but don't exceed the maximum allocation * granularity. */ -#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER) +#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_MAX_ORDER) #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */ #elif defined(CONFIG_TRANSPARENT_HUGEPAGE) -#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER) +#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_MAX_ORDER) #else /* CONFIG_TRANSPARENT_HUGEPAGE */ -/* If huge pages are not used, group by PAGE_BLOCK_ORDER */ -#define pageblock_order PAGE_BLOCK_ORDER +/* If huge pages are not used, group by PAGE_BLOCK_MAX_ORDER */ +#define pageblock_order PAGE_BLOCK_MAX_ORDER #endif /* CONFIG_HUGETLB_PAGE */ diff --git a/mm/Kconfig b/mm/Kconfig index 15716b1a5e21..3b2060a61c05 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1005,8 +1005,8 @@ config ARCH_FORCE_MAX_ORDER # the default page block order is MAX_PAGE_ORDER (10) as per # include/linux/mmzone.h. # -config PAGE_BLOCK_ORDER - int "Page Block Order" +config PAGE_BLOCK_MAX_ORDER + int "Page Block Order Upper Limit" range 1 10 if ARCH_FORCE_MAX_ORDER = 0 default 10 if ARCH_FORCE_MAX_ORDER = 0 range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0 @@ -1014,12 +1014,13 @@ config PAGE_BLOCK_ORDER help The page block order refers to the power of two number of pages that are physically contiguous and can have a migrate type associated to - them. The maximum size of the page block order is limited by - ARCH_FORCE_MAX_ORDER. + them. The maximum size of the page block order is at least limited by + ARCH_FORCE_MAX_ORDER/MAX_PAGE_ORDER. - This config allows overriding the default page block order when the - page block order is required to be smaller than ARCH_FORCE_MAX_ORDER - or MAX_PAGE_ORDER. + This config adds a new upper limit of default page block + order when the page block order is required to be smaller than + ARCH_FORCE_MAX_ORDER/MAX_PAGE_ORDER or other limits + (see include/linux/pageblock-flags.h for details). Reducing pageblock order can negatively impact THP generation success rate. If your workloads use THP heavily, please use this diff --git a/mm/mm_init.c b/mm/mm_init.c index f2944748f526..02f41e2bdf60 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -1509,7 +1509,7 @@ static inline void setup_usemap(struct zone *zone) {} /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */ void __init set_pageblock_order(void) { - unsigned int order = PAGE_BLOCK_ORDER; + unsigned int order = PAGE_BLOCK_MAX_ORDER; /* Check that pageblock_nr_pages has not already been setup */ if (pageblock_order) From 453742ba5b3cf36af348c1313f2b32f7ea4c5897 Mon Sep 17 00:00:00 2001 From: Kairui Song Date: Tue, 27 May 2025 02:06:38 +0800 Subject: [PATCH 018/370] mm, list_lru: refactor the locking code Cocci is confused by the try lock then release RCU and return logic here. So separate the try lock part out into a standalone helper. The code is easier to follow too. No feature change, fixes: cocci warnings: (new ones prefixed by >>) >> mm/list_lru.c:82:3-9: preceding lock on line 77 >> mm/list_lru.c:82:3-9: preceding lock on line 77 mm/list_lru.c:82:3-9: preceding lock on line 75 mm/list_lru.c:82:3-9: preceding lock on line 75 Link: https://lkml.kernel.org/r/20250526180638.14609-1-ryncsn@gmail.com Signed-off-by: Kairui Song Reported-by: kernel test robot Reported-by: Julia Lawall Closes: https://lore.kernel.org/r/202505252043.pbT1tBHJ-lkp@intel.com/ Reviewed-by: Qi Zheng Reviewed-by: Muchun Song Reviewed-by: SeongJae Park Cc: Chengming Zhou Cc: Johannes Weiner Cc: Kairui Song Cc: Roman Gushchin Signed-off-by: Andrew Morton --- mm/list_lru.c | 34 +++++++++++++++++++--------------- 1 file changed, 19 insertions(+), 15 deletions(-) diff --git a/mm/list_lru.c b/mm/list_lru.c index 490473af3122..ec48b5dadf51 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -60,30 +60,34 @@ list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx) return &lru->node[nid].lru; } +static inline bool lock_list_lru(struct list_lru_one *l, bool irq) +{ + if (irq) + spin_lock_irq(&l->lock); + else + spin_lock(&l->lock); + if (unlikely(READ_ONCE(l->nr_items) == LONG_MIN)) { + if (irq) + spin_unlock_irq(&l->lock); + else + spin_unlock(&l->lock); + return false; + } + return true; +} + static inline struct list_lru_one * lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg, bool irq, bool skip_empty) { struct list_lru_one *l; - long nr_items; rcu_read_lock(); again: l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg)); - if (likely(l)) { - if (irq) - spin_lock_irq(&l->lock); - else - spin_lock(&l->lock); - nr_items = READ_ONCE(l->nr_items); - if (likely(nr_items != LONG_MIN)) { - rcu_read_unlock(); - return l; - } - if (irq) - spin_unlock_irq(&l->lock); - else - spin_unlock(&l->lock); + if (likely(l) && lock_list_lru(l, irq)) { + rcu_read_unlock(); + return l; } /* * Caller may simply bail out if raced with reparenting or From 86c4a946436e6ab3d9b0d02afd5d33ebf63528c5 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 10 Jun 2025 07:49:37 +0200 Subject: [PATCH 019/370] mm: split out a writeout helper from pageout Patch series "stop passing a writeback_control to swap/shmem writeout", v3. This series was intended to remove the last remaining users of AOP_WRITEPAGE_ACTIVATE after my other pending patches removed the rest, but spectacularly failed at that. But instead it nicely improves the code, and removes two pointers from struct writeback_control. This patch (of 6): Move the code to write back swap / shmem folios into a self-contained helper to keep prepare for refactoring it. Link: https://lkml.kernel.org/r/20250610054959.2057526-1-hch@lst.de Link: https://lkml.kernel.org/r/20250610054959.2057526-2-hch@lst.de Signed-off-by: Christoph Hellwig Reviewed-by: Baolin Wang Cc: Chengming Zhou Cc: Hugh Dickins Cc: Johannes Weiner Cc: Matthew Wilcox (Oracle) Cc: Nhat Pham Signed-off-by: Andrew Morton --- mm/vmscan.c | 94 +++++++++++++++++++++++++++-------------------------- 1 file changed, 48 insertions(+), 46 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index ad47244179bb..b6dd1708fe82 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -652,14 +652,55 @@ typedef enum { PAGE_CLEAN, } pageout_t; +static pageout_t writeout(struct folio *folio, struct address_space *mapping, + struct swap_iocb **plug, struct list_head *folio_list) +{ + struct writeback_control wbc = { + .sync_mode = WB_SYNC_NONE, + .nr_to_write = SWAP_CLUSTER_MAX, + .range_start = 0, + .range_end = LLONG_MAX, + .for_reclaim = 1, + .swap_plug = plug, + }; + int res; + + folio_set_reclaim(folio); + + /* + * The large shmem folio can be split if CONFIG_THP_SWAP is not enabled + * or we failed to allocate contiguous swap entries. + */ + if (shmem_mapping(mapping)) { + if (folio_test_large(folio)) + wbc.list = folio_list; + res = shmem_writeout(folio, &wbc); + } else { + res = swap_writeout(folio, &wbc); + } + + if (res < 0) + handle_write_error(mapping, folio, res); + if (res == AOP_WRITEPAGE_ACTIVATE) { + folio_clear_reclaim(folio); + return PAGE_ACTIVATE; + } + + /* synchronous write? */ + if (!folio_test_writeback(folio)) + folio_clear_reclaim(folio); + + trace_mm_vmscan_write_folio(folio); + node_stat_add_folio(folio, NR_VMSCAN_WRITE); + return PAGE_SUCCESS; +} + /* * pageout is called by shrink_folio_list() for each dirty folio. */ static pageout_t pageout(struct folio *folio, struct address_space *mapping, struct swap_iocb **plug, struct list_head *folio_list) { - int (*writeout)(struct folio *, struct writeback_control *); - /* * We no longer attempt to writeback filesystem folios here, other * than tmpfs/shmem. That's taken care of in page-writeback. @@ -690,51 +731,12 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping, } return PAGE_KEEP; } - if (shmem_mapping(mapping)) - writeout = shmem_writeout; - else if (folio_test_anon(folio)) - writeout = swap_writeout; - else + + if (!shmem_mapping(mapping) && !folio_test_anon(folio)) return PAGE_ACTIVATE; - - if (folio_clear_dirty_for_io(folio)) { - int res; - struct writeback_control wbc = { - .sync_mode = WB_SYNC_NONE, - .nr_to_write = SWAP_CLUSTER_MAX, - .range_start = 0, - .range_end = LLONG_MAX, - .for_reclaim = 1, - .swap_plug = plug, - }; - - /* - * The large shmem folio can be split if CONFIG_THP_SWAP is - * not enabled or contiguous swap entries are failed to - * allocate. - */ - if (shmem_mapping(mapping) && folio_test_large(folio)) - wbc.list = folio_list; - - folio_set_reclaim(folio); - res = writeout(folio, &wbc); - if (res < 0) - handle_write_error(mapping, folio, res); - if (res == AOP_WRITEPAGE_ACTIVATE) { - folio_clear_reclaim(folio); - return PAGE_ACTIVATE; - } - - if (!folio_test_writeback(folio)) { - /* synchronous write? */ - folio_clear_reclaim(folio); - } - trace_mm_vmscan_write_folio(folio); - node_stat_add_folio(folio, NR_VMSCAN_WRITE); - return PAGE_SUCCESS; - } - - return PAGE_CLEAN; + if (!folio_clear_dirty_for_io(folio)) + return PAGE_CLEAN; + return writeout(folio, mapping, plug, folio_list); } /* From 44b1b073eb36143ec65a918c0fbaa582f3ec2aa1 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 10 Jun 2025 07:49:38 +0200 Subject: [PATCH 020/370] mm: stop passing a writeback_control structure to shmem_writeout shmem_writeout only needs the swap_iocb cookie and the split folio list. Pass those explicitly and remove the now unused list member from struct writeback_control. Link: https://lkml.kernel.org/r/20250610054959.2057526-3-hch@lst.de Signed-off-by: Christoph Hellwig Cc: Baolin Wang Cc: Chengming Zhou Cc: Hugh Dickins Cc: Johannes Weiner Cc: Matthew Wilcox (Oracle) Cc: Nhat Pham Signed-off-by: Andrew Morton --- drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 2 +- drivers/gpu/drm/ttm/ttm_backup.c | 9 +------- include/linux/shmem_fs.h | 5 ++++- include/linux/writeback.h | 3 --- mm/shmem.c | 26 +++++++++++++---------- mm/vmscan.c | 12 +++++------ 6 files changed, 26 insertions(+), 31 deletions(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c index 19a3eb82dc6a..24d8daa4fdb3 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c @@ -317,7 +317,7 @@ void __shmem_writeback(size_t size, struct address_space *mapping) if (folio_mapped(folio)) folio_redirty_for_writepage(&wbc, folio); else - error = shmem_writeout(folio, &wbc); + error = shmem_writeout(folio, NULL, NULL); } } diff --git a/drivers/gpu/drm/ttm/ttm_backup.c b/drivers/gpu/drm/ttm/ttm_backup.c index ffaab68bd5dd..6f2e58be4f3e 100644 --- a/drivers/gpu/drm/ttm/ttm_backup.c +++ b/drivers/gpu/drm/ttm/ttm_backup.c @@ -112,15 +112,8 @@ ttm_backup_backup_page(struct file *backup, struct page *page, if (writeback && !folio_mapped(to_folio) && folio_clear_dirty_for_io(to_folio)) { - struct writeback_control wbc = { - .sync_mode = WB_SYNC_NONE, - .nr_to_write = SWAP_CLUSTER_MAX, - .range_start = 0, - .range_end = LLONG_MAX, - .for_reclaim = 1, - }; folio_set_reclaim(to_folio); - ret = shmem_writeout(to_folio, &wbc); + ret = shmem_writeout(to_folio, NULL, NULL); if (!folio_test_writeback(to_folio)) folio_clear_reclaim(to_folio); /* diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 5f03a39a26f7..6d0f9c599ff7 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -11,6 +11,8 @@ #include #include +struct swap_iocb; + /* inode in-kernel data */ #ifdef CONFIG_TMPFS_QUOTA @@ -107,7 +109,8 @@ static inline bool shmem_mapping(struct address_space *mapping) void shmem_unlock_mapping(struct address_space *mapping); struct page *shmem_read_mapping_page_gfp(struct address_space *mapping, pgoff_t index, gfp_t gfp_mask); -int shmem_writeout(struct folio *folio, struct writeback_control *wbc); +int shmem_writeout(struct folio *folio, struct swap_iocb **plug, + struct list_head *folio_list); void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end); int shmem_unuse(unsigned int type); diff --git a/include/linux/writeback.h b/include/linux/writeback.h index eda4b62511f7..82f217970092 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -79,9 +79,6 @@ struct writeback_control { */ struct swap_iocb **swap_plug; - /* Target list for splitting a large folio */ - struct list_head *list; - /* internal fields used by the ->writepages implementation: */ struct folio_batch fbatch; pgoff_t index; diff --git a/mm/shmem.c b/mm/shmem.c index 3a5a65b1f41a..ad8db487e721 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1540,11 +1540,13 @@ int shmem_unuse(unsigned int type) /** * shmem_writeout - Write the folio to swap * @folio: The folio to write - * @wbc: How writeback is to be done + * @plug: swap plug + * @folio_list: list to put back folios on split * * Move the folio from the page cache to the swap cache. */ -int shmem_writeout(struct folio *folio, struct writeback_control *wbc) +int shmem_writeout(struct folio *folio, struct swap_iocb **plug, + struct list_head *folio_list) { struct address_space *mapping = folio->mapping; struct inode *inode = mapping->host; @@ -1554,9 +1556,6 @@ int shmem_writeout(struct folio *folio, struct writeback_control *wbc) int nr_pages; bool split = false; - if (WARN_ON_ONCE(!wbc->for_reclaim)) - goto redirty; - if ((info->flags & VM_LOCKED) || sbinfo->noswap) goto redirty; @@ -1583,7 +1582,7 @@ int shmem_writeout(struct folio *folio, struct writeback_control *wbc) try_split: /* Ensure the subpages are still dirty */ folio_test_set_dirty(folio); - if (split_folio_to_list(folio, wbc->list)) + if (split_folio_to_list(folio, folio_list)) goto redirty; folio_clear_dirty(folio); } @@ -1636,13 +1635,21 @@ int shmem_writeout(struct folio *folio, struct writeback_control *wbc) list_add(&info->swaplist, &shmem_swaplist); if (!folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN)) { + struct writeback_control wbc = { + .sync_mode = WB_SYNC_NONE, + .nr_to_write = SWAP_CLUSTER_MAX, + .range_start = 0, + .range_end = LLONG_MAX, + .for_reclaim = 1, + .swap_plug = plug, + }; shmem_recalc_inode(inode, 0, nr_pages); swap_shmem_alloc(folio->swap, nr_pages); shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap)); mutex_unlock(&shmem_swaplist_mutex); BUG_ON(folio_mapped(folio)); - return swap_writeout(folio, wbc); + return swap_writeout(folio, &wbc); } if (!info->swapped) list_del_init(&info->swaplist); @@ -1651,10 +1658,7 @@ int shmem_writeout(struct folio *folio, struct writeback_control *wbc) goto try_split; redirty: folio_mark_dirty(folio); - if (wbc->for_reclaim) - return AOP_WRITEPAGE_ACTIVATE; /* Return with folio locked */ - folio_unlock(folio); - return 0; + return AOP_WRITEPAGE_ACTIVATE; /* Return with folio locked */ } EXPORT_SYMBOL_GPL(shmem_writeout); diff --git a/mm/vmscan.c b/mm/vmscan.c index b6dd1708fe82..3cceb619a853 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -669,15 +669,13 @@ static pageout_t writeout(struct folio *folio, struct address_space *mapping, /* * The large shmem folio can be split if CONFIG_THP_SWAP is not enabled - * or we failed to allocate contiguous swap entries. + * or we failed to allocate contiguous swap entries, in which case + * the split out folios get added back to folio_list. */ - if (shmem_mapping(mapping)) { - if (folio_test_large(folio)) - wbc.list = folio_list; - res = shmem_writeout(folio, &wbc); - } else { + if (shmem_mapping(mapping)) + res = shmem_writeout(folio, plug, folio_list); + else res = swap_writeout(folio, &wbc); - } if (res < 0) handle_write_error(mapping, folio, res); From 2d1844cdbe89d04d4f679a1a01e165fea1741bab Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 10 Jun 2025 07:49:39 +0200 Subject: [PATCH 021/370] mm: tidy up swap_writeout Use a goto label to consolidate the unlock folio and return pattern and don't bother with an else after a return / goto. Link: https://lkml.kernel.org/r/20250610054959.2057526-4-hch@lst.de Signed-off-by: Christoph Hellwig Cc: Baolin Wang Cc: Chengming Zhou Cc: Hugh Dickins Cc: Johannes Weiner Cc: Matthew Wilcox (Oracle) Cc: Nhat Pham Signed-off-by: Andrew Morton --- mm/page_io.c | 36 ++++++++++++++++++------------------ 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/mm/page_io.c b/mm/page_io.c index f7716b6569fa..c420b0aa0f22 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -239,12 +239,11 @@ static void swap_zeromap_folio_clear(struct folio *folio) */ int swap_writeout(struct folio *folio, struct writeback_control *wbc) { - int ret; + int ret = 0; + + if (folio_free_swap(folio)) + goto out_unlock; - if (folio_free_swap(folio)) { - folio_unlock(folio); - return 0; - } /* * Arch code may have to preserve more data than just the page * contents, e.g. memory tags. @@ -252,8 +251,7 @@ int swap_writeout(struct folio *folio, struct writeback_control *wbc) ret = arch_prepare_to_swap(folio); if (ret) { folio_mark_dirty(folio); - folio_unlock(folio); - return ret; + goto out_unlock; } /* @@ -264,20 +262,19 @@ int swap_writeout(struct folio *folio, struct writeback_control *wbc) */ if (is_folio_zero_filled(folio)) { swap_zeromap_folio_set(folio); - folio_unlock(folio); - return 0; - } else { - /* - * Clear bits this folio occupies in the zeromap to prevent - * zero data being read in from any previous zero writes that - * occupied the same swap entries. - */ - swap_zeromap_folio_clear(folio); + goto out_unlock; } + + /* + * Clear bits this folio occupies in the zeromap to prevent zero data + * being read in from any previous zero writes that occupied the same + * swap entries. + */ + swap_zeromap_folio_clear(folio); + if (zswap_store(folio)) { count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT); - folio_unlock(folio); - return 0; + goto out_unlock; } if (!mem_cgroup_zswap_writeback_enabled(folio_memcg(folio))) { folio_mark_dirty(folio); @@ -286,6 +283,9 @@ int swap_writeout(struct folio *folio, struct writeback_control *wbc) __swap_writepage(folio, wbc); return 0; +out_unlock: + folio_unlock(folio); + return ret; } static inline void count_swpout_vm_event(struct folio *folio) From 2ba8ffcefe81241fab73aa0f59142a2c9ace575b Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 10 Jun 2025 07:49:40 +0200 Subject: [PATCH 022/370] mm: stop passing a writeback_control structure to __swap_writepage __swap_writepage only needs the swap_iocb cookie from the writeback_control structure, so pass it explicitly and remove the now unused swap_iocb member from struct writeback_control. Link: https://lkml.kernel.org/r/20250610054959.2057526-5-hch@lst.de Signed-off-by: Christoph Hellwig Acked-by: Nhat Pham Cc: Baolin Wang Cc: Chengming Zhou Cc: Hugh Dickins Cc: Johannes Weiner Cc: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- mm/page_io.c | 33 ++++++++++++++------------------- mm/swap.h | 2 +- mm/zswap.c | 5 +---- 3 files changed, 16 insertions(+), 24 deletions(-) diff --git a/mm/page_io.c b/mm/page_io.c index c420b0aa0f22..fb52bedcc966 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -281,7 +281,7 @@ int swap_writeout(struct folio *folio, struct writeback_control *wbc) return AOP_WRITEPAGE_ACTIVATE; } - __swap_writepage(folio, wbc); + __swap_writepage(folio, wbc->swap_plug); return 0; out_unlock: folio_unlock(folio); @@ -371,9 +371,9 @@ static void sio_write_complete(struct kiocb *iocb, long ret) mempool_free(sio, sio_pool); } -static void swap_writepage_fs(struct folio *folio, struct writeback_control *wbc) +static void swap_writepage_fs(struct folio *folio, struct swap_iocb **swap_plug) { - struct swap_iocb *sio = NULL; + struct swap_iocb *sio = swap_plug ? *swap_plug : NULL; struct swap_info_struct *sis = swp_swap_info(folio->swap); struct file *swap_file = sis->swap_file; loff_t pos = swap_dev_pos(folio->swap); @@ -381,8 +381,6 @@ static void swap_writepage_fs(struct folio *folio, struct writeback_control *wbc count_swpout_vm_event(folio); folio_start_writeback(folio); folio_unlock(folio); - if (wbc->swap_plug) - sio = *wbc->swap_plug; if (sio) { if (sio->iocb.ki_filp != swap_file || sio->iocb.ki_pos + sio->len != pos) { @@ -401,22 +399,21 @@ static void swap_writepage_fs(struct folio *folio, struct writeback_control *wbc bvec_set_folio(&sio->bvec[sio->pages], folio, folio_size(folio), 0); sio->len += folio_size(folio); sio->pages += 1; - if (sio->pages == ARRAY_SIZE(sio->bvec) || !wbc->swap_plug) { + if (sio->pages == ARRAY_SIZE(sio->bvec) || !swap_plug) { swap_write_unplug(sio); sio = NULL; } - if (wbc->swap_plug) - *wbc->swap_plug = sio; + if (swap_plug) + *swap_plug = sio; } static void swap_writepage_bdev_sync(struct folio *folio, - struct writeback_control *wbc, struct swap_info_struct *sis) + struct swap_info_struct *sis) { struct bio_vec bv; struct bio bio; - bio_init(&bio, sis->bdev, &bv, 1, - REQ_OP_WRITE | REQ_SWAP | wbc_to_write_flags(wbc)); + bio_init(&bio, sis->bdev, &bv, 1, REQ_OP_WRITE | REQ_SWAP); bio.bi_iter.bi_sector = swap_folio_sector(folio); bio_add_folio_nofail(&bio, folio, folio_size(folio), 0); @@ -431,13 +428,11 @@ static void swap_writepage_bdev_sync(struct folio *folio, } static void swap_writepage_bdev_async(struct folio *folio, - struct writeback_control *wbc, struct swap_info_struct *sis) + struct swap_info_struct *sis) { struct bio *bio; - bio = bio_alloc(sis->bdev, 1, - REQ_OP_WRITE | REQ_SWAP | wbc_to_write_flags(wbc), - GFP_NOIO); + bio = bio_alloc(sis->bdev, 1, REQ_OP_WRITE | REQ_SWAP, GFP_NOIO); bio->bi_iter.bi_sector = swap_folio_sector(folio); bio->bi_end_io = end_swap_bio_write; bio_add_folio_nofail(bio, folio, folio_size(folio), 0); @@ -449,7 +444,7 @@ static void swap_writepage_bdev_async(struct folio *folio, submit_bio(bio); } -void __swap_writepage(struct folio *folio, struct writeback_control *wbc) +void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug) { struct swap_info_struct *sis = swp_swap_info(folio->swap); @@ -460,16 +455,16 @@ void __swap_writepage(struct folio *folio, struct writeback_control *wbc) * is safe. */ if (data_race(sis->flags & SWP_FS_OPS)) - swap_writepage_fs(folio, wbc); + swap_writepage_fs(folio, swap_plug); /* * ->flags can be updated non-atomicially (scan_swap_map_slots), * but that will never affect SWP_SYNCHRONOUS_IO, so the data_race * is safe. */ else if (data_race(sis->flags & SWP_SYNCHRONOUS_IO)) - swap_writepage_bdev_sync(folio, wbc, sis); + swap_writepage_bdev_sync(folio, sis); else - swap_writepage_bdev_async(folio, wbc, sis); + swap_writepage_bdev_async(folio, sis); } void swap_write_unplug(struct swap_iocb *sio) diff --git a/mm/swap.h b/mm/swap.h index 9096082a915e..911817f051c2 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -21,7 +21,7 @@ static inline void swap_read_unplug(struct swap_iocb *plug) } void swap_write_unplug(struct swap_iocb *sio); int swap_writeout(struct folio *folio, struct writeback_control *wbc); -void __swap_writepage(struct folio *folio, struct writeback_control *wbc); +void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug); /* linux/mm/swap_state.c */ /* One swap address space for each 64M swap space */ diff --git a/mm/zswap.c b/mm/zswap.c index 455e9425c5f5..3c0fd8a13718 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1070,9 +1070,6 @@ static int zswap_writeback_entry(struct zswap_entry *entry, struct mempolicy *mpol; bool folio_was_allocated; struct swap_info_struct *si; - struct writeback_control wbc = { - .sync_mode = WB_SYNC_NONE, - }; int ret = 0; /* try to allocate swap cache folio */ @@ -1134,7 +1131,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry, folio_set_reclaim(folio); /* start writeback */ - __swap_writepage(folio, &wbc); + __swap_writepage(folio, NULL); out: if (ret && ret != -EEXIST) { From 624043dbd5be03cc5a2b9175c3934e6fb0ef7c70 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 10 Jun 2025 07:49:41 +0200 Subject: [PATCH 023/370] mm: stop passing a writeback_control structure to swap_writeout swap_writeout only needs the swap_iocb cookie from the writeback_control structure, so pass it explicitly. Link: https://lkml.kernel.org/r/20250610054959.2057526-6-hch@lst.de Signed-off-by: Christoph Hellwig Cc: Baolin Wang Cc: Chengming Zhou Cc: Hugh Dickins Cc: Johannes Weiner Cc: Matthew Wilcox (Oracle) Cc: Nhat Pham Signed-off-by: Andrew Morton --- include/linux/writeback.h | 7 ------- mm/page_io.c | 4 ++-- mm/shmem.c | 10 +--------- mm/swap.h | 7 +++++-- mm/vmscan.c | 10 +--------- 5 files changed, 9 insertions(+), 29 deletions(-) diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 82f217970092..9e960f2faf79 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -72,13 +72,6 @@ struct writeback_control { */ unsigned no_cgroup_owner:1; - /* To enable batching of swap writes to non-block-device backends, - * "plug" can be set point to a 'struct swap_iocb *'. When all swap - * writes have been submitted, if with swap_iocb is not NULL, - * swap_write_unplug() should be called. - */ - struct swap_iocb **swap_plug; - /* internal fields used by the ->writepages implementation: */ struct folio_batch fbatch; pgoff_t index; diff --git a/mm/page_io.c b/mm/page_io.c index fb52bedcc966..a2056a5ecb13 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -237,7 +237,7 @@ static void swap_zeromap_folio_clear(struct folio *folio) * We may have stale swap cache pages in memory: notice * them here and get rid of the unnecessary final write. */ -int swap_writeout(struct folio *folio, struct writeback_control *wbc) +int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug) { int ret = 0; @@ -281,7 +281,7 @@ int swap_writeout(struct folio *folio, struct writeback_control *wbc) return AOP_WRITEPAGE_ACTIVATE; } - __swap_writepage(folio, wbc->swap_plug); + __swap_writepage(folio, swap_plug); return 0; out_unlock: folio_unlock(folio); diff --git a/mm/shmem.c b/mm/shmem.c index ad8db487e721..eda35be2a8d9 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1635,21 +1635,13 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug, list_add(&info->swaplist, &shmem_swaplist); if (!folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN)) { - struct writeback_control wbc = { - .sync_mode = WB_SYNC_NONE, - .nr_to_write = SWAP_CLUSTER_MAX, - .range_start = 0, - .range_end = LLONG_MAX, - .for_reclaim = 1, - .swap_plug = plug, - }; shmem_recalc_inode(inode, 0, nr_pages); swap_shmem_alloc(folio->swap, nr_pages); shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap)); mutex_unlock(&shmem_swaplist_mutex); BUG_ON(folio_mapped(folio)); - return swap_writeout(folio, &wbc); + return swap_writeout(folio, plug); } if (!info->swapped) list_del_init(&info->swaplist); diff --git a/mm/swap.h b/mm/swap.h index 911817f051c2..911ad5ff0f89 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -3,6 +3,8 @@ #define _MM_SWAP_H struct mempolicy; +struct swap_iocb; + extern int page_cluster; #ifdef CONFIG_SWAP @@ -20,7 +22,7 @@ static inline void swap_read_unplug(struct swap_iocb *plug) __swap_read_unplug(plug); } void swap_write_unplug(struct swap_iocb *sio); -int swap_writeout(struct folio *folio, struct writeback_control *wbc); +int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug); void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug); /* linux/mm/swap_state.c */ @@ -160,7 +162,8 @@ static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask, return NULL; } -static inline int swap_writeout(struct folio *f, struct writeback_control *wbc) +static inline int swap_writeout(struct folio *folio, + struct swap_iocb **swap_plug) { return 0; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 3cceb619a853..a93a1ba9009e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -655,14 +655,6 @@ typedef enum { static pageout_t writeout(struct folio *folio, struct address_space *mapping, struct swap_iocb **plug, struct list_head *folio_list) { - struct writeback_control wbc = { - .sync_mode = WB_SYNC_NONE, - .nr_to_write = SWAP_CLUSTER_MAX, - .range_start = 0, - .range_end = LLONG_MAX, - .for_reclaim = 1, - .swap_plug = plug, - }; int res; folio_set_reclaim(folio); @@ -675,7 +667,7 @@ static pageout_t writeout(struct folio *folio, struct address_space *mapping, if (shmem_mapping(mapping)) res = shmem_writeout(folio, plug, folio_list); else - res = swap_writeout(folio, &wbc); + res = swap_writeout(folio, plug); if (res < 0) handle_write_error(mapping, folio, res); From a8fb49c6abbbe5c71e1a8a888ef2c4b3e341d169 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 10 Jun 2025 07:49:42 +0200 Subject: [PATCH 024/370] mm: remove the for_reclaim field from struct writeback_control This field is now only set to one in the i915 gem code that only calls writeback_iter on it, which ignores the flag. All other checks are thuse dead code and the field can be removed. Link: https://lkml.kernel.org/r/20250610054959.2057526-7-hch@lst.de Signed-off-by: Christoph Hellwig Cc: Baolin Wang Cc: Chengming Zhou Cc: Hugh Dickins Cc: Johannes Weiner Cc: Matthew Wilcox (Oracle) Cc: Nhat Pham Signed-off-by: Andrew Morton --- drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 1 - fs/fuse/file.c | 11 ----------- fs/nfs/write.c | 2 +- include/linux/writeback.h | 1 - include/trace/events/btrfs.h | 7 ++----- include/trace/events/writeback.h | 8 ++------ 6 files changed, 5 insertions(+), 25 deletions(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c index 24d8daa4fdb3..f263615f6ece 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c @@ -302,7 +302,6 @@ void __shmem_writeback(size_t size, struct address_space *mapping) .nr_to_write = SWAP_CLUSTER_MAX, .range_start = 0, .range_end = LLONG_MAX, - .for_reclaim = 1, }; struct folio *folio = NULL; int error = 0; diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 47006d0753f1..95a657a57786 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1927,17 +1927,6 @@ int fuse_write_inode(struct inode *inode, struct writeback_control *wbc) struct fuse_file *ff; int err; - /* - * Inode is always written before the last reference is dropped and - * hence this should not be reached from reclaim. - * - * Writing back the inode from reclaim can deadlock if the request - * processing itself needs an allocation. Allocations triggering - * reclaim while serving a request can't be prevented, because it can - * involve any number of unrelated userspace processes. - */ - WARN_ON(wbc->for_reclaim); - ff = __fuse_write_file_get(fi); err = fuse_flush_times(inode, ff); if (ff) diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 374fc6b34c79..cf1d720b8251 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -720,7 +720,7 @@ int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc) nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES); if (!(mntflags & NFS_MOUNT_WRITE_EAGER) || wbc->for_kupdate || - wbc->for_background || wbc->for_sync || wbc->for_reclaim) { + wbc->for_background || wbc->for_sync) { ioc = nfs_io_completion_alloc(GFP_KERNEL); if (ioc) nfs_io_completion_init(ioc, nfs_io_completion_commit, diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 9e960f2faf79..a2848d731a46 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -59,7 +59,6 @@ struct writeback_control { unsigned for_kupdate:1; /* A kupdate writeback */ unsigned for_background:1; /* A background writeback */ unsigned tagged_writepages:1; /* tag-and-write to avoid livelock */ - unsigned for_reclaim:1; /* Invoked from the page allocator */ unsigned range_cyclic:1; /* range_start is cyclic */ unsigned for_sync:1; /* sync(2) WB_SYNC_ALL writeback */ unsigned unpinned_netfs_wb:1; /* Cleared I_PINNING_NETFS_WB */ diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index bebc252db865..0adc40f5e72b 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -686,7 +686,6 @@ DECLARE_EVENT_CLASS(btrfs__writepage, __field( loff_t, range_start ) __field( loff_t, range_end ) __field( char, for_kupdate ) - __field( char, for_reclaim ) __field( char, range_cyclic ) __field( unsigned long, writeback_index ) __field( u64, root_objectid ) @@ -700,7 +699,6 @@ DECLARE_EVENT_CLASS(btrfs__writepage, __entry->range_start = wbc->range_start; __entry->range_end = wbc->range_end; __entry->for_kupdate = wbc->for_kupdate; - __entry->for_reclaim = wbc->for_reclaim; __entry->range_cyclic = wbc->range_cyclic; __entry->writeback_index = inode->i_mapping->writeback_index; __entry->root_objectid = btrfs_root_id(BTRFS_I(inode)->root); @@ -709,13 +707,12 @@ DECLARE_EVENT_CLASS(btrfs__writepage, TP_printk_btrfs("root=%llu(%s) ino=%llu page_index=%lu " "nr_to_write=%ld pages_skipped=%ld range_start=%llu " "range_end=%llu for_kupdate=%d " - "for_reclaim=%d range_cyclic=%d writeback_index=%lu", + "range_cyclic=%d writeback_index=%lu", show_root_type(__entry->root_objectid), __entry->ino, __entry->index, __entry->nr_to_write, __entry->pages_skipped, __entry->range_start, __entry->range_end, - __entry->for_kupdate, - __entry->for_reclaim, __entry->range_cyclic, + __entry->for_kupdate, __entry->range_cyclic, __entry->writeback_index) ); diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 0ff388131fc9..1e23919c0da9 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -459,7 +459,6 @@ DECLARE_EVENT_CLASS(wbc_class, __field(int, sync_mode) __field(int, for_kupdate) __field(int, for_background) - __field(int, for_reclaim) __field(int, range_cyclic) __field(long, range_start) __field(long, range_end) @@ -473,23 +472,20 @@ DECLARE_EVENT_CLASS(wbc_class, __entry->sync_mode = wbc->sync_mode; __entry->for_kupdate = wbc->for_kupdate; __entry->for_background = wbc->for_background; - __entry->for_reclaim = wbc->for_reclaim; __entry->range_cyclic = wbc->range_cyclic; __entry->range_start = (long)wbc->range_start; __entry->range_end = (long)wbc->range_end; __entry->cgroup_ino = __trace_wbc_assign_cgroup(wbc); ), - TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d " - "bgrd=%d reclm=%d cyclic=%d " - "start=0x%lx end=0x%lx cgroup_ino=%lu", + TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d bgrd=%d " + "cyclic=%d start=0x%lx end=0x%lx cgroup_ino=%lu", __entry->name, __entry->nr_to_write, __entry->pages_skipped, __entry->sync_mode, __entry->for_kupdate, __entry->for_background, - __entry->for_reclaim, __entry->range_cyclic, __entry->range_start, __entry->range_end, From 99edea30058bb5dd6c2fee6fa0c703fa4bc4b140 Mon Sep 17 00:00:00 2001 From: Barry Song Date: Thu, 5 Jun 2025 20:31:44 +1200 Subject: [PATCH 025/370] mm: madvise: use walk_page_range_vma() instead of walk_page_range() We've already found the VMA within madvise_walk_vmas() before calling specific madvise behavior functions like madvise_free_single_vma(). So calling walk_page_range() and doing find_vma() again seems unnecessary. It also prevents potential optimizations in those madvise callbacks, particularly the use of dedicated per-VMA locking. [v-songbaohua@oppo.com: revert the walk_page_range_vma change for MADV_GUARD_INSTALL] Link: https://lkml.kernel.org/r/20250609105513.10901-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20250605083144.43046-1-21cnbao@gmail.com Signed-off-by: Barry Song Reviewed-by: Anshuman Khandual Reviewed-by: Lorenzo Stoakes Acked-by: David Hildenbrand Reviewed-by: Oscar Salvador Reviewed-by: Harry Yoo Reviewed-by: Dev Jain Reviewed-by: Vlastimil Babka Reviewed-by: Ryan Roberts Tested-by: Lorenzo Stoakes Cc: "Liam R. Howlett" Cc: Jann Horn Cc: Suren Baghdasaryan Cc: Lokesh Gidra Cc: Tangquan Zheng Signed-off-by: Andrew Morton --- mm/madvise.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 1d44a35ae85c..f543ef45f6a4 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -282,7 +282,7 @@ static long madvise_willneed(struct vm_area_struct *vma, *prev = vma; #ifdef CONFIG_SWAP if (!file) { - walk_page_range(vma->vm_mm, start, end, &swapin_walk_ops, vma); + walk_page_range_vma(vma, start, end, &swapin_walk_ops, vma); lru_add_drain(); /* Push any new pages onto the LRU now */ return 0; } @@ -582,7 +582,7 @@ static void madvise_cold_page_range(struct mmu_gather *tlb, }; tlb_start_vma(tlb, vma); - walk_page_range(vma->vm_mm, addr, end, &cold_walk_ops, &walk_private); + walk_page_range_vma(vma, addr, end, &cold_walk_ops, &walk_private); tlb_end_vma(tlb, vma); } @@ -620,7 +620,7 @@ static void madvise_pageout_page_range(struct mmu_gather *tlb, }; tlb_start_vma(tlb, vma); - walk_page_range(vma->vm_mm, addr, end, &cold_walk_ops, &walk_private); + walk_page_range_vma(vma, addr, end, &cold_walk_ops, &walk_private); tlb_end_vma(tlb, vma); } @@ -827,7 +827,7 @@ static int madvise_free_single_vma(struct madvise_behavior *madv_behavior, mmu_notifier_invalidate_range_start(&range); tlb_start_vma(tlb, vma); - walk_page_range(vma->vm_mm, range.start, range.end, + walk_page_range_vma(vma, range.start, range.end, &madvise_free_walk_ops, tlb); tlb_end_vma(tlb, vma); mmu_notifier_invalidate_range_end(&range); @@ -1246,7 +1246,7 @@ static long madvise_guard_remove(struct vm_area_struct *vma, if (!is_valid_guard_vma(vma, /* allow_locked = */true)) return -EINVAL; - return walk_page_range(vma->vm_mm, start, end, + return walk_page_range_vma(vma, start, end, &guard_remove_walk_ops, NULL); } From 08e21e241210a3a124cfafd33db3dbda49b6f43d Mon Sep 17 00:00:00 2001 From: Richard Chang Date: Thu, 5 Jun 2025 07:25:32 +0000 Subject: [PATCH 026/370] mm/cma: pair the trace_cma_alloc_start/finish In the bad input validation cases, there is no trace_cma_alloc_finish to match the trace_cma_alloc_start. Move the trace_cma_alloc_start event after the validations. Link: https://lkml.kernel.org/r/20250605072532.972081-1-richardycc@google.com Signed-off-by: Richard Chang Acked-by: David Hildenbrand Cc: Kalesh Singh Cc: Martin Liu Cc: Minchan Kim Signed-off-by: Andrew Morton --- mm/cma.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/cma.c b/mm/cma.c index 397567883a10..bd3772773736 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -854,8 +854,6 @@ static struct page *__cma_alloc(struct cma *cma, unsigned long count, unsigned long i; const char *name = cma ? cma->name : NULL; - trace_cma_alloc_start(name, count, align); - if (!cma || !cma->count) return page; @@ -865,6 +863,8 @@ static struct page *__cma_alloc(struct cma *cma, unsigned long count, if (!count) return page; + trace_cma_alloc_start(name, count, align); + for (r = 0; r < cma->nranges; r++) { page = NULL; From bbcaee20e03ecaeeecba32a703816a0d4502b6c4 Mon Sep 17 00:00:00 2001 From: Chi Zhiling Date: Thu, 5 Jun 2025 13:49:35 +0800 Subject: [PATCH 027/370] readahead: fix return value of page_cache_next_miss() when no hole is found max_scan in page_cache_next_miss always decreases to zero when no hole is found, causing the return value to be index + 0. Fix this by preserving the max_scan value throughout the loop. Jan said "From what I know and have seen in the past, wrong responses from page_cache_next_miss() can lead to readahead window reduction and thus reduced read speeds." Link: https://lkml.kernel.org/r/20250605054935.2323451-1-chizhiling@163.com Fixes: 901a269ff3d5 ("filemap: fix page_cache_next_miss() when no hole found") Signed-off-by: Chi Zhiling Reviewed-by: Jan Kara Cc: Josef Bacik Cc: Matthew Wilcox (Oracle) Cc: Signed-off-by: Andrew Morton --- mm/filemap.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/filemap.c b/mm/filemap.c index bada249b9fb7..a6459874bb2a 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1778,8 +1778,9 @@ pgoff_t page_cache_next_miss(struct address_space *mapping, pgoff_t index, unsigned long max_scan) { XA_STATE(xas, &mapping->i_pages, index); + unsigned long nr = max_scan; - while (max_scan--) { + while (nr--) { void *entry = xas_next(&xas); if (!entry || xa_is_value(entry)) return xas.xa_index; From 4f745def815d86eece3614aa0cd10a25cf94fed4 Mon Sep 17 00:00:00 2001 From: Donet Tom Date: Wed, 28 May 2025 12:18:00 -0500 Subject: [PATCH 028/370] drivers/base/node: optimize memory block registration to reduce boot time MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "drivers/base/node.c: optimization and cleanups", v7. This patch (of 7) During node device initialization, `memory blocks` are registered under each NUMA node. The `memory blocks` to be registered are identified using the node's start and end PFNs, which are obtained from the node's pg_data However, not all PFNs within this range necessarily belong to the same node—some may belong to other nodes. Additionally, due to the discontiguous nature of physical memory, certain sections within a `memory block` may be absent. As a result, `memory blocks` that fall between a node's start and end PFNs may span across multiple nodes, and some sections within those blocks may be missing. `Memory blocks` have a fixed size, which is architecture dependent. Due to these considerations, the memory block registration is currently performed as follows: for_each_online_node(nid): start_pfn = pgdat->node_start_pfn; end_pfn = pgdat->node_start_pfn + node_spanned_pages; for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn)) mem_blk = memory_block_id(pfn_to_section_nr(pfn)); pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr) pfn_mb_end = pfn_start + memory_block_pfns - 1 for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++): if (get_nid_for_pfn(pfn) != nid): continue; else do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY); Here, we derive the start and end PFNs from the node's pg_data, then determine the memory blocks that may belong to the node. For each `memory block` in this range, we inspect all PFNs it contains and check their associated NUMA node ID. If a PFN within the block matches the current node, the memory block is registered under that node. If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs a binary search in the `memblock regions` to determine the NUMA node ID for a given PFN. If it is not enabled, the node ID is retrieved directly from the struct page. On large systems, this process can become time-consuming, especially since we iterate over each `memory block` and all PFNs within it until a match is found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional overhead of the binary search increases the execution time significantly, potentially leading to soft lockups during boot. In this patch, we iterate over `memblock region` to identify the `memory blocks` that belong to the current NUMA node. `memblock regions` are contiguous memory ranges, each associated with a single NUMA node, and they do not span across multiple nodes. for_each_memory_region(r): // r => region if (!node_online(r->nid)): continue; else for_each_memory_block_between(r->base, r->base + r->size - 1): do_register_memory_block_under_node(r->nid, mem_blk, MEMINIT_EARLY); We iterate over all memblock regions, and if the node associated with the region is online, we calculate the start and end memory blocks based on the region's start and end PFNs. We then register all the memory blocks within that range under the region node. Test Results on My system with 32TB RAM ======================================= 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled. Without this patch ------------------ Startup finished in 1min 16.528s (kernel) With this patch --------------- Startup finished in 17.236s (kernel) - 78% Improvement 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled. Without this patch ------------------ Startup finished in 28.320s (kernel) With this patch --------------- Startup finished in 15.621s (kernel) - 46% Improvement [donettom@linux.ibm.com: restore removed extra line] Link: https://lkml.kernel.org/r/20250609140354.467908-1-donettom@linux.ibm.com Link: https://lkml.kernel.org/r/2a0a05c2dffc62a742bf1dd030098be4ce99be28.1748452241.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/2a0a05c2dffc62a742bf1dd030098be4ce99be28.1748452241.git.donettom@linux.ibm.com Signed-off-by: Donet Tom Acked-by: David Hildenbrand Acked-by: Oscar Salvador Acked-by: Mike Rapoport (Microsoft) Acked-by: Zi Yan Signed-off-by: Andrew Morton --- drivers/base/memory.c | 21 ++++----------------- drivers/base/node.c | 35 +++++++++++++++++++++++++++++++++-- include/linux/memory.h | 18 ++++++++++++++++++ include/linux/node.h | 3 +++ 4 files changed, 58 insertions(+), 19 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index ed3e69dc785c..5c6c1d6bb59f 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include @@ -48,22 +49,8 @@ int mhp_online_type_from_str(const char *str) #define to_memory_block(dev) container_of(dev, struct memory_block, dev) -static int sections_per_block; - -static inline unsigned long memory_block_id(unsigned long section_nr) -{ - return section_nr / sections_per_block; -} - -static inline unsigned long pfn_to_block_id(unsigned long pfn) -{ - return memory_block_id(pfn_to_section_nr(pfn)); -} - -static inline unsigned long phys_to_block_id(unsigned long phys) -{ - return pfn_to_block_id(PFN_DOWN(phys)); -} +int sections_per_block; +EXPORT_SYMBOL(sections_per_block); static int memory_subsys_online(struct device *dev); static int memory_subsys_offline(struct device *dev); @@ -683,7 +670,7 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn) * * Called under device_hotplug_lock. */ -static struct memory_block *find_memory_block_by_id(unsigned long block_id) +struct memory_block *find_memory_block_by_id(unsigned long block_id) { struct memory_block *mem; diff --git a/drivers/base/node.c b/drivers/base/node.c index c19094481630..a854d04cea5d 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -21,6 +21,7 @@ #include #include #include +#include static const struct bus_type node_subsys = { .name = "node", @@ -859,6 +860,34 @@ void unregister_memory_block_under_nodes(struct memory_block *mem_blk) kobject_name(&node_devices[mem_blk->nid]->dev.kobj)); } +/* register all memory blocks under the corresponding nodes */ +static void register_memory_blocks_under_nodes(void) +{ + struct memblock_region *r; + + for_each_mem_region(r) { + const unsigned long start_block_id = phys_to_block_id(r->base); + const unsigned long end_block_id = phys_to_block_id(r->base + r->size - 1); + const int nid = memblock_get_region_node(r); + unsigned long block_id; + + if (!node_online(nid)) + continue; + + for (block_id = start_block_id; block_id <= end_block_id; block_id++) { + struct memory_block *mem; + + mem = find_memory_block_by_id(block_id); + if (!mem) + continue; + + do_register_memory_block_under_node(nid, mem, MEMINIT_EARLY); + put_device(&mem->dev); + } + + } +} + void register_memory_blocks_under_node(int nid, unsigned long start_pfn, unsigned long end_pfn, enum meminit_context context) @@ -980,11 +1009,13 @@ void __init node_dev_init(void) /* * Create all node devices, which will properly link the node - * to applicable memory block devices and already created cpu devices. + * to already created cpu devices. */ for_each_online_node(i) { - ret = register_one_node(i); + ret = __register_one_node(i); if (ret) panic("%s() failed to add node: %d\n", __func__, ret); } + + register_memory_blocks_under_nodes(); } diff --git a/include/linux/memory.h b/include/linux/memory.h index 5ec4e6d209b9..bd4440bc4a57 100644 --- a/include/linux/memory.h +++ b/include/linux/memory.h @@ -179,12 +179,30 @@ struct memory_group *memory_group_find_by_id(int mgid); typedef int (*walk_memory_groups_func_t)(struct memory_group *, void *); int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func, struct memory_group *excluded, void *arg); +struct memory_block *find_memory_block_by_id(unsigned long block_id); #define hotplug_memory_notifier(fn, pri) ({ \ static __meminitdata struct notifier_block fn##_mem_nb =\ { .notifier_call = fn, .priority = pri };\ register_memory_notifier(&fn##_mem_nb); \ }) +extern int sections_per_block; + +static inline unsigned long memory_block_id(unsigned long section_nr) +{ + return section_nr / sections_per_block; +} + +static inline unsigned long pfn_to_block_id(unsigned long pfn) +{ + return memory_block_id(pfn_to_section_nr(pfn)); +} + +static inline unsigned long phys_to_block_id(unsigned long phys) +{ + return pfn_to_block_id(PFN_DOWN(phys)); +} + #ifdef CONFIG_NUMA void memory_block_add_nid(struct memory_block *mem, int nid, enum meminit_context context); diff --git a/include/linux/node.h b/include/linux/node.h index 2b7517892230..485370f3bc17 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -120,6 +120,9 @@ static inline void register_memory_blocks_under_node(int nid, unsigned long star enum meminit_context context) { } +static inline void register_memory_blocks_under_nodes(void) +{ +} #endif extern void unregister_node(struct node *node); From 69e944b1606a0b6ba4663dc4fd9f43c2b85cdf54 Mon Sep 17 00:00:00 2001 From: Donet Tom Date: Wed, 28 May 2025 12:18:01 -0500 Subject: [PATCH 029/370] drivers/base/node: remove register_mem_block_under_node_early() The function register_mem_block_under_node_early() is no longer used, as register_memory_blocks_under_node_early() now handles memory block registration during early boot. Removed register_mem_block_under_node_early() and get_nid_for_pfn(), the latter was only used by the former. Link: https://lkml.kernel.org/r/22e0c5d20f1d33a91d0436ad22d96628cf084d1b.1748452242.git.donettom@linux.ibm.com Signed-off-by: Donet Tom Acked-by: Oscar Salvador Acked-by: Mike Rapoport (Microsoft) Acked-by: David Hildenbrand Acked-by: Zi Yan Signed-off-by: Andrew Morton --- drivers/base/node.c | 58 +-------------------------------------------- 1 file changed, 1 insertion(+), 57 deletions(-) diff --git a/drivers/base/node.c b/drivers/base/node.c index a854d04cea5d..24c32cdbe97d 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -757,15 +757,6 @@ int unregister_cpu_under_node(unsigned int cpu, unsigned int nid) } #ifdef CONFIG_MEMORY_HOTPLUG -static int __ref get_nid_for_pfn(unsigned long pfn) -{ -#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT - if (system_state < SYSTEM_RUNNING) - return early_pfn_to_nid(pfn); -#endif - return pfn_to_nid(pfn); -} - static void do_register_memory_block_under_node(int nid, struct memory_block *mem_blk, enum meminit_context context) @@ -792,46 +783,6 @@ static void do_register_memory_block_under_node(int nid, ret); } -/* register memory section under specified node if it spans that node */ -static int register_mem_block_under_node_early(struct memory_block *mem_blk, - void *arg) -{ - unsigned long memory_block_pfns = memory_block_size_bytes() / PAGE_SIZE; - unsigned long start_pfn = section_nr_to_pfn(mem_blk->start_section_nr); - unsigned long end_pfn = start_pfn + memory_block_pfns - 1; - int nid = *(int *)arg; - unsigned long pfn; - - for (pfn = start_pfn; pfn <= end_pfn; pfn++) { - int page_nid; - - /* - * memory block could have several absent sections from start. - * skip pfn range from absent section - */ - if (!pfn_in_present_section(pfn)) { - pfn = round_down(pfn + PAGES_PER_SECTION, - PAGES_PER_SECTION) - 1; - continue; - } - - /* - * We need to check if page belongs to nid only at the boot - * case because node's ranges can be interleaved. - */ - page_nid = get_nid_for_pfn(pfn); - if (page_nid < 0) - continue; - if (page_nid != nid) - continue; - - do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY); - return 0; - } - /* mem section does not span the specified node */ - return 0; -} - /* * During hotplug we know that all pages in the memory block belong to the same * node. @@ -892,15 +843,8 @@ void register_memory_blocks_under_node(int nid, unsigned long start_pfn, unsigned long end_pfn, enum meminit_context context) { - walk_memory_blocks_func_t func; - - if (context == MEMINIT_HOTPLUG) - func = register_mem_block_under_node_hotplug; - else - func = register_mem_block_under_node_early; - walk_memory_blocks(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn - start_pfn), - (void *)&nid, func); + (void *)&nid, register_mem_block_under_node_hotplug); return; } #endif /* CONFIG_MEMORY_HOTPLUG */ From ac24f6cd87d88150fc6c1fef904794571f62dc5e Mon Sep 17 00:00:00 2001 From: Donet Tom Date: Wed, 28 May 2025 12:18:02 -0500 Subject: [PATCH 030/370] drivers/base/node: remove register_memory_blocks_under_node() function call from register_one_node MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit register_one_node() is now only called via cpu_up() → __try_online_node() during CPU hotplug operations to online a node. At this stage, the node has not yet had any memory added. As a result, there are no memory blocks to walk or register, so calling register_memory_blocks_under_node() is unnecessary. Therefore, the call to register_memory_blocks_under_node() has been removed from register_one_node(). Link: https://lkml.kernel.org/r/ecf07075b1a41015fcf58823997d5c2ed7b8c18f.1748452242.git.donettom@linux.ibm.com Signed-off-by: Donet Tom Acked-by: Oscar Salvador Acked-by: Mike Rapoport (Microsoft) Acked-by: David Hildenbrand Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/node.h | 16 +--------------- 1 file changed, 1 insertion(+), 15 deletions(-) diff --git a/include/linux/node.h b/include/linux/node.h index 485370f3bc17..b15de78e0408 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -134,21 +134,7 @@ extern int __register_one_node(int nid); /* Registers an online node */ static inline int register_one_node(int nid) { - int error = 0; - - if (node_online(nid)) { - struct pglist_data *pgdat = NODE_DATA(nid); - unsigned long start_pfn = pgdat->node_start_pfn; - unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages; - - error = __register_one_node(nid); - if (error) - return error; - register_memory_blocks_under_node(nid, start_pfn, end_pfn, - MEMINIT_EARLY); - } - - return error; + return __register_one_node(nid); } extern void unregister_one_node(int nid); From 10f09d82f8b7c25bbd0b6d5142ff6df6e634132d Mon Sep 17 00:00:00 2001 From: Donet Tom Date: Wed, 28 May 2025 12:18:03 -0500 Subject: [PATCH 031/370] drivers/base/node: rename register_memory_blocks_under_node() and remove context argument The function register_memory_blocks_under_node() is now only called from the memory hotplug path, as register_memory_blocks_under_node_early() handles registration during early boot. Therefore, the context argument used to differentiate between early boot and hotplug is no longer needed and was removed. Since the function is only called from the hotplug path, we renamed register_memory_blocks_under_node() to register_memory_blocks_under_node_hotplug() Link: https://lkml.kernel.org/r/907c22292b0ee4975107876efc875c75c11badd9.1748452242.git.donettom@linux.ibm.com Signed-off-by: Donet Tom Acked-by: Oscar Salvador Acked-by: Mike Rapoport (Microsoft) Acked-by: David Hildenbrand Acked-by: Zi Yan Signed-off-by: Andrew Morton --- drivers/base/node.c | 5 ++--- include/linux/node.h | 11 +++++------ mm/memory_hotplug.c | 5 ++--- 3 files changed, 9 insertions(+), 12 deletions(-) diff --git a/drivers/base/node.c b/drivers/base/node.c index 24c32cdbe97d..302cb99853dd 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -839,9 +839,8 @@ static void register_memory_blocks_under_nodes(void) } } -void register_memory_blocks_under_node(int nid, unsigned long start_pfn, - unsigned long end_pfn, - enum meminit_context context) +void register_memory_blocks_under_node_hotplug(int nid, unsigned long start_pfn, + unsigned long end_pfn) { walk_memory_blocks(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn - start_pfn), (void *)&nid, register_mem_block_under_node_hotplug); diff --git a/include/linux/node.h b/include/linux/node.h index b15de78e0408..75b036a100d2 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -111,13 +111,12 @@ struct memory_block; extern struct node *node_devices[]; #if defined(CONFIG_MEMORY_HOTPLUG) && defined(CONFIG_NUMA) -void register_memory_blocks_under_node(int nid, unsigned long start_pfn, - unsigned long end_pfn, - enum meminit_context context); +void register_memory_blocks_under_node_hotplug(int nid, unsigned long start_pfn, + unsigned long end_pfn); #else -static inline void register_memory_blocks_under_node(int nid, unsigned long start_pfn, - unsigned long end_pfn, - enum meminit_context context) +static inline void register_memory_blocks_under_node_hotplug(int nid, + unsigned long start_pfn, + unsigned long end_pfn) { } static inline void register_memory_blocks_under_nodes(void) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index b1caedbade5b..c6ce1aba64e6 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1575,9 +1575,8 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) BUG_ON(ret); } - register_memory_blocks_under_node(nid, PFN_DOWN(start), - PFN_UP(start + size - 1), - MEMINIT_HOTPLUG); + register_memory_blocks_under_node_hotplug(nid, PFN_DOWN(start), + PFN_UP(start + size - 1)); /* create new memmap entry */ if (!strcmp(res->name, "System RAM")) From a5352f8a40a8d1b385abeca0b2cff5d2468e31a1 Mon Sep 17 00:00:00 2001 From: Donet Tom Date: Wed, 28 May 2025 12:18:04 -0500 Subject: [PATCH 032/370] drivers/base/node: rename __register_one_node() to register_one_node() The register_one_node() function was a simple wrapper around __register_one_node(). To simplify the code, register_one_node() has been removed, and __register_one_node() has been renamed to register_one_node(). Link: https://lkml.kernel.org/r/8262cd0f44eeb048a1fcd3ac8382760d7f7dea60.1748452242.git.donettom@linux.ibm.com Signed-off-by: Donet Tom Acked-by: David Hildenbrand Cc: Mike Rapoport (Microsoft) Cc: Oscar Salvador Cc: Zi Yan Signed-off-by: Andrew Morton --- arch/powerpc/platforms/pseries/pci_dlpar.c | 2 +- drivers/base/node.c | 4 ++-- include/linux/node.h | 13 +------------ mm/memory_hotplug.c | 2 +- 4 files changed, 5 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/platforms/pseries/pci_dlpar.c b/arch/powerpc/platforms/pseries/pci_dlpar.c index 52e2623a741d..aeb8633a3d00 100644 --- a/arch/powerpc/platforms/pseries/pci_dlpar.c +++ b/arch/powerpc/platforms/pseries/pci_dlpar.c @@ -29,7 +29,7 @@ struct pci_controller *init_phb_dynamic(struct device_node *dn) nid = of_node_to_nid(dn); if (likely((nid) >= 0)) { if (!node_online(nid)) { - if (__register_one_node(nid)) { + if (register_one_node(nid)) { pr_err("PCI: Failed to register node %d\n", nid); } else { update_numa_distance(dn); diff --git a/drivers/base/node.c b/drivers/base/node.c index 302cb99853dd..6cbeca45c451 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -848,7 +848,7 @@ void register_memory_blocks_under_node_hotplug(int nid, unsigned long start_pfn, } #endif /* CONFIG_MEMORY_HOTPLUG */ -int __register_one_node(int nid) +int register_one_node(int nid) { int error; int cpu; @@ -955,7 +955,7 @@ void __init node_dev_init(void) * to already created cpu devices. */ for_each_online_node(i) { - ret = __register_one_node(i); + ret = register_one_node(i); if (ret) panic("%s() failed to add node: %d\n", __func__, ret); } diff --git a/include/linux/node.h b/include/linux/node.h index 75b036a100d2..88bceebcbfa5 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -128,14 +128,7 @@ extern void unregister_node(struct node *node); #ifdef CONFIG_NUMA extern void node_dev_init(void); /* Core of the node registration - only memory hotplug should use this */ -extern int __register_one_node(int nid); - -/* Registers an online node */ -static inline int register_one_node(int nid) -{ - return __register_one_node(nid); -} - +extern int register_one_node(int nid); extern void unregister_one_node(int nid); extern int register_cpu_under_node(unsigned int cpu, unsigned int nid); extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid); @@ -148,10 +141,6 @@ extern int register_memory_node_under_compute_node(unsigned int mem_nid, static inline void node_dev_init(void) { } -static inline int __register_one_node(int nid) -{ - return 0; -} static inline int register_one_node(int nid) { return 0; diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index c6ce1aba64e6..bec20a91e757 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1571,7 +1571,7 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) * We online node here. We can't roll back from here. */ node_set_online(nid); - ret = __register_one_node(nid); + ret = register_one_node(nid); BUG_ON(ret); } From 7208cc6497c2615ed5a334b52c92ae98bda91198 Mon Sep 17 00:00:00 2001 From: Tal Zussman Date: Thu, 19 Jun 2025 21:24:23 -0400 Subject: [PATCH 033/370] userfaultfd: correctly prevent registering VM_DROPPABLE regions Patch series "mm: userfaultfd: assorted fixes and cleanups", v3. Two fixes and two cleanups for userfaultfd. Note that the third patch yields a small change in the ABI, but we seem to have concluded that it is acceptable in this case. This patch (of 4): vma_can_userfault() masks off non-userfaultfd VM flags from vm_flags. The vm_flags & VM_DROPPABLE test will then always be false, incorrectly allowing VM_DROPPABLE regions to be registered with userfaultfd. Additionally, vm_flags is not guaranteed to correspond to the actual VMA's flags. Fix this test by checking the VMA's flags directly. Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-0-a7274d3bd5e4@columbia.edu Link: https://lore.kernel.org/linux-mm/5a875a3a-2243-4eab-856f-bc53ccfec3ea@redhat.com/ Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-1-a7274d3bd5e4@columbia.edu Fixes: 9651fcedf7b9 ("mm: add MAP_DROPPABLE for designating always lazily freeable mappings") Signed-off-by: Tal Zussman Acked-by: David Hildenbrand Acked-by: Peter Xu Acked-by: Jason A. Donenfeld Cc: Al Viro Cc: Andrea Arcangeli Cc: Christian Brauner Cc: Jan Kara Signed-off-by: Andrew Morton --- include/linux/userfaultfd_k.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 75342022d144..f3b3d2c9dd5e 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -218,7 +218,7 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma, { vm_flags &= __VM_UFFD_FLAGS; - if (vm_flags & VM_DROPPABLE) + if (vma->vm_flags & VM_DROPPABLE) return false; if ((vm_flags & VM_UFFD_MINOR) && From 23ec90eb122faaf0468f450c5d5857a794956c75 Mon Sep 17 00:00:00 2001 From: Tal Zussman Date: Thu, 19 Jun 2025 21:24:24 -0400 Subject: [PATCH 034/370] userfaultfd: prevent unregistering VMAs through a different userfaultfd Currently, a VMA registered with a uffd can be unregistered through a different uffd associated with the same mm_struct. The existing behavior is slightly broken and may incorrectly reject unregistering some VMAs due to the following check: if (!vma_can_userfault(cur, cur->vm_flags, wp_async)) goto out_unlock; where wp_async is derived from ctx, not from cur. For example, a file-backed VMA registered with wp_async enabled and UFFD_WP mode cannot be unregistered through a uffd that does not have wp_async enabled. Rather than fix this and maintain this odd behavior, make unregistration stricter by requiring VMAs to be unregistered through the same uffd they were registered with. Additionally, reorder the BUG() checks to avoid the aforementioned wp_async issue in them. Convert the existing check to VM_WARN_ON_ONCE() as BUG_ON() is deprecated. This change slightly modifies the ABI. It should not be backported to -stable. It is expected that no one depends on this behavior, and no such cases are known. While at it, correct the comment for the no userfaultfd case. This seems to be a copy-paste artifact from the analogous userfaultfd_register() check. Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-2-a7274d3bd5e4@columbia.edu Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization") Signed-off-by: Tal Zussman Acked-by: David Hildenbrand Cc: Al Viro Cc: Andrea Arcangeli Cc: Christian Brauner Cc: Jan Kara Cc: Jason A. Donenfeld Cc: Peter Xu Signed-off-by: Andrew Morton --- fs/userfaultfd.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 22f4bf956ba1..8e7fb2a7a6aa 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1467,6 +1467,14 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^ !!(cur->vm_flags & __VM_UFFD_FLAGS)); + /* + * Prevent unregistering through a different userfaultfd than + * the one used for registration. + */ + if (cur->vm_userfaultfd_ctx.ctx && + cur->vm_userfaultfd_ctx.ctx != ctx) + goto out_unlock; + /* * Check not compatible vmas, not strictly required * here as not compatible vmas cannot have an @@ -1490,15 +1498,12 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, for_each_vma_range(vmi, vma, end) { cond_resched(); - BUG_ON(!vma_can_userfault(vma, vma->vm_flags, wp_async)); - - /* - * Nothing to do: this vma is already registered into this - * userfaultfd and with the right tracking mode too. - */ + /* VMA not registered with userfaultfd. */ if (!vma->vm_userfaultfd_ctx.ctx) goto skip; + VM_WARN_ON_ONCE(vma->vm_userfaultfd_ctx.ctx != ctx); + VM_WARN_ON_ONCE(!vma_can_userfault(vma, vma->vm_flags, wp_async)); WARN_ON(!(vma->vm_flags & VM_MAYWRITE)); if (vma->vm_start > start) From 31defc3b01d907e3e899de9e7002a6a63998f07a Mon Sep 17 00:00:00 2001 From: Tal Zussman Date: Thu, 19 Jun 2025 21:24:25 -0400 Subject: [PATCH 035/370] userfaultfd: remove (VM_)BUG_ON()s BUG_ON() is deprecated [1]. Convert all the BUG_ON()s and VM_BUG_ON()s to use VM_WARN_ON_ONCE(). There are a few additional cases that are converted or modified: - Convert the printk(KERN_WARNING ...) in handle_userfault() to use pr_warn(). - Convert the WARN_ON_ONCE()s in move_pages() to use VM_WARN_ON_ONCE(), as the relevant conditions are already checked in validate_range() in move_pages()'s caller. - Convert the VM_WARN_ON()'s in move_pages() to VM_WARN_ON_ONCE(). These cases should never happen and are similar to those in mfill_atomic() and mfill_atomic_hugetlb(), which were previously BUG_ON()s. move_pages() was added later than those functions and makes use of VM_WARN_ON() as a replacement for the deprecated BUG_ON(), but. VM_WARN_ON_ONCE() is likely a better direct replacement. - Convert the WARN_ON() for !VM_MAYWRITE in userfaultfd_unregister() and userfaultfd_register_range() to VM_WARN_ON_ONCE(). This condition is enforced in userfaultfd_register() so it should never happen, and can be converted to a debug check. [1] https://www.kernel.org/doc/html/v6.15/process/coding-style.html#use-warn-rather-than-bug Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-3-a7274d3bd5e4@columbia.edu Signed-off-by: Tal Zussman Cc: Al Viro Cc: Andrea Arcangeli Cc: Christian Brauner Cc: David Hildenbrand Cc: Jan Kara Cc: Jason A. Donenfeld Cc: Peter Xu Signed-off-by: Andrew Morton --- fs/userfaultfd.c | 59 +++++++++++++++++++++-------------------- mm/userfaultfd.c | 68 +++++++++++++++++++++++------------------------- 2 files changed, 62 insertions(+), 65 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 8e7fb2a7a6aa..771e81ea4ef6 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -165,14 +165,14 @@ static void userfaultfd_ctx_get(struct userfaultfd_ctx *ctx) static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx) { if (refcount_dec_and_test(&ctx->refcount)) { - VM_BUG_ON(spin_is_locked(&ctx->fault_pending_wqh.lock)); - VM_BUG_ON(waitqueue_active(&ctx->fault_pending_wqh)); - VM_BUG_ON(spin_is_locked(&ctx->fault_wqh.lock)); - VM_BUG_ON(waitqueue_active(&ctx->fault_wqh)); - VM_BUG_ON(spin_is_locked(&ctx->event_wqh.lock)); - VM_BUG_ON(waitqueue_active(&ctx->event_wqh)); - VM_BUG_ON(spin_is_locked(&ctx->fd_wqh.lock)); - VM_BUG_ON(waitqueue_active(&ctx->fd_wqh)); + VM_WARN_ON_ONCE(spin_is_locked(&ctx->fault_pending_wqh.lock)); + VM_WARN_ON_ONCE(waitqueue_active(&ctx->fault_pending_wqh)); + VM_WARN_ON_ONCE(spin_is_locked(&ctx->fault_wqh.lock)); + VM_WARN_ON_ONCE(waitqueue_active(&ctx->fault_wqh)); + VM_WARN_ON_ONCE(spin_is_locked(&ctx->event_wqh.lock)); + VM_WARN_ON_ONCE(waitqueue_active(&ctx->event_wqh)); + VM_WARN_ON_ONCE(spin_is_locked(&ctx->fd_wqh.lock)); + VM_WARN_ON_ONCE(waitqueue_active(&ctx->fd_wqh)); mmdrop(ctx->mm); kmem_cache_free(userfaultfd_ctx_cachep, ctx); } @@ -383,12 +383,12 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason) if (!ctx) goto out; - BUG_ON(ctx->mm != mm); + VM_WARN_ON_ONCE(ctx->mm != mm); /* Any unrecognized flag is a bug. */ - VM_BUG_ON(reason & ~__VM_UFFD_FLAGS); + VM_WARN_ON_ONCE(reason & ~__VM_UFFD_FLAGS); /* 0 or > 1 flags set is a bug; we expect exactly 1. */ - VM_BUG_ON(!reason || (reason & (reason - 1))); + VM_WARN_ON_ONCE(!reason || (reason & (reason - 1))); if (ctx->features & UFFD_FEATURE_SIGBUS) goto out; @@ -411,12 +411,11 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason) * to be sure not to return SIGBUS erroneously on * nowait invocations. */ - BUG_ON(vmf->flags & FAULT_FLAG_RETRY_NOWAIT); + VM_WARN_ON_ONCE(vmf->flags & FAULT_FLAG_RETRY_NOWAIT); #ifdef CONFIG_DEBUG_VM if (printk_ratelimit()) { - printk(KERN_WARNING - "FAULT_FLAG_ALLOW_RETRY missing %x\n", - vmf->flags); + pr_warn("FAULT_FLAG_ALLOW_RETRY missing %x\n", + vmf->flags); dump_stack(); } #endif @@ -602,7 +601,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx, */ out: atomic_dec(&ctx->mmap_changing); - VM_BUG_ON(atomic_read(&ctx->mmap_changing) < 0); + VM_WARN_ON_ONCE(atomic_read(&ctx->mmap_changing) < 0); userfaultfd_ctx_put(ctx); } @@ -710,7 +709,7 @@ void dup_userfaultfd_fail(struct list_head *fcs) struct userfaultfd_ctx *ctx = fctx->new; atomic_dec(&octx->mmap_changing); - VM_BUG_ON(atomic_read(&octx->mmap_changing) < 0); + VM_WARN_ON_ONCE(atomic_read(&octx->mmap_changing) < 0); userfaultfd_ctx_put(octx); userfaultfd_ctx_put(ctx); @@ -1317,8 +1316,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, do { cond_resched(); - BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^ - !!(cur->vm_flags & __VM_UFFD_FLAGS)); + VM_WARN_ON_ONCE(!!cur->vm_userfaultfd_ctx.ctx ^ + !!(cur->vm_flags & __VM_UFFD_FLAGS)); /* check not compatible vmas */ ret = -EINVAL; @@ -1372,7 +1371,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, found = true; } for_each_vma_range(vmi, cur, end); - BUG_ON(!found); + VM_WARN_ON_ONCE(!found); ret = userfaultfd_register_range(ctx, vma, vm_flags, start, end, wp_async); @@ -1464,8 +1463,8 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, do { cond_resched(); - BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^ - !!(cur->vm_flags & __VM_UFFD_FLAGS)); + VM_WARN_ON_ONCE(!!cur->vm_userfaultfd_ctx.ctx ^ + !!(cur->vm_flags & __VM_UFFD_FLAGS)); /* * Prevent unregistering through a different userfaultfd than @@ -1487,7 +1486,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, found = true; } for_each_vma_range(vmi, cur, end); - BUG_ON(!found); + VM_WARN_ON_ONCE(!found); vma_iter_set(&vmi, start); prev = vma_prev(&vmi); @@ -1504,7 +1503,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, VM_WARN_ON_ONCE(vma->vm_userfaultfd_ctx.ctx != ctx); VM_WARN_ON_ONCE(!vma_can_userfault(vma, vma->vm_flags, wp_async)); - WARN_ON(!(vma->vm_flags & VM_MAYWRITE)); + VM_WARN_ON_ONCE(!(vma->vm_flags & VM_MAYWRITE)); if (vma->vm_start > start) start = vma->vm_start; @@ -1569,7 +1568,7 @@ static int userfaultfd_wake(struct userfaultfd_ctx *ctx, * len == 0 means wake all and we don't want to wake all here, * so check it again to be sure. */ - VM_BUG_ON(!range.len); + VM_WARN_ON_ONCE(!range.len); wake_userfault(ctx, &range); ret = 0; @@ -1626,7 +1625,7 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx, return -EFAULT; if (ret < 0) goto out; - BUG_ON(!ret); + VM_WARN_ON_ONCE(!ret); /* len == 0 would wake all */ range.len = ret; if (!(uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE)) { @@ -1681,7 +1680,7 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx, if (ret < 0) goto out; /* len == 0 would wake all */ - BUG_ON(!ret); + VM_WARN_ON_ONCE(!ret); range.len = ret; if (!(uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_DONTWAKE)) { range.start = uffdio_zeropage.range.start; @@ -1793,7 +1792,7 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg) goto out; /* len == 0 would wake all */ - BUG_ON(!ret); + VM_WARN_ON_ONCE(!ret); range.len = ret; if (!(uffdio_continue.mode & UFFDIO_CONTINUE_MODE_DONTWAKE)) { range.start = uffdio_continue.range.start; @@ -1850,7 +1849,7 @@ static inline int userfaultfd_poison(struct userfaultfd_ctx *ctx, unsigned long goto out; /* len == 0 would wake all */ - BUG_ON(!ret); + VM_WARN_ON_ONCE(!ret); range.len = ret; if (!(uffdio_poison.mode & UFFDIO_POISON_MODE_DONTWAKE)) { range.start = uffdio_poison.range.start; @@ -2111,7 +2110,7 @@ static int new_userfaultfd(int flags) struct file *file; int fd; - BUG_ON(!current->mm); + VM_WARN_ON_ONCE(!current->mm); /* Check the UFFD_* constants for consistency. */ BUILD_BUG_ON(UFFD_USER_MODE_ONLY & UFFD_SHARED_FCNTL_FLAGS); diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 8253978ee0fb..9ff970980496 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -561,7 +561,7 @@ static __always_inline ssize_t mfill_atomic_hugetlb( } while (src_addr < src_start + len) { - BUG_ON(dst_addr >= dst_start + len); + VM_WARN_ON_ONCE(dst_addr >= dst_start + len); /* * Serialize via vma_lock and hugetlb_fault_mutex. @@ -602,7 +602,7 @@ static __always_inline ssize_t mfill_atomic_hugetlb( if (unlikely(err == -ENOENT)) { up_read(&ctx->map_changing_lock); uffd_mfill_unlock(dst_vma); - BUG_ON(!folio); + VM_WARN_ON_ONCE(!folio); err = copy_folio_from_user(folio, (const void __user *)src_addr, true); @@ -614,7 +614,7 @@ static __always_inline ssize_t mfill_atomic_hugetlb( dst_vma = NULL; goto retry; } else - BUG_ON(folio); + VM_WARN_ON_ONCE(folio); if (!err) { dst_addr += vma_hpagesize; @@ -635,9 +635,9 @@ static __always_inline ssize_t mfill_atomic_hugetlb( out: if (folio) folio_put(folio); - BUG_ON(copied < 0); - BUG_ON(err > 0); - BUG_ON(!copied && !err); + VM_WARN_ON_ONCE(copied < 0); + VM_WARN_ON_ONCE(err > 0); + VM_WARN_ON_ONCE(!copied && !err); return copied ? copied : err; } #else /* !CONFIG_HUGETLB_PAGE */ @@ -711,12 +711,12 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, /* * Sanitize the command parameters: */ - BUG_ON(dst_start & ~PAGE_MASK); - BUG_ON(len & ~PAGE_MASK); + VM_WARN_ON_ONCE(dst_start & ~PAGE_MASK); + VM_WARN_ON_ONCE(len & ~PAGE_MASK); /* Does the address range wrap, or is the span zero-sized? */ - BUG_ON(src_start + len <= src_start); - BUG_ON(dst_start + len <= dst_start); + VM_WARN_ON_ONCE(src_start + len <= src_start); + VM_WARN_ON_ONCE(dst_start + len <= dst_start); src_addr = src_start; dst_addr = dst_start; @@ -775,7 +775,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, while (src_addr < src_start + len) { pmd_t dst_pmdval; - BUG_ON(dst_addr >= dst_start + len); + VM_WARN_ON_ONCE(dst_addr >= dst_start + len); dst_pmd = mm_alloc_pmd(dst_mm, dst_addr); if (unlikely(!dst_pmd)) { @@ -818,7 +818,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, up_read(&ctx->map_changing_lock); uffd_mfill_unlock(dst_vma); - BUG_ON(!folio); + VM_WARN_ON_ONCE(!folio); kaddr = kmap_local_folio(folio, 0); err = copy_from_user(kaddr, @@ -832,7 +832,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, flush_dcache_folio(folio); goto retry; } else - BUG_ON(folio); + VM_WARN_ON_ONCE(folio); if (!err) { dst_addr += PAGE_SIZE; @@ -852,9 +852,9 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, out: if (folio) folio_put(folio); - BUG_ON(copied < 0); - BUG_ON(err > 0); - BUG_ON(!copied && !err); + VM_WARN_ON_ONCE(copied < 0); + VM_WARN_ON_ONCE(err > 0); + VM_WARN_ON_ONCE(!copied && !err); return copied ? copied : err; } @@ -940,11 +940,11 @@ int mwriteprotect_range(struct userfaultfd_ctx *ctx, unsigned long start, /* * Sanitize the command parameters: */ - BUG_ON(start & ~PAGE_MASK); - BUG_ON(len & ~PAGE_MASK); + VM_WARN_ON_ONCE(start & ~PAGE_MASK); + VM_WARN_ON_ONCE(len & ~PAGE_MASK); /* Does the address range wrap, or is the span zero-sized? */ - BUG_ON(start + len <= start); + VM_WARN_ON_ONCE(start + len <= start); mmap_read_lock(dst_mm); @@ -1738,15 +1738,13 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start, ssize_t moved = 0; /* Sanitize the command parameters. */ - if (WARN_ON_ONCE(src_start & ~PAGE_MASK) || - WARN_ON_ONCE(dst_start & ~PAGE_MASK) || - WARN_ON_ONCE(len & ~PAGE_MASK)) - goto out; + VM_WARN_ON_ONCE(src_start & ~PAGE_MASK); + VM_WARN_ON_ONCE(dst_start & ~PAGE_MASK); + VM_WARN_ON_ONCE(len & ~PAGE_MASK); /* Does the address range wrap, or is the span zero-sized? */ - if (WARN_ON_ONCE(src_start + len <= src_start) || - WARN_ON_ONCE(dst_start + len <= dst_start)) - goto out; + VM_WARN_ON_ONCE(src_start + len < src_start); + VM_WARN_ON_ONCE(dst_start + len < dst_start); err = uffd_move_lock(mm, dst_start, src_start, &dst_vma, &src_vma); if (err) @@ -1896,9 +1894,9 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start, up_read(&ctx->map_changing_lock); uffd_move_unlock(dst_vma, src_vma); out: - VM_WARN_ON(moved < 0); - VM_WARN_ON(err > 0); - VM_WARN_ON(!moved && !err); + VM_WARN_ON_ONCE(moved < 0); + VM_WARN_ON_ONCE(err > 0); + VM_WARN_ON_ONCE(!moved && !err); return moved ? moved : err; } @@ -1985,10 +1983,10 @@ int userfaultfd_register_range(struct userfaultfd_ctx *ctx, for_each_vma_range(vmi, vma, end) { cond_resched(); - BUG_ON(!vma_can_userfault(vma, vm_flags, wp_async)); - BUG_ON(vma->vm_userfaultfd_ctx.ctx && - vma->vm_userfaultfd_ctx.ctx != ctx); - WARN_ON(!(vma->vm_flags & VM_MAYWRITE)); + VM_WARN_ON_ONCE(!vma_can_userfault(vma, vm_flags, wp_async)); + VM_WARN_ON_ONCE(vma->vm_userfaultfd_ctx.ctx && + vma->vm_userfaultfd_ctx.ctx != ctx); + VM_WARN_ON_ONCE(!(vma->vm_flags & VM_MAYWRITE)); /* * Nothing to do: this vma is already registered into this @@ -2064,8 +2062,8 @@ void userfaultfd_release_all(struct mm_struct *mm, prev = NULL; for_each_vma(vmi, vma) { cond_resched(); - BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^ - !!(vma->vm_flags & __VM_UFFD_FLAGS)); + VM_WARN_ON_ONCE(!!vma->vm_userfaultfd_ctx.ctx ^ + !!(vma->vm_flags & __VM_UFFD_FLAGS)); if (vma->vm_userfaultfd_ctx.ctx != ctx) { prev = vma; continue; From 5e00e31867d16e235bb693b900c85e86dc2c3464 Mon Sep 17 00:00:00 2001 From: Tal Zussman Date: Thu, 19 Jun 2025 21:24:26 -0400 Subject: [PATCH 036/370] userfaultfd: remove UFFD_CLOEXEC, UFFD_NONBLOCK, and UFFD_FLAGS_SET UFFD_CLOEXEC, UFFD_NONBLOCK, and UFFD_FLAGS_SET have been unused since they were added in commit 932b18e0aec6 ("userfaultfd: linux/userfaultfd_k.h"). Remove them and the associated BUILD_BUG_ON() checks. Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-4-a7274d3bd5e4@columbia.edu Signed-off-by: Tal Zussman Acked-by: David Hildenbrand Acked-by: Peter Xu Cc: Al Viro Cc: Andrea Arcangeli Cc: Christian Brauner Cc: Jan Kara Cc: Jason A. Donenfeld Signed-off-by: Andrew Morton --- fs/userfaultfd.c | 2 -- include/linux/userfaultfd_k.h | 4 ---- 2 files changed, 6 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 771e81ea4ef6..a2928b0aec6f 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -2114,8 +2114,6 @@ static int new_userfaultfd(int flags) /* Check the UFFD_* constants for consistency. */ BUILD_BUG_ON(UFFD_USER_MODE_ONLY & UFFD_SHARED_FCNTL_FLAGS); - BUILD_BUG_ON(UFFD_CLOEXEC != O_CLOEXEC); - BUILD_BUG_ON(UFFD_NONBLOCK != O_NONBLOCK); if (flags & ~(UFFD_SHARED_FCNTL_FLAGS | UFFD_USER_MODE_ONLY)) return -EINVAL; diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index f3b3d2c9dd5e..ccad58602846 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -30,11 +30,7 @@ * from userfaultfd, in order to leave a free define-space for * shared O_* flags. */ -#define UFFD_CLOEXEC O_CLOEXEC -#define UFFD_NONBLOCK O_NONBLOCK - #define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK) -#define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS) /* * Start with fault_pending_wqh and fault_wqh so they're more likely From a6fde7add78d122f5e09cb6280f99c4b5ead7d56 Mon Sep 17 00:00:00 2001 From: Barry Song Date: Sun, 8 Jun 2025 10:01:50 +1200 Subject: [PATCH 037/370] mm: use per_vma lock for MADV_DONTNEED MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Certain madvise operations, especially MADV_DONTNEED, occur far more frequently than other madvise options, particularly in native and Java heaps for dynamic memory management. Currently, the mmap_lock is always held during these operations, even when unnecessary. This causes lock contention and can lead to severe priority inversion, where low-priority threads—such as Android's HeapTaskDaemon— hold the lock and block higher-priority threads. This patch enables the use of per-VMA locks when the advised range lies entirely within a single VMA, avoiding the need for full VMA traversal. In practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs. Tangquan's testing shows that over 99.5% of memory reclaimed by Android benefits from this per-VMA lock optimization. After extended runtime, 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while only 1,231 fell back to mmap_lock. To simplify handling, the implementation falls back to the standard mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of userfaultfd_remove(). Many thanks to Lorenzo's work[1] on "mm/madvise: support VMA read locks for MADV_DONTNEED[_LOCKED]" Then use this mechanism to permit VMA locking to be done later in the madvise() logic and also to allow altering of the locking mode to permit falling back to an mmap read lock if required." One important point, as pointed out by Jann[2], is that untagged_addr_remote() requires holding mmap_lock. This is because address tagging on x86 and RISC-V is quite complex. Until untagged_addr_remote() becomes atomic—which seems unlikely in the near future—we cannot support per-VMA locks for remote processes. So for now, only local processes are supported. Lance said: : Just to put some numbers on it, I ran a micro-benchmark with 100 : parallel threads, where each thread calls madvise() on its own 1GiB : chunk of 64KiB mTHP-backed memory. The performance gain is huge: : : 1) MADV_DONTNEED saw its average time drop from 0.0508s to 0.0270s : (~47% faster) : : 2) MADV_FREE saw its average time drop from 0.3078s to 0.1095s (~64% : faster) [lorenzo.stoakes@oracle.com: avoid any chance of uninitialised pointer deref] Link: https://lkml.kernel.org/r/309d22ca-6cd9-4601-8402-d441a07d9443@lucifer.local Link: https://lore.kernel.org/all/0b96ce61-a52c-4036-b5b6-5c50783db51f@lucifer.local/ [1] Link: https://lore.kernel.org/all/CAG48ez11zi-1jicHUZtLhyoNPGGVB+ROeAJCUw48bsjk4bbEkA@mail.gmail.com/ [2] Link: https://lkml.kernel.org/r/20250607220150.2980-1-21cnbao@gmail.com Signed-off-by: Barry Song Signed-off-by: Lorenzo Stoakes Reviewed-by: Lorenzo Stoakes Acked-by: Qi Zheng Cc: "Liam R. Howlett" Cc: David Hildenbrand Cc: Vlastimil Babka Cc: Jann Horn Cc: Suren Baghdasaryan Cc: Lokesh Gidra Cc: Tangquan Zheng Cc: Lance Yang Signed-off-by: Andrew Morton --- mm/madvise.c | 202 ++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 152 insertions(+), 50 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index f543ef45f6a4..7d78d4b5fb18 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -48,38 +48,19 @@ struct madvise_walk_private { bool pageout; }; +enum madvise_lock_mode { + MADVISE_NO_LOCK, + MADVISE_MMAP_READ_LOCK, + MADVISE_MMAP_WRITE_LOCK, + MADVISE_VMA_READ_LOCK, +}; + struct madvise_behavior { int behavior; struct mmu_gather *tlb; + enum madvise_lock_mode lock_mode; }; -/* - * Any behaviour which results in changes to the vma->vm_flags needs to - * take mmap_lock for writing. Others, which simply traverse vmas, need - * to only take it for reading. - */ -static int madvise_need_mmap_write(int behavior) -{ - switch (behavior) { - case MADV_REMOVE: - case MADV_WILLNEED: - case MADV_DONTNEED: - case MADV_DONTNEED_LOCKED: - case MADV_COLD: - case MADV_PAGEOUT: - case MADV_FREE: - case MADV_POPULATE_READ: - case MADV_POPULATE_WRITE: - case MADV_COLLAPSE: - case MADV_GUARD_INSTALL: - case MADV_GUARD_REMOVE: - return 0; - default: - /* be safe, default to 1. list exceptions explicitly */ - return 1; - } -} - #ifdef CONFIG_ANON_VMA_NAME struct anon_vma_name *anon_vma_name_alloc(const char *name) { @@ -1339,6 +1320,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, return madvise_guard_remove(vma, prev, start, end); } + /* We cannot provide prev in this lock mode. */ + VM_WARN_ON_ONCE(arg->lock_mode == MADVISE_VMA_READ_LOCK); anon_name = anon_vma_name(vma); anon_vma_name_get(anon_name); error = madvise_update_vma(vma, prev, start, end, new_flags, @@ -1488,6 +1471,44 @@ static bool process_madvise_remote_valid(int behavior) } } +/* + * Try to acquire a VMA read lock if possible. + * + * We only support this lock over a single VMA, which the input range must + * span either partially or fully. + * + * This function always returns with an appropriate lock held. If a VMA read + * lock could be acquired, we return the locked VMA. + * + * If a VMA read lock could not be acquired, we return NULL and expect caller to + * fallback to mmap lock behaviour. + */ +static struct vm_area_struct *try_vma_read_lock(struct mm_struct *mm, + struct madvise_behavior *madv_behavior, + unsigned long start, unsigned long end) +{ + struct vm_area_struct *vma; + + vma = lock_vma_under_rcu(mm, start); + if (!vma) + goto take_mmap_read_lock; + /* + * Must span only a single VMA; uffd and remote processes are + * unsupported. + */ + if (end > vma->vm_end || current->mm != mm || + userfaultfd_armed(vma)) { + vma_end_read(vma); + goto take_mmap_read_lock; + } + return vma; + +take_mmap_read_lock: + mmap_read_lock(mm); + madv_behavior->lock_mode = MADVISE_MMAP_READ_LOCK; + return NULL; +} + /* * Walk the vmas in range [start,end), and call the visit function on each one. * The visit function will get start and end parameters that cover the overlap @@ -1498,7 +1519,8 @@ static bool process_madvise_remote_valid(int behavior) */ static int madvise_walk_vmas(struct mm_struct *mm, unsigned long start, - unsigned long end, void *arg, + unsigned long end, struct madvise_behavior *madv_behavior, + void *arg, int (*visit)(struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end, void *arg)) @@ -1507,6 +1529,21 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start, struct vm_area_struct *prev; unsigned long tmp; int unmapped_error = 0; + int error; + + /* + * If VMA read lock is supported, apply madvise to a single VMA + * tentatively, avoiding walking VMAs. + */ + if (madv_behavior && madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) { + vma = try_vma_read_lock(mm, madv_behavior, start, end); + if (vma) { + prev = vma; + error = visit(vma, &prev, start, end, arg); + vma_end_read(vma); + return error; + } + } /* * If the interval [start,end) covers some unmapped address @@ -1518,8 +1555,6 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start, prev = vma; for (;;) { - int error; - /* Still start < end. */ if (!vma) return -ENOMEM; @@ -1600,34 +1635,86 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, if (end == start) return 0; - return madvise_walk_vmas(mm, start, end, anon_name, + return madvise_walk_vmas(mm, start, end, NULL, anon_name, madvise_vma_anon_name); } #endif /* CONFIG_ANON_VMA_NAME */ -static int madvise_lock(struct mm_struct *mm, int behavior) -{ - if (is_memory_failure(behavior)) - return 0; - if (madvise_need_mmap_write(behavior)) { +/* + * Any behaviour which results in changes to the vma->vm_flags needs to + * take mmap_lock for writing. Others, which simply traverse vmas, need + * to only take it for reading. + */ +static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavior) +{ + int behavior = madv_behavior->behavior; + + if (is_memory_failure(behavior)) + return MADVISE_NO_LOCK; + + switch (behavior) { + case MADV_REMOVE: + case MADV_WILLNEED: + case MADV_COLD: + case MADV_PAGEOUT: + case MADV_FREE: + case MADV_POPULATE_READ: + case MADV_POPULATE_WRITE: + case MADV_COLLAPSE: + case MADV_GUARD_INSTALL: + case MADV_GUARD_REMOVE: + return MADVISE_MMAP_READ_LOCK; + case MADV_DONTNEED: + case MADV_DONTNEED_LOCKED: + return MADVISE_VMA_READ_LOCK; + default: + return MADVISE_MMAP_WRITE_LOCK; + } +} + +static int madvise_lock(struct mm_struct *mm, + struct madvise_behavior *madv_behavior) +{ + enum madvise_lock_mode lock_mode = get_lock_mode(madv_behavior); + + switch (lock_mode) { + case MADVISE_NO_LOCK: + break; + case MADVISE_MMAP_WRITE_LOCK: if (mmap_write_lock_killable(mm)) return -EINTR; - } else { + break; + case MADVISE_MMAP_READ_LOCK: mmap_read_lock(mm); + break; + case MADVISE_VMA_READ_LOCK: + /* We will acquire the lock per-VMA in madvise_walk_vmas(). */ + break; } + + madv_behavior->lock_mode = lock_mode; return 0; } -static void madvise_unlock(struct mm_struct *mm, int behavior) +static void madvise_unlock(struct mm_struct *mm, + struct madvise_behavior *madv_behavior) { - if (is_memory_failure(behavior)) + switch (madv_behavior->lock_mode) { + case MADVISE_NO_LOCK: return; - - if (madvise_need_mmap_write(behavior)) + case MADVISE_MMAP_WRITE_LOCK: mmap_write_unlock(mm); - else + break; + case MADVISE_MMAP_READ_LOCK: mmap_read_unlock(mm); + break; + case MADVISE_VMA_READ_LOCK: + /* We will drop the lock per-VMA in madvise_walk_vmas(). */ + break; + } + + madv_behavior->lock_mode = MADVISE_NO_LOCK; } static bool madvise_batch_tlb_flush(int behavior) @@ -1712,6 +1799,21 @@ static bool is_madvise_populate(int behavior) } } +/* + * untagged_addr_remote() assumes mmap_lock is already held. On + * architectures like x86 and RISC-V, tagging is tricky because each + * mm may have a different tagging mask. However, we might only hold + * the per-VMA lock (currently only local processes are supported), + * so untagged_addr is used to avoid the mmap_lock assertion for + * local processes. + */ +static inline unsigned long get_untagged_addr(struct mm_struct *mm, + unsigned long start) +{ + return current->mm == mm ? untagged_addr(start) : + untagged_addr_remote(mm, start); +} + static int madvise_do_behavior(struct mm_struct *mm, unsigned long start, size_t len_in, struct madvise_behavior *madv_behavior) @@ -1723,7 +1825,7 @@ static int madvise_do_behavior(struct mm_struct *mm, if (is_memory_failure(behavior)) return madvise_inject_error(behavior, start, start + len_in); - start = untagged_addr_remote(mm, start); + start = get_untagged_addr(mm, start); end = start + PAGE_ALIGN(len_in); blk_start_plug(&plug); @@ -1731,7 +1833,7 @@ static int madvise_do_behavior(struct mm_struct *mm, error = madvise_populate(mm, start, end, behavior); else error = madvise_walk_vmas(mm, start, end, madv_behavior, - madvise_vma_behavior); + madv_behavior, madvise_vma_behavior); blk_finish_plug(&plug); return error; } @@ -1819,13 +1921,13 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh if (madvise_should_skip(start, len_in, behavior, &error)) return error; - error = madvise_lock(mm, behavior); + error = madvise_lock(mm, &madv_behavior); if (error) return error; madvise_init_tlb(&madv_behavior, mm); error = madvise_do_behavior(mm, start, len_in, &madv_behavior); madvise_finish_tlb(&madv_behavior); - madvise_unlock(mm, behavior); + madvise_unlock(mm, &madv_behavior); return error; } @@ -1849,7 +1951,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter, total_len = iov_iter_count(iter); - ret = madvise_lock(mm, behavior); + ret = madvise_lock(mm, &madv_behavior); if (ret) return ret; madvise_init_tlb(&madv_behavior, mm); @@ -1882,8 +1984,8 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter, /* Drop and reacquire lock to unwind race. */ madvise_finish_tlb(&madv_behavior); - madvise_unlock(mm, behavior); - ret = madvise_lock(mm, behavior); + madvise_unlock(mm, &madv_behavior); + ret = madvise_lock(mm, &madv_behavior); if (ret) goto out; madvise_init_tlb(&madv_behavior, mm); @@ -1894,7 +1996,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter, iov_iter_advance(iter, iter_iov_len(iter)); } madvise_finish_tlb(&madv_behavior); - madvise_unlock(mm, behavior); + madvise_unlock(mm, &madv_behavior); out: ret = (total_len - iov_iter_count(iter)) ? : ret; From ff7ec8dc1b646296f8d94c39339e8d3833d16c05 Mon Sep 17 00:00:00 2001 From: wangzijie Date: Sat, 7 Jun 2025 10:13:53 +0800 Subject: [PATCH 038/370] proc: use the same treatment to check proc_lseek as ones for proc_read_iter et.al Check pde->proc_ops->proc_lseek directly may cause UAF in rmmod scenario. It's a gap in proc_reg_open() after commit 654b33ada4ab("proc: fix UAF in proc_get_inode()"). Followed by AI Viro's suggestion, fix it in same manner. Link: https://lkml.kernel.org/r/20250607021353.1127963-1-wangzijie1@honor.com Fixes: 3f61631d47f1 ("take care to handle NULL ->proc_lseek()") Signed-off-by: wangzijie Reviewed-by: Alexey Dobriyan Cc: Alexei Starovoitov Cc: Al Viro Cc: "Edgecombe, Rick P" Cc: Kirill A. Shuemov Signed-off-by: Andrew Morton --- fs/proc/generic.c | 2 ++ fs/proc/inode.c | 2 +- fs/proc/internal.h | 5 +++++ include/linux/proc_fs.h | 1 + 4 files changed, 9 insertions(+), 1 deletion(-) diff --git a/fs/proc/generic.c b/fs/proc/generic.c index a3e22803cddf..e0e50914ab25 100644 --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -569,6 +569,8 @@ static void pde_set_flags(struct proc_dir_entry *pde) if (pde->proc_ops->proc_compat_ioctl) pde->flags |= PROC_ENTRY_proc_compat_ioctl; #endif + if (pde->proc_ops->proc_lseek) + pde->flags |= PROC_ENTRY_proc_lseek; } struct proc_dir_entry *proc_create_data(const char *name, umode_t mode, diff --git a/fs/proc/inode.c b/fs/proc/inode.c index 3604b616311c..129490151be1 100644 --- a/fs/proc/inode.c +++ b/fs/proc/inode.c @@ -473,7 +473,7 @@ static int proc_reg_open(struct inode *inode, struct file *file) typeof_member(struct proc_ops, proc_open) open; struct pde_opener *pdeo; - if (!pde->proc_ops->proc_lseek) + if (!pde_has_proc_lseek(pde)) file->f_mode &= ~FMODE_LSEEK; if (pde_is_permanent(pde)) { diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 96122e91c645..3d48ffe72583 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -99,6 +99,11 @@ static inline bool pde_has_proc_compat_ioctl(const struct proc_dir_entry *pde) #endif } +static inline bool pde_has_proc_lseek(const struct proc_dir_entry *pde) +{ + return pde->flags & PROC_ENTRY_proc_lseek; +} + extern struct kmem_cache *proc_dir_entry_cache; void pde_free(struct proc_dir_entry *pde); diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h index ea62201c74c4..703d0c76cc9a 100644 --- a/include/linux/proc_fs.h +++ b/include/linux/proc_fs.h @@ -27,6 +27,7 @@ enum { PROC_ENTRY_proc_read_iter = 1U << 1, PROC_ENTRY_proc_compat_ioctl = 1U << 2, + PROC_ENTRY_proc_lseek = 1U << 3, }; struct proc_ops { From 1e6b17b4237dacb02e9cfeaed35d889bbc9e8a84 Mon Sep 17 00:00:00 2001 From: Dev Jain Date: Wed, 4 Jun 2025 09:45:33 +0530 Subject: [PATCH 039/370] xarray: add a BUG_ON() to ensure caller is not sibling Suppose xas is pointing somewhere near the end of the multi-entry batch. Then it may happen that the computed slot already falls beyond the batch, thus breaking the loop due to !xa_is_sibling(), and computing the wrong order. For example, suppose we have a shift-6 node having an order-9 entry => 8 - 1 = 7 siblings, so assume the slots are at offset 0 till 7 in this node. If xas->xa_offset is 6, then the code will compute order as 1 + xas->xa_node->shift = 7. Therefore, the order computation must start from the beginning of the multi-slot entries, that is, the non-sibling entry. Thus ensure that the caller is aware of this by triggering a BUG when the entry is a sibling entry. Note that this BUG_ON() is only active while running selftests, so there is no overhead in a running kernel. Link: https://lkml.kernel.org/r/20250604041533.91198-1-dev.jain@arm.com Signed-off-by: Dev Jain Acked-by: Zi Yan Cc: "Aneesh Kumar K.V" Cc: Anshuman Khandual Cc: David Hildenbrand Cc: Matthew Wilcox (Oracle) Cc: Ryan Roberts Signed-off-by: Andrew Morton --- lib/xarray.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/lib/xarray.c b/lib/xarray.c index 76dde3a1cacf..ae3d80f4b4ee 100644 --- a/lib/xarray.c +++ b/lib/xarray.c @@ -1910,6 +1910,7 @@ EXPORT_SYMBOL(xa_store_range); * @xas: XArray operation state. * * Called after xas_load, the xas should not be in an error state. + * The xas should not be pointing to a sibling entry. * * Return: A number between 0 and 63 indicating the order of the entry. */ @@ -1920,6 +1921,8 @@ int xas_get_order(struct xa_state *xas) if (!xas->xa_node) return 0; + XA_NODE_BUG_ON(xas->xa_node, xa_is_sibling(xa_entry(xas->xa, + xas->xa_node, xas->xa_offset))); for (;;) { unsigned int slot = xas->xa_offset + (1 << order); From 1ec8a6e30e9c4ad06ea5e94536623c59c4ae2400 Mon Sep 17 00:00:00 2001 From: Joshua Hahn Date: Mon, 2 Jun 2025 09:23:40 -0700 Subject: [PATCH 040/370] mm/mempolicy: skip unnecessary synchronize_rcu() By unconditionally setting wi_state to NULL and conditionally calling synchronize_rcu(), we can save an unncessary call when there is no old_wi_state. Link: https://lkml.kernel.org/r/20250602162345.2595696-2-joshua.hahnjy@gmail.com Signed-off-by: Joshua Hahn Suggested-by: David Hildenbrand Reviewed-by: Huang Ying Cc: Alistair Popple Cc: Byungchul Park Cc: Gregory Price Cc: "Huang, Ying" Cc: kernel test robot Cc: Mathew Brost Cc: Rakie Kim Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/mempolicy.c | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 3b1dfd08338b..b0619d0020c9 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -3703,18 +3703,15 @@ static void wi_state_free(void) struct weighted_interleave_state *old_wi_state; mutex_lock(&wi_state_lock); - old_wi_state = rcu_dereference_protected(wi_state, lockdep_is_held(&wi_state_lock)); - if (!old_wi_state) { - mutex_unlock(&wi_state_lock); - return; - } - rcu_assign_pointer(wi_state, NULL); mutex_unlock(&wi_state_lock); - synchronize_rcu(); - kfree(old_wi_state); + + if (old_wi_state) { + synchronize_rcu(); + kfree(old_wi_state); + } } static struct kobj_attribute wi_auto_attr = From bdb86f6b87633cc020f8225ae09d336da7826724 Mon Sep 17 00:00:00 2001 From: Ryan Roberts Date: Mon, 9 Jun 2025 10:27:23 +0100 Subject: [PATCH 041/370] mm/readahead: honour new_order in page_cache_ra_order() Patch series "Readahead tweaks for larger folios", v5. This series adds some tweaks to readahead so that it does a better job of ramping up folio sizes as readahead extends further into the file. And it additionally special-cases executable mappings to allow the arch to request a preferred folio size for text. This patch (of 5): page_cache_ra_order() takes a parameter called new_order, which is intended to express the preferred order of the folios that will be allocated for the readahead operation. Most callers indeed call this with their preferred new order. But page_cache_async_ra() calls it with the preferred order of the previous readahead request (actually the order of the folio that had the readahead marker, which may be smaller when alignment comes into play). And despite the parameter name, page_cache_ra_order() always treats it at the old order, adding 2 to it on entry. As a result, a cold readahead always starts with order-2 folios. Let's fix this behaviour by always passing in the *new* order. Worked example: Prior to the change, mmaping an 8MB file and touching each page sequentially, resulted in the following, where we start with order-2 folios for the first 128K then ramp up to order-4 for the next 128K, then get clamped to order-5 for the rest of the file because pa_pages is limited to 128K: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER ----- ---------- ---------- --------- ------- ------- ----- ----- FOLIO 0x00000000 0x00004000 16384 0 4 4 2 FOLIO 0x00004000 0x00008000 16384 4 8 4 2 FOLIO 0x00008000 0x0000c000 16384 8 12 4 2 FOLIO 0x0000c000 0x00010000 16384 12 16 4 2 FOLIO 0x00010000 0x00014000 16384 16 20 4 2 FOLIO 0x00014000 0x00018000 16384 20 24 4 2 FOLIO 0x00018000 0x0001c000 16384 24 28 4 2 FOLIO 0x0001c000 0x00020000 16384 28 32 4 2 FOLIO 0x00020000 0x00030000 65536 32 48 16 4 FOLIO 0x00030000 0x00040000 65536 48 64 16 4 FOLIO 0x00040000 0x00060000 131072 64 96 32 5 FOLIO 0x00060000 0x00080000 131072 96 128 32 5 FOLIO 0x00080000 0x000a0000 131072 128 160 32 5 FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5 ... After the change, the same operation results in the first 128K being order-0, then we start ramping up to order-2, -4, and finally get clamped at order-5: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER ----- ---------- ---------- --------- ------- ------- ----- ----- FOLIO 0x00000000 0x00001000 4096 0 1 1 0 FOLIO 0x00001000 0x00002000 4096 1 2 1 0 FOLIO 0x00002000 0x00003000 4096 2 3 1 0 FOLIO 0x00003000 0x00004000 4096 3 4 1 0 FOLIO 0x00004000 0x00005000 4096 4 5 1 0 FOLIO 0x00005000 0x00006000 4096 5 6 1 0 FOLIO 0x00006000 0x00007000 4096 6 7 1 0 FOLIO 0x00007000 0x00008000 4096 7 8 1 0 FOLIO 0x00008000 0x00009000 4096 8 9 1 0 FOLIO 0x00009000 0x0000a000 4096 9 10 1 0 FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0 FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0 FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0 FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0 FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0 FOLIO 0x0000f000 0x00010000 4096 15 16 1 0 FOLIO 0x00010000 0x00011000 4096 16 17 1 0 FOLIO 0x00011000 0x00012000 4096 17 18 1 0 FOLIO 0x00012000 0x00013000 4096 18 19 1 0 FOLIO 0x00013000 0x00014000 4096 19 20 1 0 FOLIO 0x00014000 0x00015000 4096 20 21 1 0 FOLIO 0x00015000 0x00016000 4096 21 22 1 0 FOLIO 0x00016000 0x00017000 4096 22 23 1 0 FOLIO 0x00017000 0x00018000 4096 23 24 1 0 FOLIO 0x00018000 0x00019000 4096 24 25 1 0 FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0 FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0 FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0 FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0 FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0 FOLIO 0x0001f000 0x00020000 4096 31 32 1 0 FOLIO 0x00020000 0x00024000 16384 32 36 4 2 FOLIO 0x00024000 0x00028000 16384 36 40 4 2 FOLIO 0x00028000 0x0002c000 16384 40 44 4 2 FOLIO 0x0002c000 0x00030000 16384 44 48 4 2 FOLIO 0x00030000 0x00034000 16384 48 52 4 2 FOLIO 0x00034000 0x00038000 16384 52 56 4 2 FOLIO 0x00038000 0x0003c000 16384 56 60 4 2 FOLIO 0x0003c000 0x00040000 16384 60 64 4 2 FOLIO 0x00040000 0x00050000 65536 64 80 16 4 FOLIO 0x00050000 0x00060000 65536 80 96 16 4 FOLIO 0x00060000 0x00080000 131072 96 128 32 5 FOLIO 0x00080000 0x000a0000 131072 128 160 32 5 FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5 FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5 ... Link: https://lkml.kernel.org/r/20250609092729.274960-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20250609092729.274960-2-ryan.roberts@arm.com Signed-off-by: Ryan Roberts Acked-by: David Hildenbrand Reviewed-by: Jan Kara Tested-by: Chaitanya S Prakash Cc: Will Deacon Signed-off-by: Andrew Morton --- mm/readahead.c | 11 ++--------- 1 file changed, 2 insertions(+), 9 deletions(-) diff --git a/mm/readahead.c b/mm/readahead.c index 20d36d6b055e..973de2551efe 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -468,20 +468,12 @@ void page_cache_ra_order(struct readahead_control *ractl, unsigned int nofs; int err = 0; gfp_t gfp = readahead_gfp_mask(mapping); - unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping)); - /* - * Fallback when size < min_nrpages as each folio should be - * at least min_nrpages anyway. - */ - if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size) + if (!mapping_large_folio_support(mapping)) goto fallback; limit = min(limit, index + ra->size - 1); - if (new_order < mapping_max_folio_order(mapping)) - new_order += 2; - new_order = min(mapping_max_folio_order(mapping), new_order); new_order = min_t(unsigned int, new_order, ilog2(ra->size)); new_order = max(new_order, min_order); @@ -683,6 +675,7 @@ void page_cache_async_ra(struct readahead_control *ractl, ra->size = get_next_ra_size(ra, max_pages); ra->async_size = ra->size; readit: + order += 2; ractl->_index = ra->start; page_cache_ra_order(ractl, ra, order); } From 18ebe55a9236b33d519fcc8669730cd02386a2dc Mon Sep 17 00:00:00 2001 From: Ryan Roberts Date: Mon, 9 Jun 2025 10:27:24 +0100 Subject: [PATCH 042/370] mm/readahead: terminate async readahead on natural boundary Previously asynchonous readahead would read ra_pages (usually 128K) directly after the end of the synchonous readahead and given the synchronous readahead portion had no alignment guarantees (beyond page boundaries) it is possible (and likely) that the end of the initial 128K region would not fall on a natural boundary for the folio size being used. Therefore smaller folios were used to align down to the required boundary, both at the end of the previous readahead block and at the start of the new one. In the worst cases, this can result in never properly ramping up the folio size, and instead getting stuck oscillating between order-0, -1 and -2 folios. The next readahead will try to use folios whose order is +2 bigger than the folio that had the readahead marker. But because of the alignment requirements, that folio (the first one in the readahead block) can end up being order-0 in some cases. There will be 2 modifications to solve this issue: 1) Calculate the readahead size so the end is aligned to a folio boundary. This prevents needing to allocate small folios to align down at the end of the window and fixes the oscillation problem. 2) Remember the "preferred folio order" in the ra state instead of inferring it from the folio with the readahead marker. This solves the slow ramp up problem (discussed in a subsequent patch). This patch addresses (1) only. A subsequent patch will address (2). Worked example: The following shows the previous pathalogical behaviour when the initial synchronous readahead is unaligned. We start reading at page 17 in the file and read sequentially from there. I'm showing a dump of the pages in the page cache just after we read the first page of the folio with the readahead marker. Initially there are no pages in the page cache: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA ----- ---------- ---------- ---------- ------- ------- ----- ----- -- HOLE 0x00000000 0x00800000 8388608 0 2048 2048 Then we access page 17, causing synchonous read-around of 128K with a readahead marker set up at page 25. So far, all as expected: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA ----- ---------- ---------- ---------- ------- ------- ----- ----- -- HOLE 0x00000000 0x00001000 4096 0 1 1 FOLIO 0x00001000 0x00002000 4096 1 2 1 0 FOLIO 0x00002000 0x00003000 4096 2 3 1 0 FOLIO 0x00003000 0x00004000 4096 3 4 1 0 FOLIO 0x00004000 0x00005000 4096 4 5 1 0 FOLIO 0x00005000 0x00006000 4096 5 6 1 0 FOLIO 0x00006000 0x00007000 4096 6 7 1 0 FOLIO 0x00007000 0x00008000 4096 7 8 1 0 FOLIO 0x00008000 0x00009000 4096 8 9 1 0 FOLIO 0x00009000 0x0000a000 4096 9 10 1 0 FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0 FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0 FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0 FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0 FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0 FOLIO 0x0000f000 0x00010000 4096 15 16 1 0 FOLIO 0x00010000 0x00011000 4096 16 17 1 0 FOLIO 0x00011000 0x00012000 4096 17 18 1 0 FOLIO 0x00012000 0x00013000 4096 18 19 1 0 FOLIO 0x00013000 0x00014000 4096 19 20 1 0 FOLIO 0x00014000 0x00015000 4096 20 21 1 0 FOLIO 0x00015000 0x00016000 4096 21 22 1 0 FOLIO 0x00016000 0x00017000 4096 22 23 1 0 FOLIO 0x00017000 0x00018000 4096 23 24 1 0 FOLIO 0x00018000 0x00019000 4096 24 25 1 0 FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 Y FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0 FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0 FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0 FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0 FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0 FOLIO 0x0001f000 0x00020000 4096 31 32 1 0 FOLIO 0x00020000 0x00021000 4096 32 33 1 0 HOLE 0x00021000 0x00800000 8253440 33 2048 2015 Now access pages 18-25 inclusive. This causes an asynchronous 128K readahead starting at page 33. But since we are unaligned, even though the preferred folio order is 2, the first folio in this batch (the one with the new readahead marker) is order-0: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA ----- ---------- ---------- ---------- ------- ------- ----- ----- -- HOLE 0x00000000 0x00001000 4096 0 1 1 FOLIO 0x00001000 0x00002000 4096 1 2 1 0 FOLIO 0x00002000 0x00003000 4096 2 3 1 0 FOLIO 0x00003000 0x00004000 4096 3 4 1 0 FOLIO 0x00004000 0x00005000 4096 4 5 1 0 FOLIO 0x00005000 0x00006000 4096 5 6 1 0 FOLIO 0x00006000 0x00007000 4096 6 7 1 0 FOLIO 0x00007000 0x00008000 4096 7 8 1 0 FOLIO 0x00008000 0x00009000 4096 8 9 1 0 FOLIO 0x00009000 0x0000a000 4096 9 10 1 0 FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0 FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0 FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0 FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0 FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0 FOLIO 0x0000f000 0x00010000 4096 15 16 1 0 FOLIO 0x00010000 0x00011000 4096 16 17 1 0 FOLIO 0x00011000 0x00012000 4096 17 18 1 0 FOLIO 0x00012000 0x00013000 4096 18 19 1 0 FOLIO 0x00013000 0x00014000 4096 19 20 1 0 FOLIO 0x00014000 0x00015000 4096 20 21 1 0 FOLIO 0x00015000 0x00016000 4096 21 22 1 0 FOLIO 0x00016000 0x00017000 4096 22 23 1 0 FOLIO 0x00017000 0x00018000 4096 23 24 1 0 FOLIO 0x00018000 0x00019000 4096 24 25 1 0 FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0 FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0 FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0 FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0 FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0 FOLIO 0x0001f000 0x00020000 4096 31 32 1 0 FOLIO 0x00020000 0x00021000 4096 32 33 1 0 FOLIO 0x00021000 0x00022000 4096 33 34 1 0 Y FOLIO 0x00022000 0x00024000 8192 34 36 2 1 FOLIO 0x00024000 0x00028000 16384 36 40 4 2 FOLIO 0x00028000 0x0002c000 16384 40 44 4 2 FOLIO 0x0002c000 0x00030000 16384 44 48 4 2 FOLIO 0x00030000 0x00034000 16384 48 52 4 2 FOLIO 0x00034000 0x00038000 16384 52 56 4 2 FOLIO 0x00038000 0x0003c000 16384 56 60 4 2 FOLIO 0x0003c000 0x00040000 16384 60 64 4 2 FOLIO 0x00040000 0x00041000 4096 64 65 1 0 HOLE 0x00041000 0x00800000 8122368 65 2048 1983 Which means that when we now read pages 26-33 and readahead is kicked off again, the new preferred order is 2 (0 + 2), not 4 as we intended: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA ----- ---------- ---------- ---------- ------- ------- ----- ----- -- HOLE 0x00000000 0x00001000 4096 0 1 1 FOLIO 0x00001000 0x00002000 4096 1 2 1 0 FOLIO 0x00002000 0x00003000 4096 2 3 1 0 FOLIO 0x00003000 0x00004000 4096 3 4 1 0 FOLIO 0x00004000 0x00005000 4096 4 5 1 0 FOLIO 0x00005000 0x00006000 4096 5 6 1 0 FOLIO 0x00006000 0x00007000 4096 6 7 1 0 FOLIO 0x00007000 0x00008000 4096 7 8 1 0 FOLIO 0x00008000 0x00009000 4096 8 9 1 0 FOLIO 0x00009000 0x0000a000 4096 9 10 1 0 FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0 FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0 FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0 FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0 FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0 FOLIO 0x0000f000 0x00010000 4096 15 16 1 0 FOLIO 0x00010000 0x00011000 4096 16 17 1 0 FOLIO 0x00011000 0x00012000 4096 17 18 1 0 FOLIO 0x00012000 0x00013000 4096 18 19 1 0 FOLIO 0x00013000 0x00014000 4096 19 20 1 0 FOLIO 0x00014000 0x00015000 4096 20 21 1 0 FOLIO 0x00015000 0x00016000 4096 21 22 1 0 FOLIO 0x00016000 0x00017000 4096 22 23 1 0 FOLIO 0x00017000 0x00018000 4096 23 24 1 0 FOLIO 0x00018000 0x00019000 4096 24 25 1 0 FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0 FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0 FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0 FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0 FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0 FOLIO 0x0001f000 0x00020000 4096 31 32 1 0 FOLIO 0x00020000 0x00021000 4096 32 33 1 0 FOLIO 0x00021000 0x00022000 4096 33 34 1 0 FOLIO 0x00022000 0x00024000 8192 34 36 2 1 FOLIO 0x00024000 0x00028000 16384 36 40 4 2 FOLIO 0x00028000 0x0002c000 16384 40 44 4 2 FOLIO 0x0002c000 0x00030000 16384 44 48 4 2 FOLIO 0x00030000 0x00034000 16384 48 52 4 2 FOLIO 0x00034000 0x00038000 16384 52 56 4 2 FOLIO 0x00038000 0x0003c000 16384 56 60 4 2 FOLIO 0x0003c000 0x00040000 16384 60 64 4 2 FOLIO 0x00040000 0x00041000 4096 64 65 1 0 FOLIO 0x00041000 0x00042000 4096 65 66 1 0 Y FOLIO 0x00042000 0x00044000 8192 66 68 2 1 FOLIO 0x00044000 0x00048000 16384 68 72 4 2 FOLIO 0x00048000 0x0004c000 16384 72 76 4 2 FOLIO 0x0004c000 0x00050000 16384 76 80 4 2 FOLIO 0x00050000 0x00054000 16384 80 84 4 2 FOLIO 0x00054000 0x00058000 16384 84 88 4 2 FOLIO 0x00058000 0x0005c000 16384 88 92 4 2 FOLIO 0x0005c000 0x00060000 16384 92 96 4 2 FOLIO 0x00060000 0x00061000 4096 96 97 1 0 HOLE 0x00061000 0x00800000 7991296 97 2048 1951 This ramp up from order-0 with smaller orders at the edges for alignment cycle continues all the way to the end of the file (not shown). After the change, we round down the end boundary to the order boundary so we no longer get stuck in the cycle and can ramp up the order over time. Note that the rate of the ramp up is still not as we would expect it. We will fix that next. Here we are touching pages 17-256 sequentially: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA ----- ---------- ---------- ---------- ------- ------- ----- ----- -- HOLE 0x00000000 0x00001000 4096 0 1 1 FOLIO 0x00001000 0x00002000 4096 1 2 1 0 FOLIO 0x00002000 0x00003000 4096 2 3 1 0 FOLIO 0x00003000 0x00004000 4096 3 4 1 0 FOLIO 0x00004000 0x00005000 4096 4 5 1 0 FOLIO 0x00005000 0x00006000 4096 5 6 1 0 FOLIO 0x00006000 0x00007000 4096 6 7 1 0 FOLIO 0x00007000 0x00008000 4096 7 8 1 0 FOLIO 0x00008000 0x00009000 4096 8 9 1 0 FOLIO 0x00009000 0x0000a000 4096 9 10 1 0 FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0 FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0 FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0 FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0 FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0 FOLIO 0x0000f000 0x00010000 4096 15 16 1 0 FOLIO 0x00010000 0x00011000 4096 16 17 1 0 FOLIO 0x00011000 0x00012000 4096 17 18 1 0 FOLIO 0x00012000 0x00013000 4096 18 19 1 0 FOLIO 0x00013000 0x00014000 4096 19 20 1 0 FOLIO 0x00014000 0x00015000 4096 20 21 1 0 FOLIO 0x00015000 0x00016000 4096 21 22 1 0 FOLIO 0x00016000 0x00017000 4096 22 23 1 0 FOLIO 0x00017000 0x00018000 4096 23 24 1 0 FOLIO 0x00018000 0x00019000 4096 24 25 1 0 FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0 FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0 FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0 FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0 FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0 FOLIO 0x0001f000 0x00020000 4096 31 32 1 0 FOLIO 0x00020000 0x00021000 4096 32 33 1 0 FOLIO 0x00021000 0x00022000 4096 33 34 1 0 FOLIO 0x00022000 0x00024000 8192 34 36 2 1 FOLIO 0x00024000 0x00028000 16384 36 40 4 2 FOLIO 0x00028000 0x0002c000 16384 40 44 4 2 FOLIO 0x0002c000 0x00030000 16384 44 48 4 2 FOLIO 0x00030000 0x00034000 16384 48 52 4 2 FOLIO 0x00034000 0x00038000 16384 52 56 4 2 FOLIO 0x00038000 0x0003c000 16384 56 60 4 2 FOLIO 0x0003c000 0x00040000 16384 60 64 4 2 FOLIO 0x00040000 0x00044000 16384 64 68 4 2 FOLIO 0x00044000 0x00048000 16384 68 72 4 2 FOLIO 0x00048000 0x0004c000 16384 72 76 4 2 FOLIO 0x0004c000 0x00050000 16384 76 80 4 2 FOLIO 0x00050000 0x00054000 16384 80 84 4 2 FOLIO 0x00054000 0x00058000 16384 84 88 4 2 FOLIO 0x00058000 0x0005c000 16384 88 92 4 2 FOLIO 0x0005c000 0x00060000 16384 92 96 4 2 FOLIO 0x00060000 0x00070000 65536 96 112 16 4 FOLIO 0x00070000 0x00080000 65536 112 128 16 4 FOLIO 0x00080000 0x000a0000 131072 128 160 32 5 FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5 FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5 FOLIO 0x000e0000 0x00100000 131072 224 256 32 5 FOLIO 0x00100000 0x00120000 131072 256 288 32 5 FOLIO 0x00120000 0x00140000 131072 288 320 32 5 Y HOLE 0x00140000 0x00800000 7077888 320 2048 1728 Link: https://lkml.kernel.org/r/20250609092729.274960-3-ryan.roberts@arm.com Signed-off-by: Ryan Roberts Reviewed-by: Jan Kara Cc: Chaitanya S Prakash Cc: David Hildenbrand Cc: Will Deacon Signed-off-by: Andrew Morton --- mm/readahead.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/mm/readahead.c b/mm/readahead.c index 973de2551efe..87be20ae00d0 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -620,7 +620,7 @@ void page_cache_async_ra(struct readahead_control *ractl, unsigned long max_pages; struct file_ra_state *ra = ractl->ra; pgoff_t index = readahead_index(ractl); - pgoff_t expected, start; + pgoff_t expected, start, end, aligned_end, align; unsigned int order = folio_order(folio); /* no readahead */ @@ -652,7 +652,6 @@ void page_cache_async_ra(struct readahead_control *ractl, * the readahead window. */ ra->size = max(ra->size, get_next_ra_size(ra, max_pages)); - ra->async_size = ra->size; goto readit; } @@ -673,9 +672,14 @@ void page_cache_async_ra(struct readahead_control *ractl, ra->size = start - index; /* old async_size */ ra->size += req_count; ra->size = get_next_ra_size(ra, max_pages); - ra->async_size = ra->size; readit: order += 2; + align = 1UL << min(order, ffs(max_pages) - 1); + end = ra->start + ra->size; + aligned_end = round_down(end, align); + if (aligned_end > ra->start) + ra->size -= end - aligned_end; + ra->async_size = ra->size; ractl->_index = ra->start; page_cache_ra_order(ractl, ra, order); } From f5e8b140cd1324cf2c9c17487b8f444098624797 Mon Sep 17 00:00:00 2001 From: Ryan Roberts Date: Mon, 9 Jun 2025 10:27:25 +0100 Subject: [PATCH 043/370] mm/readahead: make space in struct file_ra_state We need to be able to store the preferred folio order associated with a readahead request in the struct file_ra_state so that we can more accurately increase the order across subsequent readahead requests. But struct file_ra_state is per-struct file, so we don't really want to increase it's size. mmap_miss is currently 32 bits but it is only counted up to 10 * MMAP_LOTSAMISS, which is currently defined as 1000. So 16 bits should be plenty. Redefine it to unsigned short, making room for order as unsigned short in follow up commit. Link: https://lkml.kernel.org/r/20250609092729.274960-4-ryan.roberts@arm.com Signed-off-by: Ryan Roberts Acked-by: David Hildenbrand Reviewed-by: Jan Kara Cc: Chaitanya S Prakash Cc: Will Deacon Signed-off-by: Andrew Morton --- include/linux/fs.h | 2 +- mm/filemap.c | 11 ++++++----- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 62634af97da6..ef819b232d66 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1054,7 +1054,7 @@ struct file_ra_state { unsigned int size; unsigned int async_size; unsigned int ra_pages; - unsigned int mmap_miss; + unsigned short mmap_miss; loff_t prev_pos; }; diff --git a/mm/filemap.c b/mm/filemap.c index a6459874bb2a..7bb4ffca8487 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3217,7 +3217,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff); struct file *fpin = NULL; unsigned long vm_flags = vmf->vma->vm_flags; - unsigned int mmap_miss; + unsigned short mmap_miss; #ifdef CONFIG_TRANSPARENT_HUGEPAGE /* Use the readahead code, even if readahead is disabled */ @@ -3285,7 +3285,7 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf, struct file_ra_state *ra = &file->f_ra; DEFINE_READAHEAD(ractl, file, ra, file->f_mapping, vmf->pgoff); struct file *fpin = NULL; - unsigned int mmap_miss; + unsigned short mmap_miss; /* If we don't want any read-ahead, don't bother */ if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages) @@ -3605,7 +3605,7 @@ static struct folio *next_uptodate_folio(struct xa_state *xas, static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf, struct folio *folio, unsigned long start, unsigned long addr, unsigned int nr_pages, - unsigned long *rss, unsigned int *mmap_miss) + unsigned long *rss, unsigned short *mmap_miss) { vm_fault_t ret = 0; struct page *page = folio_page(folio, start); @@ -3667,7 +3667,7 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf, static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf, struct folio *folio, unsigned long addr, - unsigned long *rss, unsigned int *mmap_miss) + unsigned long *rss, unsigned short *mmap_miss) { vm_fault_t ret = 0; struct page *page = &folio->page; @@ -3709,7 +3709,8 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, struct folio *folio; vm_fault_t ret = 0; unsigned long rss = 0; - unsigned int nr_pages = 0, mmap_miss = 0, mmap_miss_saved, folio_type; + unsigned int nr_pages = 0, folio_type; + unsigned short mmap_miss = 0, mmap_miss_saved; rcu_read_lock(); folio = next_uptodate_folio(&xas, mapping, end_pgoff); From c4602f9fa77fc6bb956ca51a23e7a39439e75cb6 Mon Sep 17 00:00:00 2001 From: Ryan Roberts Date: Mon, 9 Jun 2025 10:27:26 +0100 Subject: [PATCH 044/370] mm/readahead: store folio order in struct file_ra_state Previously the folio order of the previous readahead request was inferred from the folio who's readahead marker was hit. But due to the way we have to round to non-natural boundaries sometimes, this first folio in the readahead block is often smaller than the preferred order for that request. This means that for cases where the initial sync readahead is poorly aligned, the folio order will ramp up much more slowly. So instead, let's store the order in struct file_ra_state so we are not affected by any required alignment. We previously made enough room in the struct for a 16 order field. This should be plenty big enough since we are limited to MAX_PAGECACHE_ORDER anyway, which is certainly never larger than ~20. Since we now pass order in struct file_ra_state, page_cache_ra_order() no longer needs it's new_order parameter, so let's remove that. Worked example: Here we are touching pages 17-256 sequentially just as we did in the previous commit, but now that we are remembering the preferred order explicitly, we no longer have the slow ramp up problem. Note specifically that we no longer have 2 rounds (2x ~128K) of order-2 folios: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA ----- ---------- ---------- ---------- ------- ------- ----- ----- -- HOLE 0x00000000 0x00001000 4096 0 1 1 FOLIO 0x00001000 0x00002000 4096 1 2 1 0 FOLIO 0x00002000 0x00003000 4096 2 3 1 0 FOLIO 0x00003000 0x00004000 4096 3 4 1 0 FOLIO 0x00004000 0x00005000 4096 4 5 1 0 FOLIO 0x00005000 0x00006000 4096 5 6 1 0 FOLIO 0x00006000 0x00007000 4096 6 7 1 0 FOLIO 0x00007000 0x00008000 4096 7 8 1 0 FOLIO 0x00008000 0x00009000 4096 8 9 1 0 FOLIO 0x00009000 0x0000a000 4096 9 10 1 0 FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0 FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0 FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0 FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0 FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0 FOLIO 0x0000f000 0x00010000 4096 15 16 1 0 FOLIO 0x00010000 0x00011000 4096 16 17 1 0 FOLIO 0x00011000 0x00012000 4096 17 18 1 0 FOLIO 0x00012000 0x00013000 4096 18 19 1 0 FOLIO 0x00013000 0x00014000 4096 19 20 1 0 FOLIO 0x00014000 0x00015000 4096 20 21 1 0 FOLIO 0x00015000 0x00016000 4096 21 22 1 0 FOLIO 0x00016000 0x00017000 4096 22 23 1 0 FOLIO 0x00017000 0x00018000 4096 23 24 1 0 FOLIO 0x00018000 0x00019000 4096 24 25 1 0 FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0 FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0 FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0 FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0 FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0 FOLIO 0x0001f000 0x00020000 4096 31 32 1 0 FOLIO 0x00020000 0x00021000 4096 32 33 1 0 FOLIO 0x00021000 0x00022000 4096 33 34 1 0 FOLIO 0x00022000 0x00024000 8192 34 36 2 1 FOLIO 0x00024000 0x00028000 16384 36 40 4 2 FOLIO 0x00028000 0x0002c000 16384 40 44 4 2 FOLIO 0x0002c000 0x00030000 16384 44 48 4 2 FOLIO 0x00030000 0x00034000 16384 48 52 4 2 FOLIO 0x00034000 0x00038000 16384 52 56 4 2 FOLIO 0x00038000 0x0003c000 16384 56 60 4 2 FOLIO 0x0003c000 0x00040000 16384 60 64 4 2 FOLIO 0x00040000 0x00050000 65536 64 80 16 4 FOLIO 0x00050000 0x00060000 65536 80 96 16 4 FOLIO 0x00060000 0x00080000 131072 96 128 32 5 FOLIO 0x00080000 0x000a0000 131072 128 160 32 5 FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5 FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5 FOLIO 0x000e0000 0x00100000 131072 224 256 32 5 FOLIO 0x00100000 0x00120000 131072 256 288 32 5 FOLIO 0x00120000 0x00140000 131072 288 320 32 5 Y HOLE 0x00140000 0x00800000 7077888 320 2048 1728 Link: https://lkml.kernel.org/r/20250609092729.274960-5-ryan.roberts@arm.com Signed-off-by: Ryan Roberts Reviewed-by: Jan Kara Cc: Chaitanya S Prakash Cc: David Hildenbrand Cc: Will Deacon Signed-off-by: Andrew Morton --- include/linux/fs.h | 2 ++ mm/filemap.c | 6 ++++-- mm/internal.h | 3 +-- mm/readahead.c | 21 +++++++++++++-------- 4 files changed, 20 insertions(+), 12 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index ef819b232d66..e14e9d11ca0f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1043,6 +1043,7 @@ struct fown_struct { * and so were/are genuinely "ahead". Start next readahead when * the first of these pages is accessed. * @ra_pages: Maximum size of a readahead request, copied from the bdi. + * @order: Preferred folio order used for most recent readahead. * @mmap_miss: How many mmap accesses missed in the page cache. * @prev_pos: The last byte in the most recent read request. * @@ -1054,6 +1055,7 @@ struct file_ra_state { unsigned int size; unsigned int async_size; unsigned int ra_pages; + unsigned short order; unsigned short mmap_miss; loff_t prev_pos; }; diff --git a/mm/filemap.c b/mm/filemap.c index 7bb4ffca8487..4b5c8d69f04c 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3232,7 +3232,8 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) if (!(vm_flags & VM_RAND_READ)) ra->size *= 2; ra->async_size = HPAGE_PMD_NR; - page_cache_ra_order(&ractl, ra, HPAGE_PMD_ORDER); + ra->order = HPAGE_PMD_ORDER; + page_cache_ra_order(&ractl, ra); return fpin; } #endif @@ -3268,8 +3269,9 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2); ra->size = ra->ra_pages; ra->async_size = ra->ra_pages / 4; + ra->order = 0; ractl._index = ra->start; - page_cache_ra_order(&ractl, ra, 0); + page_cache_ra_order(&ractl, ra); return fpin; } diff --git a/mm/internal.h b/mm/internal.h index 6b8ed2017743..f91688e2894f 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -436,8 +436,7 @@ void zap_page_range_single_batched(struct mmu_gather *tlb, int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio, gfp_t gfp); -void page_cache_ra_order(struct readahead_control *, struct file_ra_state *, - unsigned int order); +void page_cache_ra_order(struct readahead_control *, struct file_ra_state *); void force_page_cache_ra(struct readahead_control *, unsigned long nr); static inline void force_page_cache_readahead(struct address_space *mapping, struct file *file, pgoff_t index, unsigned long nr_to_read) diff --git a/mm/readahead.c b/mm/readahead.c index 87be20ae00d0..95a24f12d1e7 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -457,7 +457,7 @@ static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index, } void page_cache_ra_order(struct readahead_control *ractl, - struct file_ra_state *ra, unsigned int new_order) + struct file_ra_state *ra) { struct address_space *mapping = ractl->mapping; pgoff_t start = readahead_index(ractl); @@ -468,9 +468,12 @@ void page_cache_ra_order(struct readahead_control *ractl, unsigned int nofs; int err = 0; gfp_t gfp = readahead_gfp_mask(mapping); + unsigned int new_order = ra->order; - if (!mapping_large_folio_support(mapping)) + if (!mapping_large_folio_support(mapping)) { + ra->order = 0; goto fallback; + } limit = min(limit, index + ra->size - 1); @@ -478,6 +481,8 @@ void page_cache_ra_order(struct readahead_control *ractl, new_order = min_t(unsigned int, new_order, ilog2(ra->size)); new_order = max(new_order, min_order); + ra->order = new_order; + /* See comment in page_cache_ra_unbounded() */ nofs = memalloc_nofs_save(); filemap_invalidate_lock_shared(mapping); @@ -609,8 +614,9 @@ void page_cache_sync_ra(struct readahead_control *ractl, ra->size = min(contig_count + req_count, max_pages); ra->async_size = 1; readit: + ra->order = 0; ractl->_index = ra->start; - page_cache_ra_order(ractl, ra, 0); + page_cache_ra_order(ractl, ra); } EXPORT_SYMBOL_GPL(page_cache_sync_ra); @@ -621,7 +627,6 @@ void page_cache_async_ra(struct readahead_control *ractl, struct file_ra_state *ra = ractl->ra; pgoff_t index = readahead_index(ractl); pgoff_t expected, start, end, aligned_end, align; - unsigned int order = folio_order(folio); /* no readahead */ if (!ra->ra_pages) @@ -644,7 +649,7 @@ void page_cache_async_ra(struct readahead_control *ractl, * Ramp up sizes, and push forward the readahead window. */ expected = round_down(ra->start + ra->size - ra->async_size, - 1UL << order); + 1UL << folio_order(folio)); if (index == expected) { ra->start += ra->size; /* @@ -673,15 +678,15 @@ void page_cache_async_ra(struct readahead_control *ractl, ra->size += req_count; ra->size = get_next_ra_size(ra, max_pages); readit: - order += 2; - align = 1UL << min(order, ffs(max_pages) - 1); + ra->order += 2; + align = 1UL << min(ra->order, ffs(max_pages) - 1); end = ra->start + ra->size; aligned_end = round_down(end, align); if (aligned_end > ra->start) ra->size -= end - aligned_end; ra->async_size = ra->size; ractl->_index = ra->start; - page_cache_ra_order(ractl, ra, order); + page_cache_ra_order(ractl, ra); } EXPORT_SYMBOL_GPL(page_cache_async_ra); From 38b0ece6d76374b989928021b5d310be11b99b5c Mon Sep 17 00:00:00 2001 From: Ryan Roberts Date: Mon, 9 Jun 2025 10:27:27 +0100 Subject: [PATCH 045/370] mm/filemap: allow arch to request folio size for exec memory Change the readahead config so that if it is being requested for an executable mapping, do a synchronous read into a set of folios with an arch-specified order and in a naturally aligned manner. We no longer center the read on the faulting page but simply align it down to the previous natural boundary. Additionally, we don't bother with an asynchronous part. On arm64 if memory is physically contiguous and naturally aligned to the "contpte" size, we can use contpte mappings, which improves utilization of the TLB. When paired with the "multi-size THP" feature, this works well to reduce dTLB pressure. However iTLB pressure is still high due to executable mappings having a low likelihood of being in the required folio size and mapping alignment, even when the filesystem supports readahead into large folios (e.g. XFS). The reason for the low likelihood is that the current readahead algorithm starts with an order-0 folio and increases the folio order by 2 every time the readahead mark is hit. But most executable memory tends to be accessed randomly and so the readahead mark is rarely hit and most executable folios remain order-0. So let's special-case the read(ahead) logic for executable mappings. The trade-off is performance improvement (due to more efficient storage of the translations in iTLB) vs potential for making reclaim more difficult (due to the folios being larger so if a part of the folio is hot the whole thing is considered hot). But executable memory is a small portion of the overall system memory so I doubt this will even register from a reclaim perspective. I've chosen 64K folio size for arm64 which benefits both the 4K and 16K base page size configs. Crucially the same amount of data is still read (usually 128K) so I'm not expecting any read amplification issues. I don't anticipate any write amplification because text is always RO. Note that the text region of an ELF file could be populated into the page cache for other reasons than taking a fault in a mmapped area. The most common case is due to the loader read()ing the header which can be shared with the beginning of text. So some text will still remain in small folios, but this simple, best effort change provides good performance improvements as is. Confine this special-case approach to the bounds of the VMA. This prevents wasting memory for any padding that might exist in the file between sections. Previously the padding would have been contained in order-0 folios and would be easy to reclaim. But now it would be part of a larger folio so more difficult to reclaim. Solve this by simply not reading it into memory in the first place. Benchmarking ============ The below shows pgbench and redis benchmarks on Graviton3 arm64 system. First, confirmation that this patch causes more text to be contained in 64K folios: +----------------------+---------------+---------------+---------------+ | File-backed folios by| system boot | pgbench | redis | | size as percentage of+-------+-------+-------+-------+-------+-------+ | all mapped text mem |before | after |before | after |before | after | +======================+=======+=======+=======+=======+=======+=======+ | base-page-4kB | 78% | 30% | 78% | 11% | 73% | 14% | | thp-aligned-8kB | 1% | 0% | 0% | 0% | 1% | 0% | | thp-aligned-16kB | 17% | 4% | 17% | 3% | 20% | 4% | | thp-aligned-32kB | 1% | 1% | 1% | 2% | 1% | 1% | | thp-aligned-64kB | 3% | 63% | 3% | 81% | 4% | 77% | | thp-aligned-128kB | 0% | 1% | 1% | 1% | 1% | 2% | | thp-unaligned-64kB | 0% | 0% | 0% | 1% | 0% | 1% | | thp-unaligned-128kB | 0% | 1% | 0% | 0% | 0% | 0% | | thp-partial | 0% | 0% | 0% | 1% | 0% | 1% | +----------------------+-------+-------+-------+-------+-------+-------+ | cont-aligned-64kB | 4% | 65% | 4% | 83% | 6% | 79% | +----------------------+-------+-------+-------+-------+-------+-------+ The above shows that for both workloads (each isolated with cgroups) as well as the general system state after boot, the amount of text backed by 4K and 16K folios reduces and the amount backed by 64K folios increases significantly. And the amount of text that is contpte-mapped significantly increases (see last row). And this is reflected in performance improvement. "(I)" indicates a statistically significant improvement. Note TPS and Reqs/sec are rates so bigger is better, ms is time so smaller is better: +-------------+-------------------------------------------+------------+ | Benchmark | Result Class | Improvemnt | +=============+===========================================+============+ | pts/pgbench | Scale: 1 Clients: 1 RO (TPS) | (I) 3.47% | | | Scale: 1 Clients: 1 RO - Latency (ms) | -2.88% | | | Scale: 1 Clients: 250 RO (TPS) | (I) 5.02% | | | Scale: 1 Clients: 250 RO - Latency (ms) | (I) -4.79% | | | Scale: 1 Clients: 1000 RO (TPS) | (I) 6.16% | | | Scale: 1 Clients: 1000 RO - Latency (ms) | (I) -5.82% | | | Scale: 100 Clients: 1 RO (TPS) | 2.51% | | | Scale: 100 Clients: 1 RO - Latency (ms) | -3.51% | | | Scale: 100 Clients: 250 RO (TPS) | (I) 4.75% | | | Scale: 100 Clients: 250 RO - Latency (ms) | (I) -4.44% | | | Scale: 100 Clients: 1000 RO (TPS) | (I) 6.34% | | | Scale: 100 Clients: 1000 RO - Latency (ms)| (I) -5.95% | +-------------+-------------------------------------------+------------+ | pts/redis | Test: GET Connections: 50 (Reqs/sec) | (I) 3.20% | | | Test: GET Connections: 1000 (Reqs/sec) | (I) 2.55% | | | Test: LPOP Connections: 50 (Reqs/sec) | (I) 4.59% | | | Test: LPOP Connections: 1000 (Reqs/sec) | (I) 4.81% | | | Test: LPUSH Connections: 50 (Reqs/sec) | (I) 5.31% | | | Test: LPUSH Connections: 1000 (Reqs/sec) | (I) 4.36% | | | Test: SADD Connections: 50 (Reqs/sec) | (I) 2.64% | | | Test: SADD Connections: 1000 (Reqs/sec) | (I) 4.15% | | | Test: SET Connections: 50 (Reqs/sec) | (I) 3.11% | | | Test: SET Connections: 1000 (Reqs/sec) | (I) 3.36% | +-------------+-------------------------------------------+------------+ [ryan.roberts@arm.com: fix use-after-free] Link: https://lkml.kernel.org/r/ea7f9da7-9a9f-4b85-9d0a-35b320f5ed25@arm.com [ryan.roberts@arm.com: use the vma_pages() helper instead of open-coding] Link: https://lkml.kernel.org/r/0e0f674b-3b7e-494f-ae7a-fc9dbb98dad4@arm.com Link: https://lkml.kernel.org/r/20250609092729.274960-6-ryan.roberts@arm.com Signed-off-by: Ryan Roberts Reviewed-by: Jan Kara Acked-by: Will Deacon Cc: Chaitanya S Prakash Cc: David Hildenbrand Signed-off-by: Andrew Morton --- arch/arm64/include/asm/pgtable.h | 8 ++++++ include/linux/pgtable.h | 11 ++++++++ mm/filemap.c | 48 ++++++++++++++++++++++++++------ 3 files changed, 58 insertions(+), 9 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 192d86e1cc76..e511f909f63c 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -1643,6 +1643,14 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf, */ #define arch_wants_old_prefaulted_pte cpu_has_hw_af +/* + * Request exec memory is read into pagecache in at least 64K folios. This size + * can be contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB + * entry), and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base + * pages are in use. + */ +#define exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT) + static inline bool pud_sect_supported(void) { return PAGE_SIZE == SZ_4K; diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 0b6e1f781d86..e4a3895c043b 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -456,6 +456,17 @@ static inline bool arch_has_hw_pte_young(void) } #endif +#ifndef exec_folio_order +/* + * Returns preferred minimum folio order for executable file-backed memory. Must + * be in range [0, PMD_ORDER). Default to order-0. + */ +static inline unsigned int exec_folio_order(void) +{ + return 0; +} +#endif + #ifndef arch_check_zapped_pte static inline void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte) diff --git a/mm/filemap.c b/mm/filemap.c index 4b5c8d69f04c..3cf955740148 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3238,8 +3238,11 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) } #endif - /* If we don't want any read-ahead, don't bother */ - if (vm_flags & VM_RAND_READ) + /* + * If we don't want any read-ahead, don't bother. VM_EXEC case below is + * already intended for random access. + */ + if ((vm_flags & (VM_RAND_READ | VM_EXEC)) == VM_RAND_READ) return fpin; if (!ra->ra_pages) return fpin; @@ -3262,14 +3265,41 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) if (mmap_miss > MMAP_LOTSAMISS) return fpin; - /* - * mmap read-around - */ + if (vm_flags & VM_EXEC) { + /* + * Allow arch to request a preferred minimum folio order for + * executable memory. This can often be beneficial to + * performance if (e.g.) arm64 can contpte-map the folio. + * Executable memory rarely benefits from readahead, due to its + * random access nature, so set async_size to 0. + * + * Limit to the boundaries of the VMA to avoid reading in any + * pad that might exist between sections, which would be a waste + * of memory. + */ + struct vm_area_struct *vma = vmf->vma; + unsigned long start = vma->vm_pgoff; + unsigned long end = start + vma_pages(vma); + unsigned long ra_end; + + ra->order = exec_folio_order(); + ra->start = round_down(vmf->pgoff, 1UL << ra->order); + ra->start = max(ra->start, start); + ra_end = round_up(ra->start + ra->ra_pages, 1UL << ra->order); + ra_end = min(ra_end, end); + ra->size = ra_end - ra->start; + ra->async_size = 0; + } else { + /* + * mmap read-around + */ + ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2); + ra->size = ra->ra_pages; + ra->async_size = ra->ra_pages / 4; + ra->order = 0; + } + fpin = maybe_unlock_mmap_for_io(vmf, fpin); - ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2); - ra->size = ra->ra_pages; - ra->async_size = ra->ra_pages / 4; - ra->order = 0; ractl._index = ra->start; page_cache_ra_order(&ractl, ra); return fpin; From 7e43195c609f0499e46c6bfa9472e39c76af445b Mon Sep 17 00:00:00 2001 From: Casey Chen Date: Tue, 10 Jun 2025 10:22:58 -0600 Subject: [PATCH 046/370] alloc_tag: remove empty module tag section The empty MOD_CODETAG_SECTIONS() macro added an incomplete .data section in module linker script, which caused symbol lookup tools like gdb to misinterpret symbol addresses e.g., __ib_process_cq incorrectly mapping to unrelated functions like below. (gdb) disas __ib_process_cq Dump of assembler code for function trace_event_fields_cq_schedule: Removing the empty section restores proper symbol resolution and layout, ensuring .data placement behaves as expected. Link: https://lkml.kernel.org/r/20250610162258.324645-1-cachen@purestorage.com Fixes: 0db6f8d7820a ("alloc_tag: load module tags into separate contiguous memory") 22d407b164ff ("lib: add allocation tagging support for memory allocation profiling") Signed-off-by: Casey Chen Reviewed-by: Yuanyuan Zhong Acked-by: Suren Baghdasaryan Cc: Arnd Bergmann Cc: Kent Overstreet Cc: Luis Chamberalin Cc: Pasha Tatashin Signed-off-by: Andrew Morton --- include/asm-generic/codetag.lds.h | 6 ------ scripts/module.lds.S | 5 ----- 2 files changed, 11 deletions(-) diff --git a/include/asm-generic/codetag.lds.h b/include/asm-generic/codetag.lds.h index 372c320c5043..a45fe3d141a1 100644 --- a/include/asm-generic/codetag.lds.h +++ b/include/asm-generic/codetag.lds.h @@ -11,12 +11,6 @@ #define CODETAG_SECTIONS() \ SECTION_WITH_BOUNDARIES(alloc_tags) -/* - * Module codetags which aren't used after module unload, therefore have the - * same lifespan as the module and can be safely unloaded with the module. - */ -#define MOD_CODETAG_SECTIONS() - #define MOD_SEPARATE_CODETAG_SECTION(_name) \ .codetag.##_name : { \ SECTION_WITH_BOUNDARIES(_name) \ diff --git a/scripts/module.lds.S b/scripts/module.lds.S index 450f1088d5fd..ee79c41059f3 100644 --- a/scripts/module.lds.S +++ b/scripts/module.lds.S @@ -52,17 +52,12 @@ SECTIONS { .data : { *(.data .data.[0-9a-zA-Z_]*) *(.data..L*) - MOD_CODETAG_SECTIONS() } .rodata : { *(.rodata .rodata.[0-9a-zA-Z_]*) *(.rodata..L*) } -#else - .data : { - MOD_CODETAG_SECTIONS() - } #endif MOD_SEPARATE_CODETAG_SECTIONS() } From be3d3343d467905b14fe62cb1dafa68906e4d5ed Mon Sep 17 00:00:00 2001 From: Mark Brown Date: Tue, 10 Jun 2025 15:13:54 +0100 Subject: [PATCH 047/370] kselftest/mm: clarify errors for pipe() Patch series "selftests/mm: Tweaks to the cow test". A collection of non-functional updates from David Hildenbrand's review. This patch (of 4): Specify that errors reported from pipe() failures are the result of failures. Link: https://lkml.kernel.org/r/20250610-selftest-mm-cow-tweaks-v1-0-43cd7457500f@kernel.org Link: https://lkml.kernel.org/r/20250610-selftest-mm-cow-tweaks-v1-1-43cd7457500f@kernel.org Signed-off-by: Mark Brown Suggested-by: David Hildenbrand Acked-by: David Hildenbrand Cc: Mark Brown Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/cow.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c index dbbcc5eb3dce..0adc0ab142c1 100644 --- a/tools/testing/selftests/mm/cow.c +++ b/tools/testing/selftests/mm/cow.c @@ -113,11 +113,11 @@ struct comm_pipes { static int setup_comm_pipes(struct comm_pipes *comm_pipes) { if (pipe(comm_pipes->child_ready) < 0) { - ksft_perror("pipe()"); + ksft_perror("pipe() failed"); return -errno; } if (pipe(comm_pipes->parent_ready) < 0) { - ksft_perror("pipe()"); + ksft_perror("pipe() failed"); close(comm_pipes->child_ready[0]); close(comm_pipes->child_ready[1]); return -errno; From 5fbfb1f39da0a5cbfe66e9ab67790e607b8f6e58 Mon Sep 17 00:00:00 2001 From: Mark Brown Date: Tue, 10 Jun 2025 15:13:55 +0100 Subject: [PATCH 048/370] selftests/mm: convert some cow error reports to ksft_perror() This prints the errno and a string decode of it. Link: https://lkml.kernel.org/r/20250610-selftest-mm-cow-tweaks-v1-2-43cd7457500f@kernel.org Signed-off-by: Mark Brown Acked-by: David Hildenbrand Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/cow.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c index 0adc0ab142c1..ff80329313e4 100644 --- a/tools/testing/selftests/mm/cow.c +++ b/tools/testing/selftests/mm/cow.c @@ -332,7 +332,7 @@ static void do_test_vmsplice_in_parent(char *mem, size_t size, if (before_fork) { transferred = vmsplice(fds[1], &iov, 1, 0); if (transferred <= 0) { - ksft_print_msg("vmsplice() failed\n"); + ksft_perror("vmsplice() failed\n"); log_test_result(KSFT_FAIL); goto close_pipe; } @@ -562,7 +562,7 @@ static void do_test_iouring(char *mem, size_t size, bool use_fork) while (total < size) { cur = pread(fd, tmp + total, size - total, total); if (cur < 0) { - ksft_print_msg("pread() failed\n"); + ksft_perror("pread() failed\n"); log_test_result(KSFT_FAIL); goto quit_child; } @@ -628,7 +628,7 @@ static void do_test_ro_pin(char *mem, size_t size, enum ro_pin_test test, tmp = malloc(size); if (!tmp) { - ksft_print_msg("malloc() failed\n"); + ksft_perror("malloc() failed\n"); log_test_result(KSFT_FAIL); return; } From 32dc2d5e3f0de7a62dd85b2ec2be2964449c4796 Mon Sep 17 00:00:00 2001 From: Mark Brown Date: Tue, 10 Jun 2025 15:13:56 +0100 Subject: [PATCH 049/370] selftests/mm: don't compare return values to in cow Tweak the coding style for checking for non-zero return values. While we're at it also remove a now redundant oring of the madvise() return code. Link: https://lkml.kernel.org/r/20250610-selftest-mm-cow-tweaks-v1-3-43cd7457500f@kernel.org Signed-off-by: Mark Brown Suggested-by: David Hildenbrand Acked-by: David Hildenbrand Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/cow.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c index ff80329313e4..48fcf03aa4cd 100644 --- a/tools/testing/selftests/mm/cow.c +++ b/tools/testing/selftests/mm/cow.c @@ -1613,13 +1613,13 @@ static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc) smem = (char *)(((uintptr_t)mmap_smem + pmdsize) & ~(pmdsize - 1)); ret = madvise(mem, pmdsize, MADV_HUGEPAGE); - if (ret != 0) { + if (ret) { ksft_perror("madvise()"); log_test_result(KSFT_FAIL); goto munmap; } - ret |= madvise(smem, pmdsize, MADV_HUGEPAGE); - if (ret != 0) { + ret = madvise(smem, pmdsize, MADV_HUGEPAGE); + if (ret) { ksft_perror("madvise()"); log_test_result(KSFT_FAIL); goto munmap; From 4ff52d4a2de1c2adc77b2cca82b542cd916b42a9 Mon Sep 17 00:00:00 2001 From: Mark Brown Date: Tue, 10 Jun 2025 15:13:57 +0100 Subject: [PATCH 050/370] selftests/mm: add messages about test errors to the cow tests It is not sufficiently clear what the individual tests in the cow test program are checking so add messages for the failure cases. Link: https://lkml.kernel.org/r/20250610-selftest-mm-cow-tweaks-v1-4-43cd7457500f@kernel.org Signed-off-by: Mark Brown Suggested-by: David Hildenbrand Acked-by: David Hildenbrand Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/cow.c | 28 ++++++++++++++++++++-------- 1 file changed, 20 insertions(+), 8 deletions(-) diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c index 48fcf03aa4cd..29767e2afdd0 100644 --- a/tools/testing/selftests/mm/cow.c +++ b/tools/testing/selftests/mm/cow.c @@ -268,8 +268,10 @@ static void do_test_cow_in_parent(char *mem, size_t size, bool do_mprotect, * fail because (a) harder to fix and (b) nobody really cares. * Flag them as expected failure for now. */ + ksft_print_msg("Leak from parent into child\n"); log_test_result(KSFT_XFAIL); } else { + ksft_print_msg("Leak from parent into child\n"); log_test_result(KSFT_FAIL); } close_comm_pipes: @@ -397,8 +399,10 @@ static void do_test_vmsplice_in_parent(char *mem, size_t size, * fail because (a) harder to fix and (b) nobody really cares. * Flag them as expected failure for now. */ + ksft_print_msg("Leak from child into parent\n"); log_test_result(KSFT_XFAIL); } else { + ksft_print_msg("Leak from child into parent\n"); log_test_result(KSFT_FAIL); } close_pipe: @@ -570,10 +574,12 @@ static void do_test_iouring(char *mem, size_t size, bool use_fork) } /* Finally, check if we read what we expected. */ - if (!memcmp(mem, tmp, size)) + if (!memcmp(mem, tmp, size)) { log_test_result(KSFT_PASS); - else + } else { + ksft_print_msg("Longtom R/W pin is not reliable\n"); log_test_result(KSFT_FAIL); + } quit_child: if (use_fork) { @@ -725,10 +731,12 @@ static void do_test_ro_pin(char *mem, size_t size, enum ro_pin_test test, ksft_perror("PIN_LONGTERM_TEST_READ failed"); log_test_result(KSFT_FAIL); } else { - if (!memcmp(mem, tmp, size)) + if (!memcmp(mem, tmp, size)) { log_test_result(KSFT_PASS); - else + } else { + ksft_print_msg("Longterm R/O pin is not reliable\n"); log_test_result(KSFT_FAIL); + } } ret = ioctl(gup_fd, PIN_LONGTERM_TEST_STOP); @@ -1417,10 +1425,12 @@ static void do_test_anon_thp_collapse(char *mem, size_t size, else ret = -EINVAL; - if (!ret) + if (!ret) { log_test_result(KSFT_PASS); - else + } else { + ksft_print_msg("Leak from parent into child\n"); log_test_result(KSFT_FAIL); + } close_comm_pipes: close_comm_pipes(&comm_pipes); } @@ -1528,10 +1538,12 @@ static void test_cow(char *mem, const char *smem, size_t size) memset(mem, 0xff, size); /* See if we still read the old values via the other mapping. */ - if (!memcmp(smem, old, size)) + if (!memcmp(smem, old, size)) { log_test_result(KSFT_PASS); - else + } else { + ksft_print_msg("Other mapping modified\n"); log_test_result(KSFT_FAIL); + } free(old); } From ba78585585d9f5ec4dd32de7b3027a86fe33cee0 Mon Sep 17 00:00:00 2001 From: Mark Brown Date: Tue, 10 Jun 2025 15:07:44 +0100 Subject: [PATCH 051/370] selftests/mm: check for YAMA ptrace_scope configuraiton before modifying it When running the memfd_secret test run_vmtests.sh unconditionally tries to confgiure the YAMA LSM's ptrace_scope configuration, leading to an error if YAMA is not in the running kernel: # ./run_vmtests.sh: line 432: /proc/sys/kernel/yama/ptrace_scope: No such file or directory # # ---------------------- # # running ./memfd_secret # # ---------------------- Check that this file is present before trying to write to it. The indentation here is a bit odd, and it doesn't seem great that we configure but don't restore ptrace_scope. Link: https://lkml.kernel.org/r/20250610-selftest-mm-enable-yama-v1-1-0097b6713116@kernel.org Signed-off-by: Mark Brown Acked-by: Mike Rapoport (Microsoft) Acked-by: David Hildenbrand Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/run_vmtests.sh | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh index dddd1dd8af14..33fc7fafa8f9 100755 --- a/tools/testing/selftests/mm/run_vmtests.sh +++ b/tools/testing/selftests/mm/run_vmtests.sh @@ -429,7 +429,9 @@ CATEGORY="vma_merge" run_test ./merge if [ -x ./memfd_secret ] then -(echo 0 > /proc/sys/kernel/yama/ptrace_scope 2>&1) | tap_prefix +if [ -f /proc/sys/kernel/yama/ptrace_scope ]; then + (echo 0 > /proc/sys/kernel/yama/ptrace_scope 2>&1) | tap_prefix +fi CATEGORY="memfd_secret" run_test ./memfd_secret fi From 6046a3bed1c2b028e692f7606e3450d1c93e8fdd Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Tue, 10 Jun 2025 11:21:50 +0200 Subject: [PATCH 052/370] lib/test_hmm: reduce stack usage The various test ioctl handlers use arrays of 64 integers that add up to 1KiB of stack data, which in turn leads to exceeding the warning limit in some configurations: lib/test_hmm.c:935:12: error: stack frame size (1408) exceeds limit (1280) in 'dmirror_migrate_to_device' [-Werror,-Wframe-larger-than] Use half the size for these arrays, in order to stay under the warning limits. The code can already deal with arbitrary lengths, but this may be a little less efficient. Link: https://lkml.kernel.org/r/20250610092159.2639515-1-arnd@kernel.org Signed-off-by: Arnd Bergmann Cc: Alistair Popple Cc: David Hildenbrand Cc: Jason Gunthorpe Cc: Jeff Johnson Cc: Jerome Glisse Cc: Thorsten Blum Signed-off-by: Andrew Morton --- lib/test_hmm.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 5b144bc5c4ec..761725bc713c 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -330,7 +330,7 @@ static int dmirror_fault(struct dmirror *dmirror, unsigned long start, { struct mm_struct *mm = dmirror->notifier.mm; unsigned long addr; - unsigned long pfns[64]; + unsigned long pfns[32]; struct hmm_range range = { .notifier = &dmirror->notifier, .hmm_pfns = pfns, @@ -879,8 +879,8 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror, unsigned long size = cmd->npages << PAGE_SHIFT; struct mm_struct *mm = dmirror->notifier.mm; struct vm_area_struct *vma; - unsigned long src_pfns[64] = { 0 }; - unsigned long dst_pfns[64] = { 0 }; + unsigned long src_pfns[32] = { 0 }; + unsigned long dst_pfns[32] = { 0 }; struct migrate_vma args = { 0 }; unsigned long next; int ret; @@ -939,8 +939,8 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror, unsigned long size = cmd->npages << PAGE_SHIFT; struct mm_struct *mm = dmirror->notifier.mm; struct vm_area_struct *vma; - unsigned long src_pfns[64] = { 0 }; - unsigned long dst_pfns[64] = { 0 }; + unsigned long src_pfns[32] = { 0 }; + unsigned long dst_pfns[32] = { 0 }; struct dmirror_bounce bounce; struct migrate_vma args = { 0 }; unsigned long next; @@ -1144,8 +1144,8 @@ static int dmirror_snapshot(struct dmirror *dmirror, unsigned long size = cmd->npages << PAGE_SHIFT; unsigned long addr; unsigned long next; - unsigned long pfns[64]; - unsigned char perm[64]; + unsigned long pfns[32]; + unsigned char perm[32]; char __user *uptr; struct hmm_range range = { .hmm_pfns = pfns, From 03dfefdacfe7127ff9c8a5167b0e72f6ab9fb9db Mon Sep 17 00:00:00 2001 From: Ye Liu Date: Tue, 10 Jun 2025 16:37:30 +0800 Subject: [PATCH 053/370] mm/memfd: clarify error handling labels in memfd_create() err_name --> err_free_name (fd failure case) err_fd --> err_free_fd (file failure case) Link: https://lkml.kernel.org/r/20250610083730.527619-1-ye.liu@linux.dev Signed-off-by: Ye Liu Signed-off-by: Andrew Morton --- mm/memfd.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/memfd.c b/mm/memfd.c index ab367e61553d..4fc98abe6627 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -475,22 +475,22 @@ SYSCALL_DEFINE2(memfd_create, fd = get_unused_fd_flags((flags & MFD_CLOEXEC) ? O_CLOEXEC : 0); if (fd < 0) { error = fd; - goto err_name; + goto err_free_name; } file = alloc_file(name, flags); if (IS_ERR(file)) { error = PTR_ERR(file); - goto err_fd; + goto err_free_fd; } fd_install(fd, file); kfree(name); return fd; -err_fd: +err_free_fd: put_unused_fd(fd); -err_name: +err_free_name: kfree(name); return error; } From 96d81e4766f9e88b66a0502b5a7f34a4c20ac754 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Thu, 5 Jun 2025 14:51:04 +0100 Subject: [PATCH 054/370] mm/pagewalk: split walk_page_range_novma() into kernel/user parts walk_page_range_novma() is rather confusing - it supports two modes, one used often, the other used only for debugging. The first mode is the common case of traversal of kernel page tables, which is what nearly all callers use this for. Secondly it provides an unusual debugging interface that allows for the traversal of page tables in a userland range of memory even for that memory which is not described by a VMA. It is far from certain that such page tables should even exist, but perhaps this is precisely why it is useful as a debugging mechanism. As a result, this is utilised by ptdump only. Historically, things were reversed - ptdump was the only user, and other parts of the kernel evolved to use the kernel page table walking here. Since we have some complicated and confusing locking rules for the novma case, it makes sense to separate the two usages into their own functions. Doing this also provide self-documentation as to the intent of the caller - are they doing something rather unusual or are they simply doing a standard kernel page table walk? We therefore establish two separate functions - walk_page_range_debug() for this single usage, and walk_kernel_page_table_range() for general kernel page table walking. The walk_page_range_debug() function is currently used to traverse both userland and kernel mappings, so we maintain this and in the case of kernel mappings being traversed, we have walk_page_range_debug() invoke walk_kernel_page_table_range() internally. We additionally make walk_page_range_debug() internal to mm. Link: https://lkml.kernel.org/r/20250605135104.90720-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Acked-by: Mike Rapoport (Microsoft) Acked-by: Qi Zheng Reviewed-by: Oscar Salvador Reviewed-by: Suren Baghdasaryan Reviewed-by: Vlastimil Babka Acked-by: David Hildenbrand Cc: Albert Ou Cc: Alexandre Ghiti Cc: Barry Song Cc: Huacai Chen Cc: Jann Horn Cc: Jonas Bonn Cc: Liam Howlett Cc: Michal Hocko Cc: Muchun Song Cc: Palmer Dabbelt Cc: Paul Walmsley Cc: Stafford Horne Cc: Stefan Kristiansson Cc: WANG Xuerui Signed-off-by: Andrew Morton --- arch/loongarch/mm/pageattr.c | 2 +- arch/openrisc/kernel/dma.c | 4 +- arch/riscv/mm/pageattr.c | 8 ++-- include/linux/pagewalk.h | 7 ++-- mm/hugetlb_vmemmap.c | 2 +- mm/internal.h | 3 ++ mm/pagewalk.c | 77 +++++++++++++++++++++++++----------- mm/ptdump.c | 3 +- 8 files changed, 71 insertions(+), 35 deletions(-) diff --git a/arch/loongarch/mm/pageattr.c b/arch/loongarch/mm/pageattr.c index 99165903908a..f5e910b68229 100644 --- a/arch/loongarch/mm/pageattr.c +++ b/arch/loongarch/mm/pageattr.c @@ -118,7 +118,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, pgp return 0; mmap_write_lock(&init_mm); - ret = walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL, &masks); + ret = walk_kernel_page_table_range(start, end, &pageattr_ops, NULL, &masks); mmap_write_unlock(&init_mm); flush_tlb_kernel_range(start, end); diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c index 3a7b5baaa450..af932a4ad306 100644 --- a/arch/openrisc/kernel/dma.c +++ b/arch/openrisc/kernel/dma.c @@ -72,7 +72,7 @@ void *arch_dma_set_uncached(void *cpu_addr, size_t size) * them and setting the cache-inhibit bit. */ mmap_write_lock(&init_mm); - error = walk_page_range_novma(&init_mm, va, va + size, + error = walk_kernel_page_table_range(va, va + size, &set_nocache_walk_ops, NULL, NULL); mmap_write_unlock(&init_mm); @@ -87,7 +87,7 @@ void arch_dma_clear_uncached(void *cpu_addr, size_t size) mmap_write_lock(&init_mm); /* walk_page_range shouldn't be able to fail here */ - WARN_ON(walk_page_range_novma(&init_mm, va, va + size, + WARN_ON(walk_kernel_page_table_range(va, va + size, &clear_nocache_walk_ops, NULL, NULL)); mmap_write_unlock(&init_mm); } diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c index d815448758a1..3f76db3d2769 100644 --- a/arch/riscv/mm/pageattr.c +++ b/arch/riscv/mm/pageattr.c @@ -299,7 +299,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, if (ret) goto unlock; - ret = walk_page_range_novma(&init_mm, lm_start, lm_end, + ret = walk_kernel_page_table_range(lm_start, lm_end, &pageattr_ops, NULL, &masks); if (ret) goto unlock; @@ -317,13 +317,13 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, if (ret) goto unlock; - ret = walk_page_range_novma(&init_mm, lm_start, lm_end, + ret = walk_kernel_page_table_range(lm_start, lm_end, &pageattr_ops, NULL, &masks); if (ret) goto unlock; } - ret = walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL, + ret = walk_kernel_page_table_range(start, end, &pageattr_ops, NULL, &masks); unlock: @@ -335,7 +335,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, */ flush_tlb_all(); #else - ret = walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL, + ret = walk_kernel_page_table_range(start, end, &pageattr_ops, NULL, &masks); mmap_write_unlock(&init_mm); diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h index 9700a29f8afb..8ac2f6d6d2a3 100644 --- a/include/linux/pagewalk.h +++ b/include/linux/pagewalk.h @@ -129,10 +129,9 @@ struct mm_walk { int walk_page_range(struct mm_struct *mm, unsigned long start, unsigned long end, const struct mm_walk_ops *ops, void *private); -int walk_page_range_novma(struct mm_struct *mm, unsigned long start, - unsigned long end, const struct mm_walk_ops *ops, - pgd_t *pgd, - void *private); +int walk_kernel_page_table_range(unsigned long start, + unsigned long end, const struct mm_walk_ops *ops, + pgd_t *pgd, void *private); int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start, unsigned long end, const struct mm_walk_ops *ops, void *private); diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index 27245e86df25..ba0fb1b6a5a8 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -166,7 +166,7 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end, VM_BUG_ON(!PAGE_ALIGNED(start | end)); mmap_read_lock(&init_mm); - ret = walk_page_range_novma(&init_mm, start, end, &vmemmap_remap_ops, + ret = walk_kernel_page_table_range(start, end, &vmemmap_remap_ops, NULL, walk); mmap_read_unlock(&init_mm); if (ret) diff --git a/mm/internal.h b/mm/internal.h index f91688e2894f..2c0d9f197d81 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1604,6 +1604,9 @@ static inline void accept_page(struct page *page) int walk_page_range_mm(struct mm_struct *mm, unsigned long start, unsigned long end, const struct mm_walk_ops *ops, void *private); +int walk_page_range_debug(struct mm_struct *mm, unsigned long start, + unsigned long end, const struct mm_walk_ops *ops, + pgd_t *pgd, void *private); /* pt_reclaim.c */ bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval); diff --git a/mm/pagewalk.c b/mm/pagewalk.c index e478777c86e1..ff5299eca687 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -585,8 +585,7 @@ int walk_page_range(struct mm_struct *mm, unsigned long start, } /** - * walk_page_range_novma - walk a range of pagetables not backed by a vma - * @mm: mm_struct representing the target process of page table walk + * walk_kernel_page_table_range - walk a range of kernel pagetables. * @start: start address of the virtual address range * @end: end address of the virtual address range * @ops: operation to call during the walk @@ -596,17 +595,61 @@ int walk_page_range(struct mm_struct *mm, unsigned long start, * Similar to walk_page_range() but can walk any page tables even if they are * not backed by VMAs. Because 'unusual' entries may be walked this function * will also not lock the PTEs for the pte_entry() callback. This is useful for - * walking the kernel pages tables or page tables for firmware. + * walking kernel pages tables or page tables for firmware. * * Note: Be careful to walk the kernel pages tables, the caller may be need to * take other effective approaches (mmap lock may be insufficient) to prevent * the intermediate kernel page tables belonging to the specified address range * from being freed (e.g. memory hot-remove). */ -int walk_page_range_novma(struct mm_struct *mm, unsigned long start, +int walk_kernel_page_table_range(unsigned long start, unsigned long end, + const struct mm_walk_ops *ops, pgd_t *pgd, void *private) +{ + struct mm_struct *mm = &init_mm; + struct mm_walk walk = { + .ops = ops, + .mm = mm, + .pgd = pgd, + .private = private, + .no_vma = true + }; + + if (start >= end) + return -EINVAL; + if (!check_ops_valid(ops)) + return -EINVAL; + + /* + * Kernel intermediate page tables are usually not freed, so the mmap + * read lock is sufficient. But there are some exceptions. + * E.g. memory hot-remove. In which case, the mmap lock is insufficient + * to prevent the intermediate kernel pages tables belonging to the + * specified address range from being freed. The caller should take + * other actions to prevent this race. + */ + mmap_assert_locked(mm); + + return walk_pgd_range(start, end, &walk); +} + +/** + * walk_page_range_debug - walk a range of pagetables not backed by a vma + * @mm: mm_struct representing the target process of page table walk + * @start: start address of the virtual address range + * @end: end address of the virtual address range + * @ops: operation to call during the walk + * @pgd: pgd to walk if different from mm->pgd + * @private: private data for callbacks' usage + * + * Similar to walk_page_range() but can walk any page tables even if they are + * not backed by VMAs. Because 'unusual' entries may be walked this function + * will also not lock the PTEs for the pte_entry() callback. + * + * This is for debugging purposes ONLY. + */ +int walk_page_range_debug(struct mm_struct *mm, unsigned long start, unsigned long end, const struct mm_walk_ops *ops, - pgd_t *pgd, - void *private) + pgd_t *pgd, void *private) { struct mm_walk walk = { .ops = ops, @@ -616,34 +659,24 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start, .no_vma = true }; + /* For convenience, we allow traversal of kernel mappings. */ + if (mm == &init_mm) + return walk_kernel_page_table_range(start, end, ops, + pgd, private); if (start >= end || !walk.mm) return -EINVAL; if (!check_ops_valid(ops)) return -EINVAL; /* - * 1) For walking the user virtual address space: - * * The mmap lock protects the page walker from changes to the page * tables during the walk. However a read lock is insufficient to * protect those areas which don't have a VMA as munmap() detaches * the VMAs before downgrading to a read lock and actually tearing * down PTEs/page tables. In which case, the mmap write lock should - * be hold. - * - * 2) For walking the kernel virtual address space: - * - * The kernel intermediate page tables usually do not be freed, so - * the mmap map read lock is sufficient. But there are some exceptions. - * E.g. memory hot-remove. In which case, the mmap lock is insufficient - * to prevent the intermediate kernel pages tables belonging to the - * specified address range from being freed. The caller should take - * other actions to prevent this race. + * be held. */ - if (mm == &init_mm) - mmap_assert_locked(walk.mm); - else - mmap_assert_write_locked(walk.mm); + mmap_assert_write_locked(mm); return walk_pgd_range(start, end, &walk); } diff --git a/mm/ptdump.c b/mm/ptdump.c index 9374f29cdc6f..61a352aa12ed 100644 --- a/mm/ptdump.c +++ b/mm/ptdump.c @@ -4,6 +4,7 @@ #include #include #include +#include "internal.h" #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS) /* @@ -177,7 +178,7 @@ void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm, pgd_t *pgd) mmap_write_lock(mm); while (range->start != range->end) { - walk_page_range_novma(mm, range->start, range->end, + walk_page_range_debug(mm, range->start, range->end, &ptdump_ops, pgd, st); range++; } From a03db236aebfaeadf79396dbd570896b870bda01 Mon Sep 17 00:00:00 2001 From: Li Zhe Date: Fri, 6 Jun 2025 10:37:42 +0800 Subject: [PATCH 055/370] gup: optimize longterm pin_user_pages() for large folio In the current implementation of longterm pin_user_pages(), we invoke collect_longterm_unpinnable_folios(). This function iterates through the list to check whether each folio belongs to the "longterm_unpinnabled" category. The folios in this list essentially correspond to a contiguous region of userspace addresses, with each folio representing a physical address in increments of PAGESIZE. If this userspace address range is mapped with large folio, we can optimize the performance of function collect_longterm_unpinnable_folios() by reducing the using of READ_ONCE() invoked in pofs_get_folio()->page_folio()->_compound_head(). Also, we can simplify the logic of collect_longterm_unpinnable_folios(). Instead of comparing with prev_folio after calling pofs_get_folio(), we can check whether the next page is within the same folio. The performance test results, based on v6.15, obtained through the gup_test tool from the kernel source tree are as follows. We achieve an improvement of over 66% for large folio with pagesize=2M. For small folio, we have only observed a very slight degradation in performance. Without this patch: [root@localhost ~] ./gup_test -HL -m 8192 -n 512 TAP version 13 1..1 # PIN_LONGTERM_BENCHMARK: Time: get:14391 put:10858 us# ok 1 ioctl status 0 # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 [root@localhost ~]# ./gup_test -LT -m 8192 -n 512 TAP version 13 1..1 # PIN_LONGTERM_BENCHMARK: Time: get:130538 put:31676 us# ok 1 ioctl status 0 # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 With this patch: [root@localhost ~] ./gup_test -HL -m 8192 -n 512 TAP version 13 1..1 # PIN_LONGTERM_BENCHMARK: Time: get:4867 put:10516 us# ok 1 ioctl status 0 # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 [root@localhost ~]# ./gup_test -LT -m 8192 -n 512 TAP version 13 1..1 # PIN_LONGTERM_BENCHMARK: Time: get:131798 put:31328 us# ok 1 ioctl status 0 # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 [lizhe.67@bytedance.com: whitespace fix, per David] Link: https://lkml.kernel.org/r/20250606091917.91384-1-lizhe.67@bytedance.com Link: https://lkml.kernel.org/r/20250606023742.58344-1-lizhe.67@bytedance.com Signed-off-by: Li Zhe Cc: David Hildenbrand Cc: Dev Jain Cc: Jason Gunthorpe Cc: John Hubbard Cc: Muchun Song Cc: Peter Xu Signed-off-by: Andrew Morton --- mm/gup.c | 38 ++++++++++++++++++++++++++++++-------- 1 file changed, 30 insertions(+), 8 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 7f2644a433a0..cbe8e4b9845b 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2296,6 +2296,31 @@ static void pofs_unpin(struct pages_or_folios *pofs) unpin_user_pages(pofs->pages, pofs->nr_entries); } +static struct folio *pofs_next_folio(struct folio *folio, + struct pages_or_folios *pofs, long *index_ptr) +{ + long i = *index_ptr + 1; + + if (!pofs->has_folios && folio_test_large(folio)) { + const unsigned long start_pfn = folio_pfn(folio); + const unsigned long end_pfn = start_pfn + folio_nr_pages(folio); + + for (; i < pofs->nr_entries; i++) { + unsigned long pfn = page_to_pfn(pofs->pages[i]); + + /* Is this page part of this folio? */ + if (pfn < start_pfn || pfn >= end_pfn) + break; + } + } + + if (unlikely(i == pofs->nr_entries)) + return NULL; + *index_ptr = i; + + return pofs_get_folio(pofs, i); +} + /* * Returns the number of collected folios. Return value is always >= 0. */ @@ -2303,16 +2328,13 @@ static unsigned long collect_longterm_unpinnable_folios( struct list_head *movable_folio_list, struct pages_or_folios *pofs) { - unsigned long i, collected = 0; - struct folio *prev_folio = NULL; + unsigned long collected = 0; bool drain_allow = true; + struct folio *folio; + long i = 0; - for (i = 0; i < pofs->nr_entries; i++) { - struct folio *folio = pofs_get_folio(pofs, i); - - if (folio == prev_folio) - continue; - prev_folio = folio; + for (folio = pofs_get_folio(pofs, i); folio; + folio = pofs_next_folio(folio, pofs, &i)) { if (folio_is_longterm_pinnable(folio)) continue; From b0da7709c28c35e0a51d4b1b350c9028358dfb14 Mon Sep 17 00:00:00 2001 From: David Wang <00107082@163.com> Date: Mon, 9 Jun 2025 14:42:00 +0800 Subject: [PATCH 056/370] alloc_tag: add sequence number for module and iterator Codetag iterator use pair to guarantee the validness. But both id and address can be reused, there is theoretical possibility when module inserted right after another module removed, kmalloc returns an address same as the address kfree by previous module and IDR key reuses the key recently removed. Add a sequence number to codetag_module and code_iterator, the sequence number is strickly incremented whenever a module is loaded. An iterator is valid if and only if its sequence number match codetag_module's. Link: https://lkml.kernel.org/r/20250609064200.112639-1-00107082@163.com Signed-off-by: David Wang <00107082@163.com> Acked-by: Suren Baghdasaryan Cc: Kent Overstreet Cc: Tim Chen Signed-off-by: Andrew Morton --- include/linux/codetag.h | 1 + lib/codetag.c | 17 ++++++++++++++--- 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/include/linux/codetag.h b/include/linux/codetag.h index 5f2b9a1f722c..457ed8fd3214 100644 --- a/include/linux/codetag.h +++ b/include/linux/codetag.h @@ -54,6 +54,7 @@ struct codetag_iterator { struct codetag_module *cmod; unsigned long mod_id; struct codetag *ct; + unsigned long mod_seq; }; #ifdef MODULE diff --git a/lib/codetag.c b/lib/codetag.c index 650d54d7e14d..545911cebd25 100644 --- a/lib/codetag.c +++ b/lib/codetag.c @@ -11,8 +11,14 @@ struct codetag_type { struct list_head link; unsigned int count; struct idr mod_idr; - struct rw_semaphore mod_lock; /* protects mod_idr */ + /* + * protects mod_idr, next_mod_seq, + * iter->mod_seq and cmod->mod_seq + */ + struct rw_semaphore mod_lock; struct codetag_type_desc desc; + /* generates unique sequence number for module load */ + unsigned long next_mod_seq; }; struct codetag_range { @@ -23,6 +29,7 @@ struct codetag_range { struct codetag_module { struct module *mod; struct codetag_range range; + unsigned long mod_seq; }; static DEFINE_MUTEX(codetag_lock); @@ -48,6 +55,7 @@ struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype) .cmod = NULL, .mod_id = 0, .ct = NULL, + .mod_seq = 0, }; return iter; @@ -91,11 +99,13 @@ struct codetag *codetag_next_ct(struct codetag_iterator *iter) if (!cmod) break; - if (cmod != iter->cmod) { + if (!iter->cmod || iter->mod_seq != cmod->mod_seq) { iter->cmod = cmod; + iter->mod_seq = cmod->mod_seq; ct = get_first_module_ct(cmod); - } else + } else { ct = get_next_module_ct(iter); + } if (ct) break; @@ -191,6 +201,7 @@ static int codetag_module_init(struct codetag_type *cttype, struct module *mod) cmod->range = range; down_write(&cttype->mod_lock); + cmod->mod_seq = ++cttype->next_mod_seq; mod_id = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL); if (mod_id >= 0) { if (cttype->desc.module_load) { From 9f44df50fee4d2f6cb374177244ccfa9f0a5cc95 Mon Sep 17 00:00:00 2001 From: David Wang <00107082@163.com> Date: Mon, 9 Jun 2025 14:44:08 +0800 Subject: [PATCH 057/370] alloc_tag: keep codetag iterator active between read() When reading /proc/allocinfo, for each read syscall, seq_file would invoke start/stop callbacks. In start callback, a memory is alloced to store iterator and the iterator would start from beginning to walk linearly to current read position. seq_file read() takes at most 4096 bytes, even if read with a larger user space buffer, meaning read out all of /proc/allocinfo, tens of read syscalls are needed. For example, a 306036 bytes allocinfo files need 76 reads: $ sudo cat /proc/allocinfo | wc 3964 16678 306036 $ sudo strace -T -e read cat /proc/allocinfo ... read(3, " 4096 1 arch/x86/k"..., 131072) = 4063 <0.000062> ... read(3, " 0 0 sound/core"..., 131072) = 4021 <0.000150> ... For those n=3964 lines, each read takes about m=3964/76=52 lines, since iterator restart from beginning for each read(), it would move forward m steps on 1st read 2*m steps on 2nd read 3*m steps on 3rd read ... n steps on last read As read() along, those linear seek steps make read() calls slower and slower. Adding those up, codetag iterator moves about O(n*n/m) steps, making data structure traversal take significant part of the whole reading. Profiling when stress reading /proc/allocinfo confirms it: vfs_read(99.959% 1677299/1677995) proc_reg_read_iter(99.856% 1674881/1677299) seq_read_iter(99.959% 1674191/1674881) allocinfo_start(75.664% 1266755/1674191) codetag_next_ct(79.217% 1003487/1266755) <--- srso_return_thunk(1.264% 16011/1266755) __kmalloc_cache_noprof(0.102% 1296/1266755) ... allocinfo_show(21.287% 356378/1674191) allocinfo_next(1.530% 25621/1674191) codetag_next_ct() takes major part. A private data alloced at open() time can be used to carry iterator alive across read() calls, and avoid the memory allocation and iterator reset for each read(). This way, only O(1) memory allocation and O(n) steps iterating, and `time` shows performance improvement from ~7ms to ~4ms. Profiling with the change: vfs_read(99.865% 1581073/1583214) proc_reg_read_iter(99.485% 1572934/1581073) seq_read_iter(99.846% 1570519/1572934) allocinfo_show(87.428% 1373074/1570519) seq_buf_printf(83.695% 1149196/1373074) seq_buf_putc(1.917% 26321/1373074) _find_next_bit(1.531% 21023/1373074) ... codetag_to_text(0.490% 6727/1373074) ... allocinfo_next(6.275% 98543/1570519) ... allocinfo_start(0.369% 5790/1570519) ... Now seq_buf_printf() takes major part. Link: https://lkml.kernel.org/r/20250609064408.112783-1-00107082@163.com Signed-off-by: David Wang <00107082@163.com> Acked-by: Suren Baghdasaryan Cc: Kent Overstreet Cc: Tim Chen Signed-off-by: Andrew Morton --- lib/alloc_tag.c | 29 ++++++++++------------------- 1 file changed, 10 insertions(+), 19 deletions(-) diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c index 3a74d63a959e..36f07dc95069 100644 --- a/lib/alloc_tag.c +++ b/lib/alloc_tag.c @@ -46,21 +46,16 @@ struct allocinfo_private { static void *allocinfo_start(struct seq_file *m, loff_t *pos) { struct allocinfo_private *priv; - struct codetag *ct; loff_t node = *pos; - priv = kzalloc(sizeof(*priv), GFP_KERNEL); - m->private = priv; - if (!priv) - return NULL; - - priv->print_header = (node == 0); + priv = (struct allocinfo_private *)m->private; codetag_lock_module_list(alloc_tag_cttype, true); - priv->iter = codetag_get_ct_iter(alloc_tag_cttype); - while ((ct = codetag_next_ct(&priv->iter)) != NULL && node) - node--; - - return ct ? priv : NULL; + if (node == 0) { + priv->print_header = true; + priv->iter = codetag_get_ct_iter(alloc_tag_cttype); + codetag_next_ct(&priv->iter); + } + return priv->iter.ct ? priv : NULL; } static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos) @@ -77,12 +72,7 @@ static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos) static void allocinfo_stop(struct seq_file *m, void *arg) { - struct allocinfo_private *priv = (struct allocinfo_private *)m->private; - - if (priv) { - codetag_lock_module_list(alloc_tag_cttype, false); - kfree(priv); - } + codetag_lock_module_list(alloc_tag_cttype, false); } static void print_allocinfo_header(struct seq_buf *buf) @@ -817,7 +807,8 @@ static int __init alloc_tag_init(void) return 0; } - if (!proc_create_seq(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op)) { + if (!proc_create_seq_private(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op, + sizeof(struct allocinfo_private), NULL)) { pr_err("Failed to create %s file\n", ALLOCINFO_FILE_NAME); shutdown_mem_profiling(false); return -ENOMEM; From cce35103135c7ffc7bebc32ebfc74fe1f2c3cb5d Mon Sep 17 00:00:00 2001 From: Li Zhijian Date: Tue, 10 Jun 2025 14:27:51 +0800 Subject: [PATCH 058/370] mm/memory-tier: fix abstract distance calculation overflow In mt_perf_to_adistance(), the calculation of abstract distance (adist) involves multiplying several int values including MEMTIER_ADISTANCE_DRAM. *adist = MEMTIER_ADISTANCE_DRAM * (perf->read_latency + perf->write_latency) / (default_dram_perf.read_latency + default_dram_perf.write_latency) * (default_dram_perf.read_bandwidth + default_dram_perf.write_bandwidth) / (perf->read_bandwidth + perf->write_bandwidth); Since these values can be large, the multiplication may exceed the maximum value of an int (INT_MAX) and overflow (Our platform did), leading to an incorrect adist. User-visible impact: The memory tiering subsystem will misinterpret slow memory (like CXL) as faster than DRAM, causing inappropriate demotion of pages from CXL (slow memory) to DRAM (fast memory). For example, we will see the following demotion chains from the dmesg, where Node0,1 are DRAM, and Node2,3 are CXL node: Demotion targets for Node 0: null Demotion targets for Node 1: null Demotion targets for Node 2: preferred: 0-1, fallback: 0-1 Demotion targets for Node 3: preferred: 0-1, fallback: 0-1 Change MEMTIER_ADISTANCE_DRAM to be a long constant by writing it with the 'L' suffix. This prevents the overflow because the multiplication will then be done in the long type which has a larger range. Link: https://lkml.kernel.org/r/20250611023439.2845785-1-lizhijian@fujitsu.com Link: https://lkml.kernel.org/r/20250610062751.2365436-1-lizhijian@fujitsu.com Fixes: 3718c02dbd4c ("acpi, hmat: calculate abstract distance with HMAT") Signed-off-by: Li Zhijian Reviewed-by: Huang Ying Acked-by: Balbir Singh Reviewed-by: Donet Tom Reviewed-by: Oscar Salvador Cc: Signed-off-by: Andrew Morton --- include/linux/memory-tiers.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 0dc0cf2863e2..7a805796fcfd 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -18,7 +18,7 @@ * adistance value (slightly faster) than default DRAM adistance to be part of * the same memory tier. */ -#define MEMTIER_ADISTANCE_DRAM ((4 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE >> 1)) +#define MEMTIER_ADISTANCE_DRAM ((4L * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE >> 1)) struct memory_tier; struct memory_dev_type { From 94dab12d86cf77ff0b8f667dc98af6d997422cb4 Mon Sep 17 00:00:00 2001 From: Dev Jain Date: Tue, 10 Jun 2025 09:20:42 +0530 Subject: [PATCH 059/370] mm: call pointers to ptes as ptep Patch series "Optimize mremap() for large folios", v4. Currently move_ptes() iterates through ptes one by one. If the underlying folio mapped by the ptes is large, we can process those ptes in a batch using folio_pte_batch(), thus clearing and setting the PTEs in one go. For arm64 specifically, this results in a 16x reduction in the number of ptep_get() calls (since on a contig block, ptep_get() on arm64 will iterate through all 16 entries to collect a/d bits), and we also elide extra TLBIs through get_and_clear_full_ptes, replacing ptep_get_and_clear. Mapping 1M of memory with 64K folios, memsetting it, remapping it to src + 1M, and munmapping it 10,000 times, the average execution time reduces from 1.9 to 1.2 seconds, giving a 37% performance optimization, on Apple M3 (arm64). No regression is observed for small folios. Test program for reference: #define _GNU_SOURCE #include #include #include #include #include #include #define SIZE (1UL << 20) // 1M int main(void) { void *new_addr, *addr; for (int i = 0; i < 10000; ++i) { addr = mmap((void *)(1UL << 30), SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (addr == MAP_FAILED) { perror("mmap"); return 1; } memset(addr, 0xAA, SIZE); new_addr = mremap(addr, SIZE, SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, addr + SIZE); if (new_addr != (addr + SIZE)) { perror("mremap"); return 1; } munmap(new_addr, SIZE); } } This patch (of 2): Avoid confusion between pte_t* and pte_t data types by suffixing pointer type variables with p. No functional change. Link: https://lkml.kernel.org/r/20250610035043.75448-1-dev.jain@arm.com Link: https://lkml.kernel.org/r/20250610035043.75448-2-dev.jain@arm.com Signed-off-by: Dev Jain Reviewed-by: Barry Song Reviewed-by: Anshuman Khandual Reviewed-by: Lorenzo Stoakes Acked-by: David Hildenbrand Reviewed-by: Pedro Falcato Cc: Bang Li Cc: Baolin Wang Cc: bibo mao Cc: Hugh Dickins Cc: Ingo Molnar Cc: Jann Horn Cc: Lance Yang Cc: Liam Howlett Cc: Matthew Wilcox (Oracle) Cc: Peter Xu Cc: Qi Zheng Cc: Ryan Roberts Cc: Vlastimil Babka Cc: Yang Shi Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/mremap.c | 31 ++++++++++++++++--------------- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index 60f6b8d0d5f0..180b12225368 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -176,7 +176,8 @@ static int move_ptes(struct pagetable_move_control *pmc, struct vm_area_struct *vma = pmc->old; bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma); struct mm_struct *mm = vma->vm_mm; - pte_t *old_pte, *new_pte, pte; + pte_t *old_ptep, *new_ptep; + pte_t pte; pmd_t dummy_pmdval; spinlock_t *old_ptl, *new_ptl; bool force_flush = false; @@ -211,8 +212,8 @@ static int move_ptes(struct pagetable_move_control *pmc, * We don't have to worry about the ordering of src and dst * pte locks because exclusive mmap_lock prevents deadlock. */ - old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl); - if (!old_pte) { + old_ptep = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl); + if (!old_ptep) { err = -EAGAIN; goto out; } @@ -223,10 +224,10 @@ static int move_ptes(struct pagetable_move_control *pmc, * mmap_lock, so this new_pte page is stable, so there is no need to get * pmdval and do pmd_same() check. */ - new_pte = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval, + new_ptep = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval, &new_ptl); - if (!new_pte) { - pte_unmap_unlock(old_pte, old_ptl); + if (!new_ptep) { + pte_unmap_unlock(old_ptep, old_ptl); err = -EAGAIN; goto out; } @@ -235,14 +236,14 @@ static int move_ptes(struct pagetable_move_control *pmc, flush_tlb_batched_pending(vma->vm_mm); arch_enter_lazy_mmu_mode(); - for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE, - new_pte++, new_addr += PAGE_SIZE) { - VM_WARN_ON_ONCE(!pte_none(*new_pte)); + for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE, + new_ptep++, new_addr += PAGE_SIZE) { + VM_WARN_ON_ONCE(!pte_none(*new_ptep)); - if (pte_none(ptep_get(old_pte))) + if (pte_none(ptep_get(old_ptep))) continue; - pte = ptep_get_and_clear(mm, old_addr, old_pte); + pte = ptep_get_and_clear(mm, old_addr, old_ptep); /* * If we are remapping a valid PTE, make sure * to flush TLB before we drop the PTL for the @@ -260,7 +261,7 @@ static int move_ptes(struct pagetable_move_control *pmc, pte = move_soft_dirty_pte(pte); if (need_clear_uffd_wp && pte_marker_uffd_wp(pte)) - pte_clear(mm, new_addr, new_pte); + pte_clear(mm, new_addr, new_ptep); else { if (need_clear_uffd_wp) { if (pte_present(pte)) @@ -268,7 +269,7 @@ static int move_ptes(struct pagetable_move_control *pmc, else if (is_swap_pte(pte)) pte = pte_swp_clear_uffd_wp(pte); } - set_pte_at(mm, new_addr, new_pte, pte); + set_pte_at(mm, new_addr, new_ptep, pte); } } @@ -277,8 +278,8 @@ static int move_ptes(struct pagetable_move_control *pmc, flush_tlb_range(vma, old_end - len, old_end); if (new_ptl != old_ptl) spin_unlock(new_ptl); - pte_unmap(new_pte - 1); - pte_unmap_unlock(old_pte - 1, old_ptl); + pte_unmap(new_ptep - 1); + pte_unmap_unlock(old_ptep - 1, old_ptl); out: if (pmc->need_rmap_locks) drop_rmap_locks(vma); From f822a9a81a31311d67f260aea96005540b18ab07 Mon Sep 17 00:00:00 2001 From: Dev Jain Date: Tue, 10 Jun 2025 09:20:43 +0530 Subject: [PATCH 060/370] mm: optimize mremap() by PTE batching Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes are painted with the contig bit, then ptep_get() will iterate through all 16 entries to collect a/d bits. Hence this optimization will result in a 16x reduction in the number of ptep_get() calls. Next, ptep_get_and_clear() will eventually call contpte_try_unfold() on every contig block, thus flushing the TLB for the complete large folio range. Instead, use get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only do them on the starting and ending contig block. For split folios, there will be no pte batching; nr_ptes will be 1. For pagetable splitting, the ptes will still point to the same large folio; for arm64, this results in the optimization described above, and for other arches (including the general case), a minor improvement is expected due to a reduction in the number of function calls. Link: https://lkml.kernel.org/r/20250610035043.75448-3-dev.jain@arm.com Signed-off-by: Dev Jain Reviewed-by: Barry Song Reviewed-by: Lorenzo Stoakes Reviewed-by: Pedro Falcato Cc: Anshuman Khandual Cc: Bang Li Cc: Baolin Wang Cc: bibo mao Cc: David Hildenbrand Cc: Hugh Dickins Cc: Ingo Molnar Cc: Jann Horn Cc: Lance Yang Cc: Liam Howlett Cc: Matthew Wilcox (Oracle) Cc: Peter Xu Cc: Qi Zheng Cc: Ryan Roberts Cc: Vlastimil Babka Cc: Yang Shi Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/mremap.c | 39 ++++++++++++++++++++++++++++++++------- 1 file changed, 32 insertions(+), 7 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index 180b12225368..18b215521ada 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte) return pte; } +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr, + pte_t *ptep, pte_t pte, int max_nr) +{ + const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY; + struct folio *folio; + + if (max_nr == 1) + return 1; + + folio = vm_normal_folio(vma, addr, pte); + if (!folio || !folio_test_large(folio)) + return 1; + + return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL, + NULL, NULL); +} + static int move_ptes(struct pagetable_move_control *pmc, unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd) { @@ -177,7 +194,7 @@ static int move_ptes(struct pagetable_move_control *pmc, bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma); struct mm_struct *mm = vma->vm_mm; pte_t *old_ptep, *new_ptep; - pte_t pte; + pte_t old_pte, pte; pmd_t dummy_pmdval; spinlock_t *old_ptl, *new_ptl; bool force_flush = false; @@ -185,6 +202,8 @@ static int move_ptes(struct pagetable_move_control *pmc, unsigned long new_addr = pmc->new_addr; unsigned long old_end = old_addr + extent; unsigned long len = old_end - old_addr; + int max_nr_ptes; + int nr_ptes; int err = 0; /* @@ -236,14 +255,16 @@ static int move_ptes(struct pagetable_move_control *pmc, flush_tlb_batched_pending(vma->vm_mm); arch_enter_lazy_mmu_mode(); - for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE, - new_ptep++, new_addr += PAGE_SIZE) { + for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE, + new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) { VM_WARN_ON_ONCE(!pte_none(*new_ptep)); - if (pte_none(ptep_get(old_ptep))) + nr_ptes = 1; + max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT; + old_pte = ptep_get(old_ptep); + if (pte_none(old_pte)) continue; - pte = ptep_get_and_clear(mm, old_addr, old_ptep); /* * If we are remapping a valid PTE, make sure * to flush TLB before we drop the PTL for the @@ -255,8 +276,12 @@ static int move_ptes(struct pagetable_move_control *pmc, * the TLB entry for the old mapping has been * flushed. */ - if (pte_present(pte)) + if (pte_present(old_pte)) { + nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep, + old_pte, max_nr_ptes); force_flush = true; + } + pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0); pte = move_pte(pte, old_addr, new_addr); pte = move_soft_dirty_pte(pte); @@ -269,7 +294,7 @@ static int move_ptes(struct pagetable_move_control *pmc, else if (is_swap_pte(pte)) pte = pte_swp_clear_uffd_wp(pte); } - set_pte_at(mm, new_addr, new_ptep, pte); + set_ptes(mm, new_addr, new_ptep, pte, nr_ptes); } } From 4f8ba33bbdfc475fdc1ac5de6a93f5de93203ed5 Mon Sep 17 00:00:00 2001 From: Barry Song Date: Wed, 11 Jun 2025 22:47:45 +1200 Subject: [PATCH 061/370] mm: madvise: use per_vma lock for MADV_FREE MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MADV_FREE is another option, besides MADV_DONTNEED, for dynamic memory freeing in user-space native or Java heap memory management. For example, jemalloc can be configured to use MADV_FREE, and recent versions of the Android Java heap have also increasingly adopted MADV_FREE. Supporting per-VMA locking for MADV_FREE thus appears increasingly necessary. We have replaced walk_page_range() with walk_page_range_vma(). Along with the proposed madvise_lock_mode by Lorenzo, the necessary infrastructure is now in place to begin exploring per-VMA locking support for MADV_FREE and potentially other madvise using walk_page_range_vma(). This patch adds support for the PGWALK_VMA_RDLOCK walk_lock mode in walk_page_range_vma(), and leverages madvise_lock_mode from madv_behavior to select the appropriate walk_lock—either mmap_lock or per-VMA lock—based on the context. Because we now dynamically update the walk_ops->walk_lock field, we must ensure this is thread-safe. The madvise_free_walk_ops is now defined as a stack variable instead of a global constant. Link: https://lkml.kernel.org/r/20250611104745.57405-1-21cnbao@gmail.com Signed-off-by: Barry Song Acked-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes Acked-by: SeongJae Park Cc: "Liam R. Howlett" Cc: Vlastimil Babka Cc: Jann Horn Cc: Suren Baghdasaryan Cc: Lokesh Gidra Cc: Mike Rapoport Cc: Michal Hocko Cc: Tangquan Zheng Cc: Qi Zheng Signed-off-by: Andrew Morton --- include/linux/pagewalk.h | 2 ++ mm/madvise.c | 25 +++++++++++++++++++------ mm/pagewalk.c | 5 ++++- 3 files changed, 25 insertions(+), 7 deletions(-) diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h index 8ac2f6d6d2a3..682472c15495 100644 --- a/include/linux/pagewalk.h +++ b/include/linux/pagewalk.h @@ -14,6 +14,8 @@ enum page_walk_lock { PGWALK_WRLOCK = 1, /* vma is expected to be already write-locked during the walk */ PGWALK_WRLOCK_VERIFY = 2, + /* vma is expected to be already read-locked during the walk */ + PGWALK_VMA_RDLOCK_VERIFY = 3, }; /** diff --git a/mm/madvise.c b/mm/madvise.c index 7d78d4b5fb18..d451438af999 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -777,10 +777,19 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, return 0; } -static const struct mm_walk_ops madvise_free_walk_ops = { - .pmd_entry = madvise_free_pte_range, - .walk_lock = PGWALK_RDLOCK, -}; +static inline enum page_walk_lock get_walk_lock(enum madvise_lock_mode mode) +{ + switch (mode) { + case MADVISE_VMA_READ_LOCK: + return PGWALK_VMA_RDLOCK_VERIFY; + case MADVISE_MMAP_READ_LOCK: + return PGWALK_RDLOCK; + default: + /* Other modes don't require fixing up the walk_lock */ + WARN_ON_ONCE(1); + return PGWALK_RDLOCK; + } +} static int madvise_free_single_vma(struct madvise_behavior *madv_behavior, struct vm_area_struct *vma, @@ -789,6 +798,9 @@ static int madvise_free_single_vma(struct madvise_behavior *madv_behavior, struct mm_struct *mm = vma->vm_mm; struct mmu_notifier_range range; struct mmu_gather *tlb = madv_behavior->tlb; + struct mm_walk_ops walk_ops = { + .pmd_entry = madvise_free_pte_range, + }; /* MADV_FREE works for only anon vma at the moment */ if (!vma_is_anonymous(vma)) @@ -808,8 +820,9 @@ static int madvise_free_single_vma(struct madvise_behavior *madv_behavior, mmu_notifier_invalidate_range_start(&range); tlb_start_vma(tlb, vma); + walk_ops.walk_lock = get_walk_lock(madv_behavior->lock_mode); walk_page_range_vma(vma, range.start, range.end, - &madvise_free_walk_ops, tlb); + &walk_ops, tlb); tlb_end_vma(tlb, vma); mmu_notifier_invalidate_range_end(&range); return 0; @@ -1658,7 +1671,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi case MADV_WILLNEED: case MADV_COLD: case MADV_PAGEOUT: - case MADV_FREE: case MADV_POPULATE_READ: case MADV_POPULATE_WRITE: case MADV_COLLAPSE: @@ -1667,6 +1679,7 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi return MADVISE_MMAP_READ_LOCK; case MADV_DONTNEED: case MADV_DONTNEED_LOCKED: + case MADV_FREE: return MADVISE_VMA_READ_LOCK; default: return MADVISE_MMAP_WRITE_LOCK; diff --git a/mm/pagewalk.c b/mm/pagewalk.c index ff5299eca687..a214a2b40ab9 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -422,7 +422,7 @@ static inline void process_mm_walk_lock(struct mm_struct *mm, { if (walk_lock == PGWALK_RDLOCK) mmap_assert_locked(mm); - else + else if (walk_lock != PGWALK_VMA_RDLOCK_VERIFY) mmap_assert_write_locked(mm); } @@ -437,6 +437,9 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma, case PGWALK_WRLOCK_VERIFY: vma_assert_write_locked(vma); break; + case PGWALK_VMA_RDLOCK_VERIFY: + vma_assert_locked(vma); + break; case PGWALK_RDLOCK: /* PGWALK_RDLOCK is handled by process_mm_walk_lock */ break; From a788b6e571f3cb1aceb1611e97010be7334cbd63 Mon Sep 17 00:00:00 2001 From: Pu Lehui Date: Wed, 11 Jun 2025 10:01:06 +0000 Subject: [PATCH 062/370] selftests/mm: use generic read_sysfs in thuge-gen test As generic read_sysfs is available in vm_utils, let's use is in thuge-gen test. Link: https://lkml.kernel.org/r/20250611100106.1331197-1-pulehui@huaweicloud.com Signed-off-by: Pu Lehui Reviewed-by: Lorenzo Stoakes Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/thuge-gen.c | 44 ++++++++------------------ 1 file changed, 13 insertions(+), 31 deletions(-) diff --git a/tools/testing/selftests/mm/thuge-gen.c b/tools/testing/selftests/mm/thuge-gen.c index 95b6f043a3cb..8e2b08dc5762 100644 --- a/tools/testing/selftests/mm/thuge-gen.c +++ b/tools/testing/selftests/mm/thuge-gen.c @@ -77,38 +77,18 @@ void show(unsigned long ps) system(buf); } -unsigned long thuge_read_sysfs(int warn, char *fmt, ...) -{ - char *line = NULL; - size_t linelen = 0; - char buf[100]; - FILE *f; - va_list ap; - unsigned long val = 0; - - va_start(ap, fmt); - vsnprintf(buf, sizeof buf, fmt, ap); - va_end(ap); - - f = fopen(buf, "r"); - if (!f) { - if (warn) - ksft_print_msg("missing %s\n", buf); - return 0; - } - if (getline(&line, &linelen, f) > 0) { - sscanf(line, "%lu", &val); - } - fclose(f); - free(line); - return val; -} - unsigned long read_free(unsigned long ps) { - return thuge_read_sysfs(ps != getpagesize(), - "/sys/kernel/mm/hugepages/hugepages-%lukB/free_hugepages", - ps >> 10); + unsigned long val = 0; + char buf[100]; + + snprintf(buf, sizeof(buf), + "/sys/kernel/mm/hugepages/hugepages-%lukB/free_hugepages", + ps >> 10); + if (read_sysfs(buf, &val) && ps != getpagesize()) + ksft_print_msg("missing %s\n", buf); + + return val; } void test_mmap(unsigned long size, unsigned flags) @@ -173,6 +153,7 @@ void test_shmget(unsigned long size, unsigned flags) void find_pagesizes(void) { unsigned long largest = getpagesize(); + unsigned long shmmax_val = 0; int i; glob_t g; @@ -195,7 +176,8 @@ void find_pagesizes(void) } globfree(&g); - if (thuge_read_sysfs(0, "/proc/sys/kernel/shmmax") < NUM_PAGES * largest) + read_sysfs("/proc/sys/kernel/shmmax", &shmmax_val); + if (shmmax_val < NUM_PAGES * largest) ksft_exit_fail_msg("Please do echo %lu > /proc/sys/kernel/shmmax", largest * NUM_PAGES); From a984f16fba2cbadfd8f3310cf506a484dc5bdeb6 Mon Sep 17 00:00:00 2001 From: Shivank Garg Date: Wed, 11 Jun 2025 05:27:07 +0000 Subject: [PATCH 063/370] mm: use folio_expected_ref_count() helper for reference counting Replace open-coded folio reference count calculations with the folio_expected_ref_count(). No functional changes intended. Link: https://lkml.kernel.org/r/20250611052706.515408-2-shivankg@amd.com Signed-off-by: Shivank Garg Acked-by: David Hildenbrand Reviewed-by: Oscar Salvador Cc: Adrian Hunter Cc: Alexander Shishkin Cc: Alistair Popple Cc: Arnaldo Carvalho de Melo Cc: Ian Rogers Cc: Ingo Molnar Cc: Jiri Olsa Cc: Kan Liang Cc: Marc Rutland Cc: "Masami Hiramatsu (Google)" Cc: Matthew Wilcox (Oracle) Cc: Namhyung kim Cc: Oleg Nesterov Cc: Peter Zijlstra Cc: Steven Rostedt Signed-off-by: Andrew Morton --- kernel/events/uprobes.c | 3 +-- mm/memfd.c | 3 +-- 2 files changed, 2 insertions(+), 4 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 4c965ba77f9f..8a601df87072 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -436,8 +436,7 @@ static int __uprobe_write_opcode(struct vm_area_struct *vma, * there are no unexpected folio references ... */ if (is_register || userfaultfd_missing(vma) || - (folio_ref_count(folio) != folio_mapcount(folio) + 1 + - folio_test_swapcache(folio) * folio_nr_pages(folio))) + (folio_ref_count(folio) != folio_expected_ref_count(folio) + 1)) goto remap; /* diff --git a/mm/memfd.c b/mm/memfd.c index 4fc98abe6627..65a107f72e39 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -32,8 +32,7 @@ static bool memfd_folio_has_extra_refs(struct folio *folio) { - return folio_ref_count(folio) - folio_mapcount(folio) != - folio_nr_pages(folio); + return folio_ref_count(folio) != folio_expected_ref_count(folio); } static void memfd_tag_pins(struct xa_state *xas) From ff20487308f4124fcbe806616773c88672b738f1 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Thu, 12 Jun 2025 15:34:37 +0100 Subject: [PATCH 064/370] bio: use memzero_page() in bio_truncate() Patch series "Remove zero_user()". The zero_user() API is almost unused these days. Finish the job of removing it. This patch (of 5): memzero_page() is the new name for zero_user(). Link: https://lkml.kernel.org/r/20250612143443.2848197-1-willy@infradead.org Link: https://lkml.kernel.org/r/20250612143443.2848197-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Alex Markuze Cc: Christoph Hellwig Cc: Ira Weiny Cc: Jens Axboe Cc: Ilya Dryomov Cc: Xiubo Li Cc: Dan Carpenter Cc: Viacheslav Dubeyko Signed-off-by: Andrew Morton --- block/bio.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/block/bio.c b/block/bio.c index 3c0a558c90f5..ce16c34ec6de 100644 --- a/block/bio.c +++ b/block/bio.c @@ -653,13 +653,13 @@ static void bio_truncate(struct bio *bio, unsigned new_size) bio_for_each_segment(bv, bio, iter) { if (done + bv.bv_len > new_size) { - unsigned offset; + size_t offset; if (!truncated) offset = new_size - done; else offset = 0; - zero_user(bv.bv_page, bv.bv_offset + offset, + memzero_page(bv.bv_page, bv.bv_offset + offset, bv.bv_len - offset); truncated = true; } From 1a80ff0f8896750156f22dbf2d4591d79bb2a155 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Thu, 12 Jun 2025 15:34:38 +0100 Subject: [PATCH 065/370] null_blk: use memzero_page() memzero_page() is the new name for zero_user(). Link: https://lkml.kernel.org/r/20250612143443.2848197-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Alex Markuze Cc: Christoph Hellwig Cc: Ilya Dryomov Cc: Ira Weiny Cc: Jens Axboe Cc: Xiubo Li Cc: Dan Carpenter Cc: Viacheslav Dubeyko Signed-off-by: Andrew Morton --- drivers/block/null_blk/main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c index aa163ae9b2aa..91642c9a3b29 100644 --- a/drivers/block/null_blk/main.c +++ b/drivers/block/null_blk/main.c @@ -1179,7 +1179,7 @@ static int copy_from_nullb(struct nullb *nullb, struct page *dest, memcpy_page(dest, off + count, t_page->page, offset, temp); else - zero_user(dest, off + count, temp); + memzero_page(dest, off + count, temp); count += temp; sector += temp >> SECTOR_SHIFT; From 88b478e55ce42c62819a3cde948f85508bdb9adb Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Thu, 12 Jun 2025 15:34:39 +0100 Subject: [PATCH 066/370] direct-io: use memzero_page() memzero_page() is the new name for zero_user(). Link: https://lkml.kernel.org/r/20250612143443.2848197-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Alex Markuze Cc: Christoph Hellwig Cc: Ilya Dryomov Cc: Ira Weiny Cc: Jens Axboe Cc: Xiubo Li Cc: Dan Carpenter Cc: Viacheslav Dubeyko Signed-off-by: Andrew Morton --- fs/direct-io.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index bbd05f1a2145..111958634def 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -996,7 +996,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio, dio_unpin_page(dio, page); goto out; } - zero_user(page, from, 1 << blkbits); + memzero_page(page, from, 1 << blkbits); sdio->block_in_file++; from += 1 << blkbits; dio->result += 1 << blkbits; From 7431f3a201b8954550ac930944870100b6627ac8 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Thu, 12 Jun 2025 15:34:40 +0100 Subject: [PATCH 067/370] ceph: convert ceph_zero_partial_page() to use a folio Retrieve a folio from the pagecache instead of a page and operate on it. Removes several hidden calls to compound_head() along with calls to deprecated functions like wait_on_page_writeback() and find_lock_page(). [dan.carpenter@linaro.org: fix NULL vs IS_ERR() bug in ceph_zero_partial_page()] Link: https://lkml.kernel.org/r/685c1424.050a0220.baa8.d6a1@mx.google.com Link: https://lkml.kernel.org/r/20250612143443.2848197-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Dan Carpenter Cc: Christoph Hellwig Cc: Ilya Dryomov Cc: Ira Weiny Cc: Jens Axboe Cc: Xiubo Li Cc: Dan Carpenter Cc: Alex Markuze Cc: Viacheslav Dubeyko Signed-off-by: Andrew Morton --- fs/ceph/file.c | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index a7254cab44cc..f6e63265c516 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -2530,19 +2530,19 @@ static loff_t ceph_llseek(struct file *file, loff_t offset, int whence) return generic_file_llseek(file, offset, whence); } -static inline void ceph_zero_partial_page( - struct inode *inode, loff_t offset, unsigned size) +static inline void ceph_zero_partial_page(struct inode *inode, + loff_t offset, size_t size) { - struct page *page; - pgoff_t index = offset >> PAGE_SHIFT; + struct folio *folio; - page = find_lock_page(inode->i_mapping, index); - if (page) { - wait_on_page_writeback(page); - zero_user(page, offset & (PAGE_SIZE - 1), size); - unlock_page(page); - put_page(page); - } + folio = filemap_lock_folio(inode->i_mapping, offset >> PAGE_SHIFT); + if (IS_ERR(folio)) + return; + + folio_wait_writeback(folio); + folio_zero_range(folio, offset_in_folio(folio, offset), size); + folio_unlock(folio); + folio_put(folio); } static void ceph_zero_pagecache_range(struct inode *inode, loff_t offset, From 234dda7a49ff94154b527784687f549b9f1417c1 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Thu, 12 Jun 2025 15:34:41 +0100 Subject: [PATCH 068/370] mm: remove zero_user() All users have now been converted to either memzero_page() or folio_zero_range(). Link: https://lkml.kernel.org/r/20250612143443.2848197-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Alex Markuze Cc: Christoph Hellwig Cc: Ilya Dryomov Cc: Ira Weiny Cc: Jens Axboe Cc: Xiubo Li Cc: Dan Carpenter Cc: Viacheslav Dubeyko Signed-off-by: Andrew Morton --- include/linux/highmem.h | 6 ------ 1 file changed, 6 deletions(-) diff --git a/include/linux/highmem.h b/include/linux/highmem.h index e48d7f27b0b9..a30526cc53a7 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -292,12 +292,6 @@ static inline void zero_user_segment(struct page *page, zero_user_segments(page, start, end, 0, 0); } -static inline void zero_user(struct page *page, - unsigned start, unsigned size) -{ - zero_user_segments(page, start, start + size, 0, 0); -} - #ifndef __HAVE_ARCH_COPY_USER_HIGHPAGE static inline void copy_user_highpage(struct page *to, struct page *from, From 7962e05a835f5a40fed4f13f415d6a48b00a93c3 Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Fri, 13 Jun 2025 09:49:19 +0800 Subject: [PATCH 069/370] selftests: khugepaged: fix the shmem collapse failure When running the khugepaged selftest for shmem (./khugepaged all:shmem), I encountered the following test failures: : Run test: collapse_full (khugepaged:shmem) : Collapse multiple fully populated PTE table.... Fail : ... : Run test: collapse_single_pte_entry (khugepaged:shmem) : Collapse PTE table with single PTE entry present.... Fail : ... : Run test: collapse_full_of_compound (khugepaged:shmem) : Allocate huge page... OK : Split huge page leaving single PTE page table full of compound pages... OK : Collapse PTE table full of compound pages.... Fail The reason for the failure is that it will set MADV_NOHUGEPAGE to prevent khugepaged from continuing to scan shmem VMA after khugepaged finishes scanning in the wait_for_scan() function. Moreover, shmem requires a refault to establish PMD mappings. However, after commit 2b0f922323cc ("mm: don't install PMD mappings when THPs are disabled by the hw/process/vma"), PMD mappings are prevented if the VMA is set with MADV_NOHUGEPAGE flag, so shmem cannot establish PMD mappings during refault. One way to fix this issue is to move the MADV_NOHUGEPAGE setting after the shmem refault. After shmem refault and check huge, the test case will unmap the shmem immediately. So it seems unnecessary to set the MADV_NOHUGEPAGE. Then we can simply drop the MADV_NOHUGEPAGE setting, and all khugepaged test cases passed. Link: https://lkml.kernel.org/r/d8502fc50d0304c2afd27ced062b1d636b7a872e.1749779183.git.baolin.wang@linux.alibaba.com Fixes: 2b0f922323cc ("mm: don't install PMD mappings when THPs are disabled by the hw/process/vma") Signed-off-by: Baolin Wang Reviewed-by: Zi Yan Acked-by: David Hildenbrand Reviewed-by: Dev Jain Tested-by: Dev Jain Tested-by: Mario Casquero Cc: Barry Song Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Mariano Pache Cc: Ryan Roberts Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/khugepaged.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c index 8a4d34cce36b..4341ce6b3b38 100644 --- a/tools/testing/selftests/mm/khugepaged.c +++ b/tools/testing/selftests/mm/khugepaged.c @@ -561,8 +561,6 @@ static bool wait_for_scan(const char *msg, char *p, int nr_hpages, usleep(TICK); } - madvise(p, nr_hpages * hpage_pmd_size, MADV_NOHUGEPAGE); - return timeout == -1; } From f081a460bbac54c5ec460aac42079974ee413bd6 Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Fri, 13 Jun 2025 09:49:20 +0800 Subject: [PATCH 070/370] selftests: mm: add shmem collapse as a default test item Currently, we only test anonymous memory collapse by default. We should also add shmem collapse as a default test item to catch issues that could break the test cases. Link: https://lkml.kernel.org/r/a30b1529b399f2e649b5a05c3d352f41a68faeae.1749779183.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Reviewed-by: Dev Jain Tested-by: Dev Jain Acked-by: David Hildenbrand Acked-by: Zi Yan Cc: Barry Song Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Mariano Pache Cc: Ryan Roberts Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/run_vmtests.sh | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh index 33fc7fafa8f9..a38c984103ce 100755 --- a/tools/testing/selftests/mm/run_vmtests.sh +++ b/tools/testing/selftests/mm/run_vmtests.sh @@ -485,6 +485,10 @@ CATEGORY="thp" run_test ./khugepaged CATEGORY="thp" run_test ./khugepaged -s 2 +CATEGORY="thp" run_test ./khugepaged all:shmem + +CATEGORY="thp" run_test ./khugepaged -s 4 all:shmem + CATEGORY="thp" run_test ./transhuge-stress -d 20 # Try to create XFS if not provided From 09fefdca80aebd1023e827cb0ee174983d829d18 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 13 Jun 2025 11:27:00 +0200 Subject: [PATCH 071/370] mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud() Patch series "mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes", v3. While working on improving vm_normal_page() and friends, I stumbled over this issues: refcounted "normal" folios must not be marked using pmd_special() / pud_special(). Otherwise, we're effectively telling the system that these folios are no "normal", violating the rules we documented for vm_normal_page(). Fortunately, there are not many pmd_special()/pud_special() users yet. So far there doesn't seem to be serious damage. Tested using the ndctl tests ("ndctl:dax" suite). This patch (of 3): We set up the cache mode but ... don't forward the updated pgprot to insert_pfn_pud(). Only a problem on x86-64 PAT when mapping PFNs using PUDs that require a special cachemode. Fix it by using the proper pgprot where the cachemode was setup. It is unclear in which configurations we would get the cachemode wrong: through vfio seems possible. Getting cachemodes wrong is usually ... bad. As the fix is easy, let's backport it to stable. Identified by code inspection. Link: https://lkml.kernel.org/r/20250613092702.1943533-1-david@redhat.com Link: https://lkml.kernel.org/r/20250613092702.1943533-2-david@redhat.com Fixes: 7b806d229ef1 ("mm: remove vmf_insert_pfn_xxx_prot() for huge page-table entries") Signed-off-by: David Hildenbrand Reviewed-by: Dan Williams Reviewed-by: Lorenzo Stoakes Reviewed-by: Jason Gunthorpe Reviewed-by: Oscar Salvador Tested-by: Dan Williams Cc: Alistair Popple Cc: Baolin Wang Cc: Dev Jain Cc: Liam Howlett Cc: Mariano Pache Cc: Michal Hocko Cc: Mike Rapoport Cc: Ryan Roberts Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Zi Yan Cc: Signed-off-by: Andrew Morton --- mm/huge_memory.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d3e66136e41a..49b98082c540 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1516,10 +1516,9 @@ static pud_t maybe_pud_mkwrite(pud_t pud, struct vm_area_struct *vma) } static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr, - pud_t *pud, pfn_t pfn, bool write) + pud_t *pud, pfn_t pfn, pgprot_t prot, bool write) { struct mm_struct *mm = vma->vm_mm; - pgprot_t prot = vma->vm_page_prot; pud_t entry; if (!pud_none(*pud)) { @@ -1581,7 +1580,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write) pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot); ptl = pud_lock(vma->vm_mm, vmf->pud); - insert_pfn_pud(vma, addr, vmf->pud, pfn, write); + insert_pfn_pud(vma, addr, vmf->pud, pfn, pgprot, write); spin_unlock(ptl); return VM_FAULT_NOPAGE; @@ -1625,7 +1624,7 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio, add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR); } insert_pfn_pud(vma, addr, vmf->pud, pfn_to_pfn_t(folio_pfn(folio)), - write); + vma->vm_page_prot, write); spin_unlock(ptl); return VM_FAULT_NOPAGE; From c4297465d4ca27f1f7c9e6c698b13c77d9ed50b9 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 13 Jun 2025 11:27:01 +0200 Subject: [PATCH 072/370] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pmd() Marking PMDs that map a "normal" refcounted folios as special is against our rules documented for vm_normal_page(): normal (refcounted) folios shall never have the page table mapping marked as special. Fortunately, there are not that many pmd_special() check that can be mislead, and most vm_normal_page_pmd()/vm_normal_folio_pmd() users that would get this wrong right now are rather harmless: e.g., none so far bases decisions whether to grab a folio reference on that decision. Well, and GUP-fast will fallback to GUP-slow. All in all, so far no big implications as it seems. Getting this right will get more important as we use folio_normal_page_pmd() in more places. Fix it by teaching insert_pfn_pmd() to properly handle folios and pfns -- moving refcount/mapcount/etc handling in there, renaming it to insert_pmd(), and distinguishing between both cases using a new simple "struct folio_or_pfn" structure. Use folio_mk_pmd() to create a pmd for a folio cleanly. Link: https://lkml.kernel.org/r/20250613092702.1943533-3-david@redhat.com Fixes: 6c88f72691f8 ("mm/huge_memory: add vmf_insert_folio_pmd()") Signed-off-by: David Hildenbrand Reviewed-by: Jason Gunthorpe Reviewed-by: Lorenzo Stoakes Reviewed-by: Dan Williams Tested-by: Dan Williams Reviewed-by: Oscar Salvador Cc: Alistair Popple Cc: Baolin Wang Cc: Dev Jain Cc: Liam Howlett Cc: Mariano Pache Cc: Michal Hocko Cc: Mike Rapoport Cc: Ryan Roberts Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/huge_memory.c | 59 ++++++++++++++++++++++++++++++++---------------- 1 file changed, 40 insertions(+), 19 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 49b98082c540..d1e3e253c714 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1372,9 +1372,17 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) return __do_huge_pmd_anonymous_page(vmf); } -static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, - pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write, - pgtable_t pgtable) +struct folio_or_pfn { + union { + struct folio *folio; + pfn_t pfn; + }; + bool is_folio; +}; + +static int insert_pmd(struct vm_area_struct *vma, unsigned long addr, + pmd_t *pmd, struct folio_or_pfn fop, pgprot_t prot, + bool write, pgtable_t pgtable) { struct mm_struct *mm = vma->vm_mm; pmd_t entry; @@ -1382,8 +1390,11 @@ static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, lockdep_assert_held(pmd_lockptr(mm, pmd)); if (!pmd_none(*pmd)) { + const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) : + pfn_t_to_pfn(fop.pfn); + if (write) { - if (pmd_pfn(*pmd) != pfn_t_to_pfn(pfn)) { + if (pmd_pfn(*pmd) != pfn) { WARN_ON_ONCE(!is_huge_zero_pmd(*pmd)); return -EEXIST; } @@ -1396,11 +1407,20 @@ static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, return -EEXIST; } - entry = pmd_mkhuge(pfn_t_pmd(pfn, prot)); - if (pfn_t_devmap(pfn)) - entry = pmd_mkdevmap(entry); - else - entry = pmd_mkspecial(entry); + if (fop.is_folio) { + entry = folio_mk_pmd(fop.folio, vma->vm_page_prot); + + folio_get(fop.folio); + folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma); + add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR); + } else { + entry = pmd_mkhuge(pfn_t_pmd(fop.pfn, prot)); + + if (pfn_t_devmap(fop.pfn)) + entry = pmd_mkdevmap(entry); + else + entry = pmd_mkspecial(entry); + } if (write) { entry = pmd_mkyoung(pmd_mkdirty(entry)); entry = maybe_pmd_mkwrite(entry, vma); @@ -1431,6 +1451,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write) unsigned long addr = vmf->address & PMD_MASK; struct vm_area_struct *vma = vmf->vma; pgprot_t pgprot = vma->vm_page_prot; + struct folio_or_pfn fop = { + .pfn = pfn, + }; pgtable_t pgtable = NULL; spinlock_t *ptl; int error; @@ -1458,8 +1481,8 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write) pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot); ptl = pmd_lock(vma->vm_mm, vmf->pmd); - error = insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write, - pgtable); + error = insert_pmd(vma, addr, vmf->pmd, fop, pgprot, write, + pgtable); spin_unlock(ptl); if (error && pgtable) pte_free(vma->vm_mm, pgtable); @@ -1474,6 +1497,10 @@ vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio, struct vm_area_struct *vma = vmf->vma; unsigned long addr = vmf->address & PMD_MASK; struct mm_struct *mm = vma->vm_mm; + struct folio_or_pfn fop = { + .folio = folio, + .is_folio = true, + }; spinlock_t *ptl; pgtable_t pgtable = NULL; int error; @@ -1491,14 +1518,8 @@ vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio, } ptl = pmd_lock(mm, vmf->pmd); - if (pmd_none(*vmf->pmd)) { - folio_get(folio); - folio_add_file_rmap_pmd(folio, &folio->page, vma); - add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR); - } - error = insert_pfn_pmd(vma, addr, vmf->pmd, - pfn_to_pfn_t(folio_pfn(folio)), vma->vm_page_prot, - write, pgtable); + error = insert_pmd(vma, addr, vmf->pmd, fop, vma->vm_page_prot, + write, pgtable); spin_unlock(ptl); if (error && pgtable) pte_free(mm, pgtable); From 02825c0925fbd53c085e88a1b7603eee8c9c6751 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 13 Jun 2025 11:27:02 +0200 Subject: [PATCH 073/370] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pud() Marking PUDs that map a "normal" refcounted folios as special is against our rules documented for vm_normal_page(). normal (refcounted) folios shall never have the page table mapping marked as special. Fortunately, there are not that many pud_special() check that can be mislead and are right now rather harmless: e.g., none so far bases decisions whether to grab a folio reference on that decision. Well, and GUP-fast will fallback to GUP-slow. All in all, so far no big implications as it seems. Getting this right will get more important as we introduce folio_normal_page_pud() and start using it in more place where we currently special-case based on other VMA flags. Fix it just like we fixed vmf_insert_folio_pmd(). Add folio_mk_pud() to mimic what we do with folio_mk_pmd(). Link: https://lkml.kernel.org/r/20250613092702.1943533-4-david@redhat.com Fixes: dbe54153296d ("mm/huge_memory: add vmf_insert_folio_pud()") Signed-off-by: David Hildenbrand Reviewed-by: Dan Williams Reviewed-by: Lorenzo Stoakes Reviewed-by: Jason Gunthorpe Tested-by: Dan Williams Reviewed-by: Oscar Salvador Cc: Alistair Popple Cc: Baolin Wang Cc: Dev Jain Cc: Liam Howlett Cc: Mariano Pache Cc: Michal Hocko Cc: Mike Rapoport Cc: Ryan Roberts Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/mm.h | 19 ++++++++++++++++- mm/huge_memory.c | 52 ++++++++++++++++++++++++++-------------------- 2 files changed, 47 insertions(+), 24 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 0ef2ba0c667a..b7e2abd8ce0d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1816,7 +1816,24 @@ static inline pmd_t folio_mk_pmd(struct folio *folio, pgprot_t pgprot) { return pmd_mkhuge(pfn_pmd(folio_pfn(folio), pgprot)); } -#endif + +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD +/** + * folio_mk_pud - Create a PUD for this folio + * @folio: The folio to create a PUD for + * @pgprot: The page protection bits to use + * + * Create a page table entry for the first page of this folio. + * This is suitable for passing to set_pud_at(). + * + * Return: A page table entry suitable for mapping this folio. + */ +static inline pud_t folio_mk_pud(struct folio *folio, pgprot_t pgprot) +{ + return pud_mkhuge(pfn_pud(folio_pfn(folio), pgprot)); +} +#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif /* CONFIG_MMU */ static inline bool folio_has_pincount(const struct folio *folio) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d1e3e253c714..bbc1dab98f2f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1536,15 +1536,18 @@ static pud_t maybe_pud_mkwrite(pud_t pud, struct vm_area_struct *vma) return pud; } -static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr, - pud_t *pud, pfn_t pfn, pgprot_t prot, bool write) +static void insert_pud(struct vm_area_struct *vma, unsigned long addr, + pud_t *pud, struct folio_or_pfn fop, pgprot_t prot, bool write) { struct mm_struct *mm = vma->vm_mm; pud_t entry; if (!pud_none(*pud)) { + const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) : + pfn_t_to_pfn(fop.pfn); + if (write) { - if (WARN_ON_ONCE(pud_pfn(*pud) != pfn_t_to_pfn(pfn))) + if (WARN_ON_ONCE(pud_pfn(*pud) != pfn)) return; entry = pud_mkyoung(*pud); entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma); @@ -1554,11 +1557,20 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr, return; } - entry = pud_mkhuge(pfn_t_pud(pfn, prot)); - if (pfn_t_devmap(pfn)) - entry = pud_mkdevmap(entry); - else - entry = pud_mkspecial(entry); + if (fop.is_folio) { + entry = folio_mk_pud(fop.folio, vma->vm_page_prot); + + folio_get(fop.folio); + folio_add_file_rmap_pud(fop.folio, &fop.folio->page, vma); + add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PUD_NR); + } else { + entry = pud_mkhuge(pfn_t_pud(fop.pfn, prot)); + + if (pfn_t_devmap(fop.pfn)) + entry = pud_mkdevmap(entry); + else + entry = pud_mkspecial(entry); + } if (write) { entry = pud_mkyoung(pud_mkdirty(entry)); entry = maybe_pud_mkwrite(entry, vma); @@ -1582,6 +1594,9 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write) unsigned long addr = vmf->address & PUD_MASK; struct vm_area_struct *vma = vmf->vma; pgprot_t pgprot = vma->vm_page_prot; + struct folio_or_pfn fop = { + .pfn = pfn, + }; spinlock_t *ptl; /* @@ -1601,7 +1616,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write) pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot); ptl = pud_lock(vma->vm_mm, vmf->pud); - insert_pfn_pud(vma, addr, vmf->pud, pfn, pgprot, write); + insert_pud(vma, addr, vmf->pud, fop, pgprot, write); spin_unlock(ptl); return VM_FAULT_NOPAGE; @@ -1623,6 +1638,10 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio, unsigned long addr = vmf->address & PUD_MASK; pud_t *pud = vmf->pud; struct mm_struct *mm = vma->vm_mm; + struct folio_or_pfn fop = { + .folio = folio, + .is_folio = true, + }; spinlock_t *ptl; if (addr < vma->vm_start || addr >= vma->vm_end) @@ -1632,20 +1651,7 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio, return VM_FAULT_SIGBUS; ptl = pud_lock(mm, pud); - - /* - * If there is already an entry present we assume the folio is - * already mapped, hence no need to take another reference. We - * still call insert_pfn_pud() though in case the mapping needs - * upgrading to writeable. - */ - if (pud_none(*vmf->pud)) { - folio_get(folio); - folio_add_file_rmap_pud(folio, &folio->page, vma); - add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR); - } - insert_pfn_pud(vma, addr, vmf->pud, pfn_to_pfn_t(folio_pfn(folio)), - vma->vm_page_prot, write); + insert_pud(vma, addr, vmf->pud, fop, vma->vm_page_prot, write); spin_unlock(ptl); return VM_FAULT_NOPAGE; From 32925ee63beb7e1a3fad25e2f54cefa2d32612de Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Fri, 13 Jun 2025 20:47:43 +0100 Subject: [PATCH 074/370] secretmem: remove uses of struct page Use filemap_lock_folio() instead of find_lock_page() to retrieve a folio from the page cache. [lorenzo.stoakes@oracle.com: fix check of filemap_lock_folio() return value] Link: https://lkml.kernel.org/r/fdbca1d0-01a3-4653-85ed-cf257bb848be@lucifer.local Link: https://lkml.kernel.org/r/20250613194744.3175157-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Lorenzo Stoakes Reviewed-by: Mike Rapoport (Microsoft) Acked-by: David Hildenbrand Signed-off-by: Andrew Morton --- mm/secretmem.c | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/mm/secretmem.c b/mm/secretmem.c index 9a11a38a6770..62ae71907fe4 100644 --- a/mm/secretmem.c +++ b/mm/secretmem.c @@ -54,7 +54,6 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf) pgoff_t offset = vmf->pgoff; gfp_t gfp = vmf->gfp_mask; unsigned long addr; - struct page *page; struct folio *folio; vm_fault_t ret; int err; @@ -65,16 +64,15 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf) filemap_invalidate_lock_shared(mapping); retry: - page = find_lock_page(mapping, offset); - if (!page) { + folio = filemap_lock_folio(mapping, offset); + if (IS_ERR(folio)) { folio = folio_alloc(gfp | __GFP_ZERO, 0); if (!folio) { ret = VM_FAULT_OOM; goto out; } - page = &folio->page; - err = set_direct_map_invalid_noflush(page); + err = set_direct_map_invalid_noflush(folio_page(folio, 0)); if (err) { folio_put(folio); ret = vmf_error(err); @@ -90,7 +88,7 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf) * already happened when we marked the page invalid * which guarantees that this call won't fail */ - set_direct_map_default_noflush(page); + set_direct_map_default_noflush(folio_page(folio, 0)); if (err == -EEXIST) goto retry; @@ -98,11 +96,11 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf) goto out; } - addr = (unsigned long)page_address(page); + addr = (unsigned long)folio_address(folio); flush_tlb_kernel_range(addr, addr + PAGE_SIZE); } - vmf->page = page; + vmf->page = folio_file_page(folio, vmf->pgoff); ret = VM_FAULT_LOCKED; out: @@ -154,7 +152,7 @@ static int secretmem_migrate_folio(struct address_space *mapping, static void secretmem_free_folio(struct folio *folio) { - set_direct_map_default_noflush(&folio->page); + set_direct_map_default_noflush(folio_page(folio, 0)); folio_zero_segment(folio, 0, folio_size(folio)); } From 9e82db9c0cdafa068969373b4bad140f72988e32 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Fri, 13 Jun 2025 20:48:23 +0100 Subject: [PATCH 075/370] highmem: remove a use of folio->page Call folio_address() instead of page_address(). Link: https://lkml.kernel.org/r/20250613194825.3175276-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: David Hildenbrand Signed-off-by: Andrew Morton --- include/linux/highmem-internal.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h index 9a7683d79a4b..36053c3d6d64 100644 --- a/include/linux/highmem-internal.h +++ b/include/linux/highmem-internal.h @@ -195,7 +195,7 @@ static inline void *kmap_local_page_try_from_panic(struct page *page) static inline void *kmap_local_folio(struct folio *folio, size_t offset) { - return page_address(&folio->page) + offset; + return folio_address(folio) + offset; } static inline void *kmap_local_page_prot(struct page *page, pgprot_t prot) From 4535cb331cfbdbc73d42887330e46e87a4589e6d Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 13 Jun 2025 19:48:07 +0100 Subject: [PATCH 076/370] mm/vma: use vmg->target to specify target VMA for new VMA merge In commit 3a75ccba047b ("mm: simplify vma merge structure and expand comments") we introduced the vmg->target field to make the merging of existing VMAs simpler - clarifying precisely which VMA would eventually become the merged VMA once the merge operation was complete. New VMA merging did not get quite the same treatment, retaining the rather confusing convention of storing the target VMA in vmg->middle. This patch corrects this state of affairs, utilising vmg->target for this purpose for both vma_merge_new_range() and also for vma_expand(). We retain the WARN_ON for vmg->middle being specified in vma_merge_new_range() as doing so would make no sense, but add an additional debug assert for setting vmg->target. This patch additionally updates VMA userland testing to account for this change. [lorenzo.stoakes@oracle.com: make comment consistent in vma_expand()] Link: https://lkml.kernel.org/r/c54f45e3-a6ac-4749-93c0-cc9e3080ee37@lucifer.local Link: https://lkml.kernel.org/r/20250613184807.108089-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Cc: Jann Horn Cc: Kees Cook Cc: Liam Howlett Signed-off-by: Andrew Morton --- mm/vma.c | 36 +++++++++++++++++++----------------- mm/vma_exec.c | 2 +- tools/testing/vma/vma.c | 6 +++--- 3 files changed, 23 insertions(+), 21 deletions(-) diff --git a/mm/vma.c b/mm/vma.c index 079540ebfb72..4b6d0be9ba39 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -1048,6 +1048,7 @@ struct vm_area_struct *vma_merge_new_range(struct vma_merge_struct *vmg) mmap_assert_write_locked(vmg->mm); VM_WARN_ON_VMG(vmg->middle, vmg); + VM_WARN_ON_VMG(vmg->target, vmg); /* vmi must point at or before the gap. */ VM_WARN_ON_VMG(vma_iter_addr(vmg->vmi) > end, vmg); @@ -1063,13 +1064,13 @@ struct vm_area_struct *vma_merge_new_range(struct vma_merge_struct *vmg) /* If we can merge with the next VMA, adjust vmg accordingly. */ if (can_merge_right) { vmg->end = next->vm_end; - vmg->middle = next; + vmg->target = next; } /* If we can merge with the previous VMA, adjust vmg accordingly. */ if (can_merge_left) { vmg->start = prev->vm_start; - vmg->middle = prev; + vmg->target = prev; vmg->pgoff = prev->vm_pgoff; /* @@ -1091,10 +1092,10 @@ struct vm_area_struct *vma_merge_new_range(struct vma_merge_struct *vmg) * Now try to expand adjacent VMA(s). This takes care of removing the * following VMA if we have VMAs on both sides. */ - if (vmg->middle && !vma_expand(vmg)) { - khugepaged_enter_vma(vmg->middle, vmg->flags); + if (vmg->target && !vma_expand(vmg)) { + khugepaged_enter_vma(vmg->target, vmg->flags); vmg->state = VMA_MERGE_SUCCESS; - return vmg->middle; + return vmg->target; } return NULL; @@ -1106,27 +1107,29 @@ struct vm_area_struct *vma_merge_new_range(struct vma_merge_struct *vmg) * @vmg: Describes a VMA expansion operation. * * Expand @vma to vmg->start and vmg->end. Can expand off the start and end. - * Will expand over vmg->next if it's different from vmg->middle and vmg->end == - * vmg->next->vm_end. Checking if the vmg->middle can expand and merge with + * Will expand over vmg->next if it's different from vmg->target and vmg->end == + * vmg->next->vm_end. Checking if the vmg->target can expand and merge with * vmg->next needs to be handled by the caller. * * Returns: 0 on success. * * ASSUMPTIONS: - * - The caller must hold a WRITE lock on vmg->middle->mm->mmap_lock. - * - The caller must have set @vmg->middle and @vmg->next. + * - The caller must hold a WRITE lock on the mm_struct->mmap_lock. + * - The caller must have set @vmg->target and @vmg->next. */ int vma_expand(struct vma_merge_struct *vmg) { struct vm_area_struct *anon_dup = NULL; bool remove_next = false; - struct vm_area_struct *middle = vmg->middle; + struct vm_area_struct *target = vmg->target; struct vm_area_struct *next = vmg->next; + VM_WARN_ON_VMG(!target, vmg); + mmap_assert_write_locked(vmg->mm); - vma_start_write(middle); - if (next && (middle != next) && (vmg->end == next->vm_end)) { + vma_start_write(target); + if (next && (target != next) && (vmg->end == next->vm_end)) { int ret; remove_next = true; @@ -1137,19 +1140,18 @@ int vma_expand(struct vma_merge_struct *vmg) * In this case we don't report OOM, so vmg->give_up_on_mm is * safe. */ - ret = dup_anon_vma(middle, next, &anon_dup); + ret = dup_anon_vma(target, next, &anon_dup); if (ret) return ret; } /* Not merging but overwriting any part of next is not handled. */ VM_WARN_ON_VMG(next && !remove_next && - next != middle && vmg->end > next->vm_start, vmg); + next != target && vmg->end > next->vm_start, vmg); /* Only handles expanding */ - VM_WARN_ON_VMG(middle->vm_start < vmg->start || - middle->vm_end > vmg->end, vmg); + VM_WARN_ON_VMG(target->vm_start < vmg->start || + target->vm_end > vmg->end, vmg); - vmg->target = middle; if (remove_next) vmg->__remove_next = true; diff --git a/mm/vma_exec.c b/mm/vma_exec.c index 2dffb02ed6a2..922ee51747a6 100644 --- a/mm/vma_exec.c +++ b/mm/vma_exec.c @@ -54,7 +54,7 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift) /* * cover the whole range: [new_start, old_end) */ - vmg.middle = vma; + vmg.target = vma; if (vma_expand(&vmg)) return -ENOMEM; diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index 2be7597a2ac2..7fec5b3de83f 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c @@ -400,7 +400,7 @@ static bool test_simple_expand(void) VMA_ITERATOR(vmi, &mm, 0); struct vma_merge_struct vmg = { .vmi = &vmi, - .middle = vma, + .target = vma, .start = 0, .end = 0x3000, .pgoff = 0, @@ -1318,7 +1318,7 @@ static bool test_dup_anon_vma(void) vma_next->anon_vma = &dummy_anon_vma; vmg_set_range(&vmg, 0, 0x5000, 0, flags); - vmg.middle = vma_prev; + vmg.target = vma_prev; vmg.next = vma_next; ASSERT_EQ(expand_existing(&vmg), 0); @@ -1501,7 +1501,7 @@ static bool test_vmi_prealloc_fail(void) vma->anon_vma = &dummy_anon_vma; vmg_set_range(&vmg, 0, 0x5000, 3, flags); - vmg.middle = vma_prev; + vmg.target = vma_prev; vmg.next = vma; fail_prealloc = true; From 3e49aa8e6510be458c1120d6d2a9deac9bc40253 Mon Sep 17 00:00:00 2001 From: Mark Brown Date: Fri, 13 Jun 2025 12:44:07 +0100 Subject: [PATCH 077/370] selftest/mm: skip if fallocate() is unsupported in gup_longterm Currently gup_longterm assumes that filesystems support fallocate() and uses that to allocate space in files, however this is an optional feature and is in particular not implemented by NFSv3 which is commonly used in CI systems leading to spurious failures. Check for lack of support and report a skip instead for that case. Link: https://lkml.kernel.org/r/20250613-selftest-mm-gup-longterm-fallocate-nfs-v1-1-758a104c175f@kernel.org Signed-off-by: Mark Brown Cc: David Hildenbrand Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/gup_longterm.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/mm/gup_longterm.c b/tools/testing/selftests/mm/gup_longterm.c index 29047d2e0c49..268dadb8ce43 100644 --- a/tools/testing/selftests/mm/gup_longterm.c +++ b/tools/testing/selftests/mm/gup_longterm.c @@ -114,7 +114,15 @@ static void do_test(int fd, size_t size, enum test_type type, bool shared) } if (fallocate(fd, 0, 0, size)) { - if (size == pagesize) { + /* + * Some filesystems (eg, NFSv3) don't support + * fallocate(), report this as a skip rather than a + * test failure. + */ + if (errno == EOPNOTSUPP) { + ksft_print_msg("fallocate() not supported by filesystem\n"); + result = KSFT_SKIP; + } else if (size == pagesize) { ksft_print_msg("fallocate() failed (%s)\n", strerror(errno)); result = KSFT_FAIL; } else { From fa493f50df3adaddf9707d14b2f972ddd388f452 Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Fri, 13 Jun 2025 17:12:19 +0800 Subject: [PATCH 078/370] mm: huge_memory: fix the check for allowed huge orders in shmem Shmem already supports mTHP, and shmem_allowable_huge_orders() will return the huge orders allowed by shmem. However, there is no check against the 'orders' parameter passed by __thp_vma_allowable_orders(), which can lead to incorrect check results for __thp_vma_allowable_orders(). For example, when a user wants to check if shmem supports PMD-sized THP by thp_vma_allowable_order(), if shmem only enables 64K mTHP, the current logic would cause thp_vma_allowable_order() to return true, implying that shmem allows PMD-sized THP allocation, which it actually does not. I don't think this will cause a significant impact on users, and this will only have some impact on the shmem THP collapse. That is to say, even though the shmem sysfs setting does not enable the PMD-sized THP, the thp_vma_allowable_order() still indicates that shmem allows PMD-sized collapse, meaning it might successfully collapse into THP, or it might not (for example, thp_vma_suitable_order() check failed in the collapse process). However, this still does not align with the shmem sysfs configuration, fix it. Link: https://lkml.kernel.org/r/529affb3220153d0d5a542960b535cdfc33f51d7.1749804835.git.baolin.wang@linux.alibaba.com Fixes: 26c7d8413aaf ("mm: thp: support "THPeligible" semantics for mTHP with anonymous shmem") Signed-off-by: Baolin Wang Reviewed-by: Lorenzo Stoakes Acked-by: Zi Yan Acked-by: David Hildenbrand Cc: Barry Song Cc: Dev Jain Cc: Liam Howlett Cc: Mariano Pache Cc: Ryan Roberts Signed-off-by: Andrew Morton --- mm/huge_memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index bbc1dab98f2f..1b31985cef11 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -166,7 +166,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, * own flags. */ if (!in_pf && shmem_file(vma->vm_file)) - return shmem_allowable_huge_orders(file_inode(vma->vm_file), + return orders & shmem_allowable_huge_orders(file_inode(vma->vm_file), vma, vma->vm_pgoff, 0, !enforce_sysfs); From f687fd5af8bf0358259278319f5bc18fc8c547cb Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Mon, 16 Jun 2025 14:45:19 -0400 Subject: [PATCH 079/370] testing/radix-tree/maple: increase readers and reduce delay for faster machines Faster machines may not see the initial or updated value in the race condition. Reduce the delay so that faster machines are less likely to fail testing of the race conditions. Link: https://lkml.kernel.org/r/20250616184521.3382795-2-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Cc: Hailong Liu Cc: Matthew Wilcox (Oracle) Cc: Peng Zhang Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- tools/testing/radix-tree/maple.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c index 2c0b38301253..f6f923c9dc10 100644 --- a/tools/testing/radix-tree/maple.c +++ b/tools/testing/radix-tree/maple.c @@ -35062,7 +35062,7 @@ void run_check_rcu_slowread(struct maple_tree *mt, struct rcu_test_struct *vals) int i; void *(*function)(void *); - pthread_t readers[20]; + pthread_t readers[30]; unsigned int index = vals->index; mt_set_in_rcu(mt); @@ -35080,14 +35080,14 @@ void run_check_rcu_slowread(struct maple_tree *mt, struct rcu_test_struct *vals) } } - usleep(5); /* small yield to ensure all threads are at least started. */ + usleep(3); /* small yield to ensure all threads are at least started. */ while (index <= vals->last) { mtree_store(mt, index, (index % 2 ? vals->entry2 : vals->entry3), GFP_KERNEL); index++; - usleep(5); + usleep(2); } while (i--) @@ -35098,6 +35098,7 @@ void run_check_rcu_slowread(struct maple_tree *mt, struct rcu_test_struct *vals) MT_BUG_ON(mt, !vals->seen_entry3); MT_BUG_ON(mt, !vals->seen_both); } + static noinline void __init check_rcu_simulated(struct maple_tree *mt) { unsigned long i, nr_entries = 1000; From ec3681e873132831ba2a9c0de43f2839c9533e94 Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Mon, 16 Jun 2025 14:45:21 -0400 Subject: [PATCH 080/370] tools/testing/radix-tree: test maple tree chaining mas_preallocate() calls Testing calling multiple mas_preallocate() calls in a row after adjusting the maple state. Ensures new calls to mas_preallocate() will change the number of allocated nodes. Link: https://lkml.kernel.org/r/20250616184521.3382795-4-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Acked-by: Lorenzo Stoakes Cc: Hailong Liu Cc: "Liam R. Howlett" Cc: Matthew Wilcox (Oracle) Cc: Peng Zhang Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- tools/testing/radix-tree/maple.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c index f6f923c9dc10..172700fb7784 100644 --- a/tools/testing/radix-tree/maple.c +++ b/tools/testing/radix-tree/maple.c @@ -35669,6 +35669,18 @@ static noinline void __init check_prealloc(struct maple_tree *mt) allocated = mas_allocated(&mas); height = mas_mt_height(&mas); MT_BUG_ON(mt, allocated != 0); + + /* Chaining multiple preallocations */ + mt_set_in_rcu(mt); + mas_set_range(&mas, 800, 805); /* Slot store, should be 0 allocations */ + MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); + allocated = mas_allocated(&mas); + MT_BUG_ON(mt, allocated != 0); + mas.last = 809; /* Node store */ + MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); + allocated = mas_allocated(&mas); + MT_BUG_ON(mt, allocated != 1); + mas_store_prealloc(&mas, ptr); } /* End of preallocation testing */ From b435415eed536d7a35516a82f5995d87035cc830 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 16 Jun 2025 10:23:44 -0700 Subject: [PATCH 081/370] mm/damon/paddr: use alloc_migartion_target() with no migration fallback nodemask Patch series "mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD}". DAMOS_MIGRATE_{HOT,COLD} implementation resembles that for demotion, and hence the behavior is also similar to that. But, since those are not only for demotion but general migrations, it would be better to match with that for move_pages() system call. Make the implementation and the behavior more similar to move_pages() by not setting migration fallback nodes, and using alloc_migration_target() instead of alloc_migrate_folio(). alloc_migrate_folio() was renamed from alloc_demote_folio() and been non-static function, to let DAMOS_MIGRATE_{HOT,COLD} call it. As alloc_migration_target() is called instead, the renaming and de-static changes are no more required but could only make future code readers be confused. Revert the changes, too. This patch (of 3): DAMOS_MIGRATE_{HOT,COLD} implementation resembles that for demote_folio_list(). Because those are not only for demotion but general folio migrations, it makes more sense to behave similarly to move_pages() system call. Make the behavior more similar to move_pages(), by using alloc_migration_target() instead of alloc_migrate_folio(), without fallback nodemask. Link: https://lkml.kernel.org/r/20250616172346.67659-2-sj@kernel.org Signed-off-by: SeongJae Park Reviewed-by: Joshua Hahn Cc: David Hildenbrand Cc: Honggyu Kim Cc: Johannes Weiner Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Qi Zheng Cc: Shakeel Butt Signed-off-by: Andrew Morton --- mm/damon/paddr.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c index 4102a8c5f992..fcab148e6865 100644 --- a/mm/damon/paddr.c +++ b/mm/damon/paddr.c @@ -386,7 +386,6 @@ static unsigned int __damon_pa_migrate_folio_list( int target_nid) { unsigned int nr_succeeded = 0; - nodemask_t allowed_mask = NODE_MASK_NONE; struct migration_target_control mtc = { /* * Allocate from 'node', or fail quickly and quietly. @@ -396,7 +395,6 @@ static unsigned int __damon_pa_migrate_folio_list( .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | __GFP_NOMEMALLOC | GFP_NOWAIT, .nid = target_nid, - .nmask = &allowed_mask }; if (pgdat->node_id == target_nid || target_nid == NUMA_NO_NODE) @@ -406,7 +404,7 @@ static unsigned int __damon_pa_migrate_folio_list( return 0; /* Migration ignores all cpuset and mempolicy settings */ - migrate_pages(migrate_folios, alloc_migrate_folio, NULL, + migrate_pages(migrate_folios, alloc_migration_target, NULL, (unsigned long)&mtc, MIGRATE_ASYNC, MR_DAMON, &nr_succeeded); From 29ea04095b9697951621dd5a7843108948d056b8 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 16 Jun 2025 10:23:45 -0700 Subject: [PATCH 082/370] Revert "mm: rename alloc_demote_folio to alloc_migrate_folio" This reverts commit 8f75267d22bdf8e3baf70f2fa7092d8c2f58da71. Commit 8f75267d22bd ("mm: rename alloc_demote_folio to alloc_migrate_folio") was to reflect the fact the function is called for not only demotion, but also general migrations from DAMOS_MIGRATE_{HOT,COLD}. The previous commit made the DAMOS actions to not use alloc_migrate_folio(), though. So, demote_folio_list() is the only caller of alloc_migrate_folio(), and the name could now be rather confusing. Revert the renaming commit. Link: https://lkml.kernel.org/r/20250616172346.67659-3-sj@kernel.org Signed-off-by: SeongJae Park Reviewed-by: Joshua Hahn Cc: David Hildenbrand Cc: Honggyu Kim Cc: Johannes Weiner Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Qi Zheng Cc: Shakeel Butt Signed-off-by: Andrew Morton --- mm/internal.h | 2 +- mm/vmscan.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 2c0d9f197d81..ce37834bcdce 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1226,7 +1226,7 @@ extern unsigned long __must_check vm_mmap_pgoff(struct file *, unsigned long, unsigned long, unsigned long); extern void set_pageblock_order(void); -struct folio *alloc_migrate_folio(struct folio *src, unsigned long private); +struct folio *alloc_demote_folio(struct folio *src, unsigned long private); unsigned long reclaim_pages(struct list_head *folio_list); unsigned int reclaim_clean_pages_from_list(struct zone *zone, struct list_head *folio_list); diff --git a/mm/vmscan.c b/mm/vmscan.c index a93a1ba9009e..6bebc91cbf2f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1006,7 +1006,7 @@ static void folio_check_dirty_writeback(struct folio *folio, mapping->a_ops->is_dirty_writeback(folio, dirty, writeback); } -struct folio *alloc_migrate_folio(struct folio *src, unsigned long private) +struct folio *alloc_demote_folio(struct folio *src, unsigned long private) { struct folio *dst; nodemask_t *allowed_mask; @@ -1069,7 +1069,7 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, node_get_allowed_targets(pgdat, &allowed_mask); /* Demotion ignores all cpuset and mempolicy settings */ - migrate_pages(demote_folios, alloc_migrate_folio, NULL, + migrate_pages(demote_folios, alloc_demote_folio, NULL, (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, &nr_succeeded); From e1b1fe45573aa9919c18c13fcf6ec688534f92e3 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 16 Jun 2025 10:23:46 -0700 Subject: [PATCH 083/370] Revert "mm: make alloc_demote_folio externally invokable for migration" This reverts commit a00ce85af2a1be494d3b0c9457e8e81cdcce2a89. Commit a00ce85af2a1 ("mm: make alloc_demote_folio externally invokable for migration") was made to let DAMOS_MIGRATE_{HOT,COLD} call the function. But a previous commit made DAMOS_MIGRATE_{HOT,COLD} call alloc_migration_target() instead. Hence there are no more callers of the function outside of vmscan.c. Revert the commit to make the function static again. Link: https://lkml.kernel.org/r/20250616172346.67659-4-sj@kernel.org Signed-off-by: SeongJae Park Reviewed-by: Joshua Hahn Cc: David Hildenbrand Cc: Honggyu Kim Cc: Johannes Weiner Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Qi Zheng Cc: Shakeel Butt Signed-off-by: Andrew Morton --- mm/internal.h | 1 - mm/vmscan.c | 3 ++- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index ce37834bcdce..3eb51c31a041 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1226,7 +1226,6 @@ extern unsigned long __must_check vm_mmap_pgoff(struct file *, unsigned long, unsigned long, unsigned long); extern void set_pageblock_order(void); -struct folio *alloc_demote_folio(struct folio *src, unsigned long private); unsigned long reclaim_pages(struct list_head *folio_list); unsigned int reclaim_clean_pages_from_list(struct zone *zone, struct list_head *folio_list); diff --git a/mm/vmscan.c b/mm/vmscan.c index 6bebc91cbf2f..620dce753b64 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1006,7 +1006,8 @@ static void folio_check_dirty_writeback(struct folio *folio, mapping->a_ops->is_dirty_writeback(folio, dirty, writeback); } -struct folio *alloc_demote_folio(struct folio *src, unsigned long private) +static struct folio *alloc_demote_folio(struct folio *src, + unsigned long private) { struct folio *dst; nodemask_t *allowed_mask; From 78ddaa358ec4cdd60bd0e243ced1c83a52c30241 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Wed, 18 Jun 2025 20:42:52 +0100 Subject: [PATCH 084/370] mm: change vm_get_page_prot() to accept vm_flags_t argument Patch series "use vm_flags_t consistently". The VMA flags field vma->vm_flags is of type vm_flags_t. Right now this is exactly equivalent to unsigned long, but it should not be assumed to be. Much code that references vma->vm_flags already correctly uses vm_flags_t, but a fairly large chunk of code simply uses unsigned long and assumes that the two are equivalent. This series corrects that and has us use vm_flags_t consistently. This series is motivated by the desire to, in a future series, adjust vm_flags_t to be a u64 regardless of whether the kernel is 32-bit or 64-bit in order to deal with the VMA flag exhaustion issue and avoid all the various problems that arise from it (being unable to use certain features in 32-bit, being unable to add new flags except for 64-bit, etc.) This is therefore a critical first step towards that goal. At any rate, using the correct type is of value regardless. We additionally take the opportunity to refer to VMA flags as vm_flags where possible to make clear what we're referring to. Overall, this series does not introduce any functional change. This patch (of 3): We abstract the type of the VMA flags to vm_flags_t, however in may places it is simply assumed this is unsigned long, which is simply incorrect. At the moment this is simply an incongruity, however in future we plan to change this type and therefore this change is a critical requirement for doing so. Overall, this patch does not introduce any functional change. [lorenzo.stoakes@oracle.com: add missing vm_get_page_prot() instance, remove include] Link: https://lkml.kernel.org/r/552f88e1-2df8-4e95-92b8-812f7c8db829@lucifer.local Link: https://lkml.kernel.org/r/cover.1750274467.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/a12769720a2743f235643b158c4f4f0a9911daf0.1750274467.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Acked-by: Mike Rapoport (Microsoft) Acked-by: Christian Brauner Reviewed-by: Vlastimil Babka Reviewed-by: Oscar Salvador Reviewed-by: Pedro Falcato Acked-by: Catalin Marinas [arm64] Acked-by: Zi Yan Acked-by: David Hildenbrand Reviewed-by: Anshuman Khandual Cc: Liam R. Howlett Cc: Lorenzo Stoakes Cc: Jann Horn Cc: Kees Cook Cc: Jan Kara Cc: Jarkko Sakkinen Signed-off-by: Andrew Morton --- arch/arm64/mm/mmap.c | 2 +- arch/powerpc/include/asm/book3s/64/pkeys.h | 2 +- arch/powerpc/mm/book3s64/pgtable.c | 2 +- arch/sparc/mm/init_64.c | 2 +- arch/x86/mm/pgprot.c | 2 +- include/linux/mm.h | 4 ++-- include/linux/pgtable.h | 2 +- tools/testing/vma/vma_internal.h | 2 +- 8 files changed, 9 insertions(+), 9 deletions(-) diff --git a/arch/arm64/mm/mmap.c b/arch/arm64/mm/mmap.c index c86c348857c4..08ee177432c2 100644 --- a/arch/arm64/mm/mmap.c +++ b/arch/arm64/mm/mmap.c @@ -81,7 +81,7 @@ static int __init adjust_protection_map(void) } arch_initcall(adjust_protection_map); -pgprot_t vm_get_page_prot(unsigned long vm_flags) +pgprot_t vm_get_page_prot(vm_flags_t vm_flags) { ptdesc_t prot; diff --git a/arch/powerpc/include/asm/book3s/64/pkeys.h b/arch/powerpc/include/asm/book3s/64/pkeys.h index 5b178139f3c0..ff911b4251d9 100644 --- a/arch/powerpc/include/asm/book3s/64/pkeys.h +++ b/arch/powerpc/include/asm/book3s/64/pkeys.h @@ -5,7 +5,7 @@ #include -static inline u64 vmflag_to_pte_pkey_bits(u64 vm_flags) +static inline u64 vmflag_to_pte_pkey_bits(vm_flags_t vm_flags) { if (!mmu_has_feature(MMU_FTR_PKEY)) return 0x0UL; diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c index 0db01e10a3f8..a89ef89101fc 100644 --- a/arch/powerpc/mm/book3s64/pgtable.c +++ b/arch/powerpc/mm/book3s64/pgtable.c @@ -644,7 +644,7 @@ unsigned long memremap_compat_align(void) EXPORT_SYMBOL_GPL(memremap_compat_align); #endif -pgprot_t vm_get_page_prot(unsigned long vm_flags) +pgprot_t vm_get_page_prot(vm_flags_t vm_flags) { unsigned long prot; diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index 25ae4c897aae..7ed58bf3aaca 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -3201,7 +3201,7 @@ void copy_highpage(struct page *to, struct page *from) } EXPORT_SYMBOL(copy_highpage); -pgprot_t vm_get_page_prot(unsigned long vm_flags) +pgprot_t vm_get_page_prot(vm_flags_t vm_flags) { unsigned long prot = pgprot_val(protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]); diff --git a/arch/x86/mm/pgprot.c b/arch/x86/mm/pgprot.c index c84bd9540b16..dc1afd5c839d 100644 --- a/arch/x86/mm/pgprot.c +++ b/arch/x86/mm/pgprot.c @@ -32,7 +32,7 @@ void add_encrypt_protection_map(void) protection_map[i] = pgprot_encrypted(protection_map[i]); } -pgprot_t vm_get_page_prot(unsigned long vm_flags) +pgprot_t vm_get_page_prot(vm_flags_t vm_flags) { unsigned long val = pgprot_val(protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]); diff --git a/include/linux/mm.h b/include/linux/mm.h index b7e2abd8ce0d..78bb177ba55f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3489,10 +3489,10 @@ static inline bool range_in_vma(struct vm_area_struct *vma, } #ifdef CONFIG_MMU -pgprot_t vm_get_page_prot(unsigned long vm_flags); +pgprot_t vm_get_page_prot(vm_flags_t vm_flags); void vma_set_page_prot(struct vm_area_struct *vma); #else -static inline pgprot_t vm_get_page_prot(unsigned long vm_flags) +static inline pgprot_t vm_get_page_prot(vm_flags_t vm_flags) { return __pgprot(0); } diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e4a3895c043b..d05e35a0facf 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -2016,7 +2016,7 @@ typedef unsigned int pgtbl_mod_mask; * x: (yes) yes */ #define DECLARE_VM_GET_PAGE_PROT \ -pgprot_t vm_get_page_prot(unsigned long vm_flags) \ +pgprot_t vm_get_page_prot(vm_flags_t vm_flags) \ { \ return protection_map[vm_flags & \ (VM_READ | VM_WRITE | VM_EXEC | VM_SHARED)]; \ diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 0f013784da89..3b1b45256d56 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -576,7 +576,7 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot) return __pgprot(pgprot_val(oldprot) | pgprot_val(newprot)); } -static inline pgprot_t vm_get_page_prot(unsigned long vm_flags) +static inline pgprot_t vm_get_page_prot(vm_flags_t vm_flags) { return __pgprot(vm_flags); } From bfbe71109fa40e8cc05a0f99e6734b7d76ee00b0 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Wed, 18 Jun 2025 20:42:53 +0100 Subject: [PATCH 085/370] mm: update core kernel code to use vm_flags_t consistently The core kernel code is currently very inconsistent in its use of vm_flags_t vs. unsigned long. This prevents us from changing the type of vm_flags_t in the future and is simply not correct, so correct this. While this results in rather a lot of churn, it is a critical pre-requisite for a future planned change to VMA flag type. Additionally, update VMA userland tests to account for the changes. To make review easier and to break things into smaller parts, driver and architecture-specific changes is left for a subsequent commit. The code has been adjusted to cascade the changes across all calling code as far as is needed. We will adjust architecture-specific and driver code in a subsequent patch. Overall, this patch does not introduce any functional change. Link: https://lkml.kernel.org/r/d1588e7bb96d1ea3fe7b9df2c699d5b4592d901d.1750274467.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Acked-by: Kees Cook Acked-by: Mike Rapoport (Microsoft) Acked-by: Jan Kara Acked-by: Christian Brauner Reviewed-by: Vlastimil Babka Acked-by: Oscar Salvador Reviewed-by: Pedro Falcato Acked-by: Zi Yan Acked-by: David Hildenbrand Reviewed-by: Anshuman Khandual Cc: Jann Horn Cc: Liam R. Howlett Cc: Catalin Marinas Cc: Jarkko Sakkinen Signed-off-by: Andrew Morton --- fs/exec.c | 2 +- fs/userfaultfd.c | 2 +- include/linux/coredump.h | 2 +- include/linux/huge_mm.h | 12 +- include/linux/khugepaged.h | 4 +- include/linux/ksm.h | 4 +- include/linux/memfd.h | 4 +- include/linux/mm.h | 6 +- include/linux/mm_types.h | 2 +- include/linux/mman.h | 4 +- include/linux/rmap.h | 4 +- include/linux/userfaultfd_k.h | 4 +- include/trace/events/fs_dax.h | 6 +- mm/debug.c | 2 +- mm/execmem.c | 8 +- mm/filemap.c | 2 +- mm/gup.c | 2 +- mm/huge_memory.c | 2 +- mm/hugetlb.c | 4 +- mm/internal.h | 4 +- mm/khugepaged.c | 4 +- mm/ksm.c | 2 +- mm/madvise.c | 4 +- mm/mapping_dirty_helpers.c | 2 +- mm/memfd.c | 8 +- mm/memory.c | 4 +- mm/mmap.c | 16 +- mm/mprotect.c | 8 +- mm/mremap.c | 2 +- mm/nommu.c | 12 +- mm/rmap.c | 4 +- mm/shmem.c | 6 +- mm/userfaultfd.c | 14 +- mm/vma.c | 80 +++++----- mm/vma.h | 16 +- mm/vmscan.c | 4 +- tools/testing/vma/vma.c | 266 +++++++++++++++---------------- tools/testing/vma/vma_internal.h | 8 +- 38 files changed, 270 insertions(+), 270 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index ba400aafd640..9faf9052bed9 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -604,7 +604,7 @@ int setup_arg_pages(struct linux_binprm *bprm, struct mm_struct *mm = current->mm; struct vm_area_struct *vma = bprm->vma; struct vm_area_struct *prev = NULL; - unsigned long vm_flags; + vm_flags_t vm_flags; unsigned long stack_base; unsigned long stack_size; unsigned long stack_expand; diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index a2928b0aec6f..48e82e19d831 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1242,7 +1242,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, int ret; struct uffdio_register uffdio_register; struct uffdio_register __user *user_uffdio_register; - unsigned long vm_flags; + vm_flags_t vm_flags; bool found; bool basic_ioctls; unsigned long start, end; diff --git a/include/linux/coredump.h b/include/linux/coredump.h index 76e41805b92d..c504b0faecc2 100644 --- a/include/linux/coredump.h +++ b/include/linux/coredump.h @@ -10,7 +10,7 @@ #ifdef CONFIG_COREDUMP struct core_vma_metadata { unsigned long start, end; - unsigned long flags; + vm_flags_t flags; unsigned long dump_size; unsigned long pgoff; struct file *file; diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 2f190c90192d..7753daac49f7 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -261,7 +261,7 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma, } unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, - unsigned long vm_flags, + vm_flags_t vm_flags, unsigned long tva_flags, unsigned long orders); @@ -282,7 +282,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, */ static inline unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, - unsigned long vm_flags, + vm_flags_t vm_flags, unsigned long tva_flags, unsigned long orders) { @@ -317,7 +317,7 @@ struct thpsize { (1<vmi, vmg->vmi ? vma_iter_addr(vmg->vmi) : 0, vmg->vmi ? vma_iter_end(vmg->vmi) : 0, vmg->prev, vmg->middle, vmg->next, vmg->target, - vmg->start, vmg->end, vmg->flags, + vmg->start, vmg->end, vmg->vm_flags, vmg->file, vmg->anon_vma, vmg->policy, #ifdef CONFIG_USERFAULTFD vmg->uffd_ctx.ctx, diff --git a/mm/execmem.c b/mm/execmem.c index 2b683e7d864d..627e6cf64f4f 100644 --- a/mm/execmem.c +++ b/mm/execmem.c @@ -26,7 +26,7 @@ static struct execmem_info default_execmem_info __ro_after_init; #ifdef CONFIG_MMU static void *execmem_vmalloc(struct execmem_range *range, size_t size, - pgprot_t pgprot, unsigned long vm_flags) + pgprot_t pgprot, vm_flags_t vm_flags) { bool kasan = range->flags & EXECMEM_KASAN_SHADOW; gfp_t gfp_flags = GFP_KERNEL | __GFP_NOWARN; @@ -82,7 +82,7 @@ struct vm_struct *execmem_vmap(size_t size) } #else static void *execmem_vmalloc(struct execmem_range *range, size_t size, - pgprot_t pgprot, unsigned long vm_flags) + pgprot_t pgprot, vm_flags_t vm_flags) { return vmalloc(size); } @@ -256,7 +256,7 @@ static void *__execmem_cache_alloc(struct execmem_range *range, size_t size) static int execmem_cache_populate(struct execmem_range *range, size_t size) { - unsigned long vm_flags = VM_ALLOW_HUGE_VMAP; + vm_flags_t vm_flags = VM_ALLOW_HUGE_VMAP; struct vm_struct *vm; size_t alloc_size; int err = -ENOMEM; @@ -373,7 +373,7 @@ void *execmem_alloc(enum execmem_type type, size_t size) { struct execmem_range *range = &execmem_info->ranges[type]; bool use_cache = range->flags & EXECMEM_ROX_CACHE; - unsigned long vm_flags = VM_FLUSH_RESET_PERMS; + vm_flags_t vm_flags = VM_FLUSH_RESET_PERMS; pgprot_t pgprot = range->pgprot; void *p; diff --git a/mm/filemap.c b/mm/filemap.c index 3cf955740148..0d0369fb5fa1 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3216,7 +3216,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) struct address_space *mapping = file->f_mapping; DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff); struct file *fpin = NULL; - unsigned long vm_flags = vmf->vma->vm_flags; + vm_flags_t vm_flags = vmf->vma->vm_flags; unsigned short mmap_miss; #ifdef CONFIG_TRANSPARENT_HUGEPAGE diff --git a/mm/gup.c b/mm/gup.c index cbe8e4b9845b..c08b97e0d344 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2044,7 +2044,7 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start, { struct vm_area_struct *vma; bool must_unlock = false; - unsigned long vm_flags; + vm_flags_t vm_flags; long i; if (!nr_pages) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1b31985cef11..6411f3107af1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -99,7 +99,7 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma) } unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, - unsigned long vm_flags, + vm_flags_t vm_flags, unsigned long tva_flags, unsigned long orders) { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 7a7df0b2a561..c7ba95030241 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -7465,8 +7465,8 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma, unsigned long s_end = sbase + PUD_SIZE; /* Allow segments to share if only one is marked locked */ - unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED_MASK; - unsigned long svm_flags = svma->vm_flags & ~VM_LOCKED_MASK; + vm_flags_t vm_flags = vma->vm_flags & ~VM_LOCKED_MASK; + vm_flags_t svm_flags = svma->vm_flags & ~VM_LOCKED_MASK; /* * match the virtual addresses, permission and the alignment of the diff --git a/mm/internal.h b/mm/internal.h index 3eb51c31a041..fe83dfca3c72 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -928,7 +928,7 @@ extern long populate_vma_page_range(struct vm_area_struct *vma, unsigned long start, unsigned long end, int *locked); extern long faultin_page_range(struct mm_struct *mm, unsigned long start, unsigned long end, bool write, int *locked); -extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags, +extern bool mlock_future_ok(struct mm_struct *mm, vm_flags_t vm_flags, unsigned long bytes); /* @@ -1358,7 +1358,7 @@ int migrate_device_coherent_folio(struct folio *folio); struct vm_struct *__get_vm_area_node(unsigned long size, unsigned long align, unsigned long shift, - unsigned long flags, unsigned long start, + vm_flags_t vm_flags, unsigned long start, unsigned long end, int node, gfp_t gfp_mask, const void *caller); diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 15203ea7d007..6b09b09c8f82 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -347,7 +347,7 @@ struct attribute_group khugepaged_attr_group = { #endif /* CONFIG_SYSFS */ int hugepage_madvise(struct vm_area_struct *vma, - unsigned long *vm_flags, int advice) + vm_flags_t *vm_flags, int advice) { switch (advice) { case MADV_HUGEPAGE: @@ -470,7 +470,7 @@ void __khugepaged_enter(struct mm_struct *mm) } void khugepaged_enter_vma(struct vm_area_struct *vma, - unsigned long vm_flags) + vm_flags_t vm_flags) { if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) && hugepage_pmd_enabled()) { diff --git a/mm/ksm.c b/mm/ksm.c index 18b3690bb69a..ef73b25fd65a 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -2840,7 +2840,7 @@ int ksm_disable(struct mm_struct *mm) } int ksm_madvise(struct vm_area_struct *vma, unsigned long start, - unsigned long end, int advice, unsigned long *vm_flags) + unsigned long end, int advice, vm_flags_t *vm_flags) { struct mm_struct *mm = vma->vm_mm; int err; diff --git a/mm/madvise.c b/mm/madvise.c index d451438af999..92f427b1b330 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -130,7 +130,7 @@ static int replace_anon_vma_name(struct vm_area_struct *vma, */ static int madvise_update_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, - unsigned long end, unsigned long new_flags, + unsigned long end, vm_flags_t new_flags, struct anon_vma_name *anon_name) { struct mm_struct *mm = vma->vm_mm; @@ -1258,7 +1258,7 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, int behavior = arg->behavior; int error; struct anon_vma_name *anon_name; - unsigned long new_flags = vma->vm_flags; + vm_flags_t new_flags = vma->vm_flags; if (unlikely(!can_modify_vma_madv(vma, behavior))) return -EPERM; diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c index 2f8829b3541a..dc1692ff9e58 100644 --- a/mm/mapping_dirty_helpers.c +++ b/mm/mapping_dirty_helpers.c @@ -218,7 +218,7 @@ static void wp_clean_post_vma(struct mm_walk *walk) static int wp_clean_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk) { - unsigned long vm_flags = READ_ONCE(walk->vma->vm_flags); + vm_flags_t vm_flags = READ_ONCE(walk->vma->vm_flags); /* Skip non-applicable VMAs */ if ((vm_flags & (VM_SHARED | VM_MAYWRITE | VM_HUGETLB)) != diff --git a/mm/memfd.c b/mm/memfd.c index 65a107f72e39..b558c4c3bd27 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -332,10 +332,10 @@ static inline bool is_write_sealed(unsigned int seals) return seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE); } -static int check_write_seal(unsigned long *vm_flags_ptr) +static int check_write_seal(vm_flags_t *vm_flags_ptr) { - unsigned long vm_flags = *vm_flags_ptr; - unsigned long mask = vm_flags & (VM_SHARED | VM_WRITE); + vm_flags_t vm_flags = *vm_flags_ptr; + vm_flags_t mask = vm_flags & (VM_SHARED | VM_WRITE); /* If a private mapping then writability is irrelevant. */ if (!(mask & VM_SHARED)) @@ -357,7 +357,7 @@ static int check_write_seal(unsigned long *vm_flags_ptr) return 0; } -int memfd_check_seals_mmap(struct file *file, unsigned long *vm_flags_ptr) +int memfd_check_seals_mmap(struct file *file, vm_flags_t *vm_flags_ptr) { int err = 0; unsigned int *seals_ptr = memfd_file_seals_ptr(file); diff --git a/mm/memory.c b/mm/memory.c index b0cda5aab398..833426fa5fe0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -797,7 +797,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, int *rss) { - unsigned long vm_flags = dst_vma->vm_flags; + vm_flags_t vm_flags = dst_vma->vm_flags; pte_t orig_pte = ptep_get(src_pte); pte_t pte = orig_pte; struct folio *folio; @@ -6128,7 +6128,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, .gfp_mask = __get_fault_gfp_mask(vma), }; struct mm_struct *mm = vma->vm_mm; - unsigned long vm_flags = vma->vm_flags; + vm_flags_t vm_flags = vma->vm_flags; pgd_t *pgd; p4d_t *p4d; vm_fault_t ret; diff --git a/mm/mmap.c b/mm/mmap.c index 09c563c95112..8f92cf10b656 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -80,7 +80,7 @@ core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644); /* Update vma->vm_page_prot to reflect vma->vm_flags. */ void vma_set_page_prot(struct vm_area_struct *vma) { - unsigned long vm_flags = vma->vm_flags; + vm_flags_t vm_flags = vma->vm_flags; pgprot_t vm_page_prot; vm_page_prot = vm_pgprot_modify(vma->vm_page_prot, vm_flags); @@ -228,12 +228,12 @@ static inline unsigned long round_hint_to_min(unsigned long hint) return hint; } -bool mlock_future_ok(struct mm_struct *mm, unsigned long flags, +bool mlock_future_ok(struct mm_struct *mm, vm_flags_t vm_flags, unsigned long bytes) { unsigned long locked_pages, limit_pages; - if (!(flags & VM_LOCKED) || capable(CAP_IPC_LOCK)) + if (!(vm_flags & VM_LOCKED) || capable(CAP_IPC_LOCK)) return true; locked_pages = bytes >> PAGE_SHIFT; @@ -1207,7 +1207,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, return ret; } -int vm_brk_flags(unsigned long addr, unsigned long request, unsigned long flags) +int vm_brk_flags(unsigned long addr, unsigned long request, vm_flags_t vm_flags) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma = NULL; @@ -1224,7 +1224,7 @@ int vm_brk_flags(unsigned long addr, unsigned long request, unsigned long flags) return 0; /* Until we need other flags, refuse anything except VM_EXEC. */ - if ((flags & (~VM_EXEC)) != 0) + if ((vm_flags & (~VM_EXEC)) != 0) return -EINVAL; if (mmap_write_lock_killable(mm)) @@ -1239,7 +1239,7 @@ int vm_brk_flags(unsigned long addr, unsigned long request, unsigned long flags) goto munmap_failed; vma = vma_prev(&vmi); - ret = do_brk_flags(&vmi, vma, addr, len, flags); + ret = do_brk_flags(&vmi, vma, addr, len, vm_flags); populate = ((mm->def_flags & VM_LOCKED) != 0); mmap_write_unlock(mm); userfaultfd_unmap_complete(mm, &uf); @@ -1444,7 +1444,7 @@ static vm_fault_t special_mapping_fault(struct vm_fault *vmf) static struct vm_area_struct *__install_special_mapping( struct mm_struct *mm, unsigned long addr, unsigned long len, - unsigned long vm_flags, void *priv, + vm_flags_t vm_flags, void *priv, const struct vm_operations_struct *ops) { int ret; @@ -1496,7 +1496,7 @@ bool vma_is_special_mapping(const struct vm_area_struct *vma, struct vm_area_struct *_install_special_mapping( struct mm_struct *mm, unsigned long addr, unsigned long len, - unsigned long vm_flags, const struct vm_special_mapping *spec) + vm_flags_t vm_flags, const struct vm_special_mapping *spec) { return __install_special_mapping(mm, addr, len, vm_flags, (void *)spec, &special_mapping_vmops); diff --git a/mm/mprotect.c b/mm/mprotect.c index 88608d0dc2c2..b873b98ab705 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -596,10 +596,10 @@ static const struct mm_walk_ops prot_none_walk_ops = { int mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb, struct vm_area_struct *vma, struct vm_area_struct **pprev, - unsigned long start, unsigned long end, unsigned long newflags) + unsigned long start, unsigned long end, vm_flags_t newflags) { struct mm_struct *mm = vma->vm_mm; - unsigned long oldflags = READ_ONCE(vma->vm_flags); + vm_flags_t oldflags = READ_ONCE(vma->vm_flags); long nrpages = (end - start) >> PAGE_SHIFT; unsigned int mm_cp_flags = 0; unsigned long charged = 0; @@ -774,8 +774,8 @@ static int do_mprotect_pkey(unsigned long start, size_t len, nstart = start; tmp = vma->vm_start; for_each_vma_range(vmi, vma, end) { - unsigned long mask_off_old_flags; - unsigned long newflags; + vm_flags_t mask_off_old_flags; + vm_flags_t newflags; int new_vma_pkey; if (vma->vm_start != tmp) { diff --git a/mm/mremap.c b/mm/mremap.c index 18b215521ada..7e93d3344828 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -1025,7 +1025,7 @@ static unsigned long prep_move_vma(struct vma_remap_struct *vrm) struct vm_area_struct *vma = vrm->vma; unsigned long old_addr = vrm->addr; unsigned long old_len = vrm->old_len; - unsigned long dummy = vma->vm_flags; + vm_flags_t dummy = vma->vm_flags; /* * We'd prefer to avoid failure later on in do_munmap: diff --git a/mm/nommu.c b/mm/nommu.c index b624acec6d2e..87e1acab0d64 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -126,7 +126,7 @@ void *vrealloc_noprof(const void *p, size_t size, gfp_t flags) void *__vmalloc_node_range_noprof(unsigned long size, unsigned long align, unsigned long start, unsigned long end, gfp_t gfp_mask, - pgprot_t prot, unsigned long vm_flags, int node, + pgprot_t prot, vm_flags_t vm_flags, int node, const void *caller) { return __vmalloc_noprof(size, gfp_mask); @@ -844,12 +844,12 @@ static int validate_mmap_request(struct file *file, * we've determined that we can make the mapping, now translate what we * now know into VMA flags */ -static unsigned long determine_vm_flags(struct file *file, - unsigned long prot, - unsigned long flags, - unsigned long capabilities) +static vm_flags_t determine_vm_flags(struct file *file, + unsigned long prot, + unsigned long flags, + unsigned long capabilities) { - unsigned long vm_flags; + vm_flags_t vm_flags; vm_flags = calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(file, flags); diff --git a/mm/rmap.c b/mm/rmap.c index fb63d9256f09..a312cae16bb5 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -839,7 +839,7 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address) struct folio_referenced_arg { int mapcount; int referenced; - unsigned long vm_flags; + vm_flags_t vm_flags; struct mem_cgroup *memcg; }; @@ -984,7 +984,7 @@ static bool invalid_folio_referenced_vma(struct vm_area_struct *vma, void *arg) * the function bailed out due to rmap lock contention. */ int folio_referenced(struct folio *folio, int is_locked, - struct mem_cgroup *memcg, unsigned long *vm_flags) + struct mem_cgroup *memcg, vm_flags_t *vm_flags) { bool we_locked = false; struct folio_referenced_arg pra = { diff --git a/mm/shmem.c b/mm/shmem.c index eda35be2a8d9..334b7b4a61a0 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -615,7 +615,7 @@ static unsigned int shmem_get_orders_within_size(struct inode *inode, static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index, loff_t write_end, bool shmem_huge_force, struct vm_area_struct *vma, - unsigned long vm_flags) + vm_flags_t vm_flags) { unsigned int maybe_pmd_order = HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER ? 0 : BIT(HPAGE_PMD_ORDER); @@ -862,7 +862,7 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo, static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index, loff_t write_end, bool shmem_huge_force, struct vm_area_struct *vma, - unsigned long vm_flags) + vm_flags_t vm_flags) { return 0; } @@ -1753,7 +1753,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode, { unsigned long mask = READ_ONCE(huge_shmem_orders_always); unsigned long within_size_orders = READ_ONCE(huge_shmem_orders_within_size); - unsigned long vm_flags = vma ? vma->vm_flags : 0; + vm_flags_t vm_flags = vma ? vma->vm_flags : 0; unsigned int global_orders; if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, vm_flags))) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 9ff970980496..95dd8dea6ee4 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1901,11 +1901,11 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start, } static void userfaultfd_set_vm_flags(struct vm_area_struct *vma, - vm_flags_t flags) + vm_flags_t vm_flags) { - const bool uffd_wp_changed = (vma->vm_flags ^ flags) & VM_UFFD_WP; + const bool uffd_wp_changed = (vma->vm_flags ^ vm_flags) & VM_UFFD_WP; - vm_flags_reset(vma, flags); + vm_flags_reset(vma, vm_flags); /* * For shared mappings, we want to enable writenotify while * userfaultfd-wp is enabled (see vma_wants_writenotify()). We'll simply @@ -1917,12 +1917,12 @@ static void userfaultfd_set_vm_flags(struct vm_area_struct *vma, static void userfaultfd_set_ctx(struct vm_area_struct *vma, struct userfaultfd_ctx *ctx, - unsigned long flags) + vm_flags_t vm_flags) { vma_start_write(vma); vma->vm_userfaultfd_ctx = (struct vm_userfaultfd_ctx){ctx}; userfaultfd_set_vm_flags(vma, - (vma->vm_flags & ~__VM_UFFD_FLAGS) | flags); + (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags); } void userfaultfd_reset_ctx(struct vm_area_struct *vma) @@ -1968,14 +1968,14 @@ struct vm_area_struct *userfaultfd_clear_vma(struct vma_iterator *vmi, /* Assumes mmap write lock taken, and mm_struct pinned. */ int userfaultfd_register_range(struct userfaultfd_ctx *ctx, struct vm_area_struct *vma, - unsigned long vm_flags, + vm_flags_t vm_flags, unsigned long start, unsigned long end, bool wp_async) { VMA_ITERATOR(vmi, ctx->mm, start); struct vm_area_struct *prev = vma_prev(&vmi); unsigned long vma_end; - unsigned long new_flags; + vm_flags_t new_flags; if (vma->vm_start < start) prev = vma; diff --git a/mm/vma.c b/mm/vma.c index 4b6d0be9ba39..b3d880652359 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -15,7 +15,7 @@ struct mmap_state { unsigned long end; pgoff_t pgoff; unsigned long pglen; - unsigned long flags; + vm_flags_t vm_flags; struct file *file; pgprot_t page_prot; @@ -37,7 +37,7 @@ struct mmap_state { bool check_ksm_early; }; -#define MMAP_STATE(name, mm_, vmi_, addr_, len_, pgoff_, flags_, file_) \ +#define MMAP_STATE(name, mm_, vmi_, addr_, len_, pgoff_, vm_flags_, file_) \ struct mmap_state name = { \ .mm = mm_, \ .vmi = vmi_, \ @@ -45,9 +45,9 @@ struct mmap_state { .end = (addr_) + (len_), \ .pgoff = pgoff_, \ .pglen = PHYS_PFN(len_), \ - .flags = flags_, \ + .vm_flags = vm_flags_, \ .file = file_, \ - .page_prot = vm_get_page_prot(flags_), \ + .page_prot = vm_get_page_prot(vm_flags_), \ } #define VMG_MMAP_STATE(name, map_, vma_) \ @@ -56,7 +56,7 @@ struct mmap_state { .vmi = (map_)->vmi, \ .start = (map_)->addr, \ .end = (map_)->end, \ - .flags = (map_)->flags, \ + .vm_flags = (map_)->vm_flags, \ .pgoff = (map_)->pgoff, \ .file = (map_)->file, \ .prev = (map_)->prev, \ @@ -95,7 +95,7 @@ static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_nex * the kernel to generate new VMAs when old one could be * extended instead. */ - if ((vma->vm_flags ^ vmg->flags) & ~VM_SOFTDIRTY) + if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_SOFTDIRTY) return false; if (vma->vm_file != vmg->file) return false; @@ -843,7 +843,7 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( * furthermost left or right side of the VMA, then we have no chance of * merging and should abort. */ - if (vmg->flags & VM_SPECIAL || (!left_side && !right_side)) + if (vmg->vm_flags & VM_SPECIAL || (!left_side && !right_side)) return NULL; if (left_side) @@ -973,7 +973,7 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( if (err || commit_merge(vmg)) goto abort; - khugepaged_enter_vma(vmg->target, vmg->flags); + khugepaged_enter_vma(vmg->target, vmg->vm_flags); vmg->state = VMA_MERGE_SUCCESS; return vmg->target; @@ -1055,7 +1055,7 @@ struct vm_area_struct *vma_merge_new_range(struct vma_merge_struct *vmg) vmg->state = VMA_MERGE_NOMERGE; /* Special VMAs are unmergeable, also if no prev/next. */ - if ((vmg->flags & VM_SPECIAL) || (!prev && !next)) + if ((vmg->vm_flags & VM_SPECIAL) || (!prev && !next)) return NULL; can_merge_left = can_vma_merge_left(vmg); @@ -1093,7 +1093,7 @@ struct vm_area_struct *vma_merge_new_range(struct vma_merge_struct *vmg) * following VMA if we have VMAs on both sides. */ if (vmg->target && !vma_expand(vmg)) { - khugepaged_enter_vma(vmg->target, vmg->flags); + khugepaged_enter_vma(vmg->target, vmg->vm_flags); vmg->state = VMA_MERGE_SUCCESS; return vmg->target; } @@ -1640,11 +1640,11 @@ static struct vm_area_struct *vma_modify(struct vma_merge_struct *vmg) struct vm_area_struct *vma_modify_flags( struct vma_iterator *vmi, struct vm_area_struct *prev, struct vm_area_struct *vma, unsigned long start, unsigned long end, - unsigned long new_flags) + vm_flags_t vm_flags) { VMG_VMA_STATE(vmg, vmi, prev, vma, start, end); - vmg.flags = new_flags; + vmg.vm_flags = vm_flags; return vma_modify(&vmg); } @@ -1655,12 +1655,12 @@ struct vm_area_struct struct vm_area_struct *vma, unsigned long start, unsigned long end, - unsigned long new_flags, + vm_flags_t vm_flags, struct anon_vma_name *new_name) { VMG_VMA_STATE(vmg, vmi, prev, vma, start, end); - vmg.flags = new_flags; + vmg.vm_flags = vm_flags; vmg.anon_name = new_name; return vma_modify(&vmg); @@ -1685,13 +1685,13 @@ struct vm_area_struct struct vm_area_struct *prev, struct vm_area_struct *vma, unsigned long start, unsigned long end, - unsigned long new_flags, + vm_flags_t vm_flags, struct vm_userfaultfd_ctx new_ctx, bool give_up_on_oom) { VMG_VMA_STATE(vmg, vmi, prev, vma, start, end); - vmg.flags = new_flags; + vmg.vm_flags = vm_flags; vmg.uffd_ctx = new_ctx; if (give_up_on_oom) vmg.give_up_on_oom = true; @@ -2327,7 +2327,7 @@ static void vms_abort_munmap_vmas(struct vma_munmap_struct *vms, static void update_ksm_flags(struct mmap_state *map) { - map->flags = ksm_vma_flags(map->mm, map->file, map->flags); + map->vm_flags = ksm_vma_flags(map->mm, map->file, map->vm_flags); } /* @@ -2372,11 +2372,11 @@ static int __mmap_prepare(struct mmap_state *map, struct list_head *uf) } /* Check against address space limit. */ - if (!may_expand_vm(map->mm, map->flags, map->pglen - vms->nr_pages)) + if (!may_expand_vm(map->mm, map->vm_flags, map->pglen - vms->nr_pages)) return -ENOMEM; /* Private writable mapping: check memory availability. */ - if (accountable_mapping(map->file, map->flags)) { + if (accountable_mapping(map->file, map->vm_flags)) { map->charged = map->pglen; map->charged -= vms->nr_accounted; if (map->charged) { @@ -2386,7 +2386,7 @@ static int __mmap_prepare(struct mmap_state *map, struct list_head *uf) } vms->nr_accounted = 0; - map->flags |= VM_ACCOUNT; + map->vm_flags |= VM_ACCOUNT; } /* @@ -2430,12 +2430,12 @@ static int __mmap_new_file_vma(struct mmap_state *map, * Drivers should not permit writability when previously it was * disallowed. */ - VM_WARN_ON_ONCE(map->flags != vma->vm_flags && - !(map->flags & VM_MAYWRITE) && + VM_WARN_ON_ONCE(map->vm_flags != vma->vm_flags && + !(map->vm_flags & VM_MAYWRITE) && (vma->vm_flags & VM_MAYWRITE)); map->file = vma->vm_file; - map->flags = vma->vm_flags; + map->vm_flags = vma->vm_flags; return 0; } @@ -2466,7 +2466,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap) vma_iter_config(vmi, map->addr, map->end); vma_set_range(vma, map->addr, map->end, map->pgoff); - vm_flags_init(vma, map->flags); + vm_flags_init(vma, map->vm_flags); vma->vm_page_prot = map->page_prot; if (vma_iter_prealloc(vmi, vma)) { @@ -2476,7 +2476,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap) if (map->file) error = __mmap_new_file_vma(map, vma); - else if (map->flags & VM_SHARED) + else if (map->vm_flags & VM_SHARED) error = shmem_zero_setup(vma); else vma_set_anonymous(vma); @@ -2486,12 +2486,12 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap) if (!map->check_ksm_early) { update_ksm_flags(map); - vm_flags_init(vma, map->flags); + vm_flags_init(vma, map->vm_flags); } #ifdef CONFIG_SPARC64 /* TODO: Fix SPARC ADI! */ - WARN_ON_ONCE(!arch_validate_flags(map->flags)); + WARN_ON_ONCE(!arch_validate_flags(map->vm_flags)); #endif /* Lock the VMA since it is modified after insertion into VMA tree */ @@ -2505,7 +2505,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap) * call covers the non-merge case. */ if (!vma_is_anonymous(vma)) - khugepaged_enter_vma(vma, map->flags); + khugepaged_enter_vma(vma, map->vm_flags); *vmap = vma; return 0; @@ -2526,7 +2526,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap) static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma) { struct mm_struct *mm = map->mm; - unsigned long vm_flags = vma->vm_flags; + vm_flags_t vm_flags = vma->vm_flags; perf_event_mmap(vma); @@ -2579,7 +2579,7 @@ static int call_mmap_prepare(struct mmap_state *map) .pgoff = map->pgoff, .file = map->file, - .vm_flags = map->flags, + .vm_flags = map->vm_flags, .page_prot = map->page_prot, }; @@ -2591,7 +2591,7 @@ static int call_mmap_prepare(struct mmap_state *map) /* Update fields permitted to be changed. */ map->pgoff = desc.pgoff; map->file = desc.file; - map->flags = desc.vm_flags; + map->vm_flags = desc.vm_flags; map->page_prot = desc.page_prot; /* User-defined fields. */ map->vm_ops = desc.vm_ops; @@ -2754,14 +2754,14 @@ unsigned long mmap_region(struct file *file, unsigned long addr, * @addr: The start address * @len: The length of the increase * @vma: The vma, - * @flags: The VMA Flags + * @vm_flags: The VMA Flags * * Extend the brk VMA from addr to addr + len. If the VMA is NULL or the flags * do not match then create a new anonymous VMA. Eventually we may be able to * do some brk-specific accounting here. */ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, - unsigned long addr, unsigned long len, unsigned long flags) + unsigned long addr, unsigned long len, vm_flags_t vm_flags) { struct mm_struct *mm = current->mm; @@ -2769,9 +2769,9 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, * Check against address space limits by the changed size * Note: This happens *after* clearing old mappings in some code paths. */ - flags |= VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags; - flags = ksm_vma_flags(mm, NULL, flags); - if (!may_expand_vm(mm, flags, len >> PAGE_SHIFT)) + vm_flags |= VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags; + vm_flags = ksm_vma_flags(mm, NULL, vm_flags); + if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) return -ENOMEM; if (mm->map_count > sysctl_max_map_count) @@ -2785,7 +2785,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, * occur after forking, so the expand will only happen on new VMAs. */ if (vma && vma->vm_end == addr) { - VMG_STATE(vmg, mm, vmi, addr, addr + len, flags, PHYS_PFN(addr)); + VMG_STATE(vmg, mm, vmi, addr, addr + len, vm_flags, PHYS_PFN(addr)); vmg.prev = vma; /* vmi is positioned at prev, which this mode expects. */ @@ -2806,8 +2806,8 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, vma_set_anonymous(vma); vma_set_range(vma, addr, addr + len, addr >> PAGE_SHIFT); - vm_flags_init(vma, flags); - vma->vm_page_prot = vm_get_page_prot(flags); + vm_flags_init(vma, vm_flags); + vma->vm_page_prot = vm_get_page_prot(vm_flags); vma_start_write(vma); if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL)) goto mas_store_fail; @@ -2818,7 +2818,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, perf_event_mmap(vma); mm->total_vm += len >> PAGE_SHIFT; mm->data_vm += len >> PAGE_SHIFT; - if (flags & VM_LOCKED) + if (vm_flags & VM_LOCKED) mm->locked_vm += (len >> PAGE_SHIFT); vm_flags_set(vma, VM_SOFTDIRTY); return 0; diff --git a/mm/vma.h b/mm/vma.h index f47112a352db..cf6e3a6371b6 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -98,7 +98,7 @@ struct vma_merge_struct { unsigned long end; pgoff_t pgoff; - unsigned long flags; + vm_flags_t vm_flags; struct file *file; struct anon_vma *anon_vma; struct mempolicy *policy; @@ -164,13 +164,13 @@ static inline pgoff_t vma_pgoff_offset(struct vm_area_struct *vma, return vma->vm_pgoff + PHYS_PFN(addr - vma->vm_start); } -#define VMG_STATE(name, mm_, vmi_, start_, end_, flags_, pgoff_) \ +#define VMG_STATE(name, mm_, vmi_, start_, end_, vm_flags_, pgoff_) \ struct vma_merge_struct name = { \ .mm = mm_, \ .vmi = vmi_, \ .start = start_, \ .end = end_, \ - .flags = flags_, \ + .vm_flags = vm_flags_, \ .pgoff = pgoff_, \ .state = VMA_MERGE_START, \ } @@ -184,7 +184,7 @@ static inline pgoff_t vma_pgoff_offset(struct vm_area_struct *vma, .next = NULL, \ .start = start_, \ .end = end_, \ - .flags = vma_->vm_flags, \ + .vm_flags = vma_->vm_flags, \ .pgoff = vma_pgoff_offset(vma_, start_), \ .file = vma_->vm_file, \ .anon_vma = vma_->anon_vma, \ @@ -288,7 +288,7 @@ __must_check struct vm_area_struct *vma_modify_flags(struct vma_iterator *vmi, struct vm_area_struct *prev, struct vm_area_struct *vma, unsigned long start, unsigned long end, - unsigned long new_flags); + vm_flags_t vm_flags); /* We are about to modify the VMA's flags and/or anon_name. */ __must_check struct vm_area_struct @@ -297,7 +297,7 @@ __must_check struct vm_area_struct struct vm_area_struct *vma, unsigned long start, unsigned long end, - unsigned long new_flags, + vm_flags_t vm_flags, struct anon_vma_name *new_name); /* We are about to modify the VMA's memory policy. */ @@ -314,7 +314,7 @@ __must_check struct vm_area_struct struct vm_area_struct *prev, struct vm_area_struct *vma, unsigned long start, unsigned long end, - unsigned long new_flags, + vm_flags_t vm_flags, struct vm_userfaultfd_ctx new_ctx, bool give_up_on_oom); @@ -375,7 +375,7 @@ static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma } #ifdef CONFIG_MMU -static inline pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags) +static inline pgprot_t vm_pgprot_modify(pgprot_t oldprot, vm_flags_t vm_flags) { return pgprot_modify(oldprot, vm_get_page_prot(vm_flags)); } diff --git a/mm/vmscan.c b/mm/vmscan.c index 620dce753b64..56d540b8a1d0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -907,7 +907,7 @@ static enum folio_references folio_check_references(struct folio *folio, struct scan_control *sc) { int referenced_ptes, referenced_folio; - unsigned long vm_flags; + vm_flags_t vm_flags; referenced_ptes = folio_referenced(folio, 1, sc->target_mem_cgroup, &vm_flags); @@ -2120,7 +2120,7 @@ static void shrink_active_list(unsigned long nr_to_scan, { unsigned long nr_taken; unsigned long nr_scanned; - unsigned long vm_flags; + vm_flags_t vm_flags; LIST_HEAD(l_hold); /* The folios which were snipped off */ LIST_HEAD(l_active); LIST_HEAD(l_inactive); diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index 7fec5b3de83f..656e1c75b711 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c @@ -65,7 +65,7 @@ static struct vm_area_struct *alloc_vma(struct mm_struct *mm, unsigned long start, unsigned long end, pgoff_t pgoff, - vm_flags_t flags) + vm_flags_t vm_flags) { struct vm_area_struct *ret = vm_area_alloc(mm); @@ -75,7 +75,7 @@ static struct vm_area_struct *alloc_vma(struct mm_struct *mm, ret->vm_start = start; ret->vm_end = end; ret->vm_pgoff = pgoff; - ret->__vm_flags = flags; + ret->__vm_flags = vm_flags; vma_assert_detached(ret); return ret; @@ -103,9 +103,9 @@ static struct vm_area_struct *alloc_and_link_vma(struct mm_struct *mm, unsigned long start, unsigned long end, pgoff_t pgoff, - vm_flags_t flags) + vm_flags_t vm_flags) { - struct vm_area_struct *vma = alloc_vma(mm, start, end, pgoff, flags); + struct vm_area_struct *vma = alloc_vma(mm, start, end, pgoff, vm_flags); if (vma == NULL) return NULL; @@ -172,7 +172,7 @@ static int expand_existing(struct vma_merge_struct *vmg) * specified new range. */ static void vmg_set_range(struct vma_merge_struct *vmg, unsigned long start, - unsigned long end, pgoff_t pgoff, vm_flags_t flags) + unsigned long end, pgoff_t pgoff, vm_flags_t vm_flags) { vma_iter_set(vmg->vmi, start); @@ -184,7 +184,7 @@ static void vmg_set_range(struct vma_merge_struct *vmg, unsigned long start, vmg->start = start; vmg->end = end; vmg->pgoff = pgoff; - vmg->flags = flags; + vmg->vm_flags = vm_flags; vmg->just_expand = false; vmg->__remove_middle = false; @@ -195,10 +195,10 @@ static void vmg_set_range(struct vma_merge_struct *vmg, unsigned long start, /* Helper function to set both the VMG range and its anon_vma. */ static void vmg_set_range_anon_vma(struct vma_merge_struct *vmg, unsigned long start, - unsigned long end, pgoff_t pgoff, vm_flags_t flags, + unsigned long end, pgoff_t pgoff, vm_flags_t vm_flags, struct anon_vma *anon_vma) { - vmg_set_range(vmg, start, end, pgoff, flags); + vmg_set_range(vmg, start, end, pgoff, vm_flags); vmg->anon_vma = anon_vma; } @@ -211,12 +211,12 @@ static void vmg_set_range_anon_vma(struct vma_merge_struct *vmg, unsigned long s static struct vm_area_struct *try_merge_new_vma(struct mm_struct *mm, struct vma_merge_struct *vmg, unsigned long start, unsigned long end, - pgoff_t pgoff, vm_flags_t flags, + pgoff_t pgoff, vm_flags_t vm_flags, bool *was_merged) { struct vm_area_struct *merged; - vmg_set_range(vmg, start, end, pgoff, flags); + vmg_set_range(vmg, start, end, pgoff, vm_flags); merged = merge_new(vmg); if (merged) { @@ -229,7 +229,7 @@ static struct vm_area_struct *try_merge_new_vma(struct mm_struct *mm, ASSERT_EQ(vmg->state, VMA_MERGE_NOMERGE); - return alloc_and_link_vma(mm, start, end, pgoff, flags); + return alloc_and_link_vma(mm, start, end, pgoff, vm_flags); } /* @@ -301,17 +301,17 @@ static void vma_set_dummy_anon_vma(struct vm_area_struct *vma, static bool test_simple_merge(void) { struct vm_area_struct *vma; - unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; struct mm_struct mm = {}; - struct vm_area_struct *vma_left = alloc_vma(&mm, 0, 0x1000, 0, flags); - struct vm_area_struct *vma_right = alloc_vma(&mm, 0x2000, 0x3000, 2, flags); + struct vm_area_struct *vma_left = alloc_vma(&mm, 0, 0x1000, 0, vm_flags); + struct vm_area_struct *vma_right = alloc_vma(&mm, 0x2000, 0x3000, 2, vm_flags); VMA_ITERATOR(vmi, &mm, 0x1000); struct vma_merge_struct vmg = { .mm = &mm, .vmi = &vmi, .start = 0x1000, .end = 0x2000, - .flags = flags, + .vm_flags = vm_flags, .pgoff = 1, }; @@ -324,7 +324,7 @@ static bool test_simple_merge(void) ASSERT_EQ(vma->vm_start, 0); ASSERT_EQ(vma->vm_end, 0x3000); ASSERT_EQ(vma->vm_pgoff, 0); - ASSERT_EQ(vma->vm_flags, flags); + ASSERT_EQ(vma->vm_flags, vm_flags); detach_free_vma(vma); mtree_destroy(&mm.mm_mt); @@ -335,9 +335,9 @@ static bool test_simple_merge(void) static bool test_simple_modify(void) { struct vm_area_struct *vma; - unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; struct mm_struct mm = {}; - struct vm_area_struct *init_vma = alloc_vma(&mm, 0, 0x3000, 0, flags); + struct vm_area_struct *init_vma = alloc_vma(&mm, 0, 0x3000, 0, vm_flags); VMA_ITERATOR(vmi, &mm, 0x1000); ASSERT_FALSE(attach_vma(&mm, init_vma)); @@ -394,9 +394,9 @@ static bool test_simple_modify(void) static bool test_simple_expand(void) { - unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; struct mm_struct mm = {}; - struct vm_area_struct *vma = alloc_vma(&mm, 0, 0x1000, 0, flags); + struct vm_area_struct *vma = alloc_vma(&mm, 0, 0x1000, 0, vm_flags); VMA_ITERATOR(vmi, &mm, 0); struct vma_merge_struct vmg = { .vmi = &vmi, @@ -422,9 +422,9 @@ static bool test_simple_expand(void) static bool test_simple_shrink(void) { - unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; struct mm_struct mm = {}; - struct vm_area_struct *vma = alloc_vma(&mm, 0, 0x3000, 0, flags); + struct vm_area_struct *vma = alloc_vma(&mm, 0, 0x3000, 0, vm_flags); VMA_ITERATOR(vmi, &mm, 0); ASSERT_FALSE(attach_vma(&mm, vma)); @@ -443,7 +443,7 @@ static bool test_simple_shrink(void) static bool test_merge_new(void) { - unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; struct mm_struct mm = {}; VMA_ITERATOR(vmi, &mm, 0); struct vma_merge_struct vmg = { @@ -473,18 +473,18 @@ static bool test_merge_new(void) * 0123456789abc * AA B CC */ - vma_a = alloc_and_link_vma(&mm, 0, 0x2000, 0, flags); + vma_a = alloc_and_link_vma(&mm, 0, 0x2000, 0, vm_flags); ASSERT_NE(vma_a, NULL); /* We give each VMA a single avc so we can test anon_vma duplication. */ INIT_LIST_HEAD(&vma_a->anon_vma_chain); list_add(&dummy_anon_vma_chain_a.same_vma, &vma_a->anon_vma_chain); - vma_b = alloc_and_link_vma(&mm, 0x3000, 0x4000, 3, flags); + vma_b = alloc_and_link_vma(&mm, 0x3000, 0x4000, 3, vm_flags); ASSERT_NE(vma_b, NULL); INIT_LIST_HEAD(&vma_b->anon_vma_chain); list_add(&dummy_anon_vma_chain_b.same_vma, &vma_b->anon_vma_chain); - vma_c = alloc_and_link_vma(&mm, 0xb000, 0xc000, 0xb, flags); + vma_c = alloc_and_link_vma(&mm, 0xb000, 0xc000, 0xb, vm_flags); ASSERT_NE(vma_c, NULL); INIT_LIST_HEAD(&vma_c->anon_vma_chain); list_add(&dummy_anon_vma_chain_c.same_vma, &vma_c->anon_vma_chain); @@ -495,7 +495,7 @@ static bool test_merge_new(void) * 0123456789abc * AA B ** CC */ - vma_d = try_merge_new_vma(&mm, &vmg, 0x7000, 0x9000, 7, flags, &merged); + vma_d = try_merge_new_vma(&mm, &vmg, 0x7000, 0x9000, 7, vm_flags, &merged); ASSERT_NE(vma_d, NULL); INIT_LIST_HEAD(&vma_d->anon_vma_chain); list_add(&dummy_anon_vma_chain_d.same_vma, &vma_d->anon_vma_chain); @@ -510,7 +510,7 @@ static bool test_merge_new(void) */ vma_a->vm_ops = &vm_ops; /* This should have no impact. */ vma_b->anon_vma = &dummy_anon_vma; - vma = try_merge_new_vma(&mm, &vmg, 0x2000, 0x3000, 2, flags, &merged); + vma = try_merge_new_vma(&mm, &vmg, 0x2000, 0x3000, 2, vm_flags, &merged); ASSERT_EQ(vma, vma_a); /* Merge with A, delete B. */ ASSERT_TRUE(merged); @@ -527,7 +527,7 @@ static bool test_merge_new(void) * 0123456789abc * AAAA* DD CC */ - vma = try_merge_new_vma(&mm, &vmg, 0x4000, 0x5000, 4, flags, &merged); + vma = try_merge_new_vma(&mm, &vmg, 0x4000, 0x5000, 4, vm_flags, &merged); ASSERT_EQ(vma, vma_a); /* Extend A. */ ASSERT_TRUE(merged); @@ -546,7 +546,7 @@ static bool test_merge_new(void) */ vma_d->anon_vma = &dummy_anon_vma; vma_d->vm_ops = &vm_ops; /* This should have no impact. */ - vma = try_merge_new_vma(&mm, &vmg, 0x6000, 0x7000, 6, flags, &merged); + vma = try_merge_new_vma(&mm, &vmg, 0x6000, 0x7000, 6, vm_flags, &merged); ASSERT_EQ(vma, vma_d); /* Prepend. */ ASSERT_TRUE(merged); @@ -564,7 +564,7 @@ static bool test_merge_new(void) * AAAAA*DDD CC */ vma_d->vm_ops = NULL; /* This would otherwise degrade the merge. */ - vma = try_merge_new_vma(&mm, &vmg, 0x5000, 0x6000, 5, flags, &merged); + vma = try_merge_new_vma(&mm, &vmg, 0x5000, 0x6000, 5, vm_flags, &merged); ASSERT_EQ(vma, vma_a); /* Merge with A, delete D. */ ASSERT_TRUE(merged); @@ -582,7 +582,7 @@ static bool test_merge_new(void) * AAAAAAAAA *CC */ vma_c->anon_vma = &dummy_anon_vma; - vma = try_merge_new_vma(&mm, &vmg, 0xa000, 0xb000, 0xa, flags, &merged); + vma = try_merge_new_vma(&mm, &vmg, 0xa000, 0xb000, 0xa, vm_flags, &merged); ASSERT_EQ(vma, vma_c); /* Prepend C. */ ASSERT_TRUE(merged); @@ -599,7 +599,7 @@ static bool test_merge_new(void) * 0123456789abc * AAAAAAAAA*CCC */ - vma = try_merge_new_vma(&mm, &vmg, 0x9000, 0xa000, 0x9, flags, &merged); + vma = try_merge_new_vma(&mm, &vmg, 0x9000, 0xa000, 0x9, vm_flags, &merged); ASSERT_EQ(vma, vma_a); /* Extend A and delete C. */ ASSERT_TRUE(merged); @@ -639,7 +639,7 @@ static bool test_merge_new(void) static bool test_vma_merge_special_flags(void) { - unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; struct mm_struct mm = {}; VMA_ITERATOR(vmi, &mm, 0); struct vma_merge_struct vmg = { @@ -661,7 +661,7 @@ static bool test_vma_merge_special_flags(void) * 01234 * AAA */ - vma_left = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); + vma_left = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); ASSERT_NE(vma_left, NULL); /* 1. Set up new VMA with special flag that would otherwise merge. */ @@ -672,12 +672,12 @@ static bool test_vma_merge_special_flags(void) * * This should merge if not for the VM_SPECIAL flag. */ - vmg_set_range(&vmg, 0x3000, 0x4000, 3, flags); + vmg_set_range(&vmg, 0x3000, 0x4000, 3, vm_flags); for (i = 0; i < ARRAY_SIZE(special_flags); i++) { vm_flags_t special_flag = special_flags[i]; - vma_left->__vm_flags = flags | special_flag; - vmg.flags = flags | special_flag; + vma_left->__vm_flags = vm_flags | special_flag; + vmg.vm_flags = vm_flags | special_flag; vma = merge_new(&vmg); ASSERT_EQ(vma, NULL); ASSERT_EQ(vmg.state, VMA_MERGE_NOMERGE); @@ -691,15 +691,15 @@ static bool test_vma_merge_special_flags(void) * * Create a VMA to modify. */ - vma = alloc_and_link_vma(&mm, 0x3000, 0x4000, 3, flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x4000, 3, vm_flags); ASSERT_NE(vma, NULL); vmg.middle = vma; for (i = 0; i < ARRAY_SIZE(special_flags); i++) { vm_flags_t special_flag = special_flags[i]; - vma_left->__vm_flags = flags | special_flag; - vmg.flags = flags | special_flag; + vma_left->__vm_flags = vm_flags | special_flag; + vmg.vm_flags = vm_flags | special_flag; vma = merge_existing(&vmg); ASSERT_EQ(vma, NULL); ASSERT_EQ(vmg.state, VMA_MERGE_NOMERGE); @@ -711,7 +711,7 @@ static bool test_vma_merge_special_flags(void) static bool test_vma_merge_with_close(void) { - unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; struct mm_struct mm = {}; VMA_ITERATOR(vmi, &mm, 0); struct vma_merge_struct vmg = { @@ -791,11 +791,11 @@ static bool test_vma_merge_with_close(void) * PPPPPPNNN */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); - vma_next = alloc_and_link_vma(&mm, 0x5000, 0x9000, 5, flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x5000, 0x9000, 5, vm_flags); vma_next->vm_ops = &vm_ops; - vmg_set_range(&vmg, 0x3000, 0x5000, 3, flags); + vmg_set_range(&vmg, 0x3000, 0x5000, 3, vm_flags); ASSERT_EQ(merge_new(&vmg), vma_prev); ASSERT_EQ(vmg.state, VMA_MERGE_SUCCESS); ASSERT_EQ(vma_prev->vm_start, 0); @@ -816,11 +816,11 @@ static bool test_vma_merge_with_close(void) * proceed. */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); - vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, vm_flags); vma->vm_ops = &vm_ops; - vmg_set_range(&vmg, 0x3000, 0x5000, 3, flags); + vmg_set_range(&vmg, 0x3000, 0x5000, 3, vm_flags); vmg.prev = vma_prev; vmg.middle = vma; @@ -844,11 +844,11 @@ static bool test_vma_merge_with_close(void) * proceed. */ - vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); - vma_next = alloc_and_link_vma(&mm, 0x5000, 0x9000, 5, flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x5000, 0x9000, 5, vm_flags); vma->vm_ops = &vm_ops; - vmg_set_range(&vmg, 0x3000, 0x5000, 3, flags); + vmg_set_range(&vmg, 0x3000, 0x5000, 3, vm_flags); vmg.middle = vma; ASSERT_EQ(merge_existing(&vmg), NULL); /* @@ -872,12 +872,12 @@ static bool test_vma_merge_with_close(void) * PPPVVNNNN */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); - vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); - vma_next = alloc_and_link_vma(&mm, 0x5000, 0x9000, 5, flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x5000, 0x9000, 5, vm_flags); vma->vm_ops = &vm_ops; - vmg_set_range(&vmg, 0x3000, 0x5000, 3, flags); + vmg_set_range(&vmg, 0x3000, 0x5000, 3, vm_flags); vmg.prev = vma_prev; vmg.middle = vma; @@ -898,12 +898,12 @@ static bool test_vma_merge_with_close(void) * PPPPPNNNN */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); - vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); - vma_next = alloc_and_link_vma(&mm, 0x5000, 0x9000, 5, flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x5000, 0x9000, 5, vm_flags); vma_next->vm_ops = &vm_ops; - vmg_set_range(&vmg, 0x3000, 0x5000, 3, flags); + vmg_set_range(&vmg, 0x3000, 0x5000, 3, vm_flags); vmg.prev = vma_prev; vmg.middle = vma; @@ -920,15 +920,15 @@ static bool test_vma_merge_with_close(void) static bool test_vma_merge_new_with_close(void) { - unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; struct mm_struct mm = {}; VMA_ITERATOR(vmi, &mm, 0); struct vma_merge_struct vmg = { .mm = &mm, .vmi = &vmi, }; - struct vm_area_struct *vma_prev = alloc_and_link_vma(&mm, 0, 0x2000, 0, flags); - struct vm_area_struct *vma_next = alloc_and_link_vma(&mm, 0x5000, 0x7000, 5, flags); + struct vm_area_struct *vma_prev = alloc_and_link_vma(&mm, 0, 0x2000, 0, vm_flags); + struct vm_area_struct *vma_next = alloc_and_link_vma(&mm, 0x5000, 0x7000, 5, vm_flags); const struct vm_operations_struct vm_ops = { .close = dummy_close, }; @@ -958,7 +958,7 @@ static bool test_vma_merge_new_with_close(void) vma_prev->vm_ops = &vm_ops; vma_next->vm_ops = &vm_ops; - vmg_set_range(&vmg, 0x2000, 0x5000, 2, flags); + vmg_set_range(&vmg, 0x2000, 0x5000, 2, vm_flags); vma = merge_new(&vmg); ASSERT_NE(vma, NULL); ASSERT_EQ(vmg.state, VMA_MERGE_SUCCESS); @@ -975,7 +975,7 @@ static bool test_vma_merge_new_with_close(void) static bool test_merge_existing(void) { - unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; struct mm_struct mm = {}; VMA_ITERATOR(vmi, &mm, 0); struct vm_area_struct *vma, *vma_prev, *vma_next; @@ -998,11 +998,11 @@ static bool test_merge_existing(void) * 0123456789 * VNNNNNN */ - vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, flags); + vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, vm_flags); vma->vm_ops = &vm_ops; /* This should have no impact. */ - vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, flags); + vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, vm_flags); vma_next->vm_ops = &vm_ops; /* This should have no impact. */ - vmg_set_range_anon_vma(&vmg, 0x3000, 0x6000, 3, flags, &dummy_anon_vma); + vmg_set_range_anon_vma(&vmg, 0x3000, 0x6000, 3, vm_flags, &dummy_anon_vma); vmg.middle = vma; vmg.prev = vma; vma_set_dummy_anon_vma(vma, &avc); @@ -1032,10 +1032,10 @@ static bool test_merge_existing(void) * 0123456789 * NNNNNNN */ - vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, flags); - vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, flags); + vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, vm_flags); vma_next->vm_ops = &vm_ops; /* This should have no impact. */ - vmg_set_range_anon_vma(&vmg, 0x2000, 0x6000, 2, flags, &dummy_anon_vma); + vmg_set_range_anon_vma(&vmg, 0x2000, 0x6000, 2, vm_flags, &dummy_anon_vma); vmg.middle = vma; vma_set_dummy_anon_vma(vma, &avc); ASSERT_EQ(merge_existing(&vmg), vma_next); @@ -1060,11 +1060,11 @@ static bool test_merge_existing(void) * 0123456789 * PPPPPPV */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); vma_prev->vm_ops = &vm_ops; /* This should have no impact. */ - vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags); vma->vm_ops = &vm_ops; /* This should have no impact. */ - vmg_set_range_anon_vma(&vmg, 0x3000, 0x6000, 3, flags, &dummy_anon_vma); + vmg_set_range_anon_vma(&vmg, 0x3000, 0x6000, 3, vm_flags, &dummy_anon_vma); vmg.prev = vma_prev; vmg.middle = vma; vma_set_dummy_anon_vma(vma, &avc); @@ -1094,10 +1094,10 @@ static bool test_merge_existing(void) * 0123456789 * PPPPPPP */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); vma_prev->vm_ops = &vm_ops; /* This should have no impact. */ - vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, flags); - vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, flags, &dummy_anon_vma); + vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags); + vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, &dummy_anon_vma); vmg.prev = vma_prev; vmg.middle = vma; vma_set_dummy_anon_vma(vma, &avc); @@ -1123,11 +1123,11 @@ static bool test_merge_existing(void) * 0123456789 * PPPPPPPPPP */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); vma_prev->vm_ops = &vm_ops; /* This should have no impact. */ - vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, flags); - vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, flags); - vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, flags, &dummy_anon_vma); + vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, vm_flags); + vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, &dummy_anon_vma); vmg.prev = vma_prev; vmg.middle = vma; vma_set_dummy_anon_vma(vma, &avc); @@ -1158,41 +1158,41 @@ static bool test_merge_existing(void) * PPPVVVVVNNN */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); - vma = alloc_and_link_vma(&mm, 0x3000, 0x8000, 3, flags); - vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x8000, 3, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, vm_flags); - vmg_set_range(&vmg, 0x4000, 0x5000, 4, flags); + vmg_set_range(&vmg, 0x4000, 0x5000, 4, vm_flags); vmg.prev = vma; vmg.middle = vma; ASSERT_EQ(merge_existing(&vmg), NULL); ASSERT_EQ(vmg.state, VMA_MERGE_NOMERGE); - vmg_set_range(&vmg, 0x5000, 0x6000, 5, flags); + vmg_set_range(&vmg, 0x5000, 0x6000, 5, vm_flags); vmg.prev = vma; vmg.middle = vma; ASSERT_EQ(merge_existing(&vmg), NULL); ASSERT_EQ(vmg.state, VMA_MERGE_NOMERGE); - vmg_set_range(&vmg, 0x6000, 0x7000, 6, flags); + vmg_set_range(&vmg, 0x6000, 0x7000, 6, vm_flags); vmg.prev = vma; vmg.middle = vma; ASSERT_EQ(merge_existing(&vmg), NULL); ASSERT_EQ(vmg.state, VMA_MERGE_NOMERGE); - vmg_set_range(&vmg, 0x4000, 0x7000, 4, flags); + vmg_set_range(&vmg, 0x4000, 0x7000, 4, vm_flags); vmg.prev = vma; vmg.middle = vma; ASSERT_EQ(merge_existing(&vmg), NULL); ASSERT_EQ(vmg.state, VMA_MERGE_NOMERGE); - vmg_set_range(&vmg, 0x4000, 0x6000, 4, flags); + vmg_set_range(&vmg, 0x4000, 0x6000, 4, vm_flags); vmg.prev = vma; vmg.middle = vma; ASSERT_EQ(merge_existing(&vmg), NULL); ASSERT_EQ(vmg.state, VMA_MERGE_NOMERGE); - vmg_set_range(&vmg, 0x5000, 0x6000, 5, flags); + vmg_set_range(&vmg, 0x5000, 0x6000, 5, vm_flags); vmg.prev = vma; vmg.middle = vma; ASSERT_EQ(merge_existing(&vmg), NULL); @@ -1205,7 +1205,7 @@ static bool test_merge_existing(void) static bool test_anon_vma_non_mergeable(void) { - unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; struct mm_struct mm = {}; VMA_ITERATOR(vmi, &mm, 0); struct vm_area_struct *vma, *vma_prev, *vma_next; @@ -1229,9 +1229,9 @@ static bool test_anon_vma_non_mergeable(void) * 0123456789 * PPPPPPPNNN */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); - vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, flags); - vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, vm_flags); /* * Give both prev and next single anon_vma_chain fields, so they will @@ -1239,7 +1239,7 @@ static bool test_anon_vma_non_mergeable(void) * * However, when prev is compared to next, the merge should fail. */ - vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, flags, NULL); + vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, NULL); vmg.prev = vma_prev; vmg.middle = vma; vma_set_dummy_anon_vma(vma_prev, &dummy_anon_vma_chain_1); @@ -1267,10 +1267,10 @@ static bool test_anon_vma_non_mergeable(void) * 0123456789 * PPPPPPPNNN */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); - vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, vm_flags); - vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, flags, NULL); + vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, NULL); vmg.prev = vma_prev; vma_set_dummy_anon_vma(vma_prev, &dummy_anon_vma_chain_1); __vma_set_dummy_anon_vma(vma_next, &dummy_anon_vma_chain_2, &dummy_anon_vma_2); @@ -1292,7 +1292,7 @@ static bool test_anon_vma_non_mergeable(void) static bool test_dup_anon_vma(void) { - unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; struct mm_struct mm = {}; VMA_ITERATOR(vmi, &mm, 0); struct vma_merge_struct vmg = { @@ -1313,11 +1313,11 @@ static bool test_dup_anon_vma(void) * This covers new VMA merging, as these operations amount to a VMA * expand. */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); - vma_next = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, vm_flags); vma_next->anon_vma = &dummy_anon_vma; - vmg_set_range(&vmg, 0, 0x5000, 0, flags); + vmg_set_range(&vmg, 0, 0x5000, 0, vm_flags); vmg.target = vma_prev; vmg.next = vma_next; @@ -1339,16 +1339,16 @@ static bool test_dup_anon_vma(void) * extend delete delete */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); - vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); - vma_next = alloc_and_link_vma(&mm, 0x5000, 0x8000, 5, flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x5000, 0x8000, 5, vm_flags); /* Initialise avc so mergeability check passes. */ INIT_LIST_HEAD(&vma_next->anon_vma_chain); list_add(&dummy_anon_vma_chain.same_vma, &vma_next->anon_vma_chain); vma_next->anon_vma = &dummy_anon_vma; - vmg_set_range(&vmg, 0x3000, 0x5000, 3, flags); + vmg_set_range(&vmg, 0x3000, 0x5000, 3, vm_flags); vmg.prev = vma_prev; vmg.middle = vma; @@ -1372,12 +1372,12 @@ static bool test_dup_anon_vma(void) * extend delete delete */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); - vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); - vma_next = alloc_and_link_vma(&mm, 0x5000, 0x8000, 5, flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x5000, 0x8000, 5, vm_flags); vmg.anon_vma = &dummy_anon_vma; vma_set_dummy_anon_vma(vma, &dummy_anon_vma_chain); - vmg_set_range(&vmg, 0x3000, 0x5000, 3, flags); + vmg_set_range(&vmg, 0x3000, 0x5000, 3, vm_flags); vmg.prev = vma_prev; vmg.middle = vma; @@ -1401,11 +1401,11 @@ static bool test_dup_anon_vma(void) * extend shrink/delete */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); - vma = alloc_and_link_vma(&mm, 0x3000, 0x8000, 3, flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x8000, 3, vm_flags); vma_set_dummy_anon_vma(vma, &dummy_anon_vma_chain); - vmg_set_range(&vmg, 0x3000, 0x5000, 3, flags); + vmg_set_range(&vmg, 0x3000, 0x5000, 3, vm_flags); vmg.prev = vma_prev; vmg.middle = vma; @@ -1429,11 +1429,11 @@ static bool test_dup_anon_vma(void) * shrink/delete extend */ - vma = alloc_and_link_vma(&mm, 0, 0x5000, 0, flags); - vma_next = alloc_and_link_vma(&mm, 0x5000, 0x8000, 5, flags); + vma = alloc_and_link_vma(&mm, 0, 0x5000, 0, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x5000, 0x8000, 5, vm_flags); vma_set_dummy_anon_vma(vma, &dummy_anon_vma_chain); - vmg_set_range(&vmg, 0x3000, 0x5000, 3, flags); + vmg_set_range(&vmg, 0x3000, 0x5000, 3, vm_flags); vmg.prev = vma; vmg.middle = vma; @@ -1452,7 +1452,7 @@ static bool test_dup_anon_vma(void) static bool test_vmi_prealloc_fail(void) { - unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; struct mm_struct mm = {}; VMA_ITERATOR(vmi, &mm, 0); struct vma_merge_struct vmg = { @@ -1468,11 +1468,11 @@ static bool test_vmi_prealloc_fail(void) * the duplicated anon_vma is unlinked. */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); - vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, vm_flags); vma->anon_vma = &dummy_anon_vma; - vmg_set_range_anon_vma(&vmg, 0x3000, 0x5000, 3, flags, &dummy_anon_vma); + vmg_set_range_anon_vma(&vmg, 0x3000, 0x5000, 3, vm_flags, &dummy_anon_vma); vmg.prev = vma_prev; vmg.middle = vma; vma_set_dummy_anon_vma(vma, &avc); @@ -1496,11 +1496,11 @@ static bool test_vmi_prealloc_fail(void) * performed in this case too. */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); - vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, vm_flags); vma->anon_vma = &dummy_anon_vma; - vmg_set_range(&vmg, 0, 0x5000, 3, flags); + vmg_set_range(&vmg, 0, 0x5000, 3, vm_flags); vmg.target = vma_prev; vmg.next = vma; @@ -1518,13 +1518,13 @@ static bool test_vmi_prealloc_fail(void) static bool test_merge_extend(void) { - unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; struct mm_struct mm = {}; VMA_ITERATOR(vmi, &mm, 0x1000); struct vm_area_struct *vma; - vma = alloc_and_link_vma(&mm, 0, 0x1000, 0, flags); - alloc_and_link_vma(&mm, 0x3000, 0x4000, 3, flags); + vma = alloc_and_link_vma(&mm, 0, 0x1000, 0, vm_flags); + alloc_and_link_vma(&mm, 0x3000, 0x4000, 3, vm_flags); /* * Extend a VMA into the gap between itself and the following VMA. @@ -1548,7 +1548,7 @@ static bool test_merge_extend(void) static bool test_copy_vma(void) { - unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; struct mm_struct mm = {}; bool need_locks = false; VMA_ITERATOR(vmi, &mm, 0); @@ -1556,7 +1556,7 @@ static bool test_copy_vma(void) /* Move backwards and do not merge. */ - vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, vm_flags); vma_new = copy_vma(&vma, 0, 0x2000, 0, &need_locks); ASSERT_NE(vma_new, vma); ASSERT_EQ(vma_new->vm_start, 0); @@ -1568,8 +1568,8 @@ static bool test_copy_vma(void) /* Move a VMA into position next to another and merge the two. */ - vma = alloc_and_link_vma(&mm, 0, 0x2000, 0, flags); - vma_next = alloc_and_link_vma(&mm, 0x6000, 0x8000, 6, flags); + vma = alloc_and_link_vma(&mm, 0, 0x2000, 0, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x6000, 0x8000, 6, vm_flags); vma_new = copy_vma(&vma, 0x4000, 0x2000, 4, &need_locks); vma_assert_attached(vma_new); @@ -1581,11 +1581,11 @@ static bool test_copy_vma(void) static bool test_expand_only_mode(void) { - unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; struct mm_struct mm = {}; VMA_ITERATOR(vmi, &mm, 0); struct vm_area_struct *vma_prev, *vma; - VMG_STATE(vmg, &mm, &vmi, 0x5000, 0x9000, flags, 5); + VMG_STATE(vmg, &mm, &vmi, 0x5000, 0x9000, vm_flags, 5); /* * Place a VMA prior to the one we're expanding so we assert that we do @@ -1593,14 +1593,14 @@ static bool test_expand_only_mode(void) * have, through the use of the just_expand flag, indicated we do not * need to do so. */ - alloc_and_link_vma(&mm, 0, 0x2000, 0, flags); + alloc_and_link_vma(&mm, 0, 0x2000, 0, vm_flags); /* * We will be positioned at the prev VMA, but looking to expand to * 0x9000. */ vma_iter_set(&vmi, 0x3000); - vma_prev = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); + vma_prev = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, vm_flags); vmg.prev = vma_prev; vmg.just_expand = true; diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 3b1b45256d56..f684649b1008 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -1084,7 +1084,7 @@ static inline bool mpol_equal(struct mempolicy *, struct mempolicy *) } static inline void khugepaged_enter_vma(struct vm_area_struct *vma, - unsigned long vm_flags) + vm_flags_t vm_flags) { (void)vma; (void)vm_flags; @@ -1200,7 +1200,7 @@ bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); /* Update vma->vm_page_prot to reflect vma->vm_flags. */ static inline void vma_set_page_prot(struct vm_area_struct *vma) { - unsigned long vm_flags = vma->vm_flags; + vm_flags_t vm_flags = vma->vm_flags; pgprot_t vm_page_prot; /* testing: we inline vm_pgprot_modify() to avoid clash with vma.h. */ @@ -1280,12 +1280,12 @@ static inline bool capable(int cap) return true; } -static inline bool mlock_future_ok(struct mm_struct *mm, unsigned long flags, +static inline bool mlock_future_ok(struct mm_struct *mm, vm_flags_t vm_flags, unsigned long bytes) { unsigned long locked_pages, limit_pages; - if (!(flags & VM_LOCKED) || capable(CAP_IPC_LOCK)) + if (!(vm_flags & VM_LOCKED) || capable(CAP_IPC_LOCK)) return true; locked_pages = bytes >> PAGE_SHIFT; From d75fa3c9475048356e50357e8a71745e5c1bd740 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Wed, 18 Jun 2025 20:42:54 +0100 Subject: [PATCH 086/370] mm: update architecture and driver code to use vm_flags_t In future we intend to change the vm_flags_t type, so it isn't correct for architecture and driver code to assume it is unsigned long. Correct this assumption across the board. Overall, this patch does not introduce any functional change. Link: https://lkml.kernel.org/r/b6eb1894abc5555ece80bb08af5c022ef780c8bc.1750274467.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Acked-by: Mike Rapoport (Microsoft) Acked-by: Christian Brauner Reviewed-by: Vlastimil Babka Reviewed-by: Oscar Salvador Reviewed-by: Pedro Falcato Acked-by: Catalin Marinas [arm64] Acked-by: Zi Yan Acked-by: David Hildenbrand Reviewed-by: Jarkko Sakkinen Reviewed-by: Anshuman Khandual Cc: Jann Horn Cc: Kees Cook Cc: Liam R. Howlett Cc: Jan Kara Signed-off-by: Andrew Morton --- arch/arm/mm/fault.c | 2 +- arch/arm64/include/asm/mman.h | 10 +++++----- arch/arm64/mm/fault.c | 2 +- arch/arm64/mm/mmu.c | 2 +- arch/powerpc/include/asm/mman.h | 2 +- arch/powerpc/include/asm/pkeys.h | 4 ++-- arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- arch/sparc/include/asm/mman.h | 4 ++-- arch/x86/kernel/cpu/sgx/encl.c | 8 ++++---- arch/x86/kernel/cpu/sgx/encl.h | 2 +- tools/testing/vma/vma_internal.h | 2 +- 11 files changed, 20 insertions(+), 20 deletions(-) diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index ab01b51de559..46169fe42c61 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -268,7 +268,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) int sig, code; vm_fault_t fault; unsigned int flags = FAULT_FLAG_DEFAULT; - unsigned long vm_flags = VM_ACCESS_FLAGS; + vm_flags_t vm_flags = VM_ACCESS_FLAGS; if (kprobe_page_fault(regs, fsr)) return 0; diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h index 21df8bbd2668..8770c7ee759f 100644 --- a/arch/arm64/include/asm/mman.h +++ b/arch/arm64/include/asm/mman.h @@ -11,10 +11,10 @@ #include #include -static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot, +static inline vm_flags_t arch_calc_vm_prot_bits(unsigned long prot, unsigned long pkey) { - unsigned long ret = 0; + vm_flags_t ret = 0; if (system_supports_bti() && (prot & PROT_BTI)) ret |= VM_ARM64_BTI; @@ -34,8 +34,8 @@ static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot, } #define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey) -static inline unsigned long arch_calc_vm_flag_bits(struct file *file, - unsigned long flags) +static inline vm_flags_t arch_calc_vm_flag_bits(struct file *file, + unsigned long flags) { /* * Only allow MTE on anonymous mappings as these are guaranteed to be @@ -68,7 +68,7 @@ static inline bool arch_validate_prot(unsigned long prot, } #define arch_validate_prot(prot, addr) arch_validate_prot(prot, addr) -static inline bool arch_validate_flags(unsigned long vm_flags) +static inline bool arch_validate_flags(vm_flags_t vm_flags) { if (system_supports_mte()) { /* diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index ec0a337891dd..24be3e632f79 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -549,7 +549,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, const struct fault_info *inf; struct mm_struct *mm = current->mm; vm_fault_t fault; - unsigned long vm_flags; + vm_flags_t vm_flags; unsigned int mm_flags = FAULT_FLAG_DEFAULT; unsigned long addr = untagged_addr(far); struct vm_area_struct *vma; diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 00ab1d648db6..3d5fb37424ab 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -720,7 +720,7 @@ void mark_rodata_ro(void) static void __init declare_vma(struct vm_struct *vma, void *va_start, void *va_end, - unsigned long vm_flags) + vm_flags_t vm_flags) { phys_addr_t pa_start = __pa_symbol(va_start); unsigned long size = va_end - va_start; diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h index 42a51a993d94..912f78a956a1 100644 --- a/arch/powerpc/include/asm/mman.h +++ b/arch/powerpc/include/asm/mman.h @@ -14,7 +14,7 @@ #include #include -static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot, +static inline vm_flags_t arch_calc_vm_prot_bits(unsigned long prot, unsigned long pkey) { #ifdef CONFIG_PPC_MEM_KEYS diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h index 59a2c7dbc78f..28e752138996 100644 --- a/arch/powerpc/include/asm/pkeys.h +++ b/arch/powerpc/include/asm/pkeys.h @@ -30,9 +30,9 @@ extern u32 reserved_allocation_mask; /* bits set for reserved keys */ #endif -static inline u64 pkey_to_vmflag_bits(u16 pkey) +static inline vm_flags_t pkey_to_vmflag_bits(u16 pkey) { - return (((u64)pkey << VM_PKEY_SHIFT) & ARCH_VM_PKEY_FLAGS); + return (((vm_flags_t)pkey << VM_PKEY_SHIFT) & ARCH_VM_PKEY_FLAGS); } static inline int vma_pkey(struct vm_area_struct *vma) diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c index 3a6592a31a10..03f8c34fa0a2 100644 --- a/arch/powerpc/kvm/book3s_hv_uvmem.c +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c @@ -393,7 +393,7 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm, { unsigned long gfn = memslot->base_gfn; unsigned long end, start = gfn_to_hva(kvm, gfn); - unsigned long vm_flags; + vm_flags_t vm_flags; int ret = 0; struct vm_area_struct *vma; int merge_flag = (merge) ? MADV_MERGEABLE : MADV_UNMERGEABLE; diff --git a/arch/sparc/include/asm/mman.h b/arch/sparc/include/asm/mman.h index af9c10c83dc5..3e4bac33be81 100644 --- a/arch/sparc/include/asm/mman.h +++ b/arch/sparc/include/asm/mman.h @@ -28,7 +28,7 @@ static inline void ipi_set_tstate_mcde(void *arg) } #define arch_calc_vm_prot_bits(prot, pkey) sparc_calc_vm_prot_bits(prot) -static inline unsigned long sparc_calc_vm_prot_bits(unsigned long prot) +static inline vm_flags_t sparc_calc_vm_prot_bits(unsigned long prot) { if (adi_capable() && (prot & PROT_ADI)) { struct pt_regs *regs; @@ -58,7 +58,7 @@ static inline int sparc_validate_prot(unsigned long prot, unsigned long addr) /* arch_validate_flags() - Ensure combination of flags is valid for a * VMA. */ -static inline bool arch_validate_flags(unsigned long vm_flags) +static inline bool arch_validate_flags(vm_flags_t vm_flags) { /* If ADI is being enabled on this VMA, check for ADI * capability on the platform and ensure VMA is suitable diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c index 279148e72459..308dbbae6c6e 100644 --- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -279,7 +279,7 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl, static struct sgx_encl_page *sgx_encl_load_page_in_vma(struct sgx_encl *encl, unsigned long addr, - unsigned long vm_flags) + vm_flags_t vm_flags) { unsigned long vm_prot_bits = vm_flags & VM_ACCESS_FLAGS; struct sgx_encl_page *entry; @@ -520,9 +520,9 @@ static void sgx_vma_open(struct vm_area_struct *vma) * Return: 0 on success, -EACCES otherwise */ int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start, - unsigned long end, unsigned long vm_flags) + unsigned long end, vm_flags_t vm_flags) { - unsigned long vm_prot_bits = vm_flags & VM_ACCESS_FLAGS; + vm_flags_t vm_prot_bits = vm_flags & VM_ACCESS_FLAGS; struct sgx_encl_page *page; unsigned long count = 0; int ret = 0; @@ -605,7 +605,7 @@ static int sgx_encl_debug_write(struct sgx_encl *encl, struct sgx_encl_page *pag */ static struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl, unsigned long addr, - unsigned long vm_flags) + vm_flags_t vm_flags) { struct sgx_encl_page *entry; diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h index f94ff14c9486..8ff47f6652b9 100644 --- a/arch/x86/kernel/cpu/sgx/encl.h +++ b/arch/x86/kernel/cpu/sgx/encl.h @@ -101,7 +101,7 @@ static inline int sgx_encl_find(struct mm_struct *mm, unsigned long addr, } int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start, - unsigned long end, unsigned long vm_flags); + unsigned long end, vm_flags_t vm_flags); bool current_is_ksgxd(void); void sgx_encl_release(struct kref *ref); diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index f684649b1008..991022e9e0d3 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -1215,7 +1215,7 @@ static inline void vma_set_page_prot(struct vm_area_struct *vma) WRITE_ONCE(vma->vm_page_prot, vm_page_prot); } -static inline bool arch_validate_flags(unsigned long) +static inline bool arch_validate_flags(vm_flags_t) { return true; } From f9550e1fcf3bef802c18fadc5c65485b66b28a63 Mon Sep 17 00:00:00 2001 From: Nathan Gao Date: Wed, 18 Jun 2025 09:33:31 -0700 Subject: [PATCH 087/370] mm/damon: fix minor typos in damon header Fix typos in include/linux/damon.h. Link: https://lkml.kernel.org/r/20250618163331.54910-1-sj@kernel.org Signed-off-by: Nathan Gao Reviewed-by: SeongJae Park Signed-off-by: Andrew Morton --- include/linux/damon.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/damon.h b/include/linux/damon.h index a4011726cb3b..bb58e36f019e 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -450,7 +450,7 @@ struct damos_access_pattern { /** * struct damos - Represents a Data Access Monitoring-based Operation Scheme. * @pattern: Access pattern of target regions. - * @action: &damo_action to be applied to the target regions. + * @action: &damos_action to be applied to the target regions. * @apply_interval_us: The time between applying the @action. * @quota: Control the aggressiveness of this scheme. * @wmarks: Watermarks for automated (in)activation of this scheme. @@ -656,7 +656,7 @@ struct damon_call_control { * struct damon_intervals_goal - Monitoring intervals auto-tuning goal. * * @access_bp: Access events observation ratio to achieve in bp. - * @aggrs: Number of aggregations to acheive @access_bp within. + * @aggrs: Number of aggregations to achieve @access_bp within. * @min_sample_us: Minimum resulting sampling interval in microseconds. * @max_sample_us: Maximum resulting sampling interval in microseconds. * From d29d64afa2b20d9bd210c01bfff78545675b5135 Mon Sep 17 00:00:00 2001 From: Petr Pavlu Date: Wed, 18 Jun 2025 14:50:35 +0200 Subject: [PATCH 088/370] codetag: avoid unused alloc_tags sections/symbols With CONFIG_MEM_ALLOC_PROFILING=n, vmlinux and all modules unnecessarily contain the symbols __start_alloc_tags and __stop_alloc_tags, which define an empty range. In the case of modules, the presence of these symbols also forces the linker to create an empty .codetag.alloc_tags section. Update codetag.lds.h to make the data conditional on CONFIG_MEM_ALLOC_PROFILING. Link: https://lkml.kernel.org/r/20250618125037.53182-1-petr.pavlu@suse.com Signed-off-by: Petr Pavlu Reviewed-by: Kent Overstreet Reviewed-by: Suren Baghdasaryan Cc: Arnd Bergmann Cc: Casey Chen Cc: Petr Pavlu Signed-off-by: Andrew Morton --- include/asm-generic/codetag.lds.h | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/include/asm-generic/codetag.lds.h b/include/asm-generic/codetag.lds.h index a45fe3d141a1..a14f4bdafdda 100644 --- a/include/asm-generic/codetag.lds.h +++ b/include/asm-generic/codetag.lds.h @@ -2,6 +2,12 @@ #ifndef __ASM_GENERIC_CODETAG_LDS_H #define __ASM_GENERIC_CODETAG_LDS_H +#ifdef CONFIG_MEM_ALLOC_PROFILING +#define IF_MEM_ALLOC_PROFILING(...) __VA_ARGS__ +#else +#define IF_MEM_ALLOC_PROFILING(...) +#endif + #define SECTION_WITH_BOUNDARIES(_name) \ . = ALIGN(8); \ __start_##_name = .; \ @@ -9,7 +15,7 @@ __stop_##_name = .; #define CODETAG_SECTIONS() \ - SECTION_WITH_BOUNDARIES(alloc_tags) + IF_MEM_ALLOC_PROFILING(SECTION_WITH_BOUNDARIES(alloc_tags)) #define MOD_SEPARATE_CODETAG_SECTION(_name) \ .codetag.##_name : { \ @@ -22,6 +28,6 @@ * unload them individually once unused. */ #define MOD_SEPARATE_CODETAG_SECTIONS() \ - MOD_SEPARATE_CODETAG_SECTION(alloc_tags) + IF_MEM_ALLOC_PROFILING(MOD_SEPARATE_CODETAG_SECTION(alloc_tags)) #endif /* __ASM_GENERIC_CODETAG_LDS_H */ From 986f5f2b4be3b7eab9ecd85c472d03a2191d6fc0 Mon Sep 17 00:00:00 2001 From: Vivek Kasireddy Date: Tue, 17 Jun 2025 22:30:53 -0700 Subject: [PATCH 089/370] mm/hugetlb: make hugetlb_reserve_pages() return nr of entries updated Patch series "mm/memfd: Reserve hugetlb folios before allocation", v4. There are cases when we try to pin a folio but discover that it has not been faulted-in. So, we try to allocate it in memfd_alloc_folio() but the allocation request may not succeed if there are no active reservations in the system at that instant. Therefore, making a reservation (by calling hugetlb_reserve_pages()) associated with the allocation will ensure that our request would not fail due to lack of reservations. This will also ensure that proper region/subpool accounting is done with our allocation. This patch (of 3): Currently, hugetlb_reserve_pages() returns a bool to indicate whether the reservation map update for the range [from, to] was successful or not. This is not sufficient for the case where the caller needs to determine how many entries were updated for the range. Therefore, have hugetlb_reserve_pages() return the number of entries updated in the reservation map associated with the range [from, to]. Also, update the callers of hugetlb_reserve_pages() to handle the new return value. Link: https://lkml.kernel.org/r/20250618053415.1036185-1-vivek.kasireddy@intel.com Link: https://lkml.kernel.org/r/20250618053415.1036185-2-vivek.kasireddy@intel.com Signed-off-by: Vivek Kasireddy Cc: Steve Sistare Cc: Muchun Song Cc: David Hildenbrand Cc: Gerd Hoffmann Cc: Oscar Salvador Signed-off-by: Andrew Morton --- fs/hugetlbfs/inode.c | 8 ++++---- include/linux/hugetlb.h | 2 +- mm/hugetlb.c | 19 +++++++++++++------ 3 files changed, 18 insertions(+), 11 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index e4de5425838d..00b2d1a032fd 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -150,10 +150,10 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) if (inode->i_flags & S_PRIVATE) vm_flags |= VM_NORESERVE; - if (!hugetlb_reserve_pages(inode, + if (hugetlb_reserve_pages(inode, vma->vm_pgoff >> huge_page_order(h), len >> huge_page_shift(h), vma, - vm_flags)) + vm_flags) < 0) goto out; ret = 0; @@ -1561,9 +1561,9 @@ struct file *hugetlb_file_setup(const char *name, size_t size, inode->i_size = size; clear_nlink(inode); - if (!hugetlb_reserve_pages(inode, 0, + if (hugetlb_reserve_pages(inode, 0, size >> huge_page_shift(hstate_inode(inode)), NULL, - acctflag)) + acctflag) < 0) file = ERR_PTR(-ENOMEM); else file = alloc_file_pseudo(inode, mnt, name, O_RDWR, diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 42f374e828a2..d8310b0f36dd 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -149,7 +149,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, uffd_flags_t flags, struct folio **foliop); #endif /* CONFIG_USERFAULTFD */ -bool hugetlb_reserve_pages(struct inode *inode, long from, long to, +long hugetlb_reserve_pages(struct inode *inode, long from, long to, struct vm_area_struct *vma, vm_flags_t vm_flags); long hugetlb_unreserve_pages(struct inode *inode, long start, long end, diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c7ba95030241..e1570735012a 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -7244,8 +7244,15 @@ long hugetlb_change_protection(struct vm_area_struct *vma, return pages > 0 ? (pages << h->order) : pages; } -/* Return true if reservation was successful, false otherwise. */ -bool hugetlb_reserve_pages(struct inode *inode, +/* + * Update the reservation map for the range [from, to]. + * + * Returns the number of entries that would be added to the reservation map + * associated with the range [from, to]. This number is greater or equal to + * zero. -EINVAL or -ENOMEM is returned in case of any errors. + */ + +long hugetlb_reserve_pages(struct inode *inode, long from, long to, struct vm_area_struct *vma, vm_flags_t vm_flags) @@ -7260,7 +7267,7 @@ bool hugetlb_reserve_pages(struct inode *inode, /* This should never happen */ if (from > to) { VM_WARN(1, "%s called with a negative range\n", __func__); - return false; + return -EINVAL; } /* @@ -7275,7 +7282,7 @@ bool hugetlb_reserve_pages(struct inode *inode, * without using reserves */ if (vm_flags & VM_NORESERVE) - return true; + return 0; /* * Shared mappings base their reservation on the number of pages that @@ -7382,7 +7389,7 @@ bool hugetlb_reserve_pages(struct inode *inode, hugetlb_cgroup_put_rsvd_cgroup(h_cg); } } - return true; + return chg; out_put_pages: spool_resv = chg - gbl_reserve; @@ -7410,7 +7417,7 @@ bool hugetlb_reserve_pages(struct inode *inode, kref_put(&resv_map->refs, resv_map_release); set_vma_resv_map(vma, NULL); } - return false; + return chg < 0 ? chg : add < 0 ? add : -EINVAL; } long hugetlb_unreserve_pages(struct inode *inode, long start, long end, From 717cf9357325055ab6b41c4e0581f4d67601eb58 Mon Sep 17 00:00:00 2001 From: Vivek Kasireddy Date: Tue, 17 Jun 2025 22:30:54 -0700 Subject: [PATCH 090/370] mm/memfd: reserve hugetlb folios before allocation When we try to allocate a folio via alloc_hugetlb_folio_reserve(), we need to ensure that there is an active reservation associated with the allocation. Otherwise, our allocation request would fail if there are no active reservations made at that moment against any other allocations. This is because alloc_hugetlb_folio_reserve() checks h->resv_huge_pages before proceeding with the allocation. Therefore, to address this issue, we just need to make a reservation (by calling hugetlb_reserve_pages()) before we try to allocate the folio. This will also ensure that proper region/subpool accounting is done associated with our allocation. Link: https://lkml.kernel.org/r/20250618053415.1036185-3-vivek.kasireddy@intel.com Signed-off-by: Vivek Kasireddy Cc: Steve Sistare Cc: Muchun Song Cc: David Hildenbrand Cc: Gerd Hoffmann Cc: Oscar Salvador Signed-off-by: Andrew Morton --- include/linux/hugetlb.h | 5 +++++ mm/hugetlb.c | 5 ----- mm/memfd.c | 17 ++++++++++++++--- 3 files changed, 19 insertions(+), 8 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index d8310b0f36dd..c6c87eae4a8d 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -740,6 +740,11 @@ extern unsigned int default_hstate_idx; #define default_hstate (hstates[default_hstate_idx]) +static inline struct hugepage_subpool *subpool_inode(struct inode *inode) +{ + return HUGETLBFS_SB(inode->i_sb)->spool; +} + static inline struct hugepage_subpool *hugetlb_folio_subpool(struct folio *folio) { return folio->_hugetlb_subpool; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index e1570735012a..da9f10710c27 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -284,11 +284,6 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool, return ret; } -static inline struct hugepage_subpool *subpool_inode(struct inode *inode) -{ - return HUGETLBFS_SB(inode->i_sb)->spool; -} - static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma) { return subpool_inode(file_inode(vma->vm_file)); diff --git a/mm/memfd.c b/mm/memfd.c index b558c4c3bd27..32fa6bfe57d1 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -70,7 +70,6 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) #ifdef CONFIG_HUGETLB_PAGE struct folio *folio; gfp_t gfp_mask; - int err; if (is_file_hugepages(memfd)) { /* @@ -79,12 +78,19 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) * alloc from. Also, the folio will be pinned for an indefinite * amount of time, so it is not expected to be migrated away. */ + struct inode *inode = file_inode(memfd); struct hstate *h = hstate_file(memfd); + int err = -ENOMEM; + long nr_resv; gfp_mask = htlb_alloc_mask(h); gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE); idx >>= huge_page_order(h); + nr_resv = hugetlb_reserve_pages(inode, idx, idx + 1, NULL, 0); + if (nr_resv < 0) + return ERR_PTR(nr_resv); + folio = alloc_hugetlb_folio_reserve(h, numa_node_id(), NULL, @@ -95,12 +101,17 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) idx); if (err) { folio_put(folio); - return ERR_PTR(err); + goto err_unresv; } + + hugetlb_set_folio_subpool(folio, subpool_inode(inode)); folio_unlock(folio); return folio; } - return ERR_PTR(-ENOMEM); +err_unresv: + if (nr_resv > 0) + hugetlb_unreserve_pages(inode, idx, idx + 1, 0); + return ERR_PTR(err); } #endif return shmem_read_folio(memfd->f_mapping, idx); From cf34cfbf1784be7ce9f22789aa157535eff2195f Mon Sep 17 00:00:00 2001 From: Vivek Kasireddy Date: Tue, 17 Jun 2025 22:30:55 -0700 Subject: [PATCH 091/370] selftests/udmabuf: add a test to pin first before writing to memfd Unlike the existing tests, this new test will create a memfd (backed by hugetlb) and pin the folios in it (a small subset) before writing/ populating it with data. This is a valid use-case that invokes the memfd_alloc_folio() kernel API and is expected to work unless there aren't enough hugetlb folios to satisfy the allocation needs. Link: https://lkml.kernel.org/r/20250618053415.1036185-4-vivek.kasireddy@intel.com Signed-off-by: Vivek Kasireddy Cc: Gerd Hoffmann Cc: Steve Sistare Cc: Muchun Song Cc: David Hildenbrand Cc: Oscar Salvador Signed-off-by: Andrew Morton --- .../selftests/drivers/dma-buf/udmabuf.c | 20 ++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/drivers/dma-buf/udmabuf.c b/tools/testing/selftests/drivers/dma-buf/udmabuf.c index 6062723a172e..77aa2897e79f 100644 --- a/tools/testing/selftests/drivers/dma-buf/udmabuf.c +++ b/tools/testing/selftests/drivers/dma-buf/udmabuf.c @@ -138,7 +138,7 @@ int main(int argc, char *argv[]) void *addr1, *addr2; ksft_print_header(); - ksft_set_plan(6); + ksft_set_plan(7); devfd = open("/dev/udmabuf", O_RDWR); if (devfd < 0) { @@ -248,6 +248,24 @@ int main(int argc, char *argv[]) else ksft_test_result_pass("%s: [PASS,test-6]\n", TEST_PREFIX); + close(buf); + close(memfd); + + /* same test as above but we pin first before writing to memfd */ + page_size = getpagesize() * 512; /* 2 MB */ + size = MEMFD_SIZE * page_size; + memfd = create_memfd_with_seals(size, true); + buf = create_udmabuf_list(devfd, memfd, size); + addr2 = mmap_fd(buf, NUM_PAGES * NUM_ENTRIES * getpagesize()); + addr1 = mmap_fd(memfd, size); + write_to_memfd(addr1, size, 'a'); + write_to_memfd(addr1, size, 'b'); + ret = compare_chunks(addr1, addr2, size); + if (ret < 0) + ksft_test_result_fail("%s: [FAIL,test-7]\n", TEST_PREFIX); + else + ksft_test_result_pass("%s: [PASS,test-7]\n", TEST_PREFIX); + close(buf); close(memfd); close(devfd); From 59b5ed409d03bc8b7bb153d78afcd7cea9d7bbfa Mon Sep 17 00:00:00 2001 From: Hao Ge Date: Wed, 18 Jun 2025 09:58:09 +0800 Subject: [PATCH 092/370] mm/percpu: conditionally define _shared_alloc_tag via CONFIG_ARCH_MODULE_NEEDS_WEAK_PER_CPU Recently discovered this entry while checking kallsyms on ARM64: ffff800083e509c0 D _shared_alloc_tag If ARCH_NEEDS_WEAK_PER_CPU is not defined(it is only defined for s390 and alpha architectures), there's no need to statically define the percpu variable _shared_alloc_tag. Therefore, we need to implement isolation for this purpose. When building the core kernel code for s390 or alpha architectures, ARCH_NEEDS_WEAK_PER_CPU remains undefined (as it is gated by #if defined(MODULE)). However, when building modules for these architectures, the macro is explicitly defined. Therefore, we remove all instances of ARCH_NEEDS_WEAK_PER_CPU from the code and introduced CONFIG_ARCH_MODULE_NEEDS_WEAK_PER_CPU to replace the relevant logic. We can now conditionally define the perpcu variable _shared_alloc_tag based on CONFIG_ARCH_MODULE_NEEDS_WEAK_PER_CPU. This allows architectures (such as s390/alpha) that require weak definitions for percpu variables in modules to include the definition, while others can omit it via compile-time exclusion. Link: https://lkml.kernel.org/r/20250618015809.1235761-1-hao.ge@linux.dev Signed-off-by: Hao Ge Suggested-by: Suren Baghdasaryan Acked-by: Alexander Gordeev [s390] Acked-by: Mike Rapoport (Microsoft) Cc: Chistoph Lameter Cc: Christian Borntraeger Cc: David Hildenbrand Cc: Dennis Zhou Cc: Heiko Carstens Cc: Kent Overstreet Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Matt Turner Cc: Richard Henderson Cc: Sven Schnelle Cc: Tejun Heo Cc: Vasily Gorbik Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/alpha/Kconfig | 1 + arch/alpha/include/asm/percpu.h | 5 ++--- arch/s390/Kconfig | 1 + arch/s390/include/asm/percpu.h | 5 ++--- include/linux/alloc_tag.h | 6 +++--- include/linux/percpu-defs.h | 7 ++++--- lib/alloc_tag.c | 2 ++ mm/Kconfig | 7 +++++++ 8 files changed, 22 insertions(+), 12 deletions(-) diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig index 109a4cddcd13..80367f2cf821 100644 --- a/arch/alpha/Kconfig +++ b/arch/alpha/Kconfig @@ -7,6 +7,7 @@ config ALPHA select ARCH_HAS_DMA_OPS if PCI select ARCH_MIGHT_HAVE_PC_PARPORT select ARCH_MIGHT_HAVE_PC_SERIO + select ARCH_MODULE_NEEDS_WEAK_PER_CPU if SMP select ARCH_NO_PREEMPT select ARCH_NO_SG_CHAIN select ARCH_USE_CMPXCHG_LOCKREF diff --git a/arch/alpha/include/asm/percpu.h b/arch/alpha/include/asm/percpu.h index 6923249f2d49..4383d66341dc 100644 --- a/arch/alpha/include/asm/percpu.h +++ b/arch/alpha/include/asm/percpu.h @@ -9,10 +9,9 @@ * way above 4G. * * Always use weak definitions for percpu variables in modules. + * Therefore, we have enabled CONFIG_ARCH_MODULE_NEEDS_WEAK_PER_CPU + * in the Kconfig. */ -#if defined(MODULE) && defined(CONFIG_SMP) -#define ARCH_NEEDS_WEAK_PER_CPU -#endif #include diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 0c16dc443e2f..b652cb952f31 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -132,6 +132,7 @@ config S390 select ARCH_INLINE_WRITE_UNLOCK_IRQ select ARCH_INLINE_WRITE_UNLOCK_IRQRESTORE select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE + select ARCH_MODULE_NEEDS_WEAK_PER_CPU select ARCH_STACKWALK select ARCH_SUPPORTS_ATOMIC_RMW select ARCH_SUPPORTS_DEBUG_PAGEALLOC diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h index 84f6b8357b45..96af7d964014 100644 --- a/arch/s390/include/asm/percpu.h +++ b/arch/s390/include/asm/percpu.h @@ -16,10 +16,9 @@ * For 64 bit module code, the module may be more than 4G above the * per cpu area, use weak definitions to force the compiler to * generate external references. + * Therefore, we have enabled CONFIG_ARCH_MODULE_NEEDS_WEAK_PER_CPU + * in the Kconfig. */ -#if defined(MODULE) -#define ARCH_NEEDS_WEAK_PER_CPU -#endif /* * We use a compare-and-swap loop since that uses less cpu cycles than diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h index 8f7931eb7d16..9ef2633e2c08 100644 --- a/include/linux/alloc_tag.h +++ b/include/linux/alloc_tag.h @@ -88,7 +88,7 @@ static inline struct alloc_tag *ct_to_alloc_tag(struct codetag *ct) return container_of(ct, struct alloc_tag, ct); } -#ifdef ARCH_NEEDS_WEAK_PER_CPU +#if defined(CONFIG_ARCH_MODULE_NEEDS_WEAK_PER_CPU) && defined(MODULE) /* * When percpu variables are required to be defined as weak, static percpu * variables can't be used inside a function (see comments for DECLARE_PER_CPU_SECTION). @@ -102,7 +102,7 @@ DECLARE_PER_CPU(struct alloc_tag_counters, _shared_alloc_tag); .ct = CODE_TAG_INIT, \ .counters = &_shared_alloc_tag }; -#else /* ARCH_NEEDS_WEAK_PER_CPU */ +#else /* CONFIG_ARCH_MODULE_NEEDS_WEAK_PER_CPU && MODULE */ #ifdef MODULE @@ -123,7 +123,7 @@ DECLARE_PER_CPU(struct alloc_tag_counters, _shared_alloc_tag); #endif /* MODULE */ -#endif /* ARCH_NEEDS_WEAK_PER_CPU */ +#endif /* CONFIG_ARCH_MODULE_NEEDS_WEAK_PER_CPU && MODULE */ DECLARE_STATIC_KEY_MAYBE(CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT, mem_alloc_profiling_key); diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h index c16cdeaa505e..12d90360f6db 100644 --- a/include/linux/percpu-defs.h +++ b/include/linux/percpu-defs.h @@ -63,14 +63,15 @@ * 1. The symbol must be globally unique, even the static ones. * 2. Static percpu variables cannot be defined inside a function. * - * Archs which need weak percpu definitions should define - * ARCH_NEEDS_WEAK_PER_CPU in asm/percpu.h when necessary. + * Archs which need weak percpu definitions should set + * CONFIG_ARCH_MODULE_NEEDS_WEAK_PER_CPU when necessary. * * To ensure that the generic code observes the above two * restrictions, if CONFIG_DEBUG_FORCE_WEAK_PER_CPU is set weak * definition is used for all cases. */ -#if defined(ARCH_NEEDS_WEAK_PER_CPU) || defined(CONFIG_DEBUG_FORCE_WEAK_PER_CPU) +#if (defined(CONFIG_ARCH_MODULE_NEEDS_WEAK_PER_CPU) && defined(MODULE)) || \ + defined(CONFIG_DEBUG_FORCE_WEAK_PER_CPU) /* * __pcpu_scope_* dummy variable is used to enforce scope. It * receives the static modifier when it's used in front of diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c index 36f07dc95069..41ccfb035b7b 100644 --- a/lib/alloc_tag.c +++ b/lib/alloc_tag.c @@ -25,8 +25,10 @@ static bool mem_profiling_support; static struct codetag_type *alloc_tag_cttype; +#ifdef CONFIG_ARCH_MODULE_NEEDS_WEAK_PER_CPU DEFINE_PER_CPU(struct alloc_tag_counters, _shared_alloc_tag); EXPORT_SYMBOL(_shared_alloc_tag); +#endif DEFINE_STATIC_KEY_MAYBE(CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT, mem_alloc_profiling_key); diff --git a/mm/Kconfig b/mm/Kconfig index 3b2060a61c05..065b1f19dd99 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -933,6 +933,13 @@ config ARCH_SUPPORTS_PUD_PFNMAP def_bool y depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD +# +# Architectures that always use weak definitions for percpu +# variables in modules should set this. +# +config ARCH_MODULE_NEEDS_WEAK_PER_CPU + bool + # # UP and nommu archs use km based percpu allocator # From 0544f3f78da3d7d2baf546e70773a8d66cdb127e Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Thu, 19 Jun 2025 18:57:53 +1000 Subject: [PATCH 093/370] mm: convert pXd_devmap checks to vma_is_dax MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "mm: Remove pXX_devmap page table bit and pfn_t type", v3. All users of dax now require a ZONE_DEVICE page which is properly refcounted. This means there is no longer any need for the PFN_DEV, PFN_MAP and PFN_SPECIAL flags. Furthermore the PFN_SG_CHAIN and PFN_SG_LAST flags never appear to have been used. It is therefore possible to remove the pfn_t type and replace any usage with raw pfns. The remaining users of PFN_DEV have simply passed this to vmf_insert_mixed() to create pte_devmap() mappings. It is unclear why this was the case but presumably to ensure vm_normal_page() does not return these pages. These users can be trivially converted to raw pfns and creating a pXX_special() mapping to ensure vm_normal_page() still doesn't return these pages. Now that there are no users of PFN_DEV we can remove the devmap page table bit and all associated functions and macros, freeing up a software page table bit. This patch (of 14): Currently dax is the only user of pmd and pud mapped ZONE_DEVICE pages. Therefore page walkers that want to exclude DAX pages can check pmd_devmap or pud_devmap. However soon dax will no longer set PFN_DEV, meaning dax pages are mapped as normal pages. Ensure page walkers that currently use pXd_devmap to skip DAX pages continue to do so by adding explicit checks of the VMA instead. Remove vma_is_dax() check from mm/userfaultfd.c as validate_move_areas() will already skip DAX VMA's on account of them not being anonymous. The check in userfaultfd_must_wait() is also redundant as vma_can_userfault() should have already filtered out dax vma's. For HMM the pud_devmap check seems unnecessary as there is no reason we shouldn't be able to handle any leaf PUD here so remove it also. Link: https://lkml.kernel.org/r/cover.176965585864cb8d2cf41464b44dcc0471e643a0.1750323463.git-series.apopple@nvidia.com Link: https://lkml.kernel.org/r/f0611f6f475f48fcdf34c65084a359aefef4cccc.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple Reviewed-by: Jason Gunthorpe Reviewed-by: Dan Williams Cc: Balbir Singh Cc: Björn Töpel Cc: Christoph Hellwig Cc: Chunyan Zhang Cc: David Hildenbrand Cc: Deepak Gupta Cc: Gerald Schaefer Cc: Inki Dae Cc: John Groves Cc: John Hubbard Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Björn Töpel Cc: Will Deacon Signed-off-by: Andrew Morton --- fs/userfaultfd.c | 2 +- mm/hmm.c | 2 +- mm/userfaultfd.c | 6 ------ 3 files changed, 2 insertions(+), 8 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 48e82e19d831..2a644aa1a510 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -304,7 +304,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx, goto out; ret = false; - if (!pmd_present(_pmd) || pmd_devmap(_pmd)) + if (!pmd_present(_pmd)) goto out; if (pmd_trans_huge(_pmd)) { diff --git a/mm/hmm.c b/mm/hmm.c index feac86196a65..d0c5c35807a2 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -441,7 +441,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end, return hmm_vma_walk_hole(start, end, -1, walk); } - if (pud_leaf(pud) && pud_devmap(pud)) { + if (pud_leaf(pud)) { unsigned long i, npages, pfn; unsigned int required_fault; unsigned long *hmm_pfns; diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 95dd8dea6ee4..dd2a25fafb82 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1818,12 +1818,6 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start, ptl = pmd_trans_huge_lock(src_pmd, src_vma); if (ptl) { - if (pmd_devmap(*src_pmd)) { - spin_unlock(ptl); - err = -ENOENT; - break; - } - /* Check if we can move the pmd without splitting it. */ if (move_splits_huge_pmd(dst_addr, src_addr, src_start + len) || !pmd_none(dst_pmdval)) { From 6b4a80e424cd2f99ca7262a8c0c725ec82e57b8a Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Thu, 19 Jun 2025 18:57:54 +1000 Subject: [PATCH 094/370] mm: filter zone device pages returned from folio_walk_start() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Previously dax pages were skipped by the pagewalk code as pud_special() or vm_normal_page{_pmd}() would be false for DAX pages. Now that dax pages are refcounted normally that is no longer the case, so the pagewalk code will start returning them. Most callers already explicitly filter for DAX or zone device pages so don't need updating. However some don't, so add checks to those callers. Link: https://lkml.kernel.org/r/4ecb7b357fc5b435588024770b88bbb695c30090.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple Cc: Balbir Singh Cc: Björn Töpel Cc: Björn Töpel Cc: Christoph Hellwig Cc: Chunyan Zhang Cc: Dan Williams Cc: David Hildenbrand Cc: Deepak Gupta Cc: Gerald Schaefer Cc: Inki Dae Cc: Jason Gunthorpe Cc: John Groves Cc: John Hubbard Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Will Deacon Signed-off-by: Andrew Morton --- kernel/events/uprobes.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 8a601df87072..f774367c8e71 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -539,7 +539,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma, } ret = 0; - if (unlikely(!folio_test_anon(folio))) { + if (unlikely(!folio_test_anon(folio) || folio_is_zone_device(folio))) { VM_WARN_ON_ONCE(is_register); folio_put(folio); goto out; From 79065255abc4b0fae843f539f7f5f0fe9fcac1e6 Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Thu, 19 Jun 2025 18:57:55 +1000 Subject: [PATCH 095/370] mm: remove remaining uses of PFN_DEV MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit PFN_DEV was used by callers of dax_direct_access() to figure out if the returned PFN is associated with a page using pfn_t_has_page() or not. However all DAX PFNs now require an assoicated ZONE_DEVICE page so can assume a page exists. Other users of PFN_DEV were setting it before calling vmf_insert_mixed(). This is unnecessary as it is no longer checked, instead relying on pfn_valid() to determine if there is an associated page or not. Link: https://lkml.kernel.org/r/74b293aebc21b941090bc3e7aeafa91b57c821a5.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple Reviewed-by: Christoph Hellwig Reviewed-by: Jason Gunthorpe Reviewed-by: Dan Williams Cc: Balbir Singh Cc: Björn Töpel Cc: Björn Töpel Cc: Chunyan Zhang Cc: David Hildenbrand Cc: Deepak Gupta Cc: Gerald Schaefer Cc: Inki Dae Cc: John Groves Cc: John Hubbard Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Will Deacon Signed-off-by: Andrew Morton --- drivers/gpu/drm/gma500/fbdev.c | 2 +- drivers/gpu/drm/omapdrm/omap_gem.c | 5 ++--- drivers/s390/block/dcssblk.c | 3 +-- drivers/vfio/pci/vfio_pci_core.c | 6 ++---- fs/cramfs/inode.c | 2 +- mm/memory.c | 2 +- 6 files changed, 8 insertions(+), 12 deletions(-) diff --git a/drivers/gpu/drm/gma500/fbdev.c b/drivers/gpu/drm/gma500/fbdev.c index 8edefea2ef59..109efdc96ac5 100644 --- a/drivers/gpu/drm/gma500/fbdev.c +++ b/drivers/gpu/drm/gma500/fbdev.c @@ -33,7 +33,7 @@ static vm_fault_t psb_fbdev_vm_fault(struct vm_fault *vmf) vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); for (i = 0; i < page_num; ++i) { - err = vmf_insert_mixed(vma, address, __pfn_to_pfn_t(pfn, PFN_DEV)); + err = vmf_insert_mixed(vma, address, __pfn_to_pfn_t(pfn, 0)); if (unlikely(err & VM_FAULT_ERROR)) break; address += PAGE_SIZE; diff --git a/drivers/gpu/drm/omapdrm/omap_gem.c b/drivers/gpu/drm/omapdrm/omap_gem.c index b9c67e4ca360..9df05b2b7ba0 100644 --- a/drivers/gpu/drm/omapdrm/omap_gem.c +++ b/drivers/gpu/drm/omapdrm/omap_gem.c @@ -371,8 +371,7 @@ static vm_fault_t omap_gem_fault_1d(struct drm_gem_object *obj, VERB("Inserting %p pfn %lx, pa %lx", (void *)vmf->address, pfn, pfn << PAGE_SHIFT); - return vmf_insert_mixed(vma, vmf->address, - __pfn_to_pfn_t(pfn, PFN_DEV)); + return vmf_insert_mixed(vma, vmf->address, __pfn_to_pfn_t(pfn, 0)); } /* Special handling for the case of faulting in 2d tiled buffers */ @@ -468,7 +467,7 @@ static vm_fault_t omap_gem_fault_2d(struct drm_gem_object *obj, for (i = n; i > 0; i--) { ret = vmf_insert_mixed(vma, - vaddr, __pfn_to_pfn_t(pfn, PFN_DEV)); + vaddr, __pfn_to_pfn_t(pfn, 0)); if (ret & VM_FAULT_ERROR) break; pfn += priv->usergart[fmt].stride_pfn; diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c index cdc7b2f16b88..249ae403f698 100644 --- a/drivers/s390/block/dcssblk.c +++ b/drivers/s390/block/dcssblk.c @@ -923,8 +923,7 @@ __dcssblk_direct_access(struct dcssblk_dev_info *dev_info, pgoff_t pgoff, if (kaddr) *kaddr = __va(dev_info->start + offset); if (pfn) - *pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), - PFN_DEV); + *pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), 0); return (dev_sz - offset) / PAGE_SIZE; } diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 6328c3a05bcd..3f2ad5fb4c17 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -1669,14 +1669,12 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf, break; #ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP case PMD_ORDER: - ret = vmf_insert_pfn_pmd(vmf, - __pfn_to_pfn_t(pfn, PFN_DEV), false); + ret = vmf_insert_pfn_pmd(vmf, __pfn_to_pfn_t(pfn, 0), false); break; #endif #ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP case PUD_ORDER: - ret = vmf_insert_pfn_pud(vmf, - __pfn_to_pfn_t(pfn, PFN_DEV), false); + ret = vmf_insert_pfn_pud(vmf, __pfn_to_pfn_t(pfn, 0), false); break; #endif default: diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c index b84d1747a020..820a664cfec7 100644 --- a/fs/cramfs/inode.c +++ b/fs/cramfs/inode.c @@ -412,7 +412,7 @@ static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma) for (i = 0; i < pages && !ret; i++) { vm_fault_t vmf; unsigned long off = i * PAGE_SIZE; - pfn_t pfn = phys_to_pfn_t(address + off, PFN_DEV); + pfn_t pfn = phys_to_pfn_t(address + off, 0); vmf = vmf_insert_mixed(vma, vma->vm_start + off, pfn); if (vmf & VM_FAULT_ERROR) ret = vm_fault_to_errno(vmf, 0); diff --git a/mm/memory.c b/mm/memory.c index 833426fa5fe0..ea1388d2a87a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2557,7 +2557,7 @@ vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr, pfnmap_setup_cachemode_pfn(pfn, &pgprot); - return insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot, + return insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, 0), pgprot, false); } EXPORT_SYMBOL(vmf_insert_pfn_prot); From 4b1d3145c1045d4b0db4622cd2c224a247f25a20 Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Thu, 19 Jun 2025 18:57:56 +1000 Subject: [PATCH 096/370] mm: convert vmf_insert_mixed() from using pte_devmap to pte_special MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit DAX no longer requires device PTEs as it always has a ZONE_DEVICE page associated with the PTE that can be reference counted normally. Other users of pte_devmap are drivers that set PFN_DEV when calling vmf_insert_mixed() which ensures vm_normal_page() returns NULL for these entries. There is no reason to distinguish these pte_devmap users so in order to free up a PTE bit use pte_special instead for entries created with vmf_insert_mixed(). This will ensure vm_normal_page() will continue to return NULL for these pages. Architectures that don't support pte_special also don't support pte_devmap so those will continue to rely on pfn_valid() to determine if the page can be mapped. Link: https://lkml.kernel.org/r/93086bd446e7bf8e4c85345613ac18f706b01f60.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple Reviewed-by: Jason Gunthorpe Reviewed-by: Dan Williams Cc: Balbir Singh Cc: Björn Töpel Cc: Björn Töpel Cc: Christoph Hellwig Cc: Chunyan Zhang Cc: David Hildenbrand Cc: Deepak Gupta Cc: Gerald Schaefer Cc: Inki Dae Cc: John Groves Cc: John Hubbard Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Will Deacon Signed-off-by: Andrew Morton --- mm/hmm.c | 3 --- mm/memory.c | 20 ++------------------ mm/vmscan.c | 2 +- 3 files changed, 3 insertions(+), 22 deletions(-) diff --git a/mm/hmm.c b/mm/hmm.c index d0c5c35807a2..14914da98416 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -302,13 +302,10 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, goto fault; /* - * Bypass devmap pte such as DAX page when all pfn requested - * flags(pfn_req_flags) are fulfilled. * Since each architecture defines a struct page for the zero page, just * fall through and treat it like a normal page. */ if (!vm_normal_page(walk->vma, addr, pte) && - !pte_devmap(pte) && !is_zero_pfn(pte_pfn(pte))) { if (hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0)) { pte_unmap(ptep); diff --git a/mm/memory.c b/mm/memory.c index ea1388d2a87a..01d51bd95197 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -598,16 +598,6 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, return NULL; if (is_zero_pfn(pfn)) return NULL; - if (pte_devmap(pte)) - /* - * NOTE: New users of ZONE_DEVICE will not set pte_devmap() - * and will have refcounts incremented on their struct pages - * when they are inserted into PTEs, thus they are safe to - * return here. Legacy ZONE_DEVICE pages that set pte_devmap() - * do not have refcounts. Example of legacy ZONE_DEVICE is - * MEMORY_DEVICE_FS_DAX type in pmem or virtio_fs drivers. - */ - return NULL; print_bad_pte(vma, addr, pte, NULL); return NULL; @@ -2483,10 +2473,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr, } /* Ok, finally just insert the thing.. */ - if (pfn_t_devmap(pfn)) - entry = pte_mkdevmap(pfn_t_pte(pfn, prot)); - else - entry = pte_mkspecial(pfn_t_pte(pfn, prot)); + entry = pte_mkspecial(pfn_t_pte(pfn, prot)); if (mkwrite) { entry = pte_mkyoung(entry); @@ -2597,8 +2584,6 @@ static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn, bool mkwrite) /* these checks mirror the abort conditions in vm_normal_page */ if (vma->vm_flags & VM_MIXEDMAP) return true; - if (pfn_t_devmap(pfn)) - return true; if (pfn_t_special(pfn)) return true; if (is_zero_pfn(pfn_t_to_pfn(pfn))) @@ -2630,8 +2615,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma, * than insert_pfn). If a zero_pfn were inserted into a VM_MIXEDMAP * without pte special, it would there be refcounted as a normal page. */ - if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && - !pfn_t_devmap(pfn) && pfn_t_valid(pfn)) { + if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pfn_t_valid(pfn)) { struct page *page; /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 56d540b8a1d0..6698fadf5d04 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3425,7 +3425,7 @@ static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned if (!pte_present(pte) || is_zero_pfn(pfn)) return -1; - if (WARN_ON_ONCE(pte_devmap(pte) || pte_special(pte))) + if (WARN_ON_ONCE(pte_special(pte))) return -1; if (!pte_young(pte) && !mm_has_notifiers(vma->vm_mm)) From fd2825b0760a10a9626986cca64f5664302ffdfc Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Thu, 19 Jun 2025 18:57:57 +1000 Subject: [PATCH 097/370] mm/gup: remove pXX_devmap usage from get_user_pages() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit GUP uses pXX_devmap() calls to see if it needs to a get a reference on the associated pgmap data structure to ensure the pages won't go away. However it's a driver responsibility to ensure that if pages are mapped (ie. discoverable by GUP) that they are not offlined or removed from the memmap so there is no need to hold a reference on the pgmap data structure to ensure this. Furthermore mappings with PFN_DEV are no longer created, hence this effectively dead code anyway so can be removed. Link: https://lkml.kernel.org/r/708b2be76876659ec5261fe5d059b07268b98b36.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple Reviewed-by: Jason Gunthorpe Reviewed-by: Dan Williams Cc: Balbir Singh Cc: Björn Töpel Cc: Björn Töpel Cc: Christoph Hellwig Cc: Chunyan Zhang Cc: David Hildenbrand Cc: Deepak Gupta Cc: Gerald Schaefer Cc: Inki Dae Cc: John Groves Cc: John Hubbard Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Will Deacon Signed-off-by: Andrew Morton --- include/linux/huge_mm.h | 3 - mm/gup.c | 160 ++-------------------------------------- mm/huge_memory.c | 40 ---------- 3 files changed, 5 insertions(+), 198 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 7753daac49f7..a2df2308cb2c 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -473,9 +473,6 @@ static inline bool folio_test_pmd_mappable(struct folio *folio) return folio_order(folio) >= HPAGE_PMD_ORDER; } -struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, - pmd_t *pmd, int flags, struct dev_pagemap **pgmap); - vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf); extern struct folio *huge_zero_folio; diff --git a/mm/gup.c b/mm/gup.c index c08b97e0d344..30d320719fa2 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -679,31 +679,9 @@ static struct page *follow_huge_pud(struct vm_area_struct *vma, return NULL; pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT; - - if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) && - pud_devmap(pud)) { - /* - * device mapped pages can only be returned if the caller - * will manage the page reference count. - * - * At least one of FOLL_GET | FOLL_PIN must be set, so - * assert that here: - */ - if (!(flags & (FOLL_GET | FOLL_PIN))) - return ERR_PTR(-EEXIST); - - if (flags & FOLL_TOUCH) - touch_pud(vma, addr, pudp, flags & FOLL_WRITE); - - ctx->pgmap = get_dev_pagemap(pfn, ctx->pgmap); - if (!ctx->pgmap) - return ERR_PTR(-EFAULT); - } - page = pfn_to_page(pfn); - if (!pud_devmap(pud) && !pud_write(pud) && - gup_must_unshare(vma, flags, page)) + if (!pud_write(pud) && gup_must_unshare(vma, flags, page)) return ERR_PTR(-EMLINK); ret = try_grab_folio(page_folio(page), 1, flags); @@ -857,8 +835,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, page = vm_normal_page(vma, address, pte); /* - * We only care about anon pages in can_follow_write_pte() and don't - * have to worry about pte_devmap() because they are never anon. + * We only care about anon pages in can_follow_write_pte(). */ if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, page, vma, flags)) { @@ -866,18 +843,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, goto out; } - if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) { - /* - * Only return device mapping pages in the FOLL_GET or FOLL_PIN - * case since they are only valid while holding the pgmap - * reference. - */ - *pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap); - if (*pgmap) - page = pte_page(pte); - else - goto no_page; - } else if (unlikely(!page)) { + if (unlikely(!page)) { if (flags & FOLL_DUMP) { /* Avoid special (like zero) pages in core dumps */ page = ERR_PTR(-EFAULT); @@ -959,14 +925,6 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, return no_page_table(vma, flags, address); if (!pmd_present(pmdval)) return no_page_table(vma, flags, address); - if (pmd_devmap(pmdval)) { - ptl = pmd_lock(mm, pmd); - page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap); - spin_unlock(ptl); - if (page) - return page; - return no_page_table(vma, flags, address); - } if (likely(!pmd_leaf(pmdval))) return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); @@ -2896,7 +2854,7 @@ static int gup_fast_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, int *nr) { struct dev_pagemap *pgmap = NULL; - int nr_start = *nr, ret = 0; + int ret = 0; pte_t *ptep, *ptem; ptem = ptep = pte_offset_map(&pmd, addr); @@ -2920,16 +2878,7 @@ static int gup_fast_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, if (!pte_access_permitted(pte, flags & FOLL_WRITE)) goto pte_unmap; - if (pte_devmap(pte)) { - if (unlikely(flags & FOLL_LONGTERM)) - goto pte_unmap; - - pgmap = get_dev_pagemap(pte_pfn(pte), pgmap); - if (unlikely(!pgmap)) { - gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages); - goto pte_unmap; - } - } else if (pte_special(pte)) + if (pte_special(pte)) goto pte_unmap; /* If it's not marked as special it must have a valid memmap. */ @@ -3001,91 +2950,6 @@ static int gup_fast_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, } #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */ -#if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE) -static int gup_fast_devmap_leaf(unsigned long pfn, unsigned long addr, - unsigned long end, unsigned int flags, struct page **pages, int *nr) -{ - int nr_start = *nr; - struct dev_pagemap *pgmap = NULL; - - do { - struct folio *folio; - struct page *page = pfn_to_page(pfn); - - pgmap = get_dev_pagemap(pfn, pgmap); - if (unlikely(!pgmap)) { - gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages); - break; - } - - folio = try_grab_folio_fast(page, 1, flags); - if (!folio) { - gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages); - break; - } - folio_set_referenced(folio); - pages[*nr] = page; - (*nr)++; - pfn++; - } while (addr += PAGE_SIZE, addr != end); - - put_dev_pagemap(pgmap); - return addr == end; -} - -static int gup_fast_devmap_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr, - unsigned long end, unsigned int flags, struct page **pages, - int *nr) -{ - unsigned long fault_pfn; - int nr_start = *nr; - - fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); - if (!gup_fast_devmap_leaf(fault_pfn, addr, end, flags, pages, nr)) - return 0; - - if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) { - gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages); - return 0; - } - return 1; -} - -static int gup_fast_devmap_pud_leaf(pud_t orig, pud_t *pudp, unsigned long addr, - unsigned long end, unsigned int flags, struct page **pages, - int *nr) -{ - unsigned long fault_pfn; - int nr_start = *nr; - - fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); - if (!gup_fast_devmap_leaf(fault_pfn, addr, end, flags, pages, nr)) - return 0; - - if (unlikely(pud_val(orig) != pud_val(*pudp))) { - gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages); - return 0; - } - return 1; -} -#else -static int gup_fast_devmap_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr, - unsigned long end, unsigned int flags, struct page **pages, - int *nr) -{ - BUILD_BUG(); - return 0; -} - -static int gup_fast_devmap_pud_leaf(pud_t pud, pud_t *pudp, unsigned long addr, - unsigned long end, unsigned int flags, struct page **pages, - int *nr) -{ - BUILD_BUG(); - return 0; -} -#endif - static int gup_fast_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) @@ -3100,13 +2964,6 @@ static int gup_fast_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr, if (pmd_special(orig)) return 0; - if (pmd_devmap(orig)) { - if (unlikely(flags & FOLL_LONGTERM)) - return 0; - return gup_fast_devmap_pmd_leaf(orig, pmdp, addr, end, flags, - pages, nr); - } - page = pmd_page(orig); refs = record_subpages(page, PMD_SIZE, addr, end, pages + *nr); @@ -3147,13 +3004,6 @@ static int gup_fast_pud_leaf(pud_t orig, pud_t *pudp, unsigned long addr, if (pud_special(orig)) return 0; - if (pud_devmap(orig)) { - if (unlikely(flags & FOLL_LONGTERM)) - return 0; - return gup_fast_devmap_pud_leaf(orig, pudp, addr, end, flags, - pages, nr); - } - page = pud_page(orig); refs = record_subpages(page, PUD_SIZE, addr, end, pages + *nr); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 6411f3107af1..7434d177b97c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1672,46 +1672,6 @@ void touch_pmd(struct vm_area_struct *vma, unsigned long addr, update_mmu_cache_pmd(vma, addr, pmd); } -struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, - pmd_t *pmd, int flags, struct dev_pagemap **pgmap) -{ - unsigned long pfn = pmd_pfn(*pmd); - struct mm_struct *mm = vma->vm_mm; - struct page *page; - int ret; - - assert_spin_locked(pmd_lockptr(mm, pmd)); - - if (flags & FOLL_WRITE && !pmd_write(*pmd)) - return NULL; - - if (pmd_present(*pmd) && pmd_devmap(*pmd)) - /* pass */; - else - return NULL; - - if (flags & FOLL_TOUCH) - touch_pmd(vma, addr, pmd, flags & FOLL_WRITE); - - /* - * device mapped pages can only be returned if the - * caller will manage the page reference count. - */ - if (!(flags & (FOLL_GET | FOLL_PIN))) - return ERR_PTR(-EEXIST); - - pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT; - *pgmap = get_dev_pagemap(pfn, *pgmap); - if (!*pgmap) - return ERR_PTR(-EFAULT); - page = pfn_to_page(pfn); - ret = try_grab_folio(page_folio(page), 1, flags); - if (ret) - page = ERR_PTR(ret); - - return page; -} - int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) From 7b2ae3c47f65bc75985caaaba9f2fb601eebca20 Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Thu, 19 Jun 2025 18:57:58 +1000 Subject: [PATCH 098/370] mm/huge_memory: remove pXd_devmap usage from insert_pXd_pfn() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Nothing uses PFN_DEV anymore so no need to create devmap pXd's when mapping a PFN. Instead special mappings will be created which ensures vm_normal_page_pXd() will not return pages which don't have an associated page. This could change behaviour slightly on architectures where pXd_devmap() does not imply pXd_special() as the normal page checks would have fallen through to checking VM_PFNMAP/MIXEDMAP instead, which in theory at least could have returned a page. However vm_normal_page_pXd() should never have been returning pages for pXd_devmap() entries anyway, so anything relying on that would have been a bug. Link: https://lkml.kernel.org/r/cd8658f9ff10afcfffd8b145a39d98bf1c595ffa.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple Cc: Balbir Singh Cc: Björn Töpel Cc: Björn Töpel Cc: Christoph Hellwig Cc: Chunyan Zhang Cc: Dan Williams Cc: David Hildenbrand Cc: Deepak Gupta Cc: Gerald Schaefer Cc: Inki Dae Cc: Jason Gunthorpe Cc: John Groves Cc: John Hubbard Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Will Deacon Signed-off-by: Andrew Morton --- mm/huge_memory.c | 12 ++---------- 1 file changed, 2 insertions(+), 10 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 7434d177b97c..54b5c37d9515 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1415,11 +1415,7 @@ static int insert_pmd(struct vm_area_struct *vma, unsigned long addr, add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR); } else { entry = pmd_mkhuge(pfn_t_pmd(fop.pfn, prot)); - - if (pfn_t_devmap(fop.pfn)) - entry = pmd_mkdevmap(entry); - else - entry = pmd_mkspecial(entry); + entry = pmd_mkspecial(entry); } if (write) { entry = pmd_mkyoung(pmd_mkdirty(entry)); @@ -1565,11 +1561,7 @@ static void insert_pud(struct vm_area_struct *vma, unsigned long addr, add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PUD_NR); } else { entry = pud_mkhuge(pfn_t_pud(fop.pfn, prot)); - - if (pfn_t_devmap(fop.pfn)) - entry = pud_mkdevmap(entry); - else - entry = pud_mkspecial(entry); + entry = pud_mkspecial(entry); } if (write) { entry = pud_mkyoung(pud_mkdirty(entry)); From 8a6a984c2e0ea406459b445a3910a454bece3aa1 Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Thu, 19 Jun 2025 18:57:59 +1000 Subject: [PATCH 099/370] mm: remove redundant pXd_devmap calls MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit DAX was the only thing that created pmd_devmap and pud_devmap entries however it no longer does as DAX pages are now refcounted normally and pXd_trans_huge() returns true for those. Therefore checking both pXd_devmap and pXd_trans_huge() is redundant and the former can be removed without changing behaviour as it will always be false. Link: https://lkml.kernel.org/r/d58f089dc16b7feb7c6728164f37dea65d64a0d3.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple Cc: Balbir Singh Cc: Björn Töpel Cc: Björn Töpel Cc: Christoph Hellwig Cc: Chunyan Zhang Cc: Dan Williams Cc: David Hildenbrand Cc: Deepak Gupta Cc: Gerald Schaefer Cc: Inki Dae Cc: Jason Gunthorpe Cc: John Groves Cc: John Hubbard Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Will Deacon Signed-off-by: Andrew Morton --- fs/dax.c | 5 ++--- include/linux/huge_mm.h | 10 ++++------ include/linux/pgtable.h | 2 +- mm/hmm.c | 4 ++-- mm/huge_memory.c | 23 +++++++++-------------- mm/mapping_dirty_helpers.c | 4 ++-- mm/memory.c | 15 ++++++--------- mm/migrate_device.c | 2 +- mm/mprotect.c | 2 +- mm/mremap.c | 5 ++--- mm/page_vma_mapped.c | 5 ++--- mm/pagewalk.c | 8 +++----- mm/pgtable-generic.c | 7 +++---- mm/userfaultfd.c | 4 ++-- mm/vmscan.c | 3 --- 15 files changed, 40 insertions(+), 59 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index ea0c35794bf9..7d4ecb9d23af 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1937,7 +1937,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp, * the PTE we need to set up. If so just return and the fault will be * retried. */ - if (pmd_trans_huge(*vmf->pmd) || pmd_devmap(*vmf->pmd)) { + if (pmd_trans_huge(*vmf->pmd)) { ret = VM_FAULT_NOPAGE; goto unlock_entry; } @@ -2060,8 +2060,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp, * the PMD we need to set up. If so just return and the fault will be * retried. */ - if (!pmd_none(*vmf->pmd) && !pmd_trans_huge(*vmf->pmd) && - !pmd_devmap(*vmf->pmd)) { + if (!pmd_none(*vmf->pmd) && !pmd_trans_huge(*vmf->pmd)) { ret = 0; goto unlock_entry; } diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index a2df2308cb2c..26607f2c65fb 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -400,8 +400,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, #define split_huge_pmd(__vma, __pmd, __address) \ do { \ pmd_t *____pmd = (__pmd); \ - if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd) \ - || pmd_devmap(*____pmd)) \ + if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd)) \ __split_huge_pmd(__vma, __pmd, __address, \ false); \ } while (0) @@ -426,8 +425,7 @@ change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, #define split_huge_pud(__vma, __pud, __address) \ do { \ pud_t *____pud = (__pud); \ - if (pud_trans_huge(*____pud) \ - || pud_devmap(*____pud)) \ + if (pud_trans_huge(*____pud)) \ __split_huge_pud(__vma, __pud, __address); \ } while (0) @@ -450,7 +448,7 @@ static inline int is_swap_pmd(pmd_t pmd) static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma) { - if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) + if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd)) return __pmd_trans_huge_lock(pmd, vma); else return NULL; @@ -458,7 +456,7 @@ static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd, static inline spinlock_t *pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma) { - if (pud_trans_huge(*pud) || pud_devmap(*pud)) + if (pud_trans_huge(*pud)) return __pud_trans_huge_lock(pud, vma); else return NULL; diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index d05e35a0facf..ffcd966cf2d4 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1672,7 +1672,7 @@ static inline int pud_trans_unstable(pud_t *pud) defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) pud_t pudval = READ_ONCE(*pud); - if (pud_none(pudval) || pud_trans_huge(pudval) || pud_devmap(pudval)) + if (pud_none(pudval) || pud_trans_huge(pudval)) return 1; if (unlikely(pud_bad(pudval))) { pud_clear_bad(pud); diff --git a/mm/hmm.c b/mm/hmm.c index 14914da98416..62d3082dc55c 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -360,7 +360,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); } - if (pmd_devmap(pmd) || pmd_trans_huge(pmd)) { + if (pmd_trans_huge(pmd)) { /* * No need to take pmd_lock here, even if some other thread * is splitting the huge pmd we will get that event through @@ -371,7 +371,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, * values. */ pmd = pmdp_get_lockless(pmdp); - if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd)) + if (!pmd_trans_huge(pmd)) goto again; return hmm_vma_handle_pmd(walk, addr, end, hmm_pfns, pmd); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 54b5c37d9515..cf808b2eea29 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1459,8 +1459,7 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write) * but we need to be consistent with PTEs and architectures that * can't support a 'special' bit. */ - BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) && - !pfn_t_devmap(pfn)); + BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))); BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) == (VM_PFNMAP|VM_MIXEDMAP)); BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags)); @@ -1596,8 +1595,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write) * but we need to be consistent with PTEs and architectures that * can't support a 'special' bit. */ - BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) && - !pfn_t_devmap(pfn)); + BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))); BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) == (VM_PFNMAP|VM_MIXEDMAP)); BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags)); @@ -1815,7 +1813,7 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm, ret = -EAGAIN; pud = *src_pud; - if (unlikely(!pud_trans_huge(pud) && !pud_devmap(pud))) + if (unlikely(!pud_trans_huge(pud))) goto out_unlock; /* @@ -2677,8 +2675,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma) { spinlock_t *ptl; ptl = pmd_lock(vma->vm_mm, pmd); - if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || - pmd_devmap(*pmd))) + if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd))) return ptl; spin_unlock(ptl); return NULL; @@ -2695,7 +2692,7 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma) spinlock_t *ptl; ptl = pud_lock(vma->vm_mm, pud); - if (likely(pud_trans_huge(*pud) || pud_devmap(*pud))) + if (likely(pud_trans_huge(*pud))) return ptl; spin_unlock(ptl); return NULL; @@ -2747,7 +2744,7 @@ static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud, VM_BUG_ON(haddr & ~HPAGE_PUD_MASK); VM_BUG_ON_VMA(vma->vm_start > haddr, vma); VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PUD_SIZE, vma); - VM_BUG_ON(!pud_trans_huge(*pud) && !pud_devmap(*pud)); + VM_BUG_ON(!pud_trans_huge(*pud)); count_vm_event(THP_SPLIT_PUD); @@ -2780,7 +2777,7 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud, (address & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE); mmu_notifier_invalidate_range_start(&range); ptl = pud_lock(vma->vm_mm, pud); - if (unlikely(!pud_trans_huge(*pud) && !pud_devmap(*pud))) + if (unlikely(!pud_trans_huge(*pud))) goto out; __split_huge_pud_locked(vma, pud, range.start); @@ -2853,8 +2850,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, VM_BUG_ON(haddr & ~HPAGE_PMD_MASK); VM_BUG_ON_VMA(vma->vm_start > haddr, vma); VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma); - VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd) - && !pmd_devmap(*pmd)); + VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)); count_vm_event(THP_SPLIT_PMD); @@ -3062,8 +3058,7 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, bool freeze) { VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE)); - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || - is_pmd_migration_entry(*pmd)) + if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd)) __split_huge_pmd_locked(vma, pmd, address, freeze); } diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c index dc1692ff9e58..c193de6cb23a 100644 --- a/mm/mapping_dirty_helpers.c +++ b/mm/mapping_dirty_helpers.c @@ -129,7 +129,7 @@ static int wp_clean_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end, pmd_t pmdval = pmdp_get_lockless(pmd); /* Do not split a huge pmd, present or migrated */ - if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval)) { + if (pmd_trans_huge(pmdval)) { WARN_ON(pmd_write(pmdval) || pmd_dirty(pmdval)); walk->action = ACTION_CONTINUE; } @@ -152,7 +152,7 @@ static int wp_clean_pud_entry(pud_t *pud, unsigned long addr, unsigned long end, pud_t pudval = READ_ONCE(*pud); /* Do not split a huge pud */ - if (pud_trans_huge(pudval) || pud_devmap(pudval)) { + if (pud_trans_huge(pudval)) { WARN_ON(pud_write(pudval) || pud_dirty(pudval)); walk->action = ACTION_CONTINUE; } diff --git a/mm/memory.c b/mm/memory.c index 01d51bd95197..150bb62855b1 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -675,8 +675,6 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, } } - if (pmd_devmap(pmd)) - return NULL; if (is_huge_zero_pmd(pmd)) return NULL; if (unlikely(pfn > highest_memmap_pfn)) @@ -1240,8 +1238,7 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, src_pmd = pmd_offset(src_pud, addr); do { next = pmd_addr_end(addr, end); - if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd) - || pmd_devmap(*src_pmd)) { + if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)) { int err; VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma); err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd, @@ -1277,7 +1274,7 @@ copy_pud_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, src_pud = pud_offset(src_p4d, addr); do { next = pud_addr_end(addr, end); - if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) { + if (pud_trans_huge(*src_pud)) { int err; VM_BUG_ON_VMA(next-addr != HPAGE_PUD_SIZE, src_vma); @@ -1791,7 +1788,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, pmd = pmd_offset(pud, addr); do { next = pmd_addr_end(addr, end); - if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) { + if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd)) { if (next - addr != HPAGE_PMD_SIZE) __split_huge_pmd(vma, pmd, addr, false); else if (zap_huge_pmd(tlb, vma, pmd, addr)) { @@ -1833,7 +1830,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb, pud = pud_offset(p4d, addr); do { next = pud_addr_end(addr, end); - if (pud_trans_huge(*pud) || pud_devmap(*pud)) { + if (pud_trans_huge(*pud)) { if (next - addr != HPAGE_PUD_SIZE) { mmap_assert_locked(tlb->mm); split_huge_pud(vma, pud, addr); @@ -6136,7 +6133,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, pud_t orig_pud = *vmf.pud; barrier(); - if (pud_trans_huge(orig_pud) || pud_devmap(orig_pud)) { + if (pud_trans_huge(orig_pud)) { /* * TODO once we support anonymous PUDs: NUMA case and @@ -6177,7 +6174,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, pmd_migration_entry_wait(mm, vmf.pmd); return 0; } - if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) { + if (pmd_trans_huge(vmf.orig_pmd)) { if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) return do_huge_pmd_numa_page(&vmf); diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 3158afe7eb23..e05e14d6eacd 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -615,7 +615,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, pmdp = pmd_alloc(mm, pudp, addr); if (!pmdp) goto abort; - if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp)) + if (pmd_trans_huge(*pmdp)) goto abort; if (pte_alloc(mm, pmdp)) goto abort; diff --git a/mm/mprotect.c b/mm/mprotect.c index b873b98ab705..88709c01177b 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -376,7 +376,7 @@ static inline long change_pmd_range(struct mmu_gather *tlb, goto next; _pmd = pmdp_get_lockless(pmd); - if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) { + if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd)) { if ((next - addr != HPAGE_PMD_SIZE) || pgtable_split_needed(vma, cp_flags)) { __split_huge_pmd(vma, pmd, addr, false); diff --git a/mm/mremap.c b/mm/mremap.c index 7e93d3344828..36585041c760 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -820,7 +820,7 @@ unsigned long move_page_tables(struct pagetable_move_control *pmc) new_pud = alloc_new_pud(mm, pmc->new_addr); if (!new_pud) break; - if (pud_trans_huge(*old_pud) || pud_devmap(*old_pud)) { + if (pud_trans_huge(*old_pud)) { if (extent == HPAGE_PUD_SIZE) { move_pgt_entry(pmc, HPAGE_PUD, old_pud, new_pud); /* We ignore and continue on error? */ @@ -839,8 +839,7 @@ unsigned long move_page_tables(struct pagetable_move_control *pmc) if (!new_pmd) break; again: - if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) || - pmd_devmap(*old_pmd)) { + if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd)) { if (extent == HPAGE_PMD_SIZE && move_pgt_entry(pmc, HPAGE_PMD, old_pmd, new_pmd)) continue; diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index e463c3be934a..e981a1a292d2 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -246,8 +246,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) */ pmde = pmdp_get_lockless(pvmw->pmd); - if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde) || - (pmd_present(pmde) && pmd_devmap(pmde))) { + if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) { pvmw->ptl = pmd_lock(mm, pvmw->pmd); pmde = *pvmw->pmd; if (!pmd_present(pmde)) { @@ -262,7 +261,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) return not_found(pvmw); return true; } - if (likely(pmd_trans_huge(pmde) || pmd_devmap(pmde))) { + if (likely(pmd_trans_huge(pmde))) { if (pvmw->flags & PVMW_MIGRATION) return not_found(pvmw); if (!check_pmd(pmd_pfn(pmde), pvmw)) diff --git a/mm/pagewalk.c b/mm/pagewalk.c index a214a2b40ab9..648038247a8d 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -143,8 +143,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, * We are ONLY installing, so avoid unnecessarily * splitting a present huge page. */ - if (pmd_present(*pmd) && - (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))) + if (pmd_present(*pmd) && pmd_trans_huge(*pmd)) continue; } @@ -210,8 +209,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, * We are ONLY installing, so avoid unnecessarily * splitting a present huge page. */ - if (pud_present(*pud) && - (pud_trans_huge(*pud) || pud_devmap(*pud))) + if (pud_present(*pud) && pud_trans_huge(*pud)) continue; } @@ -908,7 +906,7 @@ struct folio *folio_walk_start(struct folio_walk *fw, * TODO: FW_MIGRATION support for PUD migration entries * once there are relevant users. */ - if (!pud_present(pud) || pud_devmap(pud) || pud_special(pud)) { + if (!pud_present(pud) || pud_special(pud)) { spin_unlock(ptl); goto not_found; } else if (!pud_leaf(pud)) { diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 5a882f2b10f9..567e2d084071 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -139,8 +139,7 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address, { pmd_t pmd; VM_BUG_ON(address & ~HPAGE_PMD_MASK); - VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) && - !pmd_devmap(*pmdp)); + VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp)); pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp); flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); return pmd; @@ -153,7 +152,7 @@ pud_t pudp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address, pud_t pud; VM_BUG_ON(address & ~HPAGE_PUD_MASK); - VM_BUG_ON(!pud_trans_huge(*pudp) && !pud_devmap(*pudp)); + VM_BUG_ON(!pud_trans_huge(*pudp)); pud = pudp_huge_get_and_clear(vma->vm_mm, address, pudp); flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE); return pud; @@ -293,7 +292,7 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp) *pmdvalp = pmdval; if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval))) goto nomap; - if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval))) + if (unlikely(pmd_trans_huge(pmdval))) goto nomap; if (unlikely(pmd_bad(pmdval))) { pmd_clear_bad(pmd); diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index dd2a25fafb82..cbed91b09640 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -795,8 +795,8 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, * (This includes the case where the PMD used to be THP and * changed back to none after __pte_alloc().) */ - if (unlikely(!pmd_present(dst_pmdval) || pmd_trans_huge(dst_pmdval) || - pmd_devmap(dst_pmdval))) { + if (unlikely(!pmd_present(dst_pmdval) || + pmd_trans_huge(dst_pmdval))) { err = -EEXIST; break; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 6698fadf5d04..c86a2495138a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3450,9 +3450,6 @@ static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned if (!pmd_present(pmd) || is_huge_zero_pmd(pmd)) return -1; - if (WARN_ON_ONCE(pmd_devmap(pmd))) - return -1; - if (!pmd_young(pmd) && !mm_has_notifiers(vma->vm_mm)) return -1; From 2f4e882d955b19ff7b216d6a29a77fcba045b7cc Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Thu, 19 Jun 2025 18:58:00 +1000 Subject: [PATCH 100/370] mm/khugepaged: remove redundant pmd_devmap() check MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The pmd_devmap() check in check_pmd_state() is redundant because the only user of pmd_devmap were device dax and fs dax. However all callers of check_pmd_state() first call thp_vma_allowable_order() to check if the vma should be scanned. Except when called from a page fault this always returns 0 for dax vma's, hence we would never expect to see a pmd_devmap entry. Link: https://lkml.kernel.org/r/a68175fd3a37e9b72cc82c1d63fd8b69691a85b5.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple Acked-by: David Hildenbrand Reviewed-by: Jason Gunthorpe Cc: Balbir Singh Cc: Björn Töpel Cc: Björn Töpel Cc: Christoph Hellwig Cc: Chunyan Zhang Cc: Dan Williams Cc: Deepak Gupta Cc: Gerald Schaefer Cc: Inki Dae Cc: John Groves Cc: John Hubbard Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Will Deacon Signed-off-by: Andrew Morton --- mm/khugepaged.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 6b09b09c8f82..3495a20cef5e 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -945,8 +945,6 @@ static inline int check_pmd_state(pmd_t *pmd) return SCAN_PMD_NULL; if (pmd_trans_huge(pmde)) return SCAN_PMD_MAPPED; - if (pmd_devmap(pmde)) - return SCAN_PMD_NULL; if (pmd_bad(pmde)) return SCAN_PMD_NULL; return SCAN_SUCCEED; From bea0cc7cf4a51e8d3a0042212df3b83b49df58a7 Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Thu, 19 Jun 2025 18:58:01 +1000 Subject: [PATCH 101/370] powerpc: remove checks for devmap pages and PMDs/PUDs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit PFN_DEV no longer exists. This means no devmap PMDs or PUDs will be created, so checking for them is redundant. Instead mappings of pages that would have previously returned true for pXd_devmap() will return true for pXd_trans_huge() Link: https://lkml.kernel.org/r/31f63cc8dd518f9e2ec300681fe302eb4adf49b4.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple Reviewed-by: Jason Gunthorpe Acked-by: David Hildenbrand Cc: Balbir Singh Cc: Björn Töpel Cc: Björn Töpel Cc: Christoph Hellwig Cc: Chunyan Zhang Cc: Dan Williams Cc: Deepak Gupta Cc: Gerald Schaefer Cc: Inki Dae Cc: John Groves Cc: John Hubbard Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/powerpc/mm/book3s64/hash_hugepage.c | 2 +- arch/powerpc/mm/book3s64/hash_pgtable.c | 3 +-- arch/powerpc/mm/book3s64/hugetlbpage.c | 2 +- arch/powerpc/mm/book3s64/pgtable.c | 10 ++++------ arch/powerpc/mm/book3s64/radix_pgtable.c | 5 ++--- arch/powerpc/mm/pgtable.c | 2 +- 6 files changed, 10 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/mm/book3s64/hash_hugepage.c b/arch/powerpc/mm/book3s64/hash_hugepage.c index 15d6f3ea7178..cdfd4fe75edb 100644 --- a/arch/powerpc/mm/book3s64/hash_hugepage.c +++ b/arch/powerpc/mm/book3s64/hash_hugepage.c @@ -54,7 +54,7 @@ int __hash_page_thp(unsigned long ea, unsigned long access, unsigned long vsid, /* * Make sure this is thp or devmap entry */ - if (!(old_pmd & (H_PAGE_THP_HUGE | _PAGE_DEVMAP))) + if (!(old_pmd & H_PAGE_THP_HUGE)) return 0; rflags = htab_convert_pte_flags(new_pmd, flags); diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c index 988948d69bc1..82d31177630b 100644 --- a/arch/powerpc/mm/book3s64/hash_pgtable.c +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c @@ -195,7 +195,7 @@ unsigned long hash__pmd_hugepage_update(struct mm_struct *mm, unsigned long addr unsigned long old; #ifdef CONFIG_DEBUG_VM - WARN_ON(!hash__pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp)); + WARN_ON(!hash__pmd_trans_huge(*pmdp)); assert_spin_locked(pmd_lockptr(mm, pmdp)); #endif @@ -227,7 +227,6 @@ pmd_t hash__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long addres VM_BUG_ON(address & ~HPAGE_PMD_MASK); VM_BUG_ON(pmd_trans_huge(*pmdp)); - VM_BUG_ON(pmd_devmap(*pmdp)); pmd = *pmdp; pmd_clear(pmdp); diff --git a/arch/powerpc/mm/book3s64/hugetlbpage.c b/arch/powerpc/mm/book3s64/hugetlbpage.c index 83c3361b358b..2bcbbf9d85ac 100644 --- a/arch/powerpc/mm/book3s64/hugetlbpage.c +++ b/arch/powerpc/mm/book3s64/hugetlbpage.c @@ -74,7 +74,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid, } while(!pte_xchg(ptep, __pte(old_pte), __pte(new_pte))); /* Make sure this is a hugetlb entry */ - if (old_pte & (H_PAGE_THP_HUGE | _PAGE_DEVMAP)) + if (old_pte & H_PAGE_THP_HUGE) return 0; rflags = htab_convert_pte_flags(new_pte, flags); diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c index a89ef89101fc..c9431ae7f78a 100644 --- a/arch/powerpc/mm/book3s64/pgtable.c +++ b/arch/powerpc/mm/book3s64/pgtable.c @@ -62,7 +62,7 @@ int pmdp_set_access_flags(struct vm_area_struct *vma, unsigned long address, { int changed; #ifdef CONFIG_DEBUG_VM - WARN_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp)); + WARN_ON(!pmd_trans_huge(*pmdp)); assert_spin_locked(pmd_lockptr(vma->vm_mm, pmdp)); #endif changed = !pmd_same(*(pmdp), entry); @@ -82,7 +82,6 @@ int pudp_set_access_flags(struct vm_area_struct *vma, unsigned long address, { int changed; #ifdef CONFIG_DEBUG_VM - WARN_ON(!pud_devmap(*pudp)); assert_spin_locked(pud_lockptr(vma->vm_mm, pudp)); #endif changed = !pud_same(*(pudp), entry); @@ -204,8 +203,8 @@ pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma, { pmd_t pmd; VM_BUG_ON(addr & ~HPAGE_PMD_MASK); - VM_BUG_ON((pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) && - !pmd_devmap(*pmdp)) || !pmd_present(*pmdp)); + VM_BUG_ON((pmd_present(*pmdp) && !pmd_trans_huge(*pmdp)) || + !pmd_present(*pmdp)); pmd = pmdp_huge_get_and_clear(vma->vm_mm, addr, pmdp); /* * if it not a fullmm flush, then we can possibly end up converting @@ -223,8 +222,7 @@ pud_t pudp_huge_get_and_clear_full(struct vm_area_struct *vma, pud_t pud; VM_BUG_ON(addr & ~HPAGE_PMD_MASK); - VM_BUG_ON((pud_present(*pudp) && !pud_devmap(*pudp)) || - !pud_present(*pudp)); + VM_BUG_ON(!pud_present(*pudp)); pud = pudp_huge_get_and_clear(vma->vm_mm, addr, pudp); /* * if it not a fullmm flush, then we can possibly end up converting diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c index 9f764bc42b8c..877870d0046c 100644 --- a/arch/powerpc/mm/book3s64/radix_pgtable.c +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c @@ -1426,7 +1426,7 @@ unsigned long radix__pmd_hugepage_update(struct mm_struct *mm, unsigned long add unsigned long old; #ifdef CONFIG_DEBUG_VM - WARN_ON(!radix__pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp)); + WARN_ON(!radix__pmd_trans_huge(*pmdp)); assert_spin_locked(pmd_lockptr(mm, pmdp)); #endif @@ -1443,7 +1443,7 @@ unsigned long radix__pud_hugepage_update(struct mm_struct *mm, unsigned long add unsigned long old; #ifdef CONFIG_DEBUG_VM - WARN_ON(!pud_devmap(*pudp)); + WARN_ON(!pud_trans_huge(*pudp)); assert_spin_locked(pud_lockptr(mm, pudp)); #endif @@ -1461,7 +1461,6 @@ pmd_t radix__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long addre VM_BUG_ON(address & ~HPAGE_PMD_MASK); VM_BUG_ON(radix__pmd_trans_huge(*pmdp)); - VM_BUG_ON(pmd_devmap(*pmdp)); /* * khugepaged calls this for normal pmd */ diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index 61df5aed7989..dfaa9fd86f7e 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -509,7 +509,7 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea, return NULL; #endif - if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) { + if (pmd_trans_huge(pmd)) { if (is_thp) *is_thp = true; ret_pte = (pte_t *)pmdp; From 28dc88c39ecfe7de5033fa05cdd24fd1a9f8267d Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Thu, 19 Jun 2025 18:58:02 +1000 Subject: [PATCH 102/370] fs/dax: remove FS_DAX_LIMITED config option MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The dcssblk driver was the last user of FS_DAX_LIMITED. That was marked broken by 653d7825c149 ("dcssblk: mark DAX broken, remove FS_DAX_LIMITED support") to allow removal of PFN_SPECIAL. However the FS_DAX_LIMITED config option itself was not removed, so do that now. Link: https://lkml.kernel.org/r/b47bf164b4a1013d736fa1a3d501bc8b8e71311f.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple Acked-by: David Hildenbrand Cc: Balbir Singh Cc: Björn Töpel Cc: Björn Töpel Cc: Christoph Hellwig Cc: Chunyan Zhang Cc: Dan Williams Cc: Deepak Gupta Cc: Gerald Schaefer Cc: Inki Dae Cc: Jason Gunthorpe Cc: John Groves Cc: John Hubbard Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Will Deacon Signed-off-by: Andrew Morton --- fs/Kconfig | 9 +-------- fs/dax.c | 12 ------------ mm/memremap.c | 4 ---- 3 files changed, 1 insertion(+), 24 deletions(-) diff --git a/fs/Kconfig b/fs/Kconfig index 44b6cdd36dc1..ccdf371a62ed 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -59,7 +59,7 @@ endif # BLOCK config FS_DAX bool "File system based Direct Access (DAX) support" depends on MMU - depends on ZONE_DEVICE || FS_DAX_LIMITED + depends on ZONE_DEVICE select FS_IOMAP select DAX help @@ -95,13 +95,6 @@ config FS_DAX_PMD depends on ZONE_DEVICE depends on TRANSPARENT_HUGEPAGE -# Selected by DAX drivers that do not expect filesystem DAX to support -# get_user_pages() of DAX mappings. I.e. "limited" indicates no support -# for fork() of processes with MAP_SHARED mappings or support for -# direct-I/O to a DAX mapping. -config FS_DAX_LIMITED - bool - # Posix ACL utility routines # # Note: Posix ACLs can be implemented without these helpers. Never use diff --git a/fs/dax.c b/fs/dax.c index 7d4ecb9d23af..f4ffb6982270 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -449,9 +449,6 @@ static void dax_associate_entry(void *entry, struct address_space *mapping, if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) return; - if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) - return; - index = linear_page_index(vma, address & ~(size - 1)); if (shared && (folio->mapping || dax_folio_is_shared(folio))) { if (folio->mapping) @@ -474,9 +471,6 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping, { struct folio *folio = dax_to_folio(entry); - if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) - return; - if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) return; @@ -768,12 +762,6 @@ struct page *dax_layout_busy_page_range(struct address_space *mapping, pgoff_t end_idx; XA_STATE(xas, &mapping->i_pages, start_idx); - /* - * In the 'limited' case get_user_pages() for dax is disabled. - */ - if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) - return NULL; - if (!dax_mapping(mapping)) return NULL; diff --git a/mm/memremap.c b/mm/memremap.c index c417c843e9b1..c17e0a69cace 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -332,10 +332,6 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) } break; case MEMORY_DEVICE_FS_DAX: - if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) { - WARN(1, "File system DAX not supported\n"); - return ERR_PTR(-EINVAL); - } params.pgprot = pgprot_decrypted(params.pgprot); break; case MEMORY_DEVICE_GENERIC: From d438d273417055241ebaaf1ba3be23459fc27cba Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Thu, 19 Jun 2025 18:58:03 +1000 Subject: [PATCH 103/370] mm: remove devmap related functions and page table bits MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Now that DAX and all other reference counts to ZONE_DEVICE pages are managed normally there is no need for the special devmap PTE/PMD/PUD page table bits. So drop all references to these, freeing up a software defined page table bit on architectures supporting it. Link: https://lkml.kernel.org/r/6389398c32cc9daa3dfcaa9f79c7972525d310ce.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple Acked-by: Will Deacon # arm64 Acked-by: David Hildenbrand Suggested-by: Chunyan Zhang Reviewed-by: Björn Töpel Reviewed-by: Jason Gunthorpe Cc: Balbir Singh Cc: Björn Töpel Cc: Christoph Hellwig Cc: Dan Williams Cc: Deepak Gupta Cc: Gerald Schaefer Cc: Inki Dae Cc: John Groves Cc: John Hubbard Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- Documentation/mm/arch_pgtable_helpers.rst | 6 -- arch/arm64/Kconfig | 1 - arch/arm64/include/asm/pgtable-prot.h | 1 - arch/arm64/include/asm/pgtable.h | 24 -------- arch/loongarch/Kconfig | 1 - arch/loongarch/include/asm/pgtable-bits.h | 6 +- arch/loongarch/include/asm/pgtable.h | 19 ------ arch/powerpc/Kconfig | 1 - arch/powerpc/include/asm/book3s/64/hash-4k.h | 6 -- arch/powerpc/include/asm/book3s/64/hash-64k.h | 7 +-- arch/powerpc/include/asm/book3s/64/pgtable.h | 53 +---------------- arch/powerpc/include/asm/book3s/64/radix.h | 14 +---- arch/riscv/Kconfig | 1 - arch/riscv/include/asm/pgtable-64.h | 16 ----- arch/riscv/include/asm/pgtable-bits.h | 1 - arch/riscv/include/asm/pgtable.h | 22 ------- arch/x86/Kconfig | 1 - arch/x86/include/asm/pgtable.h | 51 +--------------- arch/x86/include/asm/pgtable_types.h | 5 +- include/linux/mm.h | 7 --- include/linux/pgtable.h | 19 +----- mm/Kconfig | 4 -- mm/debug_vm_pgtable.c | 59 ------------------- mm/hmm.c | 3 +- mm/madvise.c | 8 +-- 25 files changed, 17 insertions(+), 319 deletions(-) diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst index af245161d8e7..c88c7fa665d6 100644 --- a/Documentation/mm/arch_pgtable_helpers.rst +++ b/Documentation/mm/arch_pgtable_helpers.rst @@ -30,8 +30,6 @@ PTE Page Table Helpers +---------------------------+--------------------------------------------------+ | pte_protnone | Tests a PROT_NONE PTE | +---------------------------+--------------------------------------------------+ -| pte_devmap | Tests a ZONE_DEVICE mapped PTE | -+---------------------------+--------------------------------------------------+ | pte_soft_dirty | Tests a soft dirty PTE | +---------------------------+--------------------------------------------------+ | pte_swp_soft_dirty | Tests a soft dirty swapped PTE | @@ -104,8 +102,6 @@ PMD Page Table Helpers +---------------------------+--------------------------------------------------+ | pmd_protnone | Tests a PROT_NONE PMD | +---------------------------+--------------------------------------------------+ -| pmd_devmap | Tests a ZONE_DEVICE mapped PMD | -+---------------------------+--------------------------------------------------+ | pmd_soft_dirty | Tests a soft dirty PMD | +---------------------------+--------------------------------------------------+ | pmd_swp_soft_dirty | Tests a soft dirty swapped PMD | @@ -177,8 +173,6 @@ PUD Page Table Helpers +---------------------------+--------------------------------------------------+ | pud_write | Tests a writable PUD | +---------------------------+--------------------------------------------------+ -| pud_devmap | Tests a ZONE_DEVICE mapped PUD | -+---------------------------+--------------------------------------------------+ | pud_mkyoung | Creates a young PUD | +---------------------------+--------------------------------------------------+ | pud_mkold | Creates an old PUD | diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 55fc331af337..94b48b1dae71 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -44,7 +44,6 @@ config ARM64 select ARCH_HAS_NONLEAF_PMD_YOUNG if ARM64_HAFT select ARCH_HAS_PREEMPT_LAZY select ARCH_HAS_PTDUMP - select ARCH_HAS_PTE_DEVMAP select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_HW_PTE_YOUNG select ARCH_HAS_SETUP_DMA_OPS diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h index 7830d031742e..85dceb1c66f4 100644 --- a/arch/arm64/include/asm/pgtable-prot.h +++ b/arch/arm64/include/asm/pgtable-prot.h @@ -17,7 +17,6 @@ #define PTE_SWP_EXCLUSIVE (_AT(pteval_t, 1) << 2) /* only for swp ptes */ #define PTE_DIRTY (_AT(pteval_t, 1) << 55) #define PTE_SPECIAL (_AT(pteval_t, 1) << 56) -#define PTE_DEVMAP (_AT(pteval_t, 1) << 57) /* * PTE_PRESENT_INVALID=1 & PTE_VALID=0 indicates that the pte's fields should be diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index e511f909f63c..ba63c8736666 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -190,7 +190,6 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys) #define pte_user(pte) (!!(pte_val(pte) & PTE_USER)) #define pte_user_exec(pte) (!(pte_val(pte) & PTE_UXN)) #define pte_cont(pte) (!!(pte_val(pte) & PTE_CONT)) -#define pte_devmap(pte) (!!(pte_val(pte) & PTE_DEVMAP)) #define pte_tagged(pte) ((pte_val(pte) & PTE_ATTRINDX_MASK) == \ PTE_ATTRINDX(MT_NORMAL_TAGGED)) @@ -372,11 +371,6 @@ static inline pmd_t pmd_mkcont(pmd_t pmd) return __pmd(pmd_val(pmd) | PMD_SECT_CONT); } -static inline pte_t pte_mkdevmap(pte_t pte) -{ - return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL)); -} - #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP static inline int pte_uffd_wp(pte_t pte) { @@ -653,14 +647,6 @@ static inline pmd_t pmd_mkhuge(pmd_t pmd) return __pmd((pmd_val(pmd) & ~mask) | val); } -#ifdef CONFIG_TRANSPARENT_HUGEPAGE -#define pmd_devmap(pmd) pte_devmap(pmd_pte(pmd)) -#endif -static inline pmd_t pmd_mkdevmap(pmd_t pmd) -{ - return pte_pmd(set_pte_bit(pmd_pte(pmd), __pgprot(PTE_DEVMAP))); -} - #ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP #define pmd_special(pte) (!!((pmd_val(pte) & PTE_SPECIAL))) static inline pmd_t pmd_mkspecial(pmd_t pmd) @@ -1302,16 +1288,6 @@ static inline int pmdp_set_access_flags(struct vm_area_struct *vma, return __ptep_set_access_flags(vma, address, (pte_t *)pmdp, pmd_pte(entry), dirty); } - -static inline int pud_devmap(pud_t pud) -{ - return 0; -} - -static inline int pgd_devmap(pgd_t pgd) -{ - return 0; -} #endif #ifdef CONFIG_PAGE_TABLE_CHECK diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig index 4b19f93379a1..edb3db230bac 100644 --- a/arch/loongarch/Kconfig +++ b/arch/loongarch/Kconfig @@ -25,7 +25,6 @@ config LOONGARCH select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE select ARCH_HAS_PREEMPT_LAZY - select ARCH_HAS_PTE_DEVMAP select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_SET_MEMORY select ARCH_HAS_SET_DIRECT_MAP diff --git a/arch/loongarch/include/asm/pgtable-bits.h b/arch/loongarch/include/asm/pgtable-bits.h index 7bbfb04a54cc..2fc3789220ac 100644 --- a/arch/loongarch/include/asm/pgtable-bits.h +++ b/arch/loongarch/include/asm/pgtable-bits.h @@ -22,7 +22,6 @@ #define _PAGE_PFN_SHIFT 12 #define _PAGE_SWP_EXCLUSIVE_SHIFT 23 #define _PAGE_PFN_END_SHIFT 48 -#define _PAGE_DEVMAP_SHIFT 59 #define _PAGE_PRESENT_INVALID_SHIFT 60 #define _PAGE_NO_READ_SHIFT 61 #define _PAGE_NO_EXEC_SHIFT 62 @@ -36,7 +35,6 @@ #define _PAGE_MODIFIED (_ULCAST_(1) << _PAGE_MODIFIED_SHIFT) #define _PAGE_PROTNONE (_ULCAST_(1) << _PAGE_PROTNONE_SHIFT) #define _PAGE_SPECIAL (_ULCAST_(1) << _PAGE_SPECIAL_SHIFT) -#define _PAGE_DEVMAP (_ULCAST_(1) << _PAGE_DEVMAP_SHIFT) /* We borrow bit 23 to store the exclusive marker in swap PTEs. */ #define _PAGE_SWP_EXCLUSIVE (_ULCAST_(1) << _PAGE_SWP_EXCLUSIVE_SHIFT) @@ -76,8 +74,8 @@ #define __READABLE (_PAGE_VALID) #define __WRITEABLE (_PAGE_DIRTY | _PAGE_WRITE) -#define _PAGE_CHG_MASK (_PAGE_MODIFIED | _PAGE_SPECIAL | _PAGE_DEVMAP | _PFN_MASK | _CACHE_MASK | _PAGE_PLV) -#define _HPAGE_CHG_MASK (_PAGE_MODIFIED | _PAGE_SPECIAL | _PAGE_DEVMAP | _PFN_MASK | _CACHE_MASK | _PAGE_PLV | _PAGE_HUGE) +#define _PAGE_CHG_MASK (_PAGE_MODIFIED | _PAGE_SPECIAL | _PFN_MASK | _CACHE_MASK | _PAGE_PLV) +#define _HPAGE_CHG_MASK (_PAGE_MODIFIED | _PAGE_SPECIAL | _PFN_MASK | _CACHE_MASK | _PAGE_PLV | _PAGE_HUGE) #define PAGE_NONE __pgprot(_PAGE_PROTNONE | _PAGE_NO_READ | \ _PAGE_USER | _CACHE_CC) diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h index f2aeff544cee..bd128696e96d 100644 --- a/arch/loongarch/include/asm/pgtable.h +++ b/arch/loongarch/include/asm/pgtable.h @@ -409,9 +409,6 @@ static inline int pte_special(pte_t pte) { return pte_val(pte) & _PAGE_SPECIAL; static inline pte_t pte_mkspecial(pte_t pte) { pte_val(pte) |= _PAGE_SPECIAL; return pte; } #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */ -static inline int pte_devmap(pte_t pte) { return !!(pte_val(pte) & _PAGE_DEVMAP); } -static inline pte_t pte_mkdevmap(pte_t pte) { pte_val(pte) |= _PAGE_DEVMAP; return pte; } - #define pte_accessible pte_accessible static inline unsigned long pte_accessible(struct mm_struct *mm, pte_t a) { @@ -540,17 +537,6 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd) return pmd; } -static inline int pmd_devmap(pmd_t pmd) -{ - return !!(pmd_val(pmd) & _PAGE_DEVMAP); -} - -static inline pmd_t pmd_mkdevmap(pmd_t pmd) -{ - pmd_val(pmd) |= _PAGE_DEVMAP; - return pmd; -} - static inline struct page *pmd_page(pmd_t pmd) { if (pmd_trans_huge(pmd)) @@ -606,11 +592,6 @@ static inline long pmd_protnone(pmd_t pmd) #define pmd_leaf(pmd) ((pmd_val(pmd) & _PAGE_HUGE) != 0) #define pud_leaf(pud) ((pud_val(pud) & _PAGE_HUGE) != 0) -#ifdef CONFIG_TRANSPARENT_HUGEPAGE -#define pud_devmap(pud) (0) -#define pgd_devmap(pgd) (0) -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ - /* * We provide our own get_unmapped area to cope with the virtual aliasing * constraints placed on us by the cache architecture. diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index c3e0cc83f120..7a555c1d3171 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -149,7 +149,6 @@ config PPC select ARCH_HAS_PMEM_API select ARCH_HAS_PREEMPT_LAZY select ARCH_HAS_PTDUMP - select ARCH_HAS_PTE_DEVMAP if PPC_BOOK3S_64 select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE && PPC_BOOK3S_64 select ARCH_HAS_SET_MEMORY diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h index aa90a048f319..7132392fa7cd 100644 --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h @@ -168,12 +168,6 @@ extern pmd_t hash__pmdp_huge_get_and_clear(struct mm_struct *mm, extern int hash__has_transparent_hugepage(void); #endif -static inline pmd_t hash__pmd_mkdevmap(pmd_t pmd) -{ - BUG(); - return pmd; -} - #endif /* !__ASSEMBLY__ */ #endif /* _ASM_POWERPC_BOOK3S_64_HASH_4K_H */ diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h index 0bf6fd0bf42a..0fb5b7da9478 100644 --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h @@ -259,7 +259,7 @@ static inline void mark_hpte_slot_valid(unsigned char *hpte_slot_array, */ static inline int hash__pmd_trans_huge(pmd_t pmd) { - return !!((pmd_val(pmd) & (_PAGE_PTE | H_PAGE_THP_HUGE | _PAGE_DEVMAP)) == + return !!((pmd_val(pmd) & (_PAGE_PTE | H_PAGE_THP_HUGE)) == (_PAGE_PTE | H_PAGE_THP_HUGE)); } @@ -281,11 +281,6 @@ extern pmd_t hash__pmdp_huge_get_and_clear(struct mm_struct *mm, extern int hash__has_transparent_hugepage(void); #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ -static inline pmd_t hash__pmd_mkdevmap(pmd_t pmd) -{ - return __pmd(pmd_val(pmd) | (_PAGE_PTE | H_PAGE_THP_HUGE | _PAGE_DEVMAP)); -} - #endif /* __ASSEMBLY__ */ #endif /* _ASM_POWERPC_BOOK3S_64_HASH_64K_H */ diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index a2ddcbb3fcb9..c19800365315 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -88,7 +88,6 @@ #define _PAGE_SOFT_DIRTY _RPAGE_SW3 /* software: software dirty tracking */ #define _PAGE_SPECIAL _RPAGE_SW2 /* software: special page */ -#define _PAGE_DEVMAP _RPAGE_SW1 /* software: ZONE_DEVICE page */ /* * Drivers request for cache inhibited pte mapping using _PAGE_NO_CACHE @@ -109,7 +108,7 @@ */ #define _HPAGE_CHG_MASK (PTE_RPN_MASK | _PAGE_HPTEFLAGS | _PAGE_DIRTY | \ _PAGE_ACCESSED | H_PAGE_THP_HUGE | _PAGE_PTE | \ - _PAGE_SOFT_DIRTY | _PAGE_DEVMAP) + _PAGE_SOFT_DIRTY) /* * user access blocked by key */ @@ -123,7 +122,7 @@ */ #define _PAGE_CHG_MASK (PTE_RPN_MASK | _PAGE_HPTEFLAGS | _PAGE_DIRTY | \ _PAGE_ACCESSED | _PAGE_SPECIAL | _PAGE_PTE | \ - _PAGE_SOFT_DIRTY | _PAGE_DEVMAP) + _PAGE_SOFT_DIRTY) /* * We define 2 sets of base prot bits, one for basic pages (ie, @@ -609,24 +608,6 @@ static inline pte_t pte_mkhuge(pte_t pte) return pte; } -static inline pte_t pte_mkdevmap(pte_t pte) -{ - return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_SPECIAL | _PAGE_DEVMAP)); -} - -/* - * This is potentially called with a pmd as the argument, in which case it's not - * safe to check _PAGE_DEVMAP unless we also confirm that _PAGE_PTE is set. - * That's because the bit we use for _PAGE_DEVMAP is not reserved for software - * use in page directory entries (ie. non-ptes). - */ -static inline int pte_devmap(pte_t pte) -{ - __be64 mask = cpu_to_be64(_PAGE_DEVMAP | _PAGE_PTE); - - return (pte_raw(pte) & mask) == mask; -} - static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) { /* FIXME!! check whether this need to be a conditional */ @@ -1379,36 +1360,6 @@ static inline bool arch_needs_pgtable_deposit(void) } extern void serialize_against_pte_lookup(struct mm_struct *mm); - -static inline pmd_t pmd_mkdevmap(pmd_t pmd) -{ - if (radix_enabled()) - return radix__pmd_mkdevmap(pmd); - return hash__pmd_mkdevmap(pmd); -} - -static inline pud_t pud_mkdevmap(pud_t pud) -{ - if (radix_enabled()) - return radix__pud_mkdevmap(pud); - BUG(); - return pud; -} - -static inline int pmd_devmap(pmd_t pmd) -{ - return pte_devmap(pmd_pte(pmd)); -} - -static inline int pud_devmap(pud_t pud) -{ - return pte_devmap(pud_pte(pud)); -} - -static inline int pgd_devmap(pgd_t pgd) -{ - return 0; -} #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h index 8f55ff74bb68..df23a8267e4d 100644 --- a/arch/powerpc/include/asm/book3s/64/radix.h +++ b/arch/powerpc/include/asm/book3s/64/radix.h @@ -264,7 +264,7 @@ static inline int radix__p4d_bad(p4d_t p4d) static inline int radix__pmd_trans_huge(pmd_t pmd) { - return (pmd_val(pmd) & (_PAGE_PTE | _PAGE_DEVMAP)) == _PAGE_PTE; + return (pmd_val(pmd) & _PAGE_PTE) == _PAGE_PTE; } static inline pmd_t radix__pmd_mkhuge(pmd_t pmd) @@ -274,7 +274,7 @@ static inline pmd_t radix__pmd_mkhuge(pmd_t pmd) static inline int radix__pud_trans_huge(pud_t pud) { - return (pud_val(pud) & (_PAGE_PTE | _PAGE_DEVMAP)) == _PAGE_PTE; + return (pud_val(pud) & _PAGE_PTE) == _PAGE_PTE; } static inline pud_t radix__pud_mkhuge(pud_t pud) @@ -315,16 +315,6 @@ static inline int radix__has_transparent_pud_hugepage(void) } #endif -static inline pmd_t radix__pmd_mkdevmap(pmd_t pmd) -{ - return __pmd(pmd_val(pmd) | (_PAGE_PTE | _PAGE_DEVMAP)); -} - -static inline pud_t radix__pud_mkdevmap(pud_t pud) -{ - return __pud(pud_val(pud) | (_PAGE_PTE | _PAGE_DEVMAP)); -} - struct vmem_altmap; struct dev_pagemap; extern int __meminit radix__vmemmap_create_mapping(unsigned long start, diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index d71ea0f4466f..23df26f39472 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -46,7 +46,6 @@ config RISCV select ARCH_HAS_PREEMPT_LAZY select ARCH_HAS_PREPARE_SYNC_CORE_CMD select ARCH_HAS_PTDUMP if MMU - select ARCH_HAS_PTE_DEVMAP if 64BIT && MMU select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_SET_DIRECT_MAP if MMU select ARCH_HAS_SET_MEMORY if MMU diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h index 7de05db7d3bd..1018d2216901 100644 --- a/arch/riscv/include/asm/pgtable-64.h +++ b/arch/riscv/include/asm/pgtable-64.h @@ -397,24 +397,8 @@ static inline struct page *pgd_page(pgd_t pgd) p4d_t *p4d_offset(pgd_t *pgd, unsigned long address); #ifdef CONFIG_TRANSPARENT_HUGEPAGE -static inline int pte_devmap(pte_t pte); static inline pte_t pmd_pte(pmd_t pmd); static inline pte_t pud_pte(pud_t pud); - -static inline int pmd_devmap(pmd_t pmd) -{ - return pte_devmap(pmd_pte(pmd)); -} - -static inline int pud_devmap(pud_t pud) -{ - return pte_devmap(pud_pte(pud)); -} - -static inline int pgd_devmap(pgd_t pgd) -{ - return 0; -} #endif #endif /* _ASM_RISCV_PGTABLE_64_H */ diff --git a/arch/riscv/include/asm/pgtable-bits.h b/arch/riscv/include/asm/pgtable-bits.h index a8f5205cea54..179bd4afece4 100644 --- a/arch/riscv/include/asm/pgtable-bits.h +++ b/arch/riscv/include/asm/pgtable-bits.h @@ -19,7 +19,6 @@ #define _PAGE_SOFT (3 << 8) /* Reserved for software */ #define _PAGE_SPECIAL (1 << 8) /* RSW: 0x1 */ -#define _PAGE_DEVMAP (1 << 9) /* RSW, devmap */ #define _PAGE_TABLE _PAGE_PRESENT /* diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h index 5bd5aae60d53..91697fbf1f90 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -409,13 +409,6 @@ static inline int pte_special(pte_t pte) return pte_val(pte) & _PAGE_SPECIAL; } -#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP -static inline int pte_devmap(pte_t pte) -{ - return pte_val(pte) & _PAGE_DEVMAP; -} -#endif - /* static inline pte_t pte_rdprotect(pte_t pte) */ static inline pte_t pte_wrprotect(pte_t pte) @@ -457,11 +450,6 @@ static inline pte_t pte_mkspecial(pte_t pte) return __pte(pte_val(pte) | _PAGE_SPECIAL); } -static inline pte_t pte_mkdevmap(pte_t pte) -{ - return __pte(pte_val(pte) | _PAGE_DEVMAP); -} - static inline pte_t pte_mkhuge(pte_t pte) { return pte; @@ -790,11 +778,6 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd) return pte_pmd(pte_mkdirty(pmd_pte(pmd))); } -static inline pmd_t pmd_mkdevmap(pmd_t pmd) -{ - return pte_pmd(pte_mkdevmap(pmd_pte(pmd))); -} - #ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP static inline bool pmd_special(pmd_t pmd) { @@ -946,11 +929,6 @@ static inline pud_t pud_mkhuge(pud_t pud) return pud; } -static inline pud_t pud_mkdevmap(pud_t pud) -{ - return pte_pud(pte_mkdevmap(pud_pte(pud))); -} - static inline int pudp_set_access_flags(struct vm_area_struct *vma, unsigned long address, pud_t *pudp, pud_t entry, int dirty) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 71019b3b54ea..bb9b63d76a19 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -101,7 +101,6 @@ config X86 select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE select ARCH_HAS_PMEM_API if X86_64 select ARCH_HAS_PREEMPT_LAZY - select ARCH_HAS_PTE_DEVMAP if X86_64 select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_HW_PTE_YOUNG select ARCH_HAS_NONLEAF_PMD_YOUNG if PGTABLE_LEVELS > 2 diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 97954c936c54..e33df3da6980 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -301,16 +301,15 @@ static inline bool pmd_leaf(pmd_t pte) } #ifdef CONFIG_TRANSPARENT_HUGEPAGE -/* NOTE: when predicate huge page, consider also pmd_devmap, or use pmd_leaf */ static inline int pmd_trans_huge(pmd_t pmd) { - return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE; + return (pmd_val(pmd) & _PAGE_PSE) == _PAGE_PSE; } #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD static inline int pud_trans_huge(pud_t pud) { - return (pud_val(pud) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE; + return (pud_val(pud) & _PAGE_PSE) == _PAGE_PSE; } #endif @@ -320,24 +319,6 @@ static inline int has_transparent_hugepage(void) return boot_cpu_has(X86_FEATURE_PSE); } -#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP -static inline int pmd_devmap(pmd_t pmd) -{ - return !!(pmd_val(pmd) & _PAGE_DEVMAP); -} - -#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD -static inline int pud_devmap(pud_t pud) -{ - return !!(pud_val(pud) & _PAGE_DEVMAP); -} -#else -static inline int pud_devmap(pud_t pud) -{ - return 0; -} -#endif - #ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP static inline bool pmd_special(pmd_t pmd) { @@ -361,12 +342,6 @@ static inline pud_t pud_mkspecial(pud_t pud) return pud_set_flags(pud, _PAGE_SPECIAL); } #endif /* CONFIG_ARCH_SUPPORTS_PUD_PFNMAP */ - -static inline int pgd_devmap(pgd_t pgd) -{ - return 0; -} -#endif #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ static inline pte_t pte_set_flags(pte_t pte, pteval_t set) @@ -527,11 +502,6 @@ static inline pte_t pte_mkspecial(pte_t pte) return pte_set_flags(pte, _PAGE_SPECIAL); } -static inline pte_t pte_mkdevmap(pte_t pte) -{ - return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP); -} - /* See comments above mksaveddirty_shift() */ static inline pmd_t pmd_mksaveddirty(pmd_t pmd) { @@ -603,11 +573,6 @@ static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd) return pmd_set_flags(pmd, _PAGE_DIRTY); } -static inline pmd_t pmd_mkdevmap(pmd_t pmd) -{ - return pmd_set_flags(pmd, _PAGE_DEVMAP); -} - static inline pmd_t pmd_mkhuge(pmd_t pmd) { return pmd_set_flags(pmd, _PAGE_PSE); @@ -673,11 +638,6 @@ static inline pud_t pud_mkdirty(pud_t pud) return pud_mksaveddirty(pud); } -static inline pud_t pud_mkdevmap(pud_t pud) -{ - return pud_set_flags(pud, _PAGE_DEVMAP); -} - static inline pud_t pud_mkhuge(pud_t pud) { return pud_set_flags(pud, _PAGE_PSE); @@ -1008,13 +968,6 @@ static inline int pte_present(pte_t a) return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE); } -#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP -static inline int pte_devmap(pte_t a) -{ - return (pte_flags(a) & _PAGE_DEVMAP) == _PAGE_DEVMAP; -} -#endif - #define pte_accessible pte_accessible static inline bool pte_accessible(struct mm_struct *mm, pte_t a) { diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index b74ec5c3643b..f63ae8d0aac8 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -34,7 +34,6 @@ #define _PAGE_BIT_UFFD_WP _PAGE_BIT_SOFTW2 /* userfaultfd wrprotected */ #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */ #define _PAGE_BIT_KERNEL_4K _PAGE_BIT_SOFTW3 /* page must not be converted to large */ -#define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4 #ifdef CONFIG_X86_64 #define _PAGE_BIT_SAVED_DIRTY _PAGE_BIT_SOFTW5 /* Saved Dirty bit (leaf) */ @@ -121,11 +120,9 @@ #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) #define _PAGE_NX (_AT(pteval_t, 1) << _PAGE_BIT_NX) -#define _PAGE_DEVMAP (_AT(u64, 1) << _PAGE_BIT_DEVMAP) #define _PAGE_SOFTW4 (_AT(pteval_t, 1) << _PAGE_BIT_SOFTW4) #else #define _PAGE_NX (_AT(pteval_t, 0)) -#define _PAGE_DEVMAP (_AT(pteval_t, 0)) #define _PAGE_SOFTW4 (_AT(pteval_t, 0)) #endif @@ -154,7 +151,7 @@ #define _COMMON_PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \ _PAGE_SPECIAL | _PAGE_ACCESSED | \ _PAGE_DIRTY_BITS | _PAGE_SOFT_DIRTY | \ - _PAGE_DEVMAP | _PAGE_CC | _PAGE_UFFD_WP) + _PAGE_CC | _PAGE_UFFD_WP) #define _PAGE_CHG_MASK (_COMMON_PAGE_CHG_MASK | _PAGE_PAT) #define _HPAGE_CHG_MASK (_COMMON_PAGE_CHG_MASK | _PAGE_PSE | _PAGE_PAT_LARGE) diff --git a/include/linux/mm.h b/include/linux/mm.h index fc365420dfa8..4d833f159988 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2704,13 +2704,6 @@ static inline pud_t pud_mkspecial(pud_t pud) } #endif /* CONFIG_ARCH_SUPPORTS_PUD_PFNMAP */ -#ifndef CONFIG_ARCH_HAS_PTE_DEVMAP -static inline int pte_devmap(pte_t pte) -{ - return 0; -} -#endif - extern pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr, spinlock_t **ptl); static inline pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr, diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index ffcd966cf2d4..cf1515c163e2 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1643,21 +1643,6 @@ static inline int pud_write(pud_t pud) } #endif /* pud_write */ -#if !defined(CONFIG_ARCH_HAS_PTE_DEVMAP) || !defined(CONFIG_TRANSPARENT_HUGEPAGE) -static inline int pmd_devmap(pmd_t pmd) -{ - return 0; -} -static inline int pud_devmap(pud_t pud) -{ - return 0; -} -static inline int pgd_devmap(pgd_t pgd) -{ - return 0; -} -#endif - #if !defined(CONFIG_TRANSPARENT_HUGEPAGE) || \ !defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) static inline int pud_trans_huge(pud_t pud) @@ -1912,8 +1897,8 @@ typedef unsigned int pgtbl_mod_mask; * - It should contain a huge PFN, which points to a huge page larger than * PAGE_SIZE of the platform. The PFN format isn't important here. * - * - It should cover all kinds of huge mappings (e.g., pXd_trans_huge(), - * pXd_devmap(), or hugetlb mappings). + * - It should cover all kinds of huge mappings (i.e. pXd_trans_huge() + * or hugetlb mappings). */ #ifndef pgd_leaf #define pgd_leaf(x) false diff --git a/mm/Kconfig b/mm/Kconfig index 065b1f19dd99..d5d4eca947a6 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1117,9 +1117,6 @@ config ARCH_HAS_CURRENT_STACK_POINTER register alias named "current_stack_pointer", this config can be selected. -config ARCH_HAS_PTE_DEVMAP - bool - config ARCH_HAS_ZONE_DMA_SET bool @@ -1137,7 +1134,6 @@ config ZONE_DEVICE depends on MEMORY_HOTPLUG depends on MEMORY_HOTREMOVE depends on SPARSEMEM_VMEMMAP - depends on ARCH_HAS_PTE_DEVMAP select XARRAY_MULTI help diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 7731b238b534..d84d0c49012f 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -348,12 +348,6 @@ static void __init pud_advanced_tests(struct pgtable_debug_args *args) vaddr &= HPAGE_PUD_MASK; pud = pfn_pud(args->pud_pfn, args->page_prot); - /* - * Some architectures have debug checks to make sure - * huge pud mapping are only found with devmap entries - * For now test with only devmap entries. - */ - pud = pud_mkdevmap(pud); set_pud_at(args->mm, vaddr, args->pudp, pud); flush_dcache_page(page); pudp_set_wrprotect(args->mm, vaddr, args->pudp); @@ -366,7 +360,6 @@ static void __init pud_advanced_tests(struct pgtable_debug_args *args) WARN_ON(!pud_none(pud)); #endif /* __PAGETABLE_PMD_FOLDED */ pud = pfn_pud(args->pud_pfn, args->page_prot); - pud = pud_mkdevmap(pud); pud = pud_wrprotect(pud); pud = pud_mkclean(pud); set_pud_at(args->mm, vaddr, args->pudp, pud); @@ -384,7 +377,6 @@ static void __init pud_advanced_tests(struct pgtable_debug_args *args) #endif /* __PAGETABLE_PMD_FOLDED */ pud = pfn_pud(args->pud_pfn, args->page_prot); - pud = pud_mkdevmap(pud); pud = pud_mkyoung(pud); set_pud_at(args->mm, vaddr, args->pudp, pud); flush_dcache_page(page); @@ -693,53 +685,6 @@ static void __init pmd_protnone_tests(struct pgtable_debug_args *args) static void __init pmd_protnone_tests(struct pgtable_debug_args *args) { } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ -#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP -static void __init pte_devmap_tests(struct pgtable_debug_args *args) -{ - pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot); - - pr_debug("Validating PTE devmap\n"); - WARN_ON(!pte_devmap(pte_mkdevmap(pte))); -} - -#ifdef CONFIG_TRANSPARENT_HUGEPAGE -static void __init pmd_devmap_tests(struct pgtable_debug_args *args) -{ - pmd_t pmd; - - if (!has_transparent_hugepage()) - return; - - pr_debug("Validating PMD devmap\n"); - pmd = pfn_pmd(args->fixed_pmd_pfn, args->page_prot); - WARN_ON(!pmd_devmap(pmd_mkdevmap(pmd))); -} - -#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD -static void __init pud_devmap_tests(struct pgtable_debug_args *args) -{ - pud_t pud; - - if (!has_transparent_pud_hugepage()) - return; - - pr_debug("Validating PUD devmap\n"); - pud = pfn_pud(args->fixed_pud_pfn, args->page_prot); - WARN_ON(!pud_devmap(pud_mkdevmap(pud))); -} -#else /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ -static void __init pud_devmap_tests(struct pgtable_debug_args *args) { } -#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ -#else /* CONFIG_TRANSPARENT_HUGEPAGE */ -static void __init pmd_devmap_tests(struct pgtable_debug_args *args) { } -static void __init pud_devmap_tests(struct pgtable_debug_args *args) { } -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ -#else -static void __init pte_devmap_tests(struct pgtable_debug_args *args) { } -static void __init pmd_devmap_tests(struct pgtable_debug_args *args) { } -static void __init pud_devmap_tests(struct pgtable_debug_args *args) { } -#endif /* CONFIG_ARCH_HAS_PTE_DEVMAP */ - static void __init pte_soft_dirty_tests(struct pgtable_debug_args *args) { pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot); @@ -1333,10 +1278,6 @@ static int __init debug_vm_pgtable(void) pte_protnone_tests(&args); pmd_protnone_tests(&args); - pte_devmap_tests(&args); - pmd_devmap_tests(&args); - pud_devmap_tests(&args); - pte_soft_dirty_tests(&args); pmd_soft_dirty_tests(&args); pte_swap_soft_dirty_tests(&args); diff --git a/mm/hmm.c b/mm/hmm.c index 62d3082dc55c..f2415b4b2cdd 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -405,8 +405,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, return 0; } -#if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && \ - defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) +#if defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) static inline unsigned long pud_to_hmm_pfn_flags(struct hmm_range *range, pud_t pud) { diff --git a/mm/madvise.c b/mm/madvise.c index 92f427b1b330..070132f9842b 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1069,7 +1069,7 @@ static int guard_install_pud_entry(pud_t *pud, unsigned long addr, pud_t pudval = pudp_get(pud); /* If huge return >0 so we abort the operation + zap. */ - return pud_trans_huge(pudval) || pud_devmap(pudval); + return pud_trans_huge(pudval); } static int guard_install_pmd_entry(pmd_t *pmd, unsigned long addr, @@ -1078,7 +1078,7 @@ static int guard_install_pmd_entry(pmd_t *pmd, unsigned long addr, pmd_t pmdval = pmdp_get(pmd); /* If huge return >0 so we abort the operation + zap. */ - return pmd_trans_huge(pmdval) || pmd_devmap(pmdval); + return pmd_trans_huge(pmdval); } static int guard_install_pte_entry(pte_t *pte, unsigned long addr, @@ -1189,7 +1189,7 @@ static int guard_remove_pud_entry(pud_t *pud, unsigned long addr, pud_t pudval = pudp_get(pud); /* If huge, cannot have guard pages present, so no-op - skip. */ - if (pud_trans_huge(pudval) || pud_devmap(pudval)) + if (pud_trans_huge(pudval)) walk->action = ACTION_CONTINUE; return 0; @@ -1201,7 +1201,7 @@ static int guard_remove_pmd_entry(pmd_t *pmd, unsigned long addr, pmd_t pmdval = pmdp_get(pmd); /* If huge, cannot have guard pages present, so no-op - skip. */ - if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval)) + if (pmd_trans_huge(pmdval)) walk->action = ACTION_CONTINUE; return 0; From 984921edea68bf24bcc87e1317bfc90451ff46c6 Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Thu, 19 Jun 2025 18:58:04 +1000 Subject: [PATCH 104/370] mm: remove PFN_DEV, PFN_MAP, PFN_SPECIAL, PFN_SG_CHAIN and PFN_SG_LAST MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The PFN_MAP flag is no longer used for anything, so remove it. The PFN_SG_CHAIN and PFN_SG_LAST flags never appear to have been used so also remove them. The last user of PFN_SPECIAL was removed by 653d7825c149 ("dcssblk: mark DAX broken, remove FS_DAX_LIMITED support"). Users of PFN_DEV were removed earlier in this series by "mm: Remove remaining uses of PFN_DEV". Link: https://lkml.kernel.org/r/670b3950d70b4d97b905bb597dadfd3633de4314.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple Acked-by: David Hildenbrand Reviewed-by: Christoph Hellwig Reviewed-by: Jason Gunthorpe Reviewed-by: Dan Williams Cc: Balbir Singh Cc: Björn Töpel Cc: Björn Töpel Cc: Chunyan Zhang Cc: Deepak Gupta Cc: Gerald Schaefer Cc: Inki Dae Cc: John Groves Cc: John Hubbard Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Will Deacon Signed-off-by: Andrew Morton --- include/linux/pfn_t.h | 50 ++----------------------------- mm/memory.c | 2 -- tools/testing/nvdimm/test/iomap.c | 4 --- 3 files changed, 2 insertions(+), 54 deletions(-) diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h index 2d9148221e9a..2c002936a0f6 100644 --- a/include/linux/pfn_t.h +++ b/include/linux/pfn_t.h @@ -5,26 +5,11 @@ /* * PFN_FLAGS_MASK - mask of all the possible valid pfn_t flags - * PFN_SG_CHAIN - pfn is a pointer to the next scatterlist entry - * PFN_SG_LAST - pfn references a page and is the last scatterlist entry - * PFN_DEV - pfn is not covered by system memmap by default - * PFN_MAP - pfn has a dynamic page mapping established by a device driver - * PFN_SPECIAL - for CONFIG_FS_DAX_LIMITED builds to allow XIP, but not - * get_user_pages */ #define PFN_FLAGS_MASK (((u64) (~PAGE_MASK)) << (BITS_PER_LONG_LONG - PAGE_SHIFT)) -#define PFN_SG_CHAIN (1ULL << (BITS_PER_LONG_LONG - 1)) -#define PFN_SG_LAST (1ULL << (BITS_PER_LONG_LONG - 2)) -#define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3)) -#define PFN_MAP (1ULL << (BITS_PER_LONG_LONG - 4)) -#define PFN_SPECIAL (1ULL << (BITS_PER_LONG_LONG - 5)) #define PFN_FLAGS_TRACE \ - { PFN_SPECIAL, "SPECIAL" }, \ - { PFN_SG_CHAIN, "SG_CHAIN" }, \ - { PFN_SG_LAST, "SG_LAST" }, \ - { PFN_DEV, "DEV" }, \ - { PFN_MAP, "MAP" } + { } static inline pfn_t __pfn_to_pfn_t(unsigned long pfn, u64 flags) { @@ -46,7 +31,7 @@ static inline pfn_t phys_to_pfn_t(phys_addr_t addr, u64 flags) static inline bool pfn_t_has_page(pfn_t pfn) { - return (pfn.val & PFN_MAP) == PFN_MAP || (pfn.val & PFN_DEV) == 0; + return true; } static inline unsigned long pfn_t_to_pfn(pfn_t pfn) @@ -97,35 +82,4 @@ static inline pud_t pfn_t_pud(pfn_t pfn, pgprot_t pgprot) #endif #endif -#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP -static inline bool pfn_t_devmap(pfn_t pfn) -{ - const u64 flags = PFN_DEV|PFN_MAP; - - return (pfn.val & flags) == flags; -} -#else -static inline bool pfn_t_devmap(pfn_t pfn) -{ - return false; -} -pte_t pte_mkdevmap(pte_t pte); -pmd_t pmd_mkdevmap(pmd_t pmd); -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \ - defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) -pud_t pud_mkdevmap(pud_t pud); -#endif -#endif /* CONFIG_ARCH_HAS_PTE_DEVMAP */ - -#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL -static inline bool pfn_t_special(pfn_t pfn) -{ - return (pfn.val & PFN_SPECIAL) == PFN_SPECIAL; -} -#else -static inline bool pfn_t_special(pfn_t pfn) -{ - return false; -} -#endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */ #endif /* _LINUX_PFN_T_H_ */ diff --git a/mm/memory.c b/mm/memory.c index 150bb62855b1..e932a007af4c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2581,8 +2581,6 @@ static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn, bool mkwrite) /* these checks mirror the abort conditions in vm_normal_page */ if (vma->vm_flags & VM_MIXEDMAP) return true; - if (pfn_t_special(pfn)) - return true; if (is_zero_pfn(pfn_t_to_pfn(pfn))) return true; return false; diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c index e4313726fae3..ddceb04b4a9a 100644 --- a/tools/testing/nvdimm/test/iomap.c +++ b/tools/testing/nvdimm/test/iomap.c @@ -137,10 +137,6 @@ EXPORT_SYMBOL_GPL(__wrap_devm_memremap_pages); pfn_t __wrap_phys_to_pfn_t(phys_addr_t addr, unsigned long flags) { - struct nfit_test_resource *nfit_res = get_nfit_res(addr); - - if (nfit_res) - flags &= ~PFN_MAP; return phys_to_pfn_t(addr, flags); } EXPORT_SYMBOL(__wrap_phys_to_pfn_t); From 21aa65bf82a78c1e70447a45a85e533689b7f1a7 Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Thu, 19 Jun 2025 18:58:05 +1000 Subject: [PATCH 105/370] mm: remove callers of pfn_t functionality MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit All PFN_* pfn_t flags have been removed. Therefore there is no longer a need for the pfn_t type and all uses can be replaced with normal pfns. Link: https://lkml.kernel.org/r/bbedfa576c9822f8032494efbe43544628698b1f.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple Reviewed-by: Christoph Hellwig Reviewed-by: Jason Gunthorpe Acked-by: David Hildenbrand Cc: Balbir Singh Cc: Björn Töpel Cc: Björn Töpel Cc: Chunyan Zhang Cc: Dan Williams Cc: Deepak Gupta Cc: Gerald Schaefer Cc: Inki Dae Cc: John Groves Cc: John Hubbard Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/x86/mm/pat/memtype.c | 1 - drivers/dax/device.c | 23 +++---- drivers/dax/hmem/hmem.c | 1 - drivers/dax/kmem.c | 1 - drivers/dax/pmem.c | 1 - drivers/dax/super.c | 3 +- drivers/gpu/drm/exynos/exynos_drm_gem.c | 1 - drivers/gpu/drm/gma500/fbdev.c | 3 +- drivers/gpu/drm/i915/gem/i915_gem_mman.c | 1 - drivers/gpu/drm/msm/msm_gem.c | 1 - drivers/gpu/drm/omapdrm/omap_gem.c | 6 +- drivers/gpu/drm/v3d/v3d_bo.c | 1 - drivers/hwtracing/intel_th/msu.c | 3 +- drivers/md/dm-linear.c | 2 +- drivers/md/dm-log-writes.c | 2 +- drivers/md/dm-stripe.c | 2 +- drivers/md/dm-target.c | 2 +- drivers/md/dm-writecache.c | 11 ++- drivers/md/dm.c | 2 +- drivers/nvdimm/pmem.c | 8 +-- drivers/nvdimm/pmem.h | 4 +- drivers/s390/block/dcssblk.c | 9 ++- drivers/vfio/pci/vfio_pci_core.c | 5 +- fs/cramfs/inode.c | 5 +- fs/dax.c | 50 +++++++------- fs/ext4/file.c | 2 +- fs/fuse/dax.c | 3 +- fs/fuse/virtio_fs.c | 5 +- fs/xfs/xfs_file.c | 2 +- include/linux/dax.h | 9 +-- include/linux/device-mapper.h | 2 +- include/linux/huge_mm.h | 6 +- include/linux/mm.h | 4 +- include/linux/pfn.h | 9 --- include/linux/pfn_t.h | 85 ------------------------ mm/debug_vm_pgtable.c | 1 - mm/huge_memory.c | 21 +++--- mm/memory.c | 31 +++++---- mm/memremap.c | 1 - mm/migrate.c | 1 - tools/testing/nvdimm/pmem-dax.c | 6 +- tools/testing/nvdimm/test/iomap.c | 7 -- tools/testing/nvdimm/test/nfit_test.h | 1 - 43 files changed, 109 insertions(+), 235 deletions(-) delete mode 100644 include/linux/pfn_t.h diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c index 2e7923844afe..c09284302dd3 100644 --- a/arch/x86/mm/pat/memtype.c +++ b/arch/x86/mm/pat/memtype.c @@ -36,7 +36,6 @@ #include #include #include -#include #include #include #include diff --git a/drivers/dax/device.c b/drivers/dax/device.c index 328231cfb028..2bb40a6060af 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -4,7 +4,6 @@ #include #include #include -#include #include #include #include @@ -73,7 +72,7 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff, return -1; } -static void dax_set_mapping(struct vm_fault *vmf, pfn_t pfn, +static void dax_set_mapping(struct vm_fault *vmf, unsigned long pfn, unsigned long fault_size) { unsigned long i, nr_pages = fault_size / PAGE_SIZE; @@ -89,7 +88,7 @@ static void dax_set_mapping(struct vm_fault *vmf, pfn_t pfn, ALIGN_DOWN(vmf->address, fault_size)); for (i = 0; i < nr_pages; i++) { - struct folio *folio = pfn_folio(pfn_t_to_pfn(pfn) + i); + struct folio *folio = pfn_folio(pfn + i); if (folio->mapping) continue; @@ -104,7 +103,7 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax, { struct device *dev = &dev_dax->dev; phys_addr_t phys; - pfn_t pfn; + unsigned long pfn; unsigned int fault_size = PAGE_SIZE; if (check_vma(dev_dax, vmf->vma, __func__)) @@ -125,11 +124,11 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax, return VM_FAULT_SIGBUS; } - pfn = phys_to_pfn_t(phys, 0); + pfn = PHYS_PFN(phys); dax_set_mapping(vmf, pfn, fault_size); - return vmf_insert_page_mkwrite(vmf, pfn_t_to_page(pfn), + return vmf_insert_page_mkwrite(vmf, pfn_to_page(pfn), vmf->flags & FAULT_FLAG_WRITE); } @@ -140,7 +139,7 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax, struct device *dev = &dev_dax->dev; phys_addr_t phys; pgoff_t pgoff; - pfn_t pfn; + unsigned long pfn; unsigned int fault_size = PMD_SIZE; if (check_vma(dev_dax, vmf->vma, __func__)) @@ -169,11 +168,11 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax, return VM_FAULT_SIGBUS; } - pfn = phys_to_pfn_t(phys, 0); + pfn = PHYS_PFN(phys); dax_set_mapping(vmf, pfn, fault_size); - return vmf_insert_folio_pmd(vmf, page_folio(pfn_t_to_page(pfn)), + return vmf_insert_folio_pmd(vmf, page_folio(pfn_to_page(pfn)), vmf->flags & FAULT_FLAG_WRITE); } @@ -185,7 +184,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax, struct device *dev = &dev_dax->dev; phys_addr_t phys; pgoff_t pgoff; - pfn_t pfn; + unsigned long pfn; unsigned int fault_size = PUD_SIZE; @@ -215,11 +214,11 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax, return VM_FAULT_SIGBUS; } - pfn = phys_to_pfn_t(phys, 0); + pfn = PHYS_PFN(phys); dax_set_mapping(vmf, pfn, fault_size); - return vmf_insert_folio_pud(vmf, page_folio(pfn_t_to_page(pfn)), + return vmf_insert_folio_pud(vmf, page_folio(pfn_to_page(pfn)), vmf->flags & FAULT_FLAG_WRITE); } #else diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c index 5e7c53f18491..c18451a37e4f 100644 --- a/drivers/dax/hmem/hmem.c +++ b/drivers/dax/hmem/hmem.c @@ -2,7 +2,6 @@ #include #include #include -#include #include #include "../bus.h" diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 584c70a34b52..c036e4d0b610 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -5,7 +5,6 @@ #include #include #include -#include #include #include #include diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c index c8ebf4e281f2..bee93066a849 100644 --- a/drivers/dax/pmem.c +++ b/drivers/dax/pmem.c @@ -2,7 +2,6 @@ /* Copyright(c) 2016 - 2018 Intel Corporation. All rights reserved. */ #include #include -#include #include "../nvdimm/pfn.h" #include "../nvdimm/nd.h" #include "bus.h" diff --git a/drivers/dax/super.c b/drivers/dax/super.c index e16d1d40d773..54c480e874cb 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -7,7 +7,6 @@ #include #include #include -#include #include #include #include @@ -148,7 +147,7 @@ enum dax_device_flags { * pages accessible at the device relative @pgoff. */ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, - enum dax_access_mode mode, void **kaddr, pfn_t *pfn) + enum dax_access_mode mode, void **kaddr, unsigned long *pfn) { long avail; diff --git a/drivers/gpu/drm/exynos/exynos_drm_gem.c b/drivers/gpu/drm/exynos/exynos_drm_gem.c index d44401a695e2..e3fbb45f37a2 100644 --- a/drivers/gpu/drm/exynos/exynos_drm_gem.c +++ b/drivers/gpu/drm/exynos/exynos_drm_gem.c @@ -7,7 +7,6 @@ #include -#include #include #include diff --git a/drivers/gpu/drm/gma500/fbdev.c b/drivers/gpu/drm/gma500/fbdev.c index 109efdc96ac5..68b825fc056e 100644 --- a/drivers/gpu/drm/gma500/fbdev.c +++ b/drivers/gpu/drm/gma500/fbdev.c @@ -6,7 +6,6 @@ **************************************************************************/ #include -#include #include #include @@ -33,7 +32,7 @@ static vm_fault_t psb_fbdev_vm_fault(struct vm_fault *vmf) vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); for (i = 0; i < page_num; ++i) { - err = vmf_insert_mixed(vma, address, __pfn_to_pfn_t(pfn, 0)); + err = vmf_insert_mixed(vma, address, pfn); if (unlikely(err & VM_FAULT_ERROR)) break; address += PAGE_SIZE; diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.c b/drivers/gpu/drm/i915/gem/i915_gem_mman.c index f6d37dff320d..75f5b0e871ef 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_mman.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_mman.c @@ -5,7 +5,6 @@ #include #include -#include #include #include diff --git a/drivers/gpu/drm/msm/msm_gem.c b/drivers/gpu/drm/msm/msm_gem.c index 2995e80fec3b..20bf31fe799b 100644 --- a/drivers/gpu/drm/msm/msm_gem.c +++ b/drivers/gpu/drm/msm/msm_gem.c @@ -9,7 +9,6 @@ #include #include #include -#include #include #include diff --git a/drivers/gpu/drm/omapdrm/omap_gem.c b/drivers/gpu/drm/omapdrm/omap_gem.c index 9df05b2b7ba0..381552bfb409 100644 --- a/drivers/gpu/drm/omapdrm/omap_gem.c +++ b/drivers/gpu/drm/omapdrm/omap_gem.c @@ -8,7 +8,6 @@ #include #include #include -#include #include #include @@ -371,7 +370,7 @@ static vm_fault_t omap_gem_fault_1d(struct drm_gem_object *obj, VERB("Inserting %p pfn %lx, pa %lx", (void *)vmf->address, pfn, pfn << PAGE_SHIFT); - return vmf_insert_mixed(vma, vmf->address, __pfn_to_pfn_t(pfn, 0)); + return vmf_insert_mixed(vma, vmf->address, pfn); } /* Special handling for the case of faulting in 2d tiled buffers */ @@ -466,8 +465,7 @@ static vm_fault_t omap_gem_fault_2d(struct drm_gem_object *obj, pfn, pfn << PAGE_SHIFT); for (i = n; i > 0; i--) { - ret = vmf_insert_mixed(vma, - vaddr, __pfn_to_pfn_t(pfn, 0)); + ret = vmf_insert_mixed(vma, vaddr, pfn); if (ret & VM_FAULT_ERROR) break; pfn += priv->usergart[fmt].stride_pfn; diff --git a/drivers/gpu/drm/v3d/v3d_bo.c b/drivers/gpu/drm/v3d/v3d_bo.c index bb7815599435..c41476ddde68 100644 --- a/drivers/gpu/drm/v3d/v3d_bo.c +++ b/drivers/gpu/drm/v3d/v3d_bo.c @@ -16,7 +16,6 @@ */ #include -#include #include #include "v3d_drv.h" diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c index 7163950eb371..f3a13b300835 100644 --- a/drivers/hwtracing/intel_th/msu.c +++ b/drivers/hwtracing/intel_th/msu.c @@ -19,7 +19,6 @@ #include #include #include -#include #ifdef CONFIG_X86 #include @@ -1618,7 +1617,7 @@ static vm_fault_t msc_mmap_fault(struct vm_fault *vmf) return VM_FAULT_SIGBUS; get_page(page); - return vmf_insert_mixed(vmf->vma, vmf->address, page_to_pfn_t(page)); + return vmf_insert_mixed(vmf->vma, vmf->address, page_to_pfn(page)); } static const struct vm_operations_struct msc_mmap_ops = { diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c index 15538ec58f8e..73bf290af181 100644 --- a/drivers/md/dm-linear.c +++ b/drivers/md/dm-linear.c @@ -170,7 +170,7 @@ static struct dax_device *linear_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff) static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, long nr_pages, enum dax_access_mode mode, void **kaddr, - pfn_t *pfn) + unsigned long *pfn) { struct dax_device *dax_dev = linear_dax_pgoff(ti, &pgoff); diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c index d484e8e1d48a..679b07dee229 100644 --- a/drivers/md/dm-log-writes.c +++ b/drivers/md/dm-log-writes.c @@ -893,7 +893,7 @@ static struct dax_device *log_writes_dax_pgoff(struct dm_target *ti, static long log_writes_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, long nr_pages, enum dax_access_mode mode, void **kaddr, - pfn_t *pfn) + unsigned long *pfn) { struct dax_device *dax_dev = log_writes_dax_pgoff(ti, &pgoff); diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c index a7dc04bd55e5..366f46159785 100644 --- a/drivers/md/dm-stripe.c +++ b/drivers/md/dm-stripe.c @@ -316,7 +316,7 @@ static struct dax_device *stripe_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff) static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, long nr_pages, enum dax_access_mode mode, void **kaddr, - pfn_t *pfn) + unsigned long *pfn) { struct dax_device *dax_dev = stripe_dax_pgoff(ti, &pgoff); diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c index 652627aea11b..2af5a9514c05 100644 --- a/drivers/md/dm-target.c +++ b/drivers/md/dm-target.c @@ -255,7 +255,7 @@ static void io_err_io_hints(struct dm_target *ti, struct queue_limits *limits) static long io_err_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, long nr_pages, enum dax_access_mode mode, void **kaddr, - pfn_t *pfn) + unsigned long *pfn) { return -EIO; } diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c index a428e1cacf07..d8de4a3076a1 100644 --- a/drivers/md/dm-writecache.c +++ b/drivers/md/dm-writecache.c @@ -13,7 +13,6 @@ #include #include #include -#include #include #include #include "dm-io-tracker.h" @@ -256,7 +255,7 @@ static int persistent_memory_claim(struct dm_writecache *wc) int r; loff_t s; long p, da; - pfn_t pfn; + unsigned long pfn; int id; struct page **pages; sector_t offset; @@ -290,7 +289,7 @@ static int persistent_memory_claim(struct dm_writecache *wc) r = da; goto err2; } - if (!pfn_t_has_page(pfn)) { + if (!pfn_valid(pfn)) { wc->memory_map = NULL; r = -EOPNOTSUPP; goto err2; @@ -314,13 +313,13 @@ static int persistent_memory_claim(struct dm_writecache *wc) r = daa ? daa : -EINVAL; goto err3; } - if (!pfn_t_has_page(pfn)) { + if (!pfn_valid(pfn)) { r = -EOPNOTSUPP; goto err3; } while (daa-- && i < p) { - pages[i++] = pfn_t_to_page(pfn); - pfn.val++; + pages[i++] = pfn_to_page(pfn); + pfn++; if (!(i & 15)) cond_resched(); } diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 1726f0f828cc..4b9415f718e3 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1218,7 +1218,7 @@ static struct dm_target *dm_dax_get_live_target(struct mapped_device *md, static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, enum dax_access_mode mode, void **kaddr, - pfn_t *pfn) + unsigned long *pfn) { struct mapped_device *md = dax_get_private(dax_dev); sector_t sector = pgoff * PAGE_SECTORS; diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index aa50006b7616..05785ff21a8b 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -20,7 +20,6 @@ #include #include #include -#include #include #include #include @@ -242,7 +241,7 @@ static void pmem_submit_bio(struct bio *bio) /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, long nr_pages, enum dax_access_mode mode, void **kaddr, - pfn_t *pfn) + unsigned long *pfn) { resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset; sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT; @@ -254,7 +253,7 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, if (kaddr) *kaddr = pmem->virt_addr + offset; if (pfn) - *pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags); + *pfn = PHYS_PFN(pmem->phys_addr + offset); if (bb->count && badblocks_check(bb, sector, num, &first_bad, &num_bad)) { @@ -303,7 +302,7 @@ static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, static long pmem_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, enum dax_access_mode mode, - void **kaddr, pfn_t *pfn) + void **kaddr, unsigned long *pfn) { struct pmem_device *pmem = dax_get_private(dax_dev); @@ -513,7 +512,6 @@ static int pmem_attach_disk(struct device *dev, pmem->disk = disk; pmem->pgmap.owner = pmem; - pmem->pfn_flags = 0; if (is_nd_pfn(dev)) { pmem->pgmap.type = MEMORY_DEVICE_FS_DAX; pmem->pgmap.ops = &fsdax_pagemap_ops; diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h index 392b0b38acb9..a48509f90196 100644 --- a/drivers/nvdimm/pmem.h +++ b/drivers/nvdimm/pmem.h @@ -5,7 +5,6 @@ #include #include #include -#include #include enum dax_access_mode; @@ -16,7 +15,6 @@ struct pmem_device { phys_addr_t phys_addr; /* when non-zero this device is hosting a 'pfn' instance */ phys_addr_t data_offset; - u64 pfn_flags; void *virt_addr; /* immutable base size of the namespace */ size_t size; @@ -31,7 +29,7 @@ struct pmem_device { long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, long nr_pages, enum dax_access_mode mode, void **kaddr, - pfn_t *pfn); + unsigned long *pfn); #ifdef CONFIG_MEMORY_FAILURE static inline bool test_and_clear_pmem_poison(struct page *page) diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c index 249ae403f698..94fa5edecadd 100644 --- a/drivers/s390/block/dcssblk.c +++ b/drivers/s390/block/dcssblk.c @@ -17,7 +17,6 @@ #include #include #include -#include #include #include #include @@ -33,7 +32,7 @@ static void dcssblk_release(struct gendisk *disk); static void dcssblk_submit_bio(struct bio *bio); static long dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, enum dax_access_mode mode, void **kaddr, - pfn_t *pfn); + unsigned long *pfn); static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0"; @@ -914,7 +913,7 @@ dcssblk_submit_bio(struct bio *bio) static long __dcssblk_direct_access(struct dcssblk_dev_info *dev_info, pgoff_t pgoff, - long nr_pages, void **kaddr, pfn_t *pfn) + long nr_pages, void **kaddr, unsigned long *pfn) { resource_size_t offset = pgoff * PAGE_SIZE; unsigned long dev_sz; @@ -923,7 +922,7 @@ __dcssblk_direct_access(struct dcssblk_dev_info *dev_info, pgoff_t pgoff, if (kaddr) *kaddr = __va(dev_info->start + offset); if (pfn) - *pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), 0); + *pfn = PFN_DOWN(dev_info->start + offset); return (dev_sz - offset) / PAGE_SIZE; } @@ -931,7 +930,7 @@ __dcssblk_direct_access(struct dcssblk_dev_info *dev_info, pgoff_t pgoff, static long dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, enum dax_access_mode mode, void **kaddr, - pfn_t *pfn) + unsigned long *pfn) { struct dcssblk_dev_info *dev_info = dax_get_private(dax_dev); diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 3f2ad5fb4c17..31bdb9110cc0 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -20,7 +20,6 @@ #include #include #include -#include #include #include #include @@ -1669,12 +1668,12 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf, break; #ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP case PMD_ORDER: - ret = vmf_insert_pfn_pmd(vmf, __pfn_to_pfn_t(pfn, 0), false); + ret = vmf_insert_pfn_pmd(vmf, pfn, false); break; #endif #ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP case PUD_ORDER: - ret = vmf_insert_pfn_pud(vmf, __pfn_to_pfn_t(pfn, 0), false); + ret = vmf_insert_pfn_pud(vmf, pfn, false); break; #endif default: diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c index 820a664cfec7..b002e9b734f9 100644 --- a/fs/cramfs/inode.c +++ b/fs/cramfs/inode.c @@ -17,7 +17,6 @@ #include #include #include -#include #include #include #include @@ -412,8 +411,8 @@ static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma) for (i = 0; i < pages && !ret; i++) { vm_fault_t vmf; unsigned long off = i * PAGE_SIZE; - pfn_t pfn = phys_to_pfn_t(address + off, 0); - vmf = vmf_insert_mixed(vma, vma->vm_start + off, pfn); + vmf = vmf_insert_mixed(vma, vma->vm_start + off, + address + off); if (vmf & VM_FAULT_ERROR) ret = vm_fault_to_errno(vmf, 0); } diff --git a/fs/dax.c b/fs/dax.c index f4ffb6982270..4229513806be 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -20,7 +20,6 @@ #include #include #include -#include #include #include #include @@ -76,9 +75,9 @@ static struct folio *dax_to_folio(void *entry) return page_folio(pfn_to_page(dax_to_pfn(entry))); } -static void *dax_make_entry(pfn_t pfn, unsigned long flags) +static void *dax_make_entry(unsigned long pfn, unsigned long flags) { - return xa_mk_value(flags | (pfn_t_to_pfn(pfn) << DAX_SHIFT)); + return xa_mk_value(flags | (pfn << DAX_SHIFT)); } static bool dax_is_locked(void *entry) @@ -713,7 +712,7 @@ static void *grab_mapping_entry(struct xa_state *xas, if (order > 0) flags |= DAX_PMD; - entry = dax_make_entry(pfn_to_pfn_t(0), flags); + entry = dax_make_entry(0, flags); dax_lock_entry(xas, entry); if (xas_error(xas)) goto out_unlock; @@ -1041,7 +1040,7 @@ static bool dax_fault_is_synchronous(const struct iomap_iter *iter, * appropriate. */ static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf, - const struct iomap_iter *iter, void *entry, pfn_t pfn, + const struct iomap_iter *iter, void *entry, unsigned long pfn, unsigned long flags) { struct address_space *mapping = vmf->vma->vm_file->f_mapping; @@ -1239,7 +1238,7 @@ int dax_writeback_mapping_range(struct address_space *mapping, EXPORT_SYMBOL_GPL(dax_writeback_mapping_range); static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos, - size_t size, void **kaddr, pfn_t *pfnp) + size_t size, void **kaddr, unsigned long *pfnp) { pgoff_t pgoff = dax_iomap_pgoff(iomap, pos); int id, rc = 0; @@ -1257,7 +1256,7 @@ static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos, rc = -EINVAL; if (PFN_PHYS(length) < size) goto out; - if (pfn_t_to_pfn(*pfnp) & (PHYS_PFN(size)-1)) + if (*pfnp & (PHYS_PFN(size)-1)) goto out; rc = 0; @@ -1361,12 +1360,12 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf, { struct inode *inode = iter->inode; unsigned long vaddr = vmf->address; - pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr)); + unsigned long pfn = my_zero_pfn(vaddr); vm_fault_t ret; *entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_ZERO_PAGE); - ret = vmf_insert_page_mkwrite(vmf, pfn_t_to_page(pfn), false); + ret = vmf_insert_page_mkwrite(vmf, pfn_to_page(pfn), false); trace_dax_load_hole(inode, vmf, ret); return ret; } @@ -1383,14 +1382,14 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf, struct folio *zero_folio; spinlock_t *ptl; pmd_t pmd_entry; - pfn_t pfn; + unsigned long pfn; zero_folio = mm_get_huge_zero_folio(vmf->vma->vm_mm); if (unlikely(!zero_folio)) goto fallback; - pfn = page_to_pfn_t(&zero_folio->page); + pfn = page_to_pfn(&zero_folio->page); *entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_PMD | DAX_ZERO_PAGE); @@ -1779,7 +1778,8 @@ static vm_fault_t dax_fault_return(int error) * insertion for now and return the pfn so that caller can insert it after the * fsync is done. */ -static vm_fault_t dax_fault_synchronous_pfnp(pfn_t *pfnp, pfn_t pfn) +static vm_fault_t dax_fault_synchronous_pfnp(unsigned long *pfnp, + unsigned long pfn) { if (WARN_ON_ONCE(!pfnp)) return VM_FAULT_SIGBUS; @@ -1827,7 +1827,7 @@ static vm_fault_t dax_fault_cow_page(struct vm_fault *vmf, * @pmd: distinguish whether it is a pmd fault */ static vm_fault_t dax_fault_iter(struct vm_fault *vmf, - const struct iomap_iter *iter, pfn_t *pfnp, + const struct iomap_iter *iter, unsigned long *pfnp, struct xa_state *xas, void **entry, bool pmd) { const struct iomap *iomap = &iter->iomap; @@ -1838,7 +1838,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf, unsigned long entry_flags = pmd ? DAX_PMD : 0; struct folio *folio; int ret, err = 0; - pfn_t pfn; + unsigned long pfn; void *kaddr; if (!pmd && vmf->cow_page) @@ -1875,16 +1875,15 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf, folio_ref_inc(folio); if (pmd) - ret = vmf_insert_folio_pmd(vmf, pfn_folio(pfn_t_to_pfn(pfn)), - write); + ret = vmf_insert_folio_pmd(vmf, pfn_folio(pfn), write); else - ret = vmf_insert_page_mkwrite(vmf, pfn_t_to_page(pfn), write); + ret = vmf_insert_page_mkwrite(vmf, pfn_to_page(pfn), write); folio_put(folio); return ret; } -static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp, +static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, unsigned long *pfnp, int *iomap_errp, const struct iomap_ops *ops) { struct address_space *mapping = vmf->vma->vm_file->f_mapping; @@ -1996,7 +1995,7 @@ static bool dax_fault_check_fallback(struct vm_fault *vmf, struct xa_state *xas, return false; } -static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp, +static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, unsigned long *pfnp, const struct iomap_ops *ops) { struct address_space *mapping = vmf->vma->vm_file->f_mapping; @@ -2077,7 +2076,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp, return ret; } #else -static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp, +static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, unsigned long *pfnp, const struct iomap_ops *ops) { return VM_FAULT_FALLBACK; @@ -2098,7 +2097,8 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp, * successfully. */ vm_fault_t dax_iomap_fault(struct vm_fault *vmf, unsigned int order, - pfn_t *pfnp, int *iomap_errp, const struct iomap_ops *ops) + unsigned long *pfnp, int *iomap_errp, + const struct iomap_ops *ops) { if (order == 0) return dax_iomap_pte_fault(vmf, pfnp, iomap_errp, ops); @@ -2118,8 +2118,8 @@ EXPORT_SYMBOL_GPL(dax_iomap_fault); * This function inserts a writeable PTE or PMD entry into the page tables * for an mmaped DAX file. It also marks the page cache entry as dirty. */ -static vm_fault_t -dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order) +static vm_fault_t dax_insert_pfn_mkwrite(struct vm_fault *vmf, + unsigned long pfn, unsigned int order) { struct address_space *mapping = vmf->vma->vm_file->f_mapping; XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, order); @@ -2141,7 +2141,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order) xas_set_mark(&xas, PAGECACHE_TAG_DIRTY); dax_lock_entry(&xas, entry); xas_unlock_irq(&xas); - folio = pfn_folio(pfn_t_to_pfn(pfn)); + folio = pfn_folio(pfn); folio_ref_inc(folio); if (order == 0) ret = vmf_insert_page_mkwrite(vmf, &folio->page, true); @@ -2168,7 +2168,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order) * table entry. */ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf, unsigned int order, - pfn_t pfn) + unsigned long pfn) { int err; loff_t start = ((loff_t)vmf->pgoff) << PAGE_SHIFT; diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 21df81347147..e6e962982319 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -747,7 +747,7 @@ static vm_fault_t ext4_dax_huge_fault(struct vm_fault *vmf, unsigned int order) bool write = (vmf->flags & FAULT_FLAG_WRITE) && (vmf->vma->vm_flags & VM_SHARED); struct address_space *mapping = vmf->vma->vm_file->f_mapping; - pfn_t pfn; + unsigned long pfn; if (write) { sb_start_pagefault(sb); diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c index 0502bf3cdf6a..ac6d4c1064cc 100644 --- a/fs/fuse/dax.c +++ b/fs/fuse/dax.c @@ -10,7 +10,6 @@ #include #include #include -#include #include #include @@ -757,7 +756,7 @@ static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf, unsigned int order, vm_fault_t ret; struct inode *inode = file_inode(vmf->vma->vm_file); struct super_block *sb = inode->i_sb; - pfn_t pfn; + unsigned long pfn; int error = 0; struct fuse_conn *fc = get_fuse_conn(inode); struct fuse_conn_dax *fcd = fc->dax; diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index 53c2626e90e7..aac914b2cd50 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -9,7 +9,6 @@ #include #include #include -#include #include #include #include @@ -1008,7 +1007,7 @@ static void virtio_fs_cleanup_vqs(struct virtio_device *vdev) */ static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, enum dax_access_mode mode, - void **kaddr, pfn_t *pfn) + void **kaddr, unsigned long *pfn) { struct virtio_fs *fs = dax_get_private(dax_dev); phys_addr_t offset = PFN_PHYS(pgoff); @@ -1017,7 +1016,7 @@ static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, if (kaddr) *kaddr = fs->window_kaddr + offset; if (pfn) - *pfn = phys_to_pfn_t(fs->window_phys_addr + offset, 0); + *pfn = fs->window_phys_addr + offset; return nr_pages > max_nr_pages ? max_nr_pages : nr_pages; } diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 0b41b18debf3..314a9d9dd7db 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1730,7 +1730,7 @@ xfs_dax_fault_locked( bool write_fault) { vm_fault_t ret; - pfn_t pfn; + unsigned long pfn; if (!IS_ENABLED(CONFIG_FS_DAX)) { ASSERT(0); diff --git a/include/linux/dax.h b/include/linux/dax.h index dcc9fcdf14e4..29eec755430b 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -26,7 +26,7 @@ struct dax_operations { * number of pages available for DAX at that pfn. */ long (*direct_access)(struct dax_device *, pgoff_t, long, - enum dax_access_mode, void **, pfn_t *); + enum dax_access_mode, void **, unsigned long *); /* zero_page_range: required operation. Zero page range */ int (*zero_page_range)(struct dax_device *, pgoff_t, size_t); /* @@ -241,7 +241,7 @@ static inline void dax_break_layout_final(struct inode *inode) bool dax_alive(struct dax_device *dax_dev); void *dax_get_private(struct dax_device *dax_dev); long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, - enum dax_access_mode mode, void **kaddr, pfn_t *pfn); + enum dax_access_mode mode, void **kaddr, unsigned long *pfn); size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i); size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, @@ -255,9 +255,10 @@ void dax_flush(struct dax_device *dax_dev, void *addr, size_t size); ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter, const struct iomap_ops *ops); vm_fault_t dax_iomap_fault(struct vm_fault *vmf, unsigned int order, - pfn_t *pfnp, int *errp, const struct iomap_ops *ops); + unsigned long *pfnp, int *errp, + const struct iomap_ops *ops); vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf, - unsigned int order, pfn_t pfn); + unsigned int order, unsigned long pfn); int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index); void dax_delete_mapping_range(struct address_space *mapping, loff_t start, loff_t end); diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h index cb95951547ab..84fdc3a6a19a 100644 --- a/include/linux/device-mapper.h +++ b/include/linux/device-mapper.h @@ -156,7 +156,7 @@ typedef int (*dm_busy_fn) (struct dm_target *ti); */ typedef long (*dm_dax_direct_access_fn) (struct dm_target *ti, pgoff_t pgoff, long nr_pages, enum dax_access_mode node, void **kaddr, - pfn_t *pfn); + unsigned long *pfn); typedef int (*dm_dax_zero_page_range_fn)(struct dm_target *ti, pgoff_t pgoff, size_t nr_pages); diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 26607f2c65fb..8f1b15213f61 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -37,8 +37,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, pgprot_t newprot, unsigned long cp_flags); -vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write); -vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write); +vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, unsigned long pfn, + bool write); +vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, unsigned long pfn, + bool write); vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio, bool write); vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio, diff --git a/include/linux/mm.h b/include/linux/mm.h index 4d833f159988..dccebf0abf06 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3522,9 +3522,9 @@ vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr, vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn, pgprot_t pgprot); vm_fault_t vmf_insert_mixed(struct vm_area_struct *vma, unsigned long addr, - pfn_t pfn); + unsigned long pfn); vm_fault_t vmf_insert_mixed_mkwrite(struct vm_area_struct *vma, - unsigned long addr, pfn_t pfn); + unsigned long addr, unsigned long pfn); int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len); static inline vm_fault_t vmf_insert_page(struct vm_area_struct *vma, diff --git a/include/linux/pfn.h b/include/linux/pfn.h index 14bc053c53d8..b90ca0b6c331 100644 --- a/include/linux/pfn.h +++ b/include/linux/pfn.h @@ -4,15 +4,6 @@ #ifndef __ASSEMBLY__ #include - -/* - * pfn_t: encapsulates a page-frame number that is optionally backed - * by memmap (struct page). Whether a pfn_t has a 'struct page' - * backing is indicated by flags in the high bits of the value. - */ -typedef struct { - u64 val; -} pfn_t; #endif #define PFN_ALIGN(x) (((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK) diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h deleted file mode 100644 index 2c002936a0f6..000000000000 --- a/include/linux/pfn_t.h +++ /dev/null @@ -1,85 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef _LINUX_PFN_T_H_ -#define _LINUX_PFN_T_H_ -#include - -/* - * PFN_FLAGS_MASK - mask of all the possible valid pfn_t flags - */ -#define PFN_FLAGS_MASK (((u64) (~PAGE_MASK)) << (BITS_PER_LONG_LONG - PAGE_SHIFT)) - -#define PFN_FLAGS_TRACE \ - { } - -static inline pfn_t __pfn_to_pfn_t(unsigned long pfn, u64 flags) -{ - pfn_t pfn_t = { .val = pfn | (flags & PFN_FLAGS_MASK), }; - - return pfn_t; -} - -/* a default pfn to pfn_t conversion assumes that @pfn is pfn_valid() */ -static inline pfn_t pfn_to_pfn_t(unsigned long pfn) -{ - return __pfn_to_pfn_t(pfn, 0); -} - -static inline pfn_t phys_to_pfn_t(phys_addr_t addr, u64 flags) -{ - return __pfn_to_pfn_t(addr >> PAGE_SHIFT, flags); -} - -static inline bool pfn_t_has_page(pfn_t pfn) -{ - return true; -} - -static inline unsigned long pfn_t_to_pfn(pfn_t pfn) -{ - return pfn.val & ~PFN_FLAGS_MASK; -} - -static inline struct page *pfn_t_to_page(pfn_t pfn) -{ - if (pfn_t_has_page(pfn)) - return pfn_to_page(pfn_t_to_pfn(pfn)); - return NULL; -} - -static inline phys_addr_t pfn_t_to_phys(pfn_t pfn) -{ - return PFN_PHYS(pfn_t_to_pfn(pfn)); -} - -static inline pfn_t page_to_pfn_t(struct page *page) -{ - return pfn_to_pfn_t(page_to_pfn(page)); -} - -static inline int pfn_t_valid(pfn_t pfn) -{ - return pfn_valid(pfn_t_to_pfn(pfn)); -} - -#ifdef CONFIG_MMU -static inline pte_t pfn_t_pte(pfn_t pfn, pgprot_t pgprot) -{ - return pfn_pte(pfn_t_to_pfn(pfn), pgprot); -} -#endif - -#ifdef CONFIG_TRANSPARENT_HUGEPAGE -static inline pmd_t pfn_t_pmd(pfn_t pfn, pgprot_t pgprot) -{ - return pfn_pmd(pfn_t_to_pfn(pfn), pgprot); -} - -#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD -static inline pud_t pfn_t_pud(pfn_t pfn, pgprot_t pgprot) -{ - return pfn_pud(pfn_t_to_pfn(pfn), pgprot); -} -#endif -#endif - -#endif /* _LINUX_PFN_T_H_ */ diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index d84d0c49012f..bd8f9317b025 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -20,7 +20,6 @@ #include #include #include -#include #include #include #include diff --git a/mm/huge_memory.c b/mm/huge_memory.c index cf808b2eea29..ce130225a8e5 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -22,7 +22,6 @@ #include #include #include -#include #include #include #include @@ -1375,7 +1374,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) struct folio_or_pfn { union { struct folio *folio; - pfn_t pfn; + unsigned long pfn; }; bool is_folio; }; @@ -1391,7 +1390,7 @@ static int insert_pmd(struct vm_area_struct *vma, unsigned long addr, if (!pmd_none(*pmd)) { const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) : - pfn_t_to_pfn(fop.pfn); + fop.pfn; if (write) { if (pmd_pfn(*pmd) != pfn) { @@ -1414,7 +1413,7 @@ static int insert_pmd(struct vm_area_struct *vma, unsigned long addr, folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma); add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR); } else { - entry = pmd_mkhuge(pfn_t_pmd(fop.pfn, prot)); + entry = pmd_mkhuge(pfn_pmd(fop.pfn, prot)); entry = pmd_mkspecial(entry); } if (write) { @@ -1442,7 +1441,8 @@ static int insert_pmd(struct vm_area_struct *vma, unsigned long addr, * * Return: vm_fault_t value. */ -vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write) +vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, unsigned long pfn, + bool write) { unsigned long addr = vmf->address & PMD_MASK; struct vm_area_struct *vma = vmf->vma; @@ -1473,7 +1473,7 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write) return VM_FAULT_OOM; } - pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot); + pfnmap_setup_cachemode_pfn(pfn, &pgprot); ptl = pmd_lock(vma->vm_mm, vmf->pmd); error = insert_pmd(vma, addr, vmf->pmd, fop, pgprot, write, @@ -1539,7 +1539,7 @@ static void insert_pud(struct vm_area_struct *vma, unsigned long addr, if (!pud_none(*pud)) { const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) : - pfn_t_to_pfn(fop.pfn); + fop.pfn; if (write) { if (WARN_ON_ONCE(pud_pfn(*pud) != pfn)) @@ -1559,7 +1559,7 @@ static void insert_pud(struct vm_area_struct *vma, unsigned long addr, folio_add_file_rmap_pud(fop.folio, &fop.folio->page, vma); add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PUD_NR); } else { - entry = pud_mkhuge(pfn_t_pud(fop.pfn, prot)); + entry = pud_mkhuge(pfn_pud(fop.pfn, prot)); entry = pud_mkspecial(entry); } if (write) { @@ -1580,7 +1580,8 @@ static void insert_pud(struct vm_area_struct *vma, unsigned long addr, * * Return: vm_fault_t value. */ -vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write) +vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, unsigned long pfn, + bool write) { unsigned long addr = vmf->address & PUD_MASK; struct vm_area_struct *vma = vmf->vma; @@ -1603,7 +1604,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write) if (addr < vma->vm_start || addr >= vma->vm_end) return VM_FAULT_SIGBUS; - pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot); + pfnmap_setup_cachemode_pfn(pfn, &pgprot); ptl = pud_lock(vma->vm_mm, vmf->pud); insert_pud(vma, addr, vmf->pud, fop, pgprot, write); diff --git a/mm/memory.c b/mm/memory.c index e932a007af4c..0f9b32a20e5b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -57,7 +57,6 @@ #include #include #include -#include #include #include #include @@ -2435,7 +2434,7 @@ int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages, EXPORT_SYMBOL(vm_map_pages_zero); static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr, - pfn_t pfn, pgprot_t prot, bool mkwrite) + unsigned long pfn, pgprot_t prot, bool mkwrite) { struct mm_struct *mm = vma->vm_mm; pte_t *pte, entry; @@ -2457,7 +2456,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr, * allocation and mapping invalidation so just skip the * update. */ - if (pte_pfn(entry) != pfn_t_to_pfn(pfn)) { + if (pte_pfn(entry) != pfn) { WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry))); goto out_unlock; } @@ -2470,7 +2469,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr, } /* Ok, finally just insert the thing.. */ - entry = pte_mkspecial(pfn_t_pte(pfn, prot)); + entry = pte_mkspecial(pfn_pte(pfn, prot)); if (mkwrite) { entry = pte_mkyoung(entry); @@ -2541,8 +2540,7 @@ vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr, pfnmap_setup_cachemode_pfn(pfn, &pgprot); - return insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, 0), pgprot, - false); + return insert_pfn(vma, addr, pfn, pgprot, false); } EXPORT_SYMBOL(vmf_insert_pfn_prot); @@ -2573,21 +2571,22 @@ vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr, } EXPORT_SYMBOL(vmf_insert_pfn); -static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn, bool mkwrite) +static bool vm_mixed_ok(struct vm_area_struct *vma, unsigned long pfn, + bool mkwrite) { - if (unlikely(is_zero_pfn(pfn_t_to_pfn(pfn))) && + if (unlikely(is_zero_pfn(pfn)) && (mkwrite || !vm_mixed_zeropage_allowed(vma))) return false; /* these checks mirror the abort conditions in vm_normal_page */ if (vma->vm_flags & VM_MIXEDMAP) return true; - if (is_zero_pfn(pfn_t_to_pfn(pfn))) + if (is_zero_pfn(pfn)) return true; return false; } static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma, - unsigned long addr, pfn_t pfn, bool mkwrite) + unsigned long addr, unsigned long pfn, bool mkwrite) { pgprot_t pgprot = vma->vm_page_prot; int err; @@ -2598,9 +2597,9 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma, if (addr < vma->vm_start || addr >= vma->vm_end) return VM_FAULT_SIGBUS; - pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot); + pfnmap_setup_cachemode_pfn(pfn, &pgprot); - if (!pfn_modify_allowed(pfn_t_to_pfn(pfn), pgprot)) + if (!pfn_modify_allowed(pfn, pgprot)) return VM_FAULT_SIGBUS; /* @@ -2610,7 +2609,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma, * than insert_pfn). If a zero_pfn were inserted into a VM_MIXEDMAP * without pte special, it would there be refcounted as a normal page. */ - if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pfn_t_valid(pfn)) { + if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pfn_valid(pfn)) { struct page *page; /* @@ -2618,7 +2617,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma, * regardless of whether the caller specified flags that * result in pfn_t_has_page() == false. */ - page = pfn_to_page(pfn_t_to_pfn(pfn)); + page = pfn_to_page(pfn); err = insert_page(vma, addr, page, pgprot, mkwrite); } else { return insert_pfn(vma, addr, pfn, pgprot, mkwrite); @@ -2653,7 +2652,7 @@ vm_fault_t vmf_insert_page_mkwrite(struct vm_fault *vmf, struct page *page, EXPORT_SYMBOL_GPL(vmf_insert_page_mkwrite); vm_fault_t vmf_insert_mixed(struct vm_area_struct *vma, unsigned long addr, - pfn_t pfn) + unsigned long pfn) { return __vm_insert_mixed(vma, addr, pfn, false); } @@ -2665,7 +2664,7 @@ EXPORT_SYMBOL(vmf_insert_mixed); * the same entry was actually inserted. */ vm_fault_t vmf_insert_mixed_mkwrite(struct vm_area_struct *vma, - unsigned long addr, pfn_t pfn) + unsigned long addr, unsigned long pfn) { return __vm_insert_mixed(vma, addr, pfn, true); } diff --git a/mm/memremap.c b/mm/memremap.c index c17e0a69cace..044a4550671a 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -5,7 +5,6 @@ #include #include #include -#include #include #include #include diff --git a/mm/migrate.c b/mm/migrate.c index 8cf0f9c9599d..ea8c74d99659 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -35,7 +35,6 @@ #include #include #include -#include #include #include #include diff --git a/tools/testing/nvdimm/pmem-dax.c b/tools/testing/nvdimm/pmem-dax.c index c1ec099a3b1d..05e763aab104 100644 --- a/tools/testing/nvdimm/pmem-dax.c +++ b/tools/testing/nvdimm/pmem-dax.c @@ -10,7 +10,7 @@ long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, long nr_pages, enum dax_access_mode mode, void **kaddr, - pfn_t *pfn) + unsigned long *pfn) { resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset; @@ -29,7 +29,7 @@ long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, *kaddr = pmem->virt_addr + offset; page = vmalloc_to_page(pmem->virt_addr + offset); if (pfn) - *pfn = page_to_pfn_t(page); + *pfn = page_to_pfn(page); pr_debug_ratelimited("%s: pmem: %p pgoff: %#lx pfn: %#lx\n", __func__, pmem, pgoff, page_to_pfn(page)); @@ -39,7 +39,7 @@ long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, if (kaddr) *kaddr = pmem->virt_addr + offset; if (pfn) - *pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags); + *pfn = PHYS_PFN(pmem->phys_addr + offset); /* * If badblocks are present, limit known good range to the diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c index ddceb04b4a9a..f7e7bfe9bb85 100644 --- a/tools/testing/nvdimm/test/iomap.c +++ b/tools/testing/nvdimm/test/iomap.c @@ -8,7 +8,6 @@ #include #include #include -#include #include #include #include @@ -135,12 +134,6 @@ void *__wrap_devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) } EXPORT_SYMBOL_GPL(__wrap_devm_memremap_pages); -pfn_t __wrap_phys_to_pfn_t(phys_addr_t addr, unsigned long flags) -{ - return phys_to_pfn_t(addr, flags); -} -EXPORT_SYMBOL(__wrap_phys_to_pfn_t); - void *__wrap_memremap(resource_size_t offset, size_t size, unsigned long flags) { diff --git a/tools/testing/nvdimm/test/nfit_test.h b/tools/testing/nvdimm/test/nfit_test.h index b00583d1eace..b9047fb8ea4a 100644 --- a/tools/testing/nvdimm/test/nfit_test.h +++ b/tools/testing/nvdimm/test/nfit_test.h @@ -212,7 +212,6 @@ void __iomem *__wrap_devm_ioremap(struct device *dev, void *__wrap_devm_memremap(struct device *dev, resource_size_t offset, size_t size, unsigned long flags); void *__wrap_devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); -pfn_t __wrap_phys_to_pfn_t(phys_addr_t addr, unsigned long flags); void *__wrap_memremap(resource_size_t offset, size_t size, unsigned long flags); void __wrap_devm_memunmap(struct device *dev, void *addr); From 5d26b5bdc64625049d98efc0f5bfddd83709c154 Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Thu, 19 Jun 2025 18:58:06 +1000 Subject: [PATCH 106/370] mm/memremap: remove unused devmap_managed_key MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit It's no longer used so remove it. Link: https://lkml.kernel.org/r/11516e39f33f809292ffccab1d46062f9bc248b3.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple Reviewed-by: Jason Gunthorpe Acked-by: David Hildenbrand Cc: Balbir Singh Cc: Björn Töpel Cc: Björn Töpel Cc: Christoph Hellwig Cc: Chunyan Zhang Cc: Dan Williams Cc: Deepak Gupta Cc: Gerald Schaefer Cc: Inki Dae Cc: John Groves Cc: John Hubbard Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Will Deacon Signed-off-by: Andrew Morton --- mm/memremap.c | 27 --------------------------- 1 file changed, 27 deletions(-) diff --git a/mm/memremap.c b/mm/memremap.c index 044a4550671a..f75078c14839 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -38,30 +38,6 @@ unsigned long memremap_compat_align(void) EXPORT_SYMBOL_GPL(memremap_compat_align); #endif -#ifdef CONFIG_FS_DAX -DEFINE_STATIC_KEY_FALSE(devmap_managed_key); -EXPORT_SYMBOL(devmap_managed_key); - -static void devmap_managed_enable_put(struct dev_pagemap *pgmap) -{ - if (pgmap->type == MEMORY_DEVICE_FS_DAX) - static_branch_dec(&devmap_managed_key); -} - -static void devmap_managed_enable_get(struct dev_pagemap *pgmap) -{ - if (pgmap->type == MEMORY_DEVICE_FS_DAX) - static_branch_inc(&devmap_managed_key); -} -#else -static void devmap_managed_enable_get(struct dev_pagemap *pgmap) -{ -} -static void devmap_managed_enable_put(struct dev_pagemap *pgmap) -{ -} -#endif /* CONFIG_FS_DAX */ - static void pgmap_array_delete(struct range *range) { xa_store_range(&pgmap_array, PHYS_PFN(range->start), PHYS_PFN(range->end), @@ -150,7 +126,6 @@ void memunmap_pages(struct dev_pagemap *pgmap) percpu_ref_exit(&pgmap->ref); WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n"); - devmap_managed_enable_put(pgmap); } EXPORT_SYMBOL_GPL(memunmap_pages); @@ -349,8 +324,6 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) if (error) return ERR_PTR(error); - devmap_managed_enable_get(pgmap); - /* * Clear the pgmap nr_range as it will be incremented for each * successfully processed range. This communicates how many From dfa3cf0bc018ff15e02ce19906be00a287ff2395 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Thu, 19 Jun 2025 11:36:08 -0700 Subject: [PATCH 107/370] selftets/damon: add a test for memcg_path leak There was a memory leak bug in DAMOS sysfs memcg_path file. Add a selftest to ensure the bug never comes back. Link: https://lkml.kernel.org/r/20250619183608.6647-3-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/Makefile | 1 + .../selftests/damon/sysfs_memcg_path_leak.sh | 43 +++++++++++++++++++ 2 files changed, 44 insertions(+) create mode 100755 tools/testing/selftests/damon/sysfs_memcg_path_leak.sh diff --git a/tools/testing/selftests/damon/Makefile b/tools/testing/selftests/damon/Makefile index ff21524be458..e888455e3cf8 100644 --- a/tools/testing/selftests/damon/Makefile +++ b/tools/testing/selftests/damon/Makefile @@ -15,6 +15,7 @@ TEST_PROGS += reclaim.sh lru_sort.sh # regression tests (reproducers of previously found bugs) TEST_PROGS += sysfs_update_removed_scheme_dir.sh TEST_PROGS += sysfs_update_schemes_tried_regions_hang.py +TEST_PROGS += sysfs_memcg_path_leak.sh EXTRA_CLEAN = __pycache__ diff --git a/tools/testing/selftests/damon/sysfs_memcg_path_leak.sh b/tools/testing/selftests/damon/sysfs_memcg_path_leak.sh new file mode 100755 index 000000000000..64c5d8c518a4 --- /dev/null +++ b/tools/testing/selftests/damon/sysfs_memcg_path_leak.sh @@ -0,0 +1,43 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +if [ $EUID -ne 0 ] +then + echo "Run as root" + exit $ksft_skip +fi + +damon_sysfs="/sys/kernel/mm/damon/admin" +if [ ! -d "$damon_sysfs" ] +then + echo "damon sysfs not found" + exit $ksft_skip +fi + +# ensure filter directory +echo 1 > "$damon_sysfs/kdamonds/nr_kdamonds" +echo 1 > "$damon_sysfs/kdamonds/0/contexts/nr_contexts" +echo 1 > "$damon_sysfs/kdamonds/0/contexts/0/schemes/nr_schemes" +echo 1 > "$damon_sysfs/kdamonds/0/contexts/0/schemes/0/filters/nr_filters" + +filter_dir="$damon_sysfs/kdamonds/0/contexts/0/schemes/0/filters/0" + +before_kb=$(grep Slab /proc/meminfo | awk '{print $2}') + +# try to leak 3000 KiB +for i in {1..102400}; +do + echo "012345678901234567890123456789" > "$filter_dir/memcg_path" +done + +after_kb=$(grep Slab /proc/meminfo | awk '{print $2}') +# expect up to 1500 KiB free from other tasks memory +expected_after_kb_max=$((before_kb + 1500)) + +if [ "$after_kb" -gt "$expected_after_kb_max" ] +then + echo "maybe memcg_path are leaking: $before_kb -> $after_kb" + exit 1 +else + exit 0 +fi From 592b939b59b43a817ce6d79900793982d452bb5d Mon Sep 17 00:00:00 2001 From: Dev Jain Date: Tue, 24 Jun 2025 13:37:48 +0530 Subject: [PATCH 108/370] maple tree: use goto label to simplify code Use the underflow goto label to set the status to ma_underflow and return NULL, as is being done elsewhere. [akpm@linux-foundation.org: add newline, per Liam (and remove one, per akpm)] Link: https://lkml.kernel.org/r/20250624080748.4855-1-dev.jain@arm.com Signed-off-by: Dev Jain Reviewed-by: Liam R. Howlett Reviewed-by: Wei Yang Signed-off-by: Andrew Morton --- lib/maple_tree.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 00524e55a21e..34b84b14985e 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -4560,15 +4560,12 @@ static void *mas_prev_slot(struct ma_state *mas, unsigned long min, bool empty) if (unlikely(mas_rewalk_if_dead(mas, node, save_point))) goto retry; - if (likely(entry)) return entry; if (!empty) { - if (mas->index <= min) { - mas->status = ma_underflow; - return NULL; - } + if (mas->index <= min) + goto underflow; goto again; } From ab7ed56a03ce0196260bc29ab844eb4e54afee66 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Fri, 20 Jun 2025 11:00:58 -0400 Subject: [PATCH 109/370] selftests/mm: reduce uffd-unit-test poison test to minimum The test will still generate quite some unwanted MCE error messages to syslog. There was old proposal ratelimiting the MCE messages from kernel, but that has risk of hiding real useful information on production systems. We can at least reduce the test to minimum to not over-pollute dmesg, however trying to not lose its coverage too much. [peterx@redhat.com: reduce uffd-unit-test poison test to minimum] Link: https://lkml.kernel.org/r/aF2RSsjuEOtzXcUa@x1.local Link: https://lkml.kernel.org/r/20250620150058.1729489-1-peterx@redhat.com Signed-off-by: Peter Xu Cc: Axel Rasmussen Cc: Brendan Jackman Cc: David Hildenbrand Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/uffd-unit-tests.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/mm/uffd-unit-tests.c b/tools/testing/selftests/mm/uffd-unit-tests.c index c73fd5d455c8..50501b38e34e 100644 --- a/tools/testing/selftests/mm/uffd-unit-tests.c +++ b/tools/testing/selftests/mm/uffd-unit-tests.c @@ -1027,6 +1027,9 @@ static void uffd_poison_handle_fault( do_uffdio_poison(uffd, offset); } +/* Make sure to cover odd/even, and minimum duplications */ +#define UFFD_POISON_TEST_NPAGES 4 + static void uffd_poison_test(uffd_test_args_t *targs) { pthread_t uffd_mon; @@ -1034,12 +1037,17 @@ static void uffd_poison_test(uffd_test_args_t *targs) struct uffd_args args = { 0 }; struct sigaction act = { 0 }; unsigned long nr_sigbus = 0; - unsigned long nr; + unsigned long nr, poison_pages = UFFD_POISON_TEST_NPAGES; + + if (nr_pages < poison_pages) { + uffd_test_skip("Too few pages for POISON test"); + return; + } fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK); - uffd_register_poison(uffd, area_dst, nr_pages * page_size); - memset(area_src, 0, nr_pages * page_size); + uffd_register_poison(uffd, area_dst, poison_pages * page_size); + memset(area_src, 0, poison_pages * page_size); args.handle_fault = uffd_poison_handle_fault; if (pthread_create(&uffd_mon, NULL, uffd_poll_thread, &args)) @@ -1051,7 +1059,7 @@ static void uffd_poison_test(uffd_test_args_t *targs) if (sigaction(SIGBUS, &act, 0)) err("sigaction"); - for (nr = 0; nr < nr_pages; ++nr) { + for (nr = 0; nr < poison_pages; ++nr) { unsigned long offset = nr * page_size; const char *bytes = (const char *) area_dst + offset; const char *i; @@ -1078,9 +1086,9 @@ static void uffd_poison_test(uffd_test_args_t *targs) if (pthread_join(uffd_mon, NULL)) err("pthread_join()"); - if (nr_sigbus != nr_pages / 2) + if (nr_sigbus != poison_pages / 2) err("expected to receive %lu SIGBUS, actually received %lu", - nr_pages / 2, nr_sigbus); + poison_pages / 2, nr_sigbus); uffd_test_pass(); } From 59305202c67fea50378dcad0cc199dbc13a0e99a Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Fri, 20 Jun 2025 10:54:27 +0530 Subject: [PATCH 110/370] mm/ptdump: take the memory hotplug lock inside ptdump_walk_pgd() Memory hot remove unmaps and tears down various kernel page table regions as required. The ptdump code can race with concurrent modifications of the kernel page tables. When leaf entries are modified concurrently, the dump code may log stale or inconsistent information for a VA range, but this is otherwise not harmful. But when intermediate levels of kernel page table are freed, the dump code will continue to use memory that has been freed and potentially reallocated for another purpose. In such cases, the ptdump code may dereference bogus addresses, leading to a number of potential problems. To avoid the above mentioned race condition, platforms such as arm64, riscv and s390 take memory hotplug lock, while dumping kernel page table via the sysfs interface /sys/kernel/debug/kernel_page_tables. Similar race condition exists while checking for pages that might have been marked W+X via /sys/kernel/debug/kernel_page_tables/check_wx_pages which in turn calls ptdump_check_wx(). Instead of solving this race condition again, let's just move the memory hotplug lock inside generic ptdump_check_wx() which will benefit both the scenarios. Drop get_online_mems() and put_online_mems() combination from all existing platform ptdump code paths. Link: https://lkml.kernel.org/r/20250620052427.2092093-1-anshuman.khandual@arm.com Fixes: bbd6ec605c0f ("arm64/mm: Enable memory hot remove") Signed-off-by: Anshuman Khandual Acked-by: David Hildenbrand Reviewed-by: Dev Jain Acked-by: Alexander Gordeev [s390] Cc: Catalin Marinas Cc: Will Deacon Cc: Ryan Roberts Cc: Paul Walmsley Cc: Palmer Dabbelt Cc: Alexander Gordeev Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Vasily Gorbik Cc: Christian Borntraeger Cc: Sven Schnelle Cc: Signed-off-by: Andrew Morton --- arch/arm64/mm/ptdump_debugfs.c | 3 --- arch/riscv/mm/ptdump.c | 3 --- arch/s390/mm/dump_pagetables.c | 2 -- mm/ptdump.c | 2 ++ 4 files changed, 2 insertions(+), 8 deletions(-) diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c index 68bf1a125502..1e308328c079 100644 --- a/arch/arm64/mm/ptdump_debugfs.c +++ b/arch/arm64/mm/ptdump_debugfs.c @@ -1,6 +1,5 @@ // SPDX-License-Identifier: GPL-2.0 #include -#include #include #include @@ -9,9 +8,7 @@ static int ptdump_show(struct seq_file *m, void *v) { struct ptdump_info *info = m->private; - get_online_mems(); ptdump_walk(m, info); - put_online_mems(); return 0; } DEFINE_SHOW_ATTRIBUTE(ptdump); diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c index 32922550a50a..3b51690cc876 100644 --- a/arch/riscv/mm/ptdump.c +++ b/arch/riscv/mm/ptdump.c @@ -6,7 +6,6 @@ #include #include #include -#include #include #include @@ -413,9 +412,7 @@ bool ptdump_check_wx(void) static int ptdump_show(struct seq_file *m, void *v) { - get_online_mems(); ptdump_walk(m, m->private); - put_online_mems(); return 0; } diff --git a/arch/s390/mm/dump_pagetables.c b/arch/s390/mm/dump_pagetables.c index ac604b176660..9af2aae0a515 100644 --- a/arch/s390/mm/dump_pagetables.c +++ b/arch/s390/mm/dump_pagetables.c @@ -247,11 +247,9 @@ static int ptdump_show(struct seq_file *m, void *v) .marker = markers, }; - get_online_mems(); mutex_lock(&cpa_mutex); ptdump_walk_pgd(&st.ptdump, &init_mm, NULL); mutex_unlock(&cpa_mutex); - put_online_mems(); return 0; } DEFINE_SHOW_ATTRIBUTE(ptdump); diff --git a/mm/ptdump.c b/mm/ptdump.c index 61a352aa12ed..b600c7f864b8 100644 --- a/mm/ptdump.c +++ b/mm/ptdump.c @@ -176,6 +176,7 @@ void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm, pgd_t *pgd) { const struct ptdump_range *range = st->range; + get_online_mems(); mmap_write_lock(mm); while (range->start != range->end) { walk_page_range_debug(mm, range->start, range->end, @@ -183,6 +184,7 @@ void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm, pgd_t *pgd) range++; } mmap_write_unlock(mm); + put_online_mems(); /* Flush out the last page */ st->note_page_flush(st); From b7482f91ea1d3ff3288c3678c37dcfe70124d26c Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 22 Jun 2025 14:37:55 -0700 Subject: [PATCH 111/370] mm/damon/sysfs-schemes: decouple from damos_quota_goal_metric Patch series "mm/damon: decouple sysfs from core". DAMON sysfs interface is coupled with core layer. It maintains some of its keywords arrays be synchronized with matching DAMON core API enums. It is unnecessary coupling that makes separated changes for different layers difficult. Decouple the layers by introducing new data structure for the mappings on DAMON sysfs interface. This patch (of 5): Decouple DAMOS sysfs interface from damos_quota_goal_metric. For this, define and use new sysfs-schemes internal data structure that maps the user-space keywords and damos_quota_goal_metric, instead of having the implicit and unflexible array index rule. Link: https://lkml.kernel.org/r/20250622213759.50930-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250622213759.50930-2-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- mm/damon/sysfs-schemes.c | 55 ++++++++++++++++++++++++++++++---------- 1 file changed, 41 insertions(+), 14 deletions(-) diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c index 30ae7518ffbf..3747dc6678f2 100644 --- a/mm/damon/sysfs-schemes.c +++ b/mm/damon/sysfs-schemes.c @@ -941,27 +941,51 @@ struct damos_sysfs_quota_goal { int nid; }; -/* This should match with enum damos_quota_goal_metric */ -static const char * const damos_sysfs_quota_goal_metric_strs[] = { - "user_input", - "some_mem_psi_us", - "node_mem_used_bp", - "node_mem_free_bp", -}; - static struct damos_sysfs_quota_goal *damos_sysfs_quota_goal_alloc(void) { return kzalloc(sizeof(struct damos_sysfs_quota_goal), GFP_KERNEL); } +struct damos_sysfs_qgoal_metric_name { + enum damos_quota_goal_metric metric; + char *name; +}; + +static +struct damos_sysfs_qgoal_metric_name damos_sysfs_qgoal_metric_names[] = { + { + .metric = DAMOS_QUOTA_USER_INPUT, + .name = "user_input", + }, + { + .metric = DAMOS_QUOTA_SOME_MEM_PSI_US, + .name = "some_mem_psi_us", + }, + { + .metric = DAMOS_QUOTA_NODE_MEM_USED_BP, + .name = "node_mem_used_bp", + }, + { + .metric = DAMOS_QUOTA_NODE_MEM_FREE_BP, + .name = "node_mem_free_bp", + }, +}; + static ssize_t target_metric_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { struct damos_sysfs_quota_goal *goal = container_of(kobj, struct damos_sysfs_quota_goal, kobj); + int i; - return sysfs_emit(buf, "%s\n", - damos_sysfs_quota_goal_metric_strs[goal->metric]); + for (i = 0; i < ARRAY_SIZE(damos_sysfs_qgoal_metric_names); i++) { + struct damos_sysfs_qgoal_metric_name *metric_name; + + metric_name = &damos_sysfs_qgoal_metric_names[i]; + if (metric_name->metric == goal->metric) + return sysfs_emit(buf, "%s\n", metric_name->name); + } + return -EINVAL; } static ssize_t target_metric_store(struct kobject *kobj, @@ -969,11 +993,14 @@ static ssize_t target_metric_store(struct kobject *kobj, { struct damos_sysfs_quota_goal *goal = container_of(kobj, struct damos_sysfs_quota_goal, kobj); - enum damos_quota_goal_metric m; + int i; - for (m = 0; m < NR_DAMOS_QUOTA_GOAL_METRICS; m++) { - if (sysfs_streq(buf, damos_sysfs_quota_goal_metric_strs[m])) { - goal->metric = m; + for (i = 0; i < ARRAY_SIZE(damos_sysfs_qgoal_metric_names); i++) { + struct damos_sysfs_qgoal_metric_name *metric_name; + + metric_name = &damos_sysfs_qgoal_metric_names[i]; + if (sysfs_streq(buf, metric_name->name)) { + goal->metric = metric_name->metric; return count; } } From 2bbf41ee98565ca36d80975614b8dcfc32ea3c58 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 22 Jun 2025 14:37:56 -0700 Subject: [PATCH 112/370] mm/damon/sysfs-schemes: decouple from damos_action Decouple DAMOS sysfs interface from damos_action. For this, define and use new sysfs-schemes internal data structure that maps the user-space keywords and damos_action, instead of having the implicit and unflexible array index rule. [akpm@linux-foundation.org: make damos_sysfs_action_names static] Closes: https://lore.kernel.org/oe-kbuild-all/202506271655.b8yfEZIT-lkp@intel.com/ Link: https://lkml.kernel.org/r/20250622213759.50930-3-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- mm/damon/sysfs-schemes.c | 80 +++++++++++++++++++++++++++++++--------- 1 file changed, 62 insertions(+), 18 deletions(-) diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c index 3747dc6678f2..8a00a76a713f 100644 --- a/mm/damon/sysfs-schemes.c +++ b/mm/damon/sysfs-schemes.c @@ -1614,18 +1614,52 @@ struct damon_sysfs_scheme { int target_nid; }; -/* This should match with enum damos_action */ -static const char * const damon_sysfs_damos_action_strs[] = { - "willneed", - "cold", - "pageout", - "hugepage", - "nohugepage", - "lru_prio", - "lru_deprio", - "migrate_hot", - "migrate_cold", - "stat", +struct damos_sysfs_action_name { + enum damos_action action; + char *name; +}; + +static struct damos_sysfs_action_name damos_sysfs_action_names[] = { + { + .action = DAMOS_WILLNEED, + .name = "willneed", + }, + { + .action = DAMOS_COLD, + .name = "cold", + }, + { + .action = DAMOS_PAGEOUT, + .name = "pageout", + }, + { + .action = DAMOS_HUGEPAGE, + .name = "hugepage", + }, + { + .action = DAMOS_NOHUGEPAGE, + .name = "nohugepage", + }, + { + .action = DAMOS_LRU_PRIO, + .name = "lru_prio", + }, + { + .action = DAMOS_LRU_DEPRIO, + .name = "lru_deprio", + }, + { + .action = DAMOS_MIGRATE_HOT, + .name = "migrate_hot", + }, + { + .action = DAMOS_MIGRATE_COLD, + .name = "migrate_cold", + }, + { + .action = DAMOS_STAT, + .name = "stat", + }, }; static struct damon_sysfs_scheme *damon_sysfs_scheme_alloc( @@ -1862,9 +1896,16 @@ static ssize_t action_show(struct kobject *kobj, struct kobj_attribute *attr, { struct damon_sysfs_scheme *scheme = container_of(kobj, struct damon_sysfs_scheme, kobj); + int i; - return sysfs_emit(buf, "%s\n", - damon_sysfs_damos_action_strs[scheme->action]); + for (i = 0; i < ARRAY_SIZE(damos_sysfs_action_names); i++) { + struct damos_sysfs_action_name *action_name; + + action_name = &damos_sysfs_action_names[i]; + if (action_name->action == scheme->action) + return sysfs_emit(buf, "%s\n", action_name->name); + } + return -EINVAL; } static ssize_t action_store(struct kobject *kobj, struct kobj_attribute *attr, @@ -1872,11 +1913,14 @@ static ssize_t action_store(struct kobject *kobj, struct kobj_attribute *attr, { struct damon_sysfs_scheme *scheme = container_of(kobj, struct damon_sysfs_scheme, kobj); - enum damos_action action; + int i; - for (action = 0; action < NR_DAMOS_ACTIONS; action++) { - if (sysfs_streq(buf, damon_sysfs_damos_action_strs[action])) { - scheme->action = action; + for (i = 0; i < ARRAY_SIZE(damos_sysfs_action_names); i++) { + struct damos_sysfs_action_name *action_name; + + action_name = &damos_sysfs_action_names[i]; + if (sysfs_streq(buf, action_name->name)) { + scheme->action = action_name->action; return count; } } From 041f54604f2c5fc3a363cdf0002389bb98b1e560 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 22 Jun 2025 14:37:57 -0700 Subject: [PATCH 113/370] mm/damon/sysfs-schemes: decouple from damos_wmark_metric Decouple DAMOS sysfs interface from damos_wmark_metric. For this, define and use new sysfs-schemes internal data structure that maps the user-space keywords and damos_wmark_metric, instead of having the implicit and unflexible array index rule. Link: https://lkml.kernel.org/r/20250622213759.50930-4-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- mm/damon/sysfs-schemes.c | 41 ++++++++++++++++++++++++++++++---------- 1 file changed, 31 insertions(+), 10 deletions(-) diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c index 8a00a76a713f..62150d32d30a 100644 --- a/mm/damon/sysfs-schemes.c +++ b/mm/damon/sysfs-schemes.c @@ -785,10 +785,21 @@ static struct damon_sysfs_watermarks *damon_sysfs_watermarks_alloc( return watermarks; } -/* Should match with enum damos_wmark_metric */ -static const char * const damon_sysfs_wmark_metric_strs[] = { - "none", - "free_mem_rate", +struct damos_sysfs_wmark_metric_name { + enum damos_wmark_metric metric; + char *name; +}; + +static const struct damos_sysfs_wmark_metric_name +damos_sysfs_wmark_metric_names[] = { + { + .metric = DAMOS_WMARK_NONE, + .name = "none", + }, + { + .metric = DAMOS_WMARK_FREE_MEM_RATE, + .name = "free_mem_rate", + }, }; static ssize_t metric_show(struct kobject *kobj, struct kobj_attribute *attr, @@ -796,9 +807,16 @@ static ssize_t metric_show(struct kobject *kobj, struct kobj_attribute *attr, { struct damon_sysfs_watermarks *watermarks = container_of(kobj, struct damon_sysfs_watermarks, kobj); + int i; - return sysfs_emit(buf, "%s\n", - damon_sysfs_wmark_metric_strs[watermarks->metric]); + for (i = 0; i < ARRAY_SIZE(damos_sysfs_wmark_metric_names); i++) { + const struct damos_sysfs_wmark_metric_name *metric_name; + + metric_name = &damos_sysfs_wmark_metric_names[i]; + if (metric_name->metric == watermarks->metric) + return sysfs_emit(buf, "%s\n", metric_name->name); + } + return -EINVAL; } static ssize_t metric_store(struct kobject *kobj, struct kobj_attribute *attr, @@ -806,11 +824,14 @@ static ssize_t metric_store(struct kobject *kobj, struct kobj_attribute *attr, { struct damon_sysfs_watermarks *watermarks = container_of(kobj, struct damon_sysfs_watermarks, kobj); - enum damos_wmark_metric metric; + int i; - for (metric = 0; metric < NR_DAMOS_WMARK_METRICS; metric++) { - if (sysfs_streq(buf, damon_sysfs_wmark_metric_strs[metric])) { - watermarks->metric = metric; + for (i = 0; i < ARRAY_SIZE(damos_sysfs_wmark_metric_names); i++) { + const struct damos_sysfs_wmark_metric_name *metric_name; + + metric_name = &damos_sysfs_wmark_metric_names[i]; + if (sysfs_streq(buf, metric_name->name)) { + watermarks->metric = metric_name->metric; return count; } } From a39346daecc302e82a2c930c65f26519c106573e Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 22 Jun 2025 14:37:58 -0700 Subject: [PATCH 114/370] mm/damon/sysfs-schemes: decouple from damos_filter_type Decouple DAMOS sysfs interface from damos_filter_type. For this, define and use new sysfs-schemes internal data structure that maps the user-space keywords and damos_filter_type, instead of having the implicit and unflexible array index rule. Link: https://lkml.kernel.org/r/20250622213759.50930-5-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- mm/damon/sysfs-schemes.c | 75 ++++++++++++++++++++++++++++++---------- 1 file changed, 57 insertions(+), 18 deletions(-) diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c index 62150d32d30a..601360e9b521 100644 --- a/mm/damon/sysfs-schemes.c +++ b/mm/damon/sysfs-schemes.c @@ -341,16 +341,45 @@ static struct damon_sysfs_scheme_filter *damon_sysfs_scheme_filter_alloc( return filter; } -/* Should match with enum damos_filter_type */ -static const char * const damon_sysfs_scheme_filter_type_strs[] = { - "anon", - "active", - "memcg", - "young", - "hugepage_size", - "unmapped", - "addr", - "target", +struct damos_sysfs_filter_type_name { + enum damos_filter_type type; + char *name; +}; + +static const struct damos_sysfs_filter_type_name +damos_sysfs_filter_type_names[] = { + { + .type = DAMOS_FILTER_TYPE_ANON, + .name = "anon", + }, + { + .type = DAMOS_FILTER_TYPE_ACTIVE, + .name = "active", + }, + { + .type = DAMOS_FILTER_TYPE_MEMCG, + .name = "memcg", + }, + { + .type = DAMOS_FILTER_TYPE_YOUNG, + .name = "young", + }, + { + .type = DAMOS_FILTER_TYPE_HUGEPAGE_SIZE, + .name = "hugepage_size", + }, + { + .type = DAMOS_FILTER_TYPE_UNMAPPED, + .name = "unmapped", + }, + { + .type = DAMOS_FILTER_TYPE_ADDR, + .name = "addr", + }, + { + .type = DAMOS_FILTER_TYPE_TARGET, + .name = "target", + }, }; static ssize_t type_show(struct kobject *kobj, @@ -358,9 +387,16 @@ static ssize_t type_show(struct kobject *kobj, { struct damon_sysfs_scheme_filter *filter = container_of(kobj, struct damon_sysfs_scheme_filter, kobj); + int i; - return sysfs_emit(buf, "%s\n", - damon_sysfs_scheme_filter_type_strs[filter->type]); + for (i = 0; i < ARRAY_SIZE(damos_sysfs_filter_type_names); i++) { + const struct damos_sysfs_filter_type_name *type_name; + + type_name = &damos_sysfs_filter_type_names[i]; + if (type_name->type == filter->type) + return sysfs_emit(buf, "%s\n", type_name->name); + } + return -EINVAL; } static bool damos_sysfs_scheme_filter_valid_type( @@ -385,16 +421,19 @@ static ssize_t type_store(struct kobject *kobj, { struct damon_sysfs_scheme_filter *filter = container_of(kobj, struct damon_sysfs_scheme_filter, kobj); - enum damos_filter_type type; ssize_t ret = -EINVAL; + int i; - for (type = 0; type < NR_DAMOS_FILTER_TYPES; type++) { - if (sysfs_streq(buf, damon_sysfs_scheme_filter_type_strs[ - type])) { + for (i = 0; i < ARRAY_SIZE(damos_sysfs_filter_type_names); i++) { + const struct damos_sysfs_filter_type_name *type_name; + + type_name = &damos_sysfs_filter_type_names[i]; + if (sysfs_streq(buf, type_name->name)) { if (!damos_sysfs_scheme_filter_valid_type( - filter->handle_layer, type)) + filter->handle_layer, + type_name->type)) break; - filter->type = type; + filter->type = type_name->type; ret = count; break; } From d1600be2f68a6f79a1ea3993b7ab84fcbb824879 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 22 Jun 2025 14:37:59 -0700 Subject: [PATCH 115/370] mm/damon/sysfs: decouple from damon_ops_id Decouple DAMON sysfs interface from damon_ops_id. For this, define and use new mm/damon/sysfs.c internal data structure that maps the user-space keywords and damon_ops_id, instead of having the implicit and unflexible array index rule. Link: https://lkml.kernel.org/r/20250622213759.50930-6-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- mm/damon/sysfs.c | 56 +++++++++++++++++++++++++++++++++++------------- 1 file changed, 41 insertions(+), 15 deletions(-) diff --git a/mm/damon/sysfs.c b/mm/damon/sysfs.c index 1af6aff35d84..1b1476b79cdb 100644 --- a/mm/damon/sysfs.c +++ b/mm/damon/sysfs.c @@ -811,11 +811,24 @@ static const struct kobj_type damon_sysfs_attrs_ktype = { * context directory */ -/* This should match with enum damon_ops_id */ -static const char * const damon_sysfs_ops_strs[] = { - "vaddr", - "fvaddr", - "paddr", +struct damon_sysfs_ops_name { + enum damon_ops_id ops_id; + char *name; +}; + +static const struct damon_sysfs_ops_name damon_sysfs_ops_names[] = { + { + .ops_id = DAMON_OPS_VADDR, + .name = "vaddr", + }, + { + .ops_id = DAMON_OPS_FVADDR, + .name = "fvaddr", + }, + { + .ops_id = DAMON_OPS_PADDR, + .name = "paddr", + }, }; struct damon_sysfs_context { @@ -934,14 +947,16 @@ static void damon_sysfs_context_rm_dirs(struct damon_sysfs_context *context) static ssize_t avail_operations_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - enum damon_ops_id id; int len = 0; + int i; - for (id = 0; id < NR_DAMON_OPS; id++) { - if (!damon_is_registered_ops(id)) + for (i = 0; i < ARRAY_SIZE(damon_sysfs_ops_names); i++) { + const struct damon_sysfs_ops_name *ops_name; + + ops_name = &damon_sysfs_ops_names[i]; + if (!damon_is_registered_ops(ops_name->ops_id)) continue; - len += sysfs_emit_at(buf, len, "%s\n", - damon_sysfs_ops_strs[id]); + len += sysfs_emit_at(buf, len, "%s\n", ops_name->name); } return len; } @@ -951,8 +966,16 @@ static ssize_t operations_show(struct kobject *kobj, { struct damon_sysfs_context *context = container_of(kobj, struct damon_sysfs_context, kobj); + int i; - return sysfs_emit(buf, "%s\n", damon_sysfs_ops_strs[context->ops_id]); + for (i = 0; i < ARRAY_SIZE(damon_sysfs_ops_names); i++) { + const struct damon_sysfs_ops_name *ops_name; + + ops_name = &damon_sysfs_ops_names[i]; + if (ops_name->ops_id == context->ops_id) + return sysfs_emit(buf, "%s\n", ops_name->name); + } + return -EINVAL; } static ssize_t operations_store(struct kobject *kobj, @@ -960,11 +983,14 @@ static ssize_t operations_store(struct kobject *kobj, { struct damon_sysfs_context *context = container_of(kobj, struct damon_sysfs_context, kobj); - enum damon_ops_id id; + int i; - for (id = 0; id < NR_DAMON_OPS; id++) { - if (sysfs_streq(buf, damon_sysfs_ops_strs[id])) { - context->ops_id = id; + for (i = 0; i < ARRAY_SIZE(damon_sysfs_ops_names); i++) { + const struct damon_sysfs_ops_name *ops_name; + + ops_name = &damon_sysfs_ops_names[i]; + if (sysfs_streq(buf, ops_name->name)) { + context->ops_id = ops_name->ops_id; return count; } } From 2e728505494b21b874fa87fce233c63b43d74434 Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Mon, 23 Jun 2025 20:40:34 +0200 Subject: [PATCH 116/370] lib/test_vmalloc.c: use late_initcall() if built-in for init ordering When the vmalloc test code is compiled as a built-in, use late_initcall() instead of module_init() to defer a vmalloc test execution until most subsystems are up and running. It avoids interfering with components that may not yet be initialized at module_init() time. For example, there was a recent report of memory profiling infrastructure not being ready early enough leading to kernel crash. By using late_initcall() in the built-in case, we ensure the tests are run at a safer point during a boot sequence. Link: https://lkml.kernel.org/r/20250623184035.581229-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) Reviewed-by: Baoquan He Cc: Harry Yoo Cc: Suren Baghdasaryan Cc: David Wang <00107082@163.com> Signed-off-by: Andrew Morton --- lib/test_vmalloc.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/lib/test_vmalloc.c b/lib/test_vmalloc.c index 1b0b59549aaf..7264781750c9 100644 --- a/lib/test_vmalloc.c +++ b/lib/test_vmalloc.c @@ -598,7 +598,11 @@ static int __init vmalloc_test_init(void) return IS_BUILTIN(CONFIG_TEST_VMALLOC) ? 0:-EAGAIN; } +#ifdef MODULE module_init(vmalloc_test_init) +#else +late_initcall(vmalloc_test_init); +#endif MODULE_LICENSE("GPL"); MODULE_AUTHOR("Uladzislau Rezki"); From d8e77a0b636485364d70b86addf0c76bf9bccc4f Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Mon, 23 Jun 2025 20:40:35 +0200 Subject: [PATCH 117/370] lib/test_vmalloc.c: restrict default test mask to avoid test warnings When the vmalloc test is built into the kernel, it runs automatically during the boot. The current-default "run_test_mask" includes all test cases, including those which are designed to fail and which trigger kernel warnings. These kernel splats can be misinterpreted as actual kernel bugs, leading to false alarms and unnecessary reports. To address this, limit the default test mask to only the first few tests which are expected to pass cleanly. These tests are safe and should not generate any warnings unless there is a real bug. Users who wish to explicitly run specific test cases have to pass the run_test_mask as a boot parameter or at module load time. Link: https://lkml.kernel.org/r/20250623184035.581229-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) Reviewed-by: Baoquan He Cc: Harry Yoo Cc: Suren Baghdasaryan Cc: David Wang <00107082@163.com> Signed-off-by: Andrew Morton --- lib/test_vmalloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/test_vmalloc.c b/lib/test_vmalloc.c index 7264781750c9..c1966cf72ab8 100644 --- a/lib/test_vmalloc.c +++ b/lib/test_vmalloc.c @@ -41,7 +41,7 @@ __param(int, nr_pages, 0, __param(bool, use_huge, false, "Use vmalloc_huge in fix_size_alloc_test"); -__param(int, run_test_mask, INT_MAX, +__param(int, run_test_mask, 7, "Set tests specified in the mask.\n\n" "\t\tid: 1, name: fix_size_alloc_test\n" "\t\tid: 2, name: full_fit_alloc_test\n" From d2ef92cd2a31ba7c0d0eb0dd5c1acf381f161fcd Mon Sep 17 00:00:00 2001 From: Sabyrzhan Tasbolatov Date: Sun, 22 Jun 2025 10:19:06 +0500 Subject: [PATCH 118/370] mm: unexport globally copy_to_kernel_nofault MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit copy_to_kernel_nofault() is an internal helper which should not be visible to loadable modules – exporting it would give exploit code a cheap oracle to probe kernel addresses. Instead, keep the helper un-exported and compile the kunit case that exercises it only when mm/kasan/kasan_test.o is linked into vmlinux. [snovitoll@gmail.com: add a brief comment to `#ifndef MODULE`] Link: https://lkml.kernel.org/r/20250622141142.79332-1-snovitoll@gmail.com Link: https://lkml.kernel.org/r/20250622051906.67374-1-snovitoll@gmail.com Fixes: ca79a00bb9a8 ("kasan: migrate copy_user_test to kunit") Signed-off-by: Sabyrzhan Tasbolatov Suggested-by: Christoph Hellwig Suggested-by: Marco Elver Acked-by: David Hildenbrand Reviewed-by: Andrey Konovalov Cc: Alexander Potapenko Cc: Andrey Ryabinin Cc: Arnd Bergmann Cc: Dmitriy Vyukov Cc: Vincenzo Frascino Signed-off-by: Andrew Morton --- mm/kasan/kasan_test_c.c | 8 ++++++++ mm/maccess.c | 1 - 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/mm/kasan/kasan_test_c.c b/mm/kasan/kasan_test_c.c index 5f922dd38ffa..2aa12dfa427a 100644 --- a/mm/kasan/kasan_test_c.c +++ b/mm/kasan/kasan_test_c.c @@ -1977,6 +1977,11 @@ static void rust_uaf(struct kunit *test) KUNIT_EXPECT_KASAN_FAIL(test, kasan_test_rust_uaf()); } +/* + * copy_to_kernel_nofault() is an internal helper available when + * kasan_test is built-in, so it must not be visible to loadable modules. + */ +#ifndef MODULE static void copy_to_kernel_nofault_oob(struct kunit *test) { char *ptr; @@ -2011,6 +2016,7 @@ static void copy_to_kernel_nofault_oob(struct kunit *test) kfree(ptr); } +#endif /* !MODULE */ static void copy_user_test_oob(struct kunit *test) { @@ -2131,7 +2137,9 @@ static struct kunit_case kasan_kunit_test_cases[] = { KUNIT_CASE(match_all_not_assigned), KUNIT_CASE(match_all_ptr_tag), KUNIT_CASE(match_all_mem_tag), +#ifndef MODULE KUNIT_CASE(copy_to_kernel_nofault_oob), +#endif KUNIT_CASE(rust_uaf), KUNIT_CASE(copy_user_test_oob), {} diff --git a/mm/maccess.c b/mm/maccess.c index 831b4dd7296c..486559d68858 100644 --- a/mm/maccess.c +++ b/mm/maccess.c @@ -82,7 +82,6 @@ long copy_to_kernel_nofault(void *dst, const void *src, size_t size) pagefault_enable(); return -EFAULT; } -EXPORT_SYMBOL_GPL(copy_to_kernel_nofault); long strncpy_from_kernel_nofault(char *dst, const void *unsafe_addr, long count) { From 67c94320571f37efd6d88b8c23a3a69ece63fb7f Mon Sep 17 00:00:00 2001 From: Moon Hee Lee Date: Wed, 25 Jun 2025 19:07:58 -0700 Subject: [PATCH 119/370] selftests/mm: remove duplicate .gitignore entries Remove redundant entries in .gitignore confirmed by: $ sort tools/testing/selftests/mm/.gitignore | uniq -d hugetlb_dio pkey_sighandler_tests_32 pkey_sighandler_tests_64 These entries were originally added by [1], and later duplicated by [2]. [1] https://lore.kernel.org/all/20240924185911.117937-1-lorenzo.stoakes@oracle.com/ [2] https://lore.kernel.org/all/20241125064036.413536-1-lizhijian@fujitsu.com/ Link: https://lkml.kernel.org/r/20250626020758.163243-1-moonhee.lee.ca@gmail.com Signed-off-by: Moon Hee Lee Reviewed-by: Dev Jain Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/.gitignore | 3 --- 1 file changed, 3 deletions(-) diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index 824266982aa3..f2dafa0b700b 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -38,9 +38,6 @@ map_fixed_noreplace write_to_hugetlbfs hmm-tests memfd_secret -hugetlb_dio -pkey_sighandler_tests_32 -pkey_sighandler_tests_64 soft-dirty split_huge_page_test ksm_tests From 12f1d7931283106306a8822c5279013b8a0f1242 Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Tue, 24 Jun 2025 11:48:22 -0400 Subject: [PATCH 120/370] maple_tree: fix status setup on restore to active During the initial call with a maple state, an error status may be set before a valid node is populated into the maple state node. Subsequent calls with the maple state may restore the state into an active state with no node set. This was masked by the mas_walk() always resetting the status to ma_start and result in an extra walk in this rare scenario. Don't restore the state to active unless there is a value in the structs node. This also allows mas_walk() to be fixed to use the active state without exposing an issue. User visible results are marginal performance improvements when an active state can be restored and used instead of rewalking the tree. Stable is not Cc'ed because the existing code is stable and the performance gains are not worth the risk. Link: https://lore.kernel.org/all/20250611011253.19515-1-richard.weiyang@gmail.com/ Link: https://lore.kernel.org/all/20250407231354.11771-1-richard.weiyang@gmail.com/ Link: https://lore.kernel.org/all/202506191556.6bfc7b93-lkp@intel.com/ Link: https://lkml.kernel.org/r/20250624154823.52221-1-Liam.Howlett@oracle.com Fixes: a8091f039c1e ("maple_tree: add MAS_UNDERFLOW and MAS_OVERFLOW states") Signed-off-by: Liam R. Howlett Reported-by: Wei Yang Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202506191556.6bfc7b93-lkp@intel.com Reviewed-by: Wei Yang Signed-off-by: Andrew Morton --- lib/maple_tree.c | 25 ++++++++++++++++++------- 1 file changed, 18 insertions(+), 7 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 34b84b14985e..7601c7c2bc09 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -4927,7 +4927,7 @@ void *mas_walk(struct ma_state *mas) { void *entry; - if (!mas_is_active(mas) || !mas_is_start(mas)) + if (!mas_is_active(mas) && !mas_is_start(mas)) mas->status = ma_start; retry: entry = mas_state_walk(mas); @@ -5655,6 +5655,17 @@ int mas_expected_entries(struct ma_state *mas, unsigned long nr_entries) } EXPORT_SYMBOL_GPL(mas_expected_entries); +static void mas_may_activate(struct ma_state *mas) +{ + if (!mas->node) { + mas->status = ma_start; + } else if (mas->index > mas->max || mas->index < mas->min) { + mas->status = ma_start; + } else { + mas->status = ma_active; + } +} + static bool mas_next_setup(struct ma_state *mas, unsigned long max, void **entry) { @@ -5678,11 +5689,11 @@ static bool mas_next_setup(struct ma_state *mas, unsigned long max, break; case ma_overflow: /* Overflowed before, but the max changed */ - mas->status = ma_active; + mas_may_activate(mas); break; case ma_underflow: /* The user expects the mas to be one before where it is */ - mas->status = ma_active; + mas_may_activate(mas); *entry = mas_walk(mas); if (*entry) return true; @@ -5803,11 +5814,11 @@ static bool mas_prev_setup(struct ma_state *mas, unsigned long min, void **entry break; case ma_underflow: /* underflowed before but the min changed */ - mas->status = ma_active; + mas_may_activate(mas); break; case ma_overflow: /* User expects mas to be one after where it is */ - mas->status = ma_active; + mas_may_activate(mas); *entry = mas_walk(mas); if (*entry) return true; @@ -5972,7 +5983,7 @@ static __always_inline bool mas_find_setup(struct ma_state *mas, unsigned long m return true; } - mas->status = ma_active; + mas_may_activate(mas); *entry = mas_walk(mas); if (*entry) return true; @@ -5981,7 +5992,7 @@ static __always_inline bool mas_find_setup(struct ma_state *mas, unsigned long m if (unlikely(mas->last >= max)) return true; - mas->status = ma_active; + mas_may_activate(mas); *entry = mas_walk(mas); if (*entry) return true; From 1f0bce2fa8c6bfd65cf78ad6ef6e0948fc55c7bb Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Tue, 24 Jun 2025 11:48:23 -0400 Subject: [PATCH 121/370] maple_tree: add testing for restoring maple state to active Restoring maple status to ma_active on overflow/underflow when mas->node was NULL could have happened in the past, but was masked by a bug in mas_walk(). Add test cases that triggered the bug when the node was mas->node prior to fixing the maple state setup. Add a few extra tests around restoring the active maple status. Link: https://lore.kernel.org/all/202506191556.6bfc7b93-lkp@intel.com/ Link: https://lkml.kernel.org/r/20250624154823.52221-2-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Reviewed-by: Wei Yang Signed-off-by: Andrew Morton --- lib/test_maple_tree.c | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/lib/test_maple_tree.c b/lib/test_maple_tree.c index 13e2a10d7554..cb3936595b0d 100644 --- a/lib/test_maple_tree.c +++ b/lib/test_maple_tree.c @@ -3177,6 +3177,7 @@ static noinline void __init check_state_handling(struct maple_tree *mt) void *entry, *ptr = (void *) 0x1234500; void *ptr2 = &ptr; void *ptr3 = &ptr2; + unsigned long index; /* Check MAS_ROOT First */ mtree_store_range(mt, 0, 0, ptr, GFP_KERNEL); @@ -3706,6 +3707,37 @@ static noinline void __init check_state_handling(struct maple_tree *mt) MT_BUG_ON(mt, mas.last != 0x1fff); MT_BUG_ON(mt, !mas_is_active(&mas)); + mas_unlock(&mas); + mtree_destroy(mt); + + mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE); + mas_lock(&mas); + for (int count = 0; count < 30; count++) { + mas_set(&mas, count); + mas_store_gfp(&mas, xa_mk_value(count), GFP_KERNEL); + } + + /* Ensure mas_find works with MA_UNDERFLOW */ + mas_set(&mas, 0); + entry = mas_walk(&mas); + mas_set(&mas, 0); + mas_prev(&mas, 0); + MT_BUG_ON(mt, mas.status != ma_underflow); + MT_BUG_ON(mt, mas_find(&mas, ULONG_MAX) != entry); + + /* Restore active on mas_next */ + entry = mas_next(&mas, ULONG_MAX); + index = mas.index; + mas_prev(&mas, index); + MT_BUG_ON(mt, mas.status != ma_underflow); + MT_BUG_ON(mt, mas_next(&mas, ULONG_MAX) != entry); + + /* Ensure overflow -> active works */ + mas_prev(&mas, 0); + mas_next(&mas, index - 1); + MT_BUG_ON(mt, mas.status != ma_overflow); + MT_BUG_ON(mt, mas_next(&mas, ULONG_MAX) != entry); + mas_unlock(&mas); } From 98db4b5472ed8a64e418b895054c5fb123c038bf Mon Sep 17 00:00:00 2001 From: Li Wang Date: Tue, 24 Jun 2025 12:24:11 +0800 Subject: [PATCH 122/370] selftests/mm: fix UFFDIO_API usage with proper two-step feature negotiation The current implementation of test_unmerge_uffd_wp() explicitly sets `uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP` before calling UFFDIO_API. This can cause the ioctl() call to fail with EINVAL on kernels that do not support UFFD-WP, leading the test to fail unnecessarily: # ------------------------------ # running ./ksm_functional_tests # ------------------------------ # TAP version 13 # 1..9 # # [RUN] test_unmerge # ok 1 Pages were unmerged # # [RUN] test_unmerge_zero_pages # ok 2 KSM zero pages were unmerged # # [RUN] test_unmerge_discarded # ok 3 Pages were unmerged # # [RUN] test_unmerge_uffd_wp # not ok 4 UFFDIO_API failed <----- # # [RUN] test_prot_none # ok 5 Pages were unmerged # # [RUN] test_prctl # ok 6 Setting/clearing PR_SET_MEMORY_MERGE works # # [RUN] test_prctl_fork # # No pages got merged # # [RUN] test_prctl_fork_exec # ok 7 PR_SET_MEMORY_MERGE value is inherited # # [RUN] test_prctl_unmerge # ok 8 Pages were unmerged # Bail out! 1 out of 8 tests failed # # Planned tests != run tests (9 != 8) # # Totals: pass:7 fail:1 xfail:0 xpass:0 skip:0 error:0 # [FAIL] This patch improves compatibility and robustness of the UFFD-WP test (test_unmerge_uffd_wp) by correctly implementing the UFFDIO_API two-step handshake as recommended by the userfaultfd(2) man page. Key changes: 1. Use features=0 in the initial UFFDIO_API call to query supported feature bits, rather than immediately requesting WP support. 2. Skip the test gracefully if: - UFFDIO_API fails with EINVAL (e.g. unsupported API version), or - UFFD_FEATURE_PAGEFAULT_FLAG_WP is not advertised by the kernel. 3. Close the initial userfaultfd and create a new one before enabling the required feature, since UFFDIO_API can only be called once per fd. 4. Improve diagnostics by distinguishing between expected and unexpected failures, using strerror() to report errors. This ensures the test behaves correctly across a wider range of kernel versions and configurations, while preserving the intended behavior on kernels that support UFFD-WP. [liwang@redhat.com: fail the test if sys_userfaultfd() fails, per David] Link: https://lkml.kernel.org/r/20250625004645.400520-1-liwang@redhat.com Link: https://lkml.kernel.org/r/20250624042411.395285-1-liwang@redhat.com Signed-off-by: Li Wang Suggested-by: David Hildenbrand Acked-by: David Hildenbrand Cc: Aruna Ramakrishna Cc: Bagas Sanjaya Cc: Catalin Marinas Cc: Dave Hansen Cc: Joey Gouly Cc: Johannes Weiner Cc: Keith Lucas Cc: Ryan Roberts Cc: Shuah Khan Cc: Axel Rasmussen Cc: Li Wang Cc: Mike Rapoport Cc: Nadav Amit Cc: Peter Xu Signed-off-by: Andrew Morton --- .../selftests/mm/ksm_functional_tests.c | 28 +++++++++++++++++-- 1 file changed, 26 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/mm/ksm_functional_tests.c b/tools/testing/selftests/mm/ksm_functional_tests.c index b61803e36d1c..d8bd1911dfc0 100644 --- a/tools/testing/selftests/mm/ksm_functional_tests.c +++ b/tools/testing/selftests/mm/ksm_functional_tests.c @@ -393,9 +393,13 @@ static void test_unmerge_uffd_wp(void) /* See if UFFD-WP is around. */ uffdio_api.api = UFFD_API; - uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP; + uffdio_api.features = 0; if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) { - ksft_test_result_fail("UFFDIO_API failed\n"); + if (errno == EINVAL) + ksft_test_result_skip("The API version requested is not supported\n"); + else + ksft_test_result_fail("UFFDIO_API failed: %s\n", strerror(errno)); + goto close_uffd; } if (!(uffdio_api.features & UFFD_FEATURE_PAGEFAULT_FLAG_WP)) { @@ -403,6 +407,26 @@ static void test_unmerge_uffd_wp(void) goto close_uffd; } + /* + * UFFDIO_API must only be called once to enable features. + * So we close the old userfaultfd and create a new one to + * actually enable UFFD_FEATURE_PAGEFAULT_FLAG_WP. + */ + close(uffd); + uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); + if (uffd < 0) { + ksft_test_result_fail("__NR_userfaultfd failed\n"); + goto unmap; + } + + /* Now, enable it ("two-step handshake") */ + uffdio_api.api = UFFD_API; + uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP; + if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) { + ksft_test_result_fail("UFFDIO_API failed: %s\n", strerror(errno)); + goto close_uffd; + } + /* Register UFFD-WP, no need for an actual handler. */ if (uffd_register(uffd, map, size, false, true, false)) { ksft_test_result_fail("UFFDIO_REGISTER_MODE_WP failed\n"); From fdc3bc3497946c6b416a17628907581657102e60 Mon Sep 17 00:00:00 2001 From: Li Wang Date: Tue, 24 Jun 2025 11:27:48 +0800 Subject: [PATCH 123/370] ksm_tests: skip hugepage test when Transparent Hugepages are disabled Some systems (e.g. minimal or real-time kernels) may not enable Transparent Hugepages (THP), causing MADV_HUGEPAGE to return EINVAL. This patch introduces a runtime check using the existing THP sysfs interface and skips the hugepage merging test (`-H`) when THP is not available. To avoid those failures: # ----------------------------- # running ./ksm_tests -H -s 100 # ----------------------------- # ksm_tests: MADV_HUGEPAGE: Invalid argument # [FAIL] not ok 1 ksm_tests -H -s 100 # exit=2 # -------------------- # running ./khugepaged # -------------------- # Reading PMD pagesize failed# [FAIL] not ok 1 khugepaged # exit=1 # -------------------- # running ./soft-dirty # -------------------- # TAP version 13 # 1..15 # ok 1 Test test_simple # ok 2 Test test_vma_reuse dirty bit of allocated page # ok 3 Test test_vma_reuse dirty bit of reused address page # Bail out! Reading PMD pagesize failed# Planned tests != run tests (15 != 3) # # Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0 # [FAIL] not ok 1 soft-dirty # exit=1 # SUMMARY: PASS=0 SKIP=0 FAIL=1 # ------------------- # running ./migration # ------------------- # TAP version 13 # 1..3 # # Starting 3 tests from 1 test cases. # # RUN migration.private_anon ... # # OK migration.private_anon # ok 1 migration.private_anon # # RUN migration.shared_anon ... # # OK migration.shared_anon # ok 2 migration.shared_anon # # RUN migration.private_anon_thp ... # # migration.c:196:private_anon_thp:Expected madvise(ptr, TWOMEG, MADV_HUGEPAGE) (-1) == 0 (0) # # private_anon_thp: Test terminated by assertion # # FAIL migration.private_anon_thp # not ok 3 migration.private_anon_thp # # FAILED: 2 / 3 tests passed. # # Totals: pass:2 fail:1 xfail:0 xpass:0 skip:0 error:0 # [FAIL] not ok 1 migration # exit=1 It's true that CONFIG_TRANSPARENT_HUGEPAGE=y is explicitly enabled in tools/testing/selftests/mm/config, so ideally the runtime environment should also support THP. However, in practice, we've found that on some systems: - THP is disabled at boot time (transparent_hugepage=never) - Or manually disabled via sysfs - Or unavailable in RT kernels, containers, or minimal CI environments In these cases, the test will fail with EINVAL on madvise(MADV_HUGEPAGE), even though the kernel config is correct. To make the test suite more robust and avoid false negatives, this patch adds a runtime check for /sys/kernel/mm/transparent_hugepage/enabled. If THP is not available, the hugepage test (-H) is skipped with a clear message. Link: https://lkml.kernel.org/r/20250624032748.393836-1-liwang@redhat.com Signed-off-by: Li Wang Cc: Aruna Ramakrishna Cc: Bagas Sanjaya Cc: Catalin Marinas Cc: Dave Hansen Cc: David Hildenbrand Cc: Joey Gouly Cc: Johannes Weiner Cc: Keith Lucas Cc: Ryan Roberts Cc: Shuah Khan Cc: Lorenzo Stoakes Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/khugepaged.c | 5 +++++ tools/testing/selftests/mm/ksm_tests.c | 6 ++++++ tools/testing/selftests/mm/migration.c | 8 ++++++++ tools/testing/selftests/mm/soft-dirty.c | 9 ++++++++- tools/testing/selftests/mm/thp_settings.c | 11 +++++++++++ tools/testing/selftests/mm/thp_settings.h | 2 ++ 6 files changed, 40 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c index 4341ce6b3b38..a18c50d51141 100644 --- a/tools/testing/selftests/mm/khugepaged.c +++ b/tools/testing/selftests/mm/khugepaged.c @@ -1188,6 +1188,11 @@ int main(int argc, char **argv) .read_ahead_kb = 0, }; + if (!thp_is_enabled()) { + printf("Transparent Hugepages not available\n"); + return KSFT_SKIP; + } + parse_test_type(argc, argv); setbuf(stdout, NULL); diff --git a/tools/testing/selftests/mm/ksm_tests.c b/tools/testing/selftests/mm/ksm_tests.c index e80deac1436b..b77462b5c240 100644 --- a/tools/testing/selftests/mm/ksm_tests.c +++ b/tools/testing/selftests/mm/ksm_tests.c @@ -15,6 +15,7 @@ #include "../kselftest.h" #include #include "vm_util.h" +#include "thp_settings.h" #define KSM_SYSFS_PATH "/sys/kernel/mm/ksm/" #define KSM_FP(s) (KSM_SYSFS_PATH s) @@ -527,6 +528,11 @@ static int ksm_merge_hugepages_time(int merge_type, int mapping, int prot, unsigned long scan_time_ns; int pagemap_fd, n_normal_pages, n_huge_pages; + if (!thp_is_enabled()) { + printf("Transparent Hugepages not available\n"); + return KSFT_SKIP; + } + map_size *= MB; size_t len = map_size; diff --git a/tools/testing/selftests/mm/migration.c b/tools/testing/selftests/mm/migration.c index 1e3a595fbf01..a306f8bab087 100644 --- a/tools/testing/selftests/mm/migration.c +++ b/tools/testing/selftests/mm/migration.c @@ -5,6 +5,8 @@ */ #include "../kselftest_harness.h" +#include "thp_settings.h" + #include #include #include @@ -185,6 +187,9 @@ TEST_F_TIMEOUT(migration, private_anon_thp, 2*RUNTIME) uint64_t *ptr; int i; + if (!thp_is_enabled()) + SKIP(return, "Transparent Hugepages not available"); + if (self->nthreads < 2 || self->n1 < 0 || self->n2 < 0) SKIP(return, "Not enough threads or NUMA nodes available"); @@ -214,6 +219,9 @@ TEST_F_TIMEOUT(migration, shared_anon_thp, 2*RUNTIME) uint64_t *ptr; int i; + if (!thp_is_enabled()) + SKIP(return, "Transparent Hugepages not available"); + if (self->nthreads < 2 || self->n1 < 0 || self->n2 < 0) SKIP(return, "Not enough threads or NUMA nodes available"); diff --git a/tools/testing/selftests/mm/soft-dirty.c b/tools/testing/selftests/mm/soft-dirty.c index 8e1462ce0532..8a3f2b4b2186 100644 --- a/tools/testing/selftests/mm/soft-dirty.c +++ b/tools/testing/selftests/mm/soft-dirty.c @@ -6,8 +6,10 @@ #include #include #include + #include "../kselftest.h" #include "vm_util.h" +#include "thp_settings.h" #define PAGEMAP_FILE_PATH "/proc/self/pagemap" #define TEST_ITERATIONS 10000 @@ -78,8 +80,13 @@ static void test_hugepage(int pagemap_fd, int pagesize) { char *map; int i, ret; - size_t hpage_len = read_pmd_pagesize(); + if (!thp_is_enabled()) { + ksft_test_result_skip("Transparent Hugepages not available\n"); + return; + } + + size_t hpage_len = read_pmd_pagesize(); if (!hpage_len) ksft_exit_fail_msg("Reading PMD pagesize failed"); diff --git a/tools/testing/selftests/mm/thp_settings.c b/tools/testing/selftests/mm/thp_settings.c index ad872af1c81a..bad60ac52874 100644 --- a/tools/testing/selftests/mm/thp_settings.c +++ b/tools/testing/selftests/mm/thp_settings.c @@ -381,3 +381,14 @@ unsigned long thp_shmem_supported_orders(void) { return __thp_supported_orders(true); } + +bool thp_is_enabled(void) +{ + if (access(THP_SYSFS, F_OK) != 0) + return false; + + int mode = thp_read_string("enabled", thp_enabled_strings); + + /* THP is considered enabled if it's either "always" or "madvise" */ + return mode == 1 || mode == 3; +} diff --git a/tools/testing/selftests/mm/thp_settings.h b/tools/testing/selftests/mm/thp_settings.h index fc131d23d593..6c07f70beee9 100644 --- a/tools/testing/selftests/mm/thp_settings.h +++ b/tools/testing/selftests/mm/thp_settings.h @@ -84,4 +84,6 @@ void thp_set_read_ahead_path(char *path); unsigned long thp_supported_orders(void); unsigned long thp_shmem_supported_orders(void); +bool thp_is_enabled(void); + #endif /* __THP_SETTINGS_H__ */ From 58fc12f77eb962360ecf80abd3e1c2790b50786d Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 20 Jun 2025 16:33:01 +0100 Subject: [PATCH 124/370] mm/madvise: remove the visitor pattern and thread anon_vma state Patch series "madvise cleanup", v2. This is a series of patches that helps address a number of historic problems in the madvise() implementation: * Eliminate the visitor pattern and having the code which is implemented for both the anon_vma_name implementation and ordinary madvise() operations use the same madvise_vma_behavior() implementation. * Thread state through the madvise_behavior state object - this object, very usefully introduced by SJ, is already used to transmit state through operations. This series extends this by having all madvise() operations use this, including anon_vma_name. * Thread range, VMA state through madvise_behavior - This helps avoid a lot of the confusing code around range and VMA state and again keeps things consistent and with a single 'source of truth'. * Addressing the very strange behaviour around the passed around struct vm_area_struct **prev pointer - all read-only users do absolutely nothing with the prev pointer. The only function that uses it is madvise_update_vma(), and in all cases prev is always reset to VMA. Fix this by no longer having aything but madvise_update_vma() reference prev, and having madvise_walk_vmas() update prev in each instance. Additionally make it clear that the meaningful change in vma state is when madvise_update_vma() potentially merges a VMA, so explicitly retrieve the VMA in this case. * Update and clarify the madvise_walk_vmas() function - this is a source of a great deal of confusion, so simplify, stop using prev = NULL to signify that the mmap lock has been dropped (!) and make that explicit, and add some comments to explain what's going on. This patch (of 5): Now we have the madvise_behavior helper struct we no longer need to mess around with void* pointers in order to propagate anon_vma_name, and this means we can get rid of the confusing and inconsistent visitor pattern implementation in madvise_vma_anon_name(). This means we now have a single state object that threads through most of madvise()'s logic and a single code path which executes the majority of madvise() behaviour (we maintain separate logic for failure injection and memory population for the time being). We are able to remove the visitor pattern by handling the anon_vma_name setting logic via an internal madvise flag - __MADV_SET_ANON_VMA_NAME. This uses a negative value so it isn't reasonable that we will ever add this as a UAPI flag. Additionally, the madvise_behavior_valid() check ensures that user-specified behaviours are strictly only those we permit which, of course, this flag will be excluded from. We are able to propagate the anon_vma_name object through use of the madvise_behavior helper struct. Doing this results in a can_modify_vma_madv() check for anonymous VMA name changes, however this will cause no issues as this operation is not prohibited. We can also then reuse more code and drop the redundant madvise_vma_anon_name() function altogether. Additionally separate out behaviours that update VMAs from those that do not. Link: https://lkml.kernel.org/r/cover.1750433500.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/c5094bfccb41ecd19d4e9bcaa1c4a11e00158bba.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Zi Yan Reviewed-by: SeongJae Park Reviewed-by: Barry Song Acked-by: David Hildenbrand Cc: Baolin Wang Cc: Dev Jain Cc: Jann Horn Cc: Lance Yang Cc: Liam Howlett Cc: Mariano Pache Cc: Ryan Roberts Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/madvise.c | 166 +++++++++++++++++++++++++-------------------------- 1 file changed, 83 insertions(+), 83 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 070132f9842b..93837b980cc2 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -37,6 +37,8 @@ #include "internal.h" #include "swap.h" +#define __MADV_SET_ANON_VMA_NAME (-1) + /* * Maximum number of attempts we make to install guard pages before we give up * and return -ERESTARTNOINTR to have userspace try again. @@ -59,9 +61,13 @@ struct madvise_behavior { int behavior; struct mmu_gather *tlb; enum madvise_lock_mode lock_mode; + struct anon_vma_name *anon_name; }; #ifdef CONFIG_ANON_VMA_NAME +static int madvise_walk_vmas(struct mm_struct *mm, unsigned long start, + unsigned long end, struct madvise_behavior *madv_behavior); + struct anon_vma_name *anon_vma_name_alloc(const char *name) { struct anon_vma_name *anon_name; @@ -112,6 +118,35 @@ static int replace_anon_vma_name(struct vm_area_struct *vma, return 0; } + +int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, + unsigned long len_in, struct anon_vma_name *anon_name) +{ + unsigned long end; + unsigned long len; + struct madvise_behavior madv_behavior = { + .behavior = __MADV_SET_ANON_VMA_NAME, + .lock_mode = MADVISE_MMAP_WRITE_LOCK, + .anon_name = anon_name, + }; + + if (start & ~PAGE_MASK) + return -EINVAL; + len = (len_in + ~PAGE_MASK) & PAGE_MASK; + + /* Check to see whether len was rounded up from small -ve to zero */ + if (len_in && !len) + return -EINVAL; + + end = start + len; + if (end < start) + return -EINVAL; + + if (end == start) + return 0; + + return madvise_walk_vmas(mm, start, end, &madv_behavior); +} #else /* CONFIG_ANON_VMA_NAME */ static int replace_anon_vma_name(struct vm_area_struct *vma, struct anon_vma_name *anon_name) @@ -1252,13 +1287,13 @@ static long madvise_guard_remove(struct vm_area_struct *vma, static int madvise_vma_behavior(struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end, - void *behavior_arg) + struct madvise_behavior *madv_behavior) { - struct madvise_behavior *arg = behavior_arg; - int behavior = arg->behavior; - int error; - struct anon_vma_name *anon_name; + int behavior = madv_behavior->behavior; + struct anon_vma_name *anon_name = madv_behavior->anon_name; vm_flags_t new_flags = vma->vm_flags; + bool set_new_anon_name = behavior == __MADV_SET_ANON_VMA_NAME; + int error; if (unlikely(!can_modify_vma_madv(vma, behavior))) return -EPERM; @@ -1275,7 +1310,17 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, case MADV_FREE: case MADV_DONTNEED: case MADV_DONTNEED_LOCKED: - return madvise_dontneed_free(vma, prev, start, end, arg); + return madvise_dontneed_free(vma, prev, start, end, + madv_behavior); + case MADV_COLLAPSE: + return madvise_collapse(vma, prev, start, end); + case MADV_GUARD_INSTALL: + return madvise_guard_install(vma, prev, start, end); + case MADV_GUARD_REMOVE: + return madvise_guard_remove(vma, prev, start, end); + + /* The below behaviours update VMAs via madvise_update_vma(). */ + case MADV_NORMAL: new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ; break; @@ -1325,21 +1370,29 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, if (error) goto out; break; - case MADV_COLLAPSE: - return madvise_collapse(vma, prev, start, end); - case MADV_GUARD_INSTALL: - return madvise_guard_install(vma, prev, start, end); - case MADV_GUARD_REMOVE: - return madvise_guard_remove(vma, prev, start, end); + case __MADV_SET_ANON_VMA_NAME: + /* Only anonymous mappings can be named */ + if (vma->vm_file && !vma_is_anon_shmem(vma)) + return -EBADF; + break; } /* We cannot provide prev in this lock mode. */ - VM_WARN_ON_ONCE(arg->lock_mode == MADVISE_VMA_READ_LOCK); - anon_name = anon_vma_name(vma); - anon_vma_name_get(anon_name); + VM_WARN_ON_ONCE(madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK); + + /* + * madvise_update_vma() might cause a VMA merge which could put an + * anon_vma_name, so we must hold an additional reference on the + * anon_vma_name so it doesn't disappear from under us. + */ + if (!set_new_anon_name) { + anon_name = anon_vma_name(vma); + anon_vma_name_get(anon_name); + } error = madvise_update_vma(vma, prev, start, end, new_flags, anon_name); - anon_vma_name_put(anon_name); + if (!set_new_anon_name) + anon_vma_name_put(anon_name); out: /* @@ -1523,20 +1576,17 @@ static struct vm_area_struct *try_vma_read_lock(struct mm_struct *mm, } /* - * Walk the vmas in range [start,end), and call the visit function on each one. - * The visit function will get start and end parameters that cover the overlap - * between the current vma and the original range. Any unmapped regions in the - * original range will result in this function returning -ENOMEM while still - * calling the visit function on all of the existing vmas in the range. - * Must be called with the mmap_lock held for reading or writing. + * Walk the vmas in range [start,end), and call the madvise_vma_behavior + * function on each one. The function will get start and end parameters that + * cover the overlap between the current vma and the original range. Any + * unmapped regions in the original range will result in this function returning + * -ENOMEM while still calling the madvise_vma_behavior function on all of the + * existing vmas in the range. Must be called with the mmap_lock held for + * reading or writing. */ static int madvise_walk_vmas(struct mm_struct *mm, unsigned long start, - unsigned long end, struct madvise_behavior *madv_behavior, - void *arg, - int (*visit)(struct vm_area_struct *vma, - struct vm_area_struct **prev, unsigned long start, - unsigned long end, void *arg)) + unsigned long end, struct madvise_behavior *madv_behavior) { struct vm_area_struct *vma; struct vm_area_struct *prev; @@ -1548,11 +1598,12 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start, * If VMA read lock is supported, apply madvise to a single VMA * tentatively, avoiding walking VMAs. */ - if (madv_behavior && madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) { + if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) { vma = try_vma_read_lock(mm, madv_behavior, start, end); if (vma) { prev = vma; - error = visit(vma, &prev, start, end, arg); + error = madvise_vma_behavior(vma, &prev, start, end, + madv_behavior); vma_end_read(vma); return error; } @@ -1586,7 +1637,8 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start, tmp = end; /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ - error = visit(vma, &prev, start, tmp, arg); + error = madvise_vma_behavior(vma, &prev, start, tmp, + madv_behavior); if (error) return error; start = tmp; @@ -1603,57 +1655,6 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start, return unmapped_error; } -#ifdef CONFIG_ANON_VMA_NAME -static int madvise_vma_anon_name(struct vm_area_struct *vma, - struct vm_area_struct **prev, - unsigned long start, unsigned long end, - void *anon_name) -{ - int error; - - /* Only anonymous mappings can be named */ - if (vma->vm_file && !vma_is_anon_shmem(vma)) - return -EBADF; - - error = madvise_update_vma(vma, prev, start, end, vma->vm_flags, - anon_name); - - /* - * madvise() returns EAGAIN if kernel resources, such as - * slab, are temporarily unavailable. - */ - if (error == -ENOMEM) - error = -EAGAIN; - return error; -} - -int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, - unsigned long len_in, struct anon_vma_name *anon_name) -{ - unsigned long end; - unsigned long len; - - if (start & ~PAGE_MASK) - return -EINVAL; - len = (len_in + ~PAGE_MASK) & PAGE_MASK; - - /* Check to see whether len was rounded up from small -ve to zero */ - if (len_in && !len) - return -EINVAL; - - end = start + len; - if (end < start) - return -EINVAL; - - if (end == start) - return 0; - - return madvise_walk_vmas(mm, start, end, NULL, anon_name, - madvise_vma_anon_name); -} -#endif /* CONFIG_ANON_VMA_NAME */ - - /* * Any behaviour which results in changes to the vma->vm_flags needs to * take mmap_lock for writing. Others, which simply traverse vmas, need @@ -1845,8 +1846,7 @@ static int madvise_do_behavior(struct mm_struct *mm, if (is_madvise_populate(behavior)) error = madvise_populate(mm, start, end, behavior); else - error = madvise_walk_vmas(mm, start, end, madv_behavior, - madv_behavior, madvise_vma_behavior); + error = madvise_walk_vmas(mm, start, end, madv_behavior); blk_finish_plug(&plug); return error; } From 20d3aea927808b0283dff8d3a59a722801b92664 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 20 Jun 2025 16:33:02 +0100 Subject: [PATCH 125/370] mm/madvise: thread mm_struct through madvise_behavior There's no need to thread a pointer to the mm_struct nor have different functions signatures for each behaviour, instead store state in the struct madvise_behavior object consistently and use it for all madvise() actions. Link: https://lkml.kernel.org/r/a47d850b0111735e026d438c3300c0e3b7f439f4.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Reviewed-by: Zi Yan Reviewed-by: SeongJae Park Reviewed-by: Barry Song Cc: Baolin Wang Cc: David Hildenbrand Cc: Dev Jain Cc: Jann Horn Cc: Lance Yang Cc: Liam Howlett Cc: Mariano Pache Cc: Ryan Roberts Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- mm/madvise.c | 110 ++++++++++++++++++++++++++------------------------- 1 file changed, 57 insertions(+), 53 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 93837b980cc2..f1109c2c63a4 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -58,6 +58,7 @@ enum madvise_lock_mode { }; struct madvise_behavior { + struct mm_struct *mm; int behavior; struct mmu_gather *tlb; enum madvise_lock_mode lock_mode; @@ -65,8 +66,8 @@ struct madvise_behavior { }; #ifdef CONFIG_ANON_VMA_NAME -static int madvise_walk_vmas(struct mm_struct *mm, unsigned long start, - unsigned long end, struct madvise_behavior *madv_behavior); +static int madvise_walk_vmas(unsigned long start, unsigned long end, + struct madvise_behavior *madv_behavior); struct anon_vma_name *anon_vma_name_alloc(const char *name) { @@ -125,6 +126,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, unsigned long end; unsigned long len; struct madvise_behavior madv_behavior = { + .mm = mm, .behavior = __MADV_SET_ANON_VMA_NAME, .lock_mode = MADVISE_MMAP_WRITE_LOCK, .anon_name = anon_name, @@ -145,7 +147,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, if (end == start) return 0; - return madvise_walk_vmas(mm, start, end, &madv_behavior); + return madvise_walk_vmas(start, end, &madv_behavior); } #else /* CONFIG_ANON_VMA_NAME */ static int replace_anon_vma_name(struct vm_area_struct *vma, @@ -991,10 +993,11 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, return -EINVAL; } -static long madvise_populate(struct mm_struct *mm, unsigned long start, - unsigned long end, int behavior) +static long madvise_populate(unsigned long start, unsigned long end, + struct madvise_behavior *madv_behavior) { - const bool write = behavior == MADV_POPULATE_WRITE; + struct mm_struct *mm = madv_behavior->mm; + const bool write = madv_behavior->behavior == MADV_POPULATE_WRITE; int locked = 1; long pages; @@ -1408,15 +1411,14 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, /* * Error injection support for memory error handling. */ -static int madvise_inject_error(int behavior, - unsigned long start, unsigned long end) +static int madvise_inject_error(unsigned long start, unsigned long end, + struct madvise_behavior *madv_behavior) { unsigned long size; if (!capable(CAP_SYS_ADMIN)) return -EPERM; - for (; start < end; start += size) { unsigned long pfn; struct page *page; @@ -1434,7 +1436,7 @@ static int madvise_inject_error(int behavior, */ size = page_size(compound_head(page)); - if (behavior == MADV_SOFT_OFFLINE) { + if (madv_behavior->behavior == MADV_SOFT_OFFLINE) { pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n", pfn, start); ret = soft_offline_page(pfn, MF_COUNT_INCREASED); @@ -1453,9 +1455,9 @@ static int madvise_inject_error(int behavior, return 0; } -static bool is_memory_failure(int behavior) +static bool is_memory_failure(struct madvise_behavior *madv_behavior) { - switch (behavior) { + switch (madv_behavior->behavior) { case MADV_HWPOISON: case MADV_SOFT_OFFLINE: return true; @@ -1466,13 +1468,13 @@ static bool is_memory_failure(int behavior) #else -static int madvise_inject_error(int behavior, - unsigned long start, unsigned long end) +static int madvise_inject_error(unsigned long start, unsigned long end, + struct madvise_behavior *madv_behavior) { return 0; } -static bool is_memory_failure(int behavior) +static bool is_memory_failure(struct madvise_behavior *madv_behavior) { return false; } @@ -1549,10 +1551,11 @@ static bool process_madvise_remote_valid(int behavior) * If a VMA read lock could not be acquired, we return NULL and expect caller to * fallback to mmap lock behaviour. */ -static struct vm_area_struct *try_vma_read_lock(struct mm_struct *mm, +static struct vm_area_struct *try_vma_read_lock( struct madvise_behavior *madv_behavior, unsigned long start, unsigned long end) { + struct mm_struct *mm = madv_behavior->mm; struct vm_area_struct *vma; vma = lock_vma_under_rcu(mm, start); @@ -1585,9 +1588,10 @@ static struct vm_area_struct *try_vma_read_lock(struct mm_struct *mm, * reading or writing. */ static -int madvise_walk_vmas(struct mm_struct *mm, unsigned long start, - unsigned long end, struct madvise_behavior *madv_behavior) +int madvise_walk_vmas(unsigned long start, unsigned long end, + struct madvise_behavior *madv_behavior) { + struct mm_struct *mm = madv_behavior->mm; struct vm_area_struct *vma; struct vm_area_struct *prev; unsigned long tmp; @@ -1599,7 +1603,7 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start, * tentatively, avoiding walking VMAs. */ if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) { - vma = try_vma_read_lock(mm, madv_behavior, start, end); + vma = try_vma_read_lock(madv_behavior, start, end); if (vma) { prev = vma; error = madvise_vma_behavior(vma, &prev, start, end, @@ -1662,12 +1666,10 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start, */ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavior) { - int behavior = madv_behavior->behavior; - - if (is_memory_failure(behavior)) + if (is_memory_failure(madv_behavior)) return MADVISE_NO_LOCK; - switch (behavior) { + switch (madv_behavior->behavior) { case MADV_REMOVE: case MADV_WILLNEED: case MADV_COLD: @@ -1687,9 +1689,9 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi } } -static int madvise_lock(struct mm_struct *mm, - struct madvise_behavior *madv_behavior) +static int madvise_lock(struct madvise_behavior *madv_behavior) { + struct mm_struct *mm = madv_behavior->mm; enum madvise_lock_mode lock_mode = get_lock_mode(madv_behavior); switch (lock_mode) { @@ -1711,9 +1713,10 @@ static int madvise_lock(struct mm_struct *mm, return 0; } -static void madvise_unlock(struct mm_struct *mm, - struct madvise_behavior *madv_behavior) +static void madvise_unlock(struct madvise_behavior *madv_behavior) { + struct mm_struct *mm = madv_behavior->mm; + switch (madv_behavior->lock_mode) { case MADVISE_NO_LOCK: return; @@ -1743,11 +1746,10 @@ static bool madvise_batch_tlb_flush(int behavior) } } -static void madvise_init_tlb(struct madvise_behavior *madv_behavior, - struct mm_struct *mm) +static void madvise_init_tlb(struct madvise_behavior *madv_behavior) { if (madvise_batch_tlb_flush(madv_behavior->behavior)) - tlb_gather_mmu(madv_behavior->tlb, mm); + tlb_gather_mmu(madv_behavior->tlb, madv_behavior->mm); } static void madvise_finish_tlb(struct madvise_behavior *madv_behavior) @@ -1802,9 +1804,9 @@ static bool madvise_should_skip(unsigned long start, size_t len_in, return false; } -static bool is_madvise_populate(int behavior) +static bool is_madvise_populate(struct madvise_behavior *madv_behavior) { - switch (behavior) { + switch (madv_behavior->behavior) { case MADV_POPULATE_READ: case MADV_POPULATE_WRITE: return true; @@ -1828,25 +1830,26 @@ static inline unsigned long get_untagged_addr(struct mm_struct *mm, untagged_addr_remote(mm, start); } -static int madvise_do_behavior(struct mm_struct *mm, - unsigned long start, size_t len_in, +static int madvise_do_behavior(unsigned long start, size_t len_in, struct madvise_behavior *madv_behavior) { - int behavior = madv_behavior->behavior; struct blk_plug plug; unsigned long end; int error; - if (is_memory_failure(behavior)) - return madvise_inject_error(behavior, start, start + len_in); - start = get_untagged_addr(mm, start); + if (is_memory_failure(madv_behavior)) { + end = start + len_in; + return madvise_inject_error(start, end, madv_behavior); + } + + start = get_untagged_addr(madv_behavior->mm, start); end = start + PAGE_ALIGN(len_in); blk_start_plug(&plug); - if (is_madvise_populate(behavior)) - error = madvise_populate(mm, start, end, behavior); + if (is_madvise_populate(madv_behavior)) + error = madvise_populate(start, end, madv_behavior); else - error = madvise_walk_vmas(mm, start, end, madv_behavior); + error = madvise_walk_vmas(start, end, madv_behavior); blk_finish_plug(&plug); return error; } @@ -1928,19 +1931,20 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh int error; struct mmu_gather tlb; struct madvise_behavior madv_behavior = { + .mm = mm, .behavior = behavior, .tlb = &tlb, }; if (madvise_should_skip(start, len_in, behavior, &error)) return error; - error = madvise_lock(mm, &madv_behavior); + error = madvise_lock(&madv_behavior); if (error) return error; - madvise_init_tlb(&madv_behavior, mm); - error = madvise_do_behavior(mm, start, len_in, &madv_behavior); + madvise_init_tlb(&madv_behavior); + error = madvise_do_behavior(start, len_in, &madv_behavior); madvise_finish_tlb(&madv_behavior); - madvise_unlock(mm, &madv_behavior); + madvise_unlock(&madv_behavior); return error; } @@ -1958,16 +1962,17 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter, size_t total_len; struct mmu_gather tlb; struct madvise_behavior madv_behavior = { + .mm = mm, .behavior = behavior, .tlb = &tlb, }; total_len = iov_iter_count(iter); - ret = madvise_lock(mm, &madv_behavior); + ret = madvise_lock(&madv_behavior); if (ret) return ret; - madvise_init_tlb(&madv_behavior, mm); + madvise_init_tlb(&madv_behavior); while (iov_iter_count(iter)) { unsigned long start = (unsigned long)iter_iov_addr(iter); @@ -1977,8 +1982,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter, if (madvise_should_skip(start, len_in, behavior, &error)) ret = error; else - ret = madvise_do_behavior(mm, start, len_in, - &madv_behavior); + ret = madvise_do_behavior(start, len_in, &madv_behavior); /* * An madvise operation is attempting to restart the syscall, * but we cannot proceed as it would not be correct to repeat @@ -1997,11 +2001,11 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter, /* Drop and reacquire lock to unwind race. */ madvise_finish_tlb(&madv_behavior); - madvise_unlock(mm, &madv_behavior); - ret = madvise_lock(mm, &madv_behavior); + madvise_unlock(&madv_behavior); + ret = madvise_lock(&madv_behavior); if (ret) goto out; - madvise_init_tlb(&madv_behavior, mm); + madvise_init_tlb(&madv_behavior); continue; } if (ret < 0) @@ -2009,7 +2013,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter, iov_iter_advance(iter, iter_iov_len(iter)); } madvise_finish_tlb(&madv_behavior); - madvise_unlock(mm, &madv_behavior); + madvise_unlock(&madv_behavior); out: ret = (total_len - iov_iter_count(iter)) ? : ret; From c0f611507a7aa4d8a401ec6ce8bf4e8abc0a1515 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 20 Jun 2025 16:33:03 +0100 Subject: [PATCH 126/370] mm/madvise: thread VMA range state through madvise_behavior Rather than updating start and a confusing local parameter 'tmp' in madvise_walk_vmas(), instead store the current range being operated upon in the struct madvise_behavior helper object in a range pair and use this consistently in all operations. This makes it clearer what is going on and opens the door to further cleanup now we store state regarding what is currently being operated upon here. Link: https://lkml.kernel.org/r/518480ceb48553d3c280bc2b0b5e77bbad840147.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Reviewed-by: Zi Yan Reviewed-by: SeongJae Park Reviewed-by: Barry Song Cc: Baolin Wang Cc: David Hildenbrand Cc: Dev Jain Cc: Jann Horn Cc: Lance Yang Cc: Liam Howlett Cc: Mariano Pache Cc: Ryan Roberts Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- mm/madvise.c | 103 ++++++++++++++++++++++++++++----------------------- 1 file changed, 57 insertions(+), 46 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index f1109c2c63a4..764a532ff62a 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -57,17 +57,27 @@ enum madvise_lock_mode { MADVISE_VMA_READ_LOCK, }; +struct madvise_behavior_range { + unsigned long start; + unsigned long end; +}; + struct madvise_behavior { struct mm_struct *mm; int behavior; struct mmu_gather *tlb; enum madvise_lock_mode lock_mode; struct anon_vma_name *anon_name; + + /* + * The range over which the behaviour is currently being applied. If + * traversing multiple VMAs, this is updated for each. + */ + struct madvise_behavior_range range; }; #ifdef CONFIG_ANON_VMA_NAME -static int madvise_walk_vmas(unsigned long start, unsigned long end, - struct madvise_behavior *madv_behavior); +static int madvise_walk_vmas(struct madvise_behavior *madv_behavior); struct anon_vma_name *anon_vma_name_alloc(const char *name) { @@ -147,7 +157,9 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, if (end == start) return 0; - return madvise_walk_vmas(start, end, &madv_behavior); + madv_behavior.range.start = start; + madv_behavior.range.end = end; + return madvise_walk_vmas(&madv_behavior); } #else /* CONFIG_ANON_VMA_NAME */ static int replace_anon_vma_name(struct vm_area_struct *vma, @@ -993,12 +1005,13 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, return -EINVAL; } -static long madvise_populate(unsigned long start, unsigned long end, - struct madvise_behavior *madv_behavior) +static long madvise_populate(struct madvise_behavior *madv_behavior) { struct mm_struct *mm = madv_behavior->mm; const bool write = madv_behavior->behavior == MADV_POPULATE_WRITE; int locked = 1; + unsigned long start = madv_behavior->range.start; + unsigned long end = madv_behavior->range.end; long pages; while (start < end) { @@ -1289,12 +1302,13 @@ static long madvise_guard_remove(struct vm_area_struct *vma, */ static int madvise_vma_behavior(struct vm_area_struct *vma, struct vm_area_struct **prev, - unsigned long start, unsigned long end, struct madvise_behavior *madv_behavior) { int behavior = madv_behavior->behavior; struct anon_vma_name *anon_name = madv_behavior->anon_name; vm_flags_t new_flags = vma->vm_flags; + unsigned long start = madv_behavior->range.start; + unsigned long end = madv_behavior->range.end; bool set_new_anon_name = behavior == __MADV_SET_ANON_VMA_NAME; int error; @@ -1411,10 +1425,11 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, /* * Error injection support for memory error handling. */ -static int madvise_inject_error(unsigned long start, unsigned long end, - struct madvise_behavior *madv_behavior) +static int madvise_inject_error(struct madvise_behavior *madv_behavior) { unsigned long size; + unsigned long start = madv_behavior->range.start; + unsigned long end = madv_behavior->range.end; if (!capable(CAP_SYS_ADMIN)) return -EPERM; @@ -1468,8 +1483,7 @@ static bool is_memory_failure(struct madvise_behavior *madv_behavior) #else -static int madvise_inject_error(unsigned long start, unsigned long end, - struct madvise_behavior *madv_behavior) +static int madvise_inject_error(struct madvise_behavior *madv_behavior) { return 0; } @@ -1551,21 +1565,20 @@ static bool process_madvise_remote_valid(int behavior) * If a VMA read lock could not be acquired, we return NULL and expect caller to * fallback to mmap lock behaviour. */ -static struct vm_area_struct *try_vma_read_lock( - struct madvise_behavior *madv_behavior, - unsigned long start, unsigned long end) +static +struct vm_area_struct *try_vma_read_lock(struct madvise_behavior *madv_behavior) { struct mm_struct *mm = madv_behavior->mm; struct vm_area_struct *vma; - vma = lock_vma_under_rcu(mm, start); + vma = lock_vma_under_rcu(mm, madv_behavior->range.start); if (!vma) goto take_mmap_read_lock; /* * Must span only a single VMA; uffd and remote processes are * unsupported. */ - if (end > vma->vm_end || current->mm != mm || + if (madv_behavior->range.end > vma->vm_end || current->mm != mm || userfaultfd_armed(vma)) { vma_end_read(vma); goto take_mmap_read_lock; @@ -1588,13 +1601,14 @@ static struct vm_area_struct *try_vma_read_lock( * reading or writing. */ static -int madvise_walk_vmas(unsigned long start, unsigned long end, - struct madvise_behavior *madv_behavior) +int madvise_walk_vmas(struct madvise_behavior *madv_behavior) { struct mm_struct *mm = madv_behavior->mm; + struct madvise_behavior_range *range = &madv_behavior->range; + /* range is updated to span each VMA, so store end of entire range. */ + unsigned long last_end = range->end; struct vm_area_struct *vma; struct vm_area_struct *prev; - unsigned long tmp; int unmapped_error = 0; int error; @@ -1603,11 +1617,10 @@ int madvise_walk_vmas(unsigned long start, unsigned long end, * tentatively, avoiding walking VMAs. */ if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) { - vma = try_vma_read_lock(madv_behavior, start, end); + vma = try_vma_read_lock(madv_behavior); if (vma) { prev = vma; - error = madvise_vma_behavior(vma, &prev, start, end, - madv_behavior); + error = madvise_vma_behavior(vma, &prev, madv_behavior); vma_end_read(vma); return error; } @@ -1618,8 +1631,8 @@ int madvise_walk_vmas(unsigned long start, unsigned long end, * ranges, just ignore them, but return -ENOMEM at the end. * - different from the way of handling in mlock etc. */ - vma = find_vma_prev(mm, start, &prev); - if (vma && start > vma->vm_start) + vma = find_vma_prev(mm, range->start, &prev); + if (vma && range->start > vma->vm_start) prev = vma; for (;;) { @@ -1627,33 +1640,30 @@ int madvise_walk_vmas(unsigned long start, unsigned long end, if (!vma) return -ENOMEM; - /* Here start < (end|vma->vm_end). */ - if (start < vma->vm_start) { + /* Here start < (last_end|vma->vm_end). */ + if (range->start < vma->vm_start) { unmapped_error = -ENOMEM; - start = vma->vm_start; - if (start >= end) + range->start = vma->vm_start; + if (range->start >= last_end) break; } - /* Here vma->vm_start <= start < (end|vma->vm_end) */ - tmp = vma->vm_end; - if (end < tmp) - tmp = end; + /* Here vma->vm_start <= range->start < (last_end|vma->vm_end) */ + range->end = min(vma->vm_end, last_end); - /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ - error = madvise_vma_behavior(vma, &prev, start, tmp, - madv_behavior); + /* Here vma->vm_start <= range->start < range->end <= (last_end|vma->vm_end). */ + error = madvise_vma_behavior(vma, &prev, madv_behavior); if (error) return error; - start = tmp; - if (prev && start < prev->vm_end) - start = prev->vm_end; - if (start >= end) + range->start = range->end; + if (prev && range->start < prev->vm_end) + range->start = prev->vm_end; + if (range->start >= last_end) break; if (prev) vma = find_vma(mm, prev->vm_end); else /* madvise_remove dropped mmap_lock */ - vma = find_vma(mm, start); + vma = find_vma(mm, range->start); } return unmapped_error; @@ -1834,22 +1844,23 @@ static int madvise_do_behavior(unsigned long start, size_t len_in, struct madvise_behavior *madv_behavior) { struct blk_plug plug; - unsigned long end; int error; + struct madvise_behavior_range *range = &madv_behavior->range; if (is_memory_failure(madv_behavior)) { - end = start + len_in; - return madvise_inject_error(start, end, madv_behavior); + range->start = start; + range->end = start + len_in; + return madvise_inject_error(madv_behavior); } - start = get_untagged_addr(madv_behavior->mm, start); - end = start + PAGE_ALIGN(len_in); + range->start = get_untagged_addr(madv_behavior->mm, start); + range->end = range->start + PAGE_ALIGN(len_in); blk_start_plug(&plug); if (is_madvise_populate(madv_behavior)) - error = madvise_populate(start, end, madv_behavior); + error = madvise_populate(madv_behavior); else - error = madvise_walk_vmas(start, end, madv_behavior); + error = madvise_walk_vmas(madv_behavior); blk_finish_plug(&plug); return error; } From 946fc11af061c3cd08943a96f6ea8c7e6fec95e1 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 20 Jun 2025 16:33:04 +0100 Subject: [PATCH 127/370] mm/madvise: thread all madvise state through madv_behavior Doing so means we can get rid of all the weird struct vm_area_struct **prev stuff, everything becomes consistent and in future if we want to make change to behaviour there's a single place where all relevant state is stored. This also allows us to update try_vma_read_lock() to be a little more succinct and set up state for us, as well as cleaning up madvise_update_vma(). We also update the debug assertion prior to madvise_update_vma() to assert that this is a write operation as correctly pointed out by Barry in the relevant thread. We can't reasonably update the madvise functions that live outside of mm/madvise.c so we leave those as-is. Link: https://lkml.kernel.org/r/7b345ab82ef51e551f8bc0c4f7be25712871629d.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Acked-by: Zi Yan Reviewed-by: Vlastimil Babka Reviewed-by: SeongJae Park Cc: Baolin Wang Cc: Barry Song Cc: David Hildenbrand Cc: Dev Jain Cc: Jann Horn Cc: Lance Yang Cc: Liam Howlett Cc: Mariano Pache Cc: Ryan Roberts Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- mm/madvise.c | 281 ++++++++++++++++++++++++++------------------------- 1 file changed, 145 insertions(+), 136 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 764a532ff62a..f04b8165e2ab 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -74,6 +74,8 @@ struct madvise_behavior { * traversing multiple VMAs, this is updated for each. */ struct madvise_behavior_range range; + /* The VMA and VMA preceding it (if applicable) currently targeted. */ + struct vm_area_struct *prev, *vma; }; #ifdef CONFIG_ANON_VMA_NAME @@ -177,26 +179,27 @@ static int replace_anon_vma_name(struct vm_area_struct *vma, * Caller should ensure anon_name stability by raising its refcount even when * anon_name belongs to a valid vma because this function might free that vma. */ -static int madvise_update_vma(struct vm_area_struct *vma, - struct vm_area_struct **prev, unsigned long start, - unsigned long end, vm_flags_t new_flags, - struct anon_vma_name *anon_name) +static int madvise_update_vma(vm_flags_t new_flags, + struct madvise_behavior *madv_behavior) { - struct mm_struct *mm = vma->vm_mm; int error; - VMA_ITERATOR(vmi, mm, start); + struct vm_area_struct *vma = madv_behavior->vma; + struct madvise_behavior_range *range = &madv_behavior->range; + struct anon_vma_name *anon_name = madv_behavior->anon_name; + VMA_ITERATOR(vmi, madv_behavior->mm, range->start); if (new_flags == vma->vm_flags && anon_vma_name_eq(anon_vma_name(vma), anon_name)) { - *prev = vma; + madv_behavior->prev = vma; return 0; } - vma = vma_modify_flags_name(&vmi, *prev, vma, start, end, new_flags, - anon_name); + vma = vma_modify_flags_name(&vmi, madv_behavior->prev, vma, + range->start, range->end, new_flags, anon_name); if (IS_ERR(vma)) return PTR_ERR(vma); - *prev = vma; + madv_behavior->vma = vma; + madv_behavior->prev = vma; /* vm_flags is protected by the mmap_lock held in write mode. */ vma_start_write(vma); @@ -301,15 +304,16 @@ static void shmem_swapin_range(struct vm_area_struct *vma, /* * Schedule all required I/O operations. Do not wait for completion. */ -static long madvise_willneed(struct vm_area_struct *vma, - struct vm_area_struct **prev, - unsigned long start, unsigned long end) +static long madvise_willneed(struct madvise_behavior *madv_behavior) { - struct mm_struct *mm = vma->vm_mm; + struct vm_area_struct *vma = madv_behavior->vma; + struct mm_struct *mm = madv_behavior->mm; struct file *file = vma->vm_file; + unsigned long start = madv_behavior->range.start; + unsigned long end = madv_behavior->range.end; loff_t offset; - *prev = vma; + madv_behavior->prev = vma; #ifdef CONFIG_SWAP if (!file) { walk_page_range_vma(vma, start, end, &swapin_walk_ops, vma); @@ -338,7 +342,7 @@ static long madvise_willneed(struct vm_area_struct *vma, * vma's reference to the file) can go away as soon as we drop * mmap_lock. */ - *prev = NULL; /* tell sys_madvise we drop mmap_lock */ + madv_behavior->prev = NULL; /* tell sys_madvise we drop mmap_lock */ get_file(file); offset = (loff_t)(start - vma->vm_start) + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); @@ -603,16 +607,19 @@ static const struct mm_walk_ops cold_walk_ops = { }; static void madvise_cold_page_range(struct mmu_gather *tlb, - struct vm_area_struct *vma, - unsigned long addr, unsigned long end) + struct madvise_behavior *madv_behavior) + { + struct vm_area_struct *vma = madv_behavior->vma; + struct madvise_behavior_range *range = &madv_behavior->range; struct madvise_walk_private walk_private = { .pageout = false, .tlb = tlb, }; tlb_start_vma(tlb, vma); - walk_page_range_vma(vma, addr, end, &cold_walk_ops, &walk_private); + walk_page_range_vma(vma, range->start, range->end, &cold_walk_ops, + &walk_private); tlb_end_vma(tlb, vma); } @@ -621,28 +628,26 @@ static inline bool can_madv_lru_vma(struct vm_area_struct *vma) return !(vma->vm_flags & (VM_LOCKED|VM_PFNMAP|VM_HUGETLB)); } -static long madvise_cold(struct vm_area_struct *vma, - struct vm_area_struct **prev, - unsigned long start_addr, unsigned long end_addr) +static long madvise_cold(struct madvise_behavior *madv_behavior) { - struct mm_struct *mm = vma->vm_mm; + struct vm_area_struct *vma = madv_behavior->vma; struct mmu_gather tlb; - *prev = vma; + madv_behavior->prev = vma; if (!can_madv_lru_vma(vma)) return -EINVAL; lru_add_drain(); - tlb_gather_mmu(&tlb, mm); - madvise_cold_page_range(&tlb, vma, start_addr, end_addr); + tlb_gather_mmu(&tlb, madv_behavior->mm); + madvise_cold_page_range(&tlb, madv_behavior); tlb_finish_mmu(&tlb); return 0; } static void madvise_pageout_page_range(struct mmu_gather *tlb, - struct vm_area_struct *vma, - unsigned long addr, unsigned long end) + struct vm_area_struct *vma, + struct madvise_behavior_range *range) { struct madvise_walk_private walk_private = { .pageout = true, @@ -650,18 +655,17 @@ static void madvise_pageout_page_range(struct mmu_gather *tlb, }; tlb_start_vma(tlb, vma); - walk_page_range_vma(vma, addr, end, &cold_walk_ops, &walk_private); + walk_page_range_vma(vma, range->start, range->end, &cold_walk_ops, + &walk_private); tlb_end_vma(tlb, vma); } -static long madvise_pageout(struct vm_area_struct *vma, - struct vm_area_struct **prev, - unsigned long start_addr, unsigned long end_addr) +static long madvise_pageout(struct madvise_behavior *madv_behavior) { - struct mm_struct *mm = vma->vm_mm; struct mmu_gather tlb; + struct vm_area_struct *vma = madv_behavior->vma; - *prev = vma; + madv_behavior->prev = vma; if (!can_madv_lru_vma(vma)) return -EINVAL; @@ -676,8 +680,8 @@ static long madvise_pageout(struct vm_area_struct *vma, return 0; lru_add_drain(); - tlb_gather_mmu(&tlb, mm); - madvise_pageout_page_range(&tlb, vma, start_addr, end_addr); + tlb_gather_mmu(&tlb, madv_behavior->mm); + madvise_pageout_page_range(&tlb, vma, &madv_behavior->range); tlb_finish_mmu(&tlb); return 0; @@ -840,11 +844,12 @@ static inline enum page_walk_lock get_walk_lock(enum madvise_lock_mode mode) } } -static int madvise_free_single_vma(struct madvise_behavior *madv_behavior, - struct vm_area_struct *vma, - unsigned long start_addr, unsigned long end_addr) +static int madvise_free_single_vma(struct madvise_behavior *madv_behavior) { - struct mm_struct *mm = vma->vm_mm; + struct mm_struct *mm = madv_behavior->mm; + struct vm_area_struct *vma = madv_behavior->vma; + unsigned long start_addr = madv_behavior->range.start; + unsigned long end_addr = madv_behavior->range.end; struct mmu_notifier_range range; struct mmu_gather *tlb = madv_behavior->tlb; struct mm_walk_ops walk_ops = { @@ -896,25 +901,28 @@ static int madvise_free_single_vma(struct madvise_behavior *madv_behavior, * An interface that causes the system to free clean pages and flush * dirty pages is already available as msync(MS_INVALIDATE). */ -static long madvise_dontneed_single_vma(struct madvise_behavior *madv_behavior, - struct vm_area_struct *vma, - unsigned long start, unsigned long end) +static long madvise_dontneed_single_vma(struct madvise_behavior *madv_behavior) + { + struct madvise_behavior_range *range = &madv_behavior->range; struct zap_details details = { .reclaim_pt = true, .even_cows = true, }; zap_page_range_single_batched( - madv_behavior->tlb, vma, start, end - start, &details); + madv_behavior->tlb, madv_behavior->vma, range->start, + range->end - range->start, &details); return 0; } -static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma, - unsigned long start, - unsigned long *end, - int behavior) +static +bool madvise_dontneed_free_valid_vma(struct madvise_behavior *madv_behavior) { + struct vm_area_struct *vma = madv_behavior->vma; + int behavior = madv_behavior->behavior; + struct madvise_behavior_range *range = &madv_behavior->range; + if (!is_vm_hugetlb_page(vma)) { unsigned int forbidden = VM_PFNMAP; @@ -926,7 +934,7 @@ static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma, if (behavior != MADV_DONTNEED && behavior != MADV_DONTNEED_LOCKED) return false; - if (start & ~huge_page_mask(hstate_vma(vma))) + if (range->start & ~huge_page_mask(hstate_vma(vma))) return false; /* @@ -935,41 +943,40 @@ static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma, * Avoid unexpected data loss by rounding down the number of * huge pages freed. */ - *end = ALIGN_DOWN(*end, huge_page_size(hstate_vma(vma))); + range->end = ALIGN_DOWN(range->end, huge_page_size(hstate_vma(vma))); return true; } -static long madvise_dontneed_free(struct vm_area_struct *vma, - struct vm_area_struct **prev, - unsigned long start, unsigned long end, - struct madvise_behavior *madv_behavior) +static long madvise_dontneed_free(struct madvise_behavior *madv_behavior) { + struct mm_struct *mm = madv_behavior->mm; + struct madvise_behavior_range *range = &madv_behavior->range; int behavior = madv_behavior->behavior; - struct mm_struct *mm = vma->vm_mm; - *prev = vma; - if (!madvise_dontneed_free_valid_vma(vma, start, &end, behavior)) + madv_behavior->prev = madv_behavior->vma; + if (!madvise_dontneed_free_valid_vma(madv_behavior)) return -EINVAL; - if (start == end) + if (range->start == range->end) return 0; - if (!userfaultfd_remove(vma, start, end)) { - *prev = NULL; /* mmap_lock has been dropped, prev is stale */ + if (!userfaultfd_remove(madv_behavior->vma, range->start, range->end)) { + struct vm_area_struct *vma; + + madv_behavior->prev = NULL; /* mmap_lock has been dropped, prev is stale */ mmap_read_lock(mm); - vma = vma_lookup(mm, start); + madv_behavior->vma = vma = vma_lookup(mm, range->start); if (!vma) return -ENOMEM; /* * Potential end adjustment for hugetlb vma is OK as * the check below keeps end within vma. */ - if (!madvise_dontneed_free_valid_vma(vma, start, &end, - behavior)) + if (!madvise_dontneed_free_valid_vma(madv_behavior)) return -EINVAL; - if (end > vma->vm_end) { + if (range->end > vma->vm_end) { /* * Don't fail if end > vma->vm_end. If the old * vma was split while the mmap_lock was @@ -982,7 +989,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, * end-vma->vm_end range, but the manager can * handle a repetition fine. */ - end = vma->vm_end; + range->end = vma->vm_end; } /* * If the memory region between start and end was @@ -991,16 +998,15 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, * the adjustment for hugetlb vma above may have rounded * end down to the start address. */ - if (start == end) + if (range->start == range->end) return 0; - VM_WARN_ON(start > end); + VM_WARN_ON(range->start > range->end); } if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED) - return madvise_dontneed_single_vma( - madv_behavior, vma, start, end); + return madvise_dontneed_single_vma(madv_behavior); else if (behavior == MADV_FREE) - return madvise_free_single_vma(madv_behavior, vma, start, end); + return madvise_free_single_vma(madv_behavior); else return -EINVAL; } @@ -1048,16 +1054,17 @@ static long madvise_populate(struct madvise_behavior *madv_behavior) * Application wants to free up the pages and associated backing store. * This is effectively punching a hole into the middle of a file. */ -static long madvise_remove(struct vm_area_struct *vma, - struct vm_area_struct **prev, - unsigned long start, unsigned long end) +static long madvise_remove(struct madvise_behavior *madv_behavior) { loff_t offset; int error; struct file *f; - struct mm_struct *mm = vma->vm_mm; + struct mm_struct *mm = madv_behavior->mm; + struct vm_area_struct *vma = madv_behavior->vma; + unsigned long start = madv_behavior->range.start; + unsigned long end = madv_behavior->range.end; - *prev = NULL; /* tell sys_madvise we drop mmap_lock */ + madv_behavior->prev = NULL; /* tell sys_madvise we drop mmap_lock */ if (vma->vm_flags & VM_LOCKED) return -EINVAL; @@ -1169,14 +1176,14 @@ static const struct mm_walk_ops guard_install_walk_ops = { .walk_lock = PGWALK_RDLOCK, }; -static long madvise_guard_install(struct vm_area_struct *vma, - struct vm_area_struct **prev, - unsigned long start, unsigned long end) +static long madvise_guard_install(struct madvise_behavior *madv_behavior) { + struct vm_area_struct *vma = madv_behavior->vma; + struct madvise_behavior_range *range = &madv_behavior->range; long err; int i; - *prev = vma; + madv_behavior->prev = vma; if (!is_valid_guard_vma(vma, /* allow_locked = */false)) return -EINVAL; @@ -1207,13 +1214,14 @@ static long madvise_guard_install(struct vm_area_struct *vma, unsigned long nr_pages = 0; /* Returns < 0 on error, == 0 if success, > 0 if zap needed. */ - err = walk_page_range_mm(vma->vm_mm, start, end, + err = walk_page_range_mm(vma->vm_mm, range->start, range->end, &guard_install_walk_ops, &nr_pages); if (err < 0) return err; if (err == 0) { - unsigned long nr_expected_pages = PHYS_PFN(end - start); + unsigned long nr_expected_pages = + PHYS_PFN(range->end - range->start); VM_WARN_ON(nr_pages != nr_expected_pages); return 0; @@ -1223,7 +1231,8 @@ static long madvise_guard_install(struct vm_area_struct *vma, * OK some of the range have non-guard pages mapped, zap * them. This leaves existing guard pages in place. */ - zap_page_range_single(vma, start, end - start, NULL); + zap_page_range_single(vma, range->start, + range->end - range->start, NULL); } /* @@ -1279,11 +1288,12 @@ static const struct mm_walk_ops guard_remove_walk_ops = { .walk_lock = PGWALK_RDLOCK, }; -static long madvise_guard_remove(struct vm_area_struct *vma, - struct vm_area_struct **prev, - unsigned long start, unsigned long end) +static long madvise_guard_remove(struct madvise_behavior *madv_behavior) { - *prev = vma; + struct vm_area_struct *vma = madv_behavior->vma; + struct madvise_behavior_range *range = &madv_behavior->range; + + madv_behavior->prev = vma; /* * We're ok with removing guards in mlock()'d ranges, as this is a * non-destructive action. @@ -1291,7 +1301,7 @@ static long madvise_guard_remove(struct vm_area_struct *vma, if (!is_valid_guard_vma(vma, /* allow_locked = */true)) return -EINVAL; - return walk_page_range_vma(vma, start, end, + return walk_page_range_vma(vma, range->start, range->end, &guard_remove_walk_ops, NULL); } @@ -1300,41 +1310,38 @@ static long madvise_guard_remove(struct vm_area_struct *vma, * will handle splitting a vm area into separate areas, each area with its own * behavior. */ -static int madvise_vma_behavior(struct vm_area_struct *vma, - struct vm_area_struct **prev, - struct madvise_behavior *madv_behavior) +static int madvise_vma_behavior(struct madvise_behavior *madv_behavior) { int behavior = madv_behavior->behavior; - struct anon_vma_name *anon_name = madv_behavior->anon_name; + struct vm_area_struct *vma = madv_behavior->vma; vm_flags_t new_flags = vma->vm_flags; - unsigned long start = madv_behavior->range.start; - unsigned long end = madv_behavior->range.end; bool set_new_anon_name = behavior == __MADV_SET_ANON_VMA_NAME; + struct madvise_behavior_range *range = &madv_behavior->range; int error; - if (unlikely(!can_modify_vma_madv(vma, behavior))) + if (unlikely(!can_modify_vma_madv(madv_behavior->vma, behavior))) return -EPERM; switch (behavior) { case MADV_REMOVE: - return madvise_remove(vma, prev, start, end); + return madvise_remove(madv_behavior); case MADV_WILLNEED: - return madvise_willneed(vma, prev, start, end); + return madvise_willneed(madv_behavior); case MADV_COLD: - return madvise_cold(vma, prev, start, end); + return madvise_cold(madv_behavior); case MADV_PAGEOUT: - return madvise_pageout(vma, prev, start, end); + return madvise_pageout(madv_behavior); case MADV_FREE: case MADV_DONTNEED: case MADV_DONTNEED_LOCKED: - return madvise_dontneed_free(vma, prev, start, end, - madv_behavior); + return madvise_dontneed_free(madv_behavior); case MADV_COLLAPSE: - return madvise_collapse(vma, prev, start, end); + return madvise_collapse(vma, &madv_behavior->prev, + range->start, range->end); case MADV_GUARD_INSTALL: - return madvise_guard_install(vma, prev, start, end); + return madvise_guard_install(madv_behavior); case MADV_GUARD_REMOVE: - return madvise_guard_remove(vma, prev, start, end); + return madvise_guard_remove(madv_behavior); /* The below behaviours update VMAs via madvise_update_vma(). */ @@ -1351,18 +1358,18 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, new_flags |= VM_DONTCOPY; break; case MADV_DOFORK: - if (vma->vm_flags & VM_IO) + if (new_flags & VM_IO) return -EINVAL; new_flags &= ~VM_DONTCOPY; break; case MADV_WIPEONFORK: /* MADV_WIPEONFORK is only supported on anonymous memory. */ - if (vma->vm_file || vma->vm_flags & VM_SHARED) + if (vma->vm_file || new_flags & VM_SHARED) return -EINVAL; new_flags |= VM_WIPEONFORK; break; case MADV_KEEPONFORK: - if (vma->vm_flags & VM_DROPPABLE) + if (new_flags & VM_DROPPABLE) return -EINVAL; new_flags &= ~VM_WIPEONFORK; break; @@ -1370,14 +1377,15 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, new_flags |= VM_DONTDUMP; break; case MADV_DODUMP: - if ((!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) || - (vma->vm_flags & VM_DROPPABLE)) + if ((!is_vm_hugetlb_page(vma) && (new_flags & VM_SPECIAL)) || + (new_flags & VM_DROPPABLE)) return -EINVAL; new_flags &= ~VM_DONTDUMP; break; case MADV_MERGEABLE: case MADV_UNMERGEABLE: - error = ksm_madvise(vma, start, end, behavior, &new_flags); + error = ksm_madvise(vma, range->start, range->end, + behavior, &new_flags); if (error) goto out; break; @@ -1394,8 +1402,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, break; } - /* We cannot provide prev in this lock mode. */ - VM_WARN_ON_ONCE(madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK); + /* This is a write operation.*/ + VM_WARN_ON_ONCE(madv_behavior->lock_mode != MADVISE_MMAP_WRITE_LOCK); /* * madvise_update_vma() might cause a VMA merge which could put an @@ -1403,14 +1411,12 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, * anon_vma_name so it doesn't disappear from under us. */ if (!set_new_anon_name) { - anon_name = anon_vma_name(vma); - anon_vma_name_get(anon_name); + madv_behavior->anon_name = anon_vma_name(vma); + anon_vma_name_get(madv_behavior->anon_name); } - error = madvise_update_vma(vma, prev, start, end, new_flags, - anon_name); + error = madvise_update_vma(new_flags, madv_behavior); if (!set_new_anon_name) - anon_vma_name_put(anon_name); - + anon_vma_name_put(madv_behavior->anon_name); out: /* * madvise() returns EAGAIN if kernel resources, such as @@ -1560,13 +1566,13 @@ static bool process_madvise_remote_valid(int behavior) * span either partially or fully. * * This function always returns with an appropriate lock held. If a VMA read - * lock could be acquired, we return the locked VMA. + * lock could be acquired, we return true and set madv_behavior state + * accordingly. * - * If a VMA read lock could not be acquired, we return NULL and expect caller to + * If a VMA read lock could not be acquired, we return false and expect caller to * fallback to mmap lock behaviour. */ -static -struct vm_area_struct *try_vma_read_lock(struct madvise_behavior *madv_behavior) +static bool try_vma_read_lock(struct madvise_behavior *madv_behavior) { struct mm_struct *mm = madv_behavior->mm; struct vm_area_struct *vma; @@ -1583,12 +1589,14 @@ struct vm_area_struct *try_vma_read_lock(struct madvise_behavior *madv_behavior) vma_end_read(vma); goto take_mmap_read_lock; } - return vma; + madv_behavior->prev = vma; /* Not currently required. */ + madv_behavior->vma = vma; + return true; take_mmap_read_lock: mmap_read_lock(mm); madv_behavior->lock_mode = MADVISE_MMAP_READ_LOCK; - return NULL; + return false; } /* @@ -1607,23 +1615,19 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior) struct madvise_behavior_range *range = &madv_behavior->range; /* range is updated to span each VMA, so store end of entire range. */ unsigned long last_end = range->end; - struct vm_area_struct *vma; - struct vm_area_struct *prev; int unmapped_error = 0; int error; + struct vm_area_struct *vma; /* * If VMA read lock is supported, apply madvise to a single VMA * tentatively, avoiding walking VMAs. */ - if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) { - vma = try_vma_read_lock(madv_behavior); - if (vma) { - prev = vma; - error = madvise_vma_behavior(vma, &prev, madv_behavior); - vma_end_read(vma); - return error; - } + if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK && + try_vma_read_lock(madv_behavior)) { + error = madvise_vma_behavior(madv_behavior); + vma_end_read(madv_behavior->vma); + return error; } /* @@ -1631,11 +1635,13 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior) * ranges, just ignore them, but return -ENOMEM at the end. * - different from the way of handling in mlock etc. */ - vma = find_vma_prev(mm, range->start, &prev); + vma = find_vma_prev(mm, range->start, &madv_behavior->prev); if (vma && range->start > vma->vm_start) - prev = vma; + madv_behavior->prev = vma; for (;;) { + struct vm_area_struct *prev; + /* Still start < end. */ if (!vma) return -ENOMEM; @@ -1652,9 +1658,12 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior) range->end = min(vma->vm_end, last_end); /* Here vma->vm_start <= range->start < range->end <= (last_end|vma->vm_end). */ - error = madvise_vma_behavior(vma, &prev, madv_behavior); + madv_behavior->vma = vma; + error = madvise_vma_behavior(madv_behavior); if (error) return error; + prev = madv_behavior->prev; + range->start = range->end; if (prev && range->start < prev->vm_end) range->start = prev->vm_end; From e24d552a17e92714d4f62e112d536babd6428acb Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 20 Jun 2025 16:33:05 +0100 Subject: [PATCH 128/370] mm/madvise: eliminate very confusing manipulation of prev VMA The madvise code has for the longest time had very confusing code around the 'prev' VMA pointer passed around various functions which, in all cases except madvise_update_vma(), is unused and instead simply updated as soon as the function is invoked. To compound the confusion, the prev pointer is also used to indicate to the caller that the mmap lock has been dropped and that we can therefore not safely access the end of the current VMA (which might have been updated by madvise_update_vma()). Clear up this confusion by not setting prev = vma anywhere except in madvise_walk_vmas(), update all references to prev which will always be equal to vma after madvise_vma_behavior() is invoked, and adding a flag to indicate that the lock has been dropped to make this explicit. Additionally, drop a redundant BUG_ON() from madvise_collapse(), which is simply reiterating the BUG_ON(mmap_locked) above it (note that BUG_ON() is not appropriate here, but we leave existing code as-is). We finally adjust the madvise_walk_vmas() logic to be a little clearer - delaying the assignment of the end of the range to the start of the new range until the last moment and handling the lock being dropped scenario immediately. Additionally add some explanatory comments. [lorenzo.stoakes@oracle.com: fix very subtle bug] Link: https://lkml.kernel.org/r/dca94cde-8afb-4eab-8e57-3f508624d670@lucifer.local Link: https://lkml.kernel.org/r/63d281c5df2e64225ab5b4bda398b45e22818701.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Reviewed-by: Zi Yan Reviewed-by: SeongJae Park Cc: Baolin Wang Cc: Barry Song Cc: David Hildenbrand Cc: Dev Jain Cc: Jann Horn Cc: Lance Yang Cc: Liam Howlett Cc: Mariano Pache Cc: Ryan Roberts Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- include/linux/huge_mm.h | 9 +++-- mm/khugepaged.c | 9 ++--- mm/madvise.c | 77 +++++++++++++++++++++-------------------- 3 files changed, 47 insertions(+), 48 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 8f1b15213f61..4d5bb67dc4ec 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -433,9 +433,8 @@ change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, int hugepage_madvise(struct vm_area_struct *vma, vm_flags_t *vm_flags, int advice); -int madvise_collapse(struct vm_area_struct *vma, - struct vm_area_struct **prev, - unsigned long start, unsigned long end); +int madvise_collapse(struct vm_area_struct *vma, unsigned long start, + unsigned long end, bool *lock_dropped); void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start, unsigned long end, struct vm_area_struct *next); spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma); @@ -596,8 +595,8 @@ static inline int hugepage_madvise(struct vm_area_struct *vma, } static inline int madvise_collapse(struct vm_area_struct *vma, - struct vm_area_struct **prev, - unsigned long start, unsigned long end) + unsigned long start, + unsigned long end, bool *lock_dropped) { return -EINVAL; } diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 3495a20cef5e..1aa7ca67c756 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -2727,8 +2727,8 @@ static int madvise_collapse_errno(enum scan_result r) } } -int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, - unsigned long start, unsigned long end) +int madvise_collapse(struct vm_area_struct *vma, unsigned long start, + unsigned long end, bool *lock_dropped) { struct collapse_control *cc; struct mm_struct *mm = vma->vm_mm; @@ -2739,8 +2739,6 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, BUG_ON(vma->vm_start > start); BUG_ON(vma->vm_end < end); - *prev = vma; - if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER)) return -EINVAL; @@ -2788,7 +2786,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, &mmap_locked, cc); } if (!mmap_locked) - *prev = NULL; /* Tell caller we dropped mmap_lock */ + *lock_dropped = true; handle_result: switch (result) { @@ -2798,7 +2796,6 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, break; case SCAN_PTE_MAPPED_HUGEPAGE: BUG_ON(mmap_locked); - BUG_ON(*prev); mmap_read_lock(mm); result = collapse_pte_mapped_thp(mm, addr, true); mmap_read_unlock(mm); diff --git a/mm/madvise.c b/mm/madvise.c index f04b8165e2ab..c467ee42596f 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -75,7 +75,9 @@ struct madvise_behavior { */ struct madvise_behavior_range range; /* The VMA and VMA preceding it (if applicable) currently targeted. */ - struct vm_area_struct *prev, *vma; + struct vm_area_struct *prev; + struct vm_area_struct *vma; + bool lock_dropped; }; #ifdef CONFIG_ANON_VMA_NAME @@ -188,10 +190,8 @@ static int madvise_update_vma(vm_flags_t new_flags, struct anon_vma_name *anon_name = madv_behavior->anon_name; VMA_ITERATOR(vmi, madv_behavior->mm, range->start); - if (new_flags == vma->vm_flags && anon_vma_name_eq(anon_vma_name(vma), anon_name)) { - madv_behavior->prev = vma; + if (new_flags == vma->vm_flags && anon_vma_name_eq(anon_vma_name(vma), anon_name)) return 0; - } vma = vma_modify_flags_name(&vmi, madv_behavior->prev, vma, range->start, range->end, new_flags, anon_name); @@ -199,7 +199,6 @@ static int madvise_update_vma(vm_flags_t new_flags, return PTR_ERR(vma); madv_behavior->vma = vma; - madv_behavior->prev = vma; /* vm_flags is protected by the mmap_lock held in write mode. */ vma_start_write(vma); @@ -301,6 +300,12 @@ static void shmem_swapin_range(struct vm_area_struct *vma, } #endif /* CONFIG_SWAP */ +static void mark_mmap_lock_dropped(struct madvise_behavior *madv_behavior) +{ + VM_WARN_ON_ONCE(madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK); + madv_behavior->lock_dropped = true; +} + /* * Schedule all required I/O operations. Do not wait for completion. */ @@ -313,7 +318,6 @@ static long madvise_willneed(struct madvise_behavior *madv_behavior) unsigned long end = madv_behavior->range.end; loff_t offset; - madv_behavior->prev = vma; #ifdef CONFIG_SWAP if (!file) { walk_page_range_vma(vma, start, end, &swapin_walk_ops, vma); @@ -342,7 +346,7 @@ static long madvise_willneed(struct madvise_behavior *madv_behavior) * vma's reference to the file) can go away as soon as we drop * mmap_lock. */ - madv_behavior->prev = NULL; /* tell sys_madvise we drop mmap_lock */ + mark_mmap_lock_dropped(madv_behavior); get_file(file); offset = (loff_t)(start - vma->vm_start) + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); @@ -633,7 +637,6 @@ static long madvise_cold(struct madvise_behavior *madv_behavior) struct vm_area_struct *vma = madv_behavior->vma; struct mmu_gather tlb; - madv_behavior->prev = vma; if (!can_madv_lru_vma(vma)) return -EINVAL; @@ -665,7 +668,6 @@ static long madvise_pageout(struct madvise_behavior *madv_behavior) struct mmu_gather tlb; struct vm_area_struct *vma = madv_behavior->vma; - madv_behavior->prev = vma; if (!can_madv_lru_vma(vma)) return -EINVAL; @@ -954,7 +956,6 @@ static long madvise_dontneed_free(struct madvise_behavior *madv_behavior) struct madvise_behavior_range *range = &madv_behavior->range; int behavior = madv_behavior->behavior; - madv_behavior->prev = madv_behavior->vma; if (!madvise_dontneed_free_valid_vma(madv_behavior)) return -EINVAL; @@ -964,8 +965,7 @@ static long madvise_dontneed_free(struct madvise_behavior *madv_behavior) if (!userfaultfd_remove(madv_behavior->vma, range->start, range->end)) { struct vm_area_struct *vma; - madv_behavior->prev = NULL; /* mmap_lock has been dropped, prev is stale */ - + mark_mmap_lock_dropped(madv_behavior); mmap_read_lock(mm); madv_behavior->vma = vma = vma_lookup(mm, range->start); if (!vma) @@ -1064,7 +1064,7 @@ static long madvise_remove(struct madvise_behavior *madv_behavior) unsigned long start = madv_behavior->range.start; unsigned long end = madv_behavior->range.end; - madv_behavior->prev = NULL; /* tell sys_madvise we drop mmap_lock */ + mark_mmap_lock_dropped(madv_behavior); if (vma->vm_flags & VM_LOCKED) return -EINVAL; @@ -1183,7 +1183,6 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior) long err; int i; - madv_behavior->prev = vma; if (!is_valid_guard_vma(vma, /* allow_locked = */false)) return -EINVAL; @@ -1293,7 +1292,6 @@ static long madvise_guard_remove(struct madvise_behavior *madv_behavior) struct vm_area_struct *vma = madv_behavior->vma; struct madvise_behavior_range *range = &madv_behavior->range; - madv_behavior->prev = vma; /* * We're ok with removing guards in mlock()'d ranges, as this is a * non-destructive action. @@ -1336,8 +1334,8 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior) case MADV_DONTNEED_LOCKED: return madvise_dontneed_free(madv_behavior); case MADV_COLLAPSE: - return madvise_collapse(vma, &madv_behavior->prev, - range->start, range->end); + return madvise_collapse(vma, range->start, range->end, + &madv_behavior->lock_dropped); case MADV_GUARD_INSTALL: return madvise_guard_install(madv_behavior); case MADV_GUARD_REMOVE: @@ -1589,7 +1587,6 @@ static bool try_vma_read_lock(struct madvise_behavior *madv_behavior) vma_end_read(vma); goto take_mmap_read_lock; } - madv_behavior->prev = vma; /* Not currently required. */ madv_behavior->vma = vma; return true; @@ -1617,7 +1614,7 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior) unsigned long last_end = range->end; int unmapped_error = 0; int error; - struct vm_area_struct *vma; + struct vm_area_struct *prev, *vma; /* * If VMA read lock is supported, apply madvise to a single VMA @@ -1630,24 +1627,23 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior) return error; } - /* - * If the interval [start,end) covers some unmapped address - * ranges, just ignore them, but return -ENOMEM at the end. - * - different from the way of handling in mlock etc. - */ - vma = find_vma_prev(mm, range->start, &madv_behavior->prev); + vma = find_vma_prev(mm, range->start, &prev); if (vma && range->start > vma->vm_start) - madv_behavior->prev = vma; + prev = vma; for (;;) { - struct vm_area_struct *prev; - /* Still start < end. */ if (!vma) return -ENOMEM; /* Here start < (last_end|vma->vm_end). */ if (range->start < vma->vm_start) { + /* + * This indicates a gap between VMAs in the input + * range. This does not cause the operation to abort, + * rather we simply return -ENOMEM to indicate that this + * has happened, but carry on. + */ unmapped_error = -ENOMEM; range->start = vma->vm_start; if (range->start >= last_end) @@ -1658,21 +1654,28 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior) range->end = min(vma->vm_end, last_end); /* Here vma->vm_start <= range->start < range->end <= (last_end|vma->vm_end). */ + madv_behavior->prev = prev; madv_behavior->vma = vma; error = madvise_vma_behavior(madv_behavior); if (error) return error; - prev = madv_behavior->prev; + if (madv_behavior->lock_dropped) { + /* We dropped the mmap lock, we can't ref the VMA. */ + prev = NULL; + vma = NULL; + madv_behavior->lock_dropped = false; + } else { + vma = madv_behavior->vma; + prev = vma; + } - range->start = range->end; - if (prev && range->start < prev->vm_end) - range->start = prev->vm_end; - if (range->start >= last_end) + if (vma && range->end < vma->vm_end) + range->end = vma->vm_end; + if (range->end >= last_end) break; - if (prev) - vma = find_vma(mm, prev->vm_end); - else /* madvise_remove dropped mmap_lock */ - vma = find_vma(mm, range->start); + + vma = find_vma(mm, vma ? vma->vm_end : range->end); + range->start = range->end; } return unmapped_error; From 980d05955835bfdc054243e5ad95de803d2c19a7 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Tue, 24 Jun 2025 15:03:45 +0200 Subject: [PATCH 129/370] mm, madvise: simplify anon_name handling Patch series "madvise anon_name cleanups", v2. While reviewing Lorenzo's madvise cleanups I've noticed that we can handle anon_name in madvise code much better, so sending that as patch 1. Initially I wanted to do first move the existing logic from madvise_vma_behavior() to madvise_update_vma() as a separate patch before the actual simplification but that would require adding anon_vma_name_put() in error handling paths only to be removed again, so it's a single patch to avoid churn. It's also an opportunity to move some mm code from prctl under mm, hence patch 2. After code moving preparation in patch 3, also unify madvise lock handling for madvise_set_anon_name() in patch 4. This patch (of 4): Since the introduction in 9a10064f5625 ("mm: add a field to store names for private anonymous memory") the code to set anon_name on a vma has been using madvise_update_vma() to call replace_anon_vma_name(). Since the former is called also by a number of other madvise behaviours that do not set a new anon_name, they have been passing the existing anon_name of the vma to make replace_anon_vma_name() a no-op. This is rather wasteful as it needs anon_vma_name_eq() to determine the no-op situations, and checks for when replace_anon_vma_name() is allowed (the vma is anon/shmem) duplicate the checks already done earlier in madvise_vma_behavior(). It has also lead to commit 942341dcc574 ("mm: fix use-after-free when anon vma name is used after vma is freed") adding anon_name refcount get/put operations exactly to the cases that actually do not change anon_name - just so the replace_anon_vma_name() can keep safely determining it has nothing to do. The recent madvise cleanups made this suboptimal handling very obvious, but happily also allow for an easy fix. madvise_update_vma() now has the complete information whether it's been called to set a new anon_name, so stop passing it the existing vma's name and doing the refcount get/put in its only caller madvise_vma_behavior(). In madvise_update_vma() itself, limit calling of replace_anon_vma_name() only to cases where we are setting a new name, otherwise we know it's a no-op. We can rely solely on the __MADV_SET_ANON_VMA_NAME behaviour and can remove the duplicate checks for vma being anon/shmem that were done already in madvise_vma_behavior(). Additionally, by using vma_modify_flags() when not modifying the anon_name, avoid explicitly passing the existing vma's anon_name and storing a pointer to it in struct madv_behavior or a local variable. This prevents the danger of accessing a freed anon_name after vma merging, previously fixed by commit 942341dcc574. Link: https://lkml.kernel.org/r/20250624-anon_name_cleanup-v2-0-600075462a11@suse.cz Link: https://lkml.kernel.org/r/20250624-anon_name_cleanup-v2-1-600075462a11@suse.cz Signed-off-by: Vlastimil Babka Acked-by: David Hildenbrand Reviewed-by: Suren Baghdasaryan Reviewed-by: Lorenzo Stoakes Tested-by: Lorenzo Stoakes Cc: Colin Cross Cc: Jann Horn Cc: Liam Howlett Cc: Michal Hocko Cc: Mike Rapoport Cc: SeongJae Park Signed-off-by: Andrew Morton --- mm/madvise.c | 37 +++++++++++++------------------------ 1 file changed, 13 insertions(+), 24 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index c467ee42596f..0ca405017b37 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -176,25 +176,29 @@ static int replace_anon_vma_name(struct vm_area_struct *vma, } #endif /* CONFIG_ANON_VMA_NAME */ /* - * Update the vm_flags on region of a vma, splitting it or merging it as - * necessary. Must be called with mmap_lock held for writing; - * Caller should ensure anon_name stability by raising its refcount even when - * anon_name belongs to a valid vma because this function might free that vma. + * Update the vm_flags and/or anon_name on region of a vma, splitting it or + * merging it as necessary. Must be called with mmap_lock held for writing. */ static int madvise_update_vma(vm_flags_t new_flags, struct madvise_behavior *madv_behavior) { - int error; struct vm_area_struct *vma = madv_behavior->vma; struct madvise_behavior_range *range = &madv_behavior->range; struct anon_vma_name *anon_name = madv_behavior->anon_name; + bool set_new_anon_name = madv_behavior->behavior == __MADV_SET_ANON_VMA_NAME; VMA_ITERATOR(vmi, madv_behavior->mm, range->start); - if (new_flags == vma->vm_flags && anon_vma_name_eq(anon_vma_name(vma), anon_name)) + if (new_flags == vma->vm_flags && (!set_new_anon_name || + anon_vma_name_eq(anon_vma_name(vma), anon_name))) return 0; - vma = vma_modify_flags_name(&vmi, madv_behavior->prev, vma, + if (set_new_anon_name) + vma = vma_modify_flags_name(&vmi, madv_behavior->prev, vma, range->start, range->end, new_flags, anon_name); + else + vma = vma_modify_flags(&vmi, madv_behavior->prev, vma, + range->start, range->end, new_flags); + if (IS_ERR(vma)) return PTR_ERR(vma); @@ -203,11 +207,8 @@ static int madvise_update_vma(vm_flags_t new_flags, /* vm_flags is protected by the mmap_lock held in write mode. */ vma_start_write(vma); vm_flags_reset(vma, new_flags); - if (!vma->vm_file || vma_is_anon_shmem(vma)) { - error = replace_anon_vma_name(vma, anon_name); - if (error) - return error; - } + if (set_new_anon_name) + return replace_anon_vma_name(vma, anon_name); return 0; } @@ -1313,7 +1314,6 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior) int behavior = madv_behavior->behavior; struct vm_area_struct *vma = madv_behavior->vma; vm_flags_t new_flags = vma->vm_flags; - bool set_new_anon_name = behavior == __MADV_SET_ANON_VMA_NAME; struct madvise_behavior_range *range = &madv_behavior->range; int error; @@ -1403,18 +1403,7 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior) /* This is a write operation.*/ VM_WARN_ON_ONCE(madv_behavior->lock_mode != MADVISE_MMAP_WRITE_LOCK); - /* - * madvise_update_vma() might cause a VMA merge which could put an - * anon_vma_name, so we must hold an additional reference on the - * anon_vma_name so it doesn't disappear from under us. - */ - if (!set_new_anon_name) { - madv_behavior->anon_name = anon_vma_name(vma); - anon_vma_name_get(madv_behavior->anon_name); - } error = madvise_update_vma(new_flags, madv_behavior); - if (!set_new_anon_name) - anon_vma_name_put(madv_behavior->anon_name); out: /* * madvise() returns EAGAIN if kernel resources, such as From 6b233784b198e0d6dbfd526341b6ec51ffd30020 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Tue, 24 Jun 2025 15:03:46 +0200 Subject: [PATCH 130/370] mm, madvise: extract mm code from prctl_set_vma() to mm/madvise.c Setting anon_name is done via madvise_set_anon_name() and behaves a lot of like other madvise operations. However, apparently because madvise() has lacked the 4th argument and prctl() not, the userspace entry point has been implemented via prctl(PR_SET_VMA, ...) and handled first by prctl_set_vma(). Currently prctl_set_vma() lives in kernel/sys.c but setting the vma->anon_name is mm-specific code so extract it to a new set_anon_vma_name() function under mm. mm/madvise.c seems to be the most straightforward place as that's where madvise_set_anon_name() lives. Stop declaring the latter in mm.h and instead declare set_anon_vma_name(). Link: https://lkml.kernel.org/r/20250624-anon_name_cleanup-v2-2-600075462a11@suse.cz Signed-off-by: Vlastimil Babka Acked-by: David Hildenbrand Tested-by: Lorenzo Stoakes Reviewed-by: Suren Baghdasaryan Reviewed-by: Lorenzo Stoakes Cc: Colin Cross Cc: Jann Horn Cc: Liam Howlett Cc: Michal Hocko Cc: Mike Rapoport Cc: SeongJae Park Signed-off-by: Andrew Morton --- include/linux/mm.h | 14 ++++++------- kernel/sys.c | 50 +------------------------------------------- mm/madvise.c | 52 ++++++++++++++++++++++++++++++++++++++++++++-- 3 files changed, 58 insertions(+), 58 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 0e0549f3d681..ef40f68c1183 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4059,14 +4059,14 @@ unsigned long wp_shared_mapping_range(struct address_space *mapping, #endif #ifdef CONFIG_ANON_VMA_NAME -int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, - unsigned long len_in, - struct anon_vma_name *anon_name); +int set_anon_vma_name(unsigned long addr, unsigned long size, + const char __user *uname); #else -static inline int -madvise_set_anon_name(struct mm_struct *mm, unsigned long start, - unsigned long len_in, struct anon_vma_name *anon_name) { - return 0; +static inline +int set_anon_vma_name(unsigned long addr, unsigned long size, + const char __user *uname) +{ + return -EINVAL; } #endif diff --git a/kernel/sys.c b/kernel/sys.c index adc0de0aa364..b153fb345ada 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2343,54 +2343,14 @@ int __weak arch_lock_shadow_stack_status(struct task_struct *t, unsigned long st #define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE) -#ifdef CONFIG_ANON_VMA_NAME - -#define ANON_VMA_NAME_MAX_LEN 80 -#define ANON_VMA_NAME_INVALID_CHARS "\\`$[]" - -static inline bool is_valid_name_char(char ch) -{ - /* printable ascii characters, excluding ANON_VMA_NAME_INVALID_CHARS */ - return ch > 0x1f && ch < 0x7f && - !strchr(ANON_VMA_NAME_INVALID_CHARS, ch); -} - static int prctl_set_vma(unsigned long opt, unsigned long addr, unsigned long size, unsigned long arg) { - struct mm_struct *mm = current->mm; - const char __user *uname; - struct anon_vma_name *anon_name = NULL; int error; switch (opt) { case PR_SET_VMA_ANON_NAME: - uname = (const char __user *)arg; - if (uname) { - char *name, *pch; - - name = strndup_user(uname, ANON_VMA_NAME_MAX_LEN); - if (IS_ERR(name)) - return PTR_ERR(name); - - for (pch = name; *pch != '\0'; pch++) { - if (!is_valid_name_char(*pch)) { - kfree(name); - return -EINVAL; - } - } - /* anon_vma has its own copy */ - anon_name = anon_vma_name_alloc(name); - kfree(name); - if (!anon_name) - return -ENOMEM; - - } - - mmap_write_lock(mm); - error = madvise_set_anon_name(mm, addr, size, anon_name); - mmap_write_unlock(mm); - anon_vma_name_put(anon_name); + error = set_anon_vma_name(addr, size, (const char __user *)arg); break; default: error = -EINVAL; @@ -2399,14 +2359,6 @@ static int prctl_set_vma(unsigned long opt, unsigned long addr, return error; } -#else /* CONFIG_ANON_VMA_NAME */ -static int prctl_set_vma(unsigned long opt, unsigned long start, - unsigned long size, unsigned long arg) -{ - return -EINVAL; -} -#endif /* CONFIG_ANON_VMA_NAME */ - static inline unsigned long get_current_mdwe(void) { unsigned long ret = 0; diff --git a/mm/madvise.c b/mm/madvise.c index 0ca405017b37..a2294bc1cc7b 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -134,8 +134,8 @@ static int replace_anon_vma_name(struct vm_area_struct *vma, return 0; } -int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, - unsigned long len_in, struct anon_vma_name *anon_name) +static int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, + unsigned long len_in, struct anon_vma_name *anon_name) { unsigned long end; unsigned long len; @@ -2096,3 +2096,51 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, out: return ret; } + +#ifdef CONFIG_ANON_VMA_NAME + +#define ANON_VMA_NAME_MAX_LEN 80 +#define ANON_VMA_NAME_INVALID_CHARS "\\`$[]" + +static inline bool is_valid_name_char(char ch) +{ + /* printable ascii characters, excluding ANON_VMA_NAME_INVALID_CHARS */ + return ch > 0x1f && ch < 0x7f && + !strchr(ANON_VMA_NAME_INVALID_CHARS, ch); +} + +int set_anon_vma_name(unsigned long addr, unsigned long size, + const char __user *uname) +{ + struct anon_vma_name *anon_name = NULL; + struct mm_struct *mm = current->mm; + int error; + + if (uname) { + char *name, *pch; + + name = strndup_user(uname, ANON_VMA_NAME_MAX_LEN); + if (IS_ERR(name)) + return PTR_ERR(name); + + for (pch = name; *pch != '\0'; pch++) { + if (!is_valid_name_char(*pch)) { + kfree(name); + return -EINVAL; + } + } + /* anon_vma has its own copy */ + anon_name = anon_vma_name_alloc(name); + kfree(name); + if (!anon_name) + return -ENOMEM; + } + + mmap_write_lock(mm); + error = madvise_set_anon_name(mm, addr, size, anon_name); + mmap_write_unlock(mm); + anon_vma_name_put(anon_name); + + return error; +} +#endif From 986738ce446ab0cb0b6ed71509b19ace69d22bf7 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Tue, 24 Jun 2025 15:03:47 +0200 Subject: [PATCH 131/370] mm, madvise: move madvise_set_anon_name() down the file Preparatory change so that we can use madvise_lock()/unlock() in the function without forward declarations or more thorough shuffling. No functional change. Move as a separate commit helps git heuristics to detect it properly. Link: https://lkml.kernel.org/r/20250624-anon_name_cleanup-v2-3-600075462a11@suse.cz Signed-off-by: Vlastimil Babka Tested-by: Lorenzo Stoakes Acked-by: David Hildenbrand Reviewed-by: Suren Baghdasaryan Reviewed-by: Lorenzo Stoakes Cc: Colin Cross Cc: Jann Horn Cc: Liam Howlett Cc: Michal Hocko Cc: Mike Rapoport Cc: SeongJae Park Signed-off-by: Andrew Morton --- mm/madvise.c | 64 ++++++++++++++++++++++++++-------------------------- 1 file changed, 32 insertions(+), 32 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index a2294bc1cc7b..cb7d74d56deb 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -133,38 +133,6 @@ static int replace_anon_vma_name(struct vm_area_struct *vma, return 0; } - -static int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, - unsigned long len_in, struct anon_vma_name *anon_name) -{ - unsigned long end; - unsigned long len; - struct madvise_behavior madv_behavior = { - .mm = mm, - .behavior = __MADV_SET_ANON_VMA_NAME, - .lock_mode = MADVISE_MMAP_WRITE_LOCK, - .anon_name = anon_name, - }; - - if (start & ~PAGE_MASK) - return -EINVAL; - len = (len_in + ~PAGE_MASK) & PAGE_MASK; - - /* Check to see whether len was rounded up from small -ve to zero */ - if (len_in && !len) - return -EINVAL; - - end = start + len; - if (end < start) - return -EINVAL; - - if (end == start) - return 0; - - madv_behavior.range.start = start; - madv_behavior.range.end = end; - return madvise_walk_vmas(&madv_behavior); -} #else /* CONFIG_ANON_VMA_NAME */ static int replace_anon_vma_name(struct vm_area_struct *vma, struct anon_vma_name *anon_name) @@ -2109,6 +2077,38 @@ static inline bool is_valid_name_char(char ch) !strchr(ANON_VMA_NAME_INVALID_CHARS, ch); } +static int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, + unsigned long len_in, struct anon_vma_name *anon_name) +{ + unsigned long end; + unsigned long len; + struct madvise_behavior madv_behavior = { + .mm = mm, + .behavior = __MADV_SET_ANON_VMA_NAME, + .lock_mode = MADVISE_MMAP_WRITE_LOCK, + .anon_name = anon_name, + }; + + if (start & ~PAGE_MASK) + return -EINVAL; + len = (len_in + ~PAGE_MASK) & PAGE_MASK; + + /* Check to see whether len was rounded up from small -ve to zero */ + if (len_in && !len) + return -EINVAL; + + end = start + len; + if (end < start) + return -EINVAL; + + if (end == start) + return 0; + + madv_behavior.range.start = start; + madv_behavior.range.end = end; + return madvise_walk_vmas(&madv_behavior); +} + int set_anon_vma_name(unsigned long addr, unsigned long size, const char __user *uname) { From 9992554c9ca3eb929daf8ccb02cfbcb3dbc27d4f Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Tue, 24 Jun 2025 15:03:48 +0200 Subject: [PATCH 132/370] mm, madvise: use standard madvise locking in madvise_set_anon_name() Use madvise_lock()/madvise_unlock() in madvise_set_anon_name() in the same way as in do_madvise(). This narrows the lock scope a bit and reuses existing functionality. get_lock_mode() already picks the correct MADVISE_MMAP_WRITE_LOCK mode for __MADV_SET_ANON_VMA_NAME so we can just remove the explicit assignment. There is a user visible change in that the prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME...) might now return -EINTR. Link: https://lkml.kernel.org/r/20250624-anon_name_cleanup-v2-4-600075462a11@suse.cz Signed-off-by: Vlastimil Babka Tested-by: Lorenzo Stoakes Acked-by: David Hildenbrand Reviewed-by: Suren Baghdasaryan Reviewed-by: Lorenzo Stoakes Cc: Colin Cross Cc: Jann Horn Cc: Liam Howlett Cc: Michal Hocko Cc: Mike Rapoport Cc: SeongJae Park Signed-off-by: Andrew Morton --- mm/madvise.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index cb7d74d56deb..a34c2c89a53b 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -2082,10 +2082,10 @@ static int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, { unsigned long end; unsigned long len; + int error; struct madvise_behavior madv_behavior = { .mm = mm, .behavior = __MADV_SET_ANON_VMA_NAME, - .lock_mode = MADVISE_MMAP_WRITE_LOCK, .anon_name = anon_name, }; @@ -2106,7 +2106,14 @@ static int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, madv_behavior.range.start = start; madv_behavior.range.end = end; - return madvise_walk_vmas(&madv_behavior); + + error = madvise_lock(&madv_behavior); + if (error) + return error; + error = madvise_walk_vmas(&madv_behavior); + madvise_unlock(&madv_behavior); + + return error; } int set_anon_vma_name(unsigned long addr, unsigned long size, @@ -2136,9 +2143,7 @@ int set_anon_vma_name(unsigned long addr, unsigned long size, return -ENOMEM; } - mmap_write_lock(mm); error = madvise_set_anon_name(mm, addr, size, anon_name); - mmap_write_unlock(mm); anon_vma_name_put(anon_name); return error; From 1bf47d4195e4518973024cfc0777491ff796e883 Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Mon, 16 Jun 2025 15:51:44 +0200 Subject: [PATCH 133/370] mm,slub: do not special case N_NORMAL nodes for slab_nodes Patch series "Implement numa node notifier", v7. Memory notifier is a tool that allow consumers to get notified whenever memory gets onlined or offlined in the system. Currently, there are 10 consumers of that, but 5 out of those 10 consumers are only interested in getting notifications when a numa node changes its memory state. That means going from memoryless to memory-aware of vice versa. Which means that for every {online,offline}_pages operation they get notified even though the numa node might not have changed its state. This is suboptimal, and we want to decouple numa node state changes from memory state changes. While we are doing this, remove status_change_nid_normal, as the only current user (slub) does not really need it. This allows us to further simplify and clean up the code. The first patch gets rid of status_change_nid_normal in slub. The second patch implements a numa node notifier that does just that, and have those consumers register in there, so they get notified only when they are interested. The third patch replaces 'status_change_nid{_normal}' fields within memory_notify with a 'nid', as that is only what we need for memory notifer and update the only user of it (page_ext). Consumers that are only interested in numa node states change are: - memory-tier - slub - cpuset - hmat - cxl - autoweight-mempolicy This patch (of 11): Currently, slab_mem_going_online_callback() checks whether the node has N_NORMAL memory in order to be set in slab_nodes. While it is true that getting rid of that enforcing would mean ending up with movables nodes in slab_nodes, the memory waste that comes with that is negligible. So stop checking for status_change_nid_normal and just use status_change_nid instead which works for both types of memory. Also, once we allocate the kmem_cache_node cache for the node in slab_mem_online_callback(), we never deallocate it in slab_mem_offline_callback() when the node goes memoryless, so we can just get rid of it. The side effects are that we will stop clearing the node from slab_nodes, and also that newly created kmem caches after node hotremove will now allocate their kmem_cache_node for the node(s) that was hotremoved, but these should be negligible. Link: https://lkml.kernel.org/r/20250616135158.450136-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20250616135158.450136-2-osalvador@suse.de Signed-off-by: Oscar Salvador Suggested-by: David Hildenbrand Reviewed-by: Vlastimil Babka Reviewed-by: Harry Yoo Acked-by: David Hildenbrand Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joanthan Cameron Cc: Rakie Kim Signed-off-by: Andrew Morton --- mm/slub.c | 34 +++------------------------------- 1 file changed, 3 insertions(+), 31 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 31e11ef256f9..db3280cc678d 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -447,7 +447,7 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node) /* * Tracks for which NUMA nodes we have kmem_cache_nodes allocated. - * Corresponds to node_state[N_NORMAL_MEMORY], but can temporarily + * Corresponds to node_state[N_MEMORY], but can temporarily * differ during memory hotplug/hotremove operations. * Protected by slab_mutex. */ @@ -6163,36 +6163,12 @@ static int slab_mem_going_offline_callback(void *arg) return 0; } -static void slab_mem_offline_callback(void *arg) -{ - struct memory_notify *marg = arg; - int offline_node; - - offline_node = marg->status_change_nid_normal; - - /* - * If the node still has available memory. we need kmem_cache_node - * for it yet. - */ - if (offline_node < 0) - return; - - mutex_lock(&slab_mutex); - node_clear(offline_node, slab_nodes); - /* - * We no longer free kmem_cache_node structures here, as it would be - * racy with all get_node() users, and infeasible to protect them with - * slab_mutex. - */ - mutex_unlock(&slab_mutex); -} - static int slab_mem_going_online_callback(void *arg) { struct kmem_cache_node *n; struct kmem_cache *s; struct memory_notify *marg = arg; - int nid = marg->status_change_nid_normal; + int nid = marg->status_change_nid; int ret = 0; /* @@ -6250,10 +6226,6 @@ static int slab_memory_callback(struct notifier_block *self, case MEM_GOING_OFFLINE: ret = slab_mem_going_offline_callback(arg); break; - case MEM_OFFLINE: - case MEM_CANCEL_ONLINE: - slab_mem_offline_callback(arg); - break; case MEM_ONLINE: case MEM_CANCEL_OFFLINE: break; @@ -6324,7 +6296,7 @@ void __init kmem_cache_init(void) * Initialize the nodemask for which we will allocate per node * structures. Here we don't need taking slab_mutex yet. */ - for_each_node_state(node, N_NORMAL_MEMORY) + for_each_node_state(node, N_MEMORY) node_set(node, slab_nodes); create_boot_cache(kmem_cache_node, "kmem_cache_node", From 8d2882a8edb8621d37fd8931e0686070cc6cc189 Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Mon, 16 Jun 2025 15:51:45 +0200 Subject: [PATCH 134/370] mm,memory_hotplug: remove status_change_nid_normal and update documentation Now that the last user of status_change_nid_normal is gone, we can remove it. Update documentation accordingly. Link: https://lkml.kernel.org/r/20250616135158.450136-3-osalvador@suse.de Signed-off-by: Oscar Salvador Reviewed-by: Vlastimil Babka Acked-by: David Hildenbrand Cc: Harry Yoo Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joanthan Cameron Cc: Rakie Kim Signed-off-by: Andrew Morton --- Documentation/core-api/memory-hotplug.rst | 3 --- .../translations/zh_CN/core-api/memory-hotplug.rst | 3 --- include/linux/memory.h | 1 - mm/memory_hotplug.c | 12 ------------ 4 files changed, 19 deletions(-) diff --git a/Documentation/core-api/memory-hotplug.rst b/Documentation/core-api/memory-hotplug.rst index 682259ee633a..d1b8eb9add8a 100644 --- a/Documentation/core-api/memory-hotplug.rst +++ b/Documentation/core-api/memory-hotplug.rst @@ -56,14 +56,11 @@ The third argument (arg) passes a pointer of struct memory_notify:: struct memory_notify { unsigned long start_pfn; unsigned long nr_pages; - int status_change_nid_normal; int status_change_nid; } - start_pfn is start_pfn of online/offline memory. - nr_pages is # of pages of online/offline memory. -- status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask - is (will be) set/clear, if this is -1, then nodemask status is not changed. - status_change_nid is set node id when N_MEMORY of nodemask is (will be) set/clear. It means a new(memoryless) node gets new memory by online and a node loses all memory. If this is -1, then nodemask status is not changed. diff --git a/Documentation/translations/zh_CN/core-api/memory-hotplug.rst b/Documentation/translations/zh_CN/core-api/memory-hotplug.rst index 9b2841fb9a5f..c2a4122ae221 100644 --- a/Documentation/translations/zh_CN/core-api/memory-hotplug.rst +++ b/Documentation/translations/zh_CN/core-api/memory-hotplug.rst @@ -62,7 +62,6 @@ memory_notify结构体的指针:: struct memory_notify { unsigned long start_pfn; unsigned long nr_pages; - int status_change_nid_normal; int status_change_nid; } @@ -70,8 +69,6 @@ memory_notify结构体的指针:: - nr_pages是在线/离线内存的页数。 -- status_change_nid_normal是当nodemask的N_NORMAL_MEMORY被设置/清除时设置节 - 点id,如果是-1,则nodemask状态不改变。 - status_change_nid是当nodemask的N_MEMORY被(将)设置/清除时设置的节点id。这 意味着一个新的(没上线的)节点通过联机获得新的内存,而一个节点失去了所有的内 diff --git a/include/linux/memory.h b/include/linux/memory.h index bd4440bc4a57..a8284d41e452 100644 --- a/include/linux/memory.h +++ b/include/linux/memory.h @@ -109,7 +109,6 @@ struct memory_notify { unsigned long altmap_nr_pages; unsigned long start_pfn; unsigned long nr_pages; - int status_change_nid_normal; int status_change_nid; }; diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index bec20a91e757..d11278c3840d 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -706,19 +706,13 @@ static void node_states_check_changes_online(unsigned long nr_pages, int nid = zone_to_nid(zone); arg->status_change_nid = NUMA_NO_NODE; - arg->status_change_nid_normal = NUMA_NO_NODE; if (!node_state(nid, N_MEMORY)) arg->status_change_nid = nid; - if (zone_idx(zone) <= ZONE_NORMAL && !node_state(nid, N_NORMAL_MEMORY)) - arg->status_change_nid_normal = nid; } static void node_states_set_node(int node, struct memory_notify *arg) { - if (arg->status_change_nid_normal >= 0) - node_set_state(node, N_NORMAL_MEMORY); - if (arg->status_change_nid >= 0) node_set_state(node, N_MEMORY); } @@ -1894,7 +1888,6 @@ static void node_states_check_changes_offline(unsigned long nr_pages, enum zone_type zt; arg->status_change_nid = NUMA_NO_NODE; - arg->status_change_nid_normal = NUMA_NO_NODE; /* * Check whether node_states[N_NORMAL_MEMORY] will be changed. @@ -1906,8 +1899,6 @@ static void node_states_check_changes_offline(unsigned long nr_pages, */ for (zt = 0; zt <= ZONE_NORMAL; zt++) present_pages += pgdat->node_zones[zt].present_pages; - if (zone_idx(zone) <= ZONE_NORMAL && nr_pages >= present_pages) - arg->status_change_nid_normal = zone_to_nid(zone); /* * We have accounted the pages from [0..ZONE_NORMAL); ZONE_HIGHMEM @@ -1926,9 +1917,6 @@ static void node_states_check_changes_offline(unsigned long nr_pages, static void node_states_clear_node(int node, struct memory_notify *arg) { - if (arg->status_change_nid_normal >= 0) - node_clear_state(node, N_NORMAL_MEMORY); - if (arg->status_change_nid >= 0) node_clear_state(node, N_MEMORY); } From 67929de108479dbb78496b61af5c24072fc16d8d Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Mon, 16 Jun 2025 15:51:46 +0200 Subject: [PATCH 135/370] mm,memory_hotplug: implement numa node notifier There are at least six consumers of hotplug_memory_notifier that what they really are interested in is whether any numa node changed its state, e.g: going from having memory to not having memory and vice versa. Implement a specific notifier for numa nodes when their state gets changed, which will later be used by those consumers that are only interested in numa node state changes. Add documentation as well. [dan.carpenter@linaro.org: set failure reason in offline_pages()] Link: https://lkml.kernel.org/r/be4fd31b-7d09-46b0-8329-6d0464ffa7a5@sabinyo.mountain Link: https://lkml.kernel.org/r/20250616135158.450136-4-osalvador@suse.de Signed-off-by: Oscar Salvador Signed-off-by: Dan Carpenter Reviewed-by: Jonathan Cameron Reviewed-by: Harry Yoo Reviewed-by: Vlastimil Babka Acked-by: David Hildenbrand Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim Signed-off-by: Andrew Morton --- Documentation/core-api/memory-hotplug.rst | 83 ++++++++++++ drivers/base/node.c | 21 ++++ include/linux/node.h | 40 ++++++ mm/memory_hotplug.c | 146 ++++++++++------------ 4 files changed, 210 insertions(+), 80 deletions(-) diff --git a/Documentation/core-api/memory-hotplug.rst b/Documentation/core-api/memory-hotplug.rst index d1b8eb9add8a..fb84e78968b2 100644 --- a/Documentation/core-api/memory-hotplug.rst +++ b/Documentation/core-api/memory-hotplug.rst @@ -9,6 +9,9 @@ Memory hotplug event notifier Hotplugging events are sent to a notification queue. +Memory notifier +---------------- + There are six types of notification defined in ``include/linux/memory.h``: MEM_GOING_ONLINE @@ -68,6 +71,14 @@ The third argument (arg) passes a pointer of struct memory_notify:: If status_changed_nid* >= 0, callback should create/discard structures for the node if necessary. +It is possible to get notified for MEM_CANCEL_ONLINE without having been notified +for MEM_GOING_ONLINE, and the same applies to MEM_CANCEL_OFFLINE and +MEM_GOING_OFFLINE. +This can happen when a consumer fails, meaning we break the callchain and we +stop calling the remaining consumers of the notifier. +It is then important that users of memory_notify make no assumptions and get +prepared to handle such cases. + The callback routine shall return one of the values NOTIFY_DONE, NOTIFY_OK, NOTIFY_BAD, NOTIFY_STOP defined in ``include/linux/notifier.h`` @@ -80,6 +91,78 @@ further processing of the notification queue. NOTIFY_STOP stops further processing of the notification queue. +Numa node notifier +------------------ + +There are six types of notification defined in ``include/linux/node.h``: + +NODE_ADDING_FIRST_MEMORY + Generated before memory becomes available to this node for the first time. + +NODE_CANCEL_ADDING_FIRST_MEMORY + Generated if NODE_ADDING_FIRST_MEMORY fails. + +NODE_ADDED_FIRST_MEMORY + Generated when memory has become available fo this node for the first time. + +NODE_REMOVING_LAST_MEMORY + Generated when the last memory available to this node is about to be offlined. + +NODE_CANCEL_REMOVING_LAST_MEMORY + Generated when NODE_CANCEL_REMOVING_LAST_MEMORY fails. + +NODE_REMOVED_LAST_MEMORY + Generated when the last memory available to this node has been offlined. + +A callback routine can be registered by calling:: + + hotplug_node_notifier(callback_func, priority) + +Callback functions with higher values of priority are called before callback +functions with lower values. + +A callback function must have the following prototype:: + + int callback_func( + + struct notifier_block *self, unsigned long action, void *arg); + +The first argument of the callback function (self) is a pointer to the block +of the notifier chain that points to the callback function itself. +The second argument (action) is one of the event types described above. +The third argument (arg) passes a pointer of struct node_notify:: + + struct node_notify { + int nid; + } + +- nid is the node we are adding or removing memory to. + +It is possible to get notified for NODE_CANCEL_ADDING_FIRST_MEMORY without +having been notified for NODE_ADDING_FIRST_MEMORY, and the same applies to +NODE_CANCEL_REMOVING_LAST_MEMORY and NODE_REMOVING_LAST_MEMORY. +This can happen when a consumer fails, meaning we break the callchain and we +stop calling the remaining consumers of the notifier. +It is then important that users of node_notify make no assumptions and get +prepared to handle such cases. + +The callback routine shall return one of the values +NOTIFY_DONE, NOTIFY_OK, NOTIFY_BAD, NOTIFY_STOP +defined in ``include/linux/notifier.h`` + +NOTIFY_DONE and NOTIFY_OK have no effect on the further processing. + +NOTIFY_BAD is used as response to the NODE_ADDING_FIRST_MEMORY, +NODE_REMOVING_LAST_MEMORY, NODE_ADDED_FIRST_MEMORY or +NODE_REMOVED_LAST_MEMORY action to cancel hotplugging. +It stops further processing of the notification queue. + +NOTIFY_STOP stops further processing of the notification queue. + +Please note that we should not fail for NODE_ADDED_FIRST_MEMORY / +NODE_REMOVED_FIRST_MEMORY, as memory_hotplug code cannot rollback at that +point anymore. + Locking Internals ================= diff --git a/drivers/base/node.c b/drivers/base/node.c index 6cbeca45c451..6d66382dae65 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -112,6 +112,27 @@ static const struct attribute_group *node_access_node_groups[] = { NULL, }; +#ifdef CONFIG_MEMORY_HOTPLUG +static BLOCKING_NOTIFIER_HEAD(node_chain); + +int register_node_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_register(&node_chain, nb); +} +EXPORT_SYMBOL(register_node_notifier); + +void unregister_node_notifier(struct notifier_block *nb) +{ + blocking_notifier_chain_unregister(&node_chain, nb); +} +EXPORT_SYMBOL(unregister_node_notifier); + +int node_notify(unsigned long val, void *v) +{ + return blocking_notifier_call_chain(&node_chain, val, v); +} +#endif + static void node_remove_accesses(struct node *node) { struct node_access_nodes *c, *cnext; diff --git a/include/linux/node.h b/include/linux/node.h index 88bceebcbfa5..2c7529335b21 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -125,6 +125,46 @@ static inline void register_memory_blocks_under_nodes(void) #endif extern void unregister_node(struct node *node); + +struct node_notify { + int nid; +}; + +#define NODE_ADDING_FIRST_MEMORY (1<<0) +#define NODE_ADDED_FIRST_MEMORY (1<<1) +#define NODE_CANCEL_ADDING_FIRST_MEMORY (1<<2) +#define NODE_REMOVING_LAST_MEMORY (1<<3) +#define NODE_REMOVED_LAST_MEMORY (1<<4) +#define NODE_CANCEL_REMOVING_LAST_MEMORY (1<<5) + +#if defined(CONFIG_MEMORY_HOTPLUG) && defined(CONFIG_NUMA) +extern int register_node_notifier(struct notifier_block *nb); +extern void unregister_node_notifier(struct notifier_block *nb); +extern int node_notify(unsigned long val, void *v); + +#define hotplug_node_notifier(fn, pri) ({ \ + static __meminitdata struct notifier_block fn##_node_nb =\ + { .notifier_call = fn, .priority = pri };\ + register_node_notifier(&fn##_node_nb); \ +}) +#else +static inline int register_node_notifier(struct notifier_block *nb) +{ + return 0; +} +static inline void unregister_node_notifier(struct notifier_block *nb) +{ +} +static inline int node_notify(unsigned long val, void *v) +{ + return 0; +} +static inline int hotplug_node_notifier(notifier_fn_t fn, int pri) +{ + return 0; +} +#endif + #ifdef CONFIG_NUMA extern void node_dev_init(void); /* Core of the node registration - only memory hotplug should use this */ diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index d11278c3840d..0d54a01985ba 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -35,6 +35,7 @@ #include #include #include +#include #include @@ -699,24 +700,6 @@ static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages) online_mem_sections(start_pfn, end_pfn); } -/* check which state of node_states will be changed when online memory */ -static void node_states_check_changes_online(unsigned long nr_pages, - struct zone *zone, struct memory_notify *arg) -{ - int nid = zone_to_nid(zone); - - arg->status_change_nid = NUMA_NO_NODE; - - if (!node_state(nid, N_MEMORY)) - arg->status_change_nid = nid; -} - -static void node_states_set_node(int node, struct memory_notify *arg) -{ - if (arg->status_change_nid >= 0) - node_set_state(node, N_MEMORY); -} - static void __meminit resize_zone_range(struct zone *zone, unsigned long start_pfn, unsigned long nr_pages) { @@ -1167,11 +1150,18 @@ void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages) int online_pages(unsigned long pfn, unsigned long nr_pages, struct zone *zone, struct memory_group *group) { - unsigned long flags; - int need_zonelists_rebuild = 0; + struct memory_notify mem_arg = { + .start_pfn = pfn, + .nr_pages = nr_pages, + .status_change_nid = NUMA_NO_NODE, + }; + struct node_notify node_arg = { + .nid = NUMA_NO_NODE, + }; const int nid = zone_to_nid(zone); + int need_zonelists_rebuild = 0; + unsigned long flags; int ret; - struct memory_notify arg; /* * {on,off}lining is constrained to full memory sections (or more @@ -1188,11 +1178,17 @@ int online_pages(unsigned long pfn, unsigned long nr_pages, /* associate pfn range with the zone */ move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE); - arg.start_pfn = pfn; - arg.nr_pages = nr_pages; - node_states_check_changes_online(nr_pages, zone, &arg); + if (!node_state(nid, N_MEMORY)) { + /* Adding memory to the node for the first time */ + node_arg.nid = nid; + mem_arg.status_change_nid = nid; + ret = node_notify(NODE_ADDING_FIRST_MEMORY, &node_arg); + ret = notifier_to_errno(ret); + if (ret) + goto failed_addition; + } - ret = memory_notify(MEM_GOING_ONLINE, &arg); + ret = memory_notify(MEM_GOING_ONLINE, &mem_arg); ret = notifier_to_errno(ret); if (ret) goto failed_addition; @@ -1218,7 +1214,8 @@ int online_pages(unsigned long pfn, unsigned long nr_pages, online_pages_range(pfn, nr_pages); adjust_present_page_count(pfn_to_page(pfn), group, nr_pages); - node_states_set_node(nid, &arg); + if (node_arg.nid >= 0) + node_set_state(nid, N_MEMORY); if (need_zonelists_rebuild) build_all_zonelists(NULL); @@ -1239,16 +1236,22 @@ int online_pages(unsigned long pfn, unsigned long nr_pages, kswapd_run(nid); kcompactd_run(nid); + if (node_arg.nid >= 0) + /* First memory added successfully. Notify consumers. */ + node_notify(NODE_ADDED_FIRST_MEMORY, &node_arg); + writeback_set_ratelimit(); - memory_notify(MEM_ONLINE, &arg); + memory_notify(MEM_ONLINE, &mem_arg); return 0; failed_addition: pr_debug("online_pages [mem %#010llx-%#010llx] failed\n", (unsigned long long) pfn << PAGE_SHIFT, (((unsigned long long) pfn + nr_pages) << PAGE_SHIFT) - 1); - memory_notify(MEM_CANCEL_ONLINE, &arg); + memory_notify(MEM_CANCEL_ONLINE, &mem_arg); + if (node_arg.nid != NUMA_NO_NODE) + node_notify(NODE_CANCEL_ADDING_FIRST_MEMORY, &node_arg); remove_pfn_range_from_zone(zone, pfn, nr_pages); return ret; } @@ -1879,48 +1882,6 @@ static int __init cmdline_parse_movable_node(char *p) } early_param("movable_node", cmdline_parse_movable_node); -/* check which state of node_states will be changed when offline memory */ -static void node_states_check_changes_offline(unsigned long nr_pages, - struct zone *zone, struct memory_notify *arg) -{ - struct pglist_data *pgdat = zone->zone_pgdat; - unsigned long present_pages = 0; - enum zone_type zt; - - arg->status_change_nid = NUMA_NO_NODE; - - /* - * Check whether node_states[N_NORMAL_MEMORY] will be changed. - * If the memory to be offline is within the range - * [0..ZONE_NORMAL], and it is the last present memory there, - * the zones in that range will become empty after the offlining, - * thus we can determine that we need to clear the node from - * node_states[N_NORMAL_MEMORY]. - */ - for (zt = 0; zt <= ZONE_NORMAL; zt++) - present_pages += pgdat->node_zones[zt].present_pages; - - /* - * We have accounted the pages from [0..ZONE_NORMAL); ZONE_HIGHMEM - * does not apply as we don't support 32bit. - * Here we count the possible pages from ZONE_MOVABLE. - * If after having accounted all the pages, we see that the nr_pages - * to be offlined is over or equal to the accounted pages, - * we know that the node will become empty, and so, we can clear - * it for N_MEMORY as well. - */ - present_pages += pgdat->node_zones[ZONE_MOVABLE].present_pages; - - if (nr_pages >= present_pages) - arg->status_change_nid = zone_to_nid(zone); -} - -static void node_states_clear_node(int node, struct memory_notify *arg) -{ - if (arg->status_change_nid >= 0) - node_clear_state(node, N_MEMORY); -} - static int count_system_ram_pages_cb(unsigned long start_pfn, unsigned long nr_pages, void *data) { @@ -1936,11 +1897,19 @@ static int count_system_ram_pages_cb(unsigned long start_pfn, int offline_pages(unsigned long start_pfn, unsigned long nr_pages, struct zone *zone, struct memory_group *group) { - const unsigned long end_pfn = start_pfn + nr_pages; unsigned long pfn, managed_pages, system_ram_pages = 0; + const unsigned long end_pfn = start_pfn + nr_pages; + struct pglist_data *pgdat = zone->zone_pgdat; const int node = zone_to_nid(zone); + struct memory_notify mem_arg = { + .start_pfn = start_pfn, + .nr_pages = nr_pages, + .status_change_nid = NUMA_NO_NODE, + }; + struct node_notify node_arg = { + .nid = NUMA_NO_NODE, + }; unsigned long flags; - struct memory_notify arg; char *reason; int ret; @@ -1999,11 +1968,23 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages, goto failed_removal_pcplists_disabled; } - arg.start_pfn = start_pfn; - arg.nr_pages = nr_pages; - node_states_check_changes_offline(nr_pages, zone, &arg); + /* + * Check whether the node will have no present pages after we offline + * 'nr_pages' more. If so, we know that the node will become empty, and + * so we will clear N_MEMORY for it. + */ + if (nr_pages >= pgdat->node_present_pages) { + node_arg.nid = node; + mem_arg.status_change_nid = node; + ret = node_notify(NODE_REMOVING_LAST_MEMORY, &node_arg); + ret = notifier_to_errno(ret); + if (ret) { + reason = "node notifier failure"; + goto failed_removal_isolated; + } + } - ret = memory_notify(MEM_GOING_OFFLINE, &arg); + ret = memory_notify(MEM_GOING_OFFLINE, &mem_arg); ret = notifier_to_errno(ret); if (ret) { reason = "notifier failure"; @@ -2083,27 +2064,32 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages, * Make sure to mark the node as memory-less before rebuilding the zone * list. Otherwise this node would still appear in the fallback lists. */ - node_states_clear_node(node, &arg); + if (node_arg.nid >= 0) + node_clear_state(node, N_MEMORY); if (!populated_zone(zone)) { zone_pcp_reset(zone); build_all_zonelists(NULL); } - if (arg.status_change_nid >= 0) { + if (node_arg.nid >= 0) { kcompactd_stop(node); kswapd_stop(node); + /* Node went memoryless. Notify consumers */ + node_notify(NODE_REMOVED_LAST_MEMORY, &node_arg); } writeback_set_ratelimit(); - memory_notify(MEM_OFFLINE, &arg); + memory_notify(MEM_OFFLINE, &mem_arg); remove_pfn_range_from_zone(zone, start_pfn, nr_pages); return 0; failed_removal_isolated: /* pushback to free area */ undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE); - memory_notify(MEM_CANCEL_OFFLINE, &arg); + memory_notify(MEM_CANCEL_OFFLINE, &mem_arg); + if (node_arg.nid != NUMA_NO_NODE) + node_notify(NODE_CANCEL_REMOVING_LAST_MEMORY, &node_arg); failed_removal_pcplists_disabled: lru_cache_enable(); zone_pcp_enable(zone); From 5a20c096a165b8533c714aa9f1db06fee6fe4739 Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Mon, 16 Jun 2025 15:51:47 +0200 Subject: [PATCH 136/370] mm,slub: use node-notifier instead of memory-notifier slub is only concerned when a numa node changes its memory state, so stop using the memory notifier and use the new numa node notifer instead. [akpm@linux-foundation.org: slub.c needs node.h for struct node_notify] Link: https://lore.kernel.org/oe-kbuild-all/202506202144.dGkFxasv-lkp@intel.com/ Link: https://lkml.kernel.org/r/20250616135158.450136-5-osalvador@suse.de Signed-off-by: Oscar Salvador Reviewed-by: Jonathan Cameron Reviewed-by: Harry Yoo Reviewed-by: Vlastimil Babka Acked-by: David Hildenbrand Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim Signed-off-by: Andrew Morton --- mm/slub.c | 29 ++++++++++------------------- 1 file changed, 10 insertions(+), 19 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index db3280cc678d..c4b64821e680 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include #include @@ -6149,7 +6150,7 @@ int __kmem_cache_shrink(struct kmem_cache *s) return __kmem_cache_do_shrink(s); } -static int slab_mem_going_offline_callback(void *arg) +static int slab_mem_going_offline_callback(void) { struct kmem_cache *s; @@ -6163,21 +6164,12 @@ static int slab_mem_going_offline_callback(void *arg) return 0; } -static int slab_mem_going_online_callback(void *arg) +static int slab_mem_going_online_callback(int nid) { struct kmem_cache_node *n; struct kmem_cache *s; - struct memory_notify *marg = arg; - int nid = marg->status_change_nid; int ret = 0; - /* - * If the node's memory is already available, then kmem_cache_node is - * already created. Nothing to do. - */ - if (nid < 0) - return 0; - /* * We are bringing a node online. No memory is available yet. We must * allocate a kmem_cache_node structure in order to bring the node @@ -6217,17 +6209,16 @@ static int slab_mem_going_online_callback(void *arg) static int slab_memory_callback(struct notifier_block *self, unsigned long action, void *arg) { + struct node_notify *nn = arg; + int nid = nn->nid; int ret = 0; switch (action) { - case MEM_GOING_ONLINE: - ret = slab_mem_going_online_callback(arg); + case NODE_ADDING_FIRST_MEMORY: + ret = slab_mem_going_online_callback(nid); break; - case MEM_GOING_OFFLINE: - ret = slab_mem_going_offline_callback(arg); - break; - case MEM_ONLINE: - case MEM_CANCEL_OFFLINE: + case NODE_REMOVING_LAST_MEMORY: + ret = slab_mem_going_offline_callback(); break; } if (ret) @@ -6303,7 +6294,7 @@ void __init kmem_cache_init(void) sizeof(struct kmem_cache_node), SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0); - hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI); + hotplug_node_notifier(slab_memory_callback, SLAB_CALLBACK_PRI); /* Able to allocate the per node structures */ slab_state = PARTIAL; From 265ab0869783d99b9dfbfda1697ce6989441aa04 Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Mon, 16 Jun 2025 15:51:48 +0200 Subject: [PATCH 137/370] mm,memory-tiers: use node-notifier instead of memory-notifier memory-tier is only concerned when a numa node changes its memory state, because it then needs to re-create the demotion list. So stop using the memory notifier and use the new numa node notifer instead. Link: https://lkml.kernel.org/r/20250616135158.450136-6-osalvador@suse.de Signed-off-by: Oscar Salvador Reviewed-by: Jonathan Cameron Reviewed-by: Harry Yoo Reviewed-by: Vlastimil Babka Acked-by: David Hildenbrand Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim Signed-off-by: Andrew Morton --- mm/memory-tiers.c | 19 ++++++------------- 1 file changed, 6 insertions(+), 13 deletions(-) diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index fc14fe53e9b7..0382b6942b8b 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -872,25 +872,18 @@ static int __meminit memtier_hotplug_callback(struct notifier_block *self, unsigned long action, void *_arg) { struct memory_tier *memtier; - struct memory_notify *arg = _arg; - - /* - * Only update the node migration order when a node is - * changing status, like online->offline. - */ - if (arg->status_change_nid < 0) - return notifier_from_errno(0); + struct node_notify *nn = _arg; switch (action) { - case MEM_OFFLINE: + case NODE_REMOVED_LAST_MEMORY: mutex_lock(&memory_tier_lock); - if (clear_node_memory_tier(arg->status_change_nid)) + if (clear_node_memory_tier(nn->nid)) establish_demotion_targets(); mutex_unlock(&memory_tier_lock); break; - case MEM_ONLINE: + case NODE_ADDED_FIRST_MEMORY: mutex_lock(&memory_tier_lock); - memtier = set_node_memory_tier(arg->status_change_nid); + memtier = set_node_memory_tier(nn->nid); if (!IS_ERR(memtier)) establish_demotion_targets(); mutex_unlock(&memory_tier_lock); @@ -929,7 +922,7 @@ static int __init memory_tier_init(void) nodes_and(default_dram_nodes, node_states[N_MEMORY], node_states[N_CPU]); - hotplug_memory_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRI); + hotplug_node_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRI); return 0; } subsys_initcall(memory_tier_init); From 41a9344bb732cf9af5d7a004a836754fa0e7cf56 Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Mon, 16 Jun 2025 15:51:49 +0200 Subject: [PATCH 138/370] drivers,cxl: use node-notifier instead of memory-notifier memory-tier is only concerned when a numa node changes its memory state, specifically when a numa node with memory comes into play for the first time, because it needs to get its performance attributes to build a proper demotion chain. So stop using the memory notifier and use the new numa node notifer instead. Link: https://lkml.kernel.org/r/20250616135158.450136-7-osalvador@suse.de Signed-off-by: Oscar Salvador Reviewed-by: Jonathan Cameron Reviewed-by: Harry Yoo Reviewed-by: Vlastimil Babka Acked-by: David Hildenbrand Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim Signed-off-by: Andrew Morton --- drivers/cxl/core/region.c | 16 ++++++++-------- drivers/cxl/cxl.h | 4 ++-- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c index 6e5e1460068d..ba42259c3701 100644 --- a/drivers/cxl/core/region.c +++ b/drivers/cxl/core/region.c @@ -2451,12 +2451,12 @@ static int cxl_region_perf_attrs_callback(struct notifier_block *nb, unsigned long action, void *arg) { struct cxl_region *cxlr = container_of(nb, struct cxl_region, - memory_notifier); - struct memory_notify *mnb = arg; - int nid = mnb->status_change_nid; + node_notifier); + struct node_notify *nn = arg; + int nid = nn->nid; int region_nid; - if (nid == NUMA_NO_NODE || action != MEM_ONLINE) + if (action != NODE_ADDED_FIRST_MEMORY) return NOTIFY_DONE; /* @@ -3527,7 +3527,7 @@ static void shutdown_notifiers(void *_cxlr) { struct cxl_region *cxlr = _cxlr; - unregister_memory_notifier(&cxlr->memory_notifier); + unregister_node_notifier(&cxlr->node_notifier); unregister_mt_adistance_algorithm(&cxlr->adist_notifier); } @@ -3566,9 +3566,9 @@ static int cxl_region_probe(struct device *dev) if (rc) return rc; - cxlr->memory_notifier.notifier_call = cxl_region_perf_attrs_callback; - cxlr->memory_notifier.priority = CXL_CALLBACK_PRI; - register_memory_notifier(&cxlr->memory_notifier); + cxlr->node_notifier.notifier_call = cxl_region_perf_attrs_callback; + cxlr->node_notifier.priority = CXL_CALLBACK_PRI; + register_node_notifier(&cxlr->node_notifier); cxlr->adist_notifier.notifier_call = cxl_region_calculate_adistance; cxlr->adist_notifier.priority = 100; diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h index 3f1695c96abc..c487445214cc 100644 --- a/drivers/cxl/cxl.h +++ b/drivers/cxl/cxl.h @@ -513,7 +513,7 @@ enum cxl_partition_mode { * @flags: Region state flags * @params: active + config params for the region * @coord: QoS access coordinates for the region - * @memory_notifier: notifier for setting the access coordinates to node + * @node_notifier: notifier for setting the access coordinates to node * @adist_notifier: notifier for calculating the abstract distance of node */ struct cxl_region { @@ -526,7 +526,7 @@ struct cxl_region { unsigned long flags; struct cxl_region_params params; struct access_coordinate coord[ACCESS_COORDINATE_MAX]; - struct notifier_block memory_notifier; + struct notifier_block node_notifier; struct notifier_block adist_notifier; }; From 487d45d1abeee242a4b12795015584fa8d8f2e67 Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Mon, 16 Jun 2025 15:51:50 +0200 Subject: [PATCH 139/370] drivers,hmat: use node-notifier instead of memory-notifier hmat driver is only concerned when a numa node changes its memory state, specifically when a numa node with memory comes into play for the first time, because it will register the memory_targets belonging to that numa node. So stop using the memory notifier and use the new numa node notifer instead. Link: https://lkml.kernel.org/r/20250616135158.450136-8-osalvador@suse.de Signed-off-by: Oscar Salvador Reviewed-by: Jonathan Cameron Reviewed-by: Harry Yoo Reviewed-by: Vlastimil Babka Acked-by: David Hildenbrand Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim Signed-off-by: Andrew Morton --- drivers/acpi/numa/hmat.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c index 9d9052258e92..4958301f5417 100644 --- a/drivers/acpi/numa/hmat.c +++ b/drivers/acpi/numa/hmat.c @@ -962,10 +962,10 @@ static int hmat_callback(struct notifier_block *self, unsigned long action, void *arg) { struct memory_target *target; - struct memory_notify *mnb = arg; - int pxm, nid = mnb->status_change_nid; + struct node_notify *nn = arg; + int pxm, nid = nn->nid; - if (nid == NUMA_NO_NODE || action != MEM_ONLINE) + if (action != NODE_ADDED_FIRST_MEMORY) return NOTIFY_OK; pxm = node_to_pxm(nid); @@ -1118,7 +1118,7 @@ static __init int hmat_init(void) hmat_register_targets(); /* Keep the table and structures if the notifier may use them */ - if (hotplug_memory_notifier(hmat_callback, HMAT_CALLBACK_PRI)) + if (hotplug_node_notifier(hmat_callback, HMAT_CALLBACK_PRI)) goto out_put; if (!hmat_set_default_dram_perf()) From 8e1bf051c524c6e95194cebf64a839285301d735 Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Mon, 16 Jun 2025 15:51:51 +0200 Subject: [PATCH 140/370] kernel,cpuset: use node-notifier instead of memory-notifier cpuset is only concerned when a numa node changes its memory state, as it needs to know the current numa nodes with memory to keep an updated mems_allowed mask. So stop using the memory notifier and use the new numa node notifer instead. Link: https://lkml.kernel.org/r/20250616135158.450136-9-osalvador@suse.de Signed-off-by: Oscar Salvador Reviewed-by: Jonathan Cameron Reviewed-by: Harry Yoo Reviewed-by: Vlastimil Babka Acked-by: David Hildenbrand Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim Signed-off-by: Andrew Morton --- kernel/cgroup/cpuset.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 3bc4301466f3..f74d04429a29 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -4051,7 +4051,7 @@ void __init cpuset_init_smp(void) cpumask_copy(top_cpuset.effective_cpus, cpu_active_mask); top_cpuset.effective_mems = node_states[N_MEMORY]; - hotplug_memory_notifier(cpuset_track_online_nodes, CPUSET_CALLBACK_PRI); + hotplug_node_notifier(cpuset_track_online_nodes, CPUSET_CALLBACK_PRI); cpuset_migrate_mm_wq = alloc_ordered_workqueue("cpuset_migrate_mm", 0); BUG_ON(!cpuset_migrate_mm_wq); From cf0b61adf23f2cfaa360cd7d81d224c0bde7f413 Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Mon, 16 Jun 2025 15:51:52 +0200 Subject: [PATCH 141/370] mm,mempolicy: use node-notifier instead of memory-notifier mempolicy is only concerned when a numa node changes its memory state, because it needs to take this node into account for the auto-weighted memory policy system. So stop using the memory notifier and use the new numa node notifer instead. Link: https://lkml.kernel.org/r/20250616135158.450136-10-osalvador@suse.de Signed-off-by: Oscar Salvador Reviewed-by: Jonathan Cameron Reviewed-by: Harry Yoo Reviewed-by: Vlastimil Babka Reviewed-by: Rakie Kim Acked-by: David Hildenbrand Reviewed-by: Gregory Price Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton --- mm/mempolicy.c | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index b0619d0020c9..1ff7b2174eb7 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -3788,20 +3788,17 @@ static int wi_node_notifier(struct notifier_block *nb, unsigned long action, void *data) { int err; - struct memory_notify *arg = data; - int nid = arg->status_change_nid; - - if (nid < 0) - return NOTIFY_OK; + struct node_notify *nn = data; + int nid = nn->nid; switch (action) { - case MEM_ONLINE: + case NODE_ADDED_FIRST_MEMORY: err = sysfs_wi_node_add(nid); if (err) pr_err("failed to add sysfs for node%d during hotplug: %d\n", nid, err); break; - case MEM_OFFLINE: + case NODE_REMOVED_LAST_MEMORY: sysfs_wi_node_delete(nid); break; } @@ -3840,7 +3837,7 @@ static int __init add_weighted_interleave_group(struct kobject *mempolicy_kobj) } } - hotplug_memory_notifier(wi_node_notifier, DEFAULT_CALLBACK_PRI); + hotplug_node_notifier(wi_node_notifier, DEFAULT_CALLBACK_PRI); return 0; err_cleanup_kobj: From 1a19c91b9706625684dc109e3fb0d0b2a003c7c5 Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Mon, 16 Jun 2025 15:51:53 +0200 Subject: [PATCH 142/370] mm,page_ext: derive the node from the pfn page_ext is the only user of 'status_change_nid', which is set in online/offline operations, to know to which node we are adding/removing memory. Prior to call any notifiers, the memmap is initialized via, which among other things, sets the node the pages belong to, to all corresponging pages. This means that there is no need to keep using 'status_change_nid' since we can derive the node from the pfn. This will allow us to finally drop 'status_change_nid' from the memory_notify struct. Link: https://lkml.kernel.org/r/20250616135158.450136-11-osalvador@suse.de Signed-off-by: Oscar Salvador Suggested-by: David Hildenbrand Reviewed-by: Harry Yoo Reviewed-by: Vlastimil Babka Acked-by: David Hildenbrand Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joanthan Cameron Cc: Rakie Kim Signed-off-by: Andrew Morton --- mm/page_ext.c | 17 +++-------------- 1 file changed, 3 insertions(+), 14 deletions(-) diff --git a/mm/page_ext.c b/mm/page_ext.c index c351fdfe9e9a..d7396a8970e5 100644 --- a/mm/page_ext.c +++ b/mm/page_ext.c @@ -369,25 +369,15 @@ static void __invalidate_page_ext(unsigned long pfn) } static int __meminit online_page_ext(unsigned long start_pfn, - unsigned long nr_pages, - int nid) + unsigned long nr_pages) { + int nid = pfn_to_nid(start_pfn); unsigned long start, end, pfn; int fail = 0; start = SECTION_ALIGN_DOWN(start_pfn); end = SECTION_ALIGN_UP(start_pfn + nr_pages); - if (nid == NUMA_NO_NODE) { - /* - * In this case, "nid" already exists and contains valid memory. - * "start_pfn" passed to us is a pfn which is an arg for - * online__pages(), and start_pfn should exist. - */ - nid = pfn_to_nid(start_pfn); - VM_BUG_ON(!node_online(nid)); - } - for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) fail = init_section_page_ext(pfn, nid); if (!fail) @@ -435,8 +425,7 @@ static int __meminit page_ext_callback(struct notifier_block *self, switch (action) { case MEM_GOING_ONLINE: - ret = online_page_ext(mn->start_pfn, - mn->nr_pages, mn->status_change_nid); + ret = online_page_ext(mn->start_pfn, mn->nr_pages); break; case MEM_OFFLINE: offline_page_ext(mn->start_pfn, From d2a9721d807de405b198291badcc807700746781 Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Mon, 16 Jun 2025 15:51:54 +0200 Subject: [PATCH 143/370] mm,memory_hotplug: drop status_change_nid parameter from memory_notify There no users left of status_change_nid, so drop it from memory_notify struct. Link: https://lkml.kernel.org/r/20250616135158.450136-12-osalvador@suse.de Signed-off-by: Oscar Salvador Suggested-by: David Hildenbrand Acked-by: David Hildenbrand Reviewed-by: Vlastimil Babka Cc: Harry Yoo Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joanthan Cameron Cc: Rakie Kim Signed-off-by: Andrew Morton --- Documentation/core-api/memory-hotplug.rst | 7 ------- include/linux/memory.h | 1 - mm/memory_hotplug.c | 4 ---- 3 files changed, 12 deletions(-) diff --git a/Documentation/core-api/memory-hotplug.rst b/Documentation/core-api/memory-hotplug.rst index fb84e78968b2..8fc97c2379de 100644 --- a/Documentation/core-api/memory-hotplug.rst +++ b/Documentation/core-api/memory-hotplug.rst @@ -59,17 +59,10 @@ The third argument (arg) passes a pointer of struct memory_notify:: struct memory_notify { unsigned long start_pfn; unsigned long nr_pages; - int status_change_nid; } - start_pfn is start_pfn of online/offline memory. - nr_pages is # of pages of online/offline memory. -- status_change_nid is set node id when N_MEMORY of nodemask is (will be) - set/clear. It means a new(memoryless) node gets new memory by online and a - node loses all memory. If this is -1, then nodemask status is not changed. - - If status_changed_nid* >= 0, callback should create/discard structures for the - node if necessary. It is possible to get notified for MEM_CANCEL_ONLINE without having been notified for MEM_GOING_ONLINE, and the same applies to MEM_CANCEL_OFFLINE and diff --git a/include/linux/memory.h b/include/linux/memory.h index a8284d41e452..40eb70ccb09d 100644 --- a/include/linux/memory.h +++ b/include/linux/memory.h @@ -109,7 +109,6 @@ struct memory_notify { unsigned long altmap_nr_pages; unsigned long start_pfn; unsigned long nr_pages; - int status_change_nid; }; struct notifier_block; diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 0d54a01985ba..a371d59cc718 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1153,7 +1153,6 @@ int online_pages(unsigned long pfn, unsigned long nr_pages, struct memory_notify mem_arg = { .start_pfn = pfn, .nr_pages = nr_pages, - .status_change_nid = NUMA_NO_NODE, }; struct node_notify node_arg = { .nid = NUMA_NO_NODE, @@ -1181,7 +1180,6 @@ int online_pages(unsigned long pfn, unsigned long nr_pages, if (!node_state(nid, N_MEMORY)) { /* Adding memory to the node for the first time */ node_arg.nid = nid; - mem_arg.status_change_nid = nid; ret = node_notify(NODE_ADDING_FIRST_MEMORY, &node_arg); ret = notifier_to_errno(ret); if (ret) @@ -1904,7 +1902,6 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages, struct memory_notify mem_arg = { .start_pfn = start_pfn, .nr_pages = nr_pages, - .status_change_nid = NUMA_NO_NODE, }; struct node_notify node_arg = { .nid = NUMA_NO_NODE, @@ -1975,7 +1972,6 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages, */ if (nr_pages >= pgdat->node_present_pages) { node_arg.nid = node; - mem_arg.status_change_nid = node; ret = node_notify(NODE_REMOVING_LAST_MEMORY, &node_arg); ret = notifier_to_errno(ret); if (ret) { From 42f46ed99ac6c07adf7f3bcbe9040b0c52d62d0f Mon Sep 17 00:00:00 2001 From: Zi Yan Date: Mon, 16 Jun 2025 22:11:09 -0400 Subject: [PATCH 144/370] mm/page_alloc: pageblock flags functions clean up Patch series "Make MIGRATE_ISOLATE a standalone bit", v10. This patchset moves MIGRATE_ISOLATE to a standalone bit to avoid being overwritten during pageblock isolation process. Currently, MIGRATE_ISOLATE is part of enum migratetype (in include/linux/mmzone.h), thus, setting a pageblock to MIGRATE_ISOLATE overwrites its original migratetype. This causes pageblock migratetype loss during alloc_contig_range() and memory offline, especially when the process fails due to a failed pageblock isolation and the code tries to undo the finished pageblock isolations. In terms of performance for changing pageblock types, no performance change is observed: 1. I used perf to collect stats of offlining and onlining all memory of a 40GB VM 10 times and see that get_pfnblock_flags_mask() and set_pfnblock_flags_mask() take about 0.12% and 0.02% of the whole process respectively with and without this patchset across 3 runs. 2. I used perf to collect stats of dd from /dev/random to a 40GB tmpfs file and find get_pfnblock_flags_mask() takes about 0.05% of the process with and without this patchset across 3 runs. This patch (of 6): No functional change is intended. 1. Add __NR_PAGEBLOCK_BITS for the number of pageblock flag bits and use roundup_pow_of_two(__NR_PAGEBLOCK_BITS) as NR_PAGEBLOCK_BITS to take right amount of bits for pageblock flags. 2. Rename PB_migrate_skip to PB_compact_skip. 3. Add {get,set,clear}_pfnblock_bit() to operate one a standalone bit, like PB_compact_skip. 3. Make {get,set}_pfnblock_flags_mask() internal functions and use {get,set}_pfnblock_migratetype() for pageblock migratetype operations. 4. Move pageblock flags common code to get_pfnblock_bitmap_bitidx(). 3. Use MIGRATETYPE_MASK to get the migratetype of a pageblock from its flags. 4. Use PB_migrate_end in the definition of MIGRATETYPE_MASK instead of PB_migrate_bits. 5. Add a comment on is_migrate_cma_folio() to prevent one from changing it to use get_pageblock_migratetype() and causing issues. Link: https://lkml.kernel.org/r/20250617021115.2331563-1-ziy@nvidia.com Link: https://lkml.kernel.org/r/20250617021115.2331563-2-ziy@nvidia.com Signed-off-by: Zi Yan Reviewed-by: Vlastimil Babka Acked-by: David Hildenbrand Cc: Baolin Wang Cc: Brendan Jackman Cc: Johannes Weiner Cc: Kirill A. Shuemov Cc: Mel Gorman Cc: Michal Hocko Cc: Oscar Salvador Cc: Richard Chang Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- Documentation/mm/physical_memory.rst | 2 +- include/linux/mmzone.h | 18 +-- include/linux/page-isolation.h | 2 +- include/linux/pageblock-flags.h | 34 +++--- mm/memory_hotplug.c | 2 +- mm/page_alloc.c | 171 +++++++++++++++++++++------ 6 files changed, 162 insertions(+), 67 deletions(-) diff --git a/Documentation/mm/physical_memory.rst b/Documentation/mm/physical_memory.rst index d3ac106e6b14..9af11b5bd145 100644 --- a/Documentation/mm/physical_memory.rst +++ b/Documentation/mm/physical_memory.rst @@ -584,7 +584,7 @@ Compaction control ``compact_blockskip_flush`` Set to true when compaction migration scanner and free scanner meet, which - means the ``PB_migrate_skip`` bits should be cleared. + means the ``PB_compact_skip`` bits should be cleared. ``contiguous`` Set to true when the zone is contiguous (in other words, no hole). diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 5bec8b1d0e66..76d66c07b673 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -92,8 +92,12 @@ extern const char * const migratetype_names[MIGRATE_TYPES]; #ifdef CONFIG_CMA # define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA) # define is_migrate_cma_page(_page) (get_pageblock_migratetype(_page) == MIGRATE_CMA) -# define is_migrate_cma_folio(folio, pfn) (MIGRATE_CMA == \ - get_pfnblock_flags_mask(&folio->page, pfn, MIGRATETYPE_MASK)) +/* + * __dump_folio() in mm/debug.c passes a folio pointer to on-stack struct folio, + * so folio_pfn() cannot be used and pfn is needed. + */ +# define is_migrate_cma_folio(folio, pfn) \ + (get_pfnblock_migratetype(&folio->page, pfn) == MIGRATE_CMA) #else # define is_migrate_cma(migratetype) false # define is_migrate_cma_page(_page) false @@ -122,14 +126,12 @@ static inline bool migratetype_is_mergeable(int mt) extern int page_group_by_mobility_disabled; -#define MIGRATETYPE_MASK ((1UL << PB_migratetype_bits) - 1) +#define get_pageblock_migratetype(page) \ + get_pfnblock_migratetype(page, page_to_pfn(page)) -#define get_pageblock_migratetype(page) \ - get_pfnblock_flags_mask(page, page_to_pfn(page), MIGRATETYPE_MASK) +#define folio_migratetype(folio) \ + get_pageblock_migratetype(&folio->page) -#define folio_migratetype(folio) \ - get_pfnblock_flags_mask(&folio->page, folio_pfn(folio), \ - MIGRATETYPE_MASK) struct free_area { struct list_head free_list[MIGRATE_TYPES]; unsigned long nr_free; diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h index 898bb788243b..277d8d92980c 100644 --- a/include/linux/page-isolation.h +++ b/include/linux/page-isolation.h @@ -25,7 +25,7 @@ static inline bool is_migrate_isolate(int migratetype) #define MEMORY_OFFLINE 0x1 #define REPORT_FAILURE 0x2 -void set_pageblock_migratetype(struct page *page, int migratetype); +void set_pageblock_migratetype(struct page *page, enum migratetype migratetype); bool move_freepages_block_isolate(struct zone *zone, struct page *page, int migratetype); diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h index 6297c6343c55..c240c7a1fb03 100644 --- a/include/linux/pageblock-flags.h +++ b/include/linux/pageblock-flags.h @@ -19,15 +19,19 @@ enum pageblock_bits { PB_migrate, PB_migrate_end = PB_migrate + PB_migratetype_bits - 1, /* 3 bits required for migrate types */ - PB_migrate_skip,/* If set the block is skipped by compaction */ + PB_compact_skip,/* If set the block is skipped by compaction */ /* * Assume the bits will always align on a word. If this assumption * changes then get/set pageblock needs updating. */ - NR_PAGEBLOCK_BITS + __NR_PAGEBLOCK_BITS }; +#define NR_PAGEBLOCK_BITS (roundup_pow_of_two(__NR_PAGEBLOCK_BITS)) + +#define MIGRATETYPE_MASK ((1UL << (PB_migrate_end + 1)) - 1) + #if defined(CONFIG_HUGETLB_PAGE) #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE @@ -65,27 +69,23 @@ extern unsigned int pageblock_order; /* Forward declaration */ struct page; -unsigned long get_pfnblock_flags_mask(const struct page *page, - unsigned long pfn, - unsigned long mask); - -void set_pfnblock_flags_mask(struct page *page, - unsigned long flags, - unsigned long pfn, - unsigned long mask); +enum migratetype get_pfnblock_migratetype(const struct page *page, + unsigned long pfn); +bool get_pfnblock_bit(const struct page *page, unsigned long pfn, + enum pageblock_bits pb_bit); +void set_pfnblock_bit(const struct page *page, unsigned long pfn, + enum pageblock_bits pb_bit); +void clear_pfnblock_bit(const struct page *page, unsigned long pfn, + enum pageblock_bits pb_bit); /* Declarations for getting and setting flags. See mm/page_alloc.c */ #ifdef CONFIG_COMPACTION #define get_pageblock_skip(page) \ - get_pfnblock_flags_mask(page, page_to_pfn(page), \ - (1 << (PB_migrate_skip))) + get_pfnblock_bit(page, page_to_pfn(page), PB_compact_skip) #define clear_pageblock_skip(page) \ - set_pfnblock_flags_mask(page, 0, page_to_pfn(page), \ - (1 << PB_migrate_skip)) + clear_pfnblock_bit(page, page_to_pfn(page), PB_compact_skip) #define set_pageblock_skip(page) \ - set_pfnblock_flags_mask(page, (1 << PB_migrate_skip), \ - page_to_pfn(page), \ - (1 << PB_migrate_skip)) + set_pfnblock_bit(page, page_to_pfn(page), PB_compact_skip) #else static inline bool get_pageblock_skip(struct page *page) { diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index a371d59cc718..403221982c2e 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -774,7 +774,7 @@ void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, /* * TODO now we have a visible range of pages which are not associated - * with their zone properly. Not nice but set_pfnblock_flags_mask + * with their zone properly. Not nice but set_pfnblock_migratetype() * expects the zone spans the pfn range. All the pages in the range * are reserved so nobody should be touching them so we should be safe */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 31041e2aa33a..43a4c1532721 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -353,81 +353,174 @@ static inline int pfn_to_bitidx(const struct page *page, unsigned long pfn) return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS; } +static __always_inline bool is_standalone_pb_bit(enum pageblock_bits pb_bit) +{ + return pb_bit > PB_migrate_end && pb_bit < __NR_PAGEBLOCK_BITS; +} + +static __always_inline void +get_pfnblock_bitmap_bitidx(const struct page *page, unsigned long pfn, + unsigned long **bitmap_word, unsigned long *bitidx) +{ + unsigned long *bitmap; + unsigned long word_bitidx; + + BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4); + BUILD_BUG_ON(MIGRATE_TYPES > (1 << PB_migratetype_bits)); + VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page); + + bitmap = get_pageblock_bitmap(page, pfn); + *bitidx = pfn_to_bitidx(page, pfn); + word_bitidx = *bitidx / BITS_PER_LONG; + *bitidx &= (BITS_PER_LONG - 1); + *bitmap_word = &bitmap[word_bitidx]; +} + + /** - * get_pfnblock_flags_mask - Return the requested group of flags for the pageblock_nr_pages block of pages + * __get_pfnblock_flags_mask - Return the requested group of flags for + * a pageblock_nr_pages block of pages * @page: The page within the block of interest * @pfn: The target page frame number * @mask: mask of bits that the caller is interested in * * Return: pageblock_bits flags */ -unsigned long get_pfnblock_flags_mask(const struct page *page, - unsigned long pfn, unsigned long mask) +static unsigned long __get_pfnblock_flags_mask(const struct page *page, + unsigned long pfn, + unsigned long mask) { - unsigned long *bitmap; - unsigned long bitidx, word_bitidx; + unsigned long *bitmap_word; + unsigned long bitidx; unsigned long word; - bitmap = get_pageblock_bitmap(page, pfn); - bitidx = pfn_to_bitidx(page, pfn); - word_bitidx = bitidx / BITS_PER_LONG; - bitidx &= (BITS_PER_LONG-1); + get_pfnblock_bitmap_bitidx(page, pfn, &bitmap_word, &bitidx); /* - * This races, without locks, with set_pfnblock_flags_mask(). Ensure + * This races, without locks, with set_pfnblock_migratetype(). Ensure * a consistent read of the memory array, so that results, even though * racy, are not corrupted. */ - word = READ_ONCE(bitmap[word_bitidx]); + word = READ_ONCE(*bitmap_word); return (word >> bitidx) & mask; } -static __always_inline int get_pfnblock_migratetype(const struct page *page, - unsigned long pfn) +/** + * get_pfnblock_bit - Check if a standalone bit of a pageblock is set + * @page: The page within the block of interest + * @pfn: The target page frame number + * @pb_bit: pageblock bit to check + * + * Return: true if the bit is set, otherwise false + */ +bool get_pfnblock_bit(const struct page *page, unsigned long pfn, + enum pageblock_bits pb_bit) { - return get_pfnblock_flags_mask(page, pfn, MIGRATETYPE_MASK); + unsigned long *bitmap_word; + unsigned long bitidx; + + if (WARN_ON_ONCE(!is_standalone_pb_bit(pb_bit))) + return false; + + get_pfnblock_bitmap_bitidx(page, pfn, &bitmap_word, &bitidx); + + return test_bit(bitidx + pb_bit, bitmap_word); } /** - * set_pfnblock_flags_mask - Set the requested group of flags for a pageblock_nr_pages block of pages + * get_pfnblock_migratetype - Return the migratetype of a pageblock * @page: The page within the block of interest - * @flags: The flags to set * @pfn: The target page frame number + * + * Return: The migratetype of the pageblock + * + * Use get_pfnblock_migratetype() if caller already has both @page and @pfn + * to save a call to page_to_pfn(). + */ +__always_inline enum migratetype +get_pfnblock_migratetype(const struct page *page, unsigned long pfn) +{ + return __get_pfnblock_flags_mask(page, pfn, MIGRATETYPE_MASK); +} + +/** + * __set_pfnblock_flags_mask - Set the requested group of flags for + * a pageblock_nr_pages block of pages + * @page: The page within the block of interest + * @pfn: The target page frame number + * @flags: The flags to set * @mask: mask of bits that the caller is interested in */ -void set_pfnblock_flags_mask(struct page *page, unsigned long flags, - unsigned long pfn, - unsigned long mask) +static void __set_pfnblock_flags_mask(struct page *page, unsigned long pfn, + unsigned long flags, unsigned long mask) { - unsigned long *bitmap; - unsigned long bitidx, word_bitidx; + unsigned long *bitmap_word; + unsigned long bitidx; unsigned long word; - BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4); - BUILD_BUG_ON(MIGRATE_TYPES > (1 << PB_migratetype_bits)); - - bitmap = get_pageblock_bitmap(page, pfn); - bitidx = pfn_to_bitidx(page, pfn); - word_bitidx = bitidx / BITS_PER_LONG; - bitidx &= (BITS_PER_LONG-1); - - VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page); + get_pfnblock_bitmap_bitidx(page, pfn, &bitmap_word, &bitidx); mask <<= bitidx; flags <<= bitidx; - word = READ_ONCE(bitmap[word_bitidx]); + word = READ_ONCE(*bitmap_word); do { - } while (!try_cmpxchg(&bitmap[word_bitidx], &word, (word & ~mask) | flags)); + } while (!try_cmpxchg(bitmap_word, &word, (word & ~mask) | flags)); } -void set_pageblock_migratetype(struct page *page, int migratetype) +/** + * set_pfnblock_bit - Set a standalone bit of a pageblock + * @page: The page within the block of interest + * @pfn: The target page frame number + * @pb_bit: pageblock bit to set + */ +void set_pfnblock_bit(const struct page *page, unsigned long pfn, + enum pageblock_bits pb_bit) +{ + unsigned long *bitmap_word; + unsigned long bitidx; + + if (WARN_ON_ONCE(!is_standalone_pb_bit(pb_bit))) + return; + + get_pfnblock_bitmap_bitidx(page, pfn, &bitmap_word, &bitidx); + + set_bit(bitidx + pb_bit, bitmap_word); +} + +/** + * clear_pfnblock_bit - Clear a standalone bit of a pageblock + * @page: The page within the block of interest + * @pfn: The target page frame number + * @pb_bit: pageblock bit to clear + */ +void clear_pfnblock_bit(const struct page *page, unsigned long pfn, + enum pageblock_bits pb_bit) +{ + unsigned long *bitmap_word; + unsigned long bitidx; + + if (WARN_ON_ONCE(!is_standalone_pb_bit(pb_bit))) + return; + + get_pfnblock_bitmap_bitidx(page, pfn, &bitmap_word, &bitidx); + + clear_bit(bitidx + pb_bit, bitmap_word); +} + +/** + * set_pageblock_migratetype - Set the migratetype of a pageblock + * @page: The page within the block of interest + * @migratetype: migratetype to set + */ +__always_inline void set_pageblock_migratetype(struct page *page, + enum migratetype migratetype) { if (unlikely(page_group_by_mobility_disabled && migratetype < MIGRATE_PCPTYPES)) migratetype = MIGRATE_UNMOVABLE; - set_pfnblock_flags_mask(page, (unsigned long)migratetype, - page_to_pfn(page), MIGRATETYPE_MASK); + __set_pfnblock_flags_mask(page, page_to_pfn(page), + (unsigned long)migratetype, MIGRATETYPE_MASK); } #ifdef CONFIG_DEBUG_VM @@ -667,7 +760,7 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone, int nr_pages = 1 << order; VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype, - "page type is %lu, passed migratetype is %d (nr=%d)\n", + "page type is %d, passed migratetype is %d (nr=%d)\n", get_pageblock_migratetype(page), migratetype, nr_pages); if (tail) @@ -693,7 +786,7 @@ static inline void move_to_free_list(struct page *page, struct zone *zone, /* Free page moving can fail, so it happens before the type update */ VM_WARN_ONCE(get_pageblock_migratetype(page) != old_mt, - "page type is %lu, passed migratetype is %d (nr=%d)\n", + "page type is %d, passed migratetype is %d (nr=%d)\n", get_pageblock_migratetype(page), old_mt, nr_pages); list_move_tail(&page->buddy_list, &area->free_list[new_mt]); @@ -715,7 +808,7 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon int nr_pages = 1 << order; VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype, - "page type is %lu, passed migratetype is %d (nr=%d)\n", + "page type is %d, passed migratetype is %d (nr=%d)\n", get_pageblock_migratetype(page), migratetype, nr_pages); /* clear reported state and update reported page count */ @@ -3123,7 +3216,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, /* * Do not instrument rmqueue() with KMSAN. This function may call - * __msan_poison_alloca() through a call to set_pfnblock_flags_mask(). + * __msan_poison_alloca() through a call to set_pfnblock_migratetype(). * If __msan_poison_alloca() attempts to allocate pages for the stack depot, it * may call rmqueue() again, which will result in a deadlock. */ From e904bce2d9d43e0f370e238457a13847d161570b Mon Sep 17 00:00:00 2001 From: Zi Yan Date: Mon, 16 Jun 2025 22:11:10 -0400 Subject: [PATCH 145/370] mm/page_isolation: make page isolation a standalone bit During page isolation, the original migratetype is overwritten, since MIGRATE_* are enums and stored in pageblock bitmaps. Change MIGRATE_ISOLATE to be stored a standalone bit, PB_migrate_isolate, like PB_compact_skip, so that migratetype is not lost during pageblock isolation. Link: https://lkml.kernel.org/r/20250617021115.2331563-3-ziy@nvidia.com Signed-off-by: Zi Yan Reviewed-by: Vlastimil Babka Acked-by: David Hildenbrand Cc: Baolin Wang Cc: Brendan Jackman Cc: Johannes Weiner Cc: Kirill A. Shuemov Cc: Mel Gorman Cc: Michal Hocko Cc: Oscar Salvador Cc: Richard Chang Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 3 +++ include/linux/page-isolation.h | 16 ++++++++++++++++ include/linux/pageblock-flags.h | 14 ++++++++++++++ mm/page_alloc.c | 27 ++++++++++++++++++++++++--- 4 files changed, 57 insertions(+), 3 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 76d66c07b673..1d1bb2b7f40d 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -79,6 +79,9 @@ enum migratetype { * __free_pageblock_cma() function. */ MIGRATE_CMA, + __MIGRATE_TYPE_END = MIGRATE_CMA, +#else + __MIGRATE_TYPE_END = MIGRATE_HIGHATOMIC, #endif #ifdef CONFIG_MEMORY_ISOLATION MIGRATE_ISOLATE, /* can't allocate from here */ diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h index 277d8d92980c..fc021d3f95ca 100644 --- a/include/linux/page-isolation.h +++ b/include/linux/page-isolation.h @@ -11,6 +11,12 @@ static inline bool is_migrate_isolate(int migratetype) { return migratetype == MIGRATE_ISOLATE; } +#define get_pageblock_isolate(page) \ + get_pfnblock_bit(page, page_to_pfn(page), PB_migrate_isolate) +#define clear_pageblock_isolate(page) \ + clear_pfnblock_bit(page, page_to_pfn(page), PB_migrate_isolate) +#define set_pageblock_isolate(page) \ + set_pfnblock_bit(page, page_to_pfn(page), PB_migrate_isolate) #else static inline bool is_migrate_isolate_page(struct page *page) { @@ -20,6 +26,16 @@ static inline bool is_migrate_isolate(int migratetype) { return false; } +static inline bool get_pageblock_isolate(struct page *page) +{ + return false; +} +static inline void clear_pageblock_isolate(struct page *page) +{ +} +static inline void set_pageblock_isolate(struct page *page) +{ +} #endif #define MEMORY_OFFLINE 0x1 diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h index c240c7a1fb03..6a44be0f39f4 100644 --- a/include/linux/pageblock-flags.h +++ b/include/linux/pageblock-flags.h @@ -21,6 +21,13 @@ enum pageblock_bits { /* 3 bits required for migrate types */ PB_compact_skip,/* If set the block is skipped by compaction */ +#ifdef CONFIG_MEMORY_ISOLATION + /* + * Pageblock isolation is represented with a separate bit, so that + * the migratetype of a block is not overwritten by isolation. + */ + PB_migrate_isolate, /* If set the block is isolated */ +#endif /* * Assume the bits will always align on a word. If this assumption * changes then get/set pageblock needs updating. @@ -32,6 +39,13 @@ enum pageblock_bits { #define MIGRATETYPE_MASK ((1UL << (PB_migrate_end + 1)) - 1) +#ifdef CONFIG_MEMORY_ISOLATION +#define MIGRATETYPE_AND_ISO_MASK \ + (((1UL << (PB_migrate_end + 1)) - 1) | BIT(PB_migrate_isolate)) +#else +#define MIGRATETYPE_AND_ISO_MASK MIGRATETYPE_MASK +#endif + #if defined(CONFIG_HUGETLB_PAGE) #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 43a4c1532721..61dd34102c14 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -365,8 +365,12 @@ get_pfnblock_bitmap_bitidx(const struct page *page, unsigned long pfn, unsigned long *bitmap; unsigned long word_bitidx; +#ifdef CONFIG_MEMORY_ISOLATION + BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 8); +#else BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4); - BUILD_BUG_ON(MIGRATE_TYPES > (1 << PB_migratetype_bits)); +#endif + BUILD_BUG_ON(__MIGRATE_TYPE_END >= (1 << PB_migratetype_bits)); VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page); bitmap = get_pageblock_bitmap(page, pfn); @@ -439,7 +443,16 @@ bool get_pfnblock_bit(const struct page *page, unsigned long pfn, __always_inline enum migratetype get_pfnblock_migratetype(const struct page *page, unsigned long pfn) { - return __get_pfnblock_flags_mask(page, pfn, MIGRATETYPE_MASK); + unsigned long mask = MIGRATETYPE_AND_ISO_MASK; + unsigned long flags; + + flags = __get_pfnblock_flags_mask(page, pfn, mask); + +#ifdef CONFIG_MEMORY_ISOLATION + if (flags & BIT(PB_migrate_isolate)) + return MIGRATE_ISOLATE; +#endif + return flags & MIGRATETYPE_MASK; } /** @@ -519,8 +532,16 @@ __always_inline void set_pageblock_migratetype(struct page *page, migratetype < MIGRATE_PCPTYPES)) migratetype = MIGRATE_UNMOVABLE; +#ifdef CONFIG_MEMORY_ISOLATION + if (migratetype == MIGRATE_ISOLATE) { + set_pfnblock_bit(page, page_to_pfn(page), PB_migrate_isolate); + return; + } + /* MIGRATETYPE_AND_ISO_MASK clears PB_migrate_isolate if it is set */ +#endif __set_pfnblock_flags_mask(page, page_to_pfn(page), - (unsigned long)migratetype, MIGRATETYPE_MASK); + (unsigned long)migratetype, + MIGRATETYPE_AND_ISO_MASK); } #ifdef CONFIG_DEBUG_VM From 1bc3587a88d291a37dab12d6c14aa7da53304251 Mon Sep 17 00:00:00 2001 From: Zi Yan Date: Mon, 16 Jun 2025 22:11:11 -0400 Subject: [PATCH 146/370] mm/page_alloc: add support for initializing pageblock as isolated MIGRATE_ISOLATE is a standalone bit, so a pageblock cannot be initialized to just MIGRATE_ISOLATE. Add init_pageblock_migratetype() to enable initialize a pageblock with a migratetype and isolated. Link: https://lkml.kernel.org/r/20250617021115.2331563-4-ziy@nvidia.com Signed-off-by: Zi Yan Reviewed-by: Vlastimil Babka Acked-by: David Hildenbrand Cc: Baolin Wang Cc: Brendan Jackman Cc: Johannes Weiner Cc: Kirill A. Shuemov Cc: Mel Gorman Cc: Michal Hocko Cc: Oscar Salvador Cc: Richard Chang Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- include/linux/memory_hotplug.h | 3 ++- include/linux/page-isolation.h | 3 +++ kernel/kexec_handover.c | 4 ++-- mm/hugetlb.c | 4 ++-- mm/internal.h | 3 ++- mm/memory_hotplug.c | 12 ++++++++---- mm/memremap.c | 2 +- mm/mm_init.c | 24 +++++++++++++++--------- mm/page_alloc.c | 26 ++++++++++++++++++++++++++ 9 files changed, 61 insertions(+), 20 deletions(-) diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index eaac5ae8c05c..23f038a16231 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -314,7 +314,8 @@ extern int add_memory_driver_managed(int nid, u64 start, u64 size, mhp_t mhp_flags); extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, unsigned long nr_pages, - struct vmem_altmap *altmap, int migratetype); + struct vmem_altmap *altmap, int migratetype, + bool isolate_pageblock); extern void remove_pfn_range_from_zone(struct zone *zone, unsigned long start_pfn, unsigned long nr_pages); diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h index fc021d3f95ca..14c6a5f691c2 100644 --- a/include/linux/page-isolation.h +++ b/include/linux/page-isolation.h @@ -41,6 +41,9 @@ static inline void set_pageblock_isolate(struct page *page) #define MEMORY_OFFLINE 0x1 #define REPORT_FAILURE 0x2 +void __meminit init_pageblock_migratetype(struct page *page, + enum migratetype migratetype, + bool isolate); void set_pageblock_migratetype(struct page *page, enum migratetype migratetype); bool move_freepages_block_isolate(struct zone *zone, struct page *page, diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c index 5a21dbe17950..49634cc3fb43 100644 --- a/kernel/kexec_handover.c +++ b/kernel/kexec_handover.c @@ -1100,8 +1100,8 @@ static void __init kho_release_scratch(void) ulong pfn; for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) - set_pageblock_migratetype(pfn_to_page(pfn), - MIGRATE_CMA); + init_pageblock_migratetype(pfn_to_page(pfn), + MIGRATE_CMA, false); } } diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c03896375749..11d5668ff6e7 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3297,8 +3297,8 @@ static void __init hugetlb_bootmem_init_migratetype(struct folio *folio, if (folio_test_hugetlb_cma(folio)) init_cma_pageblock(folio_page(folio, i)); else - set_pageblock_migratetype(folio_page(folio, i), - MIGRATE_MOVABLE); + init_pageblock_migratetype(folio_page(folio, i), + MIGRATE_MOVABLE, false); } } diff --git a/mm/internal.h b/mm/internal.h index fe83dfca3c72..22a95a2b7fa1 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -820,7 +820,8 @@ extern void *memmap_alloc(phys_addr_t size, phys_addr_t align, int nid, bool exact_nid); void memmap_init_range(unsigned long, int, unsigned long, unsigned long, - unsigned long, enum meminit_context, struct vmem_altmap *, int); + unsigned long, enum meminit_context, struct vmem_altmap *, int, + bool); #if defined CONFIG_COMPACTION || defined CONFIG_CMA diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 403221982c2e..a3c2b0784070 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -747,7 +747,8 @@ static inline void section_taint_zone_device(unsigned long pfn) */ void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, unsigned long nr_pages, - struct vmem_altmap *altmap, int migratetype) + struct vmem_altmap *altmap, int migratetype, + bool isolate_pageblock) { struct pglist_data *pgdat = zone->zone_pgdat; int nid = pgdat->node_id; @@ -779,7 +780,8 @@ void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, * are reserved so nobody should be touching them so we should be safe */ memmap_init_range(nr_pages, nid, zone_idx(zone), start_pfn, 0, - MEMINIT_HOTPLUG, altmap, migratetype); + MEMINIT_HOTPLUG, altmap, migratetype, + isolate_pageblock); set_zone_contiguous(zone); } @@ -1104,7 +1106,8 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages, if (mhp_off_inaccessible) page_init_poison(pfn_to_page(pfn), sizeof(struct page) * nr_pages); - move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE); + move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE, + false); for (i = 0; i < nr_pages; i++) { struct page *page = pfn_to_page(pfn + i); @@ -1175,7 +1178,8 @@ int online_pages(unsigned long pfn, unsigned long nr_pages, /* associate pfn range with the zone */ - move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE); + move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_MOVABLE, + true); if (!node_state(nid, N_MEMORY)) { /* Adding memory to the node for the first time */ diff --git a/mm/memremap.c b/mm/memremap.c index f75078c14839..b0ce0d8254bd 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -228,7 +228,7 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params, zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE]; move_pfn_range_to_zone(zone, PHYS_PFN(range->start), PHYS_PFN(range_len(range)), params->altmap, - MIGRATE_MOVABLE); + MIGRATE_MOVABLE, false); } mem_hotplug_done(); diff --git a/mm/mm_init.c b/mm/mm_init.c index 02f41e2bdf60..5c21b3af216b 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -685,7 +685,8 @@ void __meminit __init_page_from_nid(unsigned long pfn, int nid) __init_single_page(pfn_to_page(pfn), pfn, zid, nid); if (pageblock_aligned(pfn)) - set_pageblock_migratetype(pfn_to_page(pfn), MIGRATE_MOVABLE); + init_pageblock_migratetype(pfn_to_page(pfn), MIGRATE_MOVABLE, + false); } #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT @@ -874,7 +875,8 @@ static void __init init_unavailable_range(unsigned long spfn, void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone, unsigned long start_pfn, unsigned long zone_end_pfn, enum meminit_context context, - struct vmem_altmap *altmap, int migratetype) + struct vmem_altmap *altmap, int migratetype, + bool isolate_pageblock) { unsigned long pfn, end_pfn = start_pfn + size; struct page *page; @@ -931,7 +933,8 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone * over the place during system boot. */ if (pageblock_aligned(pfn)) { - set_pageblock_migratetype(page, migratetype); + init_pageblock_migratetype(page, migratetype, + isolate_pageblock); cond_resched(); } pfn++; @@ -954,7 +957,8 @@ static void __init memmap_init_zone_range(struct zone *zone, return; memmap_init_range(end_pfn - start_pfn, nid, zone_id, start_pfn, - zone_end_pfn, MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); + zone_end_pfn, MEMINIT_EARLY, NULL, MIGRATE_MOVABLE, + false); if (*hole_pfn < start_pfn) init_unavailable_range(*hole_pfn, start_pfn, zone_id, nid); @@ -1035,7 +1039,7 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn, * because this is done early in section_activate() */ if (pageblock_aligned(pfn)) { - set_pageblock_migratetype(page, MIGRATE_MOVABLE); + init_pageblock_migratetype(page, MIGRATE_MOVABLE, false); cond_resched(); } @@ -1996,7 +2000,8 @@ static void __init deferred_free_pages(unsigned long pfn, /* Free a large naturally-aligned chunk if possible */ if (nr_pages == MAX_ORDER_NR_PAGES && IS_MAX_ORDER_ALIGNED(pfn)) { for (i = 0; i < nr_pages; i += pageblock_nr_pages) - set_pageblock_migratetype(page + i, MIGRATE_MOVABLE); + init_pageblock_migratetype(page + i, MIGRATE_MOVABLE, + false); __free_pages_core(page, MAX_PAGE_ORDER, MEMINIT_EARLY); return; } @@ -2006,7 +2011,8 @@ static void __init deferred_free_pages(unsigned long pfn, for (i = 0; i < nr_pages; i++, page++, pfn++) { if (pageblock_aligned(pfn)) - set_pageblock_migratetype(page, MIGRATE_MOVABLE); + init_pageblock_migratetype(page, MIGRATE_MOVABLE, + false); __free_pages_core(page, 0, MEMINIT_EARLY); } } @@ -2305,7 +2311,7 @@ void __init init_cma_reserved_pageblock(struct page *page) set_page_count(p, 0); } while (++p, --i); - set_pageblock_migratetype(page, MIGRATE_CMA); + init_pageblock_migratetype(page, MIGRATE_CMA, false); set_page_refcounted(page); /* pages were reserved and not allocated */ clear_page_tag_ref(page); @@ -2319,7 +2325,7 @@ void __init init_cma_reserved_pageblock(struct page *page) */ void __init init_cma_pageblock(struct page *page) { - set_pageblock_migratetype(page, MIGRATE_CMA); + init_pageblock_migratetype(page, MIGRATE_CMA, false); adjust_managed_page_count(page, pageblock_nr_pages); page_zone(page)->cma_pages += pageblock_nr_pages; } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 61dd34102c14..c7730264bf5f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -544,6 +544,32 @@ __always_inline void set_pageblock_migratetype(struct page *page, MIGRATETYPE_AND_ISO_MASK); } +void __meminit init_pageblock_migratetype(struct page *page, + enum migratetype migratetype, + bool isolate) +{ + unsigned long flags; + + if (unlikely(page_group_by_mobility_disabled && + migratetype < MIGRATE_PCPTYPES)) + migratetype = MIGRATE_UNMOVABLE; + + flags = migratetype; + +#ifdef CONFIG_MEMORY_ISOLATION + if (migratetype == MIGRATE_ISOLATE) { + VM_WARN_ONCE( + 1, + "Set isolate=true to isolate pageblock with a migratetype"); + return; + } + if (isolate) + flags |= BIT(PB_migrate_isolate); +#endif + __set_pfnblock_flags_mask(page, page_to_pfn(page), flags, + MIGRATETYPE_AND_ISO_MASK); +} + #ifdef CONFIG_DEBUG_VM static int page_outside_zone_boundaries(struct zone *zone, struct page *page) { From b1df9c5713dc41229667aa44eaea2399e8de9470 Mon Sep 17 00:00:00 2001 From: Zi Yan Date: Mon, 16 Jun 2025 22:11:12 -0400 Subject: [PATCH 147/370] mm/page_isolation: remove migratetype from move_freepages_block_isolate() Since migratetype is no longer overwritten during pageblock isolation, moving a pageblock out of MIGRATE_ISOLATE no longer needs a new migratetype. Add pageblock_isolate_and_move_free_pages() and pageblock_unisolate_and_move_free_pages() to be explicit about the page isolation operations. Both share the common code in __move_freepages_block_isolate(), which is renamed from move_freepages_block_isolate(). Add toggle_pageblock_isolate() to flip pageblock isolation bit in __move_freepages_block_isolate(). Make set_pageblock_migratetype() only accept non MIGRATE_ISOLATE types, so that one should use set_pageblock_isolate() to isolate pageblocks. As a result, move pageblock migratetype code out of __move_freepages_block(). Link: https://lkml.kernel.org/r/20250617021115.2331563-5-ziy@nvidia.com Signed-off-by: Zi Yan Reviewed-by: Vlastimil Babka Acked-by: David Hildenbrand Cc: Baolin Wang Cc: Brendan Jackman Cc: Johannes Weiner Cc: Kirill A. Shuemov Cc: Mel Gorman Cc: Michal Hocko Cc: Oscar Salvador Cc: Richard Chang Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- include/linux/page-isolation.h | 5 +-- mm/page_alloc.c | 80 ++++++++++++++++++++++++++-------- mm/page_isolation.c | 21 +++++---- 3 files changed, 75 insertions(+), 31 deletions(-) diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h index 14c6a5f691c2..7241a6719618 100644 --- a/include/linux/page-isolation.h +++ b/include/linux/page-isolation.h @@ -44,10 +44,9 @@ static inline void set_pageblock_isolate(struct page *page) void __meminit init_pageblock_migratetype(struct page *page, enum migratetype migratetype, bool isolate); -void set_pageblock_migratetype(struct page *page, enum migratetype migratetype); -bool move_freepages_block_isolate(struct zone *zone, struct page *page, - int migratetype); +bool pageblock_isolate_and_move_free_pages(struct zone *zone, struct page *page); +bool pageblock_unisolate_and_move_free_pages(struct zone *zone, struct page *page); int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, int migratetype, int flags); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c7730264bf5f..938b01bed1f6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -525,8 +525,8 @@ void clear_pfnblock_bit(const struct page *page, unsigned long pfn, * @page: The page within the block of interest * @migratetype: migratetype to set */ -__always_inline void set_pageblock_migratetype(struct page *page, - enum migratetype migratetype) +static void set_pageblock_migratetype(struct page *page, + enum migratetype migratetype) { if (unlikely(page_group_by_mobility_disabled && migratetype < MIGRATE_PCPTYPES)) @@ -534,9 +534,13 @@ __always_inline void set_pageblock_migratetype(struct page *page, #ifdef CONFIG_MEMORY_ISOLATION if (migratetype == MIGRATE_ISOLATE) { - set_pfnblock_bit(page, page_to_pfn(page), PB_migrate_isolate); + VM_WARN_ONCE(1, + "Use set_pageblock_isolate() for pageblock isolation"); return; } + VM_WARN_ONCE(get_pfnblock_bit(page, page_to_pfn(page), + PB_migrate_isolate), + "Use clear_pageblock_isolate() to unisolate pageblock"); /* MIGRATETYPE_AND_ISO_MASK clears PB_migrate_isolate if it is set */ #endif __set_pfnblock_flags_mask(page, page_to_pfn(page), @@ -1921,8 +1925,8 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone, #endif /* - * Change the type of a block and move all its free pages to that - * type's freelist. + * Move all free pages of a block to new type's freelist. Caller needs to + * change the block type. */ static int __move_freepages_block(struct zone *zone, unsigned long start_pfn, int old_mt, int new_mt) @@ -1954,8 +1958,6 @@ static int __move_freepages_block(struct zone *zone, unsigned long start_pfn, pages_moved += 1 << order; } - set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt); - return pages_moved; } @@ -2013,11 +2015,16 @@ static int move_freepages_block(struct zone *zone, struct page *page, int old_mt, int new_mt) { unsigned long start_pfn; + int res; if (!prep_move_freepages_block(zone, page, &start_pfn, NULL, NULL)) return -1; - return __move_freepages_block(zone, start_pfn, old_mt, new_mt); + res = __move_freepages_block(zone, start_pfn, old_mt, new_mt); + set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt); + + return res; + } #ifdef CONFIG_MEMORY_ISOLATION @@ -2045,11 +2052,19 @@ static unsigned long find_large_buddy(unsigned long start_pfn) return start_pfn; } +static inline void toggle_pageblock_isolate(struct page *page, bool isolate) +{ + if (isolate) + set_pfnblock_bit(page, page_to_pfn(page), PB_migrate_isolate); + else + clear_pfnblock_bit(page, page_to_pfn(page), PB_migrate_isolate); +} + /** - * move_freepages_block_isolate - move free pages in block for page isolation + * __move_freepages_block_isolate - move free pages in block for page isolation * @zone: the zone * @page: the pageblock page - * @migratetype: migratetype to set on the pageblock + * @isolate: to isolate the given pageblock or unisolate it * * This is similar to move_freepages_block(), but handles the special * case encountered in page isolation, where the block of interest @@ -2064,10 +2079,18 @@ static unsigned long find_large_buddy(unsigned long start_pfn) * * Returns %true if pages could be moved, %false otherwise. */ -bool move_freepages_block_isolate(struct zone *zone, struct page *page, - int migratetype) +static bool __move_freepages_block_isolate(struct zone *zone, + struct page *page, bool isolate) { unsigned long start_pfn, pfn; + int from_mt; + int to_mt; + + if (isolate == get_pageblock_isolate(page)) { + VM_WARN_ONCE(1, "%s a pageblock that is already in that state", + isolate ? "Isolate" : "Unisolate"); + return false; + } if (!prep_move_freepages_block(zone, page, &start_pfn, NULL, NULL)) return false; @@ -2084,7 +2107,7 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page, del_page_from_free_list(buddy, zone, order, get_pfnblock_migratetype(buddy, pfn)); - set_pageblock_migratetype(page, migratetype); + toggle_pageblock_isolate(page, isolate); split_large_buddy(zone, buddy, pfn, order, FPI_NONE); return true; } @@ -2095,16 +2118,38 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page, del_page_from_free_list(page, zone, order, get_pfnblock_migratetype(page, pfn)); - set_pageblock_migratetype(page, migratetype); + toggle_pageblock_isolate(page, isolate); split_large_buddy(zone, page, pfn, order, FPI_NONE); return true; } move: - __move_freepages_block(zone, start_pfn, - get_pfnblock_migratetype(page, start_pfn), - migratetype); + /* Use MIGRATETYPE_MASK to get non-isolate migratetype */ + if (isolate) { + from_mt = __get_pfnblock_flags_mask(page, page_to_pfn(page), + MIGRATETYPE_MASK); + to_mt = MIGRATE_ISOLATE; + } else { + from_mt = MIGRATE_ISOLATE; + to_mt = __get_pfnblock_flags_mask(page, page_to_pfn(page), + MIGRATETYPE_MASK); + } + + __move_freepages_block(zone, start_pfn, from_mt, to_mt); + toggle_pageblock_isolate(pfn_to_page(start_pfn), isolate); + return true; } + +bool pageblock_isolate_and_move_free_pages(struct zone *zone, struct page *page) +{ + return __move_freepages_block_isolate(zone, page, true); +} + +bool pageblock_unisolate_and_move_free_pages(struct zone *zone, struct page *page) +{ + return __move_freepages_block_isolate(zone, page, false); +} + #endif /* CONFIG_MEMORY_ISOLATION */ static void change_pageblock_range(struct page *pageblock_page, @@ -2296,6 +2341,7 @@ try_to_claim_block(struct zone *zone, struct page *page, if (free_pages + alike_pages >= (1 << (pageblock_order-1)) || page_group_by_mobility_disabled) { __move_freepages_block(zone, start_pfn, block_type, start_type); + set_pageblock_migratetype(pfn_to_page(start_pfn), start_type); return __rmqueue_smallest(zone, order, start_type); } diff --git a/mm/page_isolation.c b/mm/page_isolation.c index b2fc5266e3d2..08f627a5032f 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -188,7 +188,7 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_ unmovable = has_unmovable_pages(check_unmovable_start, check_unmovable_end, migratetype, isol_flags); if (!unmovable) { - if (!move_freepages_block_isolate(zone, page, MIGRATE_ISOLATE)) { + if (!pageblock_isolate_and_move_free_pages(zone, page)) { spin_unlock_irqrestore(&zone->lock, flags); return -EBUSY; } @@ -209,7 +209,7 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_ return -EBUSY; } -static void unset_migratetype_isolate(struct page *page, int migratetype) +static void unset_migratetype_isolate(struct page *page) { struct zone *zone; unsigned long flags; @@ -262,10 +262,10 @@ static void unset_migratetype_isolate(struct page *page, int migratetype) * Isolating this block already succeeded, so this * should not fail on zone boundaries. */ - WARN_ON_ONCE(!move_freepages_block_isolate(zone, page, migratetype)); + WARN_ON_ONCE(!pageblock_unisolate_and_move_free_pages(zone, page)); } else { - set_pageblock_migratetype(page, migratetype); - __putback_isolated_page(page, order, migratetype); + clear_pageblock_isolate(page); + __putback_isolated_page(page, order, get_pageblock_migratetype(page)); } zone->nr_isolate_pageblock--; out: @@ -383,7 +383,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags, if (PageBuddy(page)) { int order = buddy_order(page); - /* move_freepages_block_isolate() handled this */ + /* pageblock_isolate_and_move_free_pages() handled this */ VM_WARN_ON_ONCE(pfn + (1 << order) > boundary_pfn); pfn += 1UL << order; @@ -433,7 +433,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags, failed: /* restore the original migratetype */ if (!skip_isolation) - unset_migratetype_isolate(pfn_to_page(isolate_pageblock), migratetype); + unset_migratetype_isolate(pfn_to_page(isolate_pageblock)); return -EBUSY; } @@ -504,7 +504,7 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, ret = isolate_single_pageblock(isolate_end, flags, true, skip_isolation, migratetype); if (ret) { - unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype); + unset_migratetype_isolate(pfn_to_page(isolate_start)); return ret; } @@ -517,8 +517,7 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, start_pfn, end_pfn)) { undo_isolate_page_range(isolate_start, pfn, migratetype); unset_migratetype_isolate( - pfn_to_page(isolate_end - pageblock_nr_pages), - migratetype); + pfn_to_page(isolate_end - pageblock_nr_pages)); return -EBUSY; } } @@ -548,7 +547,7 @@ void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, page = __first_valid_page(pfn, pageblock_nr_pages); if (!page || !is_migrate_isolate_page(page)) continue; - unset_migratetype_isolate(page, migratetype); + unset_migratetype_isolate(page); } } /* From 7a3324eb66f616408fdaaff8a1289c0a9b333748 Mon Sep 17 00:00:00 2001 From: Zi Yan Date: Mon, 16 Jun 2025 22:11:13 -0400 Subject: [PATCH 148/370] mm/page_isolation: remove migratetype from undo_isolate_page_range() Since migratetype is no longer overwritten during pageblock isolation, undoing pageblock isolation no longer needs which migratetype to restore. Link: https://lkml.kernel.org/r/20250617021115.2331563-6-ziy@nvidia.com Signed-off-by: Zi Yan Acked-by: David Hildenbrand Reviewed-by: Vlastimil Babka Cc: Baolin Wang Cc: Brendan Jackman Cc: Johannes Weiner Cc: Kirill A. Shuemov Cc: Mel Gorman Cc: Michal Hocko Cc: Oscar Salvador Cc: Richard Chang Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- include/linux/page-isolation.h | 3 +-- mm/memory_hotplug.c | 4 ++-- mm/page_alloc.c | 2 +- mm/page_isolation.c | 9 +++------ 4 files changed, 7 insertions(+), 11 deletions(-) diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h index 7241a6719618..7a681a49e73c 100644 --- a/include/linux/page-isolation.h +++ b/include/linux/page-isolation.h @@ -51,8 +51,7 @@ bool pageblock_unisolate_and_move_free_pages(struct zone *zone, struct page *pag int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, int migratetype, int flags); -void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, - int migratetype); +void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn); int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, int isol_flags); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index a3c2b0784070..6e3380ad1bf5 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1222,7 +1222,7 @@ int online_pages(unsigned long pfn, unsigned long nr_pages, build_all_zonelists(NULL); /* Basic onlining is complete, allow allocation of onlined pages. */ - undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE); + undo_isolate_page_range(pfn, pfn + nr_pages); /* * Freshly onlined pages aren't shuffled (e.g., all pages are placed to @@ -2086,7 +2086,7 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages, failed_removal_isolated: /* pushback to free area */ - undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE); + undo_isolate_page_range(start_pfn, end_pfn); memory_notify(MEM_CANCEL_OFFLINE, &mem_arg); if (node_arg.nid != NUMA_NO_NODE) node_notify(NODE_CANCEL_REMOVING_LAST_MEMORY, &node_arg); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 938b01bed1f6..8d4b3f12b9a6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6984,7 +6984,7 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, start, end, outer_start, outer_end); } done: - undo_isolate_page_range(start, end, migratetype); + undo_isolate_page_range(start, end); return ret; } EXPORT_SYMBOL(alloc_contig_range_noprof); diff --git a/mm/page_isolation.c b/mm/page_isolation.c index 08f627a5032f..1edfef408faf 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -515,7 +515,7 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, page = __first_valid_page(pfn, pageblock_nr_pages); if (page && set_migratetype_isolate(page, migratetype, flags, start_pfn, end_pfn)) { - undo_isolate_page_range(isolate_start, pfn, migratetype); + undo_isolate_page_range(isolate_start, pfn); unset_migratetype_isolate( pfn_to_page(isolate_end - pageblock_nr_pages)); return -EBUSY; @@ -528,13 +528,10 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, * undo_isolate_page_range - undo effects of start_isolate_page_range() * @start_pfn: The first PFN of the isolated range * @end_pfn: The last PFN of the isolated range - * @migratetype: New migrate type to set on the range * - * This finds every MIGRATE_ISOLATE page block in the given range - * and switches it to @migratetype. + * This finds and unsets every MIGRATE_ISOLATE page block in the given range */ -void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, - int migratetype) +void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn) { unsigned long pfn; struct page *page; From d1554fb6302093d353c8bf4601f9bf994b836904 Mon Sep 17 00:00:00 2001 From: Zi Yan Date: Mon, 16 Jun 2025 22:11:14 -0400 Subject: [PATCH 149/370] mm/page_isolation: remove migratetype parameter from more functions migratetype is no longer overwritten during pageblock isolation, start_isolate_page_range(), has_unmovable_pages(), and set_migratetype_isolate() no longer need which migratetype to restore during isolation failure. For has_unmoable_pages(), it needs to know if the isolation is for CMA allocation, so adding PB_ISOLATE_MODE_CMA_ALLOC provide the information. At the same time change isolation flags to enum pb_isolate_mode (PB_ISOLATE_MODE_MEM_OFFLINE, PB_ISOLATE_MODE_CMA_ALLOC, PB_ISOLATE_MODE_OTHER). Remove REPORT_FAILURE and check PB_ISOLATE_MODE_MEM_OFFLINE, since only PB_ISOLATE_MODE_MEM_OFFLINE reports isolation failures. alloc_contig_range() no longer needs migratetype. Replace it with a newly defined acr_flags_t to tell if an allocation is for CMA. So does __alloc_contig_migrate_range(). Add ACR_FLAGS_NONE (set to 0) to indicate ordinary allocations. Link: https://lkml.kernel.org/r/20250617021115.2331563-7-ziy@nvidia.com Signed-off-by: Zi Yan Reviewed-by: Vlastimil Babka Acked-by: David Hildenbrand Cc: Baolin Wang Cc: Brendan Jackman Cc: Johannes Weiner Cc: Kirill A. Shuemov Cc: Mel Gorman Cc: Michal Hocko Cc: Oscar Salvador Cc: Richard Chang Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- drivers/virtio/virtio_mem.c | 2 +- include/linux/gfp.h | 7 +++- include/linux/page-isolation.h | 20 ++++++++-- include/trace/events/kmem.h | 14 ++++--- mm/cma.c | 2 +- mm/memory_hotplug.c | 6 +-- mm/page_alloc.c | 27 ++++++------- mm/page_isolation.c | 70 +++++++++++++++------------------- 8 files changed, 80 insertions(+), 68 deletions(-) diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c index 56d0dbe62163..1688ecd69a04 100644 --- a/drivers/virtio/virtio_mem.c +++ b/drivers/virtio/virtio_mem.c @@ -1243,7 +1243,7 @@ static int virtio_mem_fake_offline(struct virtio_mem *vm, unsigned long pfn, if (atomic_read(&vm->config_changed)) return -EAGAIN; - rc = alloc_contig_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE, + rc = alloc_contig_range(pfn, pfn + nr_pages, ACR_FLAGS_NONE, GFP_KERNEL); if (rc == -ENOMEM) /* whoops, out of memory */ diff --git a/include/linux/gfp.h b/include/linux/gfp.h index be160e8d8bcb..5ebf26fcdcfa 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -423,9 +423,14 @@ static inline bool gfp_compaction_allowed(gfp_t gfp_mask) extern gfp_t vma_thp_gfp_mask(struct vm_area_struct *vma); #ifdef CONFIG_CONTIG_ALLOC + +typedef unsigned int __bitwise acr_flags_t; +#define ACR_FLAGS_NONE ((__force acr_flags_t)0) // ordinary allocation request +#define ACR_FLAGS_CMA ((__force acr_flags_t)BIT(0)) // allocate for CMA + /* The below functions must be run on a range from a single zone. */ extern int alloc_contig_range_noprof(unsigned long start, unsigned long end, - unsigned migratetype, gfp_t gfp_mask); + acr_flags_t alloc_flags, gfp_t gfp_mask); #define alloc_contig_range(...) alloc_hooks(alloc_contig_range_noprof(__VA_ARGS__)) extern struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask, diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h index 7a681a49e73c..3e2f960e166c 100644 --- a/include/linux/page-isolation.h +++ b/include/linux/page-isolation.h @@ -38,8 +38,20 @@ static inline void set_pageblock_isolate(struct page *page) } #endif -#define MEMORY_OFFLINE 0x1 -#define REPORT_FAILURE 0x2 +/* + * Pageblock isolation modes: + * PB_ISOLATE_MODE_MEM_OFFLINE - isolate to offline (!allocate) memory + * e.g., skip over PageHWPoison() pages and + * PageOffline() pages. Unmovable pages will be + * reported in this mode. + * PB_ISOLATE_MODE_CMA_ALLOC - isolate for CMA allocations + * PB_ISOLATE_MODE_OTHER - isolate for other purposes + */ +enum pb_isolate_mode { + PB_ISOLATE_MODE_MEM_OFFLINE, + PB_ISOLATE_MODE_CMA_ALLOC, + PB_ISOLATE_MODE_OTHER, +}; void __meminit init_pageblock_migratetype(struct page *page, enum migratetype migratetype, @@ -49,10 +61,10 @@ bool pageblock_isolate_and_move_free_pages(struct zone *zone, struct page *page) bool pageblock_unisolate_and_move_free_pages(struct zone *zone, struct page *page); int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, - int migratetype, int flags); + enum pb_isolate_mode mode); void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn); int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, - int isol_flags); + enum pb_isolate_mode mode); #endif diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h index f74925a6cf69..efffcf578217 100644 --- a/include/trace/events/kmem.h +++ b/include/trace/events/kmem.h @@ -304,6 +304,7 @@ TRACE_EVENT(mm_page_alloc_extfrag, __entry->change_ownership) ); +#ifdef CONFIG_CONTIG_ALLOC TRACE_EVENT(mm_alloc_contig_migrate_range_info, TP_PROTO(unsigned long start, @@ -311,9 +312,9 @@ TRACE_EVENT(mm_alloc_contig_migrate_range_info, unsigned long nr_migrated, unsigned long nr_reclaimed, unsigned long nr_mapped, - int migratetype), + acr_flags_t alloc_flags), - TP_ARGS(start, end, nr_migrated, nr_reclaimed, nr_mapped, migratetype), + TP_ARGS(start, end, nr_migrated, nr_reclaimed, nr_mapped, alloc_flags), TP_STRUCT__entry( __field(unsigned long, start) @@ -321,7 +322,7 @@ TRACE_EVENT(mm_alloc_contig_migrate_range_info, __field(unsigned long, nr_migrated) __field(unsigned long, nr_reclaimed) __field(unsigned long, nr_mapped) - __field(int, migratetype) + __field(acr_flags_t, alloc_flags) ), TP_fast_assign( @@ -330,17 +331,18 @@ TRACE_EVENT(mm_alloc_contig_migrate_range_info, __entry->nr_migrated = nr_migrated; __entry->nr_reclaimed = nr_reclaimed; __entry->nr_mapped = nr_mapped; - __entry->migratetype = migratetype; + __entry->alloc_flags = alloc_flags; ), - TP_printk("start=0x%lx end=0x%lx migratetype=%d nr_migrated=%lu nr_reclaimed=%lu nr_mapped=%lu", + TP_printk("start=0x%lx end=0x%lx alloc_flags=%d nr_migrated=%lu nr_reclaimed=%lu nr_mapped=%lu", __entry->start, __entry->end, - __entry->migratetype, + __entry->alloc_flags, __entry->nr_migrated, __entry->nr_reclaimed, __entry->nr_mapped) ); +#endif TRACE_EVENT(mm_setup_per_zone_wmarks, diff --git a/mm/cma.c b/mm/cma.c index bd3772773736..c0b2630a1b81 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -822,7 +822,7 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit); mutex_lock(&cma->alloc_mutex); - ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA, gfp); + ret = alloc_contig_range(pfn, pfn + count, ACR_FLAGS_CMA, gfp); mutex_unlock(&cma->alloc_mutex); if (ret == 0) { page = pfn_to_page(pfn); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 6e3380ad1bf5..e4009a44f883 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1962,8 +1962,7 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages, /* set above range as isolated */ ret = start_isolate_page_range(start_pfn, end_pfn, - MIGRATE_MOVABLE, - MEMORY_OFFLINE | REPORT_FAILURE); + PB_ISOLATE_MODE_MEM_OFFLINE); if (ret) { reason = "failure to isolate range"; goto failed_removal_pcplists_disabled; @@ -2033,7 +2032,8 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages, goto failed_removal_isolated; } - ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE); + ret = test_pages_isolated(start_pfn, end_pfn, + PB_ISOLATE_MODE_MEM_OFFLINE); } while (ret); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8d4b3f12b9a6..4f55f8ed65c7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6693,11 +6693,12 @@ static void alloc_contig_dump_pages(struct list_head *page_list) /* * [start, end) must belong to a single zone. - * @migratetype: using migratetype to filter the type of migration in + * @alloc_flags: using acr_flags_t to filter the type of migration in * trace_mm_alloc_contig_migrate_range_info. */ static int __alloc_contig_migrate_range(struct compact_control *cc, - unsigned long start, unsigned long end, int migratetype) + unsigned long start, unsigned long end, + acr_flags_t alloc_flags) { /* This function is based on compact_zone() from compaction.c. */ unsigned int nr_reclaimed; @@ -6769,7 +6770,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc, putback_movable_pages(&cc->migratepages); } - trace_mm_alloc_contig_migrate_range_info(start, end, migratetype, + trace_mm_alloc_contig_migrate_range_info(start, end, alloc_flags, total_migrated, total_reclaimed, total_mapped); @@ -6840,10 +6841,7 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask) * alloc_contig_range() -- tries to allocate given range of pages * @start: start PFN to allocate * @end: one-past-the-last PFN to allocate - * @migratetype: migratetype of the underlying pageblocks (either - * #MIGRATE_MOVABLE or #MIGRATE_CMA). All pageblocks - * in range must have the same migratetype and it must - * be either of the two. + * @alloc_flags: allocation information * @gfp_mask: GFP mask. Node/zone/placement hints are ignored; only some * action and reclaim modifiers are supported. Reclaim modifiers * control allocation behavior during compaction/migration/reclaim. @@ -6860,7 +6858,7 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask) * need to be freed with free_contig_range(). */ int alloc_contig_range_noprof(unsigned long start, unsigned long end, - unsigned migratetype, gfp_t gfp_mask) + acr_flags_t alloc_flags, gfp_t gfp_mask) { unsigned long outer_start, outer_end; int ret = 0; @@ -6875,6 +6873,9 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, .alloc_contig = true, }; INIT_LIST_HEAD(&cc.migratepages); + enum pb_isolate_mode mode = (alloc_flags & ACR_FLAGS_CMA) ? + PB_ISOLATE_MODE_CMA_ALLOC : + PB_ISOLATE_MODE_OTHER; gfp_mask = current_gfp_context(gfp_mask); if (__alloc_contig_verify_gfp_mask(gfp_mask, (gfp_t *)&cc.gfp_mask)) @@ -6901,7 +6902,7 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, * put back to page allocator so that buddy can use them. */ - ret = start_isolate_page_range(start, end, migratetype, 0); + ret = start_isolate_page_range(start, end, mode); if (ret) goto done; @@ -6917,7 +6918,7 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, * allocated. So, if we fall through be sure to clear ret so that * -EBUSY is not accidentally used or returned to caller. */ - ret = __alloc_contig_migrate_range(&cc, start, end, migratetype); + ret = __alloc_contig_migrate_range(&cc, start, end, alloc_flags); if (ret && ret != -EBUSY) goto done; @@ -6951,7 +6952,7 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, outer_start = find_large_buddy(start); /* Make sure the range is really isolated. */ - if (test_pages_isolated(outer_start, end, 0)) { + if (test_pages_isolated(outer_start, end, mode)) { ret = -EBUSY; goto done; } @@ -6994,8 +6995,8 @@ static int __alloc_contig_pages(unsigned long start_pfn, { unsigned long end_pfn = start_pfn + nr_pages; - return alloc_contig_range_noprof(start_pfn, end_pfn, MIGRATE_MOVABLE, - gfp_mask); + return alloc_contig_range_noprof(start_pfn, end_pfn, ACR_FLAGS_NONE, + gfp_mask); } static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn, diff --git a/mm/page_isolation.c b/mm/page_isolation.c index 1edfef408faf..ece3bfc56bcd 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -31,7 +31,7 @@ * */ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long end_pfn, - int migratetype, int flags) + enum pb_isolate_mode mode) { struct page *page = pfn_to_page(start_pfn); struct zone *zone = page_zone(page); @@ -46,7 +46,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e * isolate CMA pageblocks even when they are not movable in fact * so consider them movable here. */ - if (is_migrate_cma(migratetype)) + if (mode == PB_ISOLATE_MODE_CMA_ALLOC) return NULL; return page; @@ -117,7 +117,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e * The HWPoisoned page may be not in buddy system, and * page_count() is not 0. */ - if ((flags & MEMORY_OFFLINE) && PageHWPoison(page)) + if ((mode == PB_ISOLATE_MODE_MEM_OFFLINE) && PageHWPoison(page)) continue; /* @@ -130,7 +130,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e * move these pages that still have a reference count > 0. * (false negatives in this function only) */ - if ((flags & MEMORY_OFFLINE) && PageOffline(page)) + if ((mode == PB_ISOLATE_MODE_MEM_OFFLINE) && PageOffline(page)) continue; if (__PageMovable(page) || PageLRU(page)) @@ -151,7 +151,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e * present in [start_pfn, end_pfn). The pageblock must intersect with * [start_pfn, end_pfn). */ -static int set_migratetype_isolate(struct page *page, int migratetype, int isol_flags, +static int set_migratetype_isolate(struct page *page, enum pb_isolate_mode mode, unsigned long start_pfn, unsigned long end_pfn) { struct zone *zone = page_zone(page); @@ -186,7 +186,7 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_ end_pfn); unmovable = has_unmovable_pages(check_unmovable_start, check_unmovable_end, - migratetype, isol_flags); + mode); if (!unmovable) { if (!pageblock_isolate_and_move_free_pages(zone, page)) { spin_unlock_irqrestore(&zone->lock, flags); @@ -198,7 +198,7 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_ } spin_unlock_irqrestore(&zone->lock, flags); - if (isol_flags & REPORT_FAILURE) { + if (mode == PB_ISOLATE_MODE_MEM_OFFLINE) { /* * printk() with zone->lock held will likely trigger a * lockdep splat, so defer it here. @@ -292,11 +292,10 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages) * isolate_single_pageblock() -- tries to isolate a pageblock that might be * within a free or in-use page. * @boundary_pfn: pageblock-aligned pfn that a page might cross - * @flags: isolation flags + * @mode: isolation mode * @isolate_before: isolate the pageblock before the boundary_pfn * @skip_isolation: the flag to skip the pageblock isolation in second * isolate_single_pageblock() - * @migratetype: migrate type to set in error recovery. * * Free and in-use pages can be as big as MAX_PAGE_ORDER and contain more than one * pageblock. When not all pageblocks within a page are isolated at the same @@ -311,8 +310,9 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages) * either. The function handles this by splitting the free page or migrating * the in-use page then splitting the free page. */ -static int isolate_single_pageblock(unsigned long boundary_pfn, int flags, - bool isolate_before, bool skip_isolation, int migratetype) +static int isolate_single_pageblock(unsigned long boundary_pfn, + enum pb_isolate_mode mode, bool isolate_before, + bool skip_isolation) { unsigned long start_pfn; unsigned long isolate_pageblock; @@ -338,12 +338,11 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags, zone->zone_start_pfn); if (skip_isolation) { - int mt __maybe_unused = get_pageblock_migratetype(pfn_to_page(isolate_pageblock)); - - VM_BUG_ON(!is_migrate_isolate(mt)); + VM_BUG_ON(!get_pageblock_isolate(pfn_to_page(isolate_pageblock))); } else { - ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), migratetype, - flags, isolate_pageblock, isolate_pageblock + pageblock_nr_pages); + ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), + mode, isolate_pageblock, + isolate_pageblock + pageblock_nr_pages); if (ret) return ret; @@ -441,14 +440,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags, * start_isolate_page_range() - mark page range MIGRATE_ISOLATE * @start_pfn: The first PFN of the range to be isolated. * @end_pfn: The last PFN of the range to be isolated. - * @migratetype: Migrate type to set in error recovery. - * @flags: The following flags are allowed (they can be combined in - * a bit mask) - * MEMORY_OFFLINE - isolate to offline (!allocate) memory - * e.g., skip over PageHWPoison() pages - * and PageOffline() pages. - * REPORT_FAILURE - report details about the failure to - * isolate the range + * @mode: isolation mode * * Making page-allocation-type to be MIGRATE_ISOLATE means free pages in * the range will never be allocated. Any free pages and pages freed in the @@ -481,7 +473,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags, * Return: 0 on success and -EBUSY if any part of range cannot be isolated. */ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, - int migratetype, int flags) + enum pb_isolate_mode mode) { unsigned long pfn; struct page *page; @@ -492,8 +484,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, bool skip_isolation = false; /* isolate [isolate_start, isolate_start + pageblock_nr_pages) pageblock */ - ret = isolate_single_pageblock(isolate_start, flags, false, - skip_isolation, migratetype); + ret = isolate_single_pageblock(isolate_start, mode, false, + skip_isolation); if (ret) return ret; @@ -501,8 +493,7 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, skip_isolation = true; /* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */ - ret = isolate_single_pageblock(isolate_end, flags, true, - skip_isolation, migratetype); + ret = isolate_single_pageblock(isolate_end, mode, true, skip_isolation); if (ret) { unset_migratetype_isolate(pfn_to_page(isolate_start)); return ret; @@ -513,8 +504,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, pfn < isolate_end - pageblock_nr_pages; pfn += pageblock_nr_pages) { page = __first_valid_page(pfn, pageblock_nr_pages); - if (page && set_migratetype_isolate(page, migratetype, flags, - start_pfn, end_pfn)) { + if (page && set_migratetype_isolate(page, mode, start_pfn, + end_pfn)) { undo_isolate_page_range(isolate_start, pfn); unset_migratetype_isolate( pfn_to_page(isolate_end - pageblock_nr_pages)); @@ -556,7 +547,7 @@ void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn) */ static unsigned long __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn, - int flags) + enum pb_isolate_mode mode) { struct page *page; @@ -569,11 +560,12 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn, * simple way to verify that as VM_BUG_ON(), though. */ pfn += 1 << buddy_order(page); - else if ((flags & MEMORY_OFFLINE) && PageHWPoison(page)) + else if ((mode == PB_ISOLATE_MODE_MEM_OFFLINE) && + PageHWPoison(page)) /* A HWPoisoned page cannot be also PageBuddy */ pfn++; - else if ((flags & MEMORY_OFFLINE) && PageOffline(page) && - !page_count(page)) + else if ((mode == PB_ISOLATE_MODE_MEM_OFFLINE) && + PageOffline(page) && !page_count(page)) /* * The responsible driver agreed to skip PageOffline() * pages when offlining memory by dropping its @@ -591,11 +583,11 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn, * test_pages_isolated - check if pageblocks in range are isolated * @start_pfn: The first PFN of the isolated range * @end_pfn: The first PFN *after* the isolated range - * @isol_flags: Testing mode flags + * @mode: Testing mode * * This tests if all in the specified range are free. * - * If %MEMORY_OFFLINE is specified in @flags, it will consider + * If %PB_ISOLATE_MODE_MEM_OFFLINE specified in @mode, it will consider * poisoned and offlined pages free as well. * * Caller must ensure the requested range doesn't span zones. @@ -603,7 +595,7 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn, * Returns 0 if true, -EBUSY if one or more pages are in use. */ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, - int isol_flags) + enum pb_isolate_mode mode) { unsigned long pfn, flags; struct page *page; @@ -639,7 +631,7 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, /* Check all pages are free or marked as ISOLATED */ zone = page_zone(page); spin_lock_irqsave(&zone->lock, flags); - pfn = __test_page_isolated_in_pageblock(start_pfn, end_pfn, isol_flags); + pfn = __test_page_isolated_in_pageblock(start_pfn, end_pfn, mode); spin_unlock_irqrestore(&zone->lock, flags); ret = pfn < end_pfn ? -EBUSY : 0; From c5e67d40a10234541e220750297304df79aaedd0 Mon Sep 17 00:00:00 2001 From: Yunjeong Mun Date: Fri, 27 Jun 2025 09:33:29 -0700 Subject: [PATCH 150/370] samples/damon/mtier: add parameters for node0 memory usage Change the hard-coded quota goal metric values into sysfs knobs: `node0_mem_used_bp` and `node0_mem_free_bp`. These knobs represent the used and free memory ratio of node0 in basis points (bp, where 1 bp = 0.01%). As mentioned in [1], this patch is developed under the assumption that node0 is always the fast-tier in a two-tiers memory setup. [1] https://lore.kernel.org/linux-mm/20250420194030.75838-8-sj@kernel.org/ Link: https://lkml.kernel.org/r/20250627163329.50997-1-sj@kernel.org Link: https://patch.msgid.link/20250619050313.1535-1-yunjeong.mun@sk.com Signed-off-by: Yunjeong Mun Signed-off-by: SeongJae Park Suggested-by: Honggyu Kim Reviewed-by: SeongJae Park Signed-off-by: Andrew Morton --- samples/damon/mtier.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/samples/damon/mtier.c b/samples/damon/mtier.c index c94254b77fc9..97892ade7f31 100644 --- a/samples/damon/mtier.c +++ b/samples/damon/mtier.c @@ -24,6 +24,12 @@ module_param(node1_start_addr, ulong, 0600); static unsigned long node1_end_addr __read_mostly; module_param(node1_end_addr, ulong, 0600); +static unsigned long node0_mem_used_bp __read_mostly = 9970; +module_param(node0_mem_used_bp, ulong, 0600); + +static unsigned long node0_mem_free_bp __read_mostly = 50; +module_param(node0_mem_free_bp, ulong, 0600); + static int damon_sample_mtier_enable_store( const char *val, const struct kernel_param *kp); @@ -112,7 +118,7 @@ static struct damon_ctx *damon_sample_mtier_build_ctx(bool promote) quota_goal = damos_new_quota_goal( promote ? DAMOS_QUOTA_NODE_MEM_USED_BP : DAMOS_QUOTA_NODE_MEM_FREE_BP, - promote ? 9970 : 50); + promote ? node0_mem_used_bp : node0_mem_free_bp); if (!quota_goal) goto free_out; quota_goal->nid = 0; From eff41389d8249a1a5a67faa440255ed8e526803a Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Fri, 27 Jun 2025 12:07:07 -0400 Subject: [PATCH 151/370] mm/hugetlb: remove prepare_hugepage_range() Only mips and loongarch implemented this API, however what it does was checking against stack overflow for either len or addr. That's already done in arch's arch_get_unmapped_area*() functions, even though it may not be 100% identical checks. For example, for both of the architectures, there will be a trivial difference on how stack top was defined. The old code uses STACK_TOP which may be slightly smaller than TASK_SIZE on either of them, but the hope is that shouldn't be a problem. It means the whole API is pretty much obsolete at least now, remove it completely. Link: https://lkml.kernel.org/r/20250627160707.2124580-1-peterx@redhat.com Signed-off-by: Peter Xu Reviewed-by: Jason Gunthorpe Reviewed-by: Oscar Salvador Acked-by: David Hildenbrand Reviewed-by: Liam R. Howlett Reviewed-by: Anshuman Khandual Cc: Huacai Chen Cc: Thomas Bogendoerfer Cc: Muchun Song Cc: Jann Horn Cc: Lorenzo Stoakes Cc: Pedro Falcato Cc: Vlastimil Babka Cc: Zi Yan Signed-off-by: Andrew Morton --- arch/loongarch/include/asm/hugetlb.h | 14 -------------- arch/mips/include/asm/hugetlb.h | 14 -------------- fs/hugetlbfs/inode.c | 8 ++------ include/asm-generic/hugetlb.h | 8 -------- include/linux/hugetlb.h | 6 ------ 5 files changed, 2 insertions(+), 48 deletions(-) diff --git a/arch/loongarch/include/asm/hugetlb.h b/arch/loongarch/include/asm/hugetlb.h index 4dc4b3e04225..ab68b594f889 100644 --- a/arch/loongarch/include/asm/hugetlb.h +++ b/arch/loongarch/include/asm/hugetlb.h @@ -10,20 +10,6 @@ uint64_t pmd_to_entrylo(unsigned long pmd_val); -#define __HAVE_ARCH_PREPARE_HUGEPAGE_RANGE -static inline int prepare_hugepage_range(struct file *file, - unsigned long addr, - unsigned long len) -{ - unsigned long task_size = STACK_TOP; - - if (len > task_size) - return -ENOMEM; - if (task_size - len < addr) - return -EINVAL; - return 0; -} - #define __HAVE_ARCH_HUGE_PTE_CLEAR static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep, unsigned long sz) diff --git a/arch/mips/include/asm/hugetlb.h b/arch/mips/include/asm/hugetlb.h index fbc71ddcf0f6..8c460ce01ffe 100644 --- a/arch/mips/include/asm/hugetlb.h +++ b/arch/mips/include/asm/hugetlb.h @@ -11,20 +11,6 @@ #include -#define __HAVE_ARCH_PREPARE_HUGEPAGE_RANGE -static inline int prepare_hugepage_range(struct file *file, - unsigned long addr, - unsigned long len) -{ - unsigned long task_size = STACK_TOP; - - if (len > task_size) - return -ENOMEM; - if (task_size - len < addr) - return -EINVAL; - return 0; -} - #define __HAVE_ARCH_HUGE_PTEP_GET_AND_CLEAR static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep, diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 00b2d1a032fd..81a6acddd690 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -179,12 +179,8 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr, if (len & ~huge_page_mask(h)) return -EINVAL; - if (flags & MAP_FIXED) { - if (addr & ~huge_page_mask(h)) - return -EINVAL; - if (prepare_hugepage_range(file, addr, len)) - return -EINVAL; - } + if ((flags & MAP_FIXED) && (addr & ~huge_page_mask(h))) + return -EINVAL; if (addr) addr0 = ALIGN(addr, huge_page_size(h)); diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h index 3e0a8fe9b108..4bce4f07f44f 100644 --- a/include/asm-generic/hugetlb.h +++ b/include/asm-generic/hugetlb.h @@ -114,14 +114,6 @@ static inline int huge_pte_none_mostly(pte_t pte) } #endif -#ifndef __HAVE_ARCH_PREPARE_HUGEPAGE_RANGE -static inline int prepare_hugepage_range(struct file *file, - unsigned long addr, unsigned long len) -{ - return 0; -} -#endif - #ifndef __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT static inline void huge_ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr, pte_t *ptep) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index c6c87eae4a8d..474de8e2a8f2 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -359,12 +359,6 @@ static inline void hugetlb_show_meminfo_node(int nid) { } -static inline int prepare_hugepage_range(struct file *file, - unsigned long addr, unsigned long len) -{ - return -EINVAL; -} - static inline void hugetlb_vma_lock_read(struct vm_area_struct *vma) { } From dd3d25f055e86a7f5958381beba99d67927d4e65 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Fri, 27 Jun 2025 12:07:39 -0400 Subject: [PATCH 152/370] mm: deduplicate mm_get_unmapped_area() Essentially it sets vm_flags==0 for mm_get_unmapped_area_vmflags(). Use the helper instead to dedup the lines. Link: https://lkml.kernel.org/r/20250627160739.2124768-1-peterx@redhat.com Signed-off-by: Peter Xu Reviewed-by: Jason Gunthorpe Reviewed-by: Oscar Salvador Reviewed-by: Zi Yan Reviewed-by: Lorenzo Stoakes Reviewed-by: Pedro Falcato Acked-by: David Hildenbrand Reviewed-by: Dev Jain Reviewed-by: Anshuman Khandual Cc: "Liam R. Howlett" Cc: Vlastimil Babka Cc: Jann Horn Cc: Huacai Chen Cc: Muchun Song Cc: Thomas Bogendoerfer Signed-off-by: Andrew Morton --- mm/mmap.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 8f92cf10b656..74072369e8fd 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -871,9 +871,8 @@ mm_get_unmapped_area(struct mm_struct *mm, struct file *file, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) { - if (test_bit(MMF_TOPDOWN, &mm->flags)) - return arch_get_unmapped_area_topdown(file, addr, len, pgoff, flags, 0); - return arch_get_unmapped_area(file, addr, len, pgoff, flags, 0); + return mm_get_unmapped_area_vmflags(mm, file, addr, len, + pgoff, flags, 0); } EXPORT_SYMBOL(mm_get_unmapped_area); From f3e8e1e51362680f221f98626924098c8b3c6634 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 28 Jun 2025 09:04:23 -0700 Subject: [PATCH 153/370] selftests/damon: add drgn script for extracting damon status Patch series "selftests/damon: add python and drgn based DAMON sysfs functionality tests". DAMON sysfs interface is the bridge between the user space and the kernel space for DAMON parameters. There is no good and simple test to see if the parameters are set as expected. Existing DAMON selftests therefore test end-to-end features. For example, damos_quota_goal.py runs a DAMOS scheme with quota goal set against a test program running an artificial access pattern, and see if the result is as expected. Such tests cover only a few part of DAMON. Adding more tests is also complicated. Finally, the reliability of the test itself on different systems is bad. 'drgn' is a tool that can extract kernel internal data structures like DAMON parameters. Add a test that passes specific DAMON parameters via DAMON sysfs reusing _damon_sysfs.py, extract resulting DAMON parameters via 'drgn', and compare those. Note that this test is not adding exhaustive tests of all DAMON parameters and input combinations but very basic things. Advancing the test infrastructure and adding more tests are future works. This patch (of 6): 'drgn' is a useful tool for extracting kernel internal data structures such as DAMON's parameter and running status. Add a 'drgn' script that extracts such DAMON internal data at runtime, for using it as a tool for seeing if a test input has made expected results in the kernel. The script saves or prints out the DAMON internal data as a json file or string. This is for making use of it not very depends on 'drgn'. If 'drgn' is not available on a test setup and we find alternative tools for doing that, the json-based tests can be updated to use an alternative tool in future. Note that the script is tested with 'drgn v0.0.22'. Link: https://lkml.kernel.org/r/20250628160428.53115-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250628160428.53115-2-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- .../selftests/damon/drgn_dump_damon_status.py | 161 ++++++++++++++++++ 1 file changed, 161 insertions(+) create mode 100755 tools/testing/selftests/damon/drgn_dump_damon_status.py diff --git a/tools/testing/selftests/damon/drgn_dump_damon_status.py b/tools/testing/selftests/damon/drgn_dump_damon_status.py new file mode 100755 index 000000000000..333a0d0c4bff --- /dev/null +++ b/tools/testing/selftests/damon/drgn_dump_damon_status.py @@ -0,0 +1,161 @@ +#!/usr/bin/env drgn +# SPDX-License-Identifier: GPL-2.0 + +''' +Read DAMON context data and dump as a json string. +''' +import drgn +from drgn import FaultError, NULL, Object, cast, container_of, execscript, offsetof, reinterpret, sizeof +from drgn.helpers.common import * +from drgn.helpers.linux import * + +import json +import sys + +if "prog" not in globals(): + try: + prog = drgn.get_default_prog() + except drgn.NoDefaultProgramError: + prog = drgn.program_from_kernel() + drgn.set_default_prog(prog) + +def to_dict(object, attr_name_converter): + d = {} + for attr_name, converter in attr_name_converter: + d[attr_name] = converter(getattr(object, attr_name)) + return d + +def intervals_goal_to_dict(goal): + return to_dict(goal, [ + ['access_bp', int], + ['aggrs', int], + ['min_sample_us', int], + ['max_sample_us', int], + ]) + +def attrs_to_dict(attrs): + return to_dict(attrs, [ + ['sample_interval', int], + ['aggr_interval', int], + ['ops_update_interval', int], + ['intervals_goal', intervals_goal_to_dict], + ['min_nr_regions', int], + ['max_nr_regions', int], + ]) + +def addr_range_to_dict(addr_range): + return to_dict(addr_range, [ + ['start', int], + ['end', int], + ]) + +def region_to_dict(region): + return to_dict(region, [ + ['ar', addr_range_to_dict], + ['sampling_addr', int], + ['nr_accesses', int], + ['nr_accesses_bp', int], + ['age', int], + ]) + +def regions_to_list(regions): + return [region_to_dict(r) + for r in list_for_each_entry( + 'struct damon_region', regions.address_of_(), 'list')] + +def target_to_dict(target): + return to_dict(target, [ + ['pid', int], + ['nr_regions', int], + ['regions_list', regions_to_list], + ]) + +def targets_to_list(targets): + return [target_to_dict(t) + for t in list_for_each_entry( + 'struct damon_target', targets.address_of_(), 'list')] + +def damos_access_pattern_to_dict(pattern): + return to_dict(pattern, [ + ['min_sz_region', int], + ['max_sz_region', int], + ['min_nr_accesses', int], + ['max_nr_accesses', int], + ['min_age_region', int], + ['max_age_region', int], + ]) + +def damos_quota_goal_to_dict(goal): + return to_dict(goal, [ + ['metric', int], + ['target_value', int], + ['current_value', int], + ['last_psi_total', int], + ['nid', int], + ]) + +def damos_quota_goals_to_list(goals): + return [damos_quota_goal_to_dict(g) + for g in list_for_each_entry( + 'struct damos_quota_goal', goals.address_of_(), 'list')] + +def damos_quota_to_dict(quota): + return to_dict(quota, [ + ['reset_interval', int], + ['ms', int], ['sz', int], + ['goals', damos_quota_goals_to_list], + ['esz', int], + ['weight_sz', int], + ['weight_nr_accesses', int], + ['weight_age', int], + ]) + +def damos_watermarks_to_dict(watermarks): + return to_dict(watermarks, [ + ['metric', int], + ['interval', int], + ['high', int], ['mid', int], ['low', int], + ]) + +def scheme_to_dict(scheme): + return to_dict(scheme, [ + ['pattern', damos_access_pattern_to_dict], + ['action', int], + ['apply_interval_us', int], + ['quota', damos_quota_to_dict], + ['wmarks', damos_watermarks_to_dict], + ['target_nid', int], + ]) + +def schemes_to_list(schemes): + return [scheme_to_dict(s) + for s in list_for_each_entry( + 'struct damos', schemes.address_of_(), 'list')] + +def damon_ctx_to_dict(ctx): + return to_dict(ctx, [ + ['attrs', attrs_to_dict], + ['adaptive_targets', targets_to_list], + ['schemes', schemes_to_list], + ]) + +def main(): + if len(sys.argv) < 3: + print('Usage: %s ' % sys.argv[0]) + exit(1) + + pid = int(sys.argv[1]) + file_to_store = sys.argv[2] + + kthread_data = cast('struct kthread *', + find_task(prog, pid).worker_private).data + ctx = cast('struct damon_ctx *', kthread_data) + status = {'contexts': [damon_ctx_to_dict(ctx)]} + if file_to_store == 'stdout': + print(json.dumps(status, indent=4)) + else: + with open(file_to_store, 'w') as f: + json.dump(status, f, indent=4) + +if __name__ == '__main__': + main() From e227472ebf00b6b5187915826c41258c472edb0a Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 28 Jun 2025 09:04:24 -0700 Subject: [PATCH 154/370] selftests/damon/_damon_sysfs: set Kdamond.pid in start() _damon_sysfs.py is a Python module for reading and writing DAMON sysfs for testing. It is not reading resulting kdamond pids. Read and update those when starting kdamonds. Link: https://lkml.kernel.org/r/20250628160428.53115-3-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/_damon_sysfs.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/tools/testing/selftests/damon/_damon_sysfs.py b/tools/testing/selftests/damon/_damon_sysfs.py index 5b1cb6b3ce4e..f587e117472e 100644 --- a/tools/testing/selftests/damon/_damon_sysfs.py +++ b/tools/testing/selftests/damon/_damon_sysfs.py @@ -408,6 +408,9 @@ class Kdamond: if err is not None: return err err = write_file(os.path.join(self.sysfs_dir(), 'state'), 'on') + if err is not None: + return err + self.pid, err = read_file(os.path.join(self.sysfs_dir(), 'pid')) return err def stop(self): From 4ece01897627ddeefcede4ac709cd99763994dc4 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 28 Jun 2025 09:04:25 -0700 Subject: [PATCH 155/370] selftests/damon: add python and drgn-based DAMON sysfs test Add a python-written DAMON sysfs functionality selftest. It sets DAMON parameters using Python module _damon_sysfs, reads updated kernel internal DAMON status and parameters using a 'drgn' script, namely drgn_dump_damon_status.py, and compare if the resulted DAMON internal status is as expected. The test is very minimum at the moment. Link: https://lkml.kernel.org/r/20250628160428.53115-4-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/Makefile | 1 + tools/testing/selftests/damon/sysfs.py | 42 ++++++++++++++++++++++++++ 2 files changed, 43 insertions(+) create mode 100755 tools/testing/selftests/damon/sysfs.py diff --git a/tools/testing/selftests/damon/Makefile b/tools/testing/selftests/damon/Makefile index e888455e3cf8..5b230deb19e8 100644 --- a/tools/testing/selftests/damon/Makefile +++ b/tools/testing/selftests/damon/Makefile @@ -7,6 +7,7 @@ TEST_FILES = _damon_sysfs.py # functionality tests TEST_PROGS += sysfs.sh +TEST_PROGS += sysfs.py TEST_PROGS += sysfs_update_schemes_tried_regions_wss_estimation.py TEST_PROGS += damos_quota.py damos_quota_goal.py damos_apply_interval.py TEST_PROGS += damos_tried_regions.py damon_nr_regions.py diff --git a/tools/testing/selftests/damon/sysfs.py b/tools/testing/selftests/damon/sysfs.py new file mode 100755 index 000000000000..4ff99db0d247 --- /dev/null +++ b/tools/testing/selftests/damon/sysfs.py @@ -0,0 +1,42 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: GPL-2.0 + +import json +import os +import subprocess + +import _damon_sysfs + +def dump_damon_status_dict(pid): + file_dir = os.path.dirname(os.path.abspath(__file__)) + dump_script = os.path.join(file_dir, 'drgn_dump_damon_status.py') + rc = subprocess.call(['drgn', dump_script, pid, 'damon_dump_output'], + stderr=subprocess.DEVNULL) + if rc != 0: + return None, 'drgn fail' + try: + with open('damon_dump_output', 'r') as f: + return json.load(f), None + except Exception as e: + return None, 'json.load fail (%s)' % e + +def main(): + kdamonds = _damon_sysfs.Kdamonds( + [_damon_sysfs.Kdamond(contexts=[_damon_sysfs.DamonCtx()])]) + err = kdamonds.start() + if err is not None: + print('kdamond start failed: %s' % err) + exit(1) + + status, err = dump_damon_status_dict(kdamonds.kdamonds[0].pid) + if err is not None: + print(err) + exit(1) + + if len(status['contexts']) != 1: + print('number of contexts: %d' % len(status['contexts'])) + exit(1) + kdamonds.stop() + +if __name__ == '__main__': + main() From ae3ab07e0d0488925008c19f03ba02d7818c85af Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 28 Jun 2025 09:04:26 -0700 Subject: [PATCH 156/370] selftests/damon/sysfs.py: test monitoring attribute parameters Add DAMON sysfs interface functionality tests for DAMON monitoring attribute parameters, including intervals, intervals tuning goals, and min/max number of regions. Link: https://lkml.kernel.org/r/20250628160428.53115-5-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/sysfs.py | 34 ++++++++++++++++++++++++-- 1 file changed, 32 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/damon/sysfs.py b/tools/testing/selftests/damon/sysfs.py index 4ff99db0d247..a721901a880d 100755 --- a/tools/testing/selftests/damon/sysfs.py +++ b/tools/testing/selftests/damon/sysfs.py @@ -20,6 +20,11 @@ def dump_damon_status_dict(pid): except Exception as e: return None, 'json.load fail (%s)' % e +def fail(expectation, status): + print('unexpected %s' % expectation) + print(json.dumps(status, indent=4)) + exit(1) + def main(): kdamonds = _damon_sysfs.Kdamonds( [_damon_sysfs.Kdamond(contexts=[_damon_sysfs.DamonCtx()])]) @@ -34,8 +39,33 @@ def main(): exit(1) if len(status['contexts']) != 1: - print('number of contexts: %d' % len(status['contexts'])) - exit(1) + fail('number of contexts', status) + + ctx = status['contexts'][0] + attrs = ctx['attrs'] + if attrs['sample_interval'] != 5000: + fail('sample interval', status) + if attrs['aggr_interval'] != 100000: + fail('aggr interval', status) + if attrs['ops_update_interval'] != 1000000: + fail('ops updte interval', status) + + if attrs['intervals_goal'] != { + 'access_bp': 0, 'aggrs': 0, + 'min_sample_us': 0, 'max_sample_us': 0}: + fail('intervals goal') + + if attrs['min_nr_regions'] != 10: + fail('min_nr_regions') + if attrs['max_nr_regions'] != 1000: + fail('max_nr_regions') + + if ctx['adaptive_targets'] != []: + fail('adaptive_targets') + + if ctx['schemes'] != []: + fail('schemes') + kdamonds.stop() if __name__ == '__main__': From 7e6bcf354f3ea5cc409a7210a4b3834585e61954 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 28 Jun 2025 09:04:27 -0700 Subject: [PATCH 157/370] selftests/damon/sysfs.py: test adaptive targets parameter Add DAMON sysfs interface functionality tests for setup of basic adaptive targets parameters. Link: https://lkml.kernel.org/r/20250628160428.53115-6-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/sysfs.py | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/damon/sysfs.py b/tools/testing/selftests/damon/sysfs.py index a721901a880d..3b085268f342 100755 --- a/tools/testing/selftests/damon/sysfs.py +++ b/tools/testing/selftests/damon/sysfs.py @@ -27,7 +27,9 @@ def fail(expectation, status): def main(): kdamonds = _damon_sysfs.Kdamonds( - [_damon_sysfs.Kdamond(contexts=[_damon_sysfs.DamonCtx()])]) + [_damon_sysfs.Kdamond( + contexts=[_damon_sysfs.DamonCtx( + targets=[_damon_sysfs.DamonTarget(pid=-1)])])]) err = kdamonds.start() if err is not None: print('kdamond start failed: %s' % err) @@ -60,8 +62,9 @@ def main(): if attrs['max_nr_regions'] != 1000: fail('max_nr_regions') - if ctx['adaptive_targets'] != []: - fail('adaptive_targets') + if ctx['adaptive_targets'] != [ + { 'pid': 0, 'nr_regions': 0, 'regions_list': []}]: + fail('adaptive targets', status) if ctx['schemes'] != []: fail('schemes') From 603cb4aa09a14157ba412ba4db9dffebb79eb598 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 28 Jun 2025 09:04:28 -0700 Subject: [PATCH 158/370] selftests/damon/sysfs.py: test DAMOS schemes parameters setup Add DAMON sysfs interface functionality tests for basic DAMOS schemes parameters setup. Link: https://lkml.kernel.org/r/20250628160428.53115-7-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/sysfs.py | 46 ++++++++++++++++++++++++-- 1 file changed, 43 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/damon/sysfs.py b/tools/testing/selftests/damon/sysfs.py index 3b085268f342..e67008fd055d 100755 --- a/tools/testing/selftests/damon/sysfs.py +++ b/tools/testing/selftests/damon/sysfs.py @@ -29,7 +29,9 @@ def main(): kdamonds = _damon_sysfs.Kdamonds( [_damon_sysfs.Kdamond( contexts=[_damon_sysfs.DamonCtx( - targets=[_damon_sysfs.DamonTarget(pid=-1)])])]) + targets=[_damon_sysfs.DamonTarget(pid=-1)], + schemes=[_damon_sysfs.Damos()], + )])]) err = kdamonds.start() if err is not None: print('kdamond start failed: %s' % err) @@ -66,8 +68,46 @@ def main(): { 'pid': 0, 'nr_regions': 0, 'regions_list': []}]: fail('adaptive targets', status) - if ctx['schemes'] != []: - fail('schemes') + if len(ctx['schemes']) != 1: + fail('number of schemes', status) + + scheme = ctx['schemes'][0] + if scheme['pattern'] != { + 'min_sz_region': 0, + 'max_sz_region': 2**64 - 1, + 'min_nr_accesses': 0, + 'max_nr_accesses': 2**32 - 1, + 'min_age_region': 0, + 'max_age_region': 2**32 - 1, + }: + fail('damos pattern', status) + if scheme['action'] != 9: # stat + fail('damos action', status) + if scheme['apply_interval_us'] != 0: + fail('damos apply interval', status) + if scheme['target_nid'] != -1: + fail('damos target nid', status) + + if scheme['quota'] != { + 'reset_interval': 0, + 'ms': 0, + 'sz': 0, + 'goals': [], + 'esz': 0, + 'weight_sz': 0, + 'weight_nr_accesses': 0, + 'weight_age': 0, + }: + fail('damos quota', status) + + if scheme['wmarks'] != { + 'metric': 0, + 'interval': 0, + 'high': 0, + 'mid': 0, + 'low': 0, + }: + fail('damos wmarks', status) kdamonds.stop() From b112a4e0a1af5745f987a1692553bf02c890babc Mon Sep 17 00:00:00 2001 From: Jeongjun Park Date: Thu, 3 Jul 2025 15:56:00 +0900 Subject: [PATCH 159/370] mm/percpu: prevent concurrency problem for pcpu_nr_populated read with spin lock pcpu_nr_pages() reads pcpu_nr_populated without any protection, which causes a data race between read/write. However, since this is an intended race, we should add a data_race annotation instead of add a spin lock. [akpm@linux-foundation.org: move pcpu_nr_units multiplication outside data_race(), per Vlastimil] Link: https://lkml.kernel.org/r/20250703065600.132221-1-aha310510@gmail.com Fixes: 7e8a6304d541 ("/proc/meminfo: add percpu populated pages count") Signed-off-by: Jeongjun Park Reported-by: syzbot+e5bd32b79413e86f389e@syzkaller.appspotmail.com Suggested-by: Shakeel Butt Cc: Christoph Lameter (Ampere) Cc: David Rientjes Cc: Dennis Zhou Cc: Roman Gushchin Cc: Tejun Heo Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/percpu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/percpu.c b/mm/percpu.c index b35494c8ede2..d9cbaee92b60 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -3355,7 +3355,7 @@ void __init setup_per_cpu_areas(void) */ unsigned long pcpu_nr_pages(void) { - return pcpu_nr_populated * pcpu_nr_units; + return data_race(READ_ONCE(pcpu_nr_populated)) * pcpu_nr_units; } /* From 2a83529026c23fc7d8bbf544c6c8d52df7cb9d93 Mon Sep 17 00:00:00 2001 From: Thorsten Blum Date: Mon, 30 Jun 2025 19:18:26 +0200 Subject: [PATCH 160/370] mm/hugetlb: use str_plural() in report_hugepages() Use the string choice helper function str_plural() to simplify the code and to fix the following Coccinelle/coccicheck warning reported by string_choices.cocci: opportunity for str_plural(nrinvalid) Link: https://lkml.kernel.org/r/20250630171826.114008-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum Acked-by: David Hildenbrand Acked-by: Oscar Salvador Reviewed-by: Dev Jain Reviewed-by: Anshuman Khandual Cc: Muchun Song Signed-off-by: Andrew Morton --- mm/hugetlb.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 11d5668ff6e7..db53ead8ac43 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include #include @@ -3722,7 +3723,7 @@ static void __init report_hugepages(void) buf, h->nr_huge_pages); if (nrinvalid) pr_info("HugeTLB: %s page size: %lu invalid page%s discarded\n", - buf, nrinvalid, nrinvalid > 1 ? "s" : ""); + buf, nrinvalid, str_plural(nrinvalid)); pr_info("HugeTLB: %d KiB vmemmap can be freed for a %s page\n", hugetlb_vmemmap_optimizable_size(h) / SZ_1K, buf); } From c26ad45ba538434e87290c7db5a93fe11263f593 Mon Sep 17 00:00:00 2001 From: Gerald Schaefer Date: Mon, 30 Jun 2025 18:47:25 +0200 Subject: [PATCH 161/370] mm/debug_vm_pgtable: use a swp_entry_t input value for swap tests The various __pte/pmd_to_swp_entry and __swp_entry_to_pte/pmd helper functions are expected to operate on swap PTE/PMD entries, not on present and mapped entries. Reflect this in the swap tests by using a swp_entry_t as input value, and convert it to a swap PTE/PMD for testing, similar to how it is already done in pte_swap_exclusive_tests(). Move the swap entry creation from there to init_args() and store it in args, so it can also be used in other functions. The pte/pmd_swap_tests() are also changed to compare entries instead of pfn values, again similar to pte_swap_exclusive_tests(). pte/pmd_pfn() helpers are also not expected to operate on swap PTE/PMD entries at all. Also update documentation, to reflect that the helpers operate on swap PTE/PMD entries and not present and mapped entries, and use correct names, i.e. __swp_to_pte/pmd_entry -> __swp_entry_to_pte/pmd. For consistency, also change pte/pmd_swap_soft_dirty_tests() to use args->swp_entry instead of a present and mapped PTE/PMD. Link: https://lore.kernel.org/all/20250623184321.927418-1-gerald.schaefer@linux.ibm.com Link: https://lkml.kernel.org/r/20250630164726.930405-1-gerald.schaefer@linux.ibm.com Signed-off-by: Gerald Schaefer Acked-by: David Hildenbrand Reviewed-by: Anshuman Khandual Cc: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- Documentation/mm/arch_pgtable_helpers.rst | 8 ++-- mm/debug_vm_pgtable.c | 53 ++++++++++++++--------- 2 files changed, 36 insertions(+), 25 deletions(-) diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst index c88c7fa665d6..ba2f658bc241 100644 --- a/Documentation/mm/arch_pgtable_helpers.rst +++ b/Documentation/mm/arch_pgtable_helpers.rst @@ -236,13 +236,13 @@ SWAP Page Table Helpers ======================== +---------------------------+--------------------------------------------------+ -| __pte_to_swp_entry | Creates a swapped entry (arch) from a mapped PTE | +| __pte_to_swp_entry | Creates a swp_entry_t (arch) from a swap PTE | +---------------------------+--------------------------------------------------+ -| __swp_to_pte_entry | Creates a mapped PTE from a swapped entry (arch) | +| __swp_entry_to_pte | Creates a swap PTE from a swp_entry_t (arch) | +---------------------------+--------------------------------------------------+ -| __pmd_to_swp_entry | Creates a swapped entry (arch) from a mapped PMD | +| __pmd_to_swp_entry | Creates a swp_entry_t (arch) from a swap PMD | +---------------------------+--------------------------------------------------+ -| __swp_to_pmd_entry | Creates a mapped PMD from a swapped entry (arch) | +| __swp_entry_to_pmd | Creates a swap PMD from a swp_entry_t (arch) | +---------------------------+--------------------------------------------------+ | is_migration_entry | Tests a migration (read or write) swapped entry | +-------------------------------+----------------------------------------------+ diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index bd8f9317b025..d19031f275a3 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -72,6 +72,8 @@ struct pgtable_debug_args { unsigned long fixed_pud_pfn; unsigned long fixed_pmd_pfn; unsigned long fixed_pte_pfn; + + swp_entry_t swp_entry; }; static void __init pte_basic_tests(struct pgtable_debug_args *args, int idx) @@ -698,12 +700,15 @@ static void __init pte_soft_dirty_tests(struct pgtable_debug_args *args) static void __init pte_swap_soft_dirty_tests(struct pgtable_debug_args *args) { - pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot); + pte_t pte; if (!IS_ENABLED(CONFIG_MEM_SOFT_DIRTY)) return; pr_debug("Validating PTE swap soft dirty\n"); + pte = swp_entry_to_pte(args->swp_entry); + WARN_ON(!is_swap_pte(pte)); + WARN_ON(!pte_swp_soft_dirty(pte_swp_mksoft_dirty(pte))); WARN_ON(pte_swp_soft_dirty(pte_swp_clear_soft_dirty(pte))); } @@ -737,7 +742,9 @@ static void __init pmd_swap_soft_dirty_tests(struct pgtable_debug_args *args) return; pr_debug("Validating PMD swap soft dirty\n"); - pmd = pfn_pmd(args->fixed_pmd_pfn, args->page_prot); + pmd = swp_entry_to_pmd(args->swp_entry); + WARN_ON(!is_swap_pmd(pmd)); + WARN_ON(!pmd_swp_soft_dirty(pmd_swp_mksoft_dirty(pmd))); WARN_ON(pmd_swp_soft_dirty(pmd_swp_clear_soft_dirty(pmd))); } @@ -748,17 +755,11 @@ static void __init pmd_swap_soft_dirty_tests(struct pgtable_debug_args *args) { static void __init pte_swap_exclusive_tests(struct pgtable_debug_args *args) { - unsigned long max_swap_offset; swp_entry_t entry, entry2; pte_t pte; pr_debug("Validating PTE swap exclusive\n"); - - /* See generic_max_swapfile_size(): probe the maximum offset */ - max_swap_offset = swp_offset(pte_to_swp_entry(swp_entry_to_pte(swp_entry(0, ~0UL)))); - - /* Create a swp entry with all possible bits set */ - entry = swp_entry((1 << MAX_SWAPFILES_SHIFT) - 1, max_swap_offset); + entry = args->swp_entry; pte = swp_entry_to_pte(entry); WARN_ON(pte_swp_exclusive(pte)); @@ -782,30 +783,34 @@ static void __init pte_swap_exclusive_tests(struct pgtable_debug_args *args) static void __init pte_swap_tests(struct pgtable_debug_args *args) { - swp_entry_t swp; - pte_t pte; + swp_entry_t arch_entry; + pte_t pte1, pte2; pr_debug("Validating PTE swap\n"); - pte = pfn_pte(args->fixed_pte_pfn, args->page_prot); - swp = __pte_to_swp_entry(pte); - pte = __swp_entry_to_pte(swp); - WARN_ON(args->fixed_pte_pfn != pte_pfn(pte)); + pte1 = swp_entry_to_pte(args->swp_entry); + WARN_ON(!is_swap_pte(pte1)); + + arch_entry = __pte_to_swp_entry(pte1); + pte2 = __swp_entry_to_pte(arch_entry); + WARN_ON(memcmp(&pte1, &pte2, sizeof(pte1))); } #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION static void __init pmd_swap_tests(struct pgtable_debug_args *args) { - swp_entry_t swp; - pmd_t pmd; + swp_entry_t arch_entry; + pmd_t pmd1, pmd2; if (!has_transparent_hugepage()) return; pr_debug("Validating PMD swap\n"); - pmd = pfn_pmd(args->fixed_pmd_pfn, args->page_prot); - swp = __pmd_to_swp_entry(pmd); - pmd = __swp_entry_to_pmd(swp); - WARN_ON(args->fixed_pmd_pfn != pmd_pfn(pmd)); + pmd1 = swp_entry_to_pmd(args->swp_entry); + WARN_ON(!is_swap_pmd(pmd1)); + + arch_entry = __pmd_to_swp_entry(pmd1); + pmd2 = __swp_entry_to_pmd(arch_entry); + WARN_ON(memcmp(&pmd1, &pmd2, sizeof(pmd1))); } #else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */ static void __init pmd_swap_tests(struct pgtable_debug_args *args) { } @@ -1110,6 +1115,7 @@ static void __init init_fixed_pfns(struct pgtable_debug_args *args) static int __init init_args(struct pgtable_debug_args *args) { + unsigned long max_swap_offset; struct page *page = NULL; int ret = 0; @@ -1192,6 +1198,11 @@ static int __init init_args(struct pgtable_debug_args *args) init_fixed_pfns(args); + /* See generic_max_swapfile_size(): probe the maximum offset */ + max_swap_offset = swp_offset(pte_to_swp_entry(swp_entry_to_pte(swp_entry(0, ~0UL)))); + /* Create a swp entry with all possible bits set */ + args->swp_entry = swp_entry((1 << MAX_SWAPFILES_SHIFT) - 1, max_swap_offset); + /* * Allocate (huge) pages because some of the tests need to access * the data in the pages. The corresponding tests will be skipped From 2ae1ab9934c785b855583e3eabd208d6f3ac91e1 Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Mon, 30 Jun 2025 16:42:08 +0200 Subject: [PATCH 162/370] mm,hugetlb: change mechanism to detect a COW on private mapping Patch series "Misc rework on hugetlb faulting path", v4. This patchset aims to give some love to the hugetlb faulting path, doing so by removing obsolete comments that are no longer true, sorting out the folio lock, and changing the mechanism we use to determine whether we are COWing a private mapping already. The most important patch of the series is #1, as it fixes a deadlock that was described in [1], where two processes were holding the same lock for the folio in the pagecache, and then deadlocked in the mutex. Note that this can also happen for anymous folios. This has been tested using this reproducer, below Looking up and locking the folio in the pagecache was done to check whether that folio was the same folio we had mapped in our pagetables, meaning that if it was different we knew that we already mapped that folio privately, so any further CoW would be made on a private mapping, which lead us to the question: __Was the reservation for that address consumed?__ That is all we care about, because if it was indeed consumed and we are the owner and we cannot allocate more folios, we need to unmap the folio from the processes pagetables and make it exclusive for us. We figured we do not need to look up the folio at all, and it is just enough to check whether the folio we have mapped is anonymous, which means we mapped it privately, so the reservation was indeed consumed. Patch#2 sorts out folio locking in the faulting path, reducing the scope of it ,only taking it when we are dealing with an anonymous folio and document it. More details in the patch. Patch#3-5 are cleanups. Here is the reproducer: #include #include #include #include #include #define PROTECTION (PROT_READ | PROT_WRITE) #define LENGTH (2UL*1024*1024) #define ADDR (void *)(0x0UL) #define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB) void __read(char *addr) { int i = 0; printf("a[%d]: %c\n", i, addr[i]); } void fill(char *addr) { addr[0] = 'd'; printf("addr: %c\n", addr[0]); } int main(void) { void *addr; pid_t pid, wpid; int status; addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, -1, 0); if (addr == MAP_FAILED) { perror("mmap"); return -1; } printf("Parent faulting in RO\n"); __read(addr); sleep (10); printf("Forking\n"); pid = fork(); switch (pid) { case -1: perror("fork"); break; case 0: sleep (4); printf("Child: Faulting in\n"); fill(addr); exit(0); break; default: printf("Parent: Faulting in\n"); fill(addr); while((wpid = wait(&status)) > 0); if (munmap(addr, LENGTH)) perror("munmap"); } return 0; } You will also have to add a delay in hugetlb_wp, after releasing the mutex and before unmapping, so the window is large enough to reproduce it reliably. : --- a/mm/hugetlb.c : +++ b/mm/hugetlb.c : @@ -38,6 +38,7 @@ : #include : #include : #include : +#include : : #include : #include : @@ -6261,6 +6262,8 @@ static vm_fault_t hugetlb_wp(struct vm_fault *vmf) : hugetlb_vma_unlock_read(vma); : mutex_unlock(&hugetlb_fault_mutex_table[hash]); : : + mdelay(8000); : + : unmap_ref_private(mm, vma, old_folio, vmf->address); : : mutex_lock(&hugetlb_fault_mutex_table[hash]); This patch (of 5): hugetlb_wp() checks whether the process is trying to COW on a private mapping in order to know whether the reservation for that address was already consumed. If it was consumed and we are the ownner of the mapping, the folio will have to be unmapped from the other processes. Currently, that check is done by looking up the folio in the pagecache and compare it to the folio which is mapped in our pagetables. If it differs, it means we already mapped it privately before, consuming a reservation on the way. All we are interested in is whether the mapped folio is anonymous, so we can simplify and check for that instead. Link: https://lkml.kernel.org/r/20250630144212.156938-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20250627102904.107202-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20250627102904.107202-2-osalvador@suse.de Link: https://lore.kernel.org/lkml/20250513093448.592150-1-gavinguo@igalia.com/ [1] Link: https://lkml.kernel.org/r/20250630144212.156938-2-osalvador@suse.de Fixes: 40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization") Signed-off-by: Oscar Salvador Reported-by: Gavin Guo Closes: https://lore.kernel.org/lkml/20250513093448.592150-1-gavinguo@igalia.com/ Suggested-by: Peter Xu Acked-by: David Hildenbrand Cc: Muchun Song Signed-off-by: Andrew Morton --- mm/hugetlb.c | 88 ++++++++++++++++++++-------------------------------- 1 file changed, 34 insertions(+), 54 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index db53ead8ac43..cf5d5ad5bbe9 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6131,8 +6131,7 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma, * cannot race with other handlers or page migration. * Keep the pte_same checks anyway to make transition from the mutex easier. */ -static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, - struct vm_fault *vmf) +static vm_fault_t hugetlb_wp(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct mm_struct *mm = vma->vm_mm; @@ -6194,16 +6193,17 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, PageAnonExclusive(&old_folio->page), &old_folio->page); /* - * If the process that created a MAP_PRIVATE mapping is about to - * perform a COW due to a shared page count, attempt to satisfy - * the allocation without using the existing reserves. The pagecache - * page is used to determine if the reserve at this address was - * consumed or not. If reserves were used, a partial faulted mapping - * at the time of fork() could consume its reserves on COW instead - * of the full address range. + * If the process that created a MAP_PRIVATE mapping is about to perform + * a COW due to a shared page count, attempt to satisfy the allocation + * without using the existing reserves. + * In order to determine where this is a COW on a MAP_PRIVATE mapping it + * is enough to check whether the old_folio is anonymous. This means that + * the reserve for this address was consumed. If reserves were used, a + * partial faulted mapping at the fime of fork() could consume its reserves + * on COW instead of the full address range. */ if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) && - old_folio != pagecache_folio) + folio_test_anon(old_folio)) cow_from_owner = true; folio_get(old_folio); @@ -6582,7 +6582,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, hugetlb_count_add(pages_per_huge_page(h), mm); if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) { /* Optimization, do the COW without a second fault */ - ret = hugetlb_wp(folio, vmf); + ret = hugetlb_wp(vmf); } spin_unlock(vmf->ptl); @@ -6650,10 +6650,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, vm_fault_t ret; u32 hash; struct folio *folio = NULL; - struct folio *pagecache_folio = NULL; struct hstate *h = hstate_vma(vma); struct address_space *mapping; - int need_wait_lock = 0; + bool need_wait_lock = false; struct vm_fault vmf = { .vma = vma, .address = address & huge_page_mask(h), @@ -6748,8 +6747,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, * If we are going to COW/unshare the mapping later, we examine the * pending reservations for this page now. This will ensure that any * allocations necessary to record that reservation occur outside the - * spinlock. Also lookup the pagecache page now as it is used to - * determine if a reservation has been consumed. + * spinlock. */ if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) && !(vma->vm_flags & VM_MAYSHARE) && !huge_pte_write(vmf.orig_pte)) { @@ -6759,11 +6757,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, } /* Just decrements count, does not deallocate */ vma_end_reservation(h, vma, vmf.address); - - pagecache_folio = filemap_lock_hugetlb_folio(h, mapping, - vmf.pgoff); - if (IS_ERR(pagecache_folio)) - pagecache_folio = NULL; } vmf.ptl = huge_pte_lock(h, mm, vmf.pte); @@ -6777,10 +6770,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, (flags & FAULT_FLAG_WRITE) && !huge_pte_write(vmf.orig_pte)) { if (!userfaultfd_wp_async(vma)) { spin_unlock(vmf.ptl); - if (pagecache_folio) { - folio_unlock(pagecache_folio); - folio_put(pagecache_folio); - } hugetlb_vma_unlock_read(vma); mutex_unlock(&hugetlb_fault_mutex_table[hash]); return handle_userfault(&vmf, VM_UFFD_WP); @@ -6792,24 +6781,19 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, /* Fallthrough to CoW */ } - /* - * hugetlb_wp() requires page locks of pte_page(vmf.orig_pte) and - * pagecache_folio, so here we need take the former one - * when folio != pagecache_folio or !pagecache_folio. - */ - folio = page_folio(pte_page(vmf.orig_pte)); - if (folio != pagecache_folio) - if (!folio_trylock(folio)) { - need_wait_lock = 1; - goto out_ptl; - } - - folio_get(folio); - if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) { if (!huge_pte_write(vmf.orig_pte)) { - ret = hugetlb_wp(pagecache_folio, &vmf); - goto out_put_page; + /* hugetlb_wp() requires page locks of pte_page(vmf.orig_pte) */ + folio = page_folio(pte_page(vmf.orig_pte)); + if (!folio_trylock(folio)) { + need_wait_lock = true; + goto out_ptl; + } + folio_get(folio); + ret = hugetlb_wp(&vmf); + folio_unlock(folio); + folio_put(folio); + goto out_ptl; } else if (likely(flags & FAULT_FLAG_WRITE)) { vmf.orig_pte = huge_pte_mkdirty(vmf.orig_pte); } @@ -6818,17 +6802,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (huge_ptep_set_access_flags(vma, vmf.address, vmf.pte, vmf.orig_pte, flags & FAULT_FLAG_WRITE)) update_mmu_cache(vma, vmf.address, vmf.pte); -out_put_page: - if (folio != pagecache_folio) - folio_unlock(folio); - folio_put(folio); out_ptl: spin_unlock(vmf.ptl); - - if (pagecache_folio) { - folio_unlock(pagecache_folio); - folio_put(pagecache_folio); - } out_mutex: hugetlb_vma_unlock_read(vma); @@ -6841,11 +6816,16 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, mutex_unlock(&hugetlb_fault_mutex_table[hash]); /* - * Generally it's safe to hold refcount during waiting page lock. But - * here we just wait to defer the next page fault to avoid busy loop and - * the page is not used after unlocked before returning from the current - * page fault. So we are safe from accessing freed page, even if we wait - * here without taking refcount. + * hugetlb_wp drops all the locks, but the folio lock, before trying to + * unmap the folio from other processes. During that window, if another + * process mapping that folio faults in, it will take the mutex and then + * it will wait on folio_lock, causing an ABBA deadlock. + * Use trylock instead and bail out if we fail. + * + * Ideally, we should hold a refcount on the folio we wait for, but we do + * not want to use the folio after it becomes unlocked, but rather just + * wait for it to become unlocked, so hopefully next fault successes on + * the trylock. */ if (need_wait_lock) folio_wait_locked(folio); From 9293fb4765527c0d2375eb441d045a5a75f5210d Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Mon, 30 Jun 2025 16:42:09 +0200 Subject: [PATCH 163/370] mm,hugetlb: sort out folio locking in the faulting path Recent conversations showed that there was a misunderstanding about why we were locking the folio prior to call in hugetlb_wp(). In fact, as soon as we have the folio mapped into the pagetables, we no longer need to hold it locked, because we know that no concurrent truncation could have happened. There is only one case where the folio needs to be locked, and that is when we are handling an anonymous folio, because hugetlb_wp() will check whether it can re-use it exclusively for the process that is faulting it in. So, pass the folio locked to hugetlb_wp() when that is the case. Link: https://lkml.kernel.org/r/20250627102904.107202-3-osalvador@suse.de Link: https://lkml.kernel.org/r/20250630144212.156938-3-osalvador@suse.de Signed-off-by: Oscar Salvador Suggested-by: David Hildenbrand Acked-by: David Hildenbrand Cc: Gavin Guo Cc: Muchun Song Cc: Peter Xu Signed-off-by: Andrew Morton --- mm/hugetlb.c | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index cf5d5ad5bbe9..68a260e4f4c7 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6416,6 +6416,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, pte_t new_pte; bool new_folio, new_pagecache_folio = false; u32 hash = hugetlb_fault_mutex_hash(mapping, vmf->pgoff); + bool folio_locked = true; /* * Currently, we are forced to kill the process in the event the @@ -6581,6 +6582,14 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, hugetlb_count_add(pages_per_huge_page(h), mm); if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) { + /* + * No need to keep file folios locked. See comment in + * hugetlb_fault(). + */ + if (!anon_rmap) { + folio_locked = false; + folio_unlock(folio); + } /* Optimization, do the COW without a second fault */ ret = hugetlb_wp(vmf); } @@ -6595,7 +6604,8 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, if (new_folio) folio_set_hugetlb_migratable(folio); - folio_unlock(folio); + if (folio_locked) + folio_unlock(folio); out: hugetlb_vma_unlock_read(vma); @@ -6783,15 +6793,20 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) { if (!huge_pte_write(vmf.orig_pte)) { - /* hugetlb_wp() requires page locks of pte_page(vmf.orig_pte) */ + /* + * Anonymous folios need to be lock since hugetlb_wp() + * checks whether we can re-use the folio exclusively + * for us in case we are the only user of it. + */ folio = page_folio(pte_page(vmf.orig_pte)); - if (!folio_trylock(folio)) { + if (folio_test_anon(folio) && !folio_trylock(folio)) { need_wait_lock = true; goto out_ptl; } folio_get(folio); ret = hugetlb_wp(&vmf); - folio_unlock(folio); + if (folio_test_anon(folio)) + folio_unlock(folio); folio_put(folio); goto out_ptl; } else if (likely(flags & FAULT_FLAG_WRITE)) { From d531fd2ccf6b5ad95b718b5748e086f8d4aacf56 Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Mon, 30 Jun 2025 16:42:10 +0200 Subject: [PATCH 164/370] mm,hugetlb: rename anon_rmap to new_anon_folio and make it boolean anon_rmap is used to determine whether the new allocated folio is anonymous. Rename it to something more meaningul like new_anon_folio and make it boolean, as we use it like that. While we are at it, drop 'new_pagecache_folio' as 'new_anon_folio' is enough to check whether we need to restore the consumed reservation. Link: https://lkml.kernel.org/r/20250627102904.107202-4-osalvador@suse.de Link: https://lkml.kernel.org/r/20250630144212.156938-4-osalvador@suse.de Signed-off-by: Oscar Salvador Acked-by: David Hildenbrand Cc: Gavin Guo Cc: Muchun Song Cc: Peter Xu Signed-off-by: Andrew Morton --- mm/hugetlb.c | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 68a260e4f4c7..3e5cefd5e1d9 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6406,17 +6406,16 @@ static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm, unsigned static vm_fault_t hugetlb_no_page(struct address_space *mapping, struct vm_fault *vmf) { + u32 hash = hugetlb_fault_mutex_hash(mapping, vmf->pgoff); + bool new_folio, new_anon_folio = false; struct vm_area_struct *vma = vmf->vma; struct mm_struct *mm = vma->vm_mm; struct hstate *h = hstate_vma(vma); vm_fault_t ret = VM_FAULT_SIGBUS; - int anon_rmap = 0; - unsigned long size; - struct folio *folio; - pte_t new_pte; - bool new_folio, new_pagecache_folio = false; - u32 hash = hugetlb_fault_mutex_hash(mapping, vmf->pgoff); bool folio_locked = true; + struct folio *folio; + unsigned long size; + pte_t new_pte; /* * Currently, we are forced to kill the process in the event the @@ -6515,10 +6514,9 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, ret = VM_FAULT_SIGBUS; goto out; } - new_pagecache_folio = true; } else { + new_anon_folio = true; folio_lock(folio); - anon_rmap = 1; } } else { /* @@ -6567,7 +6565,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, if (!pte_same(huge_ptep_get(mm, vmf->address, vmf->pte), vmf->orig_pte)) goto backout; - if (anon_rmap) + if (new_anon_folio) hugetlb_add_new_anon_rmap(folio, vma, vmf->address); else hugetlb_add_file_rmap(folio); @@ -6586,7 +6584,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, * No need to keep file folios locked. See comment in * hugetlb_fault(). */ - if (!anon_rmap) { + if (!new_anon_folio) { folio_locked = false; folio_unlock(folio); } @@ -6622,7 +6620,8 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, backout: spin_unlock(vmf->ptl); backout_unlocked: - if (new_folio && !new_pagecache_folio) + /* We only need to restore reservations for private mappings */ + if (new_anon_folio) restore_reserve_on_error(h, vma, vmf->address, folio); folio_unlock(folio); From cced784d2cb2a708cdfe4784514c2a10af3af37d Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Mon, 30 Jun 2025 16:42:11 +0200 Subject: [PATCH 165/370] mm,hugetlb: drop obsolete comment about non-present pte and second faults There is a comment in hugetlb_fault() that does not hold anymore. This one: /* * vmf.orig_pte could be a migration/hwpoison vmf.orig_pte at this * point, so this check prevents the kernel from going below assuming * that we have an active hugepage in pagecache. This goto expects * the 2nd page fault, and is_hugetlb_entry_(migration|hwpoisoned) * check will properly handle it. */ This was written because back in the day we used to do: hugetlb_fault () { ptep = huge_pte_offset(...) if (ptep) { entry = huge_ptep_get(ptep) if (unlikely(is_hugetlb_entry_migration(entry)) ... else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) ... } ... ... /* * entry could be a migration/hwpoison entry at this point, so this * check prevents the kernel from going below assuming that we have * a active hugepage in pagecache. This goto expects the 2nd page fault, * and is_hugetlb_entry_(migration|hwpoisoned) check will properly * handle it. */ if (!pte_present(entry)) goto out_mutex; ... } The code was designed to check for hwpoisoned/migration entries upfront, and then bail out if further down the pte was not present anymore, relying on the second fault to properly handle migration/hwpoison entries that time around. The way we handle this is different nowadays, so drop the misleading comment. Link: https://lkml.kernel.org/r/20250627102904.107202-5-osalvador@suse.de Link: https://lkml.kernel.org/r/20250630144212.156938-5-osalvador@suse.de Signed-off-by: Oscar Salvador Acked-by: David Hildenbrand Cc: Gavin Guo Cc: Muchun Song Cc: Peter Xu Signed-off-by: Andrew Morton --- mm/hugetlb.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3e5cefd5e1d9..eead921831bf 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6727,13 +6727,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, ret = 0; - /* - * vmf.orig_pte could be a migration/hwpoison vmf.orig_pte at this - * point, so this check prevents the kernel from going below assuming - * that we have an active hugepage in pagecache. This goto expects - * the 2nd page fault, and is_hugetlb_entry_(migration|hwpoisoned) - * check will properly handle it. - */ + /* Not present, either a migration or a hwpoisoned entry */ if (!pte_present(vmf.orig_pte)) { if (unlikely(is_hugetlb_entry_migration(vmf.orig_pte))) { /* From 1c0841140b5bd09f03e568d4d9341c874c396511 Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Mon, 30 Jun 2025 16:42:12 +0200 Subject: [PATCH 166/370] mm,hugetlb: drop unlikelys from hugetlb_fault The unlikely predates an era where we were checking for hwpoisoned/migration entries prior to checking whether the pte was present. Currently, we check for the pte to be a migration/hwpoison entry after we have checked that is not present, so it must be either one or the other. Link: https://lkml.kernel.org/r/20250627102904.107202-6-osalvador@suse.de Link: https://lkml.kernel.org/r/20250630144212.156938-6-osalvador@suse.de Signed-off-by: Oscar Salvador Acked-by: David Hildenbrand Cc: Gavin Guo Cc: Muchun Song Cc: Peter Xu Signed-off-by: Andrew Morton --- mm/hugetlb.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index eead921831bf..f13fa5aa6624 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6729,7 +6729,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, /* Not present, either a migration or a hwpoisoned entry */ if (!pte_present(vmf.orig_pte)) { - if (unlikely(is_hugetlb_entry_migration(vmf.orig_pte))) { + if (is_hugetlb_entry_migration(vmf.orig_pte)) { /* * Release the hugetlb fault lock now, but retain * the vma lock, because it is needed to guard the @@ -6740,7 +6740,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, mutex_unlock(&hugetlb_fault_mutex_table[hash]); migration_entry_wait_huge(vma, vmf.address, vmf.pte); return 0; - } else if (unlikely(is_hugetlb_entry_hwpoisoned(vmf.orig_pte))) + } else if (is_hugetlb_entry_hwpoisoned(vmf.orig_pte)) ret = VM_FAULT_HWPOISON_LARGE | VM_FAULT_SET_HINDEX(hstate_index(h)); goto out_mutex; From 11f45931ccfd14f2f3ca3cf9cafd0e9317e9008a Mon Sep 17 00:00:00 2001 From: Thorsten Blum Date: Mon, 30 Jun 2025 15:23:18 +0200 Subject: [PATCH 167/370] mm/cma: use str_plural() in cma_declare_contiguous_multi() Use the string choice helper function str_plural() to simplify the code and to fix the following Coccinelle/coccicheck warning reported by string_choices.cocci: opportunity for str_plural(nr) Link: https://lkml.kernel.org/r/20250630132318.41339-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum Reviewed-by: Dev Jain Tested-by: Dev Jain Reviewed-by: Anshuman Khandual Acked-by: David Hildenbrand Signed-off-by: Andrew Morton --- mm/cma.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/cma.c b/mm/cma.c index c0b2630a1b81..c40d53298801 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include #include @@ -548,8 +549,7 @@ int __init cma_declare_contiguous_multi(phys_addr_t total_size, (unsigned long)total_size / SZ_1M); else pr_info("Reserved %lu MiB in %d range%s\n", - (unsigned long)total_size / SZ_1M, nr, - nr > 1 ? "s" : ""); + (unsigned long)total_size / SZ_1M, nr, str_plural(nr)); return ret; } From 5bd3b163e374462c05c055ff091582d757929d3f Mon Sep 17 00:00:00 2001 From: Xuanye Liu Date: Wed, 2 Jul 2025 15:12:35 +0800 Subject: [PATCH 168/370] mm: fix spelling issue in swap.h recliam -> reclaim Link: https://lkml.kernel.org/r/20250702071235.212794-1-liuqiye2025@163.com Signed-off-by: Xuanye Liu Signed-off-by: Andrew Morton --- include/linux/swap.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index bc0e1c275fc0..a49be950c485 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -415,7 +415,7 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, #define MIN_SWAPPINESS 0 #define MAX_SWAPPINESS 200 -/* Just recliam from anon folios in proactive memory reclaim */ +/* Just reclaim from anon folios in proactive memory reclaim */ #define SWAPPINESS_ANON_ONLY (MAX_SWAPPINESS + 1) extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, From f63f7e9bfbacf8d948e41a481a14d5a14bbbeb64 Mon Sep 17 00:00:00 2001 From: Xuanye Liu Date: Wed, 2 Jul 2025 14:23:59 +0800 Subject: [PATCH 169/370] mm: remove outdated filename comment in percpu-stats.c The comment had the old filename. It's also unnecessary, so drop it. Link: https://lkml.kernel.org/r/20250702062400.207619-1-liuqiye2025@163.com Signed-off-by: Xuanye Liu Signed-off-by: Andrew Morton --- mm/percpu-stats.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c index dd3590dfc23d..9b9d5d6accae 100644 --- a/mm/percpu-stats.c +++ b/mm/percpu-stats.c @@ -1,6 +1,5 @@ // SPDX-License-Identifier: GPL-2.0-only /* - * mm/percpu-debug.c * * Copyright (C) 2017 Facebook Inc. * Copyright (C) 2017 Dennis Zhou From 20089ebd756c3b13c6f24650ab1d518f76ad173d Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Thu, 3 Jul 2025 21:47:09 +0300 Subject: [PATCH 170/370] cma: move __cma_declare_contiguous_nid() before its usage Patch series "cma: factor out allocation logic from __cma_declare_contiguous_nid", v2. We've discussed earlier that HIGHMEM related logic is spread all over __cma_declare_contiguous_nid(). These patches decouple it into helper functions. This patch (of 3): Move __cma_declare_contiguous_nid() before its usage and kill forward declaration. Link: https://lkml.kernel.org/r/20250703184711.3485940-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20250703184711.3485940-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Acked-by: Oscar Salvador Acked-by: David Hildenbrand Cc: Alexandre Ghiti Cc: Pratyush Yadav Signed-off-by: Andrew Morton --- mm/cma.c | 294 +++++++++++++++++++++++++++---------------------------- 1 file changed, 144 insertions(+), 150 deletions(-) diff --git a/mm/cma.c b/mm/cma.c index c40d53298801..19d0371abff8 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -36,12 +36,6 @@ struct cma cma_areas[MAX_CMA_AREAS]; unsigned int cma_area_count; -static int __init __cma_declare_contiguous_nid(phys_addr_t *basep, - phys_addr_t size, phys_addr_t limit, - phys_addr_t alignment, unsigned int order_per_bit, - bool fixed, const char *name, struct cma **res_cma, - int nid); - phys_addr_t cma_get_base(const struct cma *cma) { WARN_ON_ONCE(cma->nranges != 1); @@ -359,6 +353,150 @@ static void __init list_insert_sorted( } } +static int __init __cma_declare_contiguous_nid(phys_addr_t *basep, + phys_addr_t size, phys_addr_t limit, + phys_addr_t alignment, unsigned int order_per_bit, + bool fixed, const char *name, struct cma **res_cma, + int nid) +{ + phys_addr_t memblock_end = memblock_end_of_DRAM(); + phys_addr_t highmem_start, base = *basep; + int ret; + + /* + * We can't use __pa(high_memory) directly, since high_memory + * isn't a valid direct map VA, and DEBUG_VIRTUAL will (validly) + * complain. Find the boundary by adding one to the last valid + * address. + */ + if (IS_ENABLED(CONFIG_HIGHMEM)) + highmem_start = __pa(high_memory - 1) + 1; + else + highmem_start = memblock_end_of_DRAM(); + pr_debug("%s(size %pa, base %pa, limit %pa alignment %pa)\n", + __func__, &size, &base, &limit, &alignment); + + if (cma_area_count == ARRAY_SIZE(cma_areas)) { + pr_err("Not enough slots for CMA reserved regions!\n"); + return -ENOSPC; + } + + if (!size) + return -EINVAL; + + if (alignment && !is_power_of_2(alignment)) + return -EINVAL; + + if (!IS_ENABLED(CONFIG_NUMA)) + nid = NUMA_NO_NODE; + + /* Sanitise input arguments. */ + alignment = max_t(phys_addr_t, alignment, CMA_MIN_ALIGNMENT_BYTES); + if (fixed && base & (alignment - 1)) { + pr_err("Region at %pa must be aligned to %pa bytes\n", + &base, &alignment); + return -EINVAL; + } + base = ALIGN(base, alignment); + size = ALIGN(size, alignment); + limit &= ~(alignment - 1); + + if (!base) + fixed = false; + + /* size should be aligned with order_per_bit */ + if (!IS_ALIGNED(size >> PAGE_SHIFT, 1 << order_per_bit)) + return -EINVAL; + + /* + * If allocating at a fixed base the request region must not cross the + * low/high memory boundary. + */ + if (fixed && base < highmem_start && base + size > highmem_start) { + pr_err("Region at %pa defined on low/high memory boundary (%pa)\n", + &base, &highmem_start); + return -EINVAL; + } + + /* + * If the limit is unspecified or above the memblock end, its effective + * value will be the memblock end. Set it explicitly to simplify further + * checks. + */ + if (limit == 0 || limit > memblock_end) + limit = memblock_end; + + if (base + size > limit) { + pr_err("Size (%pa) of region at %pa exceeds limit (%pa)\n", + &size, &base, &limit); + return -EINVAL; + } + + /* Reserve memory */ + if (fixed) { + if (memblock_is_region_reserved(base, size) || + memblock_reserve(base, size) < 0) { + return -EBUSY; + } + } else { + phys_addr_t addr = 0; + + /* + * If there is enough memory, try a bottom-up allocation first. + * It will place the new cma area close to the start of the node + * and guarantee that the compaction is moving pages out of the + * cma area and not into it. + * Avoid using first 4GB to not interfere with constrained zones + * like DMA/DMA32. + */ +#ifdef CONFIG_PHYS_ADDR_T_64BIT + if (!memblock_bottom_up() && memblock_end >= SZ_4G + size) { + memblock_set_bottom_up(true); + addr = memblock_alloc_range_nid(size, alignment, SZ_4G, + limit, nid, true); + memblock_set_bottom_up(false); + } +#endif + + /* + * All pages in the reserved area must come from the same zone. + * If the requested region crosses the low/high memory boundary, + * try allocating from high memory first and fall back to low + * memory in case of failure. + */ + if (!addr && base < highmem_start && limit > highmem_start) { + addr = memblock_alloc_range_nid(size, alignment, + highmem_start, limit, nid, true); + limit = highmem_start; + } + + if (!addr) { + addr = memblock_alloc_range_nid(size, alignment, base, + limit, nid, true); + if (!addr) + return -ENOMEM; + } + + /* + * kmemleak scans/reads tracked objects for pointers to other + * objects but this address isn't mapped and accessible + */ + kmemleak_ignore_phys(addr); + base = addr; + } + + ret = cma_init_reserved_mem(base, size, order_per_bit, name, res_cma); + if (ret) { + memblock_phys_free(base, size); + return ret; + } + + (*res_cma)->nid = nid; + *basep = base; + + return 0; +} + /* * Create CMA areas with a total size of @total_size. A normal allocation * for one area is tried first. If that fails, the biggest memblock @@ -593,150 +731,6 @@ int __init cma_declare_contiguous_nid(phys_addr_t base, return ret; } -static int __init __cma_declare_contiguous_nid(phys_addr_t *basep, - phys_addr_t size, phys_addr_t limit, - phys_addr_t alignment, unsigned int order_per_bit, - bool fixed, const char *name, struct cma **res_cma, - int nid) -{ - phys_addr_t memblock_end = memblock_end_of_DRAM(); - phys_addr_t highmem_start, base = *basep; - int ret; - - /* - * We can't use __pa(high_memory) directly, since high_memory - * isn't a valid direct map VA, and DEBUG_VIRTUAL will (validly) - * complain. Find the boundary by adding one to the last valid - * address. - */ - if (IS_ENABLED(CONFIG_HIGHMEM)) - highmem_start = __pa(high_memory - 1) + 1; - else - highmem_start = memblock_end_of_DRAM(); - pr_debug("%s(size %pa, base %pa, limit %pa alignment %pa)\n", - __func__, &size, &base, &limit, &alignment); - - if (cma_area_count == ARRAY_SIZE(cma_areas)) { - pr_err("Not enough slots for CMA reserved regions!\n"); - return -ENOSPC; - } - - if (!size) - return -EINVAL; - - if (alignment && !is_power_of_2(alignment)) - return -EINVAL; - - if (!IS_ENABLED(CONFIG_NUMA)) - nid = NUMA_NO_NODE; - - /* Sanitise input arguments. */ - alignment = max_t(phys_addr_t, alignment, CMA_MIN_ALIGNMENT_BYTES); - if (fixed && base & (alignment - 1)) { - pr_err("Region at %pa must be aligned to %pa bytes\n", - &base, &alignment); - return -EINVAL; - } - base = ALIGN(base, alignment); - size = ALIGN(size, alignment); - limit &= ~(alignment - 1); - - if (!base) - fixed = false; - - /* size should be aligned with order_per_bit */ - if (!IS_ALIGNED(size >> PAGE_SHIFT, 1 << order_per_bit)) - return -EINVAL; - - /* - * If allocating at a fixed base the request region must not cross the - * low/high memory boundary. - */ - if (fixed && base < highmem_start && base + size > highmem_start) { - pr_err("Region at %pa defined on low/high memory boundary (%pa)\n", - &base, &highmem_start); - return -EINVAL; - } - - /* - * If the limit is unspecified or above the memblock end, its effective - * value will be the memblock end. Set it explicitly to simplify further - * checks. - */ - if (limit == 0 || limit > memblock_end) - limit = memblock_end; - - if (base + size > limit) { - pr_err("Size (%pa) of region at %pa exceeds limit (%pa)\n", - &size, &base, &limit); - return -EINVAL; - } - - /* Reserve memory */ - if (fixed) { - if (memblock_is_region_reserved(base, size) || - memblock_reserve(base, size) < 0) { - return -EBUSY; - } - } else { - phys_addr_t addr = 0; - - /* - * If there is enough memory, try a bottom-up allocation first. - * It will place the new cma area close to the start of the node - * and guarantee that the compaction is moving pages out of the - * cma area and not into it. - * Avoid using first 4GB to not interfere with constrained zones - * like DMA/DMA32. - */ -#ifdef CONFIG_PHYS_ADDR_T_64BIT - if (!memblock_bottom_up() && memblock_end >= SZ_4G + size) { - memblock_set_bottom_up(true); - addr = memblock_alloc_range_nid(size, alignment, SZ_4G, - limit, nid, true); - memblock_set_bottom_up(false); - } -#endif - - /* - * All pages in the reserved area must come from the same zone. - * If the requested region crosses the low/high memory boundary, - * try allocating from high memory first and fall back to low - * memory in case of failure. - */ - if (!addr && base < highmem_start && limit > highmem_start) { - addr = memblock_alloc_range_nid(size, alignment, - highmem_start, limit, nid, true); - limit = highmem_start; - } - - if (!addr) { - addr = memblock_alloc_range_nid(size, alignment, base, - limit, nid, true); - if (!addr) - return -ENOMEM; - } - - /* - * kmemleak scans/reads tracked objects for pointers to other - * objects but this address isn't mapped and accessible - */ - kmemleak_ignore_phys(addr); - base = addr; - } - - ret = cma_init_reserved_mem(base, size, order_per_bit, name, res_cma); - if (ret) { - memblock_phys_free(base, size); - return ret; - } - - (*res_cma)->nid = nid; - *basep = base; - - return 0; -} - static void cma_debug_show_areas(struct cma *cma) { unsigned long next_zero_bit, next_set_bit, nr_zero; From bef5871662efe251b9dae937d35b6d0db9fa127e Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Thu, 3 Jul 2025 21:47:10 +0300 Subject: [PATCH 171/370] cma: split reservation of fixed area into a helper function Move the check that verifies that reservation of fixed area does not cross HIGHMEM boundary and the actual memblock_resrve() call into a helper function. This makes code more readable and decouples logic related to CONFIG_HIGHMEM from the core functionality of __cma_declare_contiguous_nid(). Link: https://lkml.kernel.org/r/20250703184711.3485940-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Acked-by: Oscar Salvador Acked-by: David Hildenbrand Cc: Alexandre Ghiti Cc: Pratyush Yadav Signed-off-by: Andrew Morton --- mm/cma.c | 40 +++++++++++++++++++++++++++------------- 1 file changed, 27 insertions(+), 13 deletions(-) diff --git a/mm/cma.c b/mm/cma.c index 19d0371abff8..8521db8768ee 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -353,6 +353,30 @@ static void __init list_insert_sorted( } } +static int __init cma_fixed_reserve(phys_addr_t base, phys_addr_t size) +{ + if (IS_ENABLED(CONFIG_HIGHMEM)) { + phys_addr_t highmem_start = __pa(high_memory - 1) + 1; + + /* + * If allocating at a fixed base the request region must not + * cross the low/high memory boundary. + */ + if (base < highmem_start && base + size > highmem_start) { + pr_err("Region at %pa defined on low/high memory boundary (%pa)\n", + &base, &highmem_start); + return -EINVAL; + } + } + + if (memblock_is_region_reserved(base, size) || + memblock_reserve(base, size) < 0) { + return -EBUSY; + } + + return 0; +} + static int __init __cma_declare_contiguous_nid(phys_addr_t *basep, phys_addr_t size, phys_addr_t limit, phys_addr_t alignment, unsigned int order_per_bit, @@ -408,15 +432,6 @@ static int __init __cma_declare_contiguous_nid(phys_addr_t *basep, if (!IS_ALIGNED(size >> PAGE_SHIFT, 1 << order_per_bit)) return -EINVAL; - /* - * If allocating at a fixed base the request region must not cross the - * low/high memory boundary. - */ - if (fixed && base < highmem_start && base + size > highmem_start) { - pr_err("Region at %pa defined on low/high memory boundary (%pa)\n", - &base, &highmem_start); - return -EINVAL; - } /* * If the limit is unspecified or above the memblock end, its effective @@ -434,10 +449,9 @@ static int __init __cma_declare_contiguous_nid(phys_addr_t *basep, /* Reserve memory */ if (fixed) { - if (memblock_is_region_reserved(base, size) || - memblock_reserve(base, size) < 0) { - return -EBUSY; - } + ret = cma_fixed_reserve(base, size); + if (ret) + return ret; } else { phys_addr_t addr = 0; From 8aa2c0bf0aa92044c8f20ba250448356da509859 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Thu, 3 Jul 2025 21:47:11 +0300 Subject: [PATCH 172/370] cma: move memory allocation to a helper function __cma_declare_contiguous_nid() tries to allocate memory in several ways: * on systems with 64 bit physical address and enough memory it first attempts to allocate memory just above 4GiB * if that fails, on systems with HIGHMEM the next attempt is from high memory * and at last, if none of the previous attempts succeeded, or was even tried because of incompatible configuration, the memory is allocated anywhere within specified limits. Move all the allocation logic to a helper function to make these steps more obvious. Link: https://lkml.kernel.org/r/20250703184711.3485940-4-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Acked-by: David Hildenbrand Acked-by: Oscar Salvador Cc: Alexandre Ghiti Cc: Pratyush Yadav Signed-off-by: Andrew Morton --- mm/cma.c | 104 +++++++++++++++++++++++++++++-------------------------- 1 file changed, 54 insertions(+), 50 deletions(-) diff --git a/mm/cma.c b/mm/cma.c index 8521db8768ee..0cd6f061021f 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -377,6 +377,55 @@ static int __init cma_fixed_reserve(phys_addr_t base, phys_addr_t size) return 0; } +static phys_addr_t __init cma_alloc_mem(phys_addr_t base, phys_addr_t size, + phys_addr_t align, phys_addr_t limit, int nid) +{ + phys_addr_t addr = 0; + + /* + * If there is enough memory, try a bottom-up allocation first. + * It will place the new cma area close to the start of the node + * and guarantee that the compaction is moving pages out of the + * cma area and not into it. + * Avoid using first 4GB to not interfere with constrained zones + * like DMA/DMA32. + */ +#ifdef CONFIG_PHYS_ADDR_T_64BIT + if (!memblock_bottom_up() && limit >= SZ_4G + size) { + memblock_set_bottom_up(true); + addr = memblock_alloc_range_nid(size, align, SZ_4G, limit, + nid, true); + memblock_set_bottom_up(false); + } +#endif + + /* + * On systems with HIGHMEM try allocating from there before consuming + * memory in lower zones. + */ + if (!addr && IS_ENABLED(CONFIG_HIGHMEM)) { + phys_addr_t highmem = __pa(high_memory - 1) + 1; + + /* + * All pages in the reserved area must come from the same zone. + * If the requested region crosses the low/high memory boundary, + * try allocating from high memory first and fall back to low + * memory in case of failure. + */ + if (base < highmem && limit > highmem) { + addr = memblock_alloc_range_nid(size, align, highmem, + limit, nid, true); + limit = highmem; + } + } + + if (!addr) + addr = memblock_alloc_range_nid(size, align, base, limit, nid, + true); + + return addr; +} + static int __init __cma_declare_contiguous_nid(phys_addr_t *basep, phys_addr_t size, phys_addr_t limit, phys_addr_t alignment, unsigned int order_per_bit, @@ -384,19 +433,9 @@ static int __init __cma_declare_contiguous_nid(phys_addr_t *basep, int nid) { phys_addr_t memblock_end = memblock_end_of_DRAM(); - phys_addr_t highmem_start, base = *basep; + phys_addr_t base = *basep; int ret; - /* - * We can't use __pa(high_memory) directly, since high_memory - * isn't a valid direct map VA, and DEBUG_VIRTUAL will (validly) - * complain. Find the boundary by adding one to the last valid - * address. - */ - if (IS_ENABLED(CONFIG_HIGHMEM)) - highmem_start = __pa(high_memory - 1) + 1; - else - highmem_start = memblock_end_of_DRAM(); pr_debug("%s(size %pa, base %pa, limit %pa alignment %pa)\n", __func__, &size, &base, &limit, &alignment); @@ -453,50 +492,15 @@ static int __init __cma_declare_contiguous_nid(phys_addr_t *basep, if (ret) return ret; } else { - phys_addr_t addr = 0; - - /* - * If there is enough memory, try a bottom-up allocation first. - * It will place the new cma area close to the start of the node - * and guarantee that the compaction is moving pages out of the - * cma area and not into it. - * Avoid using first 4GB to not interfere with constrained zones - * like DMA/DMA32. - */ -#ifdef CONFIG_PHYS_ADDR_T_64BIT - if (!memblock_bottom_up() && memblock_end >= SZ_4G + size) { - memblock_set_bottom_up(true); - addr = memblock_alloc_range_nid(size, alignment, SZ_4G, - limit, nid, true); - memblock_set_bottom_up(false); - } -#endif - - /* - * All pages in the reserved area must come from the same zone. - * If the requested region crosses the low/high memory boundary, - * try allocating from high memory first and fall back to low - * memory in case of failure. - */ - if (!addr && base < highmem_start && limit > highmem_start) { - addr = memblock_alloc_range_nid(size, alignment, - highmem_start, limit, nid, true); - limit = highmem_start; - } - - if (!addr) { - addr = memblock_alloc_range_nid(size, alignment, base, - limit, nid, true); - if (!addr) - return -ENOMEM; - } + base = cma_alloc_mem(base, size, alignment, limit, nid); + if (!base) + return -ENOMEM; /* * kmemleak scans/reads tracked objects for pointers to other * objects but this address isn't mapped and accessible */ - kmemleak_ignore_phys(addr); - base = addr; + kmemleak_ignore_phys(base); } ret = cma_init_reserved_mem(base, size, order_per_bit, name, res_cma); From 526f36f3f47b9ad29ffb1bf668b7f295287ee11b Mon Sep 17 00:00:00 2001 From: Dev Jain Date: Thu, 3 Jul 2025 12:03:38 +0530 Subject: [PATCH 173/370] maple tree: add some comments Add comments explaining the fields for maple_metadata, since "end" is ambiguous and "gap" can be confused as the largest gap, whereas it is actually the offset of the largest gap. Add comment for mas_ascend() to explain, whose min and max we are trying to find. Explain that, for example, if we are already on offset zero, then the parent min is mas->min, otherwise we need to walk up to find the implied pivot min. Link: https://lkml.kernel.org/r/20250703063338.51509-1-dev.jain@arm.com Signed-off-by: Dev Jain Reviewed-by: Liam R. Howlett Cc: Wei Yang Signed-off-by: Andrew Morton --- include/linux/maple_tree.h | 4 ++-- lib/maple_tree.c | 8 +++++++- 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h index 9ef129038224..bafe143b1f78 100644 --- a/include/linux/maple_tree.h +++ b/include/linux/maple_tree.h @@ -75,8 +75,8 @@ * searching for gaps or any other code that needs to find the end of the data. */ struct maple_metadata { - unsigned char end; - unsigned char gap; + unsigned char end; /* end of data */ + unsigned char gap; /* offset of largest gap */ }; /* diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 0e85e92c5375..b4ee2d29d7a9 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -1053,7 +1053,7 @@ static inline void mte_set_gap(const struct maple_enode *mn, * mas_ascend() - Walk up a level of the tree. * @mas: The maple state * - * Sets the @mas->max and @mas->min to the correct values when walking up. This + * Sets the @mas->max and @mas->min for the parent node of mas->node. This * may cause several levels of walking up to find the correct min and max. * May find a dead node which will cause a premature return. * Return: 1 on dead node, 0 otherwise @@ -1098,6 +1098,12 @@ static int mas_ascend(struct ma_state *mas) min = 0; max = ULONG_MAX; + + /* + * !mas->offset implies that parent node min == mas->min. + * mas->offset > 0 implies that we need to walk up to find the + * implied pivot min. + */ if (!mas->offset) { min = mas->min; set_min = true; From 4e25f85b9f85d238fe012cb18cd040172344d7e6 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Wed, 2 Jul 2025 09:47:17 +0100 Subject: [PATCH 174/370] tools/testing/selftests: add mremap() unfaulted/faulted test cases Assert that mremap() behaviour is as expected when moving around unfaulted VMAs immediately adjacent to faulted ones, as well as moving around faulted VMAs and placing them back immediately adjacent to the VMA from which they were moved. This also introduces a shared helper for the syscall version of mremap() so we don't encounter any issues with libc filtering parameters. Link: https://lkml.kernel.org/r/20250702084717.21360-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Cc: Jann Horn Cc: Liam Howlett Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/merge.c | 599 ++++++++++++++++++++++++++- tools/testing/selftests/mm/vm_util.c | 8 + tools/testing/selftests/mm/vm_util.h | 3 + 3 files changed, 608 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/mm/merge.c b/tools/testing/selftests/mm/merge.c index 150dd5baed2b..cc4253f47f10 100644 --- a/tools/testing/selftests/mm/merge.c +++ b/tools/testing/selftests/mm/merge.c @@ -13,6 +13,7 @@ #include #include #include "vm_util.h" +#include FIXTURE(merge) { @@ -25,7 +26,7 @@ FIXTURE_SETUP(merge) { self->page_size = psize(); /* Carve out PROT_NONE region to map over. */ - self->carveout = mmap(NULL, 12 * self->page_size, PROT_NONE, + self->carveout = mmap(NULL, 30 * self->page_size, PROT_NONE, MAP_ANON | MAP_PRIVATE, -1, 0); ASSERT_NE(self->carveout, MAP_FAILED); /* Setup PROCMAP_QUERY interface. */ @@ -34,7 +35,7 @@ FIXTURE_SETUP(merge) FIXTURE_TEARDOWN(merge) { - ASSERT_EQ(munmap(self->carveout, 12 * self->page_size), 0); + ASSERT_EQ(munmap(self->carveout, 30 * self->page_size), 0); ASSERT_EQ(close_procmap(&self->procmap), 0); /* * Clear unconditionally, as some tests set this. It is no issue if this @@ -576,4 +577,598 @@ TEST_F(merge, ksm_merge) ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 2 * page_size); } +TEST_F(merge, mremap_unfaulted_to_faulted) +{ + unsigned int page_size = self->page_size; + char *carveout = self->carveout; + struct procmap_fd *procmap = &self->procmap; + char *ptr, *ptr2; + + /* + * Map two distinct areas: + * + * |-----------| |-----------| + * | unfaulted | | unfaulted | + * |-----------| |-----------| + * ptr ptr2 + */ + ptr = mmap(&carveout[page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + ptr2 = mmap(&carveout[7 * page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr2, MAP_FAILED); + + /* Offset ptr2 further away. */ + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000); + ASSERT_NE(ptr2, MAP_FAILED); + + /* + * Fault in ptr: + * \ + * |-----------| / |-----------| + * | faulted | \ | unfaulted | + * |-----------| / |-----------| + * ptr \ ptr2 + */ + ptr[0] = 'x'; + + /* + * Now move ptr2 adjacent to ptr: + * + * |-----------|-----------| + * | faulted | unfaulted | + * |-----------|-----------| + * ptr ptr2 + * + * It should merge: + * + * |----------------------| + * | faulted | + * |----------------------| + * ptr + */ + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[5 * page_size]); + ASSERT_NE(ptr2, MAP_FAILED); + + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 10 * page_size); +} + +TEST_F(merge, mremap_unfaulted_behind_faulted) +{ + unsigned int page_size = self->page_size; + char *carveout = self->carveout; + struct procmap_fd *procmap = &self->procmap; + char *ptr, *ptr2; + + /* + * Map two distinct areas: + * + * |-----------| |-----------| + * | unfaulted | | unfaulted | + * |-----------| |-----------| + * ptr ptr2 + */ + ptr = mmap(&carveout[6 * page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + ptr2 = mmap(&carveout[14 * page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr2, MAP_FAILED); + + /* Offset ptr2 further away. */ + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000); + ASSERT_NE(ptr2, MAP_FAILED); + + /* + * Fault in ptr: + * \ + * |-----------| / |-----------| + * | faulted | \ | unfaulted | + * |-----------| / |-----------| + * ptr \ ptr2 + */ + ptr[0] = 'x'; + + /* + * Now move ptr2 adjacent, but behind, ptr: + * + * |-----------|-----------| + * | unfaulted | faulted | + * |-----------|-----------| + * ptr2 ptr + * + * It should merge: + * + * |----------------------| + * | faulted | + * |----------------------| + * ptr2 + */ + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &carveout[page_size]); + ASSERT_NE(ptr2, MAP_FAILED); + + ASSERT_TRUE(find_vma_procmap(procmap, ptr2)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr2); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr2 + 10 * page_size); +} + +TEST_F(merge, mremap_unfaulted_between_faulted) +{ + unsigned int page_size = self->page_size; + char *carveout = self->carveout; + struct procmap_fd *procmap = &self->procmap; + char *ptr, *ptr2, *ptr3; + + /* + * Map three distinct areas: + * + * |-----------| |-----------| |-----------| + * | unfaulted | | unfaulted | | unfaulted | + * |-----------| |-----------| |-----------| + * ptr ptr2 ptr3 + */ + ptr = mmap(&carveout[page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + ptr2 = mmap(&carveout[7 * page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr2, MAP_FAILED); + ptr3 = mmap(&carveout[14 * page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr3, MAP_FAILED); + + /* Offset ptr3 further away. */ + ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr3 + page_size * 2000); + ASSERT_NE(ptr3, MAP_FAILED); + + /* Offset ptr2 further away. */ + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000); + ASSERT_NE(ptr2, MAP_FAILED); + + /* + * Fault in ptr, ptr3: + * \ \ + * |-----------| / |-----------| / |-----------| + * | faulted | \ | unfaulted | \ | faulted | + * |-----------| / |-----------| / |-----------| + * ptr \ ptr2 \ ptr3 + */ + ptr[0] = 'x'; + ptr3[0] = 'x'; + + /* + * Move ptr3 back into place, leaving a place for ptr2: + * \ + * |-----------| |-----------| / |-----------| + * | faulted | | faulted | \ | unfaulted | + * |-----------| |-----------| / |-----------| + * ptr ptr3 \ ptr2 + */ + ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[10 * page_size]); + ASSERT_NE(ptr3, MAP_FAILED); + + /* + * Finally, move ptr2 into place: + * + * |-----------|-----------|-----------| + * | faulted | unfaulted | faulted | + * |-----------|-----------|-----------| + * ptr ptr2 ptr3 + * + * It should merge, but only ptr, ptr2: + * + * |-----------------------|-----------| + * | faulted | unfaulted | + * |-----------------------|-----------| + */ + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[5 * page_size]); + ASSERT_NE(ptr2, MAP_FAILED); + + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 10 * page_size); + + ASSERT_TRUE(find_vma_procmap(procmap, ptr3)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr3); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr3 + 5 * page_size); +} + +TEST_F(merge, mremap_unfaulted_between_faulted_unfaulted) +{ + unsigned int page_size = self->page_size; + char *carveout = self->carveout; + struct procmap_fd *procmap = &self->procmap; + char *ptr, *ptr2, *ptr3; + + /* + * Map three distinct areas: + * + * |-----------| |-----------| |-----------| + * | unfaulted | | unfaulted | | unfaulted | + * |-----------| |-----------| |-----------| + * ptr ptr2 ptr3 + */ + ptr = mmap(&carveout[page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + ptr2 = mmap(&carveout[7 * page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr2, MAP_FAILED); + ptr3 = mmap(&carveout[14 * page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr3, MAP_FAILED); + + /* Offset ptr3 further away. */ + ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr3 + page_size * 2000); + ASSERT_NE(ptr3, MAP_FAILED); + + + /* Offset ptr2 further away. */ + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000); + ASSERT_NE(ptr2, MAP_FAILED); + + /* + * Fault in ptr: + * \ \ + * |-----------| / |-----------| / |-----------| + * | faulted | \ | unfaulted | \ | unfaulted | + * |-----------| / |-----------| / |-----------| + * ptr \ ptr2 \ ptr3 + */ + ptr[0] = 'x'; + + /* + * Move ptr3 back into place, leaving a place for ptr2: + * \ + * |-----------| |-----------| / |-----------| + * | faulted | | unfaulted | \ | unfaulted | + * |-----------| |-----------| / |-----------| + * ptr ptr3 \ ptr2 + */ + ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[10 * page_size]); + ASSERT_NE(ptr3, MAP_FAILED); + + /* + * Finally, move ptr2 into place: + * + * |-----------|-----------|-----------| + * | faulted | unfaulted | unfaulted | + * |-----------|-----------|-----------| + * ptr ptr2 ptr3 + * + * It should merge: + * + * |-----------------------------------| + * | faulted | + * |-----------------------------------| + */ + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[5 * page_size]); + ASSERT_NE(ptr2, MAP_FAILED); + + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 15 * page_size); +} + +TEST_F(merge, mremap_unfaulted_between_correctly_placed_faulted) +{ + unsigned int page_size = self->page_size; + char *carveout = self->carveout; + struct procmap_fd *procmap = &self->procmap; + char *ptr, *ptr2; + + /* + * Map one larger area: + * + * |-----------------------------------| + * | unfaulted | + * |-----------------------------------| + */ + ptr = mmap(&carveout[page_size], 15 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* + * Fault in ptr: + * + * |-----------------------------------| + * | faulted | + * |-----------------------------------| + */ + ptr[0] = 'x'; + + /* + * Unmap middle: + * + * |-----------| |-----------| + * | faulted | | faulted | + * |-----------| |-----------| + * + * Now the faulted areas are compatible with each other (anon_vma the + * same, vma->vm_pgoff equal to virtual page offset). + */ + ASSERT_EQ(munmap(&ptr[5 * page_size], 5 * page_size), 0); + + /* + * Map a new area, ptr2: + * \ + * |-----------| |-----------| / |-----------| + * | faulted | | faulted | \ | unfaulted | + * |-----------| |-----------| / |-----------| + * ptr \ ptr2 + */ + ptr2 = mmap(&carveout[20 * page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr2, MAP_FAILED); + + /* + * Finally, move ptr2 into place: + * + * |-----------|-----------|-----------| + * | faulted | unfaulted | faulted | + * |-----------|-----------|-----------| + * ptr ptr2 ptr3 + * + * It should merge: + * + * |-----------------------------------| + * | faulted | + * |-----------------------------------| + */ + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[5 * page_size]); + ASSERT_NE(ptr2, MAP_FAILED); + + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 15 * page_size); +} + +TEST_F(merge, mremap_correct_placed_faulted) +{ + unsigned int page_size = self->page_size; + char *carveout = self->carveout; + struct procmap_fd *procmap = &self->procmap; + char *ptr, *ptr2, *ptr3; + + /* + * Map one larger area: + * + * |-----------------------------------| + * | unfaulted | + * |-----------------------------------| + */ + ptr = mmap(&carveout[page_size], 15 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* + * Fault in ptr: + * + * |-----------------------------------| + * | faulted | + * |-----------------------------------| + */ + ptr[0] = 'x'; + + /* + * Offset the final and middle 5 pages further away: + * \ \ + * |-----------| / |-----------| / |-----------| + * | faulted | \ | faulted | \ | faulted | + * |-----------| / |-----------| / |-----------| + * ptr \ ptr2 \ ptr3 + */ + ptr3 = &ptr[10 * page_size]; + ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr3 + page_size * 2000); + ASSERT_NE(ptr3, MAP_FAILED); + ptr2 = &ptr[5 * page_size]; + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000); + ASSERT_NE(ptr2, MAP_FAILED); + + /* + * Move ptr2 into its correct place: + * \ + * |-----------|-----------| / |-----------| + * | faulted | faulted | \ | faulted | + * |-----------|-----------| / |-----------| + * ptr ptr2 \ ptr3 + * + * It should merge: + * \ + * |-----------------------| / |-----------| + * | faulted | \ | faulted | + * |-----------------------| / |-----------| + * ptr \ ptr3 + */ + + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[5 * page_size]); + ASSERT_NE(ptr2, MAP_FAILED); + + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 10 * page_size); + + /* + * Now move ptr out of place: + * \ \ + * |-----------| / |-----------| / |-----------| + * | faulted | \ | faulted | \ | faulted | + * |-----------| / |-----------| / |-----------| + * ptr2 \ ptr \ ptr3 + */ + ptr = sys_mremap(ptr, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr + page_size * 1000); + ASSERT_NE(ptr, MAP_FAILED); + + /* + * Now move ptr back into place: + * \ + * |-----------|-----------| / |-----------| + * | faulted | faulted | \ | faulted | + * |-----------|-----------| / |-----------| + * ptr ptr2 \ ptr3 + * + * It should merge: + * \ + * |-----------------------| / |-----------| + * | faulted | \ | faulted | + * |-----------------------| / |-----------| + * ptr \ ptr3 + */ + ptr = sys_mremap(ptr, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &carveout[page_size]); + ASSERT_NE(ptr, MAP_FAILED); + + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 10 * page_size); + + /* + * Now move ptr out of place again: + * \ \ + * |-----------| / |-----------| / |-----------| + * | faulted | \ | faulted | \ | faulted | + * |-----------| / |-----------| / |-----------| + * ptr2 \ ptr \ ptr3 + */ + ptr = sys_mremap(ptr, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr + page_size * 1000); + ASSERT_NE(ptr, MAP_FAILED); + + /* + * Now move ptr3 back into place: + * \ + * |-----------|-----------| / |-----------| + * | faulted | faulted | \ | faulted | + * |-----------|-----------| / |-----------| + * ptr2 ptr3 \ ptr + * + * It should merge: + * \ + * |-----------------------| / |-----------| + * | faulted | \ | faulted | + * |-----------------------| / |-----------| + * ptr2 \ ptr + */ + ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &ptr2[5 * page_size]); + ASSERT_NE(ptr3, MAP_FAILED); + + ASSERT_TRUE(find_vma_procmap(procmap, ptr2)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr2); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr2 + 10 * page_size); + + /* + * Now move ptr back into place: + * + * |-----------|-----------------------| + * | faulted | faulted | + * |-----------|-----------------------| + * ptr ptr2 + * + * It should merge: + * + * |-----------------------------------| + * | faulted | + * |-----------------------------------| + * ptr + */ + ptr = sys_mremap(ptr, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &carveout[page_size]); + ASSERT_NE(ptr, MAP_FAILED); + + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 15 * page_size); + + /* + * Now move ptr2 out of the way: + * \ + * |-----------| |-----------| / |-----------| + * | faulted | | faulted | \ | faulted | + * |-----------| |-----------| / |-----------| + * ptr ptr3 \ ptr2 + */ + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000); + ASSERT_NE(ptr2, MAP_FAILED); + + /* + * Now move it back: + * + * |-----------|-----------|-----------| + * | faulted | faulted | faulted | + * |-----------|-----------|-----------| + * ptr ptr2 ptr3 + * + * It should merge: + * + * |-----------------------------------| + * | faulted | + * |-----------------------------------| + * ptr + */ + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[5 * page_size]); + ASSERT_NE(ptr2, MAP_FAILED); + + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 15 * page_size); + + /* + * Move ptr3 out of place: + * \ + * |-----------------------| / |-----------| + * | faulted | \ | faulted | + * |-----------------------| / |-----------| + * ptr \ ptr3 + */ + ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr3 + page_size * 1000); + ASSERT_NE(ptr3, MAP_FAILED); + + /* + * Now move it back: + * + * |-----------|-----------|-----------| + * | faulted | faulted | faulted | + * |-----------|-----------|-----------| + * ptr ptr2 ptr3 + * + * It should merge: + * + * |-----------------------------------| + * | faulted | + * |-----------------------------------| + * ptr + */ + ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[10 * page_size]); + ASSERT_NE(ptr3, MAP_FAILED); + + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 15 * page_size); +} + TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c index 5492e3f784df..1d434772fa54 100644 --- a/tools/testing/selftests/mm/vm_util.c +++ b/tools/testing/selftests/mm/vm_util.c @@ -524,3 +524,11 @@ int read_sysfs(const char *file_path, unsigned long *val) return 0; } + +void *sys_mremap(void *old_address, unsigned long old_size, + unsigned long new_size, int flags, void *new_address) +{ + return (void *)syscall(__NR_mremap, (unsigned long)old_address, + old_size, new_size, flags, + (unsigned long)new_address); +} diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h index b8136d12a0f8..797c24215b17 100644 --- a/tools/testing/selftests/mm/vm_util.h +++ b/tools/testing/selftests/mm/vm_util.h @@ -117,6 +117,9 @@ static inline void log_test_result(int result) ksft_test_result_report(result, "%s\n", test_name); } +void *sys_mremap(void *old_address, unsigned long old_size, + unsigned long new_size, int flags, void *new_address); + /* * On ppc64 this will only work with radix 2M hugepage size */ From fb05f992b6bbb4702307d96f00703ee637b24dbf Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:24:55 +0200 Subject: [PATCH 175/370] mm/balloon_compaction: we cannot have isolated pages in the balloon list MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "mm/migration: rework movable_ops page migration (part 1)", v2. In the future, as we decouple "struct page" from "struct folio", pages that support "non-lru page migration" -- movable_ops page migration such as memory balloons and zsmalloc -- will no longer be folios. They will not have ->mapping, ->lru, and likely no refcount and no page lock. But they will have a type and flags 🙂 This is the first part (other parts not written yet) of decoupling movable_ops page migration from folio migration. In this series, we get rid of the ->mapping usage, and start cleaning up the code + separating it from folio migration. Migration core will have to be further reworked to not treat movable_ops pages like folios. This is the first step into that direction. This patch (of 29): The core will set PG_isolated only after mops->isolate_page() was called. In case of the balloon, that is where we will remove it from the balloon list. So we cannot have isolated pages in the balloon list. Let's drop this unnecessary check. Link: https://lkml.kernel.org/r/20250704102524.326966-2-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Zi Yan Reviewed-by: Lorenzo Stoakes Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Cc: Harry Yoo Signed-off-by: Andrew Morton --- mm/balloon_compaction.c | 6 ------ 1 file changed, 6 deletions(-) diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c index d3e00731e262..fcb60233aa35 100644 --- a/mm/balloon_compaction.c +++ b/mm/balloon_compaction.c @@ -94,12 +94,6 @@ size_t balloon_page_list_dequeue(struct balloon_dev_info *b_dev_info, if (!trylock_page(page)) continue; - if (IS_ENABLED(CONFIG_BALLOON_COMPACTION) && - PageIsolated(page)) { - /* raced with isolation */ - unlock_page(page); - continue; - } balloon_page_delete(page); __count_vm_event(BALLOON_DEFLATE); list_add(&page->lru, pages); From 15504b1163007bbfbd9a63460d5c14737c16e96d Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:24:56 +0200 Subject: [PATCH 176/370] mm/balloon_compaction: convert balloon_page_delete() to balloon_page_finalize() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Let's move the removal of the page from the balloon list into the single caller, to remove the dependency on the PG_isolated flag and clarify locking requirements. Note that for now, balloon_page_delete() was used on two paths: (1) Removing a page from the balloon for deflation through balloon_page_list_dequeue() (2) Removing an isolated page from the balloon for migration in the per-driver migration handlers. Isolated pages were already removed from the balloon list during isolation. So instead of relying on the flag, we can just distinguish both cases directly and handle it accordingly in the caller. We'll shuffle the operations a bit such that they logically make more sense (e.g., remove from the list before clearing flags). In balloon migration functions we can now move the balloon_page_finalize() out of the balloon lock and perform the finalization just before dropping the balloon reference. Document that the page lock is currently required when modifying the movability aspects of a page; hopefully we can soon decouple this from the page lock. Link: https://lkml.kernel.org/r/20250704102524.326966-3-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: Harry Yoo Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Cc: Zi Yan Signed-off-by: Andrew Morton --- arch/powerpc/platforms/pseries/cmm.c | 2 +- drivers/misc/vmw_balloon.c | 3 +- drivers/virtio/virtio_balloon.c | 4 +-- include/linux/balloon_compaction.h | 43 +++++++++++----------------- mm/balloon_compaction.c | 3 +- 5 files changed, 21 insertions(+), 34 deletions(-) diff --git a/arch/powerpc/platforms/pseries/cmm.c b/arch/powerpc/platforms/pseries/cmm.c index 5f4037c1d7fe..5e0a718d1be7 100644 --- a/arch/powerpc/platforms/pseries/cmm.c +++ b/arch/powerpc/platforms/pseries/cmm.c @@ -532,7 +532,6 @@ static int cmm_migratepage(struct balloon_dev_info *b_dev_info, spin_lock_irqsave(&b_dev_info->pages_lock, flags); balloon_page_insert(b_dev_info, newpage); - balloon_page_delete(page); b_dev_info->isolated_pages--; spin_unlock_irqrestore(&b_dev_info->pages_lock, flags); @@ -542,6 +541,7 @@ static int cmm_migratepage(struct balloon_dev_info *b_dev_info, */ plpar_page_set_active(page); + balloon_page_finalize(page); /* balloon page list reference */ put_page(page); diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c index c817d8c21641..6653fc53c951 100644 --- a/drivers/misc/vmw_balloon.c +++ b/drivers/misc/vmw_balloon.c @@ -1778,8 +1778,7 @@ static int vmballoon_migratepage(struct balloon_dev_info *b_dev_info, * @pages_lock . We keep holding @comm_lock since we will need it in a * second. */ - balloon_page_delete(page); - + balloon_page_finalize(page); put_page(page); /* Inflate */ diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 89da052f4f68..e299e18346a3 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -866,15 +866,13 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info, tell_host(vb, vb->inflate_vq); /* balloon's page migration 2nd step -- deflate "page" */ - spin_lock_irqsave(&vb_dev_info->pages_lock, flags); - balloon_page_delete(page); - spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags); vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE; set_page_pfns(vb, vb->pfns, page); tell_host(vb, vb->deflate_vq); mutex_unlock(&vb->balloon_lock); + balloon_page_finalize(page); put_page(page); /* balloon reference */ return MIGRATEPAGE_SUCCESS; diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h index 5ca2d5699620..b9f19da37b08 100644 --- a/include/linux/balloon_compaction.h +++ b/include/linux/balloon_compaction.h @@ -97,27 +97,6 @@ static inline void balloon_page_insert(struct balloon_dev_info *balloon, list_add(&page->lru, &balloon->pages); } -/* - * balloon_page_delete - delete a page from balloon's page list and clear - * the page->private assignement accordingly. - * @page : page to be released from balloon's page list - * - * Caller must ensure the page is locked and the spin_lock protecting balloon - * pages list is held before deleting a page from the balloon device. - */ -static inline void balloon_page_delete(struct page *page) -{ - __ClearPageOffline(page); - __ClearPageMovable(page); - set_page_private(page, 0); - /* - * No touch page.lru field once @page has been isolated - * because VM is using the field. - */ - if (!PageIsolated(page)) - list_del(&page->lru); -} - /* * balloon_page_device - get the b_dev_info descriptor for the balloon device * that enqueues the given page. @@ -141,12 +120,6 @@ static inline void balloon_page_insert(struct balloon_dev_info *balloon, list_add(&page->lru, &balloon->pages); } -static inline void balloon_page_delete(struct page *page) -{ - __ClearPageOffline(page); - list_del(&page->lru); -} - static inline gfp_t balloon_mapping_gfp_mask(void) { return GFP_HIGHUSER; @@ -154,6 +127,22 @@ static inline gfp_t balloon_mapping_gfp_mask(void) #endif /* CONFIG_BALLOON_COMPACTION */ +/* + * balloon_page_finalize - prepare a balloon page that was removed from the + * balloon list for release to the page allocator + * @page: page to be released to the page allocator + * + * Caller must ensure that the page is locked. + */ +static inline void balloon_page_finalize(struct page *page) +{ + if (IS_ENABLED(CONFIG_BALLOON_COMPACTION)) { + __ClearPageMovable(page); + set_page_private(page, 0); + } + __ClearPageOffline(page); +} + /* * balloon_page_push - insert a page into a page list. * @head : pointer to list diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c index fcb60233aa35..ec176bdb8a78 100644 --- a/mm/balloon_compaction.c +++ b/mm/balloon_compaction.c @@ -94,7 +94,8 @@ size_t balloon_page_list_dequeue(struct balloon_dev_info *b_dev_info, if (!trylock_page(page)) continue; - balloon_page_delete(page); + list_del(&page->lru); + balloon_page_finalize(page); __count_vm_event(BALLOON_DEFLATE); list_add(&page->lru, pages); unlock_page(page); From e22a58a2143f86f55f8d44ccdcf4ed443dde18ad Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:24:57 +0200 Subject: [PATCH 177/370] mm/zsmalloc: drop PageIsolated() related VM_BUG_ONs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Let's drop these checks; these are conditions the core migration code must make sure will hold either way, no need to double check. Link: https://lkml.kernel.org/r/20250704102524.326966-4-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Zi Yan Reviewed-by: Sergey Senozhatsky Acked-by: Harry Yoo Reviewed-by: Lorenzo Stoakes Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Signed-off-by: Andrew Morton --- mm/zpdesc.h | 5 ----- mm/zsmalloc.c | 5 ----- 2 files changed, 10 deletions(-) diff --git a/mm/zpdesc.h b/mm/zpdesc.h index d3df316e5bb7..5cb7e3de4395 100644 --- a/mm/zpdesc.h +++ b/mm/zpdesc.h @@ -168,11 +168,6 @@ static inline void __zpdesc_clear_zsmalloc(struct zpdesc *zpdesc) __ClearPageZsmalloc(zpdesc_page(zpdesc)); } -static inline bool zpdesc_is_isolated(struct zpdesc *zpdesc) -{ - return PageIsolated(zpdesc_page(zpdesc)); -} - static inline struct zone *zpdesc_zone(struct zpdesc *zpdesc) { return page_zone(zpdesc_page(zpdesc)); diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 999b513c7fdf..7f1431f2be98 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -1719,8 +1719,6 @@ static bool zs_page_isolate(struct page *page, isolate_mode_t mode) * Page is locked so zspage couldn't be destroyed. For detail, look at * lock_zspage in free_zspage. */ - VM_BUG_ON_PAGE(PageIsolated(page), page); - return true; } @@ -1739,8 +1737,6 @@ static int zs_page_migrate(struct page *newpage, struct page *page, unsigned long old_obj, new_obj; unsigned int obj_idx; - VM_BUG_ON_PAGE(!zpdesc_is_isolated(zpdesc), zpdesc_page(zpdesc)); - /* The page is locked, so this pointer must remain valid */ zspage = get_zspage(zpdesc); pool = zspage->pool; @@ -1811,7 +1807,6 @@ static int zs_page_migrate(struct page *newpage, struct page *page, static void zs_page_putback(struct page *page) { - VM_BUG_ON_PAGE(!PageIsolated(page), page); } static const struct movable_operations zsmalloc_mops = { From 2dfcd1608f3a96364f10de7fcfe28727c0292e5d Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:24:58 +0200 Subject: [PATCH 178/370] mm/page_alloc: let page freeing clear any set page type MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Currently, any user of page types must clear that type before freeing a page back to the buddy, otherwise we'll run into mapcount related sanity checks (because the page type currently overlays the page mapcount). Let's allow for not clearing the page type by page type users by letting the buddy handle it instead. We'll focus on having a page type set on the first page of a larger allocation only. With this change, we can reliably identify typed folios even though they might be in the process of getting freed, which will come in handy in migration code (at least in the transition phase). In the future we might want to warn on some page types. Instead of having an "allow list", let's rather wait until we know about once that should go on such a "disallow list". Link: https://lkml.kernel.org/r/20250704102524.326966-5-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Zi Yan Acked-by: Harry Yoo Reviewed-by: Lorenzo Stoakes Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Signed-off-by: Andrew Morton --- mm/page_alloc.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4f55f8ed65c7..6318c85d678e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1380,6 +1380,10 @@ __always_inline bool free_pages_prepare(struct page *page, mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1); page->mapping = NULL; } + if (unlikely(page_has_type(page))) + /* Reset the page_type (which overlays _mapcount) */ + page->page_type = UINT_MAX; + if (is_check_pages_enabled()) { if (free_page_is_bad(page)) bad++; From 65aabd88dffda68639808e0827cfef624a1cd55f Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:24:59 +0200 Subject: [PATCH 179/370] mm/balloon_compaction: make PageOffline sticky until the page is freed MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Let the page freeing code handle clearing the page type. Being able to identify balloon pages until actually freed is a requirement for upcoming movable_ops migration changes. Link: https://lkml.kernel.org/r/20250704102524.326966-6-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Zi Yan Acked-by: Harry Yoo Reviewed-by: Lorenzo Stoakes Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Signed-off-by: Andrew Morton --- include/linux/balloon_compaction.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h index b9f19da37b08..bfc6e50bd004 100644 --- a/include/linux/balloon_compaction.h +++ b/include/linux/balloon_compaction.h @@ -140,7 +140,7 @@ static inline void balloon_page_finalize(struct page *page) __ClearPageMovable(page); set_page_private(page, 0); } - __ClearPageOffline(page); + /* PageOffline is sticky until the page is freed to the buddy. */ } /* From 5ec3583309ef94fcceed6807aed93b50e801b84a Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:25:00 +0200 Subject: [PATCH 180/370] mm/zsmalloc: make PageZsmalloc() sticky until the page is freed MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Let the page freeing code handle clearing the page type. Being able to identify balloon pages until actually freed is a requirement for upcoming movable_ops migration changes. Link: https://lkml.kernel.org/r/20250704102524.326966-7-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Zi Yan Reviewed-by: Sergey Senozhatsky Acked-by: Harry Yoo Reviewed-by: Lorenzo Stoakes Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Signed-off-by: Andrew Morton --- mm/zpdesc.h | 5 ----- mm/zsmalloc.c | 4 ++-- 2 files changed, 2 insertions(+), 7 deletions(-) diff --git a/mm/zpdesc.h b/mm/zpdesc.h index 5cb7e3de4395..5763f3603973 100644 --- a/mm/zpdesc.h +++ b/mm/zpdesc.h @@ -163,11 +163,6 @@ static inline void __zpdesc_set_zsmalloc(struct zpdesc *zpdesc) __SetPageZsmalloc(zpdesc_page(zpdesc)); } -static inline void __zpdesc_clear_zsmalloc(struct zpdesc *zpdesc) -{ - __ClearPageZsmalloc(zpdesc_page(zpdesc)); -} - static inline struct zone *zpdesc_zone(struct zpdesc *zpdesc) { return page_zone(zpdesc_page(zpdesc)); diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 7f1431f2be98..626f09fb2713 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -244,6 +244,7 @@ static inline void free_zpdesc(struct zpdesc *zpdesc) { struct page *page = zpdesc_page(zpdesc); + /* PageZsmalloc is sticky until the page is freed to the buddy. */ __free_page(page); } @@ -880,7 +881,7 @@ static void reset_zpdesc(struct zpdesc *zpdesc) ClearPagePrivate(page); zpdesc->zspage = NULL; zpdesc->next = NULL; - __ClearPageZsmalloc(page); + /* PageZsmalloc is sticky until the page is freed to the buddy. */ } static int trylock_zspage(struct zspage *zspage) @@ -1055,7 +1056,6 @@ static struct zspage *alloc_zspage(struct zs_pool *pool, if (!zpdesc) { while (--i >= 0) { zpdesc_dec_zone_page_state(zpdescs[i]); - __zpdesc_clear_zsmalloc(zpdescs[i]); free_zpdesc(zpdescs[i]); } cache_free_zspage(pool, zspage); From 6ef0c1976b8fab938e732c2fb751fa8965153b2e Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:25:01 +0200 Subject: [PATCH 181/370] mm/migrate: rename isolate_movable_page() to isolate_movable_ops_page() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ... and start moving back to per-page things that will absolutely not be folio things in the future. Add documentation and a comment that the remaining folio stuff (lock, refcount) will have to be reworked as well. While at it, convert the VM_BUG_ON() into a WARN_ON_ONCE() and handle it gracefully (relevant with further changes), and convert a WARN_ON_ONCE() into a VM_WARN_ON_ONCE_PAGE(). Note that we will leave anything that needs a rework (lock, refcount, ->lru) to be using folios for now: that perfectly highlights the problematic bits. Link: https://lkml.kernel.org/r/20250704102524.326966-8-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Harry Yoo Reviewed-by: Lorenzo Stoakes Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Signed-off-by: Andrew Morton --- include/linux/migrate.h | 4 ++-- mm/compaction.c | 2 +- mm/migrate.c | 39 +++++++++++++++++++++++++++++---------- 3 files changed, 32 insertions(+), 13 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index aaa2114498d6..c0ec7422837b 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -69,7 +69,7 @@ int migrate_pages(struct list_head *l, new_folio_t new, free_folio_t free, unsigned long private, enum migrate_mode mode, int reason, unsigned int *ret_succeeded); struct folio *alloc_migration_target(struct folio *src, unsigned long private); -bool isolate_movable_page(struct page *page, isolate_mode_t mode); +bool isolate_movable_ops_page(struct page *page, isolate_mode_t mode); bool isolate_folio_to_list(struct folio *folio, struct list_head *list); int migrate_huge_page_move_mapping(struct address_space *mapping, @@ -90,7 +90,7 @@ static inline int migrate_pages(struct list_head *l, new_folio_t new, static inline struct folio *alloc_migration_target(struct folio *src, unsigned long private) { return NULL; } -static inline bool isolate_movable_page(struct page *page, isolate_mode_t mode) +static inline bool isolate_movable_ops_page(struct page *page, isolate_mode_t mode) { return false; } static inline bool isolate_folio_to_list(struct folio *folio, struct list_head *list) { return false; } diff --git a/mm/compaction.c b/mm/compaction.c index 3925cb61dbb8..17455c5a4be0 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1093,7 +1093,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, locked = NULL; } - if (isolate_movable_page(page, mode)) { + if (isolate_movable_ops_page(page, mode)) { folio = page_folio(page); goto isolate_success; } diff --git a/mm/migrate.c b/mm/migrate.c index 208d2d4a2f8d..2e648d75248e 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -51,8 +51,26 @@ #include "internal.h" #include "swap.h" -bool isolate_movable_page(struct page *page, isolate_mode_t mode) +/** + * isolate_movable_ops_page - isolate a movable_ops page for migration + * @page: The page. + * @mode: The isolation mode. + * + * Try to isolate a movable_ops page for migration. Will fail if the page is + * not a movable_ops page, if the page is already isolated for migration + * or if the page was just was released by its owner. + * + * Once isolated, the page cannot get freed until it is either putback + * or migrated. + * + * Returns true if isolation succeeded, otherwise false. + */ +bool isolate_movable_ops_page(struct page *page, isolate_mode_t mode) { + /* + * TODO: these pages will not be folios in the future. All + * folio dependencies will have to be removed. + */ struct folio *folio = folio_get_nontail_page(page); const struct movable_operations *mops; @@ -73,7 +91,7 @@ bool isolate_movable_page(struct page *page, isolate_mode_t mode) * we use non-atomic bitops on newly allocated page flags so * unconditionally grabbing the lock ruins page's owner side. */ - if (unlikely(!__folio_test_movable(folio))) + if (unlikely(!__PageMovable(page))) goto out_putfolio; /* @@ -90,18 +108,19 @@ bool isolate_movable_page(struct page *page, isolate_mode_t mode) if (unlikely(!folio_trylock(folio))) goto out_putfolio; - if (!folio_test_movable(folio) || folio_test_isolated(folio)) + if (!PageMovable(page) || PageIsolated(page)) goto out_no_isolated; - mops = folio_movable_ops(folio); - VM_BUG_ON_FOLIO(!mops, folio); + mops = page_movable_ops(page); + if (WARN_ON_ONCE(!mops)) + goto out_no_isolated; - if (!mops->isolate_page(&folio->page, mode)) + if (!mops->isolate_page(page, mode)) goto out_no_isolated; /* Driver shouldn't use the isolated flag */ - WARN_ON_ONCE(folio_test_isolated(folio)); - folio_set_isolated(folio); + VM_WARN_ON_ONCE_PAGE(PageIsolated(page), page); + SetPageIsolated(page); folio_unlock(folio); return true; @@ -175,8 +194,8 @@ bool isolate_folio_to_list(struct folio *folio, struct list_head *list) if (lru) isolated = folio_isolate_lru(folio); else - isolated = isolate_movable_page(&folio->page, - ISOLATE_UNEVICTABLE); + isolated = isolate_movable_ops_page(&folio->page, + ISOLATE_UNEVICTABLE); if (!isolated) return false; From d808f1f672a17441f7ef0eff2f0f387d6267ed7c Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:25:02 +0200 Subject: [PATCH 182/370] mm/migrate: rename putback_movable_folio() to putback_movable_ops_page() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ... and factor the complete handling of movable_ops pages out. Convert it similar to isolate_movable_ops_page(). While at it, convert the VM_BUG_ON_FOLIO() into a VM_WARN_ON_PAGE(). Link: https://lkml.kernel.org/r/20250704102524.326966-9-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes Reviewed-by: Harry Yoo Reviewed-by: Zi Yan Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Signed-off-by: Andrew Morton --- mm/migrate.c | 35 +++++++++++++++++++++++------------ 1 file changed, 23 insertions(+), 12 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 2e648d75248e..c3cd66b05fe2 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -133,12 +133,30 @@ bool isolate_movable_ops_page(struct page *page, isolate_mode_t mode) return false; } -static void putback_movable_folio(struct folio *folio) +/** + * putback_movable_ops_page - putback an isolated movable_ops page + * @page: The isolated page. + * + * Putback an isolated movable_ops page. + * + * After the page was putback, it might get freed instantly. + */ +static void putback_movable_ops_page(struct page *page) { - const struct movable_operations *mops = folio_movable_ops(folio); + /* + * TODO: these pages will not be folios in the future. All + * folio dependencies will have to be removed. + */ + struct folio *folio = page_folio(page); - mops->putback_page(&folio->page); - folio_clear_isolated(folio); + VM_WARN_ON_ONCE_PAGE(!PageIsolated(page), page); + folio_lock(folio); + /* If the page was released by it's owner, there is nothing to do. */ + if (PageMovable(page)) + page_movable_ops(page)->putback_page(page); + ClearPageIsolated(page); + folio_unlock(folio); + folio_put(folio); } /* @@ -166,14 +184,7 @@ void putback_movable_pages(struct list_head *l) * have PAGE_MAPPING_MOVABLE. */ if (unlikely(__folio_test_movable(folio))) { - VM_BUG_ON_FOLIO(!folio_test_isolated(folio), folio); - folio_lock(folio); - if (folio_test_movable(folio)) - putback_movable_folio(folio); - else - folio_clear_isolated(folio); - folio_unlock(folio); - folio_put(folio); + putback_movable_ops_page(&folio->page); } else { node_stat_mod_folio(folio, NR_ISOLATED_ANON + folio_is_file_lru(folio), -folio_nr_pages(folio)); From b9ed00483d4cbacca04edb11984d8daf09e9ae22 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:25:03 +0200 Subject: [PATCH 183/370] mm/migrate: factor out movable_ops page handling into migrate_movable_ops_page() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Let's factor it out, simplifying the calling code. Before this change, we would have called flush_dcache_folio() also on movable_ops pages. As documented in Documentation/core-api/cachetlb.rst: "This routine need only be called for page cache pages which can potentially ever be mapped into the address space of a user process." So don't do it for movable_ops pages. If there would ever be such a movable_ops page user, it should do the flushing itself after performing the copy. Note that we can now change folio_mapping_flags() to folio_test_anon() to make it clearer, because movable_ops pages will never take that path. [akpm@linux-foundation.org: fix kerneldoc] Link: https://lkml.kernel.org/r/20250704102524.326966-10-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Lorenzo Stoakes Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: Harry Yoo Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Signed-off-by: Andrew Morton --- mm/migrate.c | 84 +++++++++++++++++++++++++++++----------------------- 1 file changed, 47 insertions(+), 37 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index c3cd66b05fe2..8c4d5837db53 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -159,6 +159,47 @@ static void putback_movable_ops_page(struct page *page) folio_put(folio); } +/** + * migrate_movable_ops_page - migrate an isolated movable_ops page + * @dst: The destination page. + * @src: The source page. + * @mode: The migration mode. + * + * Migrate an isolated movable_ops page. + * + * If the src page was already released by its owner, the src page is + * un-isolated (putback) and migration succeeds; the migration core will be the + * owner of both pages. + * + * If the src page was not released by its owner and the migration was + * successful, the owner of the src page and the dst page are swapped and + * the src page is un-isolated. + * + * If migration fails, the ownership stays unmodified and the src page + * remains isolated: migration may be retried later or the page can be putback. + * + * TODO: migration core will treat both pages as folios and lock them before + * this call to unlock them after this call. Further, the folio refcounts on + * src and dst are also released by migration core. These pages will not be + * folios in the future, so that must be reworked. + * + * Returns MIGRATEPAGE_SUCCESS on success, otherwise a negative error + * code. + */ +static int migrate_movable_ops_page(struct page *dst, struct page *src, + enum migrate_mode mode) +{ + int rc = MIGRATEPAGE_SUCCESS; + + VM_WARN_ON_ONCE_PAGE(!PageIsolated(src), src); + /* If the page was released by it's owner, there is nothing to do. */ + if (PageMovable(src)) + rc = page_movable_ops(src)->migrate_page(dst, src, mode); + if (rc == MIGRATEPAGE_SUCCESS) + ClearPageIsolated(src); + return rc; +} + /* * Put previously isolated pages back onto the appropriate lists * from where they were once taken off for compaction/migration. @@ -1023,51 +1064,20 @@ static int move_to_new_folio(struct folio *dst, struct folio *src, mode); else rc = fallback_migrate_folio(mapping, dst, src, mode); - } else { - const struct movable_operations *mops; - /* - * In case of non-lru page, it could be released after - * isolation step. In that case, we shouldn't try migration. - */ - VM_BUG_ON_FOLIO(!folio_test_isolated(src), src); - if (!folio_test_movable(src)) { - rc = MIGRATEPAGE_SUCCESS; - folio_clear_isolated(src); + if (rc != MIGRATEPAGE_SUCCESS) goto out; - } - - mops = folio_movable_ops(src); - rc = mops->migrate_page(&dst->page, &src->page, mode); - WARN_ON_ONCE(rc == MIGRATEPAGE_SUCCESS && - !folio_test_isolated(src)); - } - - /* - * When successful, old pagecache src->mapping must be cleared before - * src is freed; but stats require that PageAnon be left as PageAnon. - */ - if (rc == MIGRATEPAGE_SUCCESS) { - if (__folio_test_movable(src)) { - VM_BUG_ON_FOLIO(!folio_test_isolated(src), src); - - /* - * We clear PG_movable under page_lock so any compactor - * cannot try to migrate this page. - */ - folio_clear_isolated(src); - } - /* - * Anonymous and movable src->mapping will be cleared by - * free_pages_prepare so don't reset it here for keeping - * the type to work PageAnon, for example. + * For pagecache folios, src->mapping must be cleared before src + * is freed. Anonymous folios must stay anonymous until freed. */ - if (!folio_mapping_flags(src)) + if (!folio_test_anon(src)) src->mapping = NULL; if (likely(!folio_is_zone_device(dst))) flush_dcache_folio(dst); + } else { + rc = migrate_movable_ops_page(&dst->page, &src->page, mode); } out: return rc; From 07e5355eeead3a715bb48ffe13499dd3d0178e52 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:25:04 +0200 Subject: [PATCH 184/370] mm/migrate: remove folio_test_movable() and folio_movable_ops() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Folios will have nothing to do with movable_ops page migration. These functions are now unused, so let's remove them. Note that __folio_test_movable() and friends will be removed separately next, after more rework. Link: https://lkml.kernel.org/r/20250704102524.326966-11-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Lorenzo Stoakes Reviewed-by: Harry Yoo Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Signed-off-by: Andrew Morton --- include/linux/migrate.h | 14 -------------- 1 file changed, 14 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index c0ec7422837b..c99a00d4ca27 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -118,20 +118,6 @@ static inline void __ClearPageMovable(struct page *page) } #endif -static inline bool folio_test_movable(struct folio *folio) -{ - return PageMovable(&folio->page); -} - -static inline -const struct movable_operations *folio_movable_ops(struct folio *folio) -{ - VM_BUG_ON(!__folio_test_movable(folio)); - - return (const struct movable_operations *) - ((unsigned long)folio->mapping - PAGE_MAPPING_MOVABLE); -} - static inline const struct movable_operations *page_movable_ops(struct page *page) { From be4a3e9c185264e9ad0fe02c1c5d81b8386bd50c Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:25:05 +0200 Subject: [PATCH 185/370] mm/migrate: move movable_ops page handling out of move_to_new_folio() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Let's move that handling directly into migrate_folio_move(), so we can simplify move_to_new_folio(). While at it, fixup the documentation a bit. Note that unmap_and_move_huge_page() does not care, because it only deals with actual folios. (we only support migration of individual movable_ops pages) Link: https://lkml.kernel.org/r/20250704102524.326966-12-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Harry Yoo Reviewed-by: Lorenzo Stoakes Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Signed-off-by: Andrew Morton --- mm/migrate.c | 63 +++++++++++++++++++++++++--------------------------- 1 file changed, 30 insertions(+), 33 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 8c4d5837db53..61e98ed46f13 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1026,11 +1026,12 @@ static int fallback_migrate_folio(struct address_space *mapping, } /* - * Move a page to a newly allocated page - * The page is locked and all ptes have been successfully removed. + * Move a src folio to a newly allocated dst folio. * - * The new page will have replaced the old page if this function - * is successful. + * The src and dst folios are locked and the src folios was unmapped from + * the page tables. + * + * On success, the src folio was replaced by the dst folio. * * Return value: * < 0 - error code @@ -1039,34 +1040,30 @@ static int fallback_migrate_folio(struct address_space *mapping, static int move_to_new_folio(struct folio *dst, struct folio *src, enum migrate_mode mode) { + struct address_space *mapping = folio_mapping(src); int rc = -EAGAIN; - bool is_lru = !__folio_test_movable(src); VM_BUG_ON_FOLIO(!folio_test_locked(src), src); VM_BUG_ON_FOLIO(!folio_test_locked(dst), dst); - if (likely(is_lru)) { - struct address_space *mapping = folio_mapping(src); + if (!mapping) + rc = migrate_folio(mapping, dst, src, mode); + else if (mapping_inaccessible(mapping)) + rc = -EOPNOTSUPP; + else if (mapping->a_ops->migrate_folio) + /* + * Most folios have a mapping and most filesystems + * provide a migrate_folio callback. Anonymous folios + * are part of swap space which also has its own + * migrate_folio callback. This is the most common path + * for page migration. + */ + rc = mapping->a_ops->migrate_folio(mapping, dst, src, + mode); + else + rc = fallback_migrate_folio(mapping, dst, src, mode); - if (!mapping) - rc = migrate_folio(mapping, dst, src, mode); - else if (mapping_inaccessible(mapping)) - rc = -EOPNOTSUPP; - else if (mapping->a_ops->migrate_folio) - /* - * Most folios have a mapping and most filesystems - * provide a migrate_folio callback. Anonymous folios - * are part of swap space which also has its own - * migrate_folio callback. This is the most common path - * for page migration. - */ - rc = mapping->a_ops->migrate_folio(mapping, dst, src, - mode); - else - rc = fallback_migrate_folio(mapping, dst, src, mode); - - if (rc != MIGRATEPAGE_SUCCESS) - goto out; + if (rc == MIGRATEPAGE_SUCCESS) { /* * For pagecache folios, src->mapping must be cleared before src * is freed. Anonymous folios must stay anonymous until freed. @@ -1076,10 +1073,7 @@ static int move_to_new_folio(struct folio *dst, struct folio *src, if (likely(!folio_is_zone_device(dst))) flush_dcache_folio(dst); - } else { - rc = migrate_movable_ops_page(&dst->page, &src->page, mode); } -out: return rc; } @@ -1330,20 +1324,23 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private, int rc; int old_page_state = 0; struct anon_vma *anon_vma = NULL; - bool is_lru = !__folio_test_movable(src); struct list_head *prev; __migrate_folio_extract(dst, &old_page_state, &anon_vma); prev = dst->lru.prev; list_del(&dst->lru); + if (unlikely(__folio_test_movable(src))) { + rc = migrate_movable_ops_page(&dst->page, &src->page, mode); + if (rc) + goto out; + goto out_unlock_both; + } + rc = move_to_new_folio(dst, src, mode); if (rc) goto out; - if (unlikely(!is_lru)) - goto out_unlock_both; - /* * When successful, push dst to LRU immediately: so that if it * turns out to be an mlocked page, remove_migration_ptes() will From a109262734c527296f0a945825f568c78dc804f2 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:25:06 +0200 Subject: [PATCH 186/370] mm/zsmalloc: stop using __ClearPageMovable() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Instead, let's check in the callbacks if the page was already destroyed, which can be checked by looking at zpdesc->zspage (see reset_zpdesc()). If we detect that the page was destroyed: (1) Fail isolation, just like the migration core would (2) Fake migration success just like the migration core would In the putback case there is nothing to do, as we don't do anything just like the migration core would do. In the future, we should look into not letting these pages get destroyed while they are isolated -- and instead delaying that to the putback/migration call. Add a TODO for that. Link: https://lkml.kernel.org/r/20250704102524.326966-13-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Harry Yoo Reviewed-by: Lorenzo Stoakes Reviewed-by: Sergey Senozhatsky Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/zsmalloc.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 626f09fb2713..b12250e219bb 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -877,7 +877,6 @@ static void reset_zpdesc(struct zpdesc *zpdesc) { struct page *page = zpdesc_page(zpdesc); - __ClearPageMovable(page); ClearPagePrivate(page); zpdesc->zspage = NULL; zpdesc->next = NULL; @@ -1716,10 +1715,11 @@ static void replace_sub_page(struct size_class *class, struct zspage *zspage, static bool zs_page_isolate(struct page *page, isolate_mode_t mode) { /* - * Page is locked so zspage couldn't be destroyed. For detail, look at - * lock_zspage in free_zspage. + * Page is locked so zspage can't be destroyed concurrently + * (see free_zspage()). But if the page was already destroyed + * (see reset_zpdesc()), refuse isolation here. */ - return true; + return page_zpdesc(page)->zspage; } static int zs_page_migrate(struct page *newpage, struct page *page, @@ -1737,6 +1737,16 @@ static int zs_page_migrate(struct page *newpage, struct page *page, unsigned long old_obj, new_obj; unsigned int obj_idx; + /* + * TODO: nothing prevents a zspage from getting destroyed while + * it is isolated for migration, as the page lock is temporarily + * dropped after zs_page_isolate() succeeded: we should rework that + * and defer destroying such pages once they are un-isolated (putback) + * instead. + */ + if (!zpdesc->zspage) + return MIGRATEPAGE_SUCCESS; + /* The page is locked, so this pointer must remain valid */ zspage = get_zspage(zpdesc); pool = zspage->pool; From 3544c4faccb8f0867bc65f8007ee70bfb5054305 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:25:07 +0200 Subject: [PATCH 187/370] mm/balloon_compaction: stop using __ClearPageMovable() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit We can just look at the balloon device (stored in page->private), to see if the page is still part of the balloon. As isolated balloon pages cannot get released (they are taken off the balloon list while isolated), we don't have to worry about this case in the putback and migration callback. Add a WARN_ON_ONCE for now. Link: https://lkml.kernel.org/r/20250704102524.326966-14-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: Harry Yoo Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/balloon_compaction.h | 4 +--- mm/balloon_compaction.c | 11 +++++++++++ 2 files changed, 12 insertions(+), 3 deletions(-) diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h index bfc6e50bd004..9bce8e9f5018 100644 --- a/include/linux/balloon_compaction.h +++ b/include/linux/balloon_compaction.h @@ -136,10 +136,8 @@ static inline gfp_t balloon_mapping_gfp_mask(void) */ static inline void balloon_page_finalize(struct page *page) { - if (IS_ENABLED(CONFIG_BALLOON_COMPACTION)) { - __ClearPageMovable(page); + if (IS_ENABLED(CONFIG_BALLOON_COMPACTION)) set_page_private(page, 0); - } /* PageOffline is sticky until the page is freed to the buddy. */ } diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c index ec176bdb8a78..e4f1a122d786 100644 --- a/mm/balloon_compaction.c +++ b/mm/balloon_compaction.c @@ -206,6 +206,9 @@ static bool balloon_page_isolate(struct page *page, isolate_mode_t mode) struct balloon_dev_info *b_dev_info = balloon_page_device(page); unsigned long flags; + if (!b_dev_info) + return false; + spin_lock_irqsave(&b_dev_info->pages_lock, flags); list_del(&page->lru); b_dev_info->isolated_pages++; @@ -219,6 +222,10 @@ static void balloon_page_putback(struct page *page) struct balloon_dev_info *b_dev_info = balloon_page_device(page); unsigned long flags; + /* Isolated balloon pages cannot get deflated. */ + if (WARN_ON_ONCE(!b_dev_info)) + return; + spin_lock_irqsave(&b_dev_info->pages_lock, flags); list_add(&page->lru, &b_dev_info->pages); b_dev_info->isolated_pages--; @@ -234,6 +241,10 @@ static int balloon_page_migrate(struct page *newpage, struct page *page, VM_BUG_ON_PAGE(!PageLocked(page), page); VM_BUG_ON_PAGE(!PageLocked(newpage), newpage); + /* Isolated balloon pages cannot get deflated. */ + if (WARN_ON_ONCE(!balloon)) + return -EAGAIN; + return balloon->migratepage(balloon, newpage, page, mode); } From 34727dee04994c8ceb1bb8a927af0a88e52e103c Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:25:08 +0200 Subject: [PATCH 188/370] mm/migrate: remove __ClearPageMovable() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Unused, let's remove it. The Chinese docs in Documentation/translations/zh_CN/mm/page_migration.rst still mention it, but that whole docs is destined to get outdated and updated by somebody that actually speaks that language. Link: https://lkml.kernel.org/r/20250704102524.326966-15-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Harry Yoo Reviewed-by: Lorenzo Stoakes Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Signed-off-by: Andrew Morton --- include/linux/migrate.h | 8 ++------ mm/compaction.c | 11 ----------- 2 files changed, 2 insertions(+), 17 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index c99a00d4ca27..6eeda8eb1e0d 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -35,8 +35,8 @@ struct migration_target_control; * @src page. The driver should copy the contents of the * @src page to the @dst page and set up the fields of @dst page. * Both pages are locked. - * If page migration is successful, the driver should call - * __ClearPageMovable(@src) and return MIGRATEPAGE_SUCCESS. + * If page migration is successful, the driver should + * return MIGRATEPAGE_SUCCESS. * If the driver cannot migrate the page at the moment, it can return * -EAGAIN. The VM interprets this as a temporary migration failure and * will retry it later. Any other error value is a permanent migration @@ -106,16 +106,12 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #ifdef CONFIG_COMPACTION bool PageMovable(struct page *page); void __SetPageMovable(struct page *page, const struct movable_operations *ops); -void __ClearPageMovable(struct page *page); #else static inline bool PageMovable(struct page *page) { return false; } static inline void __SetPageMovable(struct page *page, const struct movable_operations *ops) { } -static inline void __ClearPageMovable(struct page *page) -{ -} #endif static inline diff --git a/mm/compaction.c b/mm/compaction.c index 17455c5a4be0..889ec696ba96 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -137,17 +137,6 @@ void __SetPageMovable(struct page *page, const struct movable_operations *mops) } EXPORT_SYMBOL(__SetPageMovable); -void __ClearPageMovable(struct page *page) -{ - VM_BUG_ON_PAGE(!PageMovable(page), page); - /* - * This page still has the type of a movable page, but it's - * actually not movable any more. - */ - page->mapping = (void *)PAGE_MAPPING_MOVABLE; -} -EXPORT_SYMBOL(__ClearPageMovable); - /* Do not skip compaction more than 64 times */ #define COMPACT_MAX_DEFER_SHIFT 6 From 22d103aef090dc688a88881fb955376dec1228d5 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:25:09 +0200 Subject: [PATCH 189/370] mm/migration: remove PageMovable() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Previously, if __ClearPageMovable() were invoked on a page, this would cause __PageMovable() to return false, but due to the continued existence of page movable ops, PageMovable() would have returned true. With __ClearPageMovable() gone, the two are exactly equivalent. So we can replace PageMovable() checks by __PageMovable(). In fact, __PageMovable() cannot change until a page is freed, so we can turn some PageMovable() into sanity checks for __PageMovable(). Link: https://lkml.kernel.org/r/20250704102524.326966-16-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Lorenzo Stoakes Reviewed-by: Harry Yoo Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Signed-off-by: Andrew Morton --- include/linux/migrate.h | 2 -- mm/compaction.c | 15 --------------- mm/migrate.c | 18 ++++++++++-------- 3 files changed, 10 insertions(+), 25 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 6eeda8eb1e0d..25659a685e2a 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -104,10 +104,8 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #endif /* CONFIG_MIGRATION */ #ifdef CONFIG_COMPACTION -bool PageMovable(struct page *page); void __SetPageMovable(struct page *page, const struct movable_operations *ops); #else -static inline bool PageMovable(struct page *page) { return false; } static inline void __SetPageMovable(struct page *page, const struct movable_operations *ops) { diff --git a/mm/compaction.c b/mm/compaction.c index 889ec696ba96..5c3737301701 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -114,21 +114,6 @@ static unsigned long release_free_list(struct list_head *freepages) } #ifdef CONFIG_COMPACTION -bool PageMovable(struct page *page) -{ - const struct movable_operations *mops; - - VM_BUG_ON_PAGE(!PageLocked(page), page); - if (!__PageMovable(page)) - return false; - - mops = page_movable_ops(page); - if (mops) - return true; - - return false; -} - void __SetPageMovable(struct page *page, const struct movable_operations *mops) { VM_BUG_ON_PAGE(!PageLocked(page), page); diff --git a/mm/migrate.c b/mm/migrate.c index 61e98ed46f13..1f07c8f1fb74 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -87,9 +87,12 @@ bool isolate_movable_ops_page(struct page *page, isolate_mode_t mode) goto out; /* - * Check movable flag before taking the page lock because + * Check for movable_ops pages before taking the page lock because * we use non-atomic bitops on newly allocated page flags so * unconditionally grabbing the lock ruins page's owner side. + * + * Note that once a page has movable_ops, it will stay that way + * until the page was freed. */ if (unlikely(!__PageMovable(page))) goto out_putfolio; @@ -108,7 +111,8 @@ bool isolate_movable_ops_page(struct page *page, isolate_mode_t mode) if (unlikely(!folio_trylock(folio))) goto out_putfolio; - if (!PageMovable(page) || PageIsolated(page)) + VM_WARN_ON_ONCE_PAGE(!__PageMovable(page), page); + if (PageIsolated(page)) goto out_no_isolated; mops = page_movable_ops(page); @@ -149,11 +153,10 @@ static void putback_movable_ops_page(struct page *page) */ struct folio *folio = page_folio(page); + VM_WARN_ON_ONCE_PAGE(!__PageMovable(page), page); VM_WARN_ON_ONCE_PAGE(!PageIsolated(page), page); folio_lock(folio); - /* If the page was released by it's owner, there is nothing to do. */ - if (PageMovable(page)) - page_movable_ops(page)->putback_page(page); + page_movable_ops(page)->putback_page(page); ClearPageIsolated(page); folio_unlock(folio); folio_put(folio); @@ -191,10 +194,9 @@ static int migrate_movable_ops_page(struct page *dst, struct page *src, { int rc = MIGRATEPAGE_SUCCESS; + VM_WARN_ON_ONCE_PAGE(!__PageMovable(src), src); VM_WARN_ON_ONCE_PAGE(!PageIsolated(src), src); - /* If the page was released by it's owner, there is nothing to do. */ - if (PageMovable(src)) - rc = page_movable_ops(src)->migrate_page(dst, src, mode); + rc = page_movable_ops(src)->migrate_page(dst, src, mode); if (rc == MIGRATEPAGE_SUCCESS) ClearPageIsolated(src); return rc; From d4fb4587bd73b6b773397f5fed52a5e4bd4dec8b Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:25:10 +0200 Subject: [PATCH 190/370] mm: rename __PageMovable() to page_has_movable_ops() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Let's make it clearer that we are talking about movable_ops pages. While at it, convert a VM_BUG_ON to a VM_WARN_ON_ONCE_PAGE. Link: https://lkml.kernel.org/r/20250704102524.326966-17-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Lorenzo Stoakes Reviewed-by: Harry Yoo Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Signed-off-by: Andrew Morton --- include/linux/migrate.h | 2 +- include/linux/page-flags.h | 2 +- mm/compaction.c | 7 ++----- mm/memory-failure.c | 4 ++-- mm/memory_hotplug.c | 10 ++++------ mm/migrate.c | 8 ++++---- mm/page_alloc.c | 2 +- mm/page_isolation.c | 10 +++++----- 8 files changed, 20 insertions(+), 25 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 25659a685e2a..e04035f70e36 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -115,7 +115,7 @@ static inline void __SetPageMovable(struct page *page, static inline const struct movable_operations *page_movable_ops(struct page *page) { - VM_BUG_ON(!__PageMovable(page)); + VM_WARN_ON_ONCE_PAGE(!page_has_movable_ops(page), page); return (const struct movable_operations *) ((unsigned long)page->mapping - PAGE_MAPPING_MOVABLE); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 4fe5ee67535b..c67163b73c5e 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -750,7 +750,7 @@ static __always_inline bool __folio_test_movable(const struct folio *folio) PAGE_MAPPING_MOVABLE; } -static __always_inline bool __PageMovable(const struct page *page) +static __always_inline bool page_has_movable_ops(const struct page *page) { return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) == PAGE_MAPPING_MOVABLE; diff --git a/mm/compaction.c b/mm/compaction.c index 5c3737301701..41fd6a1fe9a3 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1056,11 +1056,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, * Skip any other type of page */ if (!PageLRU(page)) { - /* - * __PageMovable can return false positive so we need - * to verify it under page_lock. - */ - if (unlikely(__PageMovable(page)) && + /* Isolation code will deal with any races. */ + if (unlikely(page_has_movable_ops(page)) && !PageIsolated(page)) { if (locked) { unlock_page_lruvec_irqrestore(locked, flags); diff --git a/mm/memory-failure.c b/mm/memory-failure.c index b91a33fb6c69..9e2cff199934 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1388,8 +1388,8 @@ static inline bool HWPoisonHandlable(struct page *page, unsigned long flags) if (PageSlab(page)) return false; - /* Soft offline could migrate non-LRU movable pages */ - if ((flags & MF_SOFT_OFFLINE) && __PageMovable(page)) + /* Soft offline could migrate movable_ops pages */ + if ((flags & MF_SOFT_OFFLINE) && page_has_movable_ops(page)) return true; return PageLRU(page) || is_free_buddy_page(page); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index e4009a44f883..1f15af712bc3 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1739,8 +1739,8 @@ bool mhp_range_allowed(u64 start, u64 size, bool need_mapping) #ifdef CONFIG_MEMORY_HOTREMOVE /* - * Scan pfn range [start,end) to find movable/migratable pages (LRU pages, - * non-lru movable pages and hugepages). Will skip over most unmovable + * Scan pfn range [start,end) to find movable/migratable pages (LRU and + * hugetlb folio, movable_ops pages). Will skip over most unmovable * pages (esp., pages that can be skipped when offlining), but bail out on * definitely unmovable pages. * @@ -1759,13 +1759,11 @@ static int scan_movable_pages(unsigned long start, unsigned long end, struct folio *folio; page = pfn_to_page(pfn); - if (PageLRU(page)) - goto found; - if (__PageMovable(page)) + if (PageLRU(page) || page_has_movable_ops(page)) goto found; /* - * PageOffline() pages that are not marked __PageMovable() and + * PageOffline() pages that do not have movable_ops and * have a reference count > 0 (after MEM_GOING_OFFLINE) are * definitely unmovable. If their reference count would be 0, * they could at least be skipped when offlining memory. diff --git a/mm/migrate.c b/mm/migrate.c index 1f07c8f1fb74..bf9cfdafc54c 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -94,7 +94,7 @@ bool isolate_movable_ops_page(struct page *page, isolate_mode_t mode) * Note that once a page has movable_ops, it will stay that way * until the page was freed. */ - if (unlikely(!__PageMovable(page))) + if (unlikely(!page_has_movable_ops(page))) goto out_putfolio; /* @@ -111,7 +111,7 @@ bool isolate_movable_ops_page(struct page *page, isolate_mode_t mode) if (unlikely(!folio_trylock(folio))) goto out_putfolio; - VM_WARN_ON_ONCE_PAGE(!__PageMovable(page), page); + VM_WARN_ON_ONCE_PAGE(!page_has_movable_ops(page), page); if (PageIsolated(page)) goto out_no_isolated; @@ -153,7 +153,7 @@ static void putback_movable_ops_page(struct page *page) */ struct folio *folio = page_folio(page); - VM_WARN_ON_ONCE_PAGE(!__PageMovable(page), page); + VM_WARN_ON_ONCE_PAGE(!page_has_movable_ops(page), page); VM_WARN_ON_ONCE_PAGE(!PageIsolated(page), page); folio_lock(folio); page_movable_ops(page)->putback_page(page); @@ -194,7 +194,7 @@ static int migrate_movable_ops_page(struct page *dst, struct page *src, { int rc = MIGRATEPAGE_SUCCESS; - VM_WARN_ON_ONCE_PAGE(!__PageMovable(src), src); + VM_WARN_ON_ONCE_PAGE(!page_has_movable_ops(src), src); VM_WARN_ON_ONCE_PAGE(!PageIsolated(src), src); rc = page_movable_ops(src)->migrate_page(dst, src, mode); if (rc == MIGRATEPAGE_SUCCESS) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6318c85d678e..036d9b7b01c0 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2006,7 +2006,7 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page, * migration are movable. But we don't actually try * isolating, as that would be expensive. */ - if (PageLRU(page) || __PageMovable(page)) + if (PageLRU(page) || page_has_movable_ops(page)) (*num_movable)++; pfn++; } diff --git a/mm/page_isolation.c b/mm/page_isolation.c index ece3bfc56bcd..b97b965b3ed0 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -21,9 +21,9 @@ * consequently belong to a single zone. * * PageLRU check without isolation or lru_lock could race so that - * MIGRATE_MOVABLE block might include unmovable pages. And __PageMovable - * check without lock_page also may miss some movable non-lru pages at - * race condition. So you can't expect this function should be exact. + * MIGRATE_MOVABLE block might include unmovable pages. Similarly, pages + * with movable_ops can only be identified some time after they were + * allocated. So you can't expect this function should be exact. * * Returns a page without holding a reference. If the caller wants to * dereference that page (e.g., dumping), it has to make sure that it @@ -133,7 +133,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e if ((mode == PB_ISOLATE_MODE_MEM_OFFLINE) && PageOffline(page)) continue; - if (__PageMovable(page) || PageLRU(page)) + if (PageLRU(page) || page_has_movable_ops(page)) continue; /* @@ -421,7 +421,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, * proper free and split handling for them. */ VM_WARN_ON_ONCE_PAGE(PageLRU(page), page); - VM_WARN_ON_ONCE_PAGE(__PageMovable(page), page); + VM_WARN_ON_ONCE_PAGE(page_has_movable_ops(page), page); goto failed; } From 9f56c08a89222ddee6d8d574d69cbd500405fb46 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:25:11 +0200 Subject: [PATCH 191/370] mm/page_isolation: drop __folio_test_movable() check for large folios MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Currently, we only support migration of individual movable_ops pages, so we can not run into that. Link: https://lkml.kernel.org/r/20250704102524.326966-18-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Lorenzo Stoakes Reviewed-by: Harry Yoo Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Signed-off-by: Andrew Morton --- mm/page_isolation.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_isolation.c b/mm/page_isolation.c index b97b965b3ed0..f72b6cd38b95 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -92,7 +92,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e h = size_to_hstate(folio_size(folio)); if (h && !hugepage_migration_supported(h)) return page; - } else if (!folio_test_lru(folio) && !__folio_test_movable(folio)) { + } else if (!folio_test_lru(folio)) { return page; } From 457d7b3adb11576ce5f3ae0d9a4987ace213bed2 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:25:12 +0200 Subject: [PATCH 192/370] mm: remove __folio_test_movable() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Convert to page_has_movable_ops(). While at it, cleanup relevant code a bit. The data_race() in migrate_folio_unmap() is questionable: we already hold a page reference, and concurrent modifications can no longer happen (iow: __ClearPageMovable() no longer exists). Drop it for now, we'll rework page_has_movable_ops() soon either way to no longer rely on page->mapping. Wherever we cast from folio to page now is a clear sign that this code has to be decoupled. Link: https://lkml.kernel.org/r/20250704102524.326966-19-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Lorenzo Stoakes Reviewed-by: Harry Yoo Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Signed-off-by: Andrew Morton --- include/linux/page-flags.h | 6 ------ mm/migrate.c | 43 ++++++++++++-------------------------- mm/vmscan.c | 6 ++++-- 3 files changed, 17 insertions(+), 38 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index c67163b73c5e..4c27ebb689e3 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -744,12 +744,6 @@ static __always_inline bool PageAnon(const struct page *page) return folio_test_anon(page_folio(page)); } -static __always_inline bool __folio_test_movable(const struct folio *folio) -{ - return ((unsigned long)folio->mapping & PAGE_MAPPING_FLAGS) == - PAGE_MAPPING_MOVABLE; -} - static __always_inline bool page_has_movable_ops(const struct page *page) { return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) == diff --git a/mm/migrate.c b/mm/migrate.c index bf9cfdafc54c..aec0774d3da3 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -221,12 +221,7 @@ void putback_movable_pages(struct list_head *l) continue; } list_del(&folio->lru); - /* - * We isolated non-lru movable folio so here we can use - * __folio_test_movable because LRU folio's mapping cannot - * have PAGE_MAPPING_MOVABLE. - */ - if (unlikely(__folio_test_movable(folio))) { + if (unlikely(page_has_movable_ops(&folio->page))) { putback_movable_ops_page(&folio->page); } else { node_stat_mod_folio(folio, NR_ISOLATED_ANON + @@ -239,26 +234,20 @@ void putback_movable_pages(struct list_head *l) /* Must be called with an elevated refcount on the non-hugetlb folio */ bool isolate_folio_to_list(struct folio *folio, struct list_head *list) { - bool isolated, lru; - if (folio_test_hugetlb(folio)) return folio_isolate_hugetlb(folio, list); - lru = !__folio_test_movable(folio); - if (lru) - isolated = folio_isolate_lru(folio); - else - isolated = isolate_movable_ops_page(&folio->page, - ISOLATE_UNEVICTABLE); - - if (!isolated) - return false; - - list_add(&folio->lru, list); - if (lru) + if (page_has_movable_ops(&folio->page)) { + if (!isolate_movable_ops_page(&folio->page, + ISOLATE_UNEVICTABLE)) + return false; + } else { + if (!folio_isolate_lru(folio)) + return false; node_stat_add_folio(folio, NR_ISOLATED_ANON + folio_is_file_lru(folio)); - + } + list_add(&folio->lru, list); return true; } @@ -1142,12 +1131,7 @@ static void migrate_folio_undo_dst(struct folio *dst, bool locked, static void migrate_folio_done(struct folio *src, enum migrate_reason reason) { - /* - * Compaction can migrate also non-LRU pages which are - * not accounted to NR_ISOLATED_*. They can be recognized - * as __folio_test_movable - */ - if (likely(!__folio_test_movable(src)) && reason != MR_DEMOTION) + if (likely(!page_has_movable_ops(&src->page)) && reason != MR_DEMOTION) mod_node_page_state(folio_pgdat(src), NR_ISOLATED_ANON + folio_is_file_lru(src), -folio_nr_pages(src)); @@ -1166,7 +1150,6 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, int rc = -EAGAIN; int old_page_state = 0; struct anon_vma *anon_vma = NULL; - bool is_lru = data_race(!__folio_test_movable(src)); bool locked = false; bool dst_locked = false; @@ -1267,7 +1250,7 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, goto out; dst_locked = true; - if (unlikely(!is_lru)) { + if (unlikely(page_has_movable_ops(&src->page))) { __migrate_folio_record(dst, old_page_state, anon_vma); return MIGRATEPAGE_UNMAP; } @@ -1332,7 +1315,7 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private, prev = dst->lru.prev; list_del(&dst->lru); - if (unlikely(__folio_test_movable(src))) { + if (unlikely(page_has_movable_ops(&src->page))) { rc = migrate_movable_ops_page(&dst->page, &src->page, mode); if (rc) goto out; diff --git a/mm/vmscan.c b/mm/vmscan.c index c86a2495138a..b1b999734ee4 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1651,9 +1651,11 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone, unsigned int noreclaim_flag; list_for_each_entry_safe(folio, next, folio_list, lru) { + /* TODO: these pages should not even appear in this list. */ + if (page_has_movable_ops(&folio->page)) + continue; if (!folio_test_hugetlb(folio) && folio_is_file_lru(folio) && - !folio_test_dirty(folio) && !__folio_test_movable(folio) && - !folio_test_unevictable(folio)) { + !folio_test_dirty(folio) && !folio_test_unevictable(folio)) { folio_clear_active(folio); list_move(&folio->lru, &clean_folios); } From 84caf98838a3e5f4bdb344c13679e1067ffbf094 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:25:13 +0200 Subject: [PATCH 193/370] mm: stop storing migration_ops in page->mapping MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ... instead, look them up statically based on the page type. Maybe in the future we want a registration interface? At least for now, it can be easily handled using the two page types that actually support page migration. The remaining usage of page->mapping is to flag such pages as actually being movable (having movable_ops), which we will change next. Link: https://lkml.kernel.org/r/20250704102524.326966-20-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Lorenzo Stoakes Reviewed-by: Harry Yoo Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Signed-off-by: Andrew Morton --- include/linux/balloon_compaction.h | 2 +- include/linux/migrate.h | 14 ++------------ include/linux/zsmalloc.h | 2 ++ mm/balloon_compaction.c | 1 - mm/compaction.c | 5 ++--- mm/migrate.c | 23 +++++++++++++++++++++++ mm/zpdesc.h | 5 ++--- mm/zsmalloc.c | 8 +++----- 8 files changed, 35 insertions(+), 25 deletions(-) diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h index 9bce8e9f5018..a8a1706cc56f 100644 --- a/include/linux/balloon_compaction.h +++ b/include/linux/balloon_compaction.h @@ -92,7 +92,7 @@ static inline void balloon_page_insert(struct balloon_dev_info *balloon, struct page *page) { __SetPageOffline(page); - __SetPageMovable(page, &balloon_mops); + __SetPageMovable(page); set_page_private(page, (unsigned long)balloon); list_add(&page->lru, &balloon->pages); } diff --git a/include/linux/migrate.h b/include/linux/migrate.h index e04035f70e36..6aece3f3c8be 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -104,23 +104,13 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #endif /* CONFIG_MIGRATION */ #ifdef CONFIG_COMPACTION -void __SetPageMovable(struct page *page, const struct movable_operations *ops); +void __SetPageMovable(struct page *page); #else -static inline void __SetPageMovable(struct page *page, - const struct movable_operations *ops) +static inline void __SetPageMovable(struct page *page) { } #endif -static inline -const struct movable_operations *page_movable_ops(struct page *page) -{ - VM_WARN_ON_ONCE_PAGE(!page_has_movable_ops(page), page); - - return (const struct movable_operations *) - ((unsigned long)page->mapping - PAGE_MAPPING_MOVABLE); -} - #ifdef CONFIG_NUMA_BALANCING int migrate_misplaced_folio_prepare(struct folio *folio, struct vm_area_struct *vma, int node); diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h index 13e9cc5490f7..f3ccff2d966c 100644 --- a/include/linux/zsmalloc.h +++ b/include/linux/zsmalloc.h @@ -46,4 +46,6 @@ void zs_obj_read_end(struct zs_pool *pool, unsigned long handle, void zs_obj_write(struct zs_pool *pool, unsigned long handle, void *handle_mem, size_t mem_len); +extern const struct movable_operations zsmalloc_mops; + #endif diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c index e4f1a122d786..2a4a649805c1 100644 --- a/mm/balloon_compaction.c +++ b/mm/balloon_compaction.c @@ -253,6 +253,5 @@ const struct movable_operations balloon_mops = { .isolate_page = balloon_page_isolate, .putback_page = balloon_page_putback, }; -EXPORT_SYMBOL_GPL(balloon_mops); #endif /* CONFIG_BALLOON_COMPACTION */ diff --git a/mm/compaction.c b/mm/compaction.c index 41fd6a1fe9a3..348eb754cb22 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -114,11 +114,10 @@ static unsigned long release_free_list(struct list_head *freepages) } #ifdef CONFIG_COMPACTION -void __SetPageMovable(struct page *page, const struct movable_operations *mops) +void __SetPageMovable(struct page *page) { VM_BUG_ON_PAGE(!PageLocked(page), page); - VM_BUG_ON_PAGE((unsigned long)mops & PAGE_MAPPING_MOVABLE, page); - page->mapping = (void *)((unsigned long)mops | PAGE_MAPPING_MOVABLE); + page->mapping = (void *)(PAGE_MAPPING_MOVABLE); } EXPORT_SYMBOL(__SetPageMovable); diff --git a/mm/migrate.c b/mm/migrate.c index aec0774d3da3..90ddf906d706 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -43,6 +43,8 @@ #include #include #include +#include +#include #include @@ -51,6 +53,27 @@ #include "internal.h" #include "swap.h" +static const struct movable_operations *page_movable_ops(struct page *page) +{ + VM_WARN_ON_ONCE_PAGE(!page_has_movable_ops(page), page); + + /* + * If we enable page migration for a page of a certain type by marking + * it as movable, the page type must be sticky until the page gets freed + * back to the buddy. + */ +#ifdef CONFIG_BALLOON_COMPACTION + if (PageOffline(page)) + /* Only balloon compaction sets PageOffline pages movable. */ + return &balloon_mops; +#endif /* CONFIG_BALLOON_COMPACTION */ +#if defined(CONFIG_ZSMALLOC) && defined(CONFIG_COMPACTION) + if (PageZsmalloc(page)) + return &zsmalloc_mops; +#endif /* defined(CONFIG_ZSMALLOC) && defined(CONFIG_COMPACTION) */ + return NULL; +} + /** * isolate_movable_ops_page - isolate a movable_ops page for migration * @page: The page. diff --git a/mm/zpdesc.h b/mm/zpdesc.h index 5763f3603973..6855d9e2732d 100644 --- a/mm/zpdesc.h +++ b/mm/zpdesc.h @@ -152,10 +152,9 @@ static inline struct zpdesc *pfn_zpdesc(unsigned long pfn) return page_zpdesc(pfn_to_page(pfn)); } -static inline void __zpdesc_set_movable(struct zpdesc *zpdesc, - const struct movable_operations *mops) +static inline void __zpdesc_set_movable(struct zpdesc *zpdesc) { - __SetPageMovable(zpdesc_page(zpdesc), mops); + __SetPageMovable(zpdesc_page(zpdesc)); } static inline void __zpdesc_set_zsmalloc(struct zpdesc *zpdesc) diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index b12250e219bb..4aaff7c26ea9 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -1685,8 +1685,6 @@ static void lock_zspage(struct zspage *zspage) #ifdef CONFIG_COMPACTION -static const struct movable_operations zsmalloc_mops; - static void replace_sub_page(struct size_class *class, struct zspage *zspage, struct zpdesc *newzpdesc, struct zpdesc *oldzpdesc) { @@ -1709,7 +1707,7 @@ static void replace_sub_page(struct size_class *class, struct zspage *zspage, set_first_obj_offset(newzpdesc, first_obj_offset); if (unlikely(ZsHugePage(zspage))) newzpdesc->handle = oldzpdesc->handle; - __zpdesc_set_movable(newzpdesc, &zsmalloc_mops); + __zpdesc_set_movable(newzpdesc); } static bool zs_page_isolate(struct page *page, isolate_mode_t mode) @@ -1819,7 +1817,7 @@ static void zs_page_putback(struct page *page) { } -static const struct movable_operations zsmalloc_mops = { +const struct movable_operations zsmalloc_mops = { .isolate_page = zs_page_isolate, .migrate_page = zs_page_migrate, .putback_page = zs_page_putback, @@ -1882,7 +1880,7 @@ static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage) do { WARN_ON(!zpdesc_trylock(zpdesc)); - __zpdesc_set_movable(zpdesc, &zsmalloc_mops); + __zpdesc_set_movable(zpdesc); zpdesc_unlock(zpdesc); } while ((zpdesc = get_next_zpdesc(zpdesc)) != NULL); } From 3d388584d59985e95f5bfb9dbd9776fa1bb1ec8a Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:25:14 +0200 Subject: [PATCH 194/370] mm: convert "movable" flag in page->mapping to a page flag MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Instead, let's use a page flag. As the page flag can result in false-positives, glue it to the page types for which we support/implement movable_ops page migration. We are reusing PG_uptodate, that is for example used to track file system state and does not apply to movable_ops pages. So warning in case it is set in page_has_movable_ops() on other page types could result in false-positive warnings. Likely we could set the bit using a non-atomic update: in contrast to page->mapping, we could have others trying to update the flags concurrently when trying to lock the folio. In isolate_movable_ops_page(), we already take care of that by checking if the page has movable_ops before locking it. Let's start with the atomic variant, we could later switch to the non-atomic variant once we are sure other cases are similarly fine. Once we perform the switch, we'll have to introduce __SETPAGEFLAG_NOOP(). [david@redhat.com: add missing `:' in kerneldoc] Link: https://lkml.kernel.org/r/d96e2916-2c43-462c-b6a1-2375ef397d8b@redhat.com Link: https://lkml.kernel.org/r/20250704102524.326966-21-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Lorenzo Stoakes Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: Harry Yoo Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Signed-off-by: Andrew Morton --- include/linux/balloon_compaction.h | 2 +- include/linux/migrate.h | 8 ----- include/linux/page-flags.h | 54 ++++++++++++++++++++++++------ mm/compaction.c | 6 ---- mm/zpdesc.h | 2 +- 5 files changed, 46 insertions(+), 26 deletions(-) diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h index a8a1706cc56f..b222b0737c46 100644 --- a/include/linux/balloon_compaction.h +++ b/include/linux/balloon_compaction.h @@ -92,7 +92,7 @@ static inline void balloon_page_insert(struct balloon_dev_info *balloon, struct page *page) { __SetPageOffline(page); - __SetPageMovable(page); + SetPageMovableOps(page); set_page_private(page, (unsigned long)balloon); list_add(&page->lru, &balloon->pages); } diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 6aece3f3c8be..acadd41e0b5c 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -103,14 +103,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #endif /* CONFIG_MIGRATION */ -#ifdef CONFIG_COMPACTION -void __SetPageMovable(struct page *page); -#else -static inline void __SetPageMovable(struct page *page) -{ -} -#endif - #ifdef CONFIG_NUMA_BALANCING int migrate_misplaced_folio_prepare(struct folio *folio, struct vm_area_struct *vma, int node); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 4c27ebb689e3..a02a12e51915 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -170,6 +170,11 @@ enum pageflags { /* non-lru isolated movable page */ PG_isolated = PG_reclaim, +#ifdef CONFIG_MIGRATION + /* this is a movable_ops page (for selected typed pages only) */ + PG_movable_ops = PG_uptodate, +#endif + /* Only valid for buddy pages. Used to track pages that are reported */ PG_reported = PG_uptodate, @@ -698,9 +703,6 @@ PAGEFLAG_FALSE(VmemmapSelfHosted, vmemmap_self_hosted) * bit; and then folio->mapping points, not to an anon_vma, but to a private * structure which KSM associates with that merged page. See ksm.h. * - * PAGE_MAPPING_KSM without PAGE_MAPPING_ANON is used for non-lru movable - * page and then folio->mapping points to a struct movable_operations. - * * Please note that, confusingly, "folio_mapping" refers to the inode * address_space which maps the folio from disk; whereas "folio_mapped" * refers to user virtual address space into which the folio is mapped. @@ -743,13 +745,6 @@ static __always_inline bool PageAnon(const struct page *page) { return folio_test_anon(page_folio(page)); } - -static __always_inline bool page_has_movable_ops(const struct page *page) -{ - return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) == - PAGE_MAPPING_MOVABLE; -} - #ifdef CONFIG_KSM /* * A KSM page is one of those write-protected "shared pages" or "merged pages" @@ -1133,6 +1128,45 @@ bool is_free_buddy_page(const struct page *page); PAGEFLAG(Isolated, isolated, PF_ANY); +#ifdef CONFIG_MIGRATION +/* + * This page is migratable through movable_ops (for selected typed pages + * only). + * + * Page migration of such pages might fail, for example, if the page is + * already isolated by somebody else, or if the page is about to get freed. + * + * While a subsystem might set selected typed pages that support page migration + * as being movable through movable_ops, it must never clear this flag. + * + * This flag is only cleared when the page is freed back to the buddy. + * + * Only selected page types support this flag (see page_movable_ops()) and + * the flag might be used in other context for other pages. Always use + * page_has_movable_ops() instead. + */ +TESTPAGEFLAG(MovableOps, movable_ops, PF_NO_TAIL); +SETPAGEFLAG(MovableOps, movable_ops, PF_NO_TAIL); +#else /* !CONFIG_MIGRATION */ +TESTPAGEFLAG_FALSE(MovableOps, movable_ops); +SETPAGEFLAG_NOOP(MovableOps, movable_ops); +#endif /* CONFIG_MIGRATION */ + +/** + * page_has_movable_ops - test for a movable_ops page + * @page: The page to test. + * + * Test whether this is a movable_ops page. Such pages will stay that + * way until freed. + * + * Returns true if this is a movable_ops page, otherwise false. + */ +static inline bool page_has_movable_ops(const struct page *page) +{ + return PageMovableOps(page) && + (PageOffline(page) || PageZsmalloc(page)); +} + static __always_inline int PageAnonExclusive(const struct page *page) { VM_BUG_ON_PGFLAGS(!PageAnon(page), page); diff --git a/mm/compaction.c b/mm/compaction.c index 348eb754cb22..349f4ea0ec3e 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -114,12 +114,6 @@ static unsigned long release_free_list(struct list_head *freepages) } #ifdef CONFIG_COMPACTION -void __SetPageMovable(struct page *page) -{ - VM_BUG_ON_PAGE(!PageLocked(page), page); - page->mapping = (void *)(PAGE_MAPPING_MOVABLE); -} -EXPORT_SYMBOL(__SetPageMovable); /* Do not skip compaction more than 64 times */ #define COMPACT_MAX_DEFER_SHIFT 6 diff --git a/mm/zpdesc.h b/mm/zpdesc.h index 6855d9e2732d..25bf5ea0beb8 100644 --- a/mm/zpdesc.h +++ b/mm/zpdesc.h @@ -154,7 +154,7 @@ static inline struct zpdesc *pfn_zpdesc(unsigned long pfn) static inline void __zpdesc_set_movable(struct zpdesc *zpdesc) { - __SetPageMovable(zpdesc_page(zpdesc)); + SetPageMovableOps(zpdesc_page(zpdesc)); } static inline void __zpdesc_set_zsmalloc(struct zpdesc *zpdesc) From 92f091769fde42509ca7685e67c9951f2350ceb7 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:25:15 +0200 Subject: [PATCH 195/370] mm: rename PG_isolated to PG_movable_ops_isolated MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Let's rename the flag to make it clearer where it applies (not folios ...). While at it, define the flag only with CONFIG_MIGRATION. Link: https://lkml.kernel.org/r/20250704102524.326966-22-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Lorenzo Stoakes Reviewed-by: Harry Yoo Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Signed-off-by: Andrew Morton --- include/linux/page-flags.h | 16 +++++++++++----- mm/compaction.c | 2 +- mm/migrate.c | 14 +++++++------- 3 files changed, 19 insertions(+), 13 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index a02a12e51915..92f7152a445e 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -167,10 +167,9 @@ enum pageflags { /* Remapped by swiotlb-xen. */ PG_xen_remapped = PG_owner_priv_1, - /* non-lru isolated movable page */ - PG_isolated = PG_reclaim, - #ifdef CONFIG_MIGRATION + /* movable_ops page that is isolated for migration */ + PG_movable_ops_isolated = PG_reclaim, /* this is a movable_ops page (for selected typed pages only) */ PG_movable_ops = PG_uptodate, #endif @@ -1126,8 +1125,6 @@ static inline bool folio_contain_hwpoisoned_page(struct folio *folio) bool is_free_buddy_page(const struct page *page); -PAGEFLAG(Isolated, isolated, PF_ANY); - #ifdef CONFIG_MIGRATION /* * This page is migratable through movable_ops (for selected typed pages @@ -1147,9 +1144,18 @@ PAGEFLAG(Isolated, isolated, PF_ANY); */ TESTPAGEFLAG(MovableOps, movable_ops, PF_NO_TAIL); SETPAGEFLAG(MovableOps, movable_ops, PF_NO_TAIL); +/* + * A movable_ops page has this flag set while it is isolated for migration. + * This flag primarily protects against concurrent migration attempts. + * + * Once migration ended (success or failure), the flag is cleared. The + * flag is managed by the migration core. + */ +PAGEFLAG(MovableOpsIsolated, movable_ops_isolated, PF_NO_TAIL); #else /* !CONFIG_MIGRATION */ TESTPAGEFLAG_FALSE(MovableOps, movable_ops); SETPAGEFLAG_NOOP(MovableOps, movable_ops); +PAGEFLAG_FALSE(MovableOpsIsolated, movable_ops_isolated); #endif /* CONFIG_MIGRATION */ /** diff --git a/mm/compaction.c b/mm/compaction.c index 349f4ea0ec3e..bf021b31c7ec 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1051,7 +1051,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, if (!PageLRU(page)) { /* Isolation code will deal with any races. */ if (unlikely(page_has_movable_ops(page)) && - !PageIsolated(page)) { + !PageMovableOpsIsolated(page)) { if (locked) { unlock_page_lruvec_irqrestore(locked, flags); locked = NULL; diff --git a/mm/migrate.c b/mm/migrate.c index 90ddf906d706..e4e05a98c84e 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -135,7 +135,7 @@ bool isolate_movable_ops_page(struct page *page, isolate_mode_t mode) goto out_putfolio; VM_WARN_ON_ONCE_PAGE(!page_has_movable_ops(page), page); - if (PageIsolated(page)) + if (PageMovableOpsIsolated(page)) goto out_no_isolated; mops = page_movable_ops(page); @@ -146,8 +146,8 @@ bool isolate_movable_ops_page(struct page *page, isolate_mode_t mode) goto out_no_isolated; /* Driver shouldn't use the isolated flag */ - VM_WARN_ON_ONCE_PAGE(PageIsolated(page), page); - SetPageIsolated(page); + VM_WARN_ON_ONCE_PAGE(PageMovableOpsIsolated(page), page); + SetPageMovableOpsIsolated(page); folio_unlock(folio); return true; @@ -177,10 +177,10 @@ static void putback_movable_ops_page(struct page *page) struct folio *folio = page_folio(page); VM_WARN_ON_ONCE_PAGE(!page_has_movable_ops(page), page); - VM_WARN_ON_ONCE_PAGE(!PageIsolated(page), page); + VM_WARN_ON_ONCE_PAGE(!PageMovableOpsIsolated(page), page); folio_lock(folio); page_movable_ops(page)->putback_page(page); - ClearPageIsolated(page); + ClearPageMovableOpsIsolated(page); folio_unlock(folio); folio_put(folio); } @@ -218,10 +218,10 @@ static int migrate_movable_ops_page(struct page *dst, struct page *src, int rc = MIGRATEPAGE_SUCCESS; VM_WARN_ON_ONCE_PAGE(!page_has_movable_ops(src), src); - VM_WARN_ON_ONCE_PAGE(!PageIsolated(src), src); + VM_WARN_ON_ONCE_PAGE(!PageMovableOpsIsolated(src), src); rc = page_movable_ops(src)->migrate_page(dst, src, mode); if (rc == MIGRATEPAGE_SUCCESS) - ClearPageIsolated(src); + ClearPageMovableOpsIsolated(src); return rc; } From bd56d30242039826160d95a65cb14c1cd65e6488 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:25:16 +0200 Subject: [PATCH 196/370] mm/page-flags: rename PAGE_MAPPING_MOVABLE to PAGE_MAPPING_ANON_KSM MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit KSM is the only remaining user, let's rename the flag. While at it, adjust to remaining page -> folio in the doc. Link: https://lkml.kernel.org/r/20250704102524.326966-23-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Lorenzo Stoakes Reviewed-by: Harry Yoo Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Signed-off-by: Andrew Morton --- include/linux/page-flags.h | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 92f7152a445e..22767499fb21 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -697,10 +697,10 @@ PAGEFLAG_FALSE(VmemmapSelfHosted, vmemmap_self_hosted) * folio->mapping points to its anon_vma, not to a struct address_space; * with the PAGE_MAPPING_ANON bit set to distinguish it. See rmap.h. * - * On an anonymous page in a VM_MERGEABLE area, if CONFIG_KSM is enabled, - * the PAGE_MAPPING_MOVABLE bit may be set along with the PAGE_MAPPING_ANON + * On an anonymous folio in a VM_MERGEABLE area, if CONFIG_KSM is enabled, + * the PAGE_MAPPING_ANON_KSM bit may be set along with the PAGE_MAPPING_ANON * bit; and then folio->mapping points, not to an anon_vma, but to a private - * structure which KSM associates with that merged page. See ksm.h. + * structure which KSM associates with that merged folio. See ksm.h. * * Please note that, confusingly, "folio_mapping" refers to the inode * address_space which maps the folio from disk; whereas "folio_mapped" @@ -714,9 +714,9 @@ PAGEFLAG_FALSE(VmemmapSelfHosted, vmemmap_self_hosted) * See mm/slab.h. */ #define PAGE_MAPPING_ANON 0x1 -#define PAGE_MAPPING_MOVABLE 0x2 -#define PAGE_MAPPING_KSM (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE) -#define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE) +#define PAGE_MAPPING_ANON_KSM 0x2 +#define PAGE_MAPPING_KSM (PAGE_MAPPING_ANON | PAGE_MAPPING_ANON_KSM) +#define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_ANON_KSM) static __always_inline bool folio_mapping_flags(const struct folio *folio) { From 5799c0ed0aff65989b04ca3f517401e4ee94da10 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:25:17 +0200 Subject: [PATCH 197/370] mm/page-alloc: remove PageMappingFlags() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit As PageMappingFlags() now only indicates anon (incl. KSM) folios, we can now simply check for PageAnon() and remove PageMappingFlags(). ... and while at it, use the folio instead and operate on folio->mapping. Link: https://lkml.kernel.org/r/20250704102524.326966-24-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Lorenzo Stoakes Reviewed-by: Harry Yoo Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Signed-off-by: Andrew Morton --- include/linux/page-flags.h | 5 ----- mm/page_alloc.c | 7 +++---- 2 files changed, 3 insertions(+), 9 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 22767499fb21..a2151f6d96ae 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -723,11 +723,6 @@ static __always_inline bool folio_mapping_flags(const struct folio *folio) return ((unsigned long)folio->mapping & PAGE_MAPPING_FLAGS) != 0; } -static __always_inline bool PageMappingFlags(const struct page *page) -{ - return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) != 0; -} - static __always_inline bool folio_test_anon(const struct folio *folio) { return ((unsigned long)folio->mapping & PAGE_MAPPING_ANON) != 0; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 036d9b7b01c0..fa09154a799c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1375,10 +1375,9 @@ __always_inline bool free_pages_prepare(struct page *page, (page + i)->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; } } - if (PageMappingFlags(page)) { - if (PageAnon(page)) - mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1); - page->mapping = NULL; + if (folio_test_anon(folio)) { + mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1); + folio->mapping = NULL; } if (unlikely(page_has_type(page))) /* Reset the page_type (which overlays _mapcount) */ From beb2cdeed673150cdfad653dd2e7c9999c230f57 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:25:18 +0200 Subject: [PATCH 198/370] mm/page-flags: remove folio_mapping_flags() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit It's unused and the page counterpart is gone, so let's remove it. Link: https://lkml.kernel.org/r/20250704102524.326966-25-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Lorenzo Stoakes Reviewed-by: Harry Yoo Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Signed-off-by: Andrew Morton --- include/linux/page-flags.h | 5 ----- 1 file changed, 5 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index a2151f6d96ae..ae2b80fcea6a 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -718,11 +718,6 @@ PAGEFLAG_FALSE(VmemmapSelfHosted, vmemmap_self_hosted) #define PAGE_MAPPING_KSM (PAGE_MAPPING_ANON | PAGE_MAPPING_ANON_KSM) #define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_ANON_KSM) -static __always_inline bool folio_mapping_flags(const struct folio *folio) -{ - return ((unsigned long)folio->mapping & PAGE_MAPPING_FLAGS) != 0; -} - static __always_inline bool folio_test_anon(const struct folio *folio) { return ((unsigned long)folio->mapping & PAGE_MAPPING_ANON) != 0; From 78cb1a13c42a6d843e21389f74d1edb90ed07288 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:25:19 +0200 Subject: [PATCH 199/370] mm: simplify folio_expected_ref_count() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Now that PAGE_MAPPING_MOVABLE is gone, we can simplify and rely on the folio_test_anon() test only. ... but staring at the users, this function should never even have been called on movable_ops pages. E.g., * __buffer_migrate_folio() does not make sense for them * folio_migrate_mapping() does not make sense for them * migrate_huge_page_move_mapping() does not make sense for them * __migrate_folio() does not make sense for them * ... and khugepaged should never stumble over them Let's simply refuse typed pages (which includes slab) except hugetlb, and WARN. Link: https://lkml.kernel.org/r/20250704102524.326966-26-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Lorenzo Stoakes Reviewed-by: Harry Yoo Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Signed-off-by: Andrew Morton --- include/linux/mm.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index ef40f68c1183..805108d7bbc3 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2167,13 +2167,13 @@ static inline int folio_expected_ref_count(const struct folio *folio) const int order = folio_order(folio); int ref_count = 0; - if (WARN_ON_ONCE(folio_test_slab(folio))) + if (WARN_ON_ONCE(page_has_type(&folio->page) && !folio_test_hugetlb(folio))) return 0; if (folio_test_anon(folio)) { /* One reference per page from the swapcache. */ ref_count += folio_test_swapcache(folio) << order; - } else if (!((unsigned long)folio->mapping & PAGE_MAPPING_FLAGS)) { + } else { /* One reference per page from the pagecache. */ ref_count += !!folio->mapping << order; /* One reference from PG_private. */ From df25569d401e36327b339c3f5b3265d74eae90f2 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:25:20 +0200 Subject: [PATCH 200/370] mm: rename PAGE_MAPPING_* to FOLIO_MAPPING_* MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Now that the mapping flags are only used for folios, let's rename the defines. Link: https://lkml.kernel.org/r/20250704102524.326966-27-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Lorenzo Stoakes Reviewed-by: Harry Yoo Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Signed-off-by: Andrew Morton --- fs/proc/page.c | 4 ++-- include/linux/fs.h | 2 +- include/linux/mm_types.h | 1 - include/linux/page-flags.h | 20 ++++++++++---------- include/linux/pagemap.h | 2 +- mm/gup.c | 4 ++-- mm/internal.h | 2 +- mm/ksm.c | 4 ++-- mm/rmap.c | 16 ++++++++-------- mm/util.c | 6 +++--- 10 files changed, 30 insertions(+), 31 deletions(-) diff --git a/fs/proc/page.c b/fs/proc/page.c index 999af26c7298..0cdc78c0d23f 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -149,7 +149,7 @@ u64 stable_page_flags(const struct page *page) k = folio->flags; mapping = (unsigned long)folio->mapping; - is_anon = mapping & PAGE_MAPPING_ANON; + is_anon = mapping & FOLIO_MAPPING_ANON; /* * pseudo flags for the well known (anonymous) memory mapped pages @@ -158,7 +158,7 @@ u64 stable_page_flags(const struct page *page) u |= 1 << KPF_MMAP; if (is_anon) { u |= 1 << KPF_ANON; - if (mapping & PAGE_MAPPING_KSM) + if (mapping & FOLIO_MAPPING_KSM) u |= 1 << KPF_KSM; } diff --git a/include/linux/fs.h b/include/linux/fs.h index e14e9d11ca0f..d3e7ad6941a8 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -526,7 +526,7 @@ struct address_space { /* * On most architectures that alignment is already the case; but * must be enforced here for CRIS, to let the least significant bit - * of struct page's "mapping" pointer be used for PAGE_MAPPING_ANON. + * of struct folio's "mapping" pointer be used for FOLIO_MAPPING_ANON. */ /* XArray tags, for tagging dirty and writeback pages in the pagecache. */ diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 804d269a4f5e..1ec273b06691 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -105,7 +105,6 @@ struct page { unsigned int order; }; }; - /* See page-flags.h for PAGE_MAPPING_FLAGS */ struct address_space *mapping; union { pgoff_t __folio_index; /* Our offset within mapping. */ diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index ae2b80fcea6a..8e4d6eda8a8d 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -695,10 +695,10 @@ PAGEFLAG_FALSE(VmemmapSelfHosted, vmemmap_self_hosted) /* * On an anonymous folio mapped into a user virtual memory area, * folio->mapping points to its anon_vma, not to a struct address_space; - * with the PAGE_MAPPING_ANON bit set to distinguish it. See rmap.h. + * with the FOLIO_MAPPING_ANON bit set to distinguish it. See rmap.h. * * On an anonymous folio in a VM_MERGEABLE area, if CONFIG_KSM is enabled, - * the PAGE_MAPPING_ANON_KSM bit may be set along with the PAGE_MAPPING_ANON + * the FOLIO_MAPPING_ANON_KSM bit may be set along with the FOLIO_MAPPING_ANON * bit; and then folio->mapping points, not to an anon_vma, but to a private * structure which KSM associates with that merged folio. See ksm.h. * @@ -713,21 +713,21 @@ PAGEFLAG_FALSE(VmemmapSelfHosted, vmemmap_self_hosted) * false before calling the following functions (e.g., folio_test_anon). * See mm/slab.h. */ -#define PAGE_MAPPING_ANON 0x1 -#define PAGE_MAPPING_ANON_KSM 0x2 -#define PAGE_MAPPING_KSM (PAGE_MAPPING_ANON | PAGE_MAPPING_ANON_KSM) -#define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_ANON_KSM) +#define FOLIO_MAPPING_ANON 0x1 +#define FOLIO_MAPPING_ANON_KSM 0x2 +#define FOLIO_MAPPING_KSM (FOLIO_MAPPING_ANON | FOLIO_MAPPING_ANON_KSM) +#define FOLIO_MAPPING_FLAGS (FOLIO_MAPPING_ANON | FOLIO_MAPPING_ANON_KSM) static __always_inline bool folio_test_anon(const struct folio *folio) { - return ((unsigned long)folio->mapping & PAGE_MAPPING_ANON) != 0; + return ((unsigned long)folio->mapping & FOLIO_MAPPING_ANON) != 0; } static __always_inline bool PageAnonNotKsm(const struct page *page) { unsigned long flags = (unsigned long)page_folio(page)->mapping; - return (flags & PAGE_MAPPING_FLAGS) == PAGE_MAPPING_ANON; + return (flags & FOLIO_MAPPING_FLAGS) == FOLIO_MAPPING_ANON; } static __always_inline bool PageAnon(const struct page *page) @@ -743,8 +743,8 @@ static __always_inline bool PageAnon(const struct page *page) */ static __always_inline bool folio_test_ksm(const struct folio *folio) { - return ((unsigned long)folio->mapping & PAGE_MAPPING_FLAGS) == - PAGE_MAPPING_KSM; + return ((unsigned long)folio->mapping & FOLIO_MAPPING_FLAGS) == + FOLIO_MAPPING_KSM; } #else FOLIO_TEST_FLAG_FALSE(ksm) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index e63fbfbd5b0f..10a222e68b85 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -502,7 +502,7 @@ static inline pgoff_t mapping_align_index(struct address_space *mapping, static inline bool mapping_large_folio_support(struct address_space *mapping) { /* AS_FOLIO_ORDER is only reasonable for pagecache folios */ - VM_WARN_ONCE((unsigned long)mapping & PAGE_MAPPING_ANON, + VM_WARN_ONCE((unsigned long)mapping & FOLIO_MAPPING_ANON, "Anonymous mapping always supports large folio"); return mapping_max_folio_order(mapping) > 0; diff --git a/mm/gup.c b/mm/gup.c index 30d320719fa2..adffe663594d 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2804,9 +2804,9 @@ static bool gup_fast_folio_allowed(struct folio *folio, unsigned int flags) return false; /* Anonymous folios pose no problem. */ - mapping_flags = (unsigned long)mapping & PAGE_MAPPING_FLAGS; + mapping_flags = (unsigned long)mapping & FOLIO_MAPPING_FLAGS; if (mapping_flags) - return mapping_flags & PAGE_MAPPING_ANON; + return mapping_flags & FOLIO_MAPPING_ANON; /* * At this point, we know the mapping is non-null and points to an diff --git a/mm/internal.h b/mm/internal.h index 22a95a2b7fa1..2e235740128a 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -149,7 +149,7 @@ static inline void *folio_raw_mapping(const struct folio *folio) { unsigned long mapping = (unsigned long)folio->mapping; - return (void *)(mapping & ~PAGE_MAPPING_FLAGS); + return (void *)(mapping & ~FOLIO_MAPPING_FLAGS); } /* diff --git a/mm/ksm.c b/mm/ksm.c index ef73b25fd65a..2b0210d41c55 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -893,7 +893,7 @@ static struct folio *ksm_get_folio(struct ksm_stable_node *stable_node, unsigned long kpfn; expected_mapping = (void *)((unsigned long)stable_node | - PAGE_MAPPING_KSM); + FOLIO_MAPPING_KSM); again: kpfn = READ_ONCE(stable_node->kpfn); /* Address dependency. */ folio = pfn_folio(kpfn); @@ -1070,7 +1070,7 @@ static inline void folio_set_stable_node(struct folio *folio, struct ksm_stable_node *stable_node) { VM_WARN_ON_FOLIO(folio_test_anon(folio) && PageAnonExclusive(&folio->page), folio); - folio->mapping = (void *)((unsigned long)stable_node | PAGE_MAPPING_KSM); + folio->mapping = (void *)((unsigned long)stable_node | FOLIO_MAPPING_KSM); } #ifdef CONFIG_SYSFS diff --git a/mm/rmap.c b/mm/rmap.c index bd83724d14b6..4b1a2a33e39f 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -503,12 +503,12 @@ struct anon_vma *folio_get_anon_vma(const struct folio *folio) rcu_read_lock(); anon_mapping = (unsigned long)READ_ONCE(folio->mapping); - if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON) + if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON) goto out; if (!folio_mapped(folio)) goto out; - anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); + anon_vma = (struct anon_vma *) (anon_mapping - FOLIO_MAPPING_ANON); if (!atomic_inc_not_zero(&anon_vma->refcount)) { anon_vma = NULL; goto out; @@ -550,12 +550,12 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio, retry: rcu_read_lock(); anon_mapping = (unsigned long)READ_ONCE(folio->mapping); - if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON) + if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON) goto out; if (!folio_mapped(folio)) goto out; - anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); + anon_vma = (struct anon_vma *) (anon_mapping - FOLIO_MAPPING_ANON); root_anon_vma = READ_ONCE(anon_vma->root); if (down_read_trylock(&root_anon_vma->rwsem)) { /* @@ -1334,9 +1334,9 @@ void folio_move_anon_rmap(struct folio *folio, struct vm_area_struct *vma) VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); VM_BUG_ON_VMA(!anon_vma, vma); - anon_vma += PAGE_MAPPING_ANON; + anon_vma += FOLIO_MAPPING_ANON; /* - * Ensure that anon_vma and the PAGE_MAPPING_ANON bit are written + * Ensure that anon_vma and the FOLIO_MAPPING_ANON bit are written * simultaneously, so a concurrent reader (eg folio_referenced()'s * folio_test_anon()) will not see one without the other. */ @@ -1367,10 +1367,10 @@ static void __folio_set_anon(struct folio *folio, struct vm_area_struct *vma, /* * page_idle does a lockless/optimistic rmap scan on folio->mapping. * Make sure the compiler doesn't split the stores of anon_vma and - * the PAGE_MAPPING_ANON type identifier, otherwise the rmap code + * the FOLIO_MAPPING_ANON type identifier, otherwise the rmap code * could mistake the mapping for a struct address_space and crash. */ - anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; + anon_vma = (void *) anon_vma + FOLIO_MAPPING_ANON; WRITE_ONCE(folio->mapping, (struct address_space *) anon_vma); folio->index = linear_page_index(vma, address); } diff --git a/mm/util.c b/mm/util.c index 0b270c43d7d1..20bbfe4ce1b8 100644 --- a/mm/util.c +++ b/mm/util.c @@ -670,9 +670,9 @@ struct anon_vma *folio_anon_vma(const struct folio *folio) { unsigned long mapping = (unsigned long)folio->mapping; - if ((mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON) + if ((mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON) return NULL; - return (void *)(mapping - PAGE_MAPPING_ANON); + return (void *)(mapping - FOLIO_MAPPING_ANON); } /** @@ -699,7 +699,7 @@ struct address_space *folio_mapping(struct folio *folio) return swap_address_space(folio->swap); mapping = folio->mapping; - if ((unsigned long)mapping & PAGE_MAPPING_FLAGS) + if ((unsigned long)mapping & FOLIO_MAPPING_FLAGS) return NULL; return mapping; From 677e0e35d6cdd972cc6e5de65869fe698856267b Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:25:21 +0200 Subject: [PATCH 201/370] docs/mm: convert from "Non-LRU page migration" to "movable_ops page migration" MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Let's bring the docs up-to-date. Link: https://lkml.kernel.org/r/20250704102524.326966-28-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Lorenzo Stoakes Reviewed-by: Harry Yoo Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Signed-off-by: Andrew Morton --- Documentation/mm/page_migration.rst | 35 ++++++++++++++++++++--------- 1 file changed, 25 insertions(+), 10 deletions(-) diff --git a/Documentation/mm/page_migration.rst b/Documentation/mm/page_migration.rst index 519b35a4caf5..34602b254aa6 100644 --- a/Documentation/mm/page_migration.rst +++ b/Documentation/mm/page_migration.rst @@ -146,18 +146,33 @@ Steps: 18. The new page is moved to the LRU and can be scanned by the swapper, etc. again. -Non-LRU page migration -====================== +movable_ops page migration +========================== -Although migration originally aimed for reducing the latency of memory -accesses for NUMA, compaction also uses migration to create high-order -pages. For compaction purposes, it is also useful to be able to move -non-LRU pages, such as zsmalloc and virtio-balloon pages. +Selected typed, non-folio pages (e.g., pages inflated in a memory balloon, +zsmalloc pages) can be migrated using the movable_ops migration framework. -If a driver wants to make its pages movable, it should define a struct -movable_operations. It then needs to call __SetPageMovable() on each -page that it may be able to move. This uses the ``page->mapping`` field, -so this field is not available for the driver to use for other purposes. +The "struct movable_operations" provide callbacks specific to a page type +for isolating, migrating and un-isolating (putback) these pages. + +Once a page is indicated as having movable_ops, that condition must not +change until the page was freed back to the buddy. This includes not +changing/clearing the page type and not changing/clearing the +PG_movable_ops page flag. + +Arbitrary drivers cannot currently make use of this framework, as it +requires: + +(a) a page type +(b) indicating them as possibly having movable_ops in page_has_movable_ops() + based on the page type +(c) returning the movable_ops from page_movable_ops() based on the page + type +(d) not reusing the PG_movable_ops and PG_movable_ops_isolated page flags + for other purposes + +For example, balloon drivers can make use of this framework through the +balloon-compaction infrastructure residing in the core kernel. Monitoring Migration ===================== From f5e43012b86aa6a0b0ad5881cb68bfb872826c22 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:25:22 +0200 Subject: [PATCH 202/370] mm/balloon_compaction: "movable_ops" doc updates MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Let's bring the docs up-to-date. Setting PG_movable_ops + page->private very likely still requires to be performed under documented locks: it's complicated. We will rework this in the future, as we will try avoiding using the page lock. Link: https://lkml.kernel.org/r/20250704102524.326966-29-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: Harry Yoo Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/balloon_compaction.h | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h index b222b0737c46..2fecfead91d2 100644 --- a/include/linux/balloon_compaction.h +++ b/include/linux/balloon_compaction.h @@ -4,12 +4,13 @@ * * Common interface definitions for making balloon pages movable by compaction. * - * Balloon page migration makes use of the general non-lru movable page + * Balloon page migration makes use of the general "movable_ops page migration" * feature. * * page->private is used to reference the responsible balloon device. - * page->mapping is used in context of non-lru page migration to reference - * the address space operations for page isolation/migration/compaction. + * That these pages have movable_ops, and which movable_ops apply, + * is derived from the page type (PageOffline()) combined with the + * PG_movable_ops flag (PageMovableOps()). * * As the page isolation scanning step a compaction thread does is a lockless * procedure (from a page standpoint), it might bring some racy situations while @@ -17,12 +18,10 @@ * and safely perform balloon's page compaction and migration we must, always, * ensure following these simple rules: * - * i. when updating a balloon's page ->mapping element, strictly do it under - * the following lock order, independently of the far superior - * locking scheme (lru_lock, balloon_lock): + * i. Setting the PG_movable_ops flag and page->private with the following + * lock order * +-page_lock(page); * +--spin_lock_irq(&b_dev_info->pages_lock); - * ... page->mapping updates here ... * * ii. isolation or dequeueing procedure must remove the page from balloon * device page list under b_dev_info->pages_lock. From 9640b17a89a86f40df47bfc831a8afeff0c7eabc Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 4 Jul 2025 12:25:23 +0200 Subject: [PATCH 203/370] mm/balloon_compaction: provide single balloon_page_insert() and balloon_mapping_gfp_mask() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Let's just special-case based on IS_ENABLED(CONFIG_BALLOON_COMPACTION) like we did for balloon_page_finalize(). Link: https://lkml.kernel.org/r/20250704102524.326966-30-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes Cc: Alistair Popple Cc: Al Viro Cc: Arnd Bergmann Cc: Brendan Jackman Cc: Byungchul Park Cc: Chengming Zhou Cc: Christian Brauner Cc: Christophe Leroy Cc: Eugenio Pé rez Cc: Greg Kroah-Hartman Cc: Gregory Price Cc: Harry Yoo Cc: "Huang, Ying" Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Wang Cc: Jerrin Shaji George Cc: Johannes Weiner Cc: John Hubbard Cc: Jonathan Corbet Cc: Joshua Hahn Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Mathew Brost Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Cc: Michael Ellerman Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Rapoport Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Qi Zheng Cc: Rakie Kim Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: xu xin Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/balloon_compaction.h | 42 +++++++++++------------------- 1 file changed, 15 insertions(+), 27 deletions(-) diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h index 2fecfead91d2..7cfe48769239 100644 --- a/include/linux/balloon_compaction.h +++ b/include/linux/balloon_compaction.h @@ -77,6 +77,15 @@ static inline void balloon_devinfo_init(struct balloon_dev_info *balloon) #ifdef CONFIG_BALLOON_COMPACTION extern const struct movable_operations balloon_mops; +/* + * balloon_page_device - get the b_dev_info descriptor for the balloon device + * that enqueues the given page. + */ +static inline struct balloon_dev_info *balloon_page_device(struct page *page) +{ + return (struct balloon_dev_info *)page_private(page); +} +#endif /* CONFIG_BALLOON_COMPACTION */ /* * balloon_page_insert - insert a page into the balloon's page list and make @@ -91,41 +100,20 @@ static inline void balloon_page_insert(struct balloon_dev_info *balloon, struct page *page) { __SetPageOffline(page); - SetPageMovableOps(page); - set_page_private(page, (unsigned long)balloon); - list_add(&page->lru, &balloon->pages); -} - -/* - * balloon_page_device - get the b_dev_info descriptor for the balloon device - * that enqueues the given page. - */ -static inline struct balloon_dev_info *balloon_page_device(struct page *page) -{ - return (struct balloon_dev_info *)page_private(page); -} - -static inline gfp_t balloon_mapping_gfp_mask(void) -{ - return GFP_HIGHUSER_MOVABLE; -} - -#else /* !CONFIG_BALLOON_COMPACTION */ - -static inline void balloon_page_insert(struct balloon_dev_info *balloon, - struct page *page) -{ - __SetPageOffline(page); + if (IS_ENABLED(CONFIG_BALLOON_COMPACTION)) { + SetPageMovableOps(page); + set_page_private(page, (unsigned long)balloon); + } list_add(&page->lru, &balloon->pages); } static inline gfp_t balloon_mapping_gfp_mask(void) { + if (IS_ENABLED(CONFIG_BALLOON_COMPACTION)) + return GFP_HIGHUSER_MOVABLE; return GFP_HIGHUSER; } -#endif /* CONFIG_BALLOON_COMPACTION */ - /* * balloon_page_finalize - prepare a balloon page that was removed from the * balloon list for release to the page allocator From ee58e38489772f356c1ac79e0724183497e43249 Mon Sep 17 00:00:00 2001 From: Raghavendra K T Date: Wed, 2 Jul 2025 06:43:19 +0000 Subject: [PATCH 204/370] lib/test_vmalloc.c: introduce xfail for failing tests The test align_shift_alloc_test is expected to fail. Reporting the test as fail confuses to be a genuine failure. Introduce widely used xfail sematics to address the issue. Note: a warn_alloc dump similar to below is still expected: Call Trace: dump_stack_lvl+0x64/0x80 warn_alloc+0x137/0x1b0 ? __get_vm_area_node+0x134/0x140 Snippet of dmesg after change: Summary: random_size_align_alloc_test passed: 1 failed: 0 xfailed: 0 .. Summary: align_shift_alloc_test passed: 0 failed: 0 xfailed: 1 .. Summary: pcpu_alloc_test passed: 1 failed: 0 xfailed: 0 .. Link: https://lkml.kernel.org/r/20250702064319.885-1-raghavendra.kt@amd.com Signed-off-by: Raghavendra K T Reviewed-by: "Uladzislau Rezki (Sony)" Cc: Dev Jain Signed-off-by: Andrew Morton --- lib/test_vmalloc.c | 36 +++++++++++++++++++++--------------- 1 file changed, 21 insertions(+), 15 deletions(-) diff --git a/lib/test_vmalloc.c b/lib/test_vmalloc.c index c1966cf72ab8..2815658ccc37 100644 --- a/lib/test_vmalloc.c +++ b/lib/test_vmalloc.c @@ -396,25 +396,27 @@ vm_map_ram_test(void) struct test_case_desc { const char *test_name; int (*test_func)(void); + bool xfail; }; static struct test_case_desc test_case_array[] = { - { "fix_size_alloc_test", fix_size_alloc_test }, - { "full_fit_alloc_test", full_fit_alloc_test }, - { "long_busy_list_alloc_test", long_busy_list_alloc_test }, - { "random_size_alloc_test", random_size_alloc_test }, - { "fix_align_alloc_test", fix_align_alloc_test }, - { "random_size_align_alloc_test", random_size_align_alloc_test }, - { "align_shift_alloc_test", align_shift_alloc_test }, - { "pcpu_alloc_test", pcpu_alloc_test }, - { "kvfree_rcu_1_arg_vmalloc_test", kvfree_rcu_1_arg_vmalloc_test }, - { "kvfree_rcu_2_arg_vmalloc_test", kvfree_rcu_2_arg_vmalloc_test }, - { "vm_map_ram_test", vm_map_ram_test }, + { "fix_size_alloc_test", fix_size_alloc_test, }, + { "full_fit_alloc_test", full_fit_alloc_test, }, + { "long_busy_list_alloc_test", long_busy_list_alloc_test, }, + { "random_size_alloc_test", random_size_alloc_test, }, + { "fix_align_alloc_test", fix_align_alloc_test, }, + { "random_size_align_alloc_test", random_size_align_alloc_test, }, + { "align_shift_alloc_test", align_shift_alloc_test, true }, + { "pcpu_alloc_test", pcpu_alloc_test, }, + { "kvfree_rcu_1_arg_vmalloc_test", kvfree_rcu_1_arg_vmalloc_test, }, + { "kvfree_rcu_2_arg_vmalloc_test", kvfree_rcu_2_arg_vmalloc_test, }, + { "vm_map_ram_test", vm_map_ram_test, }, /* Add a new test case here. */ }; struct test_case_data { int test_failed; + int test_xfailed; int test_passed; u64 time; }; @@ -444,7 +446,7 @@ static int test_func(void *private) { struct test_driver *t = private; int random_array[ARRAY_SIZE(test_case_array)]; - int index, i, j; + int index, i, j, ret; ktime_t kt; u64 delta; @@ -468,11 +470,14 @@ static int test_func(void *private) */ if (!((run_test_mask & (1 << index)) >> index)) continue; - kt = ktime_get(); for (j = 0; j < test_repeat_count; j++) { - if (!test_case_array[index].test_func()) + ret = test_case_array[index].test_func(); + + if (!ret && !test_case_array[index].xfail) t->data[index].test_passed++; + else if (ret && test_case_array[index].xfail) + t->data[index].test_xfailed++; else t->data[index].test_failed++; } @@ -576,10 +581,11 @@ static void do_concurrent_test(void) continue; pr_info( - "Summary: %s passed: %d failed: %d repeat: %d loops: %d avg: %llu usec\n", + "Summary: %s passed: %d failed: %d xfailed: %d repeat: %d loops: %d avg: %llu usec\n", test_case_array[j].test_name, t->data[j].test_passed, t->data[j].test_failed, + t->data[j].test_xfailed, test_repeat_count, test_loop_count, t->data[j].time); } From 7f810385fde490a831fdba36978a52a927b23922 Mon Sep 17 00:00:00 2001 From: Dev Jain Date: Fri, 4 Jul 2025 09:34:17 +0530 Subject: [PATCH 205/370] khugepaged: reduce race probability between migration and khugepaged Suppose a folio is under migration, and khugepaged is also trying to collapse it. collapse_pte_mapped_thp() will retrieve the folio from the page cache via filemap_lock_folio(), thus taking a reference on the folio and sleeping on the folio lock, since the lock is held by the migration path. Migration will then fail in __folio_migrate_mapping -> folio_ref_freeze. Reduce the probability of such a race happening (leading to migration failure) by bailing out if we detect a PMD is marked with a migration entry. This fixes the migration-shared-anon-thp testcase failure on Apple M3. Note that, this is not a "fix" since it only reduces the chance of interference of khugepaged with migration, wherein both the kernel functionalities are deemed "best-effort". Link: https://lkml.kernel.org/r/20250704040417.63826-1-dev.jain@arm.com Signed-off-by: Dev Jain Acked-by: David Hildenbrand Acked-by: Oscar Salvador Reviewed-by: Anshuman Khandual Reviewed-by: Zi Yan Reviewed-by: Baolin Wang Cc: Barry Song Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Mariano Pache Cc: Ryan Roberts Signed-off-by: Andrew Morton --- mm/khugepaged.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 1aa7ca67c756..a55fb1dcd224 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -941,6 +941,14 @@ static inline int check_pmd_state(pmd_t *pmd) if (pmd_none(pmde)) return SCAN_PMD_NONE; + + /* + * The folio may be under migration when khugepaged is trying to + * collapse it. Migration success or failure will eventually end + * up with a present PMD mapping a folio again. + */ + if (is_pmd_migration_entry(pmde)) + return SCAN_PMD_MAPPED; if (!pmd_present(pmde)) return SCAN_PMD_NULL; if (pmd_trans_huge(pmde)) From 214db70287277096e77d804284066ce1c07297dd Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Fri, 4 Jul 2025 15:14:07 -0700 Subject: [PATCH 206/370] mm/damon: add trace event for auto-tuned monitoring intervals Patch series "mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota". The aim-oriented auto-tuning features for monitoring intervals and DAMOS quota are important and recommended. Add tracepoints for observabilities of those tuned values and the tuning itself. This patch (of 2): Aim-oriented monitoring intervals auto-tuning is an important and recommended feature for DAMON users. Add a trace event for the observability of the tuned intervals and tuning itself. Link: https://lkml.kernel.org/r/20250704221408.38510-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250704221408.38510-2-sj@kernel.org Signed-off-by: SeongJae Park Cc: "Masami Hiramatsu (Google)" Cc: Mathieu Desnoyers Cc: Steven Rostedt Cc: kernel test robot Signed-off-by: Andrew Morton --- include/trace/events/damon.h | 17 +++++++++++++++++ mm/damon/core.c | 1 + 2 files changed, 18 insertions(+) diff --git a/include/trace/events/damon.h b/include/trace/events/damon.h index da4bd9fd1162..32c611076023 100644 --- a/include/trace/events/damon.h +++ b/include/trace/events/damon.h @@ -48,6 +48,23 @@ TRACE_EVENT_CONDITION(damos_before_apply, __entry->nr_accesses, __entry->age) ); +TRACE_EVENT(damon_monitor_intervals_tune, + + TP_PROTO(unsigned long sample_us), + + TP_ARGS(sample_us), + + TP_STRUCT__entry( + __field(unsigned long, sample_us) + ), + + TP_fast_assign( + __entry->sample_us = sample_us; + ), + + TP_printk("sample_us=%lu", __entry->sample_us) +); + TRACE_EVENT(damon_aggregated, TP_PROTO(unsigned int target_id, struct damon_region *r, diff --git a/mm/damon/core.c b/mm/damon/core.c index 979b29e16ef4..57a1ace4d10d 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -1490,6 +1490,7 @@ static void kdamond_tune_intervals(struct damon_ctx *c) new_attrs.sample_interval); new_attrs.aggr_interval = new_attrs.sample_interval * c->attrs.aggr_samples; + trace_damon_monitor_intervals_tune(new_attrs.sample_interval); damon_set_attrs(c, &new_attrs); } From a86d695193bfab3f130f9275c275e4e143dcd2e3 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Fri, 4 Jul 2025 15:14:08 -0700 Subject: [PATCH 207/370] mm/damon: add trace event for effective size quota Aim-oriented DAMOS quota auto-tuning is an important and recommended feature for DAMOS users. Add a trace event for the observability of the tuned quota and tuning itself. [sj@kernel.org: initialize sidx in damos_trace_esz()] Link: https://lkml.kernel.org/r/20250705172003.52324-1-sj@kernel.org [sj@kernel.org: make damos_esz unconditional trace event] Link: https://lkml.kernel.org/r/20250709182843.35812-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250704221408.38510-3-sj@kernel.org Signed-off-by: SeongJae Park Cc: "Masami Hiramatsu (Google)" Cc: Mathieu Desnoyers Cc: Steven Rostedt Cc: kernel test robot Signed-off-by: Andrew Morton --- include/trace/events/damon.h | 24 ++++++++++++++++++++++++ mm/damon/core.c | 20 +++++++++++++++++++- 2 files changed, 43 insertions(+), 1 deletion(-) diff --git a/include/trace/events/damon.h b/include/trace/events/damon.h index 32c611076023..852d725afea2 100644 --- a/include/trace/events/damon.h +++ b/include/trace/events/damon.h @@ -9,6 +9,30 @@ #include #include +TRACE_EVENT(damos_esz, + + TP_PROTO(unsigned int context_idx, unsigned int scheme_idx, + unsigned long esz), + + TP_ARGS(context_idx, scheme_idx, esz), + + TP_STRUCT__entry( + __field(unsigned int, context_idx) + __field(unsigned int, scheme_idx) + __field(unsigned long, esz) + ), + + TP_fast_assign( + __entry->context_idx = context_idx; + __entry->scheme_idx = scheme_idx; + __entry->esz = esz; + ), + + TP_printk("ctx_idx=%u scheme_idx=%u esz=%lu", + __entry->context_idx, __entry->scheme_idx, + __entry->esz) +); + TRACE_EVENT_CONDITION(damos_before_apply, TP_PROTO(unsigned int context_idx, unsigned int scheme_idx, diff --git a/mm/damon/core.c b/mm/damon/core.c index 57a1ace4d10d..e8036254cc98 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -2011,12 +2011,26 @@ static void damos_set_effective_quota(struct damos_quota *quota) quota->esz = esz; } +static void damos_trace_esz(struct damon_ctx *c, struct damos *s, + struct damos_quota *quota) +{ + unsigned int cidx = 0, sidx = 0; + struct damos *siter; + + damon_for_each_scheme(siter, c) { + if (siter == s) + break; + sidx++; + } + trace_damos_esz(cidx, sidx, quota->esz); +} + static void damos_adjust_quota(struct damon_ctx *c, struct damos *s) { struct damos_quota *quota = &s->quota; struct damon_target *t; struct damon_region *r; - unsigned long cumulated_sz; + unsigned long cumulated_sz, cached_esz; unsigned int score, max_score = 0; if (!quota->ms && !quota->sz && list_empty("a->goals)) @@ -2030,7 +2044,11 @@ static void damos_adjust_quota(struct damon_ctx *c, struct damos *s) quota->total_charged_sz += quota->charged_sz; quota->charged_from = jiffies; quota->charged_sz = 0; + if (trace_damos_esz_enabled()) + cached_esz = quota->esz; damos_set_effective_quota(quota); + if (trace_damos_esz_enabled() && quota->esz != cached_esz) + damos_trace_esz(c, s, quota); } if (!c->ops.get_scheme_score) From 0ed1165c37277822b519f519d0982d36efc30006 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 6 Jul 2025 12:32:02 -0700 Subject: [PATCH 208/370] samples/damon/wsse: fix boot time enable handling Patch series "mm/damon: fix misc bugs in DAMON modules". From manual code review, I found below bugs in DAMON modules. DAMON sample modules crash if those are enabled at boot time, via kernel command line. A similar issue was found and fixed on DAMON non-sample modules in the past, but we didn't check that for sample modules. DAMON non-sample modules are not setting 'enabled' parameters accordingly when real enabling is failed. Honggyu found and fixed[1] this type of bugs in DAMON sample modules, and my inspection was motivated by the great work. Kudos to Honggyu. Finally, DAMON_RECLIAM is mistakenly losing scheme internal status due to misuse of damon_commit_ctx(). DAMON_LRU_SORT has a similar misuse, but fortunately it is not causing real status loss. Fix the bugs. Since these are similar patterns of bugs that were found in the past, it would be better to add tests or refactor the code, in future. This patch (of 6): If 'enable' parameter of the 'wsse' DAMON sample module is set at boot time via the kernel command line, memory allocation is tried before the slab is initialized. As a result kernel NULL pointer dereference BUG can happen. Fix it by checking the initialization status. Link: https://lkml.kernel.org/r/20250706193207.39810-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250706193207.39810-2-sj@kernel.org Link: https://lore.kernel.org/20250702000205.1921-1-honggyu.kim@sk.com [1] Fixes: b757c6cfc696 ("samples/damon/wsse: start and stop DAMON as the user requests") Signed-off-by: SeongJae Park Cc: Signed-off-by: Andrew Morton --- samples/damon/wsse.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/samples/damon/wsse.c b/samples/damon/wsse.c index e20238a249e7..e941958b1032 100644 --- a/samples/damon/wsse.c +++ b/samples/damon/wsse.c @@ -89,6 +89,8 @@ static void damon_sample_wsse_stop(void) put_pid(target_pidp); } +static bool init_called; + static int damon_sample_wsse_enable_store( const char *val, const struct kernel_param *kp) { @@ -114,7 +116,15 @@ static int damon_sample_wsse_enable_store( static int __init damon_sample_wsse_init(void) { - return 0; + int err = 0; + + init_called = true; + if (enable) { + err = damon_sample_wsse_start(); + if (err) + enable = false; + } + return err; } module_init(damon_sample_wsse_init); From 2780505ec2b42c07853b34640bc63279ac2bb53b Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 6 Jul 2025 12:32:03 -0700 Subject: [PATCH 209/370] samples/damon/prcl: fix boot time enable crash If 'enable' parameter of the 'prcl' DAMON sample module is set at boot time via the kernel command line, memory allocation is tried before the slab is initialized. As a result kernel NULL pointer dereference BUG can happen. Fix it by checking the initialization status. Link: https://lkml.kernel.org/r/20250706193207.39810-3-sj@kernel.org Fixes: 2aca254620a8 ("samples/damon: introduce a skeleton of a smaple DAMON module for proactive reclamation") Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- samples/damon/prcl.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/samples/damon/prcl.c b/samples/damon/prcl.c index 5597e6a08ab2..a9d7629d70f0 100644 --- a/samples/damon/prcl.c +++ b/samples/damon/prcl.c @@ -109,6 +109,8 @@ static void damon_sample_prcl_stop(void) put_pid(target_pidp); } +static bool init_called; + static int damon_sample_prcl_enable_store( const char *val, const struct kernel_param *kp) { @@ -134,6 +136,14 @@ static int damon_sample_prcl_enable_store( static int __init damon_sample_prcl_init(void) { + int err = 0; + + init_called = true; + if (enable) { + err = damon_sample_prcl_start(); + if (err) + enable = false; + } return 0; } From 964314344eab7bc43e38a32be281c5ea0609773b Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 6 Jul 2025 12:32:04 -0700 Subject: [PATCH 210/370] samples/damon/mtier: support boot time enable setup If 'enable' parameter of the 'mtier' DAMON sample module is set at boot time via the kernel command line, memory allocation is tried before the slab is initialized. As a result kernel NULL pointer dereference BUG can happen. Fix it by checking the initialization status. Link: https://lkml.kernel.org/r/20250706193207.39810-4-sj@kernel.org Fixes: 82a08bde3cf7 ("samples/damon: implement a DAMON module for memory tiering") Signed-off-by: SeongJae Park Cc: Signed-off-by: Andrew Morton --- samples/damon/mtier.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/samples/damon/mtier.c b/samples/damon/mtier.c index 97892ade7f31..bc9d8195da87 100644 --- a/samples/damon/mtier.c +++ b/samples/damon/mtier.c @@ -157,6 +157,8 @@ static void damon_sample_mtier_stop(void) damon_destroy_ctx(ctxs[1]); } +static bool init_called; + static int damon_sample_mtier_enable_store( const char *val, const struct kernel_param *kp) { @@ -182,6 +184,14 @@ static int damon_sample_mtier_enable_store( static int __init damon_sample_mtier_init(void) { + int err = 0; + + init_called = true; + if (enable) { + err = damon_sample_mtier_start(); + if (err) + enable = false; + } return 0; } From 737e40d5eb2f6507113285cb52c50a4a6b01f44a Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 6 Jul 2025 12:32:05 -0700 Subject: [PATCH 211/370] mm/damon/reclaim: reset enabled when DAMON start failed When the startup fails, 'enabled' parameter is not reset. As a result, users show the parameter 'Y' while it is not really working. Fix it by resetting 'enabled' to 'false' when the work is failed. Link: https://lkml.kernel.org/r/20250706193207.39810-5-sj@kernel.org Fixes: 04e98764befa ("mm/damon/reclaim: enable and disable synchronously") Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- mm/damon/reclaim.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/damon/reclaim.c b/mm/damon/reclaim.c index a675150965e0..c91098d8aa51 100644 --- a/mm/damon/reclaim.c +++ b/mm/damon/reclaim.c @@ -329,7 +329,7 @@ static int __init damon_reclaim_init(void) int err = damon_modules_new_paddr_ctx_target(&ctx, &target); if (err) - return err; + goto out; ctx->callback.after_wmarks_check = damon_reclaim_after_wmarks_check; ctx->callback.after_aggregation = damon_reclaim_after_aggregation; @@ -338,6 +338,9 @@ static int __init damon_reclaim_init(void) if (enabled) err = damon_reclaim_turn(true); +out: + if (err && enabled) + enabled = false; return err; } From b91b82e241822ef1f287c3a8214f7bdeef4e85b2 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 6 Jul 2025 12:32:06 -0700 Subject: [PATCH 212/370] mm/damon/lru_sort: reset enabled when DAMON start failed When the startup fails, 'enabled' parameter is not reset. As a result, users show the parameter 'Y' while it is not really working. Fix it by resetting 'enabled' to 'false' when the work is failed. Link: https://lkml.kernel.org/r/20250706193207.39810-6-sj@kernel.org Fixes: 7a034fbba336 ("mm/damon/lru_sort: enable and disable synchronously") Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- mm/damon/lru_sort.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/damon/lru_sort.c b/mm/damon/lru_sort.c index 4af8fd4a390b..9bd8a1a115e0 100644 --- a/mm/damon/lru_sort.c +++ b/mm/damon/lru_sort.c @@ -325,7 +325,7 @@ static int __init damon_lru_sort_init(void) int err = damon_modules_new_paddr_ctx_target(&ctx, &target); if (err) - return err; + goto out; ctx->callback.after_wmarks_check = damon_lru_sort_after_wmarks_check; ctx->callback.after_aggregation = damon_lru_sort_after_aggregation; @@ -334,6 +334,9 @@ static int __init damon_lru_sort_init(void) if (enabled) err = damon_lru_sort_turn(true); +out: + if (err && enabled) + enabled = false; return err; } From fed48693bdfeca666f7536ba88a05e9a4e5523b6 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 6 Jul 2025 12:32:07 -0700 Subject: [PATCH 213/370] mm/damon/reclaim: use parameter context correctly damon_reclaim_apply_parameters() allocates a new DAMON context, stages user-specified DAMON parameters on it, and commits to running DAMON context at once, using damon_commit_ctx(). The code is mistakenly over-writing the monitoring attributes and the reclaim scheme on the running context. It is not causing a real problem for monitoring attributes, but the scheme overwriting can remove scheme's internal status such as charged quota. Fix the wrong use of the parameter context. Link: https://lkml.kernel.org/r/20250706193207.39810-7-sj@kernel.org Fixes: 11ddcfc257a3 ("mm/damon/reclaim: use damon_commit_ctx()") Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- mm/damon/reclaim.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/damon/reclaim.c b/mm/damon/reclaim.c index c91098d8aa51..0fe8996328b8 100644 --- a/mm/damon/reclaim.c +++ b/mm/damon/reclaim.c @@ -194,7 +194,7 @@ static int damon_reclaim_apply_parameters(void) if (err) return err; - err = damon_set_attrs(ctx, &damon_reclaim_mon_attrs); + err = damon_set_attrs(param_ctx, &damon_reclaim_mon_attrs); if (err) goto out; @@ -202,7 +202,7 @@ static int damon_reclaim_apply_parameters(void) scheme = damon_reclaim_new_scheme(); if (!scheme) goto out; - damon_set_schemes(ctx, &scheme, 1); + damon_set_schemes(param_ctx, &scheme, 1); if (quota_mem_pressure_us) { goal = damos_new_quota_goal(DAMOS_QUOTA_SOME_MEM_PSI_US, From e89d5bf3a4f92f038817d77339a9c04904da8ad9 Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Fri, 4 Jul 2025 11:19:26 +0800 Subject: [PATCH 214/370] mm: fault in complete folios instead of individual pages for tmpfs After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"), tmpfs can also support large folio allocation (not just PMD-sized large folios). However, when accessing tmpfs via mmap(), although tmpfs supports large folios, we still establish mappings at the base page granularity, which is unreasonable. We can map multiple consecutive pages of tmpfs folios at once according to the size of the large folio. On one hand, this can reduce the overhead of page faults; on the other hand, it can leverage hardware architecture optimizations to reduce TLB misses, such as contiguous PTEs on the ARM architecture. Moreover, tmpfs mount will use the 'huge=' option to control large folio allocation explicitly. So it can be understood that the process's RSS statistics might increase, and I think this will not cause any obvious effects for users. Performance test: I created a 1G tmpfs file, populated with 64K large folios, and write-accessed it sequentially via mmap(). I observed a significant performance improvement: Before the patch: real 0m0.158s user 0m0.008s sys 0m0.150s After the patch: real 0m0.021s user 0m0.004s sys 0m0.017s Link: https://lkml.kernel.org/r/440940e78aeb7430c5cc8b6d2088ae98265b9809.1751599072.git.baolin.wang@linux.alibaba.com Fixes: acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs") Signed-off-by: Baolin Wang Acked-by: David Hildenbrand Acked-by: Zi Yan Reviewed-by: Vishal Moola (Oracle) Reviewed-by: Barry Song Reviewed-by: Lorenzo Stoakes Cc: Baolin Wang Cc: Dev Jain Cc: Hugh Dickins Cc: Liam Howlett Cc: Mariano Pache Cc: Michal Hocko Cc: Mike Rapoport Cc: Ryan Roberts Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/memory.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 0f9b32a20e5b..9944380e947d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5383,10 +5383,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf) /* * Using per-page fault to maintain the uffd semantics, and same - * approach also applies to non-anonymous-shmem faults to avoid + * approach also applies to non shmem/tmpfs faults to avoid * inflating the RSS of the process. */ - if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma)) || + if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) || unlikely(needs_fallback)) { nr_pages = 1; } else if (nr_pages > 1) { From c1db0cb157c6ddd6e24eb62673022b665ca688b9 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 5 Jul 2025 10:49:55 -0700 Subject: [PATCH 215/370] samples/damon/wsse: rename to have damon_sample_ prefix Patch series "mm/damon: misc cleanups". Yet another round of miscellaneous DAMON cleanups. This patch (of 6): DAMON sample module, wsse has its name 'wsse'. It could conflict with future modules, and not very easy to identify it by name. Use a prefix, "damon_sample_" for the name. Note that this could break users if they depend on the old name. But it is just a sample, so no such usage is expected, or known. Even if such usage exists, updating it for the new name should be straightforward. Link: https://lkml.kernel.org/r/20250705175000.56259-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250705175000.56259-2-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- samples/damon/wsse.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/samples/damon/wsse.c b/samples/damon/wsse.c index e941958b1032..89fc76f31d5e 100644 --- a/samples/damon/wsse.c +++ b/samples/damon/wsse.c @@ -12,6 +12,11 @@ #include #include +#ifdef MODULE_PARAM_PREFIX +#undef MODULE_PARAM_PREFIX +#endif +#define MODULE_PARAM_PREFIX "damon_sample_wsse." + static int target_pid __read_mostly; module_param(target_pid, int, 0600); From 6a52ac0b60317337d57c57c45e6abd4936c386c3 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 5 Jul 2025 10:49:56 -0700 Subject: [PATCH 216/370] samples/damon/prcl: rename to have damon_sample_ prefix DAMON sample module, prcl has its name 'prcl'. It could conflict with future modules, and not very easy to identify it by name. Use a prefix, "damon_sample_" for the name. Note that this could break users if they depend on the old name. But it is just a sample, so no such usage is expected, or known. Even if such usage exists, updating it for the new name should be straightforward. Link: https://lkml.kernel.org/r/20250705175000.56259-3-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- samples/damon/prcl.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/samples/damon/prcl.c b/samples/damon/prcl.c index a9d7629d70f0..4b0e47718c62 100644 --- a/samples/damon/prcl.c +++ b/samples/damon/prcl.c @@ -11,6 +11,11 @@ #include #include +#ifdef MODULE_PARAM_PREFIX +#undef MODULE_PARAM_PREFIX +#endif +#define MODULE_PARAM_PREFIX "damon_sample_prcl." + static int target_pid __read_mostly; module_param(target_pid, int, 0600); From 775c96714dca287f1130a7028890254d426a2b5d Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 5 Jul 2025 10:49:57 -0700 Subject: [PATCH 217/370] samples/damon/mtier: rename to have damon_sample_ prefix DAMON sample module, mtier has its name 'mtier'. It could conflict with future modules, and not very easy to identify it by name. Use a prefix, "damon_sample_" for the name. Note that this could break users if they depend on the old name. But it is just a sample, so no such usage is expected, or known. Even if such usage exists, updating it for the new name should be straightforward. Link: https://lkml.kernel.org/r/20250705175000.56259-4-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- samples/damon/mtier.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/samples/damon/mtier.c b/samples/damon/mtier.c index bc9d8195da87..537bc6eae9f7 100644 --- a/samples/damon/mtier.c +++ b/samples/damon/mtier.c @@ -12,6 +12,11 @@ #include #include +#ifdef MODULE_PARAM_PREFIX +#undef MODULE_PARAM_PREFIX +#endif +#define MODULE_PARAM_PREFIX "damon_sample_mtier." + static unsigned long node0_start_addr __read_mostly; module_param(node0_start_addr, ulong, 0600); From d2b5be741a5045272b9d711908eab017632ac022 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 5 Jul 2025 10:49:58 -0700 Subject: [PATCH 218/370] mm/damon/sysfs: use DAMON core API damon_is_running() DAMON core implements a static function to see if a given DAMON context is running. DAMON sysfs interface is implementing the same one on its own. Make the core function non-static and reuse it from the DAMON sysfs interface. Link: https://lkml.kernel.org/r/20250705175000.56259-5-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- include/linux/damon.h | 1 + mm/damon/core.c | 8 +++++++- mm/damon/sysfs.c | 14 ++------------ 3 files changed, 10 insertions(+), 13 deletions(-) diff --git a/include/linux/damon.h b/include/linux/damon.h index bb58e36f019e..e1fea3119538 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -934,6 +934,7 @@ static inline unsigned int damon_max_nr_accesses(const struct damon_attrs *attrs int damon_start(struct damon_ctx **ctxs, int nr_ctxs, bool exclusive); int damon_stop(struct damon_ctx **ctxs, int nr_ctxs); +bool damon_is_running(struct damon_ctx *ctx); int damon_call(struct damon_ctx *ctx, struct damon_call_control *control); int damos_walk(struct damon_ctx *ctx, struct damos_walk_control *control); diff --git a/mm/damon/core.c b/mm/damon/core.c index e8036254cc98..c66583869e95 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -1311,7 +1311,13 @@ int damon_stop(struct damon_ctx **ctxs, int nr_ctxs) return err; } -static bool damon_is_running(struct damon_ctx *ctx) +/** + * damon_is_running() - Returns if a given DAMON context is running. + * @ctx: The DAMON context to see if running. + * + * Return: true if @ctx is running, false otherwise. + */ +bool damon_is_running(struct damon_ctx *ctx) { bool running; diff --git a/mm/damon/sysfs.c b/mm/damon/sysfs.c index 1b1476b79cdb..79d65dcc9dd0 100644 --- a/mm/damon/sysfs.c +++ b/mm/damon/sysfs.c @@ -1189,16 +1189,6 @@ static void damon_sysfs_kdamond_rm_dirs(struct damon_sysfs_kdamond *kdamond) kobject_put(&kdamond->contexts->kobj); } -static bool damon_sysfs_ctx_running(struct damon_ctx *ctx) -{ - bool running; - - mutex_lock(&ctx->kdamond_lock); - running = ctx->kdamond != NULL; - mutex_unlock(&ctx->kdamond_lock); - return running; -} - /* * enum damon_sysfs_cmd - Commands for a specific kdamond. */ @@ -1275,7 +1265,7 @@ static ssize_t state_show(struct kobject *kobj, struct kobj_attribute *attr, if (!ctx) running = false; else - running = damon_sysfs_ctx_running(ctx); + running = damon_is_running(ctx); return sysfs_emit(buf, "%s\n", running ? damon_sysfs_cmd_strs[DAMON_SYSFS_CMD_ON] : @@ -1429,7 +1419,7 @@ static inline bool damon_sysfs_kdamond_running( struct damon_sysfs_kdamond *kdamond) { return kdamond->damon_ctx && - damon_sysfs_ctx_running(kdamond->damon_ctx); + damon_is_running(kdamond->damon_ctx); } static int damon_sysfs_apply_inputs(struct damon_ctx *ctx, From e000df9ff183e7adc938db8a739690de112dd961 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 5 Jul 2025 10:49:59 -0700 Subject: [PATCH 219/370] mm/damon/sysfs: don't hold kdamond_lock in before_terminate() damon_sysfs_before_terminate() is a DAMON callback that is executed from the kdamond's context. Hence it is safe to access DAMON context internal data. But the function is unnecessarily holding kdamond_lock of the context. It is just unnecessary. Remove the locking code. Link: https://lkml.kernel.org/r/20250705175000.56259-6-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- mm/damon/sysfs.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/mm/damon/sysfs.c b/mm/damon/sysfs.c index 79d65dcc9dd0..c0193de6fb9a 100644 --- a/mm/damon/sysfs.c +++ b/mm/damon/sysfs.c @@ -1387,12 +1387,10 @@ static void damon_sysfs_before_terminate(struct damon_ctx *ctx) if (!damon_target_has_pid(ctx)) return; - mutex_lock(&ctx->kdamond_lock); damon_for_each_target_safe(t, next, ctx) { put_pid(t->pid); damon_destroy_target(t); } - mutex_unlock(&ctx->kdamond_lock); } /* From 0b473f9e6eace01bf5d450fd696119e827c5fa5d Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 5 Jul 2025 10:50:00 -0700 Subject: [PATCH 220/370] Docs/mm/damon/maintainer-profile: update for mm-new tree Recently a new mm tree for new patches, namely mm-new, has been added. Update DAMON maintainer's profile doc for DAMON patches life cycle, which depend on those of mm trees. Link: https://lkml.kernel.org/r/20250705175000.56259-7-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- Documentation/mm/damon/maintainer-profile.rst | 33 ++++++++++--------- 1 file changed, 18 insertions(+), 15 deletions(-) diff --git a/Documentation/mm/damon/maintainer-profile.rst b/Documentation/mm/damon/maintainer-profile.rst index ce3e98458339..5cd07905a193 100644 --- a/Documentation/mm/damon/maintainer-profile.rst +++ b/Documentation/mm/damon/maintainer-profile.rst @@ -7,9 +7,9 @@ The DAMON subsystem covers the files that are listed in 'DATA ACCESS MONITOR' section of 'MAINTAINERS' file. The mailing lists for the subsystem are damon@lists.linux.dev and -linux-mm@kvack.org. Patches should be made against the `mm-unstable tree -`_ whenever possible and posted -to the mailing lists. +linux-mm@kvack.org. Patches should be made against the `mm-new tree +`_ whenever possible and posted to the +mailing lists. SCM Trees --------- @@ -17,17 +17,19 @@ SCM Trees There are multiple Linux trees for DAMON development. Patches under development or testing are queued in `damon/next `_ by the DAMON maintainer. -Sufficiently reviewed patches will be queued in `mm-unstable -`_ by the memory management -subsystem maintainer. After more sufficient tests, the patches will be queued -in `mm-stable `_, and finally -pull-requested to the mainline by the memory management subsystem maintainer. +Sufficiently reviewed patches will be queued in `mm-new +`_ by the memory management subsystem +maintainer. As more sufficient tests are done, the patches will move to +`mm-unstable `_ and then to +`mm-stable `_. And finally those +will be pull-requested to the mainline by the memory management subsystem +maintainer. -Note again the patches for `mm-unstable tree -`_ are queued by the memory -management subsystem maintainer. If the patches requires some patches in -`damon/next tree `_ which not yet merged -in mm-unstable, please make sure the requirement is clearly specified. +Note again the patches for `mm-new tree +`_ are queued by the memory management +subsystem maintainer. If the patches requires some patches in `damon/next tree +`_ which not yet merged in mm-new, +please make sure the requirement is clearly specified. Submit checklist addendum ------------------------- @@ -53,8 +55,9 @@ Further doing below and putting the results will be helpful. Key cycle dates --------------- -Patches can be sent anytime. Key cycle dates of the `mm-unstable -`_ and `mm-stable +Patches can be sent anytime. Key cycle dates of the `mm-new +`_, `mm-unstable +`_and `mm-stable `_ trees depend on the memory management subsystem maintainer. From f73858d5ef93cb880aca64d826b4e3ca8b55365a Mon Sep 17 00:00:00 2001 From: Muhammad Usama Anjum Date: Mon, 7 Jul 2025 12:33:20 +0500 Subject: [PATCH 221/370] selftests/mm: pagemap_scan ioctl: add PFN ZERO test cases Add test cases to test the correctness of PFN ZERO flag of pagemap_scan ioctl. Test with normal pages backed memory and huge pages backed memory. Link: https://lkml.kernel.org/r/20250707073321.106431-1-usama.anjum@collabora.com Signed-off-by: Muhammad Usama Anjum Cc: David Hildenbrand Cc: Muhammad Usama Anjum Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/cow.c | 27 +------- tools/testing/selftests/mm/pagemap_ioctl.c | 72 +++++++++++++++++++++- tools/testing/selftests/mm/vm_util.c | 23 +++++++ tools/testing/selftests/mm/vm_util.h | 2 + 4 files changed, 95 insertions(+), 29 deletions(-) diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c index 29767e2afdd0..788e82b00b75 100644 --- a/tools/testing/selftests/mm/cow.c +++ b/tools/testing/selftests/mm/cow.c @@ -72,31 +72,6 @@ static int detect_thp_sizes(size_t sizes[], int max) return count; } -static void detect_huge_zeropage(void) -{ - int fd = open("/sys/kernel/mm/transparent_hugepage/use_zero_page", - O_RDONLY); - size_t enabled = 0; - char buf[15]; - int ret; - - if (fd < 0) - return; - - ret = pread(fd, buf, sizeof(buf), 0); - if (ret > 0 && ret < sizeof(buf)) { - buf[ret] = 0; - - enabled = strtoul(buf, NULL, 10); - if (enabled == 1) { - has_huge_zeropage = true; - ksft_print_msg("[INFO] huge zeropage is enabled\n"); - } - } - - close(fd); -} - static bool range_is_swapped(void *addr, size_t size) { for (; size; addr += pagesize, size -= pagesize) @@ -1903,7 +1878,7 @@ int main(int argc, char **argv) } nr_hugetlbsizes = detect_hugetlb_page_sizes(hugetlbsizes, ARRAY_SIZE(hugetlbsizes)); - detect_huge_zeropage(); + has_huge_zeropage = detect_huge_zeropage(); ksft_set_plan(ARRAY_SIZE(anon_test_cases) * tests_per_anon_test_case() + ARRAY_SIZE(anon_thp_test_cases) * tests_per_anon_thp_test_case() + diff --git a/tools/testing/selftests/mm/pagemap_ioctl.c b/tools/testing/selftests/mm/pagemap_ioctl.c index b07acc86f4f0..c2dcda78ad31 100644 --- a/tools/testing/selftests/mm/pagemap_ioctl.c +++ b/tools/testing/selftests/mm/pagemap_ioctl.c @@ -1,4 +1,5 @@ // SPDX-License-Identifier: GPL-2.0 + #define _GNU_SOURCE #include #include @@ -34,8 +35,8 @@ #define PAGEMAP "/proc/self/pagemap" int pagemap_fd; int uffd; -unsigned long page_size; -unsigned int hpage_size; +size_t page_size; +size_t hpage_size; const char *progname; #define LEN(region) ((region.end - region.start)/page_size) @@ -1480,6 +1481,68 @@ static void transact_test(int page_size) extra_thread_faults); } +void zeropfn_tests(void) +{ + unsigned long long mem_size; + struct page_region vec; + int i, ret; + char *mmap_mem, *mem; + + /* Test with normal memory */ + mem_size = 10 * page_size; + mem = mmap(NULL, mem_size, PROT_READ, MAP_PRIVATE | MAP_ANON, -1, 0); + if (mem == MAP_FAILED) + ksft_exit_fail_msg("error nomem\n"); + + /* Touch each page to ensure it's mapped */ + for (i = 0; i < mem_size; i += page_size) + (void)((volatile char *)mem)[i]; + + ret = pagemap_ioctl(mem, mem_size, &vec, 1, 0, + (mem_size / page_size), PAGE_IS_PFNZERO, 0, 0, PAGE_IS_PFNZERO); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ksft_test_result(ret == 1 && LEN(vec) == (mem_size / page_size), + "%s all pages must have PFNZERO set\n", __func__); + + munmap(mem, mem_size); + + /* Test with huge page if user_zero_page is set to 1 */ + if (!detect_huge_zeropage()) { + ksft_test_result_skip("%s use_zero_page not supported or set to 1\n", __func__); + return; + } + + mem_size = 2 * hpage_size; + mmap_mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (mmap_mem == MAP_FAILED) + ksft_exit_fail_msg("error nomem\n"); + + /* We need a THP-aligned memory area. */ + mem = (char *)(((uintptr_t)mmap_mem + hpage_size) & ~(hpage_size - 1)); + + ret = madvise(mem, hpage_size, MADV_HUGEPAGE); + if (!ret) { + char tmp = *mem; + + asm volatile("" : "+r" (tmp)); + + ret = pagemap_ioctl(mem, hpage_size, &vec, 1, 0, + 0, PAGE_IS_PFNZERO, 0, 0, PAGE_IS_PFNZERO); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ksft_test_result(ret == 1 && LEN(vec) == (hpage_size / page_size), + "%s all huge pages must have PFNZERO set\n", __func__); + } else { + ksft_test_result_skip("%s huge page not supported\n", __func__); + } + + munmap(mmap_mem, mem_size); +} + int main(int __attribute__((unused)) argc, char *argv[]) { int shmid, buf_size, fd, i, ret; @@ -1494,7 +1557,7 @@ int main(int __attribute__((unused)) argc, char *argv[]) if (init_uffd()) ksft_exit_pass(); - ksft_set_plan(115); + ksft_set_plan(117); page_size = getpagesize(); hpage_size = read_pmd_pagesize(); @@ -1669,6 +1732,9 @@ int main(int __attribute__((unused)) argc, char *argv[]) /* 16. Userfaultfd tests */ userfaultfd_tests(); + /* 17. ZEROPFN tests */ + zeropfn_tests(); + close(pagemap_fd); ksft_exit_pass(); } diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c index 1d434772fa54..9dafa7669ef9 100644 --- a/tools/testing/selftests/mm/vm_util.c +++ b/tools/testing/selftests/mm/vm_util.c @@ -532,3 +532,26 @@ void *sys_mremap(void *old_address, unsigned long old_size, old_size, new_size, flags, (unsigned long)new_address); } + +bool detect_huge_zeropage(void) +{ + int fd = open("/sys/kernel/mm/transparent_hugepage/use_zero_page", + O_RDONLY); + bool enabled = 0; + char buf[15]; + int ret; + + if (fd < 0) + return 0; + + ret = pread(fd, buf, sizeof(buf), 0); + if (ret > 0 && ret < sizeof(buf)) { + buf[ret] = 0; + + if (strtoul(buf, NULL, 10) == 1) + enabled = 1; + } + + close(fd); + return enabled; +} diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h index 797c24215b17..2b154c287591 100644 --- a/tools/testing/selftests/mm/vm_util.h +++ b/tools/testing/selftests/mm/vm_util.h @@ -44,6 +44,8 @@ static inline unsigned int pshift(void) return __page_shift; } +bool detect_huge_zeropage(void); + /* * Plan 9 FS has bugs (at least on QEMU) where certain operations fail with * ENOENT on unlinked files. See From 7765794810c2ff6eafbbde30f343f53bbc0f979a Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Mon, 7 Jul 2025 06:57:11 +0000 Subject: [PATCH 222/370] mm/migrate: remove the -EEXIST conversion for move_pages() The -EEXIST conversion is introduced in commit 65462462ffb2 ("mm/gup: follow_pfn_pte(): -EEXIST cleanup"), since follow_page() may call follow_pfn_pte() which may return -EEXIST. But after commit 7dff875c9436 ("mm/migrate: convert add_page_for_migration() from follow_page() to folio_walk"), it use folio_walk instead. This limit the error code and won't return -EEXIST. Remove the error code conversion here. Link: https://lkml.kernel.org/r/20250707065711.18056-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang Acked-by: David Hildenbrand Reviewed-by: Joshua Hahn Reviewed-by: Zi Yan Cc: John Hubbard Cc: Matthew Brost Cc: Rakie Kim Cc: Byungchul Park Cc: Gregory Price Cc: Ying Huang Cc: Alistair Popple Signed-off-by: Andrew Morton --- mm/migrate.c | 7 ------- 1 file changed, 7 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index e4e05a98c84e..36b2764204b6 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2362,13 +2362,6 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, continue; } - /* - * The move_pages() man page does not have an -EEXIST choice, so - * use -EFAULT instead. - */ - if (err == -EEXIST) - err = -EFAULT; - /* * If the page is already on the target node (!err), store the * node, otherwise, store the err. From e66d7a4f55f44aca39cc74e8c7b4602faf26b4f7 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Wed, 2 Jul 2025 12:49:23 +0200 Subject: [PATCH 223/370] mm: convert FPB_IGNORE_* into FPB_RESPECT_* Patch series "mm: folio_pte_batch() improvements", v2. Ever since we added folio_pte_batch() for fork() + munmap() purposes, a lot more users appeared (and more are being proposed), and more functionality was added. Most of the users only need basic functionality, and could benefit from a non-inlined version. So let's clean up folio_pte_batch() and split it into a basic folio_pte_batch() (no flags) and a more advanced folio_pte_batch_ext(). Using either variant will now look much cleaner. This series will likely conflict with some changes in some (old+new) folio_pte_batch() users, but conflicts should be trivial to resolve. This patch (of 4): Respecting these PTE bits is the exception, so let's invert the meaning. With this change, most callers don't have to pass any flags. This is a preparation for splitting folio_pte_batch() into a non-inlined variant that doesn't consume any flags. Long-term, we want folio_pte_batch() to probably ignore most common PTE bits (e.g., write/dirty/young/soft-dirty) that are not relevant for most page table walkers: uffd-wp and protnone might be bits to consider in the future. Only walkers that care about them can opt-in to respect them. No functional change intended. Link: https://lkml.kernel.org/r/20250702104926.212243-2-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Lance Yang Reviewed-by: Zi Yan Reviewed-by: Oscar Salvador Reviewed-by: Dev Jain Cc: Alistair Popple Cc: Byungchul Park Cc: Gregory Price Cc: "Huang, Ying" Cc: Jann Horn Cc: Joshua Hahn Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Mathew Brost Cc: Michal Hocko Cc: Mike Rapoport Cc: Rakie Kim Cc: Rik van Riel Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/internal.h | 16 ++++++++-------- mm/madvise.c | 3 +-- mm/memory.c | 11 +++++------ mm/mempolicy.c | 4 +--- mm/mlock.c | 3 +-- mm/mremap.c | 3 +-- mm/rmap.c | 3 +-- 7 files changed, 18 insertions(+), 25 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 2e235740128a..e530809ef7d2 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -202,17 +202,17 @@ static inline void vma_close(struct vm_area_struct *vma) /* Flags for folio_pte_batch(). */ typedef int __bitwise fpb_t; -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */ -#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0)) +/* Compare PTEs respecting the dirty bit. */ +#define FPB_RESPECT_DIRTY ((__force fpb_t)BIT(0)) -/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */ -#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1)) +/* Compare PTEs respecting the soft-dirty bit. */ +#define FPB_RESPECT_SOFT_DIRTY ((__force fpb_t)BIT(1)) static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) { - if (flags & FPB_IGNORE_DIRTY) + if (!(flags & FPB_RESPECT_DIRTY)) pte = pte_mkclean(pte); - if (likely(flags & FPB_IGNORE_SOFT_DIRTY)) + if (likely(!(flags & FPB_RESPECT_SOFT_DIRTY))) pte = pte_clear_soft_dirty(pte); return pte_wrprotect(pte_mkold(pte)); } @@ -236,8 +236,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) * pages of the same large folio. * * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN, - * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and - * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY). + * the accessed bit, writable bit, dirty bit (unless FPB_RESPECT_DIRTY is set) + * and soft-dirty bit (unless FPB_RESPECT_SOFT_DIRTY is set). * * start_ptep must map any page of the folio. max_nr must be at least one and * must be limited by the caller so scanning cannot exceed a single page table. diff --git a/mm/madvise.c b/mm/madvise.c index a34c2c89a53b..e7f1d4caad81 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -346,10 +346,9 @@ static inline int madvise_folio_pte_batch(unsigned long addr, unsigned long end, pte_t pte, bool *any_young, bool *any_dirty) { - const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY; int max_nr = (end - addr) / PAGE_SIZE; - return folio_pte_batch(folio, addr, ptep, pte, max_nr, fpb_flags, NULL, + return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL, any_young, any_dirty); } diff --git a/mm/memory.c b/mm/memory.c index 9944380e947d..a03f1964db33 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -990,10 +990,10 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma * by keeping the batching logic separate. */ if (unlikely(!*prealloc && folio_test_large(folio) && max_nr != 1)) { - if (src_vma->vm_flags & VM_SHARED) - flags |= FPB_IGNORE_DIRTY; - if (!vma_soft_dirty_enabled(src_vma)) - flags |= FPB_IGNORE_SOFT_DIRTY; + if (!(src_vma->vm_flags & VM_SHARED)) + flags |= FPB_RESPECT_DIRTY; + if (vma_soft_dirty_enabled(src_vma)) + flags |= FPB_RESPECT_SOFT_DIRTY; nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags, &any_writable, NULL, NULL); @@ -1535,7 +1535,6 @@ static inline int zap_present_ptes(struct mmu_gather *tlb, struct zap_details *details, int *rss, bool *force_flush, bool *force_break, bool *any_skipped) { - const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY; struct mm_struct *mm = tlb->mm; struct folio *folio; struct page *page; @@ -1565,7 +1564,7 @@ static inline int zap_present_ptes(struct mmu_gather *tlb, * by keeping the batching logic separate. */ if (unlikely(folio_test_large(folio) && max_nr != 1)) { - nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, fpb_flags, + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, 0, NULL, NULL, NULL); zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr, diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 1ff7b2174eb7..2a25eedc3b1c 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -675,7 +675,6 @@ static void queue_folios_pmd(pmd_t *pmd, struct mm_walk *walk) static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { - const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY; struct vm_area_struct *vma = walk->vma; struct folio *folio; struct queue_pages *qp = walk->private; @@ -713,8 +712,7 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr, continue; if (folio_test_large(folio) && max_nr != 1) nr = folio_pte_batch(folio, addr, pte, ptent, - max_nr, fpb_flags, - NULL, NULL, NULL); + max_nr, 0, NULL, NULL, NULL); /* * vm_normal_folio() filters out zero pages, but there might * still be reserved folios to skip, perhaps in a VDSO. diff --git a/mm/mlock.c b/mm/mlock.c index 3cb72b579ffd..2238cdc5eb1c 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -307,14 +307,13 @@ void munlock_folio(struct folio *folio) static inline unsigned int folio_mlock_step(struct folio *folio, pte_t *pte, unsigned long addr, unsigned long end) { - const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY; unsigned int count = (end - addr) >> PAGE_SHIFT; pte_t ptent = ptep_get(pte); if (!folio_test_large(folio)) return 1; - return folio_pte_batch(folio, addr, pte, ptent, count, fpb_flags, NULL, + return folio_pte_batch(folio, addr, pte, ptent, count, 0, NULL, NULL, NULL); } diff --git a/mm/mremap.c b/mm/mremap.c index 36585041c760..d4d3ffc93150 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -173,7 +173,6 @@ static pte_t move_soft_dirty_pte(pte_t pte) static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t pte, int max_nr) { - const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY; struct folio *folio; if (max_nr == 1) @@ -183,7 +182,7 @@ static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr if (!folio || !folio_test_large(folio)) return 1; - return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL, + return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL, NULL, NULL); } diff --git a/mm/rmap.c b/mm/rmap.c index 4b1a2a33e39f..366e66651c88 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1849,7 +1849,6 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio, struct page_vma_mapped_walk *pvmw, enum ttu_flags flags, pte_t pte) { - const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY; unsigned long end_addr, addr = pvmw->address; struct vm_area_struct *vma = pvmw->vma; unsigned int max_nr; @@ -1869,7 +1868,7 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio, if (pte_unused(pte)) return 1; - return folio_pte_batch(folio, addr, pvmw->pte, pte, max_nr, fpb_flags, + return folio_pte_batch(folio, addr, pvmw->pte, pte, max_nr, 0, NULL, NULL, NULL); } From 233e28e2a76e6ffcbe33ee7813f98536fe0690b5 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Wed, 2 Jul 2025 12:49:24 +0200 Subject: [PATCH 224/370] mm: smaller folio_pte_batch() improvements Let's clean up a bit: (1) No need for start_ptep vs. ptep anymore, we can simply use ptep. (2) Let's switch to "unsigned int" for everything. Negative values do not make sense. (3) We can simplify the code by leaving the pte unchanged after the pte_same() check. (4) Clarify that we should never exceed a single VMA; it indicates a problem in the caller. No functional change intended. Link: https://lkml.kernel.org/r/20250702104926.212243-3-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Lance Yang Reviewed-by: Lorenzo Stoakes Reviewed-by: Oscar Salvador Reviewed-by: Dev Jain Cc: Alistair Popple Cc: Byungchul Park Cc: Gregory Price Cc: "Huang, Ying" Cc: Jann Horn Cc: Joshua Hahn Cc: Liam Howlett Cc: Mathew Brost Cc: Michal Hocko Cc: Mike Rapoport Cc: Rakie Kim Cc: Rik van Riel Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/internal.h | 37 +++++++++++++++---------------------- 1 file changed, 15 insertions(+), 22 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index e530809ef7d2..40ee7200e510 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -221,7 +221,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) * folio_pte_batch - detect a PTE batch for a large folio * @folio: The large folio to detect a PTE batch for. * @addr: The user virtual address the first page is mapped at. - * @start_ptep: Page table pointer for the first entry. + * @ptep: Page table pointer for the first entry. * @pte: Page table entry for the first page. * @max_nr: The maximum number of table entries to consider. * @flags: Flags to modify the PTE batch semantics. @@ -233,24 +233,24 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) * first one is dirty. * * Detect a PTE batch: consecutive (present) PTEs that map consecutive - * pages of the same large folio. + * pages of the same large folio in a single VMA and a single page table. * * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN, * the accessed bit, writable bit, dirty bit (unless FPB_RESPECT_DIRTY is set) * and soft-dirty bit (unless FPB_RESPECT_SOFT_DIRTY is set). * - * start_ptep must map any page of the folio. max_nr must be at least one and - * must be limited by the caller so scanning cannot exceed a single page table. + * @ptep must map any page of the folio. max_nr must be at least one and + * must be limited by the caller so scanning cannot exceed a single VMA and + * a single page table. * * Return: the number of table entries in the batch. */ -static inline int folio_pte_batch(struct folio *folio, unsigned long addr, - pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags, +static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long addr, + pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags, bool *any_writable, bool *any_young, bool *any_dirty) { - pte_t expected_pte, *ptep; - bool writable, young, dirty; - int nr, cur_nr; + unsigned int nr, cur_nr; + pte_t expected_pte; if (any_writable) *any_writable = false; @@ -267,29 +267,22 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, max_nr = min_t(unsigned long, max_nr, folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte)); - nr = pte_batch_hint(start_ptep, pte); + nr = pte_batch_hint(ptep, pte); expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags); - ptep = start_ptep + nr; + ptep = ptep + nr; while (nr < max_nr) { pte = ptep_get(ptep); - if (any_writable) - writable = !!pte_write(pte); - if (any_young) - young = !!pte_young(pte); - if (any_dirty) - dirty = !!pte_dirty(pte); - pte = __pte_batch_clear_ignored(pte, flags); - if (!pte_same(pte, expected_pte)) + if (!pte_same(__pte_batch_clear_ignored(pte, flags), expected_pte)) break; if (any_writable) - *any_writable |= writable; + *any_writable |= pte_write(pte); if (any_young) - *any_young |= young; + *any_young |= pte_young(pte); if (any_dirty) - *any_dirty |= dirty; + *any_dirty |= pte_dirty(pte); cur_nr = pte_batch_hint(ptep, pte); expected_pte = pte_advance_pfn(expected_pte, cur_nr); From dd80cfd4878bafc74f2a386c51b5398a12ffeb8c Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Wed, 2 Jul 2025 12:49:25 +0200 Subject: [PATCH 225/370] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_flags() Many users (including upcoming ones) don't really need the flags etc, and can live with the possible overhead of a function call. So let's provide a basic, non-inlined folio_pte_batch(), to avoid code bloat while still providing a variant that optimizes out all flag checks at runtime. folio_pte_batch_flags() will get inlined into folio_pte_batch(), optimizing out any conditionals that depend on input flags. folio_pte_batch() will behave like folio_pte_batch_flags() when no flags are specified. It's okay to add new users of folio_pte_batch_flags(), but using folio_pte_batch() if applicable is preferred. So, before this change, folio_pte_batch() was inlined into the C file optimized by propagating constants within the resulting object file. With this change, we now also have a folio_pte_batch() that is optimized by propagating all constants. But instead of having one instance per object file, we have a single shared one. In zap_present_ptes(), where we care about performance, the compiler already seem to generate a call to a common inlined folio_pte_batch() variant, shared with fork() code. So calling the new non-inlined variant should not make a difference. While at it, drop the "addr" parameter that is unused. Link: https://lkml.kernel.org/r/20250702104926.212243-4-david@redhat.com Signed-off-by: David Hildenbrand Suggested-by: Andrew Morton Link: https://lore.kernel.org/linux-mm/20250503182858.5a02729fcffd6d4723afcfc2@linux-foundation.org/ Reviewed-by: Oscar Salvador Reviewed-by: Zi Yan Reviewed-by: Dev Jain Cc: Alistair Popple Cc: Byungchul Park Cc: Gregory Price Cc: "Huang, Ying" Cc: Jann Horn Cc: Joshua Hahn Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Mathew Brost Cc: Michal Hocko Cc: Mike Rapoport Cc: Rakie Kim Cc: Rik van Riel Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/internal.h | 11 ++++++++--- mm/madvise.c | 4 ++-- mm/memory.c | 8 +++----- mm/mempolicy.c | 3 +-- mm/mlock.c | 3 +-- mm/mremap.c | 3 +-- mm/rmap.c | 3 +-- mm/util.c | 29 +++++++++++++++++++++++++++++ 8 files changed, 46 insertions(+), 18 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 40ee7200e510..c7d18f608c3f 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -218,9 +218,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) } /** - * folio_pte_batch - detect a PTE batch for a large folio + * folio_pte_batch_flags - detect a PTE batch for a large folio * @folio: The large folio to detect a PTE batch for. - * @addr: The user virtual address the first page is mapped at. * @ptep: Page table pointer for the first entry. * @pte: Page table entry for the first page. * @max_nr: The maximum number of table entries to consider. @@ -243,9 +242,12 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) * must be limited by the caller so scanning cannot exceed a single VMA and * a single page table. * + * This function will be inlined to optimize based on the input parameters; + * consider using folio_pte_batch() instead if applicable. + * * Return: the number of table entries in the batch. */ -static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long addr, +static inline unsigned int folio_pte_batch_flags(struct folio *folio, pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags, bool *any_writable, bool *any_young, bool *any_dirty) { @@ -293,6 +295,9 @@ static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long ad return min(nr, max_nr); } +unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte, + unsigned int max_nr); + /** * pte_move_swp_offset - Move the swap entry offset field of a swap pte * forward or backward by delta diff --git a/mm/madvise.c b/mm/madvise.c index e7f1d4caad81..7c4958f694b4 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -348,8 +348,8 @@ static inline int madvise_folio_pte_batch(unsigned long addr, unsigned long end, { int max_nr = (end - addr) / PAGE_SIZE; - return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL, - any_young, any_dirty); + return folio_pte_batch_flags(folio, ptep, pte, max_nr, 0, NULL, + any_young, any_dirty); } static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, diff --git a/mm/memory.c b/mm/memory.c index a03f1964db33..042088340b73 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -995,8 +995,8 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma if (vma_soft_dirty_enabled(src_vma)) flags |= FPB_RESPECT_SOFT_DIRTY; - nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags, - &any_writable, NULL, NULL); + nr = folio_pte_batch_flags(folio, src_pte, pte, max_nr, flags, + &any_writable, NULL, NULL); folio_ref_add(folio, nr); if (folio_test_anon(folio)) { if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page, @@ -1564,9 +1564,7 @@ static inline int zap_present_ptes(struct mmu_gather *tlb, * by keeping the batching logic separate. */ if (unlikely(folio_test_large(folio) && max_nr != 1)) { - nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, 0, - NULL, NULL, NULL); - + nr = folio_pte_batch(folio, pte, ptent, max_nr); zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr, addr, details, rss, force_flush, force_break, any_skipped); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 2a25eedc3b1c..eb83cff7db8c 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -711,8 +711,7 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr, if (!folio || folio_is_zone_device(folio)) continue; if (folio_test_large(folio) && max_nr != 1) - nr = folio_pte_batch(folio, addr, pte, ptent, - max_nr, 0, NULL, NULL, NULL); + nr = folio_pte_batch(folio, pte, ptent, max_nr); /* * vm_normal_folio() filters out zero pages, but there might * still be reserved folios to skip, perhaps in a VDSO. diff --git a/mm/mlock.c b/mm/mlock.c index 2238cdc5eb1c..a1d93ad33c6d 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -313,8 +313,7 @@ static inline unsigned int folio_mlock_step(struct folio *folio, if (!folio_test_large(folio)) return 1; - return folio_pte_batch(folio, addr, pte, ptent, count, 0, NULL, - NULL, NULL); + return folio_pte_batch(folio, pte, ptent, count); } static inline bool allow_mlock_munlock(struct folio *folio, diff --git a/mm/mremap.c b/mm/mremap.c index d4d3ffc93150..1f5bebbb9c0c 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -182,8 +182,7 @@ static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr if (!folio || !folio_test_large(folio)) return 1; - return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL, - NULL, NULL); + return folio_pte_batch(folio, ptep, pte, max_nr); } static int move_ptes(struct pagetable_move_control *pmc, diff --git a/mm/rmap.c b/mm/rmap.c index 366e66651c88..4c833b43fef9 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1868,8 +1868,7 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio, if (pte_unused(pte)) return 1; - return folio_pte_batch(folio, addr, pvmw->pte, pte, max_nr, 0, - NULL, NULL, NULL); + return folio_pte_batch(folio, pvmw->pte, pte, max_nr); } /* diff --git a/mm/util.c b/mm/util.c index 20bbfe4ce1b8..f134cefc9062 100644 --- a/mm/util.c +++ b/mm/util.c @@ -1171,3 +1171,32 @@ int compat_vma_mmap_prepare(struct file *file, struct vm_area_struct *vma) return 0; } EXPORT_SYMBOL(compat_vma_mmap_prepare); + +#ifdef CONFIG_MMU +/** + * folio_pte_batch - detect a PTE batch for a large folio + * @folio: The large folio to detect a PTE batch for. + * @ptep: Page table pointer for the first entry. + * @pte: Page table entry for the first page. + * @max_nr: The maximum number of table entries to consider. + * + * This is a simplified variant of folio_pte_batch_flags(). + * + * Detect a PTE batch: consecutive (present) PTEs that map consecutive + * pages of the same large folio in a single VMA and a single page table. + * + * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN, + * the accessed bit, writable bit, dirt-bit and soft-dirty bit. + * + * ptep must map any page of the folio. max_nr must be at least one and + * must be limited by the caller so scanning cannot exceed a single VMA and + * a single page table. + * + * Return: the number of table entries in the batch. + */ +unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte, + unsigned int max_nr) +{ + return folio_pte_batch_flags(folio, ptep, pte, max_nr, 0, NULL, NULL, NULL); +} +#endif /* CONFIG_MMU */ From 7ae7e811f0a6817b6deeb4f68eb44be0ec3b8e07 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Wed, 2 Jul 2025 12:49:26 +0200 Subject: [PATCH 226/370] mm: remove boolean output parameters from folio_pte_batch_ext() Instead, let's just allow for specifying through flags whether we want to have bits merged into the original PTE. For the madvise() case, simplify by having only a single parameter for merging young+dirty. For madvise_cold_or_pageout_pte_range() merging the dirty bit is not required, but also not harmful. This code is not that performance critical after all to really force all micro-optimizations. As we now have two pte_t * parameters, use PageTable() to make sure we are actually given a pointer at a copy of the PTE, not a pointer into an actual page table. Link: https://lkml.kernel.org/r/20250702104926.212243-5-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Dev Jain Reviewed-by: Oscar Salvador Cc: Alistair Popple Cc: Byungchul Park Cc: Gregory Price Cc: "Huang, Ying" Cc: Jann Horn Cc: Joshua Hahn Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Mathew Brost Cc: Michal Hocko Cc: Mike Rapoport Cc: Rakie Kim Cc: Rik van Riel Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/internal.h | 65 +++++++++++++++++++++++++++++++++------------------ mm/madvise.c | 26 ++++----------------- mm/memory.c | 8 ++----- mm/util.c | 2 +- 4 files changed, 50 insertions(+), 51 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index c7d18f608c3f..91773a0ef305 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -208,6 +208,18 @@ typedef int __bitwise fpb_t; /* Compare PTEs respecting the soft-dirty bit. */ #define FPB_RESPECT_SOFT_DIRTY ((__force fpb_t)BIT(1)) +/* + * Merge PTE write bits: if any PTE in the batch is writable, modify the + * PTE at @ptentp to be writable. + */ +#define FPB_MERGE_WRITE ((__force fpb_t)BIT(2)) + +/* + * Merge PTE young and dirty bits: if any PTE in the batch is young or dirty, + * modify the PTE at @ptentp to be young or dirty, respectively. + */ +#define FPB_MERGE_YOUNG_DIRTY ((__force fpb_t)BIT(3)) + static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) { if (!(flags & FPB_RESPECT_DIRTY)) @@ -220,16 +232,12 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) /** * folio_pte_batch_flags - detect a PTE batch for a large folio * @folio: The large folio to detect a PTE batch for. + * @vma: The VMA. Only relevant with FPB_MERGE_WRITE, otherwise can be NULL. * @ptep: Page table pointer for the first entry. - * @pte: Page table entry for the first page. + * @ptentp: Pointer to a COPY of the first page table entry whose flags this + * function updates based on @flags if appropriate. * @max_nr: The maximum number of table entries to consider. * @flags: Flags to modify the PTE batch semantics. - * @any_writable: Optional pointer to indicate whether any entry except the - * first one is writable. - * @any_young: Optional pointer to indicate whether any entry except the - * first one is young. - * @any_dirty: Optional pointer to indicate whether any entry except the - * first one is dirty. * * Detect a PTE batch: consecutive (present) PTEs that map consecutive * pages of the same large folio in a single VMA and a single page table. @@ -242,28 +250,32 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) * must be limited by the caller so scanning cannot exceed a single VMA and * a single page table. * + * Depending on the FPB_MERGE_* flags, the pte stored at @ptentp will + * be updated: it's crucial that a pointer to a COPY of the first + * page table entry, obtained through ptep_get(), is provided as @ptentp. + * * This function will be inlined to optimize based on the input parameters; * consider using folio_pte_batch() instead if applicable. * * Return: the number of table entries in the batch. */ static inline unsigned int folio_pte_batch_flags(struct folio *folio, - pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags, - bool *any_writable, bool *any_young, bool *any_dirty) + struct vm_area_struct *vma, pte_t *ptep, pte_t *ptentp, + unsigned int max_nr, fpb_t flags) { + bool any_writable = false, any_young = false, any_dirty = false; + pte_t expected_pte, pte = *ptentp; unsigned int nr, cur_nr; - pte_t expected_pte; - - if (any_writable) - *any_writable = false; - if (any_young) - *any_young = false; - if (any_dirty) - *any_dirty = false; VM_WARN_ON_FOLIO(!pte_present(pte), folio); VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio); VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio); + /* + * Ensure this is a pointer to a copy not a pointer into a page table. + * If this is a stack value, it won't be a valid virtual address, but + * that's fine because it also cannot be pointing into the page table. + */ + VM_WARN_ON(virt_addr_valid(ptentp) && PageTable(virt_to_page(ptentp))); /* Limit max_nr to the actual remaining PFNs in the folio we could batch. */ max_nr = min_t(unsigned long, max_nr, @@ -279,12 +291,12 @@ static inline unsigned int folio_pte_batch_flags(struct folio *folio, if (!pte_same(__pte_batch_clear_ignored(pte, flags), expected_pte)) break; - if (any_writable) - *any_writable |= pte_write(pte); - if (any_young) - *any_young |= pte_young(pte); - if (any_dirty) - *any_dirty |= pte_dirty(pte); + if (flags & FPB_MERGE_WRITE) + any_writable |= pte_write(pte); + if (flags & FPB_MERGE_YOUNG_DIRTY) { + any_young |= pte_young(pte); + any_dirty |= pte_dirty(pte); + } cur_nr = pte_batch_hint(ptep, pte); expected_pte = pte_advance_pfn(expected_pte, cur_nr); @@ -292,6 +304,13 @@ static inline unsigned int folio_pte_batch_flags(struct folio *folio, nr += cur_nr; } + if (any_writable) + *ptentp = pte_mkwrite(*ptentp, vma); + if (any_young) + *ptentp = pte_mkyoung(*ptentp); + if (any_dirty) + *ptentp = pte_mkdirty(*ptentp); + return min(nr, max_nr); } diff --git a/mm/madvise.c b/mm/madvise.c index 7c4958f694b4..1c30031ab035 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -343,13 +343,12 @@ static inline bool can_do_file_pageout(struct vm_area_struct *vma) static inline int madvise_folio_pte_batch(unsigned long addr, unsigned long end, struct folio *folio, pte_t *ptep, - pte_t pte, bool *any_young, - bool *any_dirty) + pte_t *ptentp) { int max_nr = (end - addr) / PAGE_SIZE; - return folio_pte_batch_flags(folio, ptep, pte, max_nr, 0, NULL, - any_young, any_dirty); + return folio_pte_batch_flags(folio, NULL, ptep, ptentp, max_nr, + FPB_MERGE_YOUNG_DIRTY); } static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, @@ -487,13 +486,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, * next pte in the range. */ if (folio_test_large(folio)) { - bool any_young; - - nr = madvise_folio_pte_batch(addr, end, folio, pte, - ptent, &any_young, NULL); - if (any_young) - ptent = pte_mkyoung(ptent); - + nr = madvise_folio_pte_batch(addr, end, folio, pte, &ptent); if (nr < folio_nr_pages(folio)) { int err; @@ -723,11 +716,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, * next pte in the range. */ if (folio_test_large(folio)) { - bool any_young, any_dirty; - - nr = madvise_folio_pte_batch(addr, end, folio, pte, - ptent, &any_young, &any_dirty); - + nr = madvise_folio_pte_batch(addr, end, folio, pte, &ptent); if (nr < folio_nr_pages(folio)) { int err; @@ -752,11 +741,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, nr = 0; continue; } - - if (any_young) - ptent = pte_mkyoung(ptent); - if (any_dirty) - ptent = pte_mkdirty(ptent); } if (folio_test_swapcache(folio) || folio_test_dirty(folio)) { diff --git a/mm/memory.c b/mm/memory.c index 042088340b73..4619bf5874af 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -972,10 +972,9 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma pte_t *dst_pte, pte_t *src_pte, pte_t pte, unsigned long addr, int max_nr, int *rss, struct folio **prealloc) { + fpb_t flags = FPB_MERGE_WRITE; struct page *page; struct folio *folio; - bool any_writable; - fpb_t flags = 0; int err, nr; page = vm_normal_page(src_vma, addr, pte); @@ -995,8 +994,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma if (vma_soft_dirty_enabled(src_vma)) flags |= FPB_RESPECT_SOFT_DIRTY; - nr = folio_pte_batch_flags(folio, src_pte, pte, max_nr, flags, - &any_writable, NULL, NULL); + nr = folio_pte_batch_flags(folio, src_vma, src_pte, &pte, max_nr, flags); folio_ref_add(folio, nr); if (folio_test_anon(folio)) { if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page, @@ -1010,8 +1008,6 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma folio_dup_file_rmap_ptes(folio, page, nr, dst_vma); rss[mm_counter_file(folio)] += nr; } - if (any_writable) - pte = pte_mkwrite(pte, src_vma); __copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte, pte, addr, nr); return nr; diff --git a/mm/util.c b/mm/util.c index f134cefc9062..68ea833ba25f 100644 --- a/mm/util.c +++ b/mm/util.c @@ -1197,6 +1197,6 @@ EXPORT_SYMBOL(compat_vma_mmap_prepare); unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte, unsigned int max_nr) { - return folio_pte_batch_flags(folio, ptep, pte, max_nr, 0, NULL, NULL, NULL); + return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr, 0); } #endif /* CONFIG_MMU */ From fdc5001b002eaca9989b5f3563245662fd1e4d40 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Wed, 4 Jun 2025 12:51:11 +0300 Subject: [PATCH 227/370] mm/vmstat: make MEMCG select VM_EVENT_COUNTERS The vmstat_text array contains labels for counters displayed in /proc/vmstat. It is important to keep the labels in sync with the counters. There is a BUILD_BUG_ON() check in vmstat_start() that ensures the size of the vmstat_text is not smaller than VM_EVENT_COUNTERS. This helps to catch cases where a new counter is added but the label is not. However, it does not help if a counter is removed but the label remains. It would be nice to make the BUILD_BUG_ON() check more strict to catch such cases. However, when compiling with MEMCG enabled but VM_EVENT_COUNTERS disabled, the vmstat_text array is larger than NR_VMSTAT_ITEMS. This issue arises because some elements of the vmstat_text array are present when either MEMCG or VM_EVENT_COUNTERS is enabled, but NR_VMSTAT_ITEMS only accounts for these elements if VM_EVENT_COUNTERS is enabled. Instead of adjusting the NR_VMSTAT_ITEMS definition to account for MEMCG, make MEMCG select VM_EVENT_COUNTERS. VM_EVENT_COUNTERS is enabled in most configurations anyway. Link: https://lkml.kernel.org/r/20250604095111.533783-1-kirill.shutemov@linux.intel.com Fixes: ebc5d83d0443 ("mm/memcontrol: use vmstat names for printing statistics") Signed-off-by: Kirill A. Shutemov Reported-by: Randy Dunlap Acked-by: Vlastimil Babka Tested-by: Randy Dunlap Acked-by: Randy Dunlap Acked-by: Shakeel Butt Cc: Konstantin Khlebnikov Cc: David Hildenbrand Cc: Johannes Weiner Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Mike Rapoport Cc: Muchun Song Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- include/linux/vmstat.h | 4 ++-- init/Kconfig | 1 + mm/vmstat.c | 4 ++-- 3 files changed, 5 insertions(+), 4 deletions(-) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index b2ccb6845595..c287998908bf 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -507,7 +507,7 @@ static inline const char *lru_list_name(enum lru_list lru) return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_" } -#if defined(CONFIG_VM_EVENT_COUNTERS) || defined(CONFIG_MEMCG) +#if defined(CONFIG_VM_EVENT_COUNTERS) static inline const char *vm_event_name(enum vm_event_item item) { return vmstat_text[NR_VM_ZONE_STAT_ITEMS + @@ -516,7 +516,7 @@ static inline const char *vm_event_name(enum vm_event_item item) NR_VM_STAT_ITEMS + item]; } -#endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */ +#endif /* CONFIG_VM_EVENT_COUNTERS */ #ifdef CONFIG_MEMCG diff --git a/init/Kconfig b/init/Kconfig index 666783eb50ab..e9b88d527aeb 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -992,6 +992,7 @@ config MEMCG select PAGE_COUNTER select EVENTFD select SLAB_OBJ_EXT + select VM_EVENT_COUNTERS help Provides control over the memory footprint of tasks in a cgroup. diff --git a/mm/vmstat.c b/mm/vmstat.c index a78d70ddeacd..01d76216d65a 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1287,7 +1287,7 @@ const char * const vmstat_text[] = { "nr_memmap_pages", "nr_memmap_boot_pages", -#if defined(CONFIG_VM_EVENT_COUNTERS) || defined(CONFIG_MEMCG) +#if defined(CONFIG_VM_EVENT_COUNTERS) /* enum vm_event_item counters */ "pgpgin", "pgpgout", @@ -1475,7 +1475,7 @@ const char * const vmstat_text[] = { "kstack_rest", #endif #endif -#endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */ +#endif /* CONFIG_VM_EVENT_COUNTERS */ }; #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */ From 8a63ff68e5ead16ab384dcbed6103f9ad371e246 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Thu, 29 May 2025 14:05:41 +0300 Subject: [PATCH 228/370] mm: strictly check vmstat_text array size /proc/vmstat displays counters from various sources. It is easy to forget to add or remove a label when a new counter is added or removed. There is a BUILD_BUG_ON() function that catches missing labels. However, for some reason, it ignores extra labels. Let's make the check strict. This would help to catch issues when a counter is removed. Link: https://lkml.kernel.org/r/20250529110541.2960330-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov Reviewed-by: Vlastimil Babka Acked-by: Michal Hocko Reviewed-by: David Hildenbrand Cc: Konstantin Khlebnikov Cc: Kirill A. Shuemov Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Mike Rapoport Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- mm/vmstat.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmstat.c b/mm/vmstat.c index 01d76216d65a..41f6a64e5c39 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1867,7 +1867,7 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos) if (*pos >= NR_VMSTAT_ITEMS) return NULL; - BUILD_BUG_ON(ARRAY_SIZE(vmstat_text) < NR_VMSTAT_ITEMS); + BUILD_BUG_ON(ARRAY_SIZE(vmstat_text) != NR_VMSTAT_ITEMS); fold_vm_numa_events(); v = kmalloc_array(NR_VMSTAT_ITEMS, sizeof(unsigned long), GFP_KERNEL); m->private = v; From ed6a9068a0fc01c27f49f97a84b3b80b0fda2787 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Tue, 3 Jun 2025 11:45:56 +0300 Subject: [PATCH 229/370] mm/vmstat: utilize designated initializers for the vmstat_text array The vmstat_text array defines labels for counters displayed in /proc/vmstat. The current definition of the array implies a specific order of the counters in their enums, making it fragile. To make it clear which counter the label is for, use designated initializers. Link: https://lkml.kernel.org/r/20250603084556.113975-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov Reviewed-by: Christoph Hellwig Acked-by: Vlastimil Babka Acked-by: Michal Hocko Cc: David Hildenbrand Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Mike Rapoport Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- mm/vmstat.c | 439 +++++++++++++++++++++++++++------------------------- 1 file changed, 230 insertions(+), 209 deletions(-) diff --git a/mm/vmstat.c b/mm/vmstat.c index 41f6a64e5c39..c250eeba0f16 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1163,318 +1163,339 @@ int fragmentation_index(struct zone *zone, unsigned int order) #if defined(CONFIG_PROC_FS) || defined(CONFIG_SYSFS) || \ defined(CONFIG_NUMA) || defined(CONFIG_MEMCG) #ifdef CONFIG_ZONE_DMA -#define TEXT_FOR_DMA(xx) xx "_dma", +#define TEXT_FOR_DMA(xx, yy) [xx##_DMA] = yy "_dma", #else -#define TEXT_FOR_DMA(xx) +#define TEXT_FOR_DMA(xx, yy) #endif #ifdef CONFIG_ZONE_DMA32 -#define TEXT_FOR_DMA32(xx) xx "_dma32", +#define TEXT_FOR_DMA32(xx, yy) [xx##_DMA32] = yy "_dma32", #else -#define TEXT_FOR_DMA32(xx) +#define TEXT_FOR_DMA32(xx, yy) #endif #ifdef CONFIG_HIGHMEM -#define TEXT_FOR_HIGHMEM(xx) xx "_high", +#define TEXT_FOR_HIGHMEM(xx, yy) [xx##_HIGH] = yy "_high", #else -#define TEXT_FOR_HIGHMEM(xx) +#define TEXT_FOR_HIGHMEM(xx, yy) #endif #ifdef CONFIG_ZONE_DEVICE -#define TEXT_FOR_DEVICE(xx) xx "_device", +#define TEXT_FOR_DEVICE(xx, yy) [xx##_DEVICE] = yy "_device", #else -#define TEXT_FOR_DEVICE(xx) +#define TEXT_FOR_DEVICE(xx, yy) #endif -#define TEXTS_FOR_ZONES(xx) TEXT_FOR_DMA(xx) TEXT_FOR_DMA32(xx) xx "_normal", \ - TEXT_FOR_HIGHMEM(xx) xx "_movable", \ - TEXT_FOR_DEVICE(xx) +#define TEXTS_FOR_ZONES(xx, yy) \ + TEXT_FOR_DMA(xx, yy) \ + TEXT_FOR_DMA32(xx, yy) \ + [xx##_NORMAL] = yy "_normal", \ + TEXT_FOR_HIGHMEM(xx, yy) \ + [xx##_MOVABLE] = yy "_movable", \ + TEXT_FOR_DEVICE(xx, yy) const char * const vmstat_text[] = { /* enum zone_stat_item counters */ - "nr_free_pages", - "nr_free_pages_blocks", - "nr_zone_inactive_anon", - "nr_zone_active_anon", - "nr_zone_inactive_file", - "nr_zone_active_file", - "nr_zone_unevictable", - "nr_zone_write_pending", - "nr_mlock", +#define I(x) (x) + [I(NR_FREE_PAGES)] = "nr_free_pages", + [I(NR_FREE_PAGES_BLOCKS)] = "nr_free_pages_blocks", + [I(NR_ZONE_INACTIVE_ANON)] = "nr_zone_inactive_anon", + [I(NR_ZONE_ACTIVE_ANON)] = "nr_zone_active_anon", + [I(NR_ZONE_INACTIVE_FILE)] = "nr_zone_inactive_file", + [I(NR_ZONE_ACTIVE_FILE)] = "nr_zone_active_file", + [I(NR_ZONE_UNEVICTABLE)] = "nr_zone_unevictable", + [I(NR_ZONE_WRITE_PENDING)] = "nr_zone_write_pending", + [I(NR_MLOCK)] = "nr_mlock", #if IS_ENABLED(CONFIG_ZSMALLOC) - "nr_zspages", + [I(NR_ZSPAGES)] = "nr_zspages", #endif - "nr_free_cma", + [I(NR_FREE_CMA_PAGES)] = "nr_free_cma", #ifdef CONFIG_UNACCEPTED_MEMORY - "nr_unaccepted", + [I(NR_UNACCEPTED)] = "nr_unaccepted", #endif +#undef I /* enum numa_stat_item counters */ +#define I(x) (NR_VM_ZONE_STAT_ITEMS + x) #ifdef CONFIG_NUMA - "numa_hit", - "numa_miss", - "numa_foreign", - "numa_interleave", - "numa_local", - "numa_other", + [I(NUMA_HIT)] = "numa_hit", + [I(NUMA_MISS)] = "numa_miss", + [I(NUMA_FOREIGN)] = "numa_foreign", + [I(NUMA_INTERLEAVE_HIT)] = "numa_interleave", + [I(NUMA_LOCAL)] = "numa_local", + [I(NUMA_OTHER)] = "numa_other", #endif +#undef I /* enum node_stat_item counters */ - "nr_inactive_anon", - "nr_active_anon", - "nr_inactive_file", - "nr_active_file", - "nr_unevictable", - "nr_slab_reclaimable", - "nr_slab_unreclaimable", - "nr_isolated_anon", - "nr_isolated_file", - "workingset_nodes", - "workingset_refault_anon", - "workingset_refault_file", - "workingset_activate_anon", - "workingset_activate_file", - "workingset_restore_anon", - "workingset_restore_file", - "workingset_nodereclaim", - "nr_anon_pages", - "nr_mapped", - "nr_file_pages", - "nr_dirty", - "nr_writeback", - "nr_writeback_temp", - "nr_shmem", - "nr_shmem_hugepages", - "nr_shmem_pmdmapped", - "nr_file_hugepages", - "nr_file_pmdmapped", - "nr_anon_transparent_hugepages", - "nr_vmscan_write", - "nr_vmscan_immediate_reclaim", - "nr_dirtied", - "nr_written", - "nr_throttled_written", - "nr_kernel_misc_reclaimable", - "nr_foll_pin_acquired", - "nr_foll_pin_released", - "nr_kernel_stack", +#define I(x) (NR_VM_ZONE_STAT_ITEMS + NR_VM_NUMA_EVENT_ITEMS + x) + [I(NR_INACTIVE_ANON)] = "nr_inactive_anon", + [I(NR_ACTIVE_ANON)] = "nr_active_anon", + [I(NR_INACTIVE_FILE)] = "nr_inactive_file", + [I(NR_ACTIVE_FILE)] = "nr_active_file", + [I(NR_UNEVICTABLE)] = "nr_unevictable", + [I(NR_SLAB_RECLAIMABLE_B)] = "nr_slab_reclaimable", + [I(NR_SLAB_UNRECLAIMABLE_B)] = "nr_slab_unreclaimable", + [I(NR_ISOLATED_ANON)] = "nr_isolated_anon", + [I(NR_ISOLATED_FILE)] = "nr_isolated_file", + [I(WORKINGSET_NODES)] = "workingset_nodes", + [I(WORKINGSET_REFAULT_ANON)] = "workingset_refault_anon", + [I(WORKINGSET_REFAULT_FILE)] = "workingset_refault_file", + [I(WORKINGSET_ACTIVATE_ANON)] = "workingset_activate_anon", + [I(WORKINGSET_ACTIVATE_FILE)] = "workingset_activate_file", + [I(WORKINGSET_RESTORE_ANON)] = "workingset_restore_anon", + [I(WORKINGSET_RESTORE_FILE)] = "workingset_restore_file", + [I(WORKINGSET_NODERECLAIM)] = "workingset_nodereclaim", + [I(NR_ANON_MAPPED)] = "nr_anon_pages", + [I(NR_FILE_MAPPED)] = "nr_mapped", + [I(NR_FILE_PAGES)] = "nr_file_pages", + [I(NR_FILE_DIRTY)] = "nr_dirty", + [I(NR_WRITEBACK)] = "nr_writeback", + [I(NR_WRITEBACK_TEMP)] = "nr_writeback_temp", + [I(NR_SHMEM)] = "nr_shmem", + [I(NR_SHMEM_THPS)] = "nr_shmem_hugepages", + [I(NR_SHMEM_PMDMAPPED)] = "nr_shmem_pmdmapped", + [I(NR_FILE_THPS)] = "nr_file_hugepages", + [I(NR_FILE_PMDMAPPED)] = "nr_file_pmdmapped", + [I(NR_ANON_THPS)] = "nr_anon_transparent_hugepages", + [I(NR_VMSCAN_WRITE)] = "nr_vmscan_write", + [I(NR_VMSCAN_IMMEDIATE)] = "nr_vmscan_immediate_reclaim", + [I(NR_DIRTIED)] = "nr_dirtied", + [I(NR_WRITTEN)] = "nr_written", + [I(NR_THROTTLED_WRITTEN)] = "nr_throttled_written", + [I(NR_KERNEL_MISC_RECLAIMABLE)] = "nr_kernel_misc_reclaimable", + [I(NR_FOLL_PIN_ACQUIRED)] = "nr_foll_pin_acquired", + [I(NR_FOLL_PIN_RELEASED)] = "nr_foll_pin_released", + [I(NR_KERNEL_STACK_KB)] = "nr_kernel_stack", #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK) - "nr_shadow_call_stack", + [I(NR_KERNEL_SCS_KB)] = "nr_shadow_call_stack", #endif - "nr_page_table_pages", - "nr_sec_page_table_pages", + [I(NR_PAGETABLE)] = "nr_page_table_pages", + [I(NR_SECONDARY_PAGETABLE)] = "nr_sec_page_table_pages", #ifdef CONFIG_IOMMU_SUPPORT - "nr_iommu_pages", + [I(NR_IOMMU_PAGES)] = "nr_iommu_pages", #endif #ifdef CONFIG_SWAP - "nr_swapcached", + [I(NR_SWAPCACHE)] = "nr_swapcached", #endif #ifdef CONFIG_NUMA_BALANCING - "pgpromote_success", - "pgpromote_candidate", + [I(PGPROMOTE_SUCCESS)] = "pgpromote_success", + [I(PGPROMOTE_CANDIDATE)] = "pgpromote_candidate", #endif - "pgdemote_kswapd", - "pgdemote_direct", - "pgdemote_khugepaged", - "pgdemote_proactive", + [I(PGDEMOTE_KSWAPD)] = "pgdemote_kswapd", + [I(PGDEMOTE_DIRECT)] = "pgdemote_direct", + [I(PGDEMOTE_KHUGEPAGED)] = "pgdemote_khugepaged", + [I(PGDEMOTE_PROACTIVE)] = "pgdemote_proactive", #ifdef CONFIG_HUGETLB_PAGE - "nr_hugetlb", + [I(NR_HUGETLB)] = "nr_hugetlb", #endif - "nr_balloon_pages", + [I(NR_BALLOON_PAGES)] = "nr_balloon_pages", +#undef I + /* system-wide enum vm_stat_item counters */ - "nr_dirty_threshold", - "nr_dirty_background_threshold", - "nr_memmap_pages", - "nr_memmap_boot_pages", +#define I(x) (NR_VM_ZONE_STAT_ITEMS + NR_VM_NUMA_EVENT_ITEMS + \ + NR_VM_NODE_STAT_ITEMS + x) + [I(NR_DIRTY_THRESHOLD)] = "nr_dirty_threshold", + [I(NR_DIRTY_BG_THRESHOLD)] = "nr_dirty_background_threshold", + [I(NR_MEMMAP_PAGES)] = "nr_memmap_pages", + [I(NR_MEMMAP_BOOT_PAGES)] = "nr_memmap_boot_pages", +#undef I #if defined(CONFIG_VM_EVENT_COUNTERS) /* enum vm_event_item counters */ - "pgpgin", - "pgpgout", - "pswpin", - "pswpout", +#define I(x) (NR_VM_ZONE_STAT_ITEMS + NR_VM_NUMA_EVENT_ITEMS + \ + NR_VM_NODE_STAT_ITEMS + NR_VM_STAT_ITEMS + x) - TEXTS_FOR_ZONES("pgalloc") - TEXTS_FOR_ZONES("allocstall") - TEXTS_FOR_ZONES("pgskip") + [I(PGPGIN)] = "pgpgin", + [I(PGPGOUT)] = "pgpgout", + [I(PSWPIN)] = "pswpin", + [I(PSWPOUT)] = "pswpout", - "pgfree", - "pgactivate", - "pgdeactivate", - "pglazyfree", +#define OFF (NR_VM_ZONE_STAT_ITEMS + NR_VM_NUMA_EVENT_ITEMS + \ + NR_VM_NODE_STAT_ITEMS + NR_VM_STAT_ITEMS) + TEXTS_FOR_ZONES(OFF+PGALLOC, "pgalloc") + TEXTS_FOR_ZONES(OFF+ALLOCSTALL, "allocstall") + TEXTS_FOR_ZONES(OFF+PGSCAN_SKIP, "pgskip") +#undef OFF - "pgfault", - "pgmajfault", - "pglazyfreed", + [I(PGFREE)] = "pgfree", + [I(PGACTIVATE)] = "pgactivate", + [I(PGDEACTIVATE)] = "pgdeactivate", + [I(PGLAZYFREE)] = "pglazyfree", - "pgrefill", - "pgreuse", - "pgsteal_kswapd", - "pgsteal_direct", - "pgsteal_khugepaged", - "pgsteal_proactive", - "pgscan_kswapd", - "pgscan_direct", - "pgscan_khugepaged", - "pgscan_proactive", - "pgscan_direct_throttle", - "pgscan_anon", - "pgscan_file", - "pgsteal_anon", - "pgsteal_file", + [I(PGFAULT)] = "pgfault", + [I(PGMAJFAULT)] = "pgmajfault", + [I(PGLAZYFREED)] = "pglazyfreed", + + [I(PGREFILL)] = "pgrefill", + [I(PGREUSE)] = "pgreuse", + [I(PGSTEAL_KSWAPD)] = "pgsteal_kswapd", + [I(PGSTEAL_DIRECT)] = "pgsteal_direct", + [I(PGSTEAL_KHUGEPAGED)] = "pgsteal_khugepaged", + [I(PGSTEAL_PROACTIVE)] = "pgsteal_proactive", + [I(PGSCAN_KSWAPD)] = "pgscan_kswapd", + [I(PGSCAN_DIRECT)] = "pgscan_direct", + [I(PGSCAN_KHUGEPAGED)] = "pgscan_khugepaged", + [I(PGSCAN_PROACTIVE)] = "pgscan_proactive", + [I(PGSCAN_DIRECT_THROTTLE)] = "pgscan_direct_throttle", + [I(PGSCAN_ANON)] = "pgscan_anon", + [I(PGSCAN_FILE)] = "pgscan_file", + [I(PGSTEAL_ANON)] = "pgsteal_anon", + [I(PGSTEAL_FILE)] = "pgsteal_file", #ifdef CONFIG_NUMA - "zone_reclaim_success", - "zone_reclaim_failed", + [I(PGSCAN_ZONE_RECLAIM_SUCCESS)] = "zone_reclaim_success", + [I(PGSCAN_ZONE_RECLAIM_FAILED)] = "zone_reclaim_failed", #endif - "pginodesteal", - "slabs_scanned", - "kswapd_inodesteal", - "kswapd_low_wmark_hit_quickly", - "kswapd_high_wmark_hit_quickly", - "pageoutrun", + [I(PGINODESTEAL)] = "pginodesteal", + [I(SLABS_SCANNED)] = "slabs_scanned", + [I(KSWAPD_INODESTEAL)] = "kswapd_inodesteal", + [I(KSWAPD_LOW_WMARK_HIT_QUICKLY)] = "kswapd_low_wmark_hit_quickly", + [I(KSWAPD_HIGH_WMARK_HIT_QUICKLY)] = "kswapd_high_wmark_hit_quickly", + [I(PAGEOUTRUN)] = "pageoutrun", - "pgrotated", + [I(PGROTATED)] = "pgrotated", - "drop_pagecache", - "drop_slab", - "oom_kill", + [I(DROP_PAGECACHE)] = "drop_pagecache", + [I(DROP_SLAB)] = "drop_slab", + [I(OOM_KILL)] = "oom_kill", #ifdef CONFIG_NUMA_BALANCING - "numa_pte_updates", - "numa_huge_pte_updates", - "numa_hint_faults", - "numa_hint_faults_local", - "numa_pages_migrated", + [I(NUMA_PTE_UPDATES)] = "numa_pte_updates", + [I(NUMA_HUGE_PTE_UPDATES)] = "numa_huge_pte_updates", + [I(NUMA_HINT_FAULTS)] = "numa_hint_faults", + [I(NUMA_HINT_FAULTS_LOCAL)] = "numa_hint_faults_local", + [I(NUMA_PAGE_MIGRATE)] = "numa_pages_migrated", #endif #ifdef CONFIG_MIGRATION - "pgmigrate_success", - "pgmigrate_fail", - "thp_migration_success", - "thp_migration_fail", - "thp_migration_split", + [I(PGMIGRATE_SUCCESS)] = "pgmigrate_success", + [I(PGMIGRATE_FAIL)] = "pgmigrate_fail", + [I(THP_MIGRATION_SUCCESS)] = "thp_migration_success", + [I(THP_MIGRATION_FAIL)] = "thp_migration_fail", + [I(THP_MIGRATION_SPLIT)] = "thp_migration_split", #endif #ifdef CONFIG_COMPACTION - "compact_migrate_scanned", - "compact_free_scanned", - "compact_isolated", - "compact_stall", - "compact_fail", - "compact_success", - "compact_daemon_wake", - "compact_daemon_migrate_scanned", - "compact_daemon_free_scanned", + [I(COMPACTMIGRATE_SCANNED)] = "compact_migrate_scanned", + [I(COMPACTFREE_SCANNED)] = "compact_free_scanned", + [I(COMPACTISOLATED)] = "compact_isolated", + [I(COMPACTSTALL)] = "compact_stall", + [I(COMPACTFAIL)] = "compact_fail", + [I(COMPACTSUCCESS)] = "compact_success", + [I(KCOMPACTD_WAKE)] = "compact_daemon_wake", + [I(KCOMPACTD_MIGRATE_SCANNED)] = "compact_daemon_migrate_scanned", + [I(KCOMPACTD_FREE_SCANNED)] = "compact_daemon_free_scanned", #endif #ifdef CONFIG_HUGETLB_PAGE - "htlb_buddy_alloc_success", - "htlb_buddy_alloc_fail", + [I(HTLB_BUDDY_PGALLOC)] = "htlb_buddy_alloc_success", + [I(HTLB_BUDDY_PGALLOC_FAIL)] = "htlb_buddy_alloc_fail", #endif #ifdef CONFIG_CMA - "cma_alloc_success", - "cma_alloc_fail", + [I(CMA_ALLOC_SUCCESS)] = "cma_alloc_success", + [I(CMA_ALLOC_FAIL)] = "cma_alloc_fail", #endif - "unevictable_pgs_culled", - "unevictable_pgs_scanned", - "unevictable_pgs_rescued", - "unevictable_pgs_mlocked", - "unevictable_pgs_munlocked", - "unevictable_pgs_cleared", - "unevictable_pgs_stranded", + [I(UNEVICTABLE_PGCULLED)] = "unevictable_pgs_culled", + [I(UNEVICTABLE_PGSCANNED)] = "unevictable_pgs_scanned", + [I(UNEVICTABLE_PGRESCUED)] = "unevictable_pgs_rescued", + [I(UNEVICTABLE_PGMLOCKED)] = "unevictable_pgs_mlocked", + [I(UNEVICTABLE_PGMUNLOCKED)] = "unevictable_pgs_munlocked", + [I(UNEVICTABLE_PGCLEARED)] = "unevictable_pgs_cleared", + [I(UNEVICTABLE_PGSTRANDED)] = "unevictable_pgs_stranded", #ifdef CONFIG_TRANSPARENT_HUGEPAGE - "thp_fault_alloc", - "thp_fault_fallback", - "thp_fault_fallback_charge", - "thp_collapse_alloc", - "thp_collapse_alloc_failed", - "thp_file_alloc", - "thp_file_fallback", - "thp_file_fallback_charge", - "thp_file_mapped", - "thp_split_page", - "thp_split_page_failed", - "thp_deferred_split_page", - "thp_underused_split_page", - "thp_split_pmd", - "thp_scan_exceed_none_pte", - "thp_scan_exceed_swap_pte", - "thp_scan_exceed_share_pte", + [I(THP_FAULT_ALLOC)] = "thp_fault_alloc", + [I(THP_FAULT_FALLBACK)] = "thp_fault_fallback", + [I(THP_FAULT_FALLBACK_CHARGE)] = "thp_fault_fallback_charge", + [I(THP_COLLAPSE_ALLOC)] = "thp_collapse_alloc", + [I(THP_COLLAPSE_ALLOC_FAILED)] = "thp_collapse_alloc_failed", + [I(THP_FILE_ALLOC)] = "thp_file_alloc", + [I(THP_FILE_FALLBACK)] = "thp_file_fallback", + [I(THP_FILE_FALLBACK_CHARGE)] = "thp_file_fallback_charge", + [I(THP_FILE_MAPPED)] = "thp_file_mapped", + [I(THP_SPLIT_PAGE)] = "thp_split_page", + [I(THP_SPLIT_PAGE_FAILED)] = "thp_split_page_failed", + [I(THP_DEFERRED_SPLIT_PAGE)] = "thp_deferred_split_page", + [I(THP_UNDERUSED_SPLIT_PAGE)] = "thp_underused_split_page", + [I(THP_SPLIT_PMD)] = "thp_split_pmd", + [I(THP_SCAN_EXCEED_NONE_PTE)] = "thp_scan_exceed_none_pte", + [I(THP_SCAN_EXCEED_SWAP_PTE)] = "thp_scan_exceed_swap_pte", + [I(THP_SCAN_EXCEED_SHARED_PTE)] = "thp_scan_exceed_share_pte", #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD - "thp_split_pud", + [I(THP_SPLIT_PUD)] = "thp_split_pud", #endif - "thp_zero_page_alloc", - "thp_zero_page_alloc_failed", - "thp_swpout", - "thp_swpout_fallback", + [I(THP_ZERO_PAGE_ALLOC)] = "thp_zero_page_alloc", + [I(THP_ZERO_PAGE_ALLOC_FAILED)] = "thp_zero_page_alloc_failed", + [I(THP_SWPOUT)] = "thp_swpout", + [I(THP_SWPOUT_FALLBACK)] = "thp_swpout_fallback", #endif #ifdef CONFIG_MEMORY_BALLOON - "balloon_inflate", - "balloon_deflate", + [I(BALLOON_INFLATE)] = "balloon_inflate", + [I(BALLOON_DEFLATE)] = "balloon_deflate", #ifdef CONFIG_BALLOON_COMPACTION - "balloon_migrate", + [I(BALLOON_MIGRATE)] = "balloon_migrate", #endif #endif /* CONFIG_MEMORY_BALLOON */ #ifdef CONFIG_DEBUG_TLBFLUSH - "nr_tlb_remote_flush", - "nr_tlb_remote_flush_received", - "nr_tlb_local_flush_all", - "nr_tlb_local_flush_one", + [I(NR_TLB_REMOTE_FLUSH)] = "nr_tlb_remote_flush", + [I(NR_TLB_REMOTE_FLUSH_RECEIVED)] = "nr_tlb_remote_flush_received", + [I(NR_TLB_LOCAL_FLUSH_ALL)] = "nr_tlb_local_flush_all", + [I(NR_TLB_LOCAL_FLUSH_ONE)] = "nr_tlb_local_flush_one", #endif /* CONFIG_DEBUG_TLBFLUSH */ #ifdef CONFIG_SWAP - "swap_ra", - "swap_ra_hit", - "swpin_zero", - "swpout_zero", + [I(SWAP_RA)] = "swap_ra", + [I(SWAP_RA_HIT)] = "swap_ra_hit", + [I(SWPIN_ZERO)] = "swpin_zero", + [I(SWPOUT_ZERO)] = "swpout_zero", #ifdef CONFIG_KSM - "ksm_swpin_copy", + [I(KSM_SWPIN_COPY)] = "ksm_swpin_copy", #endif #endif #ifdef CONFIG_KSM - "cow_ksm", + [I(COW_KSM)] = "cow_ksm", #endif #ifdef CONFIG_ZSWAP - "zswpin", - "zswpout", - "zswpwb", + [I(ZSWPIN)] = "zswpin", + [I(ZSWPOUT)] = "zswpout", + [I(ZSWPWB)] = "zswpwb", #endif #ifdef CONFIG_X86 - "direct_map_level2_splits", - "direct_map_level3_splits", - "direct_map_level2_collapses", - "direct_map_level3_collapses", + [I(DIRECT_MAP_LEVEL2_SPLIT)] = "direct_map_level2_splits", + [I(DIRECT_MAP_LEVEL3_SPLIT)] = "direct_map_level3_splits", + [I(DIRECT_MAP_LEVEL2_COLLAPSE)] = "direct_map_level2_collapses", + [I(DIRECT_MAP_LEVEL3_COLLAPSE)] = "direct_map_level3_collapses", #endif #ifdef CONFIG_PER_VMA_LOCK_STATS - "vma_lock_success", - "vma_lock_abort", - "vma_lock_retry", - "vma_lock_miss", + [I(VMA_LOCK_SUCCESS)] = "vma_lock_success", + [I(VMA_LOCK_ABORT)] = "vma_lock_abort", + [I(VMA_LOCK_RETRY)] = "vma_lock_retry", + [I(VMA_LOCK_MISS)] = "vma_lock_miss", #endif #ifdef CONFIG_DEBUG_STACK_USAGE - "kstack_1k", + [I(KSTACK_1K)] = "kstack_1k", #if THREAD_SIZE > 1024 - "kstack_2k", + [I(KSTACK_2K)] = "kstack_2k", #endif #if THREAD_SIZE > 2048 - "kstack_4k", + [I(KSTACK_4K)] = "kstack_4k", #endif #if THREAD_SIZE > 4096 - "kstack_8k", + [I(KSTACK_8K)] = "kstack_8k", #endif #if THREAD_SIZE > 8192 - "kstack_16k", + [I(KSTACK_16K)] = "kstack_16k", #endif #if THREAD_SIZE > 16384 - "kstack_32k", + [I(KSTACK_32K)] = "kstack_32k", #endif #if THREAD_SIZE > 32768 - "kstack_64k", + [I(KSTACK_64K)] = "kstack_64k", #endif #if THREAD_SIZE > 65536 - "kstack_rest", + [I(KSTACK_REST)] = "kstack_rest", #endif #endif +#undef I #endif /* CONFIG_VM_EVENT_COUNTERS */ }; #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */ From 8356a5a3b078ca89c526dd6d71e9a76fec571c37 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Wed, 25 Jun 2025 17:51:52 +0200 Subject: [PATCH 230/370] mm, vmstat: remove the NR_WRITEBACK_TEMP node_stat_item counter The only user of the counter (FUSE) was removed in commit 0c58a97f919c ("fuse: remove tmp folio for writebacks and internal rb tree") so follow the established pattern of removing the counter and hardcoding 0 in meminfo output, as done recently with NR_BOUNCE. Update documentation for procfs, including for the value for Bounce that was missed when removing its counter. Also remove the mention of NR_WRITEBACK_TEMP implications from a comment in wb_position_ratio(). The rest of the comment there about fuse setting bdi->max_ratio to 1% is still correct. [vbabka@suse.cz: v2] Link: https://lkml.kernel.org/r/5a848e15-6a57-4ecb-a015-d4f358b8a5d3@suse.cz Link: https://lkml.kernel.org/r/20250625-nr_writeback_removal-v1-1-7f2a0df70faa@suse.cz Signed-off-by: Vlastimil Babka Cc: Brendan Jackman Cc: Danilo Krummrich Cc: Greg Kroah-Hartman Cc: Jan Kara Cc: Jeff Layton Cc: Jeffle Xu Cc: Jens Axboe Cc: Joanne Koong Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Kirill A. Shuemov Cc: Matthew Wilcox (Oracle) Cc: Maxim Patlasov Cc: Michal Hocko Cc: Miklos Szeredi Cc: Suren Baghdasaryan Cc: Tejun Heo Cc: Zach O'Keefe Cc: Zi Yan Signed-off-by: Andrew Morton --- Documentation/filesystems/proc.rst | 8 +++++--- drivers/base/node.c | 2 +- fs/proc/meminfo.c | 3 +-- include/linux/mmzone.h | 1 - mm/page-writeback.c | 4 +--- mm/show_mem.c | 2 -- mm/vmstat.c | 1 - 7 files changed, 8 insertions(+), 13 deletions(-) diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 5236cb52e357..2971551b7235 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -1196,12 +1196,14 @@ SecPageTables Memory consumed by secondary page tables, this currently includes KVM mmu and IOMMU allocations on x86 and arm64. NFS_Unstable - Always zero. Previous counted pages which had been written to + Always zero. Previously counted pages which had been written to the server, but has not been committed to stable storage. Bounce - Memory used for block device "bounce buffers" + Always zero. Previously memory used for block device + "bounce buffers". WritebackTmp - Memory used by FUSE for temporary writeback buffers + Always zero. Previously memory used by FUSE for temporary + writeback buffers. CommitLimit Based on the overcommit ratio ('vm.overcommit_ratio'), this is the total amount of memory currently available to diff --git a/drivers/base/node.c b/drivers/base/node.c index 6d66382dae65..e434cb260e61 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -500,7 +500,7 @@ static ssize_t node_read_meminfo(struct device *dev, nid, K(node_page_state(pgdat, NR_SECONDARY_PAGETABLE)), nid, 0UL, nid, 0UL, - nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), + nid, 0UL, nid, K(sreclaimable + node_page_state(pgdat, NR_KERNEL_MISC_RECLAIMABLE)), nid, K(sreclaimable + sunreclaimable), diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index bc2bc60c36cc..a458f1e112fd 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -121,8 +121,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v) show_val_kb(m, "NFS_Unstable: ", 0); show_val_kb(m, "Bounce: ", 0); - show_val_kb(m, "WritebackTmp: ", - global_node_page_state(NR_WRITEBACK_TEMP)); + show_val_kb(m, "WritebackTmp: ", 0); show_val_kb(m, "CommitLimit: ", vm_commit_limit()); show_val_kb(m, "Committed_AS: ", committed); seq_printf(m, "VmallocTotal: %8lu kB\n", diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 1d1bb2b7f40d..0c5da9141983 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -206,7 +206,6 @@ enum node_stat_item { NR_FILE_PAGES, NR_FILE_DIRTY, NR_WRITEBACK, - NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */ NR_SHMEM, /* shmem pages (included tmpfs/GEM pages) */ NR_SHMEM_THPS, NR_SHMEM_PMDMAPPED, diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 72b0ff0d4bae..3e248d1c3969 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1101,9 +1101,7 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc) * such filesystems balance_dirty_pages always checks wb counters * against wb limits. Even if global "nr_dirty" is under "freerun". * This is especially important for fuse which sets bdi->max_ratio to - * 1% by default. Without strictlimit feature, fuse writeback may - * consume arbitrary amount of RAM because it is accounted in - * NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty". + * 1% by default. * * Here, in wb_position_ratio(), we calculate pos_ratio based on * two values: wb_dirty and wb_thresh. Let's consider an example: diff --git a/mm/show_mem.c b/mm/show_mem.c index 0cf8bf5d832d..41999e94a56d 100644 --- a/mm/show_mem.c +++ b/mm/show_mem.c @@ -246,7 +246,6 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z " shmem_pmdmapped:%lukB" " anon_thp:%lukB" #endif - " writeback_tmp:%lukB" " kernel_stack:%lukB" #ifdef CONFIG_SHADOW_CALL_STACK " shadow_call_stack:%lukB" @@ -273,7 +272,6 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)), K(node_page_state(pgdat, NR_ANON_THPS)), #endif - K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), node_page_state(pgdat, NR_KERNEL_STACK_KB), #ifdef CONFIG_SHADOW_CALL_STACK node_page_state(pgdat, NR_KERNEL_SCS_KB), diff --git a/mm/vmstat.c b/mm/vmstat.c index c250eeba0f16..71cd1ceba191 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1251,7 +1251,6 @@ const char * const vmstat_text[] = { [I(NR_FILE_PAGES)] = "nr_file_pages", [I(NR_FILE_DIRTY)] = "nr_dirty", [I(NR_WRITEBACK)] = "nr_writeback", - [I(NR_WRITEBACK_TEMP)] = "nr_writeback_temp", [I(NR_SHMEM)] = "nr_shmem", [I(NR_SHMEM_THPS)] = "nr_shmem_hugepages", [I(NR_SHMEM_PMDMAPPED)] = "nr_shmem_pmdmapped", From 793020545cea0c9e2a79de6ad5c9746ec4f5bd7e Mon Sep 17 00:00:00 2001 From: Honggyu Kim Date: Mon, 7 Jul 2025 11:45:47 +0900 Subject: [PATCH 231/370] samples/damon: change enable parameters to enabled The damon_{lru_sort,reclaim,stat} kernel modules use "enabled" parameter knobs as follows. /sys/module/damon_lru_sort/parameters/enabled /sys/module/damon_reclaim/parameters/enabled /sys/module/damon_stat/parameters/enabled However, other sample modules of damon use "enable" parameter knobs so it'd be better to rename them from "enable" to "enabled" to keep the consistency with other damon modules. Before: /sys/module/damon_sample_wsse/parameters/enable /sys/module/damon_sample_prcl/parameters/enable /sys/module/damon_sample_mtier/parameters/enable After: /sys/module/damon_sample_wsse/parameters/enabled /sys/module/damon_sample_prcl/parameters/enabled /sys/module/damon_sample_mtier/parameters/enabled There is no functional changes in this patch. Link: https://lkml.kernel.org/r/20250707024548.1964-1-honggyu.kim@sk.com Signed-off-by: Honggyu Kim Reviewed-by: SeongJae Park Signed-off-by: Andrew Morton --- samples/damon/mtier.c | 22 +++++++++++----------- samples/damon/prcl.c | 22 +++++++++++----------- samples/damon/wsse.c | 22 +++++++++++----------- 3 files changed, 33 insertions(+), 33 deletions(-) diff --git a/samples/damon/mtier.c b/samples/damon/mtier.c index 537bc6eae9f7..42e14df16f44 100644 --- a/samples/damon/mtier.c +++ b/samples/damon/mtier.c @@ -38,14 +38,14 @@ module_param(node0_mem_free_bp, ulong, 0600); static int damon_sample_mtier_enable_store( const char *val, const struct kernel_param *kp); -static const struct kernel_param_ops enable_param_ops = { +static const struct kernel_param_ops enabled_param_ops = { .set = damon_sample_mtier_enable_store, .get = param_get_bool, }; -static bool enable __read_mostly; -module_param_cb(enable, &enable_param_ops, &enable, 0600); -MODULE_PARM_DESC(enable, "Enable of disable DAMON_SAMPLE_MTIER"); +static bool enabled __read_mostly; +module_param_cb(enabled, &enabled_param_ops, &enabled, 0600); +MODULE_PARM_DESC(enabled, "Enable or disable DAMON_SAMPLE_MTIER"); static struct damon_ctx *ctxs[2]; @@ -167,20 +167,20 @@ static bool init_called; static int damon_sample_mtier_enable_store( const char *val, const struct kernel_param *kp) { - bool enabled = enable; + bool is_enabled = enabled; int err; - err = kstrtobool(val, &enable); + err = kstrtobool(val, &enabled); if (err) return err; - if (enable == enabled) + if (enabled == is_enabled) return 0; - if (enable) { + if (enabled) { err = damon_sample_mtier_start(); if (err) - enable = false; + enabled = false; return err; } damon_sample_mtier_stop(); @@ -192,10 +192,10 @@ static int __init damon_sample_mtier_init(void) int err = 0; init_called = true; - if (enable) { + if (enabled) { err = damon_sample_mtier_start(); if (err) - enable = false; + enabled = false; } return 0; } diff --git a/samples/damon/prcl.c b/samples/damon/prcl.c index 4b0e47718c62..8a312dba7691 100644 --- a/samples/damon/prcl.c +++ b/samples/damon/prcl.c @@ -22,14 +22,14 @@ module_param(target_pid, int, 0600); static int damon_sample_prcl_enable_store( const char *val, const struct kernel_param *kp); -static const struct kernel_param_ops enable_param_ops = { +static const struct kernel_param_ops enabled_param_ops = { .set = damon_sample_prcl_enable_store, .get = param_get_bool, }; -static bool enable __read_mostly; -module_param_cb(enable, &enable_param_ops, &enable, 0600); -MODULE_PARM_DESC(enable, "Enable of disable DAMON_SAMPLE_WSSE"); +static bool enabled __read_mostly; +module_param_cb(enabled, &enabled_param_ops, &enabled, 0600); +MODULE_PARM_DESC(enabled, "Enable or disable DAMON_SAMPLE_PRCL"); static struct damon_ctx *ctx; static struct pid *target_pidp; @@ -119,20 +119,20 @@ static bool init_called; static int damon_sample_prcl_enable_store( const char *val, const struct kernel_param *kp) { - bool enabled = enable; + bool is_enabled = enabled; int err; - err = kstrtobool(val, &enable); + err = kstrtobool(val, &enabled); if (err) return err; - if (enable == enabled) + if (enabled == is_enabled) return 0; - if (enable) { + if (enabled) { err = damon_sample_prcl_start(); if (err) - enable = false; + enabled = false; return err; } damon_sample_prcl_stop(); @@ -144,10 +144,10 @@ static int __init damon_sample_prcl_init(void) int err = 0; init_called = true; - if (enable) { + if (enabled) { err = damon_sample_prcl_start(); if (err) - enable = false; + enabled = false; } return 0; } diff --git a/samples/damon/wsse.c b/samples/damon/wsse.c index 89fc76f31d5e..d87b3b0801d2 100644 --- a/samples/damon/wsse.c +++ b/samples/damon/wsse.c @@ -23,14 +23,14 @@ module_param(target_pid, int, 0600); static int damon_sample_wsse_enable_store( const char *val, const struct kernel_param *kp); -static const struct kernel_param_ops enable_param_ops = { +static const struct kernel_param_ops enabled_param_ops = { .set = damon_sample_wsse_enable_store, .get = param_get_bool, }; -static bool enable __read_mostly; -module_param_cb(enable, &enable_param_ops, &enable, 0600); -MODULE_PARM_DESC(enable, "Enable or disable DAMON_SAMPLE_WSSE"); +static bool enabled __read_mostly; +module_param_cb(enabled, &enabled_param_ops, &enabled, 0600); +MODULE_PARM_DESC(enabled, "Enable or disable DAMON_SAMPLE_WSSE"); static struct damon_ctx *ctx; static struct pid *target_pidp; @@ -99,20 +99,20 @@ static bool init_called; static int damon_sample_wsse_enable_store( const char *val, const struct kernel_param *kp) { - bool enabled = enable; + bool is_enabled = enabled; int err; - err = kstrtobool(val, &enable); + err = kstrtobool(val, &enabled); if (err) return err; - if (enable == enabled) + if (enabled == is_enabled) return 0; - if (enable) { + if (enabled) { err = damon_sample_wsse_start(); if (err) - enable = false; + enabled = false; return err; } damon_sample_wsse_stop(); @@ -124,10 +124,10 @@ static int __init damon_sample_wsse_init(void) int err = 0; init_called = true; - if (enable) { + if (enabled) { err = damon_sample_wsse_start(); if (err) - enable = false; + enabled = false; } return err; } From 749cc6533b662166fb387486e408e581a87b2a34 Mon Sep 17 00:00:00 2001 From: Yunjeong Mun Date: Tue, 8 Jul 2025 08:59:18 +0900 Subject: [PATCH 232/370] samples/damon: support automatic node address detection This patch adds a new knob `detect_node_addresses`, which determines whether the physical address range is set manually using the existing knobs or automatically by the mtier module. When `detect_node_addresses` set to 'Y', mtier automatically converts node0 and node1 to their physical addresses. If set to 'N', it uses the existing 'node#_start_addr' and 'node#_end_addr' to define regions as before. Link: https://lkml.kernel.org/r/20250707235919.513-1-yunjeong.mun@sk.com Signed-off-by: Yunjeong Mun Suggested-by: Honggyu Kim Reviewed-by: SeongJae Park Signed-off-by: Andrew Morton --- samples/damon/mtier.c | 37 ++++++++++++++++++++++++++++++++++--- 1 file changed, 34 insertions(+), 3 deletions(-) diff --git a/samples/damon/mtier.c b/samples/damon/mtier.c index 42e14df16f44..7ebd352138e4 100644 --- a/samples/damon/mtier.c +++ b/samples/damon/mtier.c @@ -47,8 +47,29 @@ static bool enabled __read_mostly; module_param_cb(enabled, &enabled_param_ops, &enabled, 0600); MODULE_PARM_DESC(enabled, "Enable or disable DAMON_SAMPLE_MTIER"); +static bool detect_node_addresses __read_mostly; +module_param(detect_node_addresses, bool, 0600); + static struct damon_ctx *ctxs[2]; +struct region_range { + phys_addr_t start; + phys_addr_t end; +}; + +static int nid_to_phys(int target_node, struct region_range *range) +{ + if (!node_online(target_node)) { + pr_err("NUMA node %d is not online\n", target_node); + return -EINVAL; + } + + range->start = PFN_PHYS(node_start_pfn(target_node)); + range->end = PFN_PHYS(node_end_pfn(target_node)); + + return 0; +} + static struct damon_ctx *damon_sample_mtier_build_ctx(bool promote) { struct damon_ctx *ctx; @@ -58,6 +79,8 @@ static struct damon_ctx *damon_sample_mtier_build_ctx(bool promote) struct damos *scheme; struct damos_quota_goal *quota_goal; struct damos_filter *filter; + struct region_range addr; + int ret; ctx = damon_new_ctx(); if (!ctx) @@ -87,9 +110,17 @@ static struct damon_ctx *damon_sample_mtier_build_ctx(bool promote) if (!target) goto free_out; damon_add_target(ctx, target); - region = damon_new_region( - promote ? node1_start_addr : node0_start_addr, - promote ? node1_end_addr : node0_end_addr); + + if (detect_node_addresses) { + ret = promote ? nid_to_phys(1, &addr) : nid_to_phys(0, &addr); + if (ret) + goto free_out; + } else { + addr.start = promote ? node1_start_addr : node0_start_addr; + addr.end = promote ? node1_end_addr : node0_end_addr; + } + + region = damon_new_region(addr.start, addr.end); if (!region) goto free_out; damon_add_region(region, target); From 579bd5006fe7f4a7abb32da0160d376476cab67d Mon Sep 17 00:00:00 2001 From: Bijan Tabatabai Date: Tue, 8 Jul 2025 19:47:29 -0500 Subject: [PATCH 233/370] mm/damon/core: commit damos->target_nid When committing new scheme parameters from the sysfs, the target_nid field of the damos struct would not be copied. This would result in the target_nid field to retain its original value, despite being updated in the sysfs interface. This patch fixes this issue by copying target_nid in damos_commit(). Link: https://lkml.kernel.org/r/20250709004729.17252-1-bijan311@gmail.com Fixes: 83dc7bbaecae ("mm/damon/sysfs: use damon_commit_ctx()") Signed-off-by: Bijan Tabatabai Reviewed-by: SeongJae Park Cc: Jonathan Corbet Cc: Ravi Shankar Jonnalagadda Cc: Signed-off-by: Andrew Morton --- mm/damon/core.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/damon/core.c b/mm/damon/core.c index c66583869e95..04e01e08253a 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -978,6 +978,7 @@ static int damos_commit(struct damos *dst, struct damos *src) return err; dst->wmarks = src->wmarks; + dst->target_nid = src->target_nid; err = damos_commit_filters(dst, src); return err; From a2c24eae5a15f79673eba2913d87d658a04830cf Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Tue, 8 Jul 2025 19:59:31 -0500 Subject: [PATCH 234/370] mm/damon: add struct damos_migrate_dests Patch series "mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions", v4. A recent patchset automatically sets the interleave weight for each node according to the node's maximum bandwidth [1]. In another thread, the patch set's author, Joshua Hahn, wondered if/how thes weights should be changed if the bandwidth utilization of the system changes [2]. This patch set adds the mechanism for dynamically changing how application data is interleaved across nodes while leaving the policy of what the interleave weights should be to userspace. It does this by having the migrate_{hot,cold} operating schemes interleave application data according to the list of migration nodes and weights passed in via the DAMON sysfs interface. This functionality can be used to dynamically adjust how folios are interleaved by having a userspace process adjust those weights. If no specific destination nodes or weights are provided, the migrate_{hot,cold} actions will only migrate folios to damos->target_nid as before. The algorithm used to interleave the folios is similar to the one used for the weighted interleave mempolicy [3]. It uses the offset from which a folio is mapped into a VMA to determine the node the folio should be placed in. This method is convenient because for a given set of interleave weights, a folio has only one valid node it can be placed in, limitng the amount of unnecessary data movement. However, finding out how a folio is mapped inside of a VMA requires a costly rmap walk when using a paddr scheme. As such, we have decided that this functionality makes more sense as a vaddr scheme [4]. To this end, this patch set also adds vaddr versions of the migrate_{hot,cold}. Motivation ========== There have been prior discussions about how changing the interleave weights in response to the system's bandwidth utilization can be beneficial [2]. However, currently the interleave weights only are applied when data is allocated. Migrating already allocated pages according to the dynamically changing weights will better help balance the bandwidth utilization across nodes. As a toy example, imagine some application that uses 75% of the local bandwidth. Assuming sufficient capacity, when running alone, we want to keep that application's data in local memory. However, if a second instance of that application begins, using the same amount of bandwidth, it would be best to interleave the data of both processes to alleviate the bandwidth pressure from the local node. Likewise, when one of the processes ends, the data should be moves back to local memory. We imagine there would be a userspace application that would monitor system performance characteristics, such as bandwidth utilization or memory access latency, and uses that information to tune the interleave weights. Others seem to have come to a similar conclusion in previous discussions [5]. We are currently working on a userspace program that does this, but it is not quite ready to be published yet. After the userspace application tunes the interleave weights, there must be some mechanism that actually migrates pages to be consistent with those weights. This patchset is what provides this mechanism. We believe DAMON is the correct venue for the interleaving mechanism for a few reasons. First, we noticed that we don't have to migrate all of the application's pages to improve performance. we just need to migrate the frequently accessed pages. DAMON's existing hotness traching is very useful for this. Second, DAMON's quota system can be used to ensure we are not using too much bandwidth for migrations. Finally, as Ying pointed out [6], a complete solution must also handle when a memory node is at capacity. The existing migrate_cold action can be used in conjunction with the functionality added in this patch set to provide that complete solution. Functionality Test ================== Below is an example of this new functionality in use to confirm that these patches behave as intended. In this example, the user starts an application, alloc_data, which allocates 1GB using the default memory policy (i.e. allocate to local memory) then sleeps. Afterwards, we start DAMON to interleave the data at a 1:1 ratio. Using numastat, we show that DAMON has migrated the application's data to match the new interleave ratio. For this example, I modified the userspace damo tool [8] to write to the migration_dest sysfs files. I plan to upstream these changes when these patches are merged. $ # Allocate the data initially $ ./alloc_data 1G & [1] 6587 $ numastat -c -p alloc_data Per-node process memory usage (in MBs) for PID 6587 (alloc_data) Node 0 Node 1 Total ------ ------ ----- Huge 0 0 0 Heap 0 0 0 Stack 0 0 0 Private 1027 0 1027 ------- ------ ------ ----- Total 1027 0 1027 $ # Start DAMON to interleave data at a 1:1 ratio $ cat ./interleave_vaddr.yaml kdamonds: - contexts: - ops: vaddr addr_unit: null targets: - pid: 6587 regions: [] intervals: sample_us: 500 ms aggr_us: 5 s ops_update_us: 20 s intervals_goal: access_bp: 0 % aggrs: '0' min_sample_us: 0 ns max_sample_us: 0 ns nr_regions: min: '20' max: '50' schemes: - action: migrate_hot dests: - nid: 0 weight: 1 - nid: 1 weight: 1 access_pattern: sz_bytes: min: 0 B max: max nr_accesses: min: 0 % max: 100 % age: min: 0 ns max: max $ sudo ./damo/damo interleave_vaddr.yaml $ # Verify that DAMON has migrated data to match the 1:1 ratio $ numastat -c -p alloc_data Per-node process memory usage (in MBs) for PID 6587 (alloc_data) Node 0 Node 1 Total ------ ------ ----- Huge 0 0 0 Heap 0 0 0 Stack 0 0 0 Private 514 514 1027 ------- ------ ------ ----- Total 514 514 1027 Performance Test ================ Below is a simple example showing that interleaving application data using these patches can improve application performance. To do this, we run a bandwidth intensive embedding reduction application [7]. This workload is useful for this test because it reports the time it takes each iteration to run and each iteration reuses the same allocation, allowing us to see the benefits of the migration. We evaluate this on a 128 core/256 thread AMD CPU with 72GB/s of local DDR bandwidth and 26 GB/s of CXL bandwidth. Before we start the workload, the system bandwidth utilization is low, so we start with the interleave weights of 1:0, i.e. allocating all data to local memory. When the workload beings, it saturates the local bandwidth, making the page placement suboptimal. To alleviate this, we modify the interleave weights, triggering DAMON to migrate the workload's data. We use the same interleave_vaddr.yaml file to setup DAMON, except we configure it to begin with a 1:0 interleave ratio, and attach it to the shell and its children processes. $ sudo ./damo/damo start interleave_vaddr.yaml --include_child_tasks & $ /eval_baseline -d amazon_All -c 255 -r 100 Eval Phase 3: Running Baseline... REPEAT # 0 Baseline Total time : 7323.54 ms REPEAT # 1 Baseline Total time : 7624.56 ms REPEAT # 2 Baseline Total time : 7619.61 ms REPEAT # 3 Baseline Total time : 7617.12 ms REPEAT # 4 Baseline Total time : 7638.64 ms REPEAT # 5 Baseline Total time : 7611.27 ms REPEAT # 6 Baseline Total time : 7629.32 ms REPEAT # 7 Baseline Total time : 7695.63 ms # Interleave weights set to 3:1 REPEAT # 8 Baseline Total time : 7077.5 ms REPEAT # 9 Baseline Total time : 5633.23 ms REPEAT # 10 Baseline Total time : 5644.6 ms REPEAT # 11 Baseline Total time : 5627.66 ms REPEAT # 12 Baseline Total time : 5629.76 ms REPEAT # 13 Baseline Total time : 5633.05 ms REPEAT # 14 Baseline Total time : 5641.24 ms REPEAT # 15 Baseline Total time : 5631.18 ms REPEAT # 16 Baseline Total time : 5631.33 ms Updating the interleave weights and having DAMON migrate the workload data according to the weights resulted in an approximarely 25% speedup. Patches Sequence ================ Patches 1-7 extend the DAMON API to specify multiple destination nodes and weights for the migrate_{hot,cold} actions. These patches are from SJ'S RFC [8]. Patches 8-10 add a vaddr implementation of the migrate_{hot,cold} schemes. Patch 11 modifies the vaddr migrate_{hot,cold} schemes to interleave data according to the weights provided by damos->migrate_dest. Patches 12-13 allow the vaddr migrate_{hot,cold} implementation to filter out folios like the paddr version. This patch (of 13): Introduce a new struct, namely damos_migrate_dests, for specifying multiple DAMOS' migration destination nodes and their weights. Link: https://lkml.kernel.org/r/20250709005952.17776-1-bijan311@gmail.com Link: https://lkml.kernel.org/r/20250709005952.17776-2-bijan311@gmail.com Link: https://lore.kernel.org/linux-mm/20250520141236.2987309-1-joshua.hahnjy@gmail.com/ [1] Link: https://lore.kernel.org/linux-mm/20250313155705.1943522-1-joshua.hahnjy@gmail.com/ [2] Link: https://elixir.bootlin.com/linux/v6.15.4/source/mm/mempolicy.c#L2015 [3] Link: https://lore.kernel.org/damon/20250624223310.55786-1-sj@kernel.org/ [4] Link: https://lore.kernel.org/linux-mm/20250314151137.892379-1-joshua.hahnjy@gmail.com/ [5] Link: https://lore.kernel.org/linux-mm/87frjfx6u4.fsf@DESKTOP-5N7EMDA/ [6] Link: https://github.com/SNU-ARC/MERCI [7] Link: https://lore.kernel.org/damon/20250702051558.54138-1-sj@kernel.org/ [8] Signed-off-by: SeongJae Park Signed-off-by: Bijan Tabatabai Reviewed-by: SeongJae Park Cc: Jonathan Corbet Cc: Ravi Shankar Jonnalagadda Signed-off-by: Andrew Morton --- include/linux/damon.h | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/include/linux/damon.h b/include/linux/damon.h index e1fea3119538..07cee590ff09 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -447,6 +447,22 @@ struct damos_access_pattern { unsigned int max_age_region; }; +/** + * struct damos_migrate_dests - Migration destination nodes and their weights. + * @node_id_arr: Array of migration destination node ids. + * @weight_arr: Array of migration weights for @node_id_arr. + * @nr_dests: Length of the @node_id_arr and @weight_arr arrays. + * + * @node_id_arr is an array of the ids of migration destination nodes. + * @weight_arr is an array of the weights for those. The weights in + * @weight_arr are for nodes in @node_id_arr of same array index. + */ +struct damos_migrate_dests { + unsigned int *node_id_arr; + unsigned int *weight_arr; + size_t nr_dests; +}; + /** * struct damos - Represents a Data Access Monitoring-based Operation Scheme. * @pattern: Access pattern of target regions. From aabc85ee33c883243f2c506a5d88963f2456faa6 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Tue, 8 Jul 2025 19:59:32 -0500 Subject: [PATCH 235/370] mm/damon/core: add damos->migrate_dests field Add a new field to 'struct damos', namely migrate_dests, to allow DAMON API callers specify multiple migration destination nodes and their weights. Also update 'struct damos' creation and destruction functions accordingly to initialize the new field and free up the API caller-allocated buffers on those, respectively. Link: https://lkml.kernel.org/r/20250709005952.17776-3-bijan311@gmail.com Signed-off-by: SeongJae Park Signed-off-by: Bijan Tabatabai Reviewed-by: SeongJae Park Cc: Jonathan Corbet Cc: Ravi Shankar Jonnalagadda Signed-off-by: Andrew Morton --- include/linux/damon.h | 13 ++++++++++--- mm/damon/core.c | 4 ++++ 2 files changed, 14 insertions(+), 3 deletions(-) diff --git a/include/linux/damon.h b/include/linux/damon.h index 07cee590ff09..1f425d830bb9 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -470,6 +470,7 @@ struct damos_migrate_dests { * @apply_interval_us: The time between applying the @action. * @quota: Control the aggressiveness of this scheme. * @wmarks: Watermarks for automated (in)activation of this scheme. + * @migrate_dests: Destination nodes if @action is "migrate_{hot,cold}". * @target_nid: Destination node if @action is "migrate_{hot,cold}". * @filters: Additional set of &struct damos_filter for &action. * @ops_filters: ops layer handling &struct damos_filter objects list. @@ -488,9 +489,12 @@ struct damos_migrate_dests { * monitoring context are inactive, DAMON stops monitoring either, and just * repeatedly checks the watermarks. * + * @migrate_dests specifies multiple migration target nodes with different + * weights for migrate_hot or migrate_cold actions. @target_nid is ignored if + * this is set. + * * @target_nid is used to set the migration target node for migrate_hot or - * migrate_cold actions, which means it's only meaningful when @action is either - * "migrate_hot" or "migrate_cold". + * migrate_cold actions, and @migrate_dests is unset. * * Before applying the &action to a memory region, &struct damon_operations * implementation could check pages of the region and skip &action to respect @@ -533,7 +537,10 @@ struct damos { struct damos_quota quota; struct damos_watermarks wmarks; union { - int target_nid; + struct { + int target_nid; + struct damos_migrate_dests migrate_dests; + }; }; struct list_head filters; struct list_head ops_filters; diff --git a/mm/damon/core.c b/mm/damon/core.c index 04e01e08253a..6c8170d4f695 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -407,6 +407,7 @@ struct damos *damon_new_scheme(struct damos_access_pattern *pattern, scheme->wmarks = *wmarks; scheme->wmarks.activated = true; + scheme->migrate_dests = (struct damos_migrate_dests){}; scheme->target_nid = target_nid; return scheme; @@ -449,6 +450,9 @@ void damon_destroy_scheme(struct damos *s) damos_for_each_filter_safe(f, next, s) damos_destroy_filter(f); + + kfree(s->migrate_dests.node_id_arr); + kfree(s->migrate_dests.weight_arr); damon_del_scheme(s); damon_free_scheme(s); } From 2cd0bf85a203e21f7bc93628441f136fc2d0a2b8 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Tue, 8 Jul 2025 19:59:33 -0500 Subject: [PATCH 236/370] mm/damon/sysfs-schemes: implement DAMOS action destinations directory DAMOS_MIGRATE_{HOT,COLD} can have multiple action destinations and their weights. Implement sysfs directory named 'dests' under each scheme directory to let DAMON sysfs ABI users utilize the feature. The interface is similar to other multiple parameters directory like kdamonds or filters. The directory contains only nr_dests file initially. Writing a number of desired destinations to nr_dests creates directories of the number. Each of the created directories has two files named id and weight. Users can then write the destination's identifier (node id in case of DAMOS_MIGRATE_*) and weight to the files. Link: https://lkml.kernel.org/r/20250709005952.17776-4-bijan311@gmail.com Signed-off-by: SeongJae Park Cc: Bijan Tabatabai Cc: Jonathan Corbet Cc: Ravi Shankar Jonnalagadda Signed-off-by: Andrew Morton --- mm/damon/sysfs-schemes.c | 225 ++++++++++++++++++++++++++++++++++++++- 1 file changed, 224 insertions(+), 1 deletion(-) diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c index 601360e9b521..b9434cdaacdc 100644 --- a/mm/damon/sysfs-schemes.c +++ b/mm/damon/sysfs-schemes.c @@ -1655,6 +1655,204 @@ static const struct kobj_type damon_sysfs_access_pattern_ktype = { .default_groups = damon_sysfs_access_pattern_groups, }; +/* + * dest (action destination) directory + */ + +struct damos_sysfs_dest { + struct kobject kobj; + unsigned int id; + unsigned int weight; +}; + +static struct damos_sysfs_dest *damos_sysfs_dest_alloc(void) +{ + return kzalloc(sizeof(struct damos_sysfs_dest), GFP_KERNEL); +} + +static ssize_t id_show( + struct kobject *kobj, struct kobj_attribute *attr, char *buf) +{ + struct damos_sysfs_dest *dest = container_of(kobj, + struct damos_sysfs_dest, kobj); + + return sysfs_emit(buf, "%u\n", dest->id); +} + +static ssize_t id_store(struct kobject *kobj, + struct kobj_attribute *attr, const char *buf, size_t count) +{ + struct damos_sysfs_dest *dest = container_of(kobj, + struct damos_sysfs_dest, kobj); + int err = kstrtouint(buf, 0, &dest->id); + + return err ? err : count; +} + +static ssize_t weight_show( + struct kobject *kobj, struct kobj_attribute *attr, char *buf) +{ + struct damos_sysfs_dest *dest = container_of(kobj, + struct damos_sysfs_dest, kobj); + + return sysfs_emit(buf, "%u\n", dest->weight); +} + +static ssize_t weight_store(struct kobject *kobj, + struct kobj_attribute *attr, const char *buf, size_t count) +{ + struct damos_sysfs_dest *dest = container_of(kobj, + struct damos_sysfs_dest, kobj); + int err = kstrtouint(buf, 0, &dest->weight); + + return err ? err : count; +} + +static void damos_sysfs_dest_release(struct kobject *kobj) +{ + struct damos_sysfs_dest *dest = container_of(kobj, + struct damos_sysfs_dest, kobj); + kfree(dest); +} + +static struct kobj_attribute damos_sysfs_dest_id_attr = + __ATTR_RW_MODE(id, 0600); + +static struct kobj_attribute damos_sysfs_dest_weight_attr = + __ATTR_RW_MODE(weight, 0600); + +static struct attribute *damos_sysfs_dest_attrs[] = { + &damos_sysfs_dest_id_attr.attr, + &damos_sysfs_dest_weight_attr.attr, + NULL, +}; +ATTRIBUTE_GROUPS(damos_sysfs_dest); + +static const struct kobj_type damos_sysfs_dest_ktype = { + .release = damos_sysfs_dest_release, + .sysfs_ops = &kobj_sysfs_ops, + .default_groups = damos_sysfs_dest_groups, +}; + +/* + * dests (action destinations) directory + */ + +struct damos_sysfs_dests { + struct kobject kobj; + struct damos_sysfs_dest **dests_arr; + int nr; +}; + +static struct damos_sysfs_dests * +damos_sysfs_dests_alloc(void) +{ + return kzalloc(sizeof(struct damos_sysfs_dests), GFP_KERNEL); +} + +static void damos_sysfs_dests_rm_dirs( + struct damos_sysfs_dests *dests) +{ + struct damos_sysfs_dest **dests_arr = dests->dests_arr; + int i; + + for (i = 0; i < dests->nr; i++) + kobject_put(&dests_arr[i]->kobj); + dests->nr = 0; + kfree(dests_arr); + dests->dests_arr = NULL; +} + +static int damos_sysfs_dests_add_dirs( + struct damos_sysfs_dests *dests, int nr_dests) +{ + struct damos_sysfs_dest **dests_arr, *dest; + int err, i; + + damos_sysfs_dests_rm_dirs(dests); + if (!nr_dests) + return 0; + + dests_arr = kmalloc_array(nr_dests, sizeof(*dests_arr), + GFP_KERNEL | __GFP_NOWARN); + if (!dests_arr) + return -ENOMEM; + dests->dests_arr = dests_arr; + + for (i = 0; i < nr_dests; i++) { + dest = damos_sysfs_dest_alloc(); + if (!dest) { + damos_sysfs_dests_rm_dirs(dests); + return -ENOMEM; + } + + err = kobject_init_and_add(&dest->kobj, + &damos_sysfs_dest_ktype, + &dests->kobj, "%d", i); + if (err) { + kobject_put(&dest->kobj); + damos_sysfs_dests_rm_dirs(dests); + return err; + } + + dests_arr[i] = dest; + dests->nr++; + } + return 0; +} + +static ssize_t nr_dests_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct damos_sysfs_dests *dests = container_of(kobj, + struct damos_sysfs_dests, kobj); + + return sysfs_emit(buf, "%d\n", dests->nr); +} + +static ssize_t nr_dests_store(struct kobject *kobj, + struct kobj_attribute *attr, const char *buf, size_t count) +{ + struct damos_sysfs_dests *dests; + int nr, err = kstrtoint(buf, 0, &nr); + + if (err) + return err; + if (nr < 0) + return -EINVAL; + + dests = container_of(kobj, struct damos_sysfs_dests, kobj); + + if (!mutex_trylock(&damon_sysfs_lock)) + return -EBUSY; + err = damos_sysfs_dests_add_dirs(dests, nr); + mutex_unlock(&damon_sysfs_lock); + if (err) + return err; + + return count; +} + +static void damos_sysfs_dests_release(struct kobject *kobj) +{ + kfree(container_of(kobj, struct damos_sysfs_dests, kobj)); +} + +static struct kobj_attribute damos_sysfs_dests_nr_attr = + __ATTR_RW_MODE(nr_dests, 0600); + +static struct attribute *damos_sysfs_dests_attrs[] = { + &damos_sysfs_dests_nr_attr.attr, + NULL, +}; +ATTRIBUTE_GROUPS(damos_sysfs_dests); + +static const struct kobj_type damos_sysfs_dests_ktype = { + .release = damos_sysfs_dests_release, + .sysfs_ops = &kobj_sysfs_ops, + .default_groups = damos_sysfs_dests_groups, +}; + /* * scheme directory */ @@ -1672,6 +1870,7 @@ struct damon_sysfs_scheme { struct damon_sysfs_stats *stats; struct damon_sysfs_scheme_regions *tried_regions; int target_nid; + struct damos_sysfs_dests *dests; }; struct damos_sysfs_action_name { @@ -1762,6 +1961,22 @@ static int damon_sysfs_scheme_set_access_pattern( return err; } +static int damos_sysfs_set_dests(struct damon_sysfs_scheme *scheme) +{ + struct damos_sysfs_dests *dests = damos_sysfs_dests_alloc(); + int err; + + if (!dests) + return -ENOMEM; + err = kobject_init_and_add(&dests->kobj, &damos_sysfs_dests_ktype, + &scheme->kobj, "dests"); + if (err) + kobject_put(&dests->kobj); + else + scheme->dests = dests; + return err; +} + static int damon_sysfs_scheme_set_quotas(struct damon_sysfs_scheme *scheme) { struct damon_sysfs_quotas *quotas = damon_sysfs_quotas_alloc(); @@ -1894,9 +2109,12 @@ static int damon_sysfs_scheme_add_dirs(struct damon_sysfs_scheme *scheme) err = damon_sysfs_scheme_set_access_pattern(scheme); if (err) return err; - err = damon_sysfs_scheme_set_quotas(scheme); + err = damos_sysfs_set_dests(scheme); if (err) goto put_access_pattern_out; + err = damon_sysfs_scheme_set_quotas(scheme); + if (err) + goto put_dests_out; err = damon_sysfs_scheme_set_watermarks(scheme); if (err) goto put_quotas_access_pattern_out; @@ -1927,6 +2145,9 @@ static int damon_sysfs_scheme_add_dirs(struct damon_sysfs_scheme *scheme) put_quotas_access_pattern_out: kobject_put(&scheme->quotas->kobj); scheme->quotas = NULL; +put_dests_out: + kobject_put(&scheme->dests->kobj); + scheme->dests = NULL; put_access_pattern_out: kobject_put(&scheme->access_pattern->kobj); scheme->access_pattern = NULL; @@ -1937,6 +2158,8 @@ static void damon_sysfs_scheme_rm_dirs(struct damon_sysfs_scheme *scheme) { damon_sysfs_access_pattern_rm_dirs(scheme->access_pattern); kobject_put(&scheme->access_pattern->kobj); + kobject_put(&scheme->dests->kobj); + damos_sysfs_dests_rm_dirs(scheme->dests); damon_sysfs_quotas_rm_dirs(scheme->quotas); kobject_put(&scheme->quotas->kobj); kobject_put(&scheme->watermarks->kobj); From 9106d467533db0f042f2ddbf304620fa3fd54b1d Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Tue, 8 Jul 2025 19:59:34 -0500 Subject: [PATCH 237/370] mm/damon/sysfs-schemes: set damos->migrate_dests Pass user-specified multiple DAMOS action destinations and their weights to DAMON core API, so that user requests can really work. Link: https://lkml.kernel.org/r/20250709005952.17776-5-bijan311@gmail.com Signed-off-by: SeongJae Park Signed-off-by: Bijan Tabatabai Reviewed-by: SeongJae Park Cc: Jonathan Corbet Cc: Ravi Shankar Jonnalagadda Signed-off-by: Andrew Morton --- mm/damon/sysfs-schemes.c | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c index b9434cdaacdc..74056bcd6a2c 100644 --- a/mm/damon/sysfs-schemes.c +++ b/mm/damon/sysfs-schemes.c @@ -2576,6 +2576,29 @@ void damos_sysfs_update_effective_quotas( } } +static int damos_sysfs_add_migrate_dest(struct damos *scheme, + struct damos_sysfs_dests *sysfs_dests) +{ + struct damos_migrate_dests *dests = &scheme->migrate_dests; + int i; + + dests->node_id_arr = kmalloc_array(sysfs_dests->nr, + sizeof(*dests->node_id_arr), GFP_KERNEL); + if (!dests->node_id_arr) + return -ENOMEM; + dests->weight_arr = kmalloc_array(sysfs_dests->nr, + sizeof(*dests->weight_arr), GFP_KERNEL); + if (!dests->weight_arr) + /* ->node_id_arr will be freed by scheme destruction */ + return -ENOMEM; + for (i = 0; i < sysfs_dests->nr; i++) { + dests->node_id_arr[i] = sysfs_dests->dests_arr[i]->id; + dests->weight_arr[i] = sysfs_dests->dests_arr[i]->weight; + } + dests->nr_dests = sysfs_dests->nr; + return 0; +} + static struct damos *damon_sysfs_mk_scheme( struct damon_sysfs_scheme *sysfs_scheme) { @@ -2638,6 +2661,11 @@ static struct damos *damon_sysfs_mk_scheme( damon_destroy_scheme(scheme); return NULL; } + err = damos_sysfs_add_migrate_dest(scheme, sysfs_scheme->dests); + if (err) { + damon_destroy_scheme(scheme); + return NULL; + } return scheme; } From b9dfe8af511d901d6e6cac2ca999e390d7bf2a41 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Tue, 8 Jul 2025 19:59:35 -0500 Subject: [PATCH 238/370] Docs/ABI/damon: document schemes dests directory Document the new DAMOS action destinations sysfs directories on ABI doc. Link: https://lkml.kernel.org/r/20250709005952.17776-6-bijan311@gmail.com Signed-off-by: SeongJae Park Cc: Bijan Tabatabai Cc: Jonathan Corbet Cc: Ravi Shankar Jonnalagadda Signed-off-by: Andrew Morton --- .../ABI/testing/sysfs-kernel-mm-damon | 22 +++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-damon b/Documentation/ABI/testing/sysfs-kernel-mm-damon index 5697ab154c1f..e98974dfac7a 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-damon +++ b/Documentation/ABI/testing/sysfs-kernel-mm-damon @@ -431,6 +431,28 @@ Description: Directory for DAMON operations set layer-handled DAMOS filters. /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//filters directory. +What: /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//dests/nr_dests +Date: Jul 2025 +Contact: SeongJae Park +Description: Writing a number 'N' to this file creates the number of + directories for setting action destinations of the scheme named + '0' to 'N-1' under the dests/ directory. + +What: /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//dests//id +Date: Jul 2025 +Contact: SeongJae Park +Description: Writing to and reading from this file sets and gets the id of + the DAMOS action destination. For DAMOS_MIGRATE_{HOT,COLD} + actions, the destination node's node id can be written and + read. + +What: /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//dests//weight +Date: Jul 2025 +Contact: SeongJae Park +Description: Writing to and reading from this file sets and gets the weight + of the DAMOS action destination to select as the destination of + each action among the destinations. + What: /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//stats/nr_tried Date: Mar 2022 Contact: SeongJae Park From 3a4785c2d3bee59a0e39ad94b672f6e1a1180981 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Tue, 8 Jul 2025 19:59:36 -0500 Subject: [PATCH 239/370] Docs/admin-guide/mm/damon/usage: document dests directory Document the newly added DAMOS action destination directory of the DAMON sysfs interface on the usage document. Link: https://lkml.kernel.org/r/20250709005952.17776-7-bijan311@gmail.com Signed-off-by: SeongJae Park Cc: Bijan Tabatabai Cc: Jonathan Corbet Cc: Ravi Shankar Jonnalagadda Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/usage.rst | 33 +++++++++++++++++--- 1 file changed, 29 insertions(+), 4 deletions(-) diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index d960aba72b82..fc5c962353ed 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -85,6 +85,8 @@ comma (","). │ │ │ │ │ │ │ :ref:`watermarks `/metric,interval_us,high,mid,low │ │ │ │ │ │ │ :ref:`{core_,ops_,}filters `/nr_filters │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx,min,max + │ │ │ │ │ │ │ :ref:`dests `/nr_dests + │ │ │ │ │ │ │ │ 0/id,weight │ │ │ │ │ │ │ :ref:`stats `/nr_tried,sz_tried,nr_applied,sz_applied,sz_ops_filter_passed,qt_exceeds │ │ │ │ │ │ │ :ref:`tried_regions `/total_bytes │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age,sz_filter_passed @@ -307,10 +309,10 @@ to ``N-1``. Each directory represents each DAMON-based operation scheme. schemes// ------------ -In each scheme directory, seven directories (``access_pattern``, ``quotas``, -``watermarks``, ``core_filters``, ``ops_filters``, ``filters``, ``stats``, and -``tried_regions``) and three files (``action``, ``target_nid`` and -``apply_interval``) exist. +In each scheme directory, eight directories (``access_pattern``, ``quotas``, +``watermarks``, ``core_filters``, ``ops_filters``, ``filters``, ``dests``, +``stats``, and ``tried_regions``) and three files (``action``, ``target_nid`` +and ``apply_interval``) exist. The ``action`` file is for setting and getting the scheme's :ref:`action `. The keywords that can be written to and read @@ -484,6 +486,29 @@ Refer to the :ref:`DAMOS filters design documentation of different ``allow`` works, when each of the filters are supported, and differences on stats. +.. _damon_sysfs_dests: + +schemes//dests/ +------------------ + +Directory for specifying the destinations of given DAMON-based operation +scheme's action. This directory is ignored if the action of the given scheme +is not supporting multiple destinations. Only ``DAMOS_MIGRATE_{HOT,COLD}`` +actions are supporting multiple destinations. + +In the beginning, the directory has only one file, ``nr_dests``. Writing a +number (``N``) to the file creates the number of child directories named ``0`` +to ``N-1``. Each directory represents each action destination. + +Each destination directory contains two files, namely ``id`` and ``weight``. +Users can write and read the identifier of the destination to ``id`` file. +For ``DAMOS_MIGRATE_{HOT,COLD}`` actions, the migrate destination node's node +id should be written to ``id`` file. Users can write and read the weight of +the destination among the given destinations to the ``weight`` file. The +weight can be an arbitrary integer. When DAMOS apply the action to each entity +of the memory region, it will select the destination of the action based on the +relative weights of the destinations. + .. _sysfs_schemes_stats: schemes//stats/ From cbc4eea4ffb5c9858b8f4dc7cae5897b01102a3f Mon Sep 17 00:00:00 2001 From: Bijan Tabatabai Date: Tue, 8 Jul 2025 19:59:37 -0500 Subject: [PATCH 240/370] mm/damon/core: commit damos->migrate_dests When committing new scheme parameters from the sysfs, copy the migrate_dests struct of the source schemes into the destination schemes. Link: https://lkml.kernel.org/r/20250709005952.17776-8-bijan311@gmail.com Signed-off-by: Bijan Tabatabai Reviewed-by: SeongJae Park Cc: Jonathan Corbet Cc: Ravi Shankar Jonnalagadda Signed-off-by: Andrew Morton --- mm/damon/core.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/mm/damon/core.c b/mm/damon/core.c index 6c8170d4f695..2714a7a023db 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -943,6 +943,41 @@ static void damos_set_filters_default_reject(struct damos *s) damos_filters_default_reject(&s->ops_filters); } +static int damos_commit_dests(struct damos *dst, struct damos *src) +{ + struct damos_migrate_dests *dst_dests, *src_dests; + + dst_dests = &dst->migrate_dests; + src_dests = &src->migrate_dests; + + if (dst_dests->nr_dests != src_dests->nr_dests) { + kfree(dst_dests->node_id_arr); + kfree(dst_dests->weight_arr); + + dst_dests->node_id_arr = kmalloc_array(src_dests->nr_dests, + sizeof(*dst_dests->node_id_arr), GFP_KERNEL); + if (!dst_dests->node_id_arr) { + dst_dests->weight_arr = NULL; + return -ENOMEM; + } + + dst_dests->weight_arr = kmalloc_array(src_dests->nr_dests, + sizeof(*dst_dests->weight_arr), GFP_KERNEL); + if (!dst_dests->weight_arr) { + /* ->node_id_arr will be freed by scheme destruction */ + return -ENOMEM; + } + } + + dst_dests->nr_dests = src_dests->nr_dests; + for (int i = 0; i < src_dests->nr_dests; i++) { + dst_dests->node_id_arr[i] = src_dests->node_id_arr[i]; + dst_dests->weight_arr[i] = src_dests->weight_arr[i]; + } + + return 0; +} + static int damos_commit_filters(struct damos *dst, struct damos *src) { int err; @@ -984,6 +1019,10 @@ static int damos_commit(struct damos *dst, struct damos *src) dst->wmarks = src->wmarks; dst->target_nid = src->target_nid; + err = damos_commit_dests(dst, src); + if (err) + return err; + err = damos_commit_filters(dst, src); return err; } From 13dde31db71f013257e2f9363edb76f683ea5728 Mon Sep 17 00:00:00 2001 From: Bijan Tabatabai Date: Tue, 8 Jul 2025 19:59:38 -0500 Subject: [PATCH 241/370] mm/damon: move migration helpers from paddr to ops-common This patch moves the damon_pa_migrate_pages function along with its corresponding helper functions from paddr to ops-common. The function prefix of "damon_pa_" was also changed to just "damon_" accordingly. This patch will allow page migration to be available to vaddr schemes as well as paddr schemes. Link: https://lkml.kernel.org/r/20250709005952.17776-9-bijan311@gmail.com Co-developed-by: Ravi Shankar Jonnalagadda Signed-off-by: Ravi Shankar Jonnalagadda Signed-off-by: Bijan Tabatabai Reviewed-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- mm/damon/ops-common.c | 120 +++++++++++++++++++++++++++++++++++++++++ mm/damon/ops-common.h | 2 + mm/damon/paddr.c | 122 +----------------------------------------- 3 files changed, 123 insertions(+), 121 deletions(-) diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c index b43620fee6bb..918158ef3d99 100644 --- a/mm/damon/ops-common.c +++ b/mm/damon/ops-common.c @@ -5,6 +5,7 @@ * Author: SeongJae Park */ +#include #include #include #include @@ -12,6 +13,7 @@ #include #include +#include "../internal.h" #include "ops-common.h" /* @@ -138,3 +140,121 @@ int damon_cold_score(struct damon_ctx *c, struct damon_region *r, /* Return coldness of the region */ return DAMOS_MAX_SCORE - hotness; } + +static unsigned int __damon_migrate_folio_list( + struct list_head *migrate_folios, struct pglist_data *pgdat, + int target_nid) +{ + unsigned int nr_succeeded = 0; + struct migration_target_control mtc = { + /* + * Allocate from 'node', or fail quickly and quietly. + * When this happens, 'page' will likely just be discarded + * instead of migrated. + */ + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | + __GFP_NOWARN | __GFP_NOMEMALLOC | GFP_NOWAIT, + .nid = target_nid, + }; + + if (pgdat->node_id == target_nid || target_nid == NUMA_NO_NODE) + return 0; + + if (list_empty(migrate_folios)) + return 0; + + /* Migration ignores all cpuset and mempolicy settings */ + migrate_pages(migrate_folios, alloc_migration_target, NULL, + (unsigned long)&mtc, MIGRATE_ASYNC, MR_DAMON, + &nr_succeeded); + + return nr_succeeded; +} + +static unsigned int damon_migrate_folio_list(struct list_head *folio_list, + struct pglist_data *pgdat, + int target_nid) +{ + unsigned int nr_migrated = 0; + struct folio *folio; + LIST_HEAD(ret_folios); + LIST_HEAD(migrate_folios); + + while (!list_empty(folio_list)) { + struct folio *folio; + + cond_resched(); + + folio = lru_to_folio(folio_list); + list_del(&folio->lru); + + if (!folio_trylock(folio)) + goto keep; + + /* Relocate its contents to another node. */ + list_add(&folio->lru, &migrate_folios); + folio_unlock(folio); + continue; +keep: + list_add(&folio->lru, &ret_folios); + } + /* 'folio_list' is always empty here */ + + /* Migrate folios selected for migration */ + nr_migrated += __damon_migrate_folio_list( + &migrate_folios, pgdat, target_nid); + /* + * Folios that could not be migrated are still in @migrate_folios. Add + * those back on @folio_list + */ + if (!list_empty(&migrate_folios)) + list_splice_init(&migrate_folios, folio_list); + + try_to_unmap_flush(); + + list_splice(&ret_folios, folio_list); + + while (!list_empty(folio_list)) { + folio = lru_to_folio(folio_list); + list_del(&folio->lru); + folio_putback_lru(folio); + } + + return nr_migrated; +} + +unsigned long damon_migrate_pages(struct list_head *folio_list, int target_nid) +{ + int nid; + unsigned long nr_migrated = 0; + LIST_HEAD(node_folio_list); + unsigned int noreclaim_flag; + + if (list_empty(folio_list)) + return nr_migrated; + + noreclaim_flag = memalloc_noreclaim_save(); + + nid = folio_nid(lru_to_folio(folio_list)); + do { + struct folio *folio = lru_to_folio(folio_list); + + if (nid == folio_nid(folio)) { + list_move(&folio->lru, &node_folio_list); + continue; + } + + nr_migrated += damon_migrate_folio_list(&node_folio_list, + NODE_DATA(nid), + target_nid); + nid = folio_nid(lru_to_folio(folio_list)); + } while (!list_empty(folio_list)); + + nr_migrated += damon_migrate_folio_list(&node_folio_list, + NODE_DATA(nid), + target_nid); + + memalloc_noreclaim_restore(noreclaim_flag); + + return nr_migrated; +} diff --git a/mm/damon/ops-common.h b/mm/damon/ops-common.h index cc9f5da9c012..54209a7e70e6 100644 --- a/mm/damon/ops-common.h +++ b/mm/damon/ops-common.h @@ -16,3 +16,5 @@ int damon_cold_score(struct damon_ctx *c, struct damon_region *r, struct damos *s); int damon_hot_score(struct damon_ctx *c, struct damon_region *r, struct damos *s); + +unsigned long damon_migrate_pages(struct list_head *folio_list, int target_nid); diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c index fcab148e6865..48e3e6fed636 100644 --- a/mm/damon/paddr.c +++ b/mm/damon/paddr.c @@ -13,7 +13,6 @@ #include #include #include -#include #include #include "../internal.h" @@ -381,125 +380,6 @@ static unsigned long damon_pa_deactivate_pages(struct damon_region *r, sz_filter_passed); } -static unsigned int __damon_pa_migrate_folio_list( - struct list_head *migrate_folios, struct pglist_data *pgdat, - int target_nid) -{ - unsigned int nr_succeeded = 0; - struct migration_target_control mtc = { - /* - * Allocate from 'node', or fail quickly and quietly. - * When this happens, 'page' will likely just be discarded - * instead of migrated. - */ - .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | - __GFP_NOWARN | __GFP_NOMEMALLOC | GFP_NOWAIT, - .nid = target_nid, - }; - - if (pgdat->node_id == target_nid || target_nid == NUMA_NO_NODE) - return 0; - - if (list_empty(migrate_folios)) - return 0; - - /* Migration ignores all cpuset and mempolicy settings */ - migrate_pages(migrate_folios, alloc_migration_target, NULL, - (unsigned long)&mtc, MIGRATE_ASYNC, MR_DAMON, - &nr_succeeded); - - return nr_succeeded; -} - -static unsigned int damon_pa_migrate_folio_list(struct list_head *folio_list, - struct pglist_data *pgdat, - int target_nid) -{ - unsigned int nr_migrated = 0; - struct folio *folio; - LIST_HEAD(ret_folios); - LIST_HEAD(migrate_folios); - - while (!list_empty(folio_list)) { - struct folio *folio; - - cond_resched(); - - folio = lru_to_folio(folio_list); - list_del(&folio->lru); - - if (!folio_trylock(folio)) - goto keep; - - /* Relocate its contents to another node. */ - list_add(&folio->lru, &migrate_folios); - folio_unlock(folio); - continue; -keep: - list_add(&folio->lru, &ret_folios); - } - /* 'folio_list' is always empty here */ - - /* Migrate folios selected for migration */ - nr_migrated += __damon_pa_migrate_folio_list( - &migrate_folios, pgdat, target_nid); - /* - * Folios that could not be migrated are still in @migrate_folios. Add - * those back on @folio_list - */ - if (!list_empty(&migrate_folios)) - list_splice_init(&migrate_folios, folio_list); - - try_to_unmap_flush(); - - list_splice(&ret_folios, folio_list); - - while (!list_empty(folio_list)) { - folio = lru_to_folio(folio_list); - list_del(&folio->lru); - folio_putback_lru(folio); - } - - return nr_migrated; -} - -static unsigned long damon_pa_migrate_pages(struct list_head *folio_list, - int target_nid) -{ - int nid; - unsigned long nr_migrated = 0; - LIST_HEAD(node_folio_list); - unsigned int noreclaim_flag; - - if (list_empty(folio_list)) - return nr_migrated; - - noreclaim_flag = memalloc_noreclaim_save(); - - nid = folio_nid(lru_to_folio(folio_list)); - do { - struct folio *folio = lru_to_folio(folio_list); - - if (nid == folio_nid(folio)) { - list_move(&folio->lru, &node_folio_list); - continue; - } - - nr_migrated += damon_pa_migrate_folio_list(&node_folio_list, - NODE_DATA(nid), - target_nid); - nid = folio_nid(lru_to_folio(folio_list)); - } while (!list_empty(folio_list)); - - nr_migrated += damon_pa_migrate_folio_list(&node_folio_list, - NODE_DATA(nid), - target_nid); - - memalloc_noreclaim_restore(noreclaim_flag); - - return nr_migrated; -} - static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s, unsigned long *sz_filter_passed) { @@ -527,7 +407,7 @@ static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s, addr += folio_size(folio); folio_put(folio); } - applied = damon_pa_migrate_pages(&folio_list, s->target_nid); + applied = damon_migrate_pages(&folio_list, s->target_nid); cond_resched(); s->last_applied = folio; return applied * PAGE_SIZE; From 256b0c7faa84e042ceb809cf57f6593c25bd9c58 Mon Sep 17 00:00:00 2001 From: Bijan Tabatabai Date: Tue, 8 Jul 2025 19:59:39 -0500 Subject: [PATCH 242/370] mm/damon/vaddr: add vaddr versions of migrate_{hot,cold} migrate_{hot,cold} are paddr schemes that are used to migrate hot/cold data to a specified node. However, these schemes are only available when doing physical address monitoring. This patch adds an implementation for them virtual address monitoring as well. Link: https://lkml.kernel.org/r/20250709005952.17776-10-bijan311@gmail.com Co-developed-by: Ravi Shankar Jonnalagadda Signed-off-by: Ravi Shankar Jonnalagadda Signed-off-by: Bijan Tabatabai Reviewed-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- mm/damon/vaddr.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 98 insertions(+) diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c index 46554e49a478..b244ac056416 100644 --- a/mm/damon/vaddr.c +++ b/mm/damon/vaddr.c @@ -15,6 +15,7 @@ #include #include +#include "../internal.h" #include "ops-common.h" #ifdef CONFIG_DAMON_VADDR_KUNIT_TEST @@ -610,6 +611,68 @@ static unsigned int damon_va_check_accesses(struct damon_ctx *ctx) return max_nr_accesses; } +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static int damos_va_migrate_pmd_entry(pmd_t *pmd, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + struct list_head *migration_list = walk->private; + struct folio *folio; + spinlock_t *ptl; + pmd_t pmde; + + ptl = pmd_lock(walk->mm, pmd); + pmde = pmdp_get(pmd); + + if (!pmd_present(pmde) || !pmd_trans_huge(pmde)) + goto unlock; + + /* Tell page walk code to not split the PMD */ + walk->action = ACTION_CONTINUE; + + folio = damon_get_folio(pmd_pfn(pmde)); + if (!folio) + goto unlock; + + if (!folio_isolate_lru(folio)) + goto put_folio; + + list_add(&folio->lru, migration_list); + +put_folio: + folio_put(folio); +unlock: + spin_unlock(ptl); + return 0; +} +#else +#define damos_va_migrate_pmd_entry NULL +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ + +static int damos_va_migrate_pte_entry(pte_t *pte, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + struct list_head *migration_list = walk->private; + struct folio *folio; + pte_t ptent; + + ptent = ptep_get(pte); + if (pte_none(ptent) || !pte_present(ptent)) + return 0; + + folio = damon_get_folio(pte_pfn(ptent)); + if (!folio) + return 0; + + if (!folio_isolate_lru(folio)) + goto out; + + list_add(&folio->lru, migration_list); + +out: + folio_put(folio); + return 0; +} + /* * Functions for the target validity check and cleanup */ @@ -653,6 +716,34 @@ static unsigned long damos_madvise(struct damon_target *target, } #endif /* CONFIG_ADVISE_SYSCALLS */ +static unsigned long damos_va_migrate(struct damon_target *target, + struct damon_region *r, struct damos *s, + unsigned long *sz_filter_passed) +{ + LIST_HEAD(folio_list); + struct mm_struct *mm; + unsigned long applied = 0; + struct mm_walk_ops walk_ops = { + .pmd_entry = damos_va_migrate_pmd_entry, + .pte_entry = damos_va_migrate_pte_entry, + .walk_lock = PGWALK_RDLOCK, + }; + + mm = damon_get_mm(target); + if (!mm) + return 0; + + mmap_read_lock(mm); + walk_page_range(mm, r->ar.start, r->ar.end, &walk_ops, &folio_list); + mmap_read_unlock(mm); + mmput(mm); + + applied = damon_migrate_pages(&folio_list, s->target_nid); + cond_resched(); + + return applied * PAGE_SIZE; +} + static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx, struct damon_target *t, struct damon_region *r, struct damos *scheme, unsigned long *sz_filter_passed) @@ -675,6 +766,9 @@ static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx, case DAMOS_NOHUGEPAGE: madv_action = MADV_NOHUGEPAGE; break; + case DAMOS_MIGRATE_HOT: + case DAMOS_MIGRATE_COLD: + return damos_va_migrate(t, r, scheme, sz_filter_passed); case DAMOS_STAT: return 0; default: @@ -695,6 +789,10 @@ static int damon_va_scheme_score(struct damon_ctx *context, switch (scheme->action) { case DAMOS_PAGEOUT: return damon_cold_score(context, r, scheme); + case DAMOS_MIGRATE_HOT: + return damon_hot_score(context, r, scheme); + case DAMOS_MIGRATE_COLD: + return damon_cold_score(context, r, scheme); default: break; } From 0af934b1312cc03bb977e8f345c694f3bef9529e Mon Sep 17 00:00:00 2001 From: Bijan Tabatabai Date: Tue, 8 Jul 2025 19:59:40 -0500 Subject: [PATCH 243/370] Docs/mm/damon/design: document vaddr support for migrate_{hot,cold} Document that the migrate_{hot,cold} schemes can be used by the vaddr operations set. Link: https://lkml.kernel.org/r/20250709005952.17776-11-bijan311@gmail.com Co-developed-by: Ravi Shankar Jonnalagadda Signed-off-by: Ravi Shankar Jonnalagadda Signed-off-by: Bijan Tabatabai Reviewed-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- Documentation/mm/damon/design.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst index ddc50db3afa4..03f8137256f5 100644 --- a/Documentation/mm/damon/design.rst +++ b/Documentation/mm/damon/design.rst @@ -452,9 +452,9 @@ that supports each action are as below. - ``lru_deprio``: Deprioritize the region on its LRU lists. Supported by ``paddr`` operations set. - ``migrate_hot``: Migrate the regions prioritizing warmer regions. - Supported by ``paddr`` operations set. + Supported by ``vaddr``, ``fvaddr`` and ``paddr`` operations set. - ``migrate_cold``: Migrate the regions prioritizing colder regions. - Supported by ``paddr`` operations set. + Supported by ``vaddr``, ``fvaddr`` and ``paddr`` operations set. - ``stat``: Do nothing but count the statistics. Supported by all operations sets. From 19c1dc15c859929f8f084d0bfa2fafd1ee876cdd Mon Sep 17 00:00:00 2001 From: Bijan Tabatabai Date: Tue, 8 Jul 2025 19:59:41 -0500 Subject: [PATCH 244/370] mm/damon/vaddr: use damos->migrate_dests in migrate_{hot,cold} damos->migrate_dests provides a list of nodes the migrate_{hot,cold} actions should migrate to, as well as the weights which specify the ratio pages should be migrated to each destination node. This patch interleaves pages in the migrate_{hot,cold} actions according to the information provided in damos->migrate_dests if it is used. The interleaving algorithm used is similar to the one used in weighted_interleave_nid(). If damos->migration_dests is not provided, the actions migrate pages to the node specified in damos->target_nid as before. Link: https://lkml.kernel.org/r/20250709005952.17776-12-bijan311@gmail.com Co-developed-by: Ravi Shankar Jonnalagadda Signed-off-by: Ravi Shankar Jonnalagadda Signed-off-by: Bijan Tabatabai Reviewed-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- mm/damon/vaddr.c | 110 +++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 93 insertions(+), 17 deletions(-) diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c index b244ac056416..47d5f33f89c8 100644 --- a/mm/damon/vaddr.c +++ b/mm/damon/vaddr.c @@ -611,11 +611,69 @@ static unsigned int damon_va_check_accesses(struct damon_ctx *ctx) return max_nr_accesses; } +struct damos_va_migrate_private { + struct list_head *migration_lists; + struct damos_migrate_dests *dests; +}; + +/* + * Place the given folio in the migration_list corresponding to where the folio + * should be migrated. + * + * The algorithm used here is similar to weighted_interleave_nid() + */ +static void damos_va_migrate_dests_add(struct folio *folio, + struct vm_area_struct *vma, unsigned long addr, + struct damos_migrate_dests *dests, + struct list_head *migration_lists) +{ + pgoff_t ilx; + int order; + unsigned int target; + unsigned int weight_total = 0; + int i; + + /* + * If dests is empty, there is only one migration list corresponding + * to s->target_nid. + */ + if (!dests->nr_dests) { + i = 0; + goto isolate; + } + + order = folio_order(folio); + ilx = vma->vm_pgoff >> order; + ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order); + + for (i = 0; i < dests->nr_dests; i++) + weight_total += dests->weight_arr[i]; + + /* If the total weights are somehow 0, don't migrate at all */ + if (!weight_total) + return; + + target = ilx % weight_total; + for (i = 0; i < dests->nr_dests; i++) { + if (target < dests->weight_arr[i]) + break; + target -= dests->weight_arr[i]; + } + +isolate: + if (!folio_isolate_lru(folio)) + return; + + list_add(&folio->lru, &migration_lists[i]); +} + #ifdef CONFIG_TRANSPARENT_HUGEPAGE static int damos_va_migrate_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk) { - struct list_head *migration_list = walk->private; + struct damos_va_migrate_private *priv = walk->private; + struct list_head *migration_lists = priv->migration_lists; + struct damos_migrate_dests *dests = priv->dests; struct folio *folio; spinlock_t *ptl; pmd_t pmde; @@ -633,12 +691,9 @@ static int damos_va_migrate_pmd_entry(pmd_t *pmd, unsigned long addr, if (!folio) goto unlock; - if (!folio_isolate_lru(folio)) - goto put_folio; + damos_va_migrate_dests_add(folio, walk->vma, addr, dests, + migration_lists); - list_add(&folio->lru, migration_list); - -put_folio: folio_put(folio); unlock: spin_unlock(ptl); @@ -651,7 +706,9 @@ static int damos_va_migrate_pmd_entry(pmd_t *pmd, unsigned long addr, static int damos_va_migrate_pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk) { - struct list_head *migration_list = walk->private; + struct damos_va_migrate_private *priv = walk->private; + struct list_head *migration_lists = priv->migration_lists; + struct damos_migrate_dests *dests = priv->dests; struct folio *folio; pte_t ptent; @@ -663,12 +720,9 @@ static int damos_va_migrate_pte_entry(pte_t *pte, unsigned long addr, if (!folio) return 0; - if (!folio_isolate_lru(folio)) - goto out; + damos_va_migrate_dests_add(folio, walk->vma, addr, dests, + migration_lists); - list_add(&folio->lru, migration_list); - -out: folio_put(folio); return 0; } @@ -721,26 +775,48 @@ static unsigned long damos_va_migrate(struct damon_target *target, unsigned long *sz_filter_passed) { LIST_HEAD(folio_list); + struct damos_va_migrate_private priv; struct mm_struct *mm; + int nr_dests; + int nid; + bool use_target_nid; unsigned long applied = 0; + struct damos_migrate_dests *dests = &s->migrate_dests; struct mm_walk_ops walk_ops = { .pmd_entry = damos_va_migrate_pmd_entry, .pte_entry = damos_va_migrate_pte_entry, .walk_lock = PGWALK_RDLOCK, }; - mm = damon_get_mm(target); - if (!mm) + use_target_nid = dests->nr_dests == 0; + nr_dests = use_target_nid ? 1 : dests->nr_dests; + priv.dests = dests; + priv.migration_lists = kmalloc_array(nr_dests, + sizeof(*priv.migration_lists), GFP_KERNEL); + if (!priv.migration_lists) return 0; + for (int i = 0; i < nr_dests; i++) + INIT_LIST_HEAD(&priv.migration_lists[i]); + + + mm = damon_get_mm(target); + if (!mm) + goto free_lists; + mmap_read_lock(mm); - walk_page_range(mm, r->ar.start, r->ar.end, &walk_ops, &folio_list); + walk_page_range(mm, r->ar.start, r->ar.end, &walk_ops, &priv); mmap_read_unlock(mm); mmput(mm); - applied = damon_migrate_pages(&folio_list, s->target_nid); - cond_resched(); + for (int i = 0; i < nr_dests; i++) { + nid = use_target_nid ? s->target_nid : dests->node_id_arr[i]; + applied += damon_migrate_pages(&priv.migration_lists[i], nid); + cond_resched(); + } +free_lists: + kfree(priv.migration_lists); return applied * PAGE_SIZE; } From 0a707d6b04e01490f6c246fa4b6e643cc33b40a1 Mon Sep 17 00:00:00 2001 From: Bijan Tabatabai Date: Tue, 8 Jul 2025 19:59:42 -0500 Subject: [PATCH 245/370] mm/damon: move folio filtering from paddr to ops-common This patch moves damos_pa_filter_match and the functions it calls to ops-common, renaming it to damos_folio_filter_match. Doing so allows us to share the filtering logic for the vaddr version of the migrate_{hot,cold} schemes. Link: https://lkml.kernel.org/r/20250709005952.17776-13-bijan311@gmail.com Co-developed-by: Ravi Shankar Jonnalagadda Signed-off-by: Ravi Shankar Jonnalagadda Signed-off-by: Bijan Tabatabai Reviewed-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- mm/damon/ops-common.c | 150 +++++++++++++++++++++++++++++++++++++++++ mm/damon/ops-common.h | 3 + mm/damon/paddr.c | 153 +----------------------------------------- 3 files changed, 154 insertions(+), 152 deletions(-) diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c index 918158ef3d99..6a9797d1d7ff 100644 --- a/mm/damon/ops-common.c +++ b/mm/damon/ops-common.c @@ -141,6 +141,156 @@ int damon_cold_score(struct damon_ctx *c, struct damon_region *r, return DAMOS_MAX_SCORE - hotness; } +static bool damon_folio_mkold_one(struct folio *folio, + struct vm_area_struct *vma, unsigned long addr, void *arg) +{ + DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, addr, 0); + + while (page_vma_mapped_walk(&pvmw)) { + addr = pvmw.address; + if (pvmw.pte) + damon_ptep_mkold(pvmw.pte, vma, addr); + else + damon_pmdp_mkold(pvmw.pmd, vma, addr); + } + return true; +} + +void damon_folio_mkold(struct folio *folio) +{ + struct rmap_walk_control rwc = { + .rmap_one = damon_folio_mkold_one, + .anon_lock = folio_lock_anon_vma_read, + }; + bool need_lock; + + if (!folio_mapped(folio) || !folio_raw_mapping(folio)) { + folio_set_idle(folio); + return; + } + + need_lock = !folio_test_anon(folio) || folio_test_ksm(folio); + if (need_lock && !folio_trylock(folio)) + return; + + rmap_walk(folio, &rwc); + + if (need_lock) + folio_unlock(folio); + +} + +static bool damon_folio_young_one(struct folio *folio, + struct vm_area_struct *vma, unsigned long addr, void *arg) +{ + bool *accessed = arg; + DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, addr, 0); + pte_t pte; + + *accessed = false; + while (page_vma_mapped_walk(&pvmw)) { + addr = pvmw.address; + if (pvmw.pte) { + pte = ptep_get(pvmw.pte); + + /* + * PFN swap PTEs, such as device-exclusive ones, that + * actually map pages are "old" from a CPU perspective. + * The MMU notifier takes care of any device aspects. + */ + *accessed = (pte_present(pte) && pte_young(pte)) || + !folio_test_idle(folio) || + mmu_notifier_test_young(vma->vm_mm, addr); + } else { +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + *accessed = pmd_young(pmdp_get(pvmw.pmd)) || + !folio_test_idle(folio) || + mmu_notifier_test_young(vma->vm_mm, addr); +#else + WARN_ON_ONCE(1); +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ + } + if (*accessed) { + page_vma_mapped_walk_done(&pvmw); + break; + } + } + + /* If accessed, stop walking */ + return *accessed == false; +} + +bool damon_folio_young(struct folio *folio) +{ + bool accessed = false; + struct rmap_walk_control rwc = { + .arg = &accessed, + .rmap_one = damon_folio_young_one, + .anon_lock = folio_lock_anon_vma_read, + }; + bool need_lock; + + if (!folio_mapped(folio) || !folio_raw_mapping(folio)) { + if (folio_test_idle(folio)) + return false; + else + return true; + } + + need_lock = !folio_test_anon(folio) || folio_test_ksm(folio); + if (need_lock && !folio_trylock(folio)) + return false; + + rmap_walk(folio, &rwc); + + if (need_lock) + folio_unlock(folio); + + return accessed; +} + +bool damos_folio_filter_match(struct damos_filter *filter, struct folio *folio) +{ + bool matched = false; + struct mem_cgroup *memcg; + size_t folio_sz; + + switch (filter->type) { + case DAMOS_FILTER_TYPE_ANON: + matched = folio_test_anon(folio); + break; + case DAMOS_FILTER_TYPE_ACTIVE: + matched = folio_test_active(folio); + break; + case DAMOS_FILTER_TYPE_MEMCG: + rcu_read_lock(); + memcg = folio_memcg_check(folio); + if (!memcg) + matched = false; + else + matched = filter->memcg_id == mem_cgroup_id(memcg); + rcu_read_unlock(); + break; + case DAMOS_FILTER_TYPE_YOUNG: + matched = damon_folio_young(folio); + if (matched) + damon_folio_mkold(folio); + break; + case DAMOS_FILTER_TYPE_HUGEPAGE_SIZE: + folio_sz = folio_size(folio); + matched = filter->sz_range.min <= folio_sz && + folio_sz <= filter->sz_range.max; + break; + case DAMOS_FILTER_TYPE_UNMAPPED: + matched = !folio_mapped(folio) || !folio_raw_mapping(folio); + break; + default: + break; + } + + return matched == filter->matching; +} + static unsigned int __damon_migrate_folio_list( struct list_head *migrate_folios, struct pglist_data *pgdat, int target_nid) diff --git a/mm/damon/ops-common.h b/mm/damon/ops-common.h index 54209a7e70e6..61ad54aaf256 100644 --- a/mm/damon/ops-common.h +++ b/mm/damon/ops-common.h @@ -11,10 +11,13 @@ struct folio *damon_get_folio(unsigned long pfn); void damon_ptep_mkold(pte_t *pte, struct vm_area_struct *vma, unsigned long addr); void damon_pmdp_mkold(pmd_t *pmd, struct vm_area_struct *vma, unsigned long addr); +void damon_folio_mkold(struct folio *folio); +bool damon_folio_young(struct folio *folio); int damon_cold_score(struct damon_ctx *c, struct damon_region *r, struct damos *s); int damon_hot_score(struct damon_ctx *c, struct damon_region *r, struct damos *s); +bool damos_folio_filter_match(struct damos_filter *filter, struct folio *folio); unsigned long damon_migrate_pages(struct list_head *folio_list, int target_nid); diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c index 48e3e6fed636..53a55c5114fb 100644 --- a/mm/damon/paddr.c +++ b/mm/damon/paddr.c @@ -18,45 +18,6 @@ #include "../internal.h" #include "ops-common.h" -static bool damon_folio_mkold_one(struct folio *folio, - struct vm_area_struct *vma, unsigned long addr, void *arg) -{ - DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, addr, 0); - - while (page_vma_mapped_walk(&pvmw)) { - addr = pvmw.address; - if (pvmw.pte) - damon_ptep_mkold(pvmw.pte, vma, addr); - else - damon_pmdp_mkold(pvmw.pmd, vma, addr); - } - return true; -} - -static void damon_folio_mkold(struct folio *folio) -{ - struct rmap_walk_control rwc = { - .rmap_one = damon_folio_mkold_one, - .anon_lock = folio_lock_anon_vma_read, - }; - bool need_lock; - - if (!folio_mapped(folio) || !folio_raw_mapping(folio)) { - folio_set_idle(folio); - return; - } - - need_lock = !folio_test_anon(folio) || folio_test_ksm(folio); - if (need_lock && !folio_trylock(folio)) - return; - - rmap_walk(folio, &rwc); - - if (need_lock) - folio_unlock(folio); - -} - static void damon_pa_mkold(unsigned long paddr) { struct folio *folio = damon_get_folio(PHYS_PFN(paddr)); @@ -86,75 +47,6 @@ static void damon_pa_prepare_access_checks(struct damon_ctx *ctx) } } -static bool damon_folio_young_one(struct folio *folio, - struct vm_area_struct *vma, unsigned long addr, void *arg) -{ - bool *accessed = arg; - DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, addr, 0); - pte_t pte; - - *accessed = false; - while (page_vma_mapped_walk(&pvmw)) { - addr = pvmw.address; - if (pvmw.pte) { - pte = ptep_get(pvmw.pte); - - /* - * PFN swap PTEs, such as device-exclusive ones, that - * actually map pages are "old" from a CPU perspective. - * The MMU notifier takes care of any device aspects. - */ - *accessed = (pte_present(pte) && pte_young(pte)) || - !folio_test_idle(folio) || - mmu_notifier_test_young(vma->vm_mm, addr); - } else { -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - *accessed = pmd_young(pmdp_get(pvmw.pmd)) || - !folio_test_idle(folio) || - mmu_notifier_test_young(vma->vm_mm, addr); -#else - WARN_ON_ONCE(1); -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ - } - if (*accessed) { - page_vma_mapped_walk_done(&pvmw); - break; - } - } - - /* If accessed, stop walking */ - return *accessed == false; -} - -static bool damon_folio_young(struct folio *folio) -{ - bool accessed = false; - struct rmap_walk_control rwc = { - .arg = &accessed, - .rmap_one = damon_folio_young_one, - .anon_lock = folio_lock_anon_vma_read, - }; - bool need_lock; - - if (!folio_mapped(folio) || !folio_raw_mapping(folio)) { - if (folio_test_idle(folio)) - return false; - else - return true; - } - - need_lock = !folio_test_anon(folio) || folio_test_ksm(folio); - if (need_lock && !folio_trylock(folio)) - return false; - - rmap_walk(folio, &rwc); - - if (need_lock) - folio_unlock(folio); - - return accessed; -} - static bool damon_pa_young(unsigned long paddr, unsigned long *folio_sz) { struct folio *folio = damon_get_folio(PHYS_PFN(paddr)); @@ -205,49 +97,6 @@ static unsigned int damon_pa_check_accesses(struct damon_ctx *ctx) return max_nr_accesses; } -static bool damos_pa_filter_match(struct damos_filter *filter, - struct folio *folio) -{ - bool matched = false; - struct mem_cgroup *memcg; - size_t folio_sz; - - switch (filter->type) { - case DAMOS_FILTER_TYPE_ANON: - matched = folio_test_anon(folio); - break; - case DAMOS_FILTER_TYPE_ACTIVE: - matched = folio_test_active(folio); - break; - case DAMOS_FILTER_TYPE_MEMCG: - rcu_read_lock(); - memcg = folio_memcg_check(folio); - if (!memcg) - matched = false; - else - matched = filter->memcg_id == mem_cgroup_id(memcg); - rcu_read_unlock(); - break; - case DAMOS_FILTER_TYPE_YOUNG: - matched = damon_folio_young(folio); - if (matched) - damon_folio_mkold(folio); - break; - case DAMOS_FILTER_TYPE_HUGEPAGE_SIZE: - folio_sz = folio_size(folio); - matched = filter->sz_range.min <= folio_sz && - folio_sz <= filter->sz_range.max; - break; - case DAMOS_FILTER_TYPE_UNMAPPED: - matched = !folio_mapped(folio) || !folio_raw_mapping(folio); - break; - default: - break; - } - - return matched == filter->matching; -} - /* * damos_pa_filter_out - Return true if the page should be filtered out. */ @@ -259,7 +108,7 @@ static bool damos_pa_filter_out(struct damos *scheme, struct folio *folio) return false; damos_for_each_ops_filter(filter, scheme) { - if (damos_pa_filter_match(filter, folio)) + if (damos_folio_filter_match(filter, folio)) return !filter->allow; } return scheme->ops_filters_default_reject; From db87a4e236424249c7def768ba50b88699af2c0d Mon Sep 17 00:00:00 2001 From: Bijan Tabatabai Date: Tue, 8 Jul 2025 19:59:43 -0500 Subject: [PATCH 246/370] mm/damon/vaddr: apply filters in migrate_{hot/cold} The paddr versions of migrate_{hot/cold} filter out folios from migration based on the scheme's filters. This patch does the same for the vaddr versions of those schemes. The filtering code is mostly the same for the paddr and vaddr versions. The exception is the young filter. paddr determines if a page is young by doing a folio rmap walk to find the page table entries corresponding to the folio. However, vaddr schemes have easier access to the page tables, so we add some logic to avoid the extra work. Link: https://lkml.kernel.org/r/20250709005952.17776-14-bijan311@gmail.com Co-developed-by: Ravi Shankar Jonnalagadda Signed-off-by: Ravi Shankar Jonnalagadda Signed-off-by: Bijan Tabatabai Reviewed-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- mm/damon/vaddr.c | 69 +++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 65 insertions(+), 4 deletions(-) diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c index 47d5f33f89c8..7f5dc9c221a0 100644 --- a/mm/damon/vaddr.c +++ b/mm/damon/vaddr.c @@ -611,9 +611,60 @@ static unsigned int damon_va_check_accesses(struct damon_ctx *ctx) return max_nr_accesses; } +static bool damos_va_filter_young_match(struct damos_filter *filter, + struct folio *folio, struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep, pmd_t *pmdp) +{ + bool young = false; + + if (ptep) + young = pte_young(ptep_get(ptep)); + else if (pmdp) + young = pmd_young(pmdp_get(pmdp)); + + young = young || !folio_test_idle(folio) || + mmu_notifier_test_young(vma->vm_mm, addr); + + if (young && ptep) + damon_ptep_mkold(ptep, vma, addr); + else if (young && pmdp) + damon_pmdp_mkold(pmdp, vma, addr); + + return young == filter->matching; +} + +static bool damos_va_filter_out(struct damos *scheme, struct folio *folio, + struct vm_area_struct *vma, unsigned long addr, + pte_t *ptep, pmd_t *pmdp) +{ + struct damos_filter *filter; + bool matched; + + if (scheme->core_filters_allowed) + return false; + + damos_for_each_ops_filter(filter, scheme) { + /* + * damos_folio_filter_match checks the young filter by doing an + * rmap on the folio to find its page table. However, being the + * vaddr scheme, we have direct access to the page tables, so + * use that instead. + */ + if (filter->type == DAMOS_FILTER_TYPE_YOUNG) + matched = damos_va_filter_young_match(filter, folio, + vma, addr, ptep, pmdp); + else + matched = damos_folio_filter_match(filter, folio); + + if (matched) + return !filter->allow; + } + return scheme->ops_filters_default_reject; +} + struct damos_va_migrate_private { struct list_head *migration_lists; - struct damos_migrate_dests *dests; + struct damos *scheme; }; /* @@ -673,7 +724,8 @@ static int damos_va_migrate_pmd_entry(pmd_t *pmd, unsigned long addr, { struct damos_va_migrate_private *priv = walk->private; struct list_head *migration_lists = priv->migration_lists; - struct damos_migrate_dests *dests = priv->dests; + struct damos *s = priv->scheme; + struct damos_migrate_dests *dests = &s->migrate_dests; struct folio *folio; spinlock_t *ptl; pmd_t pmde; @@ -691,9 +743,13 @@ static int damos_va_migrate_pmd_entry(pmd_t *pmd, unsigned long addr, if (!folio) goto unlock; + if (damos_va_filter_out(s, folio, walk->vma, addr, NULL, pmd)) + goto put_folio; + damos_va_migrate_dests_add(folio, walk->vma, addr, dests, migration_lists); +put_folio: folio_put(folio); unlock: spin_unlock(ptl); @@ -708,7 +764,8 @@ static int damos_va_migrate_pte_entry(pte_t *pte, unsigned long addr, { struct damos_va_migrate_private *priv = walk->private; struct list_head *migration_lists = priv->migration_lists; - struct damos_migrate_dests *dests = priv->dests; + struct damos *s = priv->scheme; + struct damos_migrate_dests *dests = &s->migrate_dests; struct folio *folio; pte_t ptent; @@ -720,9 +777,13 @@ static int damos_va_migrate_pte_entry(pte_t *pte, unsigned long addr, if (!folio) return 0; + if (damos_va_filter_out(s, folio, walk->vma, addr, pte, NULL)) + goto put_folio; + damos_va_migrate_dests_add(folio, walk->vma, addr, dests, migration_lists); +put_folio: folio_put(folio); return 0; } @@ -790,7 +851,7 @@ static unsigned long damos_va_migrate(struct damon_target *target, use_target_nid = dests->nr_dests == 0; nr_dests = use_target_nid ? 1 : dests->nr_dests; - priv.dests = dests; + priv.scheme = s; priv.migration_lists = kmalloc_array(nr_dests, sizeof(*priv.migration_lists), GFP_KERNEL); if (!priv.migration_lists) From 0ec5eea20821287d9d9d4ec477f7cd0e15315518 Mon Sep 17 00:00:00 2001 From: "Vishal Moola (Oracle)" Date: Wed, 9 Jul 2025 12:40:16 -0700 Subject: [PATCH 247/370] mm/memory.c: use folios in __copy_remote_vm_str() Patch series "Remove unmap_and_put_page()". This patchset uses folios in both the callers of unmap_and_put_page(), saving a couple calls to compound_head() wrappers. This patch (of 3): Use kmap_local_folio() instead of kmap_local_page(). Replaces 2 calls to compound_head() from unmap_and_put_page() with one. This prepares us for the removal of unmap_and_put_page(). Link: https://lkml.kernel.org/r/20250709194017.927978-3-vishal.moola@gmail.com Link: https://lkml.kernel.org/r/20250709194017.927978-4-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) Acked-by: David Hildenbrand Cc: Jordan Rome Cc: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- mm/memory.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 4619bf5874af..cb2f1296854a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -6815,6 +6815,7 @@ static int __copy_remote_vm_str(struct mm_struct *mm, unsigned long addr, while (len) { int bytes, offset, retval; void *maddr; + struct folio *folio; struct page *page; struct vm_area_struct *vma = NULL; @@ -6830,17 +6831,18 @@ static int __copy_remote_vm_str(struct mm_struct *mm, unsigned long addr, goto out; } + folio = page_folio(page); bytes = len; offset = addr & (PAGE_SIZE - 1); if (bytes > PAGE_SIZE - offset) bytes = PAGE_SIZE - offset; - maddr = kmap_local_page(page); + maddr = kmap_local_folio(folio, folio_page_idx(folio, page) * PAGE_SIZE); retval = strscpy(buf, maddr + offset, bytes); if (retval >= 0) { /* Found the end of the string */ buf += retval; - unmap_and_put_page(page, maddr); + folio_release_kmap(folio, maddr); break; } @@ -6858,7 +6860,7 @@ static int __copy_remote_vm_str(struct mm_struct *mm, unsigned long addr, } len -= bytes; - unmap_and_put_page(page, maddr); + folio_release_kmap(folio, maddr); } out: From 0c092fef53f08f1e461b180c61dddc30491ec855 Mon Sep 17 00:00:00 2001 From: "Vishal Moola (Oracle)" Date: Wed, 9 Jul 2025 12:40:17 -0700 Subject: [PATCH 248/370] mm/memory.c: use folios in __access_remote_vm() Use kmap_local_folio() instead of kmap_local_page(). Replaces 2 calls to compound_head() with one. This prepares us for the removal of unmap_and_put_page(). Link: https://lkml.kernel.org/r/20250709194017.927978-5-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) Acked-by: David Hildenbrand Cc: Jordan Rome Cc: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- mm/memory.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index cb2f1296854a..bc27b1990fcb 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -6691,6 +6691,7 @@ static int __access_remote_vm(struct mm_struct *mm, unsigned long addr, while (len) { int bytes, offset; void *maddr; + struct folio *folio; struct vm_area_struct *vma = NULL; struct page *page = get_user_page_vma_remote(mm, addr, gup_flags, &vma); @@ -6722,21 +6723,22 @@ static int __access_remote_vm(struct mm_struct *mm, unsigned long addr, if (bytes <= 0) break; } else { + folio = page_folio(page); bytes = len; offset = addr & (PAGE_SIZE-1); if (bytes > PAGE_SIZE-offset) bytes = PAGE_SIZE-offset; - maddr = kmap_local_page(page); + maddr = kmap_local_folio(folio, folio_page_idx(folio, page) * PAGE_SIZE); if (write) { copy_to_user_page(vma, page, addr, maddr + offset, buf, bytes); - set_page_dirty_lock(page); + folio_mark_dirty_lock(folio); } else { copy_from_user_page(vma, page, addr, buf, maddr + offset, bytes); } - unmap_and_put_page(page, maddr); + folio_release_kmap(folio, maddr); } len -= bytes; buf += bytes; From e05d3a6014fd4ae224b44b89fbeacfaa4ace0a8e Mon Sep 17 00:00:00 2001 From: "Vishal Moola (Oracle)" Date: Wed, 9 Jul 2025 12:40:18 -0700 Subject: [PATCH 249/370] mm: remove unmap_and_put_page() There are no callers of unmap_and_put_page() left. Remove it. Link: https://lkml.kernel.org/r/20250709194017.927978-6-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) Acked-by: David Hildenbrand Cc: Jordan Rome Cc: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- include/linux/highmem.h | 6 ------ 1 file changed, 6 deletions(-) diff --git a/include/linux/highmem.h b/include/linux/highmem.h index a30526cc53a7..6234f316468c 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -682,10 +682,4 @@ static inline void folio_release_kmap(struct folio *folio, void *addr) kunmap_local(addr); folio_put(folio); } - -static inline void unmap_and_put_page(struct page *page, void *addr) -{ - folio_release_kmap(page_folio(page), addr); -} - #endif /* _LINUX_HIGHMEM_H */ From 7a92f4f591770cf77de2e6550f4a68957483b739 Mon Sep 17 00:00:00 2001 From: Davidlohr Bueso Date: Mon, 23 Jun 2025 11:58:48 -0700 Subject: [PATCH 250/370] mm/vmscan: respect psi_memstall region in node reclaim Patch series "mm: per-node proactive reclaim", v2. This adds support for allowing proactive reclaim in general on a NUMA system. A per-node interface extends support for beyond a memcg-specific interface, respecting the current semantics of memory.reclaim: respecting aging LRU and not supporting artificially triggering eviction on nodes belonging to non-bottom tiers. This patch allows userspace to do: echo 512M swappiness=10 > /sys/devices/system/node/nodeX/reclaim One of the premises for this is to semantically align as best as possible with memory.reclaim. During a brief time memcg did support nodemask until 55ab834a86a9 (Revert "mm: add nodes= arg to memory.reclaim"), for which semantics around reclaim (eviction) vs demotion were not clear, rendering charging expectations to be broken. With this approach: 1. Users who do not use memcg can benefit from proactive reclaim. 2. Proactive reclaim on top tiers will trigger demotion, for which memory is still byte-addressable. Reclaiming on the bottom nodes will trigger evicting to swap (the traditional sense of reclaim). This follows the semantics of what is today part of the aging process on tiered memory, mirroring what every other form of reclaim does (reactive and memcg proactive reclaim). Furthermore per-node proactive reclaim is not as susceptible to the memcg charging problem mentioned above. 3. Unlike memcg, there should be no surprises of callers expecting reclaim but instead got a demotion. Essentially relying on behavior of shrink_folio_list() after 6b426d071419 ("mm: disable top-tier fallback to reclaim on proactive reclaim"), without the expectations of try_to_free_mem_cgroup_pages(). 4. Unlike the nodes= arg, this interface avoids confusing semantics, such as what exactly the user wants when mixing top-tier and low-tier nodes in the nodemask. Further per-node interface is less exposed to "free up memory in my container" usecases, where eviction is intended. 5. Users that *really* want to free up memory can use proactive reclaim on nodes knowingly to be on the bottom tiers to force eviction in a natural way - higher access latencies are still better than swap. If compelled, while no guarantees and perhaps not worth the effort, users could also also potentially follow a ladder-like approach to eventually free up the memory. Alternatively, perhaps an 'evict' option could be added to the parameters for both memory.reclaim and per-node interfaces to force this action unconditionally. This patch (of 4): ... rather benign but keep proper ending order. Link: https://lkml.kernel.org/r/20250623185851.830632-1-dave@stgolabs.net Link: https://lkml.kernel.org/r/20250623185851.830632-2-dave@stgolabs.net Signed-off-by: Davidlohr Bueso Acked-by: Shakeel Butt Reviewed-by: Roman Gushchin Cc: Johannes Weiner Cc: Michal Hocko Cc: Roman Gushchin Cc: Yosry Ahmed Signed-off-by: Andrew Morton --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index b1b999734ee4..85ffff3b4d24 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -7653,8 +7653,8 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in set_task_reclaim_state(p, NULL); memalloc_noreclaim_restore(noreclaim_flag); fs_reclaim_release(sc.gfp_mask); - psi_memstall_leave(&pflags); delayacct_freepages_end(); + psi_memstall_leave(&pflags); trace_mm_vmscan_node_reclaim_end(sc.nr_reclaimed); From 2b7226af730cc9a8818ff3b39aabcd76861913dd Mon Sep 17 00:00:00 2001 From: Davidlohr Bueso Date: Mon, 23 Jun 2025 11:58:49 -0700 Subject: [PATCH 251/370] mm/memcg: make memory.reclaim interface generic This adds a general call for both parsing as well as the common reclaim semantics. memcg is still the only user and no change in semantics. [akpm@linux-foundation.org: fix CONFIG_NUMA=n build] Link: https://lkml.kernel.org/r/20250623185851.830632-3-dave@stgolabs.net Signed-off-by: Davidlohr Bueso Acked-by: Shakeel Butt Cc: Johannes Weiner Cc: Michal Hocko Cc: Roman Gushchin Cc: Yosry Ahmed Signed-off-by: Andrew Morton --- mm/internal.h | 10 +++++ mm/memcontrol.c | 77 ++------------------------------------ mm/vmscan.c | 98 +++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 112 insertions(+), 73 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 91773a0ef305..5b0f71e5434b 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -533,6 +533,16 @@ extern unsigned long highest_memmap_pfn; bool folio_isolate_lru(struct folio *folio); void folio_putback_lru(struct folio *folio); extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason); +#ifdef CONFIG_NUMA +int user_proactive_reclaim(char *buf, + struct mem_cgroup *memcg, pg_data_t *pgdat); +#else +static inline int user_proactive_reclaim(char *buf, + struct mem_cgroup *memcg, pg_data_t *pgdat) +{ + return 0; +} +#endif /* * in mm/rmap.c: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 70fdeda1120b..235c66d2161b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -51,7 +51,6 @@ #include #include #include -#include #include #include #include @@ -4564,83 +4563,15 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, return nbytes; } -enum { - MEMORY_RECLAIM_SWAPPINESS = 0, - MEMORY_RECLAIM_SWAPPINESS_MAX, - MEMORY_RECLAIM_NULL, -}; - -static const match_table_t tokens = { - { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"}, - { MEMORY_RECLAIM_SWAPPINESS_MAX, "swappiness=max"}, - { MEMORY_RECLAIM_NULL, NULL }, -}; - static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); - unsigned int nr_retries = MAX_RECLAIM_RETRIES; - unsigned long nr_to_reclaim, nr_reclaimed = 0; - int swappiness = -1; - unsigned int reclaim_options; - char *old_buf, *start; - substring_t args[MAX_OPT_ARGS]; + int ret; - buf = strstrip(buf); - - old_buf = buf; - nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE; - if (buf == old_buf) - return -EINVAL; - - buf = strstrip(buf); - - while ((start = strsep(&buf, " ")) != NULL) { - if (!strlen(start)) - continue; - switch (match_token(start, tokens, args)) { - case MEMORY_RECLAIM_SWAPPINESS: - if (match_int(&args[0], &swappiness)) - return -EINVAL; - if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS) - return -EINVAL; - break; - case MEMORY_RECLAIM_SWAPPINESS_MAX: - swappiness = SWAPPINESS_ANON_ONLY; - break; - default: - return -EINVAL; - } - } - - reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; - while (nr_reclaimed < nr_to_reclaim) { - /* Will converge on zero, but reclaim enforces a minimum */ - unsigned long batch_size = (nr_to_reclaim - nr_reclaimed) / 4; - unsigned long reclaimed; - - if (signal_pending(current)) - return -EINTR; - - /* - * This is the final attempt, drain percpu lru caches in the - * hope of introducing more evictable pages for - * try_to_free_mem_cgroup_pages(). - */ - if (!nr_retries) - lru_add_drain_all(); - - reclaimed = try_to_free_mem_cgroup_pages(memcg, - batch_size, GFP_KERNEL, - reclaim_options, - swappiness == -1 ? NULL : &swappiness); - - if (!reclaimed && !nr_retries--) - return -EAGAIN; - - nr_reclaimed += reclaimed; - } + ret = user_proactive_reclaim(buf, memcg, NULL); + if (ret) + return ret; return nbytes; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 85ffff3b4d24..9702ee5aa65d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -57,6 +57,7 @@ #include #include #include +#include #include #include @@ -6714,6 +6715,15 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, return nr_reclaimed; } +#else +unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, + unsigned long nr_pages, + gfp_t gfp_mask, + unsigned int reclaim_options, + int *swappiness) +{ + return 0; +} #endif static void kswapd_age_node(struct pglist_data *pgdat, struct scan_control *sc) @@ -7708,6 +7718,94 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order) return ret; } + +enum { + MEMORY_RECLAIM_SWAPPINESS = 0, + MEMORY_RECLAIM_SWAPPINESS_MAX, + MEMORY_RECLAIM_NULL, +}; +static const match_table_t tokens = { + { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"}, + { MEMORY_RECLAIM_SWAPPINESS_MAX, "swappiness=max"}, + { MEMORY_RECLAIM_NULL, NULL }, +}; + +int user_proactive_reclaim(char *buf, struct mem_cgroup *memcg, pg_data_t *pgdat) +{ + unsigned int nr_retries = MAX_RECLAIM_RETRIES; + unsigned long nr_to_reclaim, nr_reclaimed = 0; + int swappiness = -1; + char *old_buf, *start; + substring_t args[MAX_OPT_ARGS]; + + if (!buf || (!memcg && !pgdat)) + return -EINVAL; + + buf = strstrip(buf); + + old_buf = buf; + nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE; + if (buf == old_buf) + return -EINVAL; + + buf = strstrip(buf); + + while ((start = strsep(&buf, " ")) != NULL) { + if (!strlen(start)) + continue; + switch (match_token(start, tokens, args)) { + case MEMORY_RECLAIM_SWAPPINESS: + if (match_int(&args[0], &swappiness)) + return -EINVAL; + if (swappiness < MIN_SWAPPINESS || + swappiness > MAX_SWAPPINESS) + return -EINVAL; + break; + case MEMORY_RECLAIM_SWAPPINESS_MAX: + swappiness = SWAPPINESS_ANON_ONLY; + break; + default: + return -EINVAL; + } + } + + while (nr_reclaimed < nr_to_reclaim) { + /* Will converge on zero, but reclaim enforces a minimum */ + unsigned long batch_size = (nr_to_reclaim - nr_reclaimed) / 4; + unsigned long reclaimed; + + if (signal_pending(current)) + return -EINTR; + + /* + * This is the final attempt, drain percpu lru caches in the + * hope of introducing more evictable pages. + */ + if (!nr_retries) + lru_add_drain_all(); + + if (memcg) { + unsigned int reclaim_options; + + reclaim_options = MEMCG_RECLAIM_MAY_SWAP | + MEMCG_RECLAIM_PROACTIVE; + reclaimed = try_to_free_mem_cgroup_pages(memcg, + batch_size, GFP_KERNEL, + reclaim_options, + swappiness == -1 ? NULL : &swappiness); + } else { + return -EINVAL; + } + + if (!reclaimed && !nr_retries--) + return -EAGAIN; + + nr_reclaimed += reclaimed; + } + + return 0; +} + #endif /** From 57972c78e6780564710e20f0b2fad45114c93461 Mon Sep 17 00:00:00 2001 From: Davidlohr Bueso Date: Mon, 23 Jun 2025 11:58:50 -0700 Subject: [PATCH 252/370] mm/vmscan: make __node_reclaim() more generic As this will be called from non page allocator paths for proactive reclaim, allow users to pass the sc and nr of pages, and adjust the return value as well. No change in semantics. Link: https://lkml.kernel.org/r/20250623185851.830632-4-dave@stgolabs.net Signed-off-by: Davidlohr Bueso Reviewed-by: Roman Gushchin Acked-by: Shakeel Butt Cc: Johannes Weiner Cc: Michal Hocko Cc: Yosry Ahmed Signed-off-by: Andrew Morton --- mm/vmscan.c | 90 +++++++++++++++++++++++++++-------------------------- 1 file changed, 46 insertions(+), 44 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 9702ee5aa65d..d165b66da796 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -7618,12 +7618,54 @@ static unsigned long node_pagecache_reclaimable(struct pglist_data *pgdat) /* * Try to free up some pages from this node through reclaim. */ -static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order) +static unsigned long __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, + unsigned long nr_pages, + struct scan_control *sc) { - /* Minimum pages needed in order to stay on node */ - const unsigned long nr_pages = 1 << order; struct task_struct *p = current; unsigned int noreclaim_flag; + unsigned long pflags; + + trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, sc->order, + sc->gfp_mask); + + cond_resched(); + psi_memstall_enter(&pflags); + delayacct_freepages_start(); + fs_reclaim_acquire(sc->gfp_mask); + /* + * We need to be able to allocate from the reserves for RECLAIM_UNMAP + */ + noreclaim_flag = memalloc_noreclaim_save(); + set_task_reclaim_state(p, &sc->reclaim_state); + + if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages || + node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B) > pgdat->min_slab_pages) { + /* + * Free memory by calling shrink node with increasing + * priorities until we have enough memory freed. + */ + do { + shrink_node(pgdat, sc); + } while (sc->nr_reclaimed < nr_pages && --sc->priority >= 0); + } + + set_task_reclaim_state(p, NULL); + memalloc_noreclaim_restore(noreclaim_flag); + fs_reclaim_release(sc->gfp_mask); + delayacct_freepages_end(); + psi_memstall_leave(&pflags); + + trace_mm_vmscan_node_reclaim_end(sc->nr_reclaimed); + + return sc->nr_reclaimed; +} + +int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order) +{ + int ret; + /* Minimum pages needed in order to stay on node */ + const unsigned long nr_pages = 1 << order; struct scan_control sc = { .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), .gfp_mask = current_gfp_context(gfp_mask), @@ -7634,46 +7676,6 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in .may_swap = 1, .reclaim_idx = gfp_zone(gfp_mask), }; - unsigned long pflags; - - trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order, - sc.gfp_mask); - - cond_resched(); - psi_memstall_enter(&pflags); - delayacct_freepages_start(); - fs_reclaim_acquire(sc.gfp_mask); - /* - * We need to be able to allocate from the reserves for RECLAIM_UNMAP - */ - noreclaim_flag = memalloc_noreclaim_save(); - set_task_reclaim_state(p, &sc.reclaim_state); - - if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages || - node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B) > pgdat->min_slab_pages) { - /* - * Free memory by calling shrink node with increasing - * priorities until we have enough memory freed. - */ - do { - shrink_node(pgdat, &sc); - } while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0); - } - - set_task_reclaim_state(p, NULL); - memalloc_noreclaim_restore(noreclaim_flag); - fs_reclaim_release(sc.gfp_mask); - delayacct_freepages_end(); - psi_memstall_leave(&pflags); - - trace_mm_vmscan_node_reclaim_end(sc.nr_reclaimed); - - return sc.nr_reclaimed >= nr_pages; -} - -int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order) -{ - int ret; /* * Node reclaim reclaims unmapped file backed pages and @@ -7708,7 +7710,7 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order) if (test_and_set_bit_lock(PGDAT_RECLAIM_LOCKED, &pgdat->flags)) return NODE_RECLAIM_NOSCAN; - ret = __node_reclaim(pgdat, gfp_mask, order); + ret = __node_reclaim(pgdat, gfp_mask, nr_pages, &sc) >= nr_pages; clear_bit_unlock(PGDAT_RECLAIM_LOCKED, &pgdat->flags); if (ret) From b980077899ea49cc747afe003e01ca303b00d463 Mon Sep 17 00:00:00 2001 From: Davidlohr Bueso Date: Mon, 23 Jun 2025 11:58:51 -0700 Subject: [PATCH 253/370] mm: introduce per-node proactive reclaim interface This adds support for allowing proactive reclaim in general on a NUMA system. A per-node interface extends support for beyond a memcg-specific interface, respecting the current semantics of memory.reclaim: respecting aging LRU and not supporting artificially triggering eviction on nodes belonging to non-bottom tiers. This patch allows userspace to do: echo "512M swappiness=10" > /sys/devices/system/node/nodeX/reclaim One of the premises for this is to semantically align as best as possible with memory.reclaim. During a brief time memcg did support nodemask until 55ab834a86a9 (Revert "mm: add nodes= arg to memory.reclaim"), for which semantics around reclaim (eviction) vs demotion were not clear, rendering charging expectations to be broken. With this approach: 1. Users who do not use memcg can benefit from proactive reclaim. The memcg interface is not NUMA aware and there are usecases that are focusing on NUMA balancing rather than workload memory footprint. 2. Proactive reclaim on top tiers will trigger demotion, for which memory is still byte-addressable. Reclaiming on the bottom nodes will trigger evicting to swap (the traditional sense of reclaim). This follows the semantics of what is today part of the aging process on tiered memory, mirroring what every other form of reclaim does (reactive and memcg proactive reclaim). Furthermore per-node proactive reclaim is not as susceptible to the memcg charging problem mentioned above. 3. Unlike the nodes= arg, this interface avoids confusing semantics, such as what exactly the user wants when mixing top-tier and low-tier nodes in the nodemask. Further per-node interface is less exposed to "free up memory in my container" usecases, where eviction is intended. 4. Users that *really* want to free up memory can use proactive reclaim on nodes knowingly to be on the bottom tiers to force eviction in a natural way - higher access latencies are still better than swap. If compelled, while no guarantees and perhaps not worth the effort, users could also also potentially follow a ladder-like approach to eventually free up the memory. Alternatively, perhaps an 'evict' option could be added to the parameters for both memory.reclaim and per-node interfaces to force this action unconditionally. [akpm@linux-foundation.org: user_proactive_reclaim(): return -EBUSY on PGDAT_RECLAIM_LOCKED contention, per Roman] [dave@stgolabs.net: memcg && node is also a bogus case, per Shakeel] Link: https://lkml.kernel.org/r/20250717235604.2atyx2aobwowpge3@offworld Link: https://lkml.kernel.org/r/20250623185851.830632-5-dave@stgolabs.net Signed-off-by: Davidlohr Bueso Acked-by: Shakeel Butt Acked-by: Roman Gushchin Cc: Johannes Weiner Cc: Michal Hocko Cc: Yosry Ahmed Signed-off-by: Andrew Morton --- Documentation/ABI/stable/sysfs-devices-node | 9 ++++ drivers/base/node.c | 2 + include/linux/swap.h | 16 ++++++ mm/vmscan.c | 55 ++++++++++++++++++--- 4 files changed, 75 insertions(+), 7 deletions(-) diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node index a02707cb7cbc..2d0e023f22a7 100644 --- a/Documentation/ABI/stable/sysfs-devices-node +++ b/Documentation/ABI/stable/sysfs-devices-node @@ -227,3 +227,12 @@ Contact: Jiaqi Yan Description: Of the raw poisoned pages on a NUMA node, how many pages are recovered by memory error recovery attempt. + +What: /sys/devices/system/node/nodeX/reclaim +Date: June 2025 +Contact: Linux Memory Management list +Description: + Perform user-triggered proactive reclaim on a NUMA node. + This interface is equivalent to the memcg variant. + + See Documentation/admin-guide/cgroup-v2.rst diff --git a/drivers/base/node.c b/drivers/base/node.c index e434cb260e61..bef84f01712f 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -659,6 +659,7 @@ static int register_node(struct node *node, int num) } else { hugetlb_register_node(node); compaction_register_node(node); + reclaim_register_node(node); } return error; @@ -675,6 +676,7 @@ void unregister_node(struct node *node) { hugetlb_unregister_node(node); compaction_unregister_node(node); + reclaim_unregister_node(node); node_remove_accesses(node); node_remove_caches(node); device_unregister(&node->dev); diff --git a/include/linux/swap.h b/include/linux/swap.h index a49be950c485..95c6061fa1dc 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -431,6 +431,22 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages); extern int vm_swappiness; long remove_mapping(struct address_space *mapping, struct folio *folio); +#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA) +extern int reclaim_register_node(struct node *node); +extern void reclaim_unregister_node(struct node *node); + +#else + +static inline int reclaim_register_node(struct node *node) +{ + return 0; +} + +static inline void reclaim_unregister_node(struct node *node) +{ +} +#endif /* CONFIG_SYSFS && CONFIG_NUMA */ + #ifdef CONFIG_NUMA extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; diff --git a/mm/vmscan.c b/mm/vmscan.c index d165b66da796..19bfce93b373 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -94,10 +94,8 @@ struct scan_control { unsigned long anon_cost; unsigned long file_cost; -#ifdef CONFIG_MEMCG /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */ int *proactive_swappiness; -#endif /* Can active folios be deactivated as part of reclaim? */ #define DEACTIVATE_ANON 1 @@ -121,7 +119,7 @@ struct scan_control { /* Has cache_trim_mode failed at least once? */ unsigned int cache_trim_mode_failed:1; - /* Proactive reclaim invoked by userspace through memory.reclaim */ + /* Proactive reclaim invoked by userspace */ unsigned int proactive:1; /* @@ -7732,15 +7730,17 @@ static const match_table_t tokens = { { MEMORY_RECLAIM_NULL, NULL }, }; -int user_proactive_reclaim(char *buf, struct mem_cgroup *memcg, pg_data_t *pgdat) +int user_proactive_reclaim(char *buf, + struct mem_cgroup *memcg, pg_data_t *pgdat) { unsigned int nr_retries = MAX_RECLAIM_RETRIES; unsigned long nr_to_reclaim, nr_reclaimed = 0; int swappiness = -1; char *old_buf, *start; substring_t args[MAX_OPT_ARGS]; + gfp_t gfp_mask = GFP_KERNEL; - if (!buf || (!memcg && !pgdat)) + if (!buf || (!memcg && !pgdat) || (memcg && pgdat)) return -EINVAL; buf = strstrip(buf); @@ -7792,11 +7792,29 @@ int user_proactive_reclaim(char *buf, struct mem_cgroup *memcg, pg_data_t *pgdat reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; reclaimed = try_to_free_mem_cgroup_pages(memcg, - batch_size, GFP_KERNEL, + batch_size, gfp_mask, reclaim_options, swappiness == -1 ? NULL : &swappiness); } else { - return -EINVAL; + struct scan_control sc = { + .gfp_mask = current_gfp_context(gfp_mask), + .reclaim_idx = gfp_zone(gfp_mask), + .proactive_swappiness = swappiness == -1 ? NULL : &swappiness, + .priority = DEF_PRIORITY, + .may_writepage = !laptop_mode, + .nr_to_reclaim = max(batch_size, SWAP_CLUSTER_MAX), + .may_unmap = 1, + .may_swap = 1, + .proactive = 1, + }; + + if (test_and_set_bit_lock(PGDAT_RECLAIM_LOCKED, + &pgdat->flags)) + return -EBUSY; + + reclaimed = __node_reclaim(pgdat, gfp_mask, + batch_size, &sc); + clear_bit_unlock(PGDAT_RECLAIM_LOCKED, &pgdat->flags); } if (!reclaimed && !nr_retries--) @@ -7855,3 +7873,26 @@ void check_move_unevictable_folios(struct folio_batch *fbatch) } } EXPORT_SYMBOL_GPL(check_move_unevictable_folios); + +#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA) +static ssize_t reclaim_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + int ret, nid = dev->id; + + ret = user_proactive_reclaim((char *)buf, NULL, NODE_DATA(nid)); + return ret ? -EAGAIN : count; +} + +static DEVICE_ATTR_WO(reclaim); +int reclaim_register_node(struct node *node) +{ + return device_create_file(&node->dev, &dev_attr_reclaim); +} + +void reclaim_unregister_node(struct node *node) +{ + return device_remove_file(&node->dev, &dev_attr_reclaim); +} +#endif From 188cb385bbf04d486df3e52f28c47b3961f5f0c0 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Thu, 10 Jul 2025 11:23:53 +0300 Subject: [PATCH 254/370] mm/hmm: move pmd_to_hmm_pfn_flags() to the respective #ifdeffery When pmd_to_hmm_pfn_flags() is unused, it prevents kernel builds with clang, `make W=1` and CONFIG_TRANSPARENT_HUGEPAGE=n: mm/hmm.c:186:29: warning: unused function 'pmd_to_hmm_pfn_flags' [-Wunused-function] Fix this by moving the function to the respective existing ifdeffery for its the only user. See also: 6863f5643dd7 ("kbuild: allow Clang to find unused static inline functions for W=1 build") Link: https://lkml.kernel.org/r/20250710082403.664093-1-andriy.shevchenko@linux.intel.com Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem") Signed-off-by: Andy Shevchenko Reviewed-by: Leon Romanovsky Reviewed-by: Alistair Popple Cc: Andriy Shevchenko Cc: Bill Wendling Cc: Jerome Glisse Cc: Justin Stitt Cc: Nathan Chancellor Cc: Signed-off-by: Andrew Morton --- mm/hmm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/hmm.c b/mm/hmm.c index f2415b4b2cdd..d545e2494994 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -183,6 +183,7 @@ static inline unsigned long hmm_pfn_flags_order(unsigned long order) return order << HMM_PFN_ORDER_SHIFT; } +#ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline unsigned long pmd_to_hmm_pfn_flags(struct hmm_range *range, pmd_t pmd) { @@ -193,7 +194,6 @@ static inline unsigned long pmd_to_hmm_pfn_flags(struct hmm_range *range, hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT); } -#ifdef CONFIG_TRANSPARENT_HUGEPAGE static int hmm_vma_handle_pmd(struct mm_walk *walk, unsigned long addr, unsigned long end, unsigned long hmm_pfns[], pmd_t pmd) From c809579374f47c8273eaf22a40674ae547f39254 Mon Sep 17 00:00:00 2001 From: Chi Zhiling Date: Thu, 10 Jul 2025 14:04:51 +0800 Subject: [PATCH 255/370] readahead: use folio_nr_pages() instead of shift operation folio_nr_pages() is faster helper function to get the number of pages when NR_PAGES_IN_LARGE_FOLIO is enabled. Link: https://lkml.kernel.org/r/20250710060451.3535957-1-chizhiling@163.com Signed-off-by: Chi Zhiling Acked-by: David Hildenbrand Reviewed-by: Ryan Roberts Cc: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- mm/readahead.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/readahead.c b/mm/readahead.c index 95a24f12d1e7..406756d34309 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -649,7 +649,7 @@ void page_cache_async_ra(struct readahead_control *ractl, * Ramp up sizes, and push forward the readahead window. */ expected = round_down(ra->start + ra->size - ra->async_size, - 1UL << folio_order(folio)); + folio_nr_pages(folio)); if (index == expected) { ra->start += ra->size; /* From e001ef9652c2d931a1126c9d173ec5ea95e9847a Mon Sep 17 00:00:00 2001 From: Xuanye Liu Date: Thu, 10 Jul 2025 10:58:58 +0800 Subject: [PATCH 256/370] mm: simplify min_brk handling in brk() Set min_brk to mm->start_brk by default, and override it with mm->end_data only when CONFIG_COMPAT_BRK is enabled and brk_randomized is false. This makes the logic clearer with no functional change. Link: https://lkml.kernel.org/r/20250710025859.926355-1-liuqiye2025@163.com Signed-off-by: Xuanye Liu Reviewed-by: Pedro Falcato Reviewed-by: Lorenzo Stoakes Cc: Jann Horn Cc: Liam Howlett Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/mmap.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 74072369e8fd..1287e92ccdda 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -127,18 +127,15 @@ SYSCALL_DEFINE1(brk, unsigned long, brk) origbrk = mm->brk; + min_brk = mm->start_brk; #ifdef CONFIG_COMPAT_BRK /* * CONFIG_COMPAT_BRK can still be overridden by setting * randomize_va_space to 2, which will still cause mm->start_brk * to be arbitrarily shifted */ - if (current->brk_randomized) - min_brk = mm->start_brk; - else + if (!current->brk_randomized) min_brk = mm->end_data; -#else - min_brk = mm->start_brk; #endif if (brk < min_brk) goto out; From 004ded6bee11b8ed463cdc54b89a4390f4b64f6d Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 12 Jul 2025 12:50:03 -0700 Subject: [PATCH 257/370] mm/damon: accept parallel damon_call() requests Patch series "mm/damon: remove damon_callback". damon_callback was the only way for communicating with DAMON for contexts running on its worker thread. The interface is flexible and simple. But as DAMON evolves with more features, damon_callback has become somewhat too old. With runtime parameters update, for example, its lack of synchronization support was found to be inconvenient. Arguably it is also not easy to use correctly since the callers should understand when each callback is called, and implication of the return values from the callbacks. To replace it, damon_call() and damos_walk() are introduced. And those replaced a few damon_callback use cases. Some use cases of damon_callback such as parallel or repetitive DAMON internal data reading and additional cleanups cannot simply be replaced by damon_call() and damos_walk(), though. To allow those replaceable, extend damon_call() for parallel and/or repeated callbacks and modify the core/ops layers for additional resources cleanup. With the updates, replace the remaining damon_callback usages and finally say goodbye to damon_callback. This patch (of 14): Calling damon_call() while it is serving for another parallel thread immediately fails with -EBUSY. The caller should call it again, later. Each caller implementing such retry logic would be redundant. Accept parallel damon_call() requests and do the wait instead of the caller. Link: https://lkml.kernel.org/r/20250712195016.151108-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250712195016.151108-2-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- include/linux/damon.h | 7 +++++-- mm/damon/core.c | 49 ++++++++++++++++++++++--------------------- 2 files changed, 30 insertions(+), 26 deletions(-) diff --git a/include/linux/damon.h b/include/linux/damon.h index 1f425d830bb9..562c7876ba88 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -673,6 +673,8 @@ struct damon_call_control { struct completion completion; /* informs if the kdamond canceled @fn infocation */ bool canceled; + /* List head for siblings. */ + struct list_head list; }; /** @@ -798,8 +800,9 @@ struct damon_ctx { /* for scheme quotas prioritization */ unsigned long *regions_score_histogram; - struct damon_call_control *call_control; - struct mutex call_control_lock; + /* lists of &struct damon_call_control */ + struct list_head call_controls; + struct mutex call_controls_lock; struct damos_walk_control *walk_control; struct mutex walk_control_lock; diff --git a/mm/damon/core.c b/mm/damon/core.c index 2714a7a023db..b0a0b98f6889 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -533,7 +533,8 @@ struct damon_ctx *damon_new_ctx(void) ctx->next_ops_update_sis = 0; mutex_init(&ctx->kdamond_lock); - mutex_init(&ctx->call_control_lock); + INIT_LIST_HEAD(&ctx->call_controls); + mutex_init(&ctx->call_controls_lock); mutex_init(&ctx->walk_control_lock); ctx->attrs.min_nr_regions = 10; @@ -1393,14 +1394,11 @@ int damon_call(struct damon_ctx *ctx, struct damon_call_control *control) { init_completion(&control->completion); control->canceled = false; + INIT_LIST_HEAD(&control->list); - mutex_lock(&ctx->call_control_lock); - if (ctx->call_control) { - mutex_unlock(&ctx->call_control_lock); - return -EBUSY; - } - ctx->call_control = control; - mutex_unlock(&ctx->call_control_lock); + mutex_lock(&ctx->call_controls_lock); + list_add_tail(&ctx->call_controls, &control->list); + mutex_unlock(&ctx->call_controls_lock); if (!damon_is_running(ctx)) return -EINVAL; wait_for_completion(&control->completion); @@ -2419,11 +2417,11 @@ static void kdamond_usleep(unsigned long usecs) } /* - * kdamond_call() - handle damon_call_control. + * kdamond_call() - handle damon_call_control objects. * @ctx: The &struct damon_ctx of the kdamond. * @cancel: Whether to cancel the invocation of the function. * - * If there is a &struct damon_call_control request that registered via + * If there are &struct damon_call_control requests that registered via * &damon_call() on @ctx, do or cancel the invocation of the function depending * on @cancel. @cancel is set when the kdamond is already out of the main loop * and therefore will be terminated. @@ -2433,21 +2431,24 @@ static void kdamond_call(struct damon_ctx *ctx, bool cancel) struct damon_call_control *control; int ret = 0; - mutex_lock(&ctx->call_control_lock); - control = ctx->call_control; - mutex_unlock(&ctx->call_control_lock); - if (!control) - return; - if (cancel) { - control->canceled = true; - } else { - ret = control->fn(control->data); - control->return_code = ret; + while (true) { + mutex_lock(&ctx->call_controls_lock); + control = list_first_entry_or_null(&ctx->call_controls, + struct damon_call_control, list); + mutex_unlock(&ctx->call_controls_lock); + if (!control) + return; + if (cancel) { + control->canceled = true; + } else { + ret = control->fn(control->data); + control->return_code = ret; + } + mutex_lock(&ctx->call_controls_lock); + list_del(&control->list); + mutex_unlock(&ctx->call_controls_lock); + complete(&control->completion); } - complete(&control->completion); - mutex_lock(&ctx->call_control_lock); - ctx->call_control = NULL; - mutex_unlock(&ctx->call_control_lock); } /* Returns negative error code if it's not activated but should return */ From 43df7676e5508895c33b846d37cff9cf3b52674c Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 12 Jul 2025 12:50:04 -0700 Subject: [PATCH 258/370] mm/damon/core: introduce repeat mode damon_call() damon_call() can be useful for reading or writing DAMON internal data for one time. A common pattern of DAMON core usage from DAMON modules is doing such reads and writes repeatedly, for example, to periodically update the DAMOS stats. To do that with damon_call(), callers should call damon_call() repeatedly, with their own delay loop. Each caller doing that is repetitive. Introduce a repeat mode damon_call(). Callers can use the mode by setting a new field in damon_call_control. If the mode is turned on, damon_call() returns success immediately, and DAMON repeats invoking the callback function inside the kdamond main loop. Link: https://lkml.kernel.org/r/20250712195016.151108-3-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- include/linux/damon.h | 2 ++ mm/damon/core.c | 25 ++++++++++++++++++++----- 2 files changed, 22 insertions(+), 5 deletions(-) diff --git a/include/linux/damon.h b/include/linux/damon.h index 562c7876ba88..b83987275ff9 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -659,6 +659,7 @@ struct damon_callback { * * @fn: Function to be called back. * @data: Data that will be passed to @fn. + * @repeat: Repeat invocations. * @return_code: Return code from @fn invocation. * * Control damon_call(), which requests specific kdamond to invoke a given @@ -667,6 +668,7 @@ struct damon_callback { struct damon_call_control { int (*fn)(void *data); void *data; + bool repeat; int return_code; /* private: internal use only */ /* informs if the kdamond finished handling of the request */ diff --git a/mm/damon/core.c b/mm/damon/core.c index b0a0b98f6889..ffb87497dbb5 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -1379,8 +1379,9 @@ bool damon_is_running(struct damon_ctx *ctx) * * Ask DAMON worker thread (kdamond) of @ctx to call a function with an * argument data that respectively passed via &damon_call_control->fn and - * &damon_call_control->data of @control, and wait until the kdamond finishes - * handling of the request. + * &damon_call_control->data of @control. If &damon_call_control->repeat of + * @control is set, further wait until the kdamond finishes handling of the + * request. Otherwise, return as soon as the request is made. * * The kdamond executes the function with the argument in the main loop, just * after a sampling of the iteration is finished. The function can hence @@ -1392,7 +1393,8 @@ bool damon_is_running(struct damon_ctx *ctx) */ int damon_call(struct damon_ctx *ctx, struct damon_call_control *control) { - init_completion(&control->completion); + if (!control->repeat) + init_completion(&control->completion); control->canceled = false; INIT_LIST_HEAD(&control->list); @@ -1401,6 +1403,8 @@ int damon_call(struct damon_ctx *ctx, struct damon_call_control *control) mutex_unlock(&ctx->call_controls_lock); if (!damon_is_running(ctx)) return -EINVAL; + if (control->repeat) + return 0; wait_for_completion(&control->completion); if (control->canceled) return -ECANCELED; @@ -2429,6 +2433,7 @@ static void kdamond_usleep(unsigned long usecs) static void kdamond_call(struct damon_ctx *ctx, bool cancel) { struct damon_call_control *control; + LIST_HEAD(repeat_controls); int ret = 0; while (true) { @@ -2437,7 +2442,7 @@ static void kdamond_call(struct damon_ctx *ctx, bool cancel) struct damon_call_control, list); mutex_unlock(&ctx->call_controls_lock); if (!control) - return; + break; if (cancel) { control->canceled = true; } else { @@ -2447,8 +2452,18 @@ static void kdamond_call(struct damon_ctx *ctx, bool cancel) mutex_lock(&ctx->call_controls_lock); list_del(&control->list); mutex_unlock(&ctx->call_controls_lock); - complete(&control->completion); + if (!control->repeat) + complete(&control->completion); + else + list_add(&control->list, &repeat_controls); } + control = list_first_entry_or_null(&repeat_controls, + struct damon_call_control, list); + if (!control || cancel) + return; + mutex_lock(&ctx->call_controls_lock); + list_add_tail(&control->list, &ctx->call_controls); + mutex_unlock(&ctx->call_controls_lock); } /* Returns negative error code if it's not activated but should return */ From 405f61996d9d2e9d497cd9f6b66f41dc28d3d1d8 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 12 Jul 2025 12:50:05 -0700 Subject: [PATCH 259/370] mm/damon/stat: use damon_call() repeat mode instead of damon_callback DAMON_STAT uses damon_callback for periodically reading DAMON internal data. Use its alternative, damon_call() repeat mode. Link: https://lkml.kernel.org/r/20250712195016.151108-4-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- mm/damon/stat.c | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/mm/damon/stat.c b/mm/damon/stat.c index b75af871627e..87bcd8866d4b 100644 --- a/mm/damon/stat.c +++ b/mm/damon/stat.c @@ -122,8 +122,9 @@ static void damon_stat_set_idletime_percentiles(struct damon_ctx *c) kfree(sorted_regions); } -static int damon_stat_after_aggregation(struct damon_ctx *c) +static int damon_stat_damon_call_fn(void *data) { + struct damon_ctx *c = data; static unsigned long last_refresh_jiffies; /* avoid unnecessarily frequent stat update */ @@ -182,19 +183,29 @@ static struct damon_ctx *damon_stat_build_ctx(void) damon_add_target(ctx, target); if (damon_set_region_biggest_system_ram_default(target, &start, &end)) goto free_out; - ctx->callback.after_aggregation = damon_stat_after_aggregation; return ctx; free_out: damon_destroy_ctx(ctx); return NULL; } +static struct damon_call_control call_control = { + .fn = damon_stat_damon_call_fn, + .repeat = true, +}; + static int damon_stat_start(void) { + int err; + damon_stat_context = damon_stat_build_ctx(); if (!damon_stat_context) return -ENOMEM; - return damon_start(&damon_stat_context, 1, true); + err = damon_start(&damon_stat_context, 1, true); + if (err) + return err; + call_control.data = damon_stat_context; + return damon_call(damon_stat_context, &call_control); } static void damon_stat_stop(void) From 5da7c7031853373f4f3208a1102bd8b27b44f8b7 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 12 Jul 2025 12:50:06 -0700 Subject: [PATCH 260/370] mm/damon/reclaim: use damon_call() repeat mode instead of damon_callback DAMON_RECLAIM uses damon_callback for periodically reading and writing DAMON internal data and parameters. Use its alternative, damon_call() repeat mode. Link: https://lkml.kernel.org/r/20250712195016.151108-5-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- mm/damon/reclaim.c | 62 +++++++++++++++++++++++----------------------- 1 file changed, 31 insertions(+), 31 deletions(-) diff --git a/mm/damon/reclaim.c b/mm/damon/reclaim.c index 0fe8996328b8..3c71b4596676 100644 --- a/mm/damon/reclaim.c +++ b/mm/damon/reclaim.c @@ -238,6 +238,35 @@ static int damon_reclaim_apply_parameters(void) return err; } +static int damon_reclaim_handle_commit_inputs(void) +{ + int err; + + if (!commit_inputs) + return 0; + + err = damon_reclaim_apply_parameters(); + commit_inputs = false; + return err; +} + +static int damon_reclaim_damon_call_fn(void *arg) +{ + struct damon_ctx *c = arg; + struct damos *s; + + /* update the stats parameter */ + damon_for_each_scheme(s, c) + damon_reclaim_stat = s->stat; + + return damon_reclaim_handle_commit_inputs(); +} + +static struct damon_call_control call_control = { + .fn = damon_reclaim_damon_call_fn, + .repeat = true, +}; + static int damon_reclaim_turn(bool on) { int err; @@ -257,7 +286,7 @@ static int damon_reclaim_turn(bool on) if (err) return err; kdamond_pid = ctx->kdamond->pid; - return 0; + return damon_call(ctx, &call_control); } static int damon_reclaim_enabled_store(const char *val, @@ -296,34 +325,6 @@ module_param_cb(enabled, &enabled_param_ops, &enabled, 0600); MODULE_PARM_DESC(enabled, "Enable or disable DAMON_RECLAIM (default: disabled)"); -static int damon_reclaim_handle_commit_inputs(void) -{ - int err; - - if (!commit_inputs) - return 0; - - err = damon_reclaim_apply_parameters(); - commit_inputs = false; - return err; -} - -static int damon_reclaim_after_aggregation(struct damon_ctx *c) -{ - struct damos *s; - - /* update the stats parameter */ - damon_for_each_scheme(s, c) - damon_reclaim_stat = s->stat; - - return damon_reclaim_handle_commit_inputs(); -} - -static int damon_reclaim_after_wmarks_check(struct damon_ctx *c) -{ - return damon_reclaim_handle_commit_inputs(); -} - static int __init damon_reclaim_init(void) { int err = damon_modules_new_paddr_ctx_target(&ctx, &target); @@ -331,8 +332,7 @@ static int __init damon_reclaim_init(void) if (err) goto out; - ctx->callback.after_wmarks_check = damon_reclaim_after_wmarks_check; - ctx->callback.after_aggregation = damon_reclaim_after_aggregation; + call_control.data = ctx; /* 'enabled' has set before this function, probably via command line */ if (enabled) From 9cc8f00e527b37f001e6003ce26108fe111ec418 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 12 Jul 2025 12:50:07 -0700 Subject: [PATCH 261/370] mm/damon/lru_sort: use damon_call() repeat mode instead of damon_callback DAMON_LRU_SORT uses damon_callback for periodically reading and writing DAMON internal data and parameters. Use its alternative, damon_call() repeat mode. Link: https://lkml.kernel.org/r/20250712195016.151108-6-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- mm/damon/lru_sort.c | 70 ++++++++++++++++++++++----------------------- 1 file changed, 35 insertions(+), 35 deletions(-) diff --git a/mm/damon/lru_sort.c b/mm/damon/lru_sort.c index 9bd8a1a115e0..151a9de5ad8b 100644 --- a/mm/damon/lru_sort.c +++ b/mm/damon/lru_sort.c @@ -230,6 +230,39 @@ static int damon_lru_sort_apply_parameters(void) return err; } +static int damon_lru_sort_handle_commit_inputs(void) +{ + int err; + + if (!commit_inputs) + return 0; + + err = damon_lru_sort_apply_parameters(); + commit_inputs = false; + return err; +} + +static int damon_lru_sort_damon_call_fn(void *arg) +{ + struct damon_ctx *c = arg; + struct damos *s; + + /* update the stats parameter */ + damon_for_each_scheme(s, c) { + if (s->action == DAMOS_LRU_PRIO) + damon_lru_sort_hot_stat = s->stat; + else if (s->action == DAMOS_LRU_DEPRIO) + damon_lru_sort_cold_stat = s->stat; + } + + return damon_lru_sort_handle_commit_inputs(); +} + +static struct damon_call_control call_control = { + .fn = damon_lru_sort_damon_call_fn, + .repeat = true, +}; + static int damon_lru_sort_turn(bool on) { int err; @@ -249,7 +282,7 @@ static int damon_lru_sort_turn(bool on) if (err) return err; kdamond_pid = ctx->kdamond->pid; - return 0; + return damon_call(ctx, &call_control); } static int damon_lru_sort_enabled_store(const char *val, @@ -288,38 +321,6 @@ module_param_cb(enabled, &enabled_param_ops, &enabled, 0600); MODULE_PARM_DESC(enabled, "Enable or disable DAMON_LRU_SORT (default: disabled)"); -static int damon_lru_sort_handle_commit_inputs(void) -{ - int err; - - if (!commit_inputs) - return 0; - - err = damon_lru_sort_apply_parameters(); - commit_inputs = false; - return err; -} - -static int damon_lru_sort_after_aggregation(struct damon_ctx *c) -{ - struct damos *s; - - /* update the stats parameter */ - damon_for_each_scheme(s, c) { - if (s->action == DAMOS_LRU_PRIO) - damon_lru_sort_hot_stat = s->stat; - else if (s->action == DAMOS_LRU_DEPRIO) - damon_lru_sort_cold_stat = s->stat; - } - - return damon_lru_sort_handle_commit_inputs(); -} - -static int damon_lru_sort_after_wmarks_check(struct damon_ctx *c) -{ - return damon_lru_sort_handle_commit_inputs(); -} - static int __init damon_lru_sort_init(void) { int err = damon_modules_new_paddr_ctx_target(&ctx, &target); @@ -327,8 +328,7 @@ static int __init damon_lru_sort_init(void) if (err) goto out; - ctx->callback.after_wmarks_check = damon_lru_sort_after_wmarks_check; - ctx->callback.after_aggregation = damon_lru_sort_after_aggregation; + call_control.data = ctx; /* 'enabled' has set before this function, probably via command line */ if (enabled) From a6c33f1054e3c717ef001d292b6f0350c6388308 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 12 Jul 2025 12:50:08 -0700 Subject: [PATCH 262/370] samples/damon/prcl: use damon_call() repeat mode instead of damon_callback prcl uses damon_callback for periodically reading DAMON internal data. Use its alternative, damon_call() repeat mode. Link: https://lkml.kernel.org/r/20250712195016.151108-7-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- samples/damon/prcl.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/samples/damon/prcl.c b/samples/damon/prcl.c index 8a312dba7691..25a751a67b2d 100644 --- a/samples/damon/prcl.c +++ b/samples/damon/prcl.c @@ -34,8 +34,9 @@ MODULE_PARM_DESC(enabled, "Enable or disable DAMON_SAMPLE_PRCL"); static struct damon_ctx *ctx; static struct pid *target_pidp; -static int damon_sample_prcl_after_aggregate(struct damon_ctx *c) +static int damon_sample_prcl_repeat_call_fn(void *data) { + struct damon_ctx *c = data; struct damon_target *t; damon_for_each_target(t, c) { @@ -51,10 +52,16 @@ static int damon_sample_prcl_after_aggregate(struct damon_ctx *c) return 0; } +static struct damon_call_control repeat_call_control = { + .fn = damon_sample_prcl_repeat_call_fn, + .repeat = true, +}; + static int damon_sample_prcl_start(void) { struct damon_target *target; struct damos *scheme; + int err; pr_info("start\n"); @@ -79,8 +86,6 @@ static int damon_sample_prcl_start(void) } target->pid = target_pidp; - ctx->callback.after_aggregation = damon_sample_prcl_after_aggregate; - scheme = damon_new_scheme( &(struct damos_access_pattern) { .min_sz_region = PAGE_SIZE, @@ -100,7 +105,12 @@ static int damon_sample_prcl_start(void) } damon_set_schemes(ctx, &scheme, 1); - return damon_start(&ctx, 1, true); + err = damon_start(&ctx, 1, true); + if (err) + return err; + + repeat_call_control.data = ctx; + return damon_call(ctx, &repeat_call_control); } static void damon_sample_prcl_stop(void) From cc9c1b8c205beda5915ce935463f15cc05ff9f49 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 12 Jul 2025 12:50:09 -0700 Subject: [PATCH 263/370] samples/damon/wsse: use damon_call() repeat mode instead of damon_callback wsse uses damon_callback for periodically reading DAMON internal data. Use its alternative, damon_call() repeat mode. Link: https://lkml.kernel.org/r/20250712195016.151108-8-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- samples/damon/wsse.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/samples/damon/wsse.c b/samples/damon/wsse.c index d87b3b0801d2..a250e86b24a5 100644 --- a/samples/damon/wsse.c +++ b/samples/damon/wsse.c @@ -35,8 +35,9 @@ MODULE_PARM_DESC(enabled, "Enable or disable DAMON_SAMPLE_WSSE"); static struct damon_ctx *ctx; static struct pid *target_pidp; -static int damon_sample_wsse_after_aggregate(struct damon_ctx *c) +static int damon_sample_wsse_repeat_call_fn(void *data) { + struct damon_ctx *c = data; struct damon_target *t; damon_for_each_target(t, c) { @@ -52,9 +53,15 @@ static int damon_sample_wsse_after_aggregate(struct damon_ctx *c) return 0; } +static struct damon_call_control repeat_call_control = { + .fn = damon_sample_wsse_repeat_call_fn, + .repeat = true, +}; + static int damon_sample_wsse_start(void) { struct damon_target *target; + int err; pr_info("start\n"); @@ -79,8 +86,11 @@ static int damon_sample_wsse_start(void) } target->pid = target_pidp; - ctx->callback.after_aggregation = damon_sample_wsse_after_aggregate; - return damon_start(&ctx, 1, true); + err = damon_start(&ctx, 1, true); + if (err) + return err; + repeat_call_control.data = ctx; + return damon_call(ctx, &repeat_call_control); } static void damon_sample_wsse_stop(void) From d4614161fb6d4a56ead3e8e3e9cafa83a54747e4 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 12 Jul 2025 12:50:10 -0700 Subject: [PATCH 264/370] mm/damon/core: do not call ops.cleanup() when destroying targets damon_operations.cleanup() is documented to be called for kdamond termination, but also being called for targets destruction, which is done for any damon_ctx destruction. Nobody is using the callback for now, though. Remove the cleanup() call under the destruction. Link: https://lkml.kernel.org/r/20250712195016.151108-9-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- mm/damon/core.c | 5 ----- 1 file changed, 5 deletions(-) diff --git a/mm/damon/core.c b/mm/damon/core.c index ffb87497dbb5..b82a838b5a0e 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -550,11 +550,6 @@ static void damon_destroy_targets(struct damon_ctx *ctx) { struct damon_target *t, *next_t; - if (ctx->ops.cleanup) { - ctx->ops.cleanup(ctx); - return; - } - damon_for_each_target_safe(t, next_t, ctx) damon_destroy_target(t); } From 7114bc5e01cf393e1fdc97e10399eb9451b6af45 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 12 Jul 2025 12:50:11 -0700 Subject: [PATCH 265/370] mm/damon/core: add cleanup_target() ops callback Some DAMON operation sets may need additional cleanup per target. For example, [f]vaddr need to put pids of each target. Each user and core logic is doing that redundantly. Add another DAMON ops callback that will be used for doing such cleanups in operations set layer. [sj@kernel.org: add kernel-doc comment for damon_operations->cleanup_target] Link: https://lkml.kernel.org/r/20250715185239.89152-2-sj@kernel.org [sj@kernel.org: remove damon_ctx->callback kernel-doc comment] Link: https://lkml.kernel.org/r/20250715185239.89152-3-sj@kernel.org Link: https://lkml.kernel.org/r/20250712195016.151108-10-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- include/linux/damon.h | 6 ++++-- mm/damon/core.c | 12 ++++++++---- mm/damon/sysfs.c | 4 ++-- mm/damon/tests/core-kunit.h | 4 ++-- mm/damon/tests/vaddr-kunit.h | 2 +- 5 files changed, 17 insertions(+), 11 deletions(-) diff --git a/include/linux/damon.h b/include/linux/damon.h index b83987275ff9..8c765e36623a 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -576,6 +576,7 @@ enum damon_ops_id { * @get_scheme_score: Get the score of a region for a scheme. * @apply_scheme: Apply a DAMON-based operation scheme. * @target_valid: Determine if the target is valid. + * @cleanup_target: Clean up each target before deallocation. * @cleanup: Clean up the context. * * DAMON can be extended for various address spaces and usages. For this, @@ -608,6 +609,7 @@ enum damon_ops_id { * filters (&struct damos_filter) that handled by itself. * @target_valid should check whether the target is still valid for the * monitoring. + * @cleanup_target is called before the target will be deallocated. * @cleanup is called from @kdamond just before its termination. */ struct damon_operations { @@ -623,6 +625,7 @@ struct damon_operations { struct damon_target *t, struct damon_region *r, struct damos *scheme, unsigned long *sz_filter_passed); bool (*target_valid)(struct damon_target *t); + void (*cleanup_target)(struct damon_target *t); void (*cleanup)(struct damon_ctx *context); }; @@ -771,7 +774,6 @@ struct damon_attrs { * Accesses to other fields must be protected by themselves. * * @ops: Set of monitoring operations for given use cases. - * @callback: Set of callbacks for monitoring events notifications. * * @adaptive_targets: Head of monitoring targets (&damon_target) list. * @schemes: Head of schemes (&damos) list. @@ -933,7 +935,7 @@ struct damon_target *damon_new_target(void); void damon_add_target(struct damon_ctx *ctx, struct damon_target *t); bool damon_targets_empty(struct damon_ctx *ctx); void damon_free_target(struct damon_target *t); -void damon_destroy_target(struct damon_target *t); +void damon_destroy_target(struct damon_target *t, struct damon_ctx *ctx); unsigned int damon_nr_regions(struct damon_target *t); struct damon_ctx *damon_new_ctx(void); diff --git a/mm/damon/core.c b/mm/damon/core.c index b82a838b5a0e..678c9b4e038c 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -502,8 +502,12 @@ void damon_free_target(struct damon_target *t) kfree(t); } -void damon_destroy_target(struct damon_target *t) +void damon_destroy_target(struct damon_target *t, struct damon_ctx *ctx) { + + if (ctx && ctx->ops.cleanup_target) + ctx->ops.cleanup_target(t); + damon_del_target(t); damon_free_target(t); } @@ -551,7 +555,7 @@ static void damon_destroy_targets(struct damon_ctx *ctx) struct damon_target *t, *next_t; damon_for_each_target_safe(t, next_t, ctx) - damon_destroy_target(t); + damon_destroy_target(t, ctx); } void damon_destroy_ctx(struct damon_ctx *ctx) @@ -1137,7 +1141,7 @@ static int damon_commit_targets( if (damon_target_has_pid(dst)) put_pid(dst_target->pid); - damon_destroy_target(dst_target); + damon_destroy_target(dst_target, dst); damon_for_each_scheme(s, dst) { if (s->quota.charge_target_from == dst_target) { s->quota.charge_target_from = NULL; @@ -1156,7 +1160,7 @@ static int damon_commit_targets( err = damon_commit_target(new_target, false, src_target, damon_target_has_pid(src)); if (err) { - damon_destroy_target(new_target); + damon_destroy_target(new_target, NULL); return err; } damon_add_target(dst, new_target); diff --git a/mm/damon/sysfs.c b/mm/damon/sysfs.c index c0193de6fb9a..f2f9f756f5a2 100644 --- a/mm/damon/sysfs.c +++ b/mm/damon/sysfs.c @@ -1303,7 +1303,7 @@ static void damon_sysfs_destroy_targets(struct damon_ctx *ctx) damon_for_each_target_safe(t, next, ctx) { if (has_pid) put_pid(t->pid); - damon_destroy_target(t); + damon_destroy_target(t, ctx); } } @@ -1389,7 +1389,7 @@ static void damon_sysfs_before_terminate(struct damon_ctx *ctx) damon_for_each_target_safe(t, next, ctx) { put_pid(t->pid); - damon_destroy_target(t); + damon_destroy_target(t, ctx); } } diff --git a/mm/damon/tests/core-kunit.h b/mm/damon/tests/core-kunit.h index 298c67557fae..dfedfff19940 100644 --- a/mm/damon/tests/core-kunit.h +++ b/mm/damon/tests/core-kunit.h @@ -58,7 +58,7 @@ static void damon_test_target(struct kunit *test) damon_add_target(c, t); KUNIT_EXPECT_EQ(test, 1u, nr_damon_targets(c)); - damon_destroy_target(t); + damon_destroy_target(t, c); KUNIT_EXPECT_EQ(test, 0u, nr_damon_targets(c)); damon_destroy_ctx(c); @@ -310,7 +310,7 @@ static void damon_test_set_regions(struct kunit *test) KUNIT_EXPECT_EQ(test, r->ar.start, expects[expect_idx++]); KUNIT_EXPECT_EQ(test, r->ar.end, expects[expect_idx++]); } - damon_destroy_target(t); + damon_destroy_target(t, NULL); } static void damon_test_nr_accesses_to_accesses_bp(struct kunit *test) diff --git a/mm/damon/tests/vaddr-kunit.h b/mm/damon/tests/vaddr-kunit.h index 7cd944266a92..d2b37ccf2cc0 100644 --- a/mm/damon/tests/vaddr-kunit.h +++ b/mm/damon/tests/vaddr-kunit.h @@ -149,7 +149,7 @@ static void damon_do_test_apply_three_regions(struct kunit *test, KUNIT_EXPECT_EQ(test, r->ar.end, expected[i * 2 + 1]); } - damon_destroy_target(t); + damon_destroy_target(t, NULL); } /* From ff01aba6e45833395a3739fa7e8bd42251be0254 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 12 Jul 2025 12:50:12 -0700 Subject: [PATCH 266/370] mm/damon/vaddr: put pid in cleanup_target() Implement cleanup_target() callback for [f]vaddr, which calls put_pid() for each target that will be destroyed. Also remove redundant put_pid() calls in core, sysfs and sample modules, which were required to be done redundantly due to the lack of such self cleanup in vaddr. Link: https://lkml.kernel.org/r/20250712195016.151108-11-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- mm/damon/core.c | 2 -- mm/damon/sysfs.c | 10 ++-------- mm/damon/vaddr.c | 6 ++++++ samples/damon/prcl.c | 2 -- samples/damon/wsse.c | 2 -- 5 files changed, 8 insertions(+), 14 deletions(-) diff --git a/mm/damon/core.c b/mm/damon/core.c index 678c9b4e038c..9554743dc992 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -1139,8 +1139,6 @@ static int damon_commit_targets( } else { struct damos *s; - if (damon_target_has_pid(dst)) - put_pid(dst_target->pid); damon_destroy_target(dst_target, dst); damon_for_each_scheme(s, dst) { if (s->quota.charge_target_from == dst_target) { diff --git a/mm/damon/sysfs.c b/mm/damon/sysfs.c index f2f9f756f5a2..5eba6ac53939 100644 --- a/mm/damon/sysfs.c +++ b/mm/damon/sysfs.c @@ -1298,13 +1298,9 @@ static int damon_sysfs_set_attrs(struct damon_ctx *ctx, static void damon_sysfs_destroy_targets(struct damon_ctx *ctx) { struct damon_target *t, *next; - bool has_pid = damon_target_has_pid(ctx); - damon_for_each_target_safe(t, next, ctx) { - if (has_pid) - put_pid(t->pid); + damon_for_each_target_safe(t, next, ctx) damon_destroy_target(t, ctx); - } } static int damon_sysfs_set_regions(struct damon_target *t, @@ -1387,10 +1383,8 @@ static void damon_sysfs_before_terminate(struct damon_ctx *ctx) if (!damon_target_has_pid(ctx)) return; - damon_for_each_target_safe(t, next, ctx) { - put_pid(t->pid); + damon_for_each_target_safe(t, next, ctx) damon_destroy_target(t, ctx); - } } /* diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c index 7f5dc9c221a0..94af19c4dfed 100644 --- a/mm/damon/vaddr.c +++ b/mm/damon/vaddr.c @@ -805,6 +805,11 @@ static bool damon_va_target_valid(struct damon_target *t) return false; } +static void damon_va_cleanup_target(struct damon_target *t) +{ + put_pid(t->pid); +} + #ifndef CONFIG_ADVISE_SYSCALLS static unsigned long damos_madvise(struct damon_target *target, struct damon_region *r, int behavior) @@ -946,6 +951,7 @@ static int __init damon_va_initcall(void) .prepare_access_checks = damon_va_prepare_access_checks, .check_accesses = damon_va_check_accesses, .target_valid = damon_va_target_valid, + .cleanup_target = damon_va_cleanup_target, .cleanup = NULL, .apply_scheme = damon_va_apply_scheme, .get_scheme_score = damon_va_scheme_score, diff --git a/samples/damon/prcl.c b/samples/damon/prcl.c index 25a751a67b2d..1b839c06a612 100644 --- a/samples/damon/prcl.c +++ b/samples/damon/prcl.c @@ -120,8 +120,6 @@ static void damon_sample_prcl_stop(void) damon_stop(&ctx, 1); damon_destroy_ctx(ctx); } - if (target_pidp) - put_pid(target_pidp); } static bool init_called; diff --git a/samples/damon/wsse.c b/samples/damon/wsse.c index a250e86b24a5..da052023b099 100644 --- a/samples/damon/wsse.c +++ b/samples/damon/wsse.c @@ -100,8 +100,6 @@ static void damon_sample_wsse_stop(void) damon_stop(&ctx, 1); damon_destroy_ctx(ctx); } - if (target_pidp) - put_pid(target_pidp); } static bool init_called; From f59ae147abb7905cc4656cee5c4c6ae7b53db75a Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 12 Jul 2025 12:50:13 -0700 Subject: [PATCH 267/370] mm/damon/sysfs: remove damon_sysfs_destroy_targets() The function was introduced for putting pids and deallocating unnecessary targets. Hence it is called before damon_destroy_ctx(). Now vaddr puts pid for each target destruction (cleanup_target()). damon_destroy_ctx() deallocates the targets anyway. So damon_sysfs_destroy_targets() has no reason to exist. Remove it. Link: https://lkml.kernel.org/r/20250712195016.151108-12-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- mm/damon/sysfs.c | 23 +++-------------------- 1 file changed, 3 insertions(+), 20 deletions(-) diff --git a/mm/damon/sysfs.c b/mm/damon/sysfs.c index 5eba6ac53939..b0f7c60d655a 100644 --- a/mm/damon/sysfs.c +++ b/mm/damon/sysfs.c @@ -1295,14 +1295,6 @@ static int damon_sysfs_set_attrs(struct damon_ctx *ctx, return damon_set_attrs(ctx, &attrs); } -static void damon_sysfs_destroy_targets(struct damon_ctx *ctx) -{ - struct damon_target *t, *next; - - damon_for_each_target_safe(t, next, ctx) - damon_destroy_target(t, ctx); -} - static int damon_sysfs_set_regions(struct damon_target *t, struct damon_sysfs_regions *sysfs_regions) { @@ -1337,7 +1329,6 @@ static int damon_sysfs_add_target(struct damon_sysfs_target *sys_target, struct damon_ctx *ctx) { struct damon_target *t = damon_new_target(); - int err = -EINVAL; if (!t) return -ENOMEM; @@ -1345,16 +1336,10 @@ static int damon_sysfs_add_target(struct damon_sysfs_target *sys_target, if (damon_target_has_pid(ctx)) { t->pid = find_get_pid(sys_target->pid); if (!t->pid) - goto destroy_targets_out; + /* caller will destroy targets */ + return -EINVAL; } - err = damon_sysfs_set_regions(t, sys_target->regions); - if (err) - goto destroy_targets_out; - return 0; - -destroy_targets_out: - damon_sysfs_destroy_targets(ctx); - return err; + return damon_sysfs_set_regions(t, sys_target->regions); } static int damon_sysfs_add_targets(struct damon_ctx *ctx, @@ -1458,13 +1443,11 @@ static int damon_sysfs_commit_input(void *data) test_ctx = damon_new_ctx(); err = damon_commit_ctx(test_ctx, param_ctx); if (err) { - damon_sysfs_destroy_targets(test_ctx); damon_destroy_ctx(test_ctx); goto out; } err = damon_commit_ctx(kdamond->damon_ctx, param_ctx); out: - damon_sysfs_destroy_targets(param_ctx); damon_destroy_ctx(param_ctx); return err; } From 3a69f1635769e976151625798cc6597301150296 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 12 Jul 2025 12:50:14 -0700 Subject: [PATCH 268/370] mm/damon/core: destroy targets when kdamond_fn() finish When kdamond_fn() completes, the targets are kept. Those are kept to let callers do additional cleanups if they need. There are no such additional cleanups though. DAMON sysfs interface deallocates those in before_terminate() callback, to reduce unnecessary memory usage, for [f]vaddr use case. Just destroy the targets for every case in the core layer. This saves more memory and simplifies the logic. Link: https://lkml.kernel.org/r/20250712195016.151108-13-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- mm/damon/core.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/damon/core.c b/mm/damon/core.c index 9554743dc992..ffd1a061c2cb 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -2657,6 +2657,7 @@ static int kdamond_fn(void *data) running_exclusive_ctxs = false; mutex_unlock(&damon_lock); + damon_destroy_targets(ctx); return 0; } From 0c96decca508f681209d18163f06809331746280 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 12 Jul 2025 12:50:15 -0700 Subject: [PATCH 269/370] mm/damon/sysfs: remove damon_sysfs_before_terminate() DAMON core layer does target cleanup on its own. Remove duplicated and unnecessarily selective cleanup attempts in DAMON sysfs interface. Link: https://lkml.kernel.org/r/20250712195016.151108-14-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- mm/damon/sysfs.c | 12 ------------ 1 file changed, 12 deletions(-) diff --git a/mm/damon/sysfs.c b/mm/damon/sysfs.c index b0f7c60d655a..cce2c8a296e2 100644 --- a/mm/damon/sysfs.c +++ b/mm/damon/sysfs.c @@ -1361,17 +1361,6 @@ static int damon_sysfs_add_targets(struct damon_ctx *ctx, return 0; } -static void damon_sysfs_before_terminate(struct damon_ctx *ctx) -{ - struct damon_target *t, *next; - - if (!damon_target_has_pid(ctx)) - return; - - damon_for_each_target_safe(t, next, ctx) - damon_destroy_target(t, ctx); -} - /* * damon_sysfs_upd_schemes_stats() - Update schemes stats sysfs files. * @data: The kobject wrapper that associated to the kdamond thread. @@ -1516,7 +1505,6 @@ static struct damon_ctx *damon_sysfs_build_ctx( return ERR_PTR(err); } - ctx->callback.before_terminate = damon_sysfs_before_terminate; return ctx; } From 5add26c0a18636e8e9fe409d4591c8a36e1bf695 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 12 Jul 2025 12:50:16 -0700 Subject: [PATCH 270/370] mm/damon/core: remove damon_callback All damon_callback usages are replicated by damon_call() and damos_walk(). Time to say goodbye. Remove damon_callback. Link: https://lkml.kernel.org/r/20250712195016.151108-15-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- include/linux/damon.h | 31 +------------------------------ mm/damon/core.c | 26 +++++++------------------- 2 files changed, 8 insertions(+), 49 deletions(-) diff --git a/include/linux/damon.h b/include/linux/damon.h index 8c765e36623a..f13664c62ddd 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -629,34 +629,6 @@ struct damon_operations { void (*cleanup)(struct damon_ctx *context); }; -/** - * struct damon_callback - Monitoring events notification callbacks. - * - * @after_wmarks_check: Called after each schemes' watermarks check. - * @after_aggregation: Called after each aggregation. - * @before_terminate: Called before terminating the monitoring. - * - * The monitoring thread (&damon_ctx.kdamond) calls @before_terminate just - * before finishing the monitoring. - * - * The monitoring thread calls @after_wmarks_check after each DAMON-based - * operation schemes' watermarks check. If users need to make changes to the - * attributes of the monitoring context while it's deactivated due to the - * watermarks, this is the good place to do. - * - * The monitoring thread calls @after_aggregation for each of the aggregation - * intervals. Therefore, users can safely access the monitoring results - * without additional protection. For the reason, users are recommended to use - * these callback for the accesses to the results. - * - * If any callback returns non-zero, monitoring stops. - */ -struct damon_callback { - int (*after_wmarks_check)(struct damon_ctx *context); - int (*after_aggregation)(struct damon_ctx *context); - void (*before_terminate)(struct damon_ctx *context); -}; - /* * struct damon_call_control - Control damon_call(). * @@ -727,7 +699,7 @@ struct damon_intervals_goal { * ``mmap()`` calls from the application, in case of virtual memory monitoring) * and applies the changes for each @ops_update_interval. All time intervals * are in micro-seconds. Please refer to &struct damon_operations and &struct - * damon_callback for more detail. + * damon_call_control for more detail. */ struct damon_attrs { unsigned long sample_interval; @@ -816,7 +788,6 @@ struct damon_ctx { struct mutex kdamond_lock; struct damon_operations ops; - struct damon_callback callback; struct list_head adaptive_targets; struct list_head schemes; diff --git a/mm/damon/core.c b/mm/damon/core.c index ffd1a061c2cb..f3ec3bd736ec 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -680,9 +680,7 @@ static bool damon_valid_intervals_goal(struct damon_attrs *attrs) * @attrs: monitoring attributes * * This function should be called while the kdamond is not running, an access - * check results aggregation is not ongoing (e.g., from &struct - * damon_callback->after_aggregation or &struct - * damon_callback->after_wmarks_check callbacks), or from damon_call(). + * check results aggregation is not ongoing (e.g., from damon_call(). * * Every time interval is in micro-seconds. * @@ -778,7 +776,7 @@ static void damos_commit_quota_goal( * DAMON contexts, instead of manual in-place updates. * * This function should be called from parameters-update safe context, like - * DAMON callbacks. + * damon_call(). */ int damos_commit_quota_goals(struct damos_quota *dst, struct damos_quota *src) { @@ -1177,7 +1175,7 @@ static int damon_commit_targets( * in-place updates. * * This function should be called from parameters-update safe context, like - * DAMON callbacks. + * damon_call(). */ int damon_commit_ctx(struct damon_ctx *dst, struct damon_ctx *src) { @@ -2484,9 +2482,6 @@ static int kdamond_wait_activation(struct damon_ctx *ctx) kdamond_usleep(min_wait_time); - if (ctx->callback.after_wmarks_check && - ctx->callback.after_wmarks_check(ctx)) - break; kdamond_call(ctx, false); damos_walk_cancel(ctx); } @@ -2543,10 +2538,9 @@ static int kdamond_fn(void *data) while (!kdamond_need_stop(ctx)) { /* * ctx->attrs and ctx->next_{aggregation,ops_update}_sis could - * be changed from after_wmarks_check() or after_aggregation() - * callbacks. Read the values here, and use those for this - * iteration. That is, damon_set_attrs() updated new values - * are respected from next iteration. + * be changed from kdamond_call(). Read the values here, and + * use those for this iteration. That is, damon_set_attrs() + * updated new values are respected from next iteration. */ unsigned long next_aggregation_sis = ctx->next_aggregation_sis; unsigned long next_ops_update_sis = ctx->next_ops_update_sis; @@ -2564,14 +2558,10 @@ static int kdamond_fn(void *data) if (ctx->ops.check_accesses) max_nr_accesses = ctx->ops.check_accesses(ctx); - if (ctx->passed_sample_intervals >= next_aggregation_sis) { + if (ctx->passed_sample_intervals >= next_aggregation_sis) kdamond_merge_regions(ctx, max_nr_accesses / 10, sz_limit); - if (ctx->callback.after_aggregation && - ctx->callback.after_aggregation(ctx)) - break; - } /* * do kdamond_call() and kdamond_apply_schemes() after @@ -2637,8 +2627,6 @@ static int kdamond_fn(void *data) damon_destroy_region(r, t); } - if (ctx->callback.before_terminate) - ctx->callback.before_terminate(ctx); if (ctx->ops.cleanup) ctx->ops.cleanup(ctx); kfree(ctx->regions_score_histogram); From 5bd88fef6a0858bc80723de2e63168965054705d Mon Sep 17 00:00:00 2001 From: Thorsten Blum Date: Sat, 12 Jul 2025 19:45:17 +0200 Subject: [PATCH 271/370] mm/memfd: replace deprecated strcpy() with memcpy() in alloc_name() strcpy() is deprecated; use memcpy() instead. Not copying the NUL terminator is safe because strncpy_from_user() would overwrite it anyway by appending uname to the destination buffer at index MFD_NAME_PREFIX_LEN. No functional changes intended. Link: https://github.com/KSPP/linux/issues/88 Link: https://lkml.kernel.org/r/20250712174516.64243-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum Cc: Baolin Wang Cc: Hugh Dickins Signed-off-by: Andrew Morton --- mm/memfd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memfd.c b/mm/memfd.c index 32fa6bfe57d1..bbe679895ef6 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -411,7 +411,7 @@ static char *alloc_name(const char __user *uname) if (!name) return ERR_PTR(-ENOMEM); - strcpy(name, MFD_NAME_PREFIX); + memcpy(name, MFD_NAME_PREFIX, MFD_NAME_PREFIX_LEN); /* returned length does not include terminating zero */ len = strncpy_from_user(&name[MFD_NAME_PREFIX_LEN], uname, MFD_NAME_MAX_LEN + 1); if (len < 0) { From 9989db9f230542cfc097be3291b9457173371eb1 Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Fri, 11 Jul 2025 10:59:10 -0400 Subject: [PATCH 272/370] mm/page_owner: convert set_page_owner_migrate_reason() to folios Both callers of set_page_owner_migrate_reason() use folios. Convert the function to take a folio directly and move the &folio->page conversion inside __set_page_owner_migrate_reason(). Link: https://lkml.kernel.org/r/20250711145910.90135-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar Reviewed-by: Matthew Wilcox (Oracle) Acked-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Vishal Moola (Oracle) Reviewed-by: Oscar Salvador Cc: Muchun Song Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/page_owner.h | 8 ++++---- mm/hugetlb.c | 2 +- mm/migrate.c | 2 +- mm/page_owner.c | 4 ++-- 4 files changed, 8 insertions(+), 8 deletions(-) diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h index debdc25f08b9..3328357f6dba 100644 --- a/include/linux/page_owner.h +++ b/include/linux/page_owner.h @@ -14,7 +14,7 @@ extern void __set_page_owner(struct page *page, extern void __split_page_owner(struct page *page, int old_order, int new_order); extern void __folio_copy_owner(struct folio *newfolio, struct folio *old); -extern void __set_page_owner_migrate_reason(struct page *page, int reason); +extern void __folio_set_owner_migrate_reason(struct folio *folio, int reason); extern void __dump_page_owner(const struct page *page); extern void pagetypeinfo_showmixedcount_print(struct seq_file *m, pg_data_t *pgdat, struct zone *zone); @@ -43,10 +43,10 @@ static inline void folio_copy_owner(struct folio *newfolio, struct folio *old) if (static_branch_unlikely(&page_owner_inited)) __folio_copy_owner(newfolio, old); } -static inline void set_page_owner_migrate_reason(struct page *page, int reason) +static inline void folio_set_owner_migrate_reason(struct folio *folio, int reason) { if (static_branch_unlikely(&page_owner_inited)) - __set_page_owner_migrate_reason(page, reason); + __folio_set_owner_migrate_reason(folio, reason); } static inline void dump_page_owner(const struct page *page) { @@ -68,7 +68,7 @@ static inline void split_page_owner(struct page *page, int old_order, static inline void folio_copy_owner(struct folio *newfolio, struct folio *folio) { } -static inline void set_page_owner_migrate_reason(struct page *page, int reason) +static inline void folio_set_owner_migrate_reason(struct folio *folio, int reason) { } static inline void dump_page_owner(const struct page *page) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index f13fa5aa6624..753f99b4c718 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -7835,7 +7835,7 @@ void move_hugetlb_state(struct folio *old_folio, struct folio *new_folio, int re struct hstate *h = folio_hstate(old_folio); hugetlb_cgroup_migrate(old_folio, new_folio); - set_page_owner_migrate_reason(&new_folio->page, reason); + folio_set_owner_migrate_reason(new_folio, reason); /* * transfer temporary state of the new hugetlb folio. This is diff --git a/mm/migrate.c b/mm/migrate.c index 36b2764204b6..425401b2d4e1 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1367,7 +1367,7 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private, out_unlock_both: folio_unlock(dst); - set_page_owner_migrate_reason(&dst->page, reason); + folio_set_owner_migrate_reason(dst, reason); /* * If migration is successful, decrease refcount of dst, * which will not free the page because new page owner increased diff --git a/mm/page_owner.c b/mm/page_owner.c index 9928c9ac8c31..c3ca21132c2c 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -333,9 +333,9 @@ noinline void __set_page_owner(struct page *page, unsigned short order, inc_stack_record_count(handle, gfp_mask, 1 << order); } -void __set_page_owner_migrate_reason(struct page *page, int reason) +void __folio_set_owner_migrate_reason(struct folio *folio, int reason) { - struct page_ext *page_ext = page_ext_get(page); + struct page_ext *page_ext = page_ext_get(&folio->page); struct page_owner *page_owner; if (unlikely(!page_ext)) From 526660b950a425b9d32798703b839e3c08cdf81b Mon Sep 17 00:00:00 2001 From: Hao Jia Date: Thu, 3 Jul 2025 10:39:46 +0800 Subject: [PATCH 273/370] mm/mglru: stop try_to_inc_min_seq() if min_seq[type] has not increased In try_to_inc_min_seq(), if min_seq[type] has not increased. In other words, min_seq[type] == lrugen->min_seq[type]. Then we should return directly to avoid unnecessary overhead later. Corollary: If min_seq[type] of both anonymous and file is not increased, try_to_inc_min_seq() will fail. Proof: It is known that min_seq[type] has not increased, that is, min_seq[type] is equal to lrugen->min_seq[type], then the following: case 1: min_seq[type] has not been reassigned and changed before judgment min_seq[type] <= lrugen->min_seq[type]. Then the subsequent min_seq[type] <= lrugen->min_seq[type] judgment will always be true. case 2: min_seq[type] is reassigned to seq, before judgment min_seq[type] <= lrugen->min_seq[type]. Then at least the condition of min_seq[type] > seq must be met before min_seq[type] will be reassigned to seq. That is to say, before the reassignment, lrugen->min_seq[type] > seq is met, and then min_seq[type] = seq. Then the following min_seq[type](seq) <= lrugen->min_seq[type] judgment is always true. Therefore, in try_to_inc_min_seq(), If min_seq[type] of both anonymous and file is not increased, we can return false directly to avoid unnecessary overhead. Link: https://lkml.kernel.org/r/20250703023946.65315-1-jiahao.kernel@gmail.com Signed-off-by: Hao Jia Suggested-by: Yuanchu Xie Reviewed-by: Axel Rasmussen Cc: David Hildenbrand Cc: Greg Thelen Cc: Johannes Weiner Cc: Kinsey Ho Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Qi Zheng Cc: Shakeel Butt Cc: Yu Zhao Signed-off-by: Andrew Morton --- mm/vmscan.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/mm/vmscan.c b/mm/vmscan.c index 19bfce93b373..0d6b8e1e95e5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3921,6 +3921,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness) { int gen, type, zone; bool success = false; + bool seq_inc_flag = false; struct lru_gen_folio *lrugen = &lruvec->lrugen; DEFINE_MIN_SEQ(lruvec); @@ -3937,11 +3938,20 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness) } min_seq[type]++; + seq_inc_flag = true; } next: ; } + /* + * If min_seq[type] of both anonymous and file is not increased, + * we can directly return false to avoid unnecessary checking + * overhead later. + */ + if (!seq_inc_flag) + return success; + /* see the comment on lru_gen_folio */ if (swappiness && swappiness <= MAX_SWAPPINESS) { unsigned long seq = lrugen->max_seq - MIN_NR_GENS; From 3865301dc58aec2ab77651bdbf1e55352f50a608 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Sun, 13 Jul 2025 12:57:18 -0700 Subject: [PATCH 274/370] mm: optimize lru_note_cost() by adding lru_note_cost_unlock_irq() Dropping a lock, just to demand it again for an afterthought, cannot be good if contended: convert lru_note_cost() to lru_note_cost_unlock_irq(). [hughd@google.com: delete unneeded comment] Link: https://lkml.kernel.org/r/dbf9352a-1ed9-a021-c0c7-9309ac73e174@google.com Link: https://lkml.kernel.org/r/21100102-51b6-79d5-03db-1bb7f97fa94c@google.com Signed-off-by: Hugh Dickins Acked-by: Johannes Weiner Reviewed-by: Roman Gushchin Tested-by: Roman Gushchin Reviewed-by: Shakeel Butt Cc: David Hildenbrand Cc: Lorenzo Stoakes Cc: Michal Hocko Signed-off-by: Andrew Morton --- include/linux/swap.h | 5 +++-- mm/swap.c | 33 +++++++++++++++++++-------------- mm/vmscan.c | 8 +++----- 3 files changed, 25 insertions(+), 21 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 95c6061fa1dc..2fe6ed2cc3fd 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -376,8 +376,9 @@ extern unsigned long totalreserve_pages; /* linux/mm/swap.c */ -void lru_note_cost(struct lruvec *lruvec, bool file, - unsigned int nr_io, unsigned int nr_rotated); +void lru_note_cost_unlock_irq(struct lruvec *lruvec, bool file, + unsigned int nr_io, unsigned int nr_rotated) + __releases(lruvec->lru_lock); void lru_note_cost_refault(struct folio *); void folio_add_lru(struct folio *); void folio_add_lru_vma(struct folio *, struct vm_area_struct *); diff --git a/mm/swap.c b/mm/swap.c index 4fc322f7111a..3632dd061beb 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -237,8 +237,9 @@ void folio_rotate_reclaimable(struct folio *folio) folio_batch_add_and_move(folio, lru_move_tail, true); } -void lru_note_cost(struct lruvec *lruvec, bool file, - unsigned int nr_io, unsigned int nr_rotated) +void lru_note_cost_unlock_irq(struct lruvec *lruvec, bool file, + unsigned int nr_io, unsigned int nr_rotated) + __releases(lruvec->lru_lock) { unsigned long cost; @@ -250,18 +251,14 @@ void lru_note_cost(struct lruvec *lruvec, bool file, * different between them, adjust scan balance for CPU work. */ cost = nr_io * SWAP_CLUSTER_MAX + nr_rotated; + if (!cost) { + spin_unlock_irq(&lruvec->lru_lock); + return; + } - do { + for (;;) { unsigned long lrusize; - /* - * Hold lruvec->lru_lock is safe here, since - * 1) The pinned lruvec in reclaim, or - * 2) From a pre-LRU page during refault (which also holds the - * rcu lock, so would be safe even if the page was on the LRU - * and could move simultaneously to a new lruvec). - */ - spin_lock_irq(&lruvec->lru_lock); /* Record cost event */ if (file) lruvec->file_cost += cost; @@ -285,14 +282,22 @@ void lru_note_cost(struct lruvec *lruvec, bool file, lruvec->file_cost /= 2; lruvec->anon_cost /= 2; } + spin_unlock_irq(&lruvec->lru_lock); - } while ((lruvec = parent_lruvec(lruvec))); + lruvec = parent_lruvec(lruvec); + if (!lruvec) + break; + spin_lock_irq(&lruvec->lru_lock); + } } void lru_note_cost_refault(struct folio *folio) { - lru_note_cost(folio_lruvec(folio), folio_is_file_lru(folio), - folio_nr_pages(folio), 0); + struct lruvec *lruvec; + + lruvec = folio_lruvec_lock_irq(folio); + lru_note_cost_unlock_irq(lruvec, folio_is_file_lru(folio), + folio_nr_pages(folio), 0); } static void lru_activate(struct lruvec *lruvec, struct folio *folio) diff --git a/mm/vmscan.c b/mm/vmscan.c index 0d6b8e1e95e5..d71aeabdc725 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2053,9 +2053,9 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, __count_vm_events(item, nr_reclaimed); count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed); __count_vm_events(PGSTEAL_ANON + file, nr_reclaimed); - spin_unlock_irq(&lruvec->lru_lock); - lru_note_cost(lruvec, file, stat.nr_pageout, nr_scanned - nr_reclaimed); + lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout, + nr_scanned - nr_reclaimed); /* * If dirty folios are scanned that are not queued for IO, it @@ -2201,10 +2201,8 @@ static void shrink_active_list(unsigned long nr_to_scan, count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); - spin_unlock_irq(&lruvec->lru_lock); - if (nr_rotated) - lru_note_cost(lruvec, file, 0, nr_rotated); + lru_note_cost_unlock_irq(lruvec, file, 0, nr_rotated); trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_taken, nr_activate, nr_deactivate, nr_rotated, sc->priority, file); } From cfea89210a888d3e51607a99ecc69b3f0c958dda Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Mon, 14 Jul 2025 14:58:39 +0100 Subject: [PATCH 275/370] mm/vma: refactor vma_modify_flags_name() to vma_modify_name() The single instance in which we use this function doesn't actually need to change VMA flags, so remove this parameter and update the caller accordingly. [lorenzo.stoakes@oracle.com: correct comment] Link: https://lkml.kernel.org/r/77f45b2e-a748-4635-9381-a5051091087f@lucifer.local Link: https://lkml.kernel.org/r/20250714135839.178032-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Acked-by: David Hildenbrand Reviewed-by: Liam R. Howlett Cc: Jann Horn Signed-off-by: Andrew Morton --- mm/madvise.c | 8 ++++---- mm/vma.c | 4 +--- mm/vma.h | 15 +++++++-------- 3 files changed, 12 insertions(+), 15 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 1c30031ab035..2bf80989d5b6 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -144,8 +144,8 @@ static int replace_anon_vma_name(struct vm_area_struct *vma, } #endif /* CONFIG_ANON_VMA_NAME */ /* - * Update the vm_flags and/or anon_name on region of a vma, splitting it or - * merging it as necessary. Must be called with mmap_lock held for writing. + * Update the vm_flags or anon_name on region of a vma, splitting it or merging + * it as necessary. Must be called with mmap_lock held for writing. */ static int madvise_update_vma(vm_flags_t new_flags, struct madvise_behavior *madv_behavior) @@ -161,8 +161,8 @@ static int madvise_update_vma(vm_flags_t new_flags, return 0; if (set_new_anon_name) - vma = vma_modify_flags_name(&vmi, madv_behavior->prev, vma, - range->start, range->end, new_flags, anon_name); + vma = vma_modify_name(&vmi, madv_behavior->prev, vma, + range->start, range->end, anon_name); else vma = vma_modify_flags(&vmi, madv_behavior->prev, vma, range->start, range->end, new_flags); diff --git a/mm/vma.c b/mm/vma.c index b3d880652359..fc502b741dcf 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -1650,17 +1650,15 @@ struct vm_area_struct *vma_modify_flags( } struct vm_area_struct -*vma_modify_flags_name(struct vma_iterator *vmi, +*vma_modify_name(struct vma_iterator *vmi, struct vm_area_struct *prev, struct vm_area_struct *vma, unsigned long start, unsigned long end, - vm_flags_t vm_flags, struct anon_vma_name *new_name) { VMG_VMA_STATE(vmg, vmi, prev, vma, start, end); - vmg.vm_flags = vm_flags; vmg.anon_name = new_name; return vma_modify(&vmg); diff --git a/mm/vma.h b/mm/vma.h index cf6e3a6371b6..acdcc515c459 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -290,15 +290,14 @@ __must_check struct vm_area_struct unsigned long start, unsigned long end, vm_flags_t vm_flags); -/* We are about to modify the VMA's flags and/or anon_name. */ +/* We are about to modify the VMA's anon_name. */ __must_check struct vm_area_struct -*vma_modify_flags_name(struct vma_iterator *vmi, - struct vm_area_struct *prev, - struct vm_area_struct *vma, - unsigned long start, - unsigned long end, - vm_flags_t vm_flags, - struct anon_vma_name *new_name); +*vma_modify_name(struct vma_iterator *vmi, + struct vm_area_struct *prev, + struct vm_area_struct *vma, + unsigned long start, + unsigned long end, + struct anon_vma_name *new_name); /* We are about to modify the VMA's memory policy. */ __must_check struct vm_area_struct From 000c0691ec6a1804244049b7908911aa5d6e866c Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Thu, 17 Jul 2025 17:55:51 +0100 Subject: [PATCH 276/370] mm/mremap: perform some simple cleanups Patch series "mm/mremap: permit mremap() move of multiple VMAs", v4. Historically we've made it a uAPI requirement that mremap() may only operate on a single VMA at a time. For instances where VMAs need to be resized, this makes sense, as it becomes very difficult to determine what a user actually wants should they indicate a desire to expand or shrink the size of multiple VMAs (truncate? Adjust sizes individually? Some other strategy?). However, in instances where a user is moving VMAs, it is restrictive to disallow this. This is especially the case when anonymous mapping remap may or may not be mergeable depending on whether VMAs have or have not been faulted due to anon_vma assignment and folio index alignment with vma->vm_pgoff. Often this can result in surprising impact where a moved region is faulted, then moved back and a user fails to observe a merge from otherwise compatible, adjacent VMAs. This change allows such cases to work without the user having to be cognizant of whether a prior mremap() move or other VMA operations has resulted in VMA fragmentation. In order to do this, this series performs a large amount of refactoring, most pertinently - grouping sanity checks together, separately those that check input parameters and those relating to VMAs. we also simplify the post-mmap lock drop processing for uffd and mlock()'d VMAs. With this done, we can then fairly straightforwardly implement this functionality. This works exclusively for mremap() invocations which specify MREMAP_FIXED. It is not compatible with VMAs which use userfaultfd, as the notification of the userland fault handler would require us to drop the mmap lock. It is also not compatible with file-backed mappings with customised get_unmapped_area() handlers as these may not honour MREMAP_FIXED. The input and output addresses ranges must not overlap. We carefully account for moves which would result in VMA iterator invalidation. While there can be gaps between VMAs in the input range, there can be no gap before the first VMA in the range. This patch (of 10): We const-ify the vrm flags parameter to indicate this will never change. We rename resize_is_valid() to remap_is_valid(), as this function does not only apply to cases where we resize, so it's simply confusing to refer to that here. We remove the BUG() from mremap_at(), as we should not BUG() unless we are certain it'll result in system instability. We rename vrm_charge() to vrm_calc_charge() to make it clear this simply calculates the charged number of pages rather than actually adjusting any state. We update the comment for vrm_implies_new_addr() to explain that MREMAP_DONTUNMAP does not require a set address, but will always be moved. Additionally consistently use 'res' rather than 'ret' for result values. No functional change intended. Link: https://lkml.kernel.org/r/cover.1752770784.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/d35ad8ce6b2c33b2f2f4ef7ec415f04a35cba34f.1752770784.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Cc: Al Viro Cc: Christian Brauner Cc: Jan Kara Cc: Jann Horn Cc: Liam Howlett Cc: Peter Xu Cc: Rik van Riel Signed-off-by: Andrew Morton --- mm/mremap.c | 55 +++++++++++++++++++++++++++++++---------------------- 1 file changed, 32 insertions(+), 23 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index 1f5bebbb9c0c..65c7f29b6116 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -52,7 +52,7 @@ struct vma_remap_struct { unsigned long addr; /* User-specified address from which we remap. */ unsigned long old_len; /* Length of range being remapped. */ unsigned long new_len; /* Desired new length of mapping. */ - unsigned long flags; /* user-specified MREMAP_* flags. */ + const unsigned long flags; /* user-specified MREMAP_* flags. */ unsigned long new_addr; /* Optionally, desired new address. */ /* uffd state. */ @@ -909,7 +909,11 @@ static bool vrm_overlaps(struct vma_remap_struct *vrm) return false; } -/* Do the mremap() flags require that the new_addr parameter be specified? */ +/* + * Will a new address definitely be assigned? This either if the user specifies + * it via MREMAP_FIXED, or if MREMAP_DONTUNMAP is used, indicating we will + * always detemrine a target address. + */ static bool vrm_implies_new_addr(struct vma_remap_struct *vrm) { return vrm->flags & (MREMAP_FIXED | MREMAP_DONTUNMAP); @@ -955,7 +959,7 @@ static unsigned long vrm_set_new_addr(struct vma_remap_struct *vrm) * * Returns true on success, false if insufficient memory to charge. */ -static bool vrm_charge(struct vma_remap_struct *vrm) +static bool vrm_calc_charge(struct vma_remap_struct *vrm) { unsigned long charged; @@ -1260,8 +1264,11 @@ static unsigned long move_vma(struct vma_remap_struct *vrm) if (err) return err; - /* If accounted, charge the number of bytes the operation will use. */ - if (!vrm_charge(vrm)) + /* + * If accounted, determine the number of bytes the operation will + * charge. + */ + if (!vrm_calc_charge(vrm)) return -ENOMEM; /* We don't want racing faults. */ @@ -1300,12 +1307,12 @@ static unsigned long move_vma(struct vma_remap_struct *vrm) } /* - * resize_is_valid() - Ensure the vma can be resized to the new length at the give - * address. + * remap_is_valid() - Ensure the VMA can be moved or resized to the new length, + * at the given address. * * Return 0 on success, error otherwise. */ -static int resize_is_valid(struct vma_remap_struct *vrm) +static int remap_is_valid(struct vma_remap_struct *vrm) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma = vrm->vma; @@ -1444,7 +1451,7 @@ static unsigned long mremap_to(struct vma_remap_struct *vrm) vrm->old_len = vrm->new_len; } - err = resize_is_valid(vrm); + err = remap_is_valid(vrm); if (err) return err; @@ -1569,7 +1576,7 @@ static unsigned long expand_vma_in_place(struct vma_remap_struct *vrm) struct vm_area_struct *vma = vrm->vma; VMA_ITERATOR(vmi, mm, vma->vm_end); - if (!vrm_charge(vrm)) + if (!vrm_calc_charge(vrm)) return -ENOMEM; /* @@ -1630,7 +1637,7 @@ static unsigned long expand_vma(struct vma_remap_struct *vrm) unsigned long err; unsigned long addr = vrm->addr; - err = resize_is_valid(vrm); + err = remap_is_valid(vrm); if (err) return err; @@ -1703,18 +1710,20 @@ static unsigned long mremap_at(struct vma_remap_struct *vrm) return expand_vma(vrm); } - BUG(); + /* Should not be possible. */ + WARN_ON_ONCE(1); + return -EINVAL; } static unsigned long do_mremap(struct vma_remap_struct *vrm) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; - unsigned long ret; + unsigned long res; - ret = check_mremap_params(vrm); - if (ret) - return ret; + res = check_mremap_params(vrm); + if (res) + return res; vrm->old_len = PAGE_ALIGN(vrm->old_len); vrm->new_len = PAGE_ALIGN(vrm->new_len); @@ -1726,41 +1735,41 @@ static unsigned long do_mremap(struct vma_remap_struct *vrm) vma = vrm->vma = vma_lookup(mm, vrm->addr); if (!vma) { - ret = -EFAULT; + res = -EFAULT; goto out; } /* If mseal()'d, mremap() is prohibited. */ if (!can_modify_vma(vma)) { - ret = -EPERM; + res = -EPERM; goto out; } /* Align to hugetlb page size, if required. */ if (is_vm_hugetlb_page(vma) && !align_hugetlb(vrm)) { - ret = -EINVAL; + res = -EINVAL; goto out; } vrm->remap_type = vrm_remap_type(vrm); /* Actually execute mremap. */ - ret = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm); + res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm); out: if (vrm->mmap_locked) { mmap_write_unlock(mm); vrm->mmap_locked = false; - if (!offset_in_page(ret) && vrm->mlocked && vrm->new_len > vrm->old_len) + if (!offset_in_page(res) && vrm->mlocked && vrm->new_len > vrm->old_len) mm_populate(vrm->new_addr + vrm->old_len, vrm->delta); } userfaultfd_unmap_complete(mm, vrm->uf_unmap_early); - mremap_userfaultfd_complete(vrm->uf, vrm->addr, ret, vrm->old_len); + mremap_userfaultfd_complete(vrm->uf, vrm->addr, res, vrm->old_len); userfaultfd_unmap_complete(mm, vrm->uf_unmap); - return ret; + return res; } /* From 3215eaceca87625ac5ae4cc5dabfe88ba6e9b183 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Thu, 17 Jul 2025 17:55:52 +0100 Subject: [PATCH 277/370] mm/mremap: refactor initial parameter sanity checks We are currently checking some things later, and some things immediately. Aggregate the checks and avoid ones that need not be made. Simplify things by aligning lengths immediately. Defer setting the delta parameter until later, which removes some duplicate code in the hugetlb case. We can safely perform the checks moved from mremap_to() to check_mremap_params() because: * If we set a new address via vrm_set_new_addr(), then this is guaranteed to not overlap nor to position the new VMA past TASK_SIZE, so there's no need to check these later. * We can simply page align lengths immediately. We do not need to check for overlap nor TASK_SIZE sanity after hugetlb alignment as this asserts addresses are huge-aligned, then huge-aligns lengths, rounding down. This means any existing overlap would have already been caught. Moving things around like this lays the groundwork for subsequent changes to permit operations on batches of VMAs. No functional change intended. Link: https://lkml.kernel.org/r/c862d625c98b1abd861c406f2bfad8baf3287f83.1752770784.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Cc: Al Viro Cc: Christian Brauner Cc: Jan Kara Cc: Jann Horn Cc: Liam Howlett Cc: Peter Xu Cc: Rik van Riel Signed-off-by: Andrew Morton --- mm/mremap.c | 29 ++++++++++++++--------------- 1 file changed, 14 insertions(+), 15 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index 65c7f29b6116..9ce20c238ffd 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -1413,14 +1413,6 @@ static unsigned long mremap_to(struct vma_remap_struct *vrm) struct mm_struct *mm = current->mm; unsigned long err; - /* Is the new length or address silly? */ - if (vrm->new_len > TASK_SIZE || - vrm->new_addr > TASK_SIZE - vrm->new_len) - return -EINVAL; - - if (vrm_overlaps(vrm)) - return -EINVAL; - if (vrm->flags & MREMAP_FIXED) { /* * In mremap_to(). @@ -1525,7 +1517,12 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm) * for DOS-emu "duplicate shm area" thing. But * a zero new-len is nonsensical. */ - if (!PAGE_ALIGN(vrm->new_len)) + if (!vrm->new_len) + return -EINVAL; + + /* Is the new length or address silly? */ + if (vrm->new_len > TASK_SIZE || + vrm->new_addr > TASK_SIZE - vrm->new_len) return -EINVAL; /* Remainder of checks are for cases with specific new_addr. */ @@ -1544,6 +1541,10 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm) if (flags & MREMAP_DONTUNMAP && vrm->old_len != vrm->new_len) return -EINVAL; + /* Target VMA must not overlap source VMA. */ + if (vrm_overlaps(vrm)) + return -EINVAL; + /* * move_vma() need us to stay 4 maps below the threshold, otherwise * it will bail out at the very beginning. @@ -1620,8 +1621,6 @@ static bool align_hugetlb(struct vma_remap_struct *vrm) if (vrm->new_len > vrm->old_len) return false; - vrm_set_delta(vrm); - return true; } @@ -1721,14 +1720,13 @@ static unsigned long do_mremap(struct vma_remap_struct *vrm) struct vm_area_struct *vma; unsigned long res; + vrm->old_len = PAGE_ALIGN(vrm->old_len); + vrm->new_len = PAGE_ALIGN(vrm->new_len); + res = check_mremap_params(vrm); if (res) return res; - vrm->old_len = PAGE_ALIGN(vrm->old_len); - vrm->new_len = PAGE_ALIGN(vrm->new_len); - vrm_set_delta(vrm); - if (mmap_write_lock_killable(mm)) return -EINTR; vrm->mmap_locked = true; @@ -1751,6 +1749,7 @@ static unsigned long do_mremap(struct vma_remap_struct *vrm) goto out; } + vrm_set_delta(vrm); vrm->remap_type = vrm_remap_type(vrm); /* Actually execute mremap. */ From f256a7a4ca1aad688773fec1bd082a200395a234 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Thu, 17 Jul 2025 17:55:53 +0100 Subject: [PATCH 278/370] mm/mremap: put VMA check and prep logic into helper function Rather than lumping everything together in do_mremap(), add a new helper function, check_prep_vma(), to do the work relating to each VMA. This further lays groundwork for subsequent patches which will allow for batched VMA mremap(). Additionally, if we set vrm->new_addr == vrm->addr when prepping the VMA, this avoids us needing to do so in the expand VMA mlocked case. No functional change intended. Link: https://lkml.kernel.org/r/15efa3c57935f7f8894094b94c1803c2f322c511.1752770784.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Cc: Al Viro Cc: Christian Brauner Cc: Jan Kara Cc: Jann Horn Cc: Liam Howlett Cc: Peter Xu Cc: Rik van Riel Signed-off-by: Andrew Morton --- mm/mremap.c | 58 ++++++++++++++++++++++++++--------------------------- 1 file changed, 28 insertions(+), 30 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index 9ce20c238ffd..60eb0ac8634b 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -1634,7 +1634,6 @@ static bool align_hugetlb(struct vma_remap_struct *vrm) static unsigned long expand_vma(struct vma_remap_struct *vrm) { unsigned long err; - unsigned long addr = vrm->addr; err = remap_is_valid(vrm); if (err) @@ -1649,16 +1648,8 @@ static unsigned long expand_vma(struct vma_remap_struct *vrm) if (err) return err; - /* - * We want to populate the newly expanded portion of the VMA to - * satisfy the expectation that mlock()'ing a VMA maintains all - * of its pages in memory. - */ - if (vrm->mlocked) - vrm->new_addr = addr; - /* OK we're done! */ - return addr; + return vrm->addr; } /* @@ -1714,10 +1705,33 @@ static unsigned long mremap_at(struct vma_remap_struct *vrm) return -EINVAL; } +static int check_prep_vma(struct vma_remap_struct *vrm) +{ + struct vm_area_struct *vma = vrm->vma; + + if (!vma) + return -EFAULT; + + /* If mseal()'d, mremap() is prohibited. */ + if (!can_modify_vma(vma)) + return -EPERM; + + /* Align to hugetlb page size, if required. */ + if (is_vm_hugetlb_page(vma) && !align_hugetlb(vrm)) + return -EINVAL; + + vrm_set_delta(vrm); + vrm->remap_type = vrm_remap_type(vrm); + /* For convenience, we set new_addr even if VMA won't move. */ + if (!vrm_implies_new_addr(vrm)) + vrm->new_addr = vrm->addr; + + return 0; +} + static unsigned long do_mremap(struct vma_remap_struct *vrm) { struct mm_struct *mm = current->mm; - struct vm_area_struct *vma; unsigned long res; vrm->old_len = PAGE_ALIGN(vrm->old_len); @@ -1731,26 +1745,10 @@ static unsigned long do_mremap(struct vma_remap_struct *vrm) return -EINTR; vrm->mmap_locked = true; - vma = vrm->vma = vma_lookup(mm, vrm->addr); - if (!vma) { - res = -EFAULT; + vrm->vma = vma_lookup(current->mm, vrm->addr); + res = check_prep_vma(vrm); + if (res) goto out; - } - - /* If mseal()'d, mremap() is prohibited. */ - if (!can_modify_vma(vma)) { - res = -EPERM; - goto out; - } - - /* Align to hugetlb page size, if required. */ - if (is_vm_hugetlb_page(vma) && !align_hugetlb(vrm)) { - res = -EINVAL; - goto out; - } - - vrm_set_delta(vrm); - vrm->remap_type = vrm_remap_type(vrm); /* Actually execute mremap. */ res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm); From e49e76c20ba10e04b257f522e5a086db43d8f1c1 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Thu, 17 Jul 2025 17:55:54 +0100 Subject: [PATCH 279/370] mm/mremap: cleanup post-processing stage of mremap Separate out the uffd bits so it clear's what's happening. Don't bother setting vrm->mmap_locked after unlocking, because after this we are done anyway. The only time we drop the mmap lock is on VMA shrink, at which point vrm->new_len will be < vrm->old_len and the operation will not be performed anyway, so move this code out of the if (vrm->mmap_locked) block. All addresses returned by mremap() are page-aligned, so the offset_in_page() check on ret seems only to be incorrectly trying to detect whether an error occurred - explicitly check for this. No functional change intended. Link: https://lkml.kernel.org/r/ebb8f29650b8e343fe98fefc67b3a61a24d1e0f1.1752770784.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Cc: Al Viro Cc: Christian Brauner Cc: Jan Kara Cc: Jann Horn Cc: Liam Howlett Cc: Peter Xu Cc: Rik van Riel Signed-off-by: Andrew Morton --- mm/mremap.c | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index 60eb0ac8634b..53447761e55d 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -1729,6 +1729,15 @@ static int check_prep_vma(struct vma_remap_struct *vrm) return 0; } +static void notify_uffd(struct vma_remap_struct *vrm, unsigned long to) +{ + struct mm_struct *mm = current->mm; + + userfaultfd_unmap_complete(mm, vrm->uf_unmap_early); + mremap_userfaultfd_complete(vrm->uf, vrm->addr, to, vrm->old_len); + userfaultfd_unmap_complete(mm, vrm->uf_unmap); +} + static unsigned long do_mremap(struct vma_remap_struct *vrm) { struct mm_struct *mm = current->mm; @@ -1754,18 +1763,13 @@ static unsigned long do_mremap(struct vma_remap_struct *vrm) res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm); out: - if (vrm->mmap_locked) { + if (vrm->mmap_locked) mmap_write_unlock(mm); - vrm->mmap_locked = false; - if (!offset_in_page(res) && vrm->mlocked && vrm->new_len > vrm->old_len) - mm_populate(vrm->new_addr + vrm->old_len, vrm->delta); - } - - userfaultfd_unmap_complete(mm, vrm->uf_unmap_early); - mremap_userfaultfd_complete(vrm->uf, vrm->addr, res, vrm->old_len); - userfaultfd_unmap_complete(mm, vrm->uf_unmap); + if (!IS_ERR_VALUE(res) && vrm->mlocked && vrm->new_len > vrm->old_len) + mm_populate(vrm->new_addr + vrm->old_len, vrm->delta); + notify_uffd(vrm, res); return res; } From f9f11398d4dac3c85507f31192e318b20b19af61 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Thu, 17 Jul 2025 17:55:55 +0100 Subject: [PATCH 280/370] mm/mremap: use an explicit uffd failure path for mremap Right now it appears that the code is relying upon the returned destination address having bits outside PAGE_MASK to indicate whether an error value is specified, and decrementing the increased refcount on the uffd ctx if so. This is not a safe means of determining an error value, so instead, be specific. It makes far more sense to do so in a dedicated error path, so add mremap_userfaultfd_fail() for this purpose and use this when an error arises. A vm_userfaultfd_ctx is not established until we are at the point where mremap_userfaultfd_prep() is invoked in copy_vma_and_data(), so this is a no-op until this happens. That is - uffd remap notification only occurs if the VMA is actually moved - at which point a UFFD_EVENT_REMAP event is raised. No errors can occur after this point currently, though it's certainly not guaranteed this will always remain the case, and we mustn't rely on this. However, the reason for needing to handle this case is that, when an error arises on a VMA move at the point of adjusting page tables, we revert this operation, and propagate the error. At this point, it is not correct to raise a uffd remap event, and we must handle it. This refactoring makes it abundantly clear what we are doing. We assume vrm->new_addr is always valid, which a prior change made the case even for mremap() invocations which don't move the VMA, however given no uffd context would be set up in this case it's immaterial to this change anyway. No functional change intended. Link: https://lkml.kernel.org/r/a70e8a1f7bce9f43d1431065b414e0f212297297.1752770784.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Cc: Al Viro Cc: Christian Brauner Cc: Jan Kara Cc: Jann Horn Cc: Liam Howlett Cc: Peter Xu Cc: Rik van Riel Signed-off-by: Andrew Morton --- fs/userfaultfd.c | 15 ++++++++++----- include/linux/userfaultfd_k.h | 5 +++++ mm/mremap.c | 16 ++++++++++++---- 3 files changed, 27 insertions(+), 9 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 2a644aa1a510..54c6cc7fe9c6 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -750,11 +750,6 @@ void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx *vm_ctx, if (!ctx) return; - if (to & ~PAGE_MASK) { - userfaultfd_ctx_put(ctx); - return; - } - msg_init(&ewq.msg); ewq.msg.event = UFFD_EVENT_REMAP; @@ -765,6 +760,16 @@ void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx *vm_ctx, userfaultfd_event_wait_completion(ctx, &ewq); } +void mremap_userfaultfd_fail(struct vm_userfaultfd_ctx *vm_ctx) +{ + struct userfaultfd_ctx *ctx = vm_ctx->ctx; + + if (!ctx) + return; + + userfaultfd_ctx_put(ctx); +} + bool userfaultfd_remove(struct vm_area_struct *vma, unsigned long start, unsigned long end) { diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index df85330bcfa6..c0e716aec26a 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -259,6 +259,7 @@ extern void mremap_userfaultfd_prep(struct vm_area_struct *, extern void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx *, unsigned long from, unsigned long to, unsigned long len); +void mremap_userfaultfd_fail(struct vm_userfaultfd_ctx *); extern bool userfaultfd_remove(struct vm_area_struct *vma, unsigned long start, @@ -371,6 +372,10 @@ static inline void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx *ctx, { } +static inline void mremap_userfaultfd_fail(struct vm_userfaultfd_ctx *ctx) +{ +} + static inline bool userfaultfd_remove(struct vm_area_struct *vma, unsigned long start, unsigned long end) diff --git a/mm/mremap.c b/mm/mremap.c index 53447761e55d..db7e773d0884 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -1729,12 +1729,17 @@ static int check_prep_vma(struct vma_remap_struct *vrm) return 0; } -static void notify_uffd(struct vma_remap_struct *vrm, unsigned long to) +static void notify_uffd(struct vma_remap_struct *vrm, bool failed) { struct mm_struct *mm = current->mm; + /* Regardless of success/failure, we always notify of any unmaps. */ userfaultfd_unmap_complete(mm, vrm->uf_unmap_early); - mremap_userfaultfd_complete(vrm->uf, vrm->addr, to, vrm->old_len); + if (failed) + mremap_userfaultfd_fail(vrm->uf); + else + mremap_userfaultfd_complete(vrm->uf, vrm->addr, + vrm->new_addr, vrm->old_len); userfaultfd_unmap_complete(mm, vrm->uf_unmap); } @@ -1742,6 +1747,7 @@ static unsigned long do_mremap(struct vma_remap_struct *vrm) { struct mm_struct *mm = current->mm; unsigned long res; + bool failed; vrm->old_len = PAGE_ALIGN(vrm->old_len); vrm->new_len = PAGE_ALIGN(vrm->new_len); @@ -1763,13 +1769,15 @@ static unsigned long do_mremap(struct vma_remap_struct *vrm) res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm); out: + failed = IS_ERR_VALUE(res); + if (vrm->mmap_locked) mmap_write_unlock(mm); - if (!IS_ERR_VALUE(res) && vrm->mlocked && vrm->new_len > vrm->old_len) + if (!failed && vrm->mlocked && vrm->new_len > vrm->old_len) mm_populate(vrm->new_addr + vrm->old_len, vrm->delta); - notify_uffd(vrm, res); + notify_uffd(vrm, failed); return res; } From a85dc37186a5842580963496e9718f1ec5bcc0e0 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Thu, 17 Jul 2025 17:55:56 +0100 Subject: [PATCH 281/370] mm/mremap: check remap conditions earlier When we expand or move a VMA, this requires a number of additional checks to be performed. Make it really obvious under what circumstances these checks must be performed and aggregate all the checks in one place by invoking this in check_prep_vma(). We have to adjust the checks to account for shrink + move operations by checking new_len <= old_len rather than new_len == old_len. No functional change intended. [lorenzo.stoakes@oracle.com: allow undocumented mremap() shrink behaviour] Link: https://lkml.kernel.org/r/8fc92a38-c636-465e-9a2f-2c6ac9cb49b8@lucifer.local Link: https://lkml.kernel.org/r/8b4161ce074901e00602a446d81f182db92b0430.1752770784.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Cc: Al Viro Cc: Christian Brauner Cc: Jan Kara Cc: Jann Horn Cc: Liam Howlett Cc: Peter Xu Cc: Rik van Riel Signed-off-by: Andrew Morton --- mm/mremap.c | 33 +++++++++++++++++++++++++-------- 1 file changed, 25 insertions(+), 8 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index db7e773d0884..11a8321a90b8 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -1339,6 +1339,13 @@ static int remap_is_valid(struct vma_remap_struct *vrm) (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))) return -EINVAL; + /* + * We permit crossing of boundaries for the range being unmapped due to + * a shrink. + */ + if (vrm->remap_type == MREMAP_SHRINK) + old_len = new_len; + /* We can't remap across vm area boundaries */ if (old_len > vma->vm_end - addr) return -EFAULT; @@ -1443,10 +1450,6 @@ static unsigned long mremap_to(struct vma_remap_struct *vrm) vrm->old_len = vrm->new_len; } - err = remap_is_valid(vrm); - if (err) - return err; - /* MREMAP_DONTUNMAP expands by old_len since old_len == new_len */ if (vrm->flags & MREMAP_DONTUNMAP) { vm_flags_t vm_flags = vrm->vma->vm_flags; @@ -1635,10 +1638,6 @@ static unsigned long expand_vma(struct vma_remap_struct *vrm) { unsigned long err; - err = remap_is_valid(vrm); - if (err) - return err; - /* * [addr, old_len) spans precisely to the end of the VMA, so try to * expand it in-place. @@ -1705,6 +1704,21 @@ static unsigned long mremap_at(struct vma_remap_struct *vrm) return -EINVAL; } +/* + * Will this operation result in the VMA being expanded or moved and thus need + * to map a new portion of virtual address space? + */ +static bool vrm_will_map_new(struct vma_remap_struct *vrm) +{ + if (vrm->remap_type == MREMAP_EXPAND) + return true; + + if (vrm_implies_new_addr(vrm)) + return true; + + return false; +} + static int check_prep_vma(struct vma_remap_struct *vrm) { struct vm_area_struct *vma = vrm->vma; @@ -1726,6 +1740,9 @@ static int check_prep_vma(struct vma_remap_struct *vrm) if (!vrm_implies_new_addr(vrm)) vrm->new_addr = vrm->addr; + if (vrm_will_map_new(vrm)) + return remap_is_valid(vrm); + return 0; } From 9b2301bf8d65ede6038353086a24399386e2d815 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Thu, 17 Jul 2025 17:55:57 +0100 Subject: [PATCH 282/370] mm/mremap: move remap_is_valid() into check_prep_vma() Group parameter check logic together, moving check_mremap_params() next to it. This puts all such checks into a single place, and invokes them early so we can simply bail out as soon as we are aware that a condition is not met. No functional change intended. Link: https://lkml.kernel.org/r/4d0669c23531629d8ead42aa701c6237bd6bf012.1752770784.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Cc: Al Viro Cc: Christian Brauner Cc: Jan Kara Cc: Jann Horn Cc: Liam Howlett Cc: Peter Xu Cc: Rik van Riel Signed-off-by: Andrew Morton --- mm/mremap.c | 289 +++++++++++++++++++++++++--------------------------- 1 file changed, 139 insertions(+), 150 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index 11a8321a90b8..7b688bce9002 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -1306,71 +1306,6 @@ static unsigned long move_vma(struct vma_remap_struct *vrm) return err ? (unsigned long)err : vrm->new_addr; } -/* - * remap_is_valid() - Ensure the VMA can be moved or resized to the new length, - * at the given address. - * - * Return 0 on success, error otherwise. - */ -static int remap_is_valid(struct vma_remap_struct *vrm) -{ - struct mm_struct *mm = current->mm; - struct vm_area_struct *vma = vrm->vma; - unsigned long addr = vrm->addr; - unsigned long old_len = vrm->old_len; - unsigned long new_len = vrm->new_len; - unsigned long pgoff; - - /* - * !old_len is a special case where an attempt is made to 'duplicate' - * a mapping. This makes no sense for private mappings as it will - * instead create a fresh/new mapping unrelated to the original. This - * is contrary to the basic idea of mremap which creates new mappings - * based on the original. There are no known use cases for this - * behavior. As a result, fail such attempts. - */ - if (!old_len && !(vma->vm_flags & (VM_SHARED | VM_MAYSHARE))) { - pr_warn_once("%s (%d): attempted to duplicate a private mapping with mremap. This is not supported.\n", - current->comm, current->pid); - return -EINVAL; - } - - if ((vrm->flags & MREMAP_DONTUNMAP) && - (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))) - return -EINVAL; - - /* - * We permit crossing of boundaries for the range being unmapped due to - * a shrink. - */ - if (vrm->remap_type == MREMAP_SHRINK) - old_len = new_len; - - /* We can't remap across vm area boundaries */ - if (old_len > vma->vm_end - addr) - return -EFAULT; - - if (new_len == old_len) - return 0; - - /* Need to be careful about a growing mapping */ - pgoff = (addr - vma->vm_start) >> PAGE_SHIFT; - pgoff += vma->vm_pgoff; - if (pgoff + (new_len >> PAGE_SHIFT) < pgoff) - return -EINVAL; - - if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)) - return -EFAULT; - - if (!mlock_future_ok(mm, vma->vm_flags, vrm->delta)) - return -EAGAIN; - - if (!may_expand_vm(mm, vma->vm_flags, vrm->delta >> PAGE_SHIFT)) - return -ENOMEM; - - return 0; -} - /* * The user has requested that the VMA be shrunk (i.e., old_len > new_len), so * execute this, optionally dropping the mmap lock when we do so. @@ -1497,77 +1432,6 @@ static bool vrm_can_expand_in_place(struct vma_remap_struct *vrm) return true; } -/* - * Are the parameters passed to mremap() valid? If so return 0, otherwise return - * error. - */ -static unsigned long check_mremap_params(struct vma_remap_struct *vrm) - -{ - unsigned long addr = vrm->addr; - unsigned long flags = vrm->flags; - - /* Ensure no unexpected flag values. */ - if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP)) - return -EINVAL; - - /* Start address must be page-aligned. */ - if (offset_in_page(addr)) - return -EINVAL; - - /* - * We allow a zero old-len as a special case - * for DOS-emu "duplicate shm area" thing. But - * a zero new-len is nonsensical. - */ - if (!vrm->new_len) - return -EINVAL; - - /* Is the new length or address silly? */ - if (vrm->new_len > TASK_SIZE || - vrm->new_addr > TASK_SIZE - vrm->new_len) - return -EINVAL; - - /* Remainder of checks are for cases with specific new_addr. */ - if (!vrm_implies_new_addr(vrm)) - return 0; - - /* The new address must be page-aligned. */ - if (offset_in_page(vrm->new_addr)) - return -EINVAL; - - /* A fixed address implies a move. */ - if (!(flags & MREMAP_MAYMOVE)) - return -EINVAL; - - /* MREMAP_DONTUNMAP does not allow resizing in the process. */ - if (flags & MREMAP_DONTUNMAP && vrm->old_len != vrm->new_len) - return -EINVAL; - - /* Target VMA must not overlap source VMA. */ - if (vrm_overlaps(vrm)) - return -EINVAL; - - /* - * move_vma() need us to stay 4 maps below the threshold, otherwise - * it will bail out at the very beginning. - * That is a problem if we have already unmaped the regions here - * (new_addr, and old_addr), because userspace will not know the - * state of the vma's after it gets -ENOMEM. - * So, to avoid such scenario we can pre-compute if the whole - * operation has high chances to success map-wise. - * Worst-scenario case is when both vma's (new_addr and old_addr) get - * split in 3 before unmapping it. - * That means 2 more maps (1 for each) to the ones we already hold. - * Check whether current map count plus 2 still leads us to 4 maps below - * the threshold, otherwise return -ENOMEM here to be more safe. - */ - if ((current->mm->map_count + 2) >= sysctl_max_map_count - 3) - return -ENOMEM; - - return 0; -} - /* * We know we can expand the VMA in-place by delta pages, so do so. * @@ -1719,9 +1583,26 @@ static bool vrm_will_map_new(struct vma_remap_struct *vrm) return false; } +static void notify_uffd(struct vma_remap_struct *vrm, bool failed) +{ + struct mm_struct *mm = current->mm; + + /* Regardless of success/failure, we always notify of any unmaps. */ + userfaultfd_unmap_complete(mm, vrm->uf_unmap_early); + if (failed) + mremap_userfaultfd_fail(vrm->uf); + else + mremap_userfaultfd_complete(vrm->uf, vrm->addr, + vrm->new_addr, vrm->old_len); + userfaultfd_unmap_complete(mm, vrm->uf_unmap); +} + static int check_prep_vma(struct vma_remap_struct *vrm) { struct vm_area_struct *vma = vrm->vma; + struct mm_struct *mm = current->mm; + unsigned long addr = vrm->addr; + unsigned long old_len, new_len, pgoff; if (!vma) return -EFAULT; @@ -1738,26 +1619,134 @@ static int check_prep_vma(struct vma_remap_struct *vrm) vrm->remap_type = vrm_remap_type(vrm); /* For convenience, we set new_addr even if VMA won't move. */ if (!vrm_implies_new_addr(vrm)) - vrm->new_addr = vrm->addr; + vrm->new_addr = addr; - if (vrm_will_map_new(vrm)) - return remap_is_valid(vrm); + /* Below only meaningful if we expand or move a VMA. */ + if (!vrm_will_map_new(vrm)) + return 0; + + old_len = vrm->old_len; + new_len = vrm->new_len; + + /* + * !old_len is a special case where an attempt is made to 'duplicate' + * a mapping. This makes no sense for private mappings as it will + * instead create a fresh/new mapping unrelated to the original. This + * is contrary to the basic idea of mremap which creates new mappings + * based on the original. There are no known use cases for this + * behavior. As a result, fail such attempts. + */ + if (!old_len && !(vma->vm_flags & (VM_SHARED | VM_MAYSHARE))) { + pr_warn_once("%s (%d): attempted to duplicate a private mapping with mremap. This is not supported.\n", + current->comm, current->pid); + return -EINVAL; + } + + if ((vrm->flags & MREMAP_DONTUNMAP) && + (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))) + return -EINVAL; + + /* + * We permit crossing of boundaries for the range being unmapped due to + * a shrink. + */ + if (vrm->remap_type == MREMAP_SHRINK) + old_len = new_len; + + /* We can't remap across vm area boundaries */ + if (old_len > vma->vm_end - addr) + return -EFAULT; + + if (new_len == old_len) + return 0; + + /* Need to be careful about a growing mapping */ + pgoff = (addr - vma->vm_start) >> PAGE_SHIFT; + pgoff += vma->vm_pgoff; + if (pgoff + (new_len >> PAGE_SHIFT) < pgoff) + return -EINVAL; + + if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)) + return -EFAULT; + + if (!mlock_future_ok(mm, vma->vm_flags, vrm->delta)) + return -EAGAIN; + + if (!may_expand_vm(mm, vma->vm_flags, vrm->delta >> PAGE_SHIFT)) + return -ENOMEM; return 0; } -static void notify_uffd(struct vma_remap_struct *vrm, bool failed) -{ - struct mm_struct *mm = current->mm; +/* + * Are the parameters passed to mremap() valid? If so return 0, otherwise return + * error. + */ +static unsigned long check_mremap_params(struct vma_remap_struct *vrm) - /* Regardless of success/failure, we always notify of any unmaps. */ - userfaultfd_unmap_complete(mm, vrm->uf_unmap_early); - if (failed) - mremap_userfaultfd_fail(vrm->uf); - else - mremap_userfaultfd_complete(vrm->uf, vrm->addr, - vrm->new_addr, vrm->old_len); - userfaultfd_unmap_complete(mm, vrm->uf_unmap); +{ + unsigned long addr = vrm->addr; + unsigned long flags = vrm->flags; + + /* Ensure no unexpected flag values. */ + if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP)) + return -EINVAL; + + /* Start address must be page-aligned. */ + if (offset_in_page(addr)) + return -EINVAL; + + /* + * We allow a zero old-len as a special case + * for DOS-emu "duplicate shm area" thing. But + * a zero new-len is nonsensical. + */ + if (!vrm->new_len) + return -EINVAL; + + /* Is the new length or address silly? */ + if (vrm->new_len > TASK_SIZE || + vrm->new_addr > TASK_SIZE - vrm->new_len) + return -EINVAL; + + /* Remainder of checks are for cases with specific new_addr. */ + if (!vrm_implies_new_addr(vrm)) + return 0; + + /* The new address must be page-aligned. */ + if (offset_in_page(vrm->new_addr)) + return -EINVAL; + + /* A fixed address implies a move. */ + if (!(flags & MREMAP_MAYMOVE)) + return -EINVAL; + + /* MREMAP_DONTUNMAP does not allow resizing in the process. */ + if (flags & MREMAP_DONTUNMAP && vrm->old_len != vrm->new_len) + return -EINVAL; + + /* Target VMA must not overlap source VMA. */ + if (vrm_overlaps(vrm)) + return -EINVAL; + + /* + * move_vma() need us to stay 4 maps below the threshold, otherwise + * it will bail out at the very beginning. + * That is a problem if we have already unmaped the regions here + * (new_addr, and old_addr), because userspace will not know the + * state of the vma's after it gets -ENOMEM. + * So, to avoid such scenario we can pre-compute if the whole + * operation has high chances to success map-wise. + * Worst-scenario case is when both vma's (new_addr and old_addr) get + * split in 3 before unmapping it. + * That means 2 more maps (1 for each) to the ones we already hold. + * Check whether current map count plus 2 still leads us to 4 maps below + * the threshold, otherwise return -ENOMEM here to be more safe. + */ + if ((current->mm->map_count + 2) >= sysctl_max_map_count - 3) + return -ENOMEM; + + return 0; } static unsigned long do_mremap(struct vma_remap_struct *vrm) From 2cf442d74216bbd441c9446edfefb137804e1739 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Thu, 17 Jul 2025 17:55:58 +0100 Subject: [PATCH 283/370] mm/mremap: clean up mlock populate behaviour When an mlock()'d VMA is expanded, we need to populate the expanded region to maintain the contract that all mlock()'d memory is present (albeit - with some period after mmap unlock where the expanded part of the mapping remains unfaulted). The current implementation is very unclear, so make it absolutely explicit under what circumstances we do this. Link: https://lkml.kernel.org/r/2358b0006baa9cab83db4259817794f16fe1992e.1752770784.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Cc: Al Viro Cc: Christian Brauner Cc: Jan Kara Cc: Jann Horn Cc: Liam Howlett Cc: Peter Xu Cc: Rik van Riel Signed-off-by: Andrew Morton --- mm/mremap.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index 7b688bce9002..3cd90b52b750 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -65,7 +65,7 @@ struct vma_remap_struct { /* Internal state, determined in do_mremap(). */ unsigned long delta; /* Absolute delta of old_len,new_len. */ - bool mlocked; /* Was the VMA mlock()'d? */ + bool populate_expand; /* mlock()'d expanded, must populate. */ enum mremap_type remap_type; /* expand, shrink, etc. */ bool mmap_locked; /* Is mm currently write-locked? */ unsigned long charged; /* If VM_ACCOUNT, # pages to account. */ @@ -1010,10 +1010,8 @@ static void vrm_stat_account(struct vma_remap_struct *vrm, struct vm_area_struct *vma = vrm->vma; vm_stat_account(mm, vma->vm_flags, pages); - if (vma->vm_flags & VM_LOCKED) { + if (vma->vm_flags & VM_LOCKED) mm->locked_vm += pages; - vrm->mlocked = true; - } } /* @@ -1660,6 +1658,10 @@ static int check_prep_vma(struct vma_remap_struct *vrm) if (new_len == old_len) return 0; + /* We are expanding and the VMA is mlock()'d so we need to populate. */ + if (vma->vm_flags & VM_LOCKED) + vrm->populate_expand = true; + /* Need to be careful about a growing mapping */ pgoff = (addr - vma->vm_start) >> PAGE_SHIFT; pgoff += vma->vm_pgoff; @@ -1780,7 +1782,8 @@ static unsigned long do_mremap(struct vma_remap_struct *vrm) if (vrm->mmap_locked) mmap_write_unlock(mm); - if (!failed && vrm->mlocked && vrm->new_len > vrm->old_len) + /* VMA mlock'd + was expanded, so populated expanded region. */ + if (!failed && vrm->populate_expand) mm_populate(vrm->new_addr + vrm->old_len, vrm->delta); notify_uffd(vrm, failed); From d23cb648e3651b47007d4b17376d0af1fa98515e Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Thu, 17 Jul 2025 17:55:59 +0100 Subject: [PATCH 284/370] mm/mremap: permit mremap() move of multiple VMAs Historically we've made it a uAPI requirement that mremap() may only operate on a single VMA at a time. For instances where VMAs need to be resized, this makes sense, as it becomes very difficult to determine what a user actually wants should they indicate a desire to expand or shrink the size of multiple VMAs (truncate? Adjust sizes individually? Some other strategy?). However, in instances where a user is moving VMAs, it is restrictive to disallow this. This is especially the case when anonymous mapping remap may or may not be mergeable depending on whether VMAs have or have not been faulted due to anon_vma assignment and folio index alignment with vma->vm_pgoff. Often this can result in surprising impact where a moved region is faulted, then moved back and a user fails to observe a merge from otherwise compatible, adjacent VMAs. This change allows such cases to work without the user having to be cognizant of whether a prior mremap() move or other VMA operations has resulted in VMA fragmentation. We only permit this for mremap() operations that do NOT change the size of the VMA and DO specify MREMAP_MAYMOVE | MREMAP_FIXED. Should no VMA exist in the range, -EFAULT is returned as usual. If a VMA move spans a single VMA - then there is no functional change. Otherwise, we place additional requirements upon VMAs: * They must not have a userfaultfd context associated with them - this requires dropping the lock to notify users, and we want to perform the operation with the mmap write lock held throughout. * If file-backed, they cannot have a custom get_unmapped_area handler - this might result in MREMAP_FIXED not being honoured, which could result in unexpected positioning of VMAs in the moved region. There may be gaps in the range of VMAs that are moved: X Y X Y <---> <-> <---> <-> |-------| |-----| |-----| |-------| |-----| |-----| | A | | B | | C | ---> | A' | | B' | | C' | |-------| |-----| |-----| |-------| |-----| |-----| addr new_addr The move will preserve the gaps between each VMA. Note that any failures encountered will result in a partial move. Since an mremap() can fail at any time, this might result in only some of the VMAs being moved. Note that failures are very rare and typically require an out of a memory condition or a mapping limit condition to be hit, assuming the VMAs being moved are valid. We don't try to assess ahead of time whether VMAs are valid according to the multi VMA rules, as it would be rather unusual for a user to mix uffd-enabled VMAs and/or VMAs which map unusual driver mappings that specify custom get_unmapped_area() handlers in an aggregate operation. So we optimise for the far, far more likely case of the operation being entirely permissible. In the case of the move of a single VMA, the above conditions are permitted. This makes the behaviour identical for a single VMA as before. Link: https://lkml.kernel.org/r/8cab2f2c202c4208bdfdb562635748bea6eb37bf.1752770784.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Cc: Al Viro Cc: Christian Brauner Cc: Jan Kara Cc: Jann Horn Cc: Liam Howlett Cc: Peter Xu Cc: Rik van Riel Signed-off-by: Andrew Morton --- mm/mremap.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 152 insertions(+), 7 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index 3cd90b52b750..e15cf2e444c7 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -69,6 +69,7 @@ struct vma_remap_struct { enum mremap_type remap_type; /* expand, shrink, etc. */ bool mmap_locked; /* Is mm currently write-locked? */ unsigned long charged; /* If VM_ACCOUNT, # pages to account. */ + bool vmi_needs_invalidate; /* Is the VMA iterator invalidated? */ }; static pud_t *get_old_pud(struct mm_struct *mm, unsigned long addr) @@ -1111,6 +1112,7 @@ static void unmap_source_vma(struct vma_remap_struct *vrm) err = do_vmi_munmap(&vmi, mm, addr, len, vrm->uf_unmap, /* unlock= */false); vrm->vma = NULL; /* Invalidated. */ + vrm->vmi_needs_invalidate = true; if (err) { /* OOM: unable to split vma, just get accounts right */ vm_acct_memory(len >> PAGE_SHIFT); @@ -1186,6 +1188,10 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm, *new_vma_ptr = NULL; return -ENOMEM; } + /* By merging, we may have invalidated any iterator in use. */ + if (vma != vrm->vma) + vrm->vmi_needs_invalidate = true; + vrm->vma = vma; pmc.old = vma; pmc.new = new_vma; @@ -1362,6 +1368,7 @@ static unsigned long mremap_to(struct vma_remap_struct *vrm) err = do_munmap(mm, vrm->new_addr, vrm->new_len, vrm->uf_unmap_early); vrm->vma = NULL; /* Invalidated. */ + vrm->vmi_needs_invalidate = true; if (err) return err; @@ -1581,6 +1588,18 @@ static bool vrm_will_map_new(struct vma_remap_struct *vrm) return false; } +/* Does this remap ONLY move mappings? */ +static bool vrm_move_only(struct vma_remap_struct *vrm) +{ + if (!(vrm->flags & MREMAP_FIXED)) + return false; + + if (vrm->old_len != vrm->new_len) + return false; + + return true; +} + static void notify_uffd(struct vma_remap_struct *vrm, bool failed) { struct mm_struct *mm = current->mm; @@ -1595,6 +1614,32 @@ static void notify_uffd(struct vma_remap_struct *vrm, bool failed) userfaultfd_unmap_complete(mm, vrm->uf_unmap); } +static bool vma_multi_allowed(struct vm_area_struct *vma) +{ + struct file *file; + + /* + * We can't support moving multiple uffd VMAs as notify requires + * mmap lock to be dropped. + */ + if (userfaultfd_armed(vma)) + return false; + + /* + * Custom get unmapped area might result in MREMAP_FIXED not + * being obeyed. + */ + file = vma->vm_file; + if (file && !vma_is_shmem(vma) && !is_vm_hugetlb_page(vma)) { + const struct file_operations *fop = file->f_op; + + if (fop->get_unmapped_area) + return false; + } + + return true; +} + static int check_prep_vma(struct vma_remap_struct *vrm) { struct vm_area_struct *vma = vrm->vma; @@ -1651,7 +1696,19 @@ static int check_prep_vma(struct vma_remap_struct *vrm) if (vrm->remap_type == MREMAP_SHRINK) old_len = new_len; - /* We can't remap across vm area boundaries */ + /* + * We can't remap across the end of VMAs, as another VMA may be + * adjacent: + * + * addr vma->vm_end + * |-----.----------| + * | . | + * |-----.----------| + * .<--------->xxx> + * old_len + * + * We also require that vma->vm_start <= addr < vma->vm_end. + */ if (old_len > vma->vm_end - addr) return -EFAULT; @@ -1751,6 +1808,90 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm) return 0; } +static unsigned long remap_move(struct vma_remap_struct *vrm) +{ + struct vm_area_struct *vma; + unsigned long start = vrm->addr; + unsigned long end = vrm->addr + vrm->old_len; + unsigned long new_addr = vrm->new_addr; + bool allowed = true, seen_vma = false; + unsigned long target_addr = new_addr; + unsigned long res = -EFAULT; + unsigned long last_end; + VMA_ITERATOR(vmi, current->mm, start); + + /* + * When moving VMAs we allow for batched moves across multiple VMAs, + * with all VMAs in the input range [addr, addr + old_len) being moved + * (and split as necessary). + */ + for_each_vma_range(vmi, vma, end) { + /* Account for start, end not aligned with VMA start, end. */ + unsigned long addr = max(vma->vm_start, start); + unsigned long len = min(end, vma->vm_end) - addr; + unsigned long offset, res_vma; + + if (!allowed) + return -EFAULT; + + /* No gap permitted at the start of the range. */ + if (!seen_vma && start < vma->vm_start) + return -EFAULT; + + /* + * To sensibly move multiple VMAs, accounting for the fact that + * get_unmapped_area() may align even MAP_FIXED moves, we simply + * attempt to move such that the gaps between source VMAs remain + * consistent in destination VMAs, e.g.: + * + * X Y X Y + * <---> <-> <---> <-> + * |-------| |-----| |-----| |-------| |-----| |-----| + * | A | | B | | C | ---> | A' | | B' | | C' | + * |-------| |-----| |-----| |-------| |-----| |-----| + * new_addr + * + * So we map B' at A'->vm_end + X, and C' at B'->vm_end + Y. + */ + offset = seen_vma ? vma->vm_start - last_end : 0; + last_end = vma->vm_end; + + vrm->vma = vma; + vrm->addr = addr; + vrm->new_addr = target_addr + offset; + vrm->old_len = vrm->new_len = len; + + allowed = vma_multi_allowed(vma); + if (seen_vma && !allowed) + return -EFAULT; + + res_vma = check_prep_vma(vrm); + if (!res_vma) + res_vma = mremap_to(vrm); + if (IS_ERR_VALUE(res_vma)) + return res_vma; + + if (!seen_vma) { + VM_WARN_ON_ONCE(allowed && res_vma != new_addr); + res = res_vma; + } + + /* mmap lock is only dropped on shrink. */ + VM_WARN_ON_ONCE(!vrm->mmap_locked); + /* This is a move, no expand should occur. */ + VM_WARN_ON_ONCE(vrm->populate_expand); + + if (vrm->vmi_needs_invalidate) { + vma_iter_invalidate(&vmi); + vrm->vmi_needs_invalidate = false; + } + seen_vma = true; + target_addr = res_vma + vrm->new_len; + } + + return res; +} + static unsigned long do_mremap(struct vma_remap_struct *vrm) { struct mm_struct *mm = current->mm; @@ -1768,13 +1909,17 @@ static unsigned long do_mremap(struct vma_remap_struct *vrm) return -EINTR; vrm->mmap_locked = true; - vrm->vma = vma_lookup(current->mm, vrm->addr); - res = check_prep_vma(vrm); - if (res) - goto out; + if (vrm_move_only(vrm)) { + res = remap_move(vrm); + } else { + vrm->vma = vma_lookup(current->mm, vrm->addr); + res = check_prep_vma(vrm); + if (res) + goto out; - /* Actually execute mremap. */ - res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm); + /* Actually execute mremap. */ + res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm); + } out: failed = IS_ERR_VALUE(res); From d53f248258e11e145b773a925ad1a8590ce4618e Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Thu, 17 Jul 2025 17:56:00 +0100 Subject: [PATCH 285/370] tools/testing/selftests: extend mremap_test to test multi-VMA mremap Now that we have added the ability to move multiple VMAs at once, assert that this functions correctly, both overwriting VMAs and moving backwards and forwards with merge and VMA invalidation. Additionally assert that page tables are correctly propagated by setting random data and reading it back. Link: https://lkml.kernel.org/r/139074a24a011ca4ed52498a7fa2080024b43917.1752770784.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Cc: Al Viro Cc: Christian Brauner Cc: Jan Kara Cc: Jann Horn Cc: Liam Howlett Cc: Peter Xu Cc: Rik van Riel Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/mremap_test.c | 146 ++++++++++++++++++++++- 1 file changed, 145 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/mm/mremap_test.c b/tools/testing/selftests/mm/mremap_test.c index bb84476a177f..0a49be11e614 100644 --- a/tools/testing/selftests/mm/mremap_test.c +++ b/tools/testing/selftests/mm/mremap_test.c @@ -380,6 +380,149 @@ static void mremap_move_within_range(unsigned int pattern_seed, char *rand_addr) ksft_test_result_fail("%s\n", test_name); } +static bool is_multiple_vma_range_ok(unsigned int pattern_seed, + char *ptr, unsigned long page_size) +{ + int i; + + srand(pattern_seed); + for (i = 0; i <= 10; i += 2) { + int j; + char *buf = &ptr[i * page_size]; + size_t size = i == 4 ? 2 * page_size : page_size; + + for (j = 0; j < size; j++) { + char chr = rand(); + + if (chr != buf[j]) { + ksft_print_msg("page %d offset %d corrupted, expected %d got %d\n", + i, j, chr, buf[j]); + return false; + } + } + } + + return true; +} + +static void mremap_move_multiple_vmas(unsigned int pattern_seed, + unsigned long page_size) +{ + char *test_name = "mremap move multiple vmas"; + const size_t size = 11 * page_size; + bool success = true; + char *ptr, *tgt_ptr; + int i; + + ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANON, -1, 0); + if (ptr == MAP_FAILED) { + perror("mmap"); + success = false; + goto out; + } + + tgt_ptr = mmap(NULL, 2 * size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANON, -1, 0); + if (tgt_ptr == MAP_FAILED) { + perror("mmap"); + success = false; + goto out; + } + if (munmap(tgt_ptr, 2 * size)) { + perror("munmap"); + success = false; + goto out_unmap; + } + + /* + * Unmap so we end up with: + * + * 0 2 4 5 6 8 10 offset in buffer + * |*| |*| |*****| |*| |*| + * |*| |*| |*****| |*| |*| + * 0 1 2 3 4 5 6 pattern offset + */ + for (i = 1; i < 10; i += 2) { + if (i == 5) + continue; + + if (munmap(&ptr[i * page_size], page_size)) { + perror("munmap"); + success = false; + goto out_unmap; + } + } + + srand(pattern_seed); + + /* Set up random patterns. */ + for (i = 0; i <= 10; i += 2) { + int j; + size_t size = i == 4 ? 2 * page_size : page_size; + char *buf = &ptr[i * page_size]; + + for (j = 0; j < size; j++) + buf[j] = rand(); + } + + /* First, just move the whole thing. */ + if (mremap(ptr, size, size, + MREMAP_MAYMOVE | MREMAP_FIXED, tgt_ptr) == MAP_FAILED) { + perror("mremap"); + success = false; + goto out_unmap; + } + /* Check move was ok. */ + if (!is_multiple_vma_range_ok(pattern_seed, tgt_ptr, page_size)) { + success = false; + goto out_unmap; + } + + /* Move next to itself. */ + if (mremap(tgt_ptr, size, size, + MREMAP_MAYMOVE | MREMAP_FIXED, &tgt_ptr[size]) == MAP_FAILED) { + perror("mremap"); + goto out_unmap; + } + /* Check that the move is ok. */ + if (!is_multiple_vma_range_ok(pattern_seed, &tgt_ptr[size], page_size)) { + success = false; + goto out_unmap; + } + + /* Map a range to overwrite. */ + if (mmap(tgt_ptr, size, PROT_NONE, + MAP_PRIVATE | MAP_ANON | MAP_FIXED, -1, 0) == MAP_FAILED) { + perror("mmap tgt"); + success = false; + goto out_unmap; + } + /* Move and overwrite. */ + if (mremap(&tgt_ptr[size], size, size, + MREMAP_MAYMOVE | MREMAP_FIXED, tgt_ptr) == MAP_FAILED) { + perror("mremap"); + goto out_unmap; + } + /* Check that the move is ok. */ + if (!is_multiple_vma_range_ok(pattern_seed, tgt_ptr, page_size)) { + success = false; + goto out_unmap; + } + +out_unmap: + if (munmap(tgt_ptr, 2 * size)) + perror("munmap tgt"); + if (munmap(ptr, size)) + perror("munmap src"); + +out: + if (success) + ksft_test_result_pass("%s\n", test_name); + else + ksft_test_result_fail("%s\n", test_name); +} + /* Returns the time taken for the remap on success else returns -1. */ static long long remap_region(struct config c, unsigned int threshold_mb, char *rand_addr) @@ -721,7 +864,7 @@ int main(int argc, char **argv) char *rand_addr; size_t rand_size; int num_expand_tests = 2; - int num_misc_tests = 2; + int num_misc_tests = 3; struct test test_cases[MAX_TEST] = {}; struct test perf_test_cases[MAX_PERF_TEST]; int page_size; @@ -848,6 +991,7 @@ int main(int argc, char **argv) mremap_move_within_range(pattern_seed, rand_addr); mremap_move_1mb_from_start(pattern_seed, rand_addr); + mremap_move_multiple_vmas(pattern_seed, page_size); if (run_perf_tests) { ksft_print_msg("\n%s\n", From ea693aaa5ce5ad9fb124788bcb41d4d24a1d7a02 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Wed, 16 Jul 2025 01:05:50 -0700 Subject: [PATCH 286/370] mm/shmem: hold shmem_swaplist spinlock (not mutex) much less A flamegraph (from an MGLRU load) showed shmem_writeout()'s use of the global shmem_swaplist_mutex worryingly hot: improvement is long overdue. 3.1 commit 6922c0c7abd3 ("tmpfs: convert shmem_writepage and enable swap") apologized for extending shmem_swaplist_mutex across add_to_swap_cache(), and hoped to find another way: yes, there may be lots of work to allocate radix tree nodes in there. Then 6.15 commit b487a2da3575 ("mm, swap: simplify folio swap allocation") will have made it worse, by moving shmem_writeout()'s swap allocation under that mutex too (but the worrying flamegraph was observed even before that change). There's a useful comment about pagelock no longer protecting from eviction once moved to swap cache: but it's good till shmem_delete_from_page_cache() replaces page pointer by swap entry, so move the swaplist add between them. We would much prefer to take the global lock once per inode than once per page: given the possible races with shmem_unuse() pruning when !swapped (and other tasks racing to swap other pages out or in), try the swaplist add whenever swapped was incremented from 0 (but inode may already be on the list - only unuse and evict bother to remove it). This technique is more subtle than it looks (we're avoiding the very lock which would make it easy), but works: whereas an unlocked list_empty() check runs a risk of the inode being unqueued and left off the swaplist forever, swapoff only completing when the page is faulted in or removed. The need for a sleepable mutex went away in 5.1 commit b56a2d8af914 ("mm: rid swapoff of quadratic complexity"): a spinlock works better now. This commit is certain to take shmem_swaplist_mutex out of contention, and has been seen to make a practical improvement (but there is likely to have been an underlying issue which made its contention so visible). Link: https://lkml.kernel.org/r/87beaec6-a3b0-ce7a-c892-1e1e5bd57aa3@google.com Signed-off-by: Hugh Dickins Reviewed-by: Baolin Wang Tested-by: Baolin Wang Tested-by: David Rientjes Reviewed-by: Kairui Song Cc: Baoquan He Cc: Barry Song <21cnbao@gmail.com> Cc: Chris Li Cc: Kemeng Shi Cc: Shakeel Butt Signed-off-by: Andrew Morton --- mm/shmem.c | 59 ++++++++++++++++++++++++++++++------------------------ 1 file changed, 33 insertions(+), 26 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index 334b7b4a61a0..cdee7273ea8c 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -292,7 +292,7 @@ bool vma_is_shmem(struct vm_area_struct *vma) } static LIST_HEAD(shmem_swaplist); -static DEFINE_MUTEX(shmem_swaplist_mutex); +static DEFINE_SPINLOCK(shmem_swaplist_lock); #ifdef CONFIG_TMPFS_QUOTA @@ -432,10 +432,13 @@ static void shmem_free_inode(struct super_block *sb, size_t freed_ispace) * * But normally info->alloced == inode->i_mapping->nrpages + info->swapped * So mm freed is info->alloced - (inode->i_mapping->nrpages + info->swapped) + * + * Return: true if swapped was incremented from 0, for shmem_writeout(). */ -static void shmem_recalc_inode(struct inode *inode, long alloced, long swapped) +static bool shmem_recalc_inode(struct inode *inode, long alloced, long swapped) { struct shmem_inode_info *info = SHMEM_I(inode); + bool first_swapped = false; long freed; spin_lock(&info->lock); @@ -450,8 +453,11 @@ static void shmem_recalc_inode(struct inode *inode, long alloced, long swapped) * to stop a racing shmem_recalc_inode() from thinking that a page has * been freed. Compensate here, to avoid the need for a followup call. */ - if (swapped > 0) + if (swapped > 0) { + if (info->swapped == swapped) + first_swapped = true; freed += swapped; + } if (freed > 0) info->alloced -= freed; spin_unlock(&info->lock); @@ -459,6 +465,7 @@ static void shmem_recalc_inode(struct inode *inode, long alloced, long swapped) /* The quota case may block */ if (freed > 0) shmem_inode_unacct_blocks(inode, freed); + return first_swapped; } bool shmem_charge(struct inode *inode, long pages) @@ -1375,11 +1382,11 @@ static void shmem_evict_inode(struct inode *inode) /* Wait while shmem_unuse() is scanning this inode... */ wait_var_event(&info->stop_eviction, !atomic_read(&info->stop_eviction)); - mutex_lock(&shmem_swaplist_mutex); + spin_lock(&shmem_swaplist_lock); /* ...but beware of the race if we peeked too early */ if (!atomic_read(&info->stop_eviction)) list_del_init(&info->swaplist); - mutex_unlock(&shmem_swaplist_mutex); + spin_unlock(&shmem_swaplist_lock); } } @@ -1502,7 +1509,7 @@ int shmem_unuse(unsigned int type) if (list_empty(&shmem_swaplist)) return 0; - mutex_lock(&shmem_swaplist_mutex); + spin_lock(&shmem_swaplist_lock); start_over: list_for_each_entry_safe(info, next, &shmem_swaplist, swaplist) { if (!info->swapped) { @@ -1516,12 +1523,12 @@ int shmem_unuse(unsigned int type) * (igrab() would protect from unlink, but not from unmount). */ atomic_inc(&info->stop_eviction); - mutex_unlock(&shmem_swaplist_mutex); + spin_unlock(&shmem_swaplist_lock); error = shmem_unuse_inode(&info->vfs_inode, type); cond_resched(); - mutex_lock(&shmem_swaplist_mutex); + spin_lock(&shmem_swaplist_lock); if (atomic_dec_and_test(&info->stop_eviction)) wake_up_var(&info->stop_eviction); if (error) @@ -1532,7 +1539,7 @@ int shmem_unuse(unsigned int type) if (!info->swapped) list_del_init(&info->swaplist); } - mutex_unlock(&shmem_swaplist_mutex); + spin_unlock(&shmem_swaplist_lock); return error; } @@ -1622,30 +1629,30 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug, folio_mark_uptodate(folio); } - /* - * Add inode to shmem_unuse()'s list of swapped-out inodes, - * if it's not already there. Do it now before the folio is - * moved to swap cache, when its pagelock no longer protects - * the inode from eviction. But don't unlock the mutex until - * we've incremented swapped, because shmem_unuse_inode() will - * prune a !swapped inode from the swaplist under this mutex. - */ - mutex_lock(&shmem_swaplist_mutex); - if (list_empty(&info->swaplist)) - list_add(&info->swaplist, &shmem_swaplist); - if (!folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN)) { - shmem_recalc_inode(inode, 0, nr_pages); + bool first_swapped = shmem_recalc_inode(inode, 0, nr_pages); + + /* + * Add inode to shmem_unuse()'s list of swapped-out inodes, + * if it's not already there. Do it now before the folio is + * removed from page cache, when its pagelock no longer + * protects the inode from eviction. And do it now, after + * we've incremented swapped, because shmem_unuse() will + * prune a !swapped inode from the swaplist. + */ + if (first_swapped) { + spin_lock(&shmem_swaplist_lock); + if (list_empty(&info->swaplist)) + list_add(&info->swaplist, &shmem_swaplist); + spin_unlock(&shmem_swaplist_lock); + } + swap_shmem_alloc(folio->swap, nr_pages); shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap)); - mutex_unlock(&shmem_swaplist_mutex); BUG_ON(folio_mapped(folio)); return swap_writeout(folio, plug); } - if (!info->swapped) - list_del_init(&info->swaplist); - mutex_unlock(&shmem_swaplist_mutex); if (nr_pages > 1) goto try_split; redirty: From 6344a6d9ce13ae29e3ddf280fd8a1109a77b9996 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Wed, 16 Jul 2025 01:08:39 -0700 Subject: [PATCH 287/370] mm/shmem: writeout free swap if swap_writeout() reactivates If swap_writeout() returns AOP_WRITEPAGE_ACTIVATE (for example, because zswap cannot compress and memcg disables writeback), there is no virtue in keeping that folio in swap cache and holding the swap allocation: shmem_writeout() switch it back to shmem page cache before returning. Folio lock is held, and folio->memcg_data remains set throughout, so there is no need to get into any memcg or memsw charge complications: swap_free_nr() and delete_from_swap_cache() do as much as is needed (but beware the race with shmem_free_swap() when inode truncated or evicted). Doing the same for an anonymous folio is harder, since it will usually have been unmapped, with references to the swap left in the page tables. Adding a function to remap the folio would be fun, but not worthwhile unless it has other uses, or an urgent bug with anon is demonstrated. [hughd@google.com: use shmem_recalc_inode() rather than open coding, per Baolin] Link: https://lkml.kernel.org/r/101a7d89-290c-545d-8a6d-b1174ed8b1e5@google.com Link: https://lkml.kernel.org/r/5c911f7a-af7a-5029-1dd4-2e00b66d565c@google.com Signed-off-by: Hugh Dickins Reviewed-by: Baolin Wang Tested-by: David Rientjes Cc: Baoquan He Cc: Barry Song <21cnbao@gmail.com> Cc: Chris Li Cc: Kairui Song Cc: Kemeng Shi Cc: Shakeel Butt Signed-off-by: Andrew Morton --- mm/shmem.c | 31 ++++++++++++++++++++++++++++++- 1 file changed, 30 insertions(+), 1 deletion(-) diff --git a/mm/shmem.c b/mm/shmem.c index cdee7273ea8c..7570a24e0ae4 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1631,6 +1631,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug, if (!folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN)) { bool first_swapped = shmem_recalc_inode(inode, 0, nr_pages); + int error; /* * Add inode to shmem_unuse()'s list of swapped-out inodes, @@ -1651,7 +1652,35 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug, shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap)); BUG_ON(folio_mapped(folio)); - return swap_writeout(folio, plug); + error = swap_writeout(folio, plug); + if (error != AOP_WRITEPAGE_ACTIVATE) { + /* folio has been unlocked */ + return error; + } + + /* + * The intention here is to avoid holding on to the swap when + * zswap was unable to compress and unable to writeback; but + * it will be appropriate if other reactivate cases are added. + */ + error = shmem_add_to_page_cache(folio, mapping, index, + swp_to_radix_entry(folio->swap), + __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN); + /* Swap entry might be erased by racing shmem_free_swap() */ + if (!error) { + shmem_recalc_inode(inode, 0, -nr_pages); + swap_free_nr(folio->swap, nr_pages); + } + + /* + * The delete_from_swap_cache() below could be left for + * shrink_folio_list()'s folio_free_swap() to dispose of; + * but I'm a little nervous about letting this folio out of + * shmem_writeout() in a hybrid half-tmpfs-half-swap state + * e.g. folio_mapping(folio) might give an unexpected answer. + */ + delete_from_swap_cache(folio); + goto redirty; } if (nr_pages > 1) goto try_split; From d0813985a23c222989136d3156c2ccf91fa03e0e Mon Sep 17 00:00:00 2001 From: Anthony Yznaga Date: Tue, 15 Jul 2025 18:26:09 -0700 Subject: [PATCH 288/370] sparc64: remove hugetlb_free_pgd_range() Patch series "drop hugetlb_free_pgd_range()". For all architectures that support hugetlb except for sparc, hugetlb_free_pgd_range() just calls free_pgd_range(). It turns out the sparc implementation is essentially identical to free_pgd_range() and can be removed. Remove it and update free_pgtables() to treat hugetlb VMAs the same as others. This patch (of 3): The sparc implementation of hugetlb_free_pgd_range() is identical to free_pgd_range() with the exception of checking for and skipping possible leaf entries at the PUD and PMD levels. These checks are unnecessary because any huge pages have been freed and their PTEs cleared by the time page tables needed to map them are freed. While some huge page sizes do populate the page table with multiple PTEs, they are correctly cleared by huge_ptep_get_and_clear(). To verify this, libhugetlbfs tests were run for 64K, 8M, and 256M page sizes with an instrumented kernel on a qemu guest modified to support the 256M page size. The same tests were used to verify no regressions after applying this patch and were also run on x86 for both 2M and 1G page sizes. Link: https://lkml.kernel.org/r/20250716012611.10369-1-anthony.yznaga@oracle.com Link: https://lkml.kernel.org/r/20250716012611.10369-2-anthony.yznaga@oracle.com Signed-off-by: Anthony Yznaga Acked-by: Mike Rapoport (Microsoft) Acked-by: Oscar Salvador Cc: Alexander Gordeev Cc: Alexandre Ghiti Cc: Andreas Larsson Cc: Anshuman Khandual Cc: Arnd Bergmann Cc: Christophe Leroy Cc: David Hildenbrand Cc: David S. Miller Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Muchun Song Cc: Oscar Salvador Cc: Ryan Roberts Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/sparc/include/asm/hugetlb.h | 5 -- arch/sparc/mm/hugetlbpage.c | 119 ------------------------------- 2 files changed, 124 deletions(-) diff --git a/arch/sparc/include/asm/hugetlb.h b/arch/sparc/include/asm/hugetlb.h index e7a9cdd498dc..d3bc16fbcbbd 100644 --- a/arch/sparc/include/asm/hugetlb.h +++ b/arch/sparc/include/asm/hugetlb.h @@ -50,11 +50,6 @@ static inline int huge_ptep_set_access_flags(struct vm_area_struct *vma, return changed; } -#define __HAVE_ARCH_HUGETLB_FREE_PGD_RANGE -void hugetlb_free_pgd_range(struct mmu_gather *tlb, unsigned long addr, - unsigned long end, unsigned long floor, - unsigned long ceiling); - #include #endif /* _ASM_SPARC64_HUGETLB_H */ diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c index 80504148d8a5..4b9431311e05 100644 --- a/arch/sparc/mm/hugetlbpage.c +++ b/arch/sparc/mm/hugetlbpage.c @@ -295,122 +295,3 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, return entry; } - -static void hugetlb_free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, - unsigned long addr) -{ - pgtable_t token = pmd_pgtable(*pmd); - - pmd_clear(pmd); - pte_free_tlb(tlb, token, addr); - mm_dec_nr_ptes(tlb->mm); -} - -static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud, - unsigned long addr, unsigned long end, - unsigned long floor, unsigned long ceiling) -{ - pmd_t *pmd; - unsigned long next; - unsigned long start; - - start = addr; - pmd = pmd_offset(pud, addr); - do { - next = pmd_addr_end(addr, end); - if (pmd_none(*pmd)) - continue; - if (is_hugetlb_pmd(*pmd)) - pmd_clear(pmd); - else - hugetlb_free_pte_range(tlb, pmd, addr); - } while (pmd++, addr = next, addr != end); - - start &= PUD_MASK; - if (start < floor) - return; - if (ceiling) { - ceiling &= PUD_MASK; - if (!ceiling) - return; - } - if (end - 1 > ceiling - 1) - return; - - pmd = pmd_offset(pud, start); - pud_clear(pud); - pmd_free_tlb(tlb, pmd, start); - mm_dec_nr_pmds(tlb->mm); -} - -static void hugetlb_free_pud_range(struct mmu_gather *tlb, p4d_t *p4d, - unsigned long addr, unsigned long end, - unsigned long floor, unsigned long ceiling) -{ - pud_t *pud; - unsigned long next; - unsigned long start; - - start = addr; - pud = pud_offset(p4d, addr); - do { - next = pud_addr_end(addr, end); - if (pud_none_or_clear_bad(pud)) - continue; - if (is_hugetlb_pud(*pud)) - pud_clear(pud); - else - hugetlb_free_pmd_range(tlb, pud, addr, next, floor, - ceiling); - } while (pud++, addr = next, addr != end); - - start &= PGDIR_MASK; - if (start < floor) - return; - if (ceiling) { - ceiling &= PGDIR_MASK; - if (!ceiling) - return; - } - if (end - 1 > ceiling - 1) - return; - - pud = pud_offset(p4d, start); - p4d_clear(p4d); - pud_free_tlb(tlb, pud, start); - mm_dec_nr_puds(tlb->mm); -} - -void hugetlb_free_pgd_range(struct mmu_gather *tlb, - unsigned long addr, unsigned long end, - unsigned long floor, unsigned long ceiling) -{ - pgd_t *pgd; - p4d_t *p4d; - unsigned long next; - - addr &= PMD_MASK; - if (addr < floor) { - addr += PMD_SIZE; - if (!addr) - return; - } - if (ceiling) { - ceiling &= PMD_MASK; - if (!ceiling) - return; - } - if (end - 1 > ceiling - 1) - end -= PMD_SIZE; - if (addr > end - 1) - return; - - pgd = pgd_offset(tlb->mm, addr); - p4d = p4d_offset(pgd, addr); - do { - next = p4d_addr_end(addr, end); - if (p4d_none_or_clear_bad(p4d)) - continue; - hugetlb_free_pud_range(tlb, p4d, addr, next, floor, ceiling); - } while (p4d++, addr = next, addr != end); -} From 1c7bf6c54572f91c92b7adc1e24f492970f7020d Mon Sep 17 00:00:00 2001 From: Anthony Yznaga Date: Tue, 15 Jul 2025 18:26:10 -0700 Subject: [PATCH 289/370] mm: remove call to hugetlb_free_pgd_range() With the removal of the last arch-specific implementation of hugetlb_free_pgd_range(), hugetlb VMAs no longer need special handling when freeing page tables. Link: https://lkml.kernel.org/r/20250716012611.10369-3-anthony.yznaga@oracle.com Signed-off-by: Anthony Yznaga Acked-by: Mike Rapoport (Microsoft) Acked-by: Oscar Salvador Cc: Alexander Gordeev Cc: Alexandre Ghiti Cc: Andreas Larsson Cc: Anshuman Khandual Cc: Arnd Bergmann Cc: Christophe Leroy Cc: David Hildenbrand Cc: David S. Miller Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Muchun Song Cc: Ryan Roberts Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Will Deacon Signed-off-by: Andrew Morton --- mm/memory.c | 42 ++++++++++++++++++------------------------ 1 file changed, 18 insertions(+), 24 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index bc27b1990fcb..b4fb559dd0c6 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -379,32 +379,26 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, vma_start_write(vma); unlink_anon_vmas(vma); - if (is_vm_hugetlb_page(vma)) { - unlink_file_vma(vma); - hugetlb_free_pgd_range(tlb, addr, vma->vm_end, - floor, next ? next->vm_start : ceiling); - } else { - unlink_file_vma_batch_init(&vb); - unlink_file_vma_batch_add(&vb, vma); + unlink_file_vma_batch_init(&vb); + unlink_file_vma_batch_add(&vb, vma); - /* - * Optimization: gather nearby vmas into one call down - */ - while (next && next->vm_start <= vma->vm_end + PMD_SIZE - && !is_vm_hugetlb_page(next)) { - vma = next; - next = mas_find(mas, ceiling - 1); - if (unlikely(xa_is_zero(next))) - next = NULL; - if (mm_wr_locked) - vma_start_write(vma); - unlink_anon_vmas(vma); - unlink_file_vma_batch_add(&vb, vma); - } - unlink_file_vma_batch_final(&vb); - free_pgd_range(tlb, addr, vma->vm_end, - floor, next ? next->vm_start : ceiling); + /* + * Optimization: gather nearby vmas into one call down + */ + while (next && next->vm_start <= vma->vm_end + PMD_SIZE) { + vma = next; + next = mas_find(mas, ceiling - 1); + if (unlikely(xa_is_zero(next))) + next = NULL; + if (mm_wr_locked) + vma_start_write(vma); + unlink_anon_vmas(vma); + unlink_file_vma_batch_add(&vb, vma); } + unlink_file_vma_batch_final(&vb); + + free_pgd_range(tlb, addr, vma->vm_end, + floor, next ? next->vm_start : ceiling); vma = next; } while (vma); } From 441413d2a99d1d23bea2df2497493024b00ace57 Mon Sep 17 00:00:00 2001 From: Anthony Yznaga Date: Tue, 15 Jul 2025 18:26:11 -0700 Subject: [PATCH 290/370] mm: drop hugetlb_free_pgd_range() There are no longer any callers of hugetlb_free_pgd_range(). Link: https://lkml.kernel.org/r/20250716012611.10369-4-anthony.yznaga@oracle.com Signed-off-by: Anthony Yznaga Acked-by: Mike Rapoport (Microsoft) Acked-by: Oscar Salvador Cc: Alexander Gordeev Cc: Alexandre Ghiti Cc: Andreas Larsson Cc: Anshuman Khandual Cc: Arnd Bergmann Cc: Christophe Leroy Cc: David Hildenbrand Cc: David S. Miller Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Muchun Song Cc: Ryan Roberts Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Will Deacon Signed-off-by: Andrew Morton --- include/asm-generic/hugetlb.h | 9 --------- include/linux/hugetlb.h | 7 ------- 2 files changed, 16 deletions(-) diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h index 4bce4f07f44f..dcb8727f2b82 100644 --- a/include/asm-generic/hugetlb.h +++ b/include/asm-generic/hugetlb.h @@ -66,15 +66,6 @@ static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr, } #endif -#ifndef __HAVE_ARCH_HUGETLB_FREE_PGD_RANGE -static inline void hugetlb_free_pgd_range(struct mmu_gather *tlb, - unsigned long addr, unsigned long end, - unsigned long floor, unsigned long ceiling) -{ - free_pgd_range(tlb, addr, end, floor, ceiling); -} -#endif - #ifndef __HAVE_ARCH_HUGE_SET_HUGE_PTE_AT static inline void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte, unsigned long sz) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 474de8e2a8f2..526d27e88b3b 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -390,13 +390,6 @@ static inline int is_hugepage_only_range(struct mm_struct *mm, return 0; } -static inline void hugetlb_free_pgd_range(struct mmu_gather *tlb, - unsigned long addr, unsigned long end, - unsigned long floor, unsigned long ceiling) -{ - BUG(); -} - #ifdef CONFIG_USERFAULTFD static inline int hugetlb_mfill_atomic_pte(pte_t *dst_pte, struct vm_area_struct *dst_vma, From a9e056de661ccab70476c3002be2a6666b40ca4b Mon Sep 17 00:00:00 2001 From: Ryan Roberts Date: Mon, 9 Jun 2025 11:31:30 +0100 Subject: [PATCH 291/370] mm: remove arch_flush_tlb_batched_pending() arch helper Since commit 4b634918384c ("arm64/mm: Close theoretical race where stale TLB entry remains valid"), all arches that use tlbbatch for reclaim (arm64, riscv, x86) implement arch_flush_tlb_batched_pending() with a flush_tlb_mm(). So let's simplify by removing the unnecessary abstraction and doing the flush_tlb_mm() directly in flush_tlb_batched_pending(). This effectively reverts commit db6c1f6f236d ("mm/tlbbatch: introduce arch_flush_tlb_batched_pending()"). Link: https://lkml.kernel.org/r/20250609103132.447370-1-ryan.roberts@arm.com Signed-off-by: Ryan Roberts Suggested-by: Will Deacon Reviewed-by: Lorenzo Stoakes Acked-by: David Hildenbrand Reviewed-by: Anshuman Khandual Acked-by: Will Deacon Cc: Albert Ou Cc: Alexandre Ghiti Cc: Borislav Betkov Cc: Catalin Marinas Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Palmer Dabbelt Cc: Paul Walmsley Cc: Rik van Riel Cc: Ryan Roberts Cc: Thomas Gleinxer Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/arm64/include/asm/tlbflush.h | 11 ----------- arch/riscv/include/asm/tlbflush.h | 1 - arch/riscv/mm/tlbflush.c | 5 ----- arch/x86/include/asm/tlbflush.h | 5 ----- mm/rmap.c | 2 +- 5 files changed, 1 insertion(+), 23 deletions(-) diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h index aa9efee17277..18a5dc0c9a54 100644 --- a/arch/arm64/include/asm/tlbflush.h +++ b/arch/arm64/include/asm/tlbflush.h @@ -322,17 +322,6 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) return true; } -/* - * If mprotect/munmap/etc occurs during TLB batched flushing, we need to ensure - * all the previously issued TLBIs targeting mm have completed. But since we - * can be executing on a remote CPU, a DSB cannot guarantee this like it can - * for arch_tlbbatch_flush(). Our only option is to flush the entire mm. - */ -static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm) -{ - flush_tlb_mm(mm); -} - /* * To support TLB batched flush for multiple pages unmapping, we only send * the TLBI for each page in arch_tlbbatch_add_pending() and wait for the diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h index 1a20dd746a49..eed0abc40514 100644 --- a/arch/riscv/include/asm/tlbflush.h +++ b/arch/riscv/include/asm/tlbflush.h @@ -63,7 +63,6 @@ void flush_pud_tlb_range(struct vm_area_struct *vma, unsigned long start, bool arch_tlbbatch_should_defer(struct mm_struct *mm); void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch, struct mm_struct *mm, unsigned long start, unsigned long end); -void arch_flush_tlb_batched_pending(struct mm_struct *mm); void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch); extern unsigned long tlb_flush_all_threshold; diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c index e737ba7949b1..8404530ec00f 100644 --- a/arch/riscv/mm/tlbflush.c +++ b/arch/riscv/mm/tlbflush.c @@ -234,11 +234,6 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch, mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end); } -void arch_flush_tlb_batched_pending(struct mm_struct *mm) -{ - flush_tlb_mm(mm); -} - void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) { __flush_tlb_range(NULL, &batch->cpumask, diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index e9b81876ebe4..00daedfefc1b 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -356,11 +356,6 @@ static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *b mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL); } -static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm) -{ - flush_tlb_mm(mm); -} - extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch); static inline bool pte_flags_need_flush(unsigned long oldflags, diff --git a/mm/rmap.c b/mm/rmap.c index 4c833b43fef9..f93ce27132ab 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -746,7 +746,7 @@ void flush_tlb_batched_pending(struct mm_struct *mm) int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT; if (pending != flushed) { - arch_flush_tlb_batched_pending(mm); + flush_tlb_mm(mm); /* * If the new TLB flushing is pending during flushing, leave * mm->tlb_flush_batched as is, to avoid losing flushing. From 378bdb97405a00bf03d8d993435c83add1688e36 Mon Sep 17 00:00:00 2001 From: Kuniyuki Iwashima Date: Thu, 17 Jul 2025 19:46:43 +0000 Subject: [PATCH 292/370] memcg: convert memcg->socket_pressure to u64 memcg->socket_pressure is initialised with jiffies when the memcg is created. Once vmpressure detects that the cgroup is under memory pressure, the field is updated with jiffies + HZ to signal the fact to the socket layer and suppress memory allocation for one second. Otherwise, the field is not updated. mem_cgroup_under_socket_pressure() uses time_before() to check if jiffies is less than memcg->socket_pressure, and this has a bug on 32-bit kernel. if (time_before(jiffies, memcg->socket_pressure)) return true; As time_before() casts the final result to long, the acceptable delta between two timestamps is 2 ^ (BITS_PER_LONG - 1). On 32-bit kernel with CONFIG_HZ=1000, this is about 24 days. >>> (2 ** 31) / 1000 / 60 / 60 / 24 24.855134814814818 Once 24 days have passed since the last update of socket_pressure, mem_cgroup_under_socket_pressure() starts to lie until the next 24 days pass. We don't need to worry about this on 64-bit machines unless they serve for 300 million years. >>> (2 ** 63) / 1000 / 60 / 60 / 24 / 365 292471208.6775361 Let's convert memcg->socket_pressure to u64. Performance teting: I don't have a real 32-bit machine so this is a result on QEMU, but with/without the u64 jiffie patch, the time spent in mem_cgroup_under_socket_pressure() was 1~5us and I didn't see any measurable delta. no patch applied: iperf3 273 [000] 137.296248: probe:mem_cgroup_under_socket_pressure: (c13660d0) c13660d1 mem_cgroup_under_socket_pressure+0x1 ([kernel.kallsyms]) iperf3 273 [000] 137.296249: probe:mem_cgroup_under_socket_pressure__return: (c13660d0 <- c1d8fd7f) iperf3 273 [000] 137.296251: probe:mem_cgroup_under_socket_pressure: (c13660d0) c13660d1 mem_cgroup_under_socket_pressure+0x1 ([kernel.kallsyms]) iperf3 273 [000] 137.296253: probe:mem_cgroup_under_socket_pressure__return: (c13660d0 <- c1d8fd7f) u64 jiffies patch applied: iperf3 308 [001] 330.669370: probe:mem_cgroup_under_socket_pressure: (c12ddba0) c12ddba1 mem_cgroup_under_socket_pressure+0x1 ([kernel.kallsyms]) iperf3 308 [001] 330.669371: probe:mem_cgroup_under_socket_pressure__return: (c12ddba0 <- c1ce98bf) iperf3 308 [001] 330.669382: probe:mem_cgroup_under_socket_pressure: (c12ddba0) c12ddba1 mem_cgroup_under_socket_pressure+0x1 ([kernel.kallsyms]) iperf3 308 [001] 330.669384: probe:mem_cgroup_under_socket_pressure__return: (c12ddba0 <- c1ce98bf) So the u64 approach is good enough. Link: https://lkml.kernel.org/r/20250717194645.1096500-1-kuniyu@google.com Fixes: 8e8ae645249b ("mm: memcontrol: hook up vmpressure to socket pressure") Signed-off-by: Kuniyuki Iwashima Reported-by: Neal Cardwell Suggested-by: Andrew Morton Acked-by: Shakeel Butt Acked-by: Johannes Weiner Cc: David S. Miller Cc: Eric Dumazet Cc: Johannes Weiner Cc: Vladimir Davydov Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 44 +++++++++++++++++++++++++++++++++++--- mm/memcontrol.c | 5 ++++- mm/vmpressure.c | 2 +- 3 files changed, 46 insertions(+), 5 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 87b6688f124a..785173aa0739 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -251,8 +251,10 @@ struct mem_cgroup { * that this indicator should NOT be used in legacy cgroup mode * where socket memory is accounted/charged separately. */ - unsigned long socket_pressure; - + u64 socket_pressure; +#if BITS_PER_LONG < 64 + seqlock_t socket_pressure_seqlock; +#endif int kmemcg_id; /* * memcg->objcg is wiped out as a part of the objcg repaprenting @@ -1602,6 +1604,42 @@ extern struct static_key_false memcg_sockets_enabled_key; #define mem_cgroup_sockets_enabled static_branch_unlikely(&memcg_sockets_enabled_key) void mem_cgroup_sk_alloc(struct sock *sk); void mem_cgroup_sk_free(struct sock *sk); + +#if BITS_PER_LONG < 64 +static inline void mem_cgroup_set_socket_pressure(struct mem_cgroup *memcg) +{ + u64 val = get_jiffies_64() + HZ; + unsigned long flags; + + write_seqlock_irqsave(&memcg->socket_pressure_seqlock, flags); + memcg->socket_pressure = val; + write_sequnlock_irqrestore(&memcg->socket_pressure_seqlock, flags); +} + +static inline u64 mem_cgroup_get_socket_pressure(struct mem_cgroup *memcg) +{ + unsigned int seq; + u64 val; + + do { + seq = read_seqbegin(&memcg->socket_pressure_seqlock); + val = memcg->socket_pressure; + } while (read_seqretry(&memcg->socket_pressure_seqlock, seq)); + + return val; +} +#else +static inline void mem_cgroup_set_socket_pressure(struct mem_cgroup *memcg) +{ + WRITE_ONCE(memcg->socket_pressure, jiffies + HZ); +} + +static inline u64 mem_cgroup_get_socket_pressure(struct mem_cgroup *memcg) +{ + return READ_ONCE(memcg->socket_pressure); +} +#endif + static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg) { #ifdef CONFIG_MEMCG_V1 @@ -1609,7 +1647,7 @@ static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg) return !!memcg->tcpmem_pressure; #endif /* CONFIG_MEMCG_V1 */ do { - if (time_before(jiffies, READ_ONCE(memcg->socket_pressure))) + if (time_before64(get_jiffies_64(), mem_cgroup_get_socket_pressure(memcg))) return true; } while ((memcg = parent_mem_cgroup(memcg))); return false; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 235c66d2161b..de7d737fe011 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3754,7 +3754,10 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent) INIT_LIST_HEAD(&memcg->memory_peaks); INIT_LIST_HEAD(&memcg->swap_peaks); spin_lock_init(&memcg->peaks_lock); - memcg->socket_pressure = jiffies; + memcg->socket_pressure = get_jiffies_64(); +#if BITS_PER_LONG < 64 + seqlock_init(&memcg->socket_pressure_seqlock); +#endif memcg1_memcg_init(memcg); memcg->kmemcg_id = -1; INIT_LIST_HEAD(&memcg->objcg_list); diff --git a/mm/vmpressure.c b/mm/vmpressure.c index bd5183dfd879..c197ed47bcc4 100644 --- a/mm/vmpressure.c +++ b/mm/vmpressure.c @@ -316,7 +316,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree, * asserted for a second in which subsequent * pressure events can occur. */ - WRITE_ONCE(memcg->socket_pressure, jiffies + HZ); + mem_cgroup_set_socket_pressure(memcg); } } } From b907494768e5390ee8282dc138d9d6ba9b971af1 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Wed, 16 Jul 2025 22:54:45 -0700 Subject: [PATCH 293/370] mm/damon/sysfs: implement refresh_ms file under kdamond directory Patch series "mm/damon/sysfs: support periodic and automated stats update". DAMON sysfs interface provides files for reading DAMON internal status including auto-tuned monitoring intervals, DAMOS stats, DAMOS action applied regions, and auto-tuned DAMOS effective quota. Among those, auto-tuned monitoring intervals, DAMOS stats and auto-tuned DAMOS effective quota are essential for common DAMON/S use cases. The content of the files are not automatically updated, though. Users should manually request updates of the contents by writing a special command to 'state' file of each kdamond directory. This interface is good for minimizing overhead, but causes the below problems. First, the usage is cumbersome. This is arguably not a big problem, since the user-space tool (damo) can do this instead of the user. Second, it can be too slow. The update request is not directly handled by the sysfs interface but kdamond thread. And kdamond threads wake up only once per the sampling interval. Hence if sampling interval is not short, each update request could take too long time. The recommended sampling interval setup is asking DAMON to automatically tune it, within a range between 5 milliseconds and 10 seconds. On production systems it is not very rare to have a few seconds sampling interval as a result of the auto-tuning, so this can disturb observing DAMON internal status. Finally, parallel update requests can conflict with each other. When parallel update requests are received, DAMON sysfs interface simply returns -EBUSY to one of the requests. DAMON user-space tool is hence implementing its own backoff mechanism, but this can make the operation even slower. Introduce a new sysfs file, namely refresh_ms, for asking DAMON sysfs interface to repeat the update of the above mentioned essential contents with a user-specified time delay. If non-zero value is written to the file, DAMON sysfs interface does the updates for essential DAMON internal status including auto-tuned monitoring intervals, DAMOS stats, and auto-tuned DAMOS quotas using the user-written value as the time delay. In other words, it is similar to periodically writing 'update_schemes_stats', 'update_schemes_effective_quotas', and 'update_tuned_intervals' keywords to the 'state' file. If zero is written to the file, the automatic refresh is disabled. This patch (of 4): Implement a new DAMON sysfs file named 'refresh_ms' under each kdamond directory. The file will be used as a control knob of automatic refresh of a few DAMON internal status files. This commit implements only minimum file operations, though. The automatic refresh feature will be implemented by the following commit. Link: https://lkml.kernel.org/r/20250717055448.56976-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250717055448.56976-2-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- mm/damon/sysfs.c | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) diff --git a/mm/damon/sysfs.c b/mm/damon/sysfs.c index cce2c8a296e2..4296dc201f4d 100644 --- a/mm/damon/sysfs.c +++ b/mm/damon/sysfs.c @@ -1155,6 +1155,7 @@ struct damon_sysfs_kdamond { struct kobject kobj; struct damon_sysfs_contexts *contexts; struct damon_ctx *damon_ctx; + unsigned int refresh_ms; }; static struct damon_sysfs_kdamond *damon_sysfs_kdamond_alloc(void) @@ -1690,6 +1691,30 @@ static ssize_t pid_show(struct kobject *kobj, return sysfs_emit(buf, "%d\n", pid); } +static ssize_t refresh_ms_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct damon_sysfs_kdamond *kdamond = container_of(kobj, + struct damon_sysfs_kdamond, kobj); + + return sysfs_emit(buf, "%u\n", kdamond->refresh_ms); +} + +static ssize_t refresh_ms_store(struct kobject *kobj, + struct kobj_attribute *attr, const char *buf, size_t count) +{ + struct damon_sysfs_kdamond *kdamond = container_of(kobj, + struct damon_sysfs_kdamond, kobj); + unsigned int nr; + int err = kstrtouint(buf, 0, &nr); + + if (err) + return err; + + kdamond->refresh_ms = nr; + return count; +} + static void damon_sysfs_kdamond_release(struct kobject *kobj) { struct damon_sysfs_kdamond *kdamond = container_of(kobj, @@ -1706,9 +1731,13 @@ static struct kobj_attribute damon_sysfs_kdamond_state_attr = static struct kobj_attribute damon_sysfs_kdamond_pid_attr = __ATTR_RO_MODE(pid, 0400); +static struct kobj_attribute damon_sysfs_kdamond_refresh_ms_attr = + __ATTR_RW_MODE(refresh_ms, 0600); + static struct attribute *damon_sysfs_kdamond_attrs[] = { &damon_sysfs_kdamond_state_attr.attr, &damon_sysfs_kdamond_pid_attr.attr, + &damon_sysfs_kdamond_refresh_ms_attr.attr, NULL, }; ATTRIBUTE_GROUPS(damon_sysfs_kdamond); From d809a7c64ba8229286b333c0cba03b1cdfb50238 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Wed, 16 Jul 2025 22:54:46 -0700 Subject: [PATCH 294/370] mm/damon/sysfs: implement refresh_ms file internal work Only minimum file operations for refresh_ms file is implemented. Further implement its designed behavior, the periodic essential files content update, using repeat mode damon_call(). If non-zero value is written to the file, update DAMON sysfs files for auto-tuned monitoring intervals, DAMOS stats, and auto-tuned DAMOS quota values, which are essential to be monitored in most DAMON use cases. The user-written non-zero value becomes the time delay between the update. If zero is written to the file, the periodic refresh is disabled. Link: https://lkml.kernel.org/r/20250717055448.56976-3-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- mm/damon/sysfs.c | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) diff --git a/mm/damon/sysfs.c b/mm/damon/sysfs.c index 4296dc201f4d..6d2b0dab50cb 100644 --- a/mm/damon/sysfs.c +++ b/mm/damon/sysfs.c @@ -1509,6 +1509,32 @@ static struct damon_ctx *damon_sysfs_build_ctx( return ctx; } +static int damon_sysfs_repeat_call_fn(void *data) +{ + struct damon_sysfs_kdamond *sysfs_kdamond = data; + static unsigned long next_update_jiffies; + + if (!sysfs_kdamond->refresh_ms) + return 0; + if (time_before(jiffies, next_update_jiffies)) + return 0; + next_update_jiffies = jiffies + + msecs_to_jiffies(sysfs_kdamond->refresh_ms); + + if (!mutex_trylock(&damon_sysfs_lock)) + return 0; + damon_sysfs_upd_tuned_intervals(sysfs_kdamond); + damon_sysfs_upd_schemes_stats(sysfs_kdamond); + damon_sysfs_upd_schemes_effective_quotas(sysfs_kdamond); + mutex_unlock(&damon_sysfs_lock); + return 0; +} + +static struct damon_call_control damon_sysfs_repeat_call_control = { + .fn = damon_sysfs_repeat_call_fn, + .repeat = true, +}; + static int damon_sysfs_turn_damon_on(struct damon_sysfs_kdamond *kdamond) { struct damon_ctx *ctx; @@ -1533,6 +1559,9 @@ static int damon_sysfs_turn_damon_on(struct damon_sysfs_kdamond *kdamond) return err; } kdamond->damon_ctx = ctx; + + damon_sysfs_repeat_call_control.data = kdamond; + damon_call(ctx, &damon_sysfs_repeat_call_control); return err; } From e85e965bdbec7ebee62c9a843d31175e6cf60052 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Wed, 16 Jul 2025 22:54:47 -0700 Subject: [PATCH 295/370] Docs/admin-guide/mm/damon/usage: document refresh_ms file Document the new DAMON sysfs file, refresh_ms, on the usage document. Link: https://lkml.kernel.org/r/20250717055448.56976-4-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/usage.rst | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index fc5c962353ed..ff3a2dda1f02 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -59,7 +59,7 @@ comma (","). :ref:`/sys/kernel/mm/damon `/admin │ :ref:`kdamonds `/nr_kdamonds - │ │ :ref:`0 `/state,pid + │ │ :ref:`0 `/state,pid,refresh_ms │ │ │ :ref:`contexts `/nr_contexts │ │ │ │ :ref:`0 `/avail_operations,operations │ │ │ │ │ :ref:`monitoring_attrs `/ @@ -123,8 +123,8 @@ kdamond. kdamonds// ------------- -In each kdamond directory, two files (``state`` and ``pid``) and one directory -(``contexts``) exist. +In each kdamond directory, three files (``state``, ``pid`` and ``refresh_ms``) +and one directory (``contexts``) exist. Reading ``state`` returns ``on`` if the kdamond is currently running, or ``off`` if it is not running. @@ -161,6 +161,13 @@ Users can write below commands for the kdamond to the ``state`` file. If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread. +Users can ask the kernel to periodically update files showing auto-tuned +parameters and DAMOS stats instead of manually writing +``update_tuned_intervals`` like keywords to ``state`` file. For this, users +should write the desired update time interval in milliseconds to ``refresh_ms`` +file. If the interval is zero, the periodic update is disabled. Reading the +file shows currently set time interval. + ``contexts`` directory contains files for controlling the monitoring contexts that this kdamond will execute. From cc1c6724ebeab32e48cc927405c3d436a6eadc65 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Wed, 16 Jul 2025 22:54:48 -0700 Subject: [PATCH 296/370] Docs/ABI/damon: update for refresh_ms Document the new DAMON sysfs file, refresh_ms, on the ABI document. Link: https://lkml.kernel.org/r/20250717055448.56976-5-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- Documentation/ABI/testing/sysfs-kernel-mm-damon | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-damon b/Documentation/ABI/testing/sysfs-kernel-mm-damon index e98974dfac7a..6791d879759e 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-damon +++ b/Documentation/ABI/testing/sysfs-kernel-mm-damon @@ -44,6 +44,13 @@ Contact: SeongJae Park Description: Reading this file returns the pid of the kdamond if it is running. +What: /sys/kernel/mm/damon/admin/kdamonds//refresh_ms +Date: Jul 2025 +Contact: SeongJae Park +Description: Writing a value to this file sets the time interval for + automatic DAMON status file contents update. Writing '0' + disables the update. Reading this file returns the value. + What: /sys/kernel/mm/damon/admin/kdamonds//contexts/nr_contexts Date: Mar 2022 Contact: SeongJae Park From 4f78252da887ee7e9d1875dd6e07d9baa936c04f Mon Sep 17 00:00:00 2001 From: Kemeng Shi Date: Thu, 22 May 2025 20:25:51 +0800 Subject: [PATCH 297/370] mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc() Patch series "Some randome fixes and cleanups to swapfile". Patch 0-3 are some random fixes. Patch 4 is a cleanup. More details can be found in respective patches. This patch (of 4): When folio_alloc_swap() encounters a failure in either mem_cgroup_try_charge_swap() or add_to_swap_cache(), nr_swap_pages counter is not decremented for allocated entry. However, the following put_swap_folio() will increase nr_swap_pages counter unpairly and lead to an imbalance. Move nr_swap_pages decrement from folio_alloc_swap() to swap_range_alloc() to pair the nr_swap_pages counting. Link: https://lkml.kernel.org/r/20250522122554.12209-1-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20250522122554.12209-2-shikemeng@huaweicloud.com Fixes: 0ff67f990bd4 ("mm, swap: remove swap slot cache") Signed-off-by: Kemeng Shi Reviewed-by: Kairui Song Reviewed-by: Baoquan He Cc: Johannes Weiner Cc: Signed-off-by: Andrew Morton --- mm/swapfile.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 68ce283e84be..b8464373b7ae 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1115,6 +1115,7 @@ static void swap_range_alloc(struct swap_info_struct *si, if (vm_swap_full()) schedule_work(&si->reclaim_work); } + atomic_long_sub(nr_entries, &nr_swap_pages); } static void swap_range_free(struct swap_info_struct *si, unsigned long offset, @@ -1313,7 +1314,6 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp) if (add_to_swap_cache(folio, entry, gfp | __GFP_NOMEMALLOC, NULL)) goto out_free; - atomic_long_sub(size, &nr_swap_pages); return 0; out_free: From 255116c5b0fa2145ede28c2f7b248df5e73834d1 Mon Sep 17 00:00:00 2001 From: Kemeng Shi Date: Thu, 22 May 2025 20:25:52 +0800 Subject: [PATCH 298/370] mm: swap: correctly use maxpages in swapon syscall to avoid potential deadloop We use maxpages from read_swap_header() to initialize swap_info_struct, however the maxpages might be reduced in setup_swap_extents() and the si->max is assigned with the reduced maxpages from the setup_swap_extents(). Obviously, this could lead to memory waste as we allocated memory based on larger maxpages, besides, this could lead to a potential deadloop as following: 1) When calling setup_clusters() with larger maxpages, unavailable pages within range [si->max, larger maxpages) are not accounted with inc_cluster_info_page(). As a result, these pages are assumed available but can not be allocated. The cluster contains these pages can be moved to frag_clusters list after it's all available pages were allocated. 2) When the cluster mentioned in 1) is the only cluster in frag_clusters list, cluster_alloc_swap_entry() assume order 0 allocation will never failed and will enter a deadloop by keep trying to allocate page from the only cluster in frag_clusters which contains no actually available page. Call setup_swap_extents() to get the final maxpages before swap_info_struct initialization to fix the issue. After this change, span will include badblocks and will become large value which I think is correct value: In summary, there are two kinds of swapfile_activate operations. 1. Filesystem style: Treat all blocks logical continuity and find usable physical extents in logical range. In this way, si->pages will be actual usable physical blocks and span will be "1 + highest_block - lowest_block". 2. Block device style: Treat all blocks physically continue and only one single extent is added. In this way, si->pages will be si->max and span will be "si->pages - 1". Actually, si->pages and si->max is only used in block device style and span value is set with si->pages. As a result, span value in block device style will become a larger value as you mentioned. I think larger value is correct based on: 1. Span value in filesystem style is "1 + highest_block - lowest_block" which is the range cover all possible phisical blocks including the badblocks. 2. For block device style, si->pages is the actual usable block number and is already in pr_info. The original span value before this patch is also refer to usable block number which is redundant in pr_info. [shikemeng@huaweicloud.com: ensure si->pages == si->max - 1 after setup_swap_extents()] Link: https://lkml.kernel.org/r/20250522122554.12209-3-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20250718065139.61989-1-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20250522122554.12209-3-shikemeng@huaweicloud.com Fixes: 661383c6111a ("mm: swap: relaim the cached parts that got scanned") Signed-off-by: Kemeng Shi Reviewed-by: Baoquan He Cc: Johannes Weiner Cc: Kairui Song Cc: Signed-off-by: Andrew Morton --- mm/swapfile.c | 53 +++++++++++++++++++++++++-------------------------- 1 file changed, 26 insertions(+), 27 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index b8464373b7ae..119aa706e92d 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3141,43 +3141,30 @@ static unsigned long read_swap_header(struct swap_info_struct *si, return maxpages; } -static int setup_swap_map_and_extents(struct swap_info_struct *si, - union swap_header *swap_header, - unsigned char *swap_map, - unsigned long maxpages, - sector_t *span) +static int setup_swap_map(struct swap_info_struct *si, + union swap_header *swap_header, + unsigned char *swap_map, + unsigned long maxpages) { - unsigned int nr_good_pages; unsigned long i; - int nr_extents; - - nr_good_pages = maxpages - 1; /* omit header page */ + swap_map[0] = SWAP_MAP_BAD; /* omit header page */ for (i = 0; i < swap_header->info.nr_badpages; i++) { unsigned int page_nr = swap_header->info.badpages[i]; if (page_nr == 0 || page_nr > swap_header->info.last_page) return -EINVAL; if (page_nr < maxpages) { swap_map[page_nr] = SWAP_MAP_BAD; - nr_good_pages--; + si->pages--; } } - if (nr_good_pages) { - swap_map[0] = SWAP_MAP_BAD; - si->max = maxpages; - si->pages = nr_good_pages; - nr_extents = setup_swap_extents(si, span); - if (nr_extents < 0) - return nr_extents; - nr_good_pages = si->pages; - } - if (!nr_good_pages) { + if (!si->pages) { pr_warn("Empty swap-file\n"); return -EINVAL; } - return nr_extents; + return 0; } #define SWAP_CLUSTER_INFO_COLS \ @@ -3217,7 +3204,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si, * Mark unusable pages as unavailable. The clusters aren't * marked free yet, so no list operations are involved yet. * - * See setup_swap_map_and_extents(): header page, bad pages, + * See setup_swap_map(): header page, bad pages, * and the EOF part of the last cluster. */ inc_cluster_info_page(si, cluster_info, 0); @@ -3363,6 +3350,21 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) goto bad_swap_unlock_inode; } + si->max = maxpages; + si->pages = maxpages - 1; + nr_extents = setup_swap_extents(si, &span); + if (nr_extents < 0) { + error = nr_extents; + goto bad_swap_unlock_inode; + } + if (si->pages != si->max - 1) { + pr_err("swap:%u != (max:%u - 1)\n", si->pages, si->max); + error = -EINVAL; + goto bad_swap_unlock_inode; + } + + maxpages = si->max; + /* OK, set up the swap map and apply the bad block list */ swap_map = vzalloc(maxpages); if (!swap_map) { @@ -3374,12 +3376,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) if (error) goto bad_swap_unlock_inode; - nr_extents = setup_swap_map_and_extents(si, swap_header, swap_map, - maxpages, &span); - if (unlikely(nr_extents < 0)) { - error = nr_extents; + error = setup_swap_map(si, swap_header, swap_map, maxpages); + if (error) goto bad_swap_unlock_inode; - } /* * Use kvmalloc_array instead of bitmap_zalloc as the allocation order might From 152c1339dc13ad46f1b136e8693de15980750835 Mon Sep 17 00:00:00 2001 From: Kemeng Shi Date: Thu, 22 May 2025 20:25:53 +0800 Subject: [PATCH 299/370] mm: swap: fix potential buffer overflow in setup_clusters() In setup_swap_map(), we only ensure badpages are in range (0, last_page]. As maxpages might be < last_page, setup_clusters() will encounter a buffer overflow when a badpage is >= maxpages. Only call inc_cluster_info_page() for badpage which is < maxpages to fix the issue. Link: https://lkml.kernel.org/r/20250522122554.12209-4-shikemeng@huaweicloud.com Fixes: b843786b0bd0 ("mm: swapfile: fix SSD detection with swapfile on btrfs") Signed-off-by: Kemeng Shi Reviewed-by: Baoquan He Cc: Johannes Weiner Cc: Kairui Song Cc: Signed-off-by: Andrew Morton --- mm/swapfile.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 119aa706e92d..4f47ec9118f8 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3208,9 +3208,13 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si, * and the EOF part of the last cluster. */ inc_cluster_info_page(si, cluster_info, 0); - for (i = 0; i < swap_header->info.nr_badpages; i++) - inc_cluster_info_page(si, cluster_info, - swap_header->info.badpages[i]); + for (i = 0; i < swap_header->info.nr_badpages; i++) { + unsigned int page_nr = swap_header->info.badpages[i]; + + if (page_nr >= maxpages) + continue; + inc_cluster_info_page(si, cluster_info, page_nr); + } for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++) inc_cluster_info_page(si, cluster_info, i); From 8678d1faf1d4c9c6a87237203fc23c9389e8f143 Mon Sep 17 00:00:00 2001 From: Kemeng Shi Date: Thu, 22 May 2025 20:25:54 +0800 Subject: [PATCH 300/370] mm: swap: remove stale comment stale comment in cluster_alloc_swap_entry() As cluster_next_cpu was already dropped, the associated comment is stale now. Link: https://lkml.kernel.org/r/20250522122554.12209-5-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi Reviewed-by: Kairui Song Reviewed-by: Baoquan He Cc: Johannes Weiner Signed-off-by: Andrew Morton --- mm/swapfile.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 4f47ec9118f8..b4f3cc712580 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -956,9 +956,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o } /* - * We don't have free cluster but have some clusters in - * discarding, do discard now and reclaim them, then - * reread cluster_next_cpu since we dropped si->lock + * We don't have free cluster but have some clusters in discarding, + * do discard now and reclaim them. */ if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si)) goto new_cluster; From 92c99fc614737eaa99125283c306d2ebb13b101a Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 14 Jul 2025 09:16:51 -0400 Subject: [PATCH 301/370] mm/memory: introduce is_huge_zero_pfn() and use it in vm_normal_page_pmd() Patch series "mm: introduce snapshot_page()", v3. This series introduces snapshot_page(), a helper function that can be used to create a snapshot of a struct page and its associated struct folio. This function is intended to help callers with a consistent view of a a folio while reducing the chance of encountering partially updated or inconsistent state, such as during folio splitting which could lead to crashes and BUG_ON()s being triggered. This patch (of 4): Let's avoid working with the PMD when not required. If vm_normal_page_pmd() would be called on something that is not a present pmd, it would already be a bug (pfn possibly garbage). While at it, let's support passing in any pfn covered by the huge zero folio by masking off PFN bits -- which should be rather cheap. Link: https://lkml.kernel.org/r/cover.1752499009.git.luizcap@redhat.com Link: https://lkml.kernel.org/r/4940826e99f0c709a7cf7beb94f53288320aea5a.1752499009.git.luizcap@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Oscar Salvador Signed-off-by: Luiz Capitulino Reviewed-by: Shivank Garg Tested-by: Harry Yoo Cc: Matthew Wilcox (Oracle) Cc: SeongJae Park Signed-off-by: Andrew Morton --- include/linux/huge_mm.h | 12 +++++++++++- mm/memory.c | 2 +- 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 4d5bb67dc4ec..7748489fde1b 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -482,9 +482,14 @@ static inline bool is_huge_zero_folio(const struct folio *folio) return READ_ONCE(huge_zero_folio) == folio; } +static inline bool is_huge_zero_pfn(unsigned long pfn) +{ + return READ_ONCE(huge_zero_pfn) == (pfn & ~(HPAGE_PMD_NR - 1)); +} + static inline bool is_huge_zero_pmd(pmd_t pmd) { - return pmd_present(pmd) && READ_ONCE(huge_zero_pfn) == pmd_pfn(pmd); + return pmd_present(pmd) && is_huge_zero_pfn(pmd_pfn(pmd)); } struct folio *mm_get_huge_zero_folio(struct mm_struct *mm); @@ -632,6 +637,11 @@ static inline bool is_huge_zero_folio(const struct folio *folio) return false; } +static inline bool is_huge_zero_pfn(unsigned long pfn) +{ + return false; +} + static inline bool is_huge_zero_pmd(pmd_t pmd) { return false; diff --git a/mm/memory.c b/mm/memory.c index b4fb559dd0c6..92fd18a5d8d1 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -668,7 +668,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, } } - if (is_huge_zero_pmd(pmd)) + if (is_huge_zero_pfn(pfn)) return NULL; if (unlikely(pfn > highest_memmap_pfn)) return NULL; From d863a12108f24c5f0b49f99f328e33371bd7c69d Mon Sep 17 00:00:00 2001 From: Luiz Capitulino Date: Mon, 14 Jul 2025 09:16:52 -0400 Subject: [PATCH 302/370] mm/util: introduce snapshot_page() This commit refactors __dump_page() into snapshot_page(). snapshot_page() tries to take a faithful snapshot of a page and its folio representation. The snapshot is returned in the struct page_snapshot parameter along with additional flags that are best retrieved at snapshot creation time to reduce race windows. This function is intended to be used by callers that need a stable representation of a struct page and struct folio so that pointers or page information doesn't change while working on a page. The idea and original implementation of snapshot_page() comes from Matthew Wilcox with suggestions for improvements from David Hildenbrand. All bugs and misconceptions are mine. [luizcap@redhat.com: fix set_ps_flags() commentary] Link: https://lkml.kernel.org/r/d5c75701-b353-4536-a306-187fab0655b3@redhat.com Link: https://lkml.kernel.org/r/637a03a05cb2e3df88f84ff9e9f9642374ef813a.1752499009.git.luizcap@redhat.com Signed-off-by: Luiz Capitulino Reviewed-by: Shivank Garg Tested-by: Harry Yoo Acked-by: David Hildenbrand Cc: Matthew Wilcox (Oracle) Cc: Oscar Salvador Cc: SeongJae Park Signed-off-by: Andrew Morton --- include/linux/mm.h | 19 +++++++++++ mm/debug.c | 42 +++--------------------- mm/util.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 104 insertions(+), 38 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 805108d7bbc3..8e3a4c5b78ff 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4199,4 +4199,23 @@ static inline bool page_pool_page_is_pp(struct page *page) } #endif +#define PAGE_SNAPSHOT_FAITHFUL (1 << 0) +#define PAGE_SNAPSHOT_PG_BUDDY (1 << 1) +#define PAGE_SNAPSHOT_PG_IDLE (1 << 2) + +struct page_snapshot { + struct folio folio_snapshot; + struct page page_snapshot; + unsigned long pfn; + unsigned long idx; + unsigned long flags; +}; + +static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps) +{ + return ps->flags & PAGE_SNAPSHOT_FAITHFUL; +} + +void snapshot_page(struct page_snapshot *ps, const struct page *page); + #endif /* _LINUX_MM_H */ diff --git a/mm/debug.c b/mm/debug.c index e2973e1b3812..b4388f4dcd4d 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -129,47 +129,13 @@ static void __dump_folio(struct folio *folio, struct page *page, static void __dump_page(const struct page *page) { - struct folio *foliop, folio; - struct page precise; - unsigned long head; - unsigned long pfn = page_to_pfn(page); - unsigned long idx, nr_pages = 1; - int loops = 5; + struct page_snapshot ps; -again: - memcpy(&precise, page, sizeof(*page)); - head = precise.compound_head; - if ((head & 1) == 0) { - foliop = (struct folio *)&precise; - idx = 0; - if (!folio_test_large(foliop)) - goto dump; - foliop = (struct folio *)page; - } else { - foliop = (struct folio *)(head - 1); - idx = folio_page_idx(foliop, page); - } - - if (idx < MAX_FOLIO_NR_PAGES) { - memcpy(&folio, foliop, 2 * sizeof(struct page)); - nr_pages = folio_nr_pages(&folio); - if (nr_pages > 1) - memcpy(&folio.__page_2, &foliop->__page_2, - sizeof(struct page)); - foliop = &folio; - } - - if (idx > nr_pages) { - if (loops-- > 0) - goto again; + snapshot_page(&ps, page); + if (!snapshot_page_is_faithful(&ps)) pr_warn("page does not match folio\n"); - precise.compound_head &= ~1UL; - foliop = (struct folio *)&precise; - idx = 0; - } -dump: - __dump_folio(foliop, &precise, pfn, idx); + __dump_folio(&ps.folio_snapshot, &ps.page_snapshot, ps.pfn, ps.idx); } void dump_page(const struct page *page, const char *reason) diff --git a/mm/util.c b/mm/util.c index 68ea833ba25f..f814e6a59ab1 100644 --- a/mm/util.c +++ b/mm/util.c @@ -25,6 +25,7 @@ #include #include #include +#include #include @@ -1172,6 +1173,86 @@ int compat_vma_mmap_prepare(struct file *file, struct vm_area_struct *vma) } EXPORT_SYMBOL(compat_vma_mmap_prepare); +static void set_ps_flags(struct page_snapshot *ps, const struct folio *folio, + const struct page *page) +{ + /* + * Only the first page of a high-order buddy page has PageBuddy() set. + * So we have to check manually whether this page is part of a high- + * order buddy page. + */ + if (PageBuddy(page)) + ps->flags |= PAGE_SNAPSHOT_PG_BUDDY; + else if (page_count(page) == 0 && is_free_buddy_page(page)) + ps->flags |= PAGE_SNAPSHOT_PG_BUDDY; + + if (folio_test_idle(folio)) + ps->flags |= PAGE_SNAPSHOT_PG_IDLE; +} + +/** + * snapshot_page() - Create a snapshot of a struct page + * @ps: Pointer to a struct page_snapshot to store the page snapshot + * @page: The page to snapshot + * + * Create a snapshot of the page and store both its struct page and struct + * folio representations in @ps. + * + * A snapshot is marked as "faithful" if the compound state of @page was + * stable and allowed safe reconstruction of the folio representation. In + * rare cases where this is not possible (e.g. due to folio splitting), + * snapshot_page() falls back to treating @page as a single page and the + * snapshot is marked as "unfaithful". The snapshot_page_is_faithful() + * helper can be used to check for this condition. + */ +void snapshot_page(struct page_snapshot *ps, const struct page *page) +{ + unsigned long head, nr_pages = 1; + struct folio *foliop; + int loops = 5; + + ps->pfn = page_to_pfn(page); + ps->flags = PAGE_SNAPSHOT_FAITHFUL; + +again: + memset(&ps->folio_snapshot, 0, sizeof(struct folio)); + memcpy(&ps->page_snapshot, page, sizeof(*page)); + head = ps->page_snapshot.compound_head; + if ((head & 1) == 0) { + ps->idx = 0; + foliop = (struct folio *)&ps->page_snapshot; + if (!folio_test_large(foliop)) { + set_ps_flags(ps, page_folio(page), page); + memcpy(&ps->folio_snapshot, foliop, + sizeof(struct page)); + return; + } + foliop = (struct folio *)page; + } else { + foliop = (struct folio *)(head - 1); + ps->idx = folio_page_idx(foliop, page); + } + + if (ps->idx < MAX_FOLIO_NR_PAGES) { + memcpy(&ps->folio_snapshot, foliop, 2 * sizeof(struct page)); + nr_pages = folio_nr_pages(&ps->folio_snapshot); + if (nr_pages > 1) + memcpy(&ps->folio_snapshot.__page_2, &foliop->__page_2, + sizeof(struct page)); + set_ps_flags(ps, foliop, page); + } + + if (ps->idx > nr_pages) { + if (loops-- > 0) + goto again; + clear_compound_head(&ps->page_snapshot); + foliop = (struct folio *)&ps->page_snapshot; + memcpy(&ps->folio_snapshot, foliop, sizeof(struct page)); + ps->flags = 0; + ps->idx = 0; + } +} + #ifdef CONFIG_MMU /** * folio_pte_batch - detect a PTE batch for a large folio From 71f2a2c4ff62c9bb1daa0e47b588220178a5ad12 Mon Sep 17 00:00:00 2001 From: Luiz Capitulino Date: Mon, 14 Jul 2025 09:16:53 -0400 Subject: [PATCH 303/370] proc: kpagecount: use snapshot_page() Currently, the call to folio_precise_page_mapcount() from kpage_read() can race with a folio split. When the race happens we trigger a VM_BUG_ON_FOLIO() in folio_entire_mapcount() (see splat below). This commit fixes this race by using snapshot_page() so that we retrieve the folio mapcount using a folio snapshot. [ 2356.558576] page: refcount:1 mapcount:1 mapping:0000000000000000 index:0xffff85200 pfn:0x6f7c00 [ 2356.558748] memcg:ffff000651775780 [ 2356.558763] anon flags: 0xafffff60020838(uptodate|dirty|lru|owner_2|swapbacked|node=1|zone=2|lastcpupid=0xfffff) [ 2356.558796] raw: 00afffff60020838 fffffdffdb5d0048 fffffdffdadf7fc8 ffff00064c1629c1 [ 2356.558817] raw: 0000000ffff85200 0000000000000000 0000000100000000 ffff000651775780 [ 2356.558839] page dumped because: VM_BUG_ON_FOLIO(!folio_test_large(folio)) [ 2356.558882] ------------[ cut here ]------------ [ 2356.558897] kernel BUG at ./include/linux/mm.h:1103! [ 2356.558982] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP [ 2356.564729] CPU: 8 UID: 0 PID: 1864 Comm: folio-split-rac Tainted: G S W 6.15.0+ #3 PREEMPT(voluntary) [ 2356.566196] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN [ 2356.566814] Hardware name: Red Hat KVM, BIOS edk2-20241117-3.el9 11/17/2024 [ 2356.567684] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 2356.568563] pc : kpage_read.constprop.0+0x26c/0x290 [ 2356.569605] lr : kpage_read.constprop.0+0x26c/0x290 [ 2356.569992] sp : ffff80008fb739b0 [ 2356.570263] x29: ffff80008fb739b0 x28: ffff00064aa69580 x27: 00000000ff000000 [ 2356.570842] x26: 0000fffffffffff8 x25: ffff00064aa69580 x24: ffff80008fb73ae0 [ 2356.571411] x23: 0000000000000001 x22: 0000ffff86c6e8b8 x21: 0000000000000008 [ 2356.571978] x20: 00000000006f7c00 x19: 0000ffff86c6e8b8 x18: 0000000000000000 [ 2356.572581] x17: 3630303066666666 x16: 0000000000000003 x15: 0000000000001000 [ 2356.573217] x14: 00000000ffffffff x13: 0000000000000004 x12: 00aaaaaa00aaaaaa [ 2356.577674] x11: 0000000000000000 x10: 00aaaaaa00aaaaaa x9 : ffffbf3afca6c300 [ 2356.578332] x8 : 0000000000000002 x7 : 0000000000000001 x6 : 0000000000000001 [ 2356.578984] x5 : ffff000c79812408 x4 : 0000000000000000 x3 : 0000000000000000 [ 2356.579635] x2 : 0000000000000000 x1 : ffff00064aa69580 x0 : 000000000000003e [ 2356.580286] Call trace: [ 2356.580524] kpage_read.constprop.0+0x26c/0x290 (P) [ 2356.580982] kpagecount_read+0x28/0x40 [ 2356.581336] proc_reg_read+0x38/0x100 [ 2356.581681] vfs_read+0xcc/0x320 [ 2356.581992] ksys_read+0x74/0x118 [ 2356.582306] __arm64_sys_read+0x24/0x38 [ 2356.582668] invoke_syscall+0x70/0x100 [ 2356.583022] el0_svc_common.constprop.0+0x48/0xf8 [ 2356.583456] do_el0_svc+0x28/0x40 [ 2356.583930] el0_svc+0x38/0x118 [ 2356.584328] el0t_64_sync_handler+0x144/0x168 [ 2356.584883] el0t_64_sync+0x19c/0x1a0 [ 2356.585350] Code: aa0103e0 9003a541 91082021 97f813fc (d4210000) [ 2356.586130] ---[ end trace 0000000000000000 ]--- [ 2356.587377] note: folio-split-rac[1864] exited with irqs disabled [ 2356.588050] note: folio-split-rac[1864] exited with preempt_count 1 Link: https://lkml.kernel.org/r/1c05cc725b90962d56323ff2e28e9cc3ae397b68.1752499009.git.luizcap@redhat.com Signed-off-by: Luiz Capitulino Reported-by: syzbot+3d7dc5eaba6b932f8535@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67812fbd.050a0220.d0267.0030.GAE@google.com/Reviewed-by: Shivank Garg Tested-by: Harry Yoo Acked-by: David Hildenbrand Cc: Matthew Wilcox (Oracle) Cc: Oscar Salvador Cc: SeongJae Park Signed-off-by: Andrew Morton --- fs/proc/page.c | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/fs/proc/page.c b/fs/proc/page.c index 0cdc78c0d23f..21e69935c28b 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -43,6 +43,22 @@ static inline unsigned long get_max_dump_pfn(void) #endif } +static u64 get_kpage_count(const struct page *page) +{ + struct page_snapshot ps; + u64 ret; + + snapshot_page(&ps, page); + + if (IS_ENABLED(CONFIG_PAGE_MAPCOUNT)) + ret = folio_precise_page_mapcount(&ps.folio_snapshot, + &ps.page_snapshot); + else + ret = folio_average_page_mapcount(&ps.folio_snapshot); + + return ret; +} + static ssize_t kpage_read(struct file *file, char __user *buf, size_t count, loff_t *ppos, enum kpage_operation op) @@ -75,10 +91,7 @@ static ssize_t kpage_read(struct file *file, char __user *buf, info = stable_page_flags(page); break; case KPAGE_COUNT: - if (IS_ENABLED(CONFIG_PAGE_MAPCOUNT)) - info = folio_precise_page_mapcount(page_folio(page), page); - else - info = folio_average_page_mapcount(page_folio(page)); + info = get_kpage_count(page); break; case KPAGE_CGROUP: info = page_cgroup_ino(page); From 476d87d6a06146125e8f16edbe845a7bcf6a2e57 Mon Sep 17 00:00:00 2001 From: Luiz Capitulino Date: Mon, 14 Jul 2025 09:16:54 -0400 Subject: [PATCH 304/370] fs: stable_page_flags(): use snapshot_page() A race condition is possible in stable_page_flags() where user-space is reading /proc/kpageflags concurrently to a folio split. This may lead to oopses or BUG_ON()s being triggered. To fix this, this commit uses snapshot_page() in stable_page_flags() so that stable_page_flags() works with a stable page and folio snapshots instead. Note that stable_page_flags() makes use of some functions that require the original page or folio pointer to work properly (eg. is_free_budy_page() and folio_test_idle()). Since those functions can't be used on the page snapshot, we replace their usage with flags that were set by snapshot_page() for this purpose. Link: https://lkml.kernel.org/r/52c16c0f00995a812a55980c2f26848a999a34ab.1752499009.git.luizcap@redhat.com Signed-off-by: Luiz Capitulino Reviewed-by: Shivank Garg Tested-by: Harry Yoo Acked-by: David Hildenbrand Cc: Matthew Wilcox (Oracle) Cc: Oscar Salvador Cc: SeongJae Park Signed-off-by: Andrew Morton --- fs/proc/page.c | 29 +++++++++++++---------------- 1 file changed, 13 insertions(+), 16 deletions(-) diff --git a/fs/proc/page.c b/fs/proc/page.c index 21e69935c28b..ba3568e97fd1 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -147,6 +147,7 @@ static inline u64 kpf_copy_bit(u64 kflags, int ubit, int kbit) u64 stable_page_flags(const struct page *page) { const struct folio *folio; + struct page_snapshot ps; unsigned long k; unsigned long mapping; bool is_anon; @@ -158,7 +159,9 @@ u64 stable_page_flags(const struct page *page) */ if (!page) return 1 << KPF_NOPAGE; - folio = page_folio(page); + + snapshot_page(&ps, page); + folio = &ps.folio_snapshot; k = folio->flags; mapping = (unsigned long)folio->mapping; @@ -167,7 +170,7 @@ u64 stable_page_flags(const struct page *page) /* * pseudo flags for the well known (anonymous) memory mapped pages */ - if (page_mapped(page)) + if (folio_mapped(folio)) u |= 1 << KPF_MMAP; if (is_anon) { u |= 1 << KPF_ANON; @@ -179,7 +182,7 @@ u64 stable_page_flags(const struct page *page) * compound pages: export both head/tail info * they together define a compound page's start/end pos and order */ - if (page == &folio->page) + if (ps.idx == 0) u |= kpf_copy_bit(k, KPF_COMPOUND_HEAD, PG_head); else u |= 1 << KPF_COMPOUND_TAIL; @@ -189,25 +192,19 @@ u64 stable_page_flags(const struct page *page) folio_test_large_rmappable(folio)) { /* Note: we indicate any THPs here, not just PMD-sized ones */ u |= 1 << KPF_THP; - } else if (is_huge_zero_folio(folio)) { + } else if (is_huge_zero_pfn(ps.pfn)) { u |= 1 << KPF_ZERO_PAGE; u |= 1 << KPF_THP; - } else if (is_zero_folio(folio)) { + } else if (is_zero_pfn(ps.pfn)) { u |= 1 << KPF_ZERO_PAGE; } - /* - * Caveats on high order pages: PG_buddy and PG_slab will only be set - * on the head page. - */ - if (PageBuddy(page)) - u |= 1 << KPF_BUDDY; - else if (page_count(page) == 0 && is_free_buddy_page(page)) + if (ps.flags & PAGE_SNAPSHOT_PG_BUDDY) u |= 1 << KPF_BUDDY; - if (PageOffline(page)) + if (folio_test_offline(folio)) u |= 1 << KPF_OFFLINE; - if (PageTable(page)) + if (folio_test_pgtable(folio)) u |= 1 << KPF_PGTABLE; if (folio_test_slab(folio)) u |= 1 << KPF_SLAB; @@ -215,7 +212,7 @@ u64 stable_page_flags(const struct page *page) #if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT) u |= kpf_copy_bit(k, KPF_IDLE, PG_idle); #else - if (folio_test_idle(folio)) + if (ps.flags & PAGE_SNAPSHOT_PG_IDLE) u |= 1 << KPF_IDLE; #endif @@ -241,7 +238,7 @@ u64 stable_page_flags(const struct page *page) if (u & (1 << KPF_HUGE)) u |= kpf_copy_bit(k, KPF_HWPOISON, PG_hwpoison); else - u |= kpf_copy_bit(page->flags, KPF_HWPOISON, PG_hwpoison); + u |= kpf_copy_bit(ps.page_snapshot.flags, KPF_HWPOISON, PG_hwpoison); #endif u |= kpf_copy_bit(k, KPF_RESERVED, PG_reserved); From 77c50f9147eabcf726f62c73167bb9d9e8621a43 Mon Sep 17 00:00:00 2001 From: "Yury Norov (NVIDIA)" Date: Sat, 19 Jul 2025 16:53:59 -0400 Subject: [PATCH 305/370] mm: cma: simplify cma_debug_show_areas() The function opencodes for_each_clear_bitrange(). Fix that and drop most of housekeeping code. Link: https://lkml.kernel.org/r/20250719205401.399475-2-yury.norov@gmail.com Signed-off-by: Yury Norov (NVIDIA) Acked-by: David Hildenbrand Signed-off-by: Andrew Morton --- mm/cma.c | 19 ++++--------------- 1 file changed, 4 insertions(+), 15 deletions(-) diff --git a/mm/cma.c b/mm/cma.c index 0cd6f061021f..2ffa4befb99a 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -751,8 +751,7 @@ int __init cma_declare_contiguous_nid(phys_addr_t base, static void cma_debug_show_areas(struct cma *cma) { - unsigned long next_zero_bit, next_set_bit, nr_zero; - unsigned long start; + unsigned long start, end; unsigned long nr_part; unsigned long nbits; int r; @@ -763,22 +762,12 @@ static void cma_debug_show_areas(struct cma *cma) for (r = 0; r < cma->nranges; r++) { cmr = &cma->ranges[r]; - start = 0; nbits = cma_bitmap_maxno(cma, cmr); pr_info("range %d: ", r); - for (;;) { - next_zero_bit = find_next_zero_bit(cmr->bitmap, - nbits, start); - if (next_zero_bit >= nbits) - break; - next_set_bit = find_next_bit(cmr->bitmap, nbits, - next_zero_bit); - nr_zero = next_set_bit - next_zero_bit; - nr_part = nr_zero << cma->order_per_bit; - pr_cont("%s%lu@%lu", start ? "+" : "", nr_part, - next_zero_bit); - start = next_zero_bit + nr_zero; + for_each_clear_bitrange(start, end, cmr->bitmap, nbits) { + nr_part = (end - start) << cma->order_per_bit; + pr_cont("%s%lu@%lu", start ? "+" : "", nr_part, start); } pr_info("\n"); } From c79147e4b02fa3679aebf34121b12f5097731d8a Mon Sep 17 00:00:00 2001 From: "Yury Norov (NVIDIA)" Date: Sat, 19 Jul 2025 16:54:00 -0400 Subject: [PATCH 306/370] mm: cma: simplify cma_maxchunk_get() The function opencodes for_each_clear_bitrange(). Fix that and drop most of housekeeping code. Link: https://lkml.kernel.org/r/20250719205401.399475-3-yury.norov@gmail.com Signed-off-by: Yury Norov (NVIDIA) Acked-by: David Hildenbrand Signed-off-by: Andrew Morton --- mm/cma_debug.c | 10 +--------- 1 file changed, 1 insertion(+), 9 deletions(-) diff --git a/mm/cma_debug.c b/mm/cma_debug.c index fdf899532ca0..8c7d7f8e8fbd 100644 --- a/mm/cma_debug.c +++ b/mm/cma_debug.c @@ -56,16 +56,8 @@ static int cma_maxchunk_get(void *data, u64 *val) for (r = 0; r < cma->nranges; r++) { cmr = &cma->ranges[r]; bitmap_maxno = cma_bitmap_maxno(cma, cmr); - end = 0; - for (;;) { - start = find_next_zero_bit(cmr->bitmap, - bitmap_maxno, end); - if (start >= bitmap_maxno) - break; - end = find_next_bit(cmr->bitmap, bitmap_maxno, - start); + for_each_clear_bitrange(start, end, cmr->bitmap, bitmap_maxno) maxchunk = max(end - start, maxchunk); - } } spin_unlock_irq(&cma->lock); *val = (u64)maxchunk << cma->order_per_bit; From beb69e81724634063b9dbae4bc79e2e011fdeeb1 Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Sat, 19 Jul 2025 11:28:49 -0700 Subject: [PATCH 307/370] selftests/proc: add /proc/pid/maps tearing from vma split test MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "use per-vma locks for /proc/pid/maps reads", v8. Reading /proc/pid/maps requires read-locking mmap_lock which prevents any other task from concurrently modifying the address space. This guarantees coherent reporting of virtual address ranges, however it can block important updates from happening. Oftentimes /proc/pid/maps readers are low priority monitoring tasks and them blocking high priority tasks results in priority inversion. Locking the entire address space is required to present fully coherent picture of the address space, however even current implementation does not strictly guarantee that by outputting vmas in page-size chunks and dropping mmap_lock in between each chunk. Address space modifications are possible while mmap_lock is dropped and userspace reading the content is expected to deal with possible concurrent address space modifications. Considering these relaxed rules, holding mmap_lock is not strictly needed as long as we can guarantee that a concurrently modified vma is reported either in its original form or after it was modified. This patchset switches from holding mmap_lock while reading /proc/pid/maps to taking per-vma locks as we walk the vma tree. This reduces the contention with tasks modifying the address space because they would have to contend for the same vma as opposed to the entire address space. Previous version of this patchset [1] tried to perform /proc/pid/maps reading under RCU, however its implementation is quite complex and the results are worse than the new version because it still relied on mmap_lock speculation which retries if any part of the address space gets modified. New implementaion is both simpler and results in less contention. Note that similar approach would not work for /proc/pid/smaps reading as it also walks the page table and that's not RCU-safe. Paul McKenney's designed a test [2] to measure mmap/munmap latencies while concurrently reading /proc/pid/maps. The test has a pair of processes scanning /proc/PID/maps, and another process unmapping and remapping 4K pages from a 128MB range of anonymous memory. At the end of each 10 second run, the latency of each mmap() or munmap() operation is measured, and for each run the maximum and mean latency is printed. The map/unmap process is started first, its PID is passed to the scanners, and then the map/unmap process waits until both scanners are running before starting its timed test. The scanners keep scanning until the specified /proc/PID/maps file disappears. The latest results from Paul: Stock mm-unstable, all of the runs had maximum latencies in excess of 0.5 milliseconds, and with 80% of the runs' latencies exceeding a full millisecond, and ranging up beyond 4 full milliseconds. In contrast, 99% of the runs with this patch series applied had maximum latencies of less than 0.5 milliseconds, with the single outlier at only 0.608 milliseconds. From a median-performance (as opposed to maximum-latency) viewpoint, this patch series also looks good, with stock mm weighing in at 11 microseconds and patch series at 6 microseconds, better than a 2x improvement. Before the change: ./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2 0.011 0.008 0.521 0.011 0.008 0.552 0.011 0.008 0.590 0.011 0.008 0.660 ... 0.011 0.015 2.987 0.011 0.015 3.038 0.011 0.016 3.431 0.011 0.016 4.707 After the change: ./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2 0.006 0.005 0.026 0.006 0.005 0.029 0.006 0.005 0.034 0.006 0.005 0.035 ... 0.006 0.006 0.421 0.006 0.006 0.423 0.006 0.006 0.439 0.006 0.006 0.608 The patchset also adds a number of tests to check for /proc/pid/maps data coherency. They are designed to detect any unexpected data tearing while performing some common address space modifications (vma split, resize and remap). Even before these changes, reading /proc/pid/maps might have inconsistent data because the file is read page-by-page with mmap_lock being dropped between the pages. An example of user-visible inconsistency can be that the same vma is printed twice: once before it was modified and then after the modifications. For example if vma was extended, it might be found and reported twice. What is not expected is to see a gap where there should have been a vma both before and after modification. This patchset increases the chances of such tearing, therefore it's even more important now to test for unexpected inconsistencies. In [3] Lorenzo identified the following possible vma merging/splitting scenarios: Merges with changes to existing vmas: 1 Merge both - mapping a vma over another one and between two vmas which can be merged after this replacement; 2. Merge left full - mapping a vma at the end of an existing one and completely over its right neighbor; 3. Merge left partial - mapping a vma at the end of an existing one and partially over its right neighbor; 4. Merge right full - mapping a vma before the start of an existing one and completely over its left neighbor; 5. Merge right partial - mapping a vma before the start of an existing one and partially over its left neighbor; Merges without changes to existing vmas: 6. Merge both - mapping a vma into a gap between two vmas which can be merged after the insertion; 7. Merge left - mapping a vma at the end of an existing one; 8. Merge right - mapping a vma before the start end of an existing one; Splits 9. Split with new vma at the lower address; 10. Split with new vma at the higher address; If such merges or splits happen concurrently with the /proc/maps reading we might report a vma twice, once before the modification and once after it is modified: Case 1 might report overwritten and previous vma along with the final merged vma; Case 2 might report previous and the final merged vma; Case 3 might cause us to retry once we detect the temporary gap caused by shrinking of the right neighbor; Case 4 might report overritten and the final merged vma; Case 5 might cause us to retry once we detect the temporary gap caused by shrinking of the left neighbor; Case 6 might report previous vma and the gap along with the final marged vma; Case 7 might report previous and the final merged vma; Case 8 might report the original gap and the final merged vma covering the gap; Case 9 might cause us to retry once we detect the temporary gap caused by shrinking of the original vma at the vma start; Case 10 might cause us to retry once we detect the temporary gap caused by shrinking of the original vma at the vma end; In all these cases the retry mechanism prevents us from reporting possible temporary gaps. [1] https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/ [2] https://github.com/paulmckrcu/proc-mmap_sem-test [3] https://lore.kernel.org/all/e1863f40-39ab-4e5b-984a-c48765ffde1c@lucifer.local/ The /proc/pid/maps file is generated page by page, with the mmap_lock released between pages. This can lead to inconsistent reads if the underlying vmas are concurrently modified. For instance, if a vma split or merge occurs at a page boundary while /proc/pid/maps is being read, the same vma might be seen twice: once before and once after the change. This duplication is considered acceptable for userspace handling. However, observing a "hole" where a vma should be (e.g., due to a vma being replaced and the space temporarily being empty) is unacceptable. Implement a test that: 1. Forks a child process which continuously modifies its address space, specifically targeting a vma at the boundary between two pages. 2. The parent process repeatedly reads the child's /proc/pid/maps. 3. The parent process checks the last vma of the first page and the first vma of the second page for consistency, looking for the effects of vma splits or merges. The test duration is configurable via DURATION environment variable expressed in seconds. The default test duration is 5 seconds. Example Command: DURATION=10 ./proc-maps-race Link: https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/ [1] Link: https://github.com/paulmckrcu/proc-mmap_sem-test [2] Link: https://lore.kernel.org/all/e1863f40-39ab-4e5b-984a-c48765ffde1c@lucifer.local/ [3] Link: https://lkml.kernel.org/r/20250719182854.3166724-1-surenb@google.com Link: https://lkml.kernel.org/r/20250719182854.3166724-2-surenb@google.com Signed-off-by: Suren Baghdasaryan Cc: Alexey Dobriyan Cc: Andrii Nakryiko Cc: Christian Brauner Cc: Christophe Leroy Cc: David Hildenbrand Cc: Jann Horn Cc: Jeongjun Park Cc: Johannes Weiner Cc: Josef Bacik Cc: Kalesh Singh Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Oscar Salvador Cc: "Paul E . McKenney" Cc: Peter Xu Cc: Ryan Roberts Cc: Shuah Khan Cc: Thomas Weißschuh Cc: T.J. Mercier Cc: Vlastimil Babka Cc: Ye Bin Signed-off-by: Andrew Morton --- tools/testing/selftests/proc/.gitignore | 1 + tools/testing/selftests/proc/Makefile | 1 + tools/testing/selftests/proc/proc-maps-race.c | 447 ++++++++++++++++++ 3 files changed, 449 insertions(+) create mode 100644 tools/testing/selftests/proc/proc-maps-race.c diff --git a/tools/testing/selftests/proc/.gitignore b/tools/testing/selftests/proc/.gitignore index 973968f45bba..19bb333e2485 100644 --- a/tools/testing/selftests/proc/.gitignore +++ b/tools/testing/selftests/proc/.gitignore @@ -5,6 +5,7 @@ /proc-2-is-kthread /proc-fsconfig-hidepid /proc-loadavg-001 +/proc-maps-race /proc-multiple-procfs /proc-empty-vm /proc-pid-vm diff --git a/tools/testing/selftests/proc/Makefile b/tools/testing/selftests/proc/Makefile index b12921b9794b..50aba102201a 100644 --- a/tools/testing/selftests/proc/Makefile +++ b/tools/testing/selftests/proc/Makefile @@ -9,6 +9,7 @@ TEST_GEN_PROGS += fd-002-posix-eq TEST_GEN_PROGS += fd-003-kthread TEST_GEN_PROGS += proc-2-is-kthread TEST_GEN_PROGS += proc-loadavg-001 +TEST_GEN_PROGS += proc-maps-race TEST_GEN_PROGS += proc-empty-vm TEST_GEN_PROGS += proc-pid-vm TEST_GEN_PROGS += proc-self-map-files-001 diff --git a/tools/testing/selftests/proc/proc-maps-race.c b/tools/testing/selftests/proc/proc-maps-race.c new file mode 100644 index 000000000000..5b28dda08b7d --- /dev/null +++ b/tools/testing/selftests/proc/proc-maps-race.c @@ -0,0 +1,447 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright 2022 Google LLC. + * Author: Suren Baghdasaryan + * + * Permission to use, copy, modify, and distribute this software for any + * purpose with or without fee is hereby granted, provided that the above + * copyright notice and this permission notice appear in all copies. + * + * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES + * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF + * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR + * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES + * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN + * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF + * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. + */ +/* + * Fork a child that concurrently modifies address space while the main + * process is reading /proc/$PID/maps and verifying the results. Address + * space modifications include: + * VMA splitting and merging + * + */ +#define _GNU_SOURCE +#include "../kselftest_harness.h" +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* /proc/pid/maps parsing routines */ +struct page_content { + char *data; + ssize_t size; +}; + +#define LINE_MAX_SIZE 256 + +struct line_content { + char text[LINE_MAX_SIZE]; + unsigned long start_addr; + unsigned long end_addr; +}; + +enum test_state { + INIT, + CHILD_READY, + PARENT_READY, + SETUP_READY, + SETUP_MODIFY_MAPS, + SETUP_MAPS_MODIFIED, + SETUP_RESTORE_MAPS, + SETUP_MAPS_RESTORED, + TEST_READY, + TEST_DONE, +}; + +struct vma_modifier_info; + +FIXTURE(proc_maps_race) +{ + struct vma_modifier_info *mod_info; + struct page_content page1; + struct page_content page2; + struct line_content last_line; + struct line_content first_line; + unsigned long duration_sec; + int shared_mem_size; + int page_size; + int vma_count; + int maps_fd; + pid_t pid; +}; + +typedef bool (*vma_modifier_op)(FIXTURE_DATA(proc_maps_race) *self); +typedef bool (*vma_mod_result_check_op)(struct line_content *mod_last_line, + struct line_content *mod_first_line, + struct line_content *restored_last_line, + struct line_content *restored_first_line); + +struct vma_modifier_info { + int vma_count; + void *addr; + int prot; + void *next_addr; + vma_modifier_op vma_modify; + vma_modifier_op vma_restore; + vma_mod_result_check_op vma_mod_check; + pthread_mutex_t sync_lock; + pthread_cond_t sync_cond; + enum test_state curr_state; + bool exit; + void *child_mapped_addr[]; +}; + + +static bool read_two_pages(FIXTURE_DATA(proc_maps_race) *self) +{ + ssize_t bytes_read; + + if (lseek(self->maps_fd, 0, SEEK_SET) < 0) + return false; + + bytes_read = read(self->maps_fd, self->page1.data, self->page_size); + if (bytes_read <= 0) + return false; + + self->page1.size = bytes_read; + + bytes_read = read(self->maps_fd, self->page2.data, self->page_size); + if (bytes_read <= 0) + return false; + + self->page2.size = bytes_read; + + return true; +} + +static void copy_first_line(struct page_content *page, char *first_line) +{ + char *pos = strchr(page->data, '\n'); + + strncpy(first_line, page->data, pos - page->data); + first_line[pos - page->data] = '\0'; +} + +static void copy_last_line(struct page_content *page, char *last_line) +{ + /* Get the last line in the first page */ + const char *end = page->data + page->size - 1; + /* skip last newline */ + const char *pos = end - 1; + + /* search previous newline */ + while (pos[-1] != '\n') + pos--; + strncpy(last_line, pos, end - pos); + last_line[end - pos] = '\0'; +} + +/* Read the last line of the first page and the first line of the second page */ +static bool read_boundary_lines(FIXTURE_DATA(proc_maps_race) *self, + struct line_content *last_line, + struct line_content *first_line) +{ + if (!read_two_pages(self)) + return false; + + copy_last_line(&self->page1, last_line->text); + copy_first_line(&self->page2, first_line->text); + + return sscanf(last_line->text, "%lx-%lx", &last_line->start_addr, + &last_line->end_addr) == 2 && + sscanf(first_line->text, "%lx-%lx", &first_line->start_addr, + &first_line->end_addr) == 2; +} + +/* Thread synchronization routines */ +static void wait_for_state(struct vma_modifier_info *mod_info, enum test_state state) +{ + pthread_mutex_lock(&mod_info->sync_lock); + while (mod_info->curr_state != state) + pthread_cond_wait(&mod_info->sync_cond, &mod_info->sync_lock); + pthread_mutex_unlock(&mod_info->sync_lock); +} + +static void signal_state(struct vma_modifier_info *mod_info, enum test_state state) +{ + pthread_mutex_lock(&mod_info->sync_lock); + mod_info->curr_state = state; + pthread_cond_signal(&mod_info->sync_cond); + pthread_mutex_unlock(&mod_info->sync_lock); +} + +static void stop_vma_modifier(struct vma_modifier_info *mod_info) +{ + wait_for_state(mod_info, SETUP_READY); + mod_info->exit = true; + signal_state(mod_info, SETUP_MODIFY_MAPS); +} + +static bool capture_mod_pattern(FIXTURE_DATA(proc_maps_race) *self, + struct line_content *mod_last_line, + struct line_content *mod_first_line, + struct line_content *restored_last_line, + struct line_content *restored_first_line) +{ + signal_state(self->mod_info, SETUP_MODIFY_MAPS); + wait_for_state(self->mod_info, SETUP_MAPS_MODIFIED); + + /* Copy last line of the first page and first line of the last page */ + if (!read_boundary_lines(self, mod_last_line, mod_first_line)) + return false; + + signal_state(self->mod_info, SETUP_RESTORE_MAPS); + wait_for_state(self->mod_info, SETUP_MAPS_RESTORED); + + /* Copy last line of the first page and first line of the last page */ + if (!read_boundary_lines(self, restored_last_line, restored_first_line)) + return false; + + if (!self->mod_info->vma_mod_check(mod_last_line, mod_first_line, + restored_last_line, restored_first_line)) + return false; + + /* + * The content of these lines after modify+resore should be the same + * as the original. + */ + return strcmp(restored_last_line->text, self->last_line.text) == 0 && + strcmp(restored_first_line->text, self->first_line.text) == 0; +} + +static inline bool split_vma(FIXTURE_DATA(proc_maps_race) *self) +{ + return mmap(self->mod_info->addr, self->page_size, self->mod_info->prot | PROT_EXEC, + MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) != MAP_FAILED; +} + +static inline bool merge_vma(FIXTURE_DATA(proc_maps_race) *self) +{ + return mmap(self->mod_info->addr, self->page_size, self->mod_info->prot, + MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) != MAP_FAILED; +} + +static inline bool check_split_result(struct line_content *mod_last_line, + struct line_content *mod_first_line, + struct line_content *restored_last_line, + struct line_content *restored_first_line) +{ + /* Make sure vmas at the boundaries are changing */ + return strcmp(mod_last_line->text, restored_last_line->text) != 0 && + strcmp(mod_first_line->text, restored_first_line->text) != 0; +} + +FIXTURE_SETUP(proc_maps_race) +{ + const char *duration = getenv("DURATION"); + struct vma_modifier_info *mod_info; + pthread_mutexattr_t mutex_attr; + pthread_condattr_t cond_attr; + unsigned long duration_sec; + char fname[32]; + + self->page_size = (unsigned long)sysconf(_SC_PAGESIZE); + duration_sec = duration ? atol(duration) : 0; + self->duration_sec = duration_sec ? duration_sec : 5UL; + + /* + * Have to map enough vmas for /proc/pid/maps to contain more than one + * page worth of vmas. Assume at least 32 bytes per line in maps output + */ + self->vma_count = self->page_size / 32 + 1; + self->shared_mem_size = sizeof(struct vma_modifier_info) + self->vma_count * sizeof(void *); + + /* map shared memory for communication with the child process */ + self->mod_info = (struct vma_modifier_info *)mmap(NULL, self->shared_mem_size, + PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); + ASSERT_NE(self->mod_info, MAP_FAILED); + mod_info = self->mod_info; + + /* Initialize shared members */ + pthread_mutexattr_init(&mutex_attr); + pthread_mutexattr_setpshared(&mutex_attr, PTHREAD_PROCESS_SHARED); + ASSERT_EQ(pthread_mutex_init(&mod_info->sync_lock, &mutex_attr), 0); + pthread_condattr_init(&cond_attr); + pthread_condattr_setpshared(&cond_attr, PTHREAD_PROCESS_SHARED); + ASSERT_EQ(pthread_cond_init(&mod_info->sync_cond, &cond_attr), 0); + mod_info->vma_count = self->vma_count; + mod_info->curr_state = INIT; + mod_info->exit = false; + + self->pid = fork(); + if (!self->pid) { + /* Child process modifying the address space */ + int prot = PROT_READ | PROT_WRITE; + int i; + + for (i = 0; i < mod_info->vma_count; i++) { + mod_info->child_mapped_addr[i] = mmap(NULL, self->page_size * 3, prot, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + ASSERT_NE(mod_info->child_mapped_addr[i], MAP_FAILED); + /* change protection in adjacent maps to prevent merging */ + prot ^= PROT_WRITE; + } + signal_state(mod_info, CHILD_READY); + wait_for_state(mod_info, PARENT_READY); + while (true) { + signal_state(mod_info, SETUP_READY); + wait_for_state(mod_info, SETUP_MODIFY_MAPS); + if (mod_info->exit) + break; + + ASSERT_TRUE(mod_info->vma_modify(self)); + signal_state(mod_info, SETUP_MAPS_MODIFIED); + wait_for_state(mod_info, SETUP_RESTORE_MAPS); + ASSERT_TRUE(mod_info->vma_restore(self)); + signal_state(mod_info, SETUP_MAPS_RESTORED); + + wait_for_state(mod_info, TEST_READY); + while (mod_info->curr_state != TEST_DONE) { + ASSERT_TRUE(mod_info->vma_modify(self)); + ASSERT_TRUE(mod_info->vma_restore(self)); + } + } + for (i = 0; i < mod_info->vma_count; i++) + munmap(mod_info->child_mapped_addr[i], self->page_size * 3); + + exit(0); + } + + sprintf(fname, "/proc/%d/maps", self->pid); + self->maps_fd = open(fname, O_RDONLY); + ASSERT_NE(self->maps_fd, -1); + + /* Wait for the child to map the VMAs */ + wait_for_state(mod_info, CHILD_READY); + + /* Read first two pages */ + self->page1.data = malloc(self->page_size); + ASSERT_NE(self->page1.data, NULL); + self->page2.data = malloc(self->page_size); + ASSERT_NE(self->page2.data, NULL); + + ASSERT_TRUE(read_boundary_lines(self, &self->last_line, &self->first_line)); + + /* + * Find the addresses corresponding to the last line in the first page + * and the first line in the last page. + */ + mod_info->addr = NULL; + mod_info->next_addr = NULL; + for (int i = 0; i < mod_info->vma_count; i++) { + if (mod_info->child_mapped_addr[i] == (void *)self->last_line.start_addr) { + mod_info->addr = mod_info->child_mapped_addr[i]; + mod_info->prot = PROT_READ; + /* Even VMAs have write permission */ + if ((i % 2) == 0) + mod_info->prot |= PROT_WRITE; + } else if (mod_info->child_mapped_addr[i] == (void *)self->first_line.start_addr) { + mod_info->next_addr = mod_info->child_mapped_addr[i]; + } + + if (mod_info->addr && mod_info->next_addr) + break; + } + ASSERT_TRUE(mod_info->addr && mod_info->next_addr); + + signal_state(mod_info, PARENT_READY); + +} + +FIXTURE_TEARDOWN(proc_maps_race) +{ + int status; + + stop_vma_modifier(self->mod_info); + + free(self->page2.data); + free(self->page1.data); + + for (int i = 0; i < self->vma_count; i++) + munmap(self->mod_info->child_mapped_addr[i], self->page_size); + close(self->maps_fd); + waitpid(self->pid, &status, 0); + munmap(self->mod_info, self->shared_mem_size); +} + +TEST_F(proc_maps_race, test_maps_tearing_from_split) +{ + struct vma_modifier_info *mod_info = self->mod_info; + + struct line_content split_last_line; + struct line_content split_first_line; + struct line_content restored_last_line; + struct line_content restored_first_line; + + wait_for_state(mod_info, SETUP_READY); + + /* re-read the file to avoid using stale data from previous test */ + ASSERT_TRUE(read_boundary_lines(self, &self->last_line, &self->first_line)); + + mod_info->vma_modify = split_vma; + mod_info->vma_restore = merge_vma; + mod_info->vma_mod_check = check_split_result; + + ASSERT_TRUE(capture_mod_pattern(self, &split_last_line, &split_first_line, + &restored_last_line, &restored_first_line)); + + /* Now start concurrent modifications for self->duration_sec */ + signal_state(mod_info, TEST_READY); + + struct line_content new_last_line; + struct line_content new_first_line; + struct timespec start_ts, end_ts; + + clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts); + do { + bool last_line_changed; + bool first_line_changed; + + ASSERT_TRUE(read_boundary_lines(self, &new_last_line, &new_first_line)); + + /* Check if we read vmas after split */ + if (!strcmp(new_last_line.text, split_last_line.text)) { + /* + * The vmas should be consistent with split results, + * however if vma was concurrently restored after a + * split, it can be reported twice (first the original + * split one, then the same vma but extended after the + * merge) because we found it as the next vma again. + * In that case new first line will be the same as the + * last restored line. + */ + ASSERT_FALSE(strcmp(new_first_line.text, split_first_line.text) && + strcmp(new_first_line.text, restored_last_line.text)); + } else { + /* The vmas should be consistent with merge results */ + ASSERT_FALSE(strcmp(new_last_line.text, restored_last_line.text)); + ASSERT_FALSE(strcmp(new_first_line.text, restored_first_line.text)); + } + /* + * First and last lines should change in unison. If the last + * line changed then the first line should change as well and + * vice versa. + */ + last_line_changed = strcmp(new_last_line.text, self->last_line.text) != 0; + first_line_changed = strcmp(new_first_line.text, self->first_line.text) != 0; + ASSERT_EQ(last_line_changed, first_line_changed); + + clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts); + } while (end_ts.tv_sec - start_ts.tv_sec < self->duration_sec); + + /* Signal the modifyer thread to stop and wait until it exits */ + signal_state(mod_info, TEST_DONE); +} + +TEST_HARNESS_MAIN From b11d9e2d7883031afb570545cb9657a5c457349a Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Sat, 19 Jul 2025 11:28:50 -0700 Subject: [PATCH 308/370] selftests/proc: extend /proc/pid/maps tearing test to include vma resizing MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Test that /proc/pid/maps does not report unexpected holes in the address space when a vma at the edge of the page is being concurrently remapped. This remapping results in the vma shrinking and expanding from under the reader. We should always see either shrunk or expanded (original) version of the vma. Link: https://lkml.kernel.org/r/20250719182854.3166724-3-surenb@google.com Signed-off-by: Suren Baghdasaryan Cc: Alexey Dobriyan Cc: Andrii Nakryiko Cc: Christian Brauner Cc: Christophe Leroy Cc: David Hildenbrand Cc: Jann Horn Cc: Jeongjun Park Cc: Johannes Weiner Cc: Josef Bacik Cc: Kalesh Singh Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Oscar Salvador Cc: "Paul E . McKenney" Cc: Peter Xu Cc: Ryan Roberts Cc: Shuah Khan Cc: Thomas Weißschuh Cc: T.J. Mercier Cc: Vlastimil Babka Cc: Ye Bin Signed-off-by: Andrew Morton --- tools/testing/selftests/proc/proc-maps-race.c | 79 +++++++++++++++++++ 1 file changed, 79 insertions(+) diff --git a/tools/testing/selftests/proc/proc-maps-race.c b/tools/testing/selftests/proc/proc-maps-race.c index 5b28dda08b7d..19028bd3b85c 100644 --- a/tools/testing/selftests/proc/proc-maps-race.c +++ b/tools/testing/selftests/proc/proc-maps-race.c @@ -242,6 +242,28 @@ static inline bool check_split_result(struct line_content *mod_last_line, strcmp(mod_first_line->text, restored_first_line->text) != 0; } +static inline bool shrink_vma(FIXTURE_DATA(proc_maps_race) *self) +{ + return mremap(self->mod_info->addr, self->page_size * 3, + self->page_size, 0) != MAP_FAILED; +} + +static inline bool expand_vma(FIXTURE_DATA(proc_maps_race) *self) +{ + return mremap(self->mod_info->addr, self->page_size, + self->page_size * 3, 0) != MAP_FAILED; +} + +static inline bool check_shrink_result(struct line_content *mod_last_line, + struct line_content *mod_first_line, + struct line_content *restored_last_line, + struct line_content *restored_first_line) +{ + /* Make sure only the last vma of the first page is changing */ + return strcmp(mod_last_line->text, restored_last_line->text) != 0 && + strcmp(mod_first_line->text, restored_first_line->text) == 0; +} + FIXTURE_SETUP(proc_maps_race) { const char *duration = getenv("DURATION"); @@ -444,4 +466,61 @@ TEST_F(proc_maps_race, test_maps_tearing_from_split) signal_state(mod_info, TEST_DONE); } +TEST_F(proc_maps_race, test_maps_tearing_from_resize) +{ + struct vma_modifier_info *mod_info = self->mod_info; + + struct line_content shrunk_last_line; + struct line_content shrunk_first_line; + struct line_content restored_last_line; + struct line_content restored_first_line; + + wait_for_state(mod_info, SETUP_READY); + + /* re-read the file to avoid using stale data from previous test */ + ASSERT_TRUE(read_boundary_lines(self, &self->last_line, &self->first_line)); + + mod_info->vma_modify = shrink_vma; + mod_info->vma_restore = expand_vma; + mod_info->vma_mod_check = check_shrink_result; + + ASSERT_TRUE(capture_mod_pattern(self, &shrunk_last_line, &shrunk_first_line, + &restored_last_line, &restored_first_line)); + + /* Now start concurrent modifications for self->duration_sec */ + signal_state(mod_info, TEST_READY); + + struct line_content new_last_line; + struct line_content new_first_line; + struct timespec start_ts, end_ts; + + clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts); + do { + ASSERT_TRUE(read_boundary_lines(self, &new_last_line, &new_first_line)); + + /* Check if we read vmas after shrinking it */ + if (!strcmp(new_last_line.text, shrunk_last_line.text)) { + /* + * The vmas should be consistent with shrunk results, + * however if the vma was concurrently restored, it + * can be reported twice (first as shrunk one, then + * as restored one) because we found it as the next vma + * again. In that case new first line will be the same + * as the last restored line. + */ + ASSERT_FALSE(strcmp(new_first_line.text, shrunk_first_line.text) && + strcmp(new_first_line.text, restored_last_line.text)); + } else { + /* The vmas should be consistent with the original/resored state */ + ASSERT_FALSE(strcmp(new_last_line.text, restored_last_line.text)); + ASSERT_FALSE(strcmp(new_first_line.text, restored_first_line.text)); + } + + clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts); + } while (end_ts.tv_sec - start_ts.tv_sec < self->duration_sec); + + /* Signal the modifyer thread to stop and wait until it exits */ + signal_state(mod_info, TEST_DONE); +} + TEST_HARNESS_MAIN From 6a45336b9b6fcc1d9ef3c6fe9a43252f9a20084b Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Sat, 19 Jul 2025 11:28:51 -0700 Subject: [PATCH 309/370] selftests/proc: extend /proc/pid/maps tearing test to include vma remapping MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Test that /proc/pid/maps does not report unexpected holes in the address space when we concurrently remap a part of a vma into the middle of another vma. This remapping results in the destination vma being split into three parts and the part in the middle being patched back from, all done concurrently from under the reader. We should always see either original vma or the split one with no holes. Link: https://lkml.kernel.org/r/20250719182854.3166724-4-surenb@google.com Signed-off-by: Suren Baghdasaryan Cc: Alexey Dobriyan Cc: Andrii Nakryiko Cc: Christian Brauner Cc: Christophe Leroy Cc: David Hildenbrand Cc: Jann Horn Cc: Jeongjun Park Cc: Johannes Weiner Cc: Josef Bacik Cc: Kalesh Singh Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Oscar Salvador Cc: "Paul E . McKenney" Cc: Peter Xu Cc: Ryan Roberts Cc: Shuah Khan Cc: Thomas Weißschuh Cc: T.J. Mercier Cc: Vlastimil Babka Cc: Ye Bin Signed-off-by: Andrew Morton --- tools/testing/selftests/proc/proc-maps-race.c | 86 +++++++++++++++++++ 1 file changed, 86 insertions(+) diff --git a/tools/testing/selftests/proc/proc-maps-race.c b/tools/testing/selftests/proc/proc-maps-race.c index 19028bd3b85c..bc614a2d944a 100644 --- a/tools/testing/selftests/proc/proc-maps-race.c +++ b/tools/testing/selftests/proc/proc-maps-race.c @@ -264,6 +264,35 @@ static inline bool check_shrink_result(struct line_content *mod_last_line, strcmp(mod_first_line->text, restored_first_line->text) == 0; } +static inline bool remap_vma(FIXTURE_DATA(proc_maps_race) *self) +{ + /* + * Remap the last page of the next vma into the middle of the vma. + * This splits the current vma and the first and middle parts (the + * parts at lower addresses) become the last vma objserved in the + * first page and the first vma observed in the last page. + */ + return mremap(self->mod_info->next_addr + self->page_size * 2, self->page_size, + self->page_size, MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP, + self->mod_info->addr + self->page_size) != MAP_FAILED; +} + +static inline bool patch_vma(FIXTURE_DATA(proc_maps_race) *self) +{ + return mprotect(self->mod_info->addr + self->page_size, self->page_size, + self->mod_info->prot) == 0; +} + +static inline bool check_remap_result(struct line_content *mod_last_line, + struct line_content *mod_first_line, + struct line_content *restored_last_line, + struct line_content *restored_first_line) +{ + /* Make sure vmas at the boundaries are changing */ + return strcmp(mod_last_line->text, restored_last_line->text) != 0 && + strcmp(mod_first_line->text, restored_first_line->text) != 0; +} + FIXTURE_SETUP(proc_maps_race) { const char *duration = getenv("DURATION"); @@ -523,4 +552,61 @@ TEST_F(proc_maps_race, test_maps_tearing_from_resize) signal_state(mod_info, TEST_DONE); } +TEST_F(proc_maps_race, test_maps_tearing_from_remap) +{ + struct vma_modifier_info *mod_info = self->mod_info; + + struct line_content remapped_last_line; + struct line_content remapped_first_line; + struct line_content restored_last_line; + struct line_content restored_first_line; + + wait_for_state(mod_info, SETUP_READY); + + /* re-read the file to avoid using stale data from previous test */ + ASSERT_TRUE(read_boundary_lines(self, &self->last_line, &self->first_line)); + + mod_info->vma_modify = remap_vma; + mod_info->vma_restore = patch_vma; + mod_info->vma_mod_check = check_remap_result; + + ASSERT_TRUE(capture_mod_pattern(self, &remapped_last_line, &remapped_first_line, + &restored_last_line, &restored_first_line)); + + /* Now start concurrent modifications for self->duration_sec */ + signal_state(mod_info, TEST_READY); + + struct line_content new_last_line; + struct line_content new_first_line; + struct timespec start_ts, end_ts; + + clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts); + do { + ASSERT_TRUE(read_boundary_lines(self, &new_last_line, &new_first_line)); + + /* Check if we read vmas after remapping it */ + if (!strcmp(new_last_line.text, remapped_last_line.text)) { + /* + * The vmas should be consistent with remap results, + * however if the vma was concurrently restored, it + * can be reported twice (first as split one, then + * as restored one) because we found it as the next vma + * again. In that case new first line will be the same + * as the last restored line. + */ + ASSERT_FALSE(strcmp(new_first_line.text, remapped_first_line.text) && + strcmp(new_first_line.text, restored_last_line.text)); + } else { + /* The vmas should be consistent with the original/resored state */ + ASSERT_FALSE(strcmp(new_last_line.text, restored_last_line.text)); + ASSERT_FALSE(strcmp(new_first_line.text, restored_first_line.text)); + } + + clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts); + } while (end_ts.tv_sec - start_ts.tv_sec < self->duration_sec); + + /* Signal the modifyer thread to stop and wait until it exits */ + signal_state(mod_info, TEST_DONE); +} + TEST_HARNESS_MAIN From aadc099c480f98817849cd44585904402160fe21 Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Sat, 19 Jul 2025 11:28:52 -0700 Subject: [PATCH 310/370] selftests/proc: add verbose mode for /proc/pid/maps tearing tests MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add verbose mode to the /proc/pid/maps tearing tests to print debugging information. VERBOSE environment variable is used to enable it. Usage example: VERBOSE=1 ./proc-maps-race Link: https://lkml.kernel.org/r/20250719182854.3166724-5-surenb@google.com Signed-off-by: Suren Baghdasaryan Cc: Alexey Dobriyan Cc: Andrii Nakryiko Cc: Christian Brauner Cc: Christophe Leroy Cc: David Hildenbrand Cc: Jann Horn Cc: Jeongjun Park Cc: Johannes Weiner Cc: Josef Bacik Cc: Kalesh Singh Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Oscar Salvador Cc: "Paul E . McKenney" Cc: Peter Xu Cc: Ryan Roberts Cc: Shuah Khan Cc: Thomas Weißschuh Cc: T.J. Mercier Cc: Vlastimil Babka Cc: Ye Bin Signed-off-by: Andrew Morton --- tools/testing/selftests/proc/proc-maps-race.c | 153 ++++++++++++++++-- 1 file changed, 141 insertions(+), 12 deletions(-) diff --git a/tools/testing/selftests/proc/proc-maps-race.c b/tools/testing/selftests/proc/proc-maps-race.c index bc614a2d944a..66773685a047 100644 --- a/tools/testing/selftests/proc/proc-maps-race.c +++ b/tools/testing/selftests/proc/proc-maps-race.c @@ -77,6 +77,7 @@ FIXTURE(proc_maps_race) int shared_mem_size; int page_size; int vma_count; + bool verbose; int maps_fd; pid_t pid; }; @@ -188,12 +189,104 @@ static void stop_vma_modifier(struct vma_modifier_info *mod_info) signal_state(mod_info, SETUP_MODIFY_MAPS); } +static void print_first_lines(char *text, int nr) +{ + const char *end = text; + + while (nr && (end = strchr(end, '\n')) != NULL) { + nr--; + end++; + } + + if (end) { + int offs = end - text; + + text[offs] = '\0'; + printf(text); + text[offs] = '\n'; + printf("\n"); + } else { + printf(text); + } +} + +static void print_last_lines(char *text, int nr) +{ + const char *start = text + strlen(text); + + nr++; /* to ignore the last newline */ + while (nr) { + while (start > text && *start != '\n') + start--; + nr--; + start--; + } + printf(start); +} + +static void print_boundaries(const char *title, FIXTURE_DATA(proc_maps_race) *self) +{ + if (!self->verbose) + return; + + printf("%s", title); + /* Print 3 boundary lines from each page */ + print_last_lines(self->page1.data, 3); + printf("-----------------page boundary-----------------\n"); + print_first_lines(self->page2.data, 3); +} + +static bool print_boundaries_on(bool condition, const char *title, + FIXTURE_DATA(proc_maps_race) *self) +{ + if (self->verbose && condition) + print_boundaries(title, self); + + return condition; +} + +static void report_test_start(const char *name, bool verbose) +{ + if (verbose) + printf("==== %s ====\n", name); +} + +static struct timespec print_ts; + +static void start_test_loop(struct timespec *ts, bool verbose) +{ + if (verbose) + print_ts.tv_sec = ts->tv_sec; +} + +static void end_test_iteration(struct timespec *ts, bool verbose) +{ + if (!verbose) + return; + + /* Update every second */ + if (print_ts.tv_sec == ts->tv_sec) + return; + + printf("."); + fflush(stdout); + print_ts.tv_sec = ts->tv_sec; +} + +static void end_test_loop(bool verbose) +{ + if (verbose) + printf("\n"); +} + static bool capture_mod_pattern(FIXTURE_DATA(proc_maps_race) *self, struct line_content *mod_last_line, struct line_content *mod_first_line, struct line_content *restored_last_line, struct line_content *restored_first_line) { + print_boundaries("Before modification", self); + signal_state(self->mod_info, SETUP_MODIFY_MAPS); wait_for_state(self->mod_info, SETUP_MAPS_MODIFIED); @@ -201,6 +294,8 @@ static bool capture_mod_pattern(FIXTURE_DATA(proc_maps_race) *self, if (!read_boundary_lines(self, mod_last_line, mod_first_line)) return false; + print_boundaries("After modification", self); + signal_state(self->mod_info, SETUP_RESTORE_MAPS); wait_for_state(self->mod_info, SETUP_MAPS_RESTORED); @@ -208,6 +303,8 @@ static bool capture_mod_pattern(FIXTURE_DATA(proc_maps_race) *self, if (!read_boundary_lines(self, restored_last_line, restored_first_line)) return false; + print_boundaries("After restore", self); + if (!self->mod_info->vma_mod_check(mod_last_line, mod_first_line, restored_last_line, restored_first_line)) return false; @@ -295,6 +392,7 @@ static inline bool check_remap_result(struct line_content *mod_last_line, FIXTURE_SETUP(proc_maps_race) { + const char *verbose = getenv("VERBOSE"); const char *duration = getenv("DURATION"); struct vma_modifier_info *mod_info; pthread_mutexattr_t mutex_attr; @@ -303,6 +401,7 @@ FIXTURE_SETUP(proc_maps_race) char fname[32]; self->page_size = (unsigned long)sysconf(_SC_PAGESIZE); + self->verbose = verbose && !strncmp(verbose, "1", 1); duration_sec = duration ? atol(duration) : 0; self->duration_sec = duration_sec ? duration_sec : 5UL; @@ -444,6 +543,7 @@ TEST_F(proc_maps_race, test_maps_tearing_from_split) mod_info->vma_restore = merge_vma; mod_info->vma_mod_check = check_split_result; + report_test_start("Tearing from split", self->verbose); ASSERT_TRUE(capture_mod_pattern(self, &split_last_line, &split_first_line, &restored_last_line, &restored_first_line)); @@ -455,6 +555,7 @@ TEST_F(proc_maps_race, test_maps_tearing_from_split) struct timespec start_ts, end_ts; clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts); + start_test_loop(&start_ts, self->verbose); do { bool last_line_changed; bool first_line_changed; @@ -472,12 +573,18 @@ TEST_F(proc_maps_race, test_maps_tearing_from_split) * In that case new first line will be the same as the * last restored line. */ - ASSERT_FALSE(strcmp(new_first_line.text, split_first_line.text) && - strcmp(new_first_line.text, restored_last_line.text)); + ASSERT_FALSE(print_boundaries_on( + strcmp(new_first_line.text, split_first_line.text) && + strcmp(new_first_line.text, restored_last_line.text), + "Split result invalid", self)); } else { /* The vmas should be consistent with merge results */ - ASSERT_FALSE(strcmp(new_last_line.text, restored_last_line.text)); - ASSERT_FALSE(strcmp(new_first_line.text, restored_first_line.text)); + ASSERT_FALSE(print_boundaries_on( + strcmp(new_last_line.text, restored_last_line.text), + "Merge result invalid", self)); + ASSERT_FALSE(print_boundaries_on( + strcmp(new_first_line.text, restored_first_line.text), + "Merge result invalid", self)); } /* * First and last lines should change in unison. If the last @@ -489,7 +596,9 @@ TEST_F(proc_maps_race, test_maps_tearing_from_split) ASSERT_EQ(last_line_changed, first_line_changed); clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts); + end_test_iteration(&end_ts, self->verbose); } while (end_ts.tv_sec - start_ts.tv_sec < self->duration_sec); + end_test_loop(self->verbose); /* Signal the modifyer thread to stop and wait until it exits */ signal_state(mod_info, TEST_DONE); @@ -513,6 +622,7 @@ TEST_F(proc_maps_race, test_maps_tearing_from_resize) mod_info->vma_restore = expand_vma; mod_info->vma_mod_check = check_shrink_result; + report_test_start("Tearing from resize", self->verbose); ASSERT_TRUE(capture_mod_pattern(self, &shrunk_last_line, &shrunk_first_line, &restored_last_line, &restored_first_line)); @@ -524,6 +634,7 @@ TEST_F(proc_maps_race, test_maps_tearing_from_resize) struct timespec start_ts, end_ts; clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts); + start_test_loop(&start_ts, self->verbose); do { ASSERT_TRUE(read_boundary_lines(self, &new_last_line, &new_first_line)); @@ -537,16 +648,24 @@ TEST_F(proc_maps_race, test_maps_tearing_from_resize) * again. In that case new first line will be the same * as the last restored line. */ - ASSERT_FALSE(strcmp(new_first_line.text, shrunk_first_line.text) && - strcmp(new_first_line.text, restored_last_line.text)); + ASSERT_FALSE(print_boundaries_on( + strcmp(new_first_line.text, shrunk_first_line.text) && + strcmp(new_first_line.text, restored_last_line.text), + "Shrink result invalid", self)); } else { /* The vmas should be consistent with the original/resored state */ - ASSERT_FALSE(strcmp(new_last_line.text, restored_last_line.text)); - ASSERT_FALSE(strcmp(new_first_line.text, restored_first_line.text)); + ASSERT_FALSE(print_boundaries_on( + strcmp(new_last_line.text, restored_last_line.text), + "Expand result invalid", self)); + ASSERT_FALSE(print_boundaries_on( + strcmp(new_first_line.text, restored_first_line.text), + "Expand result invalid", self)); } clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts); + end_test_iteration(&end_ts, self->verbose); } while (end_ts.tv_sec - start_ts.tv_sec < self->duration_sec); + end_test_loop(self->verbose); /* Signal the modifyer thread to stop and wait until it exits */ signal_state(mod_info, TEST_DONE); @@ -570,6 +689,7 @@ TEST_F(proc_maps_race, test_maps_tearing_from_remap) mod_info->vma_restore = patch_vma; mod_info->vma_mod_check = check_remap_result; + report_test_start("Tearing from remap", self->verbose); ASSERT_TRUE(capture_mod_pattern(self, &remapped_last_line, &remapped_first_line, &restored_last_line, &restored_first_line)); @@ -581,6 +701,7 @@ TEST_F(proc_maps_race, test_maps_tearing_from_remap) struct timespec start_ts, end_ts; clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts); + start_test_loop(&start_ts, self->verbose); do { ASSERT_TRUE(read_boundary_lines(self, &new_last_line, &new_first_line)); @@ -594,16 +715,24 @@ TEST_F(proc_maps_race, test_maps_tearing_from_remap) * again. In that case new first line will be the same * as the last restored line. */ - ASSERT_FALSE(strcmp(new_first_line.text, remapped_first_line.text) && - strcmp(new_first_line.text, restored_last_line.text)); + ASSERT_FALSE(print_boundaries_on( + strcmp(new_first_line.text, remapped_first_line.text) && + strcmp(new_first_line.text, restored_last_line.text), + "Remap result invalid", self)); } else { /* The vmas should be consistent with the original/resored state */ - ASSERT_FALSE(strcmp(new_last_line.text, restored_last_line.text)); - ASSERT_FALSE(strcmp(new_first_line.text, restored_first_line.text)); + ASSERT_FALSE(print_boundaries_on( + strcmp(new_last_line.text, restored_last_line.text), + "Remap restore result invalid", self)); + ASSERT_FALSE(print_boundaries_on( + strcmp(new_first_line.text, restored_first_line.text), + "Remap restore result invalid", self)); } clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts); + end_test_iteration(&end_ts, self->verbose); } while (end_ts.tv_sec - start_ts.tv_sec < self->duration_sec); + end_test_loop(self->verbose); /* Signal the modifyer thread to stop and wait until it exits */ signal_state(mod_info, TEST_DONE); From 03d98703f7e172778786ebd7c5f2471d0f65d3a6 Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Sat, 19 Jul 2025 11:28:53 -0700 Subject: [PATCH 311/370] fs/proc/task_mmu: remove conversion of seq_file position to unsigned MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Back in 2.6 era, last_addr used to be stored in seq_file->version variable, which was unsigned long. As a result, sentinels to represent gate vma and end of all vmas used unsigned values. In more recent kernels we don't used seq_file->version anymore and therefore conversion from loff_t into unsigned type is not needed. Similarly, sentinel values don't need to be unsigned. Remove type conversion for set_file position and change sentinel values to signed. While at it, change the hardcoded sentinel values with named definitions for better documentation. Link: https://lkml.kernel.org/r/20250719182854.3166724-6-surenb@google.com Signed-off-by: Suren Baghdasaryan Reviewed-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Acked-by: David Hildenbrand Cc: Alexey Dobriyan Cc: Andrii Nakryiko Cc: Christian Brauner Cc: Christophe Leroy Cc: Jann Horn Cc: Jeongjun Park Cc: Johannes Weiner Cc: Josef Bacik Cc: Kalesh Singh Cc: Liam Howlett Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Oscar Salvador Cc: "Paul E . McKenney" Cc: Peter Xu Cc: Ryan Roberts Cc: Shuah Khan Cc: Thomas Weißschuh Cc: T.J. Mercier Cc: Ye Bin Signed-off-by: Andrew Morton --- fs/proc/task_mmu.c | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 751479eb128f..90237df1ed33 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -29,6 +29,9 @@ #include #include "internal.h" +#define SENTINEL_VMA_END -1 +#define SENTINEL_VMA_GATE -2 + #define SEQ_PUT_DEC(str, val) \ seq_put_decimal_ull_width(m, str, (val) << (PAGE_SHIFT-10), 8) void task_mem(struct seq_file *m, struct mm_struct *mm) @@ -135,7 +138,7 @@ static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv, if (vma) { *ppos = vma->vm_start; } else { - *ppos = -2UL; + *ppos = SENTINEL_VMA_GATE; vma = get_gate_vma(priv->mm); } @@ -145,11 +148,11 @@ static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv, static void *m_start(struct seq_file *m, loff_t *ppos) { struct proc_maps_private *priv = m->private; - unsigned long last_addr = *ppos; + loff_t last_addr = *ppos; struct mm_struct *mm; /* See m_next(). Zero at the start or after lseek. */ - if (last_addr == -1UL) + if (last_addr == SENTINEL_VMA_END) return NULL; priv->task = get_proc_task(priv->inode); @@ -170,9 +173,9 @@ static void *m_start(struct seq_file *m, loff_t *ppos) return ERR_PTR(-EINTR); } - vma_iter_init(&priv->iter, mm, last_addr); + vma_iter_init(&priv->iter, mm, (unsigned long)last_addr); hold_task_mempolicy(priv); - if (last_addr == -2UL) + if (last_addr == SENTINEL_VMA_GATE) return get_gate_vma(mm); return proc_get_vma(priv, ppos); @@ -180,8 +183,8 @@ static void *m_start(struct seq_file *m, loff_t *ppos) static void *m_next(struct seq_file *m, void *v, loff_t *ppos) { - if (*ppos == -2UL) { - *ppos = -1UL; + if (*ppos == SENTINEL_VMA_GATE) { + *ppos = SENTINEL_VMA_END; return NULL; } return proc_get_vma(m->private, ppos); From 5631da56c9a87ea41d69d1bbbc1cee327eb9354b Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Sat, 19 Jul 2025 11:28:54 -0700 Subject: [PATCH 312/370] fs/proc/task_mmu: read proc/pid/maps under per-vma lock MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit With maple_tree supporting vma tree traversal under RCU and per-vma locks, /proc/pid/maps can be read while holding individual vma locks instead of locking the entire address space. A completely lockless approach (walking vma tree under RCU) would be quite complex with the main issue being get_vma_name() using callbacks which might not work correctly with a stable vma copy, requiring original (unstable) vma - see special_mapping_name() for example. When per-vma lock acquisition fails, we take the mmap_lock for reading, lock the vma, release the mmap_lock and continue. This fallback to mmap read lock guarantees the reader to make forward progress even during lock contention. This will interfere with the writer but for a very short time while we are acquiring the per-vma lock and only when there was contention on the vma reader is interested in. We shouldn't see a repeated fallback to mmap read locks in practice, as this require a very unlikely series of lock contentions (for instance due to repeated vma split operations). However even if this did somehow happen, we would still progress. One case requiring special handling is when a vma changes between the time it was found and the time it got locked. A problematic case would be if a vma got shrunk so that its vm_start moved higher in the address space and a new vma was installed at the beginning: reader found: |--------VMA A--------| VMA is modified: |-VMA B-|----VMA A----| reader locks modified VMA A reader reports VMA A: | gap |----VMA A----| This would result in reporting a gap in the address space that does not exist. To prevent this we retry the lookup after locking the vma, however we do that only when we identify a gap and detect that the address space was changed after we found the vma. This change is designed to reduce mmap_lock contention and prevent a process reading /proc/pid/maps files (often a low priority task, such as monitoring/data collection services) from blocking address space updates. Note that this change has a userspace visible disadvantage: it allows for sub-page data tearing as opposed to the previous mechanism where data tearing could happen only between pages of generated output data. Since current userspace considers data tearing between pages to be acceptable, we assume is will be able to handle sub-page data tearing as well. Link: https://lkml.kernel.org/r/20250719182854.3166724-7-surenb@google.com Signed-off-by: Suren Baghdasaryan Reviewed-by: Vlastimil Babka Cc: Alexey Dobriyan Cc: Andrii Nakryiko Cc: Christian Brauner Cc: Christophe Leroy Cc: David Hildenbrand Cc: Jann Horn Cc: Jeongjun Park Cc: Johannes Weiner Cc: Josef Bacik Cc: Kalesh Singh Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Oscar Salvador Cc: "Paul E . McKenney" Cc: Peter Xu Cc: Ryan Roberts Cc: Shuah Khan Cc: Thomas Weißschuh Cc: T.J. Mercier Cc: Ye Bin Signed-off-by: Andrew Morton --- fs/proc/internal.h | 5 ++ fs/proc/task_mmu.c | 143 +++++++++++++++++++++++++++++++++++--- include/linux/mmap_lock.h | 11 +++ mm/madvise.c | 3 +- mm/mmap_lock.c | 93 +++++++++++++++++++++++++ 5 files changed, 245 insertions(+), 10 deletions(-) diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 3d48ffe72583..7c235451c5ea 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -384,6 +384,11 @@ struct proc_maps_private { struct task_struct *task; struct mm_struct *mm; struct vma_iterator iter; + loff_t last_pos; +#ifdef CONFIG_PER_VMA_LOCK + bool mmap_locked; + struct vm_area_struct *locked_vma; +#endif #ifdef CONFIG_NUMA struct mempolicy *task_mempolicy; #endif diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 90237df1ed33..3d6d8a9f13fc 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -130,13 +130,132 @@ static void release_task_mempolicy(struct proc_maps_private *priv) } #endif -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv, - loff_t *ppos) -{ - struct vm_area_struct *vma = vma_next(&priv->iter); +#ifdef CONFIG_PER_VMA_LOCK +static void unlock_vma(struct proc_maps_private *priv) +{ + if (priv->locked_vma) { + vma_end_read(priv->locked_vma); + priv->locked_vma = NULL; + } +} + +static const struct seq_operations proc_pid_maps_op; + +static inline bool lock_vma_range(struct seq_file *m, + struct proc_maps_private *priv) +{ + /* + * smaps and numa_maps perform page table walk, therefore require + * mmap_lock but maps can be read with locking just the vma and + * walking the vma tree under rcu read protection. + */ + if (m->op != &proc_pid_maps_op) { + if (mmap_read_lock_killable(priv->mm)) + return false; + + priv->mmap_locked = true; + } else { + rcu_read_lock(); + priv->locked_vma = NULL; + priv->mmap_locked = false; + } + + return true; +} + +static inline void unlock_vma_range(struct proc_maps_private *priv) +{ + if (priv->mmap_locked) { + mmap_read_unlock(priv->mm); + } else { + unlock_vma(priv); + rcu_read_unlock(); + } +} + +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv, + loff_t last_pos) +{ + struct vm_area_struct *vma; + + if (priv->mmap_locked) + return vma_next(&priv->iter); + + unlock_vma(priv); + vma = lock_next_vma(priv->mm, &priv->iter, last_pos); + if (!IS_ERR_OR_NULL(vma)) + priv->locked_vma = vma; + + return vma; +} + +static inline bool fallback_to_mmap_lock(struct proc_maps_private *priv, + loff_t pos) +{ + if (priv->mmap_locked) + return false; + + rcu_read_unlock(); + mmap_read_lock(priv->mm); + /* Reinitialize the iterator after taking mmap_lock */ + vma_iter_set(&priv->iter, pos); + priv->mmap_locked = true; + + return true; +} + +#else /* CONFIG_PER_VMA_LOCK */ + +static inline bool lock_vma_range(struct seq_file *m, + struct proc_maps_private *priv) +{ + return mmap_read_lock_killable(priv->mm) == 0; +} + +static inline void unlock_vma_range(struct proc_maps_private *priv) +{ + mmap_read_unlock(priv->mm); +} + +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv, + loff_t last_pos) +{ + return vma_next(&priv->iter); +} + +static inline bool fallback_to_mmap_lock(struct proc_maps_private *priv, + loff_t pos) +{ + return false; +} + +#endif /* CONFIG_PER_VMA_LOCK */ + +static struct vm_area_struct *proc_get_vma(struct seq_file *m, loff_t *ppos) +{ + struct proc_maps_private *priv = m->private; + struct vm_area_struct *vma; + +retry: + vma = get_next_vma(priv, *ppos); + /* EINTR of EAGAIN is possible */ + if (IS_ERR(vma)) { + if (PTR_ERR(vma) == -EAGAIN && fallback_to_mmap_lock(priv, *ppos)) + goto retry; + + return vma; + } + + /* Store previous position to be able to restart if needed */ + priv->last_pos = *ppos; if (vma) { - *ppos = vma->vm_start; + /* + * Track the end of the reported vma to ensure position changes + * even if previous vma was merged with the next vma and we + * found the extended vma with the same vm_start. + */ + *ppos = vma->vm_end; } else { *ppos = SENTINEL_VMA_GATE; vma = get_gate_vma(priv->mm); @@ -166,19 +285,25 @@ static void *m_start(struct seq_file *m, loff_t *ppos) return NULL; } - if (mmap_read_lock_killable(mm)) { + if (!lock_vma_range(m, priv)) { mmput(mm); put_task_struct(priv->task); priv->task = NULL; return ERR_PTR(-EINTR); } + /* + * Reset current position if last_addr was set before + * and it's not a sentinel. + */ + if (last_addr > 0) + *ppos = last_addr = priv->last_pos; vma_iter_init(&priv->iter, mm, (unsigned long)last_addr); hold_task_mempolicy(priv); if (last_addr == SENTINEL_VMA_GATE) return get_gate_vma(mm); - return proc_get_vma(priv, ppos); + return proc_get_vma(m, ppos); } static void *m_next(struct seq_file *m, void *v, loff_t *ppos) @@ -187,7 +312,7 @@ static void *m_next(struct seq_file *m, void *v, loff_t *ppos) *ppos = SENTINEL_VMA_END; return NULL; } - return proc_get_vma(m->private, ppos); + return proc_get_vma(m, ppos); } static void m_stop(struct seq_file *m, void *v) @@ -199,7 +324,7 @@ static void m_stop(struct seq_file *m, void *v) return; release_task_mempolicy(priv); - mmap_read_unlock(mm); + unlock_vma_range(priv); mmput(mm); put_task_struct(priv->task); priv->task = NULL; diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index 5da384bd0a26..1f4f44951abe 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -309,6 +309,17 @@ void vma_mark_detached(struct vm_area_struct *vma); struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, unsigned long address); +/* + * Locks next vma pointed by the iterator. Confirms the locked vma has not + * been modified and will retry under mmap_lock protection if modification + * was detected. Should be called from read RCU section. + * Returns either a valid locked VMA, NULL if no more VMAs or -EINTR if the + * process was interrupted. + */ +struct vm_area_struct *lock_next_vma(struct mm_struct *mm, + struct vma_iterator *iter, + unsigned long address); + #else /* CONFIG_PER_VMA_LOCK */ static inline void mm_lock_seqcount_init(struct mm_struct *mm) {} diff --git a/mm/madvise.c b/mm/madvise.c index 2bf80989d5b6..bb80fc5ea08f 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -108,7 +108,8 @@ void anon_vma_name_free(struct kref *kref) struct anon_vma_name *anon_vma_name(struct vm_area_struct *vma) { - mmap_assert_locked(vma->vm_mm); + if (!rwsem_is_locked(&vma->vm_mm->mmap_lock)) + vma_assert_locked(vma); return vma->anon_name; } diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c index 5f725cc67334..729fb7d0dd59 100644 --- a/mm/mmap_lock.c +++ b/mm/mmap_lock.c @@ -178,6 +178,99 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, count_vm_vma_lock_event(VMA_LOCK_ABORT); return NULL; } + +static struct vm_area_struct *lock_next_vma_under_mmap_lock(struct mm_struct *mm, + struct vma_iterator *vmi, + unsigned long from_addr) +{ + struct vm_area_struct *vma; + int ret; + + ret = mmap_read_lock_killable(mm); + if (ret) + return ERR_PTR(ret); + + /* Lookup the vma at the last position again under mmap_read_lock */ + vma_iter_set(vmi, from_addr); + vma = vma_next(vmi); + if (vma) { + /* Very unlikely vma->vm_refcnt overflow case */ + if (unlikely(!vma_start_read_locked(vma))) + vma = ERR_PTR(-EAGAIN); + } + + mmap_read_unlock(mm); + + return vma; +} + +struct vm_area_struct *lock_next_vma(struct mm_struct *mm, + struct vma_iterator *vmi, + unsigned long from_addr) +{ + struct vm_area_struct *vma; + unsigned int mm_wr_seq; + bool mmap_unlocked; + + RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "no rcu read lock held"); +retry: + /* Start mmap_lock speculation in case we need to verify the vma later */ + mmap_unlocked = mmap_lock_speculate_try_begin(mm, &mm_wr_seq); + vma = vma_next(vmi); + if (!vma) + return NULL; + + vma = vma_start_read(mm, vma); + if (IS_ERR_OR_NULL(vma)) { + /* + * Retry immediately if the vma gets detached from under us. + * Infinite loop should not happen because the vma we find will + * have to be constantly knocked out from under us. + */ + if (PTR_ERR(vma) == -EAGAIN) { + /* reset to search from the last address */ + vma_iter_set(vmi, from_addr); + goto retry; + } + + goto fallback; + } + + /* + * Verify the vma we locked belongs to the same address space and it's + * not behind of the last search position. + */ + if (unlikely(vma->vm_mm != mm || from_addr >= vma->vm_end)) + goto fallback_unlock; + + /* + * vma can be ahead of the last search position but we need to verify + * it was not shrunk after we found it and another vma has not been + * installed ahead of it. Otherwise we might observe a gap that should + * not be there. + */ + if (from_addr < vma->vm_start) { + /* Verify only if the address space might have changed since vma lookup. */ + if (!mmap_unlocked || mmap_lock_speculate_retry(mm, mm_wr_seq)) { + vma_iter_set(vmi, from_addr); + if (vma != vma_next(vmi)) + goto fallback_unlock; + } + } + + return vma; + +fallback_unlock: + vma_end_read(vma); +fallback: + rcu_read_unlock(); + vma = lock_next_vma_under_mmap_lock(mm, vmi, from_addr); + rcu_read_lock(); + /* Reinitialize the iterator after re-entering rcu read section */ + vma_iter_set(vmi, IS_ERR_OR_NULL(vma) ? from_addr : vma->vm_end); + + return vma; +} #endif /* CONFIG_PER_VMA_LOCK */ #ifdef CONFIG_LOCK_MM_AND_FIND_VMA From a5867a218d7ca15359e762fa067d867d738689e2 Mon Sep 17 00:00:00 2001 From: Yadan Fan Date: Sat, 19 Jul 2025 06:06:51 +0800 Subject: [PATCH 313/370] mm: mempool: fix wake-up edge case bug for zero-minimum pools The mempool wake-up path has a edge case bug that affects pools created with min_nr=0. When a thread blocks waiting for memory from an empty pool (curr_nr == 0), subsequent mempool_free() calls fail to wake the waiting thread because the condition "curr_nr < min_nr" evaluates to "0 < 0" which is false, this can cause threads to sleep indefinitely according to the code logic. There is at least 2 places where the mempool created with min_nr=0: 1. lib/btree.c:191: mempool_create(0, btree_alloc, btree_free, NULL) 2. drivers/md/dm-verity-fec.c:791: mempool_init_slab_pool(&f->extra_pool, 0, f->cache) Add an explicit check in mempool_free() to handle the min_nr=0 case: when the pool has zero minimum reserves, is currently empty, and has active waiters, allocate the element then wake up the sleeper. Link: https://lkml.kernel.org/r/f28a81ba-615c-481e-86fb-c0bf4115ec89@suse.com Signed-off-by: Yadan Fan Signed-off-by: Andrew Morton --- mm/mempool.c | 34 +++++++++++++++++++++++++++++++++- 1 file changed, 33 insertions(+), 1 deletion(-) diff --git a/mm/mempool.c b/mm/mempool.c index 3223337135d0..204a216b6418 100644 --- a/mm/mempool.c +++ b/mm/mempool.c @@ -540,11 +540,43 @@ void mempool_free(void *element, mempool_t *pool) if (likely(pool->curr_nr < pool->min_nr)) { add_element(pool, element); spin_unlock_irqrestore(&pool->lock, flags); - wake_up(&pool->wait); + if (wq_has_sleeper(&pool->wait)) + wake_up(&pool->wait); return; } spin_unlock_irqrestore(&pool->lock, flags); } + + /* + * Handle the min_nr = 0 edge case: + * + * For zero-minimum pools, curr_nr < min_nr (0 < 0) never succeeds, + * so waiters sleeping on pool->wait would never be woken by the + * wake-up path of previous test. This explicit check ensures the + * allocation of element when both min_nr and curr_nr are 0, and + * any active waiters are properly awakened. + * + * Inline the same logic as previous test, add_element() cannot be + * directly used here since it has BUG_ON to deny if min_nr equals + * curr_nr, so here picked rest of add_element() to use without + * BUG_ON check. + */ + if (unlikely(pool->min_nr == 0 && + READ_ONCE(pool->curr_nr) == 0)) { + spin_lock_irqsave(&pool->lock, flags); + if (likely(pool->curr_nr == 0)) { + /* Inline the logic of add_element() */ + poison_element(pool, element); + if (kasan_poison_element(pool, element)) + pool->elements[pool->curr_nr++] = element; + spin_unlock_irqrestore(&pool->lock, flags); + if (wq_has_sleeper(&pool->wait)) + wake_up(&pool->wait); + return; + } + spin_unlock_irqrestore(&pool->lock, flags); + } + pool->free(element, pool->pool_data); } EXPORT_SYMBOL(mempool_free); From 6470fb2bb1812cac0a80a290bb193a9d866dd39d Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Fri, 11 Jul 2025 15:59:34 +0530 Subject: [PATCH 314/370] fs/Kconfig: enable HUGETLBFS only if ARCH_SUPPORTS_HUGETLBFS Enable HUGETLBFS only when platform subscrbes via ARCH_SUPPORTS_HUGETLBFS. Hence select ARCH_SUPPORTS_HUGETLBFS on existing x86 and sparc for their continuing HUGETLBFS support. While here also just drop existing 'BROKEN' dependency. Link: https://lkml.kernel.org/r/20250711102934.2399533-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual Cc: "David S. Miller" Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Christian Brauner Signed-off-by: Andrew Morton --- arch/sparc/Kconfig | 1 + arch/x86/Kconfig | 1 + fs/Kconfig | 2 +- 3 files changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig index 0f88123925a4..68549eedfe6d 100644 --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -97,6 +97,7 @@ config SPARC64 select HAVE_ARCH_AUDITSYSCALL select ARCH_SUPPORTS_ATOMIC_RMW select ARCH_SUPPORTS_DEBUG_PAGEALLOC + select ARCH_SUPPORTS_HUGETLBFS select HAVE_NMI select HAVE_REGS_AND_STACK_ACCESS_API select ARCH_USE_QUEUED_RWLOCKS diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index bb9b63d76a19..0ce86e14ab5e 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -125,6 +125,7 @@ config X86 select ARCH_SUPPORTS_ACPI select ARCH_SUPPORTS_ATOMIC_RMW select ARCH_SUPPORTS_DEBUG_PAGEALLOC + select ARCH_SUPPORTS_HUGETLBFS select ARCH_SUPPORTS_PAGE_TABLE_CHECK if X86_64 select ARCH_SUPPORTS_NUMA_BALANCING if X86_64 select ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP if NR_CPUS <= 4096 diff --git a/fs/Kconfig b/fs/Kconfig index ccdf371a62ed..c654a3642897 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -249,7 +249,7 @@ config ARCH_SUPPORTS_HUGETLBFS menuconfig HUGETLBFS bool "HugeTLB file system support" - depends on X86 || SPARC64 || ARCH_SUPPORTS_HUGETLBFS || BROKEN + depends on ARCH_SUPPORTS_HUGETLBFS depends on (SYSFS || SYSCTL) select MEMFD_CREATE select PADATA if SMP From 6c7de9c83be68bf40131e75de42aae8868ea43b1 Mon Sep 17 00:00:00 2001 From: Zi Yan Date: Fri, 18 Jul 2025 14:37:15 -0400 Subject: [PATCH 315/370] mm/huge_memory: move unrelated code out of __split_unmapped_folio() Patch series "__folio_split() clean up", v5. This patchset refactors __folio_split() and __split_unmapped_folio() to: 1. make __split_unmapped_folio() reusable for splitting unmapped folios. It avoids the need for a new boolean unmapped parameter to guard mapping-related code when __split_unmapped_folio() is reused to split unmapped folios. 2. improve code readability and prevent smatch/coverity checkers from complaining about NULL mapping referencing. An additional benefit for __split_unmapped_folio() refactoring is that __split_unmapped_folio() could be called on after-split folios by __folio_split(). It can enable new split methods. For example, at deferred split time, unmapped subpages can scatter arbitrarily within a large folio, neither uniform nor non-uniform split can maximize after-split folio orders for mapped subpages. The hope is that by calling __split_unmapped_folio() multiple times, a better split result can be achieved. This patch (of 6): remap(), folio_ref_unfreeze(), lru_add_split_folio() are not relevant to splitting unmapped folio operations. Move them out to __folio_split() so that __split_unmapped_folio() only handles unmapped folio splits. This makes __split_unmapped_folio() reusable. Remove the swapcache folio split check code before __split_unmapped_folio() call, since it is already checked at the beginning of __folio_split() in uniform_split_supported() and non_uniform_split_supported(). Along with the code move, there are some variable renames: 1. release is renamed to new_folio, 2. origin_folio is now folio, since __folio_split() has folio pointing to the original folio already. Link: https://lkml.kernel.org/r/20250718023000.4044406-1-ziy@nvidia.com Link: https://lkml.kernel.org/r/20250718023000.4044406-2-ziy@nvidia.com Link: https://lkml.kernel.org/r/20250718183720.4054515-1-ziy@nvidia.com Link: https://lkml.kernel.org/r/20250718183720.4054515-2-ziy@nvidia.com Signed-off-by: Zi Yan Acked-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes Cc: Antonio Quartulli Cc: Balbir Singh Cc: Baolin Wang Cc: Barry Song Cc: Dan Carpenter Cc: Dev Jain Cc: Hugh Dickins Cc: Kirill A. Shutemov Cc: Liam Howlett Cc: Mariano Pache Cc: Mathew Brost Cc: Ryan Roberts Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/huge_memory.c | 270 +++++++++++++++++++++++------------------------ 1 file changed, 133 insertions(+), 137 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index ce130225a8e5..63eebca07628 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3385,10 +3385,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order, * order - 1 to new_order). * @split_at: in buddy allocator like split, the folio containing @split_at * will be split until its order becomes @new_order. - * @lock_at: the folio containing @lock_at is left locked for caller. - * @list: the after split folios will be added to @list if it is not NULL, - * otherwise to LRU lists. - * @end: the end of the file @folio maps to. -1 if @folio is anonymous memory. * @xas: xa_state pointing to folio->mapping->i_pages and locked by caller * @mapping: @folio->mapping * @uniform_split: if the split is uniform or not (buddy allocator like split) @@ -3414,52 +3410,26 @@ static void __split_folio_to_order(struct folio *folio, int old_order, * @page, which is split in next for loop. * * After splitting, the caller's folio reference will be transferred to the - * folio containing @page. The other folios may be freed if they are not mapped. - * - * In terms of locking, after splitting, - * 1. uniform split leaves @page (or the folio contains it) locked; - * 2. buddy allocator like (non-uniform) split leaves @folio locked. - * + * folio containing @page. The caller needs to unlock and/or free after-split + * folios if necessary. * * For !uniform_split, when -ENOMEM is returned, the original folio might be * split. The caller needs to check the input folio. */ static int __split_unmapped_folio(struct folio *folio, int new_order, - struct page *split_at, struct page *lock_at, - struct list_head *list, pgoff_t end, - struct xa_state *xas, struct address_space *mapping, - bool uniform_split) + struct page *split_at, struct xa_state *xas, + struct address_space *mapping, bool uniform_split) { - struct lruvec *lruvec; - struct address_space *swap_cache = NULL; - struct folio *origin_folio = folio; - struct folio *next_folio = folio_next(folio); - struct folio *new_folio; - struct folio *next; int order = folio_order(folio); - int split_order; int start_order = uniform_split ? new_order : order - 1; - int nr_dropped = 0; - int ret = 0; bool stop_split = false; - - if (folio_test_swapcache(folio)) { - VM_BUG_ON(mapping); - - /* a swapcache folio can only be uniformly split to order-0 */ - if (!uniform_split || new_order != 0) - return -EINVAL; - - swap_cache = swap_address_space(folio->swap); - xa_lock(&swap_cache->i_pages); - } + struct folio *next; + int split_order; + int ret = 0; if (folio_test_anon(folio)) mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1); - /* lock lru list/PageCompound, ref frozen by page_ref_freeze */ - lruvec = folio_lruvec_lock(folio); - folio_clear_has_hwpoisoned(folio); /* @@ -3469,9 +3439,9 @@ static int __split_unmapped_folio(struct folio *folio, int new_order, for (split_order = start_order; split_order >= new_order && !stop_split; split_order--) { - int old_order = folio_order(folio); - struct folio *release; struct folio *end_folio = folio_next(folio); + int old_order = folio_order(folio); + struct folio *new_folio; /* order-1 anonymous folio is not supported */ if (folio_test_anon(folio) && split_order == 1) @@ -3506,113 +3476,32 @@ static int __split_unmapped_folio(struct folio *folio, int new_order, after_split: /* - * Iterate through after-split folios and perform related - * operations. But in buddy allocator like split, the folio + * Iterate through after-split folios and update folio stats. + * But in buddy allocator like split, the folio * containing the specified page is skipped until its order * is new_order, since the folio will be worked on in next * iteration. */ - for (release = folio; release != end_folio; release = next) { - next = folio_next(release); + for (new_folio = folio; new_folio != end_folio; new_folio = next) { + next = folio_next(new_folio); /* - * for buddy allocator like split, the folio containing - * page will be split next and should not be released, - * until the folio's order is new_order or stop_split - * is set to true by the above xas_split() failure. + * for buddy allocator like split, new_folio containing + * @split_at page could be split again, thus do not + * change stats yet. Wait until new_folio's order is + * @new_order or stop_split is set to true by the above + * xas_split() failure. */ - if (release == page_folio(split_at)) { - folio = release; + if (new_folio == page_folio(split_at)) { + folio = new_folio; if (split_order != new_order && !stop_split) continue; } - if (folio_test_anon(release)) { - mod_mthp_stat(folio_order(release), - MTHP_STAT_NR_ANON, 1); - } - - /* - * origin_folio should be kept frozon until page cache - * entries are updated with all the other after-split - * folios to prevent others seeing stale page cache - * entries. - */ - if (release == origin_folio) - continue; - - folio_ref_unfreeze(release, 1 + - ((mapping || swap_cache) ? - folio_nr_pages(release) : 0)); - - lru_add_split_folio(origin_folio, release, lruvec, - list); - - /* Some pages can be beyond EOF: drop them from cache */ - if (release->index >= end) { - if (shmem_mapping(mapping)) - nr_dropped += folio_nr_pages(release); - else if (folio_test_clear_dirty(release)) - folio_account_cleaned(release, - inode_to_wb(mapping->host)); - __filemap_remove_folio(release, NULL); - folio_put_refs(release, folio_nr_pages(release)); - } else if (mapping) { - __xa_store(&mapping->i_pages, - release->index, release, 0); - } else if (swap_cache) { - __xa_store(&swap_cache->i_pages, - swap_cache_index(release->swap), - release, 0); - } + if (folio_test_anon(new_folio)) + mod_mthp_stat(folio_order(new_folio), + MTHP_STAT_NR_ANON, 1); } } - /* - * Unfreeze origin_folio only after all page cache entries, which used - * to point to it, have been updated with new folios. Otherwise, - * a parallel folio_try_get() can grab origin_folio and its caller can - * see stale page cache entries. - */ - folio_ref_unfreeze(origin_folio, 1 + - ((mapping || swap_cache) ? folio_nr_pages(origin_folio) : 0)); - - unlock_page_lruvec(lruvec); - - if (swap_cache) - xa_unlock(&swap_cache->i_pages); - if (mapping) - xa_unlock(&mapping->i_pages); - - /* Caller disabled irqs, so they are still disabled here */ - local_irq_enable(); - - if (nr_dropped) - shmem_uncharge(mapping->host, nr_dropped); - - remap_page(origin_folio, 1 << order, - folio_test_anon(origin_folio) ? - RMP_USE_SHARED_ZEROPAGE : 0); - - /* - * At this point, folio should contain the specified page. - * For uniform split, it is left for caller to unlock. - * For buddy allocator like split, the first after-split folio is left - * for caller to unlock. - */ - for (new_folio = origin_folio; new_folio != next_folio; new_folio = next) { - next = folio_next(new_folio); - if (new_folio == page_folio(lock_at)) - continue; - - folio_unlock(new_folio); - /* - * Subpages may be freed if there wasn't any mapping - * like if add_to_swap() is running on a lru page that - * had its mapping zapped. And freeing these pages - * requires taking the lru_lock so we do the put_page - * of the tail pages after the split is complete. - */ - free_folio_and_swap_cache(new_folio); - } return ret; } @@ -3686,6 +3575,11 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order, * It is in charge of checking whether the split is supported or not and * preparing @folio for __split_unmapped_folio(). * + * After splitting, the after-split folio containing @lock_at remains locked + * and others are unlocked: + * 1. for uniform split, @lock_at points to one of @folio's subpages; + * 2. for buddy allocator like (non-uniform) split, @lock_at points to @folio. + * * return: 0: successful, <0 failed (if -ENOMEM is returned, @folio might be * split but not to @new_order, the caller needs to check) */ @@ -3695,10 +3589,12 @@ static int __folio_split(struct folio *folio, unsigned int new_order, { struct deferred_split *ds_queue = get_deferred_split_queue(folio); XA_STATE(xas, &folio->mapping->i_pages, folio->index); + struct folio *end_folio = folio_next(folio); bool is_anon = folio_test_anon(folio); struct address_space *mapping = NULL; struct anon_vma *anon_vma = NULL; int order = folio_order(folio); + struct folio *new_folio, *next; int extra_pins, ret; pgoff_t end; bool is_hzp; @@ -3829,6 +3725,10 @@ static int __folio_split(struct folio *folio, unsigned int new_order, /* Prevent deferred_split_scan() touching ->_refcount */ spin_lock(&ds_queue->split_queue_lock); if (folio_ref_freeze(folio, 1 + extra_pins)) { + struct address_space *swap_cache = NULL; + int nr_dropped = 0; + struct lruvec *lruvec; + if (folio_order(folio) > 1 && !list_empty(&folio->_deferred_list)) { ds_queue->split_queue_len--; @@ -3862,9 +3762,105 @@ static int __folio_split(struct folio *folio, unsigned int new_order, } } - ret = __split_unmapped_folio(folio, new_order, - split_at, lock_at, list, end, &xas, mapping, - uniform_split); + if (folio_test_swapcache(folio)) { + VM_BUG_ON(mapping); + + swap_cache = swap_address_space(folio->swap); + xa_lock(&swap_cache->i_pages); + } + + /* lock lru list/PageCompound, ref frozen by page_ref_freeze */ + lruvec = folio_lruvec_lock(folio); + + ret = __split_unmapped_folio(folio, new_order, split_at, &xas, + mapping, uniform_split); + + /* + * Unfreeze after-split folios and put them back to the right + * list. @folio should be kept frozon until page cache + * entries are updated with all the other after-split folios + * to prevent others seeing stale page cache entries. + * As a result, new_folio starts from the next folio of + * @folio. + */ + for (new_folio = folio_next(folio); new_folio != end_folio; + new_folio = next) { + next = folio_next(new_folio); + + folio_ref_unfreeze( + new_folio, + 1 + ((mapping || swap_cache) ? + folio_nr_pages(new_folio) : + 0)); + + lru_add_split_folio(folio, new_folio, lruvec, list); + + /* Some pages can be beyond EOF: drop them from cache */ + if (new_folio->index >= end) { + if (shmem_mapping(mapping)) + nr_dropped += folio_nr_pages(new_folio); + else if (folio_test_clear_dirty(new_folio)) + folio_account_cleaned( + new_folio, + inode_to_wb(mapping->host)); + __filemap_remove_folio(new_folio, NULL); + folio_put_refs(new_folio, + folio_nr_pages(new_folio)); + } else if (mapping) { + __xa_store(&mapping->i_pages, new_folio->index, + new_folio, 0); + } else if (swap_cache) { + __xa_store(&swap_cache->i_pages, + swap_cache_index(new_folio->swap), + new_folio, 0); + } + } + /* + * Unfreeze @folio only after all page cache entries, which + * used to point to it, have been updated with new folios. + * Otherwise, a parallel folio_try_get() can grab @folio + * and its caller can see stale page cache entries. + */ + folio_ref_unfreeze(folio, 1 + + ((mapping || swap_cache) ? folio_nr_pages(folio) : 0)); + + unlock_page_lruvec(lruvec); + + if (swap_cache) + xa_unlock(&swap_cache->i_pages); + if (mapping) + xas_unlock(&xas); + + local_irq_enable(); + + if (nr_dropped) + shmem_uncharge(mapping->host, nr_dropped); + + remap_page(folio, 1 << order, + !ret && folio_test_anon(folio) ? + RMP_USE_SHARED_ZEROPAGE : + 0); + + /* + * Unlock all after-split folios except the one containing + * @lock_at page. If @folio is not split, it will be kept locked. + */ + for (new_folio = folio; new_folio != end_folio; + new_folio = next) { + next = folio_next(new_folio); + if (new_folio == page_folio(lock_at)) + continue; + + folio_unlock(new_folio); + /* + * Subpages may be freed if there wasn't any mapping + * like if add_to_swap() is running on a lru page that + * had its mapping zapped. And freeing these pages + * requires taking the lru_lock so we do the put_page + * of the tail pages after the split is complete. + */ + free_folio_and_swap_cache(new_folio); + } } else { spin_unlock(&ds_queue->split_queue_lock); fail: From 3653fc1bdba1f0fc2f0b4715ad98a3ba62689ab5 Mon Sep 17 00:00:00 2001 From: Zi Yan Date: Fri, 18 Jul 2025 14:37:16 -0400 Subject: [PATCH 316/370] mm/huge_memory: remove after_split label in __split_unmapped_folio() Check stop_split instead to avoid the goto statement. Link: https://lkml.kernel.org/r/20250718183720.4054515-3-ziy@nvidia.com Signed-off-by: Zi Yan Acked-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes Cc: Antonio Quartulli Cc: Balbir Singh Cc: Baolin Wang Cc: Barry Song Cc: Dan Carpenter Cc: Dev Jain Cc: Hugh Dickins Cc: Kirill A. Shutemov Cc: Liam Howlett Cc: Mariano Pache Cc: Mathew Brost Cc: Ryan Roberts Signed-off-by: Andrew Morton --- mm/huge_memory.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 63eebca07628..e01359008b13 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3463,18 +3463,18 @@ static int __split_unmapped_folio(struct folio *folio, int new_order, if (xas_error(xas)) { ret = xas_error(xas); stop_split = true; - goto after_split; } } } - folio_split_memcg_refs(folio, old_order, split_order); - split_page_owner(&folio->page, old_order, split_order); - pgalloc_tag_split(folio, old_order, split_order); + if (!stop_split) { + folio_split_memcg_refs(folio, old_order, split_order); + split_page_owner(&folio->page, old_order, split_order); + pgalloc_tag_split(folio, old_order, split_order); - __split_folio_to_order(folio, old_order, split_order); + __split_folio_to_order(folio, old_order, split_order); + } -after_split: /* * Iterate through after-split folios and update folio stats. * But in buddy allocator like split, the folio From 391dc7f40590d793f0e5214b6e9324b1af8fa40d Mon Sep 17 00:00:00 2001 From: Zi Yan Date: Fri, 18 Jul 2025 14:37:17 -0400 Subject: [PATCH 317/370] mm/huge_memory: deduplicate code in __folio_split() xas unlock, remap_page(), local_irq_enable() are moved out of if branches to deduplicate the code. While at it, add remap_flags to clean up remap_page() call site. nr_dropped is renamed to nr_shmem_dropped, as it becomes a variable at __folio_split() scope. Link: https://lkml.kernel.org/r/20250718183720.4054515-4-ziy@nvidia.com Signed-off-by: Zi Yan Acked-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes Cc: Antonio Quartulli Cc: Balbir Singh Cc: Baolin Wang Cc: Barry Song Cc: Dan Carpenter Cc: Dev Jain Cc: Hugh Dickins Cc: Kirill A. Shutemov Cc: Liam Howlett Cc: Mariano Pache Cc: Mathew Brost Cc: Ryan Roberts Signed-off-by: Andrew Morton --- mm/huge_memory.c | 79 +++++++++++++++++++++++------------------------- 1 file changed, 38 insertions(+), 41 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e01359008b13..d36f7bdaeb38 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3595,6 +3595,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order, struct anon_vma *anon_vma = NULL; int order = folio_order(folio); struct folio *new_folio, *next; + int nr_shmem_dropped = 0; + int remap_flags = 0; int extra_pins, ret; pgoff_t end; bool is_hzp; @@ -3718,15 +3720,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order, */ xas_lock(&xas); xas_reset(&xas); - if (xas_load(&xas) != folio) + if (xas_load(&xas) != folio) { + ret = -EAGAIN; goto fail; + } } /* Prevent deferred_split_scan() touching ->_refcount */ spin_lock(&ds_queue->split_queue_lock); if (folio_ref_freeze(folio, 1 + extra_pins)) { struct address_space *swap_cache = NULL; - int nr_dropped = 0; struct lruvec *lruvec; if (folio_order(folio) > 1 && @@ -3798,7 +3801,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order, /* Some pages can be beyond EOF: drop them from cache */ if (new_folio->index >= end) { if (shmem_mapping(mapping)) - nr_dropped += folio_nr_pages(new_folio); + nr_shmem_dropped += folio_nr_pages(new_folio); else if (folio_test_clear_dirty(new_folio)) folio_account_cleaned( new_folio, @@ -3828,48 +3831,42 @@ static int __folio_split(struct folio *folio, unsigned int new_order, if (swap_cache) xa_unlock(&swap_cache->i_pages); - if (mapping) - xas_unlock(&xas); - - local_irq_enable(); - - if (nr_dropped) - shmem_uncharge(mapping->host, nr_dropped); - - remap_page(folio, 1 << order, - !ret && folio_test_anon(folio) ? - RMP_USE_SHARED_ZEROPAGE : - 0); - - /* - * Unlock all after-split folios except the one containing - * @lock_at page. If @folio is not split, it will be kept locked. - */ - for (new_folio = folio; new_folio != end_folio; - new_folio = next) { - next = folio_next(new_folio); - if (new_folio == page_folio(lock_at)) - continue; - - folio_unlock(new_folio); - /* - * Subpages may be freed if there wasn't any mapping - * like if add_to_swap() is running on a lru page that - * had its mapping zapped. And freeing these pages - * requires taking the lru_lock so we do the put_page - * of the tail pages after the split is complete. - */ - free_folio_and_swap_cache(new_folio); - } } else { spin_unlock(&ds_queue->split_queue_lock); -fail: - if (mapping) - xas_unlock(&xas); - local_irq_enable(); - remap_page(folio, folio_nr_pages(folio), 0); ret = -EAGAIN; } +fail: + if (mapping) + xas_unlock(&xas); + + local_irq_enable(); + + if (nr_shmem_dropped) + shmem_uncharge(mapping->host, nr_shmem_dropped); + + if (!ret && is_anon) + remap_flags = RMP_USE_SHARED_ZEROPAGE; + remap_page(folio, 1 << order, remap_flags); + + /* + * Unlock all after-split folios except the one containing + * @lock_at page. If @folio is not split, it will be kept locked. + */ + for (new_folio = folio; new_folio != end_folio; new_folio = next) { + next = folio_next(new_folio); + if (new_folio == page_folio(lock_at)) + continue; + + folio_unlock(new_folio); + /* + * Subpages may be freed if there wasn't any mapping + * like if add_to_swap() is running on a lru page that + * had its mapping zapped. And freeing these pages + * requires taking the lru_lock so we do the put_page + * of the tail pages after the split is complete. + */ + free_folio_and_swap_cache(new_folio); + } out_unlock: if (anon_vma) { From 714b056c8321066f5a20d3045a95398092f7c1b9 Mon Sep 17 00:00:00 2001 From: Zi Yan Date: Thu, 17 Jul 2025 22:29:58 -0400 Subject: [PATCH 318/370] mm/huge_memory: convert VM_BUG* to VM_WARN* in __folio_split These VM_BUG* can be handled gracefully without crashing kernel. Link: https://lkml.kernel.org/r/20250718023000.4044406-5-ziy@nvidia.com Link: https://lkml.kernel.org/r/20250718183720.4054515-5-ziy@nvidia.com Signed-off-by: Zi Yan Reviewed-by: Lorenzo Stoakes Acked-by: David Hildenbrand Cc: Antonio Quartulli Cc: Balbir Singh Cc: Baolin Wang Cc: Barry Song Cc: Dan Carpenter Cc: Dev Jain Cc: Hugh Dickins Cc: Kirill A. Shutemov Cc: Liam Howlett Cc: Mariano Pache Cc: Mathew Brost Cc: Ryan Roberts Signed-off-by: Andrew Morton --- mm/huge_memory.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d36f7bdaeb38..d98283164eda 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3601,8 +3601,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order, pgoff_t end; bool is_hzp; - VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); - VM_BUG_ON_FOLIO(!folio_test_large(folio), folio); + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_ONCE_FOLIO(!folio_test_large(folio), folio); if (folio != page_folio(split_at) || folio != page_folio(lock_at)) return -EINVAL; @@ -3766,7 +3766,11 @@ static int __folio_split(struct folio *folio, unsigned int new_order, } if (folio_test_swapcache(folio)) { - VM_BUG_ON(mapping); + if (mapping) { + VM_WARN_ON_ONCE_FOLIO(mapping, folio); + ret = -EINVAL; + goto fail; + } swap_cache = swap_address_space(folio->swap); xa_lock(&swap_cache->i_pages); From a3871560ffc5755e561b75e257d2b15b19395608 Mon Sep 17 00:00:00 2001 From: Zi Yan Date: Fri, 18 Jul 2025 14:37:19 -0400 Subject: [PATCH 319/370] mm/huge_memory: get frozen folio refcount with folio_expected_ref_count() Instead of open coding the refcount calculation, use folio_expected_ref_count() to calculate frozen folio refcount. Because: 1. __folio_split() does not split a folio with PG_private, so no elevated refcount from PG_private; 2. a frozen folio in __folio_split() is fully unmapped, so folio_mapcount() in folio_expected_ref_count() is always 0; 3. (mapping || swap_cache) ? folio_nr_pages(folio) is taken care of by folio_expected_ref_count() too. Link: https://lkml.kernel.org/r/20250718023000.4044406-6-ziy@nvidia.com Link: https://lkml.kernel.org/r/20250718183720.4054515-6-ziy@nvidia.com Signed-off-by: Zi Yan Suggested-by: David Hildenbrand Acked-by: Balbir Singh Acked-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes Cc: Antonio Quartulli Cc: Baolin Wang Cc: Barry Song Cc: Dan Carpenter Cc: Dev Jain Cc: Hugh Dickins Cc: Kirill A. Shutemov Cc: Liam Howlett Cc: Mariano Pache Cc: Mathew Brost Cc: Ryan Roberts Signed-off-by: Andrew Morton --- mm/huge_memory.c | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d98283164eda..19e69704fcff 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3731,6 +3731,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order, if (folio_ref_freeze(folio, 1 + extra_pins)) { struct address_space *swap_cache = NULL; struct lruvec *lruvec; + int expected_refs; if (folio_order(folio) > 1 && !list_empty(&folio->_deferred_list)) { @@ -3794,11 +3795,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order, new_folio = next) { next = folio_next(new_folio); - folio_ref_unfreeze( - new_folio, - 1 + ((mapping || swap_cache) ? - folio_nr_pages(new_folio) : - 0)); + expected_refs = folio_expected_ref_count(new_folio) + 1; + folio_ref_unfreeze(new_folio, expected_refs); lru_add_split_folio(folio, new_folio, lruvec, list); @@ -3828,8 +3826,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order, * Otherwise, a parallel folio_try_get() can grab @folio * and its caller can see stale page cache entries. */ - folio_ref_unfreeze(folio, 1 + - ((mapping || swap_cache) ? folio_nr_pages(folio) : 0)); + expected_refs = folio_expected_ref_count(folio) + 1; + folio_ref_unfreeze(folio, expected_refs); unlock_page_lruvec(lruvec); From fde47708f9bc7f7babe4f48284f19d92faa06891 Mon Sep 17 00:00:00 2001 From: Zi Yan Date: Fri, 18 Jul 2025 14:37:20 -0400 Subject: [PATCH 320/370] mm/huge_memory: refactor after-split (page) cache code Smatch/coverity checkers report NULL mapping referencing issues[1][2][3] every time the code is modified, because they do not understand that mapping cannot be NULL when a folio is in page cache in the code. Refactor the code to make it explicit. Remove "end = -1" for anonymous folios, since after code refactoring, end is no longer used by anonymous folio handling code. No functional change is intended. Link: https://lkml.kernel.org/r/20250718023000.4044406-7-ziy@nvidia.com Link: https://lore.kernel.org/linux-mm/2afe3d59-aca5-40f7-82a3-a6d976fb0f4f@stanley.mountain/ [1] Link: https://lore.kernel.org/oe-kbuild/64b54034-f311-4e7d-b935-c16775dbb642@suswa.mountain/ [2] Link: https://lore.kernel.org/linux-mm/20250716145804.4836-1-antonio@mandelbit.com/ [3] Link: https://lkml.kernel.org/r/20250718183720.4054515-7-ziy@nvidia.com Signed-off-by: Zi Yan Suggested-by: David Hildenbrand Acked-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes Cc: Balbir Singh Cc: Baolin Wang Cc: Barry Song Cc: Dan Carpenter Cc: Dev Jain Cc: Hugh Dickins Cc: Kirill A. Shutemov Cc: Liam Howlett Cc: Mariano Pache Cc: Mathew Brost Cc: Ryan Roberts Signed-off-by: Andrew Morton --- mm/huge_memory.c | 44 ++++++++++++++++++++++++++++---------------- 1 file changed, 28 insertions(+), 16 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 19e69704fcff..9c38a95e9f09 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3640,7 +3640,6 @@ static int __folio_split(struct folio *folio, unsigned int new_order, ret = -EBUSY; goto out; } - end = -1; mapping = NULL; anon_vma_lock_write(anon_vma); } else { @@ -3793,6 +3792,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order, */ for (new_folio = folio_next(folio); new_folio != end_folio; new_folio = next) { + unsigned long nr_pages = folio_nr_pages(new_folio); + next = folio_next(new_folio); expected_refs = folio_expected_ref_count(new_folio) + 1; @@ -3800,25 +3801,36 @@ static int __folio_split(struct folio *folio, unsigned int new_order, lru_add_split_folio(folio, new_folio, lruvec, list); - /* Some pages can be beyond EOF: drop them from cache */ - if (new_folio->index >= end) { - if (shmem_mapping(mapping)) - nr_shmem_dropped += folio_nr_pages(new_folio); - else if (folio_test_clear_dirty(new_folio)) - folio_account_cleaned( - new_folio, - inode_to_wb(mapping->host)); - __filemap_remove_folio(new_folio, NULL); - folio_put_refs(new_folio, - folio_nr_pages(new_folio)); - } else if (mapping) { - __xa_store(&mapping->i_pages, new_folio->index, - new_folio, 0); - } else if (swap_cache) { + /* + * Anonymous folio with swap cache. + * NOTE: shmem in swap cache is not supported yet. + */ + if (swap_cache) { __xa_store(&swap_cache->i_pages, swap_cache_index(new_folio->swap), new_folio, 0); + continue; } + + /* Anonymous folio without swap cache */ + if (!mapping) + continue; + + /* Add the new folio to the page cache. */ + if (new_folio->index < end) { + __xa_store(&mapping->i_pages, new_folio->index, + new_folio, 0); + continue; + } + + /* Drop folio beyond EOF: ->index >= end */ + if (shmem_mapping(mapping)) + nr_shmem_dropped += nr_pages; + else if (folio_test_clear_dirty(new_folio)) + folio_account_cleaned( + new_folio, inode_to_wb(mapping->host)); + __filemap_remove_folio(new_folio, NULL); + folio_put_refs(new_folio, nr_pages); } /* * Unfreeze @folio only after all page cache entries, which From b9bf6c2872c530776852b295eb399a23626e9611 Mon Sep 17 00:00:00 2001 From: Dev Jain Date: Fri, 18 Jul 2025 14:32:38 +0530 Subject: [PATCH 321/370] mm: refactor MM_CP_PROT_NUMA skipping case into new function Patch series "Optimize mprotect() for large folios", v5. Use folio_pte_batch() to optimize change_pte_range(). On arm64, if the ptes are painted with the contig bit, then ptep_get() will iterate through all 16 entries to collect a/d bits. Hence this optimization will result in a 16x reduction in the number of ptep_get() calls. Next, ptep_modify_prot_start() will eventually call contpte_try_unfold() on every contig block, thus flushing the TLB for the complete large folio range. Instead, use get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only do them on the starting and ending contig block. For split folios, there will be no pte batching; the batch size returned by folio_pte_batch() will be 1. For pagetable split folios, the ptes will still point to the same large folio; for arm64, this results in the optimization described above, and for other arches, a minor improvement is expected due to a reduction in the number of function calls. mm-selftests pass on arm64. I have some failing tests on my x86 VM already; no new tests fail as a result of this patchset. We use the following test cases to measure performance, mprotect()'ing the mapped memory to read-only then read-write 40 times: Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then pte-mapping those THPs Test case 2: Mapping 1G of memory with 64K mTHPs Test case 3: Mapping 1G of memory with 4K pages Average execution time on arm64, Apple M3: Before the patchset: T1: 2.1 seconds T2: 2 seconds T3: 1 second After the patchset: T1: 0.65 seconds T2: 0.7 seconds T3: 1.1 seconds Observing T1/T2 and T3 before the patchset, we also remove the regression introduced by ptep_get() on a contpte block. And, for large folios we get an almost 74% performance improvement, albeit the trade-off being a slight degradation in the small folio case. For x86: Before the patchset: T1: 3.75 seconds T2: 3.7 seconds T3: 3.85 seconds After the patchset: T1: 3.7 seconds T2: 3.7 seconds T3: 3.9 seconds So there is a minor improvement due to reduction in number of function calls, and a slight degradation in the small folio case due to the overhead of vm_normal_folio() + folio_test_large(). Here is the test program: #define _GNU_SOURCE #include #include #include #include #include #define SIZE (1024*1024*1024) unsigned long pmdsize = (1UL << 21); unsigned long pagesize = (1UL << 12); static void pte_map_thps(char *mem, size_t size) { size_t offs; int ret = 0; /* PTE-map each THP by temporarily splitting the VMAs. */ for (offs = 0; offs < size; offs += pmdsize) { ret |= madvise(mem + offs, pagesize, MADV_DONTFORK); ret |= madvise(mem + offs, pagesize, MADV_DOFORK); } if (ret) { fprintf(stderr, "ERROR: mprotect() failed\n"); exit(1); } } int main(int argc, char *argv[]) { char *p; int ret = 0; p = mmap((1UL << 30), SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (p != (1UL << 30)) { perror("mmap"); return 1; } memset(p, 0, SIZE); if (madvise(p, SIZE, MADV_NOHUGEPAGE)) perror("madvise"); explicit_bzero(p, SIZE); pte_map_thps(p, SIZE); for (int loops = 0; loops < 40; loops++) { if (mprotect(p, SIZE, PROT_READ)) perror("mprotect"), exit(1); if (mprotect(p, SIZE, PROT_READ|PROT_WRITE)) perror("mprotect"), exit(1); explicit_bzero(p, SIZE); } } This patch (of 7): Reduce indentation by refactoring the prot_numa case into a new function. No functional change intended. Link: https://lkml.kernel.org/r/20250718090244.21092-1-dev.jain@arm.com Link: https://lkml.kernel.org/r/20250718090244.21092-2-dev.jain@arm.com Signed-off-by: Dev Jain Reviewed-by: Lorenzo Stoakes Reviewed-by: Barry Song Reviewed-by: Zi Yan Cc: Anshuman Khandual Cc: Catalin Marinas Cc: Christophe Leroy Cc: David Hildenbrand Cc: Hugh Dickins Cc: Jann Horn Cc: Joey Gouly Cc: Kevin Brodsky Cc: Lance Yang Cc: Liam Howlett Cc: Matthew Wilcox (Oracle) Cc: Peter Xu Cc: Ryan Roberts Cc: Vlastimil Babka Cc: Will Deacon Cc: Yang Shi Cc: Yicong Yang Cc: Zhenhua Huang Signed-off-by: Andrew Morton --- mm/mprotect.c | 101 +++++++++++++++++++++++++++----------------------- 1 file changed, 55 insertions(+), 46 deletions(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 88709c01177b..2a9c73bd0778 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -83,6 +83,59 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, return pte_dirty(pte); } +static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, + pte_t oldpte, pte_t *pte, int target_node) +{ + struct folio *folio; + bool toptier; + int nid; + + /* Avoid TLB flush if possible */ + if (pte_protnone(oldpte)) + return true; + + folio = vm_normal_folio(vma, addr, oldpte); + if (!folio) + return true; + + if (folio_is_zone_device(folio) || folio_test_ksm(folio)) + return true; + + /* Also skip shared copy-on-write pages */ + if (is_cow_mapping(vma->vm_flags) && + (folio_maybe_dma_pinned(folio) || folio_maybe_mapped_shared(folio))) + return true; + + /* + * While migration can move some dirty pages, + * it cannot move them all from MIGRATE_ASYNC + * context. + */ + if (folio_is_file_lru(folio) && folio_test_dirty(folio)) + return true; + + /* + * Don't mess with PTEs if page is already on the node + * a single-threaded process is running on. + */ + nid = folio_nid(folio); + if (target_node == nid) + return true; + + toptier = node_is_toptier(nid); + + /* + * Skip scanning top tier node if normal numa + * balancing is disabled + */ + if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && toptier) + return true; + + if (folio_use_access_time(folio)) + folio_xchg_access_time(folio, jiffies_to_msecs(jiffies)); + return false; +} + static long change_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, pgprot_t newprot, unsigned long cp_flags) @@ -117,53 +170,9 @@ static long change_pte_range(struct mmu_gather *tlb, * pages. See similar comment in change_huge_pmd. */ if (prot_numa) { - struct folio *folio; - int nid; - bool toptier; - - /* Avoid TLB flush if possible */ - if (pte_protnone(oldpte)) + if (prot_numa_skip(vma, addr, oldpte, pte, + target_node)) continue; - - folio = vm_normal_folio(vma, addr, oldpte); - if (!folio || folio_is_zone_device(folio) || - folio_test_ksm(folio)) - continue; - - /* Also skip shared copy-on-write pages */ - if (is_cow_mapping(vma->vm_flags) && - (folio_maybe_dma_pinned(folio) || - folio_maybe_mapped_shared(folio))) - continue; - - /* - * While migration can move some dirty pages, - * it cannot move them all from MIGRATE_ASYNC - * context. - */ - if (folio_is_file_lru(folio) && - folio_test_dirty(folio)) - continue; - - /* - * Don't mess with PTEs if page is already on the node - * a single-threaded process is running on. - */ - nid = folio_nid(folio); - if (target_node == nid) - continue; - toptier = node_is_toptier(nid); - - /* - * Skip scanning top tier node if normal numa - * balancing is disabled - */ - if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && - toptier) - continue; - if (folio_use_access_time(folio)) - folio_xchg_access_time(folio, - jiffies_to_msecs(jiffies)); } oldpte = ptep_modify_prot_start(vma, addr, pte); From 1d40f4e3d9d6a2d6807daa22d40ff27fc0c3d0f5 Mon Sep 17 00:00:00 2001 From: Dev Jain Date: Fri, 18 Jul 2025 14:32:39 +0530 Subject: [PATCH 322/370] mm: optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs For the MM_CP_PROT_NUMA skipping case, observe that, if we skip an iteration due to the underlying folio satisfying any of the skip conditions, then for all subsequent ptes which map the same folio, the iteration will be skipped for them too. Therefore, we can optimize by using folio_pte_batch() to batch skip the iterations. Use prot_numa_skip() introduced in the previous patch to determine whether we need to skip the iteration. Change its signature to have a double pointer to a folio, which will be used by mprotect_folio_pte_batch() to determine the number of iterations we can safely skip. Link: https://lkml.kernel.org/r/20250718090244.21092-3-dev.jain@arm.com Signed-off-by: Dev Jain Reviewed-by: Lorenzo Stoakes Reviewed-by: Ryan Roberts Reviewed-by: Zi Yan Cc: Anshuman Khandual Cc: Barry Song Cc: Catalin Marinas Cc: Christophe Leroy Cc: David Hildenbrand Cc: Hugh Dickins Cc: Jann Horn Cc: Joey Gouly Cc: Kevin Brodsky Cc: Lance Yang Cc: Liam Howlett Cc: Matthew Wilcox (Oracle) Cc: Peter Xu Cc: Vlastimil Babka Cc: Will Deacon Cc: Yang Shi Cc: Yicong Yang Cc: Zhenhua Huang Signed-off-by: Andrew Morton --- mm/mprotect.c | 57 ++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 43 insertions(+), 14 deletions(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 2a9c73bd0778..97adc62c50ab 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -83,28 +83,43 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, return pte_dirty(pte); } -static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, - pte_t oldpte, pte_t *pte, int target_node) +static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, + pte_t pte, int max_nr_ptes) { - struct folio *folio; + /* No underlying folio, so cannot batch */ + if (!folio) + return 1; + + if (!folio_test_large(folio)) + return 1; + + return folio_pte_batch(folio, ptep, pte, max_nr_ptes); +} + +static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, + pte_t oldpte, pte_t *pte, int target_node, + struct folio **foliop) +{ + struct folio *folio = NULL; + bool ret = true; bool toptier; int nid; /* Avoid TLB flush if possible */ if (pte_protnone(oldpte)) - return true; + goto skip; folio = vm_normal_folio(vma, addr, oldpte); if (!folio) - return true; + goto skip; if (folio_is_zone_device(folio) || folio_test_ksm(folio)) - return true; + goto skip; /* Also skip shared copy-on-write pages */ if (is_cow_mapping(vma->vm_flags) && (folio_maybe_dma_pinned(folio) || folio_maybe_mapped_shared(folio))) - return true; + goto skip; /* * While migration can move some dirty pages, @@ -112,7 +127,7 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, * context. */ if (folio_is_file_lru(folio) && folio_test_dirty(folio)) - return true; + goto skip; /* * Don't mess with PTEs if page is already on the node @@ -120,7 +135,7 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, */ nid = folio_nid(folio); if (target_node == nid) - return true; + goto skip; toptier = node_is_toptier(nid); @@ -129,11 +144,15 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, * balancing is disabled */ if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && toptier) - return true; + goto skip; + ret = false; if (folio_use_access_time(folio)) folio_xchg_access_time(folio, jiffies_to_msecs(jiffies)); - return false; + +skip: + *foliop = folio; + return ret; } static long change_pte_range(struct mmu_gather *tlb, @@ -147,6 +166,7 @@ static long change_pte_range(struct mmu_gather *tlb, bool prot_numa = cp_flags & MM_CP_PROT_NUMA; bool uffd_wp = cp_flags & MM_CP_UFFD_WP; bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; + int nr_ptes; tlb_change_page_size(tlb, PAGE_SIZE); pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); @@ -161,8 +181,11 @@ static long change_pte_range(struct mmu_gather *tlb, flush_tlb_batched_pending(vma->vm_mm); arch_enter_lazy_mmu_mode(); do { + nr_ptes = 1; oldpte = ptep_get(pte); if (pte_present(oldpte)) { + int max_nr_ptes = (end - addr) >> PAGE_SHIFT; + struct folio *folio; pte_t ptent; /* @@ -170,9 +193,15 @@ static long change_pte_range(struct mmu_gather *tlb, * pages. See similar comment in change_huge_pmd. */ if (prot_numa) { - if (prot_numa_skip(vma, addr, oldpte, pte, - target_node)) + int ret = prot_numa_skip(vma, addr, oldpte, pte, + target_node, &folio); + if (ret) { + + /* determine batch to skip */ + nr_ptes = mprotect_folio_pte_batch(folio, + pte, oldpte, max_nr_ptes); continue; + } } oldpte = ptep_modify_prot_start(vma, addr, pte); @@ -289,7 +318,7 @@ static long change_pte_range(struct mmu_gather *tlb, pages++; } } - } while (pte++, addr += PAGE_SIZE, addr != end); + } while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end); arch_leave_lazy_mmu_mode(); pte_unmap_unlock(pte - 1, ptl); From 0aa3657df3ec713fca1f00a57a063b28f2a78147 Mon Sep 17 00:00:00 2001 From: Dev Jain Date: Fri, 18 Jul 2025 14:32:40 +0530 Subject: [PATCH 323/370] mm: add batched versions of ptep_modify_prot_start/commit Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect, implementing them as a simple loop over the corresponding single pte helpers. Architecture may override these helpers. Link: https://lkml.kernel.org/r/20250718090244.21092-4-dev.jain@arm.com Signed-off-by: Dev Jain Reviewed-by: Lorenzo Stoakes Reviewed-by: Barry Song Reviewed-by: Ryan Roberts Reviewed-by: Zi Yan Cc: Anshuman Khandual Cc: Catalin Marinas Cc: Christophe Leroy Cc: David Hildenbrand Cc: Hugh Dickins Cc: Jann Horn Cc: Joey Gouly Cc: Kevin Brodsky Cc: Lance Yang Cc: Liam Howlett Cc: Matthew Wilcox (Oracle) Cc: Peter Xu Cc: Vlastimil Babka Cc: Will Deacon Cc: Yang Shi Cc: Yicong Yang Cc: Zhenhua Huang Signed-off-by: Andrew Morton --- include/linux/pgtable.h | 84 ++++++++++++++++++++++++++++++++++++++++- mm/mprotect.c | 4 +- 2 files changed, 85 insertions(+), 3 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index cf1515c163e2..e3b99920be05 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1331,7 +1331,9 @@ static inline pte_t ptep_modify_prot_start(struct vm_area_struct *vma, /* * Commit an update to a pte, leaving any hardware-controlled bits in - * the PTE unmodified. + * the PTE unmodified. The pte returned from ptep_modify_prot_start() may + * additionally have young and/or dirty bits set where previously they were not, + * so the updated pte may have these additional changes. */ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, @@ -1340,6 +1342,86 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma, __ptep_modify_prot_commit(vma, addr, ptep, pte); } #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */ + +/** + * modify_prot_start_ptes - Start a pte protection read-modify-write transaction + * over a batch of ptes, which protects against asynchronous hardware + * modifications to the ptes. The intention is not to prevent the hardware from + * making pte updates, but to prevent any updates it may make from being lost. + * Please see the comment above ptep_modify_prot_start() for full description. + * + * @vma: The virtual memory area the pages are mapped into. + * @addr: Address the first page is mapped at. + * @ptep: Page table pointer for the first entry. + * @nr: Number of entries. + * + * May be overridden by the architecture; otherwise, implemented as a simple + * loop over ptep_modify_prot_start(), collecting the a/d bits from each pte + * in the batch. + * + * Note that PTE bits in the PTE batch besides the PFN can differ. + * + * Context: The caller holds the page table lock. The PTEs map consecutive + * pages that belong to the same folio. All other PTE bits must be identical for + * all PTEs in the batch except for young and dirty bits. The PTEs are all in + * the same PMD. + */ +#ifndef modify_prot_start_ptes +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep, unsigned int nr) +{ + pte_t pte, tmp_pte; + + pte = ptep_modify_prot_start(vma, addr, ptep); + while (--nr) { + ptep++; + addr += PAGE_SIZE; + tmp_pte = ptep_modify_prot_start(vma, addr, ptep); + if (pte_dirty(tmp_pte)) + pte = pte_mkdirty(pte); + if (pte_young(tmp_pte)) + pte = pte_mkyoung(pte); + } + return pte; +} +#endif + +/** + * modify_prot_commit_ptes - Commit an update to a batch of ptes, leaving any + * hardware-controlled bits in the PTE unmodified. + * + * @vma: The virtual memory area the pages are mapped into. + * @addr: Address the first page is mapped at. + * @ptep: Page table pointer for the first entry. + * @old_pte: Old page table entry (for the first entry) which is now cleared. + * @pte: New page table entry to be set. + * @nr: Number of entries. + * + * May be overridden by the architecture; otherwise, implemented as a simple + * loop over ptep_modify_prot_commit(). + * + * Context: The caller holds the page table lock. The PTEs are all in the same + * PMD. On exit, the set ptes in the batch map the same folio. The ptes set by + * ptep_modify_prot_start() may additionally have young and/or dirty bits set + * where previously they were not, so the updated ptes may have these + * additional changes. + */ +#ifndef modify_prot_commit_ptes +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr, + pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr) +{ + int i; + + for (i = 0; i < nr; ++i, ++ptep, addr += PAGE_SIZE) { + ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte); + + /* Advance PFN only, set same prot */ + old_pte = pte_next_pfn(old_pte); + pte = pte_next_pfn(pte); + } +} +#endif + #endif /* CONFIG_MMU */ /* diff --git a/mm/mprotect.c b/mm/mprotect.c index 97adc62c50ab..4977f198168e 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -204,7 +204,7 @@ static long change_pte_range(struct mmu_gather *tlb, } } - oldpte = ptep_modify_prot_start(vma, addr, pte); + oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes); ptent = pte_modify(oldpte, newprot); if (uffd_wp) @@ -230,7 +230,7 @@ static long change_pte_range(struct mmu_gather *tlb, can_change_pte_writable(vma, addr, ptent)) ptent = pte_mkwrite(ptent, vma); - ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent); + modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes); if (pte_needs_flush(oldpte, ptent)) tlb_flush_pte_range(tlb, addr, PAGE_SIZE); pages++; From 57fae936b40cba55f36bb8e3296f271696c2bb67 Mon Sep 17 00:00:00 2001 From: Dev Jain Date: Fri, 18 Jul 2025 14:32:41 +0530 Subject: [PATCH 324/370] mm: introduce FPB_RESPECT_WRITE for PTE batching infrastructure Patch 6 ("mm: Optimize mprotect() by PTE batching") optimizes mprotect() by batch clearing the ptes, masking in the new protections, and batch setting the ptes. Suppose that the first pte of the batch is writable - with the current implementation of folio_pte_batch(), it is not guaranteed that the other ptes in the batch are already writable too, so we may incorrectly end up setting the writable bit on all ptes via modify_prot_commit_ptes(). Therefore, introduce FPB_RESPECT_WRITE so that all ptes in the batch are writable or not. Link: https://lkml.kernel.org/r/20250718090244.21092-5-dev.jain@arm.com Signed-off-by: Dev Jain Reviewed-by: Lorenzo Stoakes Reviewed-by: Ryan Roberts Reviewed-by: Zi Yan Cc: Anshuman Khandual Cc: Barry Song Cc: Catalin Marinas Cc: Christophe Leroy Cc: David Hildenbrand Cc: Hugh Dickins Cc: Jann Horn Cc: Joey Gouly Cc: Kevin Brodsky Cc: Lance Yang Cc: Liam Howlett Cc: Matthew Wilcox (Oracle) Cc: Peter Xu Cc: Vlastimil Babka Cc: Will Deacon Cc: Yang Shi Cc: Yicong Yang Cc: Zhenhua Huang Signed-off-by: Andrew Morton --- mm/internal.h | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 5b0f71e5434b..28d2d5b051df 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -208,17 +208,20 @@ typedef int __bitwise fpb_t; /* Compare PTEs respecting the soft-dirty bit. */ #define FPB_RESPECT_SOFT_DIRTY ((__force fpb_t)BIT(1)) +/* Compare PTEs respecting the writable bit. */ +#define FPB_RESPECT_WRITE ((__force fpb_t)BIT(2)) + /* * Merge PTE write bits: if any PTE in the batch is writable, modify the * PTE at @ptentp to be writable. */ -#define FPB_MERGE_WRITE ((__force fpb_t)BIT(2)) +#define FPB_MERGE_WRITE ((__force fpb_t)BIT(3)) /* * Merge PTE young and dirty bits: if any PTE in the batch is young or dirty, * modify the PTE at @ptentp to be young or dirty, respectively. */ -#define FPB_MERGE_YOUNG_DIRTY ((__force fpb_t)BIT(3)) +#define FPB_MERGE_YOUNG_DIRTY ((__force fpb_t)BIT(4)) static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) { @@ -226,7 +229,9 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) pte = pte_mkclean(pte); if (likely(!(flags & FPB_RESPECT_SOFT_DIRTY))) pte = pte_clear_soft_dirty(pte); - return pte_wrprotect(pte_mkold(pte)); + if (likely(!(flags & FPB_RESPECT_WRITE))) + pte = pte_wrprotect(pte); + return pte_mkold(pte); } /** From 45199f715b7455a2e4054dbc5dab0c3b65e2abc1 Mon Sep 17 00:00:00 2001 From: Dev Jain Date: Fri, 18 Jul 2025 14:32:42 +0530 Subject: [PATCH 325/370] mm: split can_change_pte_writable() into private and shared parts In preparation for patch 6 and modularizing the code in general, split can_change_pte_writable() into private and shared VMA parts. No functional change intended. Link: https://lkml.kernel.org/r/20250718090244.21092-6-dev.jain@arm.com Signed-off-by: Dev Jain Suggested-by: Lorenzo Stoakes Reviewed-by: Lorenzo Stoakes Reviewed-by: Zi Yan Cc: Anshuman Khandual Cc: Barry Song Cc: Catalin Marinas Cc: Christophe Leroy Cc: David Hildenbrand Cc: Hugh Dickins Cc: Jann Horn Cc: Joey Gouly Cc: Kevin Brodsky Cc: Lance Yang Cc: Liam Howlett Cc: Matthew Wilcox (Oracle) Cc: Peter Xu Cc: Ryan Roberts Cc: Vlastimil Babka Cc: Will Deacon Cc: Yang Shi Cc: Yicong Yang Cc: Zhenhua Huang Signed-off-by: Andrew Morton --- mm/mprotect.c | 50 ++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 36 insertions(+), 14 deletions(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 4977f198168e..a1c7d8a4648d 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -40,11 +40,8 @@ #include "internal.h" -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, - pte_t pte) +static bool maybe_change_pte_writable(struct vm_area_struct *vma, pte_t pte) { - struct page *page; - if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE))) return false; @@ -60,16 +57,32 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, if (userfaultfd_pte_wp(vma, pte)) return false; - if (!(vma->vm_flags & VM_SHARED)) { - /* - * Writable MAP_PRIVATE mapping: We can only special-case on - * exclusive anonymous pages, because we know that our - * write-fault handler similarly would map them writable without - * any additional checks while holding the PT lock. - */ - page = vm_normal_page(vma, addr, pte); - return page && PageAnon(page) && PageAnonExclusive(page); - } + return true; +} + +static bool can_change_private_pte_writable(struct vm_area_struct *vma, + unsigned long addr, pte_t pte) +{ + struct page *page; + + if (!maybe_change_pte_writable(vma, pte)) + return false; + + /* + * Writable MAP_PRIVATE mapping: We can only special-case on + * exclusive anonymous pages, because we know that our + * write-fault handler similarly would map them writable without + * any additional checks while holding the PT lock. + */ + page = vm_normal_page(vma, addr, pte); + return page && PageAnon(page) && PageAnonExclusive(page); +} + +static bool can_change_shared_pte_writable(struct vm_area_struct *vma, + pte_t pte) +{ + if (!maybe_change_pte_writable(vma, pte)) + return false; VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte)); @@ -83,6 +96,15 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, return pte_dirty(pte); } +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, + pte_t pte) +{ + if (!(vma->vm_flags & VM_SHARED)) + return can_change_private_pte_writable(vma, addr, pte); + + return can_change_shared_pte_writable(vma, pte); +} + static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte, int max_nr_ptes) { From cac1db8c3aad97d6ffb56ced8868d6cbbbd2bfbe Mon Sep 17 00:00:00 2001 From: Dev Jain Date: Fri, 18 Jul 2025 14:32:43 +0530 Subject: [PATCH 326/370] mm: optimize mprotect() by PTE batching Use folio_pte_batch to batch process a large folio. Note that, PTE batching here will save a few function calls, and this strategy in certain cases (not this one) batches atomic operations in general, so we have a performance win for all arches. This patch paves the way for patch 7 which will help us elide the TLBI per contig block on arm64. The correctness of this patch lies on the correctness of setting the new ptes based upon information only from the first pte of the batch (which may also have accumulated a/d bits via modify_prot_start_ptes()). Observe that the flag combination we pass to mprotect_folio_pte_batch() guarantees that the batch is uniform w.r.t the soft-dirty bit and the writable bit. Therefore, the only bits which may differ are the a/d bits. So we only need to worry about code which is concerned about the a/d bits of the PTEs. Setting extra a/d bits on the new ptes where previously they were not set, is fine - setting access bit when it was not set is not an incorrectness problem but will only possibly delay the reclaim of the page mapped by the pte (which is in fact intended because the kernel just operated on this region via mprotect()!). Setting dirty bit when it was not set is again not an incorrectness problem but will only possibly force an unnecessary writeback. So now we need to reason whether something can go wrong via can_change_pte_writable(). The pte_protnone, pte_needs_soft_dirty_wp, and userfaultfd_pte_wp cases are solved due to uniformity in the corresponding bits guaranteed by the flag combination. The ptes all belong to the same VMA (since callers guarantee that [start, end) will lie within the VMA) therefore the conditional based on the VMA is also safe to batch around. Since the dirty bit on the PTE really is just an indication that the folio got written to - even if the PTE is not actually dirty but one of the PTEs in the batch is, the wp-fault optimization can be made. Therefore, it is safe to batch around pte_dirty() in can_change_shared_pte_writable() (in fact this is better since without batching, it may happen that some ptes aren't changed to writable just because they are not dirty, even though the other ptes mapping the same large folio are dirty). To batch around the PageAnonExclusive case, we must check the corresponding condition for every single page. Therefore, from the large folio batch, we process sub batches of ptes mapping pages with the same PageAnonExclusive condition, and process that sub batch, then determine and process the next sub batch, and so on. Note that this does not cause any extra overhead; if suppose the size of the folio batch is 512, then the sub batch processing in total will take 512 iterations, which is the same as what we would have done before. For pte_needs_flush(): ppc does not care about the a/d bits. For x86, PAGE_SAVED_DIRTY is ignored. We will flush only when a/d bits get cleared; since we can only have extra a/d bits due to batching, we will only have an extra flush, not a case where we elide a flush due to batching when we shouldn't have. Link: https://lkml.kernel.org/r/20250718090244.21092-7-dev.jain@arm.com Signed-off-by: Dev Jain Reviewed-by: Lorenzo Stoakes Reviewed-by: Zi Yan Cc: Anshuman Khandual Cc: Barry Song Cc: Catalin Marinas Cc: Christophe Leroy Cc: David Hildenbrand Cc: Hugh Dickins Cc: Jann Horn Cc: Joey Gouly Cc: Kevin Brodsky Cc: Lance Yang Cc: Liam Howlett Cc: Matthew Wilcox (Oracle) Cc: Peter Xu Cc: Ryan Roberts Cc: Vlastimil Babka Cc: Will Deacon Cc: Yang Shi Cc: Yicong Yang Cc: Zhenhua Huang Signed-off-by: Andrew Morton --- mm/mprotect.c | 125 +++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 113 insertions(+), 12 deletions(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index a1c7d8a4648d..2ddd37b2f462 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -106,7 +106,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, } static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, - pte_t pte, int max_nr_ptes) + pte_t pte, int max_nr_ptes, fpb_t flags) { /* No underlying folio, so cannot batch */ if (!folio) @@ -115,7 +115,7 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, if (!folio_test_large(folio)) return 1; - return folio_pte_batch(folio, ptep, pte, max_nr_ptes); + return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr_ptes, flags); } static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, @@ -177,6 +177,102 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, return ret; } +/* Set nr_ptes number of ptes, starting from idx */ +static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr, + pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes, + int idx, bool set_write, struct mmu_gather *tlb) +{ + /* + * Advance the position in the batch by idx; note that if idx > 0, + * then the nr_ptes passed here is <= batch size - idx. + */ + addr += idx * PAGE_SIZE; + ptep += idx; + oldpte = pte_advance_pfn(oldpte, idx); + ptent = pte_advance_pfn(ptent, idx); + + if (set_write) + ptent = pte_mkwrite(ptent, vma); + + modify_prot_commit_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes); + if (pte_needs_flush(oldpte, ptent)) + tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE); +} + +/* + * Get max length of consecutive ptes pointing to PageAnonExclusive() pages or + * !PageAnonExclusive() pages, starting from start_idx. Caller must enforce + * that the ptes point to consecutive pages of the same anon large folio. + */ +static int page_anon_exclusive_sub_batch(int start_idx, int max_len, + struct page *first_page, bool expected_anon_exclusive) +{ + int idx; + + for (idx = start_idx + 1; idx < start_idx + max_len; ++idx) { + if (expected_anon_exclusive != PageAnonExclusive(first_page + idx)) + break; + } + return idx - start_idx; +} + +/* + * This function is a result of trying our very best to retain the + * "avoid the write-fault handler" optimization. In can_change_pte_writable(), + * if the vma is a private vma, and we cannot determine whether to change + * the pte to writable just from the vma and the pte, we then need to look + * at the actual page pointed to by the pte. Unfortunately, if we have a + * batch of ptes pointing to consecutive pages of the same anon large folio, + * the anon-exclusivity (or the negation) of the first page does not guarantee + * the anon-exclusivity (or the negation) of the other pages corresponding to + * the pte batch; hence in this case it is incorrect to decide to change or + * not change the ptes to writable just by using information from the first + * pte of the batch. Therefore, we must individually check all pages and + * retrieve sub-batches. + */ +static void commit_anon_folio_batch(struct vm_area_struct *vma, + struct folio *folio, unsigned long addr, pte_t *ptep, + pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) +{ + struct page *first_page = folio_page(folio, 0); + bool expected_anon_exclusive; + int sub_batch_idx = 0; + int len; + + while (nr_ptes) { + expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx); + len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes, + first_page, expected_anon_exclusive); + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, len, + sub_batch_idx, expected_anon_exclusive, tlb); + sub_batch_idx += len; + nr_ptes -= len; + } +} + +static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, + struct folio *folio, unsigned long addr, pte_t *ptep, + pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) +{ + bool set_write; + + if (vma->vm_flags & VM_SHARED) { + set_write = can_change_shared_pte_writable(vma, ptent); + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes, + /* idx = */ 0, set_write, tlb); + return; + } + + set_write = maybe_change_pte_writable(vma, ptent) && + (folio && folio_test_anon(folio)); + if (!set_write) { + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes, + /* idx = */ 0, set_write, tlb); + return; + } + commit_anon_folio_batch(vma, folio, addr, ptep, oldpte, ptent, nr_ptes, tlb); +} + static long change_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, pgprot_t newprot, unsigned long cp_flags) @@ -206,8 +302,9 @@ static long change_pte_range(struct mmu_gather *tlb, nr_ptes = 1; oldpte = ptep_get(pte); if (pte_present(oldpte)) { + const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | FPB_RESPECT_WRITE; int max_nr_ptes = (end - addr) >> PAGE_SHIFT; - struct folio *folio; + struct folio *folio = NULL; pte_t ptent; /* @@ -221,11 +318,16 @@ static long change_pte_range(struct mmu_gather *tlb, /* determine batch to skip */ nr_ptes = mprotect_folio_pte_batch(folio, - pte, oldpte, max_nr_ptes); + pte, oldpte, max_nr_ptes, /* flags = */ 0); continue; } } + if (!folio) + folio = vm_normal_folio(vma, addr, oldpte); + + nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags); + oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes); ptent = pte_modify(oldpte, newprot); @@ -248,14 +350,13 @@ static long change_pte_range(struct mmu_gather *tlb, * COW or special handling is required. */ if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && - !pte_write(ptent) && - can_change_pte_writable(vma, addr, ptent)) - ptent = pte_mkwrite(ptent, vma); - - modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes); - if (pte_needs_flush(oldpte, ptent)) - tlb_flush_pte_range(tlb, addr, PAGE_SIZE); - pages++; + !pte_write(ptent)) + set_write_prot_commit_flush_ptes(vma, folio, + addr, pte, oldpte, ptent, nr_ptes, tlb); + else + prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent, + nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb); + pages += nr_ptes; } else if (is_swap_pte(oldpte)) { swp_entry_t entry = pte_to_swp_entry(oldpte); pte_t newpte; From 7efa1cd5f89b53e15659f84efb2c7a21076df04f Mon Sep 17 00:00:00 2001 From: Dev Jain Date: Fri, 18 Jul 2025 14:32:44 +0530 Subject: [PATCH 327/370] arm64: add batched versions of ptep_modify_prot_start/commit Override the generic definition of modify_prot_start_ptes() to use get_and_clear_full_ptes(). This helper does a TLBI only for the starting and ending contpte block of the range, whereas the current implementation will call ptep_get_and_clear() for every contpte block, thus doing a TLBI on every contpte block. Therefore, we have a performance win. The arm64 definition of pte_accessible() allows us to batch in the errata specific case: #define pte_accessible(mm, pte) \ (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte)) All ptes are obviously present in the folio batch, and they are also valid. Override the generic definition of modify_prot_commit_ptes() to simply use set_ptes() to map the new ptes into the pagetable. Link: https://lkml.kernel.org/r/20250718090244.21092-8-dev.jain@arm.com Signed-off-by: Dev Jain Reviewed-by: Ryan Roberts Cc: Anshuman Khandual Cc: Barry Song Cc: Catalin Marinas Cc: Christophe Leroy Cc: David Hildenbrand Cc: Hugh Dickins Cc: Jann Horn Cc: Joey Gouly Cc: Kevin Brodsky Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Peter Xu Cc: Vlastimil Babka Cc: Will Deacon Cc: Yang Shi Cc: Yicong Yang Cc: Zhenhua Huang Cc: Zi Yan Signed-off-by: Andrew Morton --- arch/arm64/include/asm/pgtable.h | 10 ++++++++++ arch/arm64/mm/mmu.c | 28 +++++++++++++++++++++++----- 2 files changed, 33 insertions(+), 5 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index ba63c8736666..abd2dee416b3 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -1643,6 +1643,16 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t old_pte, pte_t new_pte); +#define modify_prot_start_ptes modify_prot_start_ptes +extern pte_t modify_prot_start_ptes(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep, + unsigned int nr); + +#define modify_prot_commit_ptes modify_prot_commit_ptes +extern void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr, + pte_t *ptep, pte_t old_pte, pte_t pte, + unsigned int nr); + #ifdef CONFIG_ARM64_CONTPTE /* diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 3d5fb37424ab..abd9725796e9 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -26,6 +26,7 @@ #include #include #include +#include #include #include @@ -1524,24 +1525,41 @@ static int __init prevent_bootmem_remove_init(void) early_initcall(prevent_bootmem_remove_init); #endif -pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) +pte_t modify_prot_start_ptes(struct vm_area_struct *vma, unsigned long addr, + pte_t *ptep, unsigned int nr) { + pte_t pte = get_and_clear_full_ptes(vma->vm_mm, addr, ptep, nr, /* full = */ 0); + if (alternative_has_cap_unlikely(ARM64_WORKAROUND_2645198)) { /* * Break-before-make (BBM) is required for all user space mappings * when the permission changes from executable to non-executable * in cases where cpu is affected with errata #2645198. */ - if (pte_user_exec(ptep_get(ptep))) - return ptep_clear_flush(vma, addr, ptep); + if (pte_accessible(vma->vm_mm, pte) && pte_user_exec(pte)) + __flush_tlb_range(vma, addr, nr * PAGE_SIZE, + PAGE_SIZE, true, 3); } - return ptep_get_and_clear(vma->vm_mm, addr, ptep); + + return pte; +} + +pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) +{ + return modify_prot_start_ptes(vma, addr, ptep, 1); +} + +void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr, + pte_t *ptep, pte_t old_pte, pte_t pte, + unsigned int nr) +{ + set_ptes(vma->vm_mm, addr, ptep, pte, nr); } void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t old_pte, pte_t pte) { - set_pte_at(vma->vm_mm, addr, ptep, pte); + modify_prot_commit_ptes(vma, addr, ptep, old_pte, pte, 1); } /* From 3f6bfd4789a0f396b8c0dfb8713c1f3eeed3b2d7 Mon Sep 17 00:00:00 2001 From: wang lian Date: Thu, 17 Jul 2025 21:18:56 +0800 Subject: [PATCH 328/370] selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" Patch series "selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup", v2. This series introduces a common FORCE_READ() macro to replace the cryptic asm volatile("" : "+r" (variable)); construct used in several mm selftests. This improves code readability and maintainability by removing duplicated, hard-to-understand code. This patch (of 2): Several mm selftests use the `asm volatile("" : "+r" (variable));` construct to force a read of a variable, preventing the compiler from optimizing away the memory access. This idiom is cryptic and duplicated across multiple test files. Following a suggestion from David[1], this patch refactors this common pattern into a FORCE_READ() macro Link: https://lkml.kernel.org/r/20250717131857.59909-1-lianux.mm@gmail.com Link: https://lkml.kernel.org/r/20250717131857.59909-2-lianux.mm@gmail.com Link: https://lore.kernel.org/lkml/4a3e0759-caa1-4cfa-bc3f-402593f1eee3@redhat.com/ [1] Signed-off-by: wang lian Reviewed-by: Lorenzo Stoakes Acked-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Wei Yang Cc: Christian Brauner Cc: Jann Horn Cc: Kairui Song Cc: Liam Howlett Cc: Mark Brown Cc: SeongJae Park Cc: Shuah Khan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/cow.c | 30 +++++++++---------- tools/testing/selftests/mm/guard-regions.c | 7 ----- tools/testing/selftests/mm/hugetlb-madvise.c | 5 +--- tools/testing/selftests/mm/migration.c | 13 ++++---- tools/testing/selftests/mm/pagemap_ioctl.c | 4 +-- .../selftests/mm/split_huge_page_test.c | 4 +-- tools/testing/selftests/mm/vm_util.h | 7 +++++ 7 files changed, 31 insertions(+), 39 deletions(-) diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c index 788e82b00b75..d30625c18259 100644 --- a/tools/testing/selftests/mm/cow.c +++ b/tools/testing/selftests/mm/cow.c @@ -1534,7 +1534,7 @@ static void test_ro_fast_pin(char *mem, const char *smem, size_t size) static void run_with_zeropage(non_anon_test_fn fn, const char *desc) { - char *mem, *smem, tmp; + char *mem, *smem; log_test_start("%s ... with shared zeropage", desc); @@ -1554,8 +1554,8 @@ static void run_with_zeropage(non_anon_test_fn fn, const char *desc) } /* Read from the page to populate the shared zeropage. */ - tmp = *mem + *smem; - asm volatile("" : "+r" (tmp)); + FORCE_READ(mem); + FORCE_READ(smem); fn(mem, smem, pagesize); munmap: @@ -1566,7 +1566,7 @@ static void run_with_zeropage(non_anon_test_fn fn, const char *desc) static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc) { - char *mem, *smem, *mmap_mem, *mmap_smem, tmp; + char *mem, *smem, *mmap_mem, *mmap_smem; size_t mmap_size; int ret; @@ -1617,8 +1617,8 @@ static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc) * the first sub-page and test if we get another sub-page populated * automatically. */ - tmp = *mem + *smem; - asm volatile("" : "+r" (tmp)); + FORCE_READ(mem); + FORCE_READ(smem); if (!pagemap_is_populated(pagemap_fd, mem + pagesize) || !pagemap_is_populated(pagemap_fd, smem + pagesize)) { ksft_test_result_skip("Did not get THPs populated\n"); @@ -1634,7 +1634,7 @@ static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc) static void run_with_memfd(non_anon_test_fn fn, const char *desc) { - char *mem, *smem, tmp; + char *mem, *smem; int fd; log_test_start("%s ... with memfd", desc); @@ -1668,8 +1668,8 @@ static void run_with_memfd(non_anon_test_fn fn, const char *desc) } /* Fault the page in. */ - tmp = *mem + *smem; - asm volatile("" : "+r" (tmp)); + FORCE_READ(mem); + FORCE_READ(smem); fn(mem, smem, pagesize); munmap: @@ -1682,7 +1682,7 @@ static void run_with_memfd(non_anon_test_fn fn, const char *desc) static void run_with_tmpfile(non_anon_test_fn fn, const char *desc) { - char *mem, *smem, tmp; + char *mem, *smem; FILE *file; int fd; @@ -1724,8 +1724,8 @@ static void run_with_tmpfile(non_anon_test_fn fn, const char *desc) } /* Fault the page in. */ - tmp = *mem + *smem; - asm volatile("" : "+r" (tmp)); + FORCE_READ(mem); + FORCE_READ(smem); fn(mem, smem, pagesize); munmap: @@ -1740,7 +1740,7 @@ static void run_with_memfd_hugetlb(non_anon_test_fn fn, const char *desc, size_t hugetlbsize) { int flags = MFD_HUGETLB; - char *mem, *smem, tmp; + char *mem, *smem; int fd; log_test_start("%s ... with memfd hugetlb (%zu kB)", desc, @@ -1778,8 +1778,8 @@ static void run_with_memfd_hugetlb(non_anon_test_fn fn, const char *desc, } /* Fault the page in. */ - tmp = *mem + *smem; - asm volatile("" : "+r" (tmp)); + FORCE_READ(mem); + FORCE_READ(smem); fn(mem, smem, hugetlbsize); munmap: diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c index 93af3d3760f9..4b76e72e7053 100644 --- a/tools/testing/selftests/mm/guard-regions.c +++ b/tools/testing/selftests/mm/guard-regions.c @@ -35,13 +35,6 @@ static volatile sig_atomic_t signal_jump_set; static sigjmp_buf signal_jmp_buf; -/* - * Ignore the checkpatch warning, we must read from x but don't want to do - * anything with it in order to trigger a read page fault. We therefore must use - * volatile to stop the compiler from optimising this away. - */ -#define FORCE_READ(x) (*(volatile typeof(x) *)x) - /* * How is the test backing the mapping being tested? */ diff --git a/tools/testing/selftests/mm/hugetlb-madvise.c b/tools/testing/selftests/mm/hugetlb-madvise.c index e74107185324..1afe14b9dc0c 100644 --- a/tools/testing/selftests/mm/hugetlb-madvise.c +++ b/tools/testing/selftests/mm/hugetlb-madvise.c @@ -47,14 +47,11 @@ void write_fault_pages(void *addr, unsigned long nr_pages) void read_fault_pages(void *addr, unsigned long nr_pages) { - volatile unsigned long dummy = 0; unsigned long i; for (i = 0; i < nr_pages; i++) { - dummy += *((unsigned long *)(addr + (i * huge_page_size))); - /* Prevent the compiler from optimizing out the entire loop: */ - asm volatile("" : "+r" (dummy)); + FORCE_READ(((unsigned long *)(addr + (i * huge_page_size)))); } } diff --git a/tools/testing/selftests/mm/migration.c b/tools/testing/selftests/mm/migration.c index a306f8bab087..c5a73617796a 100644 --- a/tools/testing/selftests/mm/migration.c +++ b/tools/testing/selftests/mm/migration.c @@ -16,6 +16,7 @@ #include #include #include +#include "vm_util.h" #define TWOMEG (2<<20) #define RUNTIME (20) @@ -103,15 +104,13 @@ int migrate(uint64_t *ptr, int n1, int n2) void *access_mem(void *ptr) { - volatile uint64_t y = 0; - volatile uint64_t *x = ptr; - while (1) { pthread_testcancel(); - y += *x; - - /* Prevent the compiler from optimizing out the writes to y: */ - asm volatile("" : "+r" (y)); + /* Force a read from the memory pointed to by ptr. This ensures + * the memory access actually happens and prevents the compiler + * from optimizing away this entire loop. + */ + FORCE_READ((uint64_t *)ptr); } return NULL; diff --git a/tools/testing/selftests/mm/pagemap_ioctl.c b/tools/testing/selftests/mm/pagemap_ioctl.c index c2dcda78ad31..0d4209eef0c3 100644 --- a/tools/testing/selftests/mm/pagemap_ioctl.c +++ b/tools/testing/selftests/mm/pagemap_ioctl.c @@ -1525,9 +1525,7 @@ void zeropfn_tests(void) ret = madvise(mem, hpage_size, MADV_HUGEPAGE); if (!ret) { - char tmp = *mem; - - asm volatile("" : "+r" (tmp)); + FORCE_READ(mem); ret = pagemap_ioctl(mem, hpage_size, &vec, 1, 0, 0, PAGE_IS_PFNZERO, 0, 0, PAGE_IS_PFNZERO); diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c index aa7400ed0e99..6640748ff221 100644 --- a/tools/testing/selftests/mm/split_huge_page_test.c +++ b/tools/testing/selftests/mm/split_huge_page_test.c @@ -398,7 +398,6 @@ int create_pagecache_thp_and_fd(const char *testfile, size_t fd_size, int *fd, char **addr) { size_t i; - int dummy = 0; unsigned char buf[1024]; srand(time(NULL)); @@ -440,8 +439,7 @@ int create_pagecache_thp_and_fd(const char *testfile, size_t fd_size, int *fd, madvise(*addr, fd_size, MADV_HUGEPAGE); for (size_t i = 0; i < fd_size; i++) - dummy += *(*addr + i); - asm volatile("" : "+r" (dummy)); + FORCE_READ((*addr + i)); if (!check_huge_file(*addr, fd_size / pmd_pagesize, pmd_pagesize)) { ksft_print_msg("No large pagecache folio generated, please provide a filesystem supporting large folio\n"); diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h index 2b154c287591..c20298ae98ea 100644 --- a/tools/testing/selftests/mm/vm_util.h +++ b/tools/testing/selftests/mm/vm_util.h @@ -18,6 +18,13 @@ #define PM_SWAP BIT_ULL(62) #define PM_PRESENT BIT_ULL(63) +/* + * Ignore the checkpatch warning, we must read from x but don't want to do + * anything with it in order to trigger a read page fault. We therefore must use + * volatile to stop the compiler from optimising this away. + */ +#define FORCE_READ(x) (*(volatile typeof(x) *)x) + extern unsigned int __page_size; extern unsigned int __page_shift; From 6f1cc9fb476941f9cf8ac7ca0d8343b425b1446d Mon Sep 17 00:00:00 2001 From: wang lian Date: Thu, 17 Jul 2025 21:18:57 +0800 Subject: [PATCH 329/370] selftests/mm: guard-regions: Use SKIP() instead of ksft_exit_skip() To ensure only the current test is skipped on permission failure, instead of terminating the entire test binary. Link: https://lkml.kernel.org/r/20250717131857.59909-3-lianux.mm@gmail.com Signed-off-by: wang lian Reviewed-by: Lorenzo Stoakes Acked-by: David Hildenbrand Reviewed-by: Mark Brown Reviewed-by: Zi Yan Reviewed-by: Wei Yang Cc: Christian Brauner Cc: Jann Horn Cc: Kairui Song Cc: Liam Howlett Cc: SeongJae Park Cc: Shuah Khan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/guard-regions.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c index 4b76e72e7053..b0d42eb04e3a 100644 --- a/tools/testing/selftests/mm/guard-regions.c +++ b/tools/testing/selftests/mm/guard-regions.c @@ -575,7 +575,7 @@ TEST_F(guard_regions, process_madvise) /* OK we don't have permission to do this, skip. */ if (count == -1 && errno == EPERM) - ksft_exit_skip("No process_madvise() permissions, try running as root.\n"); + SKIP(return, "No process_madvise() permissions, try running as root.\n"); /* Returns the number of bytes advised. */ ASSERT_EQ(count, 6 * page_size); From 10aed7dac451850a8e18128ee90222ab67d8a7e5 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Mon, 21 Jul 2025 18:33:25 +0100 Subject: [PATCH 330/370] tools/testing/selftests: add mremap() shrink test for multiple VMAs Patch series "tools/testing: expand mremap testing". Expand our mremap() testing to further assert that behaviour is as expected. There is a poorly documented mremap() feature whereby it is possible to mremap() multiple VMAs (even with gaps) when shrinking, as long as the resultant shrunk range spans only a single VMA. So we start by asserting this behaviour functions correctly both with an in-place shrink and a shrink/move. Next, we further test the newly introduced ability to mremap() multiple VMAs when performing a MAP_FIXED move (that is without the size being changed), firstly by asserting that MREMAP_DONTUNMAP has no bearing on this behaviour. Finally, we explicitly test that such moves, when splitting source VMAs, function correctly. This patch (of 3): There is an apparently little-known feature of mremap() whereby, in stark contrast to other modes (other than the recently introduced capacity to move multiple VMAs), the input source range span multiple VMAs with gaps between. This is, when shrinking a VMA, whether moving it or not, and the shrink would reduce the range to a single VMA - this is permitted, as the shrink is actioned by an unmap. This patch adds tests to assert that this behaves as expected. Link: https://lkml.kernel.org/r/cover.1753119043.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/f08122893a26092a2bec6e69443e87f468ffdbed.1753119043.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Cc: Jann Horn Cc: Liam Howlett Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/mremap_test.c | 83 +++++++++++++++++++++++- 1 file changed, 82 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/mm/mremap_test.c b/tools/testing/selftests/mm/mremap_test.c index 0a49be11e614..141a9032414e 100644 --- a/tools/testing/selftests/mm/mremap_test.c +++ b/tools/testing/selftests/mm/mremap_test.c @@ -523,6 +523,85 @@ static void mremap_move_multiple_vmas(unsigned int pattern_seed, ksft_test_result_fail("%s\n", test_name); } +static void mremap_shrink_multiple_vmas(unsigned long page_size, + bool inplace) +{ + char *test_name = "mremap shrink multiple vmas"; + const size_t size = 10 * page_size; + bool success = true; + char *ptr, *tgt_ptr; + void *res; + int i; + + ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANON, -1, 0); + if (ptr == MAP_FAILED) { + perror("mmap"); + success = false; + goto out; + } + + tgt_ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANON, -1, 0); + if (tgt_ptr == MAP_FAILED) { + perror("mmap"); + success = false; + goto out; + } + if (munmap(tgt_ptr, size)) { + perror("munmap"); + success = false; + goto out_unmap; + } + + /* + * Unmap so we end up with: + * + * 0 2 4 6 8 10 offset in buffer + * |*| |*| |*| |*| |*| |*| + * |*| |*| |*| |*| |*| |*| + */ + for (i = 1; i < 10; i += 2) { + if (munmap(&ptr[i * page_size], page_size)) { + perror("munmap"); + success = false; + goto out_unmap; + } + } + + /* + * Shrink in-place across multiple VMAs and gaps so we end up with: + * + * 0 + * |*| + * |*| + */ + if (inplace) + res = mremap(ptr, size, page_size, 0); + else + res = mremap(ptr, size, page_size, MREMAP_MAYMOVE | MREMAP_FIXED, + tgt_ptr); + + if (res == MAP_FAILED) { + perror("mremap"); + success = false; + goto out_unmap; + } + +out_unmap: + if (munmap(tgt_ptr, size)) + perror("munmap tgt"); + if (munmap(ptr, size)) + perror("munmap src"); +out: + if (success) + ksft_test_result_pass("%s%s\n", test_name, + inplace ? " [inplace]" : ""); + else + ksft_test_result_fail("%s%s\n", test_name, + inplace ? " [inplace]" : ""); +} + /* Returns the time taken for the remap on success else returns -1. */ static long long remap_region(struct config c, unsigned int threshold_mb, char *rand_addr) @@ -864,7 +943,7 @@ int main(int argc, char **argv) char *rand_addr; size_t rand_size; int num_expand_tests = 2; - int num_misc_tests = 3; + int num_misc_tests = 5; struct test test_cases[MAX_TEST] = {}; struct test perf_test_cases[MAX_PERF_TEST]; int page_size; @@ -992,6 +1071,8 @@ int main(int argc, char **argv) mremap_move_within_range(pattern_seed, rand_addr); mremap_move_1mb_from_start(pattern_seed, rand_addr); mremap_move_multiple_vmas(pattern_seed, page_size); + mremap_shrink_multiple_vmas(page_size, /* inplace= */true); + mremap_shrink_multiple_vmas(page_size, /* inplace= */false); if (run_perf_tests) { ksft_print_msg("\n%s\n", From 7062387ed690f76fd486963663f2c7c8d5762764 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Mon, 21 Jul 2025 18:33:26 +0100 Subject: [PATCH 331/370] tools/testing/selftests: test MREMAP_DONTUNMAP on multiple VMA move We support MREMAP_MAYMOVE | MREMAP_FIXED | MREMAP_DONTUNMAP for moving multiple VMAs via mremap(), so assert that the tests pass with both MREMAP_DONTUNMAP set and not set. Additionally, add success = false settings when mremap() fails. This is something that cannot realistically happen, so in no way impacted test outcome, but it is incorrect to indicate a test pass when something has failed. Link: https://lkml.kernel.org/r/d7359941981e4e44c774753b3e364d1c54928e6a.1753119043.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Cc: Jann Horn Cc: Liam Howlett Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/mremap_test.c | 29 ++++++++++++++++-------- 1 file changed, 19 insertions(+), 10 deletions(-) diff --git a/tools/testing/selftests/mm/mremap_test.c b/tools/testing/selftests/mm/mremap_test.c index 141a9032414e..d3b4772cd8c9 100644 --- a/tools/testing/selftests/mm/mremap_test.c +++ b/tools/testing/selftests/mm/mremap_test.c @@ -406,14 +406,19 @@ static bool is_multiple_vma_range_ok(unsigned int pattern_seed, } static void mremap_move_multiple_vmas(unsigned int pattern_seed, - unsigned long page_size) + unsigned long page_size, + bool dont_unmap) { + int mremap_flags = MREMAP_FIXED | MREMAP_MAYMOVE; char *test_name = "mremap move multiple vmas"; const size_t size = 11 * page_size; bool success = true; char *ptr, *tgt_ptr; int i; + if (dont_unmap) + mremap_flags |= MREMAP_DONTUNMAP; + ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); if (ptr == MAP_FAILED) { @@ -467,8 +472,7 @@ static void mremap_move_multiple_vmas(unsigned int pattern_seed, } /* First, just move the whole thing. */ - if (mremap(ptr, size, size, - MREMAP_MAYMOVE | MREMAP_FIXED, tgt_ptr) == MAP_FAILED) { + if (mremap(ptr, size, size, mremap_flags, tgt_ptr) == MAP_FAILED) { perror("mremap"); success = false; goto out_unmap; @@ -480,9 +484,10 @@ static void mremap_move_multiple_vmas(unsigned int pattern_seed, } /* Move next to itself. */ - if (mremap(tgt_ptr, size, size, - MREMAP_MAYMOVE | MREMAP_FIXED, &tgt_ptr[size]) == MAP_FAILED) { + if (mremap(tgt_ptr, size, size, mremap_flags, + &tgt_ptr[size]) == MAP_FAILED) { perror("mremap"); + success = false; goto out_unmap; } /* Check that the move is ok. */ @@ -500,8 +505,9 @@ static void mremap_move_multiple_vmas(unsigned int pattern_seed, } /* Move and overwrite. */ if (mremap(&tgt_ptr[size], size, size, - MREMAP_MAYMOVE | MREMAP_FIXED, tgt_ptr) == MAP_FAILED) { + mremap_flags, tgt_ptr) == MAP_FAILED) { perror("mremap"); + success = false; goto out_unmap; } /* Check that the move is ok. */ @@ -518,9 +524,11 @@ static void mremap_move_multiple_vmas(unsigned int pattern_seed, out: if (success) - ksft_test_result_pass("%s\n", test_name); + ksft_test_result_pass("%s%s\n", test_name, + dont_unmap ? " [dontunnmap]" : ""); else - ksft_test_result_fail("%s\n", test_name); + ksft_test_result_fail("%s%s\n", test_name, + dont_unmap ? " [dontunnmap]" : ""); } static void mremap_shrink_multiple_vmas(unsigned long page_size, @@ -943,7 +951,7 @@ int main(int argc, char **argv) char *rand_addr; size_t rand_size; int num_expand_tests = 2; - int num_misc_tests = 5; + int num_misc_tests = 6; struct test test_cases[MAX_TEST] = {}; struct test perf_test_cases[MAX_PERF_TEST]; int page_size; @@ -1070,9 +1078,10 @@ int main(int argc, char **argv) mremap_move_within_range(pattern_seed, rand_addr); mremap_move_1mb_from_start(pattern_seed, rand_addr); - mremap_move_multiple_vmas(pattern_seed, page_size); mremap_shrink_multiple_vmas(page_size, /* inplace= */true); mremap_shrink_multiple_vmas(page_size, /* inplace= */false); + mremap_move_multiple_vmas(pattern_seed, page_size, /* dontunmap= */ false); + mremap_move_multiple_vmas(pattern_seed, page_size, /* dontunmap= */ true); if (run_perf_tests) { ksft_print_msg("\n%s\n", From 7d6597dfef1183b2c198151c2740c4d9c73d3dfc Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Mon, 21 Jul 2025 18:33:27 +0100 Subject: [PATCH 332/370] tools/testing/selftests: explicitly test split multi VMA mremap move Check that moving a range of VMAs where we are offset into the first and last VMAs works correctly. This results in the VMAs being split at these points at which we are offset into VMAs. We explicitly test both the ordinary MREMAP_FIXED multi VMA move case and the MREMAP_DONTUNMAP multi VMA move case. Link: https://lkml.kernel.org/r/b04920bb6c09dc86c207c251eab8ec670fbbcaef.1753119043.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Cc: Jann Horn Cc: Liam Howlett Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/mremap_test.c | 127 ++++++++++++++++++++++- 1 file changed, 126 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/mm/mremap_test.c b/tools/testing/selftests/mm/mremap_test.c index d3b4772cd8c9..fccf9e797a0c 100644 --- a/tools/testing/selftests/mm/mremap_test.c +++ b/tools/testing/selftests/mm/mremap_test.c @@ -610,6 +610,129 @@ static void mremap_shrink_multiple_vmas(unsigned long page_size, inplace ? " [inplace]" : ""); } +static void mremap_move_multiple_vmas_split(unsigned int pattern_seed, + unsigned long page_size, + bool dont_unmap) +{ + char *test_name = "mremap move multiple vmas split"; + int mremap_flags = MREMAP_FIXED | MREMAP_MAYMOVE; + const size_t size = 10 * page_size; + bool success = true; + char *ptr, *tgt_ptr; + int i; + + if (dont_unmap) + mremap_flags |= MREMAP_DONTUNMAP; + + ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANON, -1, 0); + if (ptr == MAP_FAILED) { + perror("mmap"); + success = false; + goto out; + } + + tgt_ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANON, -1, 0); + if (tgt_ptr == MAP_FAILED) { + perror("mmap"); + success = false; + goto out; + } + if (munmap(tgt_ptr, size)) { + perror("munmap"); + success = false; + goto out_unmap; + } + + /* + * Unmap so we end up with: + * + * 0 1 2 3 4 5 6 7 8 9 10 offset in buffer + * |**********| |*******| + * |**********| |*******| + * 0 1 2 3 4 5 6 7 8 9 pattern offset + */ + if (munmap(&ptr[5 * page_size], page_size)) { + perror("munmap"); + success = false; + goto out_unmap; + } + + /* Set up random patterns. */ + srand(pattern_seed); + for (i = 0; i < 10; i++) { + int j; + char *buf = &ptr[i * page_size]; + + if (i == 5) + continue; + + for (j = 0; j < page_size; j++) + buf[j] = rand(); + } + + /* + * Move the below: + * + * <-------------> + * 0 1 2 3 4 5 6 7 8 9 10 offset in buffer + * |**********| |*******| + * |**********| |*******| + * 0 1 2 3 4 5 6 7 8 9 pattern offset + * + * Into: + * + * 0 1 2 3 4 5 6 7 offset in buffer + * |*****| |*****| + * |*****| |*****| + * 2 3 4 5 6 7 pattern offset + */ + if (mremap(&ptr[2 * page_size], size - 3 * page_size, size - 3 * page_size, + mremap_flags, tgt_ptr) == MAP_FAILED) { + perror("mremap"); + success = false; + goto out_unmap; + } + + /* Offset into random pattern. */ + srand(pattern_seed); + for (i = 0; i < 2 * page_size; i++) + rand(); + + /* Check pattern. */ + for (i = 0; i < 7; i++) { + int j; + char *buf = &tgt_ptr[i * page_size]; + + if (i == 3) + continue; + + for (j = 0; j < page_size; j++) { + char chr = rand(); + + if (chr != buf[j]) { + ksft_print_msg("page %d offset %d corrupted, expected %d got %d\n", + i, j, chr, buf[j]); + goto out_unmap; + } + } + } + +out_unmap: + if (munmap(tgt_ptr, size)) + perror("munmap tgt"); + if (munmap(ptr, size)) + perror("munmap src"); +out: + if (success) + ksft_test_result_pass("%s%s\n", test_name, + dont_unmap ? " [dontunnmap]" : ""); + else + ksft_test_result_fail("%s%s\n", test_name, + dont_unmap ? " [dontunnmap]" : ""); +} + /* Returns the time taken for the remap on success else returns -1. */ static long long remap_region(struct config c, unsigned int threshold_mb, char *rand_addr) @@ -951,7 +1074,7 @@ int main(int argc, char **argv) char *rand_addr; size_t rand_size; int num_expand_tests = 2; - int num_misc_tests = 6; + int num_misc_tests = 8; struct test test_cases[MAX_TEST] = {}; struct test perf_test_cases[MAX_PERF_TEST]; int page_size; @@ -1082,6 +1205,8 @@ int main(int argc, char **argv) mremap_shrink_multiple_vmas(page_size, /* inplace= */false); mremap_move_multiple_vmas(pattern_seed, page_size, /* dontunmap= */ false); mremap_move_multiple_vmas(pattern_seed, page_size, /* dontunmap= */ true); + mremap_move_multiple_vmas_split(pattern_seed, page_size, /* dontunmap= */ false); + mremap_move_multiple_vmas_split(pattern_seed, page_size, /* dontunmap= */ true); if (run_perf_tests) { ksft_print_msg("\n%s\n", From a27848a03504f140ea14b6b0f711c1a7c722f698 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Mon, 21 Jul 2025 16:55:30 +0100 Subject: [PATCH 333/370] docs: update THP documentation to clarify sysfs "never" setting Rather confusingly, setting all Transparent Huge Page sysfs settings to "never" does not in fact result in THP being globally disabled. Rather, it results in khugepaged being disabled, but one can still obtain THP pages using madvise(..., MADV_COLLAPSE). This is something that has remained poorly documented for some time, and it is likely the received wisdom of most users of THP that never does, in fact, mean never. It is therefore important to highlight, very clearly, that this is not the case. [lorenzo.stoakes@oracle.com: update transhuge page to mention MADV_COLLAPSE] Link: https://lkml.kernel.org/r/d54d1dfb-f06d-4979-983b-73998f05867e@lucifer.local Link: https://lkml.kernel.org/r/20250721155530.75944-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Acked-by: SeongJae Park Reviewed-by: Zi Yan Reviewed-by: Baolin Wang Reviewed-by: Barry Song Acked-by: David Hildenbrand Cc: Dev Jain Cc: Jonathan Corbet Cc: Liam Howlett Cc: Mariano Pache Cc: Ryan Roberts Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/transhuge.rst | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index dff8d5985f0f..370fba113460 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -107,7 +107,7 @@ sysfs Global THP controls ------------------- -Transparent Hugepage Support for anonymous memory can be entirely disabled +Transparent Hugepage Support for anonymous memory can be disabled (mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to avoid the risk of consuming more memory resources) or enabled system wide. This can be achieved per-supported-THP-size with one of:: @@ -119,6 +119,11 @@ system wide. This can be achieved per-supported-THP-size with one of:: where is the hugepage size being addressed, the available sizes for which vary by system. +.. note:: Setting "never" in all sysfs THP controls does **not** disable + Transparent Huge Pages globally. This is because ``madvise(..., + MADV_COLLAPSE)`` ignores these settings and collapses ranges to + PMD-sized huge pages unconditionally. + For example:: echo always >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled @@ -187,7 +192,9 @@ madvise behaviour. never - should be self-explanatory. + should be self-explanatory. Note that ``madvise(..., + MADV_COLLAPSE)`` can still cause transparent huge pages to be + obtained even if this mode is specified everywhere. By default kernel tries to use huge, PMD-mappable zero page on read page fault to anonymous mapping. It's possible to disable huge zero @@ -378,7 +385,9 @@ always Attempt to allocate huge pages every time we need a new page; never - Do not allocate huge pages; + Do not allocate huge pages. Note that ``madvise(..., MADV_COLLAPSE)`` + can still cause transparent huge pages to be obtained even if this mode + is specified everywhere; within_size Only allocate huge page if it will be fully within i_size. @@ -434,7 +443,9 @@ inherit have enabled="inherit" and all other hugepage sizes have enabled="never"; never - Do not allocate huge pages; + Do not allocate huge pages. Note that ``madvise(..., + MADV_COLLAPSE)`` can still cause transparent huge pages to be obtained + even if this mode is specified everywhere; within_size Only allocate huge page if it will be fully within i_size. From 7e6c3130690a01076efdf45aa02ba5d5c16849a0 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Jul 2025 11:58:22 -0700 Subject: [PATCH 334/370] mm/damon/ops-common: ignore migration request to invalid nodes damon_migrate_pages() tries migration even if the target node is invalid. If users mistakenly make such invalid requests via DAMOS_MIGRATE_{HOT,COLD} action, the below kernel BUG can happen. [ 7831.883495] BUG: unable to handle page fault for address: 0000000000001f48 [ 7831.884160] #PF: supervisor read access in kernel mode [ 7831.884681] #PF: error_code(0x0000) - not-present page [ 7831.885203] PGD 0 P4D 0 [ 7831.885468] Oops: Oops: 0000 [#1] SMP PTI [ 7831.885852] CPU: 31 UID: 0 PID: 94202 Comm: kdamond.0 Not tainted 6.16.0-rc5-mm-new-damon+ #93 PREEMPT(voluntary) [ 7831.886913] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-4.el9 04/01/2014 [ 7831.887777] RIP: 0010:__alloc_frozen_pages_noprof (include/linux/mmzone.h:1724 include/linux/mmzone.h:1750 mm/page_alloc.c:4936 mm/page_alloc.c:5137) [...] [ 7831.895953] Call Trace: [ 7831.896195] [ 7831.896397] __folio_alloc_noprof (mm/page_alloc.c:5183 mm/page_alloc.c:5192) [ 7831.896787] migrate_pages_batch (mm/migrate.c:1189 mm/migrate.c:1851) [ 7831.897228] ? __pfx_alloc_migration_target (mm/migrate.c:2137) [ 7831.897735] migrate_pages (mm/migrate.c:2078) [ 7831.898141] ? __pfx_alloc_migration_target (mm/migrate.c:2137) [ 7831.898664] damon_migrate_folio_list (mm/damon/ops-common.c:321 mm/damon/ops-common.c:354) [ 7831.899140] damon_migrate_pages (mm/damon/ops-common.c:405) [...] Add a target node validity check in damon_migrate_pages(). The validity check is stolen from that of do_pages_move(), which is being used for the move_pages() system call. Link: https://lkml.kernel.org/r/20250720185822.1451-1-sj@kernel.org Fixes: b51820ebea65 ("mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion") [6.11.x] Signed-off-by: SeongJae Park Reviewed-by: Joshua Hahn Cc: Honggyu Kim Cc: Hyeongtak Ji Cc: Signed-off-by: Andrew Morton --- mm/damon/ops-common.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c index 6a9797d1d7ff..99321ff5cb92 100644 --- a/mm/damon/ops-common.c +++ b/mm/damon/ops-common.c @@ -383,6 +383,10 @@ unsigned long damon_migrate_pages(struct list_head *folio_list, int target_nid) if (list_empty(folio_list)) return nr_migrated; + if (target_nid < 0 || target_nid >= MAX_NUMNODES || + !node_state(target_nid, N_MEMORY)) + return nr_migrated; + noreclaim_flag = memalloc_noreclaim_save(); nid = folio_nid(lru_to_folio(folio_list)); From 45cd52c44e85453c9147d41b715cd03c53325cf4 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 21 Jul 2025 21:46:18 +0100 Subject: [PATCH 335/370] mm: remove grab_cache_page() All callers have been converted to use filemap_grab_folio(). Link: https://lkml.kernel.org/r/20250721204619.163883-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- include/linux/pagemap.h | 12 ++---------- 1 file changed, 2 insertions(+), 10 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 10a222e68b85..a5ef64a15f96 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -878,7 +878,8 @@ static inline struct page *find_or_create_page(struct address_space *mapping, * @mapping: target address_space * @index: the page index * - * Same as grab_cache_page(), but do not wait if the page is unavailable. + * Returns locked page at given index in given cache, creating it if + * needed, but do not wait if the page is locked or to reclaim memory. * This is intended for speculative data generators, where the data can * be regenerated if the page couldn't be grabbed. This routine should * be safe to call while holding the lock for another page. @@ -942,15 +943,6 @@ unsigned filemap_get_folios_contig(struct address_space *mapping, unsigned filemap_get_folios_tag(struct address_space *mapping, pgoff_t *start, pgoff_t end, xa_mark_t tag, struct folio_batch *fbatch); -/* - * Returns locked page at given index in given cache, creating it if needed. - */ -static inline struct page *grab_cache_page(struct address_space *mapping, - pgoff_t index) -{ - return find_or_create_page(mapping, index, mapping_gfp_mask(mapping)); -} - struct folio *read_cache_folio(struct address_space *, pgoff_t index, filler_t *filler, struct file *file); struct folio *mapping_read_folio_gfp(struct address_space *, pgoff_t index, From cf20cb9ad1eb3bf59ede1f93a0640f06e6b9d028 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 21 Jul 2025 23:03:30 -0700 Subject: [PATCH 336/370] selftests/damon/sysfs.py: stop DAMON for dumping failures Commit 4ece01897627 ("selftests/damon: add python and drgn-based DAMON sysfs test") in mm-stable tree introduced sysfs.py that runs drgn for dumping DAMON status. When the DAMON status dumping fails for reasons including drgn uninstalled environment, the test fails without stopping DAMON. Following DAMON selftests that assumes DAMON is not running when they executed therefore fail. Catch dumping failures and stop DAMON for that case. Link: https://lkml.kernel.org/r/20250722060330.56068-1-sj@kernel.org Fixes: 4ece01897627 ("selftests/damon: add python and drgn-based DAMON sysfs test") Signed-off-by: SeongJae Park Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202507220707.9c5d6247-lkp@intel.com Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/sysfs.py | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/tools/testing/selftests/damon/sysfs.py b/tools/testing/selftests/damon/sysfs.py index e67008fd055d..fdc04f12be54 100755 --- a/tools/testing/selftests/damon/sysfs.py +++ b/tools/testing/selftests/damon/sysfs.py @@ -8,6 +8,10 @@ import subprocess import _damon_sysfs def dump_damon_status_dict(pid): + try: + subprocess.check_output(['which', 'drgn'], stderr=subprocess.DEVNULL) + except: + return None, 'drgn not found' file_dir = os.path.dirname(os.path.abspath(__file__)) dump_script = os.path.join(file_dir, 'drgn_dump_damon_status.py') rc = subprocess.call(['drgn', dump_script, pid, 'damon_dump_output'], @@ -40,6 +44,7 @@ def main(): status, err = dump_damon_status_dict(kdamonds.kdamonds[0].pid) if err is not None: print(err) + kdamonds.stop() exit(1) if len(status['contexts']) != 1: From 6da5e2961f3a1f350ec974b0c56374015b1c8fc1 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Jul 2025 10:16:31 -0700 Subject: [PATCH 337/370] selftests/damon/_damon_sysfs: support DAMOS watermarks setup Patch series "selftests/damon/sysfs.py: test all parameters". sysfs.py tests if DAMON sysfs interface is passing the user-requested parameters to DAMON as expected. But only the default (minimum) parameters are being tested. This is partially because _damon_sysfs.py, which is the library for making the parameter requests, is not supporting the entire parameters. The internal DAMON status dump script (drgn_dump_damon_status.py) is also not dumping entire parameters. Extend the test coverage by updating parameters input and status dumping scripts to support all parameters, and writing additional tests using those. This increased test coverage actually found one real bug (https://lore.kernel.org/20250719181932.72944-1-sj@kernel.org). First seven patches (1-7) extend _damon_sysfs.py for all parameters setup. The eight patch (8) fixes _damon_sysfs.py to use correct max nr_acceses and age values for their type. Following three patches (9-11) extend drgn_dump_damon_status.py to dump full DAMON parameters. Following nine patches (12-20) refactor sysfs.py for general testing code reuse, and extend it for full parameters check. Finally, two patches (21 and 22) add test cases in sysfs.py for full parameters testing. This patch (of 22): _damon_sysfs.py contains code for test-purpose DAMON sysfs interface control. Add support of DAMOS watermarks setup for more tests. Link: https://lkml.kernel.org/r/20250720171652.92309-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250720171652.92309-2-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/_damon_sysfs.py | 46 +++++++++++++++++-- 1 file changed, 42 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/damon/_damon_sysfs.py b/tools/testing/selftests/damon/_damon_sysfs.py index f587e117472e..d81aa11e3d32 100644 --- a/tools/testing/selftests/damon/_damon_sysfs.py +++ b/tools/testing/selftests/damon/_damon_sysfs.py @@ -165,6 +165,42 @@ class DamosQuota: return err return None +class DamosWatermarks: + metric = None + interval = None + high = None + mid = None + low = None + scheme = None # owner scheme + + def __init__(self, metric='none', interval=0, high=0, mid=0, low=0): + self.metric = metric + self.interval = interval + self.high = high + self.mid = mid + self.low = low + + def sysfs_dir(self): + return os.path.join(self.scheme.sysfs_dir(), 'watermarks') + + def stage(self): + err = write_file(os.path.join(self.sysfs_dir(), 'metric'), self.metric) + if err is not None: + return err + err = write_file(os.path.join(self.sysfs_dir(), 'interval_us'), + self.interval) + if err is not None: + return err + err = write_file(os.path.join(self.sysfs_dir(), 'high'), self.high) + if err is not None: + return err + err = write_file(os.path.join(self.sysfs_dir(), 'mid'), self.mid) + if err is not None: + return err + err = write_file(os.path.join(self.sysfs_dir(), 'low'), self.low) + if err is not None: + return err + class DamosStats: nr_tried = None sz_tried = None @@ -190,6 +226,7 @@ class Damos: action = None access_pattern = None quota = None + watermarks = None apply_interval_us = None # todo: Support watermarks, stats idx = None @@ -199,12 +236,15 @@ class Damos: tried_regions = None def __init__(self, action='stat', access_pattern=DamosAccessPattern(), - quota=DamosQuota(), apply_interval_us=0): + quota=DamosQuota(), watermarks=DamosWatermarks(), + apply_interval_us=0): self.action = action self.access_pattern = access_pattern self.access_pattern.scheme = self self.quota = quota self.quota.scheme = self + self.watermarks = watermarks + self.watermarks.scheme = self self.apply_interval_us = apply_interval_us def sysfs_dir(self): @@ -227,9 +267,7 @@ class Damos: if err is not None: return err - # disable watermarks - err = write_file( - os.path.join(self.sysfs_dir(), 'watermarks', 'metric'), 'none') + err = self.watermarks.stage() if err is not None: return err From 9d93c103edf0cdf7f1f251b943799a7fee56a580 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Jul 2025 10:16:32 -0700 Subject: [PATCH 338/370] selftests/damon/_damon_sysfs: support DAMOS filters setup _damon_sysfs.py contains code for test-purpose DAMON sysfs interface control. Add support of DAMOS filters setup for more tests. Link: https://lkml.kernel.org/r/20250720171652.92309-3-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/_damon_sysfs.py | 115 +++++++++++++++++- 1 file changed, 111 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/damon/_damon_sysfs.py b/tools/testing/selftests/damon/_damon_sysfs.py index d81aa11e3d32..f853af6ad926 100644 --- a/tools/testing/selftests/damon/_damon_sysfs.py +++ b/tools/testing/selftests/damon/_damon_sysfs.py @@ -201,6 +201,96 @@ class DamosWatermarks: if err is not None: return err +class DamosFilter: + type_ = None + matching = None + allow = None + memcg_path = None + addr_start = None + addr_end = None + target_idx = None + min_ = None + max_ = None + idx = None + filters = None # owner filters + + def __init__(self, type_='anon', matching=False, allow=False, + memcg_path='', addr_start=0, addr_end=0, target_idx=0, min_=0, + max_=0): + self.type_ = type_ + self.matching = matching + self.allow = allow + self.memcg_path = memcg_path, + self.addr_start = addr_start + self.addr_end = addr_end + self.target_idx = target_idx + self.min_ = min_ + self.max_ = max_ + + def sysfs_dir(self): + return os.path.join(self.filters.sysfs_dir(), '%d' % self.idx) + + def stage(self): + err = write_file(os.path.join(self.sysfs_dir(), 'type'), self.type_) + if err is not None: + return err + err = write_file(os.path.join(self.sysfs_dir(), 'matching'), + self.matching) + if err is not None: + return err + err = write_file(os.path.join(self.sysfs_dir(), 'allow'), self.allow) + if err is not None: + return err + err = write_file(os.path.join(self.sysfs_dir(), 'memcg_path'), + self.memcg_path) + if err is not None: + return err + err = write_file(os.path.join(self.sysfs_dir(), 'addr_start'), + self.addr_start) + if err is not None: + return err + err = write_file(os.path.join(self.sysfs_dir(), 'addr_end'), + self.addr_end) + if err is not None: + return err + err = write_file(os.path.join(self.sysfs_dir(), 'damon_target_idx'), + self.target_idx) + if err is not None: + return err + err = write_file(os.path.join(self.sysfs_dir(), 'min'), self.min_) + if err is not None: + return err + err = write_file(os.path.join(self.sysfs_dir(), 'max'), self.max_) + if err is not None: + return err + return None + +class DamosFilters: + name = None + filters = None + scheme = None # owner scheme + + def __init__(self, name, filters=[]): + self.name = name + self.filters = filters + for idx, filter_ in enumerate(self.filters): + filter_.idx = idx + filter_.filters = self + + def sysfs_dir(self): + return os.path.join(self.scheme.sysfs_dir(), self.name) + + def stage(self): + err = write_file(os.path.join(self.sysfs_dir(), 'nr_filters'), + len(self.filters)) + if err is not None: + return err + for filter_ in self.filters: + err = filter_.stage() + if err is not None: + return err + return None + class DamosStats: nr_tried = None sz_tried = None @@ -227,8 +317,10 @@ class Damos: access_pattern = None quota = None watermarks = None + core_filters = None + ops_filters = None + filters = None apply_interval_us = None - # todo: Support watermarks, stats idx = None context = None tried_bytes = None @@ -237,6 +329,7 @@ class Damos: def __init__(self, action='stat', access_pattern=DamosAccessPattern(), quota=DamosQuota(), watermarks=DamosWatermarks(), + core_filters=[], ops_filters=[], filters=[], apply_interval_us=0): self.action = action self.access_pattern = access_pattern @@ -245,6 +338,16 @@ class Damos: self.quota.scheme = self self.watermarks = watermarks self.watermarks.scheme = self + + self.core_filters = DamosFilters(name='core_filters', + filters=core_filters) + self.core_filters.scheme = self + self.ops_filters = DamosFilters(name='ops_filters', + filters=ops_filters) + self.ops_filters.scheme = self + self.filters = DamosFilters(name='filters', filters=filters) + self.filters.scheme = self + self.apply_interval_us = apply_interval_us def sysfs_dir(self): @@ -271,9 +374,13 @@ class Damos: if err is not None: return err - # disable filters - err = write_file( - os.path.join(self.sysfs_dir(), 'filters', 'nr_filters'), '0') + err = self.core_filters.stage() + if err is not None: + return err + err = self.ops_filters.stage() + if err is not None: + return err + err = self.filters.stage() if err is not None: return err From b436dfaad2ba7bf5061017248ee595f299074610 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Jul 2025 10:16:33 -0700 Subject: [PATCH 339/370] selftests/damon/_damon_sysfs: support monitoring intervals goal setup _damon_sysfs.py contains code for test-purpose DAMON sysfs interface control. Add support of the monitoring intervals auto-tune goal setup for more tests. Link: https://lkml.kernel.org/r/20250720171652.92309-4-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/_damon_sysfs.py | 43 ++++++++++++++++++- 1 file changed, 42 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/damon/_damon_sysfs.py b/tools/testing/selftests/damon/_damon_sysfs.py index f853af6ad926..ec6230929d36 100644 --- a/tools/testing/selftests/damon/_damon_sysfs.py +++ b/tools/testing/selftests/damon/_damon_sysfs.py @@ -405,18 +405,56 @@ class DamonTarget: return write_file( os.path.join(self.sysfs_dir(), 'pid_target'), self.pid) +class IntervalsGoal: + access_bp = None + aggrs = None + min_sample_us = None + max_sample_us = None + attrs = None # owner DamonAttrs + + def __init__(self, access_bp=0, aggrs=0, min_sample_us=0, max_sample_us=0): + self.access_bp = access_bp + self.aggrs = aggrs + self.min_sample_us = min_sample_us + self.max_sample_us = max_sample_us + + def sysfs_dir(self): + return os.path.join(self.attrs.interval_sysfs_dir(), 'intervals_goal') + + def stage(self): + err = write_file( + os.path.join(self.sysfs_dir(), 'access_bp'), self.access_bp) + if err is not None: + return err + err = write_file(os.path.join(self.sysfs_dir(), 'aggrs'), self.aggrs) + if err is not None: + return err + err = write_file(os.path.join(self.sysfs_dir(), 'min_sample_us'), + self.min_sample_us) + if err is not None: + return err + err = write_file(os.path.join(self.sysfs_dir(), 'max_sample_us'), + self.max_sample_us) + if err is not None: + return err + return None + class DamonAttrs: sample_us = None aggr_us = None + intervals_goal = None update_us = None min_nr_regions = None max_nr_regions = None context = None - def __init__(self, sample_us=5000, aggr_us=100000, update_us=1000000, + def __init__(self, sample_us=5000, aggr_us=100000, + intervals_goal=IntervalsGoal(), update_us=1000000, min_nr_regions=10, max_nr_regions=1000): self.sample_us = sample_us self.aggr_us = aggr_us + self.intervals_goal = intervals_goal + self.intervals_goal.attrs = self self.update_us = update_us self.min_nr_regions = min_nr_regions self.max_nr_regions = max_nr_regions @@ -436,6 +474,9 @@ class DamonAttrs: return err err = write_file(os.path.join(self.interval_sysfs_dir(), 'aggr_us'), self.aggr_us) + if err is not None: + return err + err = self.intervals_goal.stage() if err is not None: return err err = write_file(os.path.join(self.interval_sysfs_dir(), 'update_us'), From ff5aae307bfb789c9e9c4564bc19d5393aa2fc43 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Jul 2025 10:16:34 -0700 Subject: [PATCH 340/370] selftests/damon/_damon_sysfs: support DAMOS quota weights setup _damon_sysfs.py contains code for test-purpose DAMON sysfs interface control. Add support of DAMOS quotas prioritization weights setup for more tests. Link: https://lkml.kernel.org/r/20250720171652.92309-5-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/_damon_sysfs.py | 24 ++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/damon/_damon_sysfs.py b/tools/testing/selftests/damon/_damon_sysfs.py index ec6230929d36..12d076260b2b 100644 --- a/tools/testing/selftests/damon/_damon_sysfs.py +++ b/tools/testing/selftests/damon/_damon_sysfs.py @@ -125,12 +125,20 @@ class DamosQuota: ms = None # time quota goals = None # quota goals reset_interval_ms = None # quota reset interval + weight_sz_permil = None + weight_nr_accesses_permil = None + weight_age_permil = None scheme = None # owner scheme - def __init__(self, sz=0, ms=0, goals=None, reset_interval_ms=0): + def __init__(self, sz=0, ms=0, goals=None, reset_interval_ms=0, + weight_sz_permil=0, weight_nr_accesses_permil=0, + weight_age_permil=0): self.sz = sz self.ms = ms self.reset_interval_ms = reset_interval_ms + self.weight_sz_permil = weight_sz_permil + self.weight_nr_accesses_permil = weight_nr_accesses_permil + self.weight_age_permil = weight_age_permil self.goals = goals if goals is not None else [] for idx, goal in enumerate(self.goals): goal.idx = idx @@ -151,6 +159,20 @@ class DamosQuota: if err is not None: return err + err = write_file(os.path.join( + self.sysfs_dir(), 'weights', 'sz_permil'), self.weight_sz_permil) + if err is not None: + return err + err = write_file(os.path.join( + self.sysfs_dir(), 'weights', 'nr_accesses_permil'), + self.weight_nr_accesses_permil) + if err is not None: + return err + err = write_file(os.path.join( + self.sysfs_dir(), 'weights', 'age_permil'), self.weight_age_permil) + if err is not None: + return err + nr_goals_file = os.path.join(self.sysfs_dir(), 'goals', 'nr_goals') content, err = read_file(nr_goals_file) if err is not None: From 229b0af6640704bb709fae356108c11d6e63058d Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Jul 2025 10:16:35 -0700 Subject: [PATCH 341/370] selftests/damon/_damon_sysfs: support DAMOS quota goal nid setup _damon_sysfs.py contains code for test-purpose DAMON sysfs interface control. Add support of DAMOS quota goal nid setup for more tests. Link: https://lkml.kernel.org/r/20250720171652.92309-6-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/_damon_sysfs.py | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/damon/_damon_sysfs.py b/tools/testing/selftests/damon/_damon_sysfs.py index 12d076260b2b..23de9202b4e3 100644 --- a/tools/testing/selftests/damon/_damon_sysfs.py +++ b/tools/testing/selftests/damon/_damon_sysfs.py @@ -93,14 +93,16 @@ class DamosQuotaGoal: metric = None target_value = None current_value = None + nid = None effective_bytes = None quota = None # owner quota idx = None - def __init__(self, metric, target_value=10000, current_value=0): + def __init__(self, metric, target_value=10000, current_value=0, nid=0): self.metric = metric self.target_value = target_value self.current_value = current_value + self.nid = nid def sysfs_dir(self): return os.path.join(self.quota.sysfs_dir(), 'goals', '%d' % self.idx) @@ -118,6 +120,10 @@ class DamosQuotaGoal: self.current_value) if err is not None: return err + err = write_file(os.path.join(self.sysfs_dir(), 'nid'), self.nid) + if err is not None: + return err + return None class DamosQuota: From fca6ddf44df477d7ecaac4840b15da635798f1eb Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Jul 2025 10:16:36 -0700 Subject: [PATCH 342/370] selftests/damon/_damon_sysfs: support DAMOS action dests setup _damon_sysfs.py contains code for test-purpose DAMON sysfs interface control. Add support of DAMOS action destinations setup for more tests. Link: https://lkml.kernel.org/r/20250720171652.92309-7-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/_damon_sysfs.py | 56 ++++++++++++++++++- 1 file changed, 55 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/damon/_damon_sysfs.py b/tools/testing/selftests/damon/_damon_sysfs.py index 23de9202b4e3..2d95ab564885 100644 --- a/tools/testing/selftests/damon/_damon_sysfs.py +++ b/tools/testing/selftests/damon/_damon_sysfs.py @@ -319,6 +319,52 @@ class DamosFilters: return err return None +class DamosDest: + id = None + weight = None + idx = None + dests = None # owner dests + + def __init__(self, id=0, weight=0): + self.id = id + self.weight = weight + + def sysfs_dir(self): + return os.path.join(self.dests.sysfs_dir(), '%d' % self.idx) + + def stage(self): + err = write_file(os.path.join(self.sysfs_dir(), 'id'), self.id) + if err is not None: + return err + err = write_file(os.path.join(self.sysfs_dir(), 'weight'), self.weight) + if err is not None: + return err + return None + +class DamosDests: + dests = None + scheme = None # owner scheme + + def __init__(self, dests=[]): + self.dests = dests + for idx, dest in enumerate(self.dests): + dest.idx = idx + dest.dests = self + + def sysfs_dir(self): + return os.path.join(self.scheme.sysfs_dir(), 'dests') + + def stage(self): + err = write_file(os.path.join(self.sysfs_dir(), 'nr_dests'), + len(self.dests)) + if err is not None: + return err + for dest in self.dests: + err = dest.stage() + if err is not None: + return err + return None + class DamosStats: nr_tried = None sz_tried = None @@ -349,6 +395,7 @@ class Damos: ops_filters = None filters = None apply_interval_us = None + dests = None idx = None context = None tried_bytes = None @@ -358,7 +405,7 @@ class Damos: def __init__(self, action='stat', access_pattern=DamosAccessPattern(), quota=DamosQuota(), watermarks=DamosWatermarks(), core_filters=[], ops_filters=[], filters=[], - apply_interval_us=0): + dests=DamosDests(), apply_interval_us=0): self.action = action self.access_pattern = access_pattern self.access_pattern.scheme = self @@ -376,6 +423,9 @@ class Damos: self.filters = DamosFilters(name='filters', filters=filters) self.filters.scheme = self + self.dests = dests + self.dests.scheme = self + self.apply_interval_us = apply_interval_us def sysfs_dir(self): @@ -412,6 +462,10 @@ class Damos: if err is not None: return err + err = self.dests.stage() + if err is not None: + return err + class DamonTarget: pid = None # todo: Support target regions if test is made From 86e541f0be477d4656a02c5438c5034eae814c23 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Jul 2025 10:16:37 -0700 Subject: [PATCH 343/370] selftests/damon/_damon_sysfs: support DAMOS target_nid setup _damon_sysfs.py contains code for test-purpose DAMON sysfs interface control. Add support of DAMOS action destination target_nid setup for more tests. Link: https://lkml.kernel.org/r/20250720171652.92309-8-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/_damon_sysfs.py | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/damon/_damon_sysfs.py b/tools/testing/selftests/damon/_damon_sysfs.py index 2d95ab564885..70860d925503 100644 --- a/tools/testing/selftests/damon/_damon_sysfs.py +++ b/tools/testing/selftests/damon/_damon_sysfs.py @@ -395,6 +395,7 @@ class Damos: ops_filters = None filters = None apply_interval_us = None + target_nid = None dests = None idx = None context = None @@ -404,7 +405,7 @@ class Damos: def __init__(self, action='stat', access_pattern=DamosAccessPattern(), quota=DamosQuota(), watermarks=DamosWatermarks(), - core_filters=[], ops_filters=[], filters=[], + core_filters=[], ops_filters=[], filters=[], target_nid=0, dests=DamosDests(), apply_interval_us=0): self.action = action self.access_pattern = access_pattern @@ -423,6 +424,7 @@ class Damos: self.filters = DamosFilters(name='filters', filters=filters) self.filters.scheme = self + self.target_nid = target_nid self.dests = dests self.dests.scheme = self @@ -462,6 +464,11 @@ class Damos: if err is not None: return err + err = write_file(os.path.join(self.sysfs_dir(), 'target_nid'), '%d' % + self.target_nid) + if err is not None: + return err + err = self.dests.stage() if err is not None: return err From 80d4e381075f5e454482c76aeaaafbbd76994aeb Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Jul 2025 10:16:38 -0700 Subject: [PATCH 344/370] selftests/damon/_damon_sysfs: use 2**32 - 1 as max nr_accesses and age nr_accesses and age are unsigned int. Use the proper max value. Link: https://lkml.kernel.org/r/20250720171652.92309-9-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/_damon_sysfs.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/damon/_damon_sysfs.py b/tools/testing/selftests/damon/_damon_sysfs.py index 70860d925503..a0e6290833fb 100644 --- a/tools/testing/selftests/damon/_damon_sysfs.py +++ b/tools/testing/selftests/damon/_damon_sysfs.py @@ -52,9 +52,9 @@ class DamosAccessPattern: if self.size is None: self.size = [0, 2**64 - 1] if self.nr_accesses is None: - self.nr_accesses = [0, 2**64 - 1] + self.nr_accesses = [0, 2**32 - 1] if self.age is None: - self.age = [0, 2**64 - 1] + self.age = [0, 2**32 - 1] def sysfs_dir(self): return os.path.join(self.scheme.sysfs_dir(), 'access_pattern') From c1a69589571270f70f1995d88e41b2e262ad05ab Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Jul 2025 10:16:39 -0700 Subject: [PATCH 345/370] selftests/damon/drgn_dump_damon_status: dump damos->migrate_dests drgn_dump_damon_status.py is a script for dumping DAMON internal status in json format. It is being used for seeing if DAMON parameters that are set using _damon_sysfs.py are actually passed to DAMON in the kernel space. It is, however, not dumping full DAMON internal status, and it makes increasing test coverage difficult. Add damos->migrate_dests dumping for more tests. Link: https://lkml.kernel.org/r/20250720171652.92309-10-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- .../selftests/damon/drgn_dump_damon_status.py | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/tools/testing/selftests/damon/drgn_dump_damon_status.py b/tools/testing/selftests/damon/drgn_dump_damon_status.py index 333a0d0c4bff..8db081f965f5 100755 --- a/tools/testing/selftests/damon/drgn_dump_damon_status.py +++ b/tools/testing/selftests/damon/drgn_dump_damon_status.py @@ -117,6 +117,19 @@ def damos_watermarks_to_dict(watermarks): ['high', int], ['mid', int], ['low', int], ]) +def damos_migrate_dests_to_dict(dests): + nr_dests = int(dests.nr_dests) + node_id_arr = [] + weight_arr = [] + for i in range(nr_dests): + node_id_arr.append(int(dests.node_id_arr[i])) + weight_arr.append(int(dests.weight_arr[i])) + return { + 'node_id_arr': node_id_arr, + 'weight_arr': weight_arr, + 'nr_dests': nr_dests, + } + def scheme_to_dict(scheme): return to_dict(scheme, [ ['pattern', damos_access_pattern_to_dict], @@ -125,6 +138,7 @@ def scheme_to_dict(scheme): ['quota', damos_quota_to_dict], ['wmarks', damos_watermarks_to_dict], ['target_nid', int], + ['migrate_dests', damos_migrate_dests_to_dict], ]) def schemes_to_list(schemes): From eb413daaf222b1d545f6b46253f0ace97195ccd4 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Jul 2025 10:16:40 -0700 Subject: [PATCH 346/370] selftests/damon/drgn_dump_damon_status: dump ctx->ops.id drgn_dump_damon_status.py is a script for dumping DAMON internal status in json format. It is being used for seeing if DAMON parameters that are set using _damon_sysfs.py are actually passed to DAMON in the kernel space. It is, however, not dumping full DAMON internal status, and it makes increasing test coverage difficult. Add ctx->ops.id dumping for more tests. Link: https://lkml.kernel.org/r/20250720171652.92309-11-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/drgn_dump_damon_status.py | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/tools/testing/selftests/damon/drgn_dump_damon_status.py b/tools/testing/selftests/damon/drgn_dump_damon_status.py index 8db081f965f5..cf5d492670d8 100755 --- a/tools/testing/selftests/damon/drgn_dump_damon_status.py +++ b/tools/testing/selftests/damon/drgn_dump_damon_status.py @@ -25,6 +25,11 @@ def to_dict(object, attr_name_converter): d[attr_name] = converter(getattr(object, attr_name)) return d +def ops_to_dict(ops): + return to_dict(ops, [ + ['id', int], + ]) + def intervals_goal_to_dict(goal): return to_dict(goal, [ ['access_bp', int], @@ -148,6 +153,7 @@ def schemes_to_list(schemes): def damon_ctx_to_dict(ctx): return to_dict(ctx, [ + ['ops', ops_to_dict], ['attrs', attrs_to_dict], ['adaptive_targets', targets_to_list], ['schemes', schemes_to_list], From a1d52cd0302d7aefe26edbbccd1d1ae70cca50c9 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Jul 2025 10:16:41 -0700 Subject: [PATCH 347/370] selftests/damon/drgn_dump_damon_status: dump DAMOS filters drgn_dump_damon_status.py is a script for dumping DAMON internal status in json format. It is being used for seeing if DAMON parameters that are set using _damon_sysfs.py are actually passed to DAMON in the kernel space. It is, however, not dumping full DAMON internal status, and it makes increasing test coverage difficult. Add damos filters dumping for more tests. Link: https://lkml.kernel.org/r/20250720171652.92309-12-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- .../selftests/damon/drgn_dump_damon_status.py | 43 ++++++++++++++++++- 1 file changed, 42 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/damon/drgn_dump_damon_status.py b/tools/testing/selftests/damon/drgn_dump_damon_status.py index cf5d492670d8..7233369a3a44 100755 --- a/tools/testing/selftests/damon/drgn_dump_damon_status.py +++ b/tools/testing/selftests/damon/drgn_dump_damon_status.py @@ -135,8 +135,37 @@ def damos_migrate_dests_to_dict(dests): 'nr_dests': nr_dests, } +def damos_filter_to_dict(damos_filter): + filter_type_keyword = { + 0: 'anon', + 1: 'active', + 2: 'memcg', + 3: 'young', + 4: 'hugepage_size', + 5: 'unmapped', + 6: 'addr', + 7: 'target' + } + dict_ = { + 'type': filter_type_keyword[int(damos_filter.type)], + 'matching': bool(damos_filter.matching), + 'allow': bool(damos_filter.allow), + } + type_ = dict_['type'] + if type_ == 'memcg': + dict_['memcg_id'] = int(damos_filter.memcg_id) + elif type_ == 'addr': + dict_['addr_range'] = [int(damos_filter.addr_range.start), + int(damos_filter.addr_range.end)] + elif type_ == 'target': + dict_['target_idx'] = int(damos_filter.target_idx) + elif type_ == 'hugeapge_size': + dict_['sz_range'] = [int(damos_filter.sz_range.min), + int(damos_filter.sz_range.max)] + return dict_ + def scheme_to_dict(scheme): - return to_dict(scheme, [ + dict_ = to_dict(scheme, [ ['pattern', damos_access_pattern_to_dict], ['action', int], ['apply_interval_us', int], @@ -145,6 +174,18 @@ def scheme_to_dict(scheme): ['target_nid', int], ['migrate_dests', damos_migrate_dests_to_dict], ]) + filters = [] + for f in list_for_each_entry( + 'struct damos_filter', scheme.filters.address_of_(), 'list'): + filters.append(damos_filter_to_dict(f)) + dict_['filters'] = filters + ops_filters = [] + for f in list_for_each_entry( + 'struct damos_filter', scheme.ops_filters.address_of_(), 'list'): + ops_filters.append(damos_filter_to_dict(f)) + dict_['ops_filters'] = ops_filters + + return dict_ def schemes_to_list(schemes): return [scheme_to_dict(s) From b50c48de6111bc70731ba375e1b45a65e63e287f Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Jul 2025 10:16:42 -0700 Subject: [PATCH 348/370] selftests/damon/sysfs.py: generalize DAMOS Watermarks commit assertion DamosWatermarks commitment assertion is hard-coded for a specific test case. Split it out into a general version that can be reused for different test cases. Link: https://lkml.kernel.org/r/20250720171652.92309-13-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/sysfs.py | 26 ++++++++++++++++++-------- 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/tools/testing/selftests/damon/sysfs.py b/tools/testing/selftests/damon/sysfs.py index fdc04f12be54..cb34444323e1 100755 --- a/tools/testing/selftests/damon/sysfs.py +++ b/tools/testing/selftests/damon/sysfs.py @@ -29,6 +29,22 @@ def fail(expectation, status): print(json.dumps(status, indent=4)) exit(1) +def assert_true(condition, expectation, status): + if condition is not True: + fail(expectation, status) + +def assert_watermarks_committed(watermarks, dump): + wmark_metric_val = { + 'none': 0, + 'free_mem_rate': 1, + } + assert_true(dump['metric'] == wmark_metric_val[watermarks.metric], + 'metric', dump) + assert_true(dump['interval'] == watermarks.interval, 'interval', dump) + assert_true(dump['high'] == watermarks.high, 'high', dump) + assert_true(dump['mid'] == watermarks.mid, 'mid', dump) + assert_true(dump['low'] == watermarks.low, 'low', dump) + def main(): kdamonds = _damon_sysfs.Kdamonds( [_damon_sysfs.Kdamond( @@ -105,14 +121,8 @@ def main(): }: fail('damos quota', status) - if scheme['wmarks'] != { - 'metric': 0, - 'interval': 0, - 'high': 0, - 'mid': 0, - 'low': 0, - }: - fail('damos wmarks', status) + assert_watermarks_committed(_damon_sysfs.DamosWatermarks(), + scheme['wmarks']) kdamonds.stop() From f797e709f741d5a28e4f3ca2592e5cc1d7e21e3b Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Jul 2025 10:16:43 -0700 Subject: [PATCH 349/370] selftests/damon/sysfs.py: generalize DamosQuota commit assertion DamosQuota commitment assertion is hard-coded for a specific test case. Split it out into a general version that can be reused for different test cases. Link: https://lkml.kernel.org/r/20250720171652.92309-14-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/sysfs.py | 31 +++++++++++++++++--------- 1 file changed, 20 insertions(+), 11 deletions(-) diff --git a/tools/testing/selftests/damon/sysfs.py b/tools/testing/selftests/damon/sysfs.py index cb34444323e1..5f46b5f73896 100755 --- a/tools/testing/selftests/damon/sysfs.py +++ b/tools/testing/selftests/damon/sysfs.py @@ -45,6 +45,18 @@ def assert_watermarks_committed(watermarks, dump): assert_true(dump['mid'] == watermarks.mid, 'mid', dump) assert_true(dump['low'] == watermarks.low, 'low', dump) +def assert_quota_committed(quota, dump): + assert_true(dump['reset_interval'] == quota.reset_interval_ms, + 'reset_interval', dump) + assert_true(dump['ms'] == quota.ms, 'ms', dump) + assert_true(dump['sz'] == quota.sz, 'sz', dump) + # TODO: assert goals are committed + assert_true(dump['weight_sz'] == quota.weight_sz_permil, 'weight_sz', dump) + assert_true(dump['weight_nr_accesses'] == quota.weight_nr_accesses_permil, + 'weight_nr_accesses', dump) + assert_true( + dump['weight_age'] == quota.weight_age_permil, 'weight_age', dump) + def main(): kdamonds = _damon_sysfs.Kdamonds( [_damon_sysfs.Kdamond( @@ -109,18 +121,15 @@ def main(): if scheme['target_nid'] != -1: fail('damos target nid', status) - if scheme['quota'] != { - 'reset_interval': 0, - 'ms': 0, - 'sz': 0, - 'goals': [], - 'esz': 0, - 'weight_sz': 0, - 'weight_nr_accesses': 0, - 'weight_age': 0, - }: - fail('damos quota', status) + migrate_dests = scheme['migrate_dests'] + if migrate_dests['nr_dests'] != 0: + fail('nr_dests', status) + if migrate_dests['node_id_arr'] != []: + fail('node_id_arr', status) + if migrate_dests['weight_arr'] != []: + fail('weight_arr', status) + assert_quota_committed(_damon_sysfs.DamosQuota(), scheme['quota']) assert_watermarks_committed(_damon_sysfs.DamosWatermarks(), scheme['wmarks']) From 84dc442bd552f3e0b9abc6f1531431beaea68518 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Jul 2025 10:16:44 -0700 Subject: [PATCH 350/370] selftests/damon/sysfs.py: test quota goal commitment Current DAMOS quota commitment assertion is not testing quota goal commitment. Add the test. Link: https://lkml.kernel.org/r/20250720171652.92309-15-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/sysfs.py | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/damon/sysfs.py b/tools/testing/selftests/damon/sysfs.py index 5f46b5f73896..8d859fee1141 100755 --- a/tools/testing/selftests/damon/sysfs.py +++ b/tools/testing/selftests/damon/sysfs.py @@ -45,12 +45,28 @@ def assert_watermarks_committed(watermarks, dump): assert_true(dump['mid'] == watermarks.mid, 'mid', dump) assert_true(dump['low'] == watermarks.low, 'low', dump) +def assert_quota_goal_committed(qgoal, dump): + metric_val = { + 'user_input': 0, + 'some_mem_psi_us': 1, + 'node_mem_used_bp': 2, + 'node_mem_free_bp': 3, + } + assert_true(dump['metric'] == metric_val[qgoal.metric], 'metric', dump) + assert_true(dump['target_value'] == qgoal.target_value, 'target_value', + dump) + if qgoal.metric == 'user_input': + assert_true(dump['current_value'] == qgoal.current_value, + 'current_value', dump) + assert_true(dump['nid'] == qgoal.nid, 'nid', dump) + def assert_quota_committed(quota, dump): assert_true(dump['reset_interval'] == quota.reset_interval_ms, 'reset_interval', dump) assert_true(dump['ms'] == quota.ms, 'ms', dump) assert_true(dump['sz'] == quota.sz, 'sz', dump) - # TODO: assert goals are committed + for idx, qgoal in enumerate(quota.goals): + assert_quota_goal_committed(qgoal, dump['goals'][idx]) assert_true(dump['weight_sz'] == quota.weight_sz_permil, 'weight_sz', dump) assert_true(dump['weight_nr_accesses'] == quota.weight_nr_accesses_permil, 'weight_nr_accesses', dump) From bd0487a7745ea84fb0ece7fd796ad51a92a12c70 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Jul 2025 10:16:45 -0700 Subject: [PATCH 351/370] selftests/damon/sysfs.py: test DAMOS destinations commitment Current DAMOS commitment assertion is not testing quota destinations commitment. Add the test. Link: https://lkml.kernel.org/r/20250720171652.92309-16-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/sysfs.py | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/tools/testing/selftests/damon/sysfs.py b/tools/testing/selftests/damon/sysfs.py index 8d859fee1141..5ab07a88ab7f 100755 --- a/tools/testing/selftests/damon/sysfs.py +++ b/tools/testing/selftests/damon/sysfs.py @@ -73,6 +73,13 @@ def assert_quota_committed(quota, dump): assert_true( dump['weight_age'] == quota.weight_age_permil, 'weight_age', dump) + +def assert_migrate_dests_committed(dests, dump): + assert_true(dump['nr_dests'] == len(dests.dests), 'nr_dests', dump) + for idx, dest in enumerate(dests.dests): + assert_true(dump['node_id_arr'][idx] == dest.id, 'node_id', dump) + assert_true(dump['weight_arr'][idx] == dest.weight, 'weight', dump) + def main(): kdamonds = _damon_sysfs.Kdamonds( [_damon_sysfs.Kdamond( @@ -137,14 +144,8 @@ def main(): if scheme['target_nid'] != -1: fail('damos target nid', status) - migrate_dests = scheme['migrate_dests'] - if migrate_dests['nr_dests'] != 0: - fail('nr_dests', status) - if migrate_dests['node_id_arr'] != []: - fail('node_id_arr', status) - if migrate_dests['weight_arr'] != []: - fail('weight_arr', status) - + assert_migrate_dests_committed(_damon_sysfs.DamosDests(), + scheme['migrate_dests']) assert_quota_committed(_damon_sysfs.DamosQuota(), scheme['quota']) assert_watermarks_committed(_damon_sysfs.DamosWatermarks(), scheme['wmarks']) From f22ff7b5a5baf7dc3326948e50781067f9aebb3c Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Jul 2025 10:16:46 -0700 Subject: [PATCH 352/370] selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion DAMOS scheme commitment assertion is hard-coded for a specific test case. Split it out into a general version that can be reused for different test cases. Link: https://lkml.kernel.org/r/20250720171652.92309-17-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/sysfs.py | 59 ++++++++++++++++---------- 1 file changed, 37 insertions(+), 22 deletions(-) diff --git a/tools/testing/selftests/damon/sysfs.py b/tools/testing/selftests/damon/sysfs.py index 5ab07a88ab7f..6326f62733f6 100755 --- a/tools/testing/selftests/damon/sysfs.py +++ b/tools/testing/selftests/damon/sysfs.py @@ -80,6 +80,42 @@ def assert_migrate_dests_committed(dests, dump): assert_true(dump['node_id_arr'][idx] == dest.id, 'node_id', dump) assert_true(dump['weight_arr'][idx] == dest.weight, 'weight', dump) +def assert_access_pattern_committed(pattern, dump): + assert_true(dump['min_sz_region'] == pattern.size[0], 'min_sz_region', + dump) + assert_true(dump['max_sz_region'] == pattern.size[1], 'max_sz_region', + dump) + assert_true(dump['min_nr_accesses'] == pattern.nr_accesses[0], + 'min_nr_accesses', dump) + assert_true(dump['max_nr_accesses'] == pattern.nr_accesses[1], + 'max_nr_accesses', dump) + assert_true(dump['min_age_region'] == pattern.age[0], 'min_age_region', + dump) + assert_true(dump['max_age_region'] == pattern.age[1], 'miaxage_region', + dump) + +def assert_scheme_committed(scheme, dump): + assert_access_pattern_committed(scheme.access_pattern, dump['pattern']) + action_val = { + 'willneed': 0, + 'cold': 1, + 'pageout': 2, + 'hugepage': 3, + 'nohugeapge': 4, + 'lru_prio': 5, + 'lru_deprio': 6, + 'migrate_hot': 7, + 'migrate_cold': 8, + 'stat': 9, + } + assert_true(dump['action'] == action_val[scheme.action], 'action', dump) + assert_true(dump['apply_interval_us'] == scheme. apply_interval_us, + 'apply_interval_us', dump) + assert_true(dump['target_nid'] == scheme.target_nid, 'target_nid', dump) + assert_migrate_dests_committed(scheme.dests, dump['migrate_dests']) + assert_quota_committed(scheme.quota, dump['quota']) + assert_watermarks_committed(scheme.watermarks, dump['wmarks']) + def main(): kdamonds = _damon_sysfs.Kdamonds( [_damon_sysfs.Kdamond( @@ -127,28 +163,7 @@ def main(): if len(ctx['schemes']) != 1: fail('number of schemes', status) - scheme = ctx['schemes'][0] - if scheme['pattern'] != { - 'min_sz_region': 0, - 'max_sz_region': 2**64 - 1, - 'min_nr_accesses': 0, - 'max_nr_accesses': 2**32 - 1, - 'min_age_region': 0, - 'max_age_region': 2**32 - 1, - }: - fail('damos pattern', status) - if scheme['action'] != 9: # stat - fail('damos action', status) - if scheme['apply_interval_us'] != 0: - fail('damos apply interval', status) - if scheme['target_nid'] != -1: - fail('damos target nid', status) - - assert_migrate_dests_committed(_damon_sysfs.DamosDests(), - scheme['migrate_dests']) - assert_quota_committed(_damon_sysfs.DamosQuota(), scheme['quota']) - assert_watermarks_committed(_damon_sysfs.DamosWatermarks(), - scheme['wmarks']) + assert_scheme_committed(_damon_sysfs.Damos(), ctx['schemes'][0]) kdamonds.stop() From 53f800581f79555e84aeabff81a90cf954803b83 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Jul 2025 10:16:47 -0700 Subject: [PATCH 353/370] selftests/damon/sysfs.py: test DAMOS filters commitment Current DAMOS scheme commitment assertion is not testing DAMOS filters. Add the test. Link: https://lkml.kernel.org/r/20250720171652.92309-18-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/sysfs.py | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/tools/testing/selftests/damon/sysfs.py b/tools/testing/selftests/damon/sysfs.py index 6326f62733f6..6471cb8c6de2 100755 --- a/tools/testing/selftests/damon/sysfs.py +++ b/tools/testing/selftests/damon/sysfs.py @@ -80,6 +80,21 @@ def assert_migrate_dests_committed(dests, dump): assert_true(dump['node_id_arr'][idx] == dest.id, 'node_id', dump) assert_true(dump['weight_arr'][idx] == dest.weight, 'weight', dump) +def assert_filter_committed(filter_, dump): + assert_true(filter_.type_ == dump['type'], 'type', dump) + assert_true(filter_.matching == dump['matching'], 'matching', dump) + assert_true(filter_.allow == dump['allow'], 'allow', dump) + # TODO: check memcg_path and memcg_id if type is memcg + if filter_.type_ == 'addr': + assert_true([filter_.addr_start, filter_.addr_end] == + dump['addr_range'], 'addr_range', dump) + elif filter_.type_ == 'target': + assert_true(filter_.target_idx == dump['target_idx'], 'target_idx', + dump) + elif filter_.type_ == 'hugepage_size': + assert_true([filter_.min_, filter_.max_] == dump['sz_range'], + 'sz_range', dump) + def assert_access_pattern_committed(pattern, dump): assert_true(dump['min_sz_region'] == pattern.size[0], 'min_sz_region', dump) @@ -115,6 +130,11 @@ def assert_scheme_committed(scheme, dump): assert_migrate_dests_committed(scheme.dests, dump['migrate_dests']) assert_quota_committed(scheme.quota, dump['quota']) assert_watermarks_committed(scheme.watermarks, dump['wmarks']) + # TODO: test filters directory + for idx, f in enumerate(scheme.core_filters.filters): + assert_filter_committed(f, dump['filters'][idx]) + for idx, f in enumerate(scheme.ops_filters.filters): + assert_filter_committed(f, dump['ops_filters'][idx]) def main(): kdamonds = _damon_sysfs.Kdamonds( From 771d7754ab28597c0370a588887f50db229d1c0d Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Jul 2025 10:16:48 -0700 Subject: [PATCH 354/370] selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion DAMOS schemes commitment assertion is hard-coded for a specific test case. Split it out into a general version that can be reused for different test cases. Link: https://lkml.kernel.org/r/20250720171652.92309-19-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/sysfs.py | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/damon/sysfs.py b/tools/testing/selftests/damon/sysfs.py index 6471cb8c6de2..4d13da00733e 100755 --- a/tools/testing/selftests/damon/sysfs.py +++ b/tools/testing/selftests/damon/sysfs.py @@ -136,6 +136,11 @@ def assert_scheme_committed(scheme, dump): for idx, f in enumerate(scheme.ops_filters.filters): assert_filter_committed(f, dump['ops_filters'][idx]) +def assert_schemes_committed(schemes, dump): + assert_true(len(schemes) == len(dump), 'len_schemes', dump) + for idx, scheme in enumerate(schemes): + assert_scheme_committed(scheme, dump[idx]) + def main(): kdamonds = _damon_sysfs.Kdamonds( [_damon_sysfs.Kdamond( @@ -180,10 +185,7 @@ def main(): { 'pid': 0, 'nr_regions': 0, 'regions_list': []}]: fail('adaptive targets', status) - if len(ctx['schemes']) != 1: - fail('number of schemes', status) - - assert_scheme_committed(_damon_sysfs.Damos(), ctx['schemes'][0]) + assert_schemes_committed([_damon_sysfs.Damos()], ctx['schemes']) kdamonds.stop() From a4027b5f24282b4aab5cee1b63b4267d27b6c686 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Jul 2025 10:16:49 -0700 Subject: [PATCH 355/370] selftests/damon/sysfs.py: generalize monitoring attributes commit assertion DAMON monitoring attributes commitment assertion is hard-coded for a specific test case. Split it out into a general version that can be reused for different test cases. Link: https://lkml.kernel.org/r/20250720171652.92309-20-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/sysfs.py | 40 +++++++++++++++----------- 1 file changed, 24 insertions(+), 16 deletions(-) diff --git a/tools/testing/selftests/damon/sysfs.py b/tools/testing/selftests/damon/sysfs.py index 4d13da00733e..2144a47d9827 100755 --- a/tools/testing/selftests/damon/sysfs.py +++ b/tools/testing/selftests/damon/sysfs.py @@ -141,6 +141,29 @@ def assert_schemes_committed(schemes, dump): for idx, scheme in enumerate(schemes): assert_scheme_committed(scheme, dump[idx]) +def assert_monitoring_attrs_committed(attrs, dump): + assert_true(dump['sample_interval'] == attrs.sample_us, 'sample_interval', + dump) + assert_true(dump['aggr_interval'] == attrs.aggr_us, 'aggr_interval', dump) + assert_true(dump['intervals_goal']['access_bp'] == + attrs.intervals_goal.access_bp, 'access_bp', + dump['intervals_goal']) + assert_true(dump['intervals_goal']['aggrs'] == attrs.intervals_goal.aggrs, + 'aggrs', dump['intervals_goal']) + assert_true(dump['intervals_goal']['min_sample_us'] == + attrs.intervals_goal.min_sample_us, 'min_sample_us', + dump['intervals_goal']) + assert_true(dump['intervals_goal']['max_sample_us'] == + attrs.intervals_goal.max_sample_us, 'max_sample_us', + dump['intervals_goal']) + + assert_true(dump['ops_update_interval'] == attrs.update_us, + 'ops_update_interval', dump) + assert_true(dump['min_nr_regions'] == attrs.min_nr_regions, + 'min_nr_regions', dump) + assert_true(dump['max_nr_regions'] == attrs.max_nr_regions, + 'max_nr_regions', dump) + def main(): kdamonds = _damon_sysfs.Kdamonds( [_damon_sysfs.Kdamond( @@ -163,23 +186,8 @@ def main(): fail('number of contexts', status) ctx = status['contexts'][0] - attrs = ctx['attrs'] - if attrs['sample_interval'] != 5000: - fail('sample interval', status) - if attrs['aggr_interval'] != 100000: - fail('aggr interval', status) - if attrs['ops_update_interval'] != 1000000: - fail('ops updte interval', status) - if attrs['intervals_goal'] != { - 'access_bp': 0, 'aggrs': 0, - 'min_sample_us': 0, 'max_sample_us': 0}: - fail('intervals goal') - - if attrs['min_nr_regions'] != 10: - fail('min_nr_regions') - if attrs['max_nr_regions'] != 1000: - fail('max_nr_regions') + assert_monitoring_attrs_committed(_damon_sysfs.DamonAttrs(), ctx['attrs']) if ctx['adaptive_targets'] != [ { 'pid': 0, 'nr_regions': 0, 'regions_list': []}]: From 16797a55aab118c5fa9236604aa2142f1121f136 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Jul 2025 10:16:50 -0700 Subject: [PATCH 356/370] selftests/damon/sysfs.py: generalize DAMON context commit assertion DAMON context commitment assertion is hard-coded for a specific test case. Split it out into a general version that can be reused for different test cases. Link: https://lkml.kernel.org/r/20250720171652.92309-21-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/sysfs.py | 28 +++++++++++++++----------- 1 file changed, 16 insertions(+), 12 deletions(-) diff --git a/tools/testing/selftests/damon/sysfs.py b/tools/testing/selftests/damon/sysfs.py index 2144a47d9827..845fc177f8a7 100755 --- a/tools/testing/selftests/damon/sysfs.py +++ b/tools/testing/selftests/damon/sysfs.py @@ -164,6 +164,21 @@ def assert_monitoring_attrs_committed(attrs, dump): assert_true(dump['max_nr_regions'] == attrs.max_nr_regions, 'max_nr_regions', dump) +def assert_ctx_committed(ctx, dump): + ops_val = { + 'vaddr': 0, + 'fvaddr': 1, + 'paddr': 2, + } + assert_true(dump['ops']['id'] == ops_val[ctx.ops], 'ops_id', dump) + assert_monitoring_attrs_committed(ctx.monitoring_attrs, dump['attrs']) + assert_schemes_committed(ctx.schemes, dump['schemes']) + +def assert_ctxs_committed(ctxs, dump): + assert_true(len(ctxs) == len(dump), 'ctxs length', dump) + for idx, ctx in enumerate(ctxs): + assert_ctx_committed(ctx, dump[idx]) + def main(): kdamonds = _damon_sysfs.Kdamonds( [_damon_sysfs.Kdamond( @@ -182,18 +197,7 @@ def main(): kdamonds.stop() exit(1) - if len(status['contexts']) != 1: - fail('number of contexts', status) - - ctx = status['contexts'][0] - - assert_monitoring_attrs_committed(_damon_sysfs.DamonAttrs(), ctx['attrs']) - - if ctx['adaptive_targets'] != [ - { 'pid': 0, 'nr_regions': 0, 'regions_list': []}]: - fail('adaptive targets', status) - - assert_schemes_committed([_damon_sysfs.Damos()], ctx['schemes']) + assert_ctxs_committed(kdamonds.kdamonds[0].contexts, status['contexts']) kdamonds.stop() From 62b7b1ffa2dd242253480ffcd12c96c4b7ef8fb6 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Jul 2025 10:16:51 -0700 Subject: [PATCH 357/370] selftests/damon/sysfs.py: test non-default parameters runtime commit sysfs.py is testing only the default and minimum DAMON parameters. Add another test case for more non-default additional DAMON parameters commitment on runtime. Link: https://lkml.kernel.org/r/20250720171652.92309-22-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/sysfs.py | 53 ++++++++++++++++++++++++++ 1 file changed, 53 insertions(+) diff --git a/tools/testing/selftests/damon/sysfs.py b/tools/testing/selftests/damon/sysfs.py index 845fc177f8a7..1bf9508cd44d 100755 --- a/tools/testing/selftests/damon/sysfs.py +++ b/tools/testing/selftests/damon/sysfs.py @@ -199,6 +199,59 @@ def main(): assert_ctxs_committed(kdamonds.kdamonds[0].contexts, status['contexts']) + context = _damon_sysfs.DamonCtx( + monitoring_attrs=_damon_sysfs.DamonAttrs( + sample_us=100000, aggr_us=2000000, + intervals_goal=_damon_sysfs.IntervalsGoal( + access_bp=400, aggrs=3, min_sample_us=5000, + max_sample_us=10000000), + update_us=2000000), + schemes=[_damon_sysfs.Damos( + action='pageout', + access_pattern=_damon_sysfs.DamosAccessPattern( + size=[4096, 2**10], + nr_accesses=[3, 317], + age=[5,71]), + quota=_damon_sysfs.DamosQuota( + sz=100*1024*1024, ms=100, + goals=[_damon_sysfs.DamosQuotaGoal( + metric='node_mem_used_bp', + target_value=9950, + nid=1)], + reset_interval_ms=1500, + weight_sz_permil=20, + weight_nr_accesses_permil=200, + weight_age_permil=1000), + watermarks=_damon_sysfs.DamosWatermarks( + metric = 'free_mem_rate', interval = 500000, # 500 ms + high = 500, mid = 400, low = 50), + target_nid=1, + apply_interval_us=1000000, + dests=_damon_sysfs.DamosDests( + dests=[_damon_sysfs.DamosDest(id=1, weight=30), + _damon_sysfs.DamosDest(id=0, weight=70)]), + core_filters=[ + _damon_sysfs.DamosFilter(type_='addr', matching=True, + allow=False, addr_start=42, + addr_end=4242), + ], + ops_filters=[ + _damon_sysfs.DamosFilter(type_='anon', matching=True, + allow=True), + ], + )]) + context.idx = 0 + context.kdamond = kdamonds.kdamonds[0] + kdamonds.kdamonds[0].contexts = [context] + kdamonds.kdamonds[0].commit() + + status, err = dump_damon_status_dict(kdamonds.kdamonds[0].pid) + if err is not None: + print(err) + exit(1) + + assert_ctxs_committed(kdamonds.kdamonds[0].contexts, status['contexts']) + kdamonds.stop() if __name__ == '__main__': From da5973a0b8e0370f9e309c60e35a4d4d4dfe32fc Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Jul 2025 10:16:52 -0700 Subject: [PATCH 358/370] selftests/damon/sysfs.py: test runtime reduction of DAMON parameters sysfs.py is testing if non-default additional parameters can be committed. Add a test case for further reducing the parameters to the default set. Link: https://lkml.kernel.org/r/20250720171652.92309-23-sj@kernel.org Signed-off-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/sysfs.py | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/tools/testing/selftests/damon/sysfs.py b/tools/testing/selftests/damon/sysfs.py index 1bf9508cd44d..2666c6f0f1a5 100755 --- a/tools/testing/selftests/damon/sysfs.py +++ b/tools/testing/selftests/damon/sysfs.py @@ -252,6 +252,20 @@ def main(): assert_ctxs_committed(kdamonds.kdamonds[0].contexts, status['contexts']) + # test online commitment of minimum context. + context = _damon_sysfs.DamonCtx() + context.idx = 0 + context.kdamond = kdamonds.kdamonds[0] + kdamonds.kdamonds[0].contexts = [context] + kdamonds.kdamonds[0].commit() + + status, err = dump_damon_status_dict(kdamonds.kdamonds[0].pid) + if err is not None: + print(err) + exit(1) + + assert_ctxs_committed(kdamonds.kdamonds[0].contexts, status['contexts']) + kdamonds.stop() if __name__ == '__main__': From 511914506d194be931de012a5f2ab55841b9b708 Mon Sep 17 00:00:00 2001 From: Enze Li Date: Fri, 18 Jul 2025 14:42:17 +0800 Subject: [PATCH 359/370] selftests/damon: introduce _common.sh to host shared function The current test scripts contain duplicated root permission checks in multiple locations. This patch consolidates these checks into _common.sh to eliminate code redundancy. Link: https://lkml.kernel.org/r/20250718064217.299300-1-lienze@kylinos.cn Signed-off-by: Enze Li Reviewed-by: SeongJae Park Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/_common.sh | 11 +++++++++++ tools/testing/selftests/damon/lru_sort.sh | 8 +++----- tools/testing/selftests/damon/reclaim.sh | 8 +++----- tools/testing/selftests/damon/sysfs.sh | 11 ++--------- .../damon/sysfs_update_removed_scheme_dir.sh | 8 +++----- 5 files changed, 22 insertions(+), 24 deletions(-) create mode 100644 tools/testing/selftests/damon/_common.sh diff --git a/tools/testing/selftests/damon/_common.sh b/tools/testing/selftests/damon/_common.sh new file mode 100644 index 000000000000..0279698f733e --- /dev/null +++ b/tools/testing/selftests/damon/_common.sh @@ -0,0 +1,11 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +check_dependencies() +{ + if [ $EUID -ne 0 ] + then + echo "Run as root" + exit $ksft_skip + fi +} diff --git a/tools/testing/selftests/damon/lru_sort.sh b/tools/testing/selftests/damon/lru_sort.sh index 61b80197c896..1e4849db78a9 100755 --- a/tools/testing/selftests/damon/lru_sort.sh +++ b/tools/testing/selftests/damon/lru_sort.sh @@ -1,14 +1,12 @@ #!/bin/bash # SPDX-License-Identifier: GPL-2.0 +source _common.sh + # Kselftest framework requirement - SKIP code is 4. ksft_skip=4 -if [ $EUID -ne 0 ] -then - echo "Run as root" - exit $ksft_skip -fi +check_dependencies damon_lru_sort_enabled="/sys/module/damon_lru_sort/parameters/enabled" if [ ! -f "$damon_lru_sort_enabled" ] diff --git a/tools/testing/selftests/damon/reclaim.sh b/tools/testing/selftests/damon/reclaim.sh index 78dbc2334cbe..e56ceb035129 100755 --- a/tools/testing/selftests/damon/reclaim.sh +++ b/tools/testing/selftests/damon/reclaim.sh @@ -1,14 +1,12 @@ #!/bin/bash # SPDX-License-Identifier: GPL-2.0 +source _common.sh + # Kselftest framework requirement - SKIP code is 4. ksft_skip=4 -if [ $EUID -ne 0 ] -then - echo "Run as root" - exit $ksft_skip -fi +check_dependencies damon_reclaim_enabled="/sys/module/damon_reclaim/parameters/enabled" if [ ! -f "$damon_reclaim_enabled" ] diff --git a/tools/testing/selftests/damon/sysfs.sh b/tools/testing/selftests/damon/sysfs.sh index e9a976d296e2..83e3b7f63d81 100755 --- a/tools/testing/selftests/damon/sysfs.sh +++ b/tools/testing/selftests/damon/sysfs.sh @@ -1,6 +1,8 @@ #!/bin/bash # SPDX-License-Identifier: GPL-2.0 +source _common.sh + # Kselftest frmework requirement - SKIP code is 4. ksft_skip=4 @@ -364,14 +366,5 @@ test_damon_sysfs() test_kdamonds "$damon_sysfs/kdamonds" } -check_dependencies() -{ - if [ $EUID -ne 0 ] - then - echo "Run as root" - exit $ksft_skip - fi -} - check_dependencies test_damon_sysfs "/sys/kernel/mm/damon/admin" diff --git a/tools/testing/selftests/damon/sysfs_update_removed_scheme_dir.sh b/tools/testing/selftests/damon/sysfs_update_removed_scheme_dir.sh index ade35576e748..35fc32beeaf7 100755 --- a/tools/testing/selftests/damon/sysfs_update_removed_scheme_dir.sh +++ b/tools/testing/selftests/damon/sysfs_update_removed_scheme_dir.sh @@ -1,14 +1,12 @@ #!/bin/bash # SPDX-License-Identifier: GPL-2.0 +source _common.sh + # Kselftest framework requirement - SKIP code is 4. ksft_skip=4 -if [ $EUID -ne 0 ] -then - echo "Run as root" - exit $ksft_skip -fi +check_dependencies damon_sysfs="/sys/kernel/mm/damon/admin" if [ ! -d "$damon_sysfs" ] From 48e6561b667e7f0623da3ca34e2b93b7ae2a5d8d Mon Sep 17 00:00:00 2001 From: Zi Yan Date: Tue, 22 Jul 2025 15:46:49 -0400 Subject: [PATCH 360/370] mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info() The trace event has not recorded the right data since it was introduced at commit c8b360031218 ("mm: add alloc_contig_migrate_range allocation statistics"). Remove it. Link: https://lkml.kernel.org/r/20250722194649.4135191-1-ziy@nvidia.com Signed-off-by: Zi Yan Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202507220742.P3SaKlI6-lkp@intel.com/ Acked-by: David Hildenbrand Acked-by: Vlastimil Babka Cc: Brendan Jackman Cc: David Rientjes Cc: Johannes Weiner Cc: Martin Liu Cc: "Masami Hiramatsu (Google)" Cc: Mathieu Desnoyers Cc: Michal Hocko Cc: Richard Chang Cc: Steven Rostedt Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- include/trace/events/kmem.h | 40 ------------------------------------- mm/page_alloc.c | 32 +++-------------------------- 2 files changed, 3 insertions(+), 69 deletions(-) diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h index efffcf578217..474358773abe 100644 --- a/include/trace/events/kmem.h +++ b/include/trace/events/kmem.h @@ -304,46 +304,6 @@ TRACE_EVENT(mm_page_alloc_extfrag, __entry->change_ownership) ); -#ifdef CONFIG_CONTIG_ALLOC -TRACE_EVENT(mm_alloc_contig_migrate_range_info, - - TP_PROTO(unsigned long start, - unsigned long end, - unsigned long nr_migrated, - unsigned long nr_reclaimed, - unsigned long nr_mapped, - acr_flags_t alloc_flags), - - TP_ARGS(start, end, nr_migrated, nr_reclaimed, nr_mapped, alloc_flags), - - TP_STRUCT__entry( - __field(unsigned long, start) - __field(unsigned long, end) - __field(unsigned long, nr_migrated) - __field(unsigned long, nr_reclaimed) - __field(unsigned long, nr_mapped) - __field(acr_flags_t, alloc_flags) - ), - - TP_fast_assign( - __entry->start = start; - __entry->end = end; - __entry->nr_migrated = nr_migrated; - __entry->nr_reclaimed = nr_reclaimed; - __entry->nr_mapped = nr_mapped; - __entry->alloc_flags = alloc_flags; - ), - - TP_printk("start=0x%lx end=0x%lx alloc_flags=%d nr_migrated=%lu nr_reclaimed=%lu nr_mapped=%lu", - __entry->start, - __entry->end, - __entry->alloc_flags, - __entry->nr_migrated, - __entry->nr_reclaimed, - __entry->nr_mapped) -); -#endif - TRACE_EVENT(mm_setup_per_zone_wmarks, TP_PROTO(struct zone *zone), diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fa09154a799c..d1d037f97c5f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6694,14 +6694,9 @@ static void alloc_contig_dump_pages(struct list_head *page_list) } } -/* - * [start, end) must belong to a single zone. - * @alloc_flags: using acr_flags_t to filter the type of migration in - * trace_mm_alloc_contig_migrate_range_info. - */ +/* [start, end) must belong to a single zone. */ static int __alloc_contig_migrate_range(struct compact_control *cc, - unsigned long start, unsigned long end, - acr_flags_t alloc_flags) + unsigned long start, unsigned long end) { /* This function is based on compact_zone() from compaction.c. */ unsigned int nr_reclaimed; @@ -6713,10 +6708,6 @@ static int __alloc_contig_migrate_range(struct compact_control *cc, .gfp_mask = cc->gfp_mask, .reason = MR_CONTIG_RANGE, }; - struct page *page; - unsigned long total_mapped = 0; - unsigned long total_migrated = 0; - unsigned long total_reclaimed = 0; lru_cache_disable(); @@ -6742,22 +6733,9 @@ static int __alloc_contig_migrate_range(struct compact_control *cc, &cc->migratepages); cc->nr_migratepages -= nr_reclaimed; - if (trace_mm_alloc_contig_migrate_range_info_enabled()) { - total_reclaimed += nr_reclaimed; - list_for_each_entry(page, &cc->migratepages, lru) { - struct folio *folio = page_folio(page); - - total_mapped += folio_mapped(folio) * - folio_nr_pages(folio); - } - } - ret = migrate_pages(&cc->migratepages, alloc_migration_target, NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE, NULL); - if (trace_mm_alloc_contig_migrate_range_info_enabled() && !ret) - total_migrated += cc->nr_migratepages; - /* * On -ENOMEM, migrate_pages() bails out right away. It is pointless * to retry again over this error, so do the same here. @@ -6773,10 +6751,6 @@ static int __alloc_contig_migrate_range(struct compact_control *cc, putback_movable_pages(&cc->migratepages); } - trace_mm_alloc_contig_migrate_range_info(start, end, alloc_flags, - total_migrated, - total_reclaimed, - total_mapped); return (ret < 0) ? ret : 0; } @@ -6921,7 +6895,7 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, * allocated. So, if we fall through be sure to clear ret so that * -EBUSY is not accidentally used or returned to caller. */ - ret = __alloc_contig_migrate_range(&cc, start, end, alloc_flags); + ret = __alloc_contig_migrate_range(&cc, start, end); if (ret && ret != -EBUSY) goto done; From 44d10df2007a3081ae45bbf81a96b077b48db6a2 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Tue, 22 Jul 2025 18:10:23 +0100 Subject: [PATCH 361/370] MAINTAINERS: add missing percpu-internal.h file to per-cpu section This file seems to most appropriately belong to the PER-CPU MEMORY ALLOCATOR section, so place it there. Link: https://lkml.kernel.org/r/20250722171023.139777-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Cc: Alistair Popple Cc: Christoph Lameter (Ampere) Cc: Dave Chinner Cc: David Hildenbrand Cc: Dennis Zhou Cc: Jann Horn Cc: Liam Howlett Cc: Muchun Song Cc: Nico Pache Cc: Oscar Salvador Cc: Qi Zheng Cc: Roman Gushchin Cc: Tejun Heo Cc: Vlastimil Babka Cc: Zi Yan Cc: Dev Jain Cc: Michal Hocko Cc: Mike Rapoport Cc: Pedro Falcato Cc: Suren Baghdasaryan Cc: Shakeel Butt Cc: Joshua Hahn Signed-off-by: Andrew Morton --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index 9dd4111d7d96..13aa534a5bb8 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -19459,6 +19459,7 @@ F: arch/*/include/asm/percpu.h F: include/linux/percpu*.h F: lib/percpu*.c F: mm/percpu*.c +F: mm/percpu-internal.h PER-TASK DELAY ACCOUNTING M: Balbir Singh From 85c16ee6faa1f12289b9b84ab552f55dc4aad89c Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Tue, 22 Jul 2025 18:15:28 +0100 Subject: [PATCH 362/370] MAINTAINERS: add missing interval_tree.c to memory mapping section This seems to be the best place for this file. Link: https://lkml.kernel.org/r/20250722171528.141083-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Acked-by: Vlastimil Babka Acked-by: Pedro Falcato Cc: Alistair Popple Cc: Christoph Lameter (Ampere) Cc: Dave Chinner Cc: David Hildenbrand Cc: Dennis Zhou Cc: Jann Horn Cc: Liam Howlett Cc: Muchun Song Cc: Nico Pache Cc: Oscar Salvador Cc: Qi Zheng Cc: Roman Gushchin Cc: Tejun Heo Cc: Zi Yan Cc: Dev Jain Cc: Michal Hocko Cc: Mike Rapoport Cc: Suren Baghdasaryan Cc: Shakeel Butt Cc: Joshua Hahn Signed-off-by: Andrew Morton --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index 13aa534a5bb8..b95bad402591 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -16007,6 +16007,7 @@ S: Maintained W: http://www.linux-mm.org T: git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm F: include/trace/events/mmap.h +F: mm/interval_tree.c F: mm/mincore.c F: mm/mlock.c F: mm/mmap.c From 651ad43d56d1bae6aa37d313339ce756b5303a67 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Tue, 22 Jul 2025 18:19:04 +0100 Subject: [PATCH 363/370] MAINTAINERS: add missing mm_slot.h file THP section This seems to be the most appropriate place for this file. [lorenzo.stoakes@oracle.com: also add mm_slot.h to KSM section] Link: https://lkml.kernel.org/r/685747e2-a8cb-4620-a0c0-5cd9048d69b8@lucifer.local Link: https://lkml.kernel.org/r/20250722171904.142306-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Acked-by: Zi Yan Acked-by: Nico Pache Acked-by: David Hildenbrand Acked-by: Dev Jain Acked-by: Qi Zheng Cc: Alistair Popple Cc: Christoph Lameter (Ampere) Cc: Dave Chinner Cc: Dennis Zhou Cc: Jann Horn Cc: Liam Howlett Cc: Muchun Song Cc: Oscar Salvador Cc: Roman Gushchin Cc: Tejun Heo Cc: Vlastimil Babka Cc: Michal Hocko Cc: Mike Rapoport Cc: Pedro Falcato Cc: Suren Baghdasaryan Cc: Shakeel Butt Cc: Joshua Hahn Signed-off-by: Andrew Morton --- MAINTAINERS | 2 ++ 1 file changed, 2 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index b95bad402591..92d79ce8fd39 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -15820,6 +15820,7 @@ F: Documentation/mm/ksm.rst F: include/linux/ksm.h F: include/trace/events/ksm.h F: mm/ksm.c +F: mm/mm_slot.h MEMORY MANAGEMENT - MEMORY POLICY AND MIGRATION M: Andrew Morton @@ -15965,6 +15966,7 @@ F: include/linux/khugepaged.h F: include/trace/events/huge_memory.h F: mm/huge_memory.c F: mm/khugepaged.c +F: mm/mm_slot.h F: tools/testing/selftests/mm/khugepaged.c F: tools/testing/selftests/mm/split_huge_page_test.c F: tools/testing/selftests/mm/transhuge-stress.c From 2011011ad6aee2d4366402d91a856a9c9f377252 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Tue, 22 Jul 2025 18:22:58 +0100 Subject: [PATCH 364/370] MAINTAINERS: move memremap.[ch] to hotplug section This seems to be the most appropriate place for these files. Link: https://lkml.kernel.org/r/20250722172258.143488-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Acked-by: David Hildenbrand Cc: Alistair Popple Cc: Oscar Salvador Cc: Christoph Lameter (Ampere) Cc: Dennis Zhou Cc: Tejun Heo Cc: Jann Horn Cc: Liam Howlett Cc: Vlastimil Babka Cc: Dave Chinner Cc: Muchun Song Cc: Nico Pache Cc: Qi Zheng Cc: Roman Gushchin Cc: Zi Yan Cc: Dev Jain Cc: Michal Hocko Cc: Mike Rapoport Cc: Pedro Falcato Cc: Suren Baghdasaryan Cc: Shakeel Butt Cc: Joshua Hahn Signed-off-by: Andrew Morton --- MAINTAINERS | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 92d79ce8fd39..889e53c6e8b9 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -15728,6 +15728,8 @@ F: Documentation/admin-guide/mm/memory-hotplug.rst F: Documentation/core-api/memory-hotplug.rst F: drivers/base/memory.c F: include/linux/memory_hotplug.h +F: include/linux/memremap.h +F: mm/memremap.c F: mm/memory_hotplug.c F: tools/testing/selftests/memory-hotplug/ @@ -15746,7 +15748,6 @@ F: include/linux/memory_hotplug.h F: include/linux/memory-tiers.h F: include/linux/mempolicy.h F: include/linux/mempool.h -F: include/linux/memremap.h F: include/linux/mmzone.h F: include/linux/mmu_notifier.h F: include/linux/pagewalk.h From c3ef2cc69596f2cfb1546d6428ca906dd2cc13ea Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Tue, 22 Jul 2025 18:34:36 +0100 Subject: [PATCH 365/370] MAINTAINERS: add missing shrinker files The mm/list_lru.[ch] files implement a shrinker-specific data structure so seem most suited to the SHRINKER section. Link: https://lkml.kernel.org/r/20250722173436.145526-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Acked-by: Qi Zheng Cc: Dave Chinner Cc: Muchun Song Cc: Roman Gushchin Cc: Alistair Popple Cc: Christoph Lameter (Ampere) Cc: David Hildenbrand Cc: Dennis Zhou Cc: Jann Horn Cc: Liam Howlett Cc: Nico Pache Cc: Oscar Salvador Cc: Tejun Heo Cc: Vlastimil Babka Cc: Zi Yan Cc: Dev Jain Cc: Michal Hocko Cc: Mike Rapoport Cc: Pedro Falcato Cc: Suren Baghdasaryan Cc: Shakeel Butt Cc: Joshua Hahn Signed-off-by: Andrew Morton --- MAINTAINERS | 2 ++ 1 file changed, 2 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 889e53c6e8b9..421e1c68cfd4 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -22635,7 +22635,9 @@ R: Muchun Song L: linux-mm@kvack.org S: Maintained F: Documentation/admin-guide/mm/shrinker_debugfs.rst +F: include/linux/list_lru.h F: include/linux/shrinker.h +F: mm/list_lru.c F: mm/shrinker.c F: mm/shrinker_debug.c From 2656a75ca140710b7cc78f3c495dd9660f78a2c3 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Tue, 22 Jul 2025 18:41:43 +0100 Subject: [PATCH 366/370] MAINTAINERS: add missing files to page alloc section There are a couple of mm/-specific header files that were accidentally missed previously, and some page ref debug code also that ought to live here. Link: https://lkml.kernel.org/r/20250722174143.147143-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Acked-by: Zi Yan Acked-by: David Hildenbrand Acked-by: Vlastimil Babka Cc: Alistair Popple Cc: Christoph Lameter (Ampere) Cc: Dave Chinner Cc: Dennis Zhou Cc: Jann Horn Cc: Liam Howlett Cc: Muchun Song Cc: Nico Pache Cc: Oscar Salvador Cc: Qi Zheng Cc: Roman Gushchin Cc: Tejun Heo Cc: Dev Jain Cc: Michal Hocko Cc: Mike Rapoport Cc: Pedro Falcato Cc: Suren Baghdasaryan Cc: Shakeel Butt Cc: Joshua Hahn Signed-off-by: Andrew Morton --- MAINTAINERS | 3 +++ 1 file changed, 3 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 421e1c68cfd4..3d0e66d5385d 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -15880,6 +15880,7 @@ F: include/linux/gfp.h F: include/linux/page-isolation.h F: mm/compaction.c F: mm/debug_page_alloc.c +F: mm/debug_page_ref.c F: mm/fail_page_alloc.c F: mm/page_alloc.c F: mm/page_ext.c @@ -15888,8 +15889,10 @@ F: mm/page_isolation.c F: mm/page_owner.c F: mm/page_poison.c F: mm/page_reporting.c +F: mm/page_reporting.h F: mm/show_mem.c F: mm/shuffle.c +F: mm/shuffle.h MEMORY MANAGEMENT - RECLAIM M: Andrew Morton From a5c9fcb18c5a94932a50e2ce1549c8c2396530c4 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Tue, 22 Jul 2025 19:18:27 +0100 Subject: [PATCH 367/370] MAINTAINERS: add missing zsmalloc file The mm/zpdesc.h file is only included by mm/zsmalloc.c so the zsmalloc section seems the most appropriate place for this file. Link: https://lkml.kernel.org/r/20250722181827.156035-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Cc: Alistair Popple Cc: Christoph Lameter (Ampere) Cc: Dave Chinner Cc: David Hildenbrand Cc: Dennis Zhou Cc: Dev Jain Cc: Jann Horn Cc: Liam Howlett Cc: Michal Hocko Cc: Mike Rapoport Cc: Muchun Song Cc: Nico Pache Cc: Oscar Salvador Cc: Pedro Falcato Cc: Qi Zheng Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Tejun Heo Cc: Vlastimil Babka Cc: Zi Yan Cc: Shakeel Butt Cc: Joshua Hahn Signed-off-by: Andrew Morton --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index 3d0e66d5385d..e593d5af0a80 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -27465,6 +27465,7 @@ L: linux-mm@kvack.org S: Maintained F: Documentation/mm/zsmalloc.rst F: include/linux/zsmalloc.h +F: mm/zpdesc.h F: mm/zsmalloc.c ZSTD From e23210425c594b0d58c5bae4a955346c2a7b6b1c Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Thu, 24 Jul 2025 14:33:56 +0100 Subject: [PATCH 368/370] MAINTAINERS: add MM MISC section, add missing files to MISC and CORE Add a MEMORY MANAGEMENT - MISC section to contain files that are not described by other sections, moving all but the catch-all mm/ and tools/mm/ from MEMORY MANAGEMENT to MEMORY MANAGEMENT - CORE and MEMORY MANAGEMENT - MISC as appropriate. In both sections add remaining missing files. At this point, with the other recent MAINTAINERS changes, this should now mean that every memory management-related file has a section and assigned maintainers/reviewers. Finally, we copy across the maintainers/reviewers from MEMORY MANAGEMENT - CORE to MEMORY MANAGEMENT - MISC, as it seems the two are sufficiently related for this to be sensible. Link: https://lkml.kernel.org/r/20250724133356.49487-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Acked-by: Mike Rapoport (Microsoft) Acked-by: Vlastimil Babka Cc: David Hildenbrand Cc: Liam Howlett Cc: Michal Hocko Cc: Suren Baghdasaryan Cc: Alistair Popple Cc: Christoph Lameter (Ampere) Cc: Dave Chinner Cc: Dennis Zhou Cc: Dev Jain Cc: Jann Horn Cc: Muchun Song Cc: Nico Pache Cc: Oscar Salvador Cc: Pedro Falcato Cc: Qi Zheng Cc: Roman Gushchin Cc: Shakeel Butt Cc: Tejun Heo Cc: Zi Yan Cc: Joshua Hahn Signed-off-by: Andrew Morton --- MAINTAINERS | 72 ++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 58 insertions(+), 14 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index e593d5af0a80..f17cc8d8f8e8 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -15740,22 +15740,8 @@ S: Maintained W: http://www.linux-mm.org T: git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm T: quilt git://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new -F: Documentation/admin-guide/mm/ -F: Documentation/mm/ -F: include/linux/gfp.h -F: include/linux/gfp_types.h -F: include/linux/memory_hotplug.h -F: include/linux/memory-tiers.h -F: include/linux/mempolicy.h -F: include/linux/mempool.h -F: include/linux/mmzone.h -F: include/linux/mmu_notifier.h -F: include/linux/pagewalk.h -F: include/trace/events/ksm.h F: mm/ F: tools/mm/ -F: tools/testing/selftests/mm/ -N: include/linux/page[-_]* MEMORY MANAGEMENT - CORE M: Andrew Morton @@ -15770,18 +15756,40 @@ L: linux-mm@kvack.org S: Maintained W: http://www.linux-mm.org T: git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm +F: include/linux/gfp.h +F: include/linux/gfp_types.h +F: include/linux/highmem.h F: include/linux/memory.h F: include/linux/mm.h F: include/linux/mm_*.h +F: include/linux/mmzone.h F: include/linux/mmdebug.h +F: include/linux/mmu_notifier.h F: include/linux/pagewalk.h +F: include/linux/pgtable.h +F: include/linux/ptdump.h +F: include/linux/vmpressure.h +F: include/linux/vmstat.h F: kernel/fork.c F: mm/Kconfig F: mm/debug.c +F: mm/folio-compat.c +F: mm/highmem.c F: mm/init-mm.c +F: mm/internal.h +F: mm/maccess.c F: mm/memory.c +F: mm/mmu_notifier.c +F: mm/mmzone.c F: mm/pagewalk.c +F: mm/pgtable-generic.c +F: mm/ptdump.c +F: mm/sparse-vmemmap.c +F: mm/sparse.c F: mm/util.c +F: mm/vmpressure.c +F: mm/vmstat.c +N: include/linux/page[-_]* MEMORY MANAGEMENT - EXECMEM M: Andrew Morton @@ -15844,6 +15852,42 @@ F: mm/mempolicy.c F: mm/migrate.c F: mm/migrate_device.c +MEMORY MANAGEMENT - MISC +M: Andrew Morton +M: David Hildenbrand +R: Lorenzo Stoakes +R: Liam R. Howlett +R: Vlastimil Babka +R: Mike Rapoport +R: Suren Baghdasaryan +R: Michal Hocko +L: linux-mm@kvack.org +S: Maintained +W: http://www.linux-mm.org +T: git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm +F: Documentation/admin-guide/mm/ +F: Documentation/mm/ +F: include/linux/cma.h +F: include/linux/dmapool.h +F: include/linux/ioremap.h +F: include/linux/memory-tiers.h +F: include/linux/page_idle.h +F: mm/backing-dev.c +F: mm/cma.c +F: mm/cma_debug.c +F: mm/cma_sysfs.c +F: mm/dmapool.c +F: mm/dmapool_test.c +F: mm/early_ioremap.c +F: mm/fadvise.c +F: mm/ioremap.c +F: mm/mapping_dirty_helpers.c +F: mm/memory-tiers.c +F: mm/page_idle.c +F: mm/pgalloc-track.h +F: mm/process_vm_access.c +F: tools/testing/selftests/mm/ + MEMORY MANAGEMENT - NUMA MEMBLOCKS AND NUMA EMULATION M: Andrew Morton M: Mike Rapoport From 1729003f284d5f8f1bdd0c7e591b94003ebfb1dd Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Thu, 24 Jul 2025 14:54:21 +0100 Subject: [PATCH 369/370] MAINTAINERS: add missing file to cgroup section The page_counter files seems most appropriately placed here. Link: https://lkml.kernel.org/r/20250724135421.54510-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Acked-by: Shakeel Butt Cc: Alistair Popple Cc: Christoph Lameter (Ampere) Cc: Dave Chinner Cc: David Hildenbrand Cc: Dennis Zhou Cc: Dev Jain Cc: Jann Horn Cc: Liam Howlett Cc: Michal Hocko Cc: Mike Rapoport (Microsoft) Cc: Muchun Song Cc: Nico Pache Cc: Oscar Salvador Cc: Pedro Falcato Cc: Qi Zheng Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Tejun Heo Cc: Vlastimil Babka Cc: Zi Yan Cc: Joshua Hahn Signed-off-by: Andrew Morton --- MAINTAINERS | 2 ++ 1 file changed, 2 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index f17cc8d8f8e8..3f7522f0d279 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -6163,9 +6163,11 @@ L: cgroups@vger.kernel.org L: linux-mm@kvack.org S: Maintained F: include/linux/memcontrol.h +F: include/linux/page_counter.h F: mm/memcontrol.c F: mm/memcontrol-v1.c F: mm/memcontrol-v1.h +F: mm/page_counter.c F: mm/swap_cgroup.c F: samples/cgroup/* F: tools/testing/selftests/cgroup/memcg_protection.m From af915c3c13b64d196d1c305016092f5da20942c4 Mon Sep 17 00:00:00 2001 From: Joshua Hahn Date: Fri, 25 Jul 2025 10:56:15 -0700 Subject: [PATCH 370/370] MAINTAINERS: add missing headers to mempory policy & migration section These two files currently do not belong to any section. The memory policy & migration section seems to be a good home for them! Link: https://lkml.kernel.org/r/20250725175616.2397031-1-joshua.hahnjy@gmail.com Signed-off-by: Joshua Hahn Acked-by: David Hildenbrand Cc: Alistair Popple Cc: Byungchul Park Cc: Gregory Price Cc: "Huang, Ying" Cc: Lorenzo Stoakes Cc: Mathew Brost Cc: Rakie Kim Cc: Zi Yan Signed-off-by: Andrew Morton --- MAINTAINERS | 2 ++ 1 file changed, 2 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 3f7522f0d279..1a4af7cd2b16 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -15849,7 +15849,9 @@ S: Maintained W: http://www.linux-mm.org T: git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm F: include/linux/mempolicy.h +F: include/uapi/linux/mempolicy.h F: include/linux/migrate.h +F: include/linux/migrate_mode.h F: mm/mempolicy.c F: mm/migrate.c F: mm/migrate_device.c