From 36eed5400805b294f1df39b0e3ebc5b7971b3c16 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Sun, 30 Mar 2025 17:20:48 +0100 Subject: [PATCH 01/31] mm/mremap: do not set vrm->vma NULL immediately prior to checking it This seems rather unwise. If we cannot merge, extend, then we need to recall the original VMA to see if we need to uncharge. If we do need to, do so. Link: https://lkml.kernel.org/r/b2fb6b9c-376d-4e9b-905e-26d847fd3865@lucifer.local Fixes: d5c8aec0542e ("mm/mremap: initial refactor of move_vma()") Signed-off-by: Lorenzo Stoakes Reported-=by: "Lai, Yi" Closes: https://lore.kernel.org/linux-mm/Z+lcvEIHMLiKVR1i@ly-workstation/ Cc: Liam R. Howlett Cc: Vlastimil Babka Cc: Harry Yoo Signed-off-by: Andrew Morton --- mm/mremap.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/mremap.c b/mm/mremap.c index 0865387531ed..7db9da609c84 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -1561,11 +1561,12 @@ static unsigned long expand_vma_in_place(struct vma_remap_struct *vrm) * adjacent to the expanded vma and otherwise * compatible. */ - vma = vrm->vma = vma_merge_extend(&vmi, vma, vrm->delta); + vma = vma_merge_extend(&vmi, vma, vrm->delta); if (!vma) { vrm_uncharge(vrm); return -ENOMEM; } + vrm->vma = vma; vrm_stat_account(vrm, vrm->delta); From 7a95a05f15d570e6087fea59280fe267fe809100 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Sat, 22 Mar 2025 19:21:45 -0400 Subject: [PATCH 02/31] mm: page_alloc: fix defrag_mode's retry & OOM path Brendan points out that defrag_mode doesn't properly clear ALLOC_NOFRAGMENT on its last-ditch attempt to allocate. But looking closer, the problem is actually more severe: it doesn't actually *check* whether it's already retried, and keeps looping. This means the OOM path is never taken, and the thread can loop indefinitely. This is verified with an intentional OOM test on defrag_mode=1, which results in the machine hanging. After this patch, it triggers the OOM kill reliably and recovers. Clear ALLOC_NOFRAGMENT properly, and only retry once. Link: https://lkml.kernel.org/r/20250401041231.GA2117727@cmpxchg.org Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode") Signed-off-by: Johannes Weiner Reported-by: Brendan Jackman Signed-off-by: Andrew Morton --- mm/page_alloc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f51aa6051a99..37d111184eee 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4604,8 +4604,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, goto retry; /* Reclaim/compaction failed to prevent the fallback */ - if (defrag_mode) { - alloc_flags &= ALLOC_NOFRAGMENT; + if (defrag_mode && (alloc_flags & ALLOC_NOFRAGMENT)) { + alloc_flags &= ~ALLOC_NOFRAGMENT; goto retry; } From 7fa46cdfffd29459f9ebf6ed891a4c721db06a33 Mon Sep 17 00:00:00 2001 From: Harry Yoo Date: Tue, 18 Mar 2025 10:59:26 +0900 Subject: [PATCH 03/31] mm/kasan: use SLAB_NO_MERGE flag instead of an empty constructor Use SLAB_NO_MERGE flag to prevent merging instead of providing an empty constructor. Using an empty constructor in this manner is an abuse of slab interface. The SLAB_NO_MERGE flag should be used with caution, but in this case, it is acceptable as the cache is intended solely for debugging purposes. No functional changes intended. Link: https://lkml.kernel.org/r/20250318015926.1629748-1-harry.yoo@oracle.com Signed-off-by: Harry Yoo Reviewed-by: Alexander Potapenko Reviewed-by: Andrey Konovalov Acked-by: Andrey Ryabinin Cc: Dmitriy Vyukov Cc: Vincenzo Frascino Signed-off-by: Andrew Morton --- mm/kasan/kasan_test_c.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/mm/kasan/kasan_test_c.c b/mm/kasan/kasan_test_c.c index 59d673400085..3ea317837c2d 100644 --- a/mm/kasan/kasan_test_c.c +++ b/mm/kasan/kasan_test_c.c @@ -1073,14 +1073,11 @@ static void kmem_cache_rcu_uaf(struct kunit *test) kmem_cache_destroy(cache); } -static void empty_cache_ctor(void *object) { } - static void kmem_cache_double_destroy(struct kunit *test) { struct kmem_cache *cache; - /* Provide a constructor to prevent cache merging. */ - cache = kmem_cache_create("test_cache", 200, 0, 0, empty_cache_ctor); + cache = kmem_cache_create("test_cache", 200, 0, SLAB_NO_MERGE, NULL); KUNIT_ASSERT_NOT_ERR_OR_NULL(test, cache); kmem_cache_destroy(cache); KUNIT_EXPECT_KASAN_FAIL(test, kmem_cache_destroy(cache)); From 7f29070f4c8599dfe7582415ef162474588fd462 Mon Sep 17 00:00:00 2001 From: Taotao Chen Date: Thu, 20 Mar 2025 18:44:00 +0800 Subject: [PATCH 04/31] mm/damon/core: simplify control flow in damon_register_ops() The function logic is not complex, so using goto is unnecessary. Replace it with a straightforward if-else to simplify control flow and improve readability. Link: https://lkml.kernel.org/r/Z9vxcPCw8tDsjKw1@OneApple Signed-off-by: Taotao Chen Reviewed-by: SeongJae Park Signed-off-by: Andrew Morton --- mm/damon/core.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/mm/damon/core.c b/mm/damon/core.c index fc1eba3da419..f0c1676f0599 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -76,14 +76,13 @@ int damon_register_ops(struct damon_operations *ops) if (ops->id >= NR_DAMON_OPS) return -EINVAL; + mutex_lock(&damon_ops_lock); /* Fail for already registered ops */ - if (__damon_is_registered_ops(ops->id)) { + if (__damon_is_registered_ops(ops->id)) err = -EINVAL; - goto out; - } - damon_registered_ops[ops->id] = *ops; -out: + else + damon_registered_ops[ops->id] = *ops; mutex_unlock(&damon_ops_lock); return err; } From bd145bdd26c6845b5403d47b3b094bb5d020c6ef Mon Sep 17 00:00:00 2001 From: Ye Liu Date: Thu, 20 Mar 2025 14:33:46 +0800 Subject: [PATCH 05/31] mm/page_alloc: replace flag check with PageHWPoison() in check_new_page_bad() This patch replaces the direct check for the __PG_HWPOISON flag with the PageHWPoison() macro, improving code readability and maintaining consistency with other parts of the memory management code. Link: https://lkml.kernel.org/r/20250320063346.489030-1-ye.liu@linux.dev Signed-off-by: Ye Liu Reviewed-by: Sidhartha Kumar Reviewed-by: Anshuman Khandual Signed-off-by: Andrew Morton --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 37d111184eee..e892a55b471c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1593,7 +1593,7 @@ static __always_inline void page_del_and_expand(struct zone *zone, static void check_new_page_bad(struct page *page) { - if (unlikely(page->flags & __PG_HWPOISON)) { + if (unlikely(PageHWPoison(page))) { /* Don't complain about hwpoisoned pages */ if (PageBuddy(page)) __ClearPageBuddy(page); From 4a0cb631447fdcfb870a0b56950272cf25c0a6ee Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Fri, 21 Mar 2025 20:21:24 -0400 Subject: [PATCH 06/31] MAINTAINERS: add peterx as userfaultfd reviewer Add an entry for userfaultfd and make myself a reviewer of it, just in case it helps people manage the cc list. I named it MEMORY USERFAULTFD, could be a bad name, but then it can be together with the MEMORY* entries when everything is in alphabetic order, which is definitely a benefit. The line may not change much on how I'd work with userfaultfd; I think I'll do the same as before.. But maybe it still, more or less, adds some responsibility on top, indeed. [akpm@linux-foundation.org: add include/linux/userfaultfd_k.h, per Mike] [akpm@linux-foundation.org: fix misordering] Link: https://lkml.kernel.org/r/20250322002124.131736-1-peterx@redhat.com Signed-off-by: Peter Xu Cc: Andrea Arcangeli Suggested-by: Andrew Morton Acked-by: Liam R. Howlett Acked-by: Lorenzo Stoakes Cc: Mike Rapoport Signed-off-by: Andrew Morton --- MAINTAINERS | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index b1fe0b766cec..3ef706b4b01a 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -15517,6 +15517,18 @@ F: mm/vma.h F: mm/vma_internal.h F: tools/testing/vma/ +MEMORY USERFAULTFD +M: Andrew Morton +R: Peter Xu +S: Maintained +F: Documentation/admin-guide/mm/userfaultfd.rst +F: fs/userfaultfd.c +F: include/asm-generic/pgtable_uffd.h +F: include/linux/userfaultfd_k.h +F: include/uapi/linux/userfaultfd.h +F: mm/userfaultfd.c +F: tools/testing/selftests/mm/uffd-*.[ch] + MEMORY TECHNOLOGY DEVICES (MTD) M: Miquel Raynal M: Richard Weinberger From 2ebc3b68ac400444d8cc2ec1f4460c37aa7d28da Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Tue, 25 Mar 2025 13:49:27 +0200 Subject: [PATCH 07/31] mm/mm_init: init holes in the end of the memory map for FLATMEM Patch series "mm: fixes for fallouts from mem_init() cleanup". These are the fixes for fallouts from mem_init() cleanup reported by Nathan Chancellor and kbuild. The details are in the commit messages. This patch (of 2): Kernel test robot reports the following crash on 32-bit system with FLATMEM and DEBUG_VM_PGFLAGS enabled: [ 0.478822][ T0] kernel BUG at include/linux/page-flags.h:536! [ 0.479312][ T0] Oops: invalid opcode: 0000 [#1] PREEMPT SMP [ 0.479768][ T0] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.14.0-rc6-00357-g8268af309d07 #1 [ 0.480470][ T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 0.481260][ T0] EIP: reserve_bootmem_region (include/linux/page-flags.h:536) [ 0.481683][ T0] Code: 5d c3 01 f1 89 c8 ba e1 38 f4 c3 e8 1e 37 8e fc 0f 0b b8 90 e2 62 c4 e8 e2 05 5e fc 01 f1 89 c8 ba be 85 f7 c3 e8 04 37 8e fc <0f> 0b b8 80 e2 62 c4 e8 c8 05 5e fc 55 89 e5 53 57 56 83 ec 10 89 [ 0.483177][ T0] EAX: 00000000 EBX: c425df50 ECX: 00000000 EDX: 00000000 [ 0.483712][ T0] ESI: 017ffc00 EDI: ffffffff EBP: c425df34 ESP: c425df2c [ 0.484248][ T0] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046 [ 0.484846][ T0] CR0: 80050033 CR2: 00000000 CR3: 04b48000 CR4: 00000090 [ 0.485376][ T0] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 0.485907][ T0] DR6: fffe0ff0 DR7: 00000400 [ 0.486253][ T0] Call Trace: [ 0.486494][ T0] ? __die_body (arch/x86/kernel/dumpstack.c:478) [ 0.486822][ T0] ? die (arch/x86/kernel/dumpstack.c:?) [ 0.487099][ T0] ? do_trap (arch/x86/kernel/traps.c:? arch/x86/kernel/traps.c:197) [ 0.487409][ T0] ? do_error_trap (arch/x86/kernel/traps.c:217) [ 0.487752][ T0] ? reserve_bootmem_region (include/linux/page-flags.h:536) [ 0.488153][ T0] ? exc_overflow (arch/x86/kernel/traps.c:301) [ 0.488490][ T0] ? handle_invalid_op (arch/x86/kernel/traps.c:254) [ 0.488869][ T0] ? reserve_bootmem_region (include/linux/page-flags.h:536) [ 0.489271][ T0] ? exc_invalid_op (arch/x86/kernel/traps.c:316) [ 0.489619][ T0] ? handle_exception (arch/x86/entry/entry_32.S:1055) [ 0.489996][ T0] ? exc_overflow (arch/x86/kernel/traps.c:301) [ 0.490332][ T0] ? reserve_bootmem_region (include/linux/page-flags.h:536) [ 0.490733][ T0] ? exc_overflow (arch/x86/kernel/traps.c:301) [ 0.491068][ T0] ? reserve_bootmem_region (include/linux/page-flags.h:536) [ 0.491470][ T0] memmap_init_reserved_pages (mm/memblock.c:2203) [ 0.491887][ T0] free_low_memory_core_early (mm/memblock.c:?) [ 0.492302][ T0] memblock_free_all (mm/memblock.c:2272 include/linux/atomic/atomic-arch-fallback.h:546 include/linux/atomic/atomic-long.h:123 include/linux/atomic/atomic-instrumented.h:3261 include/linux/mm.h:67 mm/memblock.c:2273) [ 0.492659][ T0] mem_init (arch/x86/mm/init_32.c:735) [ 0.492952][ T0] mm_core_init (mm/mm_init.c:2730) [ 0.493271][ T0] start_kernel (init/main.c:958) [ 0.493604][ T0] i386_start_kernel (arch/x86/kernel/head32.c:79) [ 0.493969][ T0] startup_32_smp (arch/x86/kernel/head_32.S:292) The crash happens because after commit 8268af309d07 ("arch, mm: set max_mapnr when allocating memory map for FLATMEM") max_mapnr is rounded up to MAX_ORDER_NR_PAGES and the pages in the end of the memory map are passing pfn_valid() check in reserve_bootmem_region(). Make sure that that pages in the end of the memory map are initialized, just like the pages in the end of the last section for SPARSEMEM. Link: https://lkml.kernel.org/r/20250325114928.1791109-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20250325114928.1791109-2-rppt@kernel.org Fixes: 8268af309d07 ("arch, mm: set max_mapnr when allocating memory map for FLATMEM") Signed-off-by: Mike Rapoport (Microsoft) Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202503241424.d16223ec-lkp@intel.com Cc: Andy Lutomirski Cc: Borislav Betkov Cc: Dave Hansen Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jiaxun Yang Cc: Nathan Chancellor Cc: Thomas Bogendoerfer Cc: Thomas Gleinxer Signed-off-by: Andrew Morton --- mm/mm_init.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/mm/mm_init.c b/mm/mm_init.c index a38a1909b407..84f14fa12d0d 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -984,19 +984,19 @@ static void __init memmap_init(void) } } -#ifdef CONFIG_SPARSEMEM /* * Initialize the memory map for hole in the range [memory_end, - * section_end]. + * section_end] for SPARSEMEM and in the range [memory_end, memmap_end] + * for FLATMEM. * Append the pages in this hole to the highest zone in the last * node. - * The call to init_unavailable_range() is outside the ifdef to - * silence the compiler warining about zone_id set but not used; - * for FLATMEM it is a nop anyway */ +#ifdef CONFIG_SPARSEMEM end_pfn = round_up(end_pfn, PAGES_PER_SECTION); - if (hole_pfn < end_pfn) +#else + end_pfn = round_up(end_pfn, MAX_ORDER_NR_PAGES); #endif + if (hole_pfn < end_pfn) init_unavailable_range(hole_pfn, end_pfn, zone_id, nid); } From 7790c9c9265eed6d1ae1f03e688b880f409e835d Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Tue, 25 Mar 2025 13:49:28 +0200 Subject: [PATCH 08/31] memblock: don't release high memory to page allocator when HIGHMEM is off Nathan Chancellor reports the following crash on a MIPS system with CONFIG_HIGHMEM=n: Linux version 6.14.0-rc6-00359-g6faea3422e3b (nathan@ax162) (mips-linux-gcc (GCC) 14.2.0, GNU ld (GNU Binutils) 2.42) #1 SMP Fri Mar 21 08:12:02 MST 2025 earlycon: uart8250 at I/O port 0x3f8 (options '38400n8') printk: legacy bootconsole [uart8250] enabled Config serial console: console=ttyS0,38400n8r CPU0 revision is: 00019300 (MIPS 24Kc) FPU revision is: 00739300 MIPS: machine is mti,malta Software DMA cache coherency enabled Initial ramdisk at: 0x8fad0000 (5360128 bytes) OF: reserved mem: Reserved memory: No reserved-memory node in the DT Primary instruction cache 2kB, VIPT, 2-way, linesize 16 bytes. Primary data cache 2kB, 2-way, VIPT, no aliases, linesize 16 bytes Zone ranges: DMA [mem 0x0000000000000000-0x0000000000ffffff] Normal [mem 0x0000000001000000-0x000000001fffffff] Movable zone start for each node Early memory node ranges node 0: [mem 0x0000000000000000-0x000000000fffffff] node 0: [mem 0x0000000090000000-0x000000009fffffff] Initmem setup node 0 [mem 0x0000000000000000-0x000000009fffffff] On node 0, zone Normal: 16384 pages in unavailable ranges random: crng init done percpu: Embedded 3 pages/cpu s18832 r8192 d22128 u49152 Kernel command line: rd_start=0xffffffff8fad0000 rd_size=5360128 console=ttyS0,38400n8r printk: log buffer data + meta data: 32768 + 102400 = 135168 bytes Dentry cache hash table entries: 65536 (order: 4, 262144 bytes, linear) Inode-cache hash table entries: 32768 (order: 3, 131072 bytes, linear) Writing ErrCtl register=00000000 Readback ErrCtl register=00000000 Built 1 zonelists, mobility grouping on. Total pages: 16384 mem auto-init: stack:all(zero), heap alloc:off, heap free:off Unhandled kernel unaligned access[#1]: CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.14.0-rc6-00359-g6faea3422e3b #1 Hardware name: mti,malta $ 0 : 00000000 00000001 81cb0880 00129027 $ 4 : 00000001 0000000a 00000002 00129026 $ 8 : ffffdfff 80101e00 00000002 00000000 $12 : 81c9c224 81c63e68 00000002 00000000 $16 : 805b1e00 00025800 81cb0880 00000002 $20 : 00000000 81c63e64 0000000a 81f10000 $24 : 81c63e64 81c63e60 $28 : 81c60000 81c63de0 00000001 81cc9d20 Hi : 00000000 Lo : 00000000 epc : 814a227c __free_pages_ok+0x144/0x3c0 ra : 81cc9d20 memblock_free_all+0x1d4/0x27c Status: 10000002 KERNEL EXL Cause : 00800410 (ExcCode 04) BadVA : 00129026 PrId : 00019300 (MIPS 24Kc) Modules linked in: Process swapper (pid: 0, threadinfo=(ptrval), task=(ptrval), tls=00000000) Stack : 81f10000 805a9e00 81c80000 00000000 00000002 814aa240 000003ff 00000400 00000000 81f10000 81c9c224 00003b1f 81c80000 81c63e60 81ca0000 81c63e64 81f10000 0000000a 0000001f 81cc9d20 81f10000 81cc96d8 00000000 81c80000 81c9c224 81c63e60 81c63e64 00000000 81f10000 00024000 00028000 00025c00 90000000 a0000000 00000002 00000017 00000000 00000000 81f10000 81f10000 ... Call Trace: [<814a227c>] __free_pages_ok+0x144/0x3c0 [<81cc9d20>] memblock_free_all+0x1d4/0x27c [<81cc6764>] mm_core_init+0x100/0x138 [<81cb4ba4>] start_kernel+0x4a0/0x6e4 Code: 1080ffd5 02003825 2467ffff <8ce30000> 7c630500 1060ffd4 00000000 8ce30000 7c630180 The crash happens because commit 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing") too eagerly frees high memory to the page allocator even when HIGHMEM is disabled. Make sure that when CONFIG_HIGHMEM=n the high memory is not released to the page allocator. Link: https://lore.kernel.org/all/20250323190647.GA1009914@ax162 Link: https://lkml.kernel.org/r/20250325114928.1791109-3-rppt@kernel.org Reported-by: Nathan Chancellor Fixes: 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing") Signed-off-by: Mike Rapoport (Microsoft) Tested-by: Nathan Chancellor Cc: Andy Lutomirski Cc: Borislav Betkov Cc: Dave Hansen Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jiaxun Yang Cc: Thomas Bogendoerfer Cc: Thomas Gleinxer Signed-off-by: Andrew Morton --- mm/memblock.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/memblock.c b/mm/memblock.c index 284154445409..0a53db4d9f7b 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -2167,6 +2167,9 @@ static unsigned long __init __free_memory_core(phys_addr_t start, unsigned long start_pfn = PFN_UP(start); unsigned long end_pfn = PFN_DOWN(end); + if (!IS_ENABLED(CONFIG_HIGHMEM) && end_pfn > max_low_pfn) + end_pfn = max_low_pfn; + if (start_pfn >= end_pfn) return 0; From 983e760bcdb6f41ff3cf59e70e32a529537d6ef2 Mon Sep 17 00:00:00 2001 From: Li Wang Date: Thu, 27 Mar 2025 19:48:13 +0800 Subject: [PATCH 09/31] selftest/mm: va_high_addr_switch: add ppc64 support check Add PPC64 Radix MMU support to the va_high_addr_switch.sh by introducing check_supported_ppc64(). The function verifies: - 5-level paging (PGTABLE_LEVELS >= 5) enable in kernel config - Radix MMU (required for PPC64 5-level translation) - HugePages availability (needed for some tests) If any check fails, the test is skipped (ksft_skip). This ensures compatibility with Power9/Power10 systems running in Radix MMU mode. Avoid failures on 4-level paging system: # mmap(NULL, MAP_HUGETLB): 0xffffffffffffffff - FAILED # mmap(LOW_ADDR, MAP_HUGETLB): 0xffffffffffffffff - FAILED # mmap(HIGH_ADDR, MAP_HUGETLB): 0xffffffffffffffff - FAILED # mmap(HIGH_ADDR, MAP_HUGETLB) again: 0xffffffffffffffff - FAILED # mmap(HIGH_ADDR, MAP_FIXED | MAP_HUGETLB): 0xffffffffffffffff - FAILED # mmap(-1, MAP_HUGETLB): 0xffffffffffffffff - FAILED # mmap(-1, MAP_HUGETLB) again: 0xffffffffffffffff - FAILED # mmap(ADDR_SWITCH_HINT - PAGE_SIZE, 2*HUGETLB_SIZE, MAP_HUGETLB): 0xffffffffffffffff - FAILED # mmap(ADDR_SWITCH_HINT , 2*HUGETLB_SIZE, MAP_FIXED | MAP_HUGETLB): 0xffffffffffffffff - FAILED Link: https://lkml.kernel.org/r/20250327114813.25980-1-liwang@redhat.com Signed-off-by: Li Wang Cc: Anshuman Khandual Cc: Dev Jain Cc: Kirill A. Shuemov Cc: Shuah Khan Signed-off-by: Andrew Morton --- .../selftests/mm/va_high_addr_switch.sh | 28 +++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/tools/testing/selftests/mm/va_high_addr_switch.sh b/tools/testing/selftests/mm/va_high_addr_switch.sh index 2c725773cd79..1f92e8caceac 100755 --- a/tools/testing/selftests/mm/va_high_addr_switch.sh +++ b/tools/testing/selftests/mm/va_high_addr_switch.sh @@ -41,6 +41,31 @@ check_supported_x86_64() fi } +check_supported_ppc64() +{ + local config="/proc/config.gz" + [[ -f "${config}" ]] || config="/boot/config-$(uname -r)" + [[ -f "${config}" ]] || fail "Cannot find kernel config in /proc or /boot" + + local pg_table_levels=$(gzip -dcfq "${config}" | grep PGTABLE_LEVELS | cut -d'=' -f 2) + if [[ "${pg_table_levels}" -lt 5 ]]; then + echo "$0: PGTABLE_LEVELS=${pg_table_levels}, must be >= 5 to run this test" + exit $ksft_skip + fi + + local mmu_support=$(grep -m1 "mmu" /proc/cpuinfo | awk '{print $3}') + if [[ "$mmu_support" != "radix" ]]; then + echo "$0: System does not use Radix MMU, required for 5-level paging" + exit $ksft_skip + fi + + local hugepages_total=$(awk '/HugePages_Total/ {print $2}' /proc/meminfo) + if [[ "${hugepages_total}" -eq 0 ]]; then + echo "$0: HugePages are not enabled, required for some tests" + exit $ksft_skip + fi +} + check_test_requirements() { # The test supports x86_64 and powerpc64. We currently have no useful @@ -50,6 +75,9 @@ check_test_requirements() "x86_64") check_supported_x86_64 ;; + "ppc64le"|"ppc64") + check_supported_ppc64 + ;; *) return 0 ;; From 59aa44d1ee5c73a230ad11a9c3610631ea49982b Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 26 Mar 2025 23:55:38 +0200 Subject: [PATCH 10/31] MAINTAINERS: fixup USERFAULTFD entry Patch series "MAINTAINERS: add my isub-entries to MM part." Following discussion at LSF/MM/BPF I'm adding execmem, secretmem and numa memblocks sub-entries for MEMORY MANAGEMENT in MAINTAINERS. This patch (of 4): Change title to "MEMORY MANAGEMENT - USERFAULTFD" and make it sub-topic in memory management and add missing include/linux/userfaultfd_k.h and mailing list Link: https://lkml.kernel.org/r/20250326215541.1809379-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20250326215541.1809379-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Acked-by: Michal Hocko Acked-by: Oscar Salvador Acked-by: Vlastimil Babka Cc: Dan Willaims Cc: "Mike Rapoport (IBM)" Cc: Peter Xu Signed-off-by: Andrew Morton --- MAINTAINERS | 25 +++++++++++++------------ 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index 3ef706b4b01a..d143d3f7ca62 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -15497,6 +15497,19 @@ F: tools/mm/ F: tools/testing/selftests/mm/ N: include/linux/page[-_]* +MEMORY MANAGEMENT - USERFAULTFD +M: Andrew Morton +R: Peter Xu +L: linux-mm@kvack.org +S: Maintained +F: Documentation/admin-guide/mm/userfaultfd.rst +F: fs/userfaultfd.c +F: include/asm-generic/pgtable_uffd.h +F: include/linux/userfaultfd_k.h +F: include/uapi/linux/userfaultfd.h +F: mm/userfaultfd.c +F: tools/testing/selftests/mm/uffd-*.[ch] + MEMORY MAPPING M: Andrew Morton M: Liam R. Howlett @@ -15517,18 +15530,6 @@ F: mm/vma.h F: mm/vma_internal.h F: tools/testing/vma/ -MEMORY USERFAULTFD -M: Andrew Morton -R: Peter Xu -S: Maintained -F: Documentation/admin-guide/mm/userfaultfd.rst -F: fs/userfaultfd.c -F: include/asm-generic/pgtable_uffd.h -F: include/linux/userfaultfd_k.h -F: include/uapi/linux/userfaultfd.h -F: mm/userfaultfd.c -F: tools/testing/selftests/mm/uffd-*.[ch] - MEMORY TECHNOLOGY DEVICES (MTD) M: Miquel Raynal M: Richard Weinberger From 8871b533ef995ddf0fe45cdfe08665d3edd84777 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 26 Mar 2025 23:55:39 +0200 Subject: [PATCH 11/31] MAINTAINERS: mm: add entry for execmem Link: https://lkml.kernel.org/r/20250326215541.1809379-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Acked-by: Michal Hocko Acked-by: Oscar Salvador Acked-by: Vlastimil Babka Cc: Dan Willaims Cc: Peter Xu Signed-off-by: Andrew Morton --- MAINTAINERS | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index d143d3f7ca62..c6c5b78da2c4 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -15497,6 +15497,14 @@ F: tools/mm/ F: tools/testing/selftests/mm/ N: include/linux/page[-_]* +MEMORY MANAGEMENT - EXECMEM +M: Andrew Morton +M: Mike Rapoport +L: linux-mm@kvack.org +S: Maintained +F: include/linux/execmem.h +F: mm/execmem.c + MEMORY MANAGEMENT - USERFAULTFD M: Andrew Morton R: Peter Xu From 6985850f3e0bfd36e027ea2c0ec08b5deb5b371a Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 26 Mar 2025 23:55:40 +0200 Subject: [PATCH 12/31] MAINTAINERS: mm: add entry for numa memblocks and numa emulation Link: https://lkml.kernel.org/r/20250326215541.1809379-4-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Acked-by: Michal Hocko Acked-by: Oscar Salvador Acked-by: Vlastimil Babka Cc: Dan Willaims Cc: Peter Xu Signed-off-by: Andrew Morton --- MAINTAINERS | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index c6c5b78da2c4..4166bcaa2230 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -15505,6 +15505,16 @@ S: Maintained F: include/linux/execmem.h F: mm/execmem.c +MEMORY MANAGEMENT - NUMA MEMBLOCKS AND NUMA EMULATION +M: Andrew Morton +M: Mike Rapoport +L: linux-mm@kvack.org +S: Maintained +F: include/linux/numa_memblks.h +F: mm/numa.c +F: mm/numa_emulation.c +F: mm/numa_memblks.c + MEMORY MANAGEMENT - USERFAULTFD M: Andrew Morton R: Peter Xu From 38c5ecaaddd09e6e80028b132266e375b22e9b4b Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 26 Mar 2025 23:55:41 +0200 Subject: [PATCH 13/31] MAINTAINERS: mm: add entry for secretmem Link: https://lkml.kernel.org/r/20250326215541.1809379-5-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Acked-by: Michal Hocko Acked-by: Oscar Salvador Acked-by: Vlastimil Babka Cc: Dan Willaims Cc: Peter Xu Signed-off-by: Andrew Morton --- MAINTAINERS | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 4166bcaa2230..193d7e216d79 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -15515,6 +15515,14 @@ F: mm/numa.c F: mm/numa_emulation.c F: mm/numa_memblks.c +MEMORY MANAGEMENT - SECRETMEM +M: Andrew Morton +M: Mike Rapoport +L: linux-mm@kvack.org +S: Maintained +F: include/linux/secretmem.h +F: mm/secretmem.c + MEMORY MANAGEMENT - USERFAULTFD M: Andrew Morton R: Peter Xu From 9342bc134ae73a0b3fddec17075ccf75781a3a70 Mon Sep 17 00:00:00 2001 From: Jinjiang Tu Date: Mon, 24 Mar 2025 21:17:50 +0800 Subject: [PATCH 14/31] mm/memory_hotplug: fix call folio_test_large with tail page in do_migrate_range We triggered the below BUG: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x2 pfn:0x240402 head: order:9 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0x1ffffe0000000040(head|node=1|zone=3|lastcpupid=0x1ffff) page_type: f4(hugetlb) page dumped because: VM_BUG_ON_PAGE(page->compound_head & 1) ------------[ cut here ]------------ kernel BUG at ./include/linux/page-flags.h:310! Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP Modules linked in: CPU: 7 UID: 0 PID: 166 Comm: sh Not tainted 6.14.0-rc7-dirty #374 Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : const_folio_flags+0x3c/0x58 lr : const_folio_flags+0x3c/0x58 Call trace: const_folio_flags+0x3c/0x58 (P) do_migrate_range+0x164/0x720 offline_pages+0x63c/0x6fc memory_subsys_offline+0x190/0x1f4 device_offline+0xc0/0x13c state_store+0x90/0xd8 dev_attr_store+0x18/0x2c sysfs_kf_write+0x44/0x54 kernfs_fop_write_iter+0x120/0x1cc vfs_write+0x240/0x378 ksys_write+0x70/0x108 __arm64_sys_write+0x1c/0x28 invoke_syscall+0x48/0x10c el0_svc_common.constprop.0+0x40/0xe0 When allocating a hugetlb folio, between the folio is taken from buddy and prep_compound_page() is called, start_isolate_page_range() and do_migrate_range() is called. When do_migrate_range() scans the head page of the hugetlb folio, the compound_head field isn't set, so scans the tail page next. And at this time, the compound_head field of tail page is set, folio_test_large() is called by tail page, thus triggers VM_BUG_ON(). To fix it, get folio refcount before calling folio_test_large(). Link: https://lkml.kernel.org/r/20250324131750.1551884-1-tujinjiang@huawei.com Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports thp migration") Fixes: b62b51d2d159 ("mm: memory_hotplug: remove head variable in do_migrate_range()") Signed-off-by: Jinjiang Tu Acked-by: Oscar Salvador Acked-by: David Hildenbrand Cc: Kefeng Wang Cc: Nanyong Sun Cc: Naoya Horiguchi Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/memory_hotplug.c | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 75401866fb76..8305483de38b 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1813,21 +1813,15 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) page = pfn_to_page(pfn); folio = page_folio(page); - /* - * No reference or lock is held on the folio, so it might - * be modified concurrently (e.g. split). As such, - * folio_nr_pages() may read garbage. This is fine as the outer - * loop will revisit the split folio later. - */ - if (folio_test_large(folio)) - pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1; - if (!folio_try_get(folio)) continue; if (unlikely(page_folio(page) != folio)) goto put_folio; + if (folio_test_large(folio)) + pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1; + if (folio_contain_hwpoisoned_page(folio)) { if (WARN_ON(folio_test_lru(folio))) folio_isolate_lru(folio); From 1b3d3e9f4a32b06ca73b084aa526a352318c74ac Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Fri, 28 Mar 2025 01:01:36 +0000 Subject: [PATCH 15/31] microblaze/mm: put mm_cmdline_setup() in .init.text section As reported by lkp, there is a section mismatch of mm_cmdline_setup() and memblock. The reason is we don't specify the section of mm_cmdline_setup() and gcc put it into .text.unlikely. As mm_cmdline_setup() is only used in mmu_init(), which is in .init.text section, put mm_cmdline_setup() into it too. Link: https://lkml.kernel.org/r/20250328010136.13139-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202503241259.kJV3U7Xj-lkp@intel.com/ Reviewed-by: Oscar Salvador Cc: Masahiro Yamada Cc: Michal Simek Cc: Wei Yang Signed-off-by: Andrew Morton --- arch/microblaze/mm/init.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/microblaze/mm/init.c b/arch/microblaze/mm/init.c index 65f0d1fb8a2a..31d475cdb1c5 100644 --- a/arch/microblaze/mm/init.c +++ b/arch/microblaze/mm/init.c @@ -118,7 +118,7 @@ int page_is_ram(unsigned long pfn) /* * Check for command-line options that affect what MMU_init will do. */ -static void mm_cmdline_setup(void) +static void __init mm_cmdline_setup(void) { unsigned long maxmem = 0; char *p = cmd_line; From f21bb37afbba0878c8d417cd861e43d014119845 Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Tue, 25 Feb 2025 11:45:51 +0800 Subject: [PATCH 16/31] mm: pgtable: make generic tlb_remove_table() use struct ptdesc Patch series "remove tlb_remove_page_ptdesc()", v2. As suggested by Peter Zijlstra below [1], this series aims to remove tlb_remove_page_ptdesc(). : Fundamentally tlb_remove_page() is about removing *pages* as from a PTE, : there should not be a page-table anywhere near here *ever*. : : Yes, some architectures use tlb_remove_page() for page-tables too, but : that is more or less an implementation detail that can be fixed. After this series, all architectures use tlb_remove_table() or tlb_remove_ptdesc() to remove the page table pages. In the future, once all architectures using tlb_remove_table() have also converted to using struct ptdesc (eg. powerpc), it may be possible to use only tlb_remove_ptdesc(). [1] https://lore.kernel.org/linux-mm/20250103111457.GC22934@noisy.programming.kicks-ass.net/ This patch (of 6): Now only arm will call tlb_remove_ptdesc()/tlb_remove_table() when CONFIG_MMU_GATHER_TABLE_FREE is disabled. In this case, the type of the table parameter is actually struct ptdesc * instead of struct page *. Since struct ptdesc still overlaps with struct page and has not been separated from it, forcing the table parameter to struct page * will not cause any problems at this time. But this is definitely incorrect and needs to be fixed. So just like the generic __tlb_remove_table(), let generic tlb_remove_table() use struct ptdesc by default when CONFIG_MMU_GATHER_TABLE_FREE is disabled. Link: https://lkml.kernel.org/r/cover.1740454179.git.zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/5be8c3ab7bd68510bf0db4cf84010f4dfe372917.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng Reviewed-by: Kevin Brodsky Cc: Alexandre Ghiti Cc: "Aneesh Kumar K.V" Cc: Arnd Bergmann Cc: Dave Hansen Cc: David Hildenbrand Cc: Hugh Dickens Cc: Jann Horn Cc: Matthew Wilcow (Oracle) Cc: "Mike Rapoport (IBM)" Cc: Muchun Song Cc: Nicholas Piggin Cc: Peter Zijlstra (Intel) Cc: Rik van Riel Cc: Vishal Moola (Oracle) Cc: Will Deacon Cc: Yu Zhao Signed-off-by: Andrew Morton --- include/asm-generic/tlb.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index d1adfba8387e..e27d24dd8d5e 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -227,10 +227,10 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page); */ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table) { - struct page *page = (struct page *)table; + struct ptdesc *ptdesc = (struct ptdesc *)table; - pagetable_dtor(page_ptdesc(page)); - tlb_remove_page(tlb, page); + pagetable_dtor(ptdesc); + tlb_remove_page(tlb, ptdesc_page(ptdesc)); } #endif /* CONFIG_MMU_GATHER_TABLE_FREE */ From 1a03c275a3ad4e47d479a63037b72f2b305f0c13 Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Tue, 25 Feb 2025 11:45:52 +0800 Subject: [PATCH 17/31] mm: pgtable: change pt parameter of tlb_remove_ptdesc() to struct ptdesc* All callers of tlb_remove_ptdesc() pass it a pointer of struct ptdesc, so let's change the pt parameter from void * to struct ptdesc * to perform a type safety check. Link: https://lkml.kernel.org/r/60bb44299cf2d731df6592e446e7f694054d0dbe.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng Originally-by: Peter Zijlstra (Intel) Reviewed-by: Kevin Brodsky Cc: Alexandre Ghiti Cc: "Aneesh Kumar K.V" Cc: Arnd Bergmann Cc: Dave Hansen Cc: David Hildenbrand Cc: Hugh Dickens Cc: Jann Horn Cc: Matthew Wilcow (Oracle) Cc: "Mike Rapoport (IBM)" Cc: Muchun Song Cc: Nicholas Piggin Cc: Rik van Riel Cc: Vishal Moola (Oracle) Cc: Will Deacon Cc: Yu Zhao Signed-off-by: Andrew Morton --- include/asm-generic/tlb.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index e27d24dd8d5e..bf845019af36 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -493,7 +493,7 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page) return tlb_remove_page_size(tlb, page, PAGE_SIZE); } -static inline void tlb_remove_ptdesc(struct mmu_gather *tlb, void *pt) +static inline void tlb_remove_ptdesc(struct mmu_gather *tlb, struct ptdesc *pt) { tlb_remove_table(tlb, pt); } From e3ecf7c7d0829a00a4fb02531338ab9e7e75ea0d Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Tue, 25 Feb 2025 11:45:53 +0800 Subject: [PATCH 18/31] mm: pgtable: convert some architectures to use tlb_remove_ptdesc() Now, the nine architectures of csky, hexagon, loongarch, m68k, mips, nios2, openrisc, sh and um do not select CONFIG_MMU_GATHER_RCU_TABLE_FREE, and just call pagetable_dtor() + tlb_remove_page_ptdesc() (the wrapper of tlb_remove_page()). This is the same as the implementation of tlb_remove_{ptdesc|table}() under !CONFIG_MMU_GATHER_TABLE_FREE, so convert these architectures to use tlb_remove_ptdesc(). The ultimate goal is to make the architecture only use tlb_remove_ptdesc() or tlb_remove_table() for page table pages. [zhengqi.arch@bytedance.com: v2] Link: https://lkml.kernel.org/r/20250303072603.45423-1-zhengqi.arch@bytedance.com [akpm@linux-foundation.org: remove trailing semi in arch/loongarch/include/asm/pgalloc.h] Link: https://lkml.kernel.org/r/19db3e8673b67bad2f1df1ab37f1c89d99eacfea.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng Suggested-by: Peter Zijlstra (Intel) Reviewed-by: Kevin Brodsky Acked-by: Geert Uytterhoeven [m68k] Cc: Alexandre Ghiti Cc: "Aneesh Kumar K.V" Cc: Arnd Bergmann Cc: Dave Hansen Cc: David Hildenbrand Cc: Hugh Dickens Cc: Jann Horn Cc: Matthew Wilcow (Oracle) Cc: "Mike Rapoport (IBM)" Cc: Muchun Song Cc: Nicholas Piggin Cc: Rik van Riel Cc: Vishal Moola (Oracle) Cc: Will Deacon Cc: Yu Zhao Signed-off-by: Andrew Morton --- arch/csky/include/asm/pgalloc.h | 7 ++----- arch/hexagon/include/asm/pgalloc.h | 7 ++----- arch/loongarch/include/asm/pgalloc.h | 7 ++----- arch/m68k/include/asm/sun3_pgalloc.h | 7 ++----- arch/mips/include/asm/pgalloc.h | 7 ++----- arch/nios2/include/asm/pgalloc.h | 7 ++----- arch/openrisc/include/asm/pgalloc.h | 7 ++----- arch/sh/include/asm/pgalloc.h | 7 ++----- arch/um/include/asm/pgalloc.h | 21 ++++++--------------- 9 files changed, 22 insertions(+), 55 deletions(-) diff --git a/arch/csky/include/asm/pgalloc.h b/arch/csky/include/asm/pgalloc.h index bf8400c28b5a..11055c574968 100644 --- a/arch/csky/include/asm/pgalloc.h +++ b/arch/csky/include/asm/pgalloc.h @@ -61,11 +61,8 @@ static inline pgd_t *pgd_alloc(struct mm_struct *mm) return ret; } -#define __pte_free_tlb(tlb, pte, address) \ -do { \ - pagetable_dtor(page_ptdesc(pte)); \ - tlb_remove_page_ptdesc(tlb, page_ptdesc(pte)); \ -} while (0) +#define __pte_free_tlb(tlb, pte, address) \ + tlb_remove_ptdesc((tlb), page_ptdesc(pte)) extern void pagetable_init(void); extern void mmu_init(unsigned long min_pfn, unsigned long max_pfn); diff --git a/arch/hexagon/include/asm/pgalloc.h b/arch/hexagon/include/asm/pgalloc.h index 1ee5f5f157ca..937a11ef4c33 100644 --- a/arch/hexagon/include/asm/pgalloc.h +++ b/arch/hexagon/include/asm/pgalloc.h @@ -87,10 +87,7 @@ static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, max_kernel_seg = pmdindex; } -#define __pte_free_tlb(tlb, pte, addr) \ -do { \ - pagetable_dtor((page_ptdesc(pte))); \ - tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte))); \ -} while (0) +#define __pte_free_tlb(tlb, pte, addr) \ + tlb_remove_ptdesc((tlb), page_ptdesc(pte)) #endif diff --git a/arch/loongarch/include/asm/pgalloc.h b/arch/loongarch/include/asm/pgalloc.h index 7211dff8c969..b58f587f0f0a 100644 --- a/arch/loongarch/include/asm/pgalloc.h +++ b/arch/loongarch/include/asm/pgalloc.h @@ -55,11 +55,8 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm) return pte; } -#define __pte_free_tlb(tlb, pte, address) \ -do { \ - pagetable_dtor(page_ptdesc(pte)); \ - tlb_remove_page_ptdesc((tlb), page_ptdesc(pte)); \ -} while (0) +#define __pte_free_tlb(tlb, pte, address) \ + tlb_remove_ptdesc((tlb), page_ptdesc(pte)) #ifndef __PAGETABLE_PMD_FOLDED diff --git a/arch/m68k/include/asm/sun3_pgalloc.h b/arch/m68k/include/asm/sun3_pgalloc.h index 80afc3a18724..1e21c758b774 100644 --- a/arch/m68k/include/asm/sun3_pgalloc.h +++ b/arch/m68k/include/asm/sun3_pgalloc.h @@ -17,11 +17,8 @@ extern const char bad_pmd_string[]; -#define __pte_free_tlb(tlb, pte, addr) \ -do { \ - pagetable_dtor(page_ptdesc(pte)); \ - tlb_remove_page_ptdesc((tlb), page_ptdesc(pte)); \ -} while (0) +#define __pte_free_tlb(tlb, pte, addr) \ + tlb_remove_ptdesc((tlb), page_ptdesc(pte)) static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t *pte) { diff --git a/arch/mips/include/asm/pgalloc.h b/arch/mips/include/asm/pgalloc.h index 26c7a6ede983..bbca420c96d3 100644 --- a/arch/mips/include/asm/pgalloc.h +++ b/arch/mips/include/asm/pgalloc.h @@ -48,11 +48,8 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd) extern void pgd_init(void *addr); extern pgd_t *pgd_alloc(struct mm_struct *mm); -#define __pte_free_tlb(tlb, pte, address) \ -do { \ - pagetable_dtor(page_ptdesc(pte)); \ - tlb_remove_page_ptdesc((tlb), page_ptdesc(pte)); \ -} while (0) +#define __pte_free_tlb(tlb, pte, address) \ + tlb_remove_ptdesc((tlb), page_ptdesc(pte)) #ifndef __PAGETABLE_PMD_FOLDED diff --git a/arch/nios2/include/asm/pgalloc.h b/arch/nios2/include/asm/pgalloc.h index 12a536b7bfbd..db122b093a8b 100644 --- a/arch/nios2/include/asm/pgalloc.h +++ b/arch/nios2/include/asm/pgalloc.h @@ -28,10 +28,7 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, extern pgd_t *pgd_alloc(struct mm_struct *mm); -#define __pte_free_tlb(tlb, pte, addr) \ - do { \ - pagetable_dtor(page_ptdesc(pte)); \ - tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte))); \ - } while (0) +#define __pte_free_tlb(tlb, pte, addr) \ + tlb_remove_ptdesc((tlb), page_ptdesc(pte)) #endif /* _ASM_NIOS2_PGALLOC_H */ diff --git a/arch/openrisc/include/asm/pgalloc.h b/arch/openrisc/include/asm/pgalloc.h index 3372f4e6ab4b..3f110931d8f6 100644 --- a/arch/openrisc/include/asm/pgalloc.h +++ b/arch/openrisc/include/asm/pgalloc.h @@ -64,10 +64,7 @@ extern inline pgd_t *pgd_alloc(struct mm_struct *mm) extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm); -#define __pte_free_tlb(tlb, pte, addr) \ -do { \ - pagetable_dtor(page_ptdesc(pte)); \ - tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte))); \ -} while (0) +#define __pte_free_tlb(tlb, pte, addr) \ + tlb_remove_ptdesc((tlb), page_ptdesc(pte)) #endif diff --git a/arch/sh/include/asm/pgalloc.h b/arch/sh/include/asm/pgalloc.h index 96d938fdf224..6fe7123d38fa 100644 --- a/arch/sh/include/asm/pgalloc.h +++ b/arch/sh/include/asm/pgalloc.h @@ -32,10 +32,7 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, set_pmd(pmd, __pmd((unsigned long)page_address(pte))); } -#define __pte_free_tlb(tlb, pte, addr) \ -do { \ - pagetable_dtor(page_ptdesc(pte)); \ - tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte))); \ -} while (0) +#define __pte_free_tlb(tlb, pte, addr) \ + tlb_remove_ptdesc((tlb), page_ptdesc(pte)) #endif /* __ASM_SH_PGALLOC_H */ diff --git a/arch/um/include/asm/pgalloc.h b/arch/um/include/asm/pgalloc.h index f0af23c3aeb2..826ec44b58cd 100644 --- a/arch/um/include/asm/pgalloc.h +++ b/arch/um/include/asm/pgalloc.h @@ -25,27 +25,18 @@ */ extern pgd_t *pgd_alloc(struct mm_struct *); -#define __pte_free_tlb(tlb, pte, address) \ -do { \ - pagetable_dtor(page_ptdesc(pte)); \ - tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte))); \ -} while (0) +#define __pte_free_tlb(tlb, pte, address) \ + tlb_remove_ptdesc((tlb), page_ptdesc(pte)) #if CONFIG_PGTABLE_LEVELS > 2 -#define __pmd_free_tlb(tlb, pmd, address) \ -do { \ - pagetable_dtor(virt_to_ptdesc(pmd)); \ - tlb_remove_page_ptdesc((tlb), virt_to_ptdesc(pmd)); \ -} while (0) +#define __pmd_free_tlb(tlb, pmd, address) \ + tlb_remove_ptdesc((tlb), virt_to_ptdesc(pmd)) #if CONFIG_PGTABLE_LEVELS > 3 -#define __pud_free_tlb(tlb, pud, address) \ -do { \ - pagetable_dtor(virt_to_ptdesc(pud)); \ - tlb_remove_page_ptdesc((tlb), virt_to_ptdesc(pud)); \ -} while (0) +#define __pud_free_tlb(tlb, pud, address) \ + tlb_remove_ptdesc((tlb), virt_to_ptdesc(pud)) #endif #endif From 4239c198e8410b2b5a2b638daf8777e6407d9fc7 Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Tue, 25 Feb 2025 11:45:54 +0800 Subject: [PATCH 19/31] riscv: pgtable: unconditionally use tlb_remove_ptdesc() To support fast gup, the commit 69be3fb111e7 ("riscv: enable MMU_GATHER_RCU_TABLE_FREE for SMP && MMU") did the following: 1) use tlb_remove_page_ptdesc() for those platforms which use IPI to perform TLB shootdown 2) use tlb_remove_ptdesc() for those platforms which use SBI to perform TLB shootdown The tlb_remove_page_ptdesc() is the wrapper of the tlb_remove_page(). By design, the tlb_remove_page() should be used to remove a normal page from a page table entry, and should not be used for page table pages. The tlb_remove_ptdesc() is the wrapper of the tlb_remove_table(), which is designed specifically for freeing page table pages. If the CONFIG_MMU_GATHER_TABLE_FREE is enabled, the tlb_remove_table() will use semi RCU to free page table pages, that is: - batch table freeing: asynchronous free by RCU - single table freeing: IPI + synchronous free If the CONFIG_MMU_GATHER_TABLE_FREE is disabled, the tlb_remove_table() will fall back to pagetable_dtor() + tlb_remove_page(). For case 1), since we need to perform TLB shootdown before freeing the page table page, the local_irq_save() in fast gup can block the freeing and protect the fast gup page walker. Therefore we can ensure safety by just using tlb_remove_page_ptdesc(). In addition, we can also the tlb_remove_ptdesc()/tlb_remove_table() to achieve it, and it doesn't matter whether CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected. And in theory, the performance of freeing pages asynchronously via RCU will not be lower than synchronous free. For case 2), since local_irq_save() only disable S-privilege IPI irq but not M-privilege's, which is used by the SBI implementation to perform TLB shootdown, so we must select CONFIG_MMU_GATHER_RCU_TABLE_FREE and use tlb_remove_ptdesc() to ensure safety. The riscv selects this config for SMP && MMU, the CONFIG_RISCV_SBI is dependent on MMU. Therefore, only the UP system may have the situation where CONFIG_MMU_GATHER_RCU_TABLE_FREE is disabled but CONFIG_RISCV_SBI is enabled. But there is no freeing vs fast gup race in the UP system. So, in summary, we can use tlb_remove_ptdesc() to support fast gup in all cases, and this interface is specifically designed for page table pages. So let's use it unconditionally. Link: https://lkml.kernel.org/r/9025595e895515515c95e48db54b29afa489c41d.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng Suggested-by: Peter Zijlstra (Intel) Cc: Alexandre Ghiti Cc: "Aneesh Kumar K.V" Cc: Arnd Bergmann Cc: Dave Hansen Cc: David Hildenbrand Cc: Hugh Dickens Cc: Jann Horn Cc: Kevin Brodsky Cc: Matthew Wilcow (Oracle) Cc: "Mike Rapoport (IBM)" Cc: Muchun Song Cc: Nicholas Piggin Cc: Rik van Riel Cc: Vishal Moola (Oracle) Cc: Will Deacon Cc: Yu Zhao Signed-off-by: Andrew Morton --- arch/riscv/include/asm/pgalloc.h | 26 ++++---------------------- 1 file changed, 4 insertions(+), 22 deletions(-) diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h index 3e2aebea6312..770ce18a7328 100644 --- a/arch/riscv/include/asm/pgalloc.h +++ b/arch/riscv/include/asm/pgalloc.h @@ -15,24 +15,6 @@ #define __HAVE_ARCH_PUD_FREE #include -/* - * While riscv platforms with riscv_ipi_for_rfence as true require an IPI to - * perform TLB shootdown, some platforms with riscv_ipi_for_rfence as false use - * SBI to perform TLB shootdown. To keep software pagetable walkers safe in this - * case we switch to RCU based table free (MMU_GATHER_RCU_TABLE_FREE). See the - * comment below 'ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE' in include/asm-generic/tlb.h - * for more details. - */ -static inline void riscv_tlb_remove_ptdesc(struct mmu_gather *tlb, void *pt) -{ - if (riscv_use_sbi_for_rfence()) { - tlb_remove_ptdesc(tlb, pt); - } else { - pagetable_dtor(pt); - tlb_remove_page_ptdesc(tlb, pt); - } -} - static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t *pte) { @@ -108,14 +90,14 @@ static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pud, unsigned long addr) { if (pgtable_l4_enabled) - riscv_tlb_remove_ptdesc(tlb, virt_to_ptdesc(pud)); + tlb_remove_ptdesc(tlb, virt_to_ptdesc(pud)); } static inline void __p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d, unsigned long addr) { if (pgtable_l5_enabled) - riscv_tlb_remove_ptdesc(tlb, virt_to_ptdesc(p4d)); + tlb_remove_ptdesc(tlb, virt_to_ptdesc(p4d)); } #endif /* __PAGETABLE_PMD_FOLDED */ @@ -143,7 +125,7 @@ static inline pgd_t *pgd_alloc(struct mm_struct *mm) static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd, unsigned long addr) { - riscv_tlb_remove_ptdesc(tlb, virt_to_ptdesc(pmd)); + tlb_remove_ptdesc(tlb, virt_to_ptdesc(pmd)); } #endif /* __PAGETABLE_PMD_FOLDED */ @@ -151,7 +133,7 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd, static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte, unsigned long addr) { - riscv_tlb_remove_ptdesc(tlb, page_ptdesc(pte)); + tlb_remove_ptdesc(tlb, page_ptdesc(pte)); } #endif /* CONFIG_MMU */ From f1fdec956f63f7cafc1706e5d907bc9c4d241083 Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Tue, 25 Feb 2025 11:45:55 +0800 Subject: [PATCH 20/31] x86: pgtable: convert to use tlb_remove_ptdesc() The x86 has already been converted to use struct ptdesc, so convert it to use tlb_remove_ptdesc() instead of tlb_remove_table(). Link: https://lkml.kernel.org/r/36ad56b7e06fa4b17fb23c4fc650e8e0d72bb3cd.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng Cc: Alexandre Ghiti Cc: "Aneesh Kumar K.V" Cc: Arnd Bergmann Cc: Dave Hansen Cc: David Hildenbrand Cc: Hugh Dickens Cc: Jann Horn Cc: Kevin Brodsky Cc: Matthew Wilcow (Oracle) Cc: "Mike Rapoport (IBM)" Cc: Muchun Song Cc: Nicholas Piggin Cc: Peter Zijlstra (Intel) Cc: Rik van Riel Cc: Vishal Moola (Oracle) Cc: Will Deacon Cc: Yu Zhao Signed-off-by: Andrew Morton --- arch/x86/mm/pgtable.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index cec321fb74f2..a05fcddfc811 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -20,7 +20,7 @@ pgtable_t pte_alloc_one(struct mm_struct *mm) void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte) { paravirt_release_pte(page_to_pfn(pte)); - tlb_remove_table(tlb, page_ptdesc(pte)); + tlb_remove_ptdesc(tlb, page_ptdesc(pte)); } #if CONFIG_PGTABLE_LEVELS > 2 @@ -34,21 +34,21 @@ void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd) #ifdef CONFIG_X86_PAE tlb->need_flush_all = 1; #endif - tlb_remove_table(tlb, virt_to_ptdesc(pmd)); + tlb_remove_ptdesc(tlb, virt_to_ptdesc(pmd)); } #if CONFIG_PGTABLE_LEVELS > 3 void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud) { paravirt_release_pud(__pa(pud) >> PAGE_SHIFT); - tlb_remove_table(tlb, virt_to_ptdesc(pud)); + tlb_remove_ptdesc(tlb, virt_to_ptdesc(pud)); } #if CONFIG_PGTABLE_LEVELS > 4 void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d) { paravirt_release_p4d(__pa(p4d) >> PAGE_SHIFT); - tlb_remove_table(tlb, virt_to_ptdesc(p4d)); + tlb_remove_ptdesc(tlb, virt_to_ptdesc(p4d)); } #endif /* CONFIG_PGTABLE_LEVELS > 4 */ #endif /* CONFIG_PGTABLE_LEVELS > 3 */ From 02d9e1a2048e47d39733f0ced71ce8e8fee3e56d Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Tue, 25 Feb 2025 11:45:56 +0800 Subject: [PATCH 21/31] mm: pgtable: remove tlb_remove_page_ptdesc() The tlb_remove_ptdesc()/tlb_remove_table() is specially designed for page table pages, and now all architectures have been converted to use it to remove page table pages. So let's remove tlb_remove_page_ptdesc(), it currently has no users and should not be used for page table pages. Link: https://lkml.kernel.org/r/3df04c8494339073b71be4acb2d92e108ecd1b60.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng Suggested-by: Peter Zijlstra (Intel) Reviewed-by: Kevin Brodsky Cc: Alexandre Ghiti Cc: "Aneesh Kumar K.V" Cc: Arnd Bergmann Cc: Dave Hansen Cc: David Hildenbrand Cc: Hugh Dickens Cc: Jann Horn Cc: Matthew Wilcow (Oracle) Cc: "Mike Rapoport (IBM)" Cc: Muchun Song Cc: Nicholas Piggin Cc: Rik van Riel Cc: Vishal Moola (Oracle) Cc: Will Deacon Cc: Yu Zhao Signed-off-by: Andrew Morton --- include/asm-generic/tlb.h | 6 ------ 1 file changed, 6 deletions(-) diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index bf845019af36..88a42973fa47 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -498,12 +498,6 @@ static inline void tlb_remove_ptdesc(struct mmu_gather *tlb, struct ptdesc *pt) tlb_remove_table(tlb, pt); } -/* Like tlb_remove_ptdesc, but for page-like page directories. */ -static inline void tlb_remove_page_ptdesc(struct mmu_gather *tlb, struct ptdesc *pt) -{ - tlb_remove_page(tlb, ptdesc_page(pt)); -} - static inline void tlb_change_page_size(struct mmu_gather *tlb, unsigned int page_size) { From 5796d3967c0956734bd1249f76989ca80da0225b Mon Sep 17 00:00:00 2001 From: Jeff Xu Date: Wed, 5 Mar 2025 02:17:05 +0000 Subject: [PATCH 22/31] mseal sysmap: kernel config and header change MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "mseal system mappings", v9. As discussed during mseal() upstream process [1], mseal() protects the VMAs of a given virtual memory range against modifications, such as the read/write (RW) and no-execute (NX) bits. For complete descriptions of memory sealing, please see mseal.rst [2]. The mseal() is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management system. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. The system mappings are readonly only, memory sealing can protect them from ever changing to writable or unmmap/remapped as different attributes. System mappings such as vdso, vvar, vvar_vclock, vectors (arm compat-mode), sigpage (arm compat-mode), are created by the kernel during program initialization, and could be sealed after creation. Unlike the aforementioned mappings, the uprobe mapping is not established during program startup. However, its lifetime is the same as the process's lifetime [3]. It could be sealed from creation. The vsyscall on x86-64 uses a special address (0xffffffffff600000), which is outside the mm managed range. This means mprotect, munmap, and mremap won't work on the vsyscall. Since sealing doesn't enhance the vsyscall's security, it is skipped in this patch. If we ever seal the vsyscall, it is probably only for decorative purpose, i.e. showing the 'sl' flag in the /proc/pid/smaps. For this patch, it is ignored. It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may alter the system mappings during restore operations. UML(User Mode Linux) and gVisor, rr are also known to change the vdso/vvar mappings. Consequently, this feature cannot be universally enabled across all systems. As such, CONFIG_MSEAL_SYSTEM_MAPPINGS is disabled by default. To support mseal of system mappings, architectures must define CONFIG_ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS and update their special mappings calls to pass mseal flag. Additionally, architectures must confirm they do not unmap/remap system mappings during the process lifetime. The existence of this flag for an architecture implies that it does not require the remapping of thest system mappings during process lifetime, so sealing these mappings is safe from a kernel perspective. This version covers x86-64 and arm64 archiecture as minimum viable feature. While no specific CPU hardware features are required for enable this feature on an archiecture, memory sealing requires a 64-bit kernel. Other architectures can choose whether or not to adopt this feature. Currently, I'm not aware of any instances in the kernel code that actively munmap/mremap a system mapping without a request from userspace. The PPC does call munmap when _install_special_mapping fails for vdso; however, it's uncertain if this will ever fail for PPC - this needs to be investigated by PPC in the future [4]. The UML kernel can add this support when KUnit tests require it [5]. In this version, we've improved the handling of system mapping sealing from previous versions, instead of modifying the _install_special_mapping function itself, which would affect all architectures, we now call _install_special_mapping with a sealing flag only within the specific architecture that requires it. This targeted approach offers two key advantages: 1) It limits the code change's impact to the necessary architectures, and 2) It aligns with the software architecture by keeping the core memory management within the mm layer, while delegating the decision of sealing system mappings to the individual architecture, which is particularly relevant since 32-bit architectures never require sealing. Prior to this patch series, we explored sealing special mappings from userspace using glibc's dynamic linker. This approach revealed several issues: - The PT_LOAD header may report an incorrect length for vdso, (smaller than its actual size). The dynamic linker, which relies on PT_LOAD information to determine mapping size, would then split and partially seal the vdso mapping. Since each architecture has its own vdso/vvar code, fixing this in the kernel would require going through each archiecture. Our initial goal was to enable sealing readonly mappings, e.g. .text, across all architectures, sealing vdso from kernel since creation appears to be simpler than sealing vdso at glibc. - The [vvar] mapping header only contains address information, not length information. Similar issues might exist for other special mappings. - Mappings like uprobe are not covered by the dynamic linker, and there is no effective solution for them. This feature's security enhancements will benefit ChromeOS, Android, and other high security systems. Testing: This feature was tested on ChromeOS and Android for both x86-64 and ARM64. - Enable sealing and verify vdso/vvar, sigpage, vector are sealed properly, i.e. "sl" shown in the smaps for those mappings, and mremap is blocked. - Passing various automation tests (e.g. pre-checkin) on ChromeOS and Android to ensure the sealing doesn't affect the functionality of Chromebook and Android phone. I also tested the feature on Ubuntu on x86-64: - With config disabled, vdso/vvar is not sealed, - with config enabled, vdso/vvar is sealed, and booting up Ubuntu is OK, normal operations such as browsing the web, open/edit doc are OK. Link: https://lore.kernel.org/all/20240415163527.626541-1-jeffxu@chromium.org/ [1] Link: Documentation/userspace-api/mseal.rst [2] Link: https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/ [3] Link: https://lore.kernel.org/all/CABi2SkV6JJwJeviDLsq9N4ONvQ=EFANsiWkgiEOjyT9TQSt+HA@mail.gmail.com/ [4] Link: https://lore.kernel.org/all/202502251035.239B85A93@keescook/ [5] This patch (of 7): Provide infrastructure to mseal system mappings. Establish two kernel configs (CONFIG_MSEAL_SYSTEM_MAPPINGS, ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS) and VM_SEALED_SYSMAP macro for future patches. Link: https://lkml.kernel.org/r/20250305021711.3867874-1-jeffxu@google.com Link: https://lkml.kernel.org/r/20250305021711.3867874-2-jeffxu@google.com Signed-off-by: Jeff Xu Reviewed-by: Kees Cook Reviewed-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Cc: Adhemerval Zanella Cc: Alexander Mikhalitsyn Cc: Alexey Dobriyan Cc: Andrei Vagin Cc: Anna-Maria Behnsen Cc: Ard Biesheuvel Cc: Benjamin Berg Cc: Christoph Hellwig Cc: Dave Hansen Cc: David Rientjes Cc: David S. Miller Cc: Elliot Hughes Cc: Florian Faineli Cc: Greg Ungerer Cc: Guenter Roeck Cc: Heiko Carstens Cc: Helge Deller Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Ingo Molnar Cc: Jann Horn Cc: Jason A. Donenfeld Cc: Johannes Berg Cc: Jorge Lucangeli Obes Cc: Linus Waleij Cc: Mark Rutland Cc: Matthew Wilcow (Oracle) Cc: Michael Ellerman Cc: Michal Hocko Cc: Miguel Ojeda Cc: Mike Rapoport Cc: Oleg Nesterov Cc: Pedro Falcato Cc: Peter Xu Cc: Randy Dunlap Cc: Stephen Röttger Cc: Thomas Weißschuh Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/mm.h | 10 ++++++++++ init/Kconfig | 22 ++++++++++++++++++++++ security/Kconfig | 21 +++++++++++++++++++++ 3 files changed, 53 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 32ba0e33422b..778f5de6a12e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4236,4 +4236,14 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status); int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status); + +/* + * mseal of userspace process's system mappings. + */ +#ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS +#define VM_SEALED_SYSMAP VM_SEALED +#else +#define VM_SEALED_SYSMAP VM_NONE +#endif + #endif /* _LINUX_MM_H */ diff --git a/init/Kconfig b/init/Kconfig index 681f38ee68db..18717967fc8c 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1888,6 +1888,28 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS config ARCH_HAS_MEMBARRIER_SYNC_CORE bool +config ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS + bool + help + Control MSEAL_SYSTEM_MAPPINGS access based on architecture. + + A 64-bit kernel is required for the memory sealing feature. + No specific hardware features from the CPU are needed. + + To enable this feature, the architecture needs to update their + special mappings calls to include the sealing flag and confirm + that it doesn't unmap/remap system mappings during the life + time of the process. The existence of this flag for an architecture + implies that it does not require the remapping of the system + mappings during process lifetime, so sealing these mappings is safe + from a kernel perspective. + + After the architecture enables this, a distribution can set + CONFIG_MSEAL_SYSTEM_MAPPING to manage access to the feature. + + For complete descriptions of memory sealing, please see + Documentation/userspace-api/mseal.rst + config HAVE_PERF_EVENTS bool help diff --git a/security/Kconfig b/security/Kconfig index 536061cf33a9..4816fc74f81e 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -51,6 +51,27 @@ config PROC_MEM_NO_FORCE endchoice +config MSEAL_SYSTEM_MAPPINGS + bool "mseal system mappings" + depends on 64BIT + depends on ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS + depends on !CHECKPOINT_RESTORE + help + Apply mseal on system mappings. + The system mappings includes vdso, vvar, vvar_vclock, + vectors (arm compat-mode), sigpage (arm compat-mode), uprobes. + + A 64-bit kernel is required for the memory sealing feature. + No specific hardware features from the CPU are needed. + + WARNING: This feature breaks programs which rely on relocating + or unmapping system mappings. Known broken software at the time + of writing includes CHECKPOINT_RESTORE, UML, gVisor, rr. Therefore + this config can't be enabled universally. + + For complete descriptions of memory sealing, please see + Documentation/userspace-api/mseal.rst + config SECURITY bool "Enable different security models" depends on SYSFS From 7b0141daf34c5d9b3c665e609d293e37cc692734 Mon Sep 17 00:00:00 2001 From: Jeff Xu Date: Wed, 5 Mar 2025 02:17:06 +0000 Subject: [PATCH 23/31] selftests: x86: test_mremap_vdso: skip if vdso is msealed MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add code to detect if the vdso is memory sealed, skip the test if it is. Link: https://lkml.kernel.org/r/20250305021711.3867874-3-jeffxu@google.com Signed-off-by: Jeff Xu Reviewed-by: Kees Cook Reviewed-by: Lorenzo Stoakes Reviewed-by: Liam R. Howlett Cc: Adhemerval Zanella Cc: Alexander Mikhalitsyn Cc: Alexey Dobriyan Cc: Andrei Vagin Cc: Anna-Maria Behnsen Cc: Ard Biesheuvel Cc: Benjamin Berg Cc: Christoph Hellwig Cc: Dave Hansen Cc: David Rientjes Cc: David S. Miller Cc: Elliot Hughes Cc: Florian Faineli Cc: Greg Ungerer Cc: Guenter Roeck Cc: Heiko Carstens Cc: Helge Deller Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Ingo Molnar Cc: Jann Horn Cc: Jason A. Donenfeld Cc: Johannes Berg Cc: Jorge Lucangeli Obes Cc: Linus Waleij Cc: Mark Rutland Cc: Matthew Wilcow (Oracle) Cc: Michael Ellerman Cc: Michal Hocko Cc: Miguel Ojeda Cc: Mike Rapoport Cc: Oleg Nesterov Cc: Pedro Falcato Cc: Peter Xu Cc: Randy Dunlap Cc: Stephen Röttger Cc: Thomas Weißschuh Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- .../testing/selftests/x86/test_mremap_vdso.c | 43 +++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/tools/testing/selftests/x86/test_mremap_vdso.c b/tools/testing/selftests/x86/test_mremap_vdso.c index d53959e03593..94bee6e0c813 100644 --- a/tools/testing/selftests/x86/test_mremap_vdso.c +++ b/tools/testing/selftests/x86/test_mremap_vdso.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include @@ -55,13 +56,55 @@ static int try_to_remap(void *vdso_addr, unsigned long size) } +#define VDSO_NAME "[vdso]" +#define VMFLAGS "VmFlags:" +#define MSEAL_FLAGS "sl" +#define MAX_LINE_LEN 512 + +bool vdso_sealed(FILE *maps) +{ + char line[MAX_LINE_LEN]; + bool has_vdso = false; + + while (fgets(line, sizeof(line), maps)) { + if (strstr(line, VDSO_NAME)) + has_vdso = true; + + if (has_vdso && !strncmp(line, VMFLAGS, strlen(VMFLAGS))) { + if (strstr(line, MSEAL_FLAGS)) + return true; + + return false; + } + } + + return false; +} + int main(int argc, char **argv, char **envp) { pid_t child; + FILE *maps; ksft_print_header(); ksft_set_plan(1); + maps = fopen("/proc/self/smaps", "r"); + if (!maps) { + ksft_test_result_skip( + "Could not open /proc/self/smaps, errno=%d\n", + errno); + + return 0; + } + + if (vdso_sealed(maps)) { + ksft_test_result_skip("vdso is sealed\n"); + return 0; + } + + fclose(maps); + child = fork(); if (child == -1) ksft_exit_fail_msg("failed to fork (%d): %m\n", errno); From 1d6fad7b844cffc85024a362fdcc3bef696dfe2e Mon Sep 17 00:00:00 2001 From: Heiko Carstens Date: Tue, 11 Mar 2025 13:33:25 +0100 Subject: [PATCH 24/31] mseal sysmap: generic vdso vvar mapping MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit With the introduction of the generic vdso data storage the VM_SEALED_SYSMAP vm flag must be moved from the architecture specific _install_special_mapping() call [1] [2] which maps the vvar mapping to generic code. [1] https://lkml.kernel.org/r/20250305021711.3867874-4-jeffxu@google.com [2] https://lkml.kernel.org/r/20250305021711.3867874-5-jeffxu@google.com Link: https://lkml.kernel.org/r/20250311123326.2686682-2-hca@linux.ibm.com Signed-off-by: Heiko Carstens Reviewed-by: Lorenzo Stoakes Cc: Alexander Gordeev Cc: Christian Borntraeger Cc: Jeff Xu Cc: Liam Howlett Cc: Sven Schnelle Cc: Thomas Weißschuh Cc: Vasily Gorbik Signed-off-by: Andrew Morton --- lib/vdso/datastore.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/lib/vdso/datastore.c b/lib/vdso/datastore.c index c715e217ec65..3693c6caf2c4 100644 --- a/lib/vdso/datastore.c +++ b/lib/vdso/datastore.c @@ -99,7 +99,8 @@ const struct vm_special_mapping vdso_vvar_mapping = { struct vm_area_struct *vdso_install_vvar_mapping(struct mm_struct *mm, unsigned long addr) { return _install_special_mapping(mm, addr, VDSO_NR_PAGES * PAGE_SIZE, - VM_READ | VM_MAYREAD | VM_IO | VM_DONTDUMP | VM_PFNMAP, + VM_READ | VM_MAYREAD | VM_IO | VM_DONTDUMP | + VM_PFNMAP | VM_SEALED_SYSMAP, &vdso_vvar_mapping); } From 3049def198481f1d7dfe29c79658e2a0e297a565 Mon Sep 17 00:00:00 2001 From: Jeff Xu Date: Wed, 5 Mar 2025 02:17:07 +0000 Subject: [PATCH 25/31] mseal sysmap: enable x86-64 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on x86-64, covering the vdso, vvar, vvar_vclock. Production release testing passes on Android and Chrome OS. Link: https://lkml.kernel.org/r/20250305021711.3867874-4-jeffxu@google.com Signed-off-by: Jeff Xu Reviewed-by: Lorenzo Stoakes Reviewed-by: Liam R. Howlett Reviewed-by: Kees Cook Cc: Adhemerval Zanella Cc: Alexander Mikhalitsyn Cc: Alexey Dobriyan Cc: Andrei Vagin Cc: Anna-Maria Behnsen Cc: Ard Biesheuvel Cc: Benjamin Berg Cc: Christoph Hellwig Cc: Dave Hansen Cc: David Rientjes Cc: David S. Miller Cc: Elliot Hughes Cc: Florian Faineli Cc: Greg Ungerer Cc: Guenter Roeck Cc: Heiko Carstens Cc: Helge Deller Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Ingo Molnar Cc: Jann Horn Cc: Jason A. Donenfeld Cc: Johannes Berg Cc: Jorge Lucangeli Obes Cc: Linus Waleij Cc: Mark Rutland Cc: Matthew Wilcow (Oracle) Cc: Michael Ellerman Cc: Michal Hocko Cc: Miguel Ojeda Cc: Mike Rapoport Cc: Oleg Nesterov Cc: Pedro Falcato Cc: Peter Xu Cc: Randy Dunlap Cc: Stephen Röttger Cc: Thomas Weißschuh Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/x86/Kconfig | 1 + arch/x86/entry/vdso/vma.c | 5 +++-- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 9395ec37bb64..1502fd0c3c06 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -27,6 +27,7 @@ config X86_64 # Options that are inherently 64-bit kernel only: select ARCH_HAS_GIGANTIC_PAGE select ARCH_HAS_PTDUMP + select ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 select ARCH_SUPPORTS_PER_VMA_LOCK select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index 9518bf1ddf35..adb299d3b6a1 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -162,7 +162,8 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr) text_start, image->size, VM_READ|VM_EXEC| - VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, + VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC| + VM_SEALED_SYSMAP, &vdso_mapping); if (IS_ERR(vma)) { @@ -181,7 +182,7 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr) VDSO_VCLOCK_PAGES_START(addr), VDSO_NR_VCLOCK_PAGES * PAGE_SIZE, VM_READ|VM_MAYREAD|VM_IO|VM_DONTDUMP| - VM_PFNMAP, + VM_PFNMAP|VM_SEALED_SYSMAP, &vvar_vclock_mapping); if (IS_ERR(vma)) { From 0061b6e162adaaedb84093cd6908ddf8c85d5b47 Mon Sep 17 00:00:00 2001 From: Jeff Xu Date: Wed, 5 Mar 2025 02:17:08 +0000 Subject: [PATCH 26/31] mseal sysmap: enable arm64 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on arm64, covering the vdso, vvar, and compat-mode vectors and sigpage mappings. Production release testing passes on Android and Chrome OS. Link: https://lkml.kernel.org/r/20250305021711.3867874-5-jeffxu@google.com Signed-off-by: Jeff Xu Reviewed-by: Lorenzo Stoakes Reviewed-by: Liam R. Howlett Reviewed-by: Kees Cook Cc: Adhemerval Zanella Cc: Alexander Mikhalitsyn Cc: Alexey Dobriyan Cc: Andrei Vagin Cc: Anna-Maria Behnsen Cc: Ard Biesheuvel Cc: Benjamin Berg Cc: Christoph Hellwig Cc: Dave Hansen Cc: David Rientjes Cc: David S. Miller Cc: Elliot Hughes Cc: Florian Faineli Cc: Greg Ungerer Cc: Guenter Roeck Cc: Heiko Carstens Cc: Helge Deller Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Ingo Molnar Cc: Jann Horn Cc: Jason A. Donenfeld Cc: Johannes Berg Cc: Jorge Lucangeli Obes Cc: Linus Waleij Cc: Mark Rutland Cc: Matthew Wilcow (Oracle) Cc: Michael Ellerman Cc: Michal Hocko Cc: Miguel Ojeda Cc: Mike Rapoport Cc: Oleg Nesterov Cc: Pedro Falcato Cc: Peter Xu Cc: Randy Dunlap Cc: Stephen Röttger Cc: Thomas Weißschuh Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/arm64/Kconfig | 1 + arch/arm64/kernel/vdso.c | 9 ++++++--- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 748c34dc953c..a182295e6f08 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -38,6 +38,7 @@ config ARM64 select ARCH_HAS_KEEPINITRD select ARCH_HAS_MEMBARRIER_SYNC_CORE select ARCH_HAS_MEM_ENCRYPT + select ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE select ARCH_HAS_NONLEAF_PMD_YOUNG if ARM64_HAFT diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c index 887ac0b05961..78ddf6bdecad 100644 --- a/arch/arm64/kernel/vdso.c +++ b/arch/arm64/kernel/vdso.c @@ -130,7 +130,8 @@ static int __setup_additional_pages(enum vdso_abi abi, mm->context.vdso = (void *)vdso_base; ret = _install_special_mapping(mm, vdso_base, vdso_text_len, VM_READ|VM_EXEC|gp_flags| - VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, + VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC| + VM_SEALED_SYSMAP, vdso_info[abi].cm); if (IS_ERR(ret)) goto up_fail; @@ -256,7 +257,8 @@ static int aarch32_kuser_helpers_setup(struct mm_struct *mm) */ ret = _install_special_mapping(mm, AARCH32_VECTORS_BASE, PAGE_SIZE, VM_READ | VM_EXEC | - VM_MAYREAD | VM_MAYEXEC, + VM_MAYREAD | VM_MAYEXEC | + VM_SEALED_SYSMAP, &aarch32_vdso_maps[AA32_MAP_VECTORS]); return PTR_ERR_OR_ZERO(ret); @@ -279,7 +281,8 @@ static int aarch32_sigreturn_setup(struct mm_struct *mm) */ ret = _install_special_mapping(mm, addr, PAGE_SIZE, VM_READ | VM_EXEC | VM_MAYREAD | - VM_MAYWRITE | VM_MAYEXEC, + VM_MAYWRITE | VM_MAYEXEC | + VM_SEALED_SYSMAP, &aarch32_vdso_maps[AA32_MAP_SIGPAGE]); if (IS_ERR(ret)) goto out; From 3d38922abff330ec2ec8d0d6d38b647d121a0be9 Mon Sep 17 00:00:00 2001 From: Jeff Xu Date: Wed, 5 Mar 2025 02:17:09 +0000 Subject: [PATCH 27/31] mseal sysmap: uprobe mapping MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Provide support to mseal the uprobe mapping. Unlike other system mappings, the uprobe mapping is not established during program startup. However, its lifetime is the same as the process's lifetime. It could be sealed from creation. Test was done with perf tool, and observe the uprobe mapping is sealed. Link: https://lkml.kernel.org/r/20250305021711.3867874-6-jeffxu@google.com Signed-off-by: Jeff Xu Reviewed-by: Oleg Nesterov Reviewed-by: Lorenzo Stoakes Reviewed-by: Liam R. Howlett Reviewed-by: Kees Cook Cc: Adhemerval Zanella Cc: Alexander Mikhalitsyn Cc: Alexey Dobriyan Cc: Andrei Vagin Cc: Anna-Maria Behnsen Cc: Ard Biesheuvel Cc: Benjamin Berg Cc: Christoph Hellwig Cc: Dave Hansen Cc: David Rientjes Cc: David S. Miller Cc: Elliot Hughes Cc: Florian Faineli Cc: Greg Ungerer Cc: Guenter Roeck Cc: Heiko Carstens Cc: Helge Deller Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Ingo Molnar Cc: Jann Horn Cc: Jason A. Donenfeld Cc: Johannes Berg Cc: Jorge Lucangeli Obes Cc: Linus Waleij Cc: Mark Rutland Cc: Matthew Wilcow (Oracle) Cc: Michael Ellerman Cc: Michal Hocko Cc: Miguel Ojeda Cc: Mike Rapoport Cc: Pedro Falcato Cc: Peter Xu Cc: Randy Dunlap Cc: Stephen Röttger Cc: Thomas Weißschuh Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- kernel/events/uprobes.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 2746791ce1e2..615b4e6d22c7 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1703,7 +1703,8 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area) } vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE, - VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO, + VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO| + VM_SEALED_SYSMAP, &xol_mapping); if (IS_ERR(vma)) { ret = PTR_ERR(vma); From a8c15bb4008cb79dc6095a6f6e67441e36fb4e99 Mon Sep 17 00:00:00 2001 From: Jeff Xu Date: Wed, 5 Mar 2025 02:17:10 +0000 Subject: [PATCH 28/31] mseal sysmap: update mseal.rst MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Update memory sealing documentation to include details about system mappings. Link: https://lkml.kernel.org/r/20250305021711.3867874-7-jeffxu@google.com Signed-off-by: Jeff Xu Reviewed-by: Kees Cook Reviewed-by: Lorenzo Stoakes Reviewed-by: Liam R. Howlett Cc: Adhemerval Zanella Cc: Alexander Mikhalitsyn Cc: Alexey Dobriyan Cc: Andrei Vagin Cc: Anna-Maria Behnsen Cc: Ard Biesheuvel Cc: Benjamin Berg Cc: Christoph Hellwig Cc: Dave Hansen Cc: David Rientjes Cc: David S. Miller Cc: Elliot Hughes Cc: Florian Faineli Cc: Greg Ungerer Cc: Guenter Roeck Cc: Heiko Carstens Cc: Helge Deller Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Ingo Molnar Cc: Jann Horn Cc: Jason A. Donenfeld Cc: Johannes Berg Cc: Jorge Lucangeli Obes Cc: Linus Waleij Cc: Mark Rutland Cc: Matthew Wilcow (Oracle) Cc: Michael Ellerman Cc: Michal Hocko Cc: Miguel Ojeda Cc: Mike Rapoport Cc: Oleg Nesterov Cc: Pedro Falcato Cc: Peter Xu Cc: Randy Dunlap Cc: Stephen Röttger Cc: Thomas Weißschuh Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- Documentation/userspace-api/mseal.rst | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst index 41102f74c5e2..56aee46a9307 100644 --- a/Documentation/userspace-api/mseal.rst +++ b/Documentation/userspace-api/mseal.rst @@ -130,6 +130,26 @@ Use cases - Chrome browser: protect some security sensitive data structures. +- System mappings: + The system mappings are created by the kernel and includes vdso, vvar, + vvar_vclock, vectors (arm compat-mode), sigpage (arm compat-mode), uprobes. + + Those system mappings are readonly only or execute only, memory sealing can + protect them from ever changing to writable or unmmap/remapped as different + attributes. This is useful to mitigate memory corruption issues where a + corrupted pointer is passed to a memory management system. + + If supported by an architecture (CONFIG_ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS), + the CONFIG_MSEAL_SYSTEM_MAPPINGS seals all system mappings of this + architecture. + + The following architectures currently support this feature: x86-64 and arm64. + + WARNING: This feature breaks programs which rely on relocating + or unmapping system mappings. Known broken software at the time + of writing includes CHECKPOINT_RESTORE, UML, gVisor, rr. Therefore + this config can't be enabled universally. + When not to use mseal ===================== Applications can apply sealing to any virtual memory region from userspace, From b481341e4cfbc89bbb87cd0a24abef29ebfb49c7 Mon Sep 17 00:00:00 2001 From: Jeff Xu Date: Wed, 5 Mar 2025 02:17:11 +0000 Subject: [PATCH 29/31] selftest: test system mappings are sealed MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add sysmap_is_sealed.c to test system mappings are sealed. Note: CONFIG_MSEAL_SYSTEM_MAPPINGS must be set, as indicated in config file. Link: https://lkml.kernel.org/r/20250305021711.3867874-8-jeffxu@google.com Signed-off-by: Jeff Xu Reviewed-by: Lorenzo Stoakes Reviewed-by: Kees Cook Cc: Adhemerval Zanella Cc: Alexander Mikhalitsyn Cc: Alexey Dobriyan Cc: Andrei Vagin Cc: Anna-Maria Behnsen Cc: Ard Biesheuvel Cc: Benjamin Berg Cc: Christoph Hellwig Cc: Dave Hansen Cc: David Rientjes Cc: David S. Miller Cc: Elliot Hughes Cc: Florian Faineli Cc: Greg Ungerer Cc: Guenter Roeck Cc: Heiko Carstens Cc: Helge Deller Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Ingo Molnar Cc: Jann Horn Cc: Jason A. Donenfeld Cc: Johannes Berg Cc: Jorge Lucangeli Obes Cc: Liam R. Howlett Cc: Linus Waleij Cc: Mark Rutland Cc: Matthew Wilcow (Oracle) Cc: Michael Ellerman Cc: Michal Hocko Cc: Miguel Ojeda Cc: Mike Rapoport Cc: Oleg Nesterov Cc: Pedro Falcato Cc: Peter Xu Cc: Randy Dunlap Cc: Stephen Röttger Cc: Thomas Weißschuh Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- tools/testing/selftests/Makefile | 1 + .../mseal_system_mappings/.gitignore | 2 + .../selftests/mseal_system_mappings/Makefile | 6 + .../selftests/mseal_system_mappings/config | 1 + .../mseal_system_mappings/sysmap_is_sealed.c | 119 ++++++++++++++++++ 5 files changed, 129 insertions(+) create mode 100644 tools/testing/selftests/mseal_system_mappings/.gitignore create mode 100644 tools/testing/selftests/mseal_system_mappings/Makefile create mode 100644 tools/testing/selftests/mseal_system_mappings/config create mode 100644 tools/testing/selftests/mseal_system_mappings/sysmap_is_sealed.c diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index 2694344274bf..c77c8c8e3d9b 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -62,6 +62,7 @@ TARGETS += mount TARGETS += mount_setattr TARGETS += move_mount_set_group TARGETS += mqueue +TARGETS += mseal_system_mappings TARGETS += nci TARGETS += net TARGETS += net/af_unix diff --git a/tools/testing/selftests/mseal_system_mappings/.gitignore b/tools/testing/selftests/mseal_system_mappings/.gitignore new file mode 100644 index 000000000000..319c497a595e --- /dev/null +++ b/tools/testing/selftests/mseal_system_mappings/.gitignore @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only +sysmap_is_sealed diff --git a/tools/testing/selftests/mseal_system_mappings/Makefile b/tools/testing/selftests/mseal_system_mappings/Makefile new file mode 100644 index 000000000000..2b4504e2f52f --- /dev/null +++ b/tools/testing/selftests/mseal_system_mappings/Makefile @@ -0,0 +1,6 @@ +# SPDX-License-Identifier: GPL-2.0-only +CFLAGS += -std=c99 -pthread -Wall $(KHDR_INCLUDES) + +TEST_GEN_PROGS := sysmap_is_sealed + +include ../lib.mk diff --git a/tools/testing/selftests/mseal_system_mappings/config b/tools/testing/selftests/mseal_system_mappings/config new file mode 100644 index 000000000000..675cb9f37b86 --- /dev/null +++ b/tools/testing/selftests/mseal_system_mappings/config @@ -0,0 +1 @@ +CONFIG_MSEAL_SYSTEM_MAPPINGS=y diff --git a/tools/testing/selftests/mseal_system_mappings/sysmap_is_sealed.c b/tools/testing/selftests/mseal_system_mappings/sysmap_is_sealed.c new file mode 100644 index 000000000000..0d2af30c3bf5 --- /dev/null +++ b/tools/testing/selftests/mseal_system_mappings/sysmap_is_sealed.c @@ -0,0 +1,119 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * test system mappings are sealed when + * KCONFIG_MSEAL_SYSTEM_MAPPINGS=y + */ + +#define _GNU_SOURCE +#include +#include +#include +#include +#include + +#include "../kselftest.h" +#include "../kselftest_harness.h" + +#define VMFLAGS "VmFlags:" +#define MSEAL_FLAGS "sl" +#define MAX_LINE_LEN 512 + +bool has_mapping(char *name, FILE *maps) +{ + char line[MAX_LINE_LEN]; + + while (fgets(line, sizeof(line), maps)) { + if (strstr(line, name)) + return true; + } + + return false; +} + +bool mapping_is_sealed(char *name, FILE *maps) +{ + char line[MAX_LINE_LEN]; + + while (fgets(line, sizeof(line), maps)) { + if (!strncmp(line, VMFLAGS, strlen(VMFLAGS))) { + if (strstr(line, MSEAL_FLAGS)) + return true; + + return false; + } + } + + return false; +} + +FIXTURE(basic) { + FILE *maps; +}; + +FIXTURE_SETUP(basic) +{ + self->maps = fopen("/proc/self/smaps", "r"); + if (!self->maps) + SKIP(return, "Could not open /proc/self/smap, errno=%d", + errno); +}; + +FIXTURE_TEARDOWN(basic) +{ + if (self->maps) + fclose(self->maps); +}; + +FIXTURE_VARIANT(basic) +{ + char *name; + bool sealed; +}; + +FIXTURE_VARIANT_ADD(basic, vdso) { + .name = "[vdso]", + .sealed = true, +}; + +FIXTURE_VARIANT_ADD(basic, vvar) { + .name = "[vvar]", + .sealed = true, +}; + +FIXTURE_VARIANT_ADD(basic, vvar_vclock) { + .name = "[vvar_vclock]", + .sealed = true, +}; + +FIXTURE_VARIANT_ADD(basic, sigpage) { + .name = "[sigpage]", + .sealed = true, +}; + +FIXTURE_VARIANT_ADD(basic, vectors) { + .name = "[vectors]", + .sealed = true, +}; + +FIXTURE_VARIANT_ADD(basic, uprobes) { + .name = "[uprobes]", + .sealed = true, +}; + +FIXTURE_VARIANT_ADD(basic, stack) { + .name = "[stack]", + .sealed = false, +}; + +TEST_F(basic, check_sealed) +{ + if (!has_mapping(variant->name, self->maps)) { + SKIP(return, "could not find the mapping, %s", + variant->name); + } + + EXPECT_EQ(variant->sealed, + mapping_is_sealed(variant->name, self->maps)); +}; + +TEST_HARNESS_MAIN From 24e3f9fbbd5d93afb41af495c008e76a9005dd06 Mon Sep 17 00:00:00 2001 From: Heiko Carstens Date: Tue, 11 Mar 2025 13:33:26 +0100 Subject: [PATCH 30/31] mseal sysmap: enable s390 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on s390, covering the vdso. [hca@linux.ibm.com: update supported architectures] Link: https://lkml.kernel.org/r/20250317131917.1332402-1-hca@linux.ibm.com Link: https://lkml.kernel.org/r/20250311123326.2686682-3-hca@linux.ibm.com Signed-off-by: Heiko Carstens Reviewed-by: Lorenzo Stoakes Cc: Alexander Gordeev Cc: Christian Borntraeger Cc: Jeff Xu Cc: Liam Howlett Cc: Sven Schnelle Cc: Thomas Weißschuh Cc: Vasily Gorbik Signed-off-by: Andrew Morton --- Documentation/userspace-api/mseal.rst | 3 ++- arch/s390/Kconfig | 1 + arch/s390/kernel/vdso.c | 2 +- 3 files changed, 4 insertions(+), 2 deletions(-) diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst index 56aee46a9307..1dabfc29be0d 100644 --- a/Documentation/userspace-api/mseal.rst +++ b/Documentation/userspace-api/mseal.rst @@ -143,7 +143,8 @@ Use cases the CONFIG_MSEAL_SYSTEM_MAPPINGS seals all system mappings of this architecture. - The following architectures currently support this feature: x86-64 and arm64. + The following architectures currently support this feature: x86-64, arm64, + and s390. WARNING: This feature breaks programs which rely on relocating or unmapping system mappings. Known broken software at the time diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index c809c486d136..b8fa367c1fc9 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -137,6 +137,7 @@ config S390 select ARCH_SUPPORTS_DEBUG_PAGEALLOC select ARCH_SUPPORTS_HUGETLBFS select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && CC_IS_CLANG + select ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS select ARCH_SUPPORTS_NUMA_BALANCING select ARCH_SUPPORTS_PER_VMA_LOCK select ARCH_USE_BUILTIN_BSWAP diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c index 70c8f9ad13cd..430feb1a5013 100644 --- a/arch/s390/kernel/vdso.c +++ b/arch/s390/kernel/vdso.c @@ -80,7 +80,7 @@ static int map_vdso(unsigned long addr, unsigned long vdso_mapping_len) vdso_text_start = vvar_start + VDSO_NR_PAGES * PAGE_SIZE; /* VM_MAYWRITE for COW so gdb can set breakpoints */ vma = _install_special_mapping(mm, vdso_text_start, vdso_text_len, - VM_READ|VM_EXEC| + VM_READ|VM_EXEC|VM_SEALED_SYSMAP| VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, vdso_mapping); if (IS_ERR(vma)) { From e20706d5385b10a6f6a2fe5ad6b1333dad2d1416 Mon Sep 17 00:00:00 2001 From: Jeff Xu Date: Fri, 21 Mar 2025 03:26:27 +0000 Subject: [PATCH 31/31] mseal sysmap: add arch-support txt MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add Documentation/features/core/mseal_sys_mappings/arch-support.txt N/A: the arch is 32bits only and mseal is not supported in 32 bits, therefore N/A (until mseal is available in 32 bits kernel). [jeffxu@chromium.org: update to v3] Link: https://lkml.kernel.org/r/20250324151537.1106542-2-jeffxu@google.com Link: https://lkml.kernel.org/r/20250321032627.4147562-2-jeffxu@google.com Signed-off-by: Jeff Xu Cc: Alexander Gordeev Cc: Christian Borntraeger Cc: Eric Dumaze Cc: Geert Uytterhoeven Cc: guoweikang Cc: Heiko Carstens Cc: Kevin Brodsky Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Meghana Malladi Cc: Qi Zheng Cc: Sven Schnelle Cc: Thomas Weißschuh Cc: Vasily Gorbik Signed-off-by: Andrew Morton --- .../core/mseal_sys_mappings/arch-support.txt | 30 +++++++++++++++++++ 1 file changed, 30 insertions(+) create mode 100644 Documentation/features/core/mseal_sys_mappings/arch-support.txt diff --git a/Documentation/features/core/mseal_sys_mappings/arch-support.txt b/Documentation/features/core/mseal_sys_mappings/arch-support.txt new file mode 100644 index 000000000000..c6cab9760d57 --- /dev/null +++ b/Documentation/features/core/mseal_sys_mappings/arch-support.txt @@ -0,0 +1,30 @@ +# +# Feature name: mseal-system-mappings +# Kconfig: ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS +# description: arch supports mseal system mappings +# + ----------------------- + | arch |status| + ----------------------- + | alpha: | TODO | + | arc: | N/A | + | arm: | N/A | + | arm64: | ok | + | csky: | N/A | + | hexagon: | N/A | + | loongarch: | TODO | + | m68k: | N/A | + | microblaze: | N/A | + | mips: | TODO | + | nios2: | N/A | + | openrisc: | N/A | + | parisc: | TODO | + | powerpc: | TODO | + | riscv: | TODO | + | s390: | ok | + | sh: | N/A | + | sparc: | TODO | + | um: | TODO | + | x86: | ok | + | xtensa: | N/A | + -----------------------