linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-16 13:41:48 -04:00

Author	SHA1	Message	Date
David Hildenbrand (Arm)	75c5ae05e3	mm/memory: inline unmap_mapping_range_vma() into unmap_mapping_range_tree() Let's remove the number of unmap-related functions that cause confusion by inlining unmap_mapping_range_vma() into its single caller. The end result looks pretty readable. Link: https://lkml.kernel.org/r/20260227200848.114019-4-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Arve <arve@android.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Airlie <airlied@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:13 -07:00
David Hildenbrand (Arm)	de008c9ba5	mm/memory: remove "zap_details" parameter from zap_page_range_single() Nobody except memory.c should really set that parameter to non-NULL. So let's just drop it and make unmap_mapping_range_vma() use zap_page_range_single_batched() instead. [david@kernel.org: format on a single line] Link: https://lkml.kernel.org/r/8a27e9ac-2025-4724-a46d-0a7c90894ba7@kernel.org Link: https://lkml.kernel.org/r/20260227200848.114019-3-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: Puranjay Mohan <puranjay@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Arve <arve@android.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Airlie <airlied@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:13 -07:00
David Hildenbrand (Arm)	c48ad5a4b8	mm/madvise: drop range checks in madvise_free_single_vma() Patch series "mm: cleanups around unmapping / zapping". A bunch of cleanups around unmapping and zapping. Mostly simplifications, code movements, documentation and renaming of zapping functions. With this series, we'll have the following high-level zap/unmap functions (excluding high-level folio zapping): * unmap_vmas() for actual unmapping (vmas will go away) * zap_vma(): zap all page table entries in a vma * zap_vma_for_reaping(): zap_vma() that must not block * zap_vma_range(): zap a range of page table entries * zap_vma_range_batched(): zap_vma_range() with more options and batching * zap_special_vma_range(): limited zap_vma_range() for modules * __zap_vma_range(): internal helper Patch #1 is not about unmapping/zapping, but I stumbled over it while verifying MADV_DONTNEED range handling. Patch #16 is related to [1], but makes sense even independent of that. This patch (of 16): madvise_vma_behavior()-> madvise_dontneed_free()->madvise_free_single_vma() is only called from madvise_walk_vmas() (a) After try_vma_read_lock() confirmed that the whole range falls into a single VMA (see is_vma_lock_sufficient()). (b) After adjusting the range to the VMA in the loop afterwards. madvise_dontneed_free() might drop the MM lock when handling userfaultfd, but it properly looks up the VMA again to adjust the range. So in madvise_free_single_vma(), the given range should always fall into a single VMA and should also span at least one page. Let's drop the error checks. The code now matches what we do in madvise_dontneed_single_vma(), where we call zap_vma_range_batched() that documents: "The range must fit into one VMA.". Although that function still adjusts that range, we'll change that soon. Link: https://lkml.kernel.org/r/20260227200848.114019-1-david@kernel.org Link: https://lkml.kernel.org/r/20260227200848.114019-2-david@kernel.org Link: https://lore.kernel.org/r/aYSKyr7StGpGKNqW@google.com [1] Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Arve <arve@android.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Airlie <airlied@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:13 -07:00
David Hildenbrand (Arm)	9de209c183	kasan: docs: SLUB is the only remaining slab implementation We have only the SLUB implementation left in the kernel (referred to as "slab"). Therefore, there is nothing special regarding KASAN modes when it comes to the slab allocator anymore. Drop the stale comment regarding differing SLUB vs. SLAB support. Link: https://lkml.kernel.org/r/20260303120416.62580-1-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:12 -07:00
Michal Hocko	3caedb3b99	vmalloc: support __GFP_RETRY_MAYFAIL and __GFP_NORETRY __GFP_RETRY_MAYFAIL and __GFP_NORETRY haven't been supported so far because their semantic (i.e. to not trigger OOM killer) is not possible with the existing vmalloc page table allocation which is allowing for the OOM killer. Example: __vmalloc(size, GFP_KERNEL \| __GFP_RETRY_MAYFAIL); <snip> vmalloc_test/55 invoked oom-killer: gfp_mask=0x40dc0( GFP_KERNEL\|__GFP_ZERO\|__GFP_COMP), order=0, oom_score_adj=0 active_anon:0 inactive_anon:0 isolated_anon:0 active_file:0 inactive_file:0 isolated_file:0 unevictable:0 dirty:0 writeback:0 slab_reclaimable:700 slab_unreclaimable:33708 mapped:0 shmem:0 pagetables:5174 sec_pagetables:0 bounce:0 kernel_misc_reclaimable:0 free:850 free_pcp:319 free_cma:0 CPU: 4 UID: 0 PID: 639 Comm: vmalloc_test/55 ... Hardware name: QEMU Standard PC (i440FX + PIIX, ... Call Trace: <TASK> dump_stack_lvl+0x5d/0x80 dump_header+0x43/0x1b3 out_of_memory.cold+0x8/0x78 __alloc_pages_slowpath.constprop.0+0xef5/0x1130 __alloc_frozen_pages_noprof+0x312/0x330 alloc_pages_mpol+0x7d/0x160 alloc_pages_noprof+0x50/0xa0 __pte_alloc_kernel+0x1e/0x1f0 ... <snip> There are usecases for these modifiers when a large allocation request should rather fail than trigger OOM killer which wouldn't be able to handle the situation anyway [1]. While we cannot change existing page table allocation code easily we can piggy back on scoped NOWAIT allocation for them that we already have in place. The rationale is that the bulk of the consumed memory is sitting in pages backing the vmalloc allocation. Page tables are only participating a tiny fraction. Moreover page tables for virtually allocated areas are never reclaimed so the longer the system runs to less likely they are. It makes sense to allow an approximation of __GFP_RETRY_MAYFAIL and __GFP_NORETRY even if the page table allocation part is much weaker. This doesn't break the failure mode while it allows for the no OOM semantic. [1] https://lore.kernel.org/all/32bd9bed-a939-69c4-696d-f7f9a5fe31d8@redhat.com/T/#u Link: https://lkml.kernel.org/r/20260302114740.2668450-2-urezki@gmail.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:12 -07:00
Uladzislau Rezki (Sony)	0edd78cd4d	mm/vmalloc: fix incorrect size reporting on allocation failure When __vmalloc_area_node() fails to allocate pages, the failure message may report an incorrect allocation size, for example: vmalloc error: size 0, failed to allocate pages, ... This happens because the warning prints area->nr_pages * PAGE_SIZE. At this point, area->nr_pages may be zero or partly populated thus it is not valid. Report the originally requested allocation size instead by using nr_small_pages * PAGE_SIZE, which reflects the actual number of pages being requested by user. Link: https://lkml.kernel.org/r/20260302114740.2668450-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:12 -07:00
Jane Chu	7a197d346a	Documentation: fix a hugetlbfs reservation statement Documentation/mm/hugetlbfs_reserv.rst has if (resv_needed <= (resv_huge_pages - free_huge_pages)) resv_huge_pages += resv_needed; which describes this code in gather_surplus_pages() needed = (h->resv_huge_pages + delta) - h->free_huge_pages; if (needed <= 0) { h->resv_huge_pages += delta; return 0; } which means if there are enough free hugepages to account for the new reservation, simply update the global reservation count without further action. But the description is backwards, it should be if (resv_needed <= (free_huge_pages - resv_huge_pages)) instead. Link: https://lkml.kernel.org/r/20260302201015.1824798-1-jane.chu@oracle.com Fixes: `70bc0dc578` ("Documentation: vm, add hugetlbfs reservation overview") Signed-off-by: Jane Chu <jane.chu@oracle.com> Cc: David Hildenbrand <david@kernel.org> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:12 -07:00
Gladyshev Ilya	28266ac94a	mm: make ref_unless functions unless_zero only There are no users of (folio/page)_ref_add_unless(page, nr, u) with u != 0 [1] and all current users are "internal" for page refcounting API. This allows us to safely drop this parameter and reduce function semantics to the "unless zero" cases only. If needed, these functions for the u!=0 cases can be trivially reintroduced later using the same atomic_add_unless operations as before. [1]: The last user was dropped in v5.18 kernel, commit `27674ef6c7` ("mm: remove the extra ZONE_DEVICE struct page refcount"). There is no trace of discussion as to why this cleanup wasn't done earlier. Link: https://lkml.kernel.org/r/a0c89b49d38c671a0bdd35069d15ee13e08314d2.1772370066.git.gladyshev.ilya1@h-partners.com Co-developed-by: Gorbunov Ivan <gorbunov.ivan@h-partners.com> Signed-off-by: Gorbunov Ivan <gorbunov.ivan@h-partners.com> Signed-off-by: Gladyshev Ilya <gladyshev.ilya1@h-partners.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Kiryl Shutsemau <kas@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:12 -07:00
Vlastimil Babka	e9c01915ae	mm/page_alloc: remove pcpu_spin_* wrappers We only ever use pcpu_spin_trylock()/unlock() with struct per_cpu_pages so refactor the helpers to remove the generic layer. No functional change intended. Link: https://lkml.kernel.org/r/20260227-b4-pcp-locking-cleanup-v1-3-f7e22e603447@kernel.org Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Suggested-by: Matthew Wilcox <willy@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:12 -07:00
Vlastimil Babka	0a2c52a9a2	mm/page_alloc: remove IRQ saving/restoring from pcp locking Effectively revert commit `038a102535` ("mm/page_alloc: prevent pcp corruption with SMP=n"). The original problem is now avoided by pcp_spin_trylock() always failing on CONFIG_SMP=n, so we do not need to disable IRQs anymore. It's not a complete revert, because keeping the pcp_spin_(un)lock() wrappers is useful. Rename them from _maybe_irqsave/restore to _nopin. The difference from pcp_spin_trylock()/pcp_spin_unlock() is that the _nopin variants don't perform pcpu_task_pin/unpin(). Link: https://lkml.kernel.org/r/20260227-b4-pcp-locking-cleanup-v1-2-f7e22e603447@kernel.org Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:12 -07:00
Vlastimil Babka	a373f37116	mm/page_alloc: effectively disable pcp with CONFIG_SMP=n Patch series "mm/page_alloc: pcp locking cleanup". This is a followup to the hotfix `038a102535` ("mm/page_alloc: prevent pcp corruption with SMP=n"), to simplify the code and deal with the original issue properly. The previous RFC attempt [1] argued for changing the UP spinlock implementation, which was discouraged, but thanks to David's off-list suggestion, we can achieve the goal without changing the spinlock implementation. The main change in Patch 1 relies on the fact that on UP we don't need the pcp lists for scalability, so just make them always bypassed during alloc/free by making the pcp trylock an unconditional failure. The various drain paths that use pcp_spin_lock_maybe_irqsave() continue to exist but will never do any work in practice. In Patch 2 we can again remove the irq saving from them that commit `038a102535` added. Besides simpler code with all the ugly UP_flags removed, we get less bloat with CONFIG_SMP=n for mm/page_alloc.o as a result: add/remove: 25/28 grow/shrink: 4/5 up/down: 2105/-6665 (-4560) Function old new delta get_page_from_freelist 5689 7248 +1559 free_unref_folios 2006 2324 +318 make_alloc_exact 270 286 +16 __zone_watermark_ok 306 322 +16 drain_pages_zone.isra 119 109 -10 decay_pcp_high 181 149 -32 setup_pcp_cacheinfo 193 147 -46 __free_frozen_pages 1339 1089 -250 alloc_pages_bulk_noprof 1054 419 -635 free_frozen_page_commit 907 - -907 try_to_claim_block 1975 - -1975 __rmqueue_pcplist 2614 - -2614 Total: Before=54624, After=50064, chg -8.35% This patch (of 3): The page allocator has been using a locking scheme for its percpu page caches (pcp) based on spin_trylock() with no _irqsave() part. The trick is that if we interrupt the locked section, we fail the trylock and just fallback to the slowpath taking the zone lock. That's more expensive, but rare, so we don't need to pay the irqsave/restore cost all the time in the fastpaths. It's similar to but not exactly local_trylock_t (which is also newer anyway) because in some cases we do lock the pcp of a non-local cpu to drain it, in a way that's cheaper than using IPI or queue_work_on(). The complication of this scheme has been UP non-debug spinlock implementation which assumes spin_trylock() can't fail on UP and has no state to track whether it's locked. It just doesn't anticipate this usage scenario. So to work around that we disable IRQs only on UP, complicating the implementation. Also recently we found years old bug in where we didn't disable IRQs in related paths - see `038a102535` ("mm/page_alloc: prevent pcp corruption with SMP=n"). We can avoid this UP complication by realizing that we do not need the pcp caching for scalability on UP in the first place. Removing it completely with #ifdefs is not worth the trouble either. Just make pcp_spin_trylock() return NULL unconditionally on CONFIG_SMP=n. This makes the slowpaths unconditional, and we can remove the IRQ save/restore handling in pcp_spin_trylock()/unlock() completely. Link: https://lkml.kernel.org/r/20260227-b4-pcp-locking-cleanup-v1-0-f7e22e603447@kernel.org Link: https://lkml.kernel.org/r/20260227-b4-pcp-locking-cleanup-v1-1-f7e22e603447@kernel.org Link: https://lore.kernel.org/all/d762c46b-36f0-471a-b5b4-23c8cf5628ae@suse.cz/ [1] Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Suggested-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:11 -07:00
SeongJae Park	ca6969e074	mm/damon/test/core-kunit: add damon_apply_min_nr_regions() test Add a kunit test for the functionality of damon_apply_min_nr_regions(). Link: https://lkml.kernel.org/r/20260228222831.7232-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:11 -07:00
SeongJae Park	442d87c7db	mm/damon/vaddr: do not split regions for min_nr_regions The previous commit made DAMON core split regions at the beginning for min_nr_regions. The virtual address space operation set (vaddr) does similar work on its own, for a case user delegates entire initial monitoring regions setup to vaddr. It is unnecessary now, as DAMON core will do similar work for any case. Remove the duplicated work in vaddr. Also, remove a helper function that was being used only for the work, and the test code of the helper function. Link: https://lkml.kernel.org/r/20260228222831.7232-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:11 -07:00
SeongJae Park	b1029f29eb	mm/damon/core: split regions for min_nr_regions Patch series "mm/damon: strictly respect min_nr_regions". DAMON core respects min_nr_regions only at merge operation. DAMON API callers are therefore responsible to respect or ignore that. Only vaddr ops is respecting that, but only for initial start time. DAMON sysfs interface allows users to setup the initial regions that DAMON core also respects. But, again, it works for only the initial time. Users setting the regions for min_nr_regions can be difficult and inefficient, when the min_nr_regions value is high. There was actually a report [1] from a user. The use case was page granular access monitoring with a large aggregation interval. Make the following three changes for resolving the issue. First (patch 1), make DAMON core split regions at the beginning and every aggregation interval, to respect the min_nr_regions. Second (patch 2), drop the vaddr's split operations and related code that are no more needed. Third (patch 3), add a kunit test for the newly introduced function. This patch (of 3): DAMON core layer respects the min_nr_regions parameter by setting the maximum size of each region as total monitoring region size divided by the parameter value. And the limit is applied by preventing merge of regions that result in a region larger than the maximum size. The limit is updated per ops update interval, because vaddr updates the monitoring regions on the ops update callback. It does nothing for the beginning state. That's because the users can set the initial monitoring regions as they want. That is, if the users really care about the min_nr_regions, they are supposed to set the initial monitoring regions to have more than min_nr_regions regions. The virtual address space operation set, vaddr, has an exceptional case. Users can ask the ops set to configure the initial regions on its own. For the case, vaddr sets up the initial regions to meet the min_nr_regions. So, vaddr has exceptional support, but basically users are required to set the regions on their own if they want min_nr_regions to be respected. When 'min_nr_regions' is high, such initial setup is difficult. If DAMON sysfs interface is used for that, the memory for saving the initial setup is also a waste. Even if the user forgives the setup, DAMON will eventually make more than min_nr_regions regions by splitting operations. But it will take time. If the aggregation interval is long, the delay could be problematic. There was actually a report [1] of the case. The reporter wanted to do page granular monitoring with a large aggregation interval. Also, DAMON is doing nothing for online changes on monitoring regions and min_nr_regions. For example, the user can remove a monitoring region or increase min_nr_regions while DAMON is running. Split regions larger than the size at the beginning of the kdamond main loop, to fix the initial setup issue. Also do the split every aggregation interval, for online changes. This means the behavior is slightly changed. It is difficult to imagine a use case that actually depends on the old behavior, though. So this change is arguably fine. Note that the size limit is aligned by damon_ctx->min_region_sz and cannot be zero. That is, if min_nr_region is larger than the total size of monitoring regions divided by ->min_region_sz, that cannot be respected. Link: https://lkml.kernel.org/r/20260228222831.7232-1-sj@kernel.org Link: https://lkml.kernel.org/r/20260228222831.7232-2-sj@kernel.org Link: https://lore.kernel.org/CAC5umyjmJE9SBqjbetZZecpY54bHpn2AvCGNv3aF6J=1cfoPXQ@mail.gmail.com [1] Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:11 -07:00
Ritesh Harjani (IBM)	51d8c78be0	mm/kasan: fix double free for kasan pXds kasan_free_pxd() assumes the page table is always struct page aligned. But that's not always the case for all architectures. E.g. In case of powerpc with 64K pagesize, PUD table (of size 4096) comes from slab cache named pgtable-2^9. Hence instead of page_to_virt(pxd_page()) let's just directly pass the start of the pxd table which is passed as the 1st argument. This fixes the below double free kasan issue seen with PMEM: radix-mmu: Mapped 0x0000047d10000000-0x0000047f90000000 with 2.00 MiB pages ================================================================== BUG: KASAN: double-free in kasan_remove_zero_shadow+0x9c4/0xa20 Free of addr c0000003c38e0000 by task ndctl/2164 CPU: 34 UID: 0 PID: 2164 Comm: ndctl Not tainted 6.19.0-rc1-00048-gea1013c15392 #157 VOLUNTARY Hardware name: IBM,9080-HEX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_012) hv:phyp pSeries Call Trace: dump_stack_lvl+0x88/0xc4 (unreliable) print_report+0x214/0x63c kasan_report_invalid_free+0xe4/0x110 check_slab_allocation+0x100/0x150 kmem_cache_free+0x128/0x6e0 kasan_remove_zero_shadow+0x9c4/0xa20 memunmap_pages+0x2b8/0x5c0 devm_action_release+0x54/0x70 release_nodes+0xc8/0x1a0 devres_release_all+0xe0/0x140 device_unbind_cleanup+0x30/0x120 device_release_driver_internal+0x3e4/0x450 unbind_store+0xfc/0x110 drv_attr_store+0x78/0xb0 sysfs_kf_write+0x114/0x140 kernfs_fop_write_iter+0x264/0x3f0 vfs_write+0x3bc/0x7d0 ksys_write+0xa4/0x190 system_call_exception+0x190/0x480 system_call_vectored_common+0x15c/0x2ec ---- interrupt: 3000 at 0x7fff93b3d3f4 NIP: 00007fff93b3d3f4 LR: 00007fff93b3d3f4 CTR: 0000000000000000 REGS: c0000003f1b07e80 TRAP: 3000 Not tainted (6.19.0-rc1-00048-gea1013c15392) MSR: 800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 48888208 XER: 00000000 <...> NIP [00007fff93b3d3f4] 0x7fff93b3d3f4 LR [00007fff93b3d3f4] 0x7fff93b3d3f4 ---- interrupt: 3000 The buggy address belongs to the object at c0000003c38e0000 which belongs to the cache pgtable-2^9 of size 4096 The buggy address is located 0 bytes inside of 4096-byte region [c0000003c38e0000, c0000003c38e1000) The buggy address belongs to the physical page: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x3c38c head: order:2 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 memcg:c0000003bfd63e01 flags: 0x63ffff800000040(head\|node=6\|zone=0\|lastcpupid=0x7ffff) page_type: f5(slab) raw: 063ffff800000040 c000000140058980 5deadbeef0000122 0000000000000000 raw: 0000000000000000 0000000080200020 00000000f5000000 c0000003bfd63e01 head: 063ffff800000040 c000000140058980 5deadbeef0000122 0000000000000000 head: 0000000000000000 0000000080200020 00000000f5000000 c0000003bfd63e01 head: 063ffff800000002 c00c000000f0e301 00000000ffffffff 00000000ffffffff head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000004 page dumped because: kasan: bad access detected [ 138.953636] [ T2164] Memory state around the buggy address: [ 138.953643] [ T2164] c0000003c38dff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 138.953652] [ T2164] c0000003c38dff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 138.953661] [ T2164] >c0000003c38e0000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 138.953669] [ T2164] ^ [ 138.953675] [ T2164] c0000003c38e0080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 138.953684] [ T2164] c0000003c38e0100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 138.953692] [ T2164] ================================================================== [ 138.953701] [ T2164] Disabling lock debugging due to kernel taint Link: https://lkml.kernel.org/r/2f9135c7866c6e0d06e960993b8a5674a9ebc7ec.1771938394.git.ritesh.list@gmail.com Fixes: `0207df4fa1` ("kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN") Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:11 -07:00
Anshuman Khandual	3d56d7317b	mm: replace READ_ONCE() in pud_trans_unstable() Replace READ_ONCE() with the existing standard page table accessor for PUD aka pudp_get() in pud_trans_unstable(). This does not create any functional change for platforms that do not override pudp_get(), which still defaults to READ_ONCE(). Link: https://lkml.kernel.org/r/20260227040300.2091901-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:11 -07:00
Anshuman Khandual	4d267106ab	mm/debug_vm_pgtable: replace WRITE_ONCE() with pxd_clear() Replace WRITE_ONCE() with generic pxd_clear() to clear out the page table entries as required. Besides this does not cause any functional change as well. Link: https://lkml.kernel.org/r/20260227061204.2215395-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Suggested-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Ackeed-by: SeongJae Park <sj@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:11 -07:00
David Hildenbrand (Arm)	99573ef4ac	mm/pagewalk: drop FW_MIGRATION We removed the last user of FW_MIGRATION in commit `912aa82595` ("Revert "mm/ksm: convert break_ksm() from walk_page_range_vma() to folio_walk""). So let's remove FW_MIGRATION and assign FW_ZEROPAGE bit 0. Including leafops.h is no longer required. While at it, convert "expose_page" to "zeropage", as zeropages are now the only remaining use case for not exposing a page. Link: https://lkml.kernel.org/r/20260227212952.190691-1-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:10 -07:00
Dev Jain	22aa332199	khugepaged: remove redundant index check for pmd-folios Claim: folio_order(folio) == HPAGE_PMD_ORDER => folio->index == start. Proof: Both loops in hpage_collapse_scan_file and collapse_file, which iterate on the xarray, have the invariant that start <= folio->index < start + HPAGE_PMD_NR ... (i) A folio is always naturally aligned in the pagecache, therefore folio_order == HPAGE_PMD_ORDER => IS_ALIGNED(folio->index, HPAGE_PMD_NR) == true ... (ii) thp_vma_allowable_order -> thp_vma_suitable_order requires that the virtual offsets in the VMA are aligned to the order, => IS_ALIGNED(start, HPAGE_PMD_NR) == true ... (iii) Combining (i), (ii) and (iii), the claim is proven. Therefore, remove this check. While at it, simplify the comments. Link: https://lkml.kernel.org/r/20260227143501.1488110-1-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:10 -07:00
SeongJae Park	1745ccbd29	mm/damon/core: do non-safe region walk on kdamond_apply_schemes() kdamond_apply_schemes() is using damon_for_each_region_safe(), which is safe for deallocation of the region inside the loop. However, the loop internal logic does not deallocate regions. Hence it is only wasting the next pointer. Also, it causes a problem. When an address filter is applied, and there is a region that intersects with the filter, the filter splits the region on the filter boundary. The intention is to let DAMOS apply action to only filtered-in address ranges. However, it is using damon_for_each_region_safe(), which sets the next region before the execution of the iteration. Hence, the region that split and now will be next to the previous region, is simply ignored. As a result, DAMOS applies the action to target regions bit slower than expected, when the address filter is used. Shouldn't be a big problem but definitely better to be fixed. damos_skip_charged_region() was working around the issue using a double pointer hack. Use damon_for_each_region(), which is safe for this use case. And drop the work around in damos_skip_charged_region(). Link: https://lkml.kernel.org/r/20260227170623.95384-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:10 -07:00
SeongJae Park	e7e1a26b8d	mm/damon/core: set quota-score histogram with core filters Patch series "mm/damon/core: improve DAMOS quota efficiency for core layer filters". Improve two below problematic behaviors of DAMOS that makes it less efficient when core layer filters are used. DAMOS generates the under-quota regions prioritization-purpose access temperature histogram [1] with only the scheme target access pattern. The DAMOS filters are ignored on the histogram, and this can result in the scheme not applied to eligible regions. For working around this, users had to use separate DAMON contexts. The memory tiering approaches are such examples. DAMOS splits regions that intersect with address filters, so that only filtered-out part of the region is skipped. But, the implementation is skipping the other part of the region that is not filtered out, too. As a result, DAMOS can work slower than expected. Improve the two inefficient behaviors with two patches, respectively. Read the patches for more details about the problem and how those are fixed. This patch (of 2): The histogram for under-quota region prioritization [1] is made for all regions that are eligible for the DAMOS target access pattern. When there are DAMOS filters, the prioritization-threshold access temperature that generated from the histogram could be inaccurate. For example, suppose there are three regions. Each region is 1 GiB. The access temperature of the regions are 100, 50, and 0. And a DAMOS scheme that targets _any_ access temperature with quota 2 GiB is being used. The histogram will look like below: temperature size of regions having >=temperature temperature 0 3 GiB 50 2 GiB 100 1 GiB Based on the histogram and the quota (2 GiB), DAMOS applies the action to only the regions having >=50 temperature. This is all good. Let's suppose the region of temperature 50 is excluded by a DAMOS filter. Regardless of the filter, DAMOS will try to apply the action on only regions having >=50 temperature. Because the region of temperature 50 is filtered out, the action is applied to only the region of temperature 100. Worse yet, suppose the filter is excluding regions of temperature 50 and 100. Then no action is really applied to any region, while the region of temperature 0 is there. People used to work around this by utilizing multiple contexts, instead of the core layer DAMOS filters. For example, DAMON-based memory tiering approaches including the quota auto-tuning based one [2] are using a DAMON context per NUMA node. If the above explained issue is effectively alleviated, those can be configured again to run with single context and DAMOS filters for applying the promotion and demotion to only specific NUMA nodes. Alleviate the problem by checking core DAMOS filters when generating the histogram. The reason to check only core filters is the overhead. While core filters are usually for coarse-grained filtering (e.g., target/address filters for process, NUMA, zone level filtering), operation layer filters are usually for fine-grained filtering (e.g., for anon page). Doing this for operation layer filters would cause significant overhead. There is no known use case that is affected by the operation layer filters-distorted histogram problem, though. Do this for only core filters for now. We will revisit this for operation layer filters in future. We might be able to apply a sort of sampling based operation layer filtering. After this fix is applied, for the first case that there is a DAMOS filter excluding the region of temperature 50, the histogram will be like below: temperature size of regions having >=temperature temperature 0 2 GiB 100 1 GiB And DAMOS will set the temperature threshold as 0, allowing both regions of temperatures 0 and 100 be applied. For the second case that there is a DAMOS filter excluding the regions of temperature 50 and 100, the histogram will be like below: temperature size of regions having >=temperature temperature 0 1 GiB And DAMOS will set the temperature threshold as 0, allowing the region of temperature 0 be applied. [1] 'Prioritization' section of Documentation/mm/damon/design.rst [2] commit `0e1c773b50` ("mm/damon/core: introduce damos quota goal metrics for memory node utilization") Link: https://lkml.kernel.org/r/20260227170623.95384-1-sj@kernel.org Link: https://lkml.kernel.org/r/20260227170623.95384-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:10 -07:00
Kiryl Shutsemau	8231e4c040	mm/slab: use compound_head() in page_slab() page_slab() contained an open-coded implementation of compound_head(). Replace the duplicated code with a direct call to compound_head(). Link: https://lkml.kernel.org/r/20260227194302.274384-19-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:10 -07:00
Kiryl Shutsemau	fed8676ca2	hugetlb: update vmemmap_dedup.rst Update the documentation regarding vmemmap optimization for hugetlb to reflect the changes in how the kernel maps the tail pages. Fake heads no longer exist. Remove their description. [kas@kernel.org: update vmemmap_dedup.rst] Link: https://lkml.kernel.org/r/20260302105630.303492-1-kas@kernel.org Link: https://lkml.kernel.org/r/20260227194302.274384-18-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:10 -07:00
Kiryl Shutsemau	66b2a3d9ae	mm: remove the branch from compound_head() The compound_head() function is a hot path. For example, the zap path calls it for every leaf page table entry. Rewrite the helper function in a branchless manner to eliminate the risk of CPU branch misprediction. Link: https://lkml.kernel.org/r/20260227194302.274384-17-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:10 -07:00
Kiryl Shutsemau	da3e2d1ca4	mm/hugetlb: remove hugetlb_optimize_vmemmap_key static key The hugetlb_optimize_vmemmap_key static key was used to guard fake head detection in compound_head() and related functions. It allowed skipping the fake head checks entirely when HVO was not in use. With fake heads eliminated and the detection code removed, the static key serves no purpose. Remove its definition and all increment/decrement calls. Link: https://lkml.kernel.org/r/20260227194302.274384-16-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:09 -07:00
Kiryl Shutsemau	01b1d0ffb6	hugetlb: remove VMEMMAP_SYNCHRONIZE_RCU The VMEMMAP_SYNCHRONIZE_RCU flag triggered synchronize_rcu() calls to prevent a race between HVO remapping and page_ref_add_unless(). The race could occur when a speculative PFN walker tried to modify the refcount on a struct page that was in the process of being remapped to a fake head. With fake heads eliminated, page_ref_add_unless() no longer needs RCU protection. Remove the flag and synchronize_rcu() calls. Link: https://lkml.kernel.org/r/20260227194302.274384-15-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:09 -07:00
Kiryl Shutsemau	32c440d67e	mm: drop fake head checks With fake head pages eliminated in the previous commit, remove the supporting infrastructure: - page_fixed_fake_head(): no longer needed to detect fake heads; - page_is_fake_head(): no longer needed; - page_count_writable(): no longer needed for RCU protection; - RCU read_lock in page_ref_add_unless(): no longer needed; This substantially simplifies compound_head() and page_ref_add_unless(), removing both branches and RCU overhead from these hot paths. RCU was required to serialize allocation of hugetlb page against get_page_unless_zero() and prevent writing to read-only fake head. It is redundant without fake heads. See `bd225530a4` ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers") for more details. synchronize_rcu() in mm/hugetlb_vmemmap.c will be removed by a separate patch. Link: https://lkml.kernel.org/r/20260227194302.274384-14-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:09 -07:00
Kiryl Shutsemau	622026e87c	mm/hugetlb: remove fake head pages HugeTLB Vmemmap Optimization (HVO) reduces memory usage by freeing most vmemmap pages for huge pages and remapping the freed range to a single page containing the struct page metadata. With the new mask-based compound_info encoding (for power-of-2 struct page sizes), all tail pages of the same order are now identical regardless of which compound page they belong to. This means the tail pages can be truly shared without fake heads. Allocate a single page of initialized tail struct pages per zone per order in the vmemmap_tails[] array in struct zone. All huge pages of that order in the zone share this tail page, mapped read-only into their vmemmap. The head page remains unique per huge page. Redefine MAX_FOLIO_ORDER using ilog2(). The define has to produce a compile-constant as it is used to specify vmemmap_tail array size. For some reason, compiler is not able to solve get_order() at compile-time, but ilog2() works. Avoid PUD_ORDER to define MAX_FOLIO_ORDER as it adds dependency to <linux/pgtable.h> which generates hard-to-break include loop. This eliminates fake heads while maintaining the same memory savings, and simplifies compound_head() by removing fake head detection. Link: https://lkml.kernel.org/r/20260227194302.274384-13-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:09 -07:00
Kiryl Shutsemau (Meta)	76351f2f0c	x86/vdso: undefine CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP for vdso32 The 32-bit VDSO build on x86_64 uses fake_32bit_build.h to undefine various kernel configuration options that are not suitable for the VDSO context or may cause build issues when including kernel headers. Undefine CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP in fake_32bit_build.h to prepare for change in HugeTLB Vmemmap Optimization. Link: https://lkml.kernel.org/r/20260227194302.274384-12-kas@kernel.org Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:09 -07:00
Kiryl Shutsemau	c0b495b91a	mm/hugetlb: refactor code around vmemmap_walk To prepare for removing fake head pages, the vmemmap_walk code is being reworked. The reuse_page and reuse_addr variables are being eliminated. There will no longer be an expectation regarding the reuse address in relation to the operated range. Instead, the caller will provide head and tail vmemmap pages. Currently, vmemmap_head and vmemmap_tail are set to the same page, but this will change in the future. The only functional change is that __hugetlb_vmemmap_optimize_folio() will abandon optimization if memory allocation fails. Link: https://lkml.kernel.org/r/20260227194302.274384-11-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Hildenbrand (arm) <david@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:09 -07:00
Kiryl Shutsemau (Meta)	209e6d9eb1	mm/hugetlb: defer vmemmap population for bootmem hugepages Currently, the vmemmap for bootmem-allocated gigantic pages is populated early in hugetlb_vmemmap_init_early(). However, the zone information is only available after zones are initialized. If it is later discovered that a page spans multiple zones, the HVO mapping must be undone and replaced with a normal mapping using vmemmap_undo_hvo(). Defer the actual vmemmap population to hugetlb_vmemmap_init_late(). At this stage, zones are already initialized, so it can be checked if the page is valid for HVO before deciding how to populate the vmemmap. This allows us to remove vmemmap_undo_hvo() and the complex logic required to rollback HVO mappings. In hugetlb_vmemmap_init_late(), if HVO population fails or if the zones are invalid, fall back to a normal vmemmap population. Postponing population until hugetlb_vmemmap_init_late() also makes zone information available from within vmemmap_populate_hvo(). Link: https://lkml.kernel.org/r/20260227194302.274384-10-kas@kernel.org Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:08 -07:00
Kiryl Shutsemau	9f94db4c7e	mm/sparse: check memmap alignment for compound_info_has_mask() If page->compound_info encodes a mask, it is expected that vmemmap to be naturally aligned to the maximum folio size. Add a VM_WARN_ON_ONCE() to check the alignment. Link: https://lkml.kernel.org/r/20260227194302.274384-9-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Hildenbrand (arm) <david@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:08 -07:00
Kiryl Shutsemau	8c846c879e	mm: rework compound_head() for power-of-2 sizeof(struct page) For tail pages, the kernel uses the 'compound_info' field to get to the head page. The bit 0 of the field indicates whether the page is a tail page, and if set, the remaining bits represent a pointer to the head page. For cases when size of struct page is power-of-2, change the encoding of compound_info to store a mask that can be applied to the virtual address of the tail page in order to access the head page. It is possible because struct page of the head page is naturally aligned with regards to order of the page. The significant impact of this modification is that all tail pages of the same order will now have identical 'compound_info', regardless of the compound page they are associated with. This paves the way for eliminating fake heads. The HugeTLB Vmemmap Optimization (HVO) creates fake heads and it is only applied when the sizeof(struct page) is power-of-2. Having identical tail pages allows the same page to be mapped into the vmemmap of all pages, maintaining memory savings without fake heads. If sizeof(struct page) is not power-of-2, there is no functional changes. Limit mask usage to HugeTLB vmemmap optimization (HVO) where it makes a difference. The approach with mask would work in the wider set of conditions, but it requires validating that struct pages are naturally aligned for all orders up to the MAX_FOLIO_ORDER, which can be tricky. Link: https://lkml.kernel.org/r/20260227194302.274384-8-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:08 -07:00
Kiryl Shutsemau	2969b42c8f	LoongArch/mm: align vmemmap to maximal folio size The upcoming change to the HugeTLB vmemmap optimization (HVO) requires struct pages of the head page to be naturally aligned with regard to the folio size. Align vmemmap to MAX_FOLIO_VMEMMAP_ALIGN. Link: https://lkml.kernel.org/r/20260227194302.274384-7-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Hildenbrand (arm) <david@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:08 -07:00
Kiryl Shutsemau	476849b0fb	riscv/mm: align vmemmap to maximal folio size The upcoming change to the HugeTLB vmemmap optimization (HVO) requires struct pages of the head page to be naturally aligned with regard to the folio size. Align vmemmap to the newly introduced MAX_FOLIO_VMEMMAP_ALIGN. Link: https://lkml.kernel.org/r/20260227194302.274384-6-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Hildenbrand (arm) <david@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:08 -07:00
Kiryl Shutsemau	67c79a5af0	mm: move set/clear_compound_head() next to compound_head() Move set_compound_head() and clear_compound_head() to be adjacent to the compound_head() function in page-flags.h. These functions encode and decode the same compound_info field, so keeping them together makes it easier to verify their logic is consistent, especially when the encoding changes. Link: https://lkml.kernel.org/r/20260227194302.274384-5-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (arm) <david@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:08 -07:00
Kiryl Shutsemau	d50569612c	mm: rename the 'compound_head' field in the 'struct page' to 'compound_info' The 'compound_head' field in the 'struct page' encodes whether the page is a tail and where to locate the head page. Bit 0 is set if the page is a tail, and the remaining bits in the field point to the head page. As preparation for changing how the field encodes information about the head page, rename the field to 'compound_info'. Link: https://lkml.kernel.org/r/20260227194302.274384-4-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (arm) <david@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:08 -07:00
Kiryl Shutsemau	f0369fb136	mm: change the interface of prep_compound_tail() Instead of passing down the head page and tail page index, pass the tail and head pages directly, as well as the order of the compound page. This is a preparation for changing how the head position is encoded in the tail page. Link: https://lkml.kernel.org/r/20260227194302.274384-3-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (arm) <david@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:07 -07:00
Kiryl Shutsemau	a2c77ec320	mm: move MAX_FOLIO_ORDER definition to mmzone.h Patch series "mm: Eliminate fake head pages from vmemmap optimization", v7. This series removes "fake head pages" from the HugeTLB vmemmap optimization (HVO) by changing how tail pages encode their relationship to the head page. It simplifies compound_head() and page_ref_add_unless(). Both are in the hot path. Background ========== HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages and remapping the freed virtual addresses to a single physical page. Previously, all tail page vmemmap entries were remapped to the first vmemmap page (containing the head struct page), creating "fake heads" - tail pages that appear to have PG_head set when accessed through the deduplicated vmemmap. This required special handling in compound_head() to detect and work around fake heads, adding complexity and overhead to a very hot path. New Approach ============ For architectures/configs where sizeof(struct page) is a power of 2 (the common case), this series changes how position of the head page is encoded in the tail pages. Instead of storing a pointer to the head page, the ->compound_info (renamed from ->compound_head) now stores a mask. The mask can be applied to any tail page's virtual address to compute the head page address. Critically, all tail pages of the same order now have identical compound_info values, regardless of which compound page they belong to. The key insight is that all tail pages of the same order now have identical compound_info values, regardless of which compound page they belong to. In v7, these shared tail pages are allocated per-zone. This ensures that zone information (stored in page->flags) is correct even for shared tail pages, removing the need for the special-casing in page_zonenum() proposed in earlier versions. To support per-zone shared pages for boot-allocated gigantic pages, the vmemmap population is deferred until zones are initialized. This simplifies the logic significantly and allows the removal of vmemmap_undo_hvo(). Benefits ======== 1. Simplified compound_head(): No fake head detection needed, can be implemented in a branchless manner. 2. Simplified page_ref_add_unless(): RCU protection removed since there's no race with fake head remapping. 3. Cleaner architecture: The shared tail pages are truly read-only and contain valid tail page metadata. If sizeof(struct page) is not power-of-2, there are no functional changes. HVO is not supported in this configuration. I had hoped to see performance improvement, but my testing thus far has shown either no change or only a slight improvement within the noise. Series Organization =================== Patch 1: Move MAX_FOLIO_ORDER definition to mmzone.h. Patches 2-4: Refactoring of field names and interfaces. Patches 5-6: Architecture alignment for LoongArch and RISC-V. Patch 7: Mask-based compound_head() implementation. Patch 8: Add memmap alignment checks. Patch 9: Branchless compound_head() optimization. Patch 10: Defer vmemmap population for bootmem hugepages. Patch 11: Refactor vmemmap_walk. Patch 12: x86 vDSO build fix. Patch 13: Eliminate fake heads with per-zone shared tail pages. Patches 14-16: Cleanup of fake head infrastructure. Patch 17: Documentation update. Patch 18: Use compound_head() in page_slab(). This patch (of 17): Move MAX_FOLIO_ORDER definition from mm.h to mmzone.h. This is preparation for adding the vmemmap_tails array to struct zone, which requires MAX_FOLIO_ORDER to be available in mmzone.h. Link: https://lkml.kernel.org/r/20260227194302.274384-1-kas@kernel.org Link: https://lkml.kernel.org/r/20260227194302.274384-2-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Muchun Song <muchun.song@linux.dev> Acked-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:07 -07:00
gao xu	c09fb53d29	zram: use statically allocated compression algorithm names Currently, zram dynamically allocates memory for compressor algorithm names when they are set by the user. This requires careful memory management, including explicit `kfree` calls and special handling to avoid freeing statically defined default compressor names. This patch refactors the way zram handles compression algorithm names. Instead of storing dynamically allocated copies, `zram->comp_algs` will now store pointers directly to the static name strings defined within the `zcomp_ops` backend structures, thereby removing the need for conditional `kfree` calls. Link: https://lkml.kernel.org/r/5bb2e9318d124dbcb2b743dcdce6a950@honor.com Signed-off-by: gao xu <gaoxu2@honor.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:07 -07:00
Tal Zussman	511f04aac4	folio_batch: rename PAGEVEC_SIZE to FOLIO_BATCH_SIZE struct pagevec no longer exists. Rename the macro appropriately. Link: https://lkml.kernel.org/r/20260225-pagevec_cleanup-v2-4-716868cc2d11@columbia.edu Signed-off-by: Tal Zussman <tz2294@columbia.edu> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:07 -07:00
Tal Zussman	4e1d77a8f3	folio_batch: rename pagevec.h to folio_batch.h struct pagevec was removed in commit `1e0877d58b` ("mm: remove struct pagevec"). Rename include/linux/pagevec.h to reflect reality and update includes tree-wide. Add the new filename to MAINTAINERS explicitly, as it no longer matches the "include/linux/page[-_]*" pattern in MEMORY MANAGEMENT - CORE. Link: https://lkml.kernel.org/r/20260225-pagevec_cleanup-v2-3-716868cc2d11@columbia.edu Signed-off-by: Tal Zussman <tz2294@columbia.edu> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:07 -07:00
Tal Zussman	ab5193e919	fs: remove unncessary pagevec.h includes Remove unused pagevec.h includes from .c files. These were found with the following command: grep -rl '#include.pagevec\.h' --include='.c' \| while read f; do grep -qE 'PAGEVEC_SIZE\|folio_batch' "$f" \|\| echo "$f" done There are probably more removal candidates in .h files, but those are more complex to analyze. Link: https://lkml.kernel.org/r/20260225-pagevec_cleanup-v2-2-716868cc2d11@columbia.edu Signed-off-by: Tal Zussman <tz2294@columbia.edu> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:06 -07:00
Tal Zussman	cbf56f9981	mm: remove stray references to struct pagevec Patch series "mm: Remove stray references to pagevec", v2. struct pagevec was removed in commit `1e0877d58b` ("mm: remove struct pagevec"). Remove any stray references to it and rename relevant files and macros accordingly. While at it, remove unnecessary #includes of pagevec.h (now folio_batch.h) in .c files. There are probably more of these that could be removed in .h files, but those are more complex to verify. This patch (of 4): struct pagevec was removed in commit `1e0877d58b` ("mm: remove struct pagevec"). Remove remaining forward declarations and change __folio_batch_release()'s declaration to match its definition. Link: https://lkml.kernel.org/r/20260225-pagevec_cleanup-v2-0-716868cc2d11@columbia.edu Link: https://lkml.kernel.org/r/20260225-pagevec_cleanup-v2-1-716868cc2d11@columbia.edu Signed-off-by: Tal Zussman <tz2294@columbia.edu> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:06 -07:00
Pasha Tatashin	019fc36872	kho: fix KASAN support for restored vmalloc regions Restored vmalloc regions are currently not properly marked for KASAN, causing KASAN to treat accesses to these regions as out-of-bounds. Fix this by properly unpoisoning the restored vmalloc area using kasan_unpoison_vmalloc(). This requires setting the VM_UNINITIALIZED flag during the initial area allocation and clearing it after the pages have been mapped and unpoisoned, using the clear_vm_uninitialized_flag() helper. Link: https://lkml.kernel.org/r/20260225223857.1714801-3-pasha.tatashin@soleen.com Fixes: `a667300bd5` ("kho: add support for preserving vmalloc allocations") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reported-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Tested-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:06 -07:00
Pasha Tatashin	ec10636539	mm/vmalloc: export clear_vm_uninitialized_flag() Patch series "Fix KASAN support for KHO restored vmalloc regions". When KHO restores a vmalloc area, it maps existing physical pages into a newly allocated virtual memory area. However, because these areas were not properly unpoisoned, KASAN would treat any access to the restored region as out-of-bounds, as seen in the following trace: BUG: KASAN: vmalloc-out-of-bounds in kho_test_restore_data.isra.0+0x17b/0x2cd Read of size 8 at addr ffffc90000025000 by task swapper/0/1 [...] Call Trace: [...] kasan_report+0xe8/0x120 kho_test_restore_data.isra.0+0x17b/0x2cd kho_test_init+0x15a/0x1f0 do_one_initcall+0xd5/0x4b0 The fix involves deferring KASAN's default poisoning by using the VM_UNINITIALIZED flag during allocation, manually unpoisoning the memory once it is correctly mapped, and then clearing the uninitialized flag using a newly exported helper. This patch (of 2): Make clear_vm_uninitialized_flag() available to other parts of the kernel that need to manage vmalloc areas manually, such as KHO for restoring vmallocs. Link: https://lkml.kernel.org/r/20260225220223.1695350-1-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20260225223857.1714801-2-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:06 -07:00
Marco Elver	da735962d0	kfence: add kfence.fault parameter Add kfence.fault parameter to control the behavior when a KFENCE error is detected (similar in spirit to kasan.fault=<mode>). The supported modes for kfence.fault=<mode> are: - report: print the error report and continue (default). - oops: print the error report and oops. - panic: print the error report and panic. In particular, the 'oops' mode offers a trade-off between no mitigation on report and panicking outright (if panic_on_oops is not set). Link: https://lkml.kernel.org/r/20260225203639.3159463-1-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <kees@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:06 -07:00
Catalin Marinas	3efb980055	mm: do not map the shadow stack as THP The default shadow stack size allocated on first prctl() for the main thread or subsequently on clone() is either half of RLIMIT_STACK or half of a thread's stack size (for arm64). Both of these are likely to be suitable for a THP allocation and the kernel is more aggressive in creating such mappings. However, it does not make much sense to use a huge page. It didn't make sense for the normal stacks either, see commit `c4608d1bf7` ("mm: mmap: map MAP_STACK to VM_NOHUGEPAGE"). Force VM_NOHUGEPAGE when allocating/mapping the shadow stack. As per commit `7190b3c8bd` ("mm: mmap: map MAP_STACK to VM_NOHUGEPAGE only if THP is enabled"), only pass this flag if TRANSPARENT_HUGEPAGE is enabled as not to confuse CRIU tools. Link: https://lkml.kernel.org/r/20260225161404.3157851-6-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Deepak Gupta <debug@rivosinc.com> Reviewed-by: Mark Brown <broonie@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <pjw@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:06 -07:00
Catalin Marinas	a515ffc9de	x86: shstk: use the new common vm_mmap_shadow_stack() helper Replace part of the x86 alloc_shstk() content with a call to vm_mmap_shadow_stack(). There is no functional change. Link: https://lkml.kernel.org/r/20260225161404.3157851-5-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Deepak Gupta <debug@rivosinc.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <pjw@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:06 -07:00
Catalin Marinas	fecd446f0c	riscv: shstk: use the new common vm_mmap_shadow_stack() helper Replace part of the allocate_shadow_stack() content with a call to vm_mmap_shadow_stack(). There is no functional change. Link: https://lkml.kernel.org/r/20260225161404.3157851-4-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Deepak Gupta <debug@rivosinc.com> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Paul Walmsley <pjw@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:05 -07:00

1 2 3 4 5 ...

1428990 Commits