linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-16 07:51:31 -04:00

Author	SHA1	Message	Date
Vernon Yang	80a4bcac69	mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during scanning, directly set khugepaged_scan.mm_slot to the next mm_slot, reduce redundant operation. Without this patch, entering khugepaged_scan_mm_slot() next time, we will set khugepaged_scan.mm_slot to the next mm_slot. With this patch, we will directly set khugepaged_scan.mm_slot to the next mm_slot. Link: https://lkml.kernel.org/r/20260207081613.588598-6-vernon2gm@gmail.com Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:02 -07:00
Chen Ni	15c578d0dc	selftests/mm: remove duplicate include of unistd.h Remove duplicate inclusion of unistd.h in memory-failure.c to clean up redundant code. Link: https://lkml.kernel.org/r/20260211064311.2981726-1-nichen@iscas.ac.cn Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:01 -07:00
Mike Rapoport (Microsoft)	26513781d1	mm: cache struct page for empty_zero_page and return it from ZERO_PAGE() For most architectures every invocation of ZERO_PAGE() does virt_to_page(empty_zero_page). But empty_zero_page is in BSS and it is enough to get its struct page once at initialization time and then use it whenever a zero page should be accessed. Add yet another __zero_page variable that will be initialized as virt_to_page(empty_zero_page) for most architectures in a weak arch_setup_zero_pages() function. For architectures that use colored zero pages (MIPS and s390) rename their setup_zero_pages() to arch_setup_zero_pages() and make it global rather than static. For architectures that cannot use virt_to_page() for BSS (arm64 and sparc64) add override of arch_setup_zero_pages(). Link: https://lkml.kernel.org/r/20260211103141.3215197-5-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:01 -07:00
Mike Rapoport (Microsoft)	6215d9f447	arch, mm: consolidate empty_zero_page Reduce 22 declarations of empty_zero_page to 3 and 23 declarations of ZERO_PAGE() to 4. Every architecture defines empty_zero_page that way or another, but for the most of them it is always a page aligned page in BSS and most definitions of ZERO_PAGE do virt_to_page(empty_zero_page). Move Linus vetted x86 definition of empty_zero_page and ZERO_PAGE() to the core MM and drop these definitions in architectures that do not implement colored zero page (MIPS and s390). ZERO_PAGE() remains a macro because turning it to a wrapper for a static inline causes severe pain in header dependencies. For the most part the change is mechanical, with these being noteworthy: * alpha: aliased empty_zero_page with ZERO_PGE that was also used for boot parameters. Switching to a generic empty_zero_page removes the aliasing and keeps ZERO_PGE for boot parameters only * arm64: uses __pa_symbol() in ZERO_PAGE() so that definition of ZERO_PAGE() is kept intact. * m68k/parisc/um: allocated empty_zero_page from memblock, although they do not support zero page coloring and having it in BSS will work fine. * sparc64 can have empty_zero_page in BSS rather allocate it, but it can't use virt_to_page() for BSS. Keep it's definition of ZERO_PAGE() but instead of allocating it, make mem_map_zero point to empty_zero_page. * sh: used empty_zero_page for boot parameters at the very early boot. Rename the parameters page to boot_params_page and let sh use the generic empty_zero_page. * hexagon: had an amusing comment about empty_zero_page /* A handy thing to have if one has the RAM. Declared in head.S */ that unfortunately had to go :) Link: https://lkml.kernel.org/r/20260211103141.3215197-4-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Helge Deller <deller@gmx.de> [parisc] Tested-by: Helge Deller <deller@gmx.de> [parisc] Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Magnus Lindholm <linmag7@gmail.com> [alpha] Acked-by: Dinh Nguyen <dinguyen@kernel.org> [nios2] Acked-by: Andreas Larsson <andreas@gaisler.com> [sparc] Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: David S. Miller <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:01 -07:00
Mike Rapoport (Microsoft)	9a1d0c738b	mm: rename my_zero_pfn() to zero_pfn() my_zero_pfn() is a silly name. Rename zero_pfn variable to zero_page_pfn and my_zero_pfn() function to zero_pfn(). While on it, move extern declarations of zero_page_pfn outside the functions that use it and add a comment about what ZERO_PAGE is. Link: https://lkml.kernel.org/r/20260211103141.3215197-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:01 -07:00
Mike Rapoport (Microsoft)	652d12bc74	mm: don't special case !MMU for is_zero_pfn() and my_zero_pfn() Patch series "arch, mm: consolidate empty_zero_page", v3. These patches cleanup handling of ZERO_PAGE() and zero_pfn. This patch (of 4): nommu architectures have empty_zero_page and define ZERO_PAGE() and although they don't really use it to populate page tables, there is no reason to hardwire !MMU implementation of is_zero_pfn() and my_zero_pfn() to 0. Drop #ifdef CONFIG_MMU around implementations of is_zero_pfn() and my_zero_pfn() and remove !MMU version. While on it, make zero_pfn __ro_after_init. Link: https://lkml.kernel.org/r/20260211103141.3215197-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20260211103141.3215197-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:01 -07:00
Kairui Song	7498bddab9	mm/shmem: remove unnecessary restrain unmask of swap gfp flags The comment makes it look like copy-paste leftovers from shmem_replace_folio. The first try of the swap doesn't always have a limited zone. So don't drop the restraint, which should make the GFP more accurate. Link: https://lkml.kernel.org/r/20260211-shmem-swap-gfp-v1-1-e9781099a861@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:01 -07:00
Gregory Price	c5c4834513	mm: name the anonymous MMOP enum as enum mmop Give the MMOP enum (MMOP_OFFLINE, MMOP_ONLINE, etc) a proper type name so the compiler can help catch invalid values being assigned to variables of this type. Leave the existing functions returning int alone to allow for value-or-error pattern to remain unchanged without churn. mmop_default_online_type is left as int because it uses the -1 sentinal value to signal it hasn't been initialized yet. Keep the uint8_t buffer in offline_and_remove_memory() as-is for space efficiency, with an explicit cast when we consume the value. Move the enum definition before the CONFIG_MEMORY_HOTPLUG guard so it is unconditionally available for struct memory_block in memory.h. No functional change. Link: https://lore.kernel.org/linux-mm/3424eba7-523b-4351-abd0-3a888a3e5e61@kernel.org/ Link: https://lkml.kernel.org/r/20260211215447.2194189-1-gourry@gourry.net Signed-off-by: Gregory Price <gourry@gourry.net> Suggested-by: Jonathan Cameron <jonathan.cameron@huawei.com> Suggested-by: "David Hildenbrand (arm)" <david@kernel.org> Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:01 -07:00
Jiayuan Chen	4e89004eeb	selftests/cgroup: add test for zswap incompressible pages Add test_zswap_incompressible() to verify that the zswap_incomp memcg stat correctly tracks incompressible pages. The test allocates memory filled with random data from /dev/urandom, which cannot be effectively compressed by zswap. When this data is swapped out to zswap, it should be stored as-is and tracked by the zswap_incomp counter. The test verifies that: 1. Pages are swapped out to zswap (zswpout increases) 2. Incompressible pages are tracked (zswap_incomp increases) test: dd if=/dev/zero of=/swapfile bs=1M count=2048 chmod 600 /swapfile mkswap /swapfile swapon /swapfile echo Y > /sys/module/zswap/parameters/enabled ./test_zswap TAP version 13 1..8 ok 1 test_zswap_usage ok 2 test_swapin_nozswap ok 3 test_zswapin ok 4 test_zswap_writeback_enabled ok 5 test_zswap_writeback_disabled ok 6 test_no_kmem_bypass ok 7 test_no_invasive_cgroup_shrink ok 8 test_zswap_incompressible Totals: pass:8 fail:0 xfail:0 xpass:0 skip:0 error:0 Link: https://lkml.kernel.org/r/20260213071827.5688-3-jiayuan.chen@linux.dev Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:00 -07:00
Jiayuan Chen	5ad41a38c3	mm: zswap: add per-memcg stat for incompressible pages Patch series "mm: zswap: add per-memcg stat for incompressible pages", v3. In containerized environments, knowing which cgroup is contributing incompressible pages to zswap is essential for effective resource management. This series adds a new per-memcg stat 'zswap_incomp' to track incompressible pages, along with a selftest. This patch (of 2): The global zswap_stored_incompressible_pages counter was added in commit `dca4437a58` ("mm/zswap: store <PAGE_SIZE compression failed page as-is") to track how many pages are stored in raw (uncompressed) form in zswap. However, in containerized environments, knowing which cgroup is contributing incompressible pages is essential for effective resource management [1]. Add a new memcg stat 'zswap_incomp' to track incompressible pages per cgroup. This helps administrators and orchestrators to: 1. Identify workloads that produce incompressible data (e.g., encrypted data, already-compressed media, random data) and may not benefit from zswap. 2. Make informed decisions about workload placement - moving incompressible workloads to nodes with larger swap backing devices rather than relying on zswap. 3. Debug zswap efficiency issues at the cgroup level without needing to correlate global stats with individual cgroups. While the compression ratio can be estimated from existing stats (zswap / zswapped * PAGE_SIZE), this doesn't distinguish between "uniformly poor compression" and "a few completely incompressible pages mixed with highly compressible ones". The zswap_incomp stat provides direct visibility into the latter case. Link: https://lkml.kernel.org/r/20260213071827.5688-1-jiayuan.chen@linux.dev Link: https://lkml.kernel.org/r/20260213071827.5688-2-jiayuan.chen@linux.dev Link: https://lore.kernel.org/linux-mm/CAF8kJuONDFj4NAksaR4j_WyDbNwNGYLmTe-o76rqU17La=nkOw@mail.gmail.com/ [1] Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:00 -07:00
Kairui Song	37cb8cd043	memcg: consolidate private id refcount get/put helpers We currently have two different sets of helpers for getting or putting the private IDs' refcount for order 0 and large folios. This is redundant. Just use one and always acquire the refcount of the swapout folio size unless it's zero, and put the refcount using the folio size if the charge failed, since the folio size can't change. Then there is no need to update the refcount for tail pages. Same for freeing, then only one pair of get/put helper is needed now. The performance might be slightly better, too: both "inc unless zero" and "add unless zero" use the same cmpxchg implementation. For large folios, we saved an atomic operation. And for both order 0 and large folios, we saved a branch. Link: https://lkml.kernel.org/r/20260213-memcg-privid-v1-1-d8cb7afcf831@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Acked-by: Shakeel Butt <shakeel.butt@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:00 -07:00
Asier Gutierrez	c9cb94c6b8	mm/damon: remove unused target param of get_scheme_score() damon_target is not used by get_scheme_score operations, nor with virtual neither with physical addresses. Link: https://lkml.kernel.org/r/20260213145032.1740407-1-gutierrez.asier@huawei-partners.com Signed-off-by: Asier Gutierrez <gutierrez.asier@huawei-partners.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Quanmin Yan <yanquanmin1@huawei.com> Cc: ze zuo <zuoze1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:00 -07:00
Pratyush Yadav (Google)	8a552d68a8	mm: memfd_luo: preserve file seals File seals are used on memfd for making shared memory communication with untrusted peers safer and simpler. Seals provide a guarantee that certain operations won't be allowed on the file such as writes or truncations. Maintaining these guarantees across a live update will help keeping such use cases secure. These guarantees will also be needed for IOMMUFD preservation with LUO. Normally when IOMMUFD maps a memfd, it pins all its pages to make sure any truncation operations on the memfd don't lead to IOMMUFD using freed memory. This doesn't work with LUO since the preserved memfd might have completely different pages after a live update, and mapping them back to the IOMMUFD will cause all sorts of problems. Using and preserving the seals allows IOMMUFD preservation logic to trust the memfd. Since the uABI defines seals as an int, preserve them by introducing a new u32 field. There are currently only 6 possible seals, so the extra bits are unused and provide room for future expansion. Since the seals are uABI, it is safe to use them directly in the ABI. While at it, also add a u32 flags field. It makes sure the struct is nicely aligned, and can be used later to support things like MFD_CLOEXEC. Since the serialization structure is changed, bump the version number to "memfd-v2". It is important to note that the memfd-v2 version only supports seals that existed when this version was defined. This set is defined by MEMFD_LUO_ALL_SEALS. Any new seal might bring a completely different semantic with it and the parser for memfd-v2 cannot be expected to deal with that. If there are any future seals added, they will need another version bump. Link: https://lkml.kernel.org/r/20260216185946.1215770-3-pratyush@kernel.org Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org> Tested-by: Samiullah Khawaja <skhawaja@google.com> Cc: Alexander Graf <graf@amazon.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:00 -07:00
Pratyush Yadav (Google)	1beb9b7223	memfd: export memfd_{add,get}_seals() Patch series "mm: memfd_luo: preserve file seals", v2. This series adds support for preserving file seals when preserving a memfd using LUO. Patch 1 exports some memfd seal manipulation functions and patch 2 adds support for preserving them. Since it makes changes to the serialized data structure for memfd, it also bumps the version number. This patch (of 2): Support for preserving file seals will be added to memfd preservation using the Live Update Orchestrator (LUO). Export memfd_{add,get}_seals)() so memfd_luo can use them to manipulate the seals. Link: https://lkml.kernel.org/r/20260216185946.1215770-1-pratyush@kernel.org Link: https://lkml.kernel.org/r/20260216185946.1215770-2-pratyush@kernel.org Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: Samiullah Khawaja <skhawaja@google.com> Cc: Alexander Graf <graf@amazon.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:00 -07:00
Kairui Song	1df1a1b950	mm, swap: no need to clear the shadow explicitly Since we no longer bypass the swap cache, every swap-in will clear the swap shadow by inserting the folio into the swap table. The only place we may seem to need to free the swap shadow is when the swap slots are freed directly without a folio (swap_put_entries_direct). But with the swap table, that is not needed either. Freeing a slot in the swap table will set the table entry to NULL, which erases the shadow just fine. So just delete all explicit shadow clearing, it's no longer needed. Also, rearrange the freeing. Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-12-f4e34be021a7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <lkp@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:59 -07:00
Kairui Song	a0f79916e1	mm, swap: simplify checking if a folio is swapped Clean up and simplify how we check if a folio is swapped. The helper already requires the folio to be in swap cache and locked. That's enough to pin the swap cluster from being freed, so there is no need to lock anything else to avoid UAF. And besides, we have cleaned up and defined the swap operation to be mostly folio based, and now the only place a folio will have any of its swap slots' count increased from 0 to 1 is folio_dup_swap, which also requires the folio lock. So as we are holding the folio lock here, a folio can't change its swap status from not swapped (all swap slots have a count of 0) to swapped (any slot has a swap count larger than 0). So there won't be any false negatives of this helper if we simply depend on the folio lock to stabilize the cluster. We are only using this helper to determine if we can and should release the swap cache. So false positives are completely harmless, and also already exist before. Depending on the timing, previously, it's also possible that a racing thread releases the swap count right after releasing the ci lock and before this helper returns. In any case, the worst that could happen is we leave a clean swap cache. It will still be reclaimed when under pressure just fine. So, in conclusion, we can simplify and make the check much simpler and lockless. Also, rename it to folio_maybe_swapped to reflect the design. Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-11-f4e34be021a7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <lkp@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:59 -07:00
Kairui Song	45711d446b	mm, swap: no need to truncate the scan border swap_map had a static flexible size, so the last cluster won't be fully covered, hence the allocator needs to check the scan border to avoid OOB. But the swap table has a fixed-sized swap table for each cluster, and the slots beyond the device size are marked as bad slots. The allocator can simply scan all slots as usual, and any bad slots will be skipped. Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-10-f4e34be021a7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <lkp@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:59 -07:00
Kairui Song	0d6af9bcf3	mm, swap: use the swap table to track the swap count Now all the infrastructures are ready, switch to using the swap table only. This is unfortunately a large patch because the whole old counting mechanism, especially SWP_CONTINUED, has to be gone and switch to the new mechanism together, with no intermediate steps available. The swap table is capable of holding up to SWP_TB_COUNT_MAX - 1 counts in the higher bits of each table entry, so using that, the swap_map can be completely dropped. swap_map also had a limit of SWAP_CONT_MAX. Any value beyond that limit will require a COUNT_CONTINUED page. COUNT_CONTINUED is a bit complex to maintain, so for the swap table, a simpler approach is used: when the count goes beyond SWP_TB_COUNT_MAX - 1, the cluster will have an extend_table allocated, which is a swap cluster-sized array of unsigned int. The counting is basically offloaded there until the count drops below SWP_TB_COUNT_MAX again. Both the swap table and the extend table are cluster-based, so they exhibit good performance and sparsity. To make the switch from swap_map to swap table clean, this commit cleans up and introduces a new set of functions based on the swap table design, for manipulating swap counts: - __swap_cluster_dup_entry, __swap_cluster_put_entry, __swap_cluster_alloc_entry, __swap_cluster_free_entry: Increase/decrease the count of a swap slot, or alloc / free a swap slot. This is the internal routine that does the counting work based on the swap table and handles all the complexities. The caller will need to lock the cluster before calling them. All swap count-related update operations are wrapped by these four helpers. - swap_dup_entries_cluster, swap_put_entries_cluster: Increase/decrease the swap count of one or a set of swap slots in the same cluster range. These two helpers serve as the common routines for folio_dup_swap & swap_dup_entry_direct, or folio_put_swap & swap_put_entries_direct. And use these helpers to replace all existing callers. This helps to simplify the count tracking by a lot, and the swap_map is gone. [ryncsn@gmail.com: fix build] Link: https://lkml.kernel.org/r/aZWuLZi-vYi3vAWe@KASONG-MC4 Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-9-f4e34be021a7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Chris Li <chrisl@kernel.org> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <lkp@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:59 -07:00
Kairui Song	5dc533f7aa	mm, swap: simplify swap table sanity range check The newly introduced helper, which checks bad slots and emptiness of a cluster, can cover the older sanity check just fine, with a more rigorous condition check. So merge them. Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-8-f4e34be021a7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <lkp@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:59 -07:00
Kairui Song	1307442b93	mm, swap: mark bad slots in swap table directly In preparing the deprecating swap_map, mark bad slots in the swap table too when setting SWAP_MAP_BAD in swap_map. Also, refine the swap table sanity check on freeing to adapt to the bad slots change. For swapoff, the bad slots count must match the cluster usage count, as nothing should touch them, and they contribute to the cluster usage count on swapon. For ordinary swap table freeing, the swap table of clusters with bad slots should never be freed since the cluster usage count never reaches zero. Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-7-f4e34be021a7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <lkp@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:59 -07:00
Kairui Song	62629ae49b	mm, swap: implement helpers for reserving data in the swap table To prepare for using the swap table as the unified swap layer, introduce macros and helpers for storing multiple kinds of data in a swap table entry. From now on, we are storing PFN in the swap table to make space for extra counting bits (SWAP_COUNT). Shadows are still stored as they are, as the SWAP_COUNT is not used yet. Also, rename shadow_swp_to_tb to shadow_to_swp_tb. That's a spelling error, not really worth a separate fix. No behaviour change yet, just prepare the API. Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-6-f4e34be021a7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <lkp@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:59 -07:00
Kairui Song	f3d652b060	mm/workingset: leave highest bits empty for anon shadow Swap table entry will need 4 bits reserved for swap count in the shadow, so the anon shadow should have its leading 4 bits remain 0. This should be OK for the foreseeable future. Take 52 bits of physical address space as an example: for 4K pages, there would be at most 40 bits for addressable pages. Currently, we have 36 bits available (64 - 1 - 16 - 10 - 1, where XA_VALUE takes 1 bit for marker, MEM_CGROUP_ID_SHIFT takes 16 bits, NODES_SHIFT takes <=10 bits, WORKINGSET flags takes 1 bit). So in the worst case, we previously need to pack the 40 bits of address in 36 bits fields using a 64K bucket (bucket_order = 4). After this, the bucket will be increased to 1M. Which should be fine, as on such large machines, the working set size will be way larger than the bucket size. And for MGLRU's gen number tracking, it should be even more than enough, MGLRU's gen number (max_seq) increment is much slower compared to the eviction counter (nonresident_age). And after all, either the refault distance or the gen distance is only a hint that can tolerate inaccuracy just fine. And the 4 bits can be shrunk to 3, or extended to a higher value if needed later. Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-5-f4e34be021a7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <lkp@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:58 -07:00
Kairui Song	0c7e6014b7	mm, swap: consolidate bad slots setup and make it more robust In preparation for using the swap table to track bad slots directly, move the bad slot setup to one place, set up the swap_map mark, and cluster counter update together. While at it, provide more informative logs and a more robust fallback if any bad slot info looks incorrect. Fixes a potential issue that a malformed swap file may cause the cluster to be unusable upon swapon, and provides a more verbose warning on a malformed swap file Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-4-f4e34be021a7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <lkp@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:58 -07:00
Kairui Song	50f8c41928	mm, swap: remove redundant arguments and locking for enabling a device There is no need to repeatedly pass zero map and priority values. zeromap is similar to cluster info and swap_map, which are only used once the swap device is exposed. And the prio values are currently read only once set, and only used for the list insertion upon expose or swap info display. Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-3-f4e34be021a7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <lkp@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:58 -07:00
Kairui Song	451c632610	mm, swap: clean up swapon process and locking Slightly clean up the swapon process. Add comments about what swap_lock protects, introduce and rename helpers that wrap swap_map and cluster_info setup, and do it outside of the swap_lock lock. This lock protection is not needed for swap_map and cluster_info setup because all swap users must either hold the percpu ref or hold a stable allocated swap entry (e.g., locking a folio in the swap cache) before accessing. So before the swap device is exposed by enable_swap_info, nothing would use the swap device's map or cluster. So we are safe to allocate and set up swap data freely first, then expose the swap device and set the SWP_WRITEOK flag. Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-2-f4e34be021a7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <lkp@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:58 -07:00
Kairui Song	eca4d01b98	mm, swap: protect si->swap_file properly and use as a mount indicator Patch series "mm, swap: swap table phase III: remove swap_map", v3. This series removes the static swap_map and uses the swap table for the swap count directly. This saves about ~30% memory usage for the static swap metadata. For example, this saves 256MB of memory when mounting a 1TB swap device. Performance is slightly better too, since the double update of the swap table and swap_map is now gone. Test results: Mounting a swap device: ======================= Mount a 1TB brd device as SWAP, just to verify the memory save: `free -m` before: total used free shared buff/cache available Mem: 1465 1051 417 1 61 413 Swap: 1054435 0 1054435 `free -m` after: total used free shared buff/cache available Mem: 1465 795 672 1 62 670 Swap: 1054435 0 1054435 Idle memory usage is reduced by ~256MB just as expected. And following this design we should be able to save another ~512MB in a next phase. Build kernel test: ================== Test using ZSWAP with NVME SWAP, make -j48, defconfig, in a x86_64 VM with 5G RAM, under global pressure, avg of 32 test run: Before After: System time: 1038.97s 1013.75s (-2.4%) Test using ZRAM as SWAP, make -j12, tinyconfig, in a ARM64 VM with 1.5G RAM, under global pressure, avg of 32 test run: Before After: System time: 67.75s 66.65s (-1.6%) The result is slightly better. Redis / Valkey benchmark: ========================= Test using ZRAM as SWAP, in a ARM64 VM with 1.5G RAM, under global pressure, avg of 64 test run: Server: valkey-server --maxmemory 2560M Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get no persistence with BGSAVE Before: 472705.71 RPS 369451.68 RPS After: 481197.93 RPS (+1.8%) 374922.32 RPS (+1.5%) In conclusion, performance is better in all cases, and memory usage is much lower. The swap cgroup array will also be merged into the swap table in a later phase, saving the other ~60% part of the static swap metadata and making all the swap metadata dynamic. The improved API for swap operations also reduces the lock contention and makes more batching operations possible. This patch (of 12): /proc/swaps uses si->swap_map as the indicator to check if the swap device is mounted. swap_map will be removed soon, so change it to use si->swap_file instead because: - si->swap_file is exactly the only dynamic content that /proc/swaps is interested in. Previously, it was checking si->swap_map just to ensure si->swap_file is available. si->swap_map is set under mutex protection, and after si->swap_file is set, so having si->swap_map set guarantees si->swap_file is set. - Checking si->flags doesn't work here. SWP_WRITEOK is cleared during swapoff, but /proc/swaps is supposed to show the device under swapoff too to report the swapoff progress. And SWP_USED is set even if the device hasn't been properly set up. We can have another flag, but the easier way is to just check si->swap_file directly. So protect si->swap_file setting with mutext, and set si->swap_file only when the swap device is truly enabled. /proc/swaps only interested in si->swap_file and a few static data reading. Only si->swap_file needs protection. Reading other static fields is always fine. Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-0-f4e34be021a7@tencent.com Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-1-f4e34be021a7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:58 -07:00
Miquel Sabaté Solà	e623b4ebee	mm: fix typo in the comment of mod_zone_state() Use the proper function name, followed by parenthesis as usual. Link: https://lkml.kernel.org/r/20260219234407.3261196-1-mssola@mssola.com Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:58 -07:00
JP Kobryn (Meta)	e4f4fc7aa8	mm: move pgscan, pgsteal, pgrefill to node stats There are situations where reclaim kicks in on a system with free memory. One possible cause is a NUMA imbalance scenario where one or more nodes are under pressure. It would help if we could easily identify such nodes. Move the pgscan, pgsteal, and pgrefill counters from vm_event_item to node_stat_item to provide per-node reclaim visibility. With these counters as node stats, the values are now displayed in the per-node section of /proc/zoneinfo, which allows for quick identification of the affected nodes. /proc/vmstat continues to report the same counters, aggregated across all nodes. But the ordering of these items within the readout changes as they move from the vm events section to the node stats section. Memcg accounting of these counters is preserved. The relocated counters remain visible in memory.stat alongside the existing aggregate pgscan and pgsteal counters. However, this change affects how the global counters are accumulated. Previously, the global event count update was gated on !cgroup_reclaim(), excluding memcg-based reclaim from /proc/vmstat. Now that mod_lruvec_state() is being used to update the counters, the global counters will include all reclaim. This is consistent with how pgdemote counters are already tracked. Finally, the virtio_balloon driver is updated to use global_node_page_state() to fetch the counters, as they are no longer accessible through the vm_events array. Link: https://lkml.kernel.org/r/20260219235846.161910-1-jp.kobryn@linux.dev Signed-off-by: JP Kobryn <jp.kobryn@linux.dev> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: David Hildenbrand <david@kernel.org> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:58 -07:00
AnishMulay	54218f10df	selftests/mm: skip migration tests if NUMA is unavailable Currently, the migration test asserts that numa_available() returns 0. On systems where NUMA is not available (returning -1), such as certain ARM64 configurations or single-node systems, this assertion fails and crashes the test. Update the test to check the return value of numa_available(). If it is less than 0, skip the test gracefully instead of failing. This aligns the behavior with other MM selftests (like rmap) that skip when NUMA support is missing. Link: https://lkml.kernel.org/r/20260218163941.13499-1-anishm7030@gmail.com Fixes: `0c2d087284` ("mm: add selftests for migration entries") Signed-off-by: AnishMulay <anishm7030@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Tested-by: Sayali Patil <sayalip@linux.ibm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:57 -07:00
Seongsu Park	3d443691ed	mm/pkeys: remove unused tsk parameter from arch_set_user_pkey_access() The tsk parameter in arch_set_user_pkey_access() is never used in the function implementations across all architectures (arm64, powerpc, x86). Link: https://lkml.kernel.org/r/20260219063506.545148-1-sgsu.park@samsung.com Signed-off-by: Seongsu Park <sgsu.park@samsung.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:57 -07:00
Liam R. Howlett	0e8cf9a31a	maple_tree: clean up mas_wr_node_store() The new_end does not need to be passed in as the data is already being checked. This allows for other areas to skip getting the node new_end in the calling function. The type was incorrectly void * instead of void __rcu *, which isn't an issue but is technically incorrect. Move the variable assignment to after the declarations to clean up the initial setup. Ensure there is something to copy before calling memcpy(). Link: https://lkml.kernel.org/r/20260130205935.2559335-31-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:57 -07:00
Liam R. Howlett	b82f4c811e	maple_tree: don't pass end to mas_wr_append() Figure out the end internally. This is necessary for future cleanups. Link: https://lkml.kernel.org/r/20260130205935.2559335-30-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:57 -07:00
Liam R. Howlett	2969241fa2	maple_tree: pass maple copy node to mas_wmb_replace() mas_wmb_replace() is called in three places with the same setup, move the setup into the function itself. The function needs to be relocated as it calls mtree_range_walk(). Link: https://lkml.kernel.org/r/20260130205935.2559335-29-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:57 -07:00
Liam R. Howlett	b8852ef30c	maple_tree: remove maple big node and subtree structs Now that no one uses the structures and functions, drop the dead code. Link: https://lkml.kernel.org/r/20260130205935.2559335-28-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:57 -07:00
Liam R. Howlett	280b792cac	maple_tree: use maple copy node for mas_wr_split() Instead of using the maple big node, use the maple copy node for reduced stack usage and aligning with mas_wr_rebalance() and mas_wr_spanning_store(). Splitting a node is similar to rebalancing, but a new evaluation of when to ascend is needed. The only other difference is that the data is pushed and never rebalanced at each level. The testing must also align with the changes to this commit to ensure the test suite continues to pass. Link: https://lkml.kernel.org/r/20260130205935.2559335-27-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:56 -07:00
Liam R. Howlett	11e7f22f5e	maple_tree: add cp_converged() helper When the maple copy node converges into a single entry, then certain operations can stop ascending the tree. This is used more later. Link: https://lkml.kernel.org/r/20260130205935.2559335-26-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:56 -07:00
Liam R. Howlett	0abff20819	maple_tree: add copy_tree_location() helper Extract the copying of the tree location from one maple state to another into its own function. This is used more later. Link: https://lkml.kernel.org/r/20260130205935.2559335-25-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:56 -07:00
Liam R. Howlett	ebfee00c0b	maple_tree: add test for rebalance calculation off-by-one During the big node removal, an incorrect rebalance step went too far up the tree causing insufficient nodes. Test the faulty condition by recreating the scenario in the userspace testing. Link: https://lkml.kernel.org/r/20260130205935.2559335-24-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:56 -07:00
Liam R. Howlett	971f0db159	maple_tree: use maple copy node for mas_wr_rebalance() operation Stop using the maple big node for rebalance operations by changing to more align with spanning store. The rebalance operation needs its own data calculation in rebalance_data(). In the event of too much data, the rebalance tries to push the data using push_data_sib(). If there is insufficient data, the rebalance operation will rebalance against a sibling (found with rebalance_sib()). The rebalance starts at the leaf and works its way upward in the tree using rebalance_ascend(). Most of the code is shared with spanning store such as the copy node having a new root, but is fundamentally different in that the data must come from a sibling. A parent maple state is used to track the parent location to avoid multiple mas_ascend() calls. The maple state tree location is copied from the parent to the mas (child) in the ascend step. Ascending itself is done in the main loop. Link: https://lkml.kernel.org/r/20260130205935.2559335-23-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:56 -07:00
Liam R. Howlett	b00a1804e6	maple_tree: add cp_is_new_root() helper Add a helper to do what is needed when the maple copy node contains a new root node. This is useful for future commits and is self-documenting code. [Liam.Howlett@oracle.com: remove warnings on older compilers] Link: https://lkml.kernel.org/r/malwmirqnpuxqkqrobcmzfkmmxipoyzwfs2nwc5fbpxlt2r2ej@wchmjtaljvw3 [akpm@linux-foundation.org: s/cp->slot[0]/&cp->slot[0]/, per Liam] Link: https://lkml.kernel.org/r/20260130205935.2559335-22-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:56 -07:00
Liam R. Howlett	62e9d349af	maple_tree: separate wr_split_store and wr_rebalance store type code path The split and rebalance store types both go through the same function that uses the big node. Separate the code paths so that each can be updated independently. No functional change intended Link: https://lkml.kernel.org/r/20260130205935.2559335-21-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:56 -07:00
Liam R. Howlett	448ec8c0a4	maple_tree: remove unnecessary return statements Functions do not need to state return at the end, unless skipping unwind. These can safely be dropped. Link: https://lkml.kernel.org/r/20260130205935.2559335-20-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:55 -07:00
Liam R. Howlett	3578d61c1c	maple_tree: inline mas_wr_spanning_rebalance() Now that the spanning rebalance is small, fully inline it in mas_wr_spanning_store(). No functional change. Link: https://lkml.kernel.org/r/20260130205935.2559335-19-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:55 -07:00
Liam R. Howlett	a9c6716e08	maple_tree: start using maple copy node for destination Stop using the maple subtree state and big node in favour of using three destinations in the maple copy node. That is, expand the way leaves were handled to all levels of the tree and use the maple copy node to track the new nodes. Extract out the sibling init into the data calculation since this is where the insufficient data can be detected. The remainder of the sibling code to shift the next iteration is moved to the spanning_ascend() function, since it is not always needed. Next introduce the dst_setup() function which will decide how many nodes are needed to contain the data at this level. Using the destination count, populate the copy node's dst array with the new nodes and set d_count to the correct value. Note that this can be tricky in the case of a leaf node with exactly enough room because of the rule against NULLs at the end of leaves. Once the destinations are ready, copy the data by altering the cp_data_write() function to copy from the sources to the destinations directly. This eliminates the use of the big node in this code path. On node completion, node_finalise() will zero out the remaining area and set the metadata, if necessary. spanning_ascend() is used to decide if the operation is complete. It may create a new root, converge into one destination, or continue upwards by ascending the left and right write maple states. One test case setup needed to be tweaked so that the targeted node was surrounded by full nodes. [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20260130205935.2559335-18-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:55 -07:00
Liam R. Howlett	20b20162e1	maple_tree: add gap support, slot and pivot sizes for maple copy Add plumbing work for using maple copy as a normal node for a source of copy operations. This is needed later. Link: https://lkml.kernel.org/r/20260130205935.2559335-17-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:55 -07:00
Liam R. Howlett	de7f3ed37c	maple_tree: introduce ma_leaf_max_gap() This is the same as mas_leaf_max_gap(), but the information necessary is known without a maple state in future code. Adding this function now simplifies the review for a subsequent patch. Link: https://lkml.kernel.org/r/20260130205935.2559335-16-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:55 -07:00
Liam R. Howlett	6953038cab	maple_tree: change initial big node setup in mas_wr_spanning_rebalance() Instead of copying the data into the big node and finding out that the data may need to be moved or appended to, calculate the data space up front (in the maple copy node) and set up another source for the copy. The additional copy source is tracked in the maple state sib (short for sibling), and is put into the maple write states for future operations after the data is in the big node. To facilitate the newly moved node, some initial setup of the maple subtree state are relocated after the potential shift caused by the new way of rebalancing against a sibling. Link: https://lkml.kernel.org/r/20260130205935.2559335-15-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:55 -07:00
Liam R. Howlett	f141d56643	maple_tree: inline mas_spanning_rebalance_loop() into mas_wr_spanning_rebalance() Just copy the code and replace count with height. This is done to avoid affecting other code paths into mas_spanning_rebalance_loop() for the next change. No functional change intended. Link: https://lkml.kernel.org/r/20260130205935.2559335-14-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:55 -07:00
Liam R. Howlett	b14ffd2c6c	maple_tree: testing update for spanning store Spanning store had some corner cases which showed up during rcu stress testing. Add explicit tests for those cases. At the same time add some locking for easier visibility of the rcu stress testing. Only a single dump of the tree will happen on the first detected issue instead of flooding the console with output. Link: https://lkml.kernel.org/r/20260130205935.2559335-13-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:54 -07:00
Liam R. Howlett	9ec1e972c3	maple_tree: introduce maple_copy node and use it in mas_spanning_rebalance() Introduce an internal-memory only node type called maple_copy to facilitate internal copy operations. Use it in mas_spanning_rebalance() for just the leaf nodes. Initially, the maple_copy node is used to configure the source nodes and copy the data into the big_node. The maple_copy contains a list of source entries with start and end offsets. One of the maple_copy entries can be itself with an offset of 0 to 2, representing the data where the store partially overwrites entries, or fully overwrites the entry. The side effect is that the source nodes no longer have to worry about partially copying the existing offset if it is not fully overwritten. This is in preparation of removal of the maple big_node, but for the time being the data is copied to the big node to limit the change size. Link: https://lkml.kernel.org/r/20260130205935.2559335-12-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:54 -07:00

1 2 3 4 5 ...

1428916 Commits