linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-14 11:11:22 -04:00

Author	SHA1	Message	Date
Suren Baghdasaryan	9e8a0bbb12	alloc_tag: prevent enabling memory profiling if it was shut down Memory profiling can be shut down due to reasons like a failure during initialization. When this happens, the user should not be able to re-enable it. Current sysctrl interface does not handle this properly and will allow re-enabling memory profiling. Fix this by checking for this condition during sysctrl write operation. Link: https://lkml.kernel.org/r/20250915212756.3998938-3-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Usama Arif <usamaarif642@gmail.com> Cc: David Wang <00107082@163.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:30 -07:00
Suren Baghdasaryan	123bcf2842	alloc_tag: use release_pages() in the cleanup path Patch series "Minor fixes for memory allocation profiling", v2. Over the last couple months I gathered a few reports of minor issues in memory allocation profiling which are addressed in this patchset. This patch (of 2): When bulk-freeing an array of pages use release_pages() instead of freeing them page-by-page. Link: https://lkml.kernel.org/r/20250915212756.3998938-1-surenb@google.com Link: https://lkml.kernel.org/r/20250915212756.3998938-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Suggested-by: Usama Arif <usamaarif642@gmail.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Usama Arif <usamaarif642@gmail.com> Cc: David Wang <00107082@163.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:30 -07:00
Jackie Liu	5919f12821	mm/shmem: remove unused entry_order after large swapin rework After commit `93c0476e70` ("mm/shmem, swap: rework swap entry and index calculation for large swapin"), xas_get_order() will never return a non-zero value for `entry_order` in shmem_split_large_entry(). As a result, the local variable `entry_order` is effectively unused. Clean up the code by removing `entry_order` and directly using `cur_order`. This change is purely a refactor and has no functional impact. No functional change intended. Link: https://lkml.kernel.org/r/20250908062614.89880-1-liu.yun@linux.dev Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:30 -07:00
Lance Yang	6ce3bc990c	mm: skip mlocked THPs that are underused early in deferred_split_scan() When we stumble over a fully-mapped mlocked THP in the deferred shrinker, it does not make sense to try to detect whether it is underused, because try_to_map_unused_to_zeropage(), called while splitting the folio, will not actually replace any zeroed pages by the shared zeropage. Splitting the folio in that case does not make any sense, so let's not even scan to check if the folio is underused. Link: https://lkml.kernel.org/r/20250908090741.61519-1-lance.yang@linux.dev Signed-off-by: Lance Yang <lance.yang@linux.dev> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Kiryl Shutsemau <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:30 -07:00
Francois Dugast	10b9feee2d	mm/hmm: populate PFNs from PMD swap entry Once support for THP migration of zone device pages is enabled, device private swap entries will be found during the walk not only for PTEs but also for PMDs. Therefore, it is necessary to extend to PMDs the special handling which is already in place for PTEs when device private pages are owned by the caller: instead of faulting or skipping the range, the correct behavior is to use the swap entry to populate HMM PFNs. This change is a prerequisite to make use of device-private THP in drivers using drivers/gpu/drm/drm_pagemap, such as xe. Even though subsequent PFNs can be inferred when handling large order PFNs, the PFN list is still fully populated because this is currently expected by HMM users. In case this changes in the future, that is all HMM users support a sparsely populated PFN list, the for() loop can be made to skip remaining PFNs for the current order. A quick test shows the loop takes about 10 ns, roughly 20 times faster than without this optimization. Link: https://lkml.kernel.org/r/20250908091052.612303-1-francois.dugast@intel.com Signed-off-by: Francois Dugast <francois.dugast@intel.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Leon Romanovsky <leonro@nvidia.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: David Airlie <airlied@gmail.com> Cc: Christian König <christian.koenig@amd.com> Cc: Mika Penttilä <mpenttil@redhat.com> Cc: Thomas Hellstrom <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:29 -07:00
David Hildenbrand	7cad96ae59	mm/gup: fix handling of errors from arch_make_folio_accessible() in follow_page_pte() In case we call arch_make_folio_accessible() and it fails, we would incorrectly return a value that is "!= 0" to the caller, indicating that we pinned all requested pages and that the caller can keep going. follow_page_pte() is not supposed to return error values, but instead "0" on failure and "1" on success -- we'll clean that up separately. In case we return "!= 0", the caller will just keep going pinning more pages. If we happen to pin a page afterwards, we're in trouble, because we essentially skipped some pages in the requested range. Staring at the arch_make_folio_accessible() implementation on s390x, I assume it should actually never really fail unless something unexpected happens (BUG?). So let's not CC stable and just fix common code to do the right thing. Clean up the code a bit now that there is no reason to store the return value of arch_make_folio_accessible(). Link: https://lkml.kernel.org/r/20250908094517.303409-1-david@redhat.com Fixes: `f28d43636d` ("mm/gup/writeback: add callbacks for inaccessible pages") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:29 -07:00
Chanwon Park	e7a5f249e6	mm: re-enable kswapd when memory pressure subsides or demotion is toggled If kswapd fails to reclaim pages from a node MAX_RECLAIM_RETRIES in a row, kswapd on that node gets disabled. That is, the system won't wakeup kswapd for that node until page reclamation is observed at least once. That reclamation is mostly done by direct reclaim, which in turn enables kswapd back. However, on systems with CXL memory nodes, workloads with high anon page usage can disable kswapd indefinitely, without triggering direct reclaim. This can be reproduced with following steps: numa node 0 (32GB memory, 48 CPUs) numa node 2~5 (512GB CXL memory, 128GB each) (numa node 1 is disabled) swap space 8GB 1) Set /sys/kernel/mm/demotion_enabled to 0. 2) Set /proc/sys/kernel/numa_balancing to 0. 3) Run a process that allocates and random accesses 500GB of anon pages. 4) Let the process exit normally. During 3), free memory on node 0 gets lower than low watermark, and kswapd runs and depletes swap space. Then, kswapd fails consecutively and gets disabled. Allocation afterwards happens on CXL memory, so node 0 never gains more memory pressure to trigger direct reclaim. After 4), kswapd on node 0 remains disabled, and tasks running on that node are unable to swap. If you turn on NUMA_BALANCING_MEMORY_TIERING and demotion now, it won't work properly since kswapd is disabled. To mitigate this problem, reset kswapd_failures to 0 on following conditions: a) ZONE_BELOW_HIGH bit of a zone in hopeless node with a fallback memory node gets cleared. b) demotion_enabled is changed from false to true. Rationale for a): ZONE_BELOW_HIGH bit being cleared might be a sign that the node may be reclaimable afterwards. This won't help much if the memory-hungry process keeps running without freeing anything, but at least the node will go back to reclaimable state when the process exits. Rationale for b): When demotion_enabled is false, kswapd can only reclaim anon pages by swapping them out to swap space. If demotion_enabled is turned on, kswapd can demote anon pages to another node for reclaiming. So, the original failure count for determining reclaimability is no longer valid. Since kswapd_failures resets may be missed by ++ operation, it is changed from int to atomic_t. [akpm@linux-foundation.org: tweak whitespace] Link: https://lkml.kernel.org/r/aL6qGi69jWXfPc4D@pcw-MS-7D22 Signed-off-by: Chanwon Park <flyinrm@gmail.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:29 -07:00
Chunyu Hu	c56325259a	selftests/mm: fix va_high_addr_switch.sh failure on x86_64 The test will fail as below on x86_64 with cpu la57 support (will skip if no la57 support). Note, the test requries nr_hugepages to be set first. # running bash ./va_high_addr_switch.sh # ------------------------------------- # mmap(addr_switch_hint - pagesize, pagesize): 0x7f55b60fa000 - OK # mmap(addr_switch_hint - pagesize, (2 * pagesize)): 0x7f55b60f9000 - OK # mmap(addr_switch_hint, pagesize): 0x800000000000 - OK # mmap(addr_switch_hint, 2 * pagesize, MAP_FIXED): 0x800000000000 - OK # mmap(NULL): 0x7f55b60f9000 - OK # mmap(low_addr): 0x40000000 - OK # mmap(high_addr): 0x1000000000000 - OK # mmap(high_addr) again: 0xffff55b6136000 - OK # mmap(high_addr, MAP_FIXED): 0x1000000000000 - OK # mmap(-1): 0xffff55b6134000 - OK # mmap(-1) again: 0xffff55b6132000 - OK # mmap(addr_switch_hint - pagesize, pagesize): 0x7f55b60fa000 - OK # mmap(addr_switch_hint - pagesize, 2 * pagesize): 0x7f55b60f9000 - OK # mmap(addr_switch_hint - pagesize/2 , 2 * pagesize): 0x7f55b60f7000 - OK # mmap(addr_switch_hint, pagesize): 0x800000000000 - OK # mmap(addr_switch_hint, 2 * pagesize, MAP_FIXED): 0x800000000000 - OK # mmap(NULL, MAP_HUGETLB): 0x7f55b5c00000 - OK # mmap(low_addr, MAP_HUGETLB): 0x40000000 - OK # mmap(high_addr, MAP_HUGETLB): 0x1000000000000 - OK # mmap(high_addr, MAP_HUGETLB) again: 0xffff55b5e00000 - OK # mmap(high_addr, MAP_FIXED \| MAP_HUGETLB): 0x1000000000000 - OK # mmap(-1, MAP_HUGETLB): 0x7f55b5c00000 - OK # mmap(-1, MAP_HUGETLB) again: 0x7f55b5a00000 - OK # mmap(addr_switch_hint - pagesize, 2hugepagesize, MAP_HUGETLB): 0x800000000000 - FAILED # mmap(addr_switch_hint , 2hugepagesize, MAP_FIXED \| MAP_HUGETLB): 0x800000000000 - OK # [FAIL] addr_switch_hint is defined as DFEFAULT_MAP_WINDOW in the failed test (for x86_64, DFEFAULT_MAP_WINDOW is defined as (1UL<<47) - pagesize) in 64 bit. Before commit `cc92882ee2` ("mm: drop hugetlb_get_unmapped_area{_} functions"), for x86_64 hugetlb_get_unmapped_area() is handled in arch code arch/x86/mm/hugetlbpage.c and addr is checked with map_address_hint_valid() after align with 'addr &= huge_page_mask(h)' which is a round down way, and it will fail the check because the addr is within the DEFAULT_MAP_WINDOW but (addr + len) is above the DFEFAULT_MAP_WINDOW. So it wil go through the hugetlb_get_unmmaped_area_top_down() to find an area within the DFEFAULT_MAP_WINDOW. After commit `cc92882ee2` ("mm: drop hugetlb_get_unmapped_area{_} functions"). The addr hint for hugetlb_get_unmmaped_area() will be rounded up and aligned to hugepage size with ALIGN() for all arches. And after the align, the addr will be above the default MAP_DEFAULT_WINDOW, and the map_addresshint_valid() check will pass because both aligned addr (addr0) and (addr + len) are above the DEFAULT_MAP_WINDOW, and the aligned hint address (0x800000000000) is returned as an suitable gap is found there, in arch_get_unmapped_area_topdown(). To still cover the case that addr is within the DEFAULT_MAP_WINDOW, and addr + len is above the DFEFAULT_MAP_WINDOW, change to choose the last hugepage aligned address within the DEFAULT_MAP_WINDOW as the hint addr, and the addr + len (2 hugepages) will be one hugepage above the DEFAULT_MAP_WINDOW. An aligned address won't be affected by the page round up or round down from kernel, so it's determistic. Link: https://lkml.kernel.org/r/20250912013711.3002969-4-chuhu@redhat.com Fixes: `cc92882ee2` ("mm: drop hugetlb_get_unmapped_area{_*} functions") Signed-off-by: Chunyu Hu <chuhu@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:29 -07:00
Chunyu Hu	d9d957bd7b	selftests/mm: alloc hugepages in va_high_addr_switch test Alloc hugepages in the test internally, so we don't fully rely on the run_vmtests.sh. If run_vmtests.sh does that great, free hugepages is enough for being used to run the test, leave it as it is, otherwise setup the hugepages in the test. Save the original nr_hugepages value and restore it after test finish, so leave a stable test envronment. Link: https://lkml.kernel.org/r/20250912013711.3002969-3-chuhu@redhat.com Signed-off-by: Chunyu Hu <chuhu@redhat.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:28 -07:00
Chunyu Hu	6e296bcf29	selftests/mm: fix hugepages cleanup too early Patch series "Fix va_high_addr_switch.sh test failure", v3. These three patches fix the va_high_addr_switch.sh test failure on x86_64. Patch 1 fixes the hugepage setup issue that nr_hugepages is reset too early in run_vmtests.sh and break the later va_high_addr_switch testing. Patch 2 adds hugepage setup in va_high_addr_switch test, so that it can still work if vm_runtests.sh changes the hugepage setup someday. Patch 3 fixes the test failure caused by the hint addr align method change in hugetlb_get_unmapped_area(). This patch (of 3): The nr_hugepgs variable is used to keep the original nr_hugepages at the hugepage setup step at test beginning. After userfaultfd test, a cleaup is executed, both /sys/kernel/mm/hugepages/hugepages-*/nr_hugepages and /proc/sys//vm/nr_hugepages are reset to 'original' value before userfaultfd test starts. Issue here is the value used to restore /proc/sys/vm/nr_hugepages is nr_hugepgs which is the initial value before the vm_runtests.sh runs, not the value before userfaultfd test starts. 'va_high_addr_swith.sh' tests runs after that will possibly see no hugepages available for test, and got EINVAL when mmap(HUGETLB), making the result invalid. And before pkey tests, nr_hugepgs is changed to be used as a temp variable to save nr_hugepages before pkey test, and restore it after pkey tests finish. The original nr_hugepages value is not tracked anymore, so no way to restore it after all tests finish. Add a new variable orig_nr_hugepgs to save the original nr_hugepages, and and restore it to nr_hugepages after all tests finish. And change to use the nr_hugepgs variable to save the /proc/sys/vm/nr_hugeages after hugepage setup, it's also the value before userfaultfd test starts, and the correct value to be restored after userfaultfd finishes. The va_high_addr_switch.sh broken will be resolved. Link: https://lkml.kernel.org/r/20250912013711.3002969-1-chuhu@redhat.com Link: https://lkml.kernel.org/r/20250912013711.3002969-2-chuhu@redhat.com Signed-off-by: Chunyu Hu <chuhu@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:28 -07:00
Jan Kara	6028372689	readahead: add trace points Add a couple of trace points to make debugging readahead logic easier. [jack@suse.cz: v2] Link: https://lkml.kernel.org/r/20250909145849.5090-2-jack@suse.cz Link: https://lkml.kernel.org/r/20250908145533.31528-2-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:28 -07:00
Matthieu Baerts (NGI0)	e1831e8dd1	scripts/decode_stacktrace.sh: code: preserve alignment With lines having a code to decode, the alignment was not preserved for the first line. With this sample ... [ 52.238089][ T55] RIP: 0010:__ip_queue_xmit+0x127c/0x1820 [ 52.238401][ T55] Code: c1 83 e0 07 48 c1 e9 03 83 c0 03 (...) ... the script was producing the following output: [ 52.238089][ T55] RIP: 0010:__ip_queue_xmit (...) [ 52.238401][ T55] Code: c1 83 e0 07 48 c1 e9 03 83 c0 03 (...) That's because scripts/decodecode doesn't preserve the alignment. No need to modify it, it is enough to give only the "Code: (...)" part to this script, and print the prefix without modifications. With the same sample, we now have: [ 52.238089][ T55] RIP: 0010:__ip_queue_xmit (...) [ 52.238401][ T55] Code: c1 83 e0 07 48 c1 e9 03 83 c0 03 (...) Link: https://lkml.kernel.org/r/20250908-decode_strace_indent-v1-3-28e5e4758080@kernel.org Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Tested-by: Carlos Llamas <cmllamas@google.com> Cc: Breno Leitao <leitao@debian.org> Cc: Elliot Berman <quic_eberman@quicinc.com> Cc: Luca Ceresoli <luca.ceresoli@bootlin.com> Cc: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:28 -07:00
Matthieu Baerts (NGI0)	4a2fc4897b	scripts/decode_stacktrace.sh: symbol: preserve alignment With lines having a symbol to decode, the script was only trying to preserve the alignment for the timestamps, but not the rest, nor when the caller was set (CONFIG_PRINTK_CALLER=y). With this sample ... [ 52.080924] Call Trace: [ 52.080926] <TASK> [ 52.080931] dump_stack_lvl+0x6f/0xb0 ... the script was producing the following output: [ 52.080924] Call Trace: [ 52.080926] <TASK> [ 52.080931] dump_stack_lvl (arch/x86/include/asm/irqflags.h:19) (dump_stack_lvl is no longer aligned with <TASK>: one missing space) With this other sample ... [ 52.080924][ T48] Call Trace: [ 52.080926][ T48] <TASK> [ 52.080931][ T48] dump_stack_lvl+0x6f/0xb0 ... the script was producing the following output: [ 52.080924][ T48] Call Trace: [ 52.080926][ T48] <TASK> [ 52.080931][ T48] dump_stack_lvl (arch/x86/include/asm/irqflags.h:19) (the misalignment is clearer here) That's because the script had a workaround for CONFIG_PRINTK_TIME=y only, see the previous comment called "Format timestamps with tabs". To always preserve spaces, they need to be recorded along the words. That is what is now done with the new 'spaces' array. Some notes: - 'extglob' is needed only for this operation, and that's why it is set in a dedicated subshell. - 'read' is used with '-r' not to treat a <backslash> character in any special way, e.g. when followed by a space. - When a word is removed from the 'words' array, the corresponding space needs to be removed from the 'spaces' array as well. With the last sample, we now have: [ 52.080924][ T48] Call Trace: [ 52.080926][ T48] <TASK> [ 52.080931][ T48] dump_stack_lvl (arch/x86/include/asm/irqflags.h:19) (the alignment is preserved) Link: https://lkml.kernel.org/r/20250908-decode_strace_indent-v1-2-28e5e4758080@kernel.org Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Tested-by: Carlos Llamas <cmllamas@google.com> Cc: Breno Leitao <leitao@debian.org> Cc: Elliot Berman <quic_eberman@quicinc.com> Cc: Luca Ceresoli <luca.ceresoli@bootlin.com> Cc: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:27 -07:00
Matthieu Baerts (NGI0)	d322f6a24e	scripts/decode_stacktrace.sh: symbol: avoid trailing whitespaces A few patches slightly improving the output generated by decode_stacktrace.sh. This patch (of 3): Lines having a symbol to decode might not always have info after this symbol. It means ${info_str} might not be set, but it will always be printed after a space, causing trailing whitespaces. That's a detail, but when the output is opened with an editor marking these trailing whitespaces, that's a bit disturbing. It is easy to remove them by printing this variable with a space only if it is set. While at it, do the same with ${module} and print everything in one line. Link: https://lkml.kernel.org/r/20250908-decode_strace_indent-v1-0-28e5e4758080@kernel.org Link: https://lkml.kernel.org/r/20250908-decode_strace_indent-v1-1-28e5e4758080@kernel.org Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Carlos Llamas <cmllamas@google.com> Reviewed-by: Breno Leitao <leitao@debian.org> Reviewed-by: Luca Ceresoli <luca.ceresoli@bootlin.com> Cc: Carlos Llamas <cmllamas@google.com> Cc: Elliot Berman <quic_eberman@quicinc.com> Cc: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:27 -07:00
Matthew Wilcox (Oracle)	90ec2df9dd	ptdesc: remove ptdesc_to_virt() This has the same effect as ptdesc_address() so convert the callers to use that and delete the function. Add kernel-doc for ptdesc_address(). Link: https://lkml.kernel.org/r/20250908171104.2409217-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:27 -07:00
Matthew Wilcox (Oracle)	f0c92726e8	ptdesc: remove references to folios from __pagetable_ctor() and pagetable_dtor() In preparation for splitting struct ptdesc from struct page and struct folio, remove mentions of struct folio from these functions. Introduce ptdesc_nr_pages() to avoid using lruvec_stat_add/sub_folio() Link: https://lkml.kernel.org/r/20250908171104.2409217-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:27 -07:00
Matthew Wilcox (Oracle)	522abd9227	ptdesc: convert __page_flags to pt_flags Patch series "Some ptdesc cleanups". The first two patches here are preparation for splitting struct ptdesc from struct page and struct folio. I think their only dependency is on the memdesc_flags_t patches from August which is in mm-new. The third patch is just something I noticed while working on the code. This patch (of 3): Use the new memdesc_flags_t type to show that these are the same bits as page/folio/slab and thesefore have the zone/node/section information in them. Remove a use of ptdesc_folio() by converting pagetable_is_reserved() to use test_bit() directly. Link: https://lkml.kernel.org/r/20250908171104.2409217-1-willy@infradead.org Link: https://lkml.kernel.org/r/20250908171104.2409217-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:26 -07:00
Alice Ryhl	6106864b87	maple_tree: remove lockdep_map_p typedef Having the ma_external_lock field exist when CONFIG_LOCKDEP=n isn't used anywhere, so just get rid of it. This also avoids generating a typedef called lockdep_map_p that could overlap with typedefs in other header files. Link: https://lkml.kernel.org/r/20250902-maple-lockdep-p-v1-1-3ae5a398a379@google.com Signed-off-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Danilo Krummrich <dakr@kernel.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:26 -07:00
zhang jiao	152d42584a	samples/cgroup: rm unused MEMCG_EVENTS macro MEMCG_EVENTS is never referenced in the code. Just remove it. Link: https://lkml.kernel.org/r/20250903073100.2477-1-zhangjiao2@cmss.chinamobile.com Signed-off-by: zhang jiao <zhangjiao2@cmss.chinamobile.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:26 -07:00
Stanislav Fort	72797d218b	mm/memcg: v1: account event registrations and drop world-writable cgroup.event_control In cgroup v1, the legacy cgroup.event_control file is world-writable and allows unprivileged users to register unbounded events and thresholds. Each registration allocates kernel memory without capping or memcg charging, which can be abused to exhaust kernel memory in affected configurations. Make the following minimal changes: - Account allocations with __GFP_ACCOUNT in event and threshold registration. - Remove CFTYPE_WORLD_WRITABLE from cgroup.event_control to make it owner-writable. This does not affect cgroup v2. Allocations are still subject to kmem accounting being enabled, but this reduces unbounded global growth. Link: https://lkml.kernel.org/r/20250905093851.80596-1-disclosure@aisle.com Signed-off-by: Stanislav Fort <disclosure@aisle.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:26 -07:00
Kairui Song	f83938e418	mm, swap: use a single page for swap table when the size fits We have a cluster size of 512 slots. Each slot consumes 8 bytes in swap table so the swap table size of each cluster is exactly one page (4K). If that condition is true, allocate one page direct and disable the slab cache to reduce the memory usage of swap table and avoid fragmentation. Link: https://lkml.kernel.org/r/20250916160100.31545-16-ryncsn@gmail.com Co-developed-by: Chris Li <chrisl@kernel.org> Signed-off-by: Chris Li <chrisl@kernel.org> Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Suggested-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:25 -07:00
Kairui Song	07adc4cf1e	mm, swap: implement dynamic allocation of swap table Now swap table is cluster based, which means free clusters can free its table since no one should modify it. There could be speculative readers, like swap cache look up, protect them by making them RCU protected. All swap table should be filled with null entries before free, so such readers will either see a NULL pointer or a null filled table being lazy freed. On allocation, allocate the table when a cluster is used by any order. This way, we can reduce the memory usage of large swap device significantly. This idea to dynamically release unused swap cluster data was initially suggested by Chris Li while proposing the cluster swap allocator and it suits the swap table idea very well. Link: https://lkml.kernel.org/r/20250916160100.31545-15-ryncsn@gmail.com Co-developed-by: Chris Li <chrisl@kernel.org> Signed-off-by: Chris Li <chrisl@kernel.org> Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:25 -07:00
Kairui Song	685a17fbd3	mm, swap: remove contention workaround for swap cache Swap cluster setup will try to shuffle the clusters on initialization. It was helpful to avoid contention for the swap cache space. The cluster size (2M) was much smaller than each swap cache space (64M), so shuffling the cluster means the allocator will try to allocate swap slots that are in different swap cache spaces for each CPU, reducing the chance of two CPUs using the same swap cache space, and hence reducing the contention. Now, swap cache is managed by swap clusters, this shuffle is pointless. Just remove it, and clean up related macros. This also improves the HDD swap performance as shuffling IO is a bad idea for HDD, and now the shuffling is gone. Test have shown a ~40% performance gain for HDD [1]: Doing sequential swap in of 8G data using 8 processes with usemem, average of 3 test runs: Before: 1270.91 KB/s per process After: 1849.54 KB/s per process Link: https://lore.kernel.org/linux-mm/CAMgjq7AdauQ8=X0zeih2r21QoV=-WWj1hyBxLWRzq74n-C=-Ng@mail.gmail.com/ [1] Link: https://lkml.kernel.org/r/20250916160100.31545-14-ryncsn@gmail.com Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202504241621.f27743ec-lkp@intel.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:25 -07:00
Kairui Song	8b47299a41	mm, swap: mark swap address space ro and add context debug check Swap cache is now backed by swap table, and the address space is not holding any mutable data anymore. And swap cache is now protected by the swap cluster lock, instead of the XArray lock. All access to swap cache are wrapped by swap cache helpers. Locking is mostly handled internally by swap cache helpers, only a few __swap_cache_* helpers require the caller to lock the cluster by themselves. Worth noting that, unlike XArray, the cluster lock is not IRQ safe. The swap cache was very different compared to filemap, and now it's completely separated from filemap. Nothing wants to mark or change anything or do a writeback callback in IRQ. So explicitly document this and add a debug check to avoid further potential misuse. And mark the swap cache space as read-only to avoid any user wrongly mixing unexpected filemap helpers with swap cache. Link: https://lkml.kernel.org/r/20250916160100.31545-13-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:25 -07:00
Kairui Song	8578e0c00d	mm, swap: use the swap table for the swap cache and switch API Introduce basic swap table infrastructures, which are now just a fixed-sized flat array inside each swap cluster, with access wrappers. Each cluster contains a swap table of 512 entries. Each table entry is an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE), a folio type (pointer), or NULL. In this first step, it only supports storing a folio or shadow, and it is a drop-in replacement for the current swap cache. Convert all swap cache users to use the new sets of APIs. Chris Li has been suggesting using a new infrastructure for swap cache for better performance, and that idea combined well with the swap table as the new backing structure. Now the lock contention range is reduced to 2M clusters, which is much smaller than the 64M address_space. And we can also drop the multiple address_space design. All the internal works are done with swap_cache_get_* helpers. Swap cache lookup is still lock-less like before, and the helper's contexts are same with original swap cache helpers. They still require a pin on the swap device to prevent the backing data from being freed. Swap cache updates are now protected by the swap cluster lock instead of the XArray lock. This is mostly handled internally, but new __swap_cache_* helpers require the caller to lock the cluster. So, a few new cluster access and locking helpers are also introduced. A fully cluster-based unified swap table can be implemented on top of this to take care of all count tracking and synchronization work, with dynamic allocation. It should reduce the memory usage while making the performance even better. Link: https://lkml.kernel.org/r/20250916160100.31545-12-ryncsn@gmail.com Co-developed-by: Chris Li <chrisl@kernel.org> Signed-off-by: Chris Li <chrisl@kernel.org> Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:24 -07:00
Kairui Song	094dc8b059	mm, swap: wrap swap cache replacement with a helper There are currently three swap cache users that are trying to replace an existing folio with a new one: huge memory splitting, migration, and shmem replacement. What they are doing is quite similar. Introduce a common helper for this. In later commits, this can be easily switched to use the swap table by updating this helper. The newly added helper also makes the swap cache API better defined, and make debugging easier by adding a few more debug checks. Migration and shmem replace are meant to clone the folio, including content, swap entry value, and flags. And splitting will adjust each sub folio's swap entry according to order, which could be non-uniform in the future. So document it clearly that it's the caller's responsibility to set up the new folio's swap entries and flags before calling the helper. The helper will just follow the new folio's entry value. This also prepares for replacing high-order folios in the swap cache. Currently, only splitting to order 0 is allowed for swap cache folios. Using the new helper, we can handle high-order folio splitting better. Link: https://lkml.kernel.org/r/20250916160100.31545-11-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Chris Li <chrisl@kernel.org> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:24 -07:00
Kairui Song	84a7a9823e	mm/shmem, swap: remove redundant error handling for replacing folio Shmem may replace a folio in the swap cache if the cached one doesn't fit the swapin's GFP zone. When doing so, shmem has already double checked that the swap cache folio is locked, still has the swap cache flag set, and contains the wanted swap entry. So it is impossible to fail due to an XArray mismatch. There is even a comment for that. Delete the defensive error handling path, and add a WARN_ON instead: if that happened, something has broken the basic principle of how the swap cache works, we should catch and fix that. Link: https://lkml.kernel.org/r/20250916160100.31545-10-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:24 -07:00
Kairui Song	fd8d4f862f	mm, swap: cleanup swap cache API and add kerneldoc In preparation for replacing the swap cache backend with the swap table, clean up and add proper kernel doc for all swap cache APIs. Now all swap cache APIs are well-defined with consistent names. No feature change, only renaming and documenting. Link: https://lkml.kernel.org/r/20250916160100.31545-9-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:23 -07:00
Kairui Song	0fcf8ef4fd	mm, swap: tidy up swap device and cluster info helpers swp_swap_info is the most commonly used helper for retrieving swap info. It has an internal check that may lead to a NULL return value, but almost none of its caller checks the return value, making the internal check pointless. In fact, most of these callers already ensured the entry is valid and never expect a NULL value. Tidy this up and improve the function names. If the caller can make sure the swap entry/type is valid and the device is pinned, use the new introduced __swap_entry_to_info/__swap_type_to_info instead. They have more debug sanity checks and lower overhead as they are inlined. Callers that may expect a NULL value should use swap_entry_to_info/swap_type_to_info instead. No feature change. The rearranged codes should have had no effect, or they should have been hitting NULL de-ref bugs already. Only some new sanity checks are added so potential issues may show up in debug build. The new helpers will be frequently used with swap table later when working with swap cache folios. A locked swap cache folio ensures the entries are valid and stable so these helpers are very helpful. Link: https://lkml.kernel.org/r/20250916160100.31545-8-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:23 -07:00
Kairui Song	4522aed4ff	mm, swap: rename and move some swap cluster definition and helpers No feature change, move cluster related definitions and helpers to mm/swap.h, also tidy up and add a "swap_" prefix for cluster lock/unlock helpers, so they can be used outside of swap files. And while at it, add kerneldoc. Link: https://lkml.kernel.org/r/20250916160100.31545-7-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:23 -07:00
Kairui Song	ae38eb2105	mm, swap: always lock and check the swap cache folio before use Swap cache lookup only increases the reference count of the returned folio. That's not enough to ensure a folio is stable in the swap cache, so the folio could be removed from the swap cache at any time. The caller should always lock and check the folio before using it. We have just documented this in kerneldoc, now introduce a helper for swap cache folio verification with proper sanity checks. Also, sanitize a few current users to use this convention and the new helper for easier debugging. They were not having observable problems yet, only trivial issues like wasted CPU cycles on swapoff or reclaiming. They would fail in some other way, but it is still better to always follow this convention to make things robust and make later commits easier to do. Link: https://lkml.kernel.org/r/20250916160100.31545-6-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:23 -07:00
Kairui Song	3518b931df	mm, swap: check page poison flag after locking it Instead of checking the poison flag only in the fast swap cache lookup path, always check the poison flags after locking a swap cache folio. There are two reasons to do so. The folio is unstable and could be removed from the swap cache anytime, so it's totally possible that the folio is no longer the backing folio of a swap entry, and could be an irrelevant poisoned folio. We might mistakenly kill a faulting process. And it's totally possible or even common for the slow swap in path (swapin_readahead) to bring in a cached folio. The cache folio could be poisoned, too. Only checking the poison flag in the fast path will miss such folios. The race window is tiny, so it's very unlikely to happen, though. While at it, also add a unlikely prefix. Link: https://lkml.kernel.org/r/20250916160100.31545-5-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:22 -07:00
Kairui Song	a733d8de7f	mm, swap: fix swap cache index error when retrying reclaim The allocator will reclaim cached slots while scanning. Currently, it will try again if reclaim found a folio that is already removed from the swap cache due to a race. But the following lookup will be using the wrong index. It won't cause any OOB issue since the swap cache index is truncated upon lookup, but it may lead to reclaiming of an irrelevant folio. This should not cause a measurable issue, but we should fix it. Link: https://lkml.kernel.org/r/20250916160100.31545-4-ryncsn@gmail.com Fixes: `fae8595505` ("mm, swap: avoid reclaiming irrelevant swap cache") Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:22 -07:00
Kairui Song	f28124617f	mm, swap: use unified helper for swap cache look up The swap cache lookup helper swap_cache_get_folio currently does readahead updates as well, so callers that are not doing swapin from any VMA or mapping are forced to reuse filemap helpers instead, and have to access the swap cache space directly. So decouple readahead update with swap cache lookup. Move the readahead update part into a standalone helper. Let the caller call the readahead update helper if they do readahead. And convert all swap cache lookups to use swap_cache_get_folio. After this commit, there are only three special cases for accessing swap cache space now: huge memory splitting, migration, and shmem replacing, because they need to lock the XArray. The following commits will wrap their accesses to the swap cache too, with special helpers. And worth noting, currently dropbehind is not supported for anon folio, and we will never see a dropbehind folio in swap cache. The unified helper can be updated later to handle that. While at it, add proper kernedoc for touched helpers. No functional change. Link: https://lkml.kernel.org/r/20250916160100.31545-3-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:22 -07:00
Chris Li	87cc51571a	docs/mm: add document for swap table Patch series "mm, swap: introduce swap table as swap cache (phase I)", v4. This is the first phase of the bigger series implementing basic infrastructures for the Swap Table idea proposed at the LSF/MM/BPF topic "Integrate swap cache, swap maps with swap allocator" [1]. To give credit where it is due, this is based on Chris Li's idea and a prototype of using cluster size atomic arrays to implement swap cache. This phase I contains 15 patches, introduces the swap table infrastructure and uses it as the swap cache backend. By doing so, we have up to ~5-20% performance gain in throughput, RPS or build time for benchmark and workload tests. The speed up is due to less contention on the swap cache access and shallower swap cache lookup path. The cluster size is much finer-grained than the 64M address space split, which is removed in this phase I. It also unifies and cleans up the swap code base. Each swap cluster will dynamically allocate the swap table, which is an atomic array to cover every swap slot in the cluster. It replaces the swap cache backed by XArray. In phase I, the static allocated swap_map still co-exists with the swap table. The memory usage is about the same as the original on average. A few exception test cases show about 1% higher in memory usage. In the following phases of the series, swap_map will merge into the swap table without additional memory allocation. It will result in net memory reduction compared to the original swap cache. Testing has shown that phase I has a significant performance improvement from 8c/1G ARM machine to 48c96t/128G x86_64 servers in many practical workloads. The full picture with a summary can be found at [2]. An older bigger series of 28 patches is posted at [3]. vm-scability test: ================== Test with: usemem --init-time -O -y -x -n 31 1G (4G memcg, PMEM as swap) Before: After: System time: 219.12s 158.16s (-27.82%) Sum Throughput: 4767.13 MB/s 6128.59 MB/s (+28.55%) Single process Throughput: 150.21 MB/s 196.52 MB/s (+30.83%) Free latency: 175047.58 us 131411.87 us (-24.92%) usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure, PMEM as swap) Before: After: System time: 356.16s 284.68s (-20.06%) Sum Throughput: 4648.35 MB/s 5453.52 MB/s (+17.32%) Single process Throughput: 141.63 MB/s 168.35 MB/s (+18.86%) Free latency: 499907.71 us 484977.03 us (-2.99%) This shows an improvement of more than 20% improvement in most readings. Build kernel test: ================== The following result matrix is from building kernel with defconfig on tmpfs with ZSWAP / ZRAM, using different memory pressure and setups. Measuring sys and real time in seconds, less is better (user time is almost identical as expected): -j<NR> / Mem \| Sys before / after \| Real before / after Using 16G ZRAM with memcg limit: 6 / 192M \| 9686 / 9472 -2.21% \| 2130 / 2096 -1.59% 12 / 256M \| 6610 / 6451 -2.41% \| 827 / 812 -1.81% 24 / 384M \| 5938 / 5701 -3.37% \| 414 / 405 -2.17% 48 / 768M \| 4696 / 4409 -6.11% \| 188 / 182 -3.19% With 64k folio: 24 / 512M \| 4222 / 4162 -1.42% \| 326 / 321 -1.53% 48 / 1G \| 3688 / 3622 -1.79% \| 151 / 149 -1.32% With ZSWAP with 3G memcg (using higher limit due to kmem account): 48 / 3G \| 603 / 581 -3.65% \| 81 / 80 -1.23% Testing extremely high global memory and schedule pressure: Using ZSWAP with 32G NVMEs in a 48c VM that has 4G memory, no memcg limit, system components take up about 1.5G already, using make -j48 to build defconfig: Before: sys time: 2069.53s real time: 135.76s After: sys time: 2021.13s (-2.34%) real time: 134.23s (-1.12%) On another 48c 4G memory VM, using 16G ZRAM as swap, testing make -j48 with same config: Before: sys time: 1756.96s real time: 111.01s After: sys time: 1715.90s (-2.34%) real time: 109.51s (-1.35%) All cases are more or less faster, and no regression even under extremely heavy global memory pressure. Redis / Valkey bench: ===================== The test machine is a ARM64 VM with 1536M memory 12 cores, Redis is set to use 2500M memory, and ZRAM swap size is set to 5G: Testing with: redis-benchmark -r 2000000 -n 2000000 -d 1024 -c 12 -P 32 -t get no BGSAVE with BGSAVE Before: 487576.06 RPS 280016.02 RPS After: 487541.76 RPS (-0.01%) 300155.32 RPS (+7.19%) Testing with: redis-benchmark -r 2500000 -n 2500000 -d 1024 -c 12 -P 32 -t get no BGSAVE with BGSAVE Before: 466789.59 RPS 281213.92 RPS After: 466402.89 RPS (-0.08%) 298411.84 RPS (+6.12%) With BGSAVE enabled, most Redis memory will have a swap count > 1 so swap cache is heavily in use. We can see a about 6% performance gain. No BGSAVE is very slightly slower (<0.1%) due to the higher memory pressure of the co-existence of swap_map and swap table. This will be optimzed into a net gain and up to 20% gain in BGSAVE case in the following phases. HDD swap is also ~40% faster with usemem because we removed an old contention workaround. This patch (of 15): Swap table is the new swap cache. [chrisl@kernel.org: move swap table document, redo swap table size sentence] Link: https://lkml.kernel.org/r/CACePvbXjaUyzB_9RSSSgR6BNvz+L9anvn0vcNf_J0jD7-4Yy6Q@mail.gmail.com Link: https://lkml.kernel.org/r/20250916160100.31545-1-ryncsn@gmail.com Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3] Link: https://lkml.kernel.org/r/20250916160100.31545-2-ryncsn@gmail.com Link: https://lore.kernel.org/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com [1] Link: https://github.com/ryncsn/linux/tree/kasong/devel/swap-table [2] Signed-off-by: Chris Li <chrisl@kernel.org> Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:22 -07:00
Alistair Popple	614d850efd	mm/memremap: remove unused get_dev_pagemap() parameter GUP no longer uses get_dev_pagemap(). As it was the only user of the get_dev_pagemap() pgmap caching feature it can be removed. Link: https://lkml.kernel.org/r/20250903225926.34702-2-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:21 -07:00
Alistair Popple	d3f7922b92	mm/gup: remove dead pgmap refcounting code Prior to commit `aed877c2b4` ("device/dax: properly refcount device dax pages when mapping") ZONE_DEVICE pages were not fully reference counted when mapped into user page tables. Instead GUP would take a reference on the associated pgmap to ensure the results of pfn_to_page() remained valid. This is no longer required and most of the code was removed by commit `fd2825b076` ("mm/gup: remove pXX_devmap usage from get_user_pages()"). Finish cleaning this up by removing the dead calls to put_dev_pagemap() and the temporary context struct. Link: https://lkml.kernel.org/r/20250903225926.34702-1-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:21 -07:00
Wei Yang	4805ef3707	mm/page_alloc: check the correct buddy if it is a starting block find_large_buddy() search buddy based on start_pfn, which maybe different from page's pfn, e.g. when page is not pageblock aligned, because prep_move_freepages_block() always align start_pfn to pageblock. This means when we found a starting block at start_pfn, it may check on the wrong page theoretically. And not split the free page as it is supposed to, causing a freelist migratetype mismatch. The good news is the page passed to __move_freepages_block_isolate() has only two possible cases: * page is pageblock aligned * page is __first_valid_page() of this block So it is safe for the first case, and it won't get a buddy larger than pageblock for the second case. To fix the issue, check the returned pfn of find_large_buddy() to decide whether to split the free page: 1. if it is not a PageBuddy pfn, no split; 2. if it is a PageBuddy pfn but order <= pageblock_order, no split; 3. if it is a PageBuddy pfn with order > pageblock_order, start_pfn is either in the starting block or tail block, split the PageBuddy at pageblock_order level. Link: https://lkml.kernel.org/r/20250905140358.28849-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:21 -07:00
Miaohe Lin	5ce1dbfdd8	mm/hwpoison: decouple hwpoison_filter from mm/memory-failure.c mm/memory-failure.c defines and uses hwpoison_filter_* parameters but the values of those parameters can only be modified via mm/hwpoison-inject.c from userspace. They have a potentially different life time. Decouple those parameters from mm/memory-failure.c to fix this broken layering. Link: https://lkml.kernel.org/r/20250904062258.3336092-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:21 -07:00
Pankaj Raghav	a488ba3124	huge_memory: return -EINVAL in folio split functions when THP is disabled split_huge_page_to_list_[to_order](), split_huge_page() and try_folio_split() return 0 on success and error codes on failure. When THP is disabled, these functions return 0 indicating success even though an error code should be returned as it is not possible to split a folio when THP is disabled. Make all these functions return -EINVAL to indicate failure instead of 0. As large folios depend on CONFIG_THP, issue warning as this function should not be called without a large folio. Link: https://lkml.kernel.org/r/20250905150012.93714-1-kernel@pankajraghav.com Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202509051753.riCeG7LC-lkp@intel.com/ Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:20 -07:00
Jinjiang Tu	0faa77afe7	filemap: optimize folio refount update in filemap_map_pages There are two meaningless folio refcount update for order0 folio in filemap_map_pages(). First, filemap_map_order0_folio() adds folio refcount after the folio is mapped to pte. And then, filemap_map_pages() drops a refcount grabbed by next_uptodate_folio(). We could remain the refcount unchanged in this case. As Matthew metenioned in [1], it is safe to call folio_unlock() before calling folio_put() here, because the folio is in page cache with refcount held, and truncation will wait for the unlock. Optimize filemap_map_folio_range() with the same method too. With this patch, we can get 8% performance gain for lmbench testcase 'lat_pagefault -P 1 file' in order0 folio case, the size of file is 512M. Link: https://lkml.kernel.org/r/20250904132737.1250368-1-tujinjiang@huawei.com Link: https://lore.kernel.org/all/aKcU-fzxeW3xT5Wv@casper.infradead.org/ [1] Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:20 -07:00
David Hildenbrand	24a3c7af3b	selftests/mm: split_huge_page_test: cleanups for split_pte_mapped_thp test There is room for improvement, so let's clean up a bit: (1) Define "4" as a constant. (2) SKIP if we fail to allocate all THPs (e.g., fragmented) and add recovery code for all other failure cases: no need to exit the test. (3) Rename "len" to thp_area_size, and "one_page" to "thp_area". (4) Allocate a new area "page_area" into which we will mremap the pages; add "page_area_size". Now we can easily merge the two mremap instances into a single one. (5) Iterate THPs instead of bytes when checking for missed THPs after mremap. (6) Rename "pte_mapped2" to "tmp", used to verify mremap(MAP_FIXED) result. (7) Split the corruption test from the failed-split test, so we can just iterate bytes vs. thps naturally. (8) Extend comments and clarify why we are using mremap in the first place. Link: https://lkml.kernel.org/r/20250903070253.34556-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:20 -07:00
David Hildenbrand	0d0e03d5b8	selftests/mm: split_huge_page_test: fix occasional is_backed_by_folio() wrong results Patch series "selftests/mm: split_huge_page_test: split_pte_mapped_thp improvements", v2. One fix for occasional failures I found while testing and a bunch of cleanups that should make that test easier to digest. This patch (of 2): When checking for actual tail or head pages of a folio, we must make sure that the KPF_COMPOUND_HEAD/KPF_COMPOUND_TAIL flag is paired with KPF_THP. For example, if we have another large folio after our large folio in physical memory, our "pfn_flags & (KPF_THP \| KPF_COMPOUND_TAIL)" would trigger even though it's actually a head page of the next folio. If is_backed_by_folio() returns a wrong result, split_pte_mapped_thp() can fail with "Some THPs are missing during mremap". Fix it by checking for head/tail pages of folios properly. Add folio_tail_flags/folio_head_flags to improve readability and use these masks also when just testing for any compound page. Link: https://lkml.kernel.org/r/20250903070253.34556-1-david@redhat.com Link: https://lkml.kernel.org/r/20250903070253.34556-2-david@redhat.com Fixes: 169b456b0162 ("selftests/mm: reimplement is_backed_by_thp() with more precise check") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:20 -07:00
Baolin Wang	69e0a3b490	mm: shmem: fix the strategy for the tmpfs 'huge=' options After commit `acd7ccb284` ("mm: shmem: add large folio support for tmpfs"), we have extended tmpfs to allow any sized large folios, rather than just PMD-sized large folios. The strategy discussed previously was: : Considering that tmpfs already has the 'huge=' option to control the : PMD-sized large folios allocation, we can extend the 'huge=' option to : allow any sized large folios. The semantics of the 'huge=' mount option : are: : : huge=never: no any sized large folios : huge=always: any sized large folios : huge=within_size: like 'always' but respect the i_size : huge=advise: like 'always' if requested with madvise() : : Note: for tmpfs mmap() faults, due to the lack of a write size hint, still : allocate the PMD-sized huge folios if huge=always/within_size/advise is : set. : : Moreover, the 'deny' and 'force' testing options controlled by : '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same : semantics. The 'deny' can disable any sized large folios for tmpfs, while : the 'force' can enable PMD sized large folios for tmpfs. This means that when tmpfs is mounted with 'huge=always' or 'huge=within_size', tmpfs will allow getting a highest order hint based on the size of write() and fallocate() paths. It will then try each allowable large order, rather than continually attempting to allocate PMD-sized large folios as before. However, this might break some user scenarios for those who want to use PMD-sized large folios, such as the i915 driver which did not supply a write size hint when allocating shmem [1]. Moreover, Hugh also complained that this will cause a regression in userspace with 'huge=always' or 'huge=within_size'. So, let's revisit the strategy for tmpfs large page allocation. A simple fix would be to always try PMD-sized large folios first, and if that fails, fall back to smaller large folios. This approach differs from the strategy for large folio allocation used by other file systems, however, tmpfs is somewhat different from other file systems, as quoted from David's opinion: : There were opinions in the past that tmpfs should just behave like any : other fs, and I think that's what we tried to satisfy here: use the write : size as an indication. : : I assume there will be workloads where either approach will be beneficial. : I also assume that workloads that use ordinary fs'es could benefit from : the same strategy (start with PMD), while others will clearly not. Link: https://lkml.kernel.org/r/10e7ac6cebe6535c137c064d5c5a235643eebb4a.1756888965.git.baolin.wang@linux.alibaba.com Link: https://lore.kernel.org/lkml/0d734549d5ed073c80b11601da3abdd5223e1889.1753689802.git.baolin.wang@linux.alibaba.com/ [1] Fixes: `acd7ccb284` ("mm: shmem: add large folio support for tmpfs") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:19 -07:00
Alice Ryhl	56b1852e82	rust: maple_tree: add MapleTreeAlloc To support allocation trees, we introduce a new type MapleTreeAlloc for the case where the tree is created using MT_FLAGS_ALLOC_RANGE. To ensure that you can only call mtree_alloc_range on an allocation tree, we restrict thta method to the new MapleTreeAlloc type. However, all methods on MapleTree remain accessible to MapleTreeAlloc as allocation trees can use the other methods without issues. Link: https://lkml.kernel.org/r/20250902-maple-tree-v3-3-fb5c8958fb1e@google.com Signed-off-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com> Reviewed-by: Danilo Krummrich <dakr@kernel.org> Cc: Andreas Hindborg <a.hindborg@kernel.org> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Gary Guo <gary@garyguo.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Trevor Gross <tmgross@umich.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:19 -07:00
Alice Ryhl	01422da19c	rust: maple_tree: add lock guard for maple tree To load a value, one must be careful to hold the lock while accessing it. To enable this, we add a lock() method so that you can perform operations on the value before the spinlock is released. This adds a MapleGuard type without using the existing SpinLock type. This ensures that the MapleGuard type is not unnecessarily large, and that it is easy to swap out the type of lock in case the C maple tree is changed to use a different kind of lock. There are two ways of using the lock guard: You can call load() directly to load a value under the lock, or you can create an MaState to iterate the tree with find(). The find() method does not have the mas_ prefix since it's a method on MaState, and being a method on that struct serves a similar purpose to the mas_ prefix in C. Link: https://lkml.kernel.org/r/20250902-maple-tree-v3-2-fb5c8958fb1e@google.com Co-developed-by: Andrew Ballance <andrewjballance@gmail.com> Signed-off-by: Andrew Ballance <andrewjballance@gmail.com> Reviewed-by: Andrew Ballance <andrewjballance@gmail.com> Reviewed-by: Danilo Krummrich <dakr@kernel.org> Signed-off-by: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@kernel.org> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Daniel Almeida <daniel.almeida@collabora.com> Cc: Gary Guo <gary@garyguo.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Trevor Gross <tmgross@umich.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:19 -07:00
Alice Ryhl	da939ef4c4	rust: maple_tree: add MapleTree Patch series "Add Rust abstraction for Maple Trees", v3. This will be used in the Tyr driver [1] to allocate from the GPU's VA space that is not owned by userspace, but by the kernel, for kernel GPU mappings. Danilo tells me that in nouveau, the maple tree is used for keeping track of "VM regions" on top of GPUVM, and that he will most likely end up doing the same in the Rust Nova driver as well. These abstractions intentionally do not expose any way to make use of external locking. You are required to use the internal spinlock. For now, we do not support loads that only utilize rcu for protection. This contains some parts taken from Andrew Ballance's RFC [2] from April. However, it has also been reworked significantly compared to that RFC taking the use-cases in Tyr into account. This patch (of 3): The maple tree will be used in the Tyr driver to allocate and keep track of GPU allocations created internally (i.e. not by userspace). It will likely also be used in the Nova driver eventually. This adds the simplest methods for additional and removal that do not require any special care with respect to concurrency. This implementation is based on the RFC by Andrew but with significant changes to simplify the implementation. [ojeda@kernel.org: fix intra-doc links] Link: https://lkml.kernel.org/r/20250910140212.997771-1-ojeda@kernel.org Link: https://lkml.kernel.org/r/20250902-maple-tree-v3-0-fb5c8958fb1e@google.com Link: https://lkml.kernel.org/r/20250902-maple-tree-v3-1-fb5c8958fb1e@google.com Link: https://lore.kernel.org/r/20250627-tyr-v1-1-cb5f4c6ced46@collabora.com [1] Link: https://lore.kernel.org/r/20250405060154.1550858-1-andrewjballance@gmail.com [2] Co-developed-by: Andrew Ballance <andrewjballance@gmail.com> Signed-off-by: Andrew Ballance <andrewjballance@gmail.com> Signed-off-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Danilo Krummrich <dakr@kernel.org> Cc: Andreas Hindborg <a.hindborg@kernel.org> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Daniel Almeida <daniel.almeida@collabora.com> Cc: Gary Guo <gary@garyguo.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Trevor Gross <tmgross@umich.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:19 -07:00
Yueyang Pan	8147bc15b4	mm/show_mem: add trylock while printing alloc info In production, show_mem() can be called concurrently from two different entities, for example one from oom_kill_process() another from __alloc_pages_slowpath from another kthread. This patch adds a spinlock and invokes trylock before printing out the kernel alloc info in show_mem(). This way two alloc info won't interleave with each other, which then makes parsing easier. Link: https://lkml.kernel.org/r/4ed91296e0c595d945a38458f7a8d9611b0c1e52.1756897825.git.pyyjason@gmail.com Signed-off-by: Yueyang Pan <pyyjason@gmail.com> Acked-by: Usama Arif <usamaarif642@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:18 -07:00
Yueyang Pan	9abd8bd4c6	mm/show_mem: dump the status of the mem alloc profiling before printing This patchset fixes two issues we saw in production rollout. The first issue is that we saw all zero output of memory allocation profiling information from show_mem() if CONFIG_MEM_ALLOC_PROFILING is set and sysctl.vm.mem_profiling=0. This cause ambiguity as we don't know what 0B actually means in the output. It can mean either memory allocation profiling is temporary disabled or the allocation at that position is actually 0. Such ambiguity will make further parsing harder as we cannot differentiate between two case. The second issue is that multiple entities can call show_mem() which messed up the allocation info in dmesg. We saw outputs like this: 327 MiB 83635 mm/compaction.c:1880 func:compaction_alloc 48.4 GiB 12684937 mm/memory.c:1061 func:folio_prealloc 7.48 GiB 10899 mm/huge_memory.c:1159 func:vma_alloc_anon_folio_pmd 298 MiB 95216 kernel/fork.c:318 func:alloc_thread_stack_node 250 MiB 63901 mm/zsmalloc.c:987 func:alloc_zspage 1.42 GiB 372527 mm/memory.c:1063 func:folio_prealloc 1.17 GiB 95693 mm/slub.c:2424 func:alloc_slab_page 651 MiB 166732 mm/readahead.c:270 func:page_cache_ra_unbounded 419 MiB 107261 net/core/page_pool.c:572 func:__page_pool_alloc_pages_slow 404 MiB 103425 arch/x86/mm/pgtable.c:25 func:pte_alloc_one The above example is because one kthread invokes show_mem() from __alloc_pages_slowpath while kernel itself calls oom_kill_process() This patch (of 2): This patch prints the status of the memory allocation profiling before __show_mem actually prints the detailed allocation info. This way will let us know the `0B` we saw in allocation info is because the profiling is disabled or the allocation is actually 0B. Link: https://lkml.kernel.org/r/cover.1756897825.git.pyyjason@gmail.com Link: https://lkml.kernel.org/r/d7998ea0ddc2ea1a78bb6e89adf530526f76679a.1756897825.git.pyyjason@gmail.com Signed-off-by: Yueyang Pan <pyyjason@gmail.com> Acked-by: Usama Arif <usamaarif642@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:18 -07:00
Vishal Moola (Oracle)	d75d36547d	virtio_balloon: stop calling page_address() in free_pages() free_pages() should be used when we only have a virtual address. We should call __free_pages() directly on our page instead. Link: https://lkml.kernel.org/r/20250903185921.1785167-8-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Justin Sanders <justin@coraid.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: SeongJae Park <sj@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:18 -07:00

1 2 3 4 5 ...

1383045 Commits