swp_swap_info is the most commonly used helper for retrieving swap info.
It has an internal check that may lead to a NULL return value, but almost
none of its caller checks the return value, making the internal check
pointless. In fact, most of these callers already ensured the entry is
valid and never expect a NULL value.
Tidy this up and improve the function names. If the caller can make sure
the swap entry/type is valid and the device is pinned, use the new
introduced __swap_entry_to_info/__swap_type_to_info instead. They have
more debug sanity checks and lower overhead as they are inlined.
Callers that may expect a NULL value should use
swap_entry_to_info/swap_type_to_info instead.
No feature change. The rearranged codes should have had no effect, or
they should have been hitting NULL de-ref bugs already. Only some new
sanity checks are added so potential issues may show up in debug build.
The new helpers will be frequently used with swap table later when working
with swap cache folios. A locked swap cache folio ensures the entries are
valid and stable so these helpers are very helpful.
Link: https://lkml.kernel.org/r/20250916160100.31545-8-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Suggested-by: Chris Li <chrisl@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: kernel test robot <oliver.sang@intel.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The swap cache lookup helper swap_cache_get_folio currently does readahead
updates as well, so callers that are not doing swapin from any VMA or
mapping are forced to reuse filemap helpers instead, and have to access
the swap cache space directly.
So decouple readahead update with swap cache lookup. Move the readahead
update part into a standalone helper. Let the caller call the readahead
update helper if they do readahead. And convert all swap cache lookups to
use swap_cache_get_folio.
After this commit, there are only three special cases for accessing swap
cache space now: huge memory splitting, migration, and shmem replacing,
because they need to lock the XArray. The following commits will wrap
their accesses to the swap cache too, with special helpers.
And worth noting, currently dropbehind is not supported for anon folio,
and we will never see a dropbehind folio in swap cache. The unified
helper can be updated later to handle that.
While at it, add proper kernedoc for touched helpers.
No functional change.
Link: https://lkml.kernel.org/r/20250916160100.31545-3-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Chris Li <chrisl@kernel.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Chris Li <chrisl@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: kernel test robot <oliver.sang@intel.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Prior to commit aed877c2b4 ("device/dax: properly refcount device dax
pages when mapping") ZONE_DEVICE pages were not fully reference counted
when mapped into user page tables. Instead GUP would take a reference on
the associated pgmap to ensure the results of pfn_to_page() remained
valid.
This is no longer required and most of the code was removed by commit
fd2825b076 ("mm/gup: remove pXX_devmap usage from get_user_pages()").
Finish cleaning this up by removing the dead calls to put_dev_pagemap()
and the temporary context struct.
Link: https://lkml.kernel.org/r/20250903225926.34702-1-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
find_large_buddy() search buddy based on start_pfn, which maybe different
from page's pfn, e.g. when page is not pageblock aligned, because
prep_move_freepages_block() always align start_pfn to pageblock.
This means when we found a starting block at start_pfn, it may check on
the wrong page theoretically. And not split the free page as it is
supposed to, causing a freelist migratetype mismatch.
The good news is the page passed to __move_freepages_block_isolate() has
only two possible cases:
* page is pageblock aligned
* page is __first_valid_page() of this block
So it is safe for the first case, and it won't get a buddy larger than
pageblock for the second case.
To fix the issue, check the returned pfn of find_large_buddy() to decide
whether to split the free page:
1. if it is not a PageBuddy pfn, no split;
2. if it is a PageBuddy pfn but order <= pageblock_order, no split;
3. if it is a PageBuddy pfn with order > pageblock_order, start_pfn is
either in the starting block or tail block, split the PageBuddy at
pageblock_order level.
Link: https://lkml.kernel.org/r/20250905140358.28849-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There are two meaningless folio refcount update for order0 folio in
filemap_map_pages(). First, filemap_map_order0_folio() adds folio
refcount after the folio is mapped to pte. And then, filemap_map_pages()
drops a refcount grabbed by next_uptodate_folio(). We could remain the
refcount unchanged in this case.
As Matthew metenioned in [1], it is safe to call folio_unlock() before
calling folio_put() here, because the folio is in page cache with refcount
held, and truncation will wait for the unlock.
Optimize filemap_map_folio_range() with the same method too.
With this patch, we can get 8% performance gain for lmbench testcase
'lat_pagefault -P 1 file' in order0 folio case, the size of file is 512M.
Link: https://lkml.kernel.org/r/20250904132737.1250368-1-tujinjiang@huawei.com
Link: https://lore.kernel.org/all/aKcU-fzxeW3xT5Wv@casper.infradead.org/ [1]
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
After commit acd7ccb284 ("mm: shmem: add large folio support for
tmpfs"), we have extended tmpfs to allow any sized large folios, rather
than just PMD-sized large folios.
The strategy discussed previously was:
: Considering that tmpfs already has the 'huge=' option to control the
: PMD-sized large folios allocation, we can extend the 'huge=' option to
: allow any sized large folios. The semantics of the 'huge=' mount option
: are:
:
: huge=never: no any sized large folios
: huge=always: any sized large folios
: huge=within_size: like 'always' but respect the i_size
: huge=advise: like 'always' if requested with madvise()
:
: Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
: allocate the PMD-sized huge folios if huge=always/within_size/advise is
: set.
:
: Moreover, the 'deny' and 'force' testing options controlled by
: '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
: semantics. The 'deny' can disable any sized large folios for tmpfs, while
: the 'force' can enable PMD sized large folios for tmpfs.
This means that when tmpfs is mounted with 'huge=always' or
'huge=within_size', tmpfs will allow getting a highest order hint based on
the size of write() and fallocate() paths. It will then try each
allowable large order, rather than continually attempting to allocate
PMD-sized large folios as before.
However, this might break some user scenarios for those who want to use
PMD-sized large folios, such as the i915 driver which did not supply a
write size hint when allocating shmem [1].
Moreover, Hugh also complained that this will cause a regression in userspace
with 'huge=always' or 'huge=within_size'.
So, let's revisit the strategy for tmpfs large page allocation. A simple fix
would be to always try PMD-sized large folios first, and if that fails, fall
back to smaller large folios. This approach differs from the strategy for
large folio allocation used by other file systems, however, tmpfs is somewhat
different from other file systems, as quoted from David's opinion:
: There were opinions in the past that tmpfs should just behave like any
: other fs, and I think that's what we tried to satisfy here: use the write
: size as an indication.
:
: I assume there will be workloads where either approach will be beneficial.
: I also assume that workloads that use ordinary fs'es could benefit from
: the same strategy (start with PMD), while others will clearly not.
Link: https://lkml.kernel.org/r/10e7ac6cebe6535c137c064d5c5a235643eebb4a.1756888965.git.baolin.wang@linux.alibaba.com
Link: https://lore.kernel.org/lkml/0d734549d5ed073c80b11601da3abdd5223e1889.1753689802.git.baolin.wang@linux.alibaba.com/ [1]
Fixes: acd7ccb284 ("mm: shmem: add large folio support for tmpfs")
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patchset fixes two issues we saw in production rollout.
The first issue is that we saw all zero output of memory allocation
profiling information from show_mem() if CONFIG_MEM_ALLOC_PROFILING is set
and sysctl.vm.mem_profiling=0. This cause ambiguity as we don't know what
0B actually means in the output. It can mean either memory allocation
profiling is temporary disabled or the allocation at that position is
actually 0. Such ambiguity will make further parsing harder as we cannot
differentiate between two case.
The second issue is that multiple entities can call show_mem() which
messed up the allocation info in dmesg. We saw outputs like this:
327 MiB 83635 mm/compaction.c:1880 func:compaction_alloc
48.4 GiB 12684937 mm/memory.c:1061 func:folio_prealloc
7.48 GiB 10899 mm/huge_memory.c:1159 func:vma_alloc_anon_folio_pmd
298 MiB 95216 kernel/fork.c:318 func:alloc_thread_stack_node
250 MiB 63901 mm/zsmalloc.c:987 func:alloc_zspage
1.42 GiB 372527 mm/memory.c:1063 func:folio_prealloc
1.17 GiB 95693 mm/slub.c:2424 func:alloc_slab_page
651 MiB 166732 mm/readahead.c:270 func:page_cache_ra_unbounded
419 MiB 107261 net/core/page_pool.c:572 func:__page_pool_alloc_pages_slow
404 MiB 103425 arch/x86/mm/pgtable.c:25 func:pte_alloc_one
The above example is because one kthread invokes show_mem() from
__alloc_pages_slowpath while kernel itself calls oom_kill_process()
This patch (of 2):
This patch prints the status of the memory allocation profiling before
__show_mem actually prints the detailed allocation info. This way will
let us know the `0B` we saw in allocation info is because the profiling is
disabled or the allocation is actually 0B.
Link: https://lkml.kernel.org/r/cover.1756897825.git.pyyjason@gmail.com
Link: https://lkml.kernel.org/r/d7998ea0ddc2ea1a78bb6e89adf530526f76679a.1756897825.git.pyyjason@gmail.com
Signed-off-by: Yueyang Pan <pyyjason@gmail.com>
Acked-by: Usama Arif <usamaarif642@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
On XFS systems with pagesize=4K, blocksize=16K, and
CONFIG_TRANSPARENT_HUGEPAGE enabled, We observed the following readahead
behaviors:
# echo 3 > /proc/sys/vm/drop_caches
# dd if=test of=/dev/null bs=64k count=1
# ./tools/mm/page-types -r -L -f /mnt/xfs/test
foffset offset flags
0 136d4c __RU_l_________H______t_________________F_1
1 136d4d __RU_l__________T_____t_________________F_1
2 136d4e __RU_l__________T_____t_________________F_1
3 136d4f __RU_l__________T_____t_________________F_1
...
c 136bb8 __RU_l_________H______t_________________F_1
d 136bb9 __RU_l__________T_____t_________________F_1
e 136bba __RU_l__________T_____t_________________F_1
f 136bbb __RU_l__________T_____t_________________F_1 <-- first read
10 13c2cc ___U_l_________H______t______________I__F_1 <-- readahead flag
11 13c2cd ___U_l__________T_____t______________I__F_1
12 13c2ce ___U_l__________T_____t______________I__F_1
13 13c2cf ___U_l__________T_____t______________I__F_1
...
1c 1405d4 ___U_l_________H______t_________________F_1
1d 1405d5 ___U_l__________T_____t_________________F_1
1e 1405d6 ___U_l__________T_____t_________________F_1
1f 1405d7 ___U_l__________T_____t_________________F_1
[ra_size = 32, req_count = 16, async_size = 16]
# echo 3 > /proc/sys/vm/drop_caches
# dd if=test of=/dev/null bs=60k count=1
# ./page-types -r -L -f /mnt/xfs/test
foffset offset flags
0 136048 __RU_l_________H______t_________________F_1
...
c 110a40 __RU_l_________H______t_________________F_1
d 110a41 __RU_l__________T_____t_________________F_1
e 110a42 __RU_l__________T_____t_________________F_1 <-- first read
f 110a43 __RU_l__________T_____t_________________F_1 <-- first readahead flag
10 13e7a8 ___U_l_________H______t_________________F_1
...
20 137a00 ___U_l_________H______t_______P______I__F_1 <-- second readahead flag (20 - 2f)
21 137a01 ___U_l__________T_____t_______P______I__F_1
...
3f 10d4af ___U_l__________T_____t_______P_________F_1
[first readahead: ra_size = 32, req_count = 15, async_size = 17]
When reading 64k data (same for 61-63k range, where last_index is
page-aligned in filemap_get_pages()), 128k readahead is triggered via
page_cache_sync_ra() and the PG_readahead flag is set on the next folio
(the one containing 0x10 page).
When reading 60k data, 128k readahead is also triggered via
page_cache_sync_ra(). However, in this case the readahead flag is set on
the 0xf page. Although the requested read size (req_count) is 60k, the
actual read will be aligned to folio size (64k), which triggers the
readahead flag and initiates asynchronous readahead via
page_cache_async_ra(). This results in two readahead operations totaling
256k.
The root cause is that when the requested size is smaller than the actual
read size (due to folio alignment), it triggers asynchronous readahead.
By changing last_index alignment from page size to folio size, we ensure
the requested size matches the actual read size, preventing the case where
a single read operation triggers two readahead operations.
After applying the patch:
# echo 3 > /proc/sys/vm/drop_caches
# dd if=test of=/dev/null bs=60k count=1
# ./page-types -r -L -f /mnt/xfs/test
foffset offset flags
0 136d4c __RU_l_________H______t_________________F_1
1 136d4d __RU_l__________T_____t_________________F_1
2 136d4e __RU_l__________T_____t_________________F_1
3 136d4f __RU_l__________T_____t_________________F_1
...
c 136bb8 __RU_l_________H______t_________________F_1
d 136bb9 __RU_l__________T_____t_________________F_1
e 136bba __RU_l__________T_____t_________________F_1 <-- first read
f 136bbb __RU_l__________T_____t_________________F_1
10 13c2cc ___U_l_________H______t______________I__F_1 <-- readahead flag
11 13c2cd ___U_l__________T_____t______________I__F_1
12 13c2ce ___U_l__________T_____t______________I__F_1
13 13c2cf ___U_l__________T_____t______________I__F_1
...
1c 1405d4 ___U_l_________H______t_________________F_1
1d 1405d5 ___U_l__________T_____t_________________F_1
1e 1405d6 ___U_l__________T_____t_________________F_1
1f 1405d7 ___U_l__________T_____t_________________F_1
[ra_size = 32, req_count = 16, async_size = 16]
The same phenomenon will occur when reading from 49k to 64k. Set the
readahead flag to the next folio.
Because the minimum order of folio in address_space equals the block size
(at least in xfs and bcachefs that already support bs > ps), having
request_count aligned to block size will not cause overread.
[klarasmodin@gmail.com: fix overflow on 32-bit]
Link: https://lkml.kernel.org/r/yru7qf5gvyzccq5ohhpylvxug5lr5tf54omspbjh4sm6pcdb2r@fpjgj2pxw7va
[akpm@linux-foundation.org: update it for Max's constification efforts]
Link: https://lkml.kernel.org/r/20250711055509.91587-1-youling.tang@linux.dev
Co-developed-by: Chi Zhiling <chizhiling@kylinos.cn>
Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Klara Modin <klarasmodin@gmail.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Youling Tang <youling.tang@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Klara Modin <klarasmodin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In cloud environments with massive hugepage reservations (95%+ of system
RAM), single-attempt allocation during early boot often fails due to
memory pressure.
Commit 91f386bf07 ("hugetlb: batch freeing of vmemmap pages")
intensified this by deferring page frees, increase peak memory usage
during allocation.
Introduce a retry mechanism that leverages vmemmap optimization reclaim
(~1.6% memory) when available. Upon initial allocation failure, the
system retries until successful or no further progress is made, ensuring
reliable hugepage allocation while preserving batched vmemmap freeing
benefits.
Testing on a 256G machine allocating 252G of hugepages:
Before: 128056/129024 hugepages allocated
After: Successfully allocated all 129024 hugepages
Link: https://lkml.kernel.org/r/20250901082052.3247-1-lirongqing@baidu.com
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "introduce kasan.write_only option in hw-tags", v8.
Hardware tag based KASAN is implemented using the Memory Tagging Extension
(MTE) feature.
MTE is built on top of the ARMv8.0 virtual address tagging TBI (Top Byte
Ignore) feature and allows software to access a 4-bit allocation tag for
each 16-byte granule in the physical address space. A logical tag is
derived from bits 59-56 of the virtual address used for the memory access.
A CPU with MTE enabled will compare the logical tag against the
allocation tag and potentially raise an tag check fault on mismatch,
subject to system registers configuration.
Since ARMv8.9, FEAT_MTE_STORE_ONLY can be used to restrict raise of tag
check fault on store operation only.
Using this feature (FEAT_MTE_STORE_ONLY), introduce KASAN write-only mode
which restricts KASAN check write (store) operation only. This mode omits
KASAN check for read (fetch/load) operation. Therefore, it might be used
not only debugging purpose but also in normal environment.
This patch (of 2):
Since Armv8.9, FEATURE_MTE_STORE_ONLY feature is introduced to restrict
raise of tag check fault on store operation only. Introduce KASAN write
only mode based on this feature.
KASAN write only mode restricts KASAN checks operation for write only and
omits the checks for fetch/read operations when accessing memory. So it
might be used not only debugging enviroment but also normal enviroment to
check memory safty.
This features can be controlled with "kasan.write_only" arguments. When
"kasan.write_only=on", KASAN checks write operation only otherwise KASAN
checks all operations.
This changes the MTE_STORE_ONLY feature as BOOT_CPU_FEATURE like
ARM64_MTE_ASYMM so that makes it initialise in kasan_init_hw_tags() with
other function together.
Link: https://lkml.kernel.org/r/20250916222755.466009-1-yeoreum.yun@arm.com
Link: https://lkml.kernel.org/r/20250916222755.466009-2-yeoreum.yun@arm.com
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Breno Leitao <leitao@debian.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: D Scott Phillips <scott@os.amperecomputing.com>
Cc: Hardevsinh Palaniya <hardevsinh.palaniya@siliconsignals.io>
Cc: James Morse <james.morse@arm.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: levi.yun <yeoreum.yun@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Pankaj Gupta <pankaj.gupta@amd.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We want to get rid of nth_page(), and kfence init code is the last user.
Unfortunately, we might actually walk a PFN range where the pages are not
contiguous, because we might be allocating an area from memblock that
could span memory sections in problematic kernel configs (SPARSEMEM
without SPARSEMEM_VMEMMAP).
We could check whether the page range is contiguous using
page_range_contiguous() and failing kfence init, or making kfence
incompatible these problemtic kernel configs.
Let's keep it simple and simply use pfn_to_page() by iterating PFNs.
Link: https://lkml.kernel.org/r/20250901150359.867252-36-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The expectation is that there is currently no user that would pass in
non-contigous page ranges: no allocator, not even VMA, will hand these
out.
The only problematic part would be if someone would provide a range
obtained directly from memblock, or manually merge problematic ranges. If
we find such cases, we should fix them to create separate SG entries.
Let's check in sg_set_page() that this is really the case. No need to
check in sg_set_folio(), as pages in a folio are guaranteed to be
contiguous. As sg_set_page() gets inlined into modules, we have to export
the page_range_contiguous() helper -- use EXPORT_SYMBOL, there is nothing
special about this helper such that we would want to enforce GPL-only
modules.
We can now drop the nth_page() usage in sg_page_iter_page().
Link: https://lkml.kernel.org/r/20250901150359.867252-25-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Let's disallow handing out PFN ranges with non-contiguous pages, so we can
remove the nth-page usage in __cma_alloc(), and so any callers don't have
to worry about that either when wanting to blindly iterate pages.
This is really only a problem in configs with SPARSEMEM but without
SPARSEMEM_VMEMMAP, and only when we would cross memory sections in some
cases.
Will this cause harm? Probably not, because it's mostly 32bit that does
not support SPARSEMEM_VMEMMAP. If this ever becomes a problem we could
look into allocating the memmap for the memory sections spanned by a
single CMA region in one go from memblock.
[david@redhat.com: we can have NUMMU configs with SPARSEMEM enabled]
Link: https://lkml.kernel.org/r/6ec933b1-b3f7-41c0-95d8-e518bb87375e@redhat.com
Link: https://lkml.kernel.org/r/20250901150359.867252-23-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We can now safely iterate over all pages in a folio, so no need for the
pfn_to_page().
Also, as we already force the refcount in __init_single_page() to 1
through init_page_count(), we can just set the refcount to 0 and avoid
page_ref_freeze() + VM_BUG_ON. Likely, in the future, we would just want
to tell __init_single_page() to which value to initialize the refcount.
Further, adjust the comments to highlight that we are dealing with an
open-coded prep_compound_page() variant, and add another comment
explaining why we really need the __init_single_page() only on the tail
pages.
Note that the current code was likely problematic, but we never ran into
it: prep_compound_tail() would have been called with an offset that might
exceed a memory section, and prep_compound_tail() would have simply added
that offset to the page pointer -- which would not have done the right
thing on sparsemem without vmemmap.
Link: https://lkml.kernel.org/r/20250901150359.867252-14-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Grepping for "prep_compound_page" leaves on clueless how devdax gets its
compound pages initialized.
Let's add a comment that might help finding this open-coded
prep_compound_page() initialization more easily.
Further, let's be less smart about the ordering of initialization and just
perform the prep_compound_head() call after all tail pages were
initialized: just like prep_compound_page() does.
No need for a comment to describe the initialization order: again, just
like prep_compound_page().
Link: https://lkml.kernel.org/r/20250901150359.867252-10-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Let's check that no hstate that corresponds to an unreasonable folio size
is registered by an architecture. If we were to succeed registering, we
could later try allocating an unsupported gigantic folio size.
Further, let's add a BUILD_BUG_ON() for checking that HUGETLB_PAGE_ORDER
is sane at build time. As HUGETLB_PAGE_ORDER is dynamic on powerpc, we
have to use a BUILD_BUG_ON_INVALID() to make it compile.
No existing kernel configuration should be able to trigger this check:
either SPARSEMEM without SPARSEMEM_VMEMMAP cannot be configured or
gigantic folios will not exceed a memory section (the case on sparse).
Link: https://lkml.kernel.org/r/20250901150359.867252-9-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Let's reject unreasonable folio sizes early, where we can still fail.
We'll add sanity checks to prepare_compound_head/prepare_compound_page
next.
Is there a way to configure a system such that unreasonable folio sizes
would be possible? It would already be rather questionable.
If so, we'd probably want to bail out earlier, where we can avoid a WARN
and just report a proper error message that indicates where something went
wrong such that we messed up.
Link: https://lkml.kernel.org/r/20250901150359.867252-8-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: remove zpool".
zpool is an indirection layer for zswap to switch between multiple
allocator backends at runtime. Since 6.15, zsmalloc is the only allocator
left in-tree, so there is no point in keeping zpool around.
This patch (of 3):
zswap goes through the zpool layer to enable runtime-switching of
allocator backends for compressed data. However, since zbud and z3fold
were removed in 6.15, zsmalloc has been the only option available.
As such, the zpool indirection is unnecessary. Make zswap deal with
zsmalloc directly. This is comparable to zram, which also directly
interacts with zsmalloc and has never supported a different backend.
Note that this does not preclude future improvements and experiments with
different allocation strategies. Should it become necessary, it's
possible to provide an alternate implementation for the zsmalloc API,
selectable at compile time. However, zsmalloc is also rather mature and
feature rich, with years of widespread production exposure; it's
encouraged to make incremental improvements rather than fork it.
In any case, the complexity of runtime pluggability seems excessive and
unjustified at this time. Switch zswap to zsmalloc to remove the last
user of the zpool API.
[hannes@cmpxchg.org: fix default compressr test]
Link: https://lkml.kernel.org/r/20250915153640.GA828739@cmpxchg.org
Link: https://lkml.kernel.org/r/20250829162212.208258-1-hannes@cmpxchg.org
Link: https://lkml.kernel.org/r/20250829162212.208258-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Nacked-by: Vitaly Wool <vitaly.wool@konsulko.se>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "kasan: unify kasan_enabled() and remove arch-specific
implementations", v6.
This patch series addresses the fragmentation in KASAN initialization
across architectures by introducing a unified approach that eliminates
duplicate static keys and arch-specific kasan_arch_is_ready()
implementations.
The core issue is that different architectures have inconsistent approaches
to KASAN readiness tracking:
- PowerPC, LoongArch, and UML arch, each implement own kasan_arch_is_ready()
- Only HW_TAGS mode had a unified static key (kasan_flag_enabled)
- Generic and SW_TAGS modes relied on arch-specific solutions
or always-on behavior
This patch (of 2):
Introduce CONFIG_ARCH_DEFER_KASAN to identify architectures [1] that need
to defer KASAN initialization until shadow memory is properly set up, and
unify the static key infrastructure across all KASAN modes.
[1] PowerPC, UML, LoongArch selects ARCH_DEFER_KASAN.
The core issue is that different architectures haveinconsistent approaches
to KASAN readiness tracking:
- PowerPC, LoongArch, and UML arch, each implement own
kasan_arch_is_ready()
- Only HW_TAGS mode had a unified static key (kasan_flag_enabled)
- Generic and SW_TAGS modes relied on arch-specific solutions or always-on
behavior
This patch addresses the fragmentation in KASAN initialization across
architectures by introducing a unified approach that eliminates duplicate
static keys and arch-specific kasan_arch_is_ready() implementations.
Let's replace kasan_arch_is_ready() with existing kasan_enabled() check,
which examines the static key being enabled if arch selects
ARCH_DEFER_KASAN or has HW_TAGS mode support. For other arch,
kasan_enabled() checks the enablement during compile time.
Now KASAN users can use a single kasan_enabled() check everywhere.
Link: https://lkml.kernel.org/r/20250810125746.1105476-1-snovitoll@gmail.com
Link: https://lkml.kernel.org/r/20250810125746.1105476-2-snovitoll@gmail.com
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217049
Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> #powerpc
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Gow <davidgow@google.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@loongson.cn>
Cc: Marco Elver <elver@google.com>
Cc: Qing Zhang <zhangqing@loongson.cn>
Cc: Sabyrzhan Tasbolatov <snovitoll@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>