mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2026-05-08 08:02:59 -04:00
0d2a2605237341f9dfde99cd0ed3c2d003322464
1337496 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
0d2a260523 |
mm: remove references to folio in __memcg_kmem_uncharge_page()
This use of folios is misleading because these pages are not part of a folio. Remove an unnecessary call to page_folio(), saving 58 bytes of text in a Debian kernel build. Link: https://lkml.kernel.org/r/20250314133617.138071-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
8492936abb |
mm: simplify folio_memcg_charged()
There's no need to check which kind of pointer is in the memcg_data field, all we actually care about is whether it's zero or not. Saves 70 bytes in workingset_activation() with the Debian config. Link: https://lkml.kernel.org/r/20250314133617.138071-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
7cc57ecae4 |
mm: remove references to folio in split_page_memcg()
We know that the passed in page is not part of a folio (it's a plain page allocated with GFP_ACCOUNT), so we should get rid of the misleading references to folios. Introduce page_objcg() and page_set_objcg() helpers to make things more clear. Link: https://lkml.kernel.org/r/20250314133617.138071-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
1506c25508 |
mm: simplify split_page_memcg()
The last argument to split_page_memcg() is now always 0, so remove it,
effectively reverting commit
|
||
|
|
fa23a338de |
mm: separate folio_split_memcg_refs() from split_page_memcg()
Patch series "Minor memcg cleanups & prep for memdescs", v2. Separate the handling of accounted folios and GFP_ACCOUNT pages for easier to understand code. For more detail, see https://lore.kernel.org/linux-mm/Z9LwTOudOlCGny3f@casper.infradead.org/ This patch (of 5): Folios always use memcg_data to refer to the mem_cgroup while pages allocated with GFP_ACCOUNT have a pointer to the obj_cgroup. Since the caller already knows what it has, split the function into two and then we don't need to check. Move the assignment of split folio memcg_data to the point where we set up the other parts of the new folio. That leaves folio_split_memcg_refs() just handling the memcg accounting. Link: https://lkml.kernel.org/r/20250314133617.138071-1-willy@infradead.org Link: https://lkml.kernel.org/r/20250314133617.138071-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
cb44821e1f |
memcg: move do_memsw_account() to CONFIG_MEMCG_V1
The do_memsw_account() is used to enable or disable legacy memory+swap accounting in memory cgroup. However with disabled CONFIG_MEMCG_V1, we don't need to keep checking it. So, let's always return false for !CONFIG_MEMCG_V1 configs. Before the patch: $ size mm/memcontrol.o text data bss dec hex filename 49928 10736 4172 64836 fd44 mm/memcontrol.o After the patch: $ size mm/memcontrol.o text data bss dec hex filename 49430 10480 4172 64082 fa52 mm/memcontrol.o Link: https://lkml.kernel.org/r/20250312222552.3284173-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
20d6c17252 |
memcg: avoid refill_stock for root memcg
We never charge the page counters of root memcg, so there is no need to put root memcg in the memcg stock. At the moment, refill_stock() can be called from try_charge_memcg(), obj_cgroup_uncharge_pages() and mem_cgroup_uncharge_skmem(). The try_charge_memcg() and mem_cgroup_uncharge_skmem() are never called with root memcg, so those are fine. However obj_cgroup_uncharge_pages() can potentially call refill_stock() with root memcg if the objcg object has been reparented over to the root memcg. Let's just avoid refill_stock() from obj_cgroup_uncharge_pages() for root memcg. Link: https://lkml.kernel.org/r/20250313054812.2185900-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhockoc@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
b4f65dbdf8 |
mm/mm_init: rename init_reserved_page to init_deferred_page
When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, init_reserved_page() function performs initialization of a struct page that would have been deferred normally. Rename it to init_deferred_page() to better reflect what the function does. Link: https://lkml.kernel.org/r/20250225083017.567649-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
09bdc4fe70 |
mm/mm_init: rename __init_reserved_page_zone to __init_page_from_nid
__init_reserved_page_zone() function finds the zone for pfn and nid and performs initialization of a struct page with that zone and nid. There is nothing in that function about reserved pages and it is misnamed. Rename it to __init_page_from_nid() to better reflect what the function does. Link: https://lkml.kernel.org/r/20250225083017.567649-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
24ac6fb6e3 |
mm/cma: using per-CMA locks to improve concurrent allocation performance
For different CMAs, concurrent allocation of CMA memory ideally should not require synchronization using locks. Currently, a global cma_mutex lock is employed to synchronize all CMA allocations, which can impact the performance of concurrent allocations across different CMAs. To test the performance impact, follow these steps: 1. Boot the kernel with the command line argument hugetlb_cma=30G to allocate a 30GB CMA area specifically for huge page allocations. (note: on my machine, which has 3 nodes, each node is initialized with 10G of CMA) 2. Use the dd command with parameters if=/dev/zero of=/dev/shm/file bs=1G count=30 to fully utilize the CMA area by writing zeroes to a file in /dev/shm. 3. Open three terminals and execute the following commands simultaneously: (Note: Each of these commands attempts to allocate 10GB [ |
||
|
|
a211c6550e |
mm: page_alloc: defrag_mode kswapd/kcompactd watermarks
The previous patch added pageblock_order reclaim to kswapd/kcompactd,
which helps, but produces only one block at a time. Allocation stalls and
THP failure rates are still higher than they could be.
To adequately reflect ALLOC_NOFRAGMENT demand for pageblocks, change the
watermarking for kswapd & kcompactd: instead of targeting the high
watermark in order-0 pages and checking for one suitable block, simply
require that the high watermark is entirely met in pageblocks.
To this end, track the number of free pages within contiguous pageblocks,
then change pgdat_balanced() and compact_finished() to check watermarks
against this new value.
This further reduces THP latencies and allocation stalls, and improves THP
success rates against the previous patch:
DEFRAGMODE-ASYNC DEFRAGMODE-ASYNC-WMARKS
Hugealloc Time mean 34300.36 ( +0.00%) 28904.00 ( -15.73%)
Hugealloc Time stddev 36390.42 ( +0.00%) 33464.37 ( -8.04%)
Kbuild Real time 196.13 ( +0.00%) 196.59 ( +0.23%)
Kbuild User time 1234.74 ( +0.00%) 1231.67 ( -0.25%)
Kbuild System time 62.62 ( +0.00%) 59.10 ( -5.54%)
THP fault alloc 57054.53 ( +0.00%) 63223.67 ( +10.81%)
THP fault fallback 11581.40 ( +0.00%) 5412.47 ( -53.26%)
Direct compact fail 107.80 ( +0.00%) 59.07 ( -44.79%)
Direct compact success 4.53 ( +0.00%) 2.80 ( -31.33%)
Direct compact success rate % 3.20 ( +0.00%) 3.99 ( +18.66%)
Compact daemon scanned migrate 5461033.93 ( +0.00%) 2267500.33 ( -58.48%)
Compact daemon scanned free 5824897.93 ( +0.00%) 2339773.00 ( -59.83%)
Compact direct scanned migrate 58336.93 ( +0.00%) 47659.93 ( -18.30%)
Compact direct scanned free 32791.87 ( +0.00%) 40729.67 ( +24.21%)
Compact total migrate scanned 5519370.87 ( +0.00%) 2315160.27 ( -58.05%)
Compact total free scanned 5857689.80 ( +0.00%) 2380502.67 ( -59.36%)
Alloc stall 2424.60 ( +0.00%) 638.87 ( -73.62%)
Pages kswapd scanned 2657018.33 ( +0.00%) 4002186.33 ( +50.63%)
Pages kswapd reclaimed 559583.07 ( +0.00%) 718577.80 ( +28.41%)
Pages direct scanned 722094.07 ( +0.00%) 355172.73 ( -50.81%)
Pages direct reclaimed 107257.80 ( +0.00%) 31162.80 ( -70.95%)
Pages total scanned 3379112.40 ( +0.00%) 4357359.07 ( +28.95%)
Pages total reclaimed 666840.87 ( +0.00%) 749740.60 ( +12.43%)
Swap out 77238.20 ( +0.00%) 110084.33 ( +42.53%)
Swap in 11712.80 ( +0.00%) 24457.00 ( +108.80%)
File refaults 143438.80 ( +0.00%) 188226.93 ( +31.22%)
Also of note is that compaction work overall is reduced. The reason for
this is that when free pageblocks are more readily available, allocations
are also much more likely to get physically placed in LRU order, instead
of being forced to scavenge free space here and there. This means that
reclaim by itself has better chances of freeing up whole blocks, and the
system relies less on compaction.
Comparing all changes to the vanilla kernel:
VANILLA DEFRAGMODE-ASYNC-WMARKS
Hugealloc Time mean 52739.45 ( +0.00%) 28904.00 ( -45.19%)
Hugealloc Time stddev 56541.26 ( +0.00%) 33464.37 ( -40.81%)
Kbuild Real time 197.47 ( +0.00%) 196.59 ( -0.44%)
Kbuild User time 1240.49 ( +0.00%) 1231.67 ( -0.71%)
Kbuild System time 70.08 ( +0.00%) 59.10 ( -15.45%)
THP fault alloc 46727.07 ( +0.00%) 63223.67 ( +35.30%)
THP fault fallback 21910.60 ( +0.00%) 5412.47 ( -75.29%)
Direct compact fail 195.80 ( +0.00%) 59.07 ( -69.48%)
Direct compact success 7.93 ( +0.00%) 2.80 ( -57.46%)
Direct compact success rate % 3.51 ( +0.00%) 3.99 ( +10.49%)
Compact daemon scanned migrate 3369601.27 ( +0.00%) 2267500.33 ( -32.71%)
Compact daemon scanned free 5075474.47 ( +0.00%) 2339773.00 ( -53.90%)
Compact direct scanned migrate 161787.27 ( +0.00%) 47659.93 ( -70.54%)
Compact direct scanned free 163467.53 ( +0.00%) 40729.67 ( -75.08%)
Compact total migrate scanned 3531388.53 ( +0.00%) 2315160.27 ( -34.44%)
Compact total free scanned 5238942.00 ( +0.00%) 2380502.67 ( -54.56%)
Alloc stall 2371.07 ( +0.00%) 638.87 ( -73.02%)
Pages kswapd scanned 2160926.73 ( +0.00%) 4002186.33 ( +85.21%)
Pages kswapd reclaimed 533191.07 ( +0.00%) 718577.80 ( +34.77%)
Pages direct scanned 400450.33 ( +0.00%) 355172.73 ( -11.31%)
Pages direct reclaimed 94441.73 ( +0.00%) 31162.80 ( -67.00%)
Pages total scanned 2561377.07 ( +0.00%) 4357359.07 ( +70.12%)
Pages total reclaimed 627632.80 ( +0.00%) 749740.60 ( +19.46%)
Swap out 47959.53 ( +0.00%) 110084.33 ( +129.53%)
Swap in 7276.00 ( +0.00%) 24457.00 ( +236.10%)
File refaults 138043.00 ( +0.00%) 188226.93 ( +36.35%)
THP allocation latencies and %sys time are down dramatically.
THP allocation failures are down from nearly 50% to 8.5%. And to recall
previous data points, the success rates are steady and reliable without
the cumulative deterioration of fragmentation events.
Compaction work is down overall. Direct compaction work especially is
drastically reduced. As an aside, its success rate of 4% indicates there
is room for improvement. For now it's good to rely on it less.
Reclaim work is up overall, however direct reclaim work is down. Part of
the increase can be attributed to a higher use of THPs, which due to
internal fragmentation increase the memory footprint. This is not
necessarily an unexpected side-effect for users of THP.
However, taken both points together, there may well be some opportunities
for fine tuning in the reclaim/compaction coordination.
[hannes@cmpxchg.org: fix squawks from rebasing]
Link: https://lkml.kernel.org/r/20250314210558.GD1316033@cmpxchg.org
Link: https://lkml.kernel.org/r/20250313210647.1314586-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
101f9d666e |
mm: page_alloc: defrag_mode kswapd/kcompactd assistance
When defrag_mode is enabled, allocation fallbacks strongly prefer whole
block conversions instead of polluting or stealing partially used blocks.
This means there is a demand for pageblocks even from sub-block requests.
Let kswapd/kcompactd help produce them.
By the time kswapd gets woken up, normal rmqueue and block conversion
fallbacks have been attempted and failed. So always wake kswapd with the
block order; it will take care of producing a suitable compaction gap and
then chain-wake kcompactd with the block order when its done.
VANILLA DEFRAGMODE-ASYNC
Hugealloc Time mean 52739.45 ( +0.00%) 34300.36 ( -34.96%)
Hugealloc Time stddev 56541.26 ( +0.00%) 36390.42 ( -35.64%)
Kbuild Real time 197.47 ( +0.00%) 196.13 ( -0.67%)
Kbuild User time 1240.49 ( +0.00%) 1234.74 ( -0.46%)
Kbuild System time 70.08 ( +0.00%) 62.62 ( -10.50%)
THP fault alloc 46727.07 ( +0.00%) 57054.53 ( +22.10%)
THP fault fallback 21910.60 ( +0.00%) 11581.40 ( -47.14%)
Direct compact fail 195.80 ( +0.00%) 107.80 ( -44.72%)
Direct compact success 7.93 ( +0.00%) 4.53 ( -38.06%)
Direct compact success rate % 3.51 ( +0.00%) 3.20 ( -6.89%)
Compact daemon scanned migrate 3369601.27 ( +0.00%) 5461033.93 ( +62.07%)
Compact daemon scanned free 5075474.47 ( +0.00%) 5824897.93 ( +14.77%)
Compact direct scanned migrate 161787.27 ( +0.00%) 58336.93 ( -63.94%)
Compact direct scanned free 163467.53 ( +0.00%) 32791.87 ( -79.94%)
Compact total migrate scanned 3531388.53 ( +0.00%) 5519370.87 ( +56.29%)
Compact total free scanned 5238942.00 ( +0.00%) 5857689.80 ( +11.81%)
Alloc stall 2371.07 ( +0.00%) 2424.60 ( +2.26%)
Pages kswapd scanned 2160926.73 ( +0.00%) 2657018.33 ( +22.96%)
Pages kswapd reclaimed 533191.07 ( +0.00%) 559583.07 ( +4.95%)
Pages direct scanned 400450.33 ( +0.00%) 722094.07 ( +80.32%)
Pages direct reclaimed 94441.73 ( +0.00%) 107257.80 ( +13.57%)
Pages total scanned 2561377.07 ( +0.00%) 3379112.40 ( +31.93%)
Pages total reclaimed 627632.80 ( +0.00%) 666840.87 ( +6.25%)
Swap out 47959.53 ( +0.00%) 77238.20 ( +61.05%)
Swap in 7276.00 ( +0.00%) 11712.80 ( +60.97%)
File refaults 138043.00 ( +0.00%) 143438.80 ( +3.91%)
With this patch, defrag_mode=1 beats the vanilla kernel in THP success
rates and allocation latencies. The trend holds over time:
thp_fault_alloc
VANILLA DEFRAGMODE-ASYNC
61988 52066
56474 58844
57258 58233
50187 58476
52388 54516
55409 59938
52925 57204
47648 60238
43669 55733
40621 56211
36077 59861
41721 57771
36685 58579
34641 51868
33215 56280
DEFRAGMODE-ASYNC also wins on %sys as ~3/4 of the direct compaction work
is shifted to kcompactd.
Reclaim activity is higher. Part of that is simply due to the increased
memory footprint from higher THP use. The other aspect is that *direct*
reclaim/compaction are still going for requested orders rather than
targeting the page blocks required for fallbacks, which is less efficient
than it could be. However, this is already a useful tradeoff to make, as
in many environments peak periods are short and retaining the ability to
produce THP through them is more important.
Link: https://lkml.kernel.org/r/20250313210647.1314586-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
e3aa7df331 |
mm: page_alloc: defrag_mode
The page allocator groups requests by migratetype to stave off
fragmentation. However, in practice this is routinely defeated by the
fact that it gives up *before* invoking reclaim and compaction - which may
well produce suitable pages. As a result, fragmentation of physical
memory is a common ongoing process in many load scenarios.
Fragmentation deteriorates compaction's ability to produce huge pages.
Depending on the lifetime of the fragmenting allocations, those effects
can be long-lasting or even permanent, requiring drastic measures like
forcible idle states or even reboots as the only reliable ways to recover
the address space for THP production.
In a kernel build test with supplemental THP pressure, the THP allocation
rate steadily declines over 15 runs:
thp_fault_alloc
61988
56474
57258
50187
52388
55409
52925
47648
43669
40621
36077
41721
36685
34641
33215
This is a hurdle in adopting THP in any environment where hosts are shared
between multiple overlapping workloads (cloud environments), and rarely
experience true idle periods. To make THP a reliable and predictable
optimization, there needs to be a stronger guarantee to avoid such
fragmentation.
Introduce defrag_mode. When enabled, reclaim/compaction is invoked to its
full extent *before* falling back. Specifically, ALLOC_NOFRAGMENT is
enforced on the allocator fastpath and the reclaiming slowpath.
For now, fallbacks are permitted to avert OOMs. There is a plan to add
defrag_mode=2 to prefer OOMs over fragmentation, but this requires
additional prep work in compaction and the reserve management to make it
ready for all possible allocation contexts.
The following test results are from a kernel build with periodic bursts of
THP allocations, over 15 runs:
vanilla defrag_mode=1
@claimer[unmovable]: 189 103
@claimer[movable]: 92 103
@claimer[reclaimable]: 207 61
@pollute[unmovable from movable]: 25 0
@pollute[unmovable from reclaimable]: 28 0
@pollute[movable from unmovable]: 38835 0
@pollute[movable from reclaimable]: 147136 0
@pollute[reclaimable from unmovable]: 178 0
@pollute[reclaimable from movable]: 33 0
@steal[unmovable from movable]: 11 0
@steal[unmovable from reclaimable]: 5 0
@steal[reclaimable from unmovable]: 107 0
@steal[reclaimable from movable]: 90 0
@steal[movable from reclaimable]: 354 0
@steal[movable from unmovable]: 130 0
Both types of polluting fallbacks are eliminated in this workload.
Interestingly, whole block conversions are reduced as well. This is
because once a block is claimed for a type, its empty space remains
available for future allocations, instead of being padded with fallbacks;
this allows the native type to group up instead of spreading out to new
blocks. The assumption in the allocator has been that pollution from
movable allocations is less harmful than from other types, since they can
be reclaimed or migrated out should the space be needed. However, since
fallbacks occur *before* reclaim/compaction is invoked, movable pollution
will still cause non-movable allocations to spread out and claim more
blocks.
Without fragmentation, THP rates hold steady with defrag_mode=1:
thp_fault_alloc
32478
20725
45045
32130
14018
21711
40791
29134
34458
45381
28305
17265
22584
28454
30850
While the downward trend is eliminated, the keen reader will of course
notice that the baseline rate is much smaller than the vanilla kernel's to
begin with. This is due to deficiencies in how reclaim and compaction are
currently driven: ALLOC_NOFRAGMENT increases the extent to which smaller
allocations are competing with THPs for pageblocks, while making no effort
themselves to reclaim or compact beyond their own request size. This
effect already exists with the current usage of ALLOC_NOFRAGMENT, but is
amplified by defrag_mode insisting on whole block stealing much more
strongly.
Subsequent patches will address defrag_mode reclaim strategy to raise the
THP success baseline above the vanilla kernel.
Link: https://lkml.kernel.org/r/20250313210647.1314586-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
f46012c0ec |
mm: page_alloc: trace type pollution from compaction capturing
When the page allocator places pages of a certain migratetype into blocks of another type, it has lasting effects on the ability to compact and defragment down the line. For improving placement and compaction, visibility into such events is crucial. The most common case, allocator fallbacks, is already annotated, but compaction capturing is also allowed to grab pages of a different type. Extend the tracepoint to cover this case. Link: https://lkml.kernel.org/r/20250313210647.1314586-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
67914ac086 |
mm: compaction: push watermark into compaction_suitable() callers
Patch series "mm: reliable huge page allocator". This series makes changes to the allocator and reclaim/compaction code to try harder to avoid fragmentation. As a result, this makes huge page allocations cheaper, more reliable and more sustainable. It's a subset of the huge page allocator RFC initially proposed here: https://lore.kernel.org/lkml/20230418191313.268131-1-hannes@cmpxchg.org/ The following results are from a kernel build test, with additional concurrent bursts of THP allocations on a memory-constrained system. Comparing before and after the changes over 15 runs: before after Hugealloc Time mean 52739.45 ( +0.00%) 28904.00 ( -45.19%) Hugealloc Time stddev 56541.26 ( +0.00%) 33464.37 ( -40.81%) Kbuild Real time 197.47 ( +0.00%) 196.59 ( -0.44%) Kbuild User time 1240.49 ( +0.00%) 1231.67 ( -0.71%) Kbuild System time 70.08 ( +0.00%) 59.10 ( -15.45%) THP fault alloc 46727.07 ( +0.00%) 63223.67 ( +35.30%) THP fault fallback 21910.60 ( +0.00%) 5412.47 ( -75.29%) Direct compact fail 195.80 ( +0.00%) 59.07 ( -69.48%) Direct compact success 7.93 ( +0.00%) 2.80 ( -57.46%) Direct compact success rate % 3.51 ( +0.00%) 3.99 ( +10.49%) Compact daemon scanned migrate 3369601.27 ( +0.00%) 2267500.33 ( -32.71%) Compact daemon scanned free 5075474.47 ( +0.00%) 2339773.00 ( -53.90%) Compact direct scanned migrate 161787.27 ( +0.00%) 47659.93 ( -70.54%) Compact direct scanned free 163467.53 ( +0.00%) 40729.67 ( -75.08%) Compact total migrate scanned 3531388.53 ( +0.00%) 2315160.27 ( -34.44%) Compact total free scanned 5238942.00 ( +0.00%) 2380502.67 ( -54.56%) Alloc stall 2371.07 ( +0.00%) 638.87 ( -73.02%) Pages kswapd scanned 2160926.73 ( +0.00%) 4002186.33 ( +85.21%) Pages kswapd reclaimed 533191.07 ( +0.00%) 718577.80 ( +34.77%) Pages direct scanned 400450.33 ( +0.00%) 355172.73 ( -11.31%) Pages direct reclaimed 94441.73 ( +0.00%) 31162.80 ( -67.00%) Pages total scanned 2561377.07 ( +0.00%) 4357359.07 ( +70.12%) Pages total reclaimed 627632.80 ( +0.00%) 749740.60 ( +19.46%) Swap out 47959.53 ( +0.00%) 110084.33 ( +129.53%) Swap in 7276.00 ( +0.00%) 24457.00 ( +236.10%) File refaults 138043.00 ( +0.00%) 188226.93 ( +36.35%) THP latencies are cut in half, and failure rates are cut by 75%. These metrics also hold up over time, while the vanilla kernel sees a steady downward trend in success rates with each subsequent run, owed to the cumulative effects of fragmentation. A more detailed discussion of results is in the patch changelogs. The patches first introduce a vm.defrag_mode sysctl, which enforces the existing ALLOC_NOFRAGMENT alloc flag until after reclaim and compaction have run. They then change kswapd and kcompactd to target pageblocks, which boosts success in the ALLOC_NOFRAGMENT hotpaths. Patches #1 and #2 are somewhat unrelated cleanups, but touch the same code and so are included here to avoid conflicts from re-ordering. This patch (of 5): compaction_suitable() hardcodes the min watermark, with a boost to the low watermark for costly orders. However, compaction_ready() requires order-0 at the high watermark. It currently checks the marks twice. Make the watermark a parameter to compaction_suitable() and have the callers pass in what they require: - compaction_zonelist_suitable() is used by the direct reclaim path, so use the min watermark. - compact_suit_allocation_order() has a watermark in context derived from cc->alloc_flags. The only quirk is that kcompactd doesn't initialize cc->alloc_flags explicitly. There is a direct check in kcompactd_do_work() that passes ALLOC_WMARK_MIN, but there is another check downstack in compact_zone() that ends up passing the unset alloc_flags. Since they default to 0, and that coincides with ALLOC_WMARK_MIN, it is correct. But it's subtle. Set cc->alloc_flags explicitly. - should_continue_reclaim() is direct reclaim, use the min watermark. - Finally, consolidate the two checks in compaction_ready() to a single compaction_suitable() call passing the high watermark. There is a tiny change in behavior: before, compaction_suitable() would check order-0 against min or low, depending on costly order. Then there'd be another high watermark check. Now, the high watermark is passed to compaction_suitable(), and the costly order-boost (low - min) is added on top. This means compaction_ready() sets a marginally higher target for free pages. In a kernelbuild + THP pressure test, though, this didn't show any measurable negative effects on memory pressure or reclaim rates. As the comment above the check says, reclaim is usually stopped short on should_continue_reclaim(), and this just defines the worst-case reclaim cutoff in case compaction is not making any headway. [hughd@google.com: stop oops on out-of-range highest_zoneidx] Link: https://lkml.kernel.org/r/005ace8b-07fa-01d4-b54b-394a3e029c07@google.com Link: https://lkml.kernel.org/r/20250313210647.1314586-1-hannes@cmpxchg.org Link: https://lkml.kernel.org/r/20250313210647.1314586-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
8defffa4c7 |
mm: convert lru_add_page_tail() to lru_add_split_folio()
Remove three hidden calls to compound_head() and accesses to page->lru. Link: https://lkml.kernel.org/r/20250313151458.4145978-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
f841ad9ca5 |
selftests/mm/cow: fix the incorrect error handling
Error handling doesn't check the correct return value. This patch will
fix it.
Link: https://lkml.kernel.org/r/20250312043840.71799-1-cyan.yang@sifive.com
Fixes:
|
||
|
|
456620c5cb |
mm/debug: add line breaks
Missing a newline character at the end of the format string. Link: https://lkml.kernel.org/r/20250312093717.364031-1-liuye@kylinos.cn Signed-off-by: Liu Ye <liuye@kylinos.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
ce50f4bc42 |
MAINTAINERS: adjust file entry in MAPLE TREE
Commit 0f3b602e1bad ("tools: separate out shared radix-tree components")
moves files from radix-tree/linux to shared/linux in the ./tools/testing/
directory, but misses to adjust a file entry in MAPLE TREE. Hence,
./scripts/get_maintainer.pl --self-test=patterns complains about a broken
reference.
Adjust the file entry in MAPLE TREE.
Link: https://lkml.kernel.org/r/20250312105245.216302-1-lukas.bulwahn@redhat.com
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
d2734f044f |
mm: memory-failure: enhance comments for return value of memory_failure()
The comments for the return value of memory_failure are not complete, supplement the comments. Link: https://lkml.kernel.org/r/20250312112852.82415-4-xueshuai@linux.alibaba.com Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Tested-by: Tony Luck <tony.luck@intel.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ruidong Tian <tianruidong@linux.alibaba.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
aaf99ac2ce |
mm/hwpoison: do not send SIGBUS to processes with recovered clean pages
When an uncorrected memory error is consumed there is a race between the CMCI from the memory controller reporting an uncorrected error with a UCNA signature, and the core reporting and SRAR signature machine check when the data is about to be consumed. - Background: why *UN*corrected errors tied to *C*MCI in Intel platform [1] Prior to Icelake memory controllers reported patrol scrub events that detected a previously unseen uncorrected error in memory by signaling a broadcast machine check with an SRAO (Software Recoverable Action Optional) signature in the machine check bank. This was overkill because it's not an urgent problem that no core is on the verge of consuming that bad data. It's also found that multi SRAO UCE may cause nested MCE interrupts and finally become an IERR. Hence, Intel downgrades the machine check bank signature of patrol scrub from SRAO to UCNA (Uncorrected, No Action required), and signal changed to #CMCI. Just to add to the confusion, Linux does take an action (in uc_decode_notifier()) to try to offline the page despite the UC*NA* signature name. - Background: why #CMCI and #MCE race when poison is consuming in Intel platform [1] Having decided that CMCI/UCNA is the best action for patrol scrub errors, the memory controller uses it for reads too. But the memory controller is executing asynchronously from the core, and can't tell the difference between a "real" read and a speculative read. So it will do CMCI/UCNA if an error is found in any read. Thus: 1) Core is clever and thinks address A is needed soon, issues a speculative read. 2) Core finds it is going to use address A soon after sending the read request 3) The CMCI from the memory controller is in a race with MCE from the core that will soon try to retire the load from address A. Quite often (because speculation has got better) the CMCI from the memory controller is delivered before the core is committed to the instruction reading address A, so the interrupt is taken, and Linux offlines the page (marking it as poison). - Why user process is killed for instr case Commit |
||
|
|
1a15bb8303 |
x86/mce: use is_copy_from_user() to determine copy-from-user context
Patch series "mm/hwpoison: Fix regressions in memory failure handling",
v4.
## 1. What am I trying to do:
This patchset resolves two critical regressions related to memory failure
handling that have appeared in the upstream kernel since version 5.17, as
compared to 5.10 LTS.
- copyin case: poison found in user page while kernel copying from user space
- instr case: poison found while instruction fetching in user space
## 2. What is the expected outcome and why
- For copyin case:
Kernel can recover from poison found where kernel is doing get_user() or
copy_from_user() if those places get an error return and the kernel return
-EFAULT to the process instead of crashing. More specifily, MCE handler
checks the fixup handler type to decide whether an in kernel #MC can be
recovered. When EX_TYPE_UACCESS is found, the PC jumps to recovery code
specified in _ASM_EXTABLE_FAULT() and return a -EFAULT to user space.
- For instr case:
If a poison found while instruction fetching in user space, full recovery
is possible. User process takes #PF, Linux allocates a new page and fills
by reading from storage.
## 3. What actually happens and why
- For copyin case: kernel panic since v5.17
Commit
|
||
|
|
ca868cd770 |
mm: lock PGDAT_RECLAIM_LOCKED with acquire memory ordering
The PGDAT_RECLAIM_LOCKED bit is used to provide mutual exclusion of node reclaim for struct pglist_data using a single bit. Use test_and_set_bit_lock rather than test_and_set_bit to test-and-set PGDAT_RECLAIM_LOCKED with an acquire memory ordering semantic. This changes the "lock" acquisition from a full barrier to an acquire memory ordering, which is weaker. The acquire semi-permeable barrier paired with the release on unlock is sufficient for this mutual exclusion use-case. No behavior change intended other than to reduce overhead by using the appropriate barrier. Link: https://lkml.kernel.org/r/20250312141014.129725-2-mathieu.desnoyers@efficios.com Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: Jade Alglave <j.alglave@ucl.ac.uk> Cc: Luc Maranget <luc.maranget@inria.fr> Cc: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
c0ebbb3841 |
mm: add missing release barrier on PGDAT_RECLAIM_LOCKED unlock
The PGDAT_RECLAIM_LOCKED bit is used to provide mutual exclusion of node
reclaim for struct pglist_data using a single bit.
It is "locked" with a test_and_set_bit (similarly to a try lock) which
provides full ordering with respect to loads and stores done within
__node_reclaim().
It is "unlocked" with clear_bit(), which does not provide any ordering
with respect to loads and stores done before clearing the bit.
The lack of clear_bit() memory ordering with respect to stores within
__node_reclaim() can cause a subsequent CPU to fail to observe stores from
a prior node reclaim. This is not an issue in practice on TSO (e.g.
x86), but it is an issue on weakly-ordered architectures (e.g. arm64).
Fix this by using clear_bit_unlock rather than clear_bit to clear
PGDAT_RECLAIM_LOCKED with a release memory ordering semantic.
This provides stronger memory ordering (release rather than relaxed).
Link: https://lkml.kernel.org/r/20250312141014.129725-1-mathieu.desnoyers@efficios.com
Fixes:
|
||
|
|
be9258a6bf |
mm/madvise: remove len parameter of madvise_do_behavior()
Because madise_should_skip() logic is factored out, making madvise_do_behavior() calculates 'len' on its own rather then receiving it as a parameter makes code simpler. Remove the parameter. Link: https://lkml.kernel.org/r/20250312164750.59215-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Liam R. Howlett <howlett@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
0a6ffacb3b |
mm/madvise: deduplicate madvise_do_behavior() skip case handlings
The logic for checking if a given madvise() request for a single memory range can skip real work, namely madvise_do_behavior(), is duplicated in do_madvise() and vector_madvise(). Split out the logic to a function and reuse it. Link: https://lkml.kernel.org/r/20250312164750.59215-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Liam R. Howlett <howlett@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
f4a578d345 |
mm/madvise: split out populate behavior check logic
madvise_do_behavior() has a long open-coded 'behavior' check for
MADV_POPULATE_{READ,WRITE}. It adds multiple layers[1] and make the code
arguably take longer time to read. Like is_memory_failure(), split out
the check to a separate function. This is not technically removing the
additional layer but discourage further extending the switch-case. Also
it makes madvise_do_behavior() code shorter and therefore easier to read.
[1] https://lore.kernel.org/bd6d0bf1-c79e-46bd-a810-9791efb9ad73@lucifer.local
Link: https://lkml.kernel.org/r/20250312164750.59215-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liam R. Howlett <howlett@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
9ecd2f839b |
mm/madvise: use is_memory_failure() from madvise_do_behavior()
Patch series "mm/madvise: cleanup requests validations and classifications".
Cleanup madvise entry level code for cleaner request validations and
classifications.
This patch (of 4):
To reduce redundant open-coded checks of CONFIG_MEMORY_FAILURE and
MADV_{HWPOISON,SOFT_OFFLINE} in madvise_[un]lock(), is_memory_failure() is
introduced. madvise_do_behavior() is still doing the same open-coded
check, though. Use is_memory_failure() instead.
To avoid build failure on !CONFIG_MEMORY_FAILURE case, implement an empty
madvise_inject_error() under the config. Also move the definition of
is_memory_failure() inside #ifdef CONFIG_MEMORY_FAILURE clause for
madvise_inject_error() definition, to reduce duplicated ifdef clauses.
Link: https://lkml.kernel.org/r/20250312164750.59215-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250312164750.59215-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liam R. Howlett <howlett@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
15766485e4 |
mm/page_alloc: add trace event for totalreserve_pages calculation
This commit introduces a new trace event, `mm_calculate_totalreserve_pages`, which reports the new reserve value at the exact time when it takes effect. The `totalreserve_pages` value represents the total amount of memory reserved across all zones and nodes in the system. This reserved memory is crucial for ensuring that critical kernel operations have access to sufficient memory, even under memory pressure. By tracing the `totalreserve_pages` value, developers can gain insights that how the total reserved memory changes over time. Link: https://lkml.kernel.org/r/20250308034606.2036033-4-liumartin@google.com Signed-off-by: Martin Liu <liumartin@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
a293aba4a5 |
mm/page_alloc: add trace event for per-zone lowmem reserve setup
This commit introduces the `mm_setup_per_zone_lowmem_reserve` trace event,which provides detailed insights into the kernel's per-zone lowmem reserve configuration. The trace event provides precise timestamps, allowing developers to 1. Correlate lowmem reserve changes with specific kernel events and able to diagnose unexpected kswapd or direct reclaim behavior triggered by dynamic changes in lowmem reserve. 2. Know memory allocation failures that occur due to insufficient lowmem reserve, by precisely correlating allocation attempts with reserve adjustments. Link: https://lkml.kernel.org/r/20250308034606.2036033-3-liumartin@google.com Signed-off-by: Martin Liu <liumartin@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
8c02048d1c |
mm/page_alloc: add trace event for per-zone watermark setup
Patch series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages", v2. This patchset introduces tracepoints to track changes in the lowmem reserves, watermarks and totalreserve_pages. This helps to track the exact timing of such changes and understand their relation to reclaim activities. The tracepoints added are: mm_setup_per_zone_lowmem_reserve mm_setup_per_zone_wmarks mm_calculate_totalreserve_pagesi This patch (of 3): This commit introduces the `mm_setup_per_zone_wmarks` trace event, which provides detailed insights into the kernel's per-zone watermark configuration, offering precise timing and the ability to correlate watermark changes with specific kernel events. While `/proc/zoneinfo` provides some information about zone watermarks, this trace event offers: 1. The ability to link watermark changes to specific kernel events and logic. 2. The ability to capture rapid or short-lived changes in watermarks that may be missed by user-space polling 3. Diagnosing unexpected kswapd activity or excessive direct reclaim triggered by rapidly changing watermarks. Link: https://lkml.kernel.org/r/20250308034606.2036033-1-liumartin@google.com Link: https://lkml.kernel.org/r/20250308034606.2036033-2-liumartin@google.com Signed-off-by: Martin Liu <liumartin@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Martin Liu <liumartin@google.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
116eb46895 |
mm/shmem: fix functions documentation
Add missing parenthesis in @name parameter description. Link: https://lkml.kernel.org/r/20250310112535.84754-1-enrico.bravi@polito.it Signed-off-by: Enrico Bravi <enrico.bravi@polito.it> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
5d89666bd9 |
mm: use ptep_get() instead of directly dereferencing pte_t*
It is best practice for all pte accesses to go via the arch helpers, to ensure non-torn values and to allow the arch to intervene where needed (contpte for arm64 for example). While in this case it was probably safe to directly dereference, let's tidy it up for consistency. Link: https://lkml.kernel.org/r/20250310140418.1737409-1-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
1a24776fca |
drivers/base/memory: correct the field name in the header
Replace @blocks with @memory_blocks to match with the definition of struct memory_group. Link: https://lkml.kernel.org/r/20250311233045.148943-3-gshan@redhat.com Signed-off-by: Gavin Shan <gshan@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
61659efdb3 |
drivers/base/memory: improve add_boot_memory_block()
Patch series "drivers/base/memory: Two cleanups", v3. Two cleanups to drivers/base/memory. This patch (of 2)L It's unnecessary to count the present sections for the specified block since the block will be added if any section in the block is present. Besides, for_each_present_section_nr() can be reused as Andrew Morton suggested. Improve by using for_each_present_section_nr() and dropping the unnecessary @section_count. No functional changes intended. Link: https://lkml.kernel.org/r/20250311233045.148943-1-gshan@redhat.com Link: https://lkml.kernel.org/r/20250311233045.148943-2-gshan@redhat.com Signed-off-by: Gavin Shan <gshan@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
c637c61c9e |
mm/damon/sysfs-schemes: avoid Wformat-security warning on damon_sysfs_access_pattern_add_range_dir()
When -Wformat-security is given, compiler warns as a potential security
issue on damon_sysfs_access_pattern_add_range_dir() as below:
mm/damon/sysfs-schemes.c: In function `damon_sysfs_access_pattern_add_range_dir':
mm/damon/sysfs-schemes.c:1503:25: warning: format not a string literal and no format arguments [-Wformat-security]
1503 | &access_pattern->kobj, name);
| ^
Fix it by using "%s" as the format and the name as the argument.
Link: https://lkml.kernel.org/r/20250310165009.652491-1-sj@kernel.org
Fixes:
|
||
|
|
d53c78fffe |
mm/shmem: use xas_try_split() in shmem_split_large_entry()
During shmem_split_large_entry(), large swap entries are covering n slots and an order-0 folio needs to be inserted. Instead of splitting all n slots, only the 1 slot covered by the folio need to be split and the remaining n-1 shadow entries can be retained with orders ranging from 0 to n-1. This method only requires (n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) * (n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original xas_split_alloc() + xas_split() one. For example, to split an order-9 large swap entry (assuming XA_CHUNK_SHIFT is 6), 1 xa_node is needed instead of 8. xas_try_split_min_order() is used to reduce the number of calls to xas_try_split() during split. Link: https://lkml.kernel.org/r/20250314222113.711703-3-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Mattew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
200a89c159 |
mm/filemap: use xas_try_split() in __filemap_add_folio()
Patch series "Minimize xa_node allocation during xarry split", v3.
When splitting a multi-index entry in XArray from order-n to order-m,
existing xas_split_alloc()+xas_split() approach requires 2^(n %
XA_CHUNK_SHIFT) xa_node allocations. But its callers,
__filemap_add_folio() and shmem_split_large_entry(), use at most 1
xa_node. To minimize xa_node allocation and remove the limitation of no
split from order-12 (or above) to order-0 (or anything between 0 and
5)[1], xas_try_split() was added[2], which allocates (n / XA_CHUNK_SHIFT -
m / XA_CHUNK_SHIFT) xa_node. It is used for non-uniform folio split, but
can be used by __filemap_add_folio() and shmem_split_large_entry().
xas_split_alloc() and xas_split() split an order-9 to order-0:
---------------------------------
| | | | | | | | |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| | | | | | | | |
---------------------------------
| | | |
------- --- --- -------
| | ... | |
V V V V
----------- ----------- ----------- -----------
| xa_node | | xa_node | ... | xa_node | | xa_node |
----------- ----------- ----------- -----------
xas_try_split() splits an order-9 to order-0:
---------------------------------
| | | | | | | | |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| | | | | | | | |
---------------------------------
|
|
V
-----------
| xa_node |
-----------
xas_try_split() is designed to be called iteratively with n = m + 1.
xas_try_split_mini_order() is added to minmize the number of calls to
xas_try_split() by telling the caller the next minimal order to split to
instead of n - 1. Splitting order-n to order-m when m= l * XA_CHUNK_SHIFT
does not require xa_node allocation and requires 1 xa_node when n=l *
XA_CHUNK_SHIFT and m = n - 1, so it is OK to use xas_try_split() with n >
m + 1 when no new xa_node is needed.
xfstests quick group test passed on xfs and tmpfs.
[1] https://lore.kernel.org/linux-mm/Z6YX3RznGLUD07Ao@casper.infradead.org/
[2] https://lore.kernel.org/linux-mm/20250226210032.2044041-1-ziy@nvidia.com/
This patch (of 2):
During __filemap_add_folio(), a shadow entry is covering n slots and a
folio covers m slots with m < n is to be added. Instead of splitting all
n slots, only the m slots covered by the folio need to be split and the
remaining n-m shadow entries can be retained with orders ranging from m to
n-1. This method only requires
(n/XA_CHUNK_SHIFT) - (m/XA_CHUNK_SHIFT)
new xa_nodes instead of
(n % XA_CHUNK_SHIFT) * ((n/XA_CHUNK_SHIFT) - (m/XA_CHUNK_SHIFT))
new xa_nodes, compared to the original xas_split_alloc() + xas_split()
one. For example, to insert an order-0 folio when an order-9 shadow entry
is present (assuming XA_CHUNK_SHIFT is 6), 1 xa_node is needed instead of
8.
xas_try_split_min_order() is introduced to reduce the number of calls to
xas_try_split() during split.
Link: https://lkml.kernel.org/r/20250314222113.711703-1-ziy@nvidia.com
Link: https://lkml.kernel.org/r/20250314222113.711703-2-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mattew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
80a5c494c8 |
selftests/mm: add tests for folio_split(), buddy allocator like split
It splits page cache folios to orders from 0 to 8 at different in-folio offset. Link: https://lkml.kernel.org/r/20250307174001.242794-9-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
7460b470a1 |
mm/truncate: use folio_split() in truncate operation
Instead of splitting the large folio uniformly during truncation, try to use buddy allocator like folio_split() at the start and the end of a truncation range to minimize the number of resulting folios if it is supported. try_folio_split() is introduced to use folio_split() if supported and it falls back to uniform split otherwise. For example, to truncate a order-4 folio [0, 1, 2, 3, 4, 5, ..., 15] between [3, 10] (inclusive), folio_split() splits the folio at 3 to [0,1], [2], [3], [4..7], [8..15] and [3], [4..7] can be dropped and [8..15] is kept with zeros in [8..10], then another folio_split() is done at 10, so [8..10] can be dropped. One possible optimization is to make folio_split() to split a folio based on a given range, like [3..10] above. But that complicates folio_split(), so it will be investigated when necessary. Link: https://lkml.kernel.org/r/20250226210032.2044041-8-ziy@nvidia.com Link: https://lkml.kernel.org/r/20250307174001.242794-8-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
4b94c18d15 |
mm/huge_memory: add folio_split() to debugfs testing interface
This allows to test folio_split() by specifying an additional in folio page offset parameter to split_huge_page debugfs interface. Link: https://lkml.kernel.org/r/20250307174001.242794-7-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
1f43d5aa24 |
mm/huge_memory: remove the old, unused __split_huge_page()
Now split_huge_page_to_list_to_order() uses the new backend split code in __split_unmapped_folio(), the old __split_huge_page() and __split_huge_page_tail() can be removed. Link: https://lkml.kernel.org/r/20250307174001.242794-6-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
58729c04cf |
mm/huge_memory: add buddy allocator like (non-uniform) folio_split()
folio_split() splits a large folio in the same way as buddy allocator splits a large free page for allocation. The purpose is to minimize the number of folios after the split. For example, if user wants to free the 3rd subpage in a order-9 folio, folio_split() will split the order-9 folio as: O-0, O-0, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-8 if it is anon, since anon folio does not support order-1 yet. ----------------------------------------------------------------- | | | | | | | | | |O-0|O-0|O-0|O-0| O-2 |...| O-7 | O-8 | | | | | | | | | | ----------------------------------------------------------------- O-1, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-9 if it is pagecache --------------------------------------------------------------- | | | | | | | | | O-1 |O-0|O-0| O-2 |...| O-7 | O-8 | | | | | | | | | --------------------------------------------------------------- It generates fewer folios (i.e., 11 or 10) than existing page split approach, which splits the order-9 to 512 order-0 folios. It also reduces the number of new xa_node needed during a pagecache folio split from 8 to 1, potentially decreasing the folio split failure rate due to memory constraints. folio_split() and existing split_huge_page_to_list_to_order() share the folio unmapping and remapping code in __folio_split() and the common backend split code in __split_unmapped_folio() using uniform_split variable to distinguish their operations. uniform_split_supported() and non_uniform_split_supported() are added to factor out check code and will be used outside __folio_split() in the following commit. Link: https://lkml.kernel.org/r/20250307174001.242794-5-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
6384dd1d18 |
mm/huge_memory: move folio split common code to __folio_split()
This is a preparation patch for folio_split(). In the upcoming patch folio_split() will share folio unmapping and remapping code with split_huge_page_to_list_to_order(), so move the code to a common function __folio_split() first. Add a TODO for splitting large shmem folio in swap cache. Link: https://lkml.kernel.org/r/20250307174001.242794-4-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
00527733d0 |
mm/huge_memory: add two new (not yet used) functions for folio_split()
This is a preparation patch, both added functions are not used yet. The added __split_unmapped_folio() is able to split a folio with its mapping removed in two manners: 1) uniform split (the existing way), and 2) buddy allocator like (or non-uniform) split. The added __split_folio_to_order() can split a folio into any lower order. For uniform split, __split_unmapped_folio() calls it once to split the given folio to the new order. For buddy allocator like (non-uniform) split, __split_unmapped_folio() calls it (folio_order - new_order) times and each time splits the folio containing the given page to one lower order. [ziy@nvidia.com: unfreeze head folio after page cache entries are updated] Link: https://lkml.kernel.org/r/0F15DA7F-1977-412F-9A3E-F06B515D4BD2@nvidia.com [ziy@nvidia.com: use NULL instead of 0 for folio->private assignment] Link: https://lkml.kernel.org/r/1E11B9DD-3A87-4C9C-8FB4-E1324FB6A21A@nvidia.com Link: https://lkml.kernel.org/r/20250307174001.242794-3-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
3fec86f8aa |
xarray: add xas_try_split() to split a multi-index entry
Patch series "Buddy allocator like (or non-uniform) folio split", v10.
This patchset adds a new buddy allocator like (or non-uniform) large folio
split from a order-n folio to order-m with m < n. It reduces
1. the total number of after-split folios from 2^(n-m) to n-m+1;
2. the amount of memory needed for multi-index xarray split from 2^(n/6-m/6) to
n/6-m/6, assuming XA_CHUNK_SHIFT=6;
3. keep more large folios after a split from all order-m folios to
order-(n-1) to order-m folios.
For example, to split an order-9 to order-0, folio split generates 10 (or
11 for anonymous memory) folios instead of 512, allocates 1 xa_node
instead of 8, and leaves 1 order-8, 1 order-7, ..., 1 order-1 and 2
order-0 folios (or 4 order-0 for anonymous memory) instead of 512 order-0
folios.
Instead of duplicating existing split_huge_page*() code, __folio_split()
is introduced as the shared backend code for both
split_huge_page_to_list_to_order() and folio_split(). __folio_split() can
support both uniform split and buddy allocator like (or non-uniform)
split. All existing split_huge_page*() users can be gradually converted
to use folio_split() if possible. In this patchset, I converted
truncate_inode_partial_folio() to use folio_split().
xfstests quick group passed for both tmpfs and xfs. I also
semi-replicated Hugh's test[12] and ran it without any issue for almost 24
hours.
This patch (of 8):
A preparation patch for non-uniform folio split, which always split a
folio into half iteratively, and minimal xarray entry split.
Currently, xas_split_alloc() and xas_split() always split all slots from a
multi-index entry. They cost the same number of xa_node as the
to-be-split slots. For example, to split an order-9 entry, which takes
2^(9-6)=8 slots, assuming XA_CHUNK_SHIFT is 6 (!CONFIG_BASE_SMALL), 8
xa_node are needed. Instead xas_try_split() is intended to be used
iteratively to split the order-9 entry into 2 order-8 entries, then split
one order-8 entry, based on the given index, to 2 order-7 entries, ...,
and split one order-1 entry to 2 order-0 entries. When splitting the
order-6 entry and a new xa_node is needed, xas_try_split() will try to
allocate one if possible. As a result, xas_try_split() would only need 1
xa_node instead of 8.
When a new xa_node is needed during the split, xas_try_split() can try to
allocate one but no more. -ENOMEM will be return if a node cannot be
allocated. -EINVAL will be return if a sibling node is split or cascade
split happens, where two or more new nodes are needed, and these are not
supported by xas_try_split().
xas_split_alloc() and xas_split() split an order-9 to order-0:
---------------------------------
| | | | | | | | |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| | | | | | | | |
---------------------------------
| | | |
------- --- --- -------
| | ... | |
V V V V
----------- ----------- ----------- -----------
| xa_node | | xa_node | ... | xa_node | | xa_node |
----------- ----------- ----------- -----------
xas_try_split() splits an order-9 to order-0:
---------------------------------
| | | | | | | | |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| | | | | | | | |
---------------------------------
|
|
V
-----------
| xa_node |
-----------
Link: https://lkml.kernel.org/r/20250307174001.242794-1-ziy@nvidia.com
Link: https://lkml.kernel.org/r/20250307174001.242794-2-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kairui Song <kasong@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
fcc09f5b56 |
hugetlb: convert adjust_range_hwpoison() to take a folio
Remove a use of folio->page by passing the folio into adjust_range_hwpoison(). We need to convert to a page eventually, but that can happen inside adjust_range_hwpoison(). Link: https://lkml.kernel.org/r/20250226163131.3795869-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
fa17ad58f8 |
hugetlb: convert hugetlb_vma_maps_page() to hugetlb_vma_maps_pfn()
pte_page() is more expensive than pte_pfn() (often it's defined as pfn_to_page(pte_pfn())), so it makes sense to do the conversion to pfn once (by calling folio_pfn()) rather than convert the pfn to a page each time. While this is a very small advantage, the main motivation is removing a reference to folio->page. Link: https://lkml.kernel.org/r/20250226163131.3795869-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
d9a04a2615 |
mm: swap_cgroup: remove double initialization of locals
Fixes:
|
||
|
|
f0e11a997a |
mm/vmalloc: refactor __vmalloc_node_range_noprof()
According to the code logic, the first parameter of the sub-function __get_vm_area_node() should be size instead of real_size. Then in __get_vm_area_node(), the size will be aligned, so the redundant alignment operation is deleted. The use of the real_size variable causes code redundancy, so it is removed to simplify the code. The real prefix is generally used to indicate the adjusted value of a parameter, but according to the code logic, it should indicate the original value, so it is recommended to rename it to original_align. Link: https://lkml.kernel.org/r/20250306072131.800499-1-liuye@kylinos.cn Signed-off-by: Liu Ye <liuye@kylinos.cn> Reviewed-by: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Christop Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |