Commit Graph

1428889 Commits

Author SHA1 Message Date
JP Kobryn (Meta)
e4f4fc7aa8 mm: move pgscan, pgsteal, pgrefill to node stats
There are situations where reclaim kicks in on a system with free memory. 
One possible cause is a NUMA imbalance scenario where one or more nodes
are under pressure.  It would help if we could easily identify such nodes.

Move the pgscan, pgsteal, and pgrefill counters from vm_event_item to
node_stat_item to provide per-node reclaim visibility.  With these
counters as node stats, the values are now displayed in the per-node
section of /proc/zoneinfo, which allows for quick identification of the
affected nodes.

/proc/vmstat continues to report the same counters, aggregated across all
nodes.  But the ordering of these items within the readout changes as they
move from the vm events section to the node stats section.

Memcg accounting of these counters is preserved.  The relocated counters
remain visible in memory.stat alongside the existing aggregate pgscan and
pgsteal counters.

However, this change affects how the global counters are accumulated. 
Previously, the global event count update was gated on !cgroup_reclaim(),
excluding memcg-based reclaim from /proc/vmstat.  Now that
mod_lruvec_state() is being used to update the counters, the global
counters will include all reclaim.  This is consistent with how pgdemote
counters are already tracked.

Finally, the virtio_balloon driver is updated to use
global_node_page_state() to fetch the counters, as they are no longer
accessible through the vm_events array.

Link: https://lkml.kernel.org/r/20260219235846.161910-1-jp.kobryn@linux.dev
Signed-off-by: JP Kobryn <jp.kobryn@linux.dev>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Eugenio Pérez <eperezma@redhat.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mathew Brost <matthew.brost@intel.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:58 -07:00
AnishMulay
54218f10df selftests/mm: skip migration tests if NUMA is unavailable
Currently, the migration test asserts that numa_available() returns 0.  On
systems where NUMA is not available (returning -1), such as certain ARM64
configurations or single-node systems, this assertion fails and crashes
the test.

Update the test to check the return value of numa_available().  If it is
less than 0, skip the test gracefully instead of failing.

This aligns the behavior with other MM selftests (like rmap) that skip
when NUMA support is missing.

Link: https://lkml.kernel.org/r/20260218163941.13499-1-anishm7030@gmail.com
Fixes: 0c2d087284 ("mm: add selftests for migration entries")
Signed-off-by: AnishMulay <anishm7030@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Tested-by: Sayali Patil <sayalip@linux.ibm.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:57 -07:00
Seongsu Park
3d443691ed mm/pkeys: remove unused tsk parameter from arch_set_user_pkey_access()
The tsk parameter in arch_set_user_pkey_access() is never used in the
function implementations across all architectures (arm64, powerpc, x86).

Link: https://lkml.kernel.org/r/20260219063506.545148-1-sgsu.park@samsung.com
Signed-off-by: Seongsu Park <sgsu.park@samsung.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:57 -07:00
Liam R. Howlett
0e8cf9a31a maple_tree: clean up mas_wr_node_store()
The new_end does not need to be passed in as the data is already being
checked.  This allows for other areas to skip getting the node new_end in
the calling function.

The type was incorrectly void * instead of void __rcu *, which isn't an
issue but is technically incorrect.

Move the variable assignment to after the declarations to clean up the
initial setup.

Ensure there is something to copy before calling memcpy().

Link: https://lkml.kernel.org/r/20260130205935.2559335-31-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:57 -07:00
Liam R. Howlett
b82f4c811e maple_tree: don't pass end to mas_wr_append()
Figure out the end internally.  This is necessary for future cleanups.

Link: https://lkml.kernel.org/r/20260130205935.2559335-30-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:57 -07:00
Liam R. Howlett
2969241fa2 maple_tree: pass maple copy node to mas_wmb_replace()
mas_wmb_replace() is called in three places with the same setup, move the
setup into the function itself.  The function needs to be relocated as it
calls mtree_range_walk().

Link: https://lkml.kernel.org/r/20260130205935.2559335-29-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:57 -07:00
Liam R. Howlett
b8852ef30c maple_tree: remove maple big node and subtree structs
Now that no one uses the structures and functions, drop the dead code.

Link: https://lkml.kernel.org/r/20260130205935.2559335-28-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:57 -07:00
Liam R. Howlett
280b792cac maple_tree: use maple copy node for mas_wr_split()
Instead of using the maple big node, use the maple copy node for reduced
stack usage and aligning with mas_wr_rebalance() and
mas_wr_spanning_store().

Splitting a node is similar to rebalancing, but a new evaluation of when
to ascend is needed.  The only other difference is that the data is pushed
and never rebalanced at each level.

The testing must also align with the changes to this commit to ensure the
test suite continues to pass.

Link: https://lkml.kernel.org/r/20260130205935.2559335-27-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:56 -07:00
Liam R. Howlett
11e7f22f5e maple_tree: add cp_converged() helper
When the maple copy node converges into a single entry, then certain
operations can stop ascending the tree.

This is used more later.

Link: https://lkml.kernel.org/r/20260130205935.2559335-26-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:56 -07:00
Liam R. Howlett
0abff20819 maple_tree: add copy_tree_location() helper
Extract the copying of the tree location from one maple state to another
into its own function.  This is used more later.

Link: https://lkml.kernel.org/r/20260130205935.2559335-25-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:56 -07:00
Liam R. Howlett
ebfee00c0b maple_tree: add test for rebalance calculation off-by-one
During the big node removal, an incorrect rebalance step went too far up
the tree causing insufficient nodes.  Test the faulty condition by
recreating the scenario in the userspace testing.

Link: https://lkml.kernel.org/r/20260130205935.2559335-24-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:56 -07:00
Liam R. Howlett
971f0db159 maple_tree: use maple copy node for mas_wr_rebalance() operation
Stop using the maple big node for rebalance operations by changing to more
align with spanning store.  The rebalance operation needs its own data
calculation in rebalance_data().

In the event of too much data, the rebalance tries to push the data using
push_data_sib().  If there is insufficient data, the rebalance operation
will rebalance against a sibling (found with rebalance_sib()).

The rebalance starts at the leaf and works its way upward in the tree
using rebalance_ascend().  Most of the code is shared with spanning store
such as the copy node having a new root, but is fundamentally different in
that the data must come from a sibling.

A parent maple state is used to track the parent location to avoid
multiple mas_ascend() calls.  The maple state tree location is copied from
the parent to the mas (child) in the ascend step.  Ascending itself is
done in the main loop.

Link: https://lkml.kernel.org/r/20260130205935.2559335-23-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:56 -07:00
Liam R. Howlett
b00a1804e6 maple_tree: add cp_is_new_root() helper
Add a helper to do what is needed when the maple copy node contains a new
root node.  This is useful for future commits and is self-documenting
code.

[Liam.Howlett@oracle.com: remove warnings on older compilers]
  Link: https://lkml.kernel.org/r/malwmirqnpuxqkqrobcmzfkmmxipoyzwfs2nwc5fbpxlt2r2ej@wchmjtaljvw3
[akpm@linux-foundation.org: s/cp->slot[0]/&cp->slot[0]/, per Liam]
Link: https://lkml.kernel.org/r/20260130205935.2559335-22-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:56 -07:00
Liam R. Howlett
62e9d349af maple_tree: separate wr_split_store and wr_rebalance store type code path
The split and rebalance store types both go through the same function that
uses the big node.  Separate the code paths so that each can be updated
independently.

No functional change intended

Link: https://lkml.kernel.org/r/20260130205935.2559335-21-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:56 -07:00
Liam R. Howlett
448ec8c0a4 maple_tree: remove unnecessary return statements
Functions do not need to state return at the end, unless skipping unwind. 
These can safely be dropped.

Link: https://lkml.kernel.org/r/20260130205935.2559335-20-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:55 -07:00
Liam R. Howlett
3578d61c1c maple_tree: inline mas_wr_spanning_rebalance()
Now that the spanning rebalance is small, fully inline it in
mas_wr_spanning_store().

No functional change.

Link: https://lkml.kernel.org/r/20260130205935.2559335-19-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:55 -07:00
Liam R. Howlett
a9c6716e08 maple_tree: start using maple copy node for destination
Stop using the maple subtree state and big node in favour of using three
destinations in the maple copy node.  That is, expand the way leaves were
handled to all levels of the tree and use the maple copy node to track the
new nodes.

Extract out the sibling init into the data calculation since this is where
the insufficient data can be detected.  The remainder of the sibling code
to shift the next iteration is moved to the spanning_ascend() function,
since it is not always needed.

Next introduce the dst_setup() function which will decide how many nodes
are needed to contain the data at this level.  Using the destination
count, populate the copy node's dst array with the new nodes and set
d_count to the correct value.  Note that this can be tricky in the case of
a leaf node with exactly enough room because of the rule against NULLs at
the end of leaves.

Once the destinations are ready, copy the data by altering the
cp_data_write() function to copy from the sources to the destinations
directly.  This eliminates the use of the big node in this code path.  On
node completion, node_finalise() will zero out the remaining area and set
the metadata, if necessary.

spanning_ascend() is used to decide if the operation is complete.  It may
create a new root, converge into one destination, or continue upwards by
ascending the left and right write maple states.

One test case setup needed to be tweaked so that the targeted node was
surrounded by full nodes.

[akpm@linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/20260130205935.2559335-18-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:55 -07:00
Liam R. Howlett
20b20162e1 maple_tree: add gap support, slot and pivot sizes for maple copy
Add plumbing work for using maple copy as a normal node for a source of
copy operations.  This is needed later.

Link: https://lkml.kernel.org/r/20260130205935.2559335-17-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:55 -07:00
Liam R. Howlett
de7f3ed37c maple_tree: introduce ma_leaf_max_gap()
This is the same as mas_leaf_max_gap(), but the information necessary is
known without a maple state in future code.  Adding this function now
simplifies the review for a subsequent patch.

Link: https://lkml.kernel.org/r/20260130205935.2559335-16-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:55 -07:00
Liam R. Howlett
6953038cab maple_tree: change initial big node setup in mas_wr_spanning_rebalance()
Instead of copying the data into the big node and finding out that the
data may need to be moved or appended to, calculate the data space up
front (in the maple copy node) and set up another source for the copy.

The additional copy source is tracked in the maple state sib (short for
sibling), and is put into the maple write states for future operations
after the data is in the big node.

To facilitate the newly moved node, some initial setup of the maple
subtree state are relocated after the potential shift caused by the new
way of rebalancing against a sibling.

Link: https://lkml.kernel.org/r/20260130205935.2559335-15-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:55 -07:00
Liam R. Howlett
f141d56643 maple_tree: inline mas_spanning_rebalance_loop() into mas_wr_spanning_rebalance()
Just copy the code and replace count with height.  This is done to avoid
affecting other code paths into mas_spanning_rebalance_loop() for the next
change.

No functional change intended.

Link: https://lkml.kernel.org/r/20260130205935.2559335-14-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:55 -07:00
Liam R. Howlett
b14ffd2c6c maple_tree: testing update for spanning store
Spanning store had some corner cases which showed up during rcu stress
testing.  Add explicit tests for those cases.

At the same time add some locking for easier visibility of the rcu stress
testing.  Only a single dump of the tree will happen on the first detected
issue instead of flooding the console with output.

Link: https://lkml.kernel.org/r/20260130205935.2559335-13-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:54 -07:00
Liam R. Howlett
9ec1e972c3 maple_tree: introduce maple_copy node and use it in mas_spanning_rebalance()
Introduce an internal-memory only node type called maple_copy to
facilitate internal copy operations.  Use it in mas_spanning_rebalance()
for just the leaf nodes.  Initially, the maple_copy node is used to
configure the source nodes and copy the data into the big_node.

The maple_copy contains a list of source entries with start and end
offsets.  One of the maple_copy entries can be itself with an offset of 0
to 2, representing the data where the store partially overwrites entries,
or fully overwrites the entry.  The side effect is that the source nodes
no longer have to worry about partially copying the existing offset if it
is not fully overwritten.

This is in preparation of removal of the maple big_node, but for the time
being the data is copied to the big node to limit the change size.

Link: https://lkml.kernel.org/r/20260130205935.2559335-12-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:54 -07:00
Liam R. Howlett
6b74d44b62 maple_tree: correct right ma_wr_state end pivot in mas_wr_spanning_store()
The end_piv will be needed in the next patch set and has not been set
correctly in this code path.  Correct the oversight before using it.

Link: https://lkml.kernel.org/r/20260130205935.2559335-11-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:54 -07:00
Liam R. Howlett
2fce1c3c47 maple_tree: move maple_subtree_state from mas_wr_spanning_store to mas_wr_spanning_rebalance
Moving the maple_subtree_state is necessary for future cleanups and is
only set up in mas_wr_spanning_rebalance() but never used.

Link: https://lkml.kernel.org/r/20260130205935.2559335-10-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:54 -07:00
Liam R. Howlett
3680159527 maple_tree: don't pass through height in mas_wr_spanning_store
Height is not used locally in the function, so call the height argument
closer to where it is passed in the next level.

Link: https://lkml.kernel.org/r/20260130205935.2559335-9-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:54 -07:00
Liam R. Howlett
41bcc348f2 maple_tree: remove l_wr_mas from mas_wr_spanning_rebalance
Use the wr_mas instead of creating another variable on the stack.  Take
the opportunity to remove l_mas from being used anywhere but in the
maple_subtree_state.

Link: https://lkml.kernel.org/r/20260130205935.2559335-8-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:54 -07:00
Liam R. Howlett
3dd3dbaac1 maple_tree: make ma_wr_states reliable for reuse in spanning store
mas_extend_spanning_null() was not modifying the range min and range max
of the resulting store operation.  The result was that the maple write
state no longer matched what the write was doing.  This was not an issue
as the values were previously not used, but to make the ma_wr_state usable
in future changes, the range min/max stored in the ma_wr_state for left
and right need to be consistent with the operation.

Link: https://lkml.kernel.org/r/20260130205935.2559335-7-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:54 -07:00
Liam R. Howlett
6f2e522186 maple_tree: inline mas_spanning_rebalance() into mas_wr_spanning_rebalance()
Copy the contents of mas_spanning_rebalance() into
mas_wr_spanning_rebalance(), in preparation of removing initial big node
use.

No functional changes intended.

Link: https://lkml.kernel.org/r/20260130205935.2559335-6-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:53 -07:00
Liam R. Howlett
a2ac9935d3 maple_tree: remove unnecessary assignment of orig_l index
The index value is already a copy of the maple state so there is no need
to set it again.

Link: https://lkml.kernel.org/r/20260130205935.2559335-5-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:53 -07:00
Liam R. Howlett
df11f9ee8f maple_tree: extract use of big node from mas_wr_spanning_store()
Isolate big node to use in its own function.

No functional changes intended.

Link: https://lkml.kernel.org/r/20260130205935.2559335-4-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:53 -07:00
Liam R. Howlett
3e302560b9 maple_tree: move mas_spanning_rebalance loop to function
Move the loop over the tree levels to its own function.

No intended functional changes.

Link: https://lkml.kernel.org/r/20260130205935.2559335-3-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:53 -07:00
Liam R. Howlett
6884832472 maple_tree: fix mas_dup_alloc() sparse warning
Patch series "maple_tree: Replace big node with maple copy", v3.

The big node struct was created for simplicity of splitting, rebalancing,
and spanning store operations by using a copy buffer to create the data
necessary prior to breaking it up into 256B nodes.  Certain operations
were rather tricky due to the restriction of keeping NULL entries together
and never at the end of a node (except the right-most node).

The big node struct is incompatible with future features that are
currently in development.  Specifically different node types and different
data type sizes for pivots.

The big node struct was also a stack variable, which caused issues with
certain configurations of kernel build.

This series removes big node by introducing another node type which will
never be written to the tree: maple_copy.  The maple copy node operates
more like a scatter/gather operation with a number of sources and
destinations of allocated nodes.

The sources are copied to the destinations, in turn, until the sources are
exhausted.  The destination is changed if it is filled or the split
location is reached prior to the source data end.

New data is inserted by using the maple copy node itself as a source with
up to 3 slots and pivots.  The data in the maple copy node is the data
being written to the tree along with any fragment of the range(s) being
overwritten.

As with all nodes, the maple copy node is of size 256B.  Using a node type
allows for the copy operation to treat the new data stored in the maple
copy node the same as any other source node.

Analysis of the runtime shows no regression or benefit of removing the
larger stack structure.  The motivation is the ground work to use new node
types and to help those with odd configurations that have had issues.

The change was tested by myself using mm_tests on amd64 and by Suren on
android (arm64).  Limited testing on s390 qemu was also performed using
stress-ng on the virtual memory, which should cover many corner cases.


This patch (of 30):

Use RCU_INIT_POINTER to initialize an rcu pointer to an initial value
since there are no readers within the tree being created during
duplication.  There is no risk of readers seeing the initialized or
uninitialized value until after the synchronization call in
mas_dup_buld().

Link: https://lkml.kernel.org/r/20260130205935.2559335-1-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20260130205935.2559335-2-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:53 -07:00
Kevin Lourenco
0fd66c343c mm/fadvise: validate offset in generic_fadvise
When converted to (u64) for page calculations, a negative offset can
produce extremely large page indices.  This may lead to issues in certain
advice modes (excessive readahead or cache invalidation).

Reject negative offsets with -EINVAL for consistent argument validation
and to avoid silent misbehavior.

POSIX and the man page do not clearly define behavior for negative
offset/len.  FreeBSD rejects negative offsets as well, so failing with
-EINVAL is consistent with existing practice.  The man page can be updated
separately to document the Linux behavior.

Link: https://lkml.kernel.org/r/20260208135738.18992-1-klourencodev@gmail.com
Link: https://lkml.kernel.org/r/20251222141817.13335-1-klourencodev@gmail.com
Signed-off-by: Kevin Lourenco <k.lourenco@criteo.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:53 -07:00
xu xin
318d87b8fa ksm: initialize the addr only once in rmap_walk_ksm
This is a minor performance optimization, especially when there are many
for-loop iterations, because the addr variable doesn't change across
iterations.

Therefore, it only needs to be initialized once before the loop.

Link: https://lkml.kernel.org/r/20260212192820223O_r2NQzSEPG_C56cs-z4l@zte.com.cn
Link: https://lkml.kernel.org/r/20260212192932941MSsJEAyoRW4YdLBN7_myn@zte.com.cn
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Hugh Dickins <hughd@google.com>
Cc: Wang Yaxin <wang.yaxin@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:52:35 -07:00
Jiaqi Yan
34ca46cc6f fs: hugetlb: simplify remove_inode_hugepages() return type
When remove_inode_hugepages() was introduced in commit c86272287b
("hugetlb: create remove_inode_single_folio to remove single file folio")
it used to return a boolean to indicate if it bailed out due to race with
page faults.  However, since the race is already solved by [1],
remove_inode_hugepages() doesn't have any path to return false anymore.

Simplify remove_inode_hugepages() return type to void, remove the
unnecessary ret variable, and adjust the call site in
remove_inode_hugepages().  No functional change in this commit.

Link: https://lkml.kernel.org/r/20260204214741.3161520-1-jiaqiyan@google.com
Link: https://lore.kernel.org/all/20220914221810.95771-10-mike.kravetz@oracle.com [1]
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Suggested-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Acked-by: David Hildenbrand (arm) <david@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-24 14:38:26 -07:00
Altan Hacigumus
260d70819c mm/shrinker: fix refcount leak in shrink_slab_memcg()
When kmem is disabled for memcg, slab-backed shrinkers are skipped. 
However, shrink_slab_memcg() doesn't drop the reference acquired via
shrinker_try_get() before continuing.

Add the missing shrinker_put().

Also, since memcg_kmem_online() and shrinker flags cannot change
dynamically, remove the shrinker from the bitmap to avoid unnecessary
future scans.

Link: https://lkml.kernel.org/r/20260204033553.50039-1-ahacigu.linux@gmail.com
Fixes: 50d09da8e1 ("mm: shrinker: make memcg slab shrink lockless")
Signed-off-by: Altan Hacigumus <ahacigu.linux@gmail.com>
Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>
  Link: https://lore.kernel.org/r/20260203073757.135088-1-ahacigu.linux@gmail.com
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-24 14:38:26 -07:00
qinyu
b8a4b08838 mm/damon/ops-common: remove redudnant mmu notifier call in pmdp mkold
Currently, mmu_notifier_clear_young() is called immediately after
pmdp_clear_young_notify(), which already calls mmu_notifier_clear_young()
internally.  This results in a redundant notifier call.

Replace pmdp_clear_young_notify() with the non-notify variant to avoid the
duplicate call and make the pmdp path consistent with the corresponding
ptep_mkold() code.

Link: https://lkml.kernel.org/r/20260203095400.2465255-1-qin.yuA@h3c.com
Signed-off-by: qinyu <qin.yuA@h3c.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-24 14:38:25 -07:00
Shengming Hu
7e74dd0316 mm/page_alloc: avoid overcounting bulk alloc in watermark check
alloc_pages_bulk_noprof() only fills NULL slots and already tracks how
many entries are pre-populated via nr_populated.

The fast watermark check was adding nr_pages unconditionally, which can
overestimate the demand.  Use (nr_pages - nr_populated) instead, as an
upper bound on the remaining pages this call can still allocate without
scanning the whole array.

Link: https://lkml.kernel.org/r/tencent_F36C5B5FB4DED98C79D9BDEE1210CD338C06@qq.com
Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-24 14:38:25 -07:00
Kairui Song
396f57b572 mm, swap: speed up hibernation allocation and writeout
Since commit 0ff67f990b ("mm, swap: remove swap slot cache"),
hibernation has been using the swap slot slow allocation path for
simplification, which turns out might cause regression for some devices
because the allocator now rotates clusters too often, leading to slower
allocation and more random distribution of data.

Fast allocation is not complex, so implement hibernation support as well.

Test result with Samsung SSD 830 Series (SATA II, 3.0 Gbps) shows the
performance is several times better [1]:
6.19:               324 seconds
After this series:  35 seconds

Link: https://lkml.kernel.org/r/20260216-hibernate-perf-v4-1-1ba9f0bf1ec9@tencent.com
Link: https://lore.kernel.org/linux-mm/8b4bdcfa-ce3f-4e23-839f-31367df7c18f@gmx.de/ [1]
Signed-off-by: Kairui Song <kasong@tencent.com>
Fixes: 0ff67f990b ("mm, swap: remove swap slot cache")
Reported-by: Carsten Grohmann <mail@carstengrohmann.de>
Closes: https://lore.kernel.org/linux-mm/20260206121151.dea3633d1f0ded7bbf49c22e@linux-foundation.org/
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-24 14:38:25 -07:00
Linus Torvalds
24f9515de8 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
 "ARM:

   - Clear the pending exception state from a vcpu coming out of reset,
     as it could otherwise affect the first instruction executed in the
     guest

   - Fix pointer arithmetic in address translation emulation, so that
     the Hardware Access bit is set on the correct PTE instead of some
     other location

  s390:

   - Fix deadlock in new memory management

   - Properly handle kernel faults on donated memory

   - Fix bounds checking for irq routing, with selftest

   - Fix invalid machine checks and log all of them"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: arm64: Fix the descriptor address in __kvm_at_swap_desc()
  KVM: s390: vsie: Avoid injecting machine check on signal
  KVM: s390: log machine checks more aggressively
  KVM: s390: selftests: Add IRQ routing address offset tests
  KVM: s390: Limit adapter indicator access to mapped page
  s390/mm: Add missing secure storage access fixups for donated memory
  KVM: arm64: Discard PC update state on vcpu reset
  KVM: s390: Fix a deadlock
2026-03-24 13:11:26 -07:00
Linus Torvalds
45f667ebb0 Merge tag 'cxl-fixes-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull Compute Express Link (CXL) fixes from Dave Jiang:

 - Adjust the startup priority of cxl_pmem to be higher than that of
   cxl_acpi

 - Use proper endpoint validity check upon sanitize

 - Avoid incorrect DVSEC fallback when HDM decoders are enabled

 - Fix CXL_ACPI and CXL_PMEM Kconfig tristate mismatch

 - Fix leakage in __construct_region()

 - Fix use after free of parent_port in cxl_detach_ep()

* tag 'cxl-fixes-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
  cxl: Adjust the startup priority of cxl_pmem to be higher than that of cxl_acpi
  cxl/mbox: Use proper endpoint validity check upon sanitize
  cxl/hdm: Avoid incorrect DVSEC fallback when HDM decoders are enabled
  cxl/acpi: Fix CXL_ACPI and CXL_PMEM Kconfig tristate mismatch
  cxl/region: Fix leakage in __construct_region()
  cxl/port: Fix use after free of parent_port in cxl_detach_ep()
2026-03-24 12:41:29 -07:00
Paolo Bonzini
52dad81e4b Merge tag 'kvmarm-fixes-7.0-4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 7.0, take #4

- Clear the pending exception state from a vcpu coming out of
  reset, as it could otherwise affect the first instruction
  executed in the guest.

- Fix the address translation emulation icode to set the Hardware
  Access bit on the correct PTE instead of some other location.
2026-03-24 17:32:30 +01:00
Paolo Bonzini
12fd965871 Merge tag 'kvm-s390-master-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
KVM: s390: Fixes for 7.0

- fix deadlock in new memory management
- handle kernel faults on donated memory properly
- fix bounds checking for irq routing + selftest
- fix invalid machine checks + logging
2026-03-24 17:32:13 +01:00
Linus Torvalds
e3c33bc767 Merge tag 'mm-hotfixes-stable-2026-03-23-17-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM fixes from Andrew Morton:
 "6 hotfixes.  2 are cc:stable.  All are for MM.

  All are singletons - please see the changelogs for details"

* tag 'mm-hotfixes-stable-2026-03-23-17-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm/damon/stat: monitor all System RAM resources
  mm/zswap: add missing kunmap_local()
  mailmap: update email address for Muhammad Usama Anjum
  zram: do not slot_free() written-back slots
  mm/damon/core: avoid use of half-online-committed context
  mm/rmap: clear vma->anon_vma on error
2026-03-24 09:12:45 -07:00
Linus Torvalds
26a01984dd Merge tag 'perf-tools-fixes-for-v7.0-2-2026-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
Pull perf tools fixes from Arnaldo Carvalho de Melo:

 - Fix parsing 'overwrite' in command line event definitions in
   big-endian machines by writing correct union member

 - Fix finding default metric in 'perf stat'

 - Fix relative paths for including headers in 'perf kvm stat'

 - Sync header copies with the kernel sources: msr-index.h, kvm,
   build_bug.h

* tag 'perf-tools-fixes-for-v7.0-2-2026-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools:
  tools headers: Synchronize linux/build_bug.h with the kernel sources
  tools headers UAPI: Sync x86's asm/kvm.h with the kernel sources
  tools headers UAPI: Sync linux/kvm.h with the kernel sources
  tools arch x86: Sync the msr-index.h copy with the kernel sources
  perf kvm stat: Fix relative paths for including headers
  perf parse-events: Fix big-endian 'overwrite' by writing correct union member
  perf metricgroup: Fix metricgroup__has_metric_or_groups()
  tools headers: Skip arm64 cputype.h check
2026-03-24 08:58:38 -07:00
Linus Torvalds
97a48d1aab Merge tag 'media/v7.0-5' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:

 - rkvdec: fix stack usage with clang and improve handling missing
   short/long term RPS

 - synopsys: fix a Kconfig issue and an out-of-bounds check

 - verisilicon: Fix kernel panic due to __initconst misuse

 - media core: serialize REINIT and REQBUFS with req_queue_mutex

* tag 'media/v7.0-5' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  media: verisilicon: Fix kernel panic due to __initconst misuse
  media: rkvdec: reduce stack usage in rkvdec_init_v4l2_vp9_count_tbl()
  media: rkvdec: reduce excessive stack usage in assemble_hw_pps()
  media: rkvdec: Improve handling missing short/long term RPS
  media: mc, v4l2: serialize REINIT and REQBUFS with req_queue_mutex
  media: synopsys: csi2rx: add missing kconfig dependency
  media: synopsys: csi2rx: fix out-of-bounds check for formats array
2026-03-24 08:56:36 -07:00
Linus Torvalds
a0124352d5 Merge tag 'xsa482-7.0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
 "Restrict the xen privcmd driver in unprivileged domU to only allow
  hypercalls to target domain when using secure boot"

* tag 'xsa482-7.0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/privcmd: add boot control for restricted usage in domU
  xen/privcmd: restrict usage in unprivileged domU
2026-03-23 21:30:14 -07:00
SeongJae Park
84481e705a mm/damon/stat: monitor all System RAM resources
DAMON_STAT usage document (Documentation/admin-guide/mm/damon/stat.rst)
says it monitors the system's entire physical memory.  But, it is
monitoring only the biggest System RAM resource of the system.  When there
are multiple System RAM resources, this results in monitoring only an
unexpectedly small fraction of the physical memory.  For example, suppose
the system has a 500 GiB System RAM, 10 MiB non-System RAM, and 500 GiB
System RAM resources in order on the physical address space.  DAMON_STAT
will monitor only the first 500 GiB System RAM.  This situation is
particularly common on NUMA systems.

Select a physical address range that covers all System RAM areas of the
system, to fix this issue and make it work as documented.

[sj@kernel.org: return error if monitoring target region is invalid]
  Link: https://lkml.kernel.org/r/20260317053631.87907-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20260316235118.873-1-sj@kernel.org
Fixes: 369c415e60 ("mm/damon: introduce DAMON_STAT module")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org>	[6.17+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-23 09:35:05 -07:00
Lorenzo Stoakes (Oracle)
631c111150 mm/zswap: add missing kunmap_local()
Commit e2c3b6b21c ("mm: zswap: use SG list decompression APIs from
zsmalloc") updated zswap_decompress() to use the scatterwalk API to copy
data for uncompressed pages.

In doing so, it mapped kernel memory locally for 32-bit kernels using
kmap_local_folio(), however it never unmapped this memory.

This resulted in the linked syzbot report where a BUG_ON() is triggered
due to leaking the kmap slot.

This patch fixes the issue by explicitly unmapping the established kmap.


Also, add flush_dcache_folio() after the kunmap_local() call

I had assumed that a new folio here combined with the flush that is done at
the point of setting the PTE would suffice, but it doesn't seem that's
actually the case, as update_mmu_cache() will in many archtectures only
actually flush entries where a dcache flush was done on a range previously.

I had also wondered whether kunmap_local() might suffice, but it doesn't
seem to be the case.

Some arches do seem to actually dcache flush on unmap, parisc does it if
CONFIG_HIGHMEM is not set by setting ARCH_HAS_FLUSH_ON_KUNMAP and calling
kunmap_flush_on_unmap() from __kunmap_local(), otherwise non-CONFIG_HIGHMEM
callers do nothing here.

Otherwise arch_kmap_local_pre_unmap() is called which does:

* sparc - flush_cache_all()
* arm - if VIVT, __cpuc_flush_dcache_area()
* otherwise - nothing

Also arch_kmap_local_post_unmap() is called which does:

* arm - local_flush_tlb_kernel_page()
* csky - kmap_flush_tlb()
* microblaze, ppc - local_flush_tlb_page()
* mips - local_flush_tlb_one()
* sparc - flush_tlb_all() (again)
* x86 - arch_flush_lazy_mmu_mode()
* otherwise - nothing

But this is only if it's high memory, and doesn't cover all architectures,
so is presumably intended to handle other cache consistency concerns.

In any case, VIPT is problematic here whether low or high memory (in spite
of what the documentation claims, see [0] - 'the kernel did write to a page
that is in the page cache page and / or in high memory'), because dirty
cache lines may exist at the set indexed by the kernel direct mapping,
which won't exist in the set indexed by any subsequent userland mapping,
meaning userland might read stale data from L2 cache.

Even if the documentation is correct and low memory is fine not to be
flushed here, we can't be sure as to whether the memory is low or high
(kmap_local_folio() will be a no-op if low), and this call should be
harmless if it is low.

VIVT would require more work if the memory were shared and already mapped,
but this isn't the case here, and would anyway be handled by the dcache
flush call.

In any case, we definitely need this flush as far as I can tell.

And we should probably consider updating the documentation unless it turns
out there's somehow dcache synchronisation that happens for low
memory/64-bit kernels elsewhere?

[ljs@kernel.org: add flush_dcache_folio() after the kunmap_local() call]
  Link: https://lkml.kernel.org/r/13e09a99-181f-45ac-a18d-057faf94bccb@lucifer.local
Link: https://lkml.kernel.org/r/20260316140122.339697-1-ljs@kernel.org
Link: https://docs.kernel.org/core-api/cachetlb.html [0]
Fixes: e2c3b6b21c ("mm: zswap: use SG list decompression APIs from zsmalloc")
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reported-by: syzbot+fe426bef95363177631d@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69b75e2c.050a0220.12d28.015a.GAE@google.com
Acked-by: Yosry Ahmed <yosry@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Acked-by: Yosry Ahmed <yosry@kernel.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-23 09:35:05 -07:00