This patch is based on the commit 5da226dbfce3("mm: skip CMA pages when
they are not available") which skips cma pages reclaim when they are not
eligible for the current allocation context. In mglru, such pages are
added to the tail of the immediate generation to maintain better LRU
order, which is unlike the case of conventional LRU where such pages are
directly added to the head of the LRU list(akin to adding to head of the
youngest generation in mglru).
No observable issue without this patch on MGLRU, but logically it make
sense to skip the CMA page reclaim when those pages can't be satisfied for
the current allocation context.
Link: https://lkml.kernel.org/r/1691568344-13475-1-git-send-email-quic_charante@quicinc.com
Fixes: ac35a49023 ("mm: multi-gen LRU: minimal implementation")
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Reviewed-by: Kalesh Singh <kaleshsingh@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Radix vmemmap mapping can map things correctly at the PMD level or PTE
level based on different device boundary checks. Hence we skip the
restrictions w.r.t vmemmap size to be multiple of PMD_SIZE. This also
makes the feature widely useful because to use PMD_SIZE vmemmap area we
require a memory block size of 2GiB
We can also use MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY to that the feature can
work with a memory block size of 256MB. Using altmap.reserve feature to
align things correctly at pageblock granularity. We can end up losing
some pages in memory with this. For ex: with a 256MiB memory block size,
we require 4 pages to map vmemmap pages, In order to align things
correctly we end up adding a reserve of 28 pages. ie, for every 4096
pages 28 pages get reserved.
Link: https://lkml.kernel.org/r/20230808091501.287660-6-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, memmap_on_memory feature is only supported with memory block
sizes that result in vmemmap pages covering full page blocks. This is
because memory onlining/offlining code requires applicable ranges to be
pageblock-aligned, for example, to set the migratetypes properly.
This patch helps to lift that restriction by reserving more pages than
required for vmemmap space. This helps the start address to be page block
aligned with different memory block sizes. Using this facility implies
the kernel will be reserving some pages for every memoryblock. This
allows the memmap on memory feature to be widely useful with different
memory block size values.
For ex: with 64K page size and 256MiB memory block size, we require 4
pages to map vmemmap pages, To align things correctly we end up adding a
reserve of 28 pages. ie, for every 4096 pages 28 pages get reserved.
Link: https://lkml.kernel.org/r/20230808091501.287660-5-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
vma_prepare() is currently the central place where vmas are being locked
before vma_complete() applies changes to them. While this is convenient,
it also obscures vma locking and makes it harder to follow the locking
rules. Move vma locking out of vma_prepare() and take vma locks
explicitly at the locations where vmas are being modified. Move vma
locking and replace it with an assertion inside dup_anon_vma() to further
clarify the locking pattern inside vma_merge().
Link: https://lkml.kernel.org/r/20230804152724.3090321-7-surenb@google.com
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Implicit vma locking inside vm_flags_reset() and vm_flags_reset_once() is
not obvious and makes it hard to understand where vma locking is happening.
Also in some cases (like in dup_userfaultfd()) vma should be locked earlier
than vma_flags modification. To make locking more visible, change these
functions to assert that the vma write lock is taken and explicitly lock
the vma beforehand. Fix userfaultfd functions which should lock the vma
earlier.
Link: https://lkml.kernel.org/r/20230804152724.3090321-5-surenb@google.com
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
For system with kernelcore=mirror enabled while no mirrored memory is
reported by efi. This could lead to kernel OOM during startup since all
memory beside zone DMA are in the movable zone and this prevents the
kernel to use it.
Zone DMA/DMA32 initialization is independent of mirrored memory and their
max pfn is set in zone_sizes_init(). Since kernel can fallback to zone
DMA/DMA32 if there is no memory in zone Normal, these zones are seen as
mirrored memory no mather their memory attributes are.
To solve this problem, disable kernelcore=mirror when there is no real
mirrored memory exists.
Link: https://lkml.kernel.org/r/20230802072328.2107981-1-mawupeng1@huawei.com
Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
Suggested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Suggested-by: Mike Rapoport <rppt@kernel.org>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Levi Yun <ppbuk5246@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
After 90ed667c03 ("Revert "Revert "mm/compaction: fix set skip in
fast_find_migrateblock"""), we remove skip set in fast_find_migrateblock.
Correct comment that fast_find_block is used to avoid isolation_suitable
check for pageblock returned from fast_find_migrateblock because
fast_find_migrateblock will mark found pageblock skipped.
Instead, comment that fast_find_block is used to avoid a redundant check
of fast found pageblock which is already checked skip flag inside
fast_find_migrateblock.
Link: https://lkml.kernel.org/r/20230804110454.2935878-5-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Move migrate_pfn to page block end when block is marked skip to avoid
unnecessary scan retry of that block from upper caller. For example,
compact_zone may wrongly rescan skip page block with finish_pageblock
set as following:
1. cc->migrate point to the start of page block
2. compact_zone record last_migrated_pfn to cc->migrate
3. compact_zone->isolate_migratepages->isolate_migratepages_block
tries to scan the block. The low_pfn maybe moved forward to middle of
block because of free pages at beginning of block.
4. we find first lru page could be isolated but block was exclusive
marked skip.
5. abort isolate_migratepages_block and make cc->migrate_pfn point to
found lru page at middle of block.
6. compact_zone find cc->migrate_pfn and last_migrated_pfn are in the
same block and wrongly rescan the block with finish_pageblock set.
Link: https://lkml.kernel.org/r/20230804110454.2935878-4-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We record start pfn of last isolated page block with last_migrated_pfn. And
then:
1. We check if we mark the page block skip for exclusive access in
isolate_migratepages_block by test if next migrate pfn is still in last
isolated page block. If so, we will set finish_pageblock to do the
rescan.
2. We check if a full cc->order block is scanned by test if last scan
range passes the cc->order block boundary. If so, we flush the pages
were freed.
We treat cc->migrate_pfn before isolate_migratepages as the start pfn of
last isolated page range. However, we always align migrate_pfn to page
block or move to another page block in fast_find_migrateblock or in
linearly scan forward in isolate_migratepages before do page isolation in
isolate_migratepages_block.
Update last_migrated_pfn with pageblock_start_pfn(cc->migrate_pfn - 1)
after scan to correctly set start pfn of last isolated page range. To
avoid that:
1. Miss a rescan with finish_pageblock set as last_migrate_pfn does
not point to right pageblock and the migrate will not be in pageblock
of last_migrate_pfn as it should be.
2. Wrongly issue flush by test cc->order block boundary with wrong
last_migrate_pfn.
Link: https://lkml.kernel.org/r/20230804110454.2935878-3-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reorder the operations for split and spanning stores so that new data is
placed in the tree prior to marking the old data as dead. This will limit
re-walks on dead data to just once instead of a retry loop.
The order of operations is as follows: Create the new data, put the new
data in place, mark the top node of the old data as dead.
Then repair parent links in the reused nodes through all levels of the
tree, following the new nodes downwards. Finally walk the top dead node
looking for nodes that are no longer used, or subtrees that should be
destroyed (marked dead throughout then freed), follow the partially used
nodes downwards to discover other dead nodes and subtrees.
Link: https://lkml.kernel.org/r/20230804165951.2661157-7-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Replacing nodes may cause a live lock-up if CPU resources are saturated by
write operations on the tree by continuously retrying on dead nodes. To
avoid the continuous retry scenario, ensure the new node is inserted into
the tree prior to marking the old data as dead. This will define a window
where old and new data is swapped.
When reusing lower level nodes, ensure the parent pointer is updated after
the parent is marked dead. This ensures that the child is still reachable
from the top of the tree, but walking up to a dead node will result in a
single retry that will start a fresh walk from the top down through the
new node.
Link: https://lkml.kernel.org/r/20230804165951.2661157-3-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "maple_tree: Change replacement strategy".
The maple tree marks nodes dead as soon as they are going to be replaced.
This could be problematic when used in the RCU context since the writer
may be starved of CPU time by the readers. This patch set addresses the
issue by switching the data replacement strategy to one that will only
mark data as dead once the new data is available.
This series changes the ordering of the node replacement so that the new
data is live before the old data is marked 'dead'. When readers hit
'dead' nodes, they will restart from the top of the tree and end up in the
new data.
In more complex scenarios, the replacement strategy means a subtree is
built and graphed into the tree leaving some nodes to point to the old
parent. The view of tasks into the old data will either remain with the
old data, or see the new data once the old data is marked 'dead'.
Iterators will see the 'dead' node and restart on their own and switch to
the new data. There is no risk of the reader seeing old data in these
cases.
The 'dead' subtree of data is then fully marked dead, but reused nodes
will still point to the dead nodes until the parent pointer is updated.
Walking up to a 'dead' node will cause a re-walk from the top of the tree
and enter the new data area where old data is not reachable.
Once the parent pointers are fully up to date in the active data, the
'dead' subtree is iterated to collect entirely 'dead' subtrees, and dead
nodes (nodes that partially contained reused data).
This patch (of 6):
When dumping the tree, honour formatting request to output hex for the
maple node type arange64.
Link: https://lkml.kernel.org/r/20230804165951.2661157-1-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20230804165951.2661157-2-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>