Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to
first initializing struct pages by going through __init_single_page().
With deferred struct page feature enabled there is a case where we set
some fields prior to initializing:
mem_init() {
register_page_bootmem_info();
free_all_bootmem();
...
}
When register_page_bootmem_info() is called only non-deferred struct
pages are initialized. But, this function goes through some reserved
pages which might be part of the deferred, and thus are not yet
initialized.
mem_init
register_page_bootmem_info
register_page_bootmem_info_node
get_page_bootmem
.. setting fields here ..
such as: page->freelist = (void *)type;
free_all_bootmem()
free_low_memory_core_early()
for_each_reserved_mem_region()
reserve_bootmem_region()
init_reserved_page() <- Only if this is deferred reserved page
__init_single_pfn()
__init_single_page()
memset(0) <-- Loose the set fields here
We end up with similar issue as in the previous patch, where currently
we do not observe problem as memory is zeroed. But, if flag asserts are
changed we can start hitting issues.
Also, because in this patch series we will stop zeroing struct page
memory during allocation, we must make sure that struct pages are
properly initialized prior to using them.
The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.
Link: http://lkml.kernel.org/r/20171013173214.27300-4-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to
first initializing struct pages by going through __init_single_page().
With deferred struct page feature enabled, however, we set fields in
register_page_bootmem_info that are subsequently clobbered right after
in free_all_bootmem:
mem_init() {
register_page_bootmem_info();
free_all_bootmem();
...
}
When register_page_bootmem_info() is called only non-deferred struct
pages are initialized. But, this function goes through some reserved
pages which might be part of the deferred, and thus are not yet
initialized.
mem_init
register_page_bootmem_info
register_page_bootmem_info_node
get_page_bootmem
.. setting fields here ..
such as: page->freelist = (void *)type;
free_all_bootmem()
free_low_memory_core_early()
for_each_reserved_mem_region()
reserve_bootmem_region()
init_reserved_page() <- Only if this is deferred reserved page
__init_single_pfn()
__init_single_page()
memset(0) <-- Loose the set fields here
We end up with issue where, currently we do not observe problem as
memory is explicitly zeroed. But, if flag asserts are changed we can
start hitting issues.
Also, because in this patch series we will stop zeroing struct page
memory during allocation, we must make sure that struct pages are
properly initialized prior to using them.
The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.
Link: http://lkml.kernel.org/r/20171013173214.27300-3-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Tested-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "complete deferred page initialization", v12.
SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config
option, which defers initializing struct pages until all cpus have been
started so it can be done in parallel.
However, this feature is sub-optimal, because the deferred page
initialization code expects that the struct pages have already been
zeroed, and the zeroing is done early in boot with a single thread only.
Also, we access that memory and set flags before struct pages are
initialized. All of this is fixed in this patchset.
In this work we do the following:
- Never read access struct page until it was initialized
- Never set any fields in struct pages before they are initialized
- Zero struct page at the beginning of struct page initialization
==========================================================================
Performance improvements on x86 machine with 8 nodes:
Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory:
TIME SPEED UP
base no deferred: 95.796233s
fix no deferred: 79.978956s 19.77%
base deferred: 77.254713s
fix deferred: 55.050509s 40.34%
==========================================================================
SPARC M6 3600 MHz with 15T of memory
TIME SPEED UP
base no deferred: 358.335727s
fix no deferred: 302.320936s 18.52%
base deferred: 237.534603s
fix deferred: 182.103003s 30.44%
==========================================================================
Raw dmesg output with timestamps:
x86 base no deferred: https://hastebin.com/ofunepurit.scala
x86 base deferred: https://hastebin.com/ifazegeyas.scala
x86 fix no deferred: https://hastebin.com/pegocohevo.scala
x86 fix deferred: https://hastebin.com/ofupevikuk.scala
sparc base no deferred: https://hastebin.com/ibobeteken.go
sparc base deferred: https://hastebin.com/fariqimiyu.go
sparc fix no deferred: https://hastebin.com/muhegoheyi.go
sparc fix deferred: https://hastebin.com/xadinobutu.go
This patch (of 11):
deferred_init_memmap() is called when struct pages are initialized later
in boot by slave CPUs. This patch simplifies and optimizes this
function, and also fixes a couple issues (described below).
The main change is that now we are iterating through free memblock areas
instead of all configured memory. Thus, we do not have to check if the
struct page has already been initialized.
=====
In deferred_init_memmap() where all deferred struct pages are
initialized we have a check like this:
if (page->flags) {
VM_BUG_ON(page_zone(page) != zone);
goto free_range;
}
This way we are checking if the current deferred page has already been
initialized. It works, because memory for struct pages has been zeroed,
and the only way flags are not zero if it went through
__init_single_page() before. But, once we change the current behavior
and won't zero the memory in memblock allocator, we cannot trust
anything inside "struct page"es until they are initialized. This patch
fixes this.
The deferred_init_memmap() is re-written to loop through only free
memory ranges provided by memblock.
Note, this first issue is relevant only when the following change is
merged:
=====
This patch fixes another existing issue on systems that have holes in
zones i.e CONFIG_HOLES_IN_ZONE is defined.
In for_each_mem_pfn_range() we have code like this:
if (!pfn_valid_within(pfn)
goto free_range;
Note: 'page' is not set to NULL and is not incremented but 'pfn'
advances. Thus means if deferred struct pages are enabled on systems
with these kind of holes, linux would get memory corruptions. I have
fixed this issue by defining a new macro that performs all the necessary
operations when we free the current set of pages.
[pasha.tatashin@oracle.com: buddy page accessed before initialized]
Link: http://lkml.kernel.org/r/20171102170221.7401-2-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20171013173214.27300-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Tested-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The allocations from filp cache can be directly triggered by userspace
applications. A buggy application can consume a significant amount of
unaccounted system memory. Though we have not noticed such buggy
applications in our production but upon close inspection, we found that
a lot of machines spend very significant amount of memory on these
caches.
One way to limit allocations from filp cache is to set system level
limit of maximum number of open files. However this limit is shared
between different users on the system and one user can hog this
resource. To cater that, we can charge filp to kmemcg and set the
maximum limit very high and let the memory limit of each user limit the
number of files they can open and indirectly limiting their allocations
from filp cache.
One side effect of this change is that it will allow _sysctl() to return
ENOMEM and the man page of _sysctl() does not specify that. However the
man page also discourages to use _sysctl() at all.
Link: http://lkml.kernel.org/r/20171011190359.34926-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, we account page tables separately for each page table level,
but that's redundant -- we only make use of total memory allocated to
page tables for oom_badness calculation. We also provide the
information to userspace, but it has dubious value there too.
This patch switches page table accounting to single counter.
mm->pgtables_bytes is now used to account all page table levels. We use
bytes, because page table size for different levels of page table tree
may be different.
The change has user-visible effect: we don't have VmPMD and VmPUD
reported in /proc/[pid]/status. Not sure if anybody uses them. (As
alternative, we can always report 0 kB for them.)
OOM-killer report is also slightly changed: we now report pgtables_bytes
instead of nr_ptes, nr_pmd, nr_puds.
Apart from reducing number of counters per-mm, the benefit is that we
now calculate oom_badness() more correctly for machines which have
different size of page tables depending on level or where page tables
are less than a page in size.
The only downside can be debuggability because we do not know which page
table level could leak. But I do not remember many bugs that would be
caught by separate counters so I wouldn't lose sleep over this.
[akpm@linux-foundation.org: fix mm/huge_memory.c]
Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
[kirill.shutemov@linux.intel.com: fix build]
Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Memory allocations can happen before the swap_slots cache initialization
is completed during cpu bring up. If we are low on memory, we could
call get_swap_page() and access swap_slots_cache before it is fully
initialized.
Add a check in get_swap_page() for initialized swap_slots_cache to
prevent this condition. Similar check already exists in free_swap_slot.
Also annotate the checks to indicate the likely condition.
We also added a memory barrier to make sure that the locks
initialization are done before the assignment of cache->slots and
cache->slots_ret pointers. This ensures the assumption that it is safe
to acquire the slots cache locks and use the slots cache when the
corresponding cache->slots or cache->slots_ret pointers are non null.
[akpm@linux-foundation.org: tidy up comment]
[akpm@linux-foundation.org: fix spello in comment]
Link: http://lkml.kernel.org/r/65a9d0f133f63e66bba37b53b2fd0464b7cae771.1500677066.git.tim.c.chen@linux.intel.com
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Reported-by: Wenwei Tao <wenwei.tww@alibaba-inc.com>
Acked-by: Ying Huang <ying.huang@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is an optimization patch that only affect mmu_notifier users which
rely on the invalidate_range() callback. This patch avoids calling that
callback twice in a row from inside __mmu_notifier_invalidate_range_end
Existing pattern (before this patch):
mmu_notifier_invalidate_range_start()
pte/pmd/pud_clear_flush_notify()
mmu_notifier_invalidate_range()
mmu_notifier_invalidate_range_end()
mmu_notifier_invalidate_range()
New pattern (after this patch):
mmu_notifier_invalidate_range_start()
pte/pmd/pud_clear_flush_notify()
mmu_notifier_invalidate_range()
mmu_notifier_invalidate_range_only_end()
We call the invalidate_range callback after clearing the page table
under the page table lock and we skip the call to invalidate_range
inside the __mmu_notifier_invalidate_range_end() function.
Idea from Andrea Arcangeli
Link: http://lkml.kernel.org/r/20171017031003.7481-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch only affects users of mmu_notifier->invalidate_range callback
which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
and it is an optimization for those users. Everyone else is unaffected
by it.
When clearing a pte/pmd we are given a choice to notify the event under
the page table lock (notify version of *_clear_flush helpers do call the
mmu_notifier_invalidate_range). But that notification is not necessary
in all cases.
This patch removes almost all cases where it is useless to have a call
to mmu_notifier_invalidate_range before
mmu_notifier_invalidate_range_end. It also adds documentation in all
those cases explaining why.
Below is a more in depth analysis of why this is fine to do this:
For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when
device use thing like ATS/PASID to get the IOMMU to walk the CPU page
table to access a process virtual address space). There is only 2 cases
when you need to notify those secondary TLB while holding page table
lock when clearing a pte/pmd:
A) page backing address is free before mmu_notifier_invalidate_range_end
B) a page table entry is updated to point to a new page (COW, write fault
on zero page, __replace_page(), ...)
Case A is obvious you do not want to take the risk for the device to write
to a page that might now be used by something completely different.
Case B is more subtle. For correctness it requires the following sequence
to happen:
- take page table lock
- clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
- set page table entry to point to new page
If clearing the page table entry is not followed by a notify before setting
the new pte/pmd value then you can break memory model like C11 or C++11 for
the device.
Consider the following scenario (device use a feature similar to ATS/
PASID):
Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
assume they are write protected for COW (other case of B apply too).
[Time N] -----------------------------------------------------------------
CPU-thread-0 {try to write to addrA}
CPU-thread-1 {try to write to addrB}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {read addrA and populate device TLB}
DEV-thread-2 {read addrB and populate device TLB}
[Time N+1] ---------------------------------------------------------------
CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+2] ---------------------------------------------------------------
CPU-thread-0 {COW_step1: {update page table point to new page for addrA}}
CPU-thread-1 {COW_step1: {update page table point to new page for addrB}}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0 {preempted}
CPU-thread-1 {preempted}
CPU-thread-2 {write to addrA which is a write to new page}
CPU-thread-3 {}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0 {preempted}
CPU-thread-1 {preempted}
CPU-thread-2 {}
CPU-thread-3 {write to addrB which is a write to new page}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+4] ---------------------------------------------------------------
CPU-thread-0 {preempted}
CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+5] ---------------------------------------------------------------
CPU-thread-0 {preempted}
CPU-thread-1 {}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {read addrA from old page}
DEV-thread-2 {read addrB from new page}
So here because at time N+2 the clear page table entry was not pair with a
notification to invalidate the secondary TLB, the device see the new value
for addrB before seing the new value for addrA. This break total memory
ordering for the device.
When changing a pte to write protect or to point to a new write protected
page with same content (KSM) it is ok to delay invalidate_range callback
to mmu_notifier_invalidate_range_end() outside the page table lock. This
is true even if the thread doing page table update is preempted right
after releasing page table lock before calling
mmu_notifier_invalidate_range_end
Thanks to Andrea for thinking of a problematic scenario for COW.
[jglisse@redhat.com: v2]
Link: http://lkml.kernel.org/r/20171017031003.7481-2-jglisse@redhat.com
Link: http://lkml.kernel.org/r/20170901173011.10745-1-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use BUG_ON(in_interrupt()) in zs_map_object(). This is not a new
BUG_ON(), it's always been there, but was recently changed to
VM_BUG_ON(). There are several problems there. First, we use use
per-CPU mappings both in zsmalloc and in zram, and interrupt may easily
corrupt those buffers. Second, and more importantly, we believe it's
possible to start leaking sensitive information. Consider the following
case:
-> process P
swap out
zram
per-cpu mapping CPU1
compress page A
-> IRQ
swap out
zram
per-cpu mapping CPU1
compress page B
write page from per-cpu mapping CPU1 to zsmalloc pool
iret
-> process P
write page from per-cpu mapping CPU1 to zsmalloc pool [*]
return
* so we store overwritten data that actually belongs to another
page (task) and potentially contains sensitive data. And when
process P will page fault it's going to read (swap in) that
other task's data.
Link: http://lkml.kernel.org/r/20170929045140.4055-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
for_each_memblock_type macro function relies on idx variable defined in
the caller context. Silent macro arguments are almost always wrong
thing to do. They make code harder to read and easier to get wrong.
Let's use an explicit iterator parameter for for_each_memblock_type and
make the code more obious. This patch is a mere cleanup and it
shouldn't introduce any functional change.
Link: http://lkml.kernel.org/r/20170913133029.28911-1-gi-oh.kim@profitbricks.com
Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have a hardcoded 120s timeout after which the memory offline fails
basically since the hot remove has been introduced. This is essentially
a policy implemented in the kernel. Moreover there is no way to adjust
the timeout and so we are sometimes facing memory offline failures if
the system is under a heavy memory pressure or very intensive CPU
workload on large machines.
It is not very clear what purpose the timeout actually serves. The
offline operation is interruptible by a signal so if userspace wants
some timeout based termination this can be done trivially by sending a
signal.
If there is a strong usecase to do this from the kernel then we should
do it properly and have a it tunable from the userspace with the timeout
disabled by default along with the explanation who uses it and for what
purporse.
Link: http://lkml.kernel.org/r/20170918070834.13083-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm, memory_hotplug: redefine memory offline retry logic", v2.
While testing memory hotplug on a large 4TB machine we have noticed that
memory offlining is just too eager to fail. The primary reason is that
the retry logic is just too easy to give up. We have 4 ways out of the
offline
- we have a permanent failure (isolation or memory notifiers fail,
or hugetlb pages cannot be dropped)
- userspace sends a signal
- a hardcoded 120s timeout expires
- page migration fails 5 times
This is way too convoluted and it doesn't scale very well. We have seen
both temporary migration failures as well as 120s being triggered.
After removing those restrictions we were able to pass stress testing
during memory hot remove without any other negative side effects
observed. Therefore I suggest dropping both hard coded policies. I
couldn't have found any specific reason for them in the changelog. I
neither didn't get any response [1] from Kamezawa. If we need some
upper bound - e.g. timeout based - then we should have a proper and
user defined policy for that. In any case there should be a clear use
case when introducing it.
This patch (of 2):
Memory offlining can fail too eagerly under heavy memory pressure.
page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
page dumped because: isolation failed
page->mem_cgroup:ffff8801cd662000
memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
Isolation has failed here because the page is not on LRU. Most probably
because it was on the pcp LRU cache or it has been removed from the LRU
already but it hasn't been freed yet. In both cases the page doesn't
look non-migrable so retrying more makes sense.
__offline_pages seems rather cluttered when it comes to the retry logic.
We have 5 retries at maximum and a timeout. We could argue whether the
timeout makes sense but failing just because of a race when somebody
isoltes a page from LRU or puts it on a pcp LRU lists is just wrong. It
only takes it to race with a process which unmaps some pages and remove
them from the LRU list and we can fail the whole offline because of
something that is a temporary condition and actually not harmful for the
offline.
Please note that unmovable pages should be already excluded during
start_isolate_page_range. We could argue that has_unmovable_pages is
racy and MIGRATE_MOVABLE check doesn't provide any hard guarantee either
but kernel zones (aka < ZONE_MOVABLE) will very likely detect unmovable
pages in most cases and movable zone shouldn't contain unmovable pages
at all. Some of those pages might be pinned but not for ever because
that would be a bug on its own. In any case the context is still
interruptible and so the userspace can easily bail out when the
operation takes too long. This is certainly better behavior than a
hardcoded retry loop which is racy.
Fix this by removing the max retry count and only rely on the timeout
resp. interruption by a signal from the userspace. Also retry rather
than fail when check_pages_isolated sees some !free pages because those
could be a result of the race as well.
Link: http://lkml.kernel.org/r/20170918070834.13083-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo has noticed that "mm: drop migrate type checks from
has_unmovable_pages" would break CMA allocator because it relies on
has_unmovable_pages returning false even for CMA pageblocks which in
fact don't have to be movable:
alloc_contig_range
start_isolate_page_range
set_migratetype_isolate
has_unmovable_pages
This is a result of the code sharing between CMA and memory hotplug
while each one has a different idea of what has_unmovable_pages should
return. This is unfortunate but fixing it properly would require a lot
of code duplication.
Fix the issue by introducing the requested migrate type argument and
special case MIGRATE_CMA case where CMA page blocks are handled
properly. This will work for memory hotplug because it requires
MIGRATE_MOVABLE.
Link: http://lkml.kernel.org/r/20171019122118.y6cndierwl2vnguj@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Tested-by: Ran Wang <ran.wang_1@nxp.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michael has noticed that the memory offline tries to migrate kernel code
pages when doing
echo 0 > /sys/devices/system/memory/memory0/online
The current implementation will fail the operation after several failed
page migration attempts but we shouldn't even attempt to migrate that
memory and fail right away because this memory is clearly not
migrateable. This will become a real problem when we drop the retry
loop counter resp. timeout.
The real problem is in has_unmovable_pages in fact. We should fail if
there are any non migrateable pages in the area. In orther to guarantee
that remove the migrate type checks because MIGRATE_MOVABLE is not
guaranteed to contain only migrateable pages. It is merely a heuristic.
Similarly MIGRATE_CMA does guarantee that the page allocator doesn't
allocate any non-migrateable pages from the block but CMA allocations
themselves are unlikely to migrateable. Therefore remove both checks.
[akpm@linux-foundation.org: remove unused local `mt']
Link: http://lkml.kernel.org/r/20171013120013.698-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by: Ran Wang <ran.wang_1@nxp.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>