memblock_reserve() would add a new range to memblock.reserved in case
the new range is not totally covered by any of the current
memblock.reserved range. If the memblock.reserved is full and can't
resize, memblock_reserve() would fail.
This doesn't happen in real world now, I observed this during code
review. While theoretically, it has the chance to happen. And if it
happens, others would think this range of memory is still available and
may corrupt the memory.
This patch checks the return value and goto "done" after it succeeds.
Link: http://lkml.kernel.org/r/1482363033-24754-3-git-send-email-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Without a memory barrier, the following race can occur with a high-order
allocation:
wakeup_kcompactd(order == 1) kcompactd()
[L] waitqueue_active(kcompactd_wait)
[S] prepare_to_wait_event(kcompactd_wait)
[L] (kcompactd_max_order == 0)
[S] kcompactd_max_order = order; schedule()
Where the waitqueue_active() check is speculatively re-ordered to before
setting the actual condition (max_order), not seeing the threads that's
going to block; making us miss a wakeup. There are a couple of options
to fix this, including calling wq_has_sleepers() which adds a full
barrier, or unconditionally doing the wake_up_interruptible() and
serialize on the q->lock. However, to make use of the control
dependency, we just need to add L->L guarantees.
While this bug is theoretical, there have been other offenders of the
lockless waitqueue_active() in the past -- this is also documented in
the call itself.
Link: http://lkml.kernel.org/r/1483975528-24342-1-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When using a sparse memory model memmap_init_zone() when invoked with
the MEMMAP_EARLY context will skip over pages which aren't valid - ie.
which aren't in a populated region of the sparse memory map. However if
the memory map is extremely sparse then it can spend a long time
linearly checking each PFN in a large non-populated region of the memory
map & skipping it in turn.
When CONFIG_HAVE_MEMBLOCK_NODE_MAP is enabled, we have sufficient
information to quickly discover the next valid PFN given an invalid one
by searching through the list of memory regions & skipping forwards to
the first PFN covered by the memory region to the right of the
non-populated region. Implement this in order to speed up
memmap_init_zone() for systems with extremely sparse memory maps.
James said "I have tested this patch on a virtual model of a Samurai CPU
with a sparse memory map. The kernel boot time drops from 109 to
62 seconds. "
Link: http://lkml.kernel.org/r/20161125185518.29885-1-paul.burton@imgtec.com
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Tested-by: James Hartley <james.hartley@imgtec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A "compact_daemon_wake" vmstat exists that represents the number of
times kcompactd has woken up. This doesn't represent how much work it
actually did, though.
It's useful to understand how much compaction work is being done by
kcompactd versus other methods such as direct compaction and explicitly
triggered per-node (or system) compaction.
This adds two new vmstats: "compact_daemon_migrate_scanned" and
"compact_daemon_free_scanned" to represent the number of pages kcompactd
has scanned as part of its migration scanner and freeing scanner,
respectively.
These values are still accounted for in the general
"compact_migrate_scanned" and "compact_free_scanned" for compatibility.
It could be argued that explicitly triggered compaction could also be
tracked separately, and that could be added if others find it useful.
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1612071749390.69852@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 682a3385e7 ("mm, page_alloc: inline the fast path of the
zonelist iterator") changed how next_zones_zonelist() is called, by
adding a static inline function to do the fast path. This function
adds:
if (likely(!nodes && zonelist_zone_idx(z) <= highest_zoneidx))
return z;
return __next_zones_zonelist(z, highest_zoneidx, nodes);
Where __next_zones_zonelist() is only called when nodes is not NULL or
zonelist_zone_idx(z) is less than highest_zoneidx.
The original next_zone_zonelist() was converted to __next_zones_zonelist()
but it still maintained:
if (likely(nodes == NULL))
Which is now actually a very unlikely, as it is only called with nodes
equal to NULL when zonelist_zone_idx(z) is greater than highest_zoneidx.
Before this commit, this if had this statistic:
correct incorrect % Function File Line
------- --------- - -------- ---- ----
837895 446078 34 next_zones_zonelist mmzone.c 63
After this commit, it has:
correct incorrect % Function File Line
------- --------- - -------- ---- ----
10 173840 99 __next_zones_zonelist mmzone.c 63
Thus, the if statement is now much more unlikely than it ever was as a
likely.
Link: http://lkml.kernel.org/r/20170105200102.77989567@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm_vmscan_lru_shrink_inactive will currently report the number of
scanned and reclaimed pages. This doesn't give us an idea how the
reclaim went except for the overall effectiveness though. Export and
show other counters which will tell us why we couldn't reclaim some
pages.
- nr_dirty, nr_writeback, nr_congested and nr_immediate tells
us how many pages are blocked due to IO
- nr_activate tells us how many pages were moved to the active
list
- nr_ref_keep reports how many pages are kept on the LRU due
to references (mostly for the file pages which are about to
go for another round through the inactive list)
- nr_unmap_fail - how many pages failed to unmap
All these are rather low level so they might change in future but the
tracepoint is already implementation specific so no tools should be
depending on its stability.
Link: http://lkml.kernel.org/r/20170104101942.4860-7-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
shrink_page_list returns quite some counters back to its caller.
Extract the existing 5 into struct reclaim_stat because this makes the
code easier to follow and also allows further counters to be returned.
While we are at it, make all of them unsigned rather than unsigned long
as we do not really need full 64b for them (we never scan more than
SWAP_CLUSTER_MAX pages at once). This should reduce some stack space.
This patch shouldn't introduce any functional change.
Link: http://lkml.kernel.org/r/20170104101942.4860-6-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm_vmscan_lru_isolate shows the number of requested, scanned and taken
pages. This is mostly OK but on 32b systems the number of scanned pages
is quite misleading because it includes both the scanned and skipped
pages. Moreover the skipped part is scaled based on the number of taken
pages. Let's report the exact numbers without any additional logic and
add the number of skipped pages.
This should make the reported data much more easier to interpret.
Link: http://lkml.kernel.org/r/20170104101942.4860-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Our reclaim process has several tracepoints to tell us more about how
things are progressing. We are, however, missing a tracepoint to track
active list aging. Introduce mm_vmscan_lru_shrink_active which reports
the number of
- nr_taken is number of isolated pages from the active list
- nr_referenced pages which tells us that we are hitting referenced
pages which are deactivated. If this is a large part of the
reported nr_deactivated pages then we might be hitting into
the active list too early because they might be still part of
the working set. This might help to debug performance issues.
- nr_active pages which tells us how many pages are kept on the
active list - mostly exec file backed pages. A high number can
indicate that we might be trashing on executables.
[mhocko@suse.com: update]
Link: http://lkml.kernel.org/r/20170104135244.GJ25453@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170104101942.4860-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "vm, vmscan: enahance vmscan tracepoints", v2.
While debugging [2] I've realized that there is some room for
improvements in the tracepoints set we offer currently. I had hard
times to make any conclusion from the existing ones. The resulting
problem turned out to be active list aging [3] and we are missing at
least two tracepoints to debug such a problem.
Some existing tracepoints could export more information to see _why_ the
reclaim progress cannot be made not only _how much_ we could reclaim.
The later could be seen quite reasonably from the vmstat counters
already. It can be argued that we are showing too many implementation
details in those tracepoints but I consider them way too lowlevel
already to be usable by any kernel independent userspace. I would be
_really_ surprised if anything but debugging tools have used them.
Any feedback is highly appreciated.
[1] http://lkml.kernel.org/r/20161228153032.10821-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20161215225702.GA27944@boerne.fritz.box
[3] http://lkml.kernel.org/r/20161223105157.GB23109@dhcp22.suse.cz
This patch (of 8):
The trace point is not used since 925b7673cc ("mm: make per-memcg LRU
lists exclusive") so it can be removed.
Link: http://lkml.kernel.org/r/20170104101942.4860-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the non atomic version of __SetPageUptodate while the page is still
private and not visible to lookup operations. Using the non atomic
version after the page is already visible to lookups is unsafe as there
would be concurrent lock_page operation modifying the page->flags while
it runs.
This solves a lockup in find_lock_entry with the userfaultfd_shmem
selftest.
userfaultfd_shm D14296 691 1 0x00000004
Call Trace:
schedule+0x3d/0x90
schedule_timeout+0x228/0x420
io_schedule_timeout+0xa4/0x110
__lock_page+0x12d/0x170
find_lock_entry+0xa4/0x190
shmem_getpage_gfp+0xb9/0xc30
shmem_fault+0x70/0x1c0
__do_fault+0x21/0x150
handle_mm_fault+0xec9/0x1490
__do_page_fault+0x20d/0x520
trace_do_page_fault+0x61/0x270
do_async_page_fault+0x19/0x80
async_page_fault+0x25/0x30
Link: http://lkml.kernel.org/r/20170116180408.12184-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>