Let's support multiple registered callbacks, making sure that
registering vmcore callbacks cannot fail. Make the callback return a
bool instead of an int, handling how to deal with errors internally.
Drop unused HAVE_OLDMEM_PFN_IS_RAM.
We soon want to make use of this infrastructure from other drivers:
virtio-mem, registering one callback for each virtio-mem device, to
prevent reading unplugged virtio-mem memory.
Handle it via a generic vmcore_cb structure, prepared for future
extensions: for example, once we support virtio-mem on s390x where the
vmcore is completely constructed in the second kernel, we want to detect
and add plugged virtio-mem memory ranges to the vmcore in order for them
to get dumped properly.
Handle corner cases that are unexpected and shouldn't happen in sane
setups: registering a callback after the vmcore has already been opened
(warn only) and unregistering a callback after the vmcore has already been
opened (warn and essentially read only zeroes from that point on).
Link: https://lkml.kernel.org/r/20211005121430.30136-6-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After removing /dev/kmem, sanitizing /proc/kcore and handling /dev/mem,
this series tackles the last sane way how a VM could accidentially
access logically unplugged memory managed by a virtio-mem device:
/proc/vmcore
When dumping memory via "makedumpfile", PG_offline pages, used by
virtio-mem to flag logically unplugged memory, are already properly
excluded; however, especially when accessing/copying /proc/vmcore "the
usual way", we can still end up reading logically unplugged memory part
of a virtio-mem device.
Patch #1-#3 are cleanups. Patch #4 extends the existing
oldmem_pfn_is_ram mechanism. Patch #5-#7 are virtio-mem refactorings
for patch #8, which implements the virtio-mem logic to query the state
of device blocks.
Patch #8:
"Although virtio-mem currently supports reading unplugged memory in the
hypervisor, this will change in the future, indicated to the device
via a new feature flag. We similarly sanitized /proc/kcore access
recently.
[...]
Distributions that support virtio-mem+kdump have to make sure that the
virtio_mem module will be part of the kdump kernel or the kdump
initrd; dracut was recently [2] extended to include virtio-mem in the
generated initrd. As long as no special kdump kernels are used, this
will automatically make sure that virtio-mem will be around in the
kdump initrd and sanitize /proc/vmcore access -- with dracut"
This is the last remaining bit to support
VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE [3] in the Linux implementation of
virtio-mem.
Note: this is best-effort. We'll never be able to control what runs
inside the second kernel, really, but we also don't have to care: we
only care about sane setups where we don't want our VM getting zapped
once we touch the wrong memory location while dumping. While we usually
expect sane setups to use "makedumfile", nothing really speaks against
just copying /proc/vmcore, especially in environments where HWpoisioning
isn't typically expected. Also, we really don't want to put all our
trust completely on the memmap, so sanitizing also makes sense when just
using "makedumpfile".
[1] https://lkml.kernel.org/r/20210526093041.8800-1-david@redhat.com
[2] https://github.com/dracutdevs/dracut/pull/1157
[3] https://lists.oasis-open.org/archives/virtio-comment/202109/msg00021.html
This patch (of 9):
The callback is only used for the vmcore nowadays.
Link: https://lkml.kernel.org/r/20211005121430.30136-1-david@redhat.com
Link: https://lkml.kernel.org/r/20211005121430.30136-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Boris Ostrovsky <boris.ostrvsky@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Historically (pre-2.5), the inode shrinker used to reclaim only empty
inodes and skip over those that still contained page cache. This caused
problems on highmem hosts: struct inode could put fill lowmem zones
before the cache was getting reclaimed in the highmem zones.
To address this, the inode shrinker started to strip page cache to
facilitate reclaiming lowmem. However, this comes with its own set of
problems: the shrinkers may drop actively used page cache just because
the inodes are not currently open or dirty - think working with a large
git tree. It further doesn't respect cgroup memory protection settings
and can cause priority inversions between containers.
Nowadays, the page cache also holds non-resident info for evicted cache
pages in order to detect refaults. We've come to rely heavily on this
data inside reclaim for protecting the cache workingset and driving swap
behavior. We also use it to quantify and report workload health through
psi. The latter in turn is used for fleet health monitoring, as well as
driving automated memory sizing of workloads and containers, proactive
reclaim and memory offloading schemes.
The consequences of dropping page cache prematurely is that we're seeing
subtle and not-so-subtle failures in all of the above-mentioned
scenarios, with the workload generally entering unexpected thrashing
states while losing the ability to reliably detect it.
To fix this on non-highmem systems at least, going back to rotating
inodes on the LRU isn't feasible. We've tried (commit a76cf1a474
("mm: don't reclaim inodes with many attached pages")) and failed
(commit 69056ee6a8 ("Revert "mm: don't reclaim inodes with many
attached pages"")).
The issue is mostly that shrinker pools attract pressure based on their
size, and when objects get skipped the shrinkers remember this as
deferred reclaim work. This accumulates excessive pressure on the
remaining inodes, and we can quickly eat into heavily used ones, or
dirty ones that require IO to reclaim, when there potentially is plenty
of cold, clean cache around still.
Instead, this patch keeps populated inodes off the inode LRU in the
first place - just like an open file or dirty state would. An otherwise
clean and unused inode then gets queued when the last cache entry
disappears. This solves the problem without reintroducing the reclaim
issues, and generally is a bit more scalable than having to wade through
potentially hundreds of thousands of busy inodes.
Locking is a bit tricky because the locks protecting the inode state
(i_lock) and the inode LRU (lru_list.lock) don't nest inside the
irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are
serialized through i_lock, taken before the i_pages lock, to make sure
depopulated inodes are queued reliably. Additions may race with
deletions, but we'll check again in the shrinker. If additions race
with the shrinker itself, we're protected by the i_lock: if find_inode()
or iput() win, the shrinker will bail on the elevated i_count or
I_REFERENCED; if the shrinker wins and goes ahead with the inode, it
will set I_FREEING and inhibit further igets(), which will cause the
other side to create a new instance of the inode instead.
Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some descriptions of page flags in 'pagemap.rst' are written in
assumption of none-rst, which respects every new line, as below:
7 - SLAB
page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
When compound page is used, SLUB/SLQB will only set this flag on the head
Because rst ignores the new line between the first sentence and second
sentence, resulting html looks a little bit weird, as below.
7 - SLAB
page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator When
^
compound page is used, SLUB/SLQB will only set this flag on the head
page; SLOB will not flag it at all.
This change makes it more natural and consistent with other parts in the
rendered version.
Link: https://lkml.kernel.org/r/20211022090311.3856-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Information in 'TL; DR' section of 'Getting Started' is duplicated in
other parts of the doc. It is also asking readers to visit the access
pattern visualizations gallery web site to show the results of example
visualization commands, while the users of the commands can use terminal
output.
To make the doc simple, this removes the duplicated 'TL; DR' section and
replaces the visualization example commands with versions using terminal
outputs.
Link: https://lkml.kernel.org/r/20211022090311.3856-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the ctx->adaptive_targets list is empty, I did some test on
monitor_on interface like this.
# cat /sys/kernel/debug/damon/target_ids
#
# echo on > /sys/kernel/debug/damon/monitor_on
# damon: kdamond (5390) starts
Though the ctx->adaptive_targets list is empty, but the kthread_run
still be called, and the kdamond.x thread still be created, this is
meaningless.
So there adds a judgment in 'dbgfs_monitor_on_write', if the
ctx->adaptive_targets list is empty, return -EINVAL.
Link: https://lkml.kernel.org/r/0a60a6e8ec9d71989e0848a4dc3311996ca3b5d4.1634720326.git.xhao@linux.alibaba.com
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
DAMON-based operation schemes need to be manually turned on and off. In
some use cases, however, the condition for turning a scheme on and off
would depend on the system's situation. For example, schemes for
proactive pages reclamation would need to be turned on when some memory
pressure is detected, and turned off when the system has enough free
memory.
For easier control of schemes activation based on the system situation,
this introduces a watermarks-based mechanism. The client can describe
the watermark metric (e.g., amount of free memory in the system),
watermark check interval, and three watermarks, namely high, mid, and
low. If the scheme is deactivated, it only gets the metric and compare
that to the three watermarks for every check interval. If the metric is
higher than the high watermark, the scheme is deactivated. If the
metric is between the mid watermark and the low watermark, the scheme is
activated. If the metric is lower than the low watermark, the scheme is
deactivated again. This is to allow users fall back to traditional
page-granularity mechanisms.
Link: https://lkml.kernel.org/r/20211019150731.16699-12-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduction
============
This patchset 1) makes the engine for general data access
pattern-oriented memory management (DAMOS) be more useful for production
environments, and 2) implements a static kernel module for lightweight
proactive reclamation using the engine.
Proactive Reclamation
---------------------
On general memory over-committed systems, proactively reclaiming cold
pages helps saving memory and reducing latency spikes that incurred by
the direct reclaim or the CPU consumption of kswapd, while incurring
only minimal performance degradation[2].
A Free Pages Reporting[8] based memory over-commit virtualization system
would be one more specific use case. In the system, the guest VMs
reports their free memory to host, and the host reallocates the reported
memory to other guests. As a result, the system's memory utilization
can be maximized. However, the guests could be not so memory-frugal,
because some kernel subsystems and user-space applications are designed
to use as much memory as available. Then, guests would report only
small amount of free memory to host, results in poor memory utilization.
Running the proactive reclamation in such guests could help mitigating
this problem.
Google has also implemented this idea and using it in their data center.
They further proposed upstreaming it in LSFMM'19, and "the general
consensus was that, while this sort of proactive reclaim would be useful
for a number of users, the cost of this particular solution was too high
to consider merging it upstream"[3]. The cost mainly comes from the
coldness tracking. Roughly speaking, the implementation periodically
scans the 'Accessed' bit of each page. For the reason, the overhead
linearly increases as the size of the memory and the scanning frequency
grows. As a result, Google is known to dedicating one CPU for the work.
That's a reasonable option to someone like Google, but it wouldn't be so
to some others.
DAMON and DAMOS: An engine for data access pattern-oriented memory management
-----------------------------------------------------------------------------
DAMON[4] is a framework for general data access monitoring. Its
adaptive monitoring overhead control feature minimizes its monitoring
overhead. It also let the upper-bound of the overhead be configurable
by clients, regardless of the size of the monitoring target memory.
While monitoring 70 GiB memory of a production system every 5
milliseconds, it consumes less than 1% single CPU time. For this, it
could sacrify some of the quality of the monitoring results.
Nevertheless, the lower-bound of the quality is configurable, and it
uses a best-effort algorithm for better quality. Our test results[5]
show the quality is practical enough. From the production system
monitoring, we were able to find a 4 KiB region in the 70 GiB memory
that shows highest access frequency.
We normally don't monitor the data access pattern just for fun but to
improve something like memory management. Proactive reclamation is one
such usage. For such general cases, DAMON provides a feature called
DAMon-based Operation Schemes (DAMOS)[6]. It makes DAMON an engine for
general data access pattern oriented memory management. Using this,
clients can ask DAMON to find memory regions of specific data access
pattern and apply some memory management action (e.g., page out, move to
head of the LRU list, use huge page, ...). We call the request
'scheme'.
Proactive Reclamation on top of DAMON/DAMOS
-------------------------------------------
Therefore, by using DAMON for the cold pages detection, the proactive
reclamation's monitoring overhead issue can be solved. Actually, we
previously implemented a version of proactive reclamation using DAMOS
and achieved noticeable improvements with our evaluation setup[5].
Nevertheless, it more for a proof-of-concept, rather than production
uses. It supports only virtual address spaces of processes, and require
additional tuning efforts for given workloads and the hardware. For the
tuning, we introduced a simple auto-tuning user space tool[8]. Google
is also known to using a ML-based similar approach for their fleets[2].
But, making it just works with intuitive knobs in the kernel would be
helpful for general users.
To this end, this patchset improves DAMOS to be ready for such
production usages, and implements another version of the proactive
reclamation, namely DAMON_RECLAIM, on top of it.
DAMOS Improvements: Aggressiveness Control, Prioritization, and Watermarks
--------------------------------------------------------------------------
First of all, the current version of DAMOS supports only virtual address
spaces. This patchset makes it supports the physical address space for
the page out action.
Next major problem of the current version of DAMOS is the lack of the
aggressiveness control, which can results in arbitrary overhead. For
example, if huge memory regions having the data access pattern of
interest are found, applying the requested action to all of the regions
could incur significant overhead. It can be controlled by tuning the
target data access pattern with manual or automated approaches[2,7].
But, some people would prefer the kernel to just work with only
intuitive tuning or default values.
For such cases, this patchset implements a safeguard, namely time/size
quota. Using this, the clients can specify up to how much time can be
used for applying the action, and/or up to how much memory regions the
action can be applied within a user-specified time duration. A followup
question is, to which memory regions should the action applied within
the limits? We implement a simple regions prioritization mechanism for
each action and make DAMOS to apply the action to high priority regions
first. It also allows clients tune the prioritization mechanism to use
different weights for size, access frequency, and age of memory regions.
This means we could use not only LRU but also LFU or some fancy
algorithms like CAR[9] with lightweight overhead.
Though DAMON is lightweight, someone would want to remove even the cold
pages monitoring overhead when it is unnecessary. Currently, it should
manually turned on and off by clients, but some clients would simply
want to turn it on and off based on some metrics like free memory ratio
or memory fragmentation. For such cases, this patchset implements a
watermarks-based automatic activation feature. It allows the clients
configure the metric of their interest, and three watermarks of the
metric. If the metric is higher than the high watermark or lower than
the low watermark, the scheme is deactivated. If the metric is lower
than the mid watermark but higher than the low watermark, the scheme is
activated.
DAMON-based Reclaim
-------------------
Using the improved version of DAMOS, this patchset implements a static
kernel module called 'damon_reclaim'. It finds memory regions that
didn't accessed for specific time duration and page out. Consuming too
much CPU for the paging out operations, or doing pageout too frequently
can be critical for systems configuring their swap devices with
software-defined in-memory block devices like zram/zswap or total number
of writes limited devices like SSDs, respectively. To avoid the
problems, the time/size quotas can be configured. Under the quotas, it
pages out memory regions that didn't accessed longer first. Also, to
remove the monitoring overhead under peaceful situation, and to fall
back to the LRU-list based page granularity reclamation when it doesn't
make progress, the three watermarks based activation mechanism is used,
with the free memory ratio as the watermark metric.
For convenient configurations, it provides several module parameters.
Using these, sysadmins can enable/disable it, and tune its parameters
including the coldness identification time threshold, the time/size
quotas and the three watermarks.
Evaluation
==========
In short, DAMON_RECLAIM with 50ms/s time quota and regions
prioritization on v5.15-rc5 Linux kernel with ZRAM swap device achieves
38.58% memory saving with only 1.94% runtime overhead. For this,
DAMON_RECLAIM consumes only 4.97% of single CPU time.
Setup
-----
We evaluate DAMON_RECLAIM to show how each of the DAMOS improvements
make effect. For this, we measure DAMON_RECLAIM's CPU consumption,
entire system memory footprint, total number of major page faults, and
runtime of 24 realistic workloads in PARSEC3 and SPLASH-2X benchmark
suites on my QEMU/KVM based virtual machine. The virtual machine runs
on an i3.metal AWS instance, has 130GiB memory, and runs a linux kernel
built on latest -mm tree[1] plus this patchset. It also utilizes a 4
GiB ZRAM swap device. We repeats the measurement 5 times and use
averages.
[1] https://github.com/hnaz/linux-mm/tree/v5.15-rc5-mmots-2021-10-13-19-55
Detailed Results
----------------
The results are summarized in the below table.
With coldness identification threshold of 5 seconds, DAMON_RECLAIM
without the time quota-based speed limit achieves 47.21% memory saving,
but incur 4.59% runtime slowdown to the workloads on average. For this,
DAMON_RECLAIM consumes about 11.28% single CPU time.
Applying time quotas of 200ms/s, 50ms/s, and 10ms/s without the regions
prioritization reduces the slowdown to 4.89%, 2.65%, and 1.5%,
respectively. Time quota of 200ms/s (20%) makes no real change compared
to the quota unapplied version, because the quota unapplied version
consumes only 11.28% CPU time. DAMON_RECLAIM's CPU utilization also
similarly reduced: 11.24%, 5.51%, and 2.01% of single CPU time. That
is, the overhead is proportional to the speed limit. Nevertheless, it
also reduces the memory saving because it becomes less aggressive. In
detail, the three variants show 48.76%, 37.83%, and 7.85% memory saving,
respectively.
Applying the regions prioritization (page out regions that not accessed
longer first within the time quota) further reduces the performance
degradation. Runtime slowdowns and total number of major page faults
increase has been 4.89%/218,690% -> 4.39%/166,136% (200ms/s),
2.65%/111,886% -> 1.94%/59,053% (50ms/s), and 1.5%/34,973.40% ->
2.08%/8,781.75% (10ms/s). The runtime under 10ms/s time quota has
increased with prioritization, but apparently that's under the margin of
error.
time quota prioritization memory_saving cpu_util slowdown pgmajfaults overhead
N N 47.21% 11.28% 4.59% 194,802%
200ms/s N 48.76% 11.24% 4.89% 218,690%
50ms/s N 37.83% 5.51% 2.65% 111,886%
10ms/s N 7.85% 2.01% 1.5% 34,793.40%
200ms/s Y 50.08% 10.38% 4.39% 166,136%
50ms/s Y 38.58% 4.97% 1.94% 59,053%
10ms/s Y 3.63% 1.73% 2.08% 8,781.75%
Baseline and Complete Git Trees
===============================
The patches are based on the latest -mm tree
(v5.15-rc5-mmots-2021-10-13-19-55). You can also clone the complete git tree
from:
$ git clone git://github.com/sjp38/linux -b damon_reclaim/patches/v1
The web is also available:
https://git.kernel.org/pub/scm/linux/kernel/git/sj/linux.git/tag/?h=damon_reclaim/patches/v1
Sequence Of Patches
===================
The first patch makes DAMOS support the physical address space for the
page out action. Following five patches (patches 2-6) implement the
time/size quotas. Next four patches (patches 7-10) implement the memory
regions prioritization within the limit. Then, three following patches
(patches 11-13) implement the watermarks-based schemes activation.
Finally, the last two patches (patches 14-15) implement and document the
DAMON-based reclamation using the advanced DAMOS.
[1] https://www.kernel.org/doc/html/v5.15-rc1/vm/damon/index.html
[2] https://research.google/pubs/pub48551/
[3] https://lwn.net/Articles/787611/
[4] https://damonitor.github.io
[5] https://damonitor.github.io/doc/html/latest/vm/damon/eval.html
[6] https://lore.kernel.org/linux-mm/20211001125604.29660-1-sj@kernel.org/
[7] https://github.com/awslabs/damoos
[8] https://www.kernel.org/doc/html/latest/vm/free_page_reporting.html
[9] https://www.usenix.org/conference/fast-04/car-clock-adaptive-replacement
This patch (of 15):
This makes the DAMON primitives for physical address space support the
pageout action for DAMON-based Operation Schemes. With this commit,
hence, users can easily implement system-level data access-aware
reclamations using DAMOS.
[sj@kernel.org: fix missing-prototype build warning]
Link: https://lkml.kernel.org/r/20211025064220.13904-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20211019150731.16699-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20211019150731.16699-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Marco Elver <elver@google.com>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Greg Thelen <gthelen@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes the 'damon-dbgfs' to support the physical memory monitoring,
in addition to the virtual memory monitoring.
Users can do the physical memory monitoring by writing a special
keyword, 'paddr' to the 'target_ids' debugfs file. Then, DAMON will
check the special keyword and configure the monitoring context to run
with the primitives for the physical address space.
Unlike the virtual memory monitoring, the monitoring target region will
not be automatically set. Therefore, users should also set the
monitoring target address region using the 'init_regions' debugfs file.
Also, note that the physical memory monitoring will not automatically
terminated. The user should explicitly turn off the monitoring by
writing 'off' to the 'monitor_on' debugfs file.
Link: https://lkml.kernel.org/r/20211012205711.29216-7-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rienjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "DAMON: Support Physical Memory Address Space Monitoring:.
DAMON currently supports only virtual address spaces monitoring. It can
be easily extended for various use cases and address spaces by
configuring its monitoring primitives layer to use appropriate
primitives implementations, though. This patchset implements monitoring
primitives for the physical address space monitoring using the
structure.
The first 3 patches allow the user space users manually set the
monitoring regions. The 1st patch implements the feature in the
'damon-dbgfs'. Then, patches for adding a unit tests (the 2nd patch)
and updating the documentation (the 3rd patch) follow.
Following 4 patches implement the physical address space monitoring
primitives. The 4th patch makes some primitive functions for the
virtual address spaces primitives reusable. The 5th patch implements
the physical address space monitoring primitives. The 6th patch links
the primitives to the 'damon-dbgfs'. Finally, 7th patch documents this
new features.
This patch (of 7):
Some 'damon-dbgfs' users would want to monitor only a part of the entire
virtual memory address space. The program interface users in the kernel
space could use '->before_start()' callback or set the regions inside
the context struct as they want, but 'damon-dbgfs' users cannot.
For that reason, this introduces a new debugfs file called
'init_region'. 'damon-dbgfs' users can specify which initial monitoring
target address regions they want by writing special input to the file.
The input should describe each region in each line in the below form:
<pid> <start address> <end address>
Note that the regions will be updated to cover entire memory mapped
regions after a 'regions update interval' is passed. If you want the
regions to not be updated after the initial setting, you could set the
interval as a very long time, say, a few decades.
Link: https://lkml.kernel.org/r/20211012205711.29216-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20211012205711.29216-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Marco Elver <elver@google.com>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Greg Thelen <gthelen@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: David Rienjes <rientjes@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To tune the DAMON-based operation schemes, knowing how many and how
large regions are affected by each of the schemes will be helful. Those
stats could be used for not only the tuning, but also monitoring of the
working set size and the number of regions, if the scheme does not
change the program behavior too much.
For the reason, this implements the statistics for the schemes. The
total number and size of the regions that each scheme is applied are
exported to users via '->stat_count' and '->stat_sz' of 'struct damos'.
Admins can also check the number by reading 'schemes' debugfs file. The
last two integers now represents the stats. To allow collecting the
stats without changing the program behavior, this also adds new scheme
action, 'DAMOS_STAT'. Note that 'DAMOS_STAT' is not only making no
memory operation actions, but also does not reset the age of regions.
Link: https://lkml.kernel.org/r/20211001125604.29660-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rienjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In many cases, users might use DAMON for simple data access aware memory
management optimizations such as applying an operation scheme to a
memory region of a specific size having a specific access frequency for
a specific time. For example, "page out a memory region larger than 100
MiB but having a low access frequency more than 10 minutes", or "Use THP
for a memory region larger than 2 MiB having a high access frequency for
more than 2 seconds".
Most simple form of the solution would be doing offline data access
pattern profiling using DAMON and modifying the application source code
or system configuration based on the profiling results. Or, developing
a daemon constructed with two modules (one for access monitoring and the
other for applying memory management actions via mlock(), madvise(),
sysctl, etc) is imaginable.
To avoid users spending their time for implementation of such simple
data access monitoring-based operation schemes, this makes DAMON to
handle such schemes directly. With this change, users can simply
specify their desired schemes to DAMON. Then, DAMON will automatically
apply the schemes to the user-specified target processes.
Each of the schemes is composed with conditions for filtering of the
target memory regions and desired memory management action for the
target. Specifically, the format is::
<min/max size> <min/max access frequency> <min/max age> <action>
The filtering conditions are size of memory region, number of accesses
to the region monitored by DAMON, and the age of the region. The age of
region is incremented periodically but reset when its addresses or
access frequency has significantly changed or the action of a scheme was
applied. For the action, current implementation supports a few of
madvise()-like hints, ``WILLNEED``, ``COLD``, ``PAGEOUT``, ``HUGEPAGE``,
and ``NOHUGEPAGE``.
Because DAMON supports various address spaces and application of the
actions to a monitoring target region is dependent to the type of the
target address space, the application code should be implemented by each
primitives and registered to the framework. Note that this only
implements the framework part. Following commit will implement the
action applications for virtual address spaces primitives.
Link: https://lkml.kernel.org/r/20211001125604.29660-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rienjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>