If the bitmap block that manages the inode allocation status is corrupted,
nilfs_ifile_create_inode() may allocate a new inode from the reserved
inode area where it should not be allocated.
Previous fix commit d325dc6eb7 ("nilfs2: fix use-after-free bug of
struct nilfs_root"), fixed the problem that reserved inodes with inode
numbers less than NILFS_USER_INO (=11) were incorrectly reallocated due to
bitmap corruption, but since the start number of non-reserved inodes is
read from the super block and may change, in which case inode allocation
may occur from the extended reserved inode area.
If that happens, access to that inode will cause an IO error, causing the
file system to degrade to an error state.
Fix this potential issue by adding a wraparound option to the common
metadata object allocation routine and by modifying
nilfs_ifile_create_inode() to disable the option so that it only allocates
inodes with inode numbers greater than or equal to the inode number read
in "nilfs->ns_first_ino", regardless of the bitmap status of reserved
inodes.
Link: https://lkml.kernel.org/r/20240623051135.4180-4-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "nilfs2: fix potential issues related to reserved inodes".
This series fixes one use-after-free issue reported by syzbot, caused by
nilfs2's internal inode being exposed in the namespace on a corrupted
filesystem, and a couple of flaws that cause problems if the starting
number of non-reserved inodes written in the on-disk super block is
intentionally (or corruptly) changed from its default value.
This patch (of 3):
In the current implementation of nilfs2, "nilfs->ns_first_ino", which
gives the first non-reserved inode number, is read from the superblock,
but its lower limit is not checked.
As a result, if a number that overlaps with the inode number range of
reserved inodes such as the root directory or metadata files is set in the
super block parameter, the inode number test macros (NILFS_MDT_INODE and
NILFS_VALID_INODE) will not function properly.
In addition, these test macros use left bit-shift calculations using with
the inode number as the shift count via the BIT macro, but the result of a
shift calculation that exceeds the bit width of an integer is undefined in
the C specification, so if "ns_first_ino" is set to a large value other
than the default value NILFS_USER_INO (=11), the macros may potentially
malfunction depending on the environment.
Fix these issues by checking the lower bound of "nilfs->ns_first_ino" and
by preventing bit shifts equal to or greater than the NILFS_USER_INO
constant in the inode number test macros.
Also, change the type of "ns_first_ino" from signed integer to unsigned
integer to avoid the need for type casting in comparisons such as the
lower bound check introduced this time.
Link: https://lkml.kernel.org/r/20240623051135.4180-1-konishi.ryusuke@gmail.com
Link: https://lkml.kernel.org/r/20240623051135.4180-2-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The dirty throttling logic is interspersed with assumptions that dirty
limits in PAGE_SIZE units fit into 32-bit (so that various multiplications
fit into 64-bits). If limits end up being larger, we will hit overflows,
possible divisions by 0 etc. Fix these problems by never allowing so
large dirty limits as they have dubious practical value anyway. For
dirty_bytes / dirty_background_bytes interfaces we can just refuse to set
so large limits. For dirty_ratio / dirty_background_ratio it isn't so
simple as the dirty limit is computed from the amount of available memory
which can change due to memory hotplug etc. So when converting dirty
limits from ratios to numbers of pages, we just don't allow the result to
exceed UINT_MAX.
This is root-only triggerable problem which occurs when the operator
sets dirty limits to >16 TB.
Link: https://lkml.kernel.org/r/20240621144246.11148-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Zach O'Keefe <zokeefe@google.com>
Reviewed-By: Zach O'Keefe <zokeefe@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: Avoid possible overflows in dirty throttling".
Dirty throttling logic assumes dirty limits in page units fit into
32-bits. This patch series makes sure this is true (see patch 2/2 for
more details).
This patch (of 2):
This reverts commit 9319b64790.
The commit is broken in several ways. Firstly, the removed (u64) cast
from the multiplication will introduce a multiplication overflow on 32-bit
archs if wb_thresh * bg_thresh >= 1<<32 (which is actually common - the
default settings with 4GB of RAM will trigger this). Secondly, the
div64_u64() is unnecessarily expensive on 32-bit archs. We have
div64_ul() in case we want to be safe & cheap. Thirdly, if dirty
thresholds are larger than 1<<32 pages, then dirty balancing is going to
blow up in many other spectacular ways anyway so trying to fix one
possible overflow is just moot.
Link: https://lkml.kernel.org/r/20240621144017.30993-1-jack@suse.cz
Link: https://lkml.kernel.org/r/20240621144246.11148-1-jack@suse.cz
Fixes: 9319b64790 ("mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-By: Zach O'Keefe <zokeefe@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMON_LRU_SORT manually manipulates the DAMON context struct for online
parameters update. Since the struct contains not only input parameters
but also internal status and operation results, it is not that simple.
Indeed, we found and fixed a few bugs in the code. Now DAMON core layer
provides a function for the usage, namely damon_commit_ctx(). Replace the
manual manipulation logic with the function. The core layer function
could have its own bugs, but this change removes a source of bugs.
Link: https://lkml.kernel.org/r/20240618181809.82078-12-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMON_RECLAIM manually manipulates the DAMON context struct for online
parameters update. Since the struct contains not only input parameters
but also internal status and operation results, it is not that simple.
Indeed, we found and fixed a few bugs in the code. Now DAMON core layer
provides a function for the usage, namely damon_commit_ctx(). Replace the
manual manipulation logic with the function. The core layer function
could have its own bugs, but this change removes a source of bugs.
Link: https://lkml.kernel.org/r/20240618181809.82078-10-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The functions were for updating DAMON structs that may or may not be
partially populated. Hence it was not for only adding items, but also
removing unnecessary items and updating items in-place. A previous commit
has changed the functions to assume the structs are not partially
populated, and do only adding items. Make the names better explain the
behavior.
Link: https://lkml.kernel.org/r/20240618181809.82078-9-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damon/sysfs-schemes.c contains code for handling of online DAMON
parameters update edge cases. The logics are no more necessary since
damon_commit_ctx() and damon_commit_quota_goals() takes care of the cases.
Remove the unnecessary code.
Link: https://lkml.kernel.org/r/20240618181809.82078-8-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The function was for updating DAMON structs that may or may not be
partially populated. Hence it was not for only adding items, but also
removing unnecessary items and updating items in-place. A previous commit
has changed the function to assume the structs are not partially
populated, and do only adding items. Make the function name better
explain the behavior.
Link: https://lkml.kernel.org/r/20240618181809.82078-7-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMON_SYSFS manually manipulates the DAMOS quota structs for online quotal
goals parameter update. Since the struct contains not only input
parameters but also internal status and operation results, it is not that
simple. Now DAMON core layer provides a function for the usage, namely
damon_commit_quota_goals(). Replace the manual manipulation logic with
the function. The core layer function could have its own bugs, but this
change removes a source of bugs.
Link: https://lkml.kernel.org/r/20240618181809.82078-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMON_SYSFS manually manipulates DAMON context structs for online
parameters update. Since the struct contains not only input parameters
but also internal status and operation results, it is not that simple.
Indeed, we found and fixed a few bugs in the code. Now DAMON core layer
provides a function for the usage, namely damon_commit_ctx(). Replace the
manual manipulation logic with the function. The core layer function
could have its own bugs, but this change removes a source of bugs.
Link: https://lkml.kernel.org/r/20240618181809.82078-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Implement functions for supporting online DAMON context level parameters
update. The function receives two DAMON context structs. One is the
struct that currently being used by a kdamond and therefore to be updated.
The other one contains the parameters to be applied to the first one.
The function applies the new parameters to the destination struct while
keeping/updating the internal status and operation results. The function
should be called from DAMON context-update-safe place, like DAMON
callbacks.
Link: https://lkml.kernel.org/r/20240618181809.82078-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/damon: introduce DAMON parameters online commit function".
DAMON context struct (damon_ctx) contains user requests (parameters),
internal status, and operation results. For flexible usages, DAMON API
users are encouraged to manually manipulate the struct. That works well
for simple use cases. However, it has turned out that it is not that
simple at least for online parameters udpate. It is easy to forget
properly maintaining internal status and operation results. Also, such
manual manipulation for online tuning is implemented multiple times on
DAMON API users including DAMON sysfs interface, DAMON_RECLAIM and
DAMON_LRU_SORT. As a result, we have multiple sources of bugs for same
problem. Actually we found and fixed a few bugs from online parameter
updating of DAMON API users.
Implement a function for online DAMON parameters update in core layer, and
replace DAMON API users' manual manipulation code for the use case. The
core layer function could still have bugs, but this change reduces the
source of bugs for the problem to one place.
This patch (of 12):
Implement functions for supporting online DAMOS quota goals parameters
update. The function receives two DAMOS quota structs. One is the struct
that currently being used by a kdamond and therefore to be updated. The
other one contains the parameters to be applied to the first one. The
function applies the new parameters to the destination struct while
keeping/updating the internal status. The function should be called from
parameters-update safe place, like DAMON callbacks.
Link: https://lkml.kernel.org/r/20240618181809.82078-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20240618181809.82078-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch introduces DAMOS_MIGRATE_HOT action, which is similar to
DAMOS_MIGRATE_COLD, but proritizes hot pages.
It migrates pages inside the given region to the 'target_nid' NUMA node
in the sysfs.
Here is one of the example usage of this 'migrate_hot' action.
$ cd /sys/kernel/mm/damon/admin/kdamonds/<N>
$ cat contexts/<N>/schemes/<N>/action
migrate_hot
$ echo 0 > contexts/<N>/schemes/<N>/target_nid
$ echo commit > state
$ numactl -p 2 ./hot_cold 500M 600M &
$ numastat -c -p hot_cold
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Node 2 Total
-------------- ------ ------ ------ -----
701 (hot_cold) 501 0 601 1101
Link: https://lkml.kernel.org/r/20240614030010.751-7-honggyu.kim@sk.com
Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Gregory Price <gregory.price@memverge.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch introduces DAMOS_MIGRATE_COLD action, which is similar to
DAMOS_PAGEOUT, but migrate folios to the given 'target_nid' in the sysfs
instead of swapping them out.
The 'target_nid' sysfs knob informs the migration target node ID.
Here is one of the example usage of this 'migrate_cold' action.
$ cd /sys/kernel/mm/damon/admin/kdamonds/<N>
$ cat contexts/<N>/schemes/<N>/action
migrate_cold
$ echo 2 > contexts/<N>/schemes/<N>/target_nid
$ echo commit > state
$ numactl -p 0 ./hot_cold 500M 600M &
$ numastat -c -p hot_cold
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Node 2 Total
-------------- ------ ------ ------ -----
701 (hot_cold) 501 0 601 1101
Since there are some common routines with pageout, many functions have
similar logics between pageout and migrate cold.
damon_pa_migrate_folio_list() is a minimized version of
shrink_folio_list().
Link: https://lkml.kernel.org/r/20240614030010.751-6-honggyu.kim@sk.com
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Gregory Price <gregory.price@memverge.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "DAMON based tiered memory management for CXL memory", v6.
Introduction
============
With the advent of CXL/PCIe attached DRAM, which will be called simply as
CXL memory in this cover letter, some systems are becoming more
heterogeneous having memory systems with different latency and bandwidth
characteristics. They are usually handled as different NUMA nodes in
separate memory tiers and CXL memory is used as slow tiers because of its
protocol overhead compared to local DRAM.
In this kind of systems, we need to be careful placing memory pages on
proper NUMA nodes based on the memory access frequency. Otherwise, some
frequently accessed pages might reside on slow tiers and it makes
performance degradation unexpectedly. Moreover, the memory access
patterns can be changed at runtime.
To handle this problem, we need a way to monitor the memory access
patterns and migrate pages based on their access temperature. The
DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation
Schemes) can be useful features for monitoring and migrating pages. DAMOS
provides multiple actions based on DAMON monitoring results and it can be
used for proactive reclaim, which means swapping cold pages out with
DAMOS_PAGEOUT action, but it doesn't support migration actions such as
demotion and promotion between tiered memory nodes.
This series supports two new DAMOS actions; DAMOS_MIGRATE_HOT for
promotion from slow tiers and DAMOS_MIGRATE_COLD for demotion from fast
tiers. This prevents hot pages from being stuck on slow tiers, which
makes performance degradation and cold pages can be proactively demoted to
slow tiers so that the system can increase the chance to allocate more hot
pages to fast tiers.
The DAMON provides various tuning knobs but we found that the proactive
demotion for cold pages is especially useful when the system is running
out of memory on its fast tier nodes.
Our evaluation result shows that it reduces the performance slowdown
compared to the default memory policy from 11% to 3~5% when the system
runs under high memory pressure on its fast tier DRAM nodes.
DAMON configuration
===================
The specific DAMON configuration doesn't have to be in the scope of this
patch series, but some rough idea is better to be shared to explain the
evaluation result.
The DAMON provides many knobs for fine tuning but its configuration file
is generated by HMSDK[3]. It includes gen_config.py script that generates
a json file with the full config of DAMON knobs and it creates multiple
kdamonds for each NUMA node when the DAMON is enabled so that it can run
hot/cold based migration for tiered memory.
Evaluation Workload
===================
The performance evaluation is done with redis[4], which is a widely used
in-memory database and the memory access patterns are generated via
YCSB[5]. We have measured two different workloads with zipfian and latest
distributions but their configs are slightly modified to make memory usage
higher and execution time longer for better evaluation.
The idea of evaluation using these migrate_{hot,cold} actions covers
system-wide memory management rather than partitioning hot/cold pages of a
single workload. The default memory allocation policy creates pages to
the fast tier DRAM node first, then allocates newly created pages to the
slow tier CXL node when the DRAM node has insufficient free space. Once
the page allocation is done then those pages never move between NUMA
nodes. It's not true when using numa balancing, but it is not the scope
of this DAMON based tiered memory management support.
If the working set of redis can be fit fully into the DRAM node, then the
redis will access the fast DRAM only. Since the performance of DRAM only
is faster than partially accessing CXL memory in slow tiers, this
environment is not useful to evaluate this patch series.
To make pages of redis be distributed across fast DRAM node and slow CXL
node to evaluate our migrate_{hot,cold} actions, we pre-allocate some cold
memory externally using mmap and memset before launching redis-server. We
assumed that there are enough amount of cold memory in datacenters as
TMO[6] and TPP[7] papers mentioned.
The evaluation sequence is as follows.
1. Turn on DAMON with DAMOS_MIGRATE_COLD action for DRAM node and
DAMOS_MIGRATE_HOT action for CXL node. It demotes cold pages on DRAM
node and promotes hot pages on CXL node in a regular interval.
2. Allocate a huge block of cold memory by calling mmap and memset at
the fast tier DRAM node, then make the process sleep to make the fast
tier has insufficient space for redis-server.
3. Launch redis-server and load prebaked snapshot image, dump.rdb. The
redis-server consumes 52GB of anon pages and 33GB of file pages, but
due to the cold memory allocated at 2, it fails allocating the entire
memory of redis-server on the fast tier DRAM node so it partially
allocates the remaining on the slow tier CXL node. The ratio of
DRAM:CXL depends on the size of the pre-allocated cold memory.
4. Run YCSB to make zipfian or latest distribution of memory accesses to
redis-server, then measure its execution time when it's completed.
5. Repeat 4 over 50 times to measure the average execution time for each
run.
6. Increase the cold memory size then repeat goes to 2.
For each test at 4 took about a minute so repeating it 50 times almost
took about 1 hour for each test with a specific cold memory from 440GB to
500GB in 10GB increments for each evaluation. So it took about more than
10 hours for both zipfian and latest workloads to get the entire
evaluation results. Repeating the same test set multiple times doesn't
show much difference so I think it might be enough to make the result
reliable.
Evaluation Results
==================
All the result values are normalized to DRAM-only execution time because
the workload cannot be faster than DRAM-only unless the workload hits the
peak bandwidth but our redis test doesn't go beyond the bandwidth limit.
So the DRAM-only execution time is the ideal result without affected by
the gap between DRAM and CXL performance difference. The NUMA node
environment is as follows.
node0 - local DRAM, 512GB with a CPU socket (fast tier)
node1 - disabled
node2 - CXL DRAM, 96GB, no CPU attached (slow tier)
The following is the result of generating zipfian distribution to
redis-server and the numbers are averaged by 50 times of execution.
1. YCSB zipfian distribution read only workload
memory pressure with cold memory on node0 with 512GB of local DRAM.
====================+================================================+=========
| cold memory occupied by mmap and memset |
| 0G 440G 450G 460G 470G 480G 490G 500G |
====================+================================================+=========
Execution time normalized to DRAM-only values | GEOMEAN
--------------------+------------------------------------------------+---------
DRAM-only | 1.00 - - - - - - - | 1.00
CXL-only | 1.19 - - - - - - - | 1.19
default | - 1.00 1.05 1.08 1.12 1.14 1.18 1.18 | 1.11
DAMON tiered | - 1.03 1.03 1.03 1.03 1.03 1.07 *1.05 | 1.04
DAMON lazy | - 1.04 1.03 1.04 1.05 1.06 1.06 *1.06 | 1.05
====================+================================================+=========
CXL usage of redis-server in GB | AVERAGE
--------------------+------------------------------------------------+---------
DRAM-only | 0.0 - - - - - - - | 0.0
CXL-only | 51.4 - - - - - - - | 51.4
default | - 0.6 10.6 20.5 30.5 40.5 47.6 50.4 | 28.7
DAMON tiered | - 0.6 0.5 0.4 0.7 0.8 7.1 5.6 | 2.2
DAMON lazy | - 0.5 3.0 4.5 5.4 6.4 9.4 9.1 | 5.5
====================+================================================+=========
Each test result is based on the execution environment as follows.
DRAM-only: redis-server uses only local DRAM memory.
CXL-only: redis-server uses only CXL memory.
default: default memory policy(MPOL_DEFAULT).
numa balancing disabled.
DAMON tiered: DAMON enabled with DAMOS_MIGRATE_COLD for DRAM
nodes and DAMOS_MIGRATE_HOT for CXL nodes.
DAMON lazy: same as DAMON tiered, but turn on DAMON just
before making memory access request via YCSB.
The above result shows the "default" execution time goes up as the size of
cold memory is increased from 440G to 500G because the more cold memory
used, the more CXL memory is used for the target redis workload and this
makes the execution time increase.
However, "DAMON tiered" and other DAMON results show less slowdown because
the DAMOS_MIGRATE_COLD action at DRAM node proactively demotes
pre-allocated cold memory to CXL node and this free space at DRAM
increases more chance to allocate hot or warm pages of redis-server to
fast DRAM node. Moreover, DAMOS_MIGRATE_HOT action at CXL node also
promotes hot pages of redis-server to DRAM node actively.
As a result, it makes more memory of redis-server stay in DRAM node
compared to "default" memory policy and this makes the performance
improvement.
Please note that the result numbers of "DAMON tiered" and "DAMON lazy" at
500G are marked with * stars, which means their test results are replaced
with reproduced tests that didn't have OOM issue.
That was needed because sometimes the test processes get OOM when DRAM has
insufficient space. The DAMOS_MIGRATE_HOT doesn't kick reclaim but just
gives up migration when there is not enough space at DRAM side. The
problem happens when there is competition between normal allocation and
migration and the migration is done before normal allocation, then the
completely unrelated normal allocation can trigger reclaim, which incurs
OOM.
Because of this issue, I have also tested more cases with
"demotion_enabled" flag enabled to make such reclaim doesn't trigger OOM,
but just demote reclaimed pages. The following test results show more
tests with "kswapd" marked.
2. YCSB zipfian distribution read only workload (with demotion_enabled true)
memory pressure with cold memory on node0 with 512GB of local DRAM.
====================+================================================+=========
| cold memory occupied by mmap and memset |
| 0G 440G 450G 460G 470G 480G 490G 500G |
====================+================================================+=========
Execution time normalized to DRAM-only values | GEOMEAN
--------------------+------------------------------------------------+---------
DAMON tiered | - 1.03 1.03 1.03 1.03 1.03 1.07 1.05 | 1.04
DAMON lazy | - 1.04 1.03 1.04 1.05 1.06 1.06 1.06 | 1.05
DAMON tiered kswapd | - 1.03 1.03 1.03 1.03 1.02 1.02 1.03 | 1.03
DAMON lazy kswapd | - 1.04 1.04 1.04 1.03 1.05 1.04 1.05 | 1.04
====================+================================================+=========
CXL usage of redis-server in GB | AVERAGE
--------------------+------------------------------------------------+---------
DAMON tiered | - 0.6 0.5 0.4 0.7 0.8 7.1 5.6 | 2.2
DAMON lazy | - 0.5 3.0 4.5 5.4 6.4 9.4 9.1 | 5.5
DAMON tiered kswapd | - 0.0 0.0 0.4 0.5 0.1 0.8 1.0 | 0.4
DAMON lazy kswapd | - 4.2 4.6 5.3 1.7 6.8 8.1 5.8 | 5.2
====================+================================================+=========
Each test result is based on the exeuction environment as follows.
DAMON tiered: same as before
DAMON lazy: same as before
DAMON tiered kswapd: same as DAMON tiered, but turn on
/sys/kernel/mm/numa/demotion_enabled to make
kswapd or direct reclaim does demotion.
DAMON lazy kswapd: same as DAMON lazy, but turn on
/sys/kernel/mm/numa/demotion_enabled to make
kswapd or direct reclaim does demotion.
The "DAMON tiered kswapd" and "DAMON lazy kswapd" didn't trigger OOM at
all unlike other tests because kswapd and direct reclaim from DRAM node
can demote reclaimed pages to CXL node independently from DAMON actions
and their results are slightly better than without having
"demotion_enabled".
In summary, the evaluation results show that DAMON memory management with
DAMOS_MIGRATE_{HOT,COLD} actions reduces the performance slowdown compared
to the "default" memory policy from 11% to 3~5% when the system runs with
high memory pressure on its fast tier DRAM nodes.
Having these DAMOS_MIGRATE_HOT and DAMOS_MIGRATE_COLD actions can make
tiered memory systems run more efficiently under high memory pressures.
This patch (of 7):
The alloc_demote_folio can be used out of vmscan.c so it'd be better to
remove static keyword from it.
Link: https://lkml.kernel.org/r/20240614030010.751-1-honggyu.kim@sk.com
Link: https://lkml.kernel.org/r/20240614030010.751-2-honggyu.kim@sk.com
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Gregory Price <gregory.price@memverge.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Hyeongtak Ji <hyeongtak.ji@sk.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Unlike other cgroup subsystems, the hugetlb cgroup does not provide a
static array of cftype that explicitly displays the properties, handling
functions, etc., of each file. Instead, it dynamically creates the
attribute of cftypes based on the hstate during the startup procedure.
This reduces the readability of the code.
To fix this issue, introduce two templates of cftypes, and rebuild the
attributes according to the hstate to make it ready to be added to cgroup
framework.
Link: https://lkml.kernel.org/r/20240612092409.2027592-3-xiujianfeng@huawei.com
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: kernel test robot <oliver.sang@intel.com>
From: Xiu Jianfeng <xiujianfeng@huawei.com>
Subject: mm/hugetlb_cgroup: register lockdep key for cftype
Date: Tue, 18 Jun 2024 07:19:22 +0000
When CONFIG_DEBUG_LOCK_ALLOC is enabled, the following commands can
trigger a bug,
mount -t cgroup2 none /sys/fs/cgroup
cd /sys/fs/cgroup
echo "+hugetlb" > cgroup.subtree_control
The log is as below:
BUG: key ffff8880046d88d8 has not been registered!
------------[ cut here ]------------
DEBUG_LOCKS_WARN_ON(1)
WARNING: CPU: 3 PID: 226 at kernel/locking/lockdep.c:4945 lockdep_init_map_type+0x185/0x220
Modules linked in:
CPU: 3 PID: 226 Comm: bash Not tainted 6.10.0-rc4-next-20240617-g76db4c64526c #544
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:lockdep_init_map_type+0x185/0x220
Code: 00 85 c0 0f 84 6c ff ff ff 8b 3d 6a d1 85 01 85 ff 0f 85 5e ff ff ff 48 c7 c6 21 99 4a 82 48 c7 c7 60 29 49 82 e8 3b 2e f5
RSP: 0018:ffffc9000083fc30 EFLAGS: 00000282
RAX: 0000000000000000 RBX: ffffffff828dd820 RCX: 0000000000000027
RDX: ffff88803cd9cac8 RSI: 0000000000000001 RDI: ffff88803cd9cac0
RBP: ffff88800674fbb0 R08: ffffffff828ce248 R09: 00000000ffffefff
R10: ffffffff8285e260 R11: ffffffff828b8eb8 R12: ffff8880046d88d8
R13: 0000000000000000 R14: 0000000000000000 R15: ffff8880067281c0
FS: 00007f68601ea740(0000) GS:ffff88803cd80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005614f3ebc740 CR3: 000000000773a000 CR4: 00000000000006f0
Call Trace:
<TASK>
? __warn+0x77/0xd0
? lockdep_init_map_type+0x185/0x220
? report_bug+0x189/0x1a0
? handle_bug+0x3c/0x70
? exc_invalid_op+0x18/0x70
? asm_exc_invalid_op+0x1a/0x20
? lockdep_init_map_type+0x185/0x220
__kernfs_create_file+0x79/0x100
cgroup_addrm_files+0x163/0x380
? find_held_lock+0x2b/0x80
? find_held_lock+0x2b/0x80
? find_held_lock+0x2b/0x80
css_populate_dir+0x73/0x180
cgroup_apply_control_enable+0x12f/0x3a0
cgroup_subtree_control_write+0x30b/0x440
kernfs_fop_write_iter+0x13a/0x1f0
vfs_write+0x341/0x450
ksys_write+0x64/0xe0
do_syscall_64+0x4b/0x110
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f68602d9833
Code: 8b 15 61 26 0e 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 08
RSP: 002b:00007fff9bbdf8e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007f68602d9833
RDX: 0000000000000009 RSI: 00005614f3ebc740 RDI: 0000000000000001
RBP: 00005614f3ebc740 R08: 000000000000000a R09: 0000000000000008
R10: 00005614f3db6ba0 R11: 0000000000000246 R12: 0000000000000009
R13: 00007f68603bd6a0 R14: 0000000000000009 R15: 00007f68603b8880
For lockdep, there is a sanity check in lockdep_init_map_type(), the
lock-class key must either have been allocated statically or must
have been registered as a dynamic key. However the commit e18df2889ff9
("mm/hugetlb_cgroup: prepare cftypes based on template") has changed
the cftypes from static allocated objects to dynamic allocated objects,
so the cft->lockdep_key must be registered proactively.
[xiujianfeng@huawei.com: fix BUG()]
Link: https://lkml.kernel.org/r/20240619015527.2212698-1-xiujianfeng@huawei.com
Link: https://lkml.kernel.org/r/20240618071922.2127289-1-xiujianfeng@huawei.com
Link: https://lore.kernel.org/all/602186b3-5ce3-41b3-90a3-134792cc2a48@samsung.com/
Fixes: e18df2889ff9 ("mm/hugetlb_cgroup: prepare cftypes based on template")
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202406181046.8d8b2492-oliver.sang@intel.com
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: SeongJae Park <sj@kernel.org>
Closes: https://lore.kernel.org/20240618233608.400367-1-sj@kernel.org
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Today, we do not have any observability of per-page metadata and how much
it takes away from the machine capacity. Thus, we want to describe the
amount of memory that is going towards per-page metadata, which can vary
depending on build configuration, machine architecture, and system use.
This patch adds 2 fields to /proc/vmstat that can used as shown below:
Accounting per-page metadata allocated by boot-allocator:
/proc/vmstat:nr_memmap_boot * PAGE_SIZE
Accounting per-page metadata allocated by buddy-allocator:
/proc/vmstat:nr_memmap * PAGE_SIZE
Accounting total Perpage metadata allocated on the machine:
(/proc/vmstat:nr_memmap_boot +
/proc/vmstat:nr_memmap) * PAGE_SIZE
Utility for userspace:
Observability: Describe the amount of memory overhead that is going to
per-page metadata on the system at any given time since this overhead is
not currently observable.
Debugging: Tracking the changes or absolute value in struct pages can help
detect anomalies as they can be correlated with other metrics in the
machine (e.g., memtotal, number of huge pages, etc).
page_ext overheads: Some kernel features such as page_owner
page_table_check that use page_ext can be optionally enabled via kernel
parameters. Having the total per-page metadata information helps users
precisely measure impact. Furthermore, page-metadata metrics will reflect
the amount of struct pages reliquished (or overhead reduced) when
hugetlbfs pages are reserved which will vary depending on whether hugetlb
vmemmap optimization is enabled or not.
For background and results see:
lore.kernel.org/all/20240220214558.3377482-1-souravpanda@google.com
Link: https://lkml.kernel.org/r/20240605222751.1406125-1-souravpanda@google.com
Signed-off-by: Sourav Panda <souravpanda@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Chen Linxuan <chenlinxuan@uniontech.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ivan Babrou <ivan@cloudflare.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tomas Mudrunka <tomas.mudrunka@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Xu <weixugc@google.com>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
thuge-gen.c defines SHM_HUGE_* macros that are provided by the uapi since
4.14. These macros get redefined when compiling with Android's bionic
because its sys/shm.h will import the uapi definitions.
However if linux/shm.h is included, with glibc, sys/shm.h will clash on
some struct definitions:
/usr/include/linux/shm.h:26:8: error: redefinition of ‘struct shmid_ds’
26 | struct shmid_ds {
| ^~~~~~~~
In file included from /usr/include/x86_64-linux-gnu/bits/shm.h:45,
from /usr/include/x86_64-linux-gnu/sys/shm.h:30:
/usr/include/x86_64-linux-gnu/bits/types/struct_shmid_ds.h:24:8: note: originally defined here
24 | struct shmid_ds
| ^~~~~~~~
For now, guard the SHM_HUGE_* defines with ifndef to prevent redefinition
warnings on Android bionic.
Link: https://lkml.kernel.org/r/20240605223637.1374969-3-edliaw@google.com
Signed-off-by: Edward Liaw <edliaw@google.com>
Reviewed-by: Carlos Llamas <cmllamas@google.com>
Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Bill Wendling <morbo@google.com>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Justin Stitt <justinstitt@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>