mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2026-05-04 08:04:24 -04:00
82a08bde3cf7bbbe90de57baa181bebf676582c7
1353111 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
82a08bde3c |
samples/damon: implement a DAMON module for memory tiering
Implement a sample DAMON module that shows how self-tuned DAMON-based memory tiering can be written. It is a sample since the purpose is to give an idea about how it can be implemented and perform, rather than be used on general production setups. Especially, it supports only two tiers memory setup having only one CPU-attached NUMA node. [sj@kernel.org: fix wrong DAMON attrs setting] Link: https://lkml.kernel.org/r/20250510220932.47722-1-sj@kernel.org [sj@kernel.org: trigger build even if only mtier is enabled] Link: https://lkml.kernel.org/r/20250426184054.11437-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250420194030.75838-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Yunjeong Mun <yunjeong.mun@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
f77cb46226 |
Docs/ABI/damon: document nid file
Add a description of 'nid' file, which is optionally used for specific
DAMOS quota goal metrics such as node_mem_{used,free}_bp on the DAMON
sysfs ABI document.
Link: https://lkml.kernel.org/r/20250420194030.75838-7-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Yunjeong Mun <yunjeong.mun@sk.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
a7bb1e7545 |
Docs/admin-guide/mm/damon/usage: document 'nid' file
Add description of 'nid' file, which is optionally used for specific DAMOS
quota goal metrics such as node_mem_{used,free}_bp on DAMON usage
document.
Link: https://lkml.kernel.org/r/20250420194030.75838-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Yunjeong Mun <yunjeong.mun@sk.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
b3b95a3594 |
Docs/mm/damon/design: document node_mem_{used,free}_bp
Add description of DAMOS quota goal metrics for NUMA node utilization on the DAMON deesign document. Link: https://lkml.kernel.org/r/20250420194030.75838-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Yunjeong Mun <yunjeong.mun@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
85fcf0ffc4 |
mm/damon/sysfs-schemes: connect damos_quota_goal nid with core layer
DAMON sysfs interface file for DAMOS quota goal's node id argument is not passed to core layer. Implement the link. Link: https://lkml.kernel.org/r/20250420194030.75838-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Yunjeong Mun <yunjeong.mun@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
0fbd59379d |
mm/damon/sysfs-schemes: implement file for quota goal nid parameter
DAMOS_QUOTA_NODE_MEM_{USED,FREE}_BP DAMOS quota goal metrics require the
node id parameter. However, there is no DAMON user ABI for setting it.
Implement a DAMON sysfs file for that with name 'nid', under the quota
goal directory.
Link: https://lkml.kernel.org/r/20250420194030.75838-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Yunjeong Mun <yunjeong.mun@sk.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
0e1c773b50 |
mm/damon/core: introduce damos quota goal metrics for memory node utilization
Patch series "mm/damon: auto-tune DAMOS for NUMA setups including tiered
memory".
Utilizing DAMON for memory tiering usually requires manual tuning and/or
tedious controls. Let it self-tune hotness and coldness thresholds for
promotion and demotion aiming high utilization of high memory tiers, by
introducing new DAMOS quota goal metrics representing the used and the
free memory ratios of specific NUMA nodes. And introduce a sample DAMON
module that demonstrates how the new feature can be used for memory
tiering use cases.
Backgrounds
===========
A type of tiered memory system exposes the memory tiers as NUMA nodes. A
straightforward pages placement strategy for such systems is placing
access-hot and cold pages on upper and lower tiers, reespectively,
pursuing higher utilization of upper tiers. Since access temperature can
be dynamic, periodically finding and migrating hot pages and cold pages to
proper tiers (promoting and demoting) is also required. Linux kernel
provides several features for such dynamic and transparent pages
placement.
Page Faults and LRU
-------------------
One widely known way is using NUMA balancing in tiering mode (a.k.a
NUMAB-2) and reclaim-based demotion features. In the setup, NUMAB-2 finds
hot pages using access check-purpose page faults (a.k.a prot_none) and
promote those inside each process' context, until there is no more pages
to promote, or the upper tier is filled up and memory pressure happens.
In the latter case, LRU-based reclaim logic wakes up as a response to the
memory pressure and demotes cold pages to lower tiers in asynchronous
(kswapd) and/or synchronous ways (direct reclaim).
DAMON
-----
Yet another available solution is using DAMOS with migrate_hot and
migrate_cold DAMOS actions for promotions and demotions, respectively. To
make it optimum, users need to specify aggressiveness and access
temperature thresholds for promotions and demotions in a good balance that
results in high utilization of upper tiers. The number of parameters is
not small, and optimum parameter values depend on characteristics of the
underlying hardware and the workload. As a result, it often requires
manual, time consuming and repetitive tuning of the DAMOS schemes for
given workloads and systems combinations.
Self-tuned DAMON-based Memory Tiering
=====================================
To solve such manual tuning problems, DAMOS provides aim-oriented
feedback-driven quotas self-tuning. Using the feature, we design a
self-tuned DAMON-based memory tiering for general multi-tier memory
systems.
For each memory tier node, if it has a lower tier, run a DAMOS scheme that
demotes cold pages of the node, auto-tuning the aggressiveness aiming an
amount of free space of the node. The free space is for keeping the
headroom that avoids significant memory pressure during upper tier memory
usage spike, and promoting hot pages from the lower tier.
For each memory tier node, if it has an upper tier, run a DAMOS scheme
that promotes hot pages of the current node to the upper tier, auto-tuning
the aggressiveness aiming a high utilization ratio of the upper tier. The
target ratio is to ensure higher tiers are utilized as much as possible.
It should match with the headroom for demotion scheme, but have slight
overlap, to ensure promotion and demotion are not entirely stopped.
The aim-oriented aggressiveness auto-tuning of DAMOS is already available.
Hence, to make such tiering solution implementation, only new quota goal
metrics for utilization and free space ratio of specific NUMA node need to
be developed.
Discussions
===========
The design imposes below discussion points.
Expected Behaviors
------------------
The system will let upper tier memory node accommodates as many hot data
as possible. If total amount of the data is less than the top tier
memory's promotion/demotion target utilization, entire data will be just
placed on the top tier. Promotion scheme will do nothing since there is
no data to promote. Demotion scheme will also do nothing since the free
space ratio of the top tier is higher than the goal.
Only if the amount of data is larger than the top tier's utilization
ratio, demotion scheme will demote cold pages and ensure the headroom free
space. Since the promotion and demotion schemes for a single node has
small overlap at their target utilization and free space goals, promotions
and demotions will continue working with a moderate aggressiveness level.
It will keep all data is placed on access hotness under dynamic access
pattern, while minimizing the migration overhead.
In any case, each node will keep headroom free space and as many upper
tiers are utilized as possible.
Ease of Use
-----------
Users still need to set the target utilization and free space ratio, but
it will be easier to set. We argue 99.7 % utilization and 0.5 % free
space ratios can be good default values. It can be easily adjusted based
on desired headroom size of given use case. Users are also still required
to answer the minimum coldness and hotness thresholds. Together with
monitoring intervals auto-tuning[2], DAMON will always show meaningful
amount of hot and cold memory. And DAMOS quota's prioritization mechanism
will make good decision as long as the source information is that
colorful. Hence, users can very naively set the minimum criterias. We
believe any access observation and no access observation within last one
aggregation interval is enough for minimum hot and cold regions criterias.
General Tiered Memory Setup Applicability
-----------------------------------------
The design can be applied to any number of tiers having any performance
characteristics, as long as they can be hierarchical. Hence, applying the
system to different tiered memory system will be straightforward. Note
that this assumes only single CPU NUMA node case. Because today's DAMON
is not aware of which CPU made each access, applying this on systems
having multiple CPU NUMA nodes can be complicated. We are planning to
extend DAMON for the use case, but that's out of the scope of this patch
series.
How To Use
----------
Users can implement the auto-tuned DAMON-based memory tiering using DAMON
sysfs interface. It can be easily done using DAMON user-space tool like
user-space tool. Below evaluation results section shows an example DAMON
user-space tool command for that.
For wider and simpler deployment, having a kernel module that sets up and
run the DAMOS schemes via DAMON kernel API can be useful. The module can
enable the memory tiering at boot time via kernel command line parameter
or at run time with single command. This patch series implements a sample
DAMON kernel module that shows how such module can be implemented.
Comparison To Page Faults and LRU-based Approaches
--------------------------------------------------
The existing page faults based promotion (NUMAB-2) does hot pages
detection and migration in the process context. When there are many pages
to promote, it can block the progress of the application's real works.
DAMOS works in asynchronous worker thread, so it doesn't block the real
works.
NUMAB-2 doesn't provide a way to control aggressiveness of promotion other
than the maximum amount of pages to promote per given time widnow. If hot
pages are found, promotions can happen in the upper-bound speed,
regardless of upper tier's memory pressure. If the maximum speed is not
well set for the given workload, it can result in slow promotion or
unnecessary memory pressure. Self-tuned DAMON-based memory tiering
alleviates the problem by adjusting the speed based on current utilization
of the upper tier.
LRU-based demotion can be triggered in both asynchronous (kswapd) and
synchronous (direct reclaim) ways. Other than the way of finding cold
pages, asynchronous LRU-based demotion and DAMON-based demotion has no big
difference. DAMON-based demotion can make a better balancing with
DAMON-based promotion, though. The LRU-based demotion can do better than
DAMON-based demotion when the tier is having significant memory pressure.
It would be wise to use DAMON-based demotion as a proactive and primary
one, but utilizing LRU-based demotions together as a fast backup solution.
Evaluation
==========
In short, under a setup that requires fast and frequent promotions,
self-tuned DAMON-based memory tiering's hot pages promotion improves
performance about 4.42 %. We believe this shows self-tuned DAMON-based
promotion's effectiveness. Meanwhile, NUMAB-2's hot pages promotion
degrades the performance about 7.34 %. We suspect the degradation is
mostly due to NUMAB-2's synchronous nature that can block the
application's progress, which highlights the advantage of DAMON-based
solution's asynchronous nature.
Note that the test was done with the RFC version of this patch series. We
don't run it again since this patch series got no meaningful change after
the RFC, while the test takes pretty long time.
Setup
-----
Hardware. Use a machine that equips 250 GiB DRAM memory tier and 50 GiB
CXL memory tier. The tiers are exposed as NUMA nodes 0 and 1,
respectively.
Kernel. Use Linux kernel v6.13 that modified as following. Add all DAMON
patches that available on mm tree of 2025-03-15, and this patch series.
Also modify it to ignore mempolicy() system calls, to avoid bad effects
from application's traditional NUMA systems assumed optimizations.
Workload. Use a modified version of Taobench benchmark[3] that available
on DCPerf benchmark suite. It represents an in-memory caching workload.
We set its 'memsize', 'warmup_time', and 'test_time' parameter as 340 GiB,
2,500 seconds and 1,440 seconds. The parameters are chosen to ensure the
workload uses more than DRAM memory tier. Its RSS under the parameter
grows to 270 GiB within the warmup time.
It turned out the workload has a very static access pattrn. Only about 13
% of the RSS is frequently accessed from the beginning to end. Hence
promotion shows no meaningful performance difference regardless of
different design and implementations. We therefore modify the kernel to
periodically demote up to 10 GiB hot pages and promote up to 10 GiB cold
pages once per minute. The intention is to simulate periodic access
pattern changes. The hotness and coldness threshold is very naively set
so that it is more like random access pattern change rather than strict
hot/cold pages exchange. This is why we call the workload as "modified".
It is implemented as two DAMOS schemes each running on an asynchronous
thread. It can be reproduced with DAMON user-space tool like below.
# ./damo start \
--ops paddr --numa_node 0 --monitoring_intervals 10s 200s 200s \
--damos_action migrate_hot 1 \
--damos_quota_interval 60s --damos_quota_space 10G \
--ops paddr --numa_node 1 --monitoring_intervals 10s 200s 200s \
--damos_action migrate_cold 0 \
--damos_quota_interval 60s --damos_quota_space 10G \
--nr_schemes 1 1 --nr_targets 1 1 --nr_ctxs 1 1
System configurations. Use below variant system configurations.
- Baseline. No memory tiering features are turned on.
- Numab_tiering. On the baseline, enable NUMAB-2 and relcaim-based
demotion. In detail, following command is executed:
echo 2 > /proc/sys/kernel/numa_balancing;
echo 1 > /sys/kernel/mm/numa/demotion_enabled;
echo 7 > /proc/sys/vm/zone_reclaim_mode
- DAMON_tiering. On the baseline, utilize self-tuned DAMON-based memory
tiering implementation via DAMON user-space tool. It utilizes two
kernel threads, namely promotion thread and demotion thread. Demotion
thread monitors access pattern of DRAM node using DAMON with
auto-tuned monitoring intervals aiming 4% DAMON-observed access ratio,
and demote coldest pages up to 200 MiB per second aiming 0.5% free
space of DRAM node. Promotion thread monitors CXL node using same
intervals auto-tuning, and promote hot pages in same way but aiming
for 99.7% utilization of DRAM node. Because DAMON provides only
best-effort accuracy, add young page DAMOS filters to allow only and
reject all young pages at promoting and demoting, respectively. It
can be reproduced with DAMON user-space tool like below.
# ./damo start \
--numa_node 0 --monitoring_intervals_goal 4% 3 5ms 10s \
--damos_action migrate_cold 1 --damos_access_rate 0% 0% \
--damos_apply_interval 1s \
--damos_quota_interval 1s --damos_quota_space 200MB \
--damos_quota_goal node_mem_free_bp 0.5% 0 \
--damos_filter reject young \
--numa_node 1 --monitoring_intervals_goal 4% 3 5ms 10s \
--damos_action migrate_hot 0 --damos_access_rate 5% max \
--damos_apply_interval 1s \
--damos_quota_interval 1s --damos_quota_space 200MB \
--damos_quota_goal node_mem_used_bp 99.7% 0 \
--damos_filter allow young \
--damos_nr_quota_goals 1 1 --damos_nr_filters 1 1 \
--nr_targets 1 1 --nr_schemes 1 1 --nr_ctxs 1 1
Measurment Results
------------------
On each system configuration, run the modified version of Taobench and
collect 'score'. 'score' is a metric that calculated and provided by
Taobench to represents the performance of the run on the system. To
handle the measurement errors, repeat the measurement five times. The
results are as below.
Config Score Stdev (%) Normalized
Baseline 1.6165 0.0319 1.9764 1.0000
Numab_tiering 1.4976 0.0452 3.0209 0.9264
DAMON_tiering 1.6881 0.0249 1.4767 1.0443
'Config' column shows the system config of the measurement. 'Score'
column shows the 'score' measurement in average of the five runs on the
system config. 'Stdev' column shows the standsard deviation of the five
measurements of the scores. '(%)' column shows the 'Stdev' to 'Score'
ratio in percentage. Finally, 'Normalized' column shows the averaged
score values of the configs that normalized to that of 'Baseline'.
The periodic hot pages demotion and cold pages promotion that was
conducted to simulate dynamic access pattern was started from the
beginning of the workload. It resulted in the DRAM tier utilization
always under the watermark, and hence no real demotion was happened for
all test runs. This means the above results show no difference between
LRU-based and DAMON-based demotions. Only difference between NUMAB-2 and
DAMON-based promotions are represented on the results.
Numab_tiering config degraded the performance about 7.36 %. We suspect
this happened because NUMAB-2's synchronous promotion was blocking the
Taobench's real work progress.
DAMON_tiering config improved the performance about 4.43 %. We believe
this shows effectiveness of DAMON-based promotion that didn't block
Taobench's real work progress due to its asynchronous nature. Also this
means DAMON's monitoring results are accurate enough to provide visible
amount of improvement.
Evaluation Limitations
----------------------
As mentioned above, this evaluation shows only comparison of promotion
mechanisms. DAMON-based tiering is recommended to be used together with
reclaim-based demotion as a faster backup under significant memory
pressure, though.
From some perspective, the modified version of Taobench may seems making
the picture distorted too much. It would be better to evaluate with more
realistic workload, or more finely tuned micro benchmarks.
Patch Sequence
==============
The first patch (patch 1) implements two new quota goal metrics on core
layer and expose it to DAMON core kernel API. The second and third ones
(patches 2 and 3) further link it to DAMON sysfs interface. Three
following patches (patches 4-6) document the new feature and sysfs file on
design, usage, and ABI documents. The final one (patch 7) implements a
working version of a self-tuned DAMON-based memory tiering solution in an
incomplete but easy to understand form as a kernel module under
samples/damon/ directory.
References
==========
[1] https://lore.kernel.org/20231112195602.61525-1-sj@kernel.org/
[2] https://lore.kernel.org/20250303221726.484227-1-sj@kernel.org
[3] https://github.com/facebookresearch/DCPerf/blob/main/packages/tao_bench/README.md
This patch (of 7):
Used and free space ratios for specific NUMA nodes can be useful inputs
for NUMA-specific DAMOS schemes' aggressiveness self-tuning feedback loop.
Implement DAMOS quota goal metrics for such self-tuned schemes.
Link: https://lkml.kernel.org/r/20250420194030.75838-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250420194030.75838-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Yunjeong Mun <yunjeong.mun@sk.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
dec92bf95f |
mm/mempolicy: support memory hotplug in weighted interleave
The weighted interleave policy distributes page allocations across multiple NUMA nodes based on their performance weight, thereby improving memory bandwidth utilization. The weight values for each node are configured through sysfs. Previously, sysfs entries for configuring weighted interleave were created for all possible nodes (N_POSSIBLE) at initialization, including nodes that might not have memory. However, not all nodes in N_POSSIBLE are usable at runtime, as some may remain memoryless or offline. This led to sysfs entries being created for unusable nodes, causing potential misconfiguration issues. To address this issue, this patch modifies the sysfs creation logic to: 1) Limit sysfs entries to nodes that are online and have memory, avoiding the creation of sysfs entries for nodes that cannot be used. 2) Support memory hotplug by dynamically adding and removing sysfs entries based on whether a node transitions into or out of the N_MEMORY state. Additionally, the patch ensures that sysfs attributes are properly managed when nodes go offline, preventing stale or redundant entries from persisting in the system. By making these changes, the weighted interleave policy now manages its sysfs entries more efficiently, ensuring that only relevant nodes are considered for interleaving, and dynamically adapting to memory hotplug events. [dan.carpenter@linaro.org: fix error code in sysfs_wi_node_add()] Link: https://lkml.kernel.org/r/aBjL7Bwc0QBzgajK@stanley.mountain Link: https://lkml.kernel.org/r/20250417072839.711-4-rakie.kim@sk.com Co-developed-by: Honggyu Kim <honggyu.kim@sk.com> Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Co-developed-by: Yunjeong Mun <yunjeong.mun@sk.com> Signed-off-by: Yunjeong Mun <yunjeong.mun@sk.com> Signed-off-by: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
cf8cecf2bc |
mm/mempolicy: prepare weighted interleave sysfs for memory hotplug
Previously, the weighted interleave sysfs structure was statically managed during initialization. This prevented new nodes from being recognized when memory hotplug events occurred, limiting the ability to update or extend sysfs entries dynamically at runtime. To address this, this patch refactors the sysfs infrastructure and encapsulates it within a new structure, `sysfs_wi_group`, which holds both the kobject and an array of node attribute pointers. By allocating this group structure globally, the per-node sysfs attributes can be managed beyond initialization time, enabling external modules to insert or remove node entries in response to events such as memory hotplug or node online/offline transitions. Instead of allocating all per-node sysfs attributes at once, the initialization path now uses the existing sysfs_wi_node_add() and sysfs_wi_node_delete() helpers. This refactoring makes it possible to modularly manage per-node sysfs entries and ensures the infrastructure is ready for runtime extension. Link: https://lkml.kernel.org/r/20250417072839.711-3-rakie.kim@sk.com Signed-off-by: Rakie Kim <rakie.kim@sk.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Honggyu Kim <honggyu.kim@sk.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yunjeong Mun <yunjeong.mun@sk.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
bb52e89d8b |
mm/mempolicy: fix memory leaks in weighted interleave sysfs
Patch series "Enhance sysfs handling for memory hotplug in weighted
interleave", v9.
The following patch series enhances the weighted interleave policy in the
memory management subsystem by improving sysfs handling, fixing memory
leaks, and introducing dynamic sysfs updates for memory hotplug support.
This patch (of 3):
Memory leaks occurred when removing sysfs attributes for weighted
interleave. Improper kobject deallocation led to unreleased memory when
initialization failed or when nodes were removed.
The risk of leak is low because it only appears to trigger if setup
fails. Setup only fails due to -ENOMEM which is unlikely to happen
from a late_initcall() when memory pressure is low.
This patch resolves the issue by replacing unnecessary `kfree()` calls
with proper `kobject_del()` and `kobject_put()` sequences, ensuring
correct teardown and preventing memory leaks.
By explicitly calling `kobject_del()` before `kobject_put()`, the release
function is now invoked safely, and internal sysfs state is correctly
cleaned up. This guarantees that the memory associated with the kobject
is fully released and avoids resource leaks, thereby improving system
stability.
Additionally, sysfs_remove_file() is no longer called from the release
function to avoid accessing invalid sysfs state after kobject_del(). All
attribute removals are now done before kobject_del(), preventing WARN_ON()
in kernfs and ensuring safe and consistent cleanup of sysfs entries.
Link: https://lkml.kernel.org/r/20250417072839.711-1-rakie.kim@sk.com
Link: https://lkml.kernel.org/r/20250417072839.711-2-rakie.kim@sk.com
Fixes:
|
||
|
|
6e14fd33f1 |
mm: memcontrol: remove unnecessary NULL check before free_percpu()
free_percpu() checks for NULL pointers internally. Remove unneeded NULL check here. Link: https://lkml.kernel.org/r/20250417084330.937380-1-nichen@iscas.ac.cn Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
d096612048 |
vmalloc: align nr_vmalloc_pages and vmap_lazy_nr
Currently both atomics share one cache-line: <snip> ... ffffffff83eab400 b vmap_lazy_nr ffffffff83eab408 b nr_vmalloc_pages ... <snip> those are global variables and they are only 8 bytes apart. Since they are modified by different threads this causes a false sharing. This can lead to a performance drop due to unnecessary cache invalidations. After this patch it is aligned to a cache line boundary: <snip> ... ffffffff8260a600 d vmap_lazy_nr ffffffff8260a640 d nr_vmalloc_pages ... <snip> Link: https://lkml.kernel.org/r/20250417161216.88318-4-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Adrian Huang <ahuang12@lenovo.com> Tested-by: Adrian Huang <ahuang12@lenovo.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Christop Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
7a6fe58777 |
MAINTAINERS: add test_vmalloc.c to VMALLOC section
A vmalloc subsystem includes "lib/test_vmalloc.c" test suite. Add an "F:" entry under VMALLOC section to track this file as part of the subsystem. Link: https://lkml.kernel.org/r/20250417161216.88318-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christop Hellwig <hch@infradead.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
2d76e79315 |
lib/test_vmalloc.c: allow built-in execution
Remove the dependency on module loading ("m") for the vmalloc test suite,
enabling it to be built directly into the kernel, so both ("=m") and
("=y") are supported.
Motivation:
- Faster debugging/testing of vmalloc code;
- It allows to configure the test via kernel-boot parameters.
Configuration example:
test_vmalloc.nr_threads=64
test_vmalloc.run_test_mask=7
test_vmalloc.sequential_test_order=1
Link: https://lkml.kernel.org/r/20250417161216.88318-2-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Adrian Huang <ahuang12@lenovo.com>
Tested-by: Adrian Huang <ahuang12@lenovo.com>
Cc: Christop Hellwig <hch@infradead.org>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
7a73348e5d |
lib/test_vmalloc.c: replace RWSEM to SRCU for setup
The test has the initialization step during which threads are created. To prevent the workers from starting prematurely a write lock was previously used by the main setup thread, while each worker would block on a read lock. Replace this RWSEM based synchronization with a simpler SRCU based approach. Which does two basic steps: - Main thread wraps the setup phase in an SRCU read-side critical section. Pair of srcu_read_lock()/srcu_read_unlock(). - Each worker calls synchronize_srcu() on entry, ensuring it waits for the initialization phase to be completed. This patch eliminates the need for down_read()/up_read() and down_write()/up_write() pairs thus simplifying the logic and improving clarity. [urezki@gmail.com: fix compile error with CONFIG_TINY_RCU] Link: https://lkml.kernel.org/r/20250420142029.103169-1-urezki@gmail.com Link: https://lkml.kernel.org/r/20250417161216.88318-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Adrian Huang <ahuang12@lenovo.com> Tested-by: Adrian Huang <ahuang12@lenovo.com> Cc: Baoquan He <bhe@redhat.com> Cc: Christop Hellwig <hch@infradead.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
b05f8d7e07 |
Documentation: zram: update IDLE pages tracking documentation
Move IDLE pages tracking into a separate chapter because there are
multiple features that use (or depend on) it either in built-in variant
("mark all") or in extended variant (ac-time tracking).
In addition, recompression doesn't require memory tracking to be enabled
in order to be able to perform idle recompression.
Link: https://lkml.kernel.org/r/20250416042833.3858827-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reported-by: Shin Kawamura <kawasin@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
4a34c584d8 |
mempolicy: optimize queue_folios_pte_range by PTE batching
After the check for queue_folio_required(), the code only cares about the folio in the for loop, i.e the PTEs are redundant. Therefore, optimize this loop by skipping over a PTE batch mapping the same folio. With a test program migrating pages of the calling process, which includes a mapped VMA of size 4GB with pte-mapped large folios of order-9, and migrating once back and forth node-0 and node-1, the average execution time reduces from 7.5 to 4 seconds, giving an approx 47% speedup. Link: https://lkml.kernel.org/r/20250416053048.96479-1-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
75404e0766 |
mm: move mmap/vma locking logic into specific files
Currently the VMA and mmap locking logic is entangled in two of the most overwrought files in mm - include/linux/mm.h and mm/memory.c. Separate this logic out so we can more easily make changes and create an appropriate MAINTAINERS entry that spans only the logic relating to locking. This should have no functional change. Care is taken to avoid dependency loops, we must regrettably keep release_fault_lock() and assert_fault_locked() in mm.h as a result due to the dependence on the vm_fault type. Additionally we must declare rcuwait_wake_up() manually to avoid a dependency cycle on linux/rcuwait.h. Additionally move the nommu implementatino of lock_mm_and_find_vma() to mmap_lock.c so everything lock-related is in one place. Link: https://lkml.kernel.org/r/bec6c8e29fa8de9267a811a10b1bdae355d67ed4.1744799282.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
f735eebe55 |
memcg: multi-memcg percpu charge cache
Memory cgroup accounting is expensive and to reduce the cost, the kernel maintains per-cpu charge cache for a single memcg. So, if a charge request comes for a different memcg, the kernel will flush the old memcg's charge cache and then charge the newer memcg a fixed amount (64 pages), subtracts the charge request amount and stores the remaining in the per-cpu charge cache for the newer memcg. This mechanism is based on the assumption that the kernel, for locality, keep a process on a CPU for long period of time and most of the charge requests from that process will be served by that CPU's local charge cache. However this assumption breaks down for incoming network traffic in a multi-tenant machine. We are in the process of running multiple workloads on a single machine and if such workloads are network heavy, we are seeing very high network memory accounting cost. We have observed multiple CPUs spending almost 100% of their time in net_rx_action and almost all of that time is spent in memcg accounting of the network traffic. More precisely, net_rx_action is serving packets from multiple workloads and is observing/serving mix of packets of these workloads. The memcg switch of per-cpu cache is very expensive and we are observing a lot of memcg switches on the machine. Almost all the time is being spent on charging new memcg and flushing older memcg cache. So, definitely we need per-cpu cache that support multiple memcgs for this scenario. This patch implements a simple (and dumb) multiple memcg percpu charge cache. Actually we started with more sophisticated LRU based approach but the dumb one was always better than the sophisticated one by 1% to 3%, so going with the simple approach. Some of the design choices are: 1. Fit all caches memcgs in a single cacheline. 2. The cache array can be mix of empty slots or memcg charged slots, so the kernel has to traverse the full array. 3. The cache drain from the reclaim will drain all cached memcgs to keep things simple. To evaluate the impact of this optimization, on a 72 CPUs machine, we ran the following workload where each netperf client runs in a different cgroup. The next-20250415 kernel is used as base. $ netserver -6 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K number of clients | Without patch | With patch 6 | 42584.1 Mbps | 48603.4 Mbps (14.13% improvement) 12 | 30617.1 Mbps | 47919.7 Mbps (56.51% improvement) 18 | 25305.2 Mbps | 45497.3 Mbps (79.79% improvement) 24 | 20104.1 Mbps | 37907.7 Mbps (88.55% improvement) 30 | 14702.4 Mbps | 30746.5 Mbps (109.12% improvement) 36 | 10801.5 Mbps | 26476.3 Mbps (145.11% improvement) The results show drastic improvement for network intensive workloads. [shakeel.butt@linux.dev: add BUILD_BUG_ON() for MEMCG_CHARGE_BATCH] Link: https://lkml.kernel.org/r/rlsgeosg3j7v5nihhbxxxbv3xfy4ejvigihj7lkkbt3n6imyne@2apxx2jm2e57 [shakeel.butt@linux.dev: simplify refill_stock] Link: https://lkml.kernel.org/r/as5cdsm4lraxupg3t6onep2ixql72za25hvd4x334dsoyo4apr@zyzl4vkuevuv [hughd@google.com: it's better to stock nr_pages than the uninitialized stock_pages] Link: https://lkml.kernel.org/r/d542d18f-1caa-6fea-e2c3-3555c87bcf64@google.com [shakeel.butt@linux.dev: add comment per Michal and use DEFINE_PER_CPU_ALIGNED instead of DEFINE_PER_CPU per Vlastimil] Link: https://lkml.kernel.org/r/dieeei3squ2gcnqxdjayvxbvzldr266rhnvtl3vjzsqevxkevf@ckui5vjzl2qg Link: https://lkml.kernel.org/r/20250416180229.2902751-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Eric Dumaze <edumazet@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
06340b9270 |
mm: convert free_page_and_swap_cache() to free_folio_and_swap_cache()
free_page_and_swap_cache() takes a struct page pointer as input parameter, but it will immediately convert it to folio and all operations following within use folio instead of page. It makes more sense to pass in folio directly. Convert free_page_and_swap_cache() to free_folio_and_swap_cache() to consume folio directly. Link: https://lkml.kernel.org/r/20250416201720.41678-1-nifan.cxl@gmail.com Signed-off-by: Fan Ni <fan.ni@samsung.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Adam Manzanares <a.manzanares@samsung.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
4f219913c1 |
mm: add nr_free_highatomic in show_free_areas
commit c928807f6f6b6("mm/page_alloc: keep track of free highatomic")
adds a new variable nr_free_highatomic, which is useful for analyzing low
mem issues. add nr_free_highatomic in show_free_areas.
Signed-off-by: gao xu <gaoxu2@honor.com>
Link: https://lkml.kernel.org/r/d92eeff74f7a4578a14ac777cfe3603a@honor.com
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
df2bbc47e7 |
mm/vmscan: modify the assignment logic of the scan and total_scan variables
The scan and total_scan variables can be initialized to 0 when they are defined, replacing the separate assignment statements. Link: https://lkml.kernel.org/r/20250417092422.1333620-1-hao.ge@linux.dev Signed-off-by: Hao Ge <gehao@kylinos.cn> Acked-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
1477b8cd26 |
samples/damon/prcl: fix a comment typo
This patch just fixes a typo in the comment. Link: https://lkml.kernel.org/r/20250411073800.1444481-1-lienze@kylinos.cn Signed-off-by: Enze Li <lienze@kylinos.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
a7797e74bd |
mm/gup: clean up codes in fault_in_xxx() functions
The code style in fault_in_readable() and fault_in_writable() is a little inconsistent with fault_in_safe_writeable(). In fault_in_readable() and fault_in_writable(), it uses 'uaddr' passed in as loop cursor. While in fault_in_safe_writeable(), local variable 'start' is used as loop cursor. This may mislead people when reading code or making change in these codes. Here define explicit loop cursor and use for loop to simplify codes in these three functions. These cleanup can make them be consistent in code style and improve readability. [bhe@redhat.com: address minor concerns from David] Link: https://lkml.kernel.org/r/Z/sbv3EmLXWgEE7+@MiWiFi-R3L-srv Link: https://lkml.kernel.org/r/20250410035717.473207-5-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yanjun.Zhu <yanjun.zhu@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
339122abb5 |
mm/gup: remove gup_fast_pgd_leaf() and clean up the relevant codes
In the current kernel, only pud huge page is supported in some architectures. P4d and pgd huge pages haven't been supported yet. And in mm/gup.c, there's no pgd huge page handling in the follow_page_mask() code path. Hence it doesn't make sense to only have gup_fast_pgd_leaf() in gup_fast code path. Here remove gup_fast_pgd_leaf() and clean up the relevant codes. Link: https://lkml.kernel.org/r/20250410035717.473207-4-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Yanjun.Zhu <yanjun.zhu@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
ede27b7ee2 |
mm/gup: remove unneeded checking in follow_page_pte()
Patch series "mm/gup: Minor fix, cleanup and improvements", v4. These were made from code inspection in mm/gup.c. This patch (of 3): In __get_user_pages(), it will traverse page table and take a reference to the page the given user address corresponds to if GUP_GET or GUP_PIN is set. However, it's not supported both GUP_GET and GUP_PIN are set. Even though this check need be done, it should be done earlier, but not doing it till entering into follow_page_pte() and failed. Furthermore, this checking has been done in is_valid_gup_args() and all external users of __get_user_pages() will call is_valid_gup_args() to catch the illegal setting. We don't need to worry about internal users of __get_user_pages() because the gup_flags are set by MM code correctly. Here remove the checking in follow_page_pte(), and add VM_WARN_ON_ONCE() to catch the possible exceptional setting just in case. And also change the VM_BUG_ON to VM_WARN_ON_ONCE() for checking (!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN))) because the checking has been done in is_valid_gup_args() for external users of __get_user_pages(). Link: https://lkml.kernel.org/r/20250410035717.473207-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20250410035717.473207-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Yanjun.Zhu <yanjun.zhu@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
e7a446030b |
mm,hugetlb: allocate frozen pages in alloc_buddy_hugetlb_folio
alloc_buddy_hugetlb_folio() allocates a rmappable folio, then strips the rmappable part and freezes it. We can simplify all that by allocating frozen pages directly. Link: https://lkml.kernel.org/r/20250411132359.312708-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
0272d07ef6 |
vmalloc: use atomic_long_add_return_relaxed()
Switch from the atomic_long_add_return() to its relaxed version.
We do not need a full memory barrier or any memory ordering during
increasing the "vmap_lazy_nr" variable. What we only need is to do it
atomically. This is what atomic_long_add_return_relaxed() guarantees.
AARCH64:
<snip>
Default:
40ec: d34cfe94 lsr x20, x20, #12
40f0: 14000044 b 4200 <free_vmap_area_noflush+0x19c>
40f4: 94000000 bl 0 <__sanitizer_cov_trace_pc>
40f8: 90000000 adrp x0, 0 <__traceiter_alloc_vmap_area>
40fc: 91000000 add x0, x0, #0x0
4100: f8f40016 ldaddal x20, x22, [x0]
4104: 8b160296 add x22, x20, x22
Relaxed:
40ec: d34cfe94 lsr x20, x20, #12
40f0: 14000044 b 4200 <free_vmap_area_noflush+0x19c>
40f4: 94000000 bl 0 <__sanitizer_cov_trace_pc>
40f8: 90000000 adrp x0, 0 <__traceiter_alloc_vmap_area>
40fc: 91000000 add x0, x0, #0x0
4100: f8340016 ldadd x20, x22, [x0]
4104: 8b160296 add x22, x20, x22
<snip>
Link: https://lkml.kernel.org/r/20250415112646.113091-1-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Christop Hellwig <hch@infradead.org>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
00ccf40ae2 |
mm, hugetlb: avoid passing a null nodemask when there is mbind policy
Before trying to allocate a page, gather_surplus_pages() sets up a nodemask for the nodes we can allocate from, but instead of passing the nodemask down the road to the page allocator, it iterates over the nodes within that nodemask right there, meaning that the page allocator will receive a preferred_nid and a null nodemask. This is a problem when using a memory policy, because it might be that the page allocator ends up using a node as a fallback which is not represented in the policy. Avoid that by passing the nodemask directly to the page allocator, so it can filter out fallback nodes that are not part of the nodemask. Link: https://lkml.kernel.org/r/20250415121503.376811-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
f736953e2b |
selftests/damon: remove the remaining test scripts for DAMON debugfs interface
DAMON has dropped debugfs support; therefore, remove these unused scripts.
Link: https://lkml.kernel.org/r/20250411024332.1373861-1-enze.li@linux.dev
Fixes:
|
||
|
|
60cada258d |
memcg: optimize memcg_rstat_updated
Currently the kernel maintains the stats updates per-memcg which is needed to implement stats flushing threshold. On the update side, the update is added to the per-cpu per-memcg update of the given memcg and all of its ancestors. However when the given memcg has passed the flushing threshold, all of its ancestors should have passed the threshold as well. There is no need to traverse up the memcg tree to maintain the stats updates. Perf profile collected from our fleet shows that memcg_rstat_updated is one of the most expensive memcg function i.e. a lot of cumulative CPU is being spent on it. So, even small micro optimizations matter a lot. This patch is microbenchmarked with multiple instances of netperf on a single machine with locally running netserver and we see couple of percentage of improvement. Link: https://lkml.kernel.org/r/20250410025752.92159-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
585a914588 |
selftests/mm: restore default nr_hugepages value during cleanup in hugetlb_reparenting_test.sh
During cleanup, the value of /proc/sys/vm/nr_hugepages is currently being
set to 0. At the end of the test, if all tests pass, the original
nr_hugepages value is restored. However, if any test fails, it remains
set to 0.
With this patch, we ensure that the original nr_hugepages value is
restored during cleanup, regardless of whether the test passes or fails.
Link: https://lkml.kernel.org/r/20250410100748.2310-1-donettom@linux.ibm.com
Fixes:
|
||
|
|
2a6ed1b411 |
maple_tree: reorder mas->store_type case statements
Move the unlikely case that mas->store_type is invalid to be the last evaluated case and put liklier cases higher up. Link: https://lkml.kernel.org/r/20250410191446.2474640-7-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Suggested-by: Liam R. Howlett <liam.howlett@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
271152a973 |
maple_tree: add sufficient height
In order to support rebalancing and spanning stores using less than the worst case number of nodes, we need to track more than just the vacant height. Using only vacant height to reduce the worst case maple node allocation count can lead to a shortcoming of nodes in the following scenarios. For rebalancing writes, when a leaf node becomes insufficient, it may be combined with a sibling into a single node. This means that the parent node which has entries for this children will lose one entry. If this parent node was just meeting the minimum entries, losing one entry will now cause this parent node to be insufficient. This leads to a cascading operation of rebalancing at different levels and can lead to more node allocations than simply using vacant height can return. For spanning writes, a similar situation occurs. At the location at which a spanning write is detected, the number of ancestor nodes may similarly need to rebalanced into a smaller number of nodes and the same cascading situation could occur. To use less than the full height of the tree for the number of allocations, we also need to track the height at which a non-leaf node cannot become insufficient. This means even if a rebalance occurs to a child of this node, it currently has enough entries that it can lose one without any further action. This field is stored in the maple write state as sufficient height. In mas_prealloc_calc() when figuring out how many nodes to allocate, we check if the vacant node is lower in the tree than a sufficient node (has a larger value). If it is, we cannot use the vacant height and must use the difference in the height and sufficient height as the basis for the number of nodes needed. An off by one bug was also discovered in mast_overflow() where it is using >= rather than >. This caused extra iterations of the mas_spanning_rebalance() loop and lead to unneeded allocations. A test is also added to check the number of allocations is correct. Link: https://lkml.kernel.org/r/20250410191446.2474640-6-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
300a5b4ffe |
maple_tree: break on convergence in mas_spanning_rebalance()
This allows support for using the vacant height to calculate the worst case number of nodes needed for wr_rebalance operation. mas_spanning_rebalance() was seen to perform unnecessary node allocations. We can reduce allocations by breaking early during the rebalancing loop once we realize that we have ascended to a common ancestor. Link: https://lkml.kernel.org/r/20250410191446.2474640-5-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Suggested-by: Liam Howlett <liam.howlett@oracle.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
ad88fc17d2 |
maple_tree: use vacant nodes to reduce worst case allocations
In order to determine the store type for a maple tree operation, a walk of the tree is done through mas_wr_walk(). This function descends the tree until a spanning write is detected or we reach a leaf node. While descending, keep track of the height at which we encounter a node with available space. This is done by checking if mas->end is less than the number of slots a given node type can fit. Now that the height of the vacant node is tracked, we can use the difference between the height of the tree and the height of the vacant node to know how many levels we will have to propagate creating new nodes. Update mas_prealloc_calc() to consider the vacant height and reduce the number of worst-case allocations. Rebalancing and spanning stores are not supported and fall back to using the full height of the tree for allocations. Update preallocation testing assertions to take into account vacant height. Link: https://lkml.kernel.org/r/20250410191446.2474640-4-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
f9d3a963fe |
maple_tree: use height and depth consistently
For the maple tree, the root node is defined to have a depth of 0 with a height of 1. Each level down from the node, these values are incremented by 1. Various code paths define a root with depth 1 which is inconsisent with the definition. Modify the code to be consistent with this definition. In mas_spanning_rebalance(), l_mas.depth was being used to track the height based on the number of iterations done in the main loop. This information was then used in mas_put_in_tree() to set the height. Rather than overload the l_mas.depth field to track height, simply keep track of height in the local variable new_height and directly pass this to mas_wmb_replace() which will be passed into mas_put_in_tree(). This allows up to remove writes to l_mas.depth. Link: https://lkml.kernel.org/r/20250410191446.2474640-3-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
28092a652f |
maple_tree: convert mas_prealloc_calc() to take in a maple write state
Patch series "Track node vacancy to reduce worst case allocation counts", v5. ================ overview ======================== Currently, the maple tree preallocates the worst case number of nodes for given store type by taking into account the whole height of the tree. This comes from a worst case scenario of every node in the tree being full and having to propagate node allocation upwards until we reach the root of the tree. This can be optimized if there are vacancies in nodes that are at a lower depth than the root node. This series implements tracking the level at which there is a vacant node so we only need to allocate until this level is reached, rather than always using the full height of the tree. The ma_wr_state struct is modified to add a field which keeps track of the vacant height and is updated during walks of the tree. This value is then read in mas_prealloc_calc() when we decide how many nodes to allocate. For rebalancing and spanning stores, we also need to track the lowest height at which a node has 1 more entry than the minimum sufficient number of entries. This is because rebalancing can cause a parent node to become insufficient which results in further node allocations. In this case, we need to use the sufficient height as the worst case rather than the vacant height. patch 1-2: preparatory patches patch 3: implement vacant height tracking + update the tests patch 4: support vacant height tracking for rebalancing writes patch 5: implement sufficient height tracking patch 6: reorder switch case statements ================ results ========================= Bpftrace was used to profile the allocation path for requesting new maple nodes while running stress-ng mmap 120s. The histograms below represent requests to kmem_cache_alloc_bulk() and show the count argument. This represnts how many maple nodes the caller is requesting in kmem_cache_alloc_bulk() command: stress-ng --mmap 4 --timeout 120 mm-unstable @bulk_alloc_req: [3, 4) 4 | | [4, 5) 54170 |@ | [5, 6) 0 | | [6, 7) 893057 |@@@@@@@@@@@@@@@@@@@@ | [7, 8) 4 | | [8, 9) 2230287 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [9, 10) 55811 |@ | [10, 11) 77834 |@ | [11, 12) 0 | | [12, 13) 1368684 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [13, 14) 0 | | [14, 15) 0 | | [15, 16) 367197 |@@@@@@@@ | @maple_node_total: 46,630,160 @total_vmas: 46184591 mm-unstable + this series @bulk_alloc_req: [2, 3) 198 | | [3, 4) 4 | | [4, 5) 43 | | [5, 6) 0 | | [6, 7) 1069503 |@@@@@@@@@@@@@@@@@@@@@ | [7, 8) 4 | | [8, 9) 2597268 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [9, 10) 472191 |@@@@@@@@@ | [10, 11) 191904 |@@@ | [11, 12) 0 | | [12, 13) 247316 |@@@@ | [13, 14) 0 | | [14, 15) 0 | | [15, 16) 98769 |@ | @maple_node_total: 37,813,856 @total_vmas: 43493287 This represents a ~19% reduction in the number of bulk maple nodes allocated. For more reproducible results, a historgram of the return value of mas_prealloc_calc() is displayed while running the maple_tree_tests whcih have a deterministic store pattern mas_prealloc_calc() return value mm-unstable 1 : (12068) 3 : (11836) 5 : ***** (271192) 7 : ************************************************** (2329329) 9 : *********** (534186) 10 : (435) 11 : *************** (704306) 13 : ******** (409781) mas_prealloc_calc() return value mm-unstable + this series 1 : (12070) 3 : ************************************************** (3548777) 5 : ******** (633458) 7 : (65081) 9 : (11224) 10 : (341) 11 : (2973) 13 : (68) do_mmap latency was also measured for regressions: command: stress-ng --mmap 4 --timeout 120 mm-unstable: avg = 7162 nsecs, total: 16101821292 nsecs, count: 2248034 mm-unstable + this series: avg = 6689 nsecs, total: 15135391764 nsecs, count: 2262726 stress-ng --mmap4 --timeout 120 with vacant_height: stress-ng: info: [257] 21526312 Maple Tree Read 0.176 M/sec stress-ng: info: [257] 339979348 Maple Tree Write 2.774 M/sec without vacant_height: stress-ng: info: [8228] 20968900 Maple Tree Read 0.171 M/sec stress-ng: info: [8228] 312214648 Maple Tree Write 2.547 M/sec This represents an increase of ~3% read throughput and ~9% increase in write throughput. This patch (of 6): In a subsequent patch, mas_prealloc_calc() will need to access fields only in the ma_wr_state. Convert the function to take in a ma_wr_state and modify all callers. There is no functional change. Link: https://lkml.kernel.org/r/20250410191446.2474640-1-sidhartha.kumar@oracle.com Link: https://lkml.kernel.org/r/20250410191446.2474640-2-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
43c4cfde7e |
mm/madvise: batch tlb flushes for MADV_DONTNEED[_LOCKED]
MADV_DONTNEED[_LOCKED] handling for [process_]madvise() flushes tlb for each vma of each address range. Update the logic to do tlb flushes in a batched way. Initialize an mmu_gather object from do_madvise() and vector_madvise(), which are the entry level functions for [process_]madvise(), respectively. And pass those objects to the function for per-vma work, via madvise_behavior struct. Make the per-vma logic not flushes tlb on their own but just saves the tlb entries to the received mmu_gather object. For this internal logic change, make zap_page_range_single_batched() non-static and use it directly from madvise_dontneed_single_vma(). Finally, the entry level functions flush the tlb entries that gathered for the entire user request, at once. Link: https://lkml.kernel.org/r/20250410000022.1901-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
de8efdf8cd |
mm/memory: split non-tlb flushing part from zap_page_range_single()
Some of zap_page_range_single() callers such as [process_]madvise() with
MADV_DONTNEED[_LOCKED] cannot batch tlb flushes because
zap_page_range_single() flushes tlb for each invocation. Split out the
body of zap_page_range_single() except mmu_gather object initialization
and gathered tlb entries flushing for such batched tlb flushing usage.
To avoid hugetlb pages allocation failures from concurrent page faults,
the tlb flush should be done before hugetlb faults unlocking, though. Do
the flush and the unlock inside the split out function in the order for
hugetlb vma case. Refer to commit
|
||
|
|
01bef02bf9 |
mm/madvise: batch tlb flushes for MADV_FREE
MADV_FREE handling for [process_]madvise() flushes tlb for each vma of each address range. Update the logic to do tlb flushes in a batched way. Initialize an mmu_gather object from do_madvise() and vector_madvise(), which are the entry level functions for [process_]madvise(), respectively. And pass those objects to the function for per-vma work, via madvise_behavior struct. Make the per-vma logic not flushes tlb on their own but just saves the tlb entries to the received mmu_gather object. Finally, the entry level functions flush the tlb entries that gathered for the entire user request, at once. Link: https://lkml.kernel.org/r/20250410000022.1901-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
066c770437 |
mm/madvise: define and use madvise_behavior struct for madvise_do_behavior()
Patch series "mm/madvise: batch tlb flushes for MADV_DONTNEED and
MADV_FREE", v3.
When process_madvise() is called to do MADV_DONTNEED[_LOCKED] or MADV_FREE
with multiple address ranges, tlb flushes happen for each of the given
address ranges. Because such tlb flushes are for the same process, doing
those in a batch is more efficient while still being safe. Modify
process_madvise() entry level code path to do such batched tlb flushes,
while the internal unmap logic do only gathering of the tlb entries to
flush.
In more detail, modify the entry functions to initialize an mmu_gather
object and pass it to the internal logic. And make the internal logic do
only gathering of the tlb entries to flush into the received mmu_gather
object. After all internal function calls are done, the entry functions
flush the gathered tlb entries at once.
Because process_madvise() and madvise() share the internal unmap logic,
make same change to madvise() entry code together, to make code consistent
and cleaner. It is only for keeping the code clean, and shouldn't degrade
madvise(). It could rather provide a potential tlb flushes reduction
benefit for a case that there are multiple vmas for the given address
range. It is only a side effect from an effort to keep code clean, so we
don't measure it separately.
Similar optimizations might be applicable to other madvise behavior such
as MADV_COLD and MADV_PAGEOUT. Those are simply out of the scope of this
patch series, though.
Patches Sequence
================
The first patch defines a new data structure for managing information that
is required for batched tlb flushes (mmu_gather and behavior), and update
code paths for MADV_DONTNEED[_LOCKED] and MADV_FREE handling internal
logic to receive it.
The second patch batches tlb flushes for MADV_FREE handling for both
madvise() and process_madvise().
Remaining two patches are for MADV_DONTNEED[_LOCKED] tlb flushes batching.
The third patch splits zap_page_range_single() for batching of
MADV_DONTNEED[_LOCKED] handling. The fourth patch batches tlb flushes for
the hint using the sub-logic that the third patch split out, and the
helpers for batched tlb flushes that introduced for the MADV_FREE case, by
the second patch.
Test Results
============
I measured the latency to apply MADV_DONTNEED advice to 256 MiB memory
using multiple process_madvise() calls. I apply the advice in 4 KiB sized
regions granularity, but with varying batch size per process_madvise()
call (vlen) from 1 to 1024. The source code for the measurement is
available at GitHub[1]. To reduce measurement errors, I did the
measurement five times.
The measurement results are as below. 'sz_batch' column shows the batch
size of process_madvise() calls. 'Before' and 'After' columns show the
average of latencies in nanoseconds that measured five times on kernels
that built without and with the tlb flushes batching of this series
(patches 3 and 4), respectively. For the baseline, mm-new tree of
2025-04-09[2] has been used, after reverting the second version of this
patch series and adding a temporal fix for !CONFIG_DEBUG_VM build
failure[3]. 'B-stdev' and 'A-stdev' columns show ratios of latency
measurements standard deviation to average in percent for 'Before' and
'After', respectively. 'Latency_reduction' shows the reduction of the
latency that the 'After' has achieved compared to 'Before', in percent.
Higher 'Latency_reduction' values mean more efficiency improvements.
sz_batch Before B-stdev After A-stdev Latency_reduction
1 146386348 2.78 111327360.6 3.13 23.95
2 108222130 1.54 72131173.6 2.39 33.35
4 93617846.8 2.76 51859294.4 2.50 44.61
8 80555150.4 2.38 44328790 1.58 44.97
16 77272777 1.62 37489433.2 1.16 51.48
32 76478465.2 2.75 33570506 3.48 56.10
64 75810266.6 1.15 27037652.6 1.61 64.34
128 73222748 3.86 25517629.4 3.30 65.15
256 72534970.8 2.31 25002180.4 0.94 65.53
512 71809392 5.12 24152285.4 2.41 66.37
1024 73281170.2 4.53 24183615 2.09 67.00
Unexpectedly the latency has reduced (improved) even with batch size one.
I think some of compiler optimizations have affected that, like also
observed with the first version of this patch series.
So, please focus on the proportion between the improvement and the batch
size. As expected, tlb flushes batching provides latency reduction that
proportional to the batch size. The efficiency gain ranges from about 33
percent with batch size 2, and up to 67 percent with batch size 1,024.
Please note that this is a very simple microbenchmark, so real efficiency
gain on real workload could be very different.
This patch (of 4):
To implement batched tlb flushes for MADV_DONTNEED[_LOCKED] and MADV_FREE,
an mmu_gather object in addition to the behavior integer need to be passed
to the internal logics. Using a struct can make it easy without
increasing the number of parameters of all code paths towards the internal
logic. Define a struct for the purpose and use it on the code path that
starts from madvise_do_behavior() and ends on madvise_dontneed_free().
Note that this changes madvise_walk_vmas() visitor type signature, too.
Specifically, it changes its 'arg' type from 'unsigned long' to the new
struct pointer.
Link: https://lkml.kernel.org/r/20250410000022.1901-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250410000022.1901-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liam R. Howlett <howlett@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
10d483f198 |
mm: huge_memory: add folio_mark_accessed() when zapping file THP
When investigating performance issues during file folio unmap, I noticed some behavioral differences in handling non-PMD-sized folios and PMD-sized folios. For non-PMD-sized file folios, it will call folio_mark_accessed() to mark the folio as having seen activity, but this is not done for PMD-sized folios. This might not cause obvious issues, but a potential problem could be that, it might lead to reclaim of hot file folios under memory pressure, as quoted from Johannes: : Sometimes file contents are only accessed through relatively short-lived : mappings. But they can nevertheless be accessed a lot and be hot. It's : important to not lose that information on unmap, and end up kicking out a : frequently used cache page. Therefore, we should also add folio_mark_accessed() for PMD-sized file folios when unmapping. [baolin.wang@linux.alibaba.com: add comment] Link: https://lkml.kernel.org/r/23fdc11d-e983-4627-89a8-79e9ecf9a45a@linux.alibaba.com Link: https://lkml.kernel.org/r/fc117f60d7b686f87067f36a0ef7cdbc3a78109c.1744190345.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Barry Song <21cnbao@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
10d288964d |
tools/testing/selftests: assert that anon merge cases behave as expected
Prior to the recently applied commit that permits this merge,
mprotect()'ing a faulted VMA, adjacent to an unfaulted VMA, such that the
two share characteristics would fail to merge due to what appear to be
unintended consequences of commit
|
||
|
|
bd23f293a0 |
tools/testing: add PROCMAP_QUERY helper functions in mm self tests
The PROCMAP_QUERY ioctl() is very useful - it allows for binary access to /proc/$pid/[s]maps data and thus convenient lookup of data contained there. This patch exposes this for convenient use by mm self tests so the state of VMAs can easily be queried. Link: https://lkml.kernel.org/r/ce83d877093d1fc594762cf4b82f0c27963030ee.1744104124.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
879bca0a2c |
mm/vma: fix incorrectly disallowed anonymous VMA merges
Patch series "fix incorrectly disallowed anonymous VMA merges", v2.
It appears that we have been incorrectly rejecting merge cases for 15
years, apparently by mistake.
Imagine a range of anonymous mapped momemory divided into two VMAs like
this, with incompatible protection bits:
RW RWX
unfaulted faulted
|-----------|-----------|
| prev | vma |
|-----------|-----------|
mprotect(RW)
Now imagine mprotect()'ing vma so it is RW. This appears as if it should
merge, it does not.
Neither does this case, again mprotect()'ing vma RW:
RWX RW
faulted unfaulted
|-----------|-----------|
| vma | next |
|-----------|-----------|
mprotect(RW)
Nor:
RW RWX RW
unfaulted faulted unfaulted
|-----------|-----------|-----------|
| prev | vma | next |
|-----------|-----------|-----------|
mprotect(RW)
What's going on here?
In commit
|
||
|
|
af8251dd45 |
mm: rust: add MEMORY MANAGEMENT [RUST]
We have introduced Rust bindings for core mm abstractions as part of this series, so add an entry in MAINTAINERS to be explicit about who maintains this. Patches are anticipated to be taken through the mm tree as usual with other mm code. Link: https://rust-for-linux.com/rust-kernel-policy#how-is-rust-introduced-in-a-subsystem Link: https://lore.kernel.org/all/33e64b12-aa07-4e78-933a-b07c37ff1d84@lucifer.local/ Link: https://lkml.kernel.org/r/20250408-vma-v16-9-d8b446e885d9@google.com Signed-off-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Andreas Hindborg <a.hindborg@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Gary Guo <gary@garyguo.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Trevor Gross <tmgross@umich.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
6acb75ad7b |
task: rust: rework how current is accessed
Introduce a new type called `CurrentTask` that lets you perform various
operations that are only safe on the `current` task. Use the new type to
provide a way to access the current mm without incrementing its refcount.
With this change, you can write stuff such as
let vma = current!().mm().lock_vma_under_rcu(addr);
without incrementing any refcounts.
This replaces the existing abstractions for accessing the current pid
namespace. With the old approach, every field access to current involves
both a macro and a unsafe helper function. The new approach simplifies
that to a single safe function on the `CurrentTask` type. This makes it
less heavy-weight to add additional current accessors in the future.
That said, creating a `CurrentTask` type like the one in this patch
requires that we are careful to ensure that it cannot escape the current
task or otherwise access things after they are freed. To do this, I
declared that it cannot escape the current "task context" where I defined
a "task context" as essentially the region in which `current` remains
unchanged. So e.g., release_task() or begin_new_exec() would leave the
task context.
If a userspace thread returns to userspace and later makes another
syscall, then I consider the two syscalls to be different task contexts.
This allows values stored in that task to be modified between syscalls,
even if they're guaranteed to be immutable during a syscall.
Ensuring correctness of `CurrentTask` is slightly tricky if we also want
the ability to have a safe `kthread_use_mm()` implementation in Rust. To
support that safely, there are two patterns we need to ensure are safe:
// Case 1: current!() called inside the scope.
let mm;
kthread_use_mm(some_mm, || {
mm = current!().mm();
});
drop(some_mm);
mm.do_something(); // UAF
and:
// Case 2: current!() called before the scope.
let mm;
let task = current!();
kthread_use_mm(some_mm, || {
mm = task.mm();
});
drop(some_mm);
mm.do_something(); // UAF
The existing `current!()` abstraction already natively prevents the first
case: The `&CurrentTask` would be tied to the inner scope, so the
borrow-checker ensures that no reference derived from it can escape the
scope.
Fixing the second case is a bit more tricky. The solution is to
essentially pretend that the contents of the scope execute on an different
thread, which means that only thread-safe types can cross the boundary.
Since `CurrentTask` is marked `NotThreadSafe`, attempts to move it to
another thread will fail, and this includes our fake pretend thread
boundary.
This has the disadvantage that other types that aren't thread-safe for
reasons unrelated to `current` also cannot be moved across the
`kthread_use_mm()` boundary. I consider this an acceptable tradeoff.
Link: https://lkml.kernel.org/r/20250408-vma-v16-8-d8b446e885d9@google.com
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Reviewed-by: Gary Guo <gary@garyguo.net>
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Benno Lossin <benno.lossin@proton.me>
Cc: Björn Roy Baron <bjorn3_gh@protonmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jann Horn <jannh@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Trevor Gross <tmgross@umich.edu>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
f8c7819881 |
rust: miscdevice: add mmap support
Add the ability to write a file_operations->mmap hook in Rust when using the miscdevice abstraction. The `vma` argument to the `mmap` hook uses the `VmaNew` type from the previous commit; this type provides the correct set of operations for a file_operations->mmap hook. Link: https://lkml.kernel.org/r/20250408-vma-v16-7-d8b446e885d9@google.com Signed-off-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Reviewed-by: Gary Guo <gary@garyguo.net> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Trevor Gross <tmgross@umich.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
dcb81aeab4 |
mm: rust: add VmaNew for f_ops->mmap()
This type will be used when setting up a new vma in an f_ops->mmap() hook. Using a separate type from VmaRef allows us to have a separate set of operations that you are only able to use during the mmap() hook. For example, the VM_MIXEDMAP flag must not be changed after the initial setup that happens during the f_ops->mmap() hook. To avoid setting invalid flag values, the methods for clearing VM_MAYWRITE and similar involve a check of VM_WRITE, and return an error if VM_WRITE is set. Trying to use `try_clear_maywrite` without checking the return value results in a compilation error because the `Result` type is marked #[must_use]. For now, there's only a method for VM_MIXEDMAP and not VM_PFNMAP. When we add a VM_PFNMAP method, we will need some way to prevent you from setting both VM_MIXEDMAP and VM_PFNMAP on the same vma. Link: https://lkml.kernel.org/r/20250408-vma-v16-6-d8b446e885d9@google.com Signed-off-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Gary Guo <gary@garyguo.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Trevor Gross <tmgross@umich.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |