All zswap entries will take a reference of zswap_pool when zswap_store(),
and drop it when free. Change it to use the percpu_ref is better for
scalability performance.
Although percpu_ref use a bit more memory which should be ok for our use
case, since we almost have only one zswap_pool to be using. The
performance gain is for zswap_store/load hotpath.
Testing kernel build (32 threads) in tmpfs with memory.max=2GB. (zswap
shrinker and writeback enabled with one 50GB swapfile, on a 128 CPUs
x86-64 machine, below is the average of 5 runs)
mm-unstable zswap-global-lru
real 63.20 63.12
user 1061.75 1062.95
sys 268.74 264.44
[chengming.zhou@linux.dev: fix zswap_pools_lock usages after changing to percpu_ref]
Link: https://lkml.kernel.org/r/20240228154954.3028626-1-chengming.zhou@linux.dev
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-2-200495333595@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/zswap: optimize for dynamic zswap_pools", v3.
Dynamic pool creation has been supported for a long time, which maybe not
used so much in practice. But with the per-memcg lru merged, the current
structure of zswap_pool's lru and shrinker become less optimal.
In the current structure, each zswap_pool has its own lru, shrinker and
shrink_work, but only the latest zswap_pool will be the current used.
1. When memory has pressure, all shrinkers of zswap_pools will try to
shrink its lru list, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work will
try to shrink its own lru, which is inefficient.
A more natural way is to have a global zswap lru shared between all
zswap_pools, and so is the shrinker. The code becomes much simpler too.
Another optimization is changing zswap_pool kref to percpu_ref, which will
be taken reference by every zswap entry. So the scalability is better.
Testing kernel build (32 threads) in tmpfs with memory.max=2GB. (zswap
shrinker and writeback enabled with one 50GB swapfile, on a 128 CPUs
x86-64 machine, below is the average of 5 runs)
mm-unstable zswap-global-lru
real 63.20 63.12
user 1061.75 1062.95
sys 268.74 264.44
This patch (of 3):
Dynamic zswap_pool creation may create/reuse to have multiple zswap_pools
in a list, only the first will be current used.
Each zswap_pool has its own lru and shrinker, which is not necessary and
has its problem:
1. When memory has pressure, all shrinker of zswap_pools will
try to shrink its own lru, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work
will try to shrink its lru list. The rationale here was to
try and empty the old pool first so that we can completely
drop it. However, since we only support exclusive loads now,
the LRU ordering should be entirely decided by the order of
stores, so the oldest entries on the LRU will naturally be
from the oldest pool.
Anyway, having a global lru and shrinker shared by all zswap_pools is
better and efficient.
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-0-200495333595@bytedance.com
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-1-200495333595@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Refactor the code left in write_cache_pages into an iterator that the file
system can call to get the next folio for a writeback operation:
struct folio *folio = NULL;
while ((folio = writeback_iter(mapping, wbc, folio, &error))) {
error = <do per-folio writeback>;
}
The twist here is that the error value is passed by reference, so that the
iterator can restore it when breaking out of the loop.
Handling of the magic AOP_WRITEPAGE_ACTIVATE value stays outside the
iterator and needs is just kept in the write_cache_pages legacy wrapper.
in preparation for eventually killing it off.
Heavily based on a for_each* based iterator from Matthew Wilcox.
Link: https://lkml.kernel.org/r/20240215063649.2164017-14-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Collapse the two nested loops into one. This is needed as a step towards
turning this into an iterator.
Note that this drops the "index <= end" check in the previous outer loop
and just relies on filemap_get_folios_tag() to return 0 entries when index
> end. This actually has a subtle implication when end == -1 because then
the returned index will be -1 as well and thus if there is page present on
index -1, we could be looping indefinitely. But as the comment in
filemap_get_folios_tag documents this as already broken anyway we should
not worry about it here either. The fix for that would probably a change
to the filemap_get_folios_tag() calling convention.
[hch@lst.de: update the commit log per Jan]
Link: https://lkml.kernel.org/r/20240215063649.2164017-10-hch@lst.de
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rework the way we deal with the cleanup after the writepage call.
First handle the magic AOP_WRITEPAGE_ACTIVATE separately from real error
returns to get it out of the way of the actual error handling path.
The split the handling on intgrity vs non-integrity branches first, and
return early using a goto for the non-ingegrity early loop condition to
remove the need for the done and done_index local variables, and for
assigning the error to ret when we can just return error directly.
Link: https://lkml.kernel.org/r/20240215063649.2164017-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "convert write_cache_pages() to an iterator", v8.
This is an evolution of the series Matthew Wilcox originally sent in June
2023, which has changed quite a bit since and now has a while based
iterator.
This patch (of 14):
mapping_set_error should only be called on 0 returns (which it ignores) or
a negative error code.
writepage_cb ends up being able to call writepage_cb on the magic
AOP_WRITEPAGE_ACTIVATE return value from ->writepage which means success
but the caller needs to unlock the page. Ignore that and just call
mapping_set_error on negative errors.
(no fixes tag as this goes back more than 20 years over various renames
and refactors so I've given up chasing down the original introduction)
Link: https://lkml.kernel.org/r/20240215063649.2164017-1-hch@lst.de
Link: https://lkml.kernel.org/r/20240215063649.2164017-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The purpose is stopping splitting large folios whose mapcount are 2 or
above. Folios whose estimated_shares = 0 should be still perfect and even
better candidates than estimated_shares = 1.
Consider a pte-mapped large folio with 16 subpages, if we unmap 1-15, the
current code will split folios and reclaim them while madvise goes on this
folio; but if we unmap subpage 0, we will keep this folio and break. This
is weird.
For pmd-mapped large folios, we can still use "= 1" as the condition as
anyway we have the entire map for it. So this patch doesn't change the
condition for pmd-mapped large folios. This also explains why we had been
using "= 1" for both pmd-mapped and pte-mapped large folios before commit
07e8c82b5e ("madvise: convert madvise_cold_or_pageout_pte_range() to use
folios"), because in the past, we used the mapcount of the specific
subpage, since the subpage had pte present, its mapcount wouldn't be 0.
The problem can be quite easily reproduced by writing a small program,
unmapping the first subpage of a pte-mapped large folio vs. unmapping
anyone other than the first subpage.
Link: https://lkml.kernel.org/r/20240221085036.105621-1-21cnbao@gmail.com
Fixes: 2f406263e3 ("madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against large folio for sharing check")
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The code is quite hard to read, we are still writing swap_map after
errors happen. Though the written value is as before,
has_cache = count & SWAP_HAS_CACHE;
count &= ~SWAP_HAS_CACHE;
[snipped]
WRITE_ONCE(p->swap_map[offset], count | has_cache);
It would be better to entirely drop the WRITE_ONCE for both
performance and readability.
[akpm@linux-foundation.org: avoid using goto]
Link: https://lkml.kernel.org/r/20240221091028.123122-1-21cnbao@gmail.com
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Before last commit, memory compaction only migrates order-0 folios and
skips >0 order folios. Last commit splits all >0 order folios during
compaction. This commit migrates >0 order folios during compaction by
keeping isolated free pages at their original size without splitting them
into order-0 pages and using them directly during migration process.
What is different from the prior implementation:
1. All isolated free pages are kept in a NR_PAGE_ORDERS array of page
lists, where each page list stores free pages in the same order.
2. All free pages are not post_alloc_hook() processed nor buddy pages,
although their orders are stored in first page's private like buddy
pages.
3. During migration, in new page allocation time (i.e., in
compaction_alloc()), free pages are then processed by post_alloc_hook().
When migration fails and a new page is returned (i.e., in
compaction_free()), free pages are restored by reversing the
post_alloc_hook() operations using newly added
free_pages_prepare_fpi_none().
Step 3 is done for a latter optimization that splitting and/or merging
free pages during compaction becomes easier.
Note: without splitting free pages, compaction can end prematurely due to
migration will return -ENOMEM even if there is free pages. This happens
when no order-0 free page exist and compaction_alloc() return NULL.
Link: https://lkml.kernel.org/r/20240220183220.1451315-4-zi.yan@sent.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Adam Manzanares <a.manzanares@samsung.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
No need to continue try compact memory if pending fatal signal, allow loop
termination earlier in compact_nodes().
The existing fatal_signal_pending() check does make compact_zone()
break out of the while loop, but it still enters the next zone/next
nid, and some unnecessary functions(eg, lru_add_drain) are called.
There was no observable benefit from the new test, it is just found
from code inspection when refactoring compact_node().
Link: https://lkml.kernel.org/r/20240208022508.1771534-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We used to rely on the returned -ENOSPC of zpool_malloc() to increase
reject_compress_poor. But the code wouldn't get to there after commit
744e188592 ("crypto: scomp - fix req->dst buffer overflow") as the new
code will goto out immediately after the special compression case happens.
So there might be no longer a chance to execute zpool_malloc now. We are
incorrectly increasing zswap_reject_compress_fail instead. Thus, we need
to fix the counters handling right after compressions return ENOSPC. This
patch also centralizes the counters handling for all of compress_poor,
compress_fail and alloc_fail.
Link: https://lkml.kernel.org/r/20240219211935.72394-1-21cnbao@gmail.com
Fixes: 744e188592 ("crypto: scomp - fix req->dst buffer overflow")
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The comment is confusing that Pool lock should be held as this function
accesses first_num above the __encode_handle() because first_num is the
element of z3fold_header which is protected by z3fold_header->page_lock.
I found the same comment for encode_handle() in zbud.c by accident ,Pool
lock should be held as this function accesses first|last_chunks, which is
the element of zbud_header and it does not have any lock, so pool lock
should be held.
Z3fold is based on zbud, maybe the comment come from zbud, but it was
wrong, so fix it.
Link: https://lkml.kernel.org/r/20240219024453.2240147-1-hezhongkun.hzk@bytedance.com
Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Support the PSI-driven quota self-tuning from DAMON_RECLAIM by introducing
yet another parameter, 'quota_mem_pressure_us'. Users can set the desired
amount of memory pressure stall time per each quota reset interval using
the parameter. Then DAMON_RECLAIM monitor the memory pressure stall time,
specifically system-wide memory 'some' PSI value that increased during the
given time interval, and self-tune the quota using the DAMOS core logic.
Link: https://lkml.kernel.org/r/20240219194431.159606-20-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMOS supports user-feedback driven quota auto-tuning, but only DAMON
sysfs interface is using it. Add support of the feature on DAMON_RECLAIM
by adding one more input parameter, namely 'quota_autotune_feedback', for
providing the user feedback to DAMON_RECLAIM. It assumes the target value
of the feedback is 10,000.
Link: https://lkml.kernel.org/r/20240219194431.159606-19-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Extend DAMON sysfs interface to support the PSI-based quota auto-tuning by
adding a new file, 'target_metric' under the quota goal directory. Old
users don't get any behavioral changes since the default value of the
metric is 'user input'.
Link: https://lkml.kernel.org/r/20240219194431.159606-15-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Extend DAMOS quota goal metric with system wide memory pressure stall
time. Specifically, the system level 'some' PSI for memory is used. The
target value can be set in microseconds. DAMOS measures the increased
amount of the PSI metric in last quota_reset_interval and use the ratio of
it versus the user-specified target PSI value as the score for the
auto-tuning feedback loop.
Link: https://lkml.kernel.org/r/20240219194431.159606-14-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMOS quota auto-tuning asks users to assess the current tuned quota and
provide the feedback in a manual and repeated way. It allows users
generate the feedback from a source that the kernel cannot access, and
writing a script or a function for doing the manual and repeated feeding
is not a big deal. However, additional works are additional works, and it
could be more efficient if DAMOS could do the fetch itself, especially in
case of DAMON sysfs interface use case, since it can avoid the context
switches between the user-space and the kernel-space, though the overhead
would be only trivial in most cases. Also in many cases, feedbacks could
be made from kernel-accessible sources, such as PSI, CPU usage, etc. Make
the quota goal to support multiple types of metrics including such ones.
Link: https://lkml.kernel.org/r/20240219194431.159606-13-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMOS quota auto-tuning feature let users to set the goal by providing a
function for getting the current score of the tuned quota. It allows
flexible goal setup, but only simple user-set quota is currently being
used. As a result, the only user of the DAMOS quota auto-tuning is using
a silly void pointer casting based score value passing function. Simplify
the interface and the user code by letting user directly set the target
and the current value.
Link: https://lkml.kernel.org/r/20240219194431.159606-12-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMOS quota auto-tuning feature supports static signle goal and dynamic
multiple goals via DAMON kernel API, specifically via ->goal and ->goals
fields of damos_quota struct, respectively. All in-tree DAMOS kernel API
users are using only the dynamic multiple goals now. Remove the unsued
static single goal interface.
Link: https://lkml.kernel.org/r/20240219194431.159606-11-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMON sysfs interface implements multiple quota auto-tuning goals on its
level since the DAMOS core logic was supporting only single goal. Now the
core logic supports multiple goals on its level. Update DAMON sysfs
interface to reuse the core logic and drop unnecessary duplicated multiple
goals implementation.
Link: https://lkml.kernel.org/r/20240219194431.159606-10-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The feedback-driven DAMOS quota auto-tuning feature allows only single
goal to the DAMON kernel API users. The API users could implement
multiple goals for the end-users on their level, and that's what DAMON
sysfs interface is doing. More DAMON kernel API users such as
DAMON_RECLAIM would need to do similar work. To reduce unnecessary future
duplciated efforts, support multiple goals from DAMOS core layer. To make
the support in minimum non-destructive change, keep the old single goal
setup interface, and add multiple goals setup. The single goal will
treated as one of the multiple goals, so old API users are not required to
make any change.
Link: https://lkml.kernel.org/r/20240219194431.159606-9-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The comments and definition of 'struct damos_quota' lists a few fields for
effective quota generation first, fields for regions prioritization under
the quota, and then remaining fields for effective quota generation.
Readers' should unnecesssarily switch their context in the middle. List
all the fields for the effective quota first, and then fields for the
prioritization for making it easier to read.
Link: https://lkml.kernel.org/r/20240219194431.159606-7-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>