When calculating the hotness threshold for lru_prio scheme of
DAMON_LRU_SORT, the module divides some values by the maximum nr_accesses.
However, due to the type of the related variables, simple division-based
calculation of the divisor can return zero. As a result, divide-by-zero
is possible. Fix it by using damon_max_nr_accesses(), which handles the
case.
Link: https://lkml.kernel.org/r/20231019194924.100347-5-sj@kernel.org
Fixes: 40e983cca9 ("mm/damon: introduce DAMON-based LRU-lists Sorting")
Signed-off-by: SeongJae Park <sj@kernel.org>
Reported-by: Jakub Acs <acsjakub@amazon.de>
Cc: <stable@vger.kernel.org> [6.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When calculating the hotness of each region for the under-quota regions
prioritization, DAMON divides some values by the maximum nr_accesses.
However, due to the type of the related variables, simple division-based
calculation of the divisor can return zero. As a result, divide-by-zero
is possible. Fix it by using damon_max_nr_accesses(), which handles the
case.
Link: https://lkml.kernel.org/r/20231019194924.100347-4-sj@kernel.org
Fixes: 198f0f4c58 ("mm/damon/vaddr,paddr: support pageout prioritization")
Signed-off-by: SeongJae Park <sj@kernel.org>
Reported-by: Jakub Acs <acsjakub@amazon.de>
Cc: <stable@vger.kernel.org> [5.16+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When monitoring attributes are changed, DAMON updates access rate of the
monitoring results accordingly. For that, it divides some values by the
maximum nr_accesses. However, due to the type of the related variables,
simple division-based calculation of the divisor can return zero. As a
result, divide-by-zero is possible. Fix it by using
damon_max_nr_accesses(), which handles the case.
Link: https://lkml.kernel.org/r/20231019194924.100347-3-sj@kernel.org
Fixes: 2f5bef5a59 ("mm/damon/core: update monitoring results for new monitoring attributes")
Signed-off-by: SeongJae Park <sj@kernel.org>
Reported-by: Jakub Acs <acsjakub@amazon.de>
Cc: <stable@vger.kernel.org> [6.3+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "avoid divide-by-zero due to max_nr_accesses overflow".
The maximum nr_accesses of given DAMON context can be calculated by
dividing the aggregation interval by the sampling interval. Some logics
in DAMON uses the maximum nr_accesses as a divisor. Hence, the value
shouldn't be zero. Such case is avoided since DAMON avoids setting the
agregation interval as samller than the sampling interval. However, since
nr_accesses is unsigned int while the intervals are unsigned long, the
maximum nr_accesses could be zero while casting.
Avoid the divide-by-zero by implementing a function that handles the
corner case (first patch), and replaces the vulnerable direct max
nr_accesses calculations (remaining patches).
Note that the patches for the replacements are divided for broken commits,
to make backporting on required tres easier. Especially, the last patch
is for a patch that not yet merged into the mainline but in mm tree.
This patch (of 4):
The maximum nr_accesses of given DAMON context can be calculated by
dividing the aggregation interval by the sampling interval. Some logics
in DAMON uses the maximum nr_accesses as a divisor. Hence, the value
shouldn't be zero. Such case is avoided since DAMON avoids setting the
agregation interval as samller than the sampling interval. However, since
nr_accesses is unsigned int while the intervals are unsigned long, the
maximum nr_accesses could be zero while casting. Implement a function
that handles the corner case.
Note that this commit is not fixing the real issue since this is only
introducing the safe function that will replaces the problematic
divisions. The replacements will be made by followup commits, to make
backporting on stable series easier.
Link: https://lkml.kernel.org/r/20231019194924.100347-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20231019194924.100347-2-sj@kernel.org
Fixes: 198f0f4c58 ("mm/damon/vaddr,paddr: support pageout prioritization")
Signed-off-by: SeongJae Park <sj@kernel.org>
Reported-by: Jakub Acs <acsjakub@amazon.de>
Cc: <stable@vger.kernel.org> [5.16+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since commit dc68badced ("mm: mlock: update mlock_pte_range to handle
large folio") I've just occasionally seen VM_WARN_ON_FOLIO(folio_test_ksm)
warnings from folio_within_range(), in a splurge after testing with KSM
hyperactive.
folio_referenced_one()'s use of folio_within_vma() is safe because it
checks folio_test_large() first; but allow_mlock_munlock() needs to do the
same to avoid those warnings (or check !folio_test_ksm() itself? Or move
either check into folio_within_range()? Hard to tell without more
examples of its use).
Link: https://lkml.kernel.org/r/23852f6a-5bfa-1ffd-30db-30c5560ad426@google.com
Fixes: dc68badced ("mm: mlock: update mlock_pte_range to handle large folio")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Stefan Roesch <shr@devkernel.io>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since commit e509ad4d77 ("ext4: use bdev_getblk() to avoid memory
reclaim in readahead path") rightly replaced GFP_NOFAIL allocations by
GFP_NOWAIT allocations, I've occasionally been seeing "page allocation
failure: order:0" warnings under load: all with
ext4_sb_breadahead_unmovable() in the stack. I don't think those warnings
are of any interest: suppress them with __GFP_NOWARN.
Link: https://lkml.kernel.org/r/7bc6ad16-9a4d-dd90-202e-47d6cbb5a136@google.com
Fixes: e509ad4d77 ("ext4: use bdev_getblk() to avoid memory reclaim in readahead path")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Hui Zhu <teawater@antgroup.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When doing compaction, I found the lru_add_drain() is an obvious hotspot
when migrating pages. The distribution of this hotspot is as follows:
- 18.75% compact_zone
- 17.39% migrate_pages
- 13.79% migrate_pages_batch
- 11.66% migrate_folio_move
- 7.02% lru_add_drain
+ 7.02% lru_add_drain_cpu
+ 3.00% move_to_new_folio
1.23% rmap_walk
+ 1.92% migrate_folio_unmap
+ 3.20% migrate_pages_sync
+ 0.90% isolate_migratepages
The lru_add_drain() was added by commit c3096e6782 ("mm/migrate:
__unmap_and_move() push good newpage to LRU") to drain the newpage to LRU
immediately, to help to build up the correct newpage->mlock_count in
remove_migration_ptes() for mlocked pages. However, if there are no
mlocked pages are migrating, then we can avoid this lru drain operation,
especailly for the heavy concurrent scenarios.
So we can record the source pages' mlocked status in
migrate_folio_unmap(), and only drain the lru list when the mlocked status
is set in migrate_folio_move().
In addition, the page was already isolated from lru when migrating, so
checking the mlocked status is stable by folio_test_mlocked() in
migrate_folio_unmap().
After this patch, I can see the hotpot of the lru_add_drain() is gone:
- 9.41% migrate_pages_batch
- 6.15% migrate_folio_move
- 3.64% move_to_new_folio
+ 1.80% migrate_folio_extra
+ 1.70% buffer_migrate_folio
+ 1.41% rmap_walk
+ 0.62% folio_add_lru
+ 3.07% migrate_folio_unmap
Meanwhile, the compaction latency shows some improvements when running
thpscale:
base patched
Amean fault-both-1 1131.22 ( 0.00%) 1112.55 * 1.65%*
Amean fault-both-3 2489.75 ( 0.00%) 2324.15 * 6.65%*
Amean fault-both-5 3257.37 ( 0.00%) 3183.18 * 2.28%*
Amean fault-both-7 4257.99 ( 0.00%) 4079.04 * 4.20%*
Amean fault-both-12 6614.02 ( 0.00%) 6075.60 * 8.14%*
Amean fault-both-18 10607.78 ( 0.00%) 8978.86 * 15.36%*
Amean fault-both-24 14911.65 ( 0.00%) 11619.55 * 22.08%*
Amean fault-both-30 14954.67 ( 0.00%) 14925.66 * 0.19%*
Amean fault-both-32 16654.87 ( 0.00%) 15580.31 * 6.45%*
Link: https://lkml.kernel.org/r/06e9153a7a4850352ec36602df3a3a844de45698.1697859741.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
For compound pages, the head sets the PG_head flag and the tail sets the
compound_head to indicate the head page. If a user allocates a compound
page and frees it with a different order, the compound page information
will not be properly initialized. To detect this problem,
compound_order(page) and the order argument are compared, but this is not
checked when the order argument is zero. That error should be checked
regardless of the order.
Link: https://lkml.kernel.org/r/20231023083217.1866451-1-hyesoo.yu@samsung.com
Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>
Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
nr_failed was missing the large folio splits from migrate_pages_batch()
and can cause a mismatch between migrate_pages() return value and the
number of not migrated pages, i.e., when the return value of
migrate_pages() is 0, there are still pages left in the from page list.
It will happen when a non-PMD THP large folio fails to migrate due to
-ENOMEM and is split successfully but not all the split pages are not
migrated, migrate_pages_batch() would return non-zero, but
astats.nr_thp_split = 0. nr_failed would be 0 and returned to the caller
of migrate_pages(), but the not migrated pages are left in the from page
list without being added back to LRU lists.
Fix it by adding a new nr_split counter for large folio splits and adding
it to nr_failed in migrate_page_sync() after migrate_pages_batch() is
done.
Link: https://lkml.kernel.org/r/20231017163129.2025214-1-zi.yan@sent.com
Fixes: 2ef7dbb269 ("migrate_pages: try migrate in batch asynchronously firstly")
Signed-off-by: Zi Yan <ziy@nvidia.com>
Acked-by: Huang Ying <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
delete_object_part() can be called by multiple callers in the same time.
If an object is found and removed by a caller, and then another caller try
to find it too, it failed and return directly. It still be recorded by
kmemleak even if it has already been freed to buddy. With DEBUG on,
kmemleak will report the following warning,
kmemleak: Partially freeing unknown object at 0xa1af86000 (size 4096)
CPU: 0 PID: 742 Comm: test_huge Not tainted 6.6.0-rc3kmemleak+ #54
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x37/0x50
kmemleak_free_part_phys+0x50/0x60
hugetlb_vmemmap_optimize+0x172/0x290
? __pfx_vmemmap_remap_pte+0x10/0x10
__prep_new_hugetlb_folio+0xe/0x30
prep_new_hugetlb_folio.isra.0+0xe/0x40
alloc_fresh_hugetlb_folio+0xc3/0xd0
alloc_surplus_hugetlb_folio.constprop.0+0x6e/0xd0
hugetlb_acct_memory.part.0+0xe6/0x2a0
hugetlb_reserve_pages+0x110/0x2c0
hugetlbfs_file_mmap+0x11d/0x1b0
mmap_region+0x248/0x9a0
? hugetlb_get_unmapped_area+0x15c/0x2d0
do_mmap+0x38b/0x580
vm_mmap_pgoff+0xe6/0x190
ksys_mmap_pgoff+0x18a/0x1f0
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Expand __create_object() and move __alloc_object() to the beginning. Then
use kmemleak_lock to protect __find_and_remove_object() and
__link_object() as a whole, which can guarantee all objects are processed
sequentialally.
Link: https://lkml.kernel.org/r/20231018102952.3339837-8-liushixin2@huawei.com
Fixes: 53238a60dd ("kmemleak: Allow partial freeing of memory blocks")
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Patrick Wang <patrick.wang.shcn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
__create_object() consists of two part, the first part allocate a kmemleak
object and initialize it, the second part insert it into object tree.
This function need kmemleak_lock but actually only the second part need
lock.
Split it into two functions, the first function __alloc_object only
allocate a kmemleak object, and the second function __link_object() will
initialize the object and insert it into object tree, use the
kmemleak_lock to protect __link_object() only.
[akpm@linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/20231018102952.3339837-5-liushixin2@huawei.com
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Patrick Wang <patrick.wang.shcn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
With 0x%p, the pointer will be hashed and print (____ptrval____) instead.
And with 0x%pa, the pointer can be successfully printed but with duplicate
prefixes, which looks like:
kmemleak: kmemleak_free(0x(____ptrval____))
kmemleak: kmemleak_free_percpu(0x(____ptrval____))
kmemleak: kmemleak_free_part_phys(0x0x0000000a1af86000)
Use 0x%px instead of 0x%p or 0x%pa to print the pointer. Then the print
will be like:
kmemleak: kmemleak_free(0xffff9111c145b020)
kmemleak: kmemleak_free_percpu(0x00000000000333b0)
kmemleak: kmemleak_free_part_phys(0x0000000a1af80000)
Link: https://lkml.kernel.org/r/20231018102952.3339837-4-liushixin2@huawei.com
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Patrick Wang <patrick.wang.shcn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: convert page cpupid functions to folios", v3.
The cpupid(or access time) used by numa balancing is stored in flags or
_last_cpupid(if LAST_CPUPID_NOT_IN_PAGE_FLAGS) of page, this is to convert
page cpupid to folio cpupid, a new _last_cpupid is added into folio, which
make us to use folio->_last_cpupid directly, and the page cpupid functions
are converted to folio ones.
page_cpupid_last() -> folio_last_cpupid()
xchg_page_access_time() -> folio_xchg_access_time()
page_cpupid_xchg_last() -> folio_xchg_last_cpupid()
This patch (of 19):
If WANT_PAGE_VIRTUAL and LAST_CPUPID_NOT_IN_PAGE_FLAGS defined, the
'virtual' and '_last_cpupid' are in struct page, and since _last_cpupid is
used by numa balancing feature, it is better to move it before KMSAN
metadata from struct page, also add them into struct folio to make us to
access them from folio directly.
Link: https://lkml.kernel.org/r/20231018140806.2783514-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20231018140806.2783514-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>