A softlockup occurs in scan free swap slot under huge memory pressure.
The test scenario is: 64 CPU cores, 64GB memory, and 28 zram devices, the
disksize of each zram device is 50MB.
LATENCY_LIMIT is used to prevent softlockups in scan_swap_map_slots(), but
the real loop number would more than LATENCY_LIMIT because of "goto checks
and goto scan" repeatly without decreasing latency limit.
In order to fix it, decrease latency_ration in advance.
There is also a suspicious place that will cause softlockups in
get_swap_pages(). In this function, the "goto start_over" may result in
continuous scanning of the swap partition. If there is no cond_sched in
scan_swap_map_slots(), it would cause a softlockup (I am not sure about
this).
WARN: soft lockup - CPU#11 stuck for 11s! [kswapd0:466]
CPU: 11 PID: 466 Comm: kswapd@ Kdump: loaded Tainted: G
dump backtrace+0x0/0x1le4
show stack+0x20/@x2c
dump_stack+0xd8/0x140
watchdog print_info+0x48/0x54
watchdog_process_before_softlockup+0x98/0xa0
watchdog_timer_fn+0xlac/0x2d0
hrtimer_rum_queues+0xb0/0x130
hrtimer_interrupt+0x13c/0x3c0
arch_timer_handler_virt+0x3c/0x50
handLe_percpu_devid_irq+0x90/0x1f4
handle domain irq+0x84/0x100
gic_handle_irq+0x88/0x2b0
e11 ira+0xhB/Bx140
scan_swap_map_slots+0x678/0x890
get_swap_pages+0x29c/0x440
get_swap_page+0x120/0x2e0
add_to_swap+UX2U/0XyC
shrink_page_list+0x5d0/0x152c
shrink_inactive_list+0xl6c/Bx500
shrink_lruvec+0x270/0x304
WARN: soft lockup - CPU#32 stuck for 11s! [stress-ng:309915]
watchdog_timer_fn+0x1ac/0x2d0
__run_hrtimer+0x98/0x2a0
__hrtimer_run_queues+0xb0/0x130
hrtimer_interrupt+0x13c/0x3c0
arch_timer_handler_virt+0x3c/0x50
handle_percpu_devid_irq+0x90/0x1f4
__handle_domain_irq+0x84/0x100
gic_handle_irq+0x88/0x2b0
el1_irq+0xb8/0x140
get_swap_pages+0x1e8/0x440
get_swap_page+0x1c8/0x2e0
add_to_swap+0x20/0x9c
shrink_page_list+0x5d0/0x152c
reclaim_pages+0x160/0x310
madvise_cold_or_pageout_pte_range+0x7bc/0xe3c
walk_pmd_range.isra.0+0xac/0x22c
walk_pud_range+0xfc/0x1c0
walk_pgd_range+0x158/0x1b0
__walk_page_range+0x64/0x100
walk_page_range+0x104/0x150
Link: https://lkml.kernel.org/r/20221118133850.3360369-1-chenwandun@huawei.com
Fixes: 048c27fd72 ("[PATCH] swap: scan_swap_map latency breaks")
Signed-off-by: Chen Wandun <chenwandun@huawei.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: <xialonglong1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit b140513524 ("mm/sl[au]b: generalize kmalloc subsystem")
refactored large parts of the kmalloc subsystem, resulting in the stack
trace pruning logic done by KFENCE to no longer work.
While b140513524 attempted to fix the situation by including
'__kmem_cache_free' in the list of functions KFENCE should skip through,
this only works when the compiler actually optimized the tail call from
kfree() to __kmem_cache_free() into a jump (and thus kfree() _not_
appearing in the full stack trace to begin with).
In some configurations, the compiler no longer optimizes the tail call
into a jump, and __kmem_cache_free() appears in the stack trace. This
means that the pruned stack trace shown by KFENCE would include kfree()
which is not intended - for example:
| BUG: KFENCE: invalid free in kfree+0x7c/0x120
|
| Invalid free of 0xffff8883ed8fefe0 (in kfence-#126):
| kfree+0x7c/0x120
| test_double_free+0x116/0x1a9
| kunit_try_run_case+0x90/0xd0
| [...]
Fix it by moving __kmem_cache_free() to the list of functions that may be
tail called by an allocator entry function, making the pruning logic work
in both the optimized and unoptimized tail call cases.
Link: https://lkml.kernel.org/r/20221118152216.3914899-1-elver@google.com
Fixes: b140513524 ("mm/sl[au]b: generalize kmalloc subsystem")
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Feng Tang <feng.tang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The page reclaim isolates a batch of folios from the tail of one of the
LRU lists and works on those folios one by one. For a suitable
swap-backed folio, if the swap device is async, it queues that folio for
writeback. After the page reclaim finishes an entire batch, it puts back
the folios it queued for writeback to the head of the original LRU list.
In the meantime, the page writeback flushes the queued folios also by
batches. Its batching logic is independent from that of the page reclaim.
For each of the folios it writes back, the page writeback calls
folio_rotate_reclaimable() which tries to rotate a folio to the tail.
folio_rotate_reclaimable() only works for a folio after the page reclaim
has put it back. If an async swap device is fast enough, the page
writeback can finish with that folio while the page reclaim is still
working on the rest of the batch containing it. In this case, that folio
will remain at the head and the page reclaim will not retry it before
reaching there.
This patch adds a retry to evict_folios(). After evict_folios() has
finished an entire batch and before it puts back folios it cannot free
immediately, it retries those that may have missed the rotation.
Before this patch, ~60% of folios swapped to an Intel Optane missed
folio_rotate_reclaimable(). After this patch, ~99% of missed folios were
reclaimed upon retry.
This problem affects relatively slow async swap devices like Samsung 980
Pro much less and does not affect sync swap devices like zram or zswap at
all.
Link: https://lkml.kernel.org/r/20221116013808.3995280-1-yuzhao@google.com
Fixes: ac35a49023 ("mm: multi-gen LRU: minimal implementation")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: "Yin, Fengwei" <fengwei.yin@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A DAMON sysfs interface user can start DAMON with a scheme, remove the
sysfs directory for the scheme, and then ask update of the scheme's stats.
Because the schemes stats update logic isn't aware of the situation, it
results in an invalid memory access. Fix the bug by checking if the
scheme sysfs directory exists.
Link: https://lkml.kernel.org/r/20221114175552.1951-1-sj@kernel.org
Fixes: 0ac32b8aff ("mm/damon/sysfs: support DAMOS stats")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [v5.18]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shared memory segments can be created that are backed by hugetlb pages.
When this happens, the vmas associated with any mappings (shmat) are
marked VM_HUGETLB, yet the vm_ops for such mappings are provided by
ipc/shm (shm_vm_ops). There is a mechanism to call the underlying hugetlb
vm_ops, and this is done for most operations. However, it is not done for
open and close.
This was not an issue until the introduction of the hugetlb vma_lock.
This lock structure is pointed to by vm_private_data and the open/close
vm_ops help maintain this structure. The special hugetlb routine called
at fork took care of structure updates at fork time. However,
vma_splitting is not properly handled for ipc shared memory mappings
backed by hugetlb pages. This can result in a "kernel NULL pointer
dereference" BUG or use after free as two vmas point to the same lock
structure.
Update the shm open and close routines to always call the underlying open
and close routines.
Link: https://lkml.kernel.org/r/20221114210018.49346-1-mike.kravetz@oracle.com
Fixes: 8d9bfb2608 ("hugetlb: add vma based lock for pmd sharing")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Doug Nelson <doug.nelson@intel.com>
Reported-by: <syzbot+83b4134621b7c326d950@syzkaller.appspotmail.com>
Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Syzbot reported the below splat:
WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
Modules linked in:
CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
do_madvise mm/madvise.c:1432 [inline]
__do_sys_madvise mm/madvise.c:1432 [inline]
__se_sys_madvise mm/madvise.c:1430 [inline]
__x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f6b48a4eef9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
</TASK>
The khugepaged code would pick up the node with the most hit as the preferred
node, and also tries to do some balance if several nodes have the same
hit record. Basically it does conceptually:
* If the target_node <= last_target_node, then iterate from
last_target_node + 1 to MAX_NUMNODES (1024 on default config)
* If the max_value == node_load[nid], then target_node = nid
But there is a corner case, paritucularly for MADV_COLLAPSE, that the
non-existing node may be returned as preferred node.
Assuming the system has 2 nodes, the target_node is 0 and the
last_target_node is 1, if MADV_COLLAPSE path is hit, the max_value may
be 0, then it may return 2 for target_node, but it is actually not
existing (offline), so the warn is triggered.
The node balance was introduced by commit 9f1b868a13 ("mm: thp:
khugepaged: add policy for finding target node") to satisfy
"numactl --interleave=all". But interleaving is a mere hint rather than
something that has hard requirements.
So use nodemask to record the nodes which have the same hit record, the
hugepage allocation could fallback to those nodes. And remove
__GFP_THISNODE since it does disallow fallback. And if the nodemask
just has one node set, it means there is one single node has the most
hit record, the nodemask approach actually behaves like __GFP_THISNODE.
Link: https://lkml.kernel.org/r/20221108184357.55614-2-shy828301@gmail.com
Fixes: 7d8faaf155 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
Signed-off-by: Yang Shi <shy828301@gmail.com>
Suggested-by: Zach O'Keefe <zokeefe@google.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Zach O'Keefe <zokeefe@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reported-by: <syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
During proactive reclaim, we sometimes observe severe overreclaim, with
several thousand times more pages reclaimed than requested.
This trace was obtained from shrink_lruvec() during such an instance:
prio:0 anon_cost:1141521 file_cost:7767
nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190)
nr=[7161123 345 578 1111]
While he reclaimer requested 4M, vmscan reclaimed close to 16G, most of it
by swapping. These requests take over a minute, during which the write()
to memory.reclaim is unkillably stuck inside the kernel.
Digging into the source, this is caused by the proportional reclaim
bailout logic. This code tries to resolve a fundamental conflict: to
reclaim roughly what was requested, while also aging all LRUs fairly and
in accordance to their size, swappiness, refault rates etc. The way it
attempts fairness is that once the reclaim goal has been reached, it stops
scanning the LRUs with the smaller remaining scan targets, and adjusts the
remainder of the bigger LRUs according to how much of the smaller LRUs was
scanned. It then finishes scanning that remainder regardless of the
reclaim goal.
This works fine if priority levels are low and the LRU lists are
comparable in size. However, in this instance, the cgroup that is
targeted by proactive reclaim has almost no files left - they've already
been squeezed out by proactive reclaim earlier - and the remaining anon
pages are hot. Anon rotations cause the priority level to drop to 0,
which results in reclaim targeting all of anon (a lot) and all of file
(almost nothing). By the time reclaim decides to bail, it has scanned
most or all of the file target, and therefor must also scan most or all of
the enormous anon target. This target is thousands of times larger than
the reclaim goal, thus causing the overreclaim.
The bailout code hasn't changed in years, why is this failing now? The
most likely explanations are two other recent changes in anon reclaim:
1. Before the series starting with commit 5df741963d ("mm: fix LRU
balancing effect of new transparent huge pages"), the VM was
overall relatively reluctant to swap at all, even if swap was
configured. This means the LRU balancing code didn't come into play
as often as it does now, and mostly in high pressure situations
where pronounced swap activity wouldn't be as surprising.
2. For historic reasons, shrink_lruvec() loops on the scan targets of
all LRU lists except the active anon one, meaning it would bail if
the only remaining pages to scan were active anon - even if there
were a lot of them.
Before the series starting with commit ccc5dc6734 ("mm/vmscan:
make active/inactive ratio as 1:1 for anon lru"), most anon pages
would live on the active LRU; the inactive one would contain only a
handful of preselected reclaim candidates. After the series, anon
gets aged similarly to file, and the inactive list is the default
for new anon pages as well, making it often the much bigger list.
As a result, the VM is now more likely to actually finish large
anon targets than before.
Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward the
larger LRU lists is made before bailing out on a met reclaim goal.
This fixes the extreme overreclaim problem.
Fairness is more subtle and harder to evaluate. No obvious misbehavior
was observed on the test workload, in any case. Conceptually, fairness
should primarily be a cumulative effect from regular, lower priority
scans. Once the VM is in trouble and needs to escalate scan targets to
make forward progress, fairness needs to take a backseat. This is also
acknowledged by the myriad exceptions in get_scan_count(). This patch
makes fairness decrease gradually, as it keeps fairness work static over
increasing priority levels with growing scan targets. This should make
more sense - although we may have to re-visit the exact values.
Link: https://lkml.kernel.org/r/20220802162811.39216-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When psi annotations were added to to btrfs compression reads, the psi
state tracking over add_ra_bio_pages and btrfs_submit_compressed_read was
faulty. A pressure state, once entered, is never left. This results in
incorrectly elevated pressure, which triggers OOM kills.
pflags record the *previous* memstall state when we enter a new one. The
code tried to initialize pflags to 1, and then optimize the leave call
when we either didn't enter a memstall, or were already inside a nested
stall. However, there can be multiple PageWorkingset pages in the bio, at
which point it's that path itself that enters repeatedly and overwrites
pflags. This causes us to miss the exit.
Enter the stall only once if needed, then unwind correctly.
erofs has the same problem, fix that up too. And move the memstall exit
past submit_bio() to restore submit accounting originally added by
b8e24a9300 ("block: annotate refault stalls from IO submission").
Link: https://lkml.kernel.org/r/Y2UHRqthNUwuIQGS@cmpxchg.org
Fixes: 4088a47e78 ("btrfs: add manual PSI accounting for compressed reads")
Fixes: 99486c511f ("erofs: add manual PSI accounting for the compressed address space")
Fixes: 118f3663fb ("block: remove PSI accounting from the bio layer")
Link: https://lore.kernel.org/r/d20a0a85-e415-cf78-27f9-77dd7a94bc8d@leemhuis.info/
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Thorsten Leemhuis <linux@leemhuis.info>
Tested-by: Thorsten Leemhuis <linux@leemhuis.info>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Gao Xiang <xiang@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If a nilfs2 filesystem is downgraded to read-only due to metadata
corruption on disk and is remounted read/write, or if emergency read-only
remount is performed, detaching a log writer and synchronizing the
filesystem can be done at the same time.
In these cases, use-after-free of the log writer (hereinafter
nilfs->ns_writer) can happen as shown in the scenario below:
Task1 Task2
-------------------------------- ------------------------------
nilfs_construct_segment
nilfs_segctor_sync
init_wait
init_waitqueue_entry
add_wait_queue
schedule
nilfs_remount (R/W remount case)
nilfs_attach_log_writer
nilfs_detach_log_writer
nilfs_segctor_destroy
kfree
finish_wait
_raw_spin_lock_irqsave
__raw_spin_lock_irqsave
do_raw_spin_lock
debug_spin_lock_before <-- use-after-free
While Task1 is sleeping, nilfs->ns_writer is freed by Task2. After Task1
waked up, Task1 accesses nilfs->ns_writer which is already freed. This
scenario diagram is based on the Shigeru Yoshida's post [1].
This patch fixes the issue by not detaching nilfs->ns_writer on remount so
that this UAF race doesn't happen. Along with this change, this patch
also inserts a few necessary read-only checks with superblock instance
where only the ns_writer pointer was used to check if the filesystem is
read-only.
Link: https://syzkaller.appspot.com/bug?id=79a4c002e960419ca173d55e863bd09e8112df8b
Link: https://lkml.kernel.org/r/20221103141759.1836312-1-syoshida@redhat.com [1]
Link: https://lkml.kernel.org/r/20221104142959.28296-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+f816fa82f8783f7a02bb@syzkaller.appspotmail.com
Reported-by: Shigeru Yoshida <syoshida@redhat.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mfill_atomic_install_pte() checks page->mapping to detect whether one page
is used in the page cache. However as pointed out by Matthew, the page
can logically be a tail page rather than always the head in the case of
uffd minor mode with UFFDIO_CONTINUE. It means we could wrongly install
one pte with shmem thp tail page assuming it's an anonymous page.
It's not that clear even for anonymous page, since normally anonymous
pages also have page->mapping being setup with the anon vma. It's safe
here only because the only such caller to mfill_atomic_install_pte() is
always passing in a newly allocated page (mcopy_atomic_pte()), whose
page->mapping is not yet setup. However that's not extremely obvious
either.
For either of above, use page_mapping() instead.
Link: https://lkml.kernel.org/r/Y2K+y7wnhC4vbnP2@x1n
Fixes: 153132571f ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Matthew Wilcox <willy@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A semaphore deadlock can occur if nilfs_get_block() detects metadata
corruption while locating data blocks and a superblock writeback occurs at
the same time:
task 1 task 2
------ ------
* A file operation *
nilfs_truncate()
nilfs_get_block()
down_read(rwsem A) <--
nilfs_bmap_lookup_contig()
... generic_shutdown_super()
nilfs_put_super()
* Prepare to write superblock *
down_write(rwsem B) <--
nilfs_cleanup_super()
* Detect b-tree corruption * nilfs_set_log_cursor()
nilfs_bmap_convert_error() nilfs_count_free_blocks()
__nilfs_error() down_read(rwsem A) <--
nilfs_set_error()
down_write(rwsem B) <--
*** DEADLOCK ***
Here, nilfs_get_block() readlocks rwsem A (= NILFS_MDT(dat_inode)->mi_sem)
and then calls nilfs_bmap_lookup_contig(), but if it fails due to metadata
corruption, __nilfs_error() is called from nilfs_bmap_convert_error()
inside the lock section.
Since __nilfs_error() calls nilfs_set_error() unless the filesystem is
read-only and nilfs_set_error() attempts to writelock rwsem B (=
nilfs->ns_sem) to write back superblock exclusively, hierarchical lock
acquisition occurs in the order rwsem A -> rwsem B.
Now, if another task starts updating the superblock, it may writelock
rwsem B during the lock sequence above, and can deadlock trying to
readlock rwsem A in nilfs_count_free_blocks().
However, there is actually no need to take rwsem A in
nilfs_count_free_blocks() because it, within the lock section, only reads
a single integer data on a shared struct with
nilfs_sufile_get_ncleansegs(). This has been the case after commit
aa474a2201 ("nilfs2: add local variable to cache the number of clean
segments"), that is, even before this bug was introduced.
So, this resolves the deadlock problem by just not taking the semaphore in
nilfs_count_free_blocks().
Link: https://lkml.kernel.org/r/20221029044912.9139-1-konishi.ryusuke@gmail.com
Fixes: e828949e5b ("nilfs2: call nilfs_error inside bmap routines")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+45d6ce7b7ad7ef455d03@syzkaller.appspotmail.com
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org> [2.6.38+
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There is a memory leak reported by kmemleak:
unreferenced object 0xffff88817231ce40 (size 224):
comm "mount.cifs", pid 19308, jiffies 4295917571 (age 405.880s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
60 c0 b2 00 81 88 ff ff 98 83 01 42 81 88 ff ff `..........B....
backtrace:
[<ffffffff81936171>] __alloc_file+0x21/0x250
[<ffffffff81937051>] alloc_empty_file+0x41/0xf0
[<ffffffff81937159>] alloc_file+0x59/0x710
[<ffffffff81937964>] alloc_file_pseudo+0x154/0x210
[<ffffffff81741dbf>] __shmem_file_setup+0xff/0x2a0
[<ffffffff817502cd>] shmem_zero_setup+0x8d/0x160
[<ffffffff817cc1d5>] mmap_region+0x1075/0x19d0
[<ffffffff817cd257>] do_mmap+0x727/0x1110
[<ffffffff817518b2>] vm_mmap_pgoff+0x112/0x1e0
[<ffffffff83adf955>] do_syscall_64+0x35/0x80
[<ffffffff83c0006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
The root cause was traced to an error handing path in mmap_region() when
arch_validate_flags() or mas_preallocate() fails. In the shared anonymous
mapping sence, vma will be setuped and mapped with a new shared anonymous
file via shmem_zero_setup(). So in this case, the file resource needs to
be released.
Fix it by calling fput(vma->vm_file) and unmap_region() when
arch_validate_flags() or mas_preallocate() returns an error in the shared
anonymous mapping sence.
Link: https://lkml.kernel.org/r/20221028073717.1179380-1-lizetao1@huawei.com
Fixes: d4af56c5c7 ("mm: start tracking VMAs with maple tree")
Fixes: c462ac288f ("mm: Introduce arch_validate_flags()")
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This change is very similar to the change that was made for shmem [1], and
it solves the same problem but for HugeTLBFS instead.
Currently, when poison is found in a HugeTLB page, the page is removed
from the page cache. That means that attempting to map or read that
hugepage in the future will result in a new hugepage being allocated
instead of notifying the user that the page was poisoned. As [1] states,
this is effectively memory corruption.
The fix is to leave the page in the page cache. If the user attempts to
use a poisoned HugeTLB page with a syscall, the syscall will fail with
EIO, the same error code that shmem uses. For attempts to map the page,
the thread will get a BUS_MCEERR_AR SIGBUS.
[1]: commit a760542666 ("mm: shmem: don't truncate page if memory failure happens")
Link: https://lkml.kernel.org/r/20221018200125.848471-1-jthoughton@google.com
Signed-off-by: James Houghton <jthoughton@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Tested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Along the development cycle, the testing code support for module/in-kernel
compiles was removed. Restore this functionality by moving any internal
API tests to the userspace side, as well as threading tests. Fix the
lockdep issues and add a way to reduce memory usage so the tests can
complete with KASAN + memleak detection. Make the tests work on 32 bit
hosts where possible and detect 32 bit hosts in the radix test suite.
[akpm@linux-foundation.org: fix module export]
[akpm@linux-foundation.org: fix it some more]
[liam.howlett@oracle.com: fix compile warnings on 32bit build in check_find()]
Link: https://lkml.kernel.org/r/20221107203816.1260327-1-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20221028180415.3074673-1-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
clang-analyzer reported some Dead Stores in mas_anode_descend(). Upon
inspection, there were a few clean ups that would make the code cleaner:
The count variable was set from the mt_slots array and then updated but
never used again. Just use the array reference directly.
Also stop updating the type since it isn't used after the update.
Stop setting the gaps pointer to NULL at the start since it is always
set before the loop begins.
Link: https://lkml.kernel.org/r/20221026151413.4032730-1-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Suggested-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull cxl fixes from Dan Williams:
"Several fixes for CXL region creation crashes, leaks and failures.
This is mainly fallout from the original implementation of dynamic CXL
region creation (instantiate new physical memory pools) that arrived
in v6.0-rc1.
Given the theme of "failures in the presence of pass-through decoders"
this also includes new regression test infrastructure for that case.
Summary:
- Fix region creation crash with pass-through decoders
- Fix region creation crash when no decoder allocation fails
- Fix region creation crash when scanning regions to enforce the
increasing physical address order constraint that CXL mandates
- Fix a memory leak for cxl_pmem_region objects, track 1:N instead of
1:1 memory-device-to-region associations.
- Fix a memory leak for cxl_region objects when regions with active
targets are deleted
- Fix assignment of NUMA nodes to CXL regions by CFMWS (CXL Window)
emulated proximity domains.
- Fix region creation failure for switch attached devices downstream
of a single-port host-bridge
- Fix false positive memory leak of cxl_region objects by recycling
recently used region ids rather than freeing them
- Add regression test infrastructure for a pass-through decoder
configuration
- Fix some mailbox payload handling corner cases"
* tag 'cxl-fixes-for-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl/region: Recycle region ids
cxl/region: Fix 'distance' calculation with passthrough ports
tools/testing/cxl: Add a single-port host-bridge regression config
tools/testing/cxl: Fix some error exits
cxl/pmem: Fix cxl_pmem_region and cxl_memdev leak
cxl/region: Fix cxl_region leak, cleanup targets at region delete
cxl/region: Fix region HPA ordering validation
cxl/pmem: Use size_add() against integer overflow
cxl/region: Fix decoder allocation crash
ACPI: NUMA: Add CXL CFMWS 'nodes' to the possible nodes set
cxl/pmem: Fix failure to account for 8 byte header for writes to the device LSA.
cxl/region: Fix null pointer dereference due to pass through decoder commit
cxl/mbox: Add a check on input payload size
Pull hwmon fixes from Guenter Roeck:
"Fix two regressions:
- Commit 54cc3dbfc1 ("hwmon: (pmbus) Add regulator supply into
macro") resulted in regulator undercount when disabling regulators.
Revert it.
- The thermal subsystem rework caused the scmi driver to no longer
register with the thermal subsystem because index values no longer
match. To fix the problem, the scmi driver now directly registers
with the thermal subsystem, no longer through the hwmon core"
* tag 'hwmon-for-v6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
Revert "hwmon: (pmbus) Add regulator supply into macro"
hwmon: (scmi) Register explicitly with Thermal Framework
Pull perf fixes from Borislav Petkov:
- Add Cooper Lake's stepping to the PEBS guest/host events isolation
fixed microcode revisions checking quirk
- Update Icelake and Sapphire Rapids events constraints
- Use the standard energy unit for Sapphire Rapids in RAPL
- Fix the hw_breakpoint test to fail more graciously on !SMP configs
* tag 'perf_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Add Cooper Lake stepping to isolation_ucodes[]
perf/x86/intel: Fix pebs event constraints for SPR
perf/x86/intel: Fix pebs event constraints for ICL
perf/x86/rapl: Use standard Energy Unit for SPR Dram RAPL domain
perf/hw_breakpoint: test: Skip the test if dependencies unmet
Pull x86 fixes from Borislav Petkov:
- Add new Intel CPU models
- Enforce that TDX guests are successfully loaded only on TDX hardware
where virtualization exception (#VE) delivery on kernel memory is
disabled because handling those in all possible cases is "essentially
impossible"
- Add the proper include to the syscall wrappers so that BTF can see
the real pt_regs definition and not only the forward declaration
* tag 'x86_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Add several Intel server CPU model numbers
x86/tdx: Panic on bad configs that #VE on "private" memory access
x86/tdx: Prepare for using "INFO" call for a second purpose
x86/syscall: Include asm/ptrace.h in syscall_wrapper header
Pull Kbuild fixes from Masahiro Yamada:
- Use POSIX-compatible grep options
- Document git-related tips for reproducible builds
- Fix a typo in the modpost rule
- Suppress SIGPIPE error message from gcc-ar and llvm-ar
- Fix segmentation fault in the menuconfig search
* tag 'kbuild-fixes-v6.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kconfig: fix segmentation fault in menuconfig search
kbuild: fix SIGPIPE error message for AR=gcc-ar and AR=llvm-ar
kbuild: fix typo in modpost
Documentation: kbuild: Add description of git for reproducible builds
kbuild: use POSIX-compatible grep option
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Fix the pKVM stage-1 walker erronously using the stage-2 accessor
- Correctly convert vcpu->kvm to a hyp pointer when generating an
exception in a nVHE+MTE configuration
- Check that KVM_CAP_DIRTY_LOG_* are valid before enabling them
- Fix SMPRI_EL1/TPIDR2_EL0 trapping on VHE
- Document the boot requirements for FGT when entering the kernel at
EL1
x86:
- Use SRCU to protect zap in __kvm_set_or_clear_apicv_inhibit()
- Make argument order consistent for kvcalloc()
- Userspace API fixes for DEBUGCTL and LBRs"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: Fix a typo about the usage of kvcalloc()
KVM: x86: Use SRCU to protect zap in __kvm_set_or_clear_apicv_inhibit()
KVM: VMX: Ignore guest CPUID for host userspace writes to DEBUGCTL
KVM: VMX: Fold vmx_supported_debugctl() into vcpu_supported_debugctl()
KVM: VMX: Advertise PMU LBRs if and only if perf supports LBRs
arm64: booting: Document our requirements for fine grained traps with SME
KVM: arm64: Fix SMPRI_EL1/TPIDR2_EL0 trapping on VHE
KVM: Check KVM_CAP_DIRTY_LOG_{RING, RING_ACQ_REL} prior to enabling them
KVM: arm64: Fix bad dereference on MTE-enabled systems
KVM: arm64: Use correct accessor to parse stage-1 PTEs
Pull xen fixes from Juergen Gross:
"One fix for silencing a smatch warning, and a small cleanup patch"
* tag 'for-linus-6.1-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: simplify sysenter and syscall setup
x86/xen: silence smatch warning in pmu_msr_chk_emulated()