linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-16 10:11:38 -04:00

Author	SHA1	Message	Date
Chunyu Hu	dad4964a34	selftests/mm: split_huge_page_test: skip the test when thp is not available When thp is not enabled on some kernel config such as realtime kernel, the test will report failure. Fix the false positive by skipping the test directly when thp is not enabled. Tested with thp disabled kernel: Before The fix: # -------------------------------------------------- # running ./split_huge_page_test /tmp/xfs_dir_Ywup9p # -------------------------------------------------- # TAP version 13 # Bail out! Reading PMD pagesize failed # # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0 # [FAIL] not ok 61 split_huge_page_test /tmp/xfs_dir_Ywup9p # exit=1 After the fix: # -------------------------------------------------- # running ./split_huge_page_test /tmp/xfs_dir_YHPUPl # -------------------------------------------------- # TAP version 13 # 1..0 # SKIP Transparent Hugepages not available # [SKIP] ok 6 split_huge_page_test /tmp/xfs_dir_YHPUPl # SKIP Link: https://lore.kernel.org/20260402014543.1671131-6-chuhu@redhat.com Signed-off-by: Chunyu Hu <chuhu@redhat.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Li Wang <liwang@redhat.com> Cc: Nico Pache <npache@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:54 -07:00
Chunyu Hu	a784a3a39c	selftests/mm/vm_util: robust write_file() Add three more checks for buflen and numwritten. The buflen should be at least two, that means at least one char and the null-end. The error case check is added by checking numwriten < 0 instead of numwritten < 1. And the truncate case is checked. The test will exit if any of these conditions aren't met. Additionally, add more print information when a write failure occurs or a truncated write happens, providing clearer diagnostics. Link: https://lore.kernel.org/20260402014543.1671131-5-chuhu@redhat.com Signed-off-by: Chunyu Hu <chuhu@redhat.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Cc: Nico Pache <npache@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:54 -07:00
Chunyu Hu	710d2f3079	selftests/mm: move write_file helper to vm_util thp_settings provides write_file() helper for safely writing to a file and exit when write failure happens. It's a very low level helper and many sub tests need such a helper, not only thp tests. split_huge_page_test also defines a write_file locally. The two have minior differences in return type and used exit api. And there would be conflicts if split_huge_page_test wanted to include thp_settings.h because of different prototype, making it less convenient. It's possisble to merge the two, although some tests don't use the kselftest infrastrucutre for testing. It would also work when using the ksft_exit_msg() to exit in my test, as the counters are all zero. Output will be like: TAP version 13 1..62 Bail out! /proc/sys/vm/drop_caches1 open failed: No such file or directory # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0 So here we just keep the version in split_huge_page_test, and move it into the vm_util. This makes it easy to maitain and user could just include one vm_util.h when they don't need thp setting helpers. Keep the prototype of void return as the function will exit on any error, return value is not necessary, and will simply the callers like write_num() and write_string(). Link: https://lore.kernel.org/20260402014543.1671131-4-chuhu@redhat.com Signed-off-by: Chunyu Hu <chuhu@redhat.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Suggested-by: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:54 -07:00
Chunyu Hu	929d5fbf1a	selftests/mm: soft-dirty: skip two tests when thp is not available The test_hugepage test contain two sub tests. If just reporting one skip when thp not available, there will be error in the log because the test count don't match the test plan. Change to skip two tests by running the ksft_test_result_skip twice in this case. Without the fix (run test on thp disabled kernel): ./run_vmtests.sh -t soft_dirty # -------------------- # running ./soft-dirty # -------------------- # TAP version 13 # 1..19 # ok 1 Test test_simple # ok 2 Test test_vma_reuse dirty bit of allocated page # ok 3 Test test_vma_reuse dirty bit of reused address page # ok 4 # SKIP Transparent Hugepages not available # ok 5 Test test_mprotect-anon dirty bit of new written page # ok 6 Test test_mprotect-anon soft-dirty clear after clear_refs # ok 7 Test test_mprotect-anon soft-dirty clear after marking RO # ok 8 Test test_mprotect-anon soft-dirty clear after marking RW # ok 9 Test test_mprotect-anon soft-dirty after rewritten # ok 10 Test test_mprotect-file dirty bit of new written page # ok 11 Test test_mprotect-file soft-dirty clear after clear_refs # ok 12 Test test_mprotect-file soft-dirty clear after marking RO # ok 13 Test test_mprotect-file soft-dirty clear after marking RW # ok 14 Test test_mprotect-file soft-dirty after rewritten # ok 15 Test test_merge-anon soft-dirty after remap merge 1st pg # ok 16 Test test_merge-anon soft-dirty after remap merge 2nd pg # ok 17 Test test_merge-anon soft-dirty after mprotect merge 1st pg # ok 18 Test test_merge-anon soft-dirty after mprotect merge 2nd pg # # 1 skipped test(s) detected. Consider enabling relevant config options to improve coverage. # # Planned tests != run tests (19 != 18) # # Totals: pass:17 fail:0 xfail:0 xpass:0 skip:1 error:0 # [FAIL] not ok 52 soft-dirty # exit=1 With the fix (run test on thp disabled kernel): ./run_vmtests.sh -t soft_dirty # -------------------- # running ./soft-dirty # TAP version 13 # -------------------- # running ./soft-dirty # -------------------- # TAP version 13 # 1..19 # ok 1 Test test_simple # ok 2 Test test_vma_reuse dirty bit of allocated page # ok 3 Test test_vma_reuse dirty bit of reused address page # # Transparent Hugepages not available # ok 4 # SKIP Test test_hugepage huge page allocation # ok 5 # SKIP Test test_hugepage huge page dirty bit # ok 6 Test test_mprotect-anon dirty bit of new written page # ok 7 Test test_mprotect-anon soft-dirty clear after clear_refs # ok 8 Test test_mprotect-anon soft-dirty clear after marking RO # ok 9 Test test_mprotect-anon soft-dirty clear after marking RW # ok 10 Test test_mprotect-anon soft-dirty after rewritten # ok 11 Test test_mprotect-file dirty bit of new written page # ok 12 Test test_mprotect-file soft-dirty clear after clear_refs # ok 13 Test test_mprotect-file soft-dirty clear after marking RO # ok 14 Test test_mprotect-file soft-dirty clear after marking RW # ok 15 Test test_mprotect-file soft-dirty after rewritten # ok 16 Test test_merge-anon soft-dirty after remap merge 1st pg # ok 17 Test test_merge-anon soft-dirty after remap merge 2nd pg # ok 18 Test test_merge-anon soft-dirty after mprotect merge 1st pg # ok 19 Test test_merge-anon soft-dirty after mprotect merge 2nd pg # # 2 skipped test(s) detected. Consider enabling relevant config options to improve coverage. # # Totals: pass:17 fail:0 xfail:0 xpass:0 skip:2 error:0 # [PASS] ok 1 soft-dirty hwpoison_inject # SUMMARY: PASS=1 SKIP=0 FAIL=0 1..1 Link: https://lore.kernel.org/20260402014543.1671131-3-chuhu@redhat.com Signed-off-by: Chunyu Hu <chuhu@redhat.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Li Wang <liwang@redhat.com> Cc: Nico Pache <npache@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:54 -07:00
Chunyu Hu	fb0fca46b9	selftests/mm/guard-regions: skip collapse test when thp not enabled Patch series "selftests/mm: skip several tests when thp is not available", v8. There are several tests requires transprarent hugepages, when run on thp disabled kernel such as realtime kernel, there will be false negative. Mark those tests as skip when thp is not available. This patch (of 6): When thp is not available, just skip the collape tests to avoid the false negative. Without the change, run with a thp disabled kernel: ./run_vmtests.sh -t madv_guard -n 1 <snip/> # RUN guard_regions.anon.collapse ... # guard-regions.c:2217:collapse:Expected madvise(ptr, size, MADV_NOHUGEPAGE) (-1) == 0 (0) # collapse: Test terminated by assertion # FAIL guard_regions.anon.collapse not ok 2 guard_regions.anon.collapse <snip/> # RUN guard_regions.shmem.collapse ... # guard-regions.c:2217:collapse:Expected madvise(ptr, size, MADV_NOHUGEPAGE) (-1) == 0 (0) # collapse: Test terminated by assertion # FAIL guard_regions.shmem.collapse not ok 32 guard_regions.shmem.collapse <snip/> # RUN guard_regions.file.collapse ... # guard-regions.c:2217:collapse:Expected madvise(ptr, size, MADV_NOHUGEPAGE) (-1) == 0 (0) # collapse: Test terminated by assertion # FAIL guard_regions.file.collapse not ok 62 guard_regions.file.collapse <snip/> # FAILED: 87 / 90 tests passed. # 17 skipped test(s) detected. Consider enabling relevant config options to improve coverage. # Totals: pass:70 fail:3 xfail:0 xpass:0 skip:17 error:0 With this change, run with thp disabled kernel: ./run_vmtests.sh -t madv_guard -n 1 <snip/> # RUN guard_regions.anon.collapse ... # SKIP Transparent Hugepages not available # OK guard_regions.anon.collapse ok 2 guard_regions.anon.collapse # SKIP Transparent Hugepages not available <snip/> # RUN guard_regions.file.collapse ... # SKIP Transparent Hugepages not available # OK guard_regions.file.collapse ok 62 guard_regions.file.collapse # SKIP Transparent Hugepages not available <snip/> # RUN guard_regions.shmem.collapse ... # SKIP Transparent Hugepages not available # OK guard_regions.shmem.collapse ok 32 guard_regions.shmem.collapse # SKIP Transparent Hugepages not available <snip/> # PASSED: 90 / 90 tests passed. # 20 skipped test(s) detected. Consider enabling relevant config options to improve coverage. # Totals: pass:70 fail:0 xfail:0 xpass:0 skip:20 error:0 Link: https://lore.kernel.org/20260402014543.1671131-1-chuhu@redhat.com Link: https://lore.kernel.org/20260402014543.1671131-2-chuhu@redhat.com Signed-off-by: Chunyu Hu <chuhu@redhat.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Li Wang <liwang@redhat.com> Cc: Nico Pache <npache@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:54 -07:00
Mike Rapoport (Microsoft)	6ab703034f	userfaultfd: mfill_atomic(): remove retry logic Since __mfill_atomic_pte() handles the retry for both anonymous and shmem, there is no need to retry copying the date from the userspace in the loop in mfill_atomic(). Drop the retry logic from mfill_atomic(). [rppt@kernel.org: remove safety measure of not returning ENOENT from _copy] Link: https://lore.kernel.org/ac5zcDUY8CFHr6Lw@kernel.org Link: https://lore.kernel.org/20260402041156.1377214-12-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:54 -07:00
Mike Rapoport (Microsoft)	f74991b4e3	shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops Add filemap_add() and filemap_remove() methods to vm_uffd_ops and use them in __mfill_atomic_pte() to add shmem folios to page cache and remove them in case of error. Implement these methods in shmem along with vm_uffd_ops->alloc_folio() and drop shmem_mfill_atomic_pte(). Since userfaultfd now does not reference any functions from shmem, drop include if linux/shmem_fs.h from mm/userfaultfd.c mfill_atomic_install_pte() is not used anywhere outside of mm/userfaultfd, make it static. Link: https://lore.kernel.org/20260402041156.1377214-11-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: James Houghton <jthoughton@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:54 -07:00
Mike Rapoport (Microsoft)	ad9ac30813	userfaultfd: introduce vm_uffd_ops->alloc_folio() and use it to refactor mfill_atomic_pte_zeroed_folio() and mfill_atomic_pte_copy(). mfill_atomic_pte_zeroed_folio() and mfill_atomic_pte_copy() perform almost identical actions: * allocate a folio * update folio contents (either copy from userspace of fill with zeros) * update page tables with the new folio Split a __mfill_atomic_pte() helper that handles both cases and uses newly introduced vm_uffd_ops->alloc_folio() to allocate the folio. Pass the ops structure from the callers to __mfill_atomic_pte() to later allow using anon_uffd_ops for MAP_PRIVATE mappings of file-backed VMAs. Note, that the new ops method is called alloc_folio() rather than folio_alloc() to avoid clash with alloc_tag macro folio_alloc(). Link: https://lore.kernel.org/20260402041156.1377214-10-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: James Houghton <jthoughton@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:54 -07:00
Mike Rapoport (Microsoft)	dfc4d77182	shmem, userfaultfd: use a VMA callback to handle UFFDIO_CONTINUE When userspace resolves a page fault in a shmem VMA with UFFDIO_CONTINUE it needs to get a folio that already exists in the pagecache backing that VMA. Instead of using shmem_get_folio() for that, add a get_folio_noalloc() method to 'struct vm_uffd_ops' that will return a folio if it exists in the VMA's pagecache at given pgoff. Implement get_folio_noalloc() method for shmem and slightly refactor userfaultfd's mfill_get_vma() and mfill_atomic_pte_continue() to support this new API. Link: https://lore.kernel.org/20260402041156.1377214-9-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: James Houghton <jthoughton@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:54 -07:00
Mike Rapoport (Microsoft)	0f48947c42	userfaultfd: introduce vm_uffd_ops Current userfaultfd implementation works only with memory managed by core MM: anonymous, shmem and hugetlb. First, there is no fundamental reason to limit userfaultfd support only to the core memory types and userfaults can be handled similarly to regular page faults provided a VMA owner implements appropriate callbacks. Second, historically various code paths were conditioned on vma_is_anonymous(), vma_is_shmem() and is_vm_hugetlb_page() and some of these conditions can be expressed as operations implemented by a particular memory type. Introduce vm_uffd_ops extension to vm_operations_struct that will delegate memory type specific operations to a VMA owner. Operations for anonymous memory are handled internally in userfaultfd using anon_uffd_ops that implicitly assigned to anonymous VMAs. Start with a single operation, ->can_userfault() that will verify that a VMA meets requirements for userfaultfd support at registration time. Implement that method for anonymous, shmem and hugetlb and move relevant parts of vma_can_userfault() into the new callbacks. [rppt@kernel.org: relocate VM_DROPPABLE test, per Tal] Link: https://lore.kernel.org/adffgfM5ANxtPIEF@kernel.org Link: https://lore.kernel.org/20260402041156.1377214-8-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Cc: Tal Zussman <tz2294@columbia.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:53 -07:00
Mike Rapoport (Microsoft)	a5bb866987	userfaultfd: move vma_can_userfault out of line vma_can_userfault() has grown pretty big and it's not called on performance critical path. Move it out of line. No functional changes. Link: https://lore.kernel.org/20260402041156.1377214-7-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:53 -07:00
Mike Rapoport (Microsoft)	f5f035a724	userfaultfd: retry copying with locks dropped in mfill_atomic_pte_copy() Implementation of UFFDIO_COPY for anonymous memory might fail to copy data from userspace buffer when the destination VMA is locked (either with mm_lock or with per-VMA lock). In that case, mfill_atomic() releases the locks, retries copying the data with locks dropped and then re-locks the destination VMA and re-establishes PMD. Since this retry-reget dance is only relevant for UFFDIO_COPY and it never happens for other UFFDIO_ operations, make it a part of mfill_atomic_pte_copy() that actually implements UFFDIO_COPY for anonymous memory. As a temporal safety measure to avoid breaking biscection mfill_atomic_pte_copy() makes sure to never return -ENOENT so that the loop in mfill_atomic() won't retry copiyng outside of mmap_lock. This is removed later when shmem implementation will be updated later and the loop in mfill_atomic() will be adjusted. [akpm@linux-foundation.org: update mfill_copy_folio_retry()] Link: https://lore.kernel.org/20260316173829.1126728-1-avagin@google.com Link: https://lore.kernel.org/20260306171815.3160826-6-rppt@kernel.org Link: https://lore.kernel.org/20260402041156.1377214-6-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Cc: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:53 -07:00
Mike Rapoport (Microsoft)	b8c03b7f45	userfaultfd: introduce mfill_get_vma() and mfill_put_vma() Split the code that finds, locks and verifies VMA from mfill_atomic() into a helper function. This function will be used later during refactoring of mfill_atomic_pte_copy(). Add a counterpart mfill_put_vma() helper that unlocks the VMA and releases map_changing_lock. [avagin@google.com: fix lock leak in mfill_get_vma()] Link: https://lore.kernel.org/20260316173829.1126728-1-avagin@google.com Link: https://lore.kernel.org/20260402041156.1377214-5-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrei Vagin <avagin@google.com> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:53 -07:00
Mike Rapoport (Microsoft)	e2e0b826d3	userfaultfd: introduce mfill_establish_pmd() helper There is a lengthy code chunk in mfill_atomic() that establishes the PMD for UFFDIO operations. This code may be called twice: first time when the copy is performed with VMA/mm locks held and the other time after the copy is retried with locks dropped. Move the code that establishes a PMD into a helper function so it can be reused later during refactoring of mfill_atomic_pte_copy(). Link: https://lore.kernel.org/20260402041156.1377214-4-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:53 -07:00
Mike Rapoport (Microsoft)	db0062d2c0	userfaultfd: introduce struct mfill_state mfill_atomic() passes a lot of parameters down to its callees. Aggregate them all into mfill_state structure and pass this structure to functions that implement various UFFDIO_ commands. Tracking the state in a structure will allow moving the code that retries copying of data for UFFDIO_COPY into mfill_atomic_pte_copy() and make the loop in mfill_atomic() identical for all UFFDIO operations on PTE-mapped memory. The mfill_state definition is deliberately local to mm/userfaultfd.c, hence shmem_mfill_atomic_pte() is not updated. [harry.yoo@oracle.com: properly initialize mfill_state.len to fix folio_add_new_anon_rmap() WARN] Link: https://lore.kernel.org/abehBY7QakYF9bK4@hyeyoo Link: https://lore.kernel.org/20260402041156.1377214-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:53 -07:00
Mike Rapoport (Microsoft)	c0620487fc	userfaultfd: introduce mfill_copy_folio_locked() helper Patch series "mm, kvm: allow uffd support in guest_memfd", v4. These patches enable support for userfaultfd in guest_memfd. As the groundwork I refactored userfaultfd handling of PTE-based memory types (anonymous and shmem) and converted them to use vm_uffd_ops for allocating a folio or getting an existing folio from the page cache. shmem also implements callbacks that add a folio to the page cache after the data passed in UFFDIO_COPY was copied and remove the folio from the page cache if page table update fails. In order for guest_memfd to notify userspace about page faults, there are new VM_FAULT_UFFD_MINOR and VM_FAULT_UFFD_MISSING that a ->fault() handler can return to inform the page fault handler that it needs to call handle_userfault() to complete the fault. Nikita helped to plumb these new goodies into guest_memfd and provided basic tests to verify that guest_memfd works with userfaultfd. The handling of UFFDIO_MISSING in guest_memfd requires ability to remove a folio from page cache, the best way I could find was exporting filemap_remove_folio() to KVM. I deliberately left hugetlb out, at least for the most part. hugetlb handles acquisition of VMA and more importantly establishing of parent page table entry differently than PTE-based memory types. This is a different abstraction level than what vm_uffd_ops provides and people objected to exposing such low level APIs as a part of VMA operations. Also, to enable uffd in guest_memfd refactoring of hugetlb is not needed and I prefer to delay it until the dust settles after the changes in this set. This patch (of 4): Split copying of data when locks held from mfill_atomic_pte_copy() into a helper function mfill_copy_folio_locked(). This makes improves code readability and makes complex mfill_atomic_pte_copy() function easier to comprehend. No functional change. Link: https://lore.kernel.org/20260402041156.1377214-1-rppt@kernel.org Link: https://lore.kernel.org/20260402041156.1377214-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:53 -07:00
Chenghao Duan	dc44f32fde	mm/memfd_luo: remove folio from page cache when accounting fails In memfd_luo_retrieve_folios(), when shmem_inode_acct_blocks() fails after successfully adding the folio to the page cache, the code jumps to unlock_folio without removing the folio from the page cache. While the folio eventually will be freed when the file is released by memfd_luo_retrieve(), it is a good idea to directly remove a folio that was not fully added to the file. This avoids the possibility of accounting mismatches in shmem or filemap core. Fix by adding a remove_from_cache label that calls filemap_remove_folio() before unlocking, matching the error handling pattern in shmem_alloc_and_add_folio(). This issue was identified by AI review: https://sashiko.dev/#/patchset/20260323110747.193569-1-duanchenghao@kylinos.cn [pratyush@kernel.org: changelog alterations] Link: https://lore.kernel.org/2vxzzf3lfujq.fsf@kernel.org Link: https://lore.kernel.org/20260326084727.118437-7-duanchenghao@kylinos.cn Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Haoran Jiang <jianghaoran@kylinos.cn> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:53 -07:00
Chenghao Duan	3538f90ab8	mm/memfd_luo: fix physical address conversion in put_folios cleanup In memfd_luo_retrieve_folios()'s put_folios cleanup path: 1. kho_restore_folio() expects a phys_addr_t (physical address) but receives a raw PFN (pfolio->pfn). This causes kho_restore_page() to check the wrong physical address (pfn << PAGE_SHIFT instead of the actual physical address). 2. This loop lacks the !pfolio->pfn check that exists in the main retrieval loop and memfd_luo_discard_folios(), which could incorrectly process sparse file holes where pfn=0. Fix by converting PFN to physical address with PFN_PHYS() and adding the !pfolio->pfn check, matching the pattern used elsewhere in this file. This issue was identified by the AI review. https://sashiko.dev/#/patchset/20260323110747.193569-1-duanchenghao@kylinos.cn Link: https://lore.kernel.org/20260326084727.118437-6-duanchenghao@kylinos.cn Fixes: `b3749f174d` ("mm: memfd_luo: allow preserving memfd") Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Haoran Jiang <jianghaoran@kylinos.cn> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:53 -07:00
Chenghao Duan	32f6cec5e7	mm/memfd_luo: use i_size_write() to set inode size during retrieve Use i_size_write() instead of directly assigning to inode->i_size when restoring the memfd size in memfd_luo_retrieve(), to keep code consistency. No functional change intended. Link: https://lore.kernel.org/20260326084727.118437-5-duanchenghao@kylinos.cn Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Haoran Jiang <jianghaoran@kylinos.cn> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Pratyush Yadav <pratyush@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:53 -07:00
Chenghao Duan	4aa6424f37	mm/memfd_luo: remove unnecessary memset in zero-size memfd path The memset(kho_vmalloc, 0, sizeof(*kho_vmalloc)) call in the zero-size file handling path is unnecessary because the allocation of the ser structure already uses the __GFP_ZERO flag, ensuring the memory is already zero-initialized. Link: https://lore.kernel.org/20260326084727.118437-4-duanchenghao@kylinos.cn Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Haoran Jiang <jianghaoran@kylinos.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:52 -07:00
Chenghao Duan	502d3c2ad8	mm/memfd_luo: optimize shmem_recalc_inode calls in retrieve path Move shmem_recalc_inode() out of the loop in memfd_luo_retrieve_folios() to improve performance when restoring large memfds. Currently, shmem_recalc_inode() is called for each folio during restore, which is O(n) expensive operations. This patch collects the number of successfully added folios and calls shmem_recalc_inode() once after the loop completes, reducing complexity to O(1). Additionally, fix the error path to also call shmem_recalc_inode() for the folios that were successfully added before the error occurred. Link: https://lore.kernel.org/20260326084727.118437-3-duanchenghao@kylinos.cn Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Haoran Jiang <jianghaoran@kylinos.cn> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:52 -07:00
Chenghao Duan	ed2a29dc6d	mm/memfd: use folio_nr_pages() for shmem inode accounting I found several modifiable points while reading the code. This patch (of 6): Patch series "Modify memfd_luo code", v3. memfd_luo_retrieve_folios() called shmem_inode_acct_blocks() and shmem_recalc_inode() with hardcoded 1 instead of the actual folio page count. memfd may use large folios (THP/hugepages), causing quota/limit under-accounting and incorrect stat output. Fix by using folio_nr_pages(folio) for both functions. Issue found by AI review and suggested by Pratyush Yadav <pratyush@kernel.org>. https://sashiko.dev/#/patchset/20260319012845.29570-1-duanchenghao%40kylinos.cn Link: https://lore.kernel.org/20260326084727.118437-1-duanchenghao@kylinos.cn Link: https://lore.kernel.org/20260326084727.118437-2-duanchenghao@kylinos.cn Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn> Suggested-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Haoran Jiang <jianghaoran@kylinos.cn> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:52 -07:00
Muchun Song	7cf6d940f4	mm/sparse: fix preinited section_mem_map clobbering on failure path sparse_init_nid() is careful to leave alone every section whose vmemmap has already been set up by sparse_vmemmap_init_nid_early(); it only clears section_mem_map for the rest: if (!preinited_vmemmap_section(ms)) ms->section_mem_map = 0; A leftover line after that conditional block ms->section_mem_map = 0; was supposed to be deleted but was missed in the failure path, causing the field to be overwritten for all sections when memory allocation fails, effectively destroying the pre-initialization check. Drop the stray assignment so that preinited sections retain their already valid state. Those pre-inited sections (HugeTLB pages) are not activated. However, such failures are extremely rare, so I don't see any major userspace issues. Link: https://lore.kernel.org/20260331113724.2080833-1-songmuchun@bytedance.com Fixes: `d65917c423` ("mm/sparse: allow for alternate vmemmap section init at boot") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed by: Donet Tom <donettom@linux.ibm.com> Cc: David Hildenbrand <david@kernel.org> Cc: Frank van der Linden <fvdl@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:52 -07:00
Sergey Senozhatsky	e3668b3713	zram: do not forget to endio for partial discard requests As reported by Qu Wenruo and Avinesh Kumar, the following getconf PAGESIZE 65536 blkdiscard -p 4k /dev/zram0 takes literally forever to complete. zram doesn't support partial discards and just returns immediately w/o doing any discard work in such cases. The problem is that we forget to endio on our way out, so blkdiscard sleeps forever in submit_bio_wait(). Fix this by jumping to end_bio label, which does bio_endio(). Link: https://lore.kernel.org/20260331074255.777019-1-senozhatsky@chromium.org Fixes: `0120dd6e4e` ("zram: make zram_bio_discard more self-contained") Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reported-by: Qu Wenruo <wqu@suse.com> Closes: https://lore.kernel.org/linux-block/92361cd3-fb8b-482e-bc89-15ff1acb9a59@suse.com Tested-by: Qu Wenruo <wqu@suse.com> Reported-by: Avinesh Kumar <avinesh.kumar@suse.com> Closes: https://bugzilla.suse.com/show_bug.cgi?id=1256530 Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Brian Geffon <bgeffon@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:52 -07:00
Alistair Popple	af69016dab	lib: test_hmm: implement a device release method Unloading the HMM test module produces the following warning: [ 3782.224783] ------------[ cut here ]------------ [ 3782.226323] Device 'hmm_dmirror0' does not have a release() function, it is broken and must be fixed. See Documentation/core-api/kobject.rst. [ 3782.230570] WARNING: drivers/base/core.c:2567 at device_release+0x185/0x210, CPU#20: rmmod/1924 [ 3782.233949] Modules linked in: test_hmm(-) nvidia_uvm(O) nvidia(O) [ 3782.236321] CPU: 20 UID: 0 PID: 1924 Comm: rmmod Tainted: G O 7.0.0-rc1+ #374 PREEMPT(full) [ 3782.240226] Tainted: [O]=OOT_MODULE [ 3782.241639] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 [ 3782.246193] RIP: 0010:device_release+0x185/0x210 [ 3782.247860] Code: 00 00 fc ff df 48 8d 7b 50 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 86 00 00 00 48 8b 73 50 48 85 f6 74 11 48 8d 3d db 25 29 03 <67> 48 0f b9 3a e9 0d ff ff ff 48 b8 00 00 00 00 00 fc ff df 48 89 [ 3782.254211] RSP: 0018:ffff888126577d98 EFLAGS: 00010246 [ 3782.256054] RAX: dffffc0000000000 RBX: ffffffffc2b70310 RCX: ffffffff8fe61ba1 [ 3782.258512] RDX: 1ffffffff856e062 RSI: ffff88811341eea0 RDI: ffffffff91bbacb0 [ 3782.261041] RBP: ffff888111475000 R08: 0000000000000001 R09: fffffbfff856e069 [ 3782.263471] R10: ffffffffc2b7034b R11: 00000000ffffffff R12: 0000000000000000 [ 3782.265983] R13: dffffc0000000000 R14: ffff88811341eea0 R15: 0000000000000000 [ 3782.268443] FS: 00007fd5a3689040(0000) GS:ffff88842c8d0000(0000) knlGS:0000000000000000 [ 3782.271236] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3782.273251] CR2: 00007fd5a36d2c10 CR3: 00000001242b8000 CR4: 00000000000006f0 [ 3782.275362] Call Trace: [ 3782.276071] <TASK> [ 3782.276678] kobject_put+0x146/0x270 [ 3782.277731] hmm_dmirror_exit+0x7a/0x130 [test_hmm] [ 3782.279135] __do_sys_delete_module+0x341/0x510 [ 3782.280438] ? module_flags+0x300/0x300 [ 3782.281547] do_syscall_64+0x111/0x670 [ 3782.282620] entry_SYSCALL_64_after_hwframe+0x4b/0x53 [ 3782.284091] RIP: 0033:0x7fd5a3793b37 [ 3782.285303] Code: 73 01 c3 48 8b 0d c9 82 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 99 82 0c 00 f7 d8 64 89 01 48 [ 3782.290708] RSP: 002b:00007ffd68b7dc68 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 [ 3782.292817] RAX: ffffffffffffffda RBX: 000055e3c0d1c770 RCX: 00007fd5a3793b37 [ 3782.294735] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055e3c0d1c7d8 [ 3782.296661] RBP: 0000000000000000 R08: 1999999999999999 R09: 0000000000000000 [ 3782.298622] R10: 00007fd5a3806ac0 R11: 0000000000000206 R12: 00007ffd68b7deb0 [ 3782.300576] R13: 00007ffd68b7e781 R14: 000055e3c0d1b2a0 R15: 00007ffd68b7deb8 [ 3782.301963] </TASK> [ 3782.302371] irq event stamp: 5019 [ 3782.302987] hardirqs last enabled at (5027): [<ffffffff8cf1f062>] __up_console_sem+0x52/0x60 [ 3782.304507] hardirqs last disabled at (5036): [<ffffffff8cf1f047>] __up_console_sem+0x37/0x60 [ 3782.306086] softirqs last enabled at (4940): [<ffffffff8cd9a4b0>] __irq_exit_rcu+0xc0/0xf0 [ 3782.307567] softirqs last disabled at (4929): [<ffffffff8cd9a4b0>] __irq_exit_rcu+0xc0/0xf0 [ 3782.309105] ---[ end trace 0000000000000000 ]--- This is because the test module doesn't have a device.release method. In this case one probably isn't needed for correctness - the device structs are in a static array so don't need freeing when the final reference goes away. However some device state is freed on exit, so to ensure this happens at the right time and to silence the warning move the deinitialisation to a release method and assign that as the device release callback. Whilst here also fix a minor error handling bug where cdev_device_del() wasn't being called if allocation failed. Link: https://lore.kernel.org/20260331063445.3551404-4-apopple@nvidia.com Fixes: `6a760f58c7` ("mm/hmm/test: use char dev with struct device to get device node") Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: Balbir Singh <balbirs@nvidia.com> Tested-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger,kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:52 -07:00
Alistair Popple	f9d7975c52	selftests/mm: hmm-tests: don't hardcode THP size to 2MB Several HMM tests hardcode TWOMEG as the THP size. This is wrong on architectures where the PMD size is not 2MB such as arm64 with 64K base pages where THP is 512MB. Fix this by using read_pmd_pagesize() from vm_util instead. While here also replace the custom file_read_ulong() helper used to parse the default hugetlbfs page size from /proc/meminfo with the existing default_huge_page_size() from vm_util. Link: https://lore.kernel.org/20260331063445.3551404-3-apopple@nvidia.com Link: https://lore.kernel.org/linux-mm/8bd0396a-8997-4d2e-a13f-5aac033083d7@linux.dev/ Fixes: `fee9f6d1b8` ("mm/hmm/test: add selftests for HMM") Fixes: `519071529d` ("selftests/mm/hmm-tests: new tests for zone device THP migration") Signed-off-by: Alistair Popple <apopple@nvidia.com> Reported-by: Zenghui Yu <zenghui.yu@linux.dev> Closes: https://lore.kernel.org/linux-mm/8bd0396a-8997-4d2e-a13f-5aac033083d7@linux.dev/ Reviewed-by: Balbir Singh <balbirs@nvidia.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger,kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:52 -07:00
Alistair Popple	744dd97752	lib: test_hmm: evict device pages on file close to avoid use-after-free Patch series "Minor hmm_test fixes and cleanups". Two bugfixes a cleanup for the HMM kernel selftests. These were mostly reported by Zenghui Yu with special thanks to Lorenzo for analysing and pointing out the problems. This patch (of 3): When dmirror_fops_release() is called it frees the dmirror struct but doesn't migrate device private pages back to system memory first. This leaves those pages with a dangling zone_device_data pointer to the freed dmirror. If a subsequent fault occurs on those pages (eg. during coredump) the dmirror_devmem_fault() callback dereferences the stale pointer causing a kernel panic. This was reported [1] when running mm/ksft_hmm.sh on arm64, where a test failure triggered SIGABRT and the resulting coredump walked the VMAs faulting in the stale device private pages. Fix this by calling dmirror_device_evict_chunk() for each devmem chunk in dmirror_fops_release() to migrate all device private pages back to system memory before freeing the dmirror struct. The function is moved earlier in the file to avoid a forward declaration. Link: https://lore.kernel.org/20260331063445.3551404-1-apopple@nvidia.com Link: https://lore.kernel.org/20260331063445.3551404-2-apopple@nvidia.com Fixes: `b2ef9f5a5c` ("mm/hmm/test: add selftest driver for HMM") Signed-off-by: Alistair Popple <apopple@nvidia.com> Reported-by: Zenghui Yu <zenghui.yu@linux.dev> Closes: https://lore.kernel.org/linux-mm/8bd0396a-8997-4d2e-a13f-5aac033083d7@linux.dev/ Reviewed-by: Balbir Singh <balbirs@nvidia.com> Tested-by: Zenghui Yu <zenghui.yu@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zenghui Yu <zenghui.yu@linux.dev> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:52 -07:00
Li Wang	047a6d4940	selftests/mm: skip hugetlb_dio tests when DIO alignment is incompatible hugetlb_dio test uses sub-page offsets (pagesize / 2) to verify that hugepages used as DIO user buffers are correctly unpinned at completion. However, on filesystems with a logical block size larger than half the page size (e.g., 4K-sector block devices), these unaligned DIO writes are rejected with -EINVAL, causing the test to fail unexpectedly. Add get_dio_alignment() to query the filesystem's required DIO alignment via statx(STATX_DIOALIGN) and skip individual test cases whose file offset or write size is not a multiple of that alignment. Aligned cases continue to run so the core coverage is preserved. While here, open the temporary file once in main() and share the fd across all test cases instead of reopening it in each invocation. === Reproduce Steps === # dd if=/dev/zero of=/tmp/test.img bs=1M count=512 # losetup --sector-size 4096 /dev/loop0 /tmp/test.img # mkfs.xfs /dev/loop0 # mkdir -p /mnt/dio_test # mount /dev/loop0 /mnt/dio_test // Modify test to open /mnt/dio_test and rebuild it: - fd = open("/tmp", O_TMPFILE \| O_RDWR \| O_DIRECT, 0664); + fd = open("/mnt/dio_test", O_TMPFILE \| O_RDWR \| O_DIRECT, 0664); # getconf PAGESIZE 4096 # echo 100 >/proc/sys/vm/nr_hugepages # ./hugetlb_dio TAP version 13 1..4 # No. Free pages before allocation : 100 # No. Free pages after munmap : 100 ok 1 free huge pages from 0-12288 Bail out! Error writing to file : Invalid argument (22) # Planned tests != run tests (4 != 1) # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 Link: https://lore.kernel.org/20260401090520.24018-1-liwang@redhat.com Signed-off-by: Li Wang <liwang@redhat.com> Suggested-by: Mike Rapoport <rppt@kernel.org> Suggested-by: David Hildenbrand <david@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:52 -07:00
Lorenzo Stoakes (Oracle)	84f4928446	tools/testing/selftests: add merge test for partial msealed range Commit `2697dd8ae7` ("mm/mseal: update VMA end correctly on merge") fixed an issue in the loop which iterates through VMAs applying mseal, which was triggered by mseal()'ing a range of VMAs where the second was mseal()'d and the first mergeable with it, once mseal()'d. Add a regression test to assert that this behaviour is correct. We place it in the merge selftests as this is strictly an issue with merging (via a vma_modify() invocation). It also asserts that mseal()'d ranges are correctly merged as you'd expect. The test is implemented such that it is skipped if mseal() is not available on the system. [rppt@kernel.org: fix inclusions, to fix handle_uprobe_upon_merged_vma()] Link: https://lore.kernel.org/ac_mCIUQWRAbuH8F@kernel.org [ljs@kernel.org: simplifications per Pedro] Link: https://lore.kernel.org/1c9c922d-5cb5-4cff-9273-b737cdb57ca1@lucifer.local Link: https://lore.kernel.org/20260331073627.50010-1-ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Signed-off-by: Mike Rapoport <rppt@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:52 -07:00
Jackie Liu	6fae274ce0	mm/mempolicy: fix memory leaks in weighted_interleave_auto_store() weighted_interleave_auto_store() fetches old_wi_state inside the if (!input) block only. This causes two memory leaks: 1. When a user writes "false" and the current mode is already manual, the function returns early without freeing the freshly allocated new_wi_state. 2. When a user writes "true", old_wi_state stays NULL because the fetch is skipped entirely. The old state is then overwritten by rcu_assign_pointer() but never freed, since the cleanup path is gated on old_wi_state being non-NULL. A user can trigger this repeatedly by writing "1" in a loop. Fix both leaks by moving the old_wi_state fetch before the input check, making it unconditional. This also allows a unified early return for both "true" and "false" when the requested mode matches the current mode. Link: https://lore.kernel.org/20260401005702.7096-1-liu.yun@linux.dev Link: https://sashiko.dev/#/patchset/20260331100740.84906-1-liu.yun@linux.dev Fixes: `e341f9c3c8` ("mm/mempolicy: Weighted Interleave Auto-tuning") Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Reviewed by: Donet Tom <donettom@linux.ibm.com> Cc: Gregory Price <gourry@gourry.net> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: David Hildenbrand <david@kernel.org> Cc: <stable@vger.kernel.org> # v6.16+ Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:51 -07:00
SeongJae Park	0c13ed77dd	Docs/admin-guide/mm/damon/lru_sort: warn commit_inputs vs param updates race DAMON_LRU_SORT handles commit_inputs request inside kdamond thread, reading the module parameters. If the user updates the module parameters while the kdamond thread is reading those, races can happen. To avoid this, the commit_inputs parameter shows whether it is still in the progress, assuming users wouldn't update parameters in the middle of the work. Some users might ignore that. Add a warning about the behavior. The issue was discovered in [1] by sashiko. Link: https://lore.kernel.org/20260329153052.46657-3-sj@kernel.org Link: https://lore.kernel.org/20260319161620.189392-2-objecting@objecting.org [1] Fixes: `6acfcd0d75` ("Docs/admin-guide/damon: add a document for DAMON_LRU_SORT") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.0.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:51 -07:00
SeongJae Park	0beba407d4	Docs/admin-guide/mm/damon/reclaim: warn commit_inputs vs param updates race Patch series "Docs/admin-guide/mm/damon: warn commit_inputs vs other params race". Writing 'Y' to the commit_inputs parameter of DAMON_RECLAIM and DAMON_LRU_SORT, and writing other parameters before the commit_inputs request is completely processed can cause race conditions. While the consequence can be bad, the documentation is not clearly describing that. Add clear warnings. The issue was discovered [1,2] by sashiko. This patch (of 2): DAMON_RECLAIM handles commit_inputs request inside kdamond thread, reading the module parameters. If the user updates the module parameters while the kdamond thread is reading those, races can happen. To avoid this, the commit_inputs parameter shows whether it is still in the progress, assuming users wouldn't update parameters in the middle of the work. Some users might ignore that. Add a warning about the behavior. The issue was discovered in [1] by sashiko. Link: https://lore.kernel.org/20260329153052.46657-2-sj@kernel.org Link: https://lore.kernel.org/20260319161620.189392-3-objecting@objecting.org [1] Link: https://lore.kernel.org/20260319161620.189392-2-objecting@objecting.org [3] Fixes: `81a84182c3` ("Docs/admin-guide/mm/damon/reclaim: document 'commit_inputs' parameter") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 5.19.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:51 -07:00
SeongJae Park	049a57421d	mm/damon/core: use time_in_range_open() for damos quota window start damos_adjust_quota() uses time_after_eq() to show if it is time to start a new quota charge window, comparing the current jiffies and the scheduled next charge window start time. If it is, the next charge window start time is updated and the new charge window starts. The time check and next window start time update is skipped while the scheme is deactivated by the watermarks. Let's suppose the deactivation is kept more than LONG_MAX jiffies (assuming CONFIG_HZ of 250, more than 99 days in 32 bit systems and more than one billion years in 64 bit systems), resulting in having the jiffies larger than the next charge window start time + LONG_MAX. Then, the time_after_eq() call can return false until another LONG_MAX jiffies are passed. This means the scheme can continue working after being reactivated by the watermarks. But, soon, the quota will be exceeded and the scheme will again effectively stop working until the next charge window starts. Because the current charge window is extended to up to LONG_MAX jiffies, however, it will look like it stopped unexpectedly and indefinitely, from the user's perspective. Fix this by using !time_in_range_open() instead. The issue was discovered [1] by sashiko. Link: https://lore.kernel.org/20260329152306.45796-1-sj@kernel.org Link: https://lore.kernel.org/20260324040722.57944-1-sj@kernel.org [1] Fixes: `ee801b7dd7` ("mm/damon/schemes: activate schemes based on a watermarks mechanism") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 5.16.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:51 -07:00
SeongJae Park	a34dac6482	mm/damon/core: validate damos_quota_goal->nid for node_memcg_{used,free}_bp Users can set damos_quota_goal->nid with arbitrary value for node_memcg_{used,free}_bp. But DAMON core is using those for NODE-DATA() without a validation of the value. This can result in out of bounds memory access. The issue can actually triggered using DAMON user-space tool (damo), like below. $ sudo mkdir /sys/fs/cgroup/foo $ sudo ./damo start --damos_action stat --damos_quota_interval 1s \ --damos_quota_goal node_memcg_used_bp 50% -1 /foo $ sudo dmseg [...] [ 524.181426] Unable to handle kernel paging request at virtual address 0000000000002c00 Fix this issue by adding the validation of the given node id. If an invalid node id is given, it returns 0% for used memory ratio, and 100% for free memory ratio. Link: https://lore.kernel.org/20260329043902.46163-3-sj@kernel.org Fixes: `b74a120bcf` ("mm/damon/core: implement DAMOS_QUOTA_NODE_MEMCG_USED_BP") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.19.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:51 -07:00
SeongJae Park	40250b2dde	mm/damon/core: validate damos_quota_goal->nid for node_mem_{used,free}_bp Patch series "mm/damon/core: validate damos_quota_goal->nid". node_mem[cg]_{used,free}_bp DAMOS quota goals receive the node id. The node id is used for si_meminfo_node() and NODE_DATA() without proper validation. As a result, privileged users can trigger an out of bounds memory access using DAMON_SYSFS. Fix the issues. The issue was originally reported [1] with a fix by another author. The original author announced [2] that they will stop working including the fix that was still in the review stage. Hence I'm restarting this. This patch (of 2): Users can set damos_quota_goal->nid with arbitrary value for node_mem_{used,free}_bp. But DAMON core is using those for si_meminfo_node() without the validation of the value. This can result in out of bounds memory access. The issue can actually triggered using DAMON user-space tool (damo), like below. $ sudo ./damo start --damos_action stat \ --damos_quota_goal node_mem_used_bp 50% -1 \ --damos_quota_interval 1s $ sudo dmesg [...] [ 65.565986] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000098 Fix this issue by adding the validation of the given node. If an invalid node id is given, it returns 0% for used memory ratio, and 100% for free memory ratio. Link: https://lore.kernel.org/20260329043902.46163-2-sj@kernel.org Link: https://lore.kernel.org/20260325073034.140353-1-objecting@objecting.org [1] Link: https://lore.kernel.org/20260327040924.68553-1-sj@kernel.org [2] Fixes: `0e1c773b50` ("mm/damon/core: introduce damos quota goal metrics for memory node utilization") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.16.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:51 -07:00
Jackie Liu	e04ed278d2	mm/damon/stat: fix memory leak on damon_start() failure in damon_stat_start() Destroy the DAMON context and reset the global pointer when damon_start() fails. Otherwise, the context allocated by damon_stat_build_ctx() is leaked, and the stale damon_stat_context pointer will be overwritten on the next enable attempt, making the old allocation permanently unreachable. Link: https://lore.kernel.org/20260331101553.88422-1-liu.yun@linux.dev Fixes: `369c415e60` ("mm/damon: introduce DAMON_STAT module") Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.17.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:51 -07:00
SeongJae Park	33c3f6c2b4	mm/damon/core: fix damos_walk() vs kdamond_fn() exit race When kdamond_fn() main loop is finished, the function cancels remaining damos_walk() request and unset the damon_ctx->kdamond so that API callers and API functions themselves can show the context is terminated. damos_walk() adds the caller's request to the queue first. After that, it shows if the kdamond of the damon_ctx is still running (damon_ctx->kdamond is set). Only if the kdamond is running, damos_walk() starts waiting for the kdamond's handling of the newly added request. The damos_walk() requests registration and damon_ctx->kdamond unset are protected by different mutexes, though. Hence, damos_walk() could race with damon_ctx->kdamond unset, and result in deadlocks. For example, let's suppose kdamond successfully finished the damow_walk() request cancelling. Right after that, damos_walk() is called for the context. It registers the new request, and shows the context is still running, because damon_ctx->kdamond unset is not yet done. Hence the damos_walk() caller starts waiting for the handling of the request. However, the kdamond is already on the termination steps, so it never handles the new request. As a result, the damos_walk() caller thread infinitely waits. Fix this by introducing another damon_ctx field, namely walk_control_obsolete. It is protected by the damon_ctx->walk_control_lock, which protects damos_walk() request registration. Initialize (unset) it in kdamond_fn() before letting damon_start() returns and set it just before the cancelling of the remaining damos_walk() request is executed. damos_walk() reads the obsolete field under the lock and avoids adding a new request. After this change, only requests that are guaranteed to be handled or cancelled are registered. Hence the after-registration DAMON context termination check is no longer needed. Remove it together. The issue is found by sashiko [1]. Link: https://lore.kernel.org/20260327233319.3528-3-sj@kernel.org Link: https://lore.kernel.org/20260325141956.87144-1-sj@kernel.org [1] Fixes: `bf0eaba0ff` ("mm/damon/core: implement damos_walk()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.14.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:51 -07:00
SeongJae Park	55da81663b	mm/damon/core: fix damon_call() vs kdamond_fn() exit race Patch series "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit race". damon_call() and damos_walk() can leak memory and/or deadlock when they race with kdamond terminations. Fix those. This patch (of 2); When kdamond_fn() main loop is finished, the function cancels all remaining damon_call() requests and unset the damon_ctx->kdamond so that API callers and API functions themselves can know the context is terminated. damon_call() adds the caller's request to the queue first. After that, it shows if the kdamond of the damon_ctx is still running (damon_ctx->kdamond is set). Only if the kdamond is running, damon_call() starts waiting for the kdamond's handling of the newly added request. The damon_call() requests registration and damon_ctx->kdamond unset are protected by different mutexes, though. Hence, damon_call() could race with damon_ctx->kdamond unset, and result in deadlocks. For example, let's suppose kdamond successfully finished the damon_call() requests cancelling. Right after that, damon_call() is called for the context. It registers the new request, and shows the context is still running, because damon_ctx->kdamond unset is not yet done. Hence the damon_call() caller starts waiting for the handling of the request. However, the kdamond is already on the termination steps, so it never handles the new request. As a result, the damon_call() caller threads infinitely waits. Fix this by introducing another damon_ctx field, namely call_controls_obsolete. It is protected by the damon_ctx->call_controls_lock, which protects damon_call() requests registration. Initialize (unset) it in kdamond_fn() before letting damon_start() returns and set it just before the cancelling of remaining damon_call() requests is executed. damon_call() reads the obsolete field under the lock and avoids adding a new request. After this change, only requests that are guaranteed to be handled or cancelled are registered. Hence the after-registration DAMON context termination check is no longer needed. Remove it together. Note that the deadlock will not happen when damon_call() is called for repeat mode request. In tis case, damon_call() returns instead of waiting for the handling when the request registration succeeds and it shows the kdamond is running. However, if the request also has dealloc_on_cancel, the request memory would be leaked. The issue is found by sashiko [1]. Link: https://lore.kernel.org/20260327233319.3528-1-sj@kernel.org Link: https://lore.kernel.org/20260327233319.3528-2-sj@kernel.org Link: https://lore.kernel.org/20260325141956.87144-1-sj@kernel.org [1] Fixes: `42b7491af1` ("mm/damon/core: introduce damon_call()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.14.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:51 -07:00
Kanchana P. Sridhar	ef3c0f6cb7	mm: zswap: tie per-CPU acomp_ctx lifetime to the pool Currently, per-CPU acomp_ctx are allocated on pool creation and/or CPU hotplug, and destroyed on pool destruction or CPU hotunplug. This complicates the lifetime management to save memory while a CPU is offlined, which is not very common. Simplify lifetime management by allocating per-CPU acomp_ctx once on pool creation (or CPU hotplug for CPUs onlined later), and keeping them allocated until the pool is destroyed. Refactor cleanup code from zswap_cpu_comp_dead() into acomp_ctx_free() to be used elsewhere. The main benefit of using the CPU hotplug multi state instance startup callback to allocate the acomp_ctx resources is that it prevents the cores from being offlined until the multi state instance addition call returns. From Documentation/core-api/cpu_hotplug.rst: "The node list add/remove operations and the callback invocations are serialized against CPU hotplug operations." Furthermore, zswap_[de]compress() cannot contend with zswap_cpu_comp_prepare() because: - During pool creation/deletion, the pool is not in the zswap_pools list. - During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed out. zswap_cpu_comp_prepare() will be run on a control CPU, since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section of "enum cpuhp_state". In both these cases, any recursions into zswap reclaim from zswap_cpu_comp_prepare() will be handled by the old pool. The above two observations enable the following simplifications: 1) zswap_cpu_comp_prepare(): a) acomp_ctx mutex locking: If the process gets migrated while zswap_cpu_comp_prepare() is running, it will complete on the new CPU. In case of failures, we pass the acomp_ctx pointer obtained at the start of zswap_cpu_comp_prepare() to acomp_ctx_free(), which again, can only undergo migration. There appear to be no contention scenarios that might cause inconsistent values of acomp_ctx's members. Hence, it seems there is no need for mutex_lock(&acomp_ctx->mutex) in zswap_cpu_comp_prepare(). b) acomp_ctx mutex initialization: Since the pool is not yet on zswap_pools list, we don't need to initialize the per-CPU acomp_ctx mutex in zswap_pool_create(). This has been restored to occur in zswap_cpu_comp_prepare(). c) Subsequent CPU offline-online transitions: zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is valid. If so, it returns success. This should handle any CPU hotplug online-offline transitions after pool creation is done. 2) CPU offline vis-a-vis zswap ops: Let's suppose the process is migrated to another CPU before the current CPU is dysfunctional. If zswap_[de]compress() holds the acomp_ctx->mutex lock of the offlined CPU, that mutex will be released once it completes on the new CPU. Since there is no teardown callback, there is no possibility of UAF. 3) Pool creation/deletion and process migration to another CPU: During pool creation/deletion, the pool is not in the zswap_pools list. Hence it cannot contend with zswap ops on that CPU. However, the process can get migrated. a) Pool creation --> zswap_cpu_comp_prepare() --> process migrated: * Old CPU offline: no-op. * zswap_cpu_comp_prepare() continues to run on the new CPU to finish allocating acomp_ctx resources for the offlined CPU. b) Pool deletion --> acomp_ctx_free() --> process migrated: * Old CPU offline: no-op. * acomp_ctx_free() continues to run on the new CPU to finish de-allocating acomp_ctx resources for the offlined CPU. 4) Pool deletion vis-a-vis CPU onlining: The call to cpuhp_state_remove_instance() cannot race with zswap_cpu_comp_prepare() because of hotplug synchronization. The current acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock() are deleted. Instead, zswap_[de]compress() directly call mutex_[un]lock(&acomp_ctx->mutex). The per-CPU memory cost of not deleting the acomp_ctx resources upon CPU offlining, and only deleting them when the pool is destroyed, is 8.28 KB on x86_64. This cost is only paid when a CPU is offlined, until it is onlined again. Link: https://lore.kernel.org/20260331183351.29844-3-kanchanapsridhar2026@gmail.com Co-developed-by: Kanchana P. Sridhar <kanchanapsridhar2026@gmail.com> Signed-off-by: Kanchana P. Sridhar <kanchanapsridhar2026@gmail.com> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Acked-by: Yosry Ahmed <yosry@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:50 -07:00
Kanchana P. Sridhar	1556478e9e	mm: zswap: remove redundant checks in zswap_cpu_comp_dead() Patch series "zswap pool per-CPU acomp_ctx simplifications", v3. This patchset first removes redundant checks on the acomp_ctx and its "req" member in zswap_cpu_comp_dead(). Next, it persists the zswap pool's per-CPU acomp_ctx resources to last until the pool is destroyed. It then simplifies the per-CPU acomp_ctx mutex locking in zswap_compress()/zswap_decompress(). Code comments added after allocation and before checking to deallocate the per-CPU acomp_ctx's members, based on expected crypto API return values and zswap changes this patchset makes. Patch 2 is an independent submission of patch 23 from [1], to facilitate merging. This patch (of 2): There are presently redundant checks on the per-CPU acomp_ctx and it's "req" member in zswap_cpu_comp_dead(): redundant because they are inconsistent with zswap_pool_create() handling of failure in allocating the acomp_ctx, and with the expected NULL return value from the acomp_request_alloc() API when it fails to allocate an acomp_req. Fix these by converting to them to be NULL checks. Add comments in zswap_cpu_comp_prepare() clarifying the expected return values of the crypto_alloc_acomp_node() and acomp_request_alloc() API. Link: https://lore.kernel.org/20260331183351.29844-2-kanchanapsridhar2026@gmail.com Link: https://patchwork.kernel.org/project/linux-mm/list/?series=1046677 Signed-off-by: Kanchana P. Sridhar <kanchanapsridhar2026@gmail.com> Suggested-by: Yosry Ahmed <yosry@kernel.org> Acked-by: Yosry Ahmed <yosry@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:50 -07:00
Hao Ge	6b1842775a	mm/alloc_tag: clear codetag for pages allocated before page_ext initialization Due to initialization ordering, page_ext is allocated and initialized relatively late during boot. Some pages have already been allocated and freed before page_ext becomes available, leaving their codetag uninitialized. A clear example is in init_section_page_ext(): alloc_page_ext() calls kmemleak_alloc(). If the slab cache has no free objects, it falls back to the buddy allocator to allocate memory. However, at this point page_ext is not yet fully initialized, so these newly allocated pages have no codetag set. These pages may later be reclaimed by KASAN, which causes the warning to trigger when they are freed because their codetag ref is still empty. Use a global array to track pages allocated before page_ext is fully initialized. The array size is fixed at 8192 entries, and will emit a warning if this limit is exceeded. When page_ext initialization completes, set their codetag to empty to avoid warnings when they are freed later. This warning is only observed with CONFIG_MEM_ALLOC_PROFILING_DEBUG=Y and mem_profiling_compressed disabled: [ 9.582133] ------------[ cut here ]------------ [ 9.582137] alloc_tag was not set [ 9.582139] WARNING: ./include/linux/alloc_tag.h:164 at __pgalloc_tag_sub+0x40f/0x550, CPU#5: systemd/1 [ 9.582190] CPU: 5 UID: 0 PID: 1 Comm: systemd Not tainted 7.0.0-rc4 #1 PREEMPT(lazy) [ 9.582192] Hardware name: Red Hat KVM, BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 9.582194] RIP: 0010:__pgalloc_tag_sub+0x40f/0x550 [ 9.582196] Code: 00 00 4c 29 e5 48 8b 05 1f 88 56 05 48 8d 4c ad 00 48 8d 2c c8 e9 87 fd ff ff 0f 0b 0f 0b e9 f3 fe ff ff 48 8d 3d 61 2f ed 03 <67> 48 0f b9 3a e9 b3 fd ff ff 0f 0b eb e4 e8 5e cd 14 02 4c 89 c7 [ 9.582197] RSP: 0018:ffffc9000001f940 EFLAGS: 00010246 [ 9.582200] RAX: dffffc0000000000 RBX: 1ffff92000003f2b RCX: 1ffff110200d806c [ 9.582201] RDX: ffff8881006c0360 RSI: 0000000000000004 RDI: ffffffff9bc7b460 [ 9.582202] RBP: 0000000000000000 R08: 0000000000000000 R09: fffffbfff3a62324 [ 9.582203] R10: ffffffff9d311923 R11: 0000000000000000 R12: ffffea0004001b00 [ 9.582204] R13: 0000000000002000 R14: ffffea0000000000 R15: ffff8881006c0360 [ 9.582206] FS: 00007ffbbcf2d940(0000) GS:ffff888450479000(0000) knlGS:0000000000000000 [ 9.582208] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9.582210] CR2: 000055ee3aa260d0 CR3: 0000000148b67005 CR4: 0000000000770ef0 [ 9.582211] PKRU: 55555554 [ 9.582212] Call Trace: [ 9.582213] <TASK> [ 9.582214] ? __pfx___pgalloc_tag_sub+0x10/0x10 [ 9.582216] ? check_bytes_and_report+0x68/0x140 [ 9.582219] __free_frozen_pages+0x2e4/0x1150 [ 9.582221] ? __free_slab+0xc2/0x2b0 [ 9.582224] qlist_free_all+0x4c/0xf0 [ 9.582227] kasan_quarantine_reduce+0x15d/0x180 [ 9.582229] __kasan_slab_alloc+0x69/0x90 [ 9.582232] kmem_cache_alloc_noprof+0x14a/0x500 [ 9.582234] do_getname+0x96/0x310 [ 9.582237] do_readlinkat+0x91/0x2f0 [ 9.582239] ? __pfx_do_readlinkat+0x10/0x10 [ 9.582240] ? get_random_bytes_user+0x1df/0x2c0 [ 9.582244] __x64_sys_readlinkat+0x96/0x100 [ 9.582246] do_syscall_64+0xce/0x650 [ 9.582250] ? __x64_sys_getrandom+0x13a/0x1e0 [ 9.582252] ? __pfx___x64_sys_getrandom+0x10/0x10 [ 9.582254] ? do_syscall_64+0x114/0x650 [ 9.582255] ? ksys_read+0xfc/0x1d0 [ 9.582258] ? __pfx_ksys_read+0x10/0x10 [ 9.582260] ? do_syscall_64+0x114/0x650 [ 9.582262] ? do_syscall_64+0x114/0x650 [ 9.582264] ? __pfx_fput_close_sync+0x10/0x10 [ 9.582266] ? file_close_fd_locked+0x178/0x2a0 [ 9.582268] ? __x64_sys_faccessat2+0x96/0x100 [ 9.582269] ? __x64_sys_close+0x7d/0xd0 [ 9.582271] ? do_syscall_64+0x114/0x650 [ 9.582273] ? do_syscall_64+0x114/0x650 [ 9.582275] ? clear_bhb_loop+0x50/0xa0 [ 9.582277] ? clear_bhb_loop+0x50/0xa0 [ 9.582279] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 9.582280] RIP: 0033:0x7ffbbda345ee [ 9.582282] Code: 0f 1f 40 00 48 8b 15 29 38 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 0f 1f 40 00 f3 0f 1e fa 49 89 ca b8 0b 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fa 37 0d 00 f7 d8 64 89 01 48 [ 9.582284] RSP: 002b:00007ffe2ad8de58 EFLAGS: 00000202 ORIG_RAX: 000000000000010b [ 9.582286] RAX: ffffffffffffffda RBX: 000055ee3aa25570 RCX: 00007ffbbda345ee [ 9.582287] RDX: 000055ee3aa25570 RSI: 00007ffe2ad8dee0 RDI: 00000000ffffff9c [ 9.582288] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000001001 [ 9.582289] R10: 0000000000001000 R11: 0000000000000202 R12: 0000000000000033 [ 9.582290] R13: 00007ffe2ad8dee0 R14: 00000000ffffff9c R15: 00007ffe2ad8deb0 [ 9.582292] </TASK> [ 9.582293] ---[ end trace 0000000000000000 ]--- Link: https://lore.kernel.org/20260331081312.123719-1-hao.ge@linux.dev Fixes: `dcfe378c81` ("lib: introduce support for page allocation tagging") Signed-off-by: Hao Ge <hao.ge@linux.dev> Suggested-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:50 -07:00
Suren Baghdasaryan	d14514c66c	mm/vmscan: prevent MGLRU reclaim from pinning address space When shrinking lruvec, MGLRU pins address space before walking it. This is excessive since all it needs for walking the page range is a stable mm_struct to be able to take and release mmap_read_lock and a stable mm->mm_mt tree to walk. This address space pinning results in delays when releasing the memory of a dying process. This also prevents mm reapers (both in-kernel oom-reaper and userspace process_mrelease()) from doing their job during MGLRU scan because they check task_will_free_mem() which will yield negative result due to the elevated mm->mm_users. This affects the system in the sense that if the MM of the killed process is being reclaimed by kswapd then reapers won't be able to reap it. Even the process itself (which might have higher-priority than kswapd) will not free its memory until kswapd drops the last reference. IOW, we delay freeing the memory because kswapd is reclaiming it. In Android the visible result for us is that process_mrelease() (userspace reaper) skips MM in such cases and we see process memory not released for an unusually long time (secs). Replace unnecessary address space pinning with mm_struct pinning by replacing mmget/mmput with mmgrab/mmdrop calls. mm_mt is contained within mm_struct itself, therefore it won't be freed as long as mm_struct is stable and it won't change during the walk because mmap_read_lock is being held. Link: https://lore.kernel.org/20260322070843.941997-1-surenb@google.com Fixes: `bd74fdaea1` ("mm: multi-gen LRU: support page table walks") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:50 -07:00
Pasha Tatashin	68750e820b	liveupdate: defer file handler module refcounting to active sessions Stop pinning modules indefinitely upon file handler registration. Instead, dynamically increment the module reference count only when a live update session actively uses the file handler (e.g., during preservation or deserialization), and release it when the session ends. This allows modules providing live update handlers to be gracefully unloaded when no live update is in progress. Link: https://lore.kernel.org/20260327033335.696621-11-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:50 -07:00
Pasha Tatashin	2ab7207e7e	liveupdate: make unregister functions return void Change liveupdate_unregister_file_handler and liveupdate_unregister_flb to return void instead of an error code. This follows the design principle that unregistration during module unload should not fail, as the unload cannot be stopped at that point. Link: https://lore.kernel.org/20260327033335.696621-10-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:50 -07:00
Pasha Tatashin	074488008d	liveupdate: remove liveupdate_test_unregister() Now that file handler unregistration automatically unregisters all associated file handlers (FLBs), the liveupdate_test_unregister() function is no longer needed. Remove it along with its usages and declarations. Link: https://lore.kernel.org/20260327033335.696621-9-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:50 -07:00
Pasha Tatashin	5ee1c7d641	liveupdate: auto unregister FLBs on file handler unregistration To ensure that unregistration is always successful and doesn't leave dangling resources, introduce auto-unregistration of FLBs: when a file handler is unregistered, all FLBs associated with it are automatically unregistered. Introduce a new helper luo_flb_unregister_all() which unregisters all FLBs linked to the given file handler. Link: https://lore.kernel.org/20260327033335.696621-8-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:50 -07:00
Pasha Tatashin	118c390824	liveupdate: remove luo_session_quiesce() Now that FLB module references are handled dynamically during active sessions, we can safely remove the luo_session_quiesce() and luo_session_resume() mechanism. Link: https://lore.kernel.org/20260327033335.696621-7-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:50 -07:00
Pasha Tatashin	76be9983df	liveupdate: defer FLB module refcounting to active sessions Stop pinning modules indefinitely upon FLB registration. Instead, dynamically take a module reference when the FLB is actively used in a session (e.g., during preserve and retrieve) and release it when the session concludes. This allows modules providing FLB operations to be cleanly unloaded when not in active use by the live update orchestrator. Link: https://lore.kernel.org/20260327033335.696621-6-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:50 -07:00
Pasha Tatashin	6b2b22f7c8	liveupdate: protect FLB lists with luo_register_rwlock Because liveupdate FLB objects will soon drop their persistent module references when registered, list traversals must be protected against concurrent module unloading. To provide this protection, utilize the global luo_register_rwlock. It protects the global registry of FLBs and the handler's specific list of FLB dependencies. Read locks are used during concurrent list traversals (e.g., during preservation and serialization). Write locks are taken during registration and unregistration. Link: https://lore.kernel.org/20260327033335.696621-5-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:49 -07:00
Pasha Tatashin	9e1e185845	liveupdate: protect file handler list with rwsem Because liveupdate file handlers will no longer hold a module reference when registered, we must ensure that the access to the handler list is protected against concurrent module unloading. Utilize the global luo_register_rwlock to protect the global registry of file handlers. Read locks are taken during list traversals in luo_preserve_file() and luo_file_deserialize(). Write locks are taken during registration and unregistration. Link: https://lore.kernel.org/20260327033335.696621-4-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:49 -07:00

1 2 3 4 5 ...

1429319 Commits