mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2026-05-16 05:31:37 -04:00
de61e40bcbb84546972191fb70ef64c5aecdda68
1429335 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
de61e40bcb |
MAINTAINERS: drop include/linux/kho/abi/ from KHO
The KHO entry already includes include/linux/kho. Listing its subdirectory is redundant. Link: https://lore.kernel.org/20260414121752.1912847-3-pratyush@kernel.org Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
320c7234d1 |
MAINTAINERS: update KHO and LIVE UPDATE maintainers
Patch series "MAINTAINERS: update KHO and LIVE UPDATE entries". This series contains some updates for the Kexec Handover (KHO) and Live update entries. Patch 1 updates the maintainers list and adds the liveupdate tree. Patches 2 and 3 clean up stale files in the list. This patch (of 3): I have been helping out with reviewing and developing KHO. I would also like to help maintain it. Change my entry from R to M for KHO and live update. Alex has been inactive for a while, so to avoid over-crowding the KHO entry and to keep the information up-to-date, move his entry from M to R. We also now have a tree for KHO and live update at liveupdate/linux.git where we plan to start maintaining those subsystems and start queuing the patches. List that in the entries as well. Link: https://lore.kernel.org/20260414121752.1912847-1-pratyush@kernel.org Link: https://lore.kernel.org/20260414121752.1912847-2-pratyush@kernel.org Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org> Reviewed-by: Alexander Graf <graf@amazon.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: David Hildenbrand <david@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
60087b49f8 |
MAINTAINERS: update kexec/kdump maintainers entries
Update KEXEC and KDUMP maintainer entries by adding the live update group maintainers. Remove Vivek Goyal due to inactivity to keep the MAINTAINERS file up-to-date, and add Vivek to the CREDITS file to recognize their contributions. Link: https://lore.kernel.org/20260413121146.49215-1-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Pratyush Yadav <pratyush@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Diego Viola <diego.viola@gmail.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Magnus Karlsson <magnus.karlsson@intel.com> Cc: Mark Brown <broonie@kernel.org> Cc: Martin Kepplinger <martink@posteo.de> Cc: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
57294a97bd |
mm/migrate_device: remove dead migration entry check in migrate_vma_collect_huge_pmd()
The softleaf_is_migration() check is unreachable as entries that are not
device_private are filtered out. Similarly, the PTE-level equivalent in
migrate_vma_collect_pmd() skips migration entries.
This dead branch also contained a double spin_unlock(ptl) bug.
Link: https://lore.kernel.org/20260212014611.416695-1-dave@stgolabs.net
Fixes:
|
||
|
|
d432e8847f |
selftests: mm: skip charge_reserved_hugetlb without killall
charge_reserved_hugetlb.sh tears down background writers with killall from psmisc. Minimal Ubuntu images do not always provide that tool, so the selftest fails in cleanup for an environment reason rather than for the hugetlb behavior it is trying to cover. Skip the test when killall is unavailable, similar to the existing root check, so these environments report the dependency clearly instead of failing the test. Link: https://lore.kernel.org/20260410044139.67480-1-create0818@163.com Signed-off-by: Cao Ruichuang <create0818@163.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
161ce69c2c |
userfaultfd: allow registration of ranges below mmap_min_addr
The current implementation of validate_range() in fs/userfaultfd.c
performs a hard check against mmap_min_addr. This is redundant because
UFFDIO_REGISTER operates on memory ranges that must already be backed by a
VMA.
Enforcing mmap_min_addr or capability checks again in userfaultfd is
unnecessary and prevents applications like binary compilers from using
UFFD for valid memory regions mapped by application.
Remove the redundant check for mmap_min_addr.
We started using UFFD instead of the classic mprotect approach in the
binary translator to track application writes. During development, we
encountered this bug. The translator cannot control where the translated
application chooses to map its memory and if the app requires a
low-address area, UFFD fails, whereas mprotect would work just fine. I
believe this is a genuine logic bug rather than an improvement, and I
would appreciate including the fix in stable.
Link: https://lore.kernel.org/20260409103345.15044-1-komlomal@gmail.com
Fixes:
|
||
|
|
2b19bf0571 |
mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update
vmstat_shepherd uses delayed_work_pending() to check whether vmstat_update
is already scheduled for a given CPU before queuing it. However,
delayed_work_pending() only tests WORK_STRUCT_PENDING_BIT, which is
cleared the moment a worker thread picks up the work to execute it.
This means that while vmstat_update is actively running on a CPU,
delayed_work_pending() returns false. If need_update() also returns true
at that point (per-cpu counters not yet zeroed mid-flush), the shepherd
queues a second invocation with delay=0, causing vmstat_update to run
again immediately after finishing.
On a 72-CPU system this race is readily observable: before the fix, many
CPUs show invocation gaps well below 500 jiffies (the minimum
round_jiffies_relative() can produce), with the most extreme cases
reaching 0 jiffies—vmstat_update called twice within the same jiffy.
Fix this by replacing delayed_work_pending() with work_busy(), which
returns non-zero for both WORK_BUSY_PENDING (timer armed or work queued)
and WORK_BUSY_RUNNING (work currently executing). The shepherd now
correctly skips a CPU in all busy states.
After the fix, all sub-jiffy and most sub-100-jiffie gaps disappear. The
remaining early invocations have gaps in the 700–999 jiffie range,
attributable to round_jiffies_relative() aligning to a nearer
jiffie-second boundary rather than to this race.
Each spurious vmstat_update invocation has a measurable side effect:
refresh_cpu_vm_stats() calls decay_pcp_high() for every zone, which drains
idle per-CPU pages back to the buddy allocator via free_pcppages_bulk(),
taking the zone spinlock each time. Eliminating the double-scheduling
therefore reduces zone lock contention directly. On a 72-CPU stress-ng
workload measured with perf lock contention:
free_pcppages_bulk contention count: ~55% reduction
free_pcppages_bulk total wait time: ~57% reduction
free_pcppages_bulk max wait time: ~47% reduction
Note: work_busy() is inherently racy—between the check and the
subsequent queue_delayed_work_on() call, vmstat_update can finish
execution, leaving the work neither pending nor running. In that narrow
window the shepherd can still queue a second invocation. After the fix,
this residual race is rare and produces only occasional small gaps, a
significant improvement over the systematic double-scheduling seen with
delayed_work_pending().
Link: https://lore.kernel.org/20260409-vmstat-v2-1-e9d9a6db08ad@debian.org
Fixes:
|
||
|
|
c45b354911 |
mm/hugetlb: fix early boot crash on parameters without '=' separator
If hugepages, hugepagesz, or default_hugepagesz are specified on the
kernel command line without the '=' separator, early parameter parsing
passes NULL to hugetlb_add_param(), which dereferences it in strlen() and
can crash the system during early boot.
Reject NULL values in hugetlb_add_param() and return -EINVAL instead.
Link: https://lore.kernel.org/20260409105437.108686-4-thorsten.blum@linux.dev
Fixes:
|
||
|
|
2f529e73d7 |
zram: reject unrecognized type= values in recompress_store()
recompress_store() parses the type= parameter with three if statements checking for "idle", "huge", and "huge_idle". An unrecognized value silently falls through with mode left at 0, causing the recompression pass to run with no slot filter — processing all slots instead of the intended subset. Add a !mode check after the type parsing block to return -EINVAL for unrecognized values, consistent with the function's other parameter validation. Link: https://lore.kernel.org/20260407153027.42425-1-astellman@stellman-greene.com Signed-off-by: Andrew Stellman <astellman@stellman-greene.com> Suggested-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
9a8ea3c1cb |
docs: proc: document ProtectionKey in smaps
The ProtectionKey entry was added in v4.9; back then it was x86-specific, but it now lives in generic code and applies to all architectures supporting pkeys (currently x86, power, arm64). Time to document it: add a paragraph to proc.rst about the ProtectionKey entry. [akpm@linux-foundation.org: s/system/hardware/, per review discussion] [akpm@linux-foundation.org: s/hardware/CPU/] Link: https://lore.kernel.org/20260407125133.564182-1-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Reported-by: Yury Khrustalev <yury.khrustalev@arm.com> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
89e613bc0b |
mm/mprotect: special-case small folios when applying permissions
The common order-0 case is important enough to want its own branch, and avoids the hairy, large loop logic that the CPU does not seem to handle particularly well. While at it, encourage the compiler to inline batch PTE logic and resolve constant branches by adding __always_inline strategically. Link: https://lore.kernel.org/20260402141628.3367596-3-pfalcato@suse.de Signed-off-by: Pedro Falcato <pfalcato@suse.de> Suggested-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Tested-by: Luke Yang <luyang@redhat.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Jiri Hladky <jhladky@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
3bc181c143 |
mm/mprotect: move softleaf code out of the main function
Patch series "mm/mprotect: micro-optimization work", v3. Micro-optimize the change_protection functionality and the change_pte_range() routine. This set of functions works in an incredibly tight loop, and even small inefficiencies are incredibly evident when spun hundreds, thousands or hundreds of thousands of times. There was an attempt to keep the batching functionality as much as possible, which introduced some part of the slowness, but not all of it. Removing it for !arm64 architectures would speed mprotect() up even further, but could easily pessimize cases where large folios are mapped (which is not as rare as it seems, particularly when it comes to the page cache these days). The micro-benchmark used for the tests was [0] (usable using google/benchmark and g++ -O2 -lbenchmark repro.cpp) This resulted in the following (first entry is baseline): --------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------- mprotect_bench 85967 ns 85967 ns 6935 mprotect_bench 70684 ns 70684 ns 9887 After the patchset we can observe an ~18% speedup in mprotect. Wonderful for the elusive mprotect-based workloads! Testing & more ideas welcome. I suspect there is plenty of improvement possible but it would require more time than what I have on my hands right now. The entire inlined function (which inlines into change_protection()) is gigantic - I'm not surprised this is so finnicky. Note: per my profiling, the next _big_ bottleneck here is modify_prot_start_ptes, exactly on the xchg() done by x86. ptep_get_and_clear() is _expensive_. I don't think there's a properly safe way to go about it since we do depend on the D bit quite a lot. This might not be such an issue on other architectures. Luke Yang reported [1]: : On average, we see improvements ranging from a minimum of 5% to a : maximum of 55%, with most improvements showing around a 25% speed up in : the libmicro/mprot_tw4m micro benchmark. This patch (of 2): Move softleaf change_pte_range code into a separate function. This makes the change_pte_range() function a good bit smaller, and lessens cognitive load when reading through the function. Link: https://lore.kernel.org/20260402141628.3367596-1-pfalcato@suse.de Link: https://lore.kernel.org/20260402141628.3367596-2-pfalcato@suse.de Link: https://lore.kernel.org/all/aY8-XuFZ7zCvXulB@luyang-thinkpadp1gen7.toromso.csb/ Link: https://gist.github.com/heatd/1450d273005aba91fa5744f44dfcd933 [0] Link: https://lore.kernel.org/CAL2CeBxT4jtJ+LxYb6=BNxNMGinpgD_HYH5gGxOP-45Q2OncqQ@mail.gmail.com [1] Signed-off-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Tested-by: Luke Yang <luyang@redhat.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Jiri Hladky <jhladky@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
19999e479c |
mm: remove '!root_reclaim' checking in should_abort_scan()
Android systems usually use memory.reclaim interface to implement user space memory management which expects that the requested reclaim target and actually reclaimed amount memory are not diverging by too much. With the current MGRLU implementation there is, however, no bail out when the reclaim target is reached and this could lead to an excessive reclaim that scales with the reclaim hierarchy size.For example, we can get a nr_reclaimed=394/nr_to_reclaim=32 proactive reclaim under a common 1-N cgroup hierarchy. This defect arose from the goal of keeping fairness among memcgs that is, for try_to_free_mem_cgroup_pages -> shrink_node_memcgs -> shrink_lruvec -> lru_gen_shrink_lruvec -> try_to_shrink_lruvec, the !root_reclaim(sc) check was there for reclaim fairness, which was necessary before commit |
||
|
|
77c368f057 |
mm/sparse: fix comment for section map alignment
The comment in mmzone.h currently details exhaustive per-architecture bit-width lists and explains alignment using min(PAGE_SHIFT, PFN_SECTION_SHIFT). Such details risk falling out of date over time and may inadvertently be left un-updated. We always expect a single section to cover full pages. Therefore, we can safely assume that PFN_SECTION_SHIFT is large enough to accommodate SECTION_MAP_LAST_BIT. We use BUILD_BUG_ON() to ensure this. Update the comment to accurately reflect this consensus, making it clear that we rely on a single section covering full pages. Link: https://lore.kernel.org/20260402102320.3617578-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Petr Tesarik <ptesarik@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
df620ec4d4 |
mm/page_io: use sio->len for PSWPIN accounting in sio_read_complete()
sio_read_complete() uses sio->pages to account global PSWPIN vm events, but sio->pages tracks the number of bvec entries (folios), not base pages. While large folios cannot currently reach this path (SWP_FS_OPS and SWP_SYNCHRONOUS_IO are mutually exclusive, and mTHP swap-in allocation is gated on SWP_SYNCHRONOUS_IO), the accounting is semantically inconsistent with the per-memcg path which correctly uses folio_nr_pages(). Use sio->len >> PAGE_SHIFT instead, which gives the correct base page count since sio->len is accumulated via folio_size(folio). Link: https://lore.kernel.org/20260402061408.36119-1-devnexen@gmail.com Signed-off-by: David Carlier <devnexen@gmail.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: NeilBrown <neil@brown.name> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
cfe9a446f5 |
selftests/mm: transhuge_stress: skip the test when thp not available
The test requires thp, skip the test when thp is not available to avoid false positive. Tested with thp disabled kernel. Before the fix: # -------------------------------- # running ./transhuge-stress -d 20 # -------------------------------- # TAP version 13 # 1..1 # transhuge-stress: allocate 1453 transhuge pages, using 2907 MiB virtual memory and 11 MiB of ram # Bail out! MADV_HUGEPAGE# Planned tests != run tests (1 != 0) # # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0 # [FAIL] not ok 60 transhuge-stress -d 20 # exit=1 After the fix: # -------------------------------- # running ./transhuge-stress -d 20 # -------------------------------- # TAP version 13 # 1..0 # SKIP Transparent Hugepages not available # [SKIP] ok 5 transhuge-stress -d 20 # SKIP Link: https://lore.kernel.org/20260402014543.1671131-7-chuhu@redhat.com Signed-off-by: Chunyu Hu <chuhu@redhat.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Li Wang <liwang@redhat.com> Cc: Nico Pache <npache@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
dad4964a34 |
selftests/mm: split_huge_page_test: skip the test when thp is not available
When thp is not enabled on some kernel config such as realtime kernel, the test will report failure. Fix the false positive by skipping the test directly when thp is not enabled. Tested with thp disabled kernel: Before The fix: # -------------------------------------------------- # running ./split_huge_page_test /tmp/xfs_dir_Ywup9p # -------------------------------------------------- # TAP version 13 # Bail out! Reading PMD pagesize failed # # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0 # [FAIL] not ok 61 split_huge_page_test /tmp/xfs_dir_Ywup9p # exit=1 After the fix: # -------------------------------------------------- # running ./split_huge_page_test /tmp/xfs_dir_YHPUPl # -------------------------------------------------- # TAP version 13 # 1..0 # SKIP Transparent Hugepages not available # [SKIP] ok 6 split_huge_page_test /tmp/xfs_dir_YHPUPl # SKIP Link: https://lore.kernel.org/20260402014543.1671131-6-chuhu@redhat.com Signed-off-by: Chunyu Hu <chuhu@redhat.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Li Wang <liwang@redhat.com> Cc: Nico Pache <npache@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
a784a3a39c |
selftests/mm/vm_util: robust write_file()
Add three more checks for buflen and numwritten. The buflen should be at least two, that means at least one char and the null-end. The error case check is added by checking numwriten < 0 instead of numwritten < 1. And the truncate case is checked. The test will exit if any of these conditions aren't met. Additionally, add more print information when a write failure occurs or a truncated write happens, providing clearer diagnostics. Link: https://lore.kernel.org/20260402014543.1671131-5-chuhu@redhat.com Signed-off-by: Chunyu Hu <chuhu@redhat.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Cc: Nico Pache <npache@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
710d2f3079 |
selftests/mm: move write_file helper to vm_util
thp_settings provides write_file() helper for safely writing to a file and exit when write failure happens. It's a very low level helper and many sub tests need such a helper, not only thp tests. split_huge_page_test also defines a write_file locally. The two have minior differences in return type and used exit api. And there would be conflicts if split_huge_page_test wanted to include thp_settings.h because of different prototype, making it less convenient. It's possisble to merge the two, although some tests don't use the kselftest infrastrucutre for testing. It would also work when using the ksft_exit_msg() to exit in my test, as the counters are all zero. Output will be like: TAP version 13 1..62 Bail out! /proc/sys/vm/drop_caches1 open failed: No such file or directory # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0 So here we just keep the version in split_huge_page_test, and move it into the vm_util. This makes it easy to maitain and user could just include one vm_util.h when they don't need thp setting helpers. Keep the prototype of void return as the function will exit on any error, return value is not necessary, and will simply the callers like write_num() and write_string(). Link: https://lore.kernel.org/20260402014543.1671131-4-chuhu@redhat.com Signed-off-by: Chunyu Hu <chuhu@redhat.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Suggested-by: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
929d5fbf1a |
selftests/mm: soft-dirty: skip two tests when thp is not available
The test_hugepage test contain two sub tests. If just reporting one skip when thp not available, there will be error in the log because the test count don't match the test plan. Change to skip two tests by running the ksft_test_result_skip twice in this case. Without the fix (run test on thp disabled kernel): ./run_vmtests.sh -t soft_dirty # -------------------- # running ./soft-dirty # -------------------- # TAP version 13 # 1..19 # ok 1 Test test_simple # ok 2 Test test_vma_reuse dirty bit of allocated page # ok 3 Test test_vma_reuse dirty bit of reused address page # ok 4 # SKIP Transparent Hugepages not available # ok 5 Test test_mprotect-anon dirty bit of new written page # ok 6 Test test_mprotect-anon soft-dirty clear after clear_refs # ok 7 Test test_mprotect-anon soft-dirty clear after marking RO # ok 8 Test test_mprotect-anon soft-dirty clear after marking RW # ok 9 Test test_mprotect-anon soft-dirty after rewritten # ok 10 Test test_mprotect-file dirty bit of new written page # ok 11 Test test_mprotect-file soft-dirty clear after clear_refs # ok 12 Test test_mprotect-file soft-dirty clear after marking RO # ok 13 Test test_mprotect-file soft-dirty clear after marking RW # ok 14 Test test_mprotect-file soft-dirty after rewritten # ok 15 Test test_merge-anon soft-dirty after remap merge 1st pg # ok 16 Test test_merge-anon soft-dirty after remap merge 2nd pg # ok 17 Test test_merge-anon soft-dirty after mprotect merge 1st pg # ok 18 Test test_merge-anon soft-dirty after mprotect merge 2nd pg # # 1 skipped test(s) detected. Consider enabling relevant config options to improve coverage. # # Planned tests != run tests (19 != 18) # # Totals: pass:17 fail:0 xfail:0 xpass:0 skip:1 error:0 # [FAIL] not ok 52 soft-dirty # exit=1 With the fix (run test on thp disabled kernel): ./run_vmtests.sh -t soft_dirty # -------------------- # running ./soft-dirty # TAP version 13 # -------------------- # running ./soft-dirty # -------------------- # TAP version 13 # 1..19 # ok 1 Test test_simple # ok 2 Test test_vma_reuse dirty bit of allocated page # ok 3 Test test_vma_reuse dirty bit of reused address page # # Transparent Hugepages not available # ok 4 # SKIP Test test_hugepage huge page allocation # ok 5 # SKIP Test test_hugepage huge page dirty bit # ok 6 Test test_mprotect-anon dirty bit of new written page # ok 7 Test test_mprotect-anon soft-dirty clear after clear_refs # ok 8 Test test_mprotect-anon soft-dirty clear after marking RO # ok 9 Test test_mprotect-anon soft-dirty clear after marking RW # ok 10 Test test_mprotect-anon soft-dirty after rewritten # ok 11 Test test_mprotect-file dirty bit of new written page # ok 12 Test test_mprotect-file soft-dirty clear after clear_refs # ok 13 Test test_mprotect-file soft-dirty clear after marking RO # ok 14 Test test_mprotect-file soft-dirty clear after marking RW # ok 15 Test test_mprotect-file soft-dirty after rewritten # ok 16 Test test_merge-anon soft-dirty after remap merge 1st pg # ok 17 Test test_merge-anon soft-dirty after remap merge 2nd pg # ok 18 Test test_merge-anon soft-dirty after mprotect merge 1st pg # ok 19 Test test_merge-anon soft-dirty after mprotect merge 2nd pg # # 2 skipped test(s) detected. Consider enabling relevant config options to improve coverage. # # Totals: pass:17 fail:0 xfail:0 xpass:0 skip:2 error:0 # [PASS] ok 1 soft-dirty hwpoison_inject # SUMMARY: PASS=1 SKIP=0 FAIL=0 1..1 Link: https://lore.kernel.org/20260402014543.1671131-3-chuhu@redhat.com Signed-off-by: Chunyu Hu <chuhu@redhat.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Li Wang <liwang@redhat.com> Cc: Nico Pache <npache@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
fb0fca46b9 |
selftests/mm/guard-regions: skip collapse test when thp not enabled
Patch series "selftests/mm: skip several tests when thp is not available", v8. There are several tests requires transprarent hugepages, when run on thp disabled kernel such as realtime kernel, there will be false negative. Mark those tests as skip when thp is not available. This patch (of 6): When thp is not available, just skip the collape tests to avoid the false negative. Without the change, run with a thp disabled kernel: ./run_vmtests.sh -t madv_guard -n 1 <snip/> # RUN guard_regions.anon.collapse ... # guard-regions.c:2217:collapse:Expected madvise(ptr, size, MADV_NOHUGEPAGE) (-1) == 0 (0) # collapse: Test terminated by assertion # FAIL guard_regions.anon.collapse not ok 2 guard_regions.anon.collapse <snip/> # RUN guard_regions.shmem.collapse ... # guard-regions.c:2217:collapse:Expected madvise(ptr, size, MADV_NOHUGEPAGE) (-1) == 0 (0) # collapse: Test terminated by assertion # FAIL guard_regions.shmem.collapse not ok 32 guard_regions.shmem.collapse <snip/> # RUN guard_regions.file.collapse ... # guard-regions.c:2217:collapse:Expected madvise(ptr, size, MADV_NOHUGEPAGE) (-1) == 0 (0) # collapse: Test terminated by assertion # FAIL guard_regions.file.collapse not ok 62 guard_regions.file.collapse <snip/> # FAILED: 87 / 90 tests passed. # 17 skipped test(s) detected. Consider enabling relevant config options to improve coverage. # Totals: pass:70 fail:3 xfail:0 xpass:0 skip:17 error:0 With this change, run with thp disabled kernel: ./run_vmtests.sh -t madv_guard -n 1 <snip/> # RUN guard_regions.anon.collapse ... # SKIP Transparent Hugepages not available # OK guard_regions.anon.collapse ok 2 guard_regions.anon.collapse # SKIP Transparent Hugepages not available <snip/> # RUN guard_regions.file.collapse ... # SKIP Transparent Hugepages not available # OK guard_regions.file.collapse ok 62 guard_regions.file.collapse # SKIP Transparent Hugepages not available <snip/> # RUN guard_regions.shmem.collapse ... # SKIP Transparent Hugepages not available # OK guard_regions.shmem.collapse ok 32 guard_regions.shmem.collapse # SKIP Transparent Hugepages not available <snip/> # PASSED: 90 / 90 tests passed. # 20 skipped test(s) detected. Consider enabling relevant config options to improve coverage. # Totals: pass:70 fail:0 xfail:0 xpass:0 skip:20 error:0 Link: https://lore.kernel.org/20260402014543.1671131-1-chuhu@redhat.com Link: https://lore.kernel.org/20260402014543.1671131-2-chuhu@redhat.com Signed-off-by: Chunyu Hu <chuhu@redhat.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Li Wang <liwang@redhat.com> Cc: Nico Pache <npache@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
6ab703034f |
userfaultfd: mfill_atomic(): remove retry logic
Since __mfill_atomic_pte() handles the retry for both anonymous and shmem, there is no need to retry copying the date from the userspace in the loop in mfill_atomic(). Drop the retry logic from mfill_atomic(). [rppt@kernel.org: remove safety measure of not returning ENOENT from _copy] Link: https://lore.kernel.org/ac5zcDUY8CFHr6Lw@kernel.org Link: https://lore.kernel.org/20260402041156.1377214-12-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
f74991b4e3 |
shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops
Add filemap_add() and filemap_remove() methods to vm_uffd_ops and use them in __mfill_atomic_pte() to add shmem folios to page cache and remove them in case of error. Implement these methods in shmem along with vm_uffd_ops->alloc_folio() and drop shmem_mfill_atomic_pte(). Since userfaultfd now does not reference any functions from shmem, drop include if linux/shmem_fs.h from mm/userfaultfd.c mfill_atomic_install_pte() is not used anywhere outside of mm/userfaultfd, make it static. Link: https://lore.kernel.org/20260402041156.1377214-11-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: James Houghton <jthoughton@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
ad9ac30813 |
userfaultfd: introduce vm_uffd_ops->alloc_folio()
and use it to refactor mfill_atomic_pte_zeroed_folio() and mfill_atomic_pte_copy(). mfill_atomic_pte_zeroed_folio() and mfill_atomic_pte_copy() perform almost identical actions: * allocate a folio * update folio contents (either copy from userspace of fill with zeros) * update page tables with the new folio Split a __mfill_atomic_pte() helper that handles both cases and uses newly introduced vm_uffd_ops->alloc_folio() to allocate the folio. Pass the ops structure from the callers to __mfill_atomic_pte() to later allow using anon_uffd_ops for MAP_PRIVATE mappings of file-backed VMAs. Note, that the new ops method is called alloc_folio() rather than folio_alloc() to avoid clash with alloc_tag macro folio_alloc(). Link: https://lore.kernel.org/20260402041156.1377214-10-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: James Houghton <jthoughton@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
dfc4d77182 |
shmem, userfaultfd: use a VMA callback to handle UFFDIO_CONTINUE
When userspace resolves a page fault in a shmem VMA with UFFDIO_CONTINUE it needs to get a folio that already exists in the pagecache backing that VMA. Instead of using shmem_get_folio() for that, add a get_folio_noalloc() method to 'struct vm_uffd_ops' that will return a folio if it exists in the VMA's pagecache at given pgoff. Implement get_folio_noalloc() method for shmem and slightly refactor userfaultfd's mfill_get_vma() and mfill_atomic_pte_continue() to support this new API. Link: https://lore.kernel.org/20260402041156.1377214-9-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: James Houghton <jthoughton@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
0f48947c42 |
userfaultfd: introduce vm_uffd_ops
Current userfaultfd implementation works only with memory managed by core MM: anonymous, shmem and hugetlb. First, there is no fundamental reason to limit userfaultfd support only to the core memory types and userfaults can be handled similarly to regular page faults provided a VMA owner implements appropriate callbacks. Second, historically various code paths were conditioned on vma_is_anonymous(), vma_is_shmem() and is_vm_hugetlb_page() and some of these conditions can be expressed as operations implemented by a particular memory type. Introduce vm_uffd_ops extension to vm_operations_struct that will delegate memory type specific operations to a VMA owner. Operations for anonymous memory are handled internally in userfaultfd using anon_uffd_ops that implicitly assigned to anonymous VMAs. Start with a single operation, ->can_userfault() that will verify that a VMA meets requirements for userfaultfd support at registration time. Implement that method for anonymous, shmem and hugetlb and move relevant parts of vma_can_userfault() into the new callbacks. [rppt@kernel.org: relocate VM_DROPPABLE test, per Tal] Link: https://lore.kernel.org/adffgfM5ANxtPIEF@kernel.org Link: https://lore.kernel.org/20260402041156.1377214-8-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Cc: Tal Zussman <tz2294@columbia.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
a5bb866987 |
userfaultfd: move vma_can_userfault out of line
vma_can_userfault() has grown pretty big and it's not called on performance critical path. Move it out of line. No functional changes. Link: https://lore.kernel.org/20260402041156.1377214-7-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
f5f035a724 |
userfaultfd: retry copying with locks dropped in mfill_atomic_pte_copy()
Implementation of UFFDIO_COPY for anonymous memory might fail to copy data from userspace buffer when the destination VMA is locked (either with mm_lock or with per-VMA lock). In that case, mfill_atomic() releases the locks, retries copying the data with locks dropped and then re-locks the destination VMA and re-establishes PMD. Since this retry-reget dance is only relevant for UFFDIO_COPY and it never happens for other UFFDIO_ operations, make it a part of mfill_atomic_pte_copy() that actually implements UFFDIO_COPY for anonymous memory. As a temporal safety measure to avoid breaking biscection mfill_atomic_pte_copy() makes sure to never return -ENOENT so that the loop in mfill_atomic() won't retry copiyng outside of mmap_lock. This is removed later when shmem implementation will be updated later and the loop in mfill_atomic() will be adjusted. [akpm@linux-foundation.org: update mfill_copy_folio_retry()] Link: https://lore.kernel.org/20260316173829.1126728-1-avagin@google.com Link: https://lore.kernel.org/20260306171815.3160826-6-rppt@kernel.org Link: https://lore.kernel.org/20260402041156.1377214-6-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Cc: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
b8c03b7f45 |
userfaultfd: introduce mfill_get_vma() and mfill_put_vma()
Split the code that finds, locks and verifies VMA from mfill_atomic() into a helper function. This function will be used later during refactoring of mfill_atomic_pte_copy(). Add a counterpart mfill_put_vma() helper that unlocks the VMA and releases map_changing_lock. [avagin@google.com: fix lock leak in mfill_get_vma()] Link: https://lore.kernel.org/20260316173829.1126728-1-avagin@google.com Link: https://lore.kernel.org/20260402041156.1377214-5-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrei Vagin <avagin@google.com> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
e2e0b826d3 |
userfaultfd: introduce mfill_establish_pmd() helper
There is a lengthy code chunk in mfill_atomic() that establishes the PMD for UFFDIO operations. This code may be called twice: first time when the copy is performed with VMA/mm locks held and the other time after the copy is retried with locks dropped. Move the code that establishes a PMD into a helper function so it can be reused later during refactoring of mfill_atomic_pte_copy(). Link: https://lore.kernel.org/20260402041156.1377214-4-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
db0062d2c0 |
userfaultfd: introduce struct mfill_state
mfill_atomic() passes a lot of parameters down to its callees.
Aggregate them all into mfill_state structure and pass this structure to
functions that implement various UFFDIO_ commands.
Tracking the state in a structure will allow moving the code that retries
copying of data for UFFDIO_COPY into mfill_atomic_pte_copy() and make the
loop in mfill_atomic() identical for all UFFDIO operations on PTE-mapped
memory.
The mfill_state definition is deliberately local to mm/userfaultfd.c,
hence shmem_mfill_atomic_pte() is not updated.
[harry.yoo@oracle.com: properly initialize mfill_state.len to fix
folio_add_new_anon_rmap() WARN]
Link: https://lore.kernel.org/abehBY7QakYF9bK4@hyeyoo
Link: https://lore.kernel.org/20260402041156.1377214-3-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrei Vagin <avagin@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Harry Yoo (Oracle) <harry@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nikita Kalyazin <kalyazin@amazon.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Carlier <devnexen@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
c0620487fc |
userfaultfd: introduce mfill_copy_folio_locked() helper
Patch series "mm, kvm: allow uffd support in guest_memfd", v4. These patches enable support for userfaultfd in guest_memfd. As the groundwork I refactored userfaultfd handling of PTE-based memory types (anonymous and shmem) and converted them to use vm_uffd_ops for allocating a folio or getting an existing folio from the page cache. shmem also implements callbacks that add a folio to the page cache after the data passed in UFFDIO_COPY was copied and remove the folio from the page cache if page table update fails. In order for guest_memfd to notify userspace about page faults, there are new VM_FAULT_UFFD_MINOR and VM_FAULT_UFFD_MISSING that a ->fault() handler can return to inform the page fault handler that it needs to call handle_userfault() to complete the fault. Nikita helped to plumb these new goodies into guest_memfd and provided basic tests to verify that guest_memfd works with userfaultfd. The handling of UFFDIO_MISSING in guest_memfd requires ability to remove a folio from page cache, the best way I could find was exporting filemap_remove_folio() to KVM. I deliberately left hugetlb out, at least for the most part. hugetlb handles acquisition of VMA and more importantly establishing of parent page table entry differently than PTE-based memory types. This is a different abstraction level than what vm_uffd_ops provides and people objected to exposing such low level APIs as a part of VMA operations. Also, to enable uffd in guest_memfd refactoring of hugetlb is not needed and I prefer to delay it until the dust settles after the changes in this set. This patch (of 4): Split copying of data when locks held from mfill_atomic_pte_copy() into a helper function mfill_copy_folio_locked(). This makes improves code readability and makes complex mfill_atomic_pte_copy() function easier to comprehend. No functional change. Link: https://lore.kernel.org/20260402041156.1377214-1-rppt@kernel.org Link: https://lore.kernel.org/20260402041156.1377214-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
dc44f32fde |
mm/memfd_luo: remove folio from page cache when accounting fails
In memfd_luo_retrieve_folios(), when shmem_inode_acct_blocks() fails after successfully adding the folio to the page cache, the code jumps to unlock_folio without removing the folio from the page cache. While the folio eventually will be freed when the file is released by memfd_luo_retrieve(), it is a good idea to directly remove a folio that was not fully added to the file. This avoids the possibility of accounting mismatches in shmem or filemap core. Fix by adding a remove_from_cache label that calls filemap_remove_folio() before unlocking, matching the error handling pattern in shmem_alloc_and_add_folio(). This issue was identified by AI review: https://sashiko.dev/#/patchset/20260323110747.193569-1-duanchenghao@kylinos.cn [pratyush@kernel.org: changelog alterations] Link: https://lore.kernel.org/2vxzzf3lfujq.fsf@kernel.org Link: https://lore.kernel.org/20260326084727.118437-7-duanchenghao@kylinos.cn Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Haoran Jiang <jianghaoran@kylinos.cn> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
3538f90ab8 |
mm/memfd_luo: fix physical address conversion in put_folios cleanup
In memfd_luo_retrieve_folios()'s put_folios cleanup path:
1. kho_restore_folio() expects a phys_addr_t (physical address) but
receives a raw PFN (pfolio->pfn). This causes kho_restore_page() to
check the wrong physical address (pfn << PAGE_SHIFT instead of the
actual physical address).
2. This loop lacks the !pfolio->pfn check that exists in the main
retrieval loop and memfd_luo_discard_folios(), which could
incorrectly process sparse file holes where pfn=0.
Fix by converting PFN to physical address with PFN_PHYS() and adding
the !pfolio->pfn check, matching the pattern used elsewhere in this file.
This issue was identified by the AI review.
https://sashiko.dev/#/patchset/20260323110747.193569-1-duanchenghao@kylinos.cn
Link: https://lore.kernel.org/20260326084727.118437-6-duanchenghao@kylinos.cn
Fixes:
|
||
|
|
32f6cec5e7 |
mm/memfd_luo: use i_size_write() to set inode size during retrieve
Use i_size_write() instead of directly assigning to inode->i_size when restoring the memfd size in memfd_luo_retrieve(), to keep code consistency. No functional change intended. Link: https://lore.kernel.org/20260326084727.118437-5-duanchenghao@kylinos.cn Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Haoran Jiang <jianghaoran@kylinos.cn> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Pratyush Yadav <pratyush@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
4aa6424f37 |
mm/memfd_luo: remove unnecessary memset in zero-size memfd path
The memset(kho_vmalloc, 0, sizeof(*kho_vmalloc)) call in the zero-size file handling path is unnecessary because the allocation of the ser structure already uses the __GFP_ZERO flag, ensuring the memory is already zero-initialized. Link: https://lore.kernel.org/20260326084727.118437-4-duanchenghao@kylinos.cn Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Haoran Jiang <jianghaoran@kylinos.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
502d3c2ad8 |
mm/memfd_luo: optimize shmem_recalc_inode calls in retrieve path
Move shmem_recalc_inode() out of the loop in memfd_luo_retrieve_folios() to improve performance when restoring large memfds. Currently, shmem_recalc_inode() is called for each folio during restore, which is O(n) expensive operations. This patch collects the number of successfully added folios and calls shmem_recalc_inode() once after the loop completes, reducing complexity to O(1). Additionally, fix the error path to also call shmem_recalc_inode() for the folios that were successfully added before the error occurred. Link: https://lore.kernel.org/20260326084727.118437-3-duanchenghao@kylinos.cn Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Haoran Jiang <jianghaoran@kylinos.cn> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
ed2a29dc6d |
mm/memfd: use folio_nr_pages() for shmem inode accounting
I found several modifiable points while reading the code. This patch (of 6): Patch series "Modify memfd_luo code", v3. memfd_luo_retrieve_folios() called shmem_inode_acct_blocks() and shmem_recalc_inode() with hardcoded 1 instead of the actual folio page count. memfd may use large folios (THP/hugepages), causing quota/limit under-accounting and incorrect stat output. Fix by using folio_nr_pages(folio) for both functions. Issue found by AI review and suggested by Pratyush Yadav <pratyush@kernel.org>. https://sashiko.dev/#/patchset/20260319012845.29570-1-duanchenghao%40kylinos.cn Link: https://lore.kernel.org/20260326084727.118437-1-duanchenghao@kylinos.cn Link: https://lore.kernel.org/20260326084727.118437-2-duanchenghao@kylinos.cn Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn> Suggested-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Haoran Jiang <jianghaoran@kylinos.cn> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
7cf6d940f4 |
mm/sparse: fix preinited section_mem_map clobbering on failure path
sparse_init_nid() is careful to leave alone every section whose vmemmap
has already been set up by sparse_vmemmap_init_nid_early(); it only clears
section_mem_map for the rest:
if (!preinited_vmemmap_section(ms))
ms->section_mem_map = 0;
A leftover line after that conditional block
ms->section_mem_map = 0;
was supposed to be deleted but was missed in the failure path, causing the
field to be overwritten for all sections when memory allocation fails,
effectively destroying the pre-initialization check.
Drop the stray assignment so that preinited sections retain their
already valid state.
Those pre-inited sections (HugeTLB pages) are not activated. However,
such failures are extremely rare, so I don't see any major userspace
issues.
Link: https://lore.kernel.org/20260331113724.2080833-1-songmuchun@bytedance.com
Fixes:
|
||
|
|
e3668b3713 |
zram: do not forget to endio for partial discard requests
As reported by Qu Wenruo and Avinesh Kumar, the following
getconf PAGESIZE
65536
blkdiscard -p 4k /dev/zram0
takes literally forever to complete. zram doesn't support partial
discards and just returns immediately w/o doing any discard work in such
cases. The problem is that we forget to endio on our way out, so
blkdiscard sleeps forever in submit_bio_wait(). Fix this by jumping to
end_bio label, which does bio_endio().
Link: https://lore.kernel.org/20260331074255.777019-1-senozhatsky@chromium.org
Fixes:
|
||
|
|
af69016dab |
lib: test_hmm: implement a device release method
Unloading the HMM test module produces the following warning:
[ 3782.224783] ------------[ cut here ]------------
[ 3782.226323] Device 'hmm_dmirror0' does not have a release() function, it is broken and must be fixed. See Documentation/core-api/kobject.rst.
[ 3782.230570] WARNING: drivers/base/core.c:2567 at device_release+0x185/0x210, CPU#20: rmmod/1924
[ 3782.233949] Modules linked in: test_hmm(-) nvidia_uvm(O) nvidia(O)
[ 3782.236321] CPU: 20 UID: 0 PID: 1924 Comm: rmmod Tainted: G O 7.0.0-rc1+ #374 PREEMPT(full)
[ 3782.240226] Tainted: [O]=OOT_MODULE
[ 3782.241639] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
[ 3782.246193] RIP: 0010:device_release+0x185/0x210
[ 3782.247860] Code: 00 00 fc ff df 48 8d 7b 50 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 86 00 00 00 48 8b 73 50 48 85 f6 74 11 48 8d 3d db 25 29 03 <67> 48 0f b9 3a e9 0d ff ff ff 48 b8 00 00 00 00 00 fc ff df 48 89
[ 3782.254211] RSP: 0018:ffff888126577d98 EFLAGS: 00010246
[ 3782.256054] RAX: dffffc0000000000 RBX: ffffffffc2b70310 RCX: ffffffff8fe61ba1
[ 3782.258512] RDX: 1ffffffff856e062 RSI: ffff88811341eea0 RDI: ffffffff91bbacb0
[ 3782.261041] RBP: ffff888111475000 R08: 0000000000000001 R09: fffffbfff856e069
[ 3782.263471] R10: ffffffffc2b7034b R11: 00000000ffffffff R12: 0000000000000000
[ 3782.265983] R13: dffffc0000000000 R14: ffff88811341eea0 R15: 0000000000000000
[ 3782.268443] FS: 00007fd5a3689040(0000) GS:ffff88842c8d0000(0000) knlGS:0000000000000000
[ 3782.271236] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3782.273251] CR2: 00007fd5a36d2c10 CR3: 00000001242b8000 CR4: 00000000000006f0
[ 3782.275362] Call Trace:
[ 3782.276071] <TASK>
[ 3782.276678] kobject_put+0x146/0x270
[ 3782.277731] hmm_dmirror_exit+0x7a/0x130 [test_hmm]
[ 3782.279135] __do_sys_delete_module+0x341/0x510
[ 3782.280438] ? module_flags+0x300/0x300
[ 3782.281547] do_syscall_64+0x111/0x670
[ 3782.282620] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 3782.284091] RIP: 0033:0x7fd5a3793b37
[ 3782.285303] Code: 73 01 c3 48 8b 0d c9 82 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 99 82 0c 00 f7 d8 64 89 01 48
[ 3782.290708] RSP: 002b:00007ffd68b7dc68 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 3782.292817] RAX: ffffffffffffffda RBX: 000055e3c0d1c770 RCX: 00007fd5a3793b37
[ 3782.294735] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055e3c0d1c7d8
[ 3782.296661] RBP: 0000000000000000 R08: 1999999999999999 R09: 0000000000000000
[ 3782.298622] R10: 00007fd5a3806ac0 R11: 0000000000000206 R12: 00007ffd68b7deb0
[ 3782.300576] R13: 00007ffd68b7e781 R14: 000055e3c0d1b2a0 R15: 00007ffd68b7deb8
[ 3782.301963] </TASK>
[ 3782.302371] irq event stamp: 5019
[ 3782.302987] hardirqs last enabled at (5027): [<ffffffff8cf1f062>] __up_console_sem+0x52/0x60
[ 3782.304507] hardirqs last disabled at (5036): [<ffffffff8cf1f047>] __up_console_sem+0x37/0x60
[ 3782.306086] softirqs last enabled at (4940): [<ffffffff8cd9a4b0>] __irq_exit_rcu+0xc0/0xf0
[ 3782.307567] softirqs last disabled at (4929): [<ffffffff8cd9a4b0>] __irq_exit_rcu+0xc0/0xf0
[ 3782.309105] ---[ end trace 0000000000000000 ]---
This is because the test module doesn't have a device.release method. In
this case one probably isn't needed for correctness - the device structs
are in a static array so don't need freeing when the final reference goes
away.
However some device state is freed on exit, so to ensure this happens at
the right time and to silence the warning move the deinitialisation to a
release method and assign that as the device release callback. Whilst
here also fix a minor error handling bug where cdev_device_del() wasn't
being called if allocation failed.
Link: https://lore.kernel.org/20260331063445.3551404-4-apopple@nvidia.com
Fixes:
|
||
|
|
f9d7975c52 |
selftests/mm: hmm-tests: don't hardcode THP size to 2MB
Several HMM tests hardcode TWOMEG as the THP size. This is wrong on architectures where the PMD size is not 2MB such as arm64 with 64K base pages where THP is 512MB. Fix this by using read_pmd_pagesize() from vm_util instead. While here also replace the custom file_read_ulong() helper used to parse the default hugetlbfs page size from /proc/meminfo with the existing default_huge_page_size() from vm_util. Link: https://lore.kernel.org/20260331063445.3551404-3-apopple@nvidia.com Link: https://lore.kernel.org/linux-mm/8bd0396a-8997-4d2e-a13f-5aac033083d7@linux.dev/ Fixes: |
||
|
|
744dd97752 |
lib: test_hmm: evict device pages on file close to avoid use-after-free
Patch series "Minor hmm_test fixes and cleanups".
Two bugfixes a cleanup for the HMM kernel selftests. These were mostly
reported by Zenghui Yu with special thanks to Lorenzo for analysing and
pointing out the problems.
This patch (of 3):
When dmirror_fops_release() is called it frees the dmirror struct but
doesn't migrate device private pages back to system memory first. This
leaves those pages with a dangling zone_device_data pointer to the freed
dmirror.
If a subsequent fault occurs on those pages (eg. during coredump) the
dmirror_devmem_fault() callback dereferences the stale pointer causing a
kernel panic. This was reported [1] when running mm/ksft_hmm.sh on arm64,
where a test failure triggered SIGABRT and the resulting coredump walked
the VMAs faulting in the stale device private pages.
Fix this by calling dmirror_device_evict_chunk() for each devmem chunk in
dmirror_fops_release() to migrate all device private pages back to system
memory before freeing the dmirror struct. The function is moved earlier
in the file to avoid a forward declaration.
Link: https://lore.kernel.org/20260331063445.3551404-1-apopple@nvidia.com
Link: https://lore.kernel.org/20260331063445.3551404-2-apopple@nvidia.com
Fixes:
|
||
|
|
047a6d4940 |
selftests/mm: skip hugetlb_dio tests when DIO alignment is incompatible
hugetlb_dio test uses sub-page offsets (pagesize / 2) to verify that
hugepages used as DIO user buffers are correctly unpinned at completion.
However, on filesystems with a logical block size larger than half the
page size (e.g., 4K-sector block devices), these unaligned DIO writes are
rejected with -EINVAL, causing the test to fail unexpectedly.
Add get_dio_alignment() to query the filesystem's required DIO alignment
via statx(STATX_DIOALIGN) and skip individual test cases whose file offset
or write size is not a multiple of that alignment. Aligned cases continue
to run so the core coverage is preserved.
While here, open the temporary file once in main() and share the fd across
all test cases instead of reopening it in each invocation.
=== Reproduce Steps ===
# dd if=/dev/zero of=/tmp/test.img bs=1M count=512
# losetup --sector-size 4096 /dev/loop0 /tmp/test.img
# mkfs.xfs /dev/loop0
# mkdir -p /mnt/dio_test
# mount /dev/loop0 /mnt/dio_test
// Modify test to open /mnt/dio_test and rebuild it:
- fd = open("/tmp", O_TMPFILE | O_RDWR | O_DIRECT, 0664);
+ fd = open("/mnt/dio_test", O_TMPFILE | O_RDWR | O_DIRECT, 0664);
# getconf PAGESIZE
4096
# echo 100 >/proc/sys/vm/nr_hugepages
# ./hugetlb_dio
TAP version 13
1..4
# No. Free pages before allocation : 100
# No. Free pages after munmap : 100
ok 1 free huge pages from 0-12288
Bail out! Error writing to file
: Invalid argument (22)
# Planned tests != run tests (4 != 1)
# Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
Link: https://lore.kernel.org/20260401090520.24018-1-liwang@redhat.com
Signed-off-by: Li Wang <liwang@redhat.com>
Suggested-by: Mike Rapoport <rppt@kernel.org>
Suggested-by: David Hildenbrand <david@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
84f4928446 |
tools/testing/selftests: add merge test for partial msealed range
Commit
|
||
|
|
6fae274ce0 |
mm/mempolicy: fix memory leaks in weighted_interleave_auto_store()
weighted_interleave_auto_store() fetches old_wi_state inside the if
(!input) block only. This causes two memory leaks:
1. When a user writes "false" and the current mode is already manual,
the function returns early without freeing the freshly allocated
new_wi_state.
2. When a user writes "true", old_wi_state stays NULL because the
fetch is skipped entirely. The old state is then overwritten by
rcu_assign_pointer() but never freed, since the cleanup path is
gated on old_wi_state being non-NULL. A user can trigger this
repeatedly by writing "1" in a loop.
Fix both leaks by moving the old_wi_state fetch before the input check,
making it unconditional. This also allows a unified early return for both
"true" and "false" when the requested mode matches the current mode.
Link: https://lore.kernel.org/20260401005702.7096-1-liu.yun@linux.dev
Link: https://sashiko.dev/#/patchset/20260331100740.84906-1-liu.yun@linux.dev
Fixes:
|
||
|
|
0c13ed77dd |
Docs/admin-guide/mm/damon/lru_sort: warn commit_inputs vs param updates race
DAMON_LRU_SORT handles commit_inputs request inside kdamond thread,
reading the module parameters. If the user updates the module
parameters while the kdamond thread is reading those, races can happen.
To avoid this, the commit_inputs parameter shows whether it is still in
the progress, assuming users wouldn't update parameters in the middle of
the work. Some users might ignore that. Add a warning about the
behavior.
The issue was discovered in [1] by sashiko.
Link: https://lore.kernel.org/20260329153052.46657-3-sj@kernel.org
Link: https://lore.kernel.org/20260319161620.189392-2-objecting@objecting.org [1]
Fixes:
|
||
|
|
0beba407d4 |
Docs/admin-guide/mm/damon/reclaim: warn commit_inputs vs param updates race
Patch series "Docs/admin-guide/mm/damon: warn commit_inputs vs other
params race".
Writing 'Y' to the commit_inputs parameter of DAMON_RECLAIM and
DAMON_LRU_SORT, and writing other parameters before the commit_inputs
request is completely processed can cause race conditions. While the
consequence can be bad, the documentation is not clearly describing that.
Add clear warnings.
The issue was discovered [1,2] by sashiko.
This patch (of 2):
DAMON_RECLAIM handles commit_inputs request inside kdamond thread,
reading the module parameters. If the user updates the module
parameters while the kdamond thread is reading those, races can happen.
To avoid this, the commit_inputs parameter shows whether it is still in
the progress, assuming users wouldn't update parameters in the middle of
the work. Some users might ignore that. Add a warning about the
behavior.
The issue was discovered in [1] by sashiko.
Link: https://lore.kernel.org/20260329153052.46657-2-sj@kernel.org
Link: https://lore.kernel.org/20260319161620.189392-3-objecting@objecting.org [1]
Link: https://lore.kernel.org/20260319161620.189392-2-objecting@objecting.org [3]
Fixes:
|
||
|
|
049a57421d |
mm/damon/core: use time_in_range_open() for damos quota window start
damos_adjust_quota() uses time_after_eq() to show if it is time to start a
new quota charge window, comparing the current jiffies and the scheduled
next charge window start time. If it is, the next charge window start
time is updated and the new charge window starts.
The time check and next window start time update is skipped while the
scheme is deactivated by the watermarks. Let's suppose the deactivation
is kept more than LONG_MAX jiffies (assuming CONFIG_HZ of 250, more than
99 days in 32 bit systems and more than one billion years in 64 bit
systems), resulting in having the jiffies larger than the next charge
window start time + LONG_MAX. Then, the time_after_eq() call can return
false until another LONG_MAX jiffies are passed.
This means the scheme can continue working after being reactivated by the
watermarks. But, soon, the quota will be exceeded and the scheme will
again effectively stop working until the next charge window starts.
Because the current charge window is extended to up to LONG_MAX jiffies,
however, it will look like it stopped unexpectedly and indefinitely, from
the user's perspective.
Fix this by using !time_in_range_open() instead.
The issue was discovered [1] by sashiko.
Link: https://lore.kernel.org/20260329152306.45796-1-sj@kernel.org
Link: https://lore.kernel.org/20260324040722.57944-1-sj@kernel.org [1]
Fixes:
|
||
|
|
a34dac6482 |
mm/damon/core: validate damos_quota_goal->nid for node_memcg_{used,free}_bp
Users can set damos_quota_goal->nid with arbitrary value for
node_memcg_{used,free}_bp. But DAMON core is using those for NODE-DATA()
without a validation of the value. This can result in out of bounds
memory access. The issue can actually triggered using DAMON user-space
tool (damo), like below.
$ sudo mkdir /sys/fs/cgroup/foo
$ sudo ./damo start --damos_action stat --damos_quota_interval 1s \
--damos_quota_goal node_memcg_used_bp 50% -1 /foo
$ sudo dmseg
[...]
[ 524.181426] Unable to handle kernel paging request at virtual address 0000000000002c00
Fix this issue by adding the validation of the given node id. If an
invalid node id is given, it returns 0% for used memory ratio, and 100%
for free memory ratio.
Link: https://lore.kernel.org/20260329043902.46163-3-sj@kernel.org
Fixes:
|