mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-12-28 09:06:11 -05:00
cf8cecf2bc221cea99fadc2dc62cef2d4c970dff
1353103 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
cf8cecf2bc |
mm/mempolicy: prepare weighted interleave sysfs for memory hotplug
Previously, the weighted interleave sysfs structure was statically managed during initialization. This prevented new nodes from being recognized when memory hotplug events occurred, limiting the ability to update or extend sysfs entries dynamically at runtime. To address this, this patch refactors the sysfs infrastructure and encapsulates it within a new structure, `sysfs_wi_group`, which holds both the kobject and an array of node attribute pointers. By allocating this group structure globally, the per-node sysfs attributes can be managed beyond initialization time, enabling external modules to insert or remove node entries in response to events such as memory hotplug or node online/offline transitions. Instead of allocating all per-node sysfs attributes at once, the initialization path now uses the existing sysfs_wi_node_add() and sysfs_wi_node_delete() helpers. This refactoring makes it possible to modularly manage per-node sysfs entries and ensures the infrastructure is ready for runtime extension. Link: https://lkml.kernel.org/r/20250417072839.711-3-rakie.kim@sk.com Signed-off-by: Rakie Kim <rakie.kim@sk.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Honggyu Kim <honggyu.kim@sk.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yunjeong Mun <yunjeong.mun@sk.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
bb52e89d8b |
mm/mempolicy: fix memory leaks in weighted interleave sysfs
Patch series "Enhance sysfs handling for memory hotplug in weighted
interleave", v9.
The following patch series enhances the weighted interleave policy in the
memory management subsystem by improving sysfs handling, fixing memory
leaks, and introducing dynamic sysfs updates for memory hotplug support.
This patch (of 3):
Memory leaks occurred when removing sysfs attributes for weighted
interleave. Improper kobject deallocation led to unreleased memory when
initialization failed or when nodes were removed.
The risk of leak is low because it only appears to trigger if setup
fails. Setup only fails due to -ENOMEM which is unlikely to happen
from a late_initcall() when memory pressure is low.
This patch resolves the issue by replacing unnecessary `kfree()` calls
with proper `kobject_del()` and `kobject_put()` sequences, ensuring
correct teardown and preventing memory leaks.
By explicitly calling `kobject_del()` before `kobject_put()`, the release
function is now invoked safely, and internal sysfs state is correctly
cleaned up. This guarantees that the memory associated with the kobject
is fully released and avoids resource leaks, thereby improving system
stability.
Additionally, sysfs_remove_file() is no longer called from the release
function to avoid accessing invalid sysfs state after kobject_del(). All
attribute removals are now done before kobject_del(), preventing WARN_ON()
in kernfs and ensuring safe and consistent cleanup of sysfs entries.
Link: https://lkml.kernel.org/r/20250417072839.711-1-rakie.kim@sk.com
Link: https://lkml.kernel.org/r/20250417072839.711-2-rakie.kim@sk.com
Fixes:
|
||
|
|
6e14fd33f1 |
mm: memcontrol: remove unnecessary NULL check before free_percpu()
free_percpu() checks for NULL pointers internally. Remove unneeded NULL check here. Link: https://lkml.kernel.org/r/20250417084330.937380-1-nichen@iscas.ac.cn Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
d096612048 |
vmalloc: align nr_vmalloc_pages and vmap_lazy_nr
Currently both atomics share one cache-line: <snip> ... ffffffff83eab400 b vmap_lazy_nr ffffffff83eab408 b nr_vmalloc_pages ... <snip> those are global variables and they are only 8 bytes apart. Since they are modified by different threads this causes a false sharing. This can lead to a performance drop due to unnecessary cache invalidations. After this patch it is aligned to a cache line boundary: <snip> ... ffffffff8260a600 d vmap_lazy_nr ffffffff8260a640 d nr_vmalloc_pages ... <snip> Link: https://lkml.kernel.org/r/20250417161216.88318-4-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Adrian Huang <ahuang12@lenovo.com> Tested-by: Adrian Huang <ahuang12@lenovo.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Christop Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
7a6fe58777 |
MAINTAINERS: add test_vmalloc.c to VMALLOC section
A vmalloc subsystem includes "lib/test_vmalloc.c" test suite. Add an "F:" entry under VMALLOC section to track this file as part of the subsystem. Link: https://lkml.kernel.org/r/20250417161216.88318-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christop Hellwig <hch@infradead.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
2d76e79315 |
lib/test_vmalloc.c: allow built-in execution
Remove the dependency on module loading ("m") for the vmalloc test suite,
enabling it to be built directly into the kernel, so both ("=m") and
("=y") are supported.
Motivation:
- Faster debugging/testing of vmalloc code;
- It allows to configure the test via kernel-boot parameters.
Configuration example:
test_vmalloc.nr_threads=64
test_vmalloc.run_test_mask=7
test_vmalloc.sequential_test_order=1
Link: https://lkml.kernel.org/r/20250417161216.88318-2-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Adrian Huang <ahuang12@lenovo.com>
Tested-by: Adrian Huang <ahuang12@lenovo.com>
Cc: Christop Hellwig <hch@infradead.org>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
7a73348e5d |
lib/test_vmalloc.c: replace RWSEM to SRCU for setup
The test has the initialization step during which threads are created. To prevent the workers from starting prematurely a write lock was previously used by the main setup thread, while each worker would block on a read lock. Replace this RWSEM based synchronization with a simpler SRCU based approach. Which does two basic steps: - Main thread wraps the setup phase in an SRCU read-side critical section. Pair of srcu_read_lock()/srcu_read_unlock(). - Each worker calls synchronize_srcu() on entry, ensuring it waits for the initialization phase to be completed. This patch eliminates the need for down_read()/up_read() and down_write()/up_write() pairs thus simplifying the logic and improving clarity. [urezki@gmail.com: fix compile error with CONFIG_TINY_RCU] Link: https://lkml.kernel.org/r/20250420142029.103169-1-urezki@gmail.com Link: https://lkml.kernel.org/r/20250417161216.88318-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Adrian Huang <ahuang12@lenovo.com> Tested-by: Adrian Huang <ahuang12@lenovo.com> Cc: Baoquan He <bhe@redhat.com> Cc: Christop Hellwig <hch@infradead.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
b05f8d7e07 |
Documentation: zram: update IDLE pages tracking documentation
Move IDLE pages tracking into a separate chapter because there are
multiple features that use (or depend on) it either in built-in variant
("mark all") or in extended variant (ac-time tracking).
In addition, recompression doesn't require memory tracking to be enabled
in order to be able to perform idle recompression.
Link: https://lkml.kernel.org/r/20250416042833.3858827-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reported-by: Shin Kawamura <kawasin@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
4a34c584d8 |
mempolicy: optimize queue_folios_pte_range by PTE batching
After the check for queue_folio_required(), the code only cares about the folio in the for loop, i.e the PTEs are redundant. Therefore, optimize this loop by skipping over a PTE batch mapping the same folio. With a test program migrating pages of the calling process, which includes a mapped VMA of size 4GB with pte-mapped large folios of order-9, and migrating once back and forth node-0 and node-1, the average execution time reduces from 7.5 to 4 seconds, giving an approx 47% speedup. Link: https://lkml.kernel.org/r/20250416053048.96479-1-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
75404e0766 |
mm: move mmap/vma locking logic into specific files
Currently the VMA and mmap locking logic is entangled in two of the most overwrought files in mm - include/linux/mm.h and mm/memory.c. Separate this logic out so we can more easily make changes and create an appropriate MAINTAINERS entry that spans only the logic relating to locking. This should have no functional change. Care is taken to avoid dependency loops, we must regrettably keep release_fault_lock() and assert_fault_locked() in mm.h as a result due to the dependence on the vm_fault type. Additionally we must declare rcuwait_wake_up() manually to avoid a dependency cycle on linux/rcuwait.h. Additionally move the nommu implementatino of lock_mm_and_find_vma() to mmap_lock.c so everything lock-related is in one place. Link: https://lkml.kernel.org/r/bec6c8e29fa8de9267a811a10b1bdae355d67ed4.1744799282.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
f735eebe55 |
memcg: multi-memcg percpu charge cache
Memory cgroup accounting is expensive and to reduce the cost, the kernel maintains per-cpu charge cache for a single memcg. So, if a charge request comes for a different memcg, the kernel will flush the old memcg's charge cache and then charge the newer memcg a fixed amount (64 pages), subtracts the charge request amount and stores the remaining in the per-cpu charge cache for the newer memcg. This mechanism is based on the assumption that the kernel, for locality, keep a process on a CPU for long period of time and most of the charge requests from that process will be served by that CPU's local charge cache. However this assumption breaks down for incoming network traffic in a multi-tenant machine. We are in the process of running multiple workloads on a single machine and if such workloads are network heavy, we are seeing very high network memory accounting cost. We have observed multiple CPUs spending almost 100% of their time in net_rx_action and almost all of that time is spent in memcg accounting of the network traffic. More precisely, net_rx_action is serving packets from multiple workloads and is observing/serving mix of packets of these workloads. The memcg switch of per-cpu cache is very expensive and we are observing a lot of memcg switches on the machine. Almost all the time is being spent on charging new memcg and flushing older memcg cache. So, definitely we need per-cpu cache that support multiple memcgs for this scenario. This patch implements a simple (and dumb) multiple memcg percpu charge cache. Actually we started with more sophisticated LRU based approach but the dumb one was always better than the sophisticated one by 1% to 3%, so going with the simple approach. Some of the design choices are: 1. Fit all caches memcgs in a single cacheline. 2. The cache array can be mix of empty slots or memcg charged slots, so the kernel has to traverse the full array. 3. The cache drain from the reclaim will drain all cached memcgs to keep things simple. To evaluate the impact of this optimization, on a 72 CPUs machine, we ran the following workload where each netperf client runs in a different cgroup. The next-20250415 kernel is used as base. $ netserver -6 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K number of clients | Without patch | With patch 6 | 42584.1 Mbps | 48603.4 Mbps (14.13% improvement) 12 | 30617.1 Mbps | 47919.7 Mbps (56.51% improvement) 18 | 25305.2 Mbps | 45497.3 Mbps (79.79% improvement) 24 | 20104.1 Mbps | 37907.7 Mbps (88.55% improvement) 30 | 14702.4 Mbps | 30746.5 Mbps (109.12% improvement) 36 | 10801.5 Mbps | 26476.3 Mbps (145.11% improvement) The results show drastic improvement for network intensive workloads. [shakeel.butt@linux.dev: add BUILD_BUG_ON() for MEMCG_CHARGE_BATCH] Link: https://lkml.kernel.org/r/rlsgeosg3j7v5nihhbxxxbv3xfy4ejvigihj7lkkbt3n6imyne@2apxx2jm2e57 [shakeel.butt@linux.dev: simplify refill_stock] Link: https://lkml.kernel.org/r/as5cdsm4lraxupg3t6onep2ixql72za25hvd4x334dsoyo4apr@zyzl4vkuevuv [hughd@google.com: it's better to stock nr_pages than the uninitialized stock_pages] Link: https://lkml.kernel.org/r/d542d18f-1caa-6fea-e2c3-3555c87bcf64@google.com [shakeel.butt@linux.dev: add comment per Michal and use DEFINE_PER_CPU_ALIGNED instead of DEFINE_PER_CPU per Vlastimil] Link: https://lkml.kernel.org/r/dieeei3squ2gcnqxdjayvxbvzldr266rhnvtl3vjzsqevxkevf@ckui5vjzl2qg Link: https://lkml.kernel.org/r/20250416180229.2902751-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Eric Dumaze <edumazet@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
06340b9270 |
mm: convert free_page_and_swap_cache() to free_folio_and_swap_cache()
free_page_and_swap_cache() takes a struct page pointer as input parameter, but it will immediately convert it to folio and all operations following within use folio instead of page. It makes more sense to pass in folio directly. Convert free_page_and_swap_cache() to free_folio_and_swap_cache() to consume folio directly. Link: https://lkml.kernel.org/r/20250416201720.41678-1-nifan.cxl@gmail.com Signed-off-by: Fan Ni <fan.ni@samsung.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Adam Manzanares <a.manzanares@samsung.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
4f219913c1 |
mm: add nr_free_highatomic in show_free_areas
commit c928807f6f6b6("mm/page_alloc: keep track of free highatomic")
adds a new variable nr_free_highatomic, which is useful for analyzing low
mem issues. add nr_free_highatomic in show_free_areas.
Signed-off-by: gao xu <gaoxu2@honor.com>
Link: https://lkml.kernel.org/r/d92eeff74f7a4578a14ac777cfe3603a@honor.com
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
df2bbc47e7 |
mm/vmscan: modify the assignment logic of the scan and total_scan variables
The scan and total_scan variables can be initialized to 0 when they are defined, replacing the separate assignment statements. Link: https://lkml.kernel.org/r/20250417092422.1333620-1-hao.ge@linux.dev Signed-off-by: Hao Ge <gehao@kylinos.cn> Acked-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
1477b8cd26 |
samples/damon/prcl: fix a comment typo
This patch just fixes a typo in the comment. Link: https://lkml.kernel.org/r/20250411073800.1444481-1-lienze@kylinos.cn Signed-off-by: Enze Li <lienze@kylinos.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
a7797e74bd |
mm/gup: clean up codes in fault_in_xxx() functions
The code style in fault_in_readable() and fault_in_writable() is a little inconsistent with fault_in_safe_writeable(). In fault_in_readable() and fault_in_writable(), it uses 'uaddr' passed in as loop cursor. While in fault_in_safe_writeable(), local variable 'start' is used as loop cursor. This may mislead people when reading code or making change in these codes. Here define explicit loop cursor and use for loop to simplify codes in these three functions. These cleanup can make them be consistent in code style and improve readability. [bhe@redhat.com: address minor concerns from David] Link: https://lkml.kernel.org/r/Z/sbv3EmLXWgEE7+@MiWiFi-R3L-srv Link: https://lkml.kernel.org/r/20250410035717.473207-5-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yanjun.Zhu <yanjun.zhu@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
339122abb5 |
mm/gup: remove gup_fast_pgd_leaf() and clean up the relevant codes
In the current kernel, only pud huge page is supported in some architectures. P4d and pgd huge pages haven't been supported yet. And in mm/gup.c, there's no pgd huge page handling in the follow_page_mask() code path. Hence it doesn't make sense to only have gup_fast_pgd_leaf() in gup_fast code path. Here remove gup_fast_pgd_leaf() and clean up the relevant codes. Link: https://lkml.kernel.org/r/20250410035717.473207-4-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Yanjun.Zhu <yanjun.zhu@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
ede27b7ee2 |
mm/gup: remove unneeded checking in follow_page_pte()
Patch series "mm/gup: Minor fix, cleanup and improvements", v4. These were made from code inspection in mm/gup.c. This patch (of 3): In __get_user_pages(), it will traverse page table and take a reference to the page the given user address corresponds to if GUP_GET or GUP_PIN is set. However, it's not supported both GUP_GET and GUP_PIN are set. Even though this check need be done, it should be done earlier, but not doing it till entering into follow_page_pte() and failed. Furthermore, this checking has been done in is_valid_gup_args() and all external users of __get_user_pages() will call is_valid_gup_args() to catch the illegal setting. We don't need to worry about internal users of __get_user_pages() because the gup_flags are set by MM code correctly. Here remove the checking in follow_page_pte(), and add VM_WARN_ON_ONCE() to catch the possible exceptional setting just in case. And also change the VM_BUG_ON to VM_WARN_ON_ONCE() for checking (!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN))) because the checking has been done in is_valid_gup_args() for external users of __get_user_pages(). Link: https://lkml.kernel.org/r/20250410035717.473207-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20250410035717.473207-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Yanjun.Zhu <yanjun.zhu@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
e7a446030b |
mm,hugetlb: allocate frozen pages in alloc_buddy_hugetlb_folio
alloc_buddy_hugetlb_folio() allocates a rmappable folio, then strips the rmappable part and freezes it. We can simplify all that by allocating frozen pages directly. Link: https://lkml.kernel.org/r/20250411132359.312708-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
0272d07ef6 |
vmalloc: use atomic_long_add_return_relaxed()
Switch from the atomic_long_add_return() to its relaxed version.
We do not need a full memory barrier or any memory ordering during
increasing the "vmap_lazy_nr" variable. What we only need is to do it
atomically. This is what atomic_long_add_return_relaxed() guarantees.
AARCH64:
<snip>
Default:
40ec: d34cfe94 lsr x20, x20, #12
40f0: 14000044 b 4200 <free_vmap_area_noflush+0x19c>
40f4: 94000000 bl 0 <__sanitizer_cov_trace_pc>
40f8: 90000000 adrp x0, 0 <__traceiter_alloc_vmap_area>
40fc: 91000000 add x0, x0, #0x0
4100: f8f40016 ldaddal x20, x22, [x0]
4104: 8b160296 add x22, x20, x22
Relaxed:
40ec: d34cfe94 lsr x20, x20, #12
40f0: 14000044 b 4200 <free_vmap_area_noflush+0x19c>
40f4: 94000000 bl 0 <__sanitizer_cov_trace_pc>
40f8: 90000000 adrp x0, 0 <__traceiter_alloc_vmap_area>
40fc: 91000000 add x0, x0, #0x0
4100: f8340016 ldadd x20, x22, [x0]
4104: 8b160296 add x22, x20, x22
<snip>
Link: https://lkml.kernel.org/r/20250415112646.113091-1-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Christop Hellwig <hch@infradead.org>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
00ccf40ae2 |
mm, hugetlb: avoid passing a null nodemask when there is mbind policy
Before trying to allocate a page, gather_surplus_pages() sets up a nodemask for the nodes we can allocate from, but instead of passing the nodemask down the road to the page allocator, it iterates over the nodes within that nodemask right there, meaning that the page allocator will receive a preferred_nid and a null nodemask. This is a problem when using a memory policy, because it might be that the page allocator ends up using a node as a fallback which is not represented in the policy. Avoid that by passing the nodemask directly to the page allocator, so it can filter out fallback nodes that are not part of the nodemask. Link: https://lkml.kernel.org/r/20250415121503.376811-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
f736953e2b |
selftests/damon: remove the remaining test scripts for DAMON debugfs interface
DAMON has dropped debugfs support; therefore, remove these unused scripts.
Link: https://lkml.kernel.org/r/20250411024332.1373861-1-enze.li@linux.dev
Fixes:
|
||
|
|
60cada258d |
memcg: optimize memcg_rstat_updated
Currently the kernel maintains the stats updates per-memcg which is needed to implement stats flushing threshold. On the update side, the update is added to the per-cpu per-memcg update of the given memcg and all of its ancestors. However when the given memcg has passed the flushing threshold, all of its ancestors should have passed the threshold as well. There is no need to traverse up the memcg tree to maintain the stats updates. Perf profile collected from our fleet shows that memcg_rstat_updated is one of the most expensive memcg function i.e. a lot of cumulative CPU is being spent on it. So, even small micro optimizations matter a lot. This patch is microbenchmarked with multiple instances of netperf on a single machine with locally running netserver and we see couple of percentage of improvement. Link: https://lkml.kernel.org/r/20250410025752.92159-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
585a914588 |
selftests/mm: restore default nr_hugepages value during cleanup in hugetlb_reparenting_test.sh
During cleanup, the value of /proc/sys/vm/nr_hugepages is currently being
set to 0. At the end of the test, if all tests pass, the original
nr_hugepages value is restored. However, if any test fails, it remains
set to 0.
With this patch, we ensure that the original nr_hugepages value is
restored during cleanup, regardless of whether the test passes or fails.
Link: https://lkml.kernel.org/r/20250410100748.2310-1-donettom@linux.ibm.com
Fixes:
|
||
|
|
2a6ed1b411 |
maple_tree: reorder mas->store_type case statements
Move the unlikely case that mas->store_type is invalid to be the last evaluated case and put liklier cases higher up. Link: https://lkml.kernel.org/r/20250410191446.2474640-7-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Suggested-by: Liam R. Howlett <liam.howlett@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
271152a973 |
maple_tree: add sufficient height
In order to support rebalancing and spanning stores using less than the worst case number of nodes, we need to track more than just the vacant height. Using only vacant height to reduce the worst case maple node allocation count can lead to a shortcoming of nodes in the following scenarios. For rebalancing writes, when a leaf node becomes insufficient, it may be combined with a sibling into a single node. This means that the parent node which has entries for this children will lose one entry. If this parent node was just meeting the minimum entries, losing one entry will now cause this parent node to be insufficient. This leads to a cascading operation of rebalancing at different levels and can lead to more node allocations than simply using vacant height can return. For spanning writes, a similar situation occurs. At the location at which a spanning write is detected, the number of ancestor nodes may similarly need to rebalanced into a smaller number of nodes and the same cascading situation could occur. To use less than the full height of the tree for the number of allocations, we also need to track the height at which a non-leaf node cannot become insufficient. This means even if a rebalance occurs to a child of this node, it currently has enough entries that it can lose one without any further action. This field is stored in the maple write state as sufficient height. In mas_prealloc_calc() when figuring out how many nodes to allocate, we check if the vacant node is lower in the tree than a sufficient node (has a larger value). If it is, we cannot use the vacant height and must use the difference in the height and sufficient height as the basis for the number of nodes needed. An off by one bug was also discovered in mast_overflow() where it is using >= rather than >. This caused extra iterations of the mas_spanning_rebalance() loop and lead to unneeded allocations. A test is also added to check the number of allocations is correct. Link: https://lkml.kernel.org/r/20250410191446.2474640-6-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
300a5b4ffe |
maple_tree: break on convergence in mas_spanning_rebalance()
This allows support for using the vacant height to calculate the worst case number of nodes needed for wr_rebalance operation. mas_spanning_rebalance() was seen to perform unnecessary node allocations. We can reduce allocations by breaking early during the rebalancing loop once we realize that we have ascended to a common ancestor. Link: https://lkml.kernel.org/r/20250410191446.2474640-5-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Suggested-by: Liam Howlett <liam.howlett@oracle.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
ad88fc17d2 |
maple_tree: use vacant nodes to reduce worst case allocations
In order to determine the store type for a maple tree operation, a walk of the tree is done through mas_wr_walk(). This function descends the tree until a spanning write is detected or we reach a leaf node. While descending, keep track of the height at which we encounter a node with available space. This is done by checking if mas->end is less than the number of slots a given node type can fit. Now that the height of the vacant node is tracked, we can use the difference between the height of the tree and the height of the vacant node to know how many levels we will have to propagate creating new nodes. Update mas_prealloc_calc() to consider the vacant height and reduce the number of worst-case allocations. Rebalancing and spanning stores are not supported and fall back to using the full height of the tree for allocations. Update preallocation testing assertions to take into account vacant height. Link: https://lkml.kernel.org/r/20250410191446.2474640-4-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
f9d3a963fe |
maple_tree: use height and depth consistently
For the maple tree, the root node is defined to have a depth of 0 with a height of 1. Each level down from the node, these values are incremented by 1. Various code paths define a root with depth 1 which is inconsisent with the definition. Modify the code to be consistent with this definition. In mas_spanning_rebalance(), l_mas.depth was being used to track the height based on the number of iterations done in the main loop. This information was then used in mas_put_in_tree() to set the height. Rather than overload the l_mas.depth field to track height, simply keep track of height in the local variable new_height and directly pass this to mas_wmb_replace() which will be passed into mas_put_in_tree(). This allows up to remove writes to l_mas.depth. Link: https://lkml.kernel.org/r/20250410191446.2474640-3-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
28092a652f |
maple_tree: convert mas_prealloc_calc() to take in a maple write state
Patch series "Track node vacancy to reduce worst case allocation counts", v5. ================ overview ======================== Currently, the maple tree preallocates the worst case number of nodes for given store type by taking into account the whole height of the tree. This comes from a worst case scenario of every node in the tree being full and having to propagate node allocation upwards until we reach the root of the tree. This can be optimized if there are vacancies in nodes that are at a lower depth than the root node. This series implements tracking the level at which there is a vacant node so we only need to allocate until this level is reached, rather than always using the full height of the tree. The ma_wr_state struct is modified to add a field which keeps track of the vacant height and is updated during walks of the tree. This value is then read in mas_prealloc_calc() when we decide how many nodes to allocate. For rebalancing and spanning stores, we also need to track the lowest height at which a node has 1 more entry than the minimum sufficient number of entries. This is because rebalancing can cause a parent node to become insufficient which results in further node allocations. In this case, we need to use the sufficient height as the worst case rather than the vacant height. patch 1-2: preparatory patches patch 3: implement vacant height tracking + update the tests patch 4: support vacant height tracking for rebalancing writes patch 5: implement sufficient height tracking patch 6: reorder switch case statements ================ results ========================= Bpftrace was used to profile the allocation path for requesting new maple nodes while running stress-ng mmap 120s. The histograms below represent requests to kmem_cache_alloc_bulk() and show the count argument. This represnts how many maple nodes the caller is requesting in kmem_cache_alloc_bulk() command: stress-ng --mmap 4 --timeout 120 mm-unstable @bulk_alloc_req: [3, 4) 4 | | [4, 5) 54170 |@ | [5, 6) 0 | | [6, 7) 893057 |@@@@@@@@@@@@@@@@@@@@ | [7, 8) 4 | | [8, 9) 2230287 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [9, 10) 55811 |@ | [10, 11) 77834 |@ | [11, 12) 0 | | [12, 13) 1368684 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [13, 14) 0 | | [14, 15) 0 | | [15, 16) 367197 |@@@@@@@@ | @maple_node_total: 46,630,160 @total_vmas: 46184591 mm-unstable + this series @bulk_alloc_req: [2, 3) 198 | | [3, 4) 4 | | [4, 5) 43 | | [5, 6) 0 | | [6, 7) 1069503 |@@@@@@@@@@@@@@@@@@@@@ | [7, 8) 4 | | [8, 9) 2597268 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [9, 10) 472191 |@@@@@@@@@ | [10, 11) 191904 |@@@ | [11, 12) 0 | | [12, 13) 247316 |@@@@ | [13, 14) 0 | | [14, 15) 0 | | [15, 16) 98769 |@ | @maple_node_total: 37,813,856 @total_vmas: 43493287 This represents a ~19% reduction in the number of bulk maple nodes allocated. For more reproducible results, a historgram of the return value of mas_prealloc_calc() is displayed while running the maple_tree_tests whcih have a deterministic store pattern mas_prealloc_calc() return value mm-unstable 1 : (12068) 3 : (11836) 5 : ***** (271192) 7 : ************************************************** (2329329) 9 : *********** (534186) 10 : (435) 11 : *************** (704306) 13 : ******** (409781) mas_prealloc_calc() return value mm-unstable + this series 1 : (12070) 3 : ************************************************** (3548777) 5 : ******** (633458) 7 : (65081) 9 : (11224) 10 : (341) 11 : (2973) 13 : (68) do_mmap latency was also measured for regressions: command: stress-ng --mmap 4 --timeout 120 mm-unstable: avg = 7162 nsecs, total: 16101821292 nsecs, count: 2248034 mm-unstable + this series: avg = 6689 nsecs, total: 15135391764 nsecs, count: 2262726 stress-ng --mmap4 --timeout 120 with vacant_height: stress-ng: info: [257] 21526312 Maple Tree Read 0.176 M/sec stress-ng: info: [257] 339979348 Maple Tree Write 2.774 M/sec without vacant_height: stress-ng: info: [8228] 20968900 Maple Tree Read 0.171 M/sec stress-ng: info: [8228] 312214648 Maple Tree Write 2.547 M/sec This represents an increase of ~3% read throughput and ~9% increase in write throughput. This patch (of 6): In a subsequent patch, mas_prealloc_calc() will need to access fields only in the ma_wr_state. Convert the function to take in a ma_wr_state and modify all callers. There is no functional change. Link: https://lkml.kernel.org/r/20250410191446.2474640-1-sidhartha.kumar@oracle.com Link: https://lkml.kernel.org/r/20250410191446.2474640-2-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
43c4cfde7e |
mm/madvise: batch tlb flushes for MADV_DONTNEED[_LOCKED]
MADV_DONTNEED[_LOCKED] handling for [process_]madvise() flushes tlb for each vma of each address range. Update the logic to do tlb flushes in a batched way. Initialize an mmu_gather object from do_madvise() and vector_madvise(), which are the entry level functions for [process_]madvise(), respectively. And pass those objects to the function for per-vma work, via madvise_behavior struct. Make the per-vma logic not flushes tlb on their own but just saves the tlb entries to the received mmu_gather object. For this internal logic change, make zap_page_range_single_batched() non-static and use it directly from madvise_dontneed_single_vma(). Finally, the entry level functions flush the tlb entries that gathered for the entire user request, at once. Link: https://lkml.kernel.org/r/20250410000022.1901-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
de8efdf8cd |
mm/memory: split non-tlb flushing part from zap_page_range_single()
Some of zap_page_range_single() callers such as [process_]madvise() with
MADV_DONTNEED[_LOCKED] cannot batch tlb flushes because
zap_page_range_single() flushes tlb for each invocation. Split out the
body of zap_page_range_single() except mmu_gather object initialization
and gathered tlb entries flushing for such batched tlb flushing usage.
To avoid hugetlb pages allocation failures from concurrent page faults,
the tlb flush should be done before hugetlb faults unlocking, though. Do
the flush and the unlock inside the split out function in the order for
hugetlb vma case. Refer to commit
|
||
|
|
01bef02bf9 |
mm/madvise: batch tlb flushes for MADV_FREE
MADV_FREE handling for [process_]madvise() flushes tlb for each vma of each address range. Update the logic to do tlb flushes in a batched way. Initialize an mmu_gather object from do_madvise() and vector_madvise(), which are the entry level functions for [process_]madvise(), respectively. And pass those objects to the function for per-vma work, via madvise_behavior struct. Make the per-vma logic not flushes tlb on their own but just saves the tlb entries to the received mmu_gather object. Finally, the entry level functions flush the tlb entries that gathered for the entire user request, at once. Link: https://lkml.kernel.org/r/20250410000022.1901-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
066c770437 |
mm/madvise: define and use madvise_behavior struct for madvise_do_behavior()
Patch series "mm/madvise: batch tlb flushes for MADV_DONTNEED and
MADV_FREE", v3.
When process_madvise() is called to do MADV_DONTNEED[_LOCKED] or MADV_FREE
with multiple address ranges, tlb flushes happen for each of the given
address ranges. Because such tlb flushes are for the same process, doing
those in a batch is more efficient while still being safe. Modify
process_madvise() entry level code path to do such batched tlb flushes,
while the internal unmap logic do only gathering of the tlb entries to
flush.
In more detail, modify the entry functions to initialize an mmu_gather
object and pass it to the internal logic. And make the internal logic do
only gathering of the tlb entries to flush into the received mmu_gather
object. After all internal function calls are done, the entry functions
flush the gathered tlb entries at once.
Because process_madvise() and madvise() share the internal unmap logic,
make same change to madvise() entry code together, to make code consistent
and cleaner. It is only for keeping the code clean, and shouldn't degrade
madvise(). It could rather provide a potential tlb flushes reduction
benefit for a case that there are multiple vmas for the given address
range. It is only a side effect from an effort to keep code clean, so we
don't measure it separately.
Similar optimizations might be applicable to other madvise behavior such
as MADV_COLD and MADV_PAGEOUT. Those are simply out of the scope of this
patch series, though.
Patches Sequence
================
The first patch defines a new data structure for managing information that
is required for batched tlb flushes (mmu_gather and behavior), and update
code paths for MADV_DONTNEED[_LOCKED] and MADV_FREE handling internal
logic to receive it.
The second patch batches tlb flushes for MADV_FREE handling for both
madvise() and process_madvise().
Remaining two patches are for MADV_DONTNEED[_LOCKED] tlb flushes batching.
The third patch splits zap_page_range_single() for batching of
MADV_DONTNEED[_LOCKED] handling. The fourth patch batches tlb flushes for
the hint using the sub-logic that the third patch split out, and the
helpers for batched tlb flushes that introduced for the MADV_FREE case, by
the second patch.
Test Results
============
I measured the latency to apply MADV_DONTNEED advice to 256 MiB memory
using multiple process_madvise() calls. I apply the advice in 4 KiB sized
regions granularity, but with varying batch size per process_madvise()
call (vlen) from 1 to 1024. The source code for the measurement is
available at GitHub[1]. To reduce measurement errors, I did the
measurement five times.
The measurement results are as below. 'sz_batch' column shows the batch
size of process_madvise() calls. 'Before' and 'After' columns show the
average of latencies in nanoseconds that measured five times on kernels
that built without and with the tlb flushes batching of this series
(patches 3 and 4), respectively. For the baseline, mm-new tree of
2025-04-09[2] has been used, after reverting the second version of this
patch series and adding a temporal fix for !CONFIG_DEBUG_VM build
failure[3]. 'B-stdev' and 'A-stdev' columns show ratios of latency
measurements standard deviation to average in percent for 'Before' and
'After', respectively. 'Latency_reduction' shows the reduction of the
latency that the 'After' has achieved compared to 'Before', in percent.
Higher 'Latency_reduction' values mean more efficiency improvements.
sz_batch Before B-stdev After A-stdev Latency_reduction
1 146386348 2.78 111327360.6 3.13 23.95
2 108222130 1.54 72131173.6 2.39 33.35
4 93617846.8 2.76 51859294.4 2.50 44.61
8 80555150.4 2.38 44328790 1.58 44.97
16 77272777 1.62 37489433.2 1.16 51.48
32 76478465.2 2.75 33570506 3.48 56.10
64 75810266.6 1.15 27037652.6 1.61 64.34
128 73222748 3.86 25517629.4 3.30 65.15
256 72534970.8 2.31 25002180.4 0.94 65.53
512 71809392 5.12 24152285.4 2.41 66.37
1024 73281170.2 4.53 24183615 2.09 67.00
Unexpectedly the latency has reduced (improved) even with batch size one.
I think some of compiler optimizations have affected that, like also
observed with the first version of this patch series.
So, please focus on the proportion between the improvement and the batch
size. As expected, tlb flushes batching provides latency reduction that
proportional to the batch size. The efficiency gain ranges from about 33
percent with batch size 2, and up to 67 percent with batch size 1,024.
Please note that this is a very simple microbenchmark, so real efficiency
gain on real workload could be very different.
This patch (of 4):
To implement batched tlb flushes for MADV_DONTNEED[_LOCKED] and MADV_FREE,
an mmu_gather object in addition to the behavior integer need to be passed
to the internal logics. Using a struct can make it easy without
increasing the number of parameters of all code paths towards the internal
logic. Define a struct for the purpose and use it on the code path that
starts from madvise_do_behavior() and ends on madvise_dontneed_free().
Note that this changes madvise_walk_vmas() visitor type signature, too.
Specifically, it changes its 'arg' type from 'unsigned long' to the new
struct pointer.
Link: https://lkml.kernel.org/r/20250410000022.1901-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250410000022.1901-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liam R. Howlett <howlett@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
10d483f198 |
mm: huge_memory: add folio_mark_accessed() when zapping file THP
When investigating performance issues during file folio unmap, I noticed some behavioral differences in handling non-PMD-sized folios and PMD-sized folios. For non-PMD-sized file folios, it will call folio_mark_accessed() to mark the folio as having seen activity, but this is not done for PMD-sized folios. This might not cause obvious issues, but a potential problem could be that, it might lead to reclaim of hot file folios under memory pressure, as quoted from Johannes: : Sometimes file contents are only accessed through relatively short-lived : mappings. But they can nevertheless be accessed a lot and be hot. It's : important to not lose that information on unmap, and end up kicking out a : frequently used cache page. Therefore, we should also add folio_mark_accessed() for PMD-sized file folios when unmapping. [baolin.wang@linux.alibaba.com: add comment] Link: https://lkml.kernel.org/r/23fdc11d-e983-4627-89a8-79e9ecf9a45a@linux.alibaba.com Link: https://lkml.kernel.org/r/fc117f60d7b686f87067f36a0ef7cdbc3a78109c.1744190345.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Barry Song <21cnbao@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
10d288964d |
tools/testing/selftests: assert that anon merge cases behave as expected
Prior to the recently applied commit that permits this merge,
mprotect()'ing a faulted VMA, adjacent to an unfaulted VMA, such that the
two share characteristics would fail to merge due to what appear to be
unintended consequences of commit
|
||
|
|
bd23f293a0 |
tools/testing: add PROCMAP_QUERY helper functions in mm self tests
The PROCMAP_QUERY ioctl() is very useful - it allows for binary access to /proc/$pid/[s]maps data and thus convenient lookup of data contained there. This patch exposes this for convenient use by mm self tests so the state of VMAs can easily be queried. Link: https://lkml.kernel.org/r/ce83d877093d1fc594762cf4b82f0c27963030ee.1744104124.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
879bca0a2c |
mm/vma: fix incorrectly disallowed anonymous VMA merges
Patch series "fix incorrectly disallowed anonymous VMA merges", v2.
It appears that we have been incorrectly rejecting merge cases for 15
years, apparently by mistake.
Imagine a range of anonymous mapped momemory divided into two VMAs like
this, with incompatible protection bits:
RW RWX
unfaulted faulted
|-----------|-----------|
| prev | vma |
|-----------|-----------|
mprotect(RW)
Now imagine mprotect()'ing vma so it is RW. This appears as if it should
merge, it does not.
Neither does this case, again mprotect()'ing vma RW:
RWX RW
faulted unfaulted
|-----------|-----------|
| vma | next |
|-----------|-----------|
mprotect(RW)
Nor:
RW RWX RW
unfaulted faulted unfaulted
|-----------|-----------|-----------|
| prev | vma | next |
|-----------|-----------|-----------|
mprotect(RW)
What's going on here?
In commit
|
||
|
|
af8251dd45 |
mm: rust: add MEMORY MANAGEMENT [RUST]
We have introduced Rust bindings for core mm abstractions as part of this series, so add an entry in MAINTAINERS to be explicit about who maintains this. Patches are anticipated to be taken through the mm tree as usual with other mm code. Link: https://rust-for-linux.com/rust-kernel-policy#how-is-rust-introduced-in-a-subsystem Link: https://lore.kernel.org/all/33e64b12-aa07-4e78-933a-b07c37ff1d84@lucifer.local/ Link: https://lkml.kernel.org/r/20250408-vma-v16-9-d8b446e885d9@google.com Signed-off-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Andreas Hindborg <a.hindborg@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Gary Guo <gary@garyguo.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Trevor Gross <tmgross@umich.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
6acb75ad7b |
task: rust: rework how current is accessed
Introduce a new type called `CurrentTask` that lets you perform various
operations that are only safe on the `current` task. Use the new type to
provide a way to access the current mm without incrementing its refcount.
With this change, you can write stuff such as
let vma = current!().mm().lock_vma_under_rcu(addr);
without incrementing any refcounts.
This replaces the existing abstractions for accessing the current pid
namespace. With the old approach, every field access to current involves
both a macro and a unsafe helper function. The new approach simplifies
that to a single safe function on the `CurrentTask` type. This makes it
less heavy-weight to add additional current accessors in the future.
That said, creating a `CurrentTask` type like the one in this patch
requires that we are careful to ensure that it cannot escape the current
task or otherwise access things after they are freed. To do this, I
declared that it cannot escape the current "task context" where I defined
a "task context" as essentially the region in which `current` remains
unchanged. So e.g., release_task() or begin_new_exec() would leave the
task context.
If a userspace thread returns to userspace and later makes another
syscall, then I consider the two syscalls to be different task contexts.
This allows values stored in that task to be modified between syscalls,
even if they're guaranteed to be immutable during a syscall.
Ensuring correctness of `CurrentTask` is slightly tricky if we also want
the ability to have a safe `kthread_use_mm()` implementation in Rust. To
support that safely, there are two patterns we need to ensure are safe:
// Case 1: current!() called inside the scope.
let mm;
kthread_use_mm(some_mm, || {
mm = current!().mm();
});
drop(some_mm);
mm.do_something(); // UAF
and:
// Case 2: current!() called before the scope.
let mm;
let task = current!();
kthread_use_mm(some_mm, || {
mm = task.mm();
});
drop(some_mm);
mm.do_something(); // UAF
The existing `current!()` abstraction already natively prevents the first
case: The `&CurrentTask` would be tied to the inner scope, so the
borrow-checker ensures that no reference derived from it can escape the
scope.
Fixing the second case is a bit more tricky. The solution is to
essentially pretend that the contents of the scope execute on an different
thread, which means that only thread-safe types can cross the boundary.
Since `CurrentTask` is marked `NotThreadSafe`, attempts to move it to
another thread will fail, and this includes our fake pretend thread
boundary.
This has the disadvantage that other types that aren't thread-safe for
reasons unrelated to `current` also cannot be moved across the
`kthread_use_mm()` boundary. I consider this an acceptable tradeoff.
Link: https://lkml.kernel.org/r/20250408-vma-v16-8-d8b446e885d9@google.com
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Reviewed-by: Gary Guo <gary@garyguo.net>
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Benno Lossin <benno.lossin@proton.me>
Cc: Björn Roy Baron <bjorn3_gh@protonmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jann Horn <jannh@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Trevor Gross <tmgross@umich.edu>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
f8c7819881 |
rust: miscdevice: add mmap support
Add the ability to write a file_operations->mmap hook in Rust when using the miscdevice abstraction. The `vma` argument to the `mmap` hook uses the `VmaNew` type from the previous commit; this type provides the correct set of operations for a file_operations->mmap hook. Link: https://lkml.kernel.org/r/20250408-vma-v16-7-d8b446e885d9@google.com Signed-off-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Reviewed-by: Gary Guo <gary@garyguo.net> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Trevor Gross <tmgross@umich.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
dcb81aeab4 |
mm: rust: add VmaNew for f_ops->mmap()
This type will be used when setting up a new vma in an f_ops->mmap() hook. Using a separate type from VmaRef allows us to have a separate set of operations that you are only able to use during the mmap() hook. For example, the VM_MIXEDMAP flag must not be changed after the initial setup that happens during the f_ops->mmap() hook. To avoid setting invalid flag values, the methods for clearing VM_MAYWRITE and similar involve a check of VM_WRITE, and return an error if VM_WRITE is set. Trying to use `try_clear_maywrite` without checking the return value results in a compilation error because the `Result` type is marked #[must_use]. For now, there's only a method for VM_MIXEDMAP and not VM_PFNMAP. When we add a VM_PFNMAP method, we will need some way to prevent you from setting both VM_MIXEDMAP and VM_PFNMAP on the same vma. Link: https://lkml.kernel.org/r/20250408-vma-v16-6-d8b446e885d9@google.com Signed-off-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Gary Guo <gary@garyguo.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Trevor Gross <tmgross@umich.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
114ba9b9e8 |
mm: rust: add mmput_async support
Adds an MmWithUserAsync type that uses mmput_async when dropped but is
otherwise identical to MmWithUser. This has to be done using a separate
type because the thing we are changing is the destructor.
Rust Binder needs this to avoid a certain deadlock. See commit
|
||
|
|
3105f8f391 |
mm: rust: add lock_vma_under_rcu
Currently, the binder driver always uses the mmap lock to make changes to its vma. Because the mmap lock is global to the process, this can involve significant contention. However, the kernel has a feature called per-vma locks, which can significantly reduce contention. For example, you can take a vma lock in parallel with an mmap write lock. This is important because contention on the mmap lock has been a long-term recurring challenge for the Binder driver. This patch introduces support for using `lock_vma_under_rcu` from Rust. The Rust Binder driver will be able to use this to reduce contention on the mmap lock. Link: https://lkml.kernel.org/r/20250408-vma-v16-4-d8b446e885d9@google.com Signed-off-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Reviewed-by: Gary Guo <gary@garyguo.net> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Trevor Gross <tmgross@umich.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
bf3d331bb8 |
mm: rust: add vm_insert_page
The vm_insert_page method is only usable on vmas with the VM_MIXEDMAP flag, so we introduce a new type to keep track of such vmas. The approach used in this patch assumes that we will not need to encode many flag combinations in the type. I don't think we need to encode more than VM_MIXEDMAP and VM_PFNMAP as things are now. However, if that becomes necessary, using generic parameters in a single type would scale better as the number of flags increases. Link: https://lkml.kernel.org/r/20250408-vma-v16-3-d8b446e885d9@google.com Signed-off-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Reviewed-by: Gary Guo <gary@garyguo.net> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Trevor Gross <tmgross@umich.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
040f404b73 |
mm: rust: add vm_area_struct methods that require read access
This adds a type called VmaRef which is used when referencing a vma that you have read access to. Here, read access means that you hold either the mmap read lock or the vma read lock (or stronger). Additionally, a vma_lookup method is added to the mmap read guard, which enables you to obtain a &VmaRef in safe Rust code. This patch only provides a way to lock the mmap read lock, but a follow-up patch also provides a way to just lock the vma read lock. Link: https://lkml.kernel.org/r/20250408-vma-v16-2-d8b446e885d9@google.com Signed-off-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Reviewed-by: Gary Guo <gary@garyguo.net> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Trevor Gross <tmgross@umich.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
5bb9ed6cdf |
mm: rust: add abstraction for struct mm_struct
Patch series "Rust support for mm_struct, vm_area_struct, and mmap", v16. This updates the vm_area_struct support to use the approach we discussed at LPC where there are several different Rust wrappers for vm_area_struct depending on the kind of access you have to the vma. Each case allows a different set of operations on the vma. This includes an MM MAINTAINERS entry as proposed by Lorenzo: https://lore.kernel.org/all/33e64b12-aa07-4e78-933a-b07c37ff1d84@lucifer.local/ This patch (of 9): These abstractions allow you to reference a `struct mm_struct` using both mmgrab and mmget refcounts. This is done using two Rust types: * Mm - represents an mm_struct where you don't know anything about the value of mm_users. * MmWithUser - represents an mm_struct where you know at compile time that mm_users is non-zero. This allows us to encode in the type system whether a method requires that mm_users is non-zero or not. For instance, you can always call `mmget_not_zero` but you can only call `mmap_read_lock` when mm_users is non-zero. The struct is called Mm to keep consistency with the C side. The ability to obtain `current->mm` is added later in this series. The mm module is defined to only exist when CONFIG_MMU is set. This avoids various errors due to missing types and functions when CONFIG_MMU is disabled. More fine-grained cfgs can be considered in the future. See the thread at [1] for more info. Link: https://lkml.kernel.org/r/20250408-vma-v16-9-d8b446e885d9@google.com Link: https://lkml.kernel.org/r/20250408-vma-v16-1-d8b446e885d9@google.com Link: https://lore.kernel.org/all/202503091916.QousmtcY-lkp@intel.com/ Signed-off-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Balbir Singh <balbirs@nvidia.com> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Reviewed-by: Gary Guo <gary@garyguo.net> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benno Lossin <benno.lossin@proton.me> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Trevor Gross <tmgross@umich.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
8472cc4503 |
riscv: mm: call PUD/P4D ctor in special kernel pgtable alloc
Constructors for PUD/P4D-level pgtables were recently introduced. They
should be called for all pgtables; make sure they are called for special
kernel mappings created by create_pgd_mapping() too.
While at it also switch to using pagetable_alloc() like in
alloc_{pte,pmd}_late().
Link: https://lkml.kernel.org/r/20250408095222.860601-13-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Linus Waleij <linus.walleij@linaro.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: <x86@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
cb5d2be838 |
arm64: mm: call PUD/P4D ctor in __create_pgd_mapping()
Constructors for PUD/P4D-level pgtables were recently introduced. They should be called for all pgtables; make sure they are called for special kernel mappings created by __create_pgd_mapping() too. Link: https://lkml.kernel.org/r/20250408095222.860601-12-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Linus Waleij <linus.walleij@linaro.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport <rppt@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Will Deacon <will@kernel.org> Cc: <x86@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
0e3a16a760 |
riscv: mm: clarify ctor mm argument in alloc_{pte,pmd}_late
pagetable_{pte,pmd}_ctor(mm, ptdesc) skip the ptlock initialisation if mm
is &init_mm. To avoid unnecessary overhead, it is therefore preferable to
pass the actual mm associated to the PTE/PMD.
Unfortunately, this proves challenging for alloc_{pte,pmd}_late() as the
associated mm is not available at the point where they are called - in
fact not even top-level functions like create_pgd_mapping() are passed the
mm. As a result they both call the ctor with NULL as mm; this is safe but
potentially wasteful.
This is not a new situation, but let's add a couple of comments to clarify
it.
Link: https://lkml.kernel.org/r/20250408095222.860601-11-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Linus Waleij <linus.walleij@linaro.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: <x86@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|