linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-12-28 09:06:11 -05:00

Author	SHA1	Message	Date
Rakie Kim	cf8cecf2bc	mm/mempolicy: prepare weighted interleave sysfs for memory hotplug Previously, the weighted interleave sysfs structure was statically managed during initialization. This prevented new nodes from being recognized when memory hotplug events occurred, limiting the ability to update or extend sysfs entries dynamically at runtime. To address this, this patch refactors the sysfs infrastructure and encapsulates it within a new structure, `sysfs_wi_group`, which holds both the kobject and an array of node attribute pointers. By allocating this group structure globally, the per-node sysfs attributes can be managed beyond initialization time, enabling external modules to insert or remove node entries in response to events such as memory hotplug or node online/offline transitions. Instead of allocating all per-node sysfs attributes at once, the initialization path now uses the existing sysfs_wi_node_add() and sysfs_wi_node_delete() helpers. This refactoring makes it possible to modularly manage per-node sysfs entries and ensures the infrastructure is ready for runtime extension. Link: https://lkml.kernel.org/r/20250417072839.711-3-rakie.kim@sk.com Signed-off-by: Rakie Kim <rakie.kim@sk.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Honggyu Kim <honggyu.kim@sk.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yunjeong Mun <yunjeong.mun@sk.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:35 -07:00
Rakie Kim	bb52e89d8b	mm/mempolicy: fix memory leaks in weighted interleave sysfs Patch series "Enhance sysfs handling for memory hotplug in weighted interleave", v9. The following patch series enhances the weighted interleave policy in the memory management subsystem by improving sysfs handling, fixing memory leaks, and introducing dynamic sysfs updates for memory hotplug support. This patch (of 3): Memory leaks occurred when removing sysfs attributes for weighted interleave. Improper kobject deallocation led to unreleased memory when initialization failed or when nodes were removed. The risk of leak is low because it only appears to trigger if setup fails. Setup only fails due to -ENOMEM which is unlikely to happen from a late_initcall() when memory pressure is low. This patch resolves the issue by replacing unnecessary `kfree()` calls with proper `kobject_del()` and `kobject_put()` sequences, ensuring correct teardown and preventing memory leaks. By explicitly calling `kobject_del()` before `kobject_put()`, the release function is now invoked safely, and internal sysfs state is correctly cleaned up. This guarantees that the memory associated with the kobject is fully released and avoids resource leaks, thereby improving system stability. Additionally, sysfs_remove_file() is no longer called from the release function to avoid accessing invalid sysfs state after kobject_del(). All attribute removals are now done before kobject_del(), preventing WARN_ON() in kernfs and ensuring safe and consistent cleanup of sysfs entries. Link: https://lkml.kernel.org/r/20250417072839.711-1-rakie.kim@sk.com Link: https://lkml.kernel.org/r/20250417072839.711-2-rakie.kim@sk.com Fixes: `dce41f5ae2` ("mm/mempolicy: implement the sysfs-based weighted_interleave interface") Signed-off-by: Rakie Kim <rakie.kim@sk.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Honggyu Kim <honggyu.kim@sk.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yunjeong Mun <yunjeong.mun@sk.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:35 -07:00
Chen Ni	6e14fd33f1	mm: memcontrol: remove unnecessary NULL check before free_percpu() free_percpu() checks for NULL pointers internally. Remove unneeded NULL check here. Link: https://lkml.kernel.org/r/20250417084330.937380-1-nichen@iscas.ac.cn Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:34 -07:00
Uladzislau Rezki (Sony)	d096612048	vmalloc: align nr_vmalloc_pages and vmap_lazy_nr Currently both atomics share one cache-line: <snip> ... ffffffff83eab400 b vmap_lazy_nr ffffffff83eab408 b nr_vmalloc_pages ... <snip> those are global variables and they are only 8 bytes apart. Since they are modified by different threads this causes a false sharing. This can lead to a performance drop due to unnecessary cache invalidations. After this patch it is aligned to a cache line boundary: <snip> ... ffffffff8260a600 d vmap_lazy_nr ffffffff8260a640 d nr_vmalloc_pages ... <snip> Link: https://lkml.kernel.org/r/20250417161216.88318-4-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Adrian Huang <ahuang12@lenovo.com> Tested-by: Adrian Huang <ahuang12@lenovo.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Christop Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:34 -07:00
Uladzislau Rezki (Sony)	7a6fe58777	MAINTAINERS: add test_vmalloc.c to VMALLOC section A vmalloc subsystem includes "lib/test_vmalloc.c" test suite. Add an "F:" entry under VMALLOC section to track this file as part of the subsystem. Link: https://lkml.kernel.org/r/20250417161216.88318-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christop Hellwig <hch@infradead.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:34 -07:00
Uladzislau Rezki (Sony)	2d76e79315	lib/test_vmalloc.c: allow built-in execution Remove the dependency on module loading ("m") for the vmalloc test suite, enabling it to be built directly into the kernel, so both ("=m") and ("=y") are supported. Motivation: - Faster debugging/testing of vmalloc code; - It allows to configure the test via kernel-boot parameters. Configuration example: test_vmalloc.nr_threads=64 test_vmalloc.run_test_mask=7 test_vmalloc.sequential_test_order=1 Link: https://lkml.kernel.org/r/20250417161216.88318-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Adrian Huang <ahuang12@lenovo.com> Tested-by: Adrian Huang <ahuang12@lenovo.com> Cc: Christop Hellwig <hch@infradead.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:34 -07:00
Uladzislau Rezki (Sony)	7a73348e5d	lib/test_vmalloc.c: replace RWSEM to SRCU for setup The test has the initialization step during which threads are created. To prevent the workers from starting prematurely a write lock was previously used by the main setup thread, while each worker would block on a read lock. Replace this RWSEM based synchronization with a simpler SRCU based approach. Which does two basic steps: - Main thread wraps the setup phase in an SRCU read-side critical section. Pair of srcu_read_lock()/srcu_read_unlock(). - Each worker calls synchronize_srcu() on entry, ensuring it waits for the initialization phase to be completed. This patch eliminates the need for down_read()/up_read() and down_write()/up_write() pairs thus simplifying the logic and improving clarity. [urezki@gmail.com: fix compile error with CONFIG_TINY_RCU] Link: https://lkml.kernel.org/r/20250420142029.103169-1-urezki@gmail.com Link: https://lkml.kernel.org/r/20250417161216.88318-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Adrian Huang <ahuang12@lenovo.com> Tested-by: Adrian Huang <ahuang12@lenovo.com> Cc: Baoquan He <bhe@redhat.com> Cc: Christop Hellwig <hch@infradead.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:33 -07:00
Sergey Senozhatsky	b05f8d7e07	Documentation: zram: update IDLE pages tracking documentation Move IDLE pages tracking into a separate chapter because there are multiple features that use (or depend on) it either in built-in variant ("mark all") or in extended variant (ac-time tracking). In addition, recompression doesn't require memory tracking to be enabled in order to be able to perform idle recompression. Link: https://lkml.kernel.org/r/20250416042833.3858827-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reported-by: Shin Kawamura <kawasin@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:33 -07:00
Dev Jain	4a34c584d8	mempolicy: optimize queue_folios_pte_range by PTE batching After the check for queue_folio_required(), the code only cares about the folio in the for loop, i.e the PTEs are redundant. Therefore, optimize this loop by skipping over a PTE batch mapping the same folio. With a test program migrating pages of the calling process, which includes a mapped VMA of size 4GB with pte-mapped large folios of order-9, and migrating once back and forth node-0 and node-1, the average execution time reduces from 7.5 to 4 seconds, giving an approx 47% speedup. Link: https://lkml.kernel.org/r/20250416053048.96479-1-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:33 -07:00
Lorenzo Stoakes	75404e0766	mm: move mmap/vma locking logic into specific files Currently the VMA and mmap locking logic is entangled in two of the most overwrought files in mm - include/linux/mm.h and mm/memory.c. Separate this logic out so we can more easily make changes and create an appropriate MAINTAINERS entry that spans only the logic relating to locking. This should have no functional change. Care is taken to avoid dependency loops, we must regrettably keep release_fault_lock() and assert_fault_locked() in mm.h as a result due to the dependence on the vm_fault type. Additionally we must declare rcuwait_wake_up() manually to avoid a dependency cycle on linux/rcuwait.h. Additionally move the nommu implementatino of lock_mm_and_find_vma() to mmap_lock.c so everything lock-related is in one place. Link: https://lkml.kernel.org/r/bec6c8e29fa8de9267a811a10b1bdae355d67ed4.1744799282.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:33 -07:00
Shakeel Butt	f735eebe55	memcg: multi-memcg percpu charge cache Memory cgroup accounting is expensive and to reduce the cost, the kernel maintains per-cpu charge cache for a single memcg. So, if a charge request comes for a different memcg, the kernel will flush the old memcg's charge cache and then charge the newer memcg a fixed amount (64 pages), subtracts the charge request amount and stores the remaining in the per-cpu charge cache for the newer memcg. This mechanism is based on the assumption that the kernel, for locality, keep a process on a CPU for long period of time and most of the charge requests from that process will be served by that CPU's local charge cache. However this assumption breaks down for incoming network traffic in a multi-tenant machine. We are in the process of running multiple workloads on a single machine and if such workloads are network heavy, we are seeing very high network memory accounting cost. We have observed multiple CPUs spending almost 100% of their time in net_rx_action and almost all of that time is spent in memcg accounting of the network traffic. More precisely, net_rx_action is serving packets from multiple workloads and is observing/serving mix of packets of these workloads. The memcg switch of per-cpu cache is very expensive and we are observing a lot of memcg switches on the machine. Almost all the time is being spent on charging new memcg and flushing older memcg cache. So, definitely we need per-cpu cache that support multiple memcgs for this scenario. This patch implements a simple (and dumb) multiple memcg percpu charge cache. Actually we started with more sophisticated LRU based approach but the dumb one was always better than the sophisticated one by 1% to 3%, so going with the simple approach. Some of the design choices are: 1. Fit all caches memcgs in a single cacheline. 2. The cache array can be mix of empty slots or memcg charged slots, so the kernel has to traverse the full array. 3. The cache drain from the reclaim will drain all cached memcgs to keep things simple. To evaluate the impact of this optimization, on a 72 CPUs machine, we ran the following workload where each netperf client runs in a different cgroup. The next-20250415 kernel is used as base. $ netserver -6 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K number of clients \| Without patch \| With patch 6 \| 42584.1 Mbps \| 48603.4 Mbps (14.13% improvement) 12 \| 30617.1 Mbps \| 47919.7 Mbps (56.51% improvement) 18 \| 25305.2 Mbps \| 45497.3 Mbps (79.79% improvement) 24 \| 20104.1 Mbps \| 37907.7 Mbps (88.55% improvement) 30 \| 14702.4 Mbps \| 30746.5 Mbps (109.12% improvement) 36 \| 10801.5 Mbps \| 26476.3 Mbps (145.11% improvement) The results show drastic improvement for network intensive workloads. [shakeel.butt@linux.dev: add BUILD_BUG_ON() for MEMCG_CHARGE_BATCH] Link: https://lkml.kernel.org/r/rlsgeosg3j7v5nihhbxxxbv3xfy4ejvigihj7lkkbt3n6imyne@2apxx2jm2e57 [shakeel.butt@linux.dev: simplify refill_stock] Link: https://lkml.kernel.org/r/as5cdsm4lraxupg3t6onep2ixql72za25hvd4x334dsoyo4apr@zyzl4vkuevuv [hughd@google.com: it's better to stock nr_pages than the uninitialized stock_pages] Link: https://lkml.kernel.org/r/d542d18f-1caa-6fea-e2c3-3555c87bcf64@google.com [shakeel.butt@linux.dev: add comment per Michal and use DEFINE_PER_CPU_ALIGNED instead of DEFINE_PER_CPU per Vlastimil] Link: https://lkml.kernel.org/r/dieeei3squ2gcnqxdjayvxbvzldr266rhnvtl3vjzsqevxkevf@ckui5vjzl2qg Link: https://lkml.kernel.org/r/20250416180229.2902751-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Eric Dumaze <edumazet@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:32 -07:00
Fan Ni	06340b9270	mm: convert free_page_and_swap_cache() to free_folio_and_swap_cache() free_page_and_swap_cache() takes a struct page pointer as input parameter, but it will immediately convert it to folio and all operations following within use folio instead of page. It makes more sense to pass in folio directly. Convert free_page_and_swap_cache() to free_folio_and_swap_cache() to consume folio directly. Link: https://lkml.kernel.org/r/20250416201720.41678-1-nifan.cxl@gmail.com Signed-off-by: Fan Ni <fan.ni@samsung.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Adam Manzanares <a.manzanares@samsung.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:32 -07:00
gaoxu	4f219913c1	mm: add nr_free_highatomic in show_free_areas commit c928807f6f6b6("mm/page_alloc: keep track of free highatomic") adds a new variable nr_free_highatomic, which is useful for analyzing low mem issues. add nr_free_highatomic in show_free_areas. Signed-off-by: gao xu <gaoxu2@honor.com> Link: https://lkml.kernel.org/r/d92eeff74f7a4578a14ac777cfe3603a@honor.com Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:32 -07:00
Hao Ge	df2bbc47e7	mm/vmscan: modify the assignment logic of the scan and total_scan variables The scan and total_scan variables can be initialized to 0 when they are defined, replacing the separate assignment statements. Link: https://lkml.kernel.org/r/20250417092422.1333620-1-hao.ge@linux.dev Signed-off-by: Hao Ge <gehao@kylinos.cn> Acked-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:32 -07:00
Enze Li	1477b8cd26	samples/damon/prcl: fix a comment typo This patch just fixes a typo in the comment. Link: https://lkml.kernel.org/r/20250411073800.1444481-1-lienze@kylinos.cn Signed-off-by: Enze Li <lienze@kylinos.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:31 -07:00
Baoquan He	a7797e74bd	mm/gup: clean up codes in fault_in_xxx() functions The code style in fault_in_readable() and fault_in_writable() is a little inconsistent with fault_in_safe_writeable(). In fault_in_readable() and fault_in_writable(), it uses 'uaddr' passed in as loop cursor. While in fault_in_safe_writeable(), local variable 'start' is used as loop cursor. This may mislead people when reading code or making change in these codes. Here define explicit loop cursor and use for loop to simplify codes in these three functions. These cleanup can make them be consistent in code style and improve readability. [bhe@redhat.com: address minor concerns from David] Link: https://lkml.kernel.org/r/Z/sbv3EmLXWgEE7+@MiWiFi-R3L-srv Link: https://lkml.kernel.org/r/20250410035717.473207-5-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yanjun.Zhu <yanjun.zhu@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:31 -07:00
Baoquan He	339122abb5	mm/gup: remove gup_fast_pgd_leaf() and clean up the relevant codes In the current kernel, only pud huge page is supported in some architectures. P4d and pgd huge pages haven't been supported yet. And in mm/gup.c, there's no pgd huge page handling in the follow_page_mask() code path. Hence it doesn't make sense to only have gup_fast_pgd_leaf() in gup_fast code path. Here remove gup_fast_pgd_leaf() and clean up the relevant codes. Link: https://lkml.kernel.org/r/20250410035717.473207-4-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Yanjun.Zhu <yanjun.zhu@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:31 -07:00
Baoquan He	ede27b7ee2	mm/gup: remove unneeded checking in follow_page_pte() Patch series "mm/gup: Minor fix, cleanup and improvements", v4. These were made from code inspection in mm/gup.c. This patch (of 3): In __get_user_pages(), it will traverse page table and take a reference to the page the given user address corresponds to if GUP_GET or GUP_PIN is set. However, it's not supported both GUP_GET and GUP_PIN are set. Even though this check need be done, it should be done earlier, but not doing it till entering into follow_page_pte() and failed. Furthermore, this checking has been done in is_valid_gup_args() and all external users of __get_user_pages() will call is_valid_gup_args() to catch the illegal setting. We don't need to worry about internal users of __get_user_pages() because the gup_flags are set by MM code correctly. Here remove the checking in follow_page_pte(), and add VM_WARN_ON_ONCE() to catch the possible exceptional setting just in case. And also change the VM_BUG_ON to VM_WARN_ON_ONCE() for checking (!!pages != !!(gup_flags & (FOLL_GET \| FOLL_PIN))) because the checking has been done in is_valid_gup_args() for external users of __get_user_pages(). Link: https://lkml.kernel.org/r/20250410035717.473207-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20250410035717.473207-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Yanjun.Zhu <yanjun.zhu@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:31 -07:00
Oscar Salvador	e7a446030b	mm,hugetlb: allocate frozen pages in alloc_buddy_hugetlb_folio alloc_buddy_hugetlb_folio() allocates a rmappable folio, then strips the rmappable part and freezes it. We can simplify all that by allocating frozen pages directly. Link: https://lkml.kernel.org/r/20250411132359.312708-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:30 -07:00
Uladzislau Rezki (Sony)	0272d07ef6	vmalloc: use atomic_long_add_return_relaxed() Switch from the atomic_long_add_return() to its relaxed version. We do not need a full memory barrier or any memory ordering during increasing the "vmap_lazy_nr" variable. What we only need is to do it atomically. This is what atomic_long_add_return_relaxed() guarantees. AARCH64: <snip> Default: 40ec: d34cfe94 lsr x20, x20, #12 40f0: 14000044 b 4200 <free_vmap_area_noflush+0x19c> 40f4: 94000000 bl 0 <__sanitizer_cov_trace_pc> 40f8: 90000000 adrp x0, 0 <__traceiter_alloc_vmap_area> 40fc: 91000000 add x0, x0, #0x0 4100: f8f40016 ldaddal x20, x22, [x0] 4104: 8b160296 add x22, x20, x22 Relaxed: 40ec: d34cfe94 lsr x20, x20, #12 40f0: 14000044 b 4200 <free_vmap_area_noflush+0x19c> 40f4: 94000000 bl 0 <__sanitizer_cov_trace_pc> 40f8: 90000000 adrp x0, 0 <__traceiter_alloc_vmap_area> 40fc: 91000000 add x0, x0, #0x0 4100: f8340016 ldadd x20, x22, [x0] 4104: 8b160296 add x22, x20, x22 <snip> Link: https://lkml.kernel.org/r/20250415112646.113091-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christop Hellwig <hch@infradead.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:30 -07:00
Oscar Salvador	00ccf40ae2	mm, hugetlb: avoid passing a null nodemask when there is mbind policy Before trying to allocate a page, gather_surplus_pages() sets up a nodemask for the nodes we can allocate from, but instead of passing the nodemask down the road to the page allocator, it iterates over the nodes within that nodemask right there, meaning that the page allocator will receive a preferred_nid and a null nodemask. This is a problem when using a memory policy, because it might be that the page allocator ends up using a node as a fallback which is not represented in the policy. Avoid that by passing the nodemask directly to the page allocator, so it can filter out fallback nodes that are not part of the nodemask. Link: https://lkml.kernel.org/r/20250415121503.376811-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:30 -07:00
Enze Li	f736953e2b	selftests/damon: remove the remaining test scripts for DAMON debugfs interface DAMON has dropped debugfs support; therefore, remove these unused scripts. Link: https://lkml.kernel.org/r/20250411024332.1373861-1-enze.li@linux.dev Fixes: `5ec4333b19` ("mm/damon: remove DAMON debugfs interface") Signed-off-by: Enze Li <lienze@kylinos.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:30 -07:00
Shakeel Butt	60cada258d	memcg: optimize memcg_rstat_updated Currently the kernel maintains the stats updates per-memcg which is needed to implement stats flushing threshold. On the update side, the update is added to the per-cpu per-memcg update of the given memcg and all of its ancestors. However when the given memcg has passed the flushing threshold, all of its ancestors should have passed the threshold as well. There is no need to traverse up the memcg tree to maintain the stats updates. Perf profile collected from our fleet shows that memcg_rstat_updated is one of the most expensive memcg function i.e. a lot of cumulative CPU is being spent on it. So, even small micro optimizations matter a lot. This patch is microbenchmarked with multiple instances of netperf on a single machine with locally running netserver and we see couple of percentage of improvement. Link: https://lkml.kernel.org/r/20250410025752.92159-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:29 -07:00
Donet Tom	585a914588	selftests/mm: restore default nr_hugepages value during cleanup in hugetlb_reparenting_test.sh During cleanup, the value of /proc/sys/vm/nr_hugepages is currently being set to 0. At the end of the test, if all tests pass, the original nr_hugepages value is restored. However, if any test fails, it remains set to 0. With this patch, we ensure that the original nr_hugepages value is restored during cleanup, regardless of whether the test passes or fails. Link: https://lkml.kernel.org/r/20250410100748.2310-1-donettom@linux.ibm.com Fixes: `29750f71a9` ("hugetlb_cgroup: add hugetlb_cgroup reservation tests") Signed-off-by: Donet Tom <donettom@linux.ibm.com> Cc: Li Wang <liwang@redhat.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:29 -07:00
Sidhartha Kumar	2a6ed1b411	maple_tree: reorder mas->store_type case statements Move the unlikely case that mas->store_type is invalid to be the last evaluated case and put liklier cases higher up. Link: https://lkml.kernel.org/r/20250410191446.2474640-7-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Suggested-by: Liam R. Howlett <liam.howlett@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:29 -07:00
Sidhartha Kumar	271152a973	maple_tree: add sufficient height In order to support rebalancing and spanning stores using less than the worst case number of nodes, we need to track more than just the vacant height. Using only vacant height to reduce the worst case maple node allocation count can lead to a shortcoming of nodes in the following scenarios. For rebalancing writes, when a leaf node becomes insufficient, it may be combined with a sibling into a single node. This means that the parent node which has entries for this children will lose one entry. If this parent node was just meeting the minimum entries, losing one entry will now cause this parent node to be insufficient. This leads to a cascading operation of rebalancing at different levels and can lead to more node allocations than simply using vacant height can return. For spanning writes, a similar situation occurs. At the location at which a spanning write is detected, the number of ancestor nodes may similarly need to rebalanced into a smaller number of nodes and the same cascading situation could occur. To use less than the full height of the tree for the number of allocations, we also need to track the height at which a non-leaf node cannot become insufficient. This means even if a rebalance occurs to a child of this node, it currently has enough entries that it can lose one without any further action. This field is stored in the maple write state as sufficient height. In mas_prealloc_calc() when figuring out how many nodes to allocate, we check if the vacant node is lower in the tree than a sufficient node (has a larger value). If it is, we cannot use the vacant height and must use the difference in the height and sufficient height as the basis for the number of nodes needed. An off by one bug was also discovered in mast_overflow() where it is using >= rather than >. This caused extra iterations of the mas_spanning_rebalance() loop and lead to unneeded allocations. A test is also added to check the number of allocations is correct. Link: https://lkml.kernel.org/r/20250410191446.2474640-6-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:29 -07:00
Sidhartha Kumar	300a5b4ffe	maple_tree: break on convergence in mas_spanning_rebalance() This allows support for using the vacant height to calculate the worst case number of nodes needed for wr_rebalance operation. mas_spanning_rebalance() was seen to perform unnecessary node allocations. We can reduce allocations by breaking early during the rebalancing loop once we realize that we have ascended to a common ancestor. Link: https://lkml.kernel.org/r/20250410191446.2474640-5-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Suggested-by: Liam Howlett <liam.howlett@oracle.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:28 -07:00
Sidhartha Kumar	ad88fc17d2	maple_tree: use vacant nodes to reduce worst case allocations In order to determine the store type for a maple tree operation, a walk of the tree is done through mas_wr_walk(). This function descends the tree until a spanning write is detected or we reach a leaf node. While descending, keep track of the height at which we encounter a node with available space. This is done by checking if mas->end is less than the number of slots a given node type can fit. Now that the height of the vacant node is tracked, we can use the difference between the height of the tree and the height of the vacant node to know how many levels we will have to propagate creating new nodes. Update mas_prealloc_calc() to consider the vacant height and reduce the number of worst-case allocations. Rebalancing and spanning stores are not supported and fall back to using the full height of the tree for allocations. Update preallocation testing assertions to take into account vacant height. Link: https://lkml.kernel.org/r/20250410191446.2474640-4-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:28 -07:00
Sidhartha Kumar	f9d3a963fe	maple_tree: use height and depth consistently For the maple tree, the root node is defined to have a depth of 0 with a height of 1. Each level down from the node, these values are incremented by 1. Various code paths define a root with depth 1 which is inconsisent with the definition. Modify the code to be consistent with this definition. In mas_spanning_rebalance(), l_mas.depth was being used to track the height based on the number of iterations done in the main loop. This information was then used in mas_put_in_tree() to set the height. Rather than overload the l_mas.depth field to track height, simply keep track of height in the local variable new_height and directly pass this to mas_wmb_replace() which will be passed into mas_put_in_tree(). This allows up to remove writes to l_mas.depth. Link: https://lkml.kernel.org/r/20250410191446.2474640-3-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:28 -07:00
Sidhartha Kumar	28092a652f	maple_tree: convert mas_prealloc_calc() to take in a maple write state Patch series "Track node vacancy to reduce worst case allocation counts", v5. ================ overview ======================== Currently, the maple tree preallocates the worst case number of nodes for given store type by taking into account the whole height of the tree. This comes from a worst case scenario of every node in the tree being full and having to propagate node allocation upwards until we reach the root of the tree. This can be optimized if there are vacancies in nodes that are at a lower depth than the root node. This series implements tracking the level at which there is a vacant node so we only need to allocate until this level is reached, rather than always using the full height of the tree. The ma_wr_state struct is modified to add a field which keeps track of the vacant height and is updated during walks of the tree. This value is then read in mas_prealloc_calc() when we decide how many nodes to allocate. For rebalancing and spanning stores, we also need to track the lowest height at which a node has 1 more entry than the minimum sufficient number of entries. This is because rebalancing can cause a parent node to become insufficient which results in further node allocations. In this case, we need to use the sufficient height as the worst case rather than the vacant height. patch 1-2: preparatory patches patch 3: implement vacant height tracking + update the tests patch 4: support vacant height tracking for rebalancing writes patch 5: implement sufficient height tracking patch 6: reorder switch case statements ================ results ========================= Bpftrace was used to profile the allocation path for requesting new maple nodes while running stress-ng mmap 120s. The histograms below represent requests to kmem_cache_alloc_bulk() and show the count argument. This represnts how many maple nodes the caller is requesting in kmem_cache_alloc_bulk() command: stress-ng --mmap 4 --timeout 120 mm-unstable @bulk_alloc_req: [3, 4) 4 \| \| [4, 5) 54170 \|@ \| [5, 6) 0 \| \| [6, 7) 893057 \|@@@@@@@@@@@@@@@@@@@@ \| [7, 8) 4 \| \| [8, 9) 2230287 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\| [9, 10) 55811 \|@ \| [10, 11) 77834 \|@ \| [11, 12) 0 \| \| [12, 13) 1368684 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ \| [13, 14) 0 \| \| [14, 15) 0 \| \| [15, 16) 367197 \|@@@@@@@@ \| @maple_node_total: 46,630,160 @total_vmas: 46184591 mm-unstable + this series @bulk_alloc_req: [2, 3) 198 \| \| [3, 4) 4 \| \| [4, 5) 43 \| \| [5, 6) 0 \| \| [6, 7) 1069503 \|@@@@@@@@@@@@@@@@@@@@@ \| [7, 8) 4 \| \| [8, 9) 2597268 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\| [9, 10) 472191 \|@@@@@@@@@ \| [10, 11) 191904 \|@@@ \| [11, 12) 0 \| \| [12, 13) 247316 \|@@@@ \| [13, 14) 0 \| \| [14, 15) 0 \| \| [15, 16) 98769 \|@ \| @maple_node_total: 37,813,856 @total_vmas: 43493287 This represents a ~19% reduction in the number of bulk maple nodes allocated. For more reproducible results, a historgram of the return value of mas_prealloc_calc() is displayed while running the maple_tree_tests whcih have a deterministic store pattern mas_prealloc_calc() return value mm-unstable 1 : (12068) 3 : (11836) 5 : *** (271192) 7 : ********************************************** (2329329) 9 : ******* (534186) 10 : (435) 11 : *********** (704306) 13 : **** (409781) mas_prealloc_calc() return value mm-unstable + this series 1 : (12070) 3 : ********************************************** (3548777) 5 : ****** (633458) 7 : (65081) 9 : (11224) 10 : (341) 11 : (2973) 13 : (68) do_mmap latency was also measured for regressions: command: stress-ng --mmap 4 --timeout 120 mm-unstable: avg = 7162 nsecs, total: 16101821292 nsecs, count: 2248034 mm-unstable + this series: avg = 6689 nsecs, total: 15135391764 nsecs, count: 2262726 stress-ng --mmap4 --timeout 120 with vacant_height: stress-ng: info: [257] 21526312 Maple Tree Read 0.176 M/sec stress-ng: info: [257] 339979348 Maple Tree Write 2.774 M/sec without vacant_height: stress-ng: info: [8228] 20968900 Maple Tree Read 0.171 M/sec stress-ng: info: [8228] 312214648 Maple Tree Write 2.547 M/sec This represents an increase of ~3% read throughput and ~9% increase in write throughput. This patch (of 6): In a subsequent patch, mas_prealloc_calc() will need to access fields only in the ma_wr_state. Convert the function to take in a ma_wr_state and modify all callers. There is no functional change. Link: https://lkml.kernel.org/r/20250410191446.2474640-1-sidhartha.kumar@oracle.com Link: https://lkml.kernel.org/r/20250410191446.2474640-2-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:28 -07:00
SeongJae Park	43c4cfde7e	mm/madvise: batch tlb flushes for MADV_DONTNEED[_LOCKED] MADV_DONTNEED[_LOCKED] handling for [process_]madvise() flushes tlb for each vma of each address range. Update the logic to do tlb flushes in a batched way. Initialize an mmu_gather object from do_madvise() and vector_madvise(), which are the entry level functions for [process_]madvise(), respectively. And pass those objects to the function for per-vma work, via madvise_behavior struct. Make the per-vma logic not flushes tlb on their own but just saves the tlb entries to the received mmu_gather object. For this internal logic change, make zap_page_range_single_batched() non-static and use it directly from madvise_dontneed_single_vma(). Finally, the entry level functions flush the tlb entries that gathered for the entire user request, at once. Link: https://lkml.kernel.org/r/20250410000022.1901-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:27 -07:00
SeongJae Park	de8efdf8cd	mm/memory: split non-tlb flushing part from zap_page_range_single() Some of zap_page_range_single() callers such as [process_]madvise() with MADV_DONTNEED[_LOCKED] cannot batch tlb flushes because zap_page_range_single() flushes tlb for each invocation. Split out the body of zap_page_range_single() except mmu_gather object initialization and gathered tlb entries flushing for such batched tlb flushing usage. To avoid hugetlb pages allocation failures from concurrent page faults, the tlb flush should be done before hugetlb faults unlocking, though. Do the flush and the unlock inside the split out function in the order for hugetlb vma case. Refer to commit `2820b0f09b` ("hugetlbfs: close race between MADV_DONTNEED and page fault") for more details about the concurrent faults' page allocation failure problem. Link: https://lkml.kernel.org/r/20250410000022.1901-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:27 -07:00
SeongJae Park	01bef02bf9	mm/madvise: batch tlb flushes for MADV_FREE MADV_FREE handling for [process_]madvise() flushes tlb for each vma of each address range. Update the logic to do tlb flushes in a batched way. Initialize an mmu_gather object from do_madvise() and vector_madvise(), which are the entry level functions for [process_]madvise(), respectively. And pass those objects to the function for per-vma work, via madvise_behavior struct. Make the per-vma logic not flushes tlb on their own but just saves the tlb entries to the received mmu_gather object. Finally, the entry level functions flush the tlb entries that gathered for the entire user request, at once. Link: https://lkml.kernel.org/r/20250410000022.1901-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:27 -07:00
SeongJae Park	066c770437	mm/madvise: define and use madvise_behavior struct for madvise_do_behavior() Patch series "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE", v3. When process_madvise() is called to do MADV_DONTNEED[_LOCKED] or MADV_FREE with multiple address ranges, tlb flushes happen for each of the given address ranges. Because such tlb flushes are for the same process, doing those in a batch is more efficient while still being safe. Modify process_madvise() entry level code path to do such batched tlb flushes, while the internal unmap logic do only gathering of the tlb entries to flush. In more detail, modify the entry functions to initialize an mmu_gather object and pass it to the internal logic. And make the internal logic do only gathering of the tlb entries to flush into the received mmu_gather object. After all internal function calls are done, the entry functions flush the gathered tlb entries at once. Because process_madvise() and madvise() share the internal unmap logic, make same change to madvise() entry code together, to make code consistent and cleaner. It is only for keeping the code clean, and shouldn't degrade madvise(). It could rather provide a potential tlb flushes reduction benefit for a case that there are multiple vmas for the given address range. It is only a side effect from an effort to keep code clean, so we don't measure it separately. Similar optimizations might be applicable to other madvise behavior such as MADV_COLD and MADV_PAGEOUT. Those are simply out of the scope of this patch series, though. Patches Sequence ================ The first patch defines a new data structure for managing information that is required for batched tlb flushes (mmu_gather and behavior), and update code paths for MADV_DONTNEED[_LOCKED] and MADV_FREE handling internal logic to receive it. The second patch batches tlb flushes for MADV_FREE handling for both madvise() and process_madvise(). Remaining two patches are for MADV_DONTNEED[_LOCKED] tlb flushes batching. The third patch splits zap_page_range_single() for batching of MADV_DONTNEED[_LOCKED] handling. The fourth patch batches tlb flushes for the hint using the sub-logic that the third patch split out, and the helpers for batched tlb flushes that introduced for the MADV_FREE case, by the second patch. Test Results ============ I measured the latency to apply MADV_DONTNEED advice to 256 MiB memory using multiple process_madvise() calls. I apply the advice in 4 KiB sized regions granularity, but with varying batch size per process_madvise() call (vlen) from 1 to 1024. The source code for the measurement is available at GitHub[1]. To reduce measurement errors, I did the measurement five times. The measurement results are as below. 'sz_batch' column shows the batch size of process_madvise() calls. 'Before' and 'After' columns show the average of latencies in nanoseconds that measured five times on kernels that built without and with the tlb flushes batching of this series (patches 3 and 4), respectively. For the baseline, mm-new tree of 2025-04-09[2] has been used, after reverting the second version of this patch series and adding a temporal fix for !CONFIG_DEBUG_VM build failure[3]. 'B-stdev' and 'A-stdev' columns show ratios of latency measurements standard deviation to average in percent for 'Before' and 'After', respectively. 'Latency_reduction' shows the reduction of the latency that the 'After' has achieved compared to 'Before', in percent. Higher 'Latency_reduction' values mean more efficiency improvements. sz_batch Before B-stdev After A-stdev Latency_reduction 1 146386348 2.78 111327360.6 3.13 23.95 2 108222130 1.54 72131173.6 2.39 33.35 4 93617846.8 2.76 51859294.4 2.50 44.61 8 80555150.4 2.38 44328790 1.58 44.97 16 77272777 1.62 37489433.2 1.16 51.48 32 76478465.2 2.75 33570506 3.48 56.10 64 75810266.6 1.15 27037652.6 1.61 64.34 128 73222748 3.86 25517629.4 3.30 65.15 256 72534970.8 2.31 25002180.4 0.94 65.53 512 71809392 5.12 24152285.4 2.41 66.37 1024 73281170.2 4.53 24183615 2.09 67.00 Unexpectedly the latency has reduced (improved) even with batch size one. I think some of compiler optimizations have affected that, like also observed with the first version of this patch series. So, please focus on the proportion between the improvement and the batch size. As expected, tlb flushes batching provides latency reduction that proportional to the batch size. The efficiency gain ranges from about 33 percent with batch size 2, and up to 67 percent with batch size 1,024. Please note that this is a very simple microbenchmark, so real efficiency gain on real workload could be very different. This patch (of 4): To implement batched tlb flushes for MADV_DONTNEED[_LOCKED] and MADV_FREE, an mmu_gather object in addition to the behavior integer need to be passed to the internal logics. Using a struct can make it easy without increasing the number of parameters of all code paths towards the internal logic. Define a struct for the purpose and use it on the code path that starts from madvise_do_behavior() and ends on madvise_dontneed_free(). Note that this changes madvise_walk_vmas() visitor type signature, too. Specifically, it changes its 'arg' type from 'unsigned long' to the new struct pointer. Link: https://lkml.kernel.org/r/20250410000022.1901-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250410000022.1901-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liam R. Howlett <howlett@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:27 -07:00
Baolin Wang	10d483f198	mm: huge_memory: add folio_mark_accessed() when zapping file THP When investigating performance issues during file folio unmap, I noticed some behavioral differences in handling non-PMD-sized folios and PMD-sized folios. For non-PMD-sized file folios, it will call folio_mark_accessed() to mark the folio as having seen activity, but this is not done for PMD-sized folios. This might not cause obvious issues, but a potential problem could be that, it might lead to reclaim of hot file folios under memory pressure, as quoted from Johannes: : Sometimes file contents are only accessed through relatively short-lived : mappings. But they can nevertheless be accessed a lot and be hot. It's : important to not lose that information on unmap, and end up kicking out a : frequently used cache page. Therefore, we should also add folio_mark_accessed() for PMD-sized file folios when unmapping. [baolin.wang@linux.alibaba.com: add comment] Link: https://lkml.kernel.org/r/23fdc11d-e983-4627-89a8-79e9ecf9a45a@linux.alibaba.com Link: https://lkml.kernel.org/r/fc117f60d7b686f87067f36a0ef7cdbc3a78109c.1744190345.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Barry Song <21cnbao@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:26 -07:00
Lorenzo Stoakes	10d288964d	tools/testing/selftests: assert that anon merge cases behave as expected Prior to the recently applied commit that permits this merge, mprotect()'ing a faulted VMA, adjacent to an unfaulted VMA, such that the two share characteristics would fail to merge due to what appear to be unintended consequences of commit `965f55dea0` ("mmap: avoid merging cloned VMAs"). Now we have fixed this bug, assert that we can indeed merge anonymous VMAs this way. Also assert that forked source/target VMAs are equally rejected. Previously, all empty target anon merges with one VMA faulted and the other unfaulted would be rejected incorrectly, now we ensure that unforked merge, but forked do not. Additionally, add the new test file to the MEMORY MAPPING section in MAINTAINERS, as these tests are explicitly memory mapping related. Link: https://lkml.kernel.org/r/2b69330274a3b71721f7042c5eabe91143934415.1744104124.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:26 -07:00
Lorenzo Stoakes	bd23f293a0	tools/testing: add PROCMAP_QUERY helper functions in mm self tests The PROCMAP_QUERY ioctl() is very useful - it allows for binary access to /proc/$pid/[s]maps data and thus convenient lookup of data contained there. This patch exposes this for convenient use by mm self tests so the state of VMAs can easily be queried. Link: https://lkml.kernel.org/r/ce83d877093d1fc594762cf4b82f0c27963030ee.1744104124.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:26 -07:00
Lorenzo Stoakes	879bca0a2c	mm/vma: fix incorrectly disallowed anonymous VMA merges Patch series "fix incorrectly disallowed anonymous VMA merges", v2. It appears that we have been incorrectly rejecting merge cases for 15 years, apparently by mistake. Imagine a range of anonymous mapped momemory divided into two VMAs like this, with incompatible protection bits: RW RWX unfaulted faulted \|-----------\|-----------\| \| prev \| vma \| \|-----------\|-----------\| mprotect(RW) Now imagine mprotect()'ing vma so it is RW. This appears as if it should merge, it does not. Neither does this case, again mprotect()'ing vma RW: RWX RW faulted unfaulted \|-----------\|-----------\| \| vma \| next \| \|-----------\|-----------\| mprotect(RW) Nor: RW RWX RW unfaulted faulted unfaulted \|-----------\|-----------\|-----------\| \| prev \| vma \| next \| \|-----------\|-----------\|-----------\| mprotect(RW) What's going on here? In commit `5beb493052` ("mm: change anon_vma linking to fix multi-process server scalability issue"), from 2010, Rik von Riel took careful care to account for these cases - commenting that '[this is] easily overlooked: when mprotect shifts the boundary, make sure the expanding vma has anon_vma set if the shrinking vma had, to cover any anon pages imported.' However, commit `965f55dea0` ("mmap: avoid merging cloned VMAs") introduced a little over a year later, appears to have accidentally disallowed this. By adjusting the is_mergeable_anon_vma() function to avoid lock contention across large trees of forked anon_vma's, this commit wrongly assumed the VMA being checked (the ostensible merge 'target') should be faulted, that is, have an anon_vma, and thus an anon_vma_chain list established, but only of length 1. This appears to have been unintentional, as disallowing empty target VMAs like this across the board makes no sense. We already have logic that accounts for this case, the same logic Rik introduced in 2010, now via dup_anon_vma() (and ultimately anon_vma_clone()), so there is no problem permitting this. This series fixes this mistake and also ensures that scalability concerns remain addressed by explicitly checking that whatever VMA is being merged has not been forked. A full set of self tests which reproduce the issue are provided, as well as updating userland VMA tests to assert this behaviour. The self tests additionally assert scalability concerns are addressed. This patch (of 3): anon_vma_chain's were introduced by Rik von Riel in commit `5beb493052` ("mm: change anon_vma linking to fix multi-process server scalability issue"). This patch was introduced in March 2010. As part of this change, careful attention was made to the instance of mprotect() causing a VMA merge, with one faulted (i.e. having anon_vma set) and another not: /* * Easily overlooked: when mprotect shifts the boundary, * make sure the expanding vma has anon_vma set if the * shrinking vma had, to cover any anon pages imported. / In the modern VMA code, this is handled in dup_anon_vma() (and ultimately anon_vma_clone()). This case is one of the three configurations of adjacent VMA anon_vma state that we might encounter on merge (where dst is the VMA which will be merged into and src the one being merged into dst): 1. dst->anon_vma, src->anon_vma - These must be equal, no-op. 2. dst->anon_vma, !src->anon_vma - We simply use dst->anon_vma, no-op. 3. !dst->anon_vma, src->anon_vma - The case in question here. In case 3, the instance addressed here - we duplicate the AVC connections from src and place into dst. However, in practice, we very often do NOT do this. This appears to be due to an inadvertent consequence of the change introduced by commit `965f55dea0` ("mmap: avoid merging cloned VMAs"), introduced in May 2011. This implies that this merge case was functional only for a little over a year, and has since been broken for ~15 years. Here, lock scalability concerns lead to us restricting anonymous merges only to those VMAs with 1 entry in their vma->anon_vma_chain, that is, a VMA that is not connected to any parent process's anon_vma. The mergeability test looks like this: static inline bool is_mergeable_anon_vma(struct anon_vma anon_vma1, struct anon_vma anon_vma2, struct vm_area_struct vma) { if ((!anon_vma1 \|\| !anon_vma2) && (!vma \|\| !vma->anon_vma \|\| list_is_singular(&vma->anon_vma_chain))) return true; return anon_vma1 == anon_vma2; } However, we have a problem here - typically the vma passed here is the destination VMA. For instance in vma_merge_existing_range() we invoke: can_vma_merge_left() -> [ check that there is an immediately adjacent prior VMA ] -> can_vma_merge_after() -> is_mergeable_vma() for general attribute check -> is_mergeable_anon_vma([ proposed anon_vma ], prev->anon_vma, prev) So if we were considering a target unfaulted 'prev': unfaulted faulted \|-----------\|-----------\| \| prev \| vma \| \|-----------\|-----------\| This would call is_mergeable_anon_vma(NULL, vma->anon_vma, prev). The list_is_singular() check for vma->anon_vma_chain, an empty list on fault, would cause this merge to _fail_ even though all else indicates a merge. Equally a simple merge into a next VMA would hit the same problem: faulted unfaulted \|-----------\|-----------\| \| vma \| next \| \|-----------\|-----------\| can_vma_merge_right() -> [ check that there is an immediately adjacent succeeding VMA ] -> can_vma_merge_before() -> is_mergeable_vma() for general attribute check -> is_mergeable_anon_vma([ proposed anon_vma ], next->anon_vma, next) For a 3-way merge, we'd also hit the same problem if it was configured like this for instance: unfaulted faulted unfaulted \|-----------\|-----------\|-----------\| \| prev \| vma \| next \| \|-----------\|-----------\|-----------\| As we'd call can_vma_merge_left() for prev, and can_vma_merge_right() for next, both of which would fail. vma_merge_new_range() (and relatedly, vma_expand()) are not impacted, as the new VMA would never already be faulted (it is a proposed new range). Because we already handle each of the aforementioned merge cases, and can absolutely therefore deal with an existing VMA merge with !dst->anon_vma, src->anon_vma, there is absolutely no reason to disallow this kind of merge. It seems that the intention of this patch is to ensure that, in the instance of merging unfaulted VMAs with faulted ones, we never wish to do so with those with multiple AVCs due to the fact that anon_vma lock's are held across both parent and child anon_vma's (actually, the 'root' parent anon_vma's lock is used). In fact, the original commit alludes to this - "find_mergeable_anon_vma() already considers this case". In find_mergeable_anon_vma() however, we check the anon_vma which will be merged from, if it is set, then we check list_is_singular(vma->anon_vma_chain). So to match this logic, update is_mergeable_anon_vma() to perform this scalability check on the VMA whose anon_vma we ultimately merge into. This matches existing behaviour with forked VMAs, only we no longer wrongly disallow ALL empty target merges. So we both allow merge cases and ensure the scalability check is correctly applied. We may wish to revisit these lock scalability concerns at a later date and ensure they are still valid. Additionally, correct userland VMA tests which were mistakenly not asserting these cases correctly previously to now correctly assert this, and to ensure vmg->anon_vma state is always consistent to account for newly introduced asserts. Link: https://lkml.kernel.org/r/cover.1744104124.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/18c756fc9eaf7ad082a710c91133b8346f8cd9a8.1744104124.git.lorenzo.stoakes@oracle.com Fixes: `965f55dea0` ("mmap: avoid merging cloned VMAs") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:26 -07:00
Alice Ryhl	af8251dd45	mm: rust: add MEMORY MANAGEMENT [RUST] We have introduced Rust bindings for core mm abstractions as part of this series, so add an entry in MAINTAINERS to be explicit about who maintains this. Patches are anticipated to be taken through the mm tree as usual with other mm code. Link: https://rust-for-linux.com/rust-kernel-policy#how-is-rust-introduced-in-a-subsystem Link: https://lore.kernel.org/all/33e64b12-aa07-4e78-933a-b07c37ff1d84@lucifer.local/ Link: https://lkml.kernel.org/r/20250408-vma-v16-9-d8b446e885d9@google.com Signed-off-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Andreas Hindborg <a.hindborg@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Gary Guo <gary@garyguo.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Trevor Gross <tmgross@umich.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:25 -07:00
Alice Ryhl	6acb75ad7b	task: rust: rework how current is accessed Introduce a new type called `CurrentTask` that lets you perform various operations that are only safe on the `current` task. Use the new type to provide a way to access the current mm without incrementing its refcount. With this change, you can write stuff such as let vma = current!().mm().lock_vma_under_rcu(addr); without incrementing any refcounts. This replaces the existing abstractions for accessing the current pid namespace. With the old approach, every field access to current involves both a macro and a unsafe helper function. The new approach simplifies that to a single safe function on the `CurrentTask` type. This makes it less heavy-weight to add additional current accessors in the future. That said, creating a `CurrentTask` type like the one in this patch requires that we are careful to ensure that it cannot escape the current task or otherwise access things after they are freed. To do this, I declared that it cannot escape the current "task context" where I defined a "task context" as essentially the region in which `current` remains unchanged. So e.g., release_task() or begin_new_exec() would leave the task context. If a userspace thread returns to userspace and later makes another syscall, then I consider the two syscalls to be different task contexts. This allows values stored in that task to be modified between syscalls, even if they're guaranteed to be immutable during a syscall. Ensuring correctness of `CurrentTask` is slightly tricky if we also want the ability to have a safe `kthread_use_mm()` implementation in Rust. To support that safely, there are two patterns we need to ensure are safe: // Case 1: current!() called inside the scope. let mm; kthread_use_mm(some_mm, \|\| { mm = current!().mm(); }); drop(some_mm); mm.do_something(); // UAF and: // Case 2: current!() called before the scope. let mm; let task = current!(); kthread_use_mm(some_mm, \|\| { mm = task.mm(); }); drop(some_mm); mm.do_something(); // UAF The existing `current!()` abstraction already natively prevents the first case: The `&CurrentTask` would be tied to the inner scope, so the borrow-checker ensures that no reference derived from it can escape the scope. Fixing the second case is a bit more tricky. The solution is to essentially pretend that the contents of the scope execute on an different thread, which means that only thread-safe types can cross the boundary. Since `CurrentTask` is marked `NotThreadSafe`, attempts to move it to another thread will fail, and this includes our fake pretend thread boundary. This has the disadvantage that other types that aren't thread-safe for reasons unrelated to `current` also cannot be moved across the `kthread_use_mm()` boundary. I consider this an acceptable tradeoff. Link: https://lkml.kernel.org/r/20250408-vma-v16-8-d8b446e885d9@google.com Signed-off-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Reviewed-by: Gary Guo <gary@garyguo.net> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Trevor Gross <tmgross@umich.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:25 -07:00
Alice Ryhl	f8c7819881	rust: miscdevice: add mmap support Add the ability to write a file_operations->mmap hook in Rust when using the miscdevice abstraction. The `vma` argument to the `mmap` hook uses the `VmaNew` type from the previous commit; this type provides the correct set of operations for a file_operations->mmap hook. Link: https://lkml.kernel.org/r/20250408-vma-v16-7-d8b446e885d9@google.com Signed-off-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Reviewed-by: Gary Guo <gary@garyguo.net> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Trevor Gross <tmgross@umich.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:25 -07:00
Alice Ryhl	dcb81aeab4	mm: rust: add VmaNew for f_ops->mmap() This type will be used when setting up a new vma in an f_ops->mmap() hook. Using a separate type from VmaRef allows us to have a separate set of operations that you are only able to use during the mmap() hook. For example, the VM_MIXEDMAP flag must not be changed after the initial setup that happens during the f_ops->mmap() hook. To avoid setting invalid flag values, the methods for clearing VM_MAYWRITE and similar involve a check of VM_WRITE, and return an error if VM_WRITE is set. Trying to use `try_clear_maywrite` without checking the return value results in a compilation error because the `Result` type is marked #[must_use]. For now, there's only a method for VM_MIXEDMAP and not VM_PFNMAP. When we add a VM_PFNMAP method, we will need some way to prevent you from setting both VM_MIXEDMAP and VM_PFNMAP on the same vma. Link: https://lkml.kernel.org/r/20250408-vma-v16-6-d8b446e885d9@google.com Signed-off-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Gary Guo <gary@garyguo.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Trevor Gross <tmgross@umich.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:25 -07:00
Alice Ryhl	114ba9b9e8	mm: rust: add mmput_async support Adds an MmWithUserAsync type that uses mmput_async when dropped but is otherwise identical to MmWithUser. This has to be done using a separate type because the thing we are changing is the destructor. Rust Binder needs this to avoid a certain deadlock. See commit `9a9ab0d963` ("binder: fix race between mmput() and do_exit()") for details. It's also needed in the shrinker to avoid cleaning up the mm in the shrinker's context. Link: https://lkml.kernel.org/r/20250408-vma-v16-5-d8b446e885d9@google.com Signed-off-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Reviewed-by: Gary Guo <gary@garyguo.net> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Trevor Gross <tmgross@umich.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:24 -07:00
Alice Ryhl	3105f8f391	mm: rust: add lock_vma_under_rcu Currently, the binder driver always uses the mmap lock to make changes to its vma. Because the mmap lock is global to the process, this can involve significant contention. However, the kernel has a feature called per-vma locks, which can significantly reduce contention. For example, you can take a vma lock in parallel with an mmap write lock. This is important because contention on the mmap lock has been a long-term recurring challenge for the Binder driver. This patch introduces support for using `lock_vma_under_rcu` from Rust. The Rust Binder driver will be able to use this to reduce contention on the mmap lock. Link: https://lkml.kernel.org/r/20250408-vma-v16-4-d8b446e885d9@google.com Signed-off-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Reviewed-by: Gary Guo <gary@garyguo.net> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Trevor Gross <tmgross@umich.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:24 -07:00
Alice Ryhl	bf3d331bb8	mm: rust: add vm_insert_page The vm_insert_page method is only usable on vmas with the VM_MIXEDMAP flag, so we introduce a new type to keep track of such vmas. The approach used in this patch assumes that we will not need to encode many flag combinations in the type. I don't think we need to encode more than VM_MIXEDMAP and VM_PFNMAP as things are now. However, if that becomes necessary, using generic parameters in a single type would scale better as the number of flags increases. Link: https://lkml.kernel.org/r/20250408-vma-v16-3-d8b446e885d9@google.com Signed-off-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Reviewed-by: Gary Guo <gary@garyguo.net> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Trevor Gross <tmgross@umich.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:24 -07:00
Alice Ryhl	040f404b73	mm: rust: add vm_area_struct methods that require read access This adds a type called VmaRef which is used when referencing a vma that you have read access to. Here, read access means that you hold either the mmap read lock or the vma read lock (or stronger). Additionally, a vma_lookup method is added to the mmap read guard, which enables you to obtain a &VmaRef in safe Rust code. This patch only provides a way to lock the mmap read lock, but a follow-up patch also provides a way to just lock the vma read lock. Link: https://lkml.kernel.org/r/20250408-vma-v16-2-d8b446e885d9@google.com Signed-off-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Reviewed-by: Gary Guo <gary@garyguo.net> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Trevor Gross <tmgross@umich.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:24 -07:00
Alice Ryhl	5bb9ed6cdf	mm: rust: add abstraction for struct mm_struct Patch series "Rust support for mm_struct, vm_area_struct, and mmap", v16. This updates the vm_area_struct support to use the approach we discussed at LPC where there are several different Rust wrappers for vm_area_struct depending on the kind of access you have to the vma. Each case allows a different set of operations on the vma. This includes an MM MAINTAINERS entry as proposed by Lorenzo: https://lore.kernel.org/all/33e64b12-aa07-4e78-933a-b07c37ff1d84@lucifer.local/ This patch (of 9): These abstractions allow you to reference a `struct mm_struct` using both mmgrab and mmget refcounts. This is done using two Rust types: * Mm - represents an mm_struct where you don't know anything about the value of mm_users. * MmWithUser - represents an mm_struct where you know at compile time that mm_users is non-zero. This allows us to encode in the type system whether a method requires that mm_users is non-zero or not. For instance, you can always call `mmget_not_zero` but you can only call `mmap_read_lock` when mm_users is non-zero. The struct is called Mm to keep consistency with the C side. The ability to obtain `current->mm` is added later in this series. The mm module is defined to only exist when CONFIG_MMU is set. This avoids various errors due to missing types and functions when CONFIG_MMU is disabled. More fine-grained cfgs can be considered in the future. See the thread at [1] for more info. Link: https://lkml.kernel.org/r/20250408-vma-v16-9-d8b446e885d9@google.com Link: https://lkml.kernel.org/r/20250408-vma-v16-1-d8b446e885d9@google.com Link: https://lore.kernel.org/all/202503091916.QousmtcY-lkp@intel.com/ Signed-off-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Balbir Singh <balbirs@nvidia.com> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Reviewed-by: Gary Guo <gary@garyguo.net> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benno Lossin <benno.lossin@proton.me> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Trevor Gross <tmgross@umich.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:24 -07:00
Kevin Brodsky	8472cc4503	riscv: mm: call PUD/P4D ctor in special kernel pgtable alloc Constructors for PUD/P4D-level pgtables were recently introduced. They should be called for all pgtables; make sure they are called for special kernel mappings created by create_pgd_mapping() too. While at it also switch to using pagetable_alloc() like in alloc_{pte,pmd}_late(). Link: https://lkml.kernel.org/r/20250408095222.860601-13-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Linus Waleij <linus.walleij@linaro.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport <rppt@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Will Deacon <will@kernel.org> Cc: <x86@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:23 -07:00
Kevin Brodsky	cb5d2be838	arm64: mm: call PUD/P4D ctor in __create_pgd_mapping() Constructors for PUD/P4D-level pgtables were recently introduced. They should be called for all pgtables; make sure they are called for special kernel mappings created by __create_pgd_mapping() too. Link: https://lkml.kernel.org/r/20250408095222.860601-12-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Linus Waleij <linus.walleij@linaro.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport <rppt@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Will Deacon <will@kernel.org> Cc: <x86@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:23 -07:00
Kevin Brodsky	0e3a16a760	riscv: mm: clarify ctor mm argument in alloc_{pte,pmd}_late pagetable_{pte,pmd}_ctor(mm, ptdesc) skip the ptlock initialisation if mm is &init_mm. To avoid unnecessary overhead, it is therefore preferable to pass the actual mm associated to the PTE/PMD. Unfortunately, this proves challenging for alloc_{pte,pmd}_late() as the associated mm is not available at the point where they are called - in fact not even top-level functions like create_pgd_mapping() are passed the mm. As a result they both call the ctor with NULL as mm; this is safe but potentially wasteful. This is not a new situation, but let's add a couple of comments to clarify it. Link: https://lkml.kernel.org/r/20250408095222.860601-11-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Linus Waleij <linus.walleij@linaro.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport <rppt@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Will Deacon <will@kernel.org> Cc: <x86@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:48:23 -07:00

1 2 3 4 5 ...

1353103 Commits