linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-09 12:33:18 -04:00

Author	SHA1	Message	Date
Zeng Jingxiang	b9585a3f3e	mm/list_lru: make the case where mlru is NULL as unlikely In the following memcg_list_lru_alloc() function, mlru here is almost always NULL, so in most cases this should save a function call, mark mlru as unlikely to optimize the code, and reusing the mlru for the next attempt when the tree insertion fails. do { xas_lock_irqsave(&xas, flags); if (!xas_load(&xas) && !css_is_dying(&pos->css)) { xas_store(&xas, mlru); if (!xas_error(&xas)) mlru = NULL; } xas_unlock_irqrestore(&xas, flags); } while (xas_nomem(&xas, GFP_KERNEL)); > if (mlru) kfree(mlru); Link: https://lkml.kernel.org/r/20250227082223.1173847-1-jingxiangzeng.cas@gmail.com Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202412290924.UTP7GH2Z-lkp@intel.com/ Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Jingxiang Zeng <linuszeng@tencent.com> Cc: Kairui Song <kasong@tencent.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 00:05:32 -07:00
Anshuman Khandual	f9aad62200	mm: rename GENERIC_PTDUMP and PTDUMP_CORE Platforms subscribe into generic ptdump implementation via GENERIC_PTDUMP. But generic ptdump gets enabled via PTDUMP_CORE. These configs combination is confusing as they sound very similar and does not differentiate between platform's feature subscription and feature enablement for ptdump. Rename the configs as ARCH_HAS_PTDUMP and PTDUMP making it more clear and improve readability. Link: https://lkml.kernel.org/r/20250226122404.1927473-6-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> (powerpc) Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Cc: Will Deacon <will@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marc Zyngier <maz@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Steven Price <steven.price@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 00:05:32 -07:00
Anshuman Khandual	3f54872454	mm: make DEBUG_WX depdendent on GENERIC_PTDUMP DEBUG_WX selects PTDUMP_CORE without even ensuring that the given platform implements GENERIC_PTDUMP. This problem has been latent until now, as all the platforms subscribing ARCH_HAS_DEBUG_WX also subscribe GENERIC_PTDUMP. Link: https://lkml.kernel.org/r/20250226122404.1927473-5-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Steven Price <steven.price@arm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 00:05:31 -07:00
Anshuman Khandual	a5c96dfd47	docs: arm64: drop PTDUMP config options from ptdump.rst Both GENERIC_PTDUMP and PTDUMP_CORE are not user selectable config options. Just drop these from documentation. Link: https://lkml.kernel.org/r/20250226122404.1927473-4-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Suggested-by: Steven Price <steven.price@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 00:05:31 -07:00
Anshuman Khandual	2c5e6ac2db	arch/powerpc: drop GENERIC_PTDUMP from mpc885_ads_defconfig GENERIC_PTDUMP gets selected on powerpc explicitly and hence can be dropped off from mpc885_ads_defconfig. Replace with CONFIG_PTDUMP_DEBUGFS instead. Link: https://lkml.kernel.org/r/20250226122404.1927473-3-anshuman.khandual@arm.com Fixes: `e084728393` ("powerpc/ptdump: Convert powerpc to GENERIC_PTDUMP") Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Steven Price <steven.price@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 00:05:31 -07:00
Anshuman Khandual	9a4f9e2a81	configs: drop GENERIC_PTDUMP from debug.config Patch series "mm: Rework generic PTDUMP configs", v3. The series reworks generic PTDUMP configs before eventually renaming them after some basic cleanups first. This patch (of 5): The platforms that support GENERIC_PTDUMP select the config explicitly. But enabling this feature on platforms that don't really support - does nothing or might cause a build failure. Hence just drop GENERIC_PTDUMP from generic debug.config Link: https://lkml.kernel.org/r/20250226122404.1927473-1-anshuman.khandual@arm.com Link: https://lkml.kernel.org/r/20250226122404.1927473-2-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Steven Price <steven.price@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 00:05:30 -07:00
David Hildenbrand	720ba85040	mm/mmu_notifier: use MMU_NOTIFY_CLEAR in remove_device_exclusive_entry() Let's limit the use of MMU_NOTIFY_EXCLUSIVE to the case where we convert a present PTE to device-exclusive. For the other case, we can simply use MMU_NOTIFY_CLEAR, because it really is clearing the device-exclusive entry first, to then install the present entry. Update the documentation of MMU_NOTIFY_EXCLUSIVE, to document the single use case more thoroughly. If ever required, we could add a separate MMU_NOTIFY_CLEAR_EXCLUSIVE; for now using MMU_NOTIFY_CLEAR seems to be sufficient. Link: https://lkml.kernel.org/r/20250226132257.2826043-6-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 00:05:30 -07:00
David Hildenbrand	2f95381f8a	mm/memory: document restore_exclusive_pte() Let's document how this function is to be used, and why the folio lock is involved. Link: https://lkml.kernel.org/r/20250226132257.2826043-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 00:05:30 -07:00
David Hildenbrand	248624f9c6	mm/memory: pass folio and pte to restore_exclusive_pte() Let's pass the folio and the pte to restore_exclusive_pte(), so we can avoid repeated page_folio() and ptep_get(). To do that, pass the pte to try_restore_exclusive_pte() and use a folio in there already. While at it, just avoid the "swp_entry_t entry" variable in try_restore_exclusive_pte() and add a folio-locked check to restore_exclusive_pte(). Link: https://lkml.kernel.org/r/20250226132257.2826043-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 00:05:30 -07:00
David Hildenbrand	db0f6e674c	mm/memory: remove PageAnonExclusive sanity-check in restore_exclusive_pte() In commit `b832a354d7` ("mm/memory: page_add_anon_rmap() -> folio_add_anon_rmap_pte()") we accidentally changed the sanity check to essentially ignore anonymous folio by mis-placing the "!" ... but we really always only get anonymous folios in restore_exclusive_pte(). However, in the meantime we removed the separate "writable device-exclusive entries" and always detect if the PTE can be writable using can_change_pte_writable() -- which also consults PageAnonExclusive. So let's just get rid of this sanity check completely. Link: https://lkml.kernel.org/r/20250226132257.2826043-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 00:05:29 -07:00
David Hildenbrand	66add5e909	lib/test_hmm: make dmirror_atomic_map() consume a single page Patch series "mm: cleanups for device-exclusive entries (hmm)", v2. Some smaller device-exclusive cleanups I have lying around. This patch (of 5): The caller now always passes a single page; let's simplify, and return "0" on success. Link: https://lkml.kernel.org/r/20250226132257.2826043-1-david@redhat.com Link: https://lkml.kernel.org/r/20250226132257.2826043-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 00:05:29 -07:00
Matthew Wilcox (Oracle)	173a3dc051	mm: assert the folio is locked in folio_start_writeback() The folio must be locked when we start writeback in order to prevent writeback from being started twice on the same folio. I don't expect this to catch any problems, but it should be good documentation. Link: https://lkml.kernel.org/r/20250226153614.3774896-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 00:05:29 -07:00
Seongjun Kim	a58f3dcf20	samples/damon: a typo in the kconfig - sameple There is a typo in the Kconfig file of the damon sample module. Correct it: s/sameple/sample/ Link: https://lkml.kernel.org/r/20250226184204.29370-1-sj@kernel.org Signed-off-by: Seongjun Kim <bus710@gmail.com> Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 00:05:29 -07:00
Brendan Jackman	ebc29409c2	mm/page_alloc: warn on nr_reserved_highatomic underflow As documented in the comment this underflow should not happen. The locking has indeed changed here since the comment was written, see the migratetype hygiene patches[0]. However, those changes made the locking _safer_, so the underflow _really_ shouldn't happen now. So upgrade the comment to a warning. [0] https://lore.kernel.org/all/20240320180429.678181-7-hannes@cmpxchg.org/T/#m3da87e6cc3348a4640aa298137bc9f8f61b76c84 Link: https://lkml.kernel.org/r/20250225-warn-underflow-v1-1-3dc542941d3a@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:44 -07:00
Christoph Hellwig	88fb7794f6	vmalloc: drop Christoph from Reviewers I haven't been doing as much review as I should. As part of reducing my inbox flow drop me from the official Reviewers. I might still chime in on patches occasionally. Link: https://lkml.kernel.org/r/20250224163033.350072-1-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:44 -07:00
Kairui Song	b487a2da35	mm, swap: simplify folio swap allocation With slot cache gone, clean up the allocation helpers even more. folio_alloc_swap will be the only entry for allocation and adding the folio to swap cache (except suspend), making it opposite of folio_free_swap. Link: https://lkml.kernel.org/r/20250313165935.63303-8-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:44 -07:00
Kairui Song	0ff67f990b	mm, swap: remove swap slot cache Slot cache is no longer needed now, removing it and all related code. - vm-scalability with: `usemem --init-time -O -y -x -R -31 1G`, 12G memory cgroup using simulated pmem as SWAP (32G pmem, 32 CPUs), 16 test runs for each case, measuring the total throughput: Before (KB/s) (stdev) After (KB/s) (stdev) Random (4K): 424907.60 (24410.78) 414745.92 (34554.78) Random (64K): 163308.82 (11635.72) 167314.50 (18434.99) Sequential (4K, !-R): 6150056.79 (103205.90) 6321469.06 (115878.16) The performance changes are below noise level. - Build linux kernel with make -j96, using 4K folio with 1.5G memory cgroup limit and 64K folio with 2G memory cgroup limit, on top of tmpfs, 12 test runs, measuring the system time: Before (s) (stdev) After (s) (stdev) make -j96 (4K): 6445.69 (61.95) 6408.80 (69.46) make -j96 (64K): 6841.71 (409.04) 6437.99 (435.55) Similar to above, 64k mTHP case showed a slight improvement. Link: https://lkml.kernel.org/r/20250313165935.63303-7-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:43 -07:00
Kairui Song	1b7e90020e	mm, swap: use percpu cluster as allocation fast path Current allocation workflow first traverses the plist with a global lock held, after choosing a device, it uses the percpu cluster on that swap device. This commit moves the percpu cluster variable out of being tied to individual swap devices, making it a global percpu variable, and will be used directly for allocation as a fast path. The global percpu cluster variable will never point to a HDD device, and allocations on a HDD device are still globally serialized. This improves the allocator performance and prepares for removal of the slot cache in later commits. There shouldn't be much observable behavior change, except one thing: this changes how swap device allocation rotation works. Currently, each allocation will rotate the plist, and because of the existence of slot cache (one order 0 allocation usually returns 64 entries), swap devices of the same priority are rotated for every 64 order 0 entries consumed. High order allocations are different, they will bypass the slot cache, and so swap device is rotated for every 16K, 32K, or up to 2M allocation. The rotation rule was never clearly defined or documented, it was changed several times without mentioning. After this commit, and once slot cache is gone in later commits, swap device rotation will happen for every consumed cluster. Ideally non-HDD devices will be rotated if 2M space has been consumed for each order. Fragmented clusters will rotate the device faster, which seems OK. HDD devices is rotated for every allocation regardless of the allocation order, which should be OK too and trivial. This commit also slightly changes allocation behaviour for slot cache. The new added cluster allocation fast path may allocate entries from different device to the slot cache, this is not observable from user space, only impact performance very slightly, and slot cache will be just gone in next commit, so this can be ignored. Link: https://lkml.kernel.org/r/20250313165935.63303-6-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:43 -07:00
Kairui Song	280cfccaa2	mm, swap: don't update the counter up-front The counter update before allocation design was useful to avoid unnecessary scan when device is full, so it will abort early if the counter indicates the device is full. But that is an uncommon case, and now scanning of a full device is very fast, so the up-front update is not helpful any more. Remove it and simplify the slot allocation logic. Link: https://lkml.kernel.org/r/20250313165935.63303-5-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:43 -07:00
Kairui Song	78524b05f1	mm, swap: avoid redundant swap device pinning Currently __read_swap_cache_async() has get/put_swap_device() calls to increase/decrease a swap device reference to prevent swapoff. While some of its callers have already held the swap device reference, e.g in do_swap_page() and shmem_swapin_folio() where __read_swap_cache_async() will finally called. Now there are only two callers not holding a swap device reference, so make them hold a reference instead. And drop the get/put_swap_device calls in __read_swap_cache_async. This should reduce the overhead for swap in during page fault slightly. Link: https://lkml.kernel.org/r/20250313165935.63303-4-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:43 -07:00
Kairui Song	3123fb0a18	mm, swap: drop the flag TTRS_DIRECT This flag exists temporarily to allow the allocator to bypass the slot cache during freeing, so reclaiming one slot will free the slot immediately. But now we have already removed slot cache usage on freeing, so this flag has no effect now. Link: https://lkml.kernel.org/r/20250313165935.63303-3-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:42 -07:00
Kairui Song	fae8595505	mm, swap: avoid reclaiming irrelevant swap cache Patch series "mm, swap: remove swap slot cache", v3. Slot cache was initially introduced by commit `67afa38e01` ("mm/swap: add cache for swap slots allocation") to reduce the lock contention of si->lock. Previous series "mm, swap: rework of swap allocator locks" [1] removed swap slot cache for freeing path as freeing path no longer touches si->lock in most cased. Allocation path also have slight to none contention on si->lock since that series, but slot cache still helps to reduce other overheads, like counters and the plist. This series removes the slot cache from allocation path too, by using the cluster as allocation fast path and also reduce other overheads. Now slot cache is completely gone, the code is much simplified without obvious feature or performance change, also clean up related workaround. Also this should avoid other potential issues, e.g. the long pinning of swap slots: swap slot cache pins swap slots with HAS_CACHE, causing reclaim or allocation fail to use these slots on scanning. The only behavior change is the swap device allocation rotation mechanism, as explained in the patch "mm, swap: use percpu cluster as allocation fast path". Test results are looking good after deleting the swap slot cache: - vm-scalability with: `usemem --init-time -O -y -x -R -31 1G`, 12G memory cgroup using simulated pmem as SWAP (32G pmem, 32 CPUs), 16 test runs for each case, measuring the total throughput: Before (KB/s) (stdev) After (KB/s) (stdev) Random (4K): 424907.60 (24410.78) 414745.92 (34554.78) Random (64K): 163308.82 (11635.72) 167314.50 (18434.99) Sequential (4K, !-R): 6150056.79 (103205.90) 6321469.06 (115878.16) - Build linux kernel with make -j96, using 4K folio with 1.5G memory cgroup limit and 64K folio with 2G memory cgroup limit, on top of tmpfs, 12 test runs, measuring the system time: Before (s) (stdev) After (s) (stdev) make -j96 (4K): 6445.69 (61.95) 6408.80 (69.46) make -j96 (64K): 6841.71 (409.04) 6437.99 (435.55) The performance is unchanged, slightly better in some cases. [1] https://lore.kernel.org/linux-mm/20250113175732.48099-1-ryncsn@gmail.com/ This patch (of 7): Swap allocator will do swap cache reclaim to recycle HAS_CACHE slots for allocation. It initiates the reclaim from the offset to be reclaimed and looks up the corresponding folio. The lookup process is lockless, so it's possible the folio will be removed from the swap cache and given a different swap entry before the reclaim locks the folio. If it happens, the reclaim will end up reclaiming an irrelevant folio, and return wrong return value. This shouldn't cause any problem with correctness or stability, but it is indeed confusing and unexpected, and will increase fragmentation, decrease performance. Fix this by checking whether the folio is still pointing to the offset the allocator want to reclaim before reclaiming it. Link: https://lkml.kernel.org/r/20250313165935.63303-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20250313165935.63303-2-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:42 -07:00
Jane Chu	442b1eca22	mm: make page_mapped_in_vma() hugetlb walk aware When a process consumes a UE in a page, the memory failure handler attempts to collect information for a potential SIGBUS. If the page is an anonymous page, page_mapped_in_vma(page, vma) is invoked in order to 1. retrieve the vaddr from the process' address space, 2. verify that the vaddr is indeed mapped to the poisoned page, where 'page' is the precise small page with UE. It's been observed that when injecting poison to a non-head subpage of an anonymous hugetlb page, no SIGBUS shows up, while injecting to the head page produces a SIGBUS. The cause is that, though hugetlb_walk() returns a valid pmd entry (on x86), but check_pte() detects mismatch between the head page per the pmd and the input subpage. Thus the vaddr is considered not mapped to the subpage and the process is not collected for SIGBUS purpose. This is the calling stack: collect_procs_anon page_mapped_in_vma page_vma_mapped_walk hugetlb_walk huge_pte_lock check_pte check_pte() header says that it "check if [pvmw->pfn, @pvmw->pfn + @pvmw->nr_pages) is mapped at the @pvmw->pte" but practically works only if pvmw->pfn is the head page pfn at pvmw->pte. Hindsight acknowledging that some pvmw->pte could point to a hugepage of some sort such that it makes sense to make check_pte() work for hugepage. Link: https://lkml.kernel.org/r/20250224211445.2663312-1-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: linmiaohe <linmiaohe@huawei.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:42 -07:00
Johannes Weiner	a4138a2702	mm: page_alloc: group fallback functions together The way the fallback rules are spread out makes them hard to follow. Move the functions next to each other at least. Link: https://lkml.kernel.org/r/20250225001023.1494422-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:42 -07:00
Johannes Weiner	020396a581	mm: page_alloc: remove remnants of unlocked migratetype updates The freelist hygiene patches made migratetype accesses fully protected under the zone->lock. Remove remnants of handling the race conditions that existed before from the MIGRATE_HIGHATOMIC code. Link: https://lkml.kernel.org/r/20250225001023.1494422-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:41 -07:00
Johannes Weiner	c2f6ea38fc	mm: page_alloc: don't steal single pages from biggest buddy The fallback code searches for the biggest buddy first in an attempt to steal the whole block and encourage type grouping down the line. The approach used to be this: - Non-movable requests will split the largest buddy and steal the remainder. This splits up contiguity, but it allows subsequent requests of this type to fall back into adjacent space. - Movable requests go and look for the smallest buddy instead. The thinking is that movable requests can be compacted, so grouping is less important than retaining contiguity. `c0cd6f557b` ("mm: page_alloc: fix freelist movement during block conversion") enforces freelist type hygiene, which restricts stealing to either claiming the whole block or just taking the requested chunk; no additional pages or buddy remainders can be stolen any more. The patch mishandled when to switch to finding the smallest buddy in that new reality. As a result, it may steal the exact request size, but from the biggest buddy. This causes fracturing for no good reason. Fix this by committing to the new behavior: either steal the whole block, or fall back to the smallest buddy. Remove single-page stealing from steal_suitable_fallback(). Rename it to try_to_steal_block() to make the intentions clear. If this fails, always fall back to the smallest buddy. The following is from 4 runs of mmtest's thpchallenge. "Pollute" is single page fallback, "steal" is conversion of a partially used block. The numbers for free block conversions (omitted) are comparable. vanilla patched @pollute[unmovable from reclaimable]: 27 106 @pollute[unmovable from movable]: 82 46 @pollute[reclaimable from unmovable]: 256 83 @pollute[reclaimable from movable]: 46 8 @pollute[movable from unmovable]: 4841 868 @pollute[movable from reclaimable]: 5278 12568 @steal[unmovable from reclaimable]: 11 12 @steal[unmovable from movable]: 113 49 @steal[reclaimable from unmovable]: 19 34 @steal[reclaimable from movable]: 47 21 @steal[movable from unmovable]: 250 183 @steal[movable from reclaimable]: 81 93 The allocator appears to do a better job at keeping stealing and polluting to the first fallback preference. As a result, the numbers for "from movable" - the least preferred fallback option, and most detrimental to compactability - are down across the board. Link: https://lkml.kernel.org/r/20250225001023.1494422-2-hannes@cmpxchg.org Fixes: `c0cd6f557b` ("mm: page_alloc: fix freelist movement during block conversion") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:41 -07:00
Lorenzo Stoakes	f3b92176f4	tools/selftests: add guard region test for /proc/$pid/pagemap Add a test to the guard region self tests to assert that the /proc/$pid/pagemap information now made availabile to the user correctly identifies and reports guard regions. As a part of this change, update vm_util.h to add the new bit (note there is no header file in the kernel where this is exposed, the user is expected to provide their own mask) and utilise the helper functions there for pagemap functionality. [lorenzo.stoakes@oracle.com: fixup define name] Link: https://lkml.kernel.org/r/32e83941-e6f5-42ee-9292-a44c16463cf1@lucifer.local Link: https://lkml.kernel.org/r/164feb0a43ae72650e6b20c3910213f469566311.1740139449.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:41 -07:00
Lorenzo Stoakes	8e2f2aeb8b	fs/proc/task_mmu: add guard region bit to pagemap Patch series "fs/proc/task_mmu: add guard region bit to pagemap". Currently there is no means of determining whether a given page in a mapping range is designated a guard region (as installed via madvise() using the MADV_GUARD_INSTALL flag). This is generally not an issue, but in some instances users may wish to determine whether this is the case. This series adds this ability via /proc/$pid/pagemap, updates the documentation and adds a self test to assert that this functions correctly. This patch (of 2): Currently there is no means by which users can determine whether a given page in memory is in fact a guard region, that is having had the MADV_GUARD_INSTALL madvise() flag applied to it. This is intentional, as to provide this information in VMA metadata would contradict the intent of the feature (providing a means to change fault behaviour at a page table level rather than a VMA level), and would require VMA metadata operations to scan page tables, which is unacceptable. In many cases, users have no need to reflect and determine what regions have been designated guard regions, as it is the user who has established them in the first place. But in some instances, such as monitoring software, or software that relies upon being able to ascertain the nature of mappings within a remote process for instance, it becomes useful to be able to determine which pages have the guard region marker applied. This patch makes use of an unused pagemap bit (58) to provide this information. This patch updates the documentation at the same time as making the change such that the implementation of the feature and the documentation of it are tied together. Link: https://lkml.kernel.org/r/cover.1740139449.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/521d99c08b975fb06a1e7201e971cc24d68196d1.1740139449.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:41 -07:00
Kemeng Shi	0a8a5b6c41	mm: swap: remove stale comment of swap_reclaim_full_clusters() swap_reclaim_full_clusters() has no return value now, just remove the stale comment which says swap_reclaim_full_clusters() wil return a bool value. Link: https://lkml.kernel.org/r/20250222160850.505274-7-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:40 -07:00
Kemeng Shi	2310f08942	mm, swap: correct comment in swap_usage_sub() We will add si back to plist in swap_usage_sub(), just correct the wrong comment which says we will remove si from plist in swap_usage_sub(). Link: https://lkml.kernel.org/r/20250222160850.505274-6-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:40 -07:00
Kemeng Shi	43e9bbc3bb	mm, swap: remove setting SWAP_MAP_BAD for discard cluster Before alloc from a cluster, we will aqcuire cluster's lock and make sure it is usable by cluster_is_usable(), so there is no need to set SWAP_MAP_BAD for cluster to be discarded. Link: https://lkml.kernel.org/r/20250222160850.505274-5-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:40 -07:00
Brendan Jackman	1ddae9d67e	selftests/mm/mlock: print error on failure It's not really possible to start diagnosing this without knowing the actual error. Also update the mlock2 helper to behave like libc would by setting errno and returning -1. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-12-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:40 -07:00
Brendan Jackman	5d2146a335	selftests/mm: skip mlock tests if nobody user can't read it If running from a directory that can't be read by unprivileged users, executing on-fault-test via the nobody user will fail. The kselftest build does give the file the correct permissions, but after being installed it might be in a directory without global execute permissions. Since the script can't safely fix that, just skip if it happens. Note that the stderr of the `ls` command is unfiltered meaning the user sees a "permission denied" error that can help inform them why the test was skipped. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-11-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:40 -07:00
Brendan Jackman	f896c6de83	selftests/mm: ensure uffd-wp-mremap gets pages of each size This test allocates a page of every available size and doesn't have any SKIP logic if the allocation fails. So, ensure it's available and skip the test if we can't do so. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-10-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:39 -07:00
Brendan Jackman	e9269b2cc4	selftests/mm: drop unnecessary sudo usage This script must be run as root anyway (see all the writing to privileged files in /proc etc). Remove the unnecessary use of sudo to avoid breaking on single-user systems that don't have sudo. This also avoids confusing readers. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-9-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:39 -07:00
Brendan Jackman	32b42970e8	selftests/mm: skip gup_longterm tests on weird filesystems Some filesystems don't support ftruncate()ing unlinked files. They return ENOENT. In that case, skip the test. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-8-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:39 -07:00
Brendan Jackman	571a4b62ed	selftests/mm: skip map_populate on weird filesystems It seems that 9pfs does not allow truncating unlinked files, Mark Brown has noted that NFS may also behave this way. It doesn't seem quite right to call this a "bug" but it's probably a special enough case that it makes sense for the test to just SKIP if it happens. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-7-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:39 -07:00
Brendan Jackman	bf6d575e24	selftests/mm: don't fail uffd-stress if too many CPUs This calculation divides a fixed parameter by an environment-dependent parameter i.e. the number of CPUs. The simple way to avoid machine-specific failures here is to just put a cap on the max value of the latter. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-6-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Suggested-by: Mateusz Guzik <mjguzik@gmail.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:38 -07:00
Brendan Jackman	db0f1c138f	selftests/mm: print some details when uffd-stress gets bad params So this can be debugged more easily. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-5-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:38 -07:00
Brendan Jackman	f3b5535abc	selftests/mm/uffd: rename nr_cpus -> nr_parallel A later commit will bound this variable so it no longer necessarily matches the number of CPUs. Rename it appropriately. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-4-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:38 -07:00
Brendan Jackman	f4b3e6c7f1	selftests/mm: skip uffd-wp-mremap if userfaultfd not available It's obvious that this should fail in that case, but still, save the reader the effort of figuring out that they've run into this by just SKIPping Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-3-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:38 -07:00
Brendan Jackman	0046dbed80	selftests/mm: skip uffd-stress if userfaultfd not available It's pretty obvious that the test wouldn't work if you don't have the feature enabled. But, it's still useful to SKIP instead of failing so the reader can immediately tell that this is the reason why. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-2-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:37 -07:00
Brendan Jackman	800ddf3cd7	selftests/mm: report errno when things fail in gup_longterm Patch series "selftests/mm: Some cleanups from trying to run them", v4. I never had much luck running mm selftests so I spent a few hours digging into why. Looks like most of the reason is missing SKIP checks, so this series is just adding a bunch of those that I found. I did not do anything like all of them, just the ones I spotted in gup_longterm, gup_test, mmap, userfaultfd and memfd_secret. It's a bit unfortunate to have to skip those tests when ftruncate() fails, but I don't have time to dig deep enough into it to actually make them pass. I have observed the issue on 9pfs and heard rumours that NFS has a similar problem. I'm now able to run these test groups successfully: - mmap - gup_test - compaction - migration - page_frag - userfaultfd - mlock I've never gone past "Waiting for hugetlb memory to get depleted", in the hugetlb tests. I don't know if they are stuck or if they would eventually work if I was patient enough (testing on a 1G machine). I have not investigated further. I had some issues with mlock tests failing due to -ENOSRCH from mlock2(), I can no longer reproduce that though, things work OK now. Of the remaining tests there may be others that work fine, but there's no convenient way to survey the whole output of run_vmtests.sh so I'm just going test by test here. In my spare moments I am slowly chipping away at a setup to run these tests continuously in a reasonably hermetic QEMU environment via virtme-ng: `5fad4b9c59/README.md` Hopefully that will eventually offer a way to provide a "canned" environment where the tests are known to work, which can be fairly easily reproduced by any developer. This patch (of 12): Just reporting failure doesn't tell you what went wrong. This can fail in different ways so report errno to help the reader get started debugging. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-0-dec210a658f5@google.com Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-1-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:37 -07:00
Sergey Senozhatsky	2ad951865a	zram: add might_sleep to zcomp API Explicitly state that zcomp compress/decompress must be called from non-atomic context. Link: https://lkml.kernel.org/r/20250303022425.285971-20-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:37 -07:00
Sergey Senozhatsky	a6d2193b3e	zram: do not leak page on writeback_store error path Ensure the page used for local object data is freed on error out path. Link: https://lkml.kernel.org/r/20250303022425.285971-19-senozhatsky@chromium.org Fixes: `330edc2bc0` (zram: rework writeback target selection strategy) Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:37 -07:00
Sergey Senozhatsky	5b683d4e98	zram: do not leak page on recompress_store error path Ensure the page used for local object data is freed on error out path. Link: https://lkml.kernel.org/r/20250303022425.285971-18-senozhatsky@chromium.org Fixes: `3f909a60ce` ("zram: rework recompress target selection strategy") Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:36 -07:00
Sergey Senozhatsky	f66140eb71	zram: permit reclaim in zstd custom allocator When configured with pre-trained compression/decompression dictionary support, zstd requires custom memory allocator, which it calls internally from compression()/decompression() routines. That means allocation from atomic context (either under entry spin-lock, or per-CPU local-lock or both). Now, with non-atomic zram read()/write(), those limitations are relaxed and we can allow direct and indirect reclaim. Link: https://lkml.kernel.org/r/20250303022425.285971-17-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:36 -07:00
Sergey Senozhatsky	82f91900c7	zram: switch to new zsmalloc object mapping API Use new read/write zsmalloc object API. For cases when RO mapped object spans two physical pages (requires temp buffer) compression streams now carry around one extra physical page. Link: https://lkml.kernel.org/r/20250303022425.285971-16-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:36 -07:00
Sergey Senozhatsky	44f7641349	zsmalloc: introduce new object mapping API Current object mapping API is a little cumbersome. First, it's inconsistent, sometimes it returns with page-faults disabled and sometimes with page-faults enabled. Second, and most importantly, it enforces atomicity restrictions on its users. zs_map_object() has to return a liner object address which is not always possible because some objects span multiple physical (non-contiguous) pages. For such objects zsmalloc uses a per-CPU buffer to which object's data is copied before a pointer to that per-CPU buffer is returned back to the caller. This leads to another, final, issue - extra memcpy(). Since the caller gets a pointer to per-CPU buffer it can memcpy() data only to that buffer, and during zs_unmap_object() zsmalloc will memcpy() from that per-CPU buffer to physical pages that object in question spans across. New API splits functions by access mode: - zs_obj_read_begin(handle, local_copy) Returns a pointer to handle memory. For objects that span two physical pages a local_copy buffer is used to store object's data before the address is returned to the caller. Otherwise the object's page is kmap_local mapped directly. - zs_obj_read_end(handle, buf) Unmaps the page if it was kmap_local mapped by zs_obj_read_begin(). - zs_obj_write(handle, buf, len) Copies len-bytes from compression buffer to handle memory (takes care of objects that span two pages). This does not need any additional (e.g. per-CPU) buffers and writes the data directly to zsmalloc pool pages. In terms of performance, on a synthetic and completely reproducible test that allocates fixed number of objects of fixed sizes and iterates over those objects, first mapping in RO then in RW mode: OLD API ======= 3 first results out of 10 369,205,778 instructions # 0.80 insn per cycle 40,467,926 branches # 113.732 M/sec 369,002,122 instructions # 0.62 insn per cycle 40,426,145 branches # 189.361 M/sec 369,036,706 instructions # 0.63 insn per cycle 40,430,860 branches # 204.105 M/sec [..] NEW API ======= 3 first results out of 10 265,799,293 instructions # 0.51 insn per cycle 29,834,567 branches # 170.281 M/sec 265,765,970 instructions # 0.55 insn per cycle 29,829,019 branches # 161.602 M/sec 265,764,702 instructions # 0.51 insn per cycle 29,828,015 branches # 189.677 M/sec [..] T-test on all 10 runs ===================== Difference at 95.0% confidence -1.03219e+08 +/- 55308.7 -27.9705% +/- 0.0149878% (Student's t, pooled s = 58864.4) The old API will stay around until the remaining users switch to the new one. After that we'll also remove zsmalloc per-CPU buffer and CPU hotplug handling. The split of map(RO) and map(WO) into read_{begin/end}/write is suggested by Yosry Ahmed. Link: https://lkml.kernel.org/r/20250303022425.285971-15-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:36 -07:00
Sergey Senozhatsky	e27af3f936	zsmalloc: sleepable zspage reader-lock In order to implement preemptible object mapping we need a zspage lock that satisfies several preconditions: - it should be reader-write type of a lock - it should be possible to hold it from any context, but also being preemptible if the context allows it - we never sleep while acquiring but can sleep while holding in read mode An rwsemaphore doesn't suffice, due to atomicity requirements, rwlock doesn't satisfy due to reader-preemptability requirement. It's also worth to mention, that per-zspage rwsem is a little too memory heavy (we can easily have double digits megabytes used only on rwsemaphores). Switch over from rwlock_t to a atomic_t-based implementation of a reader-writer semaphore that satisfies all of the preconditions. The spin-lock based zspage_lock is suggested by Hillf Danton. Link: https://lkml.kernel.org/r/20250303022425.285971-14-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:35 -07:00

1 2 3 4 5 ...

1337311 Commits