linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-03 17:04:50 -04:00

Author	SHA1	Message	Date
Mark Brown	df7e1286b1	mm: care about shadow stack guard gap when getting an unmapped area As covered in the commit log for `c44357c2e7` ("x86/mm: care about shadow stack guard gap during placement") our current mmap() implementation does not take care to ensure that a new mapping isn't placed with existing mappings inside it's own guard gaps. This is particularly important for shadow stacks since if two shadow stacks end up getting placed adjacent to each other then they can overflow into each other which weakens the protection offered by the feature. On x86 there is a custom arch_get_unmapped_area() which was updated by the above commit to cover this case by specifying a start_gap for allocations with VM_SHADOW_STACK. Both arm64 and RISC-V have equivalent features and use the generic implementation of arch_get_unmapped_area() so let's make the equivalent change there so they also don't get shadow stack pages placed without guard pages. x86 uses a single page guard, this is also sufficient for arm64 where we either do single word pops and pushes or unconstrained writes. Architectures which do not have this feature will define VM_SHADOW_STACK to VM_NONE and hence be unaffected. Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-3-a46b8b6dc0ed@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Chris Zankel <chris@zankel.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N Rao <naveen@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:13 -07:00
Mark Brown	540e00a729	mm: pass vm_flags to generic_get_unmapped_area() In preparation for using vm_flags to ensure guard pages for shadow stacks supply them as an argument to generic_get_unmapped_area(). The only user outside of the core code is the PowerPC book3s64 implementation which is trivially wrapping the generic implementation in the radix_enabled() case. No functional changes. Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-2-a46b8b6dc0ed@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Chris Zankel <chris@zankel.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Naveen N Rao <naveen@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:13 -07:00
Mark Brown	25d4054cc9	mm: make arch_get_unmapped_area() take vm_flags by default Patch series "mm: Care about shadow stack guard gap when getting an unmapped area", v2. As covered in the commit log for `c44357c2e7` ("x86/mm: care about shadow stack guard gap during placement") our current mmap() implementation does not take care to ensure that a new mapping isn't placed with existing mappings inside it's own guard gaps. This is particularly important for shadow stacks since if two shadow stacks end up getting placed adjacent to each other then they can overflow into each other which weakens the protection offered by the feature. On x86 there is a custom arch_get_unmapped_area() which was updated by the above commit to cover this case by specifying a start_gap for allocations with VM_SHADOW_STACK. Both arm64 and RISC-V have equivalent features and use the generic implementation of arch_get_unmapped_area() so let's make the equivalent change there so they also don't get shadow stack pages placed without guard pages. The arm64 and RISC-V shadow stack implementations are currently on the list: https://lore.kernel.org/r/20240829-arm64-gcs-v12-0-42fec94743 https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/ Given the addition of the use of vm_flags in the generic implementation we also simplify the set of possibilities that have to be dealt with in the core code by making arch_get_unmapped_area() take vm_flags as standard. This is a bit invasive since the prototype change touches quite a few architectures but since the parameter is ignored the change is straightforward, the simplification for the generic code seems worth it. This patch (of 3): When we introduced arch_get_unmapped_area_vmflags() in `961148704a` ("mm: introduce arch_get_unmapped_area_vmflags()") we did so as part of properly supporting guard pages for shadow stacks on x86_64, which uses a custom arch_get_unmapped_area(). Equivalent features are also present on both arm64 and RISC-V, both of which use the generic implementation of arch_get_unmapped_area() and will require equivalent modification there. Rather than continue to deal with having two versions of the functions let's bite the bullet and have all implementations of arch_get_unmapped_area() take vm_flags as a parameter. The new parameter is currently ignored by all implementations other than x86. The only caller that doesn't have a vm_flags available is mm_get_unmapped_area(), as for the x86 implementation and the wrapper used on other architectures this is modified to supply no flags. No functional changes. Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-0-a46b8b6dc0ed@kernel.org Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-1-a46b8b6dc0ed@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Helge Deller <deller@gmx.de> [parisc] Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Chris Zankel <chris@zankel.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N Rao <naveen@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:13 -07:00
SeongJae Park	f0679f9e6d	mm/damon/tests/vaddr-kunit: init maple tree without MT_FLAGS_LOCK_EXTERN damon_test_three_regions_in_vmas() initializes a maple tree with MM_MT_FLAGS. The flags contains MT_FLAGS_LOCK_EXTERN, which means mt_lock of the maple tree will not be used. And therefore the maple tree initialization code skips initialization of the mt_lock. However, __link_vmas(), which adds vmas for test to the maple tree, uses the mt_lock. In other words, the uninitialized spinlock is used. The problem becomes clear when spinlock debugging is turned on, since it reports spinlock bad magic bug. Fix the issue by excluding MT_FLAGS_LOCK_EXTERN from the maple tree initialization flags. Note that we don't use empty flags to make it further similar to the usage of mm maple tree, and to be prepared for possible future changes, as suggested by Liam. Link: https://lkml.kernel.org/r/20240904172931.1284-1-sj@kernel.org Fixes: `d0cf3dd47f` ("damon: convert __damon_va_three_regions to use the VMA iterator") Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: Guenter Roeck <linux@roeck-us.net> Closes: https://lore.kernel.org/1453b2b2-6119-4082-ad9e-f3c5239bf87e@roeck-us.net Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:13 -07:00
Sergey Senozhatsky	5ad7a998ba	mm: Kconfig: fixup zsmalloc configuration zsmalloc is not exclusive to zswap. Commit `b3fbd58fcb` ("mm: Kconfig: simplify zswap configuration") made CONFIG_ZSMALLOC only visible when CONFIG_ZSWAP is selected, which makes it impossible to menuconfig zsmalloc-specific features (stats, chain-size, etc.) on systems that use ZRAM but don't have ZSWAP enabled. Make zsmalloc depend on both ZRAM and ZSWAP. Link: https://lkml.kernel.org/r/20240903040143.1580705-1-senozhatsky@chromium.org Fixes: `b3fbd58fcb` ("mm: Kconfig: simplify zswap configuration") Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:12 -07:00
Takaya Saeki	fc1b43c422	filemap: fix the last_index of mm_filemap_get_pages In commit `b6273b55d8` ("filemap: add trace events for get_pages, map_pages, and fault"), mm_filemap_get_pages was added to trace page cache access. However, it tracks an extra page beyond the end of the accessed range. This patch fixes it by replacing last_index with last_index - 1. Link: https://lkml.kernel.org/r/20240903102100.70405-1-takayas@chromium.org Fixes: `b6273b55d8` ("filemap: add trace events for get_pages, map_pages, and fault") Signed-off-by: Takaya Saeki <takayas@chromium.org> Cc: Junichi Uekawa <uekawa@chromium.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:12 -07:00
Rik van Riel	e1e4cfd01a	mm,tmpfs: consider end of file write in shmem_is_huge Take the end of a file write into consideration when deciding whether or not to use huge pages for tmpfs files when the tmpfs filesystem is mounted with huge=within_size This allows large writes that append to the end of a file to automatically use large pages. Doing 4MB sequential writes without fallocate to a 16GB tmpfs file with fio. The numbers without THP or with huge=always stay the same, but the performance with huge=within_size now matches that of huge=always. huge before after 4kB pages 1560 MB/s 1560 MB/s within_size 1560 MB/s 4720 MB/s always: 4720 MB/s 4720 MB/s [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20240903111928.7171e60c@imladris.surriel.com Signed-off-by: Rik van Riel <riel@surriel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:12 -07:00
Sergey Senozhatsky	e899007a5e	zram: support priority parameter in recompression recompress device attribute supports alg=NAME parameter so that we can specify only one particular algorithm we want to perform recompression with. However, with algo params we now can have several exactly same secondary algorithms but each with its own params tuning (e.g. priority 1 configured to use more aggressive level, and priority 2 configured to use a pre-trained dictionary). Support priority=NUM parameter so that we can correctly determine which secondary algorithm we want to use. Link: https://lkml.kernel.org/r/20240902105656.1383858-25-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:12 -07:00
Sergey Senozhatsky	97ee4842f2	Documentation/zram: add documentation for algorithm parameters Document brief description of compression algorithms' parameters: compression level and pre-trained dictionary. [senozhatsky@chromium.org: trivial fixup] Link: https://lkml.kernel.org/r/20240903063722.1603592-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20240902105656.1383858-24-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:11 -07:00
Sergey Senozhatsky	6a559ecd6e	zram: add dictionary support to zstd backend This adds support for pre-trained zstd dictionaries [1] Dictionary is setup in params once (per-comp) and loaded to Cctx and Dctx by reference, so we don't allocate extra memory. TEST ==== * zstd /sys/block/zram0/mm_stat 1750654976 504565092 514203648 0 514203648 1 0 34204 34204 * zstd dict=/etc/zstd-dict-amd64 /sys/block/zram0/mm_stat 1750638592 465851259 475373568 0 475373568 1 0 34185 34185 *** zstd level=8 dict=/etc/zstd-dict-amd64 /sys/block/zram0/mm_stat 1750642688 430765171 439955456 0 439955456 1 0 34185 34185 [1] https://github.com/facebook/zstd/blob/dev/programs/zstd.1.md#dictionary-builder Link: https://lkml.kernel.org/r/20240902105656.1383858-23-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:11 -07:00
Sergey Senozhatsky	1e673c8cf7	zram: add dictionary support to lz4hc Support pre-trained dictionary param. Just like lz4, lz4hc doesn't mandate specific format of the dictionary and zstd --train can be used to train a dictionary for lz4, according to [1]. TEST ==== * lz4hc /sys/block/zram0/mm_stat 1750638592 608954620 621031424 0 621031424 1 0 34288 34288 * lz4hc dict=/etc/lz4-dict-amd64 /sys/block/zram0/mm_stat 1750671360 505068582 514994176 0 514994176 1 0 34278 34278 [1] https://github.com/lz4/lz4/issues/557 Link: https://lkml.kernel.org/r/20240902105656.1383858-22-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:11 -07:00
Sergey Senozhatsky	fb4f644ee8	zram: add dictionary support to lz4 Support pre-trained dictionary param. lz4 doesn't mandate specific format of the dictionary and even zstd --train can be used to train a dictionary for lz4, according to [1]. TEST ==== * lz4 /sys/block/zram0/mm_stat 1750654976 664188565 676864000 0 676864000 1 0 34288 34288 * lz4 dict=/etc/lz4-dict-amd64 /sys/block/zram0/mm_stat 1750638592 619891141 632053760 0 632053760 1 0 34278 34278 *** lz4 level=5 dict=/etc/lz4-dict-amd64 /sys/block/zram0/mm_stat 1750638592 727174243 740810752 0 740810752 1 0 34437 34437 [1] https://github.com/lz4/lz4/issues/557 Link: https://lkml.kernel.org/r/20240902105656.1383858-21-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:11 -07:00
Sergey Senozhatsky	b8f03cb703	zram: move immutable comp params away from per-CPU context Immutable params never change once comp has been allocated and setup, so we don't need to store multiple copies of them in each per-CPU backend context. Move those to per-comp zcomp_params and pass it to backends callbacks for requests execution. Basically, this means parameters sharing between different contexts. Also introduce two new backends callbacks: setup_params() and release_params(). First, we need to validate params in a driver-specific way; second, driver may want to allocate its specific representation of the params which is needed to execute requests. Link: https://lkml.kernel.org/r/20240902105656.1383858-20-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:10 -07:00
Sergey Senozhatsky	6a81bdfeb3	zram: introduce zcomp_ctx structure Keep run-time driver data (scratch buffers, etc.) in zcomp_ctx structure. This structure is allocated per-CPU because drivers (backends) need to modify its content during requests execution. We will split mutable and immutable driver data, this is a preparation path. Link: https://lkml.kernel.org/r/20240902105656.1383858-19-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:10 -07:00
Sergey Senozhatsky	52c7b4e2ba	zram: introduce zcomp_req structure Encapsulate compression/decompression data in zcomp_req structure. Link: https://lkml.kernel.org/r/20240902105656.1383858-18-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:10 -07:00
Sergey Senozhatsky	dea77d7aea	zram: add support for dict comp config Handle dict=path algorithm param so that we can read a pre-trained compression algorithm dictionary which we then pass to the backend configuration. Link: https://lkml.kernel.org/r/20240902105656.1383858-17-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:10 -07:00
Sergey Senozhatsky	4eac932103	zram: introduce algorithm_params device attribute This attribute is used to setup compression algorithms' parameters, so we can tweak algorithms' characteristics. At this point only 'level' is supported (to be extended in the future). Each call sets up parameters for one particular algorithm, which should be specified either by the algorithm's priority or algo name. This is expected to be called after corresponding algorithm is selected via comp_algorithm or recomp_algorithm. echo "priority=0 level=1" > /sys/block/zram0/algorithm_params or echo "algo=zstd level=1" > /sys/block/zram0/algorithm_params Link: https://lkml.kernel.org/r/20240902105656.1383858-16-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:09 -07:00
Sergey Senozhatsky	eb826a0190	zram: recalculate zstd compression params once zstd compression params depends on level, but are constant for a given instance of zstd compression backend. Calculate once (during ctx creation). Link: https://lkml.kernel.org/r/20240902105656.1383858-15-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:09 -07:00
Sergey Senozhatsky	f2bac7ad18	zram: introduce zcomp_params structure We will store a per-algorithm parameters there (compression level, dictionary, dictionary size, etc.). Link: https://lkml.kernel.org/r/20240902105656.1383858-14-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:09 -07:00
Sergey Senozhatsky	1a78390d87	zram: check that backends array has at least one backend Make sure that backends array has anything apart from the sentinel NULL value. We also select LZO_BACKEND if none backends were selected. Link: https://lkml.kernel.org/r/20240902105656.1383858-13-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:09 -07:00
Sergey Senozhatsky	1d3100cf14	zram: add 842 compression backend support Add s/w 842 compression support. Link: https://lkml.kernel.org/r/20240902105656.1383858-12-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:08 -07:00
Sergey Senozhatsky	84112e314f	zram: add zlib compression backend support Add s/w zlib (inflate/deflate) compression. Link: https://lkml.kernel.org/r/20240902105656.1383858-11-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:08 -07:00
Sergey Senozhatsky	dbf2763cec	zram: pass estimated src size hint to zstd zram works with PAGE_SIZE buffers, so we always know exact size of the source buffer and hence can pass estimated_src_size to zstd_get_params(). This hint on x86_64, for example, reduces the size of the work memory buffer from 1303520 bytes down to 90080 bytes. Given that compression streams are per-CPU that's quite some memory saving. Link: https://lkml.kernel.org/r/20240902105656.1383858-10-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:08 -07:00
Sergey Senozhatsky	73e7d81abb	zram: add zstd compression backend support Add s/w zstd compression. Link: https://lkml.kernel.org/r/20240902105656.1383858-9-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:08 -07:00
Sergey Senozhatsky	c60a4ef544	zram: add lz4hc compression backend support Add s/w lz4hc compression support. Link: https://lkml.kernel.org/r/20240902105656.1383858-8-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:07 -07:00
Sergey Senozhatsky	22d651c3b3	zram: add lz4 compression backend support Add s/w lz4 compression support. Link: https://lkml.kernel.org/r/20240902105656.1383858-7-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:07 -07:00
Sergey Senozhatsky	2152247c55	zram: add lzo and lzorle compression backends support Add s/w lzo/lzorle compression support. Link: https://lkml.kernel.org/r/20240902105656.1383858-6-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:07 -07:00
Sergey Senozhatsky	917a59e81c	zram: introduce custom comp backends API Moving to custom backends implementation gives us ability to have our own minimalistic and extendable API, and algorithms tunings becomes possible. The list of compression backends is empty at this point, we will add backends in the followup patches. Link: https://lkml.kernel.org/r/20240902105656.1383858-5-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:07 -07:00
Sergey Senozhatsky	f3c11cf5ca	lib: zstd: fix null-deref in ZSTD_createCDict_advanced2() ZSTD_createCDict_advanced2() must ensure that ZSTD_createCDict_advanced_internal() has successfully allocated cdict. customMalloc() may be called under low memory condition and may be unable to allocate workspace for cdict. Link: https://lkml.kernel.org/r/20240902105656.1383858-4-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Nick Terrell <terrelln@fb.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:06 -07:00
Sergey Senozhatsky	7518847430	lib: lz4hc: export LZ4_resetStreamHC symbol This symbol is needed to enable lz4hc dictionary support. Link: https://lkml.kernel.org/r/20240902105656.1383858-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Nick Terrell <terrelln@fb.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:06 -07:00
Sergey Senozhatsky	4fc4187984	lib: zstd: export API needed for dictionary support Patch series "zram: introduce custom comp backends API", v7. This series introduces support for run-time compression algorithms tuning, so users, for instance, can adjust compression/acceleration levels and provide pre-trained compression/decompression dictionaries which certain algorithms support. At this point we stop supporting (old/deprecated) comp API. We may add new acomp API support in the future, but before that zram needs to undergo some major rework (we are not ready for async compression). Some benchmarks for reference (look at column #2) * init zstd /sys/block/zram0/mm_stat 1750659072 504622188 514355200 0 514355200 1 0 34204 34204 * init zstd dict=/home/ss/zstd-dict-amd64 /sys/block/zram0/mm_stat 1750650880 465908890 475398144 0 475398144 1 0 34185 34185 * init zstd level=8 dict=/home/ss/zstd-dict-amd64 /sys/block/zram0/mm_stat 1750654976 430803319 439873536 0 439873536 1 0 34185 34185 * init lz4 /sys/block/zram0/mm_stat 1750646784 664266564 677060608 0 677060608 1 0 34288 34288 * init lz4 dict=/home/ss/lz4-dict-amd64 /sys/block/zram0/mm_stat 1750650880 619990300 632102912 0 632102912 1 0 34278 34278 * init lz4hc /sys/block/zram0/mm_stat 1750630400 609023822 621232128 0 621232128 1 0 34288 34288 * init lz4hc dict=/home/ss/lz4-dict-amd64 /sys/block/zram0/mm_stat 1750659072 505133172 515231744 0 515231744 1 0 34278 34278 Recompress init zram zstd (prio=0), zstd level=5 (prio 1), zstd with dict (prio 2) * zstd /sys/block/zram0/mm_stat 1750982656 504630584 514269184 0 514269184 1 0 34204 34204 * idle recompress priority=1 (zstd level=5) /sys/block/zram0/mm_stat 1750982656 488645601 525438976 0 514269184 1 0 34204 34204 * idle recompress priority=2 (zstd dict) /sys/block/zram0/mm_stat 1750982656 460869640 517914624 0 514269184 1 0 34185 34204 This patch (of 24): We need to export a number of API functions that enable advanced zstd usage - C/D dictionaries, dictionaries sharing between contexts, etc. Link: https://lkml.kernel.org/r/20240902105656.1383858-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20240902105656.1383858-2-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Nick Terrell <terrelln@fb.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:06 -07:00
Wei Yang	6050df6d70	maple_tree: fix comment typo on ma_flag of allocation tree The maple tree flag of allocation tree is MT_FLAGS_ALLOC_RANGE. Just correct it. Link: https://lkml.kernel.org/r/20240809020115.31575-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:06 -07:00
Kent Overstreet	0a6fff20d3	mm: fix folio_alloc_noprof() folio_alloc_noprof) wasn't calling the _noprof version, causing allocations to be accounted here instead of to the caller Link: https://lkml.kernel.org/r/20240901202459.4867-1-kent.overstreet@linux.dev Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:05 -07:00
Wei Yang	96ae4c9019	maple_tree: cleanup function descriptions This patch tries to cleanup some function description: * function name mismatch * parameter name mismatch * parameter all end up with ':' * not prefix '*' if parameter is a pointer There is still some missing description of parameters, I didn't add them since I am not sure the exact meaning. Link: https://lkml.kernel.org/r/20240830220400.2007-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:05 -07:00
Huan Yang	94deaf69dc	mm: page_alloc: simpify page del and expand When page del from buddy and need expand, it will account free_pages in zone's migratetype. The current way is to subtract the page number of the current order when deleting, and then add it back when expanding. This is unnecessary, as when migrating the same type, we can directly record the difference between the high-order pages and the expand added, and then subtract it directly. This patch merge that, only when del and expand done, then account free_pages. Link: https://lkml.kernel.org/r/20240826064048.187790-1-link@vivo.com Signed-off-by: Huan Yang <link@vivo.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:05 -07:00
Dev Jain	536ab838a5	selftests/mm: relax test to fail after 100 migration failures It was recently observed at [1] that during the folio unmapping stage of migration, when the PTEs are cleared, a racing thread faulting on that folio may increase the refcount of the folio, sleep on the folio lock (the migration path has the lock), and migration ultimately fails when asserting the actual refcount against the expected. Thereby, the migration selftest fails on shared-anon mappings. The above enforces the fact that migration is a best-effort service, therefore, it is wrong to fail the test for just a single failure; hence, fail the test after 100 consecutive failures (where 100 is still a subjective choice). Note that, this has no effect on the execution time of the test since that is controlled by a timeout. [1] https://lore.kernel.org/all/20240801081657.1386743-1-dev.jain@arm.com/ Link: https://lkml.kernel.org/r/20240830051609.4037834-1-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gavin Shan <gshan@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:05 -07:00
Hongbo Li	7ae12a57c5	mm/vmalloc.c: make use of the helper macro LIST_HEAD() list_head can be initialized automatically with LIST_HEAD() instead of calling INIT_LIST_HEAD(). Here we can simplify the code. Link: https://lkml.kernel.org/r/20240828041216.1222582-1-lihongbo22@huawei.com Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:04 -07:00
Usama Arif	81d3ff3c6f	mm: add sysfs entry to disable splitting underused THPs If disabled, THPs faulted in or collapsed will not be added to _deferred_list, and therefore won't be considered for splitting under memory pressure if underused. Link: https://lkml.kernel.org/r/20240830100438.3623486-7-usamaarif642@gmail.com Signed-off-by: Usama Arif <usamaarif642@gmail.com> Cc: Alexander Zhu <alexlzhu@fb.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kairui Song <ryncsn@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Shuang Zhai <zhais@google.com> Cc: Shuang Zhai <szhai2@cs.rochester.edu> Cc: Yu Zhao <yuzhao@google.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:04 -07:00
Usama Arif	dafff3f4c8	mm: split underused THPs This is an attempt to mitigate the issue of running out of memory when THP is always enabled. During runtime whenever a THP is being faulted in (__do_huge_pmd_anonymous_page) or collapsed by khugepaged (collapse_huge_page), the THP is added to _deferred_list. Whenever memory reclaim happens in linux, the kernel runs the deferred_split shrinker which goes through the _deferred_list. If the folio was partially mapped, the shrinker attempts to split it. If the folio is not partially mapped, the shrinker checks if the THP was underused, i.e. how many of the base 4K pages of the entire THP were zero-filled. If this number goes above a certain threshold (decided by /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none), the shrinker will attempt to split that THP. Then at remap time, the pages that were zero-filled are mapped to the shared zeropage, hence saving memory. Link: https://lkml.kernel.org/r/20240830100438.3623486-6-usamaarif642@gmail.com Signed-off-by: Usama Arif <usamaarif642@gmail.com> Suggested-by: Rik van Riel <riel@surriel.com> Co-authored-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexander Zhu <alexlzhu@fb.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kairui Song <ryncsn@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Shuang Zhai <zhais@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Shuang Zhai <szhai2@cs.rochester.edu> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:04 -07:00
Usama Arif	8422acdc97	mm: introduce a pageflag for partially mapped folios Currently folio->_deferred_list is used to keep track of partially_mapped folios that are going to be split under memory pressure. In the next patch, all THPs that are faulted in and collapsed by khugepaged are also going to be tracked using _deferred_list. This patch introduces a pageflag to be able to distinguish between partially mapped folios and others in the deferred_list at split time in deferred_split_scan. Its needed as __folio_remove_rmap decrements _mapcount, _large_mapcount and _entire_mapcount, hence it won't be possible to distinguish between partially mapped folios and others in deferred_split_scan. Eventhough it introduces an extra flag to track if the folio is partially mapped, there is no functional change intended with this patch and the flag is not useful in this patch itself, it will become useful in the next patch when _deferred_list has non partially mapped folios. Link: https://lkml.kernel.org/r/20240830100438.3623486-5-usamaarif642@gmail.com Signed-off-by: Usama Arif <usamaarif642@gmail.com> Cc: Alexander Zhu <alexlzhu@fb.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kairui Song <ryncsn@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Shuang Zhai <zhais@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Shuang Zhai <szhai2@cs.rochester.edu> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:04 -07:00
Alexander Zhu	391e869711	mm: selftest to verify zero-filled pages are mapped to zeropage When a THP is split, any subpage that is zero-filled will be mapped to the shared zeropage, hence saving memory. Add selftest to verify this by allocating zero-filled THP and comparing RssAnon before and after split. Link: https://lkml.kernel.org/r/20240830100438.3623486-4-usamaarif642@gmail.com Signed-off-by: Alexander Zhu <alexlzhu@fb.com> Signed-off-by: Usama Arif <usamaarif642@gmail.com> Acked-by: Rik van Riel <riel@surriel.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kairui Song <ryncsn@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Shuang Zhai <zhais@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Shuang Zhai <szhai2@cs.rochester.edu> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:03 -07:00
Yu Zhao	b1f202060a	mm: remap unused subpages to shared zeropage when splitting isolated thp Patch series "mm: split underused THPs", v5. The current upstream default policy for THP is always. However, Meta uses madvise in production as the current THP=always policy vastly overprovisions THPs in sparsely accessed memory areas, resulting in excessive memory pressure and premature OOM killing. Using madvise + relying on khugepaged has certain drawbacks over THP=always. Using madvise hints mean THPs aren't "transparent" and require userspace changes. Waiting for khugepaged to scan memory and collapse pages into THP can be slow and unpredictable in terms of performance (i.e. you dont know when the collapse will happen), while production environments require predictable performance. If there is enough memory available, its better for both performance and predictability to have a THP from fault time, i.e. THP=always rather than wait for khugepaged to collapse it, and deal with sparsely populated THPs when the system is running out of memory. This patch series is an attempt to mitigate the issue of running out of memory when THP is always enabled. During runtime whenever a THP is being faulted in or collapsed by khugepaged, the THP is added to a list. Whenever memory reclaim happens, the kernel runs the deferred_split shrinker which goes through the list and checks if the THP was underused, i.e. how many of the base 4K pages of the entire THP were zero-filled. If this number goes above a certain threshold, the shrinker will attempt to split that THP. Then at remap time, the pages that were zero-filled are mapped to the shared zeropage, hence saving memory. This method avoids the downside of wasting memory in areas where THP is sparsely filled when THP is always enabled, while still providing the upside THPs like reduced TLB misses without having to use madvise. Meta production workloads that were CPU bound (>99% CPU utilzation) were tested with THP shrinker. The results after 2 hours are as follows: \| THP=madvise \| THP=always \| THP=always \| \| \| + shrinker series \| \| \| + max_ptes_none=409 ----------------------------------------------------------------------------- Performance improvement \| - \| +1.8% \| +1.7% (over THP=madvise) \| \| \| ----------------------------------------------------------------------------- Memory usage \| 54.6G \| 58.8G (+7.7%) \| 55.9G (+2.4%) ----------------------------------------------------------------------------- max_ptes_none=409 means that any THP that has more than 409 out of 512 (80%) zero filled filled pages will be split. To test out the patches, the below commands without the shrinker will invoke OOM killer immediately and kill stress, but will not fail with the shrinker: echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none mkdir /sys/fs/cgroup/test echo $$ > /sys/fs/cgroup/test/cgroup.procs echo 20M > /sys/fs/cgroup/test/memory.max echo 0 > /sys/fs/cgroup/test/memory.swap.max # allocate twice memory.max for each stress worker and touch 40/512 of # each THP, i.e. vm-stride 50K. # With the shrinker, max_ptes_none of 470 and below won't invoke OOM # killer. # Without the shrinker, OOM killer is invoked immediately irrespective # of max_ptes_none value and kills stress. stress --vm 1 --vm-bytes 40M --vm-stride 50K This patch (of 5): Here being unused means containing only zeros and inaccessible to userspace. When splitting an isolated thp under reclaim or migration, the unused subpages can be mapped to the shared zeropage, hence saving memory. This is particularly helpful when the internal fragmentation of a thp is high, i.e. it has many untouched subpages. This is also a prerequisite for THP low utilization shrinker which will be introduced in later patches, where underutilized THPs are split, and the zero-filled pages are freed saving memory. Link: https://lkml.kernel.org/r/20240830100438.3623486-1-usamaarif642@gmail.com Link: https://lkml.kernel.org/r/20240830100438.3623486-3-usamaarif642@gmail.com Signed-off-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Usama Arif <usamaarif642@gmail.com> Tested-by: Shuang Zhai <zhais@google.com> Cc: Alexander Zhu <alexlzhu@fb.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kairui Song <ryncsn@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Shuang Zhai <szhai2@cs.rochester.edu> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:03 -07:00
Barry Song	903edea6c5	mm: warn about illegal __GFP_NOFAIL usage in a more appropriate location and manner Three points for this change: 1. We should consolidate all warnings in one place. Currently, the order > 1 warning is in the hotpath, while others are in less likely scenarios. Moving all warnings to the slowpath will reduce the overhead for order > 1 and increase the visibility of other warnings. 2. We currently have two warnings for order: one for order > 1 in the hotpath and another for order > costly_order in the laziest path. I suggest standardizing on order > 1 since it's been in use for a long time. 3. We don't need to check for __GFP_NOWARN in this case. __GFP_NOWARN is meant to suppress allocation failure reports, but here we're dealing with bug detection, not allocation failures. So replace WARN_ON_ONCE_GFP by WARN_ON_ONCE. [v-songbaohua@oppo.com: also update the doc for __GFP_NOFAIL with order > 1] Link: https://lkml.kernel.org/r/20240903223935.1697-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240830202823.21478-4-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: "Eugenio Pérez" <eperezma@redhat.com> Cc: Hailong.Liu <hailong.liu@oppo.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <kees@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Maxime Coquelin <maxime.coquelin@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Xie Yongji <xieyongji@bytedance.com> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:03 -07:00
Barry Song	17d7542260	mm: document __GFP_NOFAIL must be blockable Non-blocking allocation with __GFP_NOFAIL is not supported and may still result in NULL pointers (if we don't return NULL, we result in busy-loop within non-sleepable contexts): static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct alloc_context ac) { ... / * Make sure that __GFP_NOFAIL request doesn't leak out and make sure * we always retry / if (gfp_mask & __GFP_NOFAIL) { / * All existing users of the __GFP_NOFAIL are blockable, so warn * of any new users that actually require GFP_NOWAIT */ if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) goto fail; ... } ... fail: warn_alloc(gfp_mask, ac->nodemask, "page allocation failure: order:%u", order); got_pg: return page; } Highlight this in the documentation of __GFP_NOFAIL so that non-mm subsystems can reject any illegal usage of __GFP_NOFAIL with GFP_ATOMIC, GFP_NOWAIT, etc. Link: https://lkml.kernel.org/r/20240830202823.21478-3-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: David Hildenbrand <david@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: "Eugenio Pérez" <eperezma@redhat.com> Cc: Hailong.Liu <hailong.liu@oppo.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <kees@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Maxime Coquelin <maxime.coquelin@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Xie Yongji <xieyongji@bytedance.com> Cc: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:03 -07:00
Jason Wang	955abe0a1b	vduse: avoid using __GFP_NOFAIL Patch series "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve related doc and warn", v4. __GFP_NOFAIL carries the semantics of never failing, so its callers do not check the return value: %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller cannot handle allocation failures. The allocation could block indefinitely but will never return with failure. Testing for failure is pointless. However, __GFP_NOFAIL can sometimes fail if it exceeds size limits or is used with GFP_ATOMIC/GFP_NOWAIT in a non-sleepable context. This patchset handles illegal using __GFP_NOFAIL together with GFP_ATOMIC lacking __GFP_DIRECT_RECLAIM(without this, we can't do anything to reclaim memory to satisfy the nofail requirement) and improve related document and warnings. The proper size limits for __GFP_NOFAIL will be handled separately after more discussions. This patch (of 3): mm doesn't support non-blockable __GFP_NOFAIL allocation. Because persisting in providing __GFP_NOFAIL services for non-block users who cannot perform direct memory reclaim may only result in an endless busy loop. Therefore, in such cases, the current mm-core may directly return a NULL pointer: static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct alloc_context ac) { ... if (gfp_mask & __GFP_NOFAIL) { / * All existing users of the __GFP_NOFAIL are blockable, so warn * of any new users that actually require GFP_NOWAIT */ if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) goto fail; ... } ... fail: warn_alloc(gfp_mask, ac->nodemask, "page allocation failure: order:%u", order); got_pg: return page; } Unfortuantely, vpda does that nofail allocation under non-sleepable lock. A possible way to fix that is to move the pages allocation out of the lock into the caller, but having to allocate a huge number of pages and auxiliary page array seems to be problematic as well per Tetsuon: " You should implement proper error handling instead of using __GFP_NOFAIL if count can become large." So I chose another way, which does not release kernel bounce pages when user tries to register userspace bounce pages. Then we can avoid allocating in paths where failure is not expected.(e.g in the release). We pay this for more memory usage as we don't release kernel bounce pages but further optimizations could be done on top. [v-songbaohua@oppo.com: Refine the changelog] Link: https://lkml.kernel.org/r/20240830202823.21478-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240830202823.21478-2-21cnbao@gmail.com Fixes: `6c77ed2288` ("vduse: Support using userspace pages as bounce buffer") Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Xie Yongji <xieyongji@bytedance.com> Tested-by: Xie Yongji <xieyongji@bytedance.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hailong.Liu <hailong.liu@oppo.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Eugenio Pérez" <eperezma@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Maxime Coquelin <maxime.coquelin@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:02 -07:00
Mateusz Guzik	83362d2237	mm/hugetlb: sort out global lock annotations The mutex array pointer shares a cacheline with the spinlock: ffffffff84187480 B hugetlb_fault_mutex_table ffffffff84187488 B hugetlb_lock This is because the former is annotated with a macro forcing cacheline alignment. I suspect it was meant to be the variant which on top of it makes sure the object does not share the cacheline with anyone. Since array pointer itself is de facto read-only such an annotation does not make sense there anyway. Instead mark it __ro_after_init along with the size var. Do however move the spinlock out of the way. [akpm@linux-foundation.org: move section directives to the end of the definitions, per convention] [akpm@linux-foundation.org: DEFINE_SPINLOCK doesn't permit section modifiers at end-of-definition] Link: https://lkml.kernel.org/r/20240828160704.1425767-1-mjguzik@gmail.com Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:02 -07:00
Hugh Dickins	15444054a5	mm: shmem: extend shmem_unused_huge_shrink() to all sizes Although shmem_get_folio_gfp() is correctly putting inodes on the shrinklist according to the folio size, shmem_unused_huge_shrink() was still dealing with that shrinklist in terms of HPAGE_PMD_SIZE. Generalize that; and to handle the mixture of sizes more sensibly, shmem_alloc_and_add_folio() give it a number of pages to be freed (approximate: no need to minimize that with an exact calculation) instead of a number of inodes to split. [akpm@linux-foundation.org: comment tweak, per David] Link: https://lkml.kernel.org/r/d8c40850-6774-7a93-1e2c-8d941683b260@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:02 -07:00
Hugh Dickins	de5b85262e	mm: shmem: fix minor off-by-one in shrinkable calculation There has been a long-standing and very minor off-by-one, where shmem_get_folio_gfp() decides if a large folio extends beyond i_size far enough to leave a page or more for freeing later under pressure. This is not something needed for stable: but it will be proportionately more significant as support for smaller large folios is added, and is best fixed before duplicating the check in other places. Link: https://lkml.kernel.org/r/d8e75079-af2d-8519-56df-6be1dccc247a@google.com Fixes: `779750d20b` ("shmem: split huge pages beyond i_size under memory pressure") Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:02 -07:00
Wei Yang	21a449bedf	maple_tree: dump error message based on format Just do what mt_dump_range64() does. Dump the error message based on format. Link: https://lkml.kernel.org/r/20240826012422.29935-2-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:01 -07:00
Wei Yang	8152831069	maple_tree: arange64 node is not a leaf node mt_dump_arange64() only applies to an entry whose type is maple_arange_64, in which mte_is_leaf() must return false. Since mte_is_leaf() here is always false, we can remove this condition check. Link: https://lkml.kernel.org/r/20240826012422.29935-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:01 -07:00

1 2 3 4 5 ...

1296494 Commits