Pull slab fix from Vlastimil Babka:
- A stable fix for a missing tag reset that can happen in
kfree_nolock() with KASAN+SLUB_TINY configs (Deepanshu Kartikey)
* tag 'slab-for-6.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm/slub: reset KASAN tag in defer_free() before accessing freed memory
Pull slab fix from Vlastimil Babka:
- A stable fix for performance regression in tests that perform
kmem_cache_destroy() a lot, due to unnecessarily wide scope of
kvfree_rcu_barrier() (Harry Yoo)
* tag 'slab-for-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction
When CONFIG_SLUB_TINY is enabled, kfree_nolock() calls kasan_slab_free()
before defer_free(). On ARM64 with MTE (Memory Tagging Extension),
kasan_slab_free() poisons the memory and changes the tag from the
original (e.g., 0xf3) to a poison tag (0xfe).
When defer_free() then tries to write to the freed object to build the
deferred free list via llist_add(), the pointer still has the old tag,
causing a tag mismatch and triggering a KASAN use-after-free report:
BUG: KASAN: slab-use-after-free in defer_free+0x3c/0xbc mm/slub.c:6537
Write at addr f3f000000854f020 by task kworker/u8:6/983
Pointer tag: [f3], memory tag: [fe]
Fix this by calling kasan_reset_tag() before accessing the freed memory.
This is safe because defer_free() is part of the allocator itself and is
expected to manipulate freed memory for bookkeeping purposes.
Fixes: af92793e52 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Cc: stable@vger.kernel.org
Reported-by: syzbot+7a25305a76d872abcfa1@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7a25305a76d872abcfa1
Tested-by: syzbot+7a25305a76d872abcfa1@syzkaller.appspotmail.com
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://patch.msgid.link/20251210022024.3255826-1-kartikey406@gmail.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Currently, kvfree_rcu_barrier() flushes RCU sheaves across all slab
caches when a cache is destroyed. This is unnecessary; only the RCU
sheaves belonging to the cache being destroyed need to be flushed.
As suggested by Vlastimil Babka, introduce a weaker form of
kvfree_rcu_barrier() that operates on a specific slab cache.
Factor out flush_rcu_sheaves_on_cache() from flush_all_rcu_sheaves() and
call it from flush_all_rcu_sheaves() and kvfree_rcu_barrier_on_cache().
Call kvfree_rcu_barrier_on_cache() instead of kvfree_rcu_barrier() on
cache destruction.
The performance benefit is evaluated on a 12 core 24 threads AMD Ryzen
5900X machine (1 socket), by loading slub_kunit module.
Before:
Total calls: 19
Average latency (us): 18127
Total time (us): 344414
After:
Total calls: 19
Average latency (us): 10066
Total time (us): 191264
Two performance regression have been reported:
- stress module loader test's runtime increases by 50-60% (Daniel)
- internal graphics test's runtime on Tegra234 increases by 35% (Jon)
They are fixed by this change.
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Fixes: ec66e0d599 ("slab: add sheaf support for batching kfree_rcu() operations")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-mm/1bda09da-93be-4737-aef0-d47f8c5c9301@suse.cz
Reported-and-tested-by: Daniel Gomez <da.gomez@samsung.com>
Closes: https://lore.kernel.org/linux-mm/0406562e-2066-4cf8-9902-b2b0616dd742@kernel.org
Reported-and-tested-by: Jon Hunter <jonathanh@nvidia.com>
Closes: https://lore.kernel.org/linux-mm/e988eff6-1287-425e-a06c-805af5bbf262@nvidia.com
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20251207154148.117723-1-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Pull MM updates from Andrew Morton:
"__vmalloc()/kvmalloc() and no-block support" (Uladzislau Rezki)
Rework the vmalloc() code to support non-blocking allocations
(GFP_ATOIC, GFP_NOWAIT)
"ksm: fix exec/fork inheritance" (xu xin)
Fix a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not
inherited across fork/exec
"mm/zswap: misc cleanup of code and documentations" (SeongJae Park)
Some light maintenance work on the zswap code
"mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" (Mauricio Faria de Oliveira)
Enhance the /sys/kernel/debug/page_owner debug feature by adding
unique identifiers to differentiate the various stack traces so
that userspace monitoring tools can better match stack traces over
time
"mm/page_alloc: pcp->batch cleanups" (Joshua Hahn)
Minor alterations to the page allocator's per-cpu-pages feature
"Improve UFFDIO_MOVE scalability by removing anon_vma lock" (Lokesh Gidra)
Address a scalability issue in userfaultfd's UFFDIO_MOVE operation
"kasan: cleanups for kasan_enabled() checks" (Sabyrzhan Tasbolatov)
"drivers/base/node: fold node register and unregister functions" (Donet Tom)
Clean up the NUMA node handling code a little
"mm: some optimizations for prot numa" (Kefeng Wang)
Cleanups and small optimizations to the NUMA allocation hinting
code
"mm/page_alloc: Batch callers of free_pcppages_bulk" (Joshua Hahn)
Address long lock hold times at boot on large machines. These were
causing (harmless) softlockup warnings
"optimize the logic for handling dirty file folios during reclaim" (Baolin Wang)
Remove some now-unnecessary work from page reclaim
"mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" (SeongJae Park)
Enhance the DAMOS auto-tuning feature
"mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" (Quanmin Yan)
Fix DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace
configuration
"expand mmap_prepare functionality, port more users" (Lorenzo Stoakes)
Enhance the new(ish) file_operations.mmap_prepare() method and port
additional callsites from the old ->mmap() over to ->mmap_prepare()
"Fix stale IOTLB entries for kernel address space" (Lu Baolu)
Fix a bug (and possible security issue on non-x86) in the IOMMU
code. In some situations the IOMMU could be left hanging onto a
stale kernel pagetable entry
"mm/huge_memory: cleanup __split_unmapped_folio()" (Wei Yang)
Clean up and optimize the folio splitting code
"mm, swap: misc cleanup and bugfix" (Kairui Song)
Some cleanups and a minor fix in the swap discard code
"mm/damon: misc documentation fixups" (SeongJae Park)
"mm/damon: support pin-point targets removal" (SeongJae Park)
Permit userspace to remove a specific monitoring target in the
middle of the current targets list
"mm: MISC follow-up patches for linux/pgalloc.h" (Harry Yoo)
A couple of cleanups related to mm header file inclusion
"mm/swapfile.c: select swap devices of default priority round robin" (Baoquan He)
improve the selection of swap devices for NUMA machines
"mm: Convert memory block states (MEM_*) macros to enums" (Israel Batista)
Change the memory block labels from macros to enums so they will
appear in kernel debug info
"ksm: perform a range-walk to jump over holes in break_ksm" (Pedro Demarchi Gomes)
Address an inefficiency when KSM unmerges an address range
"mm/damon/tests: fix memory bugs in kunit tests" (SeongJae Park)
Fix leaks and unhandled malloc() failures in DAMON userspace unit
tests
"some cleanups for pageout()" (Baolin Wang)
Clean up a couple of minor things in the page scanner's
writeback-for-eviction code
"mm/hugetlb: refactor sysfs/sysctl interfaces" (Hui Zhu)
Move hugetlb's sysfs/sysctl handling code into a new file
"introduce VM_MAYBE_GUARD and make it sticky" (Lorenzo Stoakes)
Make the VMA guard regions available in /proc/pid/smaps and
improves the mergeability of guarded VMAs
"mm: perform guard region install/remove under VMA lock" (Lorenzo Stoakes)
Reduce mmap lock contention for callers performing VMA guard region
operations
"vma_start_write_killable" (Matthew Wilcox)
Start work on permitting applications to be killed when they are
waiting on a read_lock on the VMA lock
"mm/damon/tests: add more tests for online parameters commit" (SeongJae Park)
Add additional userspace testing of DAMON's "commit" feature
"mm/damon: misc cleanups" (SeongJae Park)
"make VM_SOFTDIRTY a sticky VMA flag" (Lorenzo Stoakes)
Address the possible loss of a VMA's VM_SOFTDIRTY flag when that
VMA is merged with another
"mm: support device-private THP" (Balbir Singh)
Introduce support for Transparent Huge Page (THP) migration in zone
device-private memory
"Optimize folio split in memory failure" (Zi Yan)
"mm/huge_memory: Define split_type and consolidate split support checks" (Wei Yang)
Some more cleanups in the folio splitting code
"mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" (Lorenzo Stoakes)
Clean up our handling of pagetable leaf entries by introducing the
concept of 'software leaf entries', of type softleaf_t
"reparent the THP split queue" (Muchun Song)
Reparent the THP split queue to its parent memcg. This is in
preparation for addressing the long-standing "dying memcg" problem,
wherein dead memcg's linger for too long, consuming memory
resources
"unify PMD scan results and remove redundant cleanup" (Wei Yang)
A little cleanup in the hugepage collapse code
"zram: introduce writeback bio batching" (Sergey Senozhatsky)
Improve zram writeback efficiency by introducing batched bio
writeback support
"memcg: cleanup the memcg stats interfaces" (Shakeel Butt)
Clean up our handling of the interrupt safety of some memcg stats
"make vmalloc gfp flags usage more apparent" (Vishal Moola)
Clean up vmalloc's handling of incoming GFP flags
"mm: Add soft-dirty and uffd-wp support for RISC-V" (Chunyan Zhang)
Teach soft dirty and userfaultfd write protect tracking to use
RISC-V's Svrsw60t59b extension
"mm: swap: small fixes and comment cleanups" (Youngjun Park)
Fix a small bug and clean up some of the swap code
"initial work on making VMA flags a bitmap" (Lorenzo Stoakes)
Start work on converting the vma struct's flags to a bitmap, so we
stop running out of them, especially on 32-bit
"mm/swapfile: fix and cleanup swap list iterations" (Youngjun Park)
Address a possible bug in the swap discard code and clean things
up a little
[ This merge also reverts commit ebb9aeb980 ("vfio/nvgrace-gpu:
register device memory for poison handling") because it looks
broken to me, I've asked for clarification - Linus ]
* tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
mm: fix vma_start_write_killable() signal handling
mm/swapfile: use plist_for_each_entry in __folio_throttle_swaprate
mm/swapfile: fix list iteration when next node is removed during discard
fs/proc/task_mmu.c: fix make_uffd_wp_huge_pte() huge pte handling
mm/kfence: add reboot notifier to disable KFENCE on shutdown
memcg: remove inc/dec_lruvec_kmem_state helpers
selftests/mm/uffd: initialize char variable to Null
mm: fix DEBUG_RODATA_TEST indentation in Kconfig
mm: introduce VMA flags bitmap type
tools/testing/vma: eliminate dependency on vma->__vm_flags
mm: simplify and rename mm flags function for clarity
mm: declare VMA flags by bit
zram: fix a spelling mistake
mm/page_alloc: optimize lowmem_reserve max lookup using its semantic monotonicity
mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
pagemap: update BUDDY flag documentation
mm: swap: remove scan_swap_map_slots() references from comments
mm: swap: change swap_alloc_slow() to void
mm, swap: remove redundant comment for read_swap_cache_async
mm, swap: use SWP_SOLIDSTATE to determine if swap is rotational
...
Pull slab updates from Vlastimil Babka:
- mempool_alloc_bulk() support for upcoming users in the block layer
that need to allocate multiple objects at once with the mempool's
guaranteed progress semantics, which is not achievable with an
allocation single objects in a loop. Along with refactoring and
various improvements (Christoph Hellwig)
- Preparations for the upcoming separation of struct slab from struct
page, mostly by removing the struct folio layer, as the purpose of
struct folio has shifted since it became used in slab code (Matthew
Wilcox)
- Modernisation of slab's boot param API usage, which removes some
unexpected parsing corner cases (Petr Tesarik)
- Refactoring of freelist_aba_t (now struct freelist_counters) and
associated functions for double cmpxchg, enabled by -fms-extensions
(Vlastimil Babka)
- Cleanups and improvements related to sheaves caching layer, that were
part of the full conversion to sheaves, which is planned for the next
release (Vlastimil Babka)
* tag 'slab-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (42 commits)
slab: Remove unnecessary call to compound_head() in alloc_from_pcs()
mempool: clarify behavior of mempool_alloc_preallocated()
mempool: drop the file name in the top of file comment
mempool: de-typedef
mempool: remove mempool_{init,create}_kvmalloc_pool
mempool: legitimize the io_schedule_timeout in mempool_alloc_from_pool
mempool: add mempool_{alloc,free}_bulk
mempool: factor out a mempool_alloc_from_pool helper
slab: Remove references to folios from virt_to_slab()
kasan: Remove references to folio in __kasan_mempool_poison_object()
memcg: Convert mem_cgroup_from_obj_folio() to mem_cgroup_from_obj_slab()
mempool: factor out a mempool_adjust_gfp helper
mempool: add error injection support
mempool: improve kerneldoc comments
mm: improve kerneldoc comments for __alloc_pages_bulk
fault-inject: make enum fault_flags available unconditionally
usercopy: Remove folio references from check_heap_object()
slab: Remove folio references from kfree_nolock()
slab: Remove folio references from kfree_rcu_sheaf()
slab: Remove folio references from build_detached_freelist()
...
Merge series "slab: cmpxchg cleanups enabled by -fms-extensions"
From the cover letter [1]:
After learning about -fms-extensions being enabled for 6.19, I realized
there is some cleanup potential in slub code by extending the definition
and usage of freelist_aba_t, as it can now become an unnamed member of
struct slab. This series performs the cleanup, with no functional
changes intended. Additionally we turn freelist_aba_t to struct
freelist_counters as it doesn't meet any criteria for being a typedef,
per Documentation/process/coding-style.rst
Based on the tag kbuild-ms-extensions-6.19 from
git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linuxV
Link: https://lore.kernel.org/all/20251107-slab-fms-cleanup-v1-0-650b1491ac9e@suse.cz/#t [1]
Merge series "Prepare slab for memdescs" by Matthew Wilcox.
From the cover letter [1]:
When we separate struct folio, struct page and struct slab from each
other, converting to folios then to slabs will be nonsense. It made
sense under the 'folio is just a head page' interpretation, but with
full separation, page_folio() will return NULL for a page which belongs
to a slab.
This patch series removes almost all mentions of folio from slab.
There are a few folio_test_slab() invocations left around the tree that
I haven't decided how to handle yet. We're not yet quite at the point
of separately allocating struct slab, but that's what I'll be working
on next.
Link: https://lore.kernel.org/all/20251113000932.1589073-1-willy@infradead.org/ [1]
Extend __kvmalloc_node_noprof() to handle non-blocking GFP flags
(GFP_NOWAIT and GFP_ATOMIC). Previously such flags were rejected,
returning NULL. With this change:
- kvmalloc() can fall back to vmalloc() if non-blocking contexts;
- for non-blocking allocations the VM_ALLOW_HUGE_VMAP option is
disabled, since the huge mapping path still contains might_sleep();
- documentation update to reflect that GFP_NOWAIT and GFP_ATOMIC
are now supported.
Link: https://lkml.kernel.org/r/20251007122035.56347-11-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Marco Elver <elver@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull slab fix from Vlastimil Babka:
- Fix memory leak of objects from remote NUMA node when bulk freeing to
a cache with sheaves (Harry Yoo)
* tag 'slab-for-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm/slub: fix memory leak in free_to_pcs_bulk()
The commit 989b09b739 ("slab: skip percpu sheaves for remote object
freeing") introduced the remote_objects array in free_to_pcs_bulk() to
skip sheaves when objects from a remote node are freed.
However, the array is flushed only when:
1) the array becomes full (++remote_nr >= PCS_BATCH_MAX), or
2) slab_free_hook() returns false and size becomes zero.
When neither of the conditions is met, objects in the array are leaked.
This resulted in a memory leak [1], where 82 GiB of memory was allocated
for the maple_node cache.
Flush the array after successfully freeing objects to sheaves
in the do_free: path.
In the meantime, move the snippet if (!size) goto flush_remote; outside
the while loop for readability. Let's say all objects in the array are
from a remote node: then we acquire s->cpu_sheaves->lock and try to free
an object even when size is zero. This doesn't appear to be harmful,
but isn't really readable.
Reported-by: Tytus Rogalewski <admin@simplepod.ai>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220765 [1]
Closes: https://lore.kernel.org/linux-mm/20251107094809.12e9d705b7bf4815783eb184@linux-foundation.org
Closes: https://lore.kernel.org/all/aRGDTwbt2EIz2CYn@hyeyoo
Fixes: 989b09b739 ("slab: skip percpu sheaves for remote object freeing")
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20251111125331.12246-1-harry.yoo@oracle.com
Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Tytus Rogalewski <admin@simplepod.ai>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
In barn_shrink(), use LIST_HEAD() to declare and initialize the
list_head in one step instead of using INIT_LIST_HEAD() separately.
No functional change.
Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
In functions such as [__]slab_update_freelist() and
__slab_update_freelist_fast/slow() we pass old and new freelist and
counters as 4 separate parameters. The underlying
__update_freelist_fast() then constructs struct freelist_counters
variables for passing the full freelist+counter combinations to cmpxchg
double.
In most cases we actually start with struct freelist_counters variables,
but then pass the individual fields, only to construct new struct
freelist_counters variables. While it's all inlined and thus should be
efficient, we can simplify this code.
Thus replace the 4 parameters for individual fields with two pointers to
struct freelist_counters wherever applicable. __update_freelist_fast()
can then pass them directly to try_cmpxchg_freelist().
The code is also more obvious as the pattern becomes unified such that
we set up "old" and "new" struct freelist_counters variables upfront as
we fully need them to be, and simply call [__]slab_update_freelist() on
them. Previously some of the "new" values would be hidden among the
many parameters and thus make it harder to figure out what the code
does.
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
In several functions we declare local struct slab variables so we can
work with the freelist and counters fields (including the sub-counters
that are in the union) comfortably.
With struct freelist_counters containing the full counters definition,
we can now reduce the local variables to that type as we don't need the
other fields in struct slab.
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
In struct slab we currently have freelist and counters pair, where
counters itself is a union of unsigned long with a sub-struct of
several smaller fields. Then for the usage with double cmpxchg we have
freelist_aba_t that duplicates the definition of the freelist+counters
with implicitly the same layout as the full definition in struct slab.
Thanks to -fms-extension we can now move the full counters definition to
freelist_aba_t (while changing it to struct freelist_counters as a
typedef is unnecessary and discouraged) and replace the relevant part in
struct slab to an unnamed reference to it.
The immediate benefit is the removal of duplication and no longer
relying on the same layout implicitly. It also allows further cleanups
thanks to having the full definition of counters in struct
freelist_counters.
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
In kmem_cache_cpu we currently have a union of the freelist+tid pair
with freelist_aba_t, relying implicitly on the type compatibility with the
freelist+counters pair used in freelist_aba_t.
To allow further changes to freelist_aba_t, we can instead define a
separate struct freelist_tid (instead of a typedef, per the coding
style) for kmem_cache_cpu, as that affects only a single helper
__update_cpu_freelist_fast().
We can add the resulting struct freelist_tid to kmem_cache_cpu as
unnamed field thanks to -fms-extensions, so that freelist and tid fields
can still be accessed directly.
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
We want to expand usage of sheaves to all non-boot caches, including
kmalloc caches. Since sheaves themselves are also allocated by
kmalloc(), we need to prevent excessive or infinite recursion -
depending on sheaf size, the sheaf can be allocated from smaller, same
or larger kmalloc size bucket, there's no particular constraint.
This is similar to allocating the objext arrays so let's just reuse the
existing mechanisms for those. __GFP_NO_OBJ_EXT in alloc_empty_sheaf()
will prevent a nested kmalloc() from allocating a sheaf itself - it will
either have sheaves already, or fallback to a non-sheaf-cached
allocation (so bootstrap of sheaves in a kmalloc cache that allocates
sheaves from its own size bucket is possible). Additionally, reuse
OBJCGS_CLEAR_MASK to clear unwanted gfp flags from the nested
allocation.
Link: https://patch.msgid.link/20251105-sheaves-cleanups-v1-5-b8218e1ac7ef@suse.cz
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
The function is tricky and many of its tests are hard to understand. Try
to improve that by using more descriptively named variables and added
comments.
- rename 'prior' to 'old_head' to match the head and tail parameters
- introduce a 'bool was_full' to make it more obvious what we are
testing instead of the !prior and prior tests
- add or improve comments in various places to explain what we're doing
Also replace kmem_cache_has_cpu_partial() tests with
IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) which are compile-time constants.
We can do that because the kmem_cache_debug(s) case is handled upfront
via free_to_partial_list().
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20251105-sheaves-cleanups-v1-1-b8218e1ac7ef@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
CONFIG_SLUB_TINY minimizes the SLUB's memory overhead in multiple ways,
mainly by avoiding percpu caching of slabs and objects. It also reduces
code size by replacing some code paths with simplified ones through
ifdefs, but the benefits of that are smaller and would complicate the
upcoming changes.
Thus remove these code paths and associated ifdefs and simplify the code
base.
Link: https://patch.msgid.link/20251105-sheaves-cleanups-v1-4-b8218e1ac7ef@suse.cz
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
When a pfmemalloc allocation actually dips into reserves, the slab is
marked accordingly and non-pfmemalloc allocations should not be allowed
to allocate from it. The sheaves percpu caching currently doesn't follow
this rule, so implement it before we expand sheaves usage to all caches.
Make sure objects from pfmemalloc slabs don't end up in percpu sheaves.
When freeing, skip sheaves when freeing an object from pfmemalloc slab.
When refilling sheaves, use __GFP_NOMEMALLOC to override any pfmemalloc
context - the allocation will fallback to regular slab allocations when
sheaves are depleted and can't be refilled because of the override.
For kfree_rcu(), detect pfmemalloc slabs after processing the rcu_sheaf
after the grace period in __rcu_free_sheaf_prepare() and simply flush
it if any object is from pfmemalloc slabs.
For prefilled sheaves, try to refill them first with __GFP_NOMEMALLOC
and if it fails, retry without __GFP_NOMEMALLOC but then mark the sheaf
pfmemalloc, which makes it flushed back to slabs when returned.
Link: https://patch.msgid.link/20251105-sheaves-cleanups-v1-3-b8218e1ac7ef@suse.cz
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
SLUB's internal bulk allocation __kmem_cache_alloc_bulk() can currently
allocate some objects from KFENCE, i.e. when refilling a sheaf. It works
but it's conceptually the wrong layer, as KFENCE allocations should only
happen when objects are actually handed out from slab to its users.
Currently for sheaf-enabled caches, slab_alloc_node() can return KFENCE
object via kfence_alloc(), but also via alloc_from_pcs() when a sheaf
was refilled with KFENCE objects. Continuing like this would also
complicate the upcoming sheaf refill changes.
Thus remove KFENCE allocation from __kmem_cache_alloc_bulk() and move it
to the places that return slab objects to users. slab_alloc_node() is
already covered (see above). Add kfence_alloc() to
kmem_cache_alloc_from_sheaf() to handle KFENCE allocations from
prefilled sheafs, with a comment that the caller should not expect the
sheaf size to decrease after every allocation because of this
possibility.
For kmem_cache_alloc_bulk() implement a different strategy to handle
KFENCE upfront and rely on internal batched operations afterwards.
Assume there will be at most once KFENCE allocation per bulk allocation
and then assign its index in the array of objects randomly.
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Link: https://patch.msgid.link/20251105-sheaves-cleanups-v1-2-b8218e1ac7ef@suse.cz
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
In review of a followup work, Harry noticed a potential infinite loop.
Upon closed inspection, it already exists for kmalloc_nolock() on a
cache with debugging enabled, since commit af92793e52 ("slab:
Introduce kmalloc_nolock() and kfree_nolock().")
When alloc_single_from_new_slab() fails to trylock node list_lock, we
keep retrying to get partial slab or allocate a new slab. If we indeed
interrupted somebody holding the list_lock, the trylock fill fail
deterministically and we end up allocating and defer-freeing slabs
indefinitely with no progress.
To fix it, fail the allocation if spinning is not allowed. This is
acceptable in the restricted context of kmalloc_nolock(), especially
with debugging enabled.
Reported-by: Harry Yoo <harry.yoo@oracle.com>
Closes: https://lore.kernel.org/all/aQLqZjjq1SPD3Fml@hyeyoo/
Fixes: af92793e52 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Acked-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20251103-fix-nolock-loop-v1-1-6e2b3e82b9da@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Use core_param() and __core_param_cb() instead of __setup() or
__setup_param() to improve syntax checking and error messages.
Replace get_option() with kstrtouint(), because:
* the latter accepts a pointer to const char,
* these parameters should not accept ranges,
* error value can be passed directly to parser.
There is one more change apart from the parsing of numeric parameters:
slab_strict_numa parameter name must match exactly. Before this patch the
kernel would silently accept any option that starts with the name as an
undocumented alias.
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Link: https://patch.msgid.link/6ae7e0ddc72b7619203c07dd5103a598e12f713b.1761324765.git.ptesarik@suse.com
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Use __core_param_cb() to parse the "slab_debug" kernel parameter instead of
the obsolescent __setup(). For now, the parameter is not exposed in sysfs,
and no get ops is provided.
There is a slight change in behavior. Before this patch, the following
parameter would silently turn on full debugging for all slabs:
slub_debug_yada_yada_gotta_love_this=hail_satan!
This syntax is now rejected, and the parameter will be passed to user
space, making the kernel a holier place.
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Link: https://patch.msgid.link/9674b34861394088c7853edf8e9d2b439fd4b42f.1761324765.git.ptesarik@suse.com
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
If two competing threads enter alloc_slab_obj_exts(), and the one that
allocates the vector wins the cmpxchg(), the other thread that failed
allocation mistakenly assumes that slab->obj_exts is still empty due to
its own allocation failure. This will then trigger warnings with
CONFIG_MEM_ALLOC_PROFILING_DEBUG checks in the subsequent free path.
Therefore, let's check the result of cmpxchg() to see if marking the
allocation as failed was successful. If it wasn't, check whether the
winning side has succeeded its allocation (it might have been also
marking it as failed) and if yes, return success.
Suggested-by: Harry Yoo <harry.yoo@oracle.com>
Fixes: f7381b9116 ("slab: mark slab->obj_exts allocation failures unconditionally")
Cc: <stable@vger.kernel.org>
Signed-off-by: Hao Ge <gehao@kylinos.cn>
Link: https://patch.msgid.link/20251023143313.1327968-1-hao.ge@linux.dev
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Since commit af92793e52 ("slab: Introduce kmalloc_nolock() and
kfree_nolock().") there's a possibility in alloc_single_from_new_slab()
that we discard the newly allocated slab if we can't spin and we fail to
trylock. As a result we don't perform inc_slabs_node() later in the
function. Instead we perform a deferred deactivate_slab() which can
either put the unacounted slab on partial list, or discard it
immediately while performing dec_slabs_node(). Either way will cause an
accounting imbalance.
Fix this by not marking the slab as frozen, and using free_slab()
instead of deactivate_slab() for non-frozen slabs in
free_deferred_objects(). For CONFIG_SLUB_TINY, that's the only possible
case. By not using discard_slab() we avoid dec_slabs_node().
Fixes: af92793e52 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Link: https://patch.msgid.link/20251023-fix-slab-accounting-v2-1-0e62d50986ea@suse.cz
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
If two competing threads enter alloc_slab_obj_exts() and one of them
fails to allocate the object extension vector, it might override the
valid slab->obj_exts allocated by the other thread with
OBJEXTS_ALLOC_FAIL. This will cause the thread that lost this race and
expects a valid pointer to dereference a NULL pointer later on.
Update slab->obj_exts atomically using cmpxchg() to avoid
slab->obj_exts overrides by racing threads.
Thanks for Vlastimil and Suren's help with debugging.
Fixes: f7381b9116 ("slab: mark slab->obj_exts allocation failures unconditionally")
Cc: <stable@vger.kernel.org>
Suggested-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Hao Ge <gehao@kylinos.cn>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Link: https://patch.msgid.link/20251021010353.1187193-1-hao.ge@linux.dev
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
If obj_exts allocation failed, slab->obj_exts is set to OBJEXTS_ALLOC_FAIL,
But we do not clear it when freeing the slab. Since OBJEXTS_ALLOC_FAIL and
MEMCG_DATA_OBJEXTS currently share the same bit position, during the
release of the associated folio, a VM_BUG_ON_FOLIO() check in
folio_memcg_kmem() is triggered because the OBJEXTS_ALLOC_FAIL flag was
not cleared, causing it to be interpreted as a kmem folio (non-slab)
with MEMCG_OBJEXTS_DATA flag set, which is invalid because
MEMCG_OBJEXTS_DATA is supposed to be set only on slabs.
Another problem that predates sharing the OBJEXTS_ALLOC_FAIL and
MEMCG_DATA_OBJEXTS bits is that on configurations with
is_check_pages_enabled(), the non-cleared bit in page->memcg_data will
trigger a free_page_is_bad() failure "page still charged to cgroup"
When freeing a slab, we clear slab->obj_exts if the obj_ext array has
been successfully allocated. So let's clear it also when the allocation
has failed.
Fixes: 09c46563ff ("codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext allocations")
Fixes: 7612833192 ("slab: Reuse first bit for OBJEXTS_ALLOC_FAIL")
Link: https://lore.kernel.org/all/20251015141642.700170-1-hao.ge@linux.dev/
Cc: <stable@vger.kernel.org>
Signed-off-by: Hao Ge <gehao@kylinos.cn>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
defer_free() links pending objects using the slab's freelist offset
which is fine as they are not free yet. free_deferred_objects() then
clears this pointer to avoid confusing the debugging consistency checks
that may be enabled for the cache.
However, with CONFIG_SLAB_FREELIST_HARDENED, even the NULL pointer needs
to be encoded appropriately using set_freepointer(), otherwise it's
decoded as something else and triggers the consistency checks, as found
by the kernel test robot.
Use set_freepointer() to prevent the issue.
Fixes: af92793e52 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Reported-and-tested-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202510101652.7921fdc6-lkp@intel.com
Acked-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
In slab_post_alloc_hook(), kmemleak check is skipped when
gfpflags_allow_spinning() returns false since commit af92793e52
("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Therefore, unconditionally calling kmemleak_not_leak() in
alloc_slab_obj_exts() would trigger the following warning:
kmemleak: Trying to color unknown object at 0xffff8881057f5000 as Grey
Call Trace:
alloc_slab_obj_exts+0x1b5/0x370
__alloc_tagging_slab_alloc_hook+0x9f/0x2d0
__kmalloc_cache_noprof+0x1c4/0x5c0
__set_page_owner+0x10d/0x1c0
post_alloc_hook+0x84/0xf0
get_page_from_freelist+0x73b/0x1380
__alloc_frozen_pages_noprof+0x110/0x2c0
alloc_pages_mpol+0x44/0x140
alloc_slab_page+0xac/0x150
allocate_slab+0x78/0x3a0
___slab_alloc+0x76b/0xed0
__slab_alloc.constprop.0+0x5a/0xb0
Add the allow_spin check in alloc_slab_obj_exts() to eliminate the above
warning.
Fixes: af92793e52 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20250930063831.782815-1-ranxiaokai627@163.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Pull MM updates from Andrew Morton:
- "mm, swap: improve cluster scan strategy" from Kairui Song improves
performance and reduces the failure rate of swap cluster allocation
- "support large align and nid in Rust allocators" from Vitaly Wool
permits Rust allocators to set NUMA node and large alignment when
perforning slub and vmalloc reallocs
- "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend
DAMOS_STAT's handling of the DAMON operations sets for virtual
address spaces for ops-level DAMOS filters
- "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren
Baghdasaryan reduces mmap_lock contention during reads of
/proc/pid/maps
- "mm/mincore: minor clean up for swap cache checking" from Kairui Song
performs some cleanup in the swap code
- "mm: vm_normal_page*() improvements" from David Hildenbrand provides
code cleanup in the pagemap code
- "add persistent huge zero folio support" from Pankaj Raghav provides
a block layer speedup by optionalls making the
huge_zero_pagepersistent, instead of releasing it when its refcount
falls to zero
- "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to
the recently added Kexec Handover feature
- "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo
Stoakes turns mm_struct.flags into a bitmap. To end the constant
struggle with space shortage on 32-bit conflicting with 64-bit's
needs
- "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap
code
- "selftests/mm: Fix false positives and skip unsupported tests" from
Donet Tom fixes a few things in our selftests code
- "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised"
from David Hildenbrand "allows individual processes to opt-out of
THP=always into THP=madvise, without affecting other workloads on the
system".
It's a long story - the [1/N] changelog spells out the considerations
- "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on
the memdesc project. Please see
https://kernelnewbies.org/MatthewWilcox/Memdescs and
https://blogs.oracle.com/linux/post/introducing-memdesc
- "Tiny optimization for large read operations" from Chi Zhiling
improves the efficiency of the pagecache read path
- "Better split_huge_page_test result check" from Zi Yan improves our
folio splitting selftest code
- "test that rmap behaves as expected" from Wei Yang adds some rmap
selftests
- "remove write_cache_pages()" from Christoph Hellwig removes that
function and converts its two remaining callers
- "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD
selftests issues
- "introduce kernel file mapped folios" from Boris Burkov introduces
the concept of "kernel file pages". Using these permits btrfs to
account its metadata pages to the root cgroup, rather than to the
cgroups of random inappropriate tasks
- "mm/pageblock: improve readability of some pageblock handling" from
Wei Yang provides some readability improvements to the page allocator
code
- "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON
to understand arm32 highmem
- "tools: testing: Use existing atomic.h for vma/maple tests" from
Brendan Jackman performs some code cleanups and deduplication under
tools/testing/
- "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes
a couple of 32-bit issues in tools/testing/radix-tree.c
- "kasan: unify kasan_enabled() and remove arch-specific
implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific
initialization code into a common arch-neutral implementation
- "mm: remove zpool" from Johannes Weiner removes zspool - an
indirection layer which now only redirects to a single thing
(zsmalloc)
- "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a
couple of cleanups in the fork code
- "mm: remove nth_page()" from David Hildenbrand makes rather a lot of
adjustments at various nth_page() callsites, eventually permitting
the removal of that undesirable helper function
- "introduce kasan.write_only option in hw-tags" from Yeoreum Yun
creates a KASAN read-only mode for ARM, using that architecture's
memory tagging feature. It is felt that a read-only mode KASAN is
suitable for use in production systems rather than debug-only
- "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does
some tidying in the hugetlb folio allocation code
- "mm: establish const-correctness for pointer parameters" from Max
Kellermann makes quite a number of the MM API functions more accurate
about the constness of their arguments. This was getting in the way
of subsystems (in this case CEPH) when they attempt to improving
their own const/non-const accuracy
- "Cleanup free_pages() misuse" from Vishal Moola fixes a number of
code sites which were confused over when to use free_pages() vs
__free_pages()
- "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the
mapletree code accessible to Rust. Required by nouveau and by its
forthcoming successor: the new Rust Nova driver
- "selftests/mm: split_huge_page_test: split_pte_mapped_thp
improvements" from David Hildenbrand adds a fix and some cleanups to
the thp selftesting code
- "mm, swap: introduce swap table as swap cache (phase I)" from Chris
Li and Kairui Song is the first step along the path to implementing
"swap tables" - a new approach to swap allocation and state tracking
which is expected to yield speed and space improvements. This
patchset itself yields a 5-20% performance benefit in some situations
- "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc
layer to clean up the ptdesc code a little
- "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some
issues in our 5-level pagetable selftesting code
- "Minor fixes for memory allocation profiling" from Suren Baghdasaryan
addresses a couple of minor issues in relatively new memory
allocation profiling feature
- "Small cleanups" from Matthew Wilcox has a few cleanups in
preparation for more memdesc work
- "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from
Quanmin Yan makes some changes to DAMON in furtherance of supporting
arm highmem
- "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad
Anjum adds that compiler check to selftests code and fixes the
fallout, by removing dead code
- "Improvements to Victim Process Thawing and OOM Reaper Traversal
Order" from zhongjinji makes a number of improvements in the OOM
killer: mainly thawing a more appropriate group of victim threads so
they can release resources
- "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park
is a bunch of small and unrelated fixups for DAMON
- "mm/damon: define and use DAMON initialization check function" from
SeongJae Park implement reliability and maintainability improvements
to a recently-added bug fix
- "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from
SeongJae Park provides additional transparency to userspace clients
of the DAMON_STAT information
- "Expand scope of khugepaged anonymous collapse" from Dev Jain removes
some constraints on khubepaged's collapsing of anon VMAs. It also
increases the success rate of MADV_COLLAPSE against an anon vma
- "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()"
from Lorenzo Stoakes moves us further towards removal of
file_operations.mmap(). This patchset concentrates upon clearing up
the treatment of stacked filesystems
- "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau
provides some fixes and improvements to mlock's tracking of large
folios. /proc/meminfo's "Mlocked" field became more accurate
- "mm/ksm: Fix incorrect accounting of KSM counters during fork" from
Donet Tom fixes several user-visible KSM stats inaccuracies across
forks and adds selftest code to verify these counters
- "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses
some potential but presently benign issues in KSM's mm_slot handling
* tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits)
mm: swap: check for stable address space before operating on the VMA
mm: convert folio_page() back to a macro
mm/khugepaged: use start_addr/addr for improved readability
hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list
alloc_tag: fix boot failure due to NULL pointer dereference
mm: silence data-race in update_hiwater_rss
mm/memory-failure: don't select MEMORY_ISOLATION
mm/khugepaged: remove definition of struct khugepaged_mm_slot
mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL
hugetlb: increase number of reserving hugepages via cmdline
selftests/mm: add fork inheritance test for ksm_merging_pages counter
mm/ksm: fix incorrect KSM counter handling in mm_struct during fork
drivers/base/node: fix double free in register_one_node()
mm: remove PMD alignment constraint in execmem_vmalloc()
mm/memory_hotplug: fix typo 'esecially' -> 'especially'
mm/rmap: improve mlock tracking for large folios
mm/filemap: map entire large folio faultaround
mm/fault: try to map the entire file folio in finish_fault()
mm/rmap: mlock large folios in try_to_unmap_one()
mm/rmap: fix a mlock race condition in folio_referenced_one()
...
From the cover letter [1]:
This patch set introduces kmalloc_nolock() which is the next logical
step towards any context allocation necessary to remove bpf_mem_alloc
and get rid of preallocation requirement in BPF infrastructure.
In production BPF maps grew to gigabytes in size. Preallocation wastes
memory. Alloc from any context addresses this issue for BPF and other
subsystems that are forced to preallocate too.
This long task started with introduction of alloc_pages_nolock(), then
memcg and objcg were converted to operate from any context including
NMI, this set completes the task with kmalloc_nolock() that builds on
top of alloc_pages_nolock() and memcg changes.
After that BPF subsystem will gradually adopt it everywhere.
Link: https://lore.kernel.org/all/20250909010007.1660-1-alexei.starovoitov@gmail.com/ [1]