Pull MIPS updates from Thomas Bogendoerfer:
"Cleanups and fixes"
* tag 'mips_6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
MIPS: pci-legacy: Override pci_address_to_pio
MIPS: Loongson64: env: Use str_on_off() helper in prom_lefi_init_env()
MIPS: migrate to generic rule for built-in DTBs
mips: fix shmctl/semctl/msgctl syscall for o32
mips/math-emu: fix emulation of the prefx instruction
MIPS: Loongson: Add comments for interface_info
MIPS: Loongson64: remove ROM Size unit in boardinfo
MIPS: traps: Use str_enabled_disabled() in parity_protection_init()
MIPS: ftrace: Declare ftrace_get_parent_ra_addr() as static
Revert "MIPS: csrc-r4k: Select HAVE_UNSTABLE_SCHED_CLOCK if SMP && 64BIT"
MIPS: Fix the wrong format specifier
MIPS: Add a blank line after __HEAD
MIPS: kernel: Rename read/write_c0_ecc to read/writec0_errctl
Pull ARM updates from Russell King:
- fix typos in vfpmodule.c
- drop obsolete VFP accessor fallback for old assemblers
- add cache line identifier register accessor functions
- add cacheinfo support
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux:
ARM: 9440/1: cacheinfo fix format field mask
ARM: 9433/2: implement cacheinfo support
ARM: 9432/2: add CLIDR accessor functions
ARM: 9438/1: assembler: Drop obsolete VFP accessor fallback
ARM: 9437/1: vfp: Fix typographical errors in vfpmodule.c
Pull m68knommu update from Greg Ungerer:
"Just a single fix to correct the clock rate defined for the internal
timer hardware blocks of the ColdFire 5441x family of SoC devices"
* tag 'm68knommu-for-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
m68k: coldfire: Use proper clock rate for timers
Pull xtensa updates from Max Filippov:
- a few one-liner cleanups
* tag 'xtensa-20250126' of https://github.com/jcmvbkbc/linux-xtensa:
xtensa/simdisk: Use str_write_read() helper in simdisk_transfer()
xtensa: Remove zero-length alignment array
xtensa: annotate dtb_start variable as static __initdata
Pull MM updates from Andrew Morton:
"The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs.
- "Allocate and free frozen pages" from Matthew Wilcox reorganizes
the page allocator so we end up with the ability to allocate and
free zero-refcount pages. So that callers (ie, slab) can avoid a
refcount inc & dec
- "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to
use large folios other than PMD-sized ones
- "Fix mm/rodata_test" from Petr Tesarik performs some maintenance
and fixes for this small built-in kernel selftest
- "mas_anode_descend() related cleanup" from Wei Yang tidies up part
of the mapletree code
- "mm: fix format issues and param types" from Keren Sun implements a
few minor code cleanups
- "simplify split calculation" from Wei Yang provides a few fixes and
a test for the mapletree code
- "mm/vma: make more mmap logic userland testable" from Lorenzo
Stoakes continues the work of moving vma-related code into the
(relatively) new mm/vma.c
- "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
Hildenbrand cleans up and rationalizes handling of gfp flags in the
page allocator
- "readahead: Reintroduce fix for improper RA window sizing" from Jan
Kara is a second attempt at fixing a readahead window sizing issue.
It should reduce the amount of unnecessary reading
- "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
addresses an issue where "huge" amounts of pte pagetables are
accumulated:
https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/
Qi's series addresses this windup by synchronously freeing PTE
memory within the context of madvise(MADV_DONTNEED)
- "selftest/mm: Remove warnings found by adding compiler flags" from
Muhammad Usama Anjum fixes some build warnings in the selftests
code when optional compiler warnings are enabled
- "mm: don't use __GFP_HARDWALL when migrating remote pages" from
David Hildenbrand tightens the allocator's observance of
__GFP_HARDWALL
- "pkeys kselftests improvements" from Kevin Brodsky implements
various fixes and cleanups in the MM selftests code, mainly
pertaining to the pkeys tests
- "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
estimate application working set size
- "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
provides some cleanups to memcg's hugetlb charging logic
- "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
removes the global swap cgroup lock. A speedup of 10% for a
tmpfs-based kernel build was demonstrated
- "zram: split page type read/write handling" from Sergey Senozhatsky
has several fixes and cleaups for zram in the area of
zram_write_page(). A watchdog softlockup warning was eliminated
- "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin
Brodsky cleans up the pagetable destructor implementations. A rare
use-after-free race is fixed
- "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
simplifies and cleans up the debugging code in the VMA merging
logic
- "Account page tables at all levels" from Kevin Brodsky cleans up
and regularizes the pagetable ctor/dtor handling. This results in
improvements in accounting accuracy
- "mm/damon: replace most damon_callback usages in sysfs with new
core functions" from SeongJae Park cleans up and generalizes
DAMON's sysfs file interface logic
- "mm/damon: enable page level properties based monitoring" from
SeongJae Park increases the amount of information which is
presented in response to DAMOS actions
- "mm/damon: remove DAMON debugfs interface" from SeongJae Park
removes DAMON's long-deprecated debugfs interfaces. Thus the
migration to sysfs is completed
- "mm/hugetlb: Refactor hugetlb allocation resv accounting" from
Peter Xu cleans up and generalizes the hugetlb reservation
accounting
- "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
removes a never-used feature of the alloc_pages_bulk() interface
- "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
extends DAMOS filters to support not only exclusion (rejecting),
but also inclusion (allowing) behavior
- "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
introduces a new memory descriptor for zswap.zpool that currently
overlaps with struct page for now. This is part of the effort to
reduce the size of struct page and to enable dynamic allocation of
memory descriptors
- "mm, swap: rework of swap allocator locks" from Kairui Song redoes
and simplifies the swap allocator locking. A speedup of 400% was
demonstrated for one workload. As was a 35% reduction for kernel
build time with swap-on-zram
- "mm: update mips to use do_mmap(), make mmap_region() internal"
from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
mmap_region() can be made MM-internal
- "mm/mglru: performance optimizations" from Yu Zhao fixes a few
MGLRU regressions and otherwise improves MGLRU performance
- "Docs/mm/damon: add tuning guide and misc updates" from SeongJae
Park updates DAMON documentation
- "Cleanup for memfd_create()" from Isaac Manjarres does that thing
- "mm: hugetlb+THP folio and migration cleanups" from David
Hildenbrand provides various cleanups in the areas of hugetlb
folios, THP folios and migration
- "Uncached buffered IO" from Jens Axboe implements the new
RWF_DONTCACHE flag which provides synchronous dropbehind for
pagecache reading and writing. To permite userspace to address
issues with massive buildup of useless pagecache when
reading/writing fast devices
- "selftests/mm: virtual_address_range: Reduce memory" from Thomas
Weißschuh fixes and optimizes some of the MM selftests"
* tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
mm/compaction: fix UBSAN shift-out-of-bounds warning
s390/mm: add missing ctor/dtor on page table upgrade
kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags()
tools: add VM_WARN_ON_VMG definition
mm/damon/core: use str_high_low() helper in damos_wmark_wait_us()
seqlock: add missing parameter documentation for raw_seqcount_try_begin()
mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh
mm/page_alloc: remove the incorrect and misleading comment
zram: remove zcomp_stream_put() from write_incompressible_page()
mm: separate move/undo parts from migrate_pages_batch()
mm/kfence: use str_write_read() helper in get_access_type()
selftests/mm/mkdirty: fix memory leak in test_uffdio_copy()
kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags()
selftests/mm: virtual_address_range: avoid reading from VM_IO mappings
selftests/mm: vm_util: split up /proc/self/smaps parsing
selftests/mm: virtual_address_range: unmap chunks after validation
selftests/mm: virtual_address_range: mmap() without PROT_WRITE
selftests/memfd/memfd_test: fix possible NULL pointer dereference
mm: add FGP_DONTCACHE folio creation flag
mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
...
Pull non-MM updates from Andrew Morton:
"Mainly individually changelogged singleton patches. The patch series
in this pull are:
- "lib min_heap: Improve min_heap safety, testing, and documentation"
from Kuan-Wei Chiu provides various tightenings to the min_heap
library code
- "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms
some cleanup and Rust preparation in the xarray library code
- "Update reference to include/asm-<arch>" from Geert Uytterhoeven
fixes pathnames in some code comments
- "Converge on using secs_to_jiffies()" from Easwar Hariharan uses
the new secs_to_jiffies() in various places where that is
appropriate
- "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen
switches two filesystems to the new mount API
- "Convert ocfs2 to use folios" from Matthew Wilcox does that
- "Remove get_task_comm() and print task comm directly" from Yafang
Shao removes now-unneeded calls to get_task_comm() in various
places
- "squashfs: reduce memory usage and update docs" from Phillip
Lougher implements some memory savings in squashfs and performs
some maintainability work
- "lib: clarify comparison function requirements" from Kuan-Wei Chiu
tightens the sort code's behaviour and adds some maintenance work
- "nilfs2: protect busy buffer heads from being force-cleared" from
Ryusuke Konishi fixes an issues in nlifs when the fs is presented
with a corrupted image
- "nilfs2: fix kernel-doc comments for function return values" from
Ryusuke Konishi fixes some nilfs kerneldoc
- "nilfs2: fix issues with rename operations" from Ryusuke Konishi
addresses some nilfs BUG_ONs which syzbot was able to trigger
- "minmax.h: Cleanups and minor optimisations" from David Laight does
some maintenance work on the min/max library code
- "Fixes and cleanups to xarray" from Kemeng Shi does maintenance
work on the xarray library code"
* tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (131 commits)
ocfs2: use str_yes_no() and str_no_yes() helper functions
include/linux/lz4.h: add some missing macros
Xarray: use xa_mark_t in xas_squash_marks() to keep code consistent
Xarray: remove repeat check in xas_squash_marks()
Xarray: distinguish large entries correctly in xas_split_alloc()
Xarray: move forward index correctly in xas_pause()
Xarray: do not return sibling entries from xas_find_marked()
ipc/util.c: complete the kernel-doc function descriptions
gcov: clang: use correct function param names
latencytop: use correct kernel-doc format for func params
minmax.h: remove some #defines that are only expanded once
minmax.h: simplify the variants of clamp()
minmax.h: move all the clamp() definitions after the min/max() ones
minmax.h: use BUILD_BUG_ON_MSG() for the lo < hi test in clamp()
minmax.h: reduce the #define expansion of min(), max() and clamp()
minmax.h: update some comments
minmax.h: add whitespace around operators and after commas
nilfs2: do not update mtime of renamed directory that is not moved
nilfs2: handle errors that nilfs_prepare_chunk() may return
CREDITS: fix spelling mistake
...
Pull ata updates from Damien Le Moal:
- Constify struct pci_device_id (Christophe)
- Remove unused code in the sata_gemini driver (David)
- Improve libahci_platform to allow supporting non consecutive port
numbers as specified in device trees (Josua)
- Cleanup ahci driver code handling of port numbers with the new helper
ahci_ignore_port() (me)
- Use pm_sleep_ptr() to remove CONFIG_PM_SLEEP ifdefs in the ahci_st
driver (Raphael). More of these changes will be included in the next
cycle
* tag 'ata-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ahci: st: Switch from CONFIG_PM_SLEEP guards to pm_sleep_ptr()
ahci: Introduce ahci_ignore_port() helper
ata: libahci_platform: support non-consecutive port numbers
ata: sata_gemini: Remove remaining reset glue
ata: sata_gemini: Remove unused gemini_sata_reset_bridge()
ata: Constify struct pci_device_id
Pull SCSI updates from James Bottomley:
"Updates to the usual drivers (ufs, lpfc, fnic, qla2xx, mpi3mr).
The major core change is the renaming of the slave_ methods plus a bit
of constification. The rest are minor updates and fixes"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (103 commits)
scsi: fnic: Propagate SCSI error code from fnic_scsi_drv_init()
scsi: fnic: Test for memory allocation failure and return error code
scsi: fnic: Return appropriate error code from failure of scsi drv init
scsi: fnic: Return appropriate error code for mem alloc failure
scsi: fnic: Remove always-true IS_FNIC_FCP_INITIATOR macro
scsi: fnic: Fix use of uninitialized value in debug message
scsi: fnic: Delete incorrect debugfs error handling
scsi: fnic: Remove unnecessary else to fix warning in FDLS FIP
scsi: fnic: Remove extern definition from .c files
scsi: fnic: Remove unnecessary else and unnecessary break in FDLS
scsi: mpi3mr: Fix possible crash when setting up bsg fails
scsi: ufs: bsg: Set bsg_queue to NULL after removal
scsi: ufs: bsg: Delete bsg_dev when setting up bsg fails
scsi: st: Don't set pos_unknown just after device recognition
scsi: aic7xxx: Fix build 'aicasm' warning
scsi: Revert "scsi: ufs: core: Probe for EXT_IID support"
scsi: storvsc: Ratelimit warning logs to prevent VM denial of service
scsi: scsi_debug: Constify sdebug_driver_template
scsi: documentation: Corrections for struct updates
scsi: driver-api: documentation: Change what is added to docbook
...
Pull firewire updates from Takashi Sakamoto:
"Two changes for the 6.14 kernel.
The first change concerns the PCI driver for 1394 OHCI hardware.
Previously, it used legacy PCI suspend/resume callbacks, which have
now been replaced with callbacks defined in the Linux generic power
management framework. This original patch was posted in 2020 and has
been adapted with some modifications for the latest kernel. Note that
the driver still includes platform-specific operations for PowerPC,
and these operations have not been tested in the new implementation
yet. It would be helpful to share the results of suspending/resuming
on the platform.
The other one is a minor fix for the memory allocation in some KUnit
tests"
* tag 'firewire-updates-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
firewire: test: Fix potential null dereference in firewire kunit test
firewire: ohci: use generic power management
Pull modules updates from Petr Pavlu:
- Sign modules with sha512 instead of sha1 by default
- Don't fail module loading when failing to set the
ro_after_init section read-only
- Constify 'struct module_attribute'
- Cleanups and preparation for const struct bin_attribute
- Put known GPL offenders in an array
- Extend the preempt disabled section in
dereference_symbol_descriptor()
* tag 'modules-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux:
module: sign with sha512 instead of sha1 by default
module: Don't fail module loading when setting ro_after_init section RO failed
module: Split module_enable_rodata_ro()
module: sysfs: Use const 'struct bin_attribute'
module: sysfs: Add notes attributes through attribute_group
module: sysfs: Simplify section attribute allocation
module: sysfs: Drop 'struct module_sect_attr'
module: sysfs: Drop member 'module_sect_attr::address'
module: sysfs: Drop member 'module_sect_attrs::nsections'
module: Constify 'struct module_attribute'
module: Handle 'struct module_version_attribute' as const
params: Prepare for 'const struct module_attribute *'
module: Put known GPL offenders in an array
module: Extend the preempt disabled section in dereference_symbol_descriptor().
Pull rv and tools/rtla updates from Steven Rostedt:
- Add a test suite to test the tool
Add a small test suite that can be used to test rtla's basic features
to at least have something to test when applying changes.
- Automate manual steps in monitor creation
While creating a new monitor in RV, besides generating code from
dot2k, there are a few manual steps which can be tedious and error
prone, like adding the tracepoints, makefile lines and kconfig, or
selecting events that start the monitor in the initial state.
Updates were made to try and automate as much as possible among those
steps to make creating a new RV monitor much quicker. It is still
requires to select proper tracepoints, this step is harder to
automate in a general way and, in several cases, would still need
user intervention.
- Have rtla timerlat hist and top set OSNOISE_WORKLOAD flag
Have both rtla-timerlat-hist and rtla-timerlat-top set
OSNOISE_WORKLOAD to the proper value ("on" when running with -k,
"off" when running with -u) every time the option is available
instead of setting it only when running with -u.
This prevents rtla timerlat -k from giving no results when
NO_OSNOISE_WORKLOAD is set, either manually or by an abnormally
exited earlier run of rtla timerlat -u.
- Stop rtla timerlat on signal properly when overloaded
There is an issue where if rtla is run on machines with a high number
of CPUs (100+), timerlat can generate more samples than rtla is able
to process via tracefs_iterate_raw_events. This is especially common
when the interval is set to 100us (rteval and cyclictest default) as
opposed to the rtla default of 1000us, but also happens with the rtla
default.
Currently, this leads to rtla hanging and having to be terminated
with SIGTERM. SIGINT setting stop_tracing is not enough, since more
and more events are coming and tracefs_iterate_raw_events never
exits.
To fix this: Stop the timerlat tracer on SIGINT/SIGALRM to ensure no
more events are generated when rtla is supposed to exit.
Also on receiving SIGINT/SIGALRM twice, abort iteration immediately
with tracefs_iterate_stop, making rtla exit right away instead of
waiting for all events to be processed.
- Account for missed events
Due to tracefs buffer overflow, it can happen that rtla misses
events, making the tracing results inaccurate.
Count both the number of missed events and the total number of
processed events, and display missed events as well as their
percentage. The numbers are displayed for both osnoise and timerlat,
even though for the earlier, missed events are generally not
expected.
For hist, the number is displayed at the end of the run; for top, it
is displayed on each printing of the top table.
- Changes to make osnoise more robust
There was a dependency in the code that the first field of the
osnoise_tool structure was the trace field. If that that ever
changed, then the code work break. Change the code to encapsulate
this dependency where the code that uses the structure does not have
this dependency.
* tag 'trace-tools-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (22 commits)
rtla: Report missed event count
rtla: Add function to report missed events
rtla: Count all processed events
rtla: Count missed trace events
tools/rtla: Add osnoise_trace_is_off()
rtla/timerlat_top: Set OSNOISE_WORKLOAD for kernel threads
rtla/timerlat_hist: Set OSNOISE_WORKLOAD for kernel threads
rtla/osnoise: Distinguish missing workload option
rtla/timerlat_top: Abort event processing on second signal
rtla/timerlat_hist: Abort event processing on second signal
rtla/timerlat_top: Stop timerlat tracer on signal
rtla/timerlat_hist: Stop timerlat tracer on signal
rtla: Add trace_instance_stop
tools/rtla: Add basic test suite
verification/dot2k: Implement event type detection
verification/dot2k: Auto patch current kernel source
verification/dot2k: Simplify manual steps in monitor creation
rv: Simplify manual steps in monitor creation
verification/dot2k: Add support for name and description options
verification/dot2k: More robust template variables
...
Pull runtime verifier and osnoise fixes from Steven Rostedt:
- Reset idle tasks on reset for runtime verifier
When the runtime verifier is reset, it resets the task's data that is
being monitored. But it only iterates for_each_process() which does
not include the idle tasks. As the idle tasks can be monitored, they
need to be reset as well.
- Fix the enabling and disabling of tracepoints in osnoise
If timerlat is enabled and the WORKLOAD flag is not set, then the
osnoise tracer will enable the migrate task tracepoint to monitor it
for its own workload. The test to enable the tracepoint is done
against user space modifiable parameters. On disabling of the tracer,
those same parameters are used to determine if the tracepoint should
be disabled. The problem is if user space were to modify the
parameters after it enables the tracer then it may not disable the
tracepoint.
Instead, a static variable is used to keep track if the tracepoint
was enabled or not. Then when the tracer shuts down, it will use this
variable to decide to disable the tracepoint or not, instead of
looking at the user space parameters.
* tag 'trace-rv-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/osnoise: Fix resetting of tracepoints
rv: Reset per-task monitors also for idle tasks
Pull bitmap updates from Yury Norov:
"This includes const_true() series from Vincent Mailhol, another
__always_inline rework from Nathan Chancellor for RISCV, and a couple
of random fixes from Dr. David Alan Gilbert and I Hsin Cheng"
* tag 'bitmap-for-6.14' of https://github.com:/norov/linux:
cpumask: Rephrase comments for cpumask_any*() APIs
cpu: Remove unused init_cpu_online
riscv: Always inline bitops
linux/bits.h: simplify GENMASK_INPUT_CHECK()
compiler.h: add const_true()
Switch away from using sha1 for module signing by default and use the
more modern sha512 instead, which is what among others Arch, Fedora,
RHEL, and Ubuntu are currently using for their kernels.
Sha1 has not been considered secure against well-funded opponents since
2005[1]; since 2011 the NIST and other organizations furthermore
recommended its replacement[2]. This is why OpenSSL on RHEL9, Fedora
Linux 41+[3], and likely some other current and future distributions
reject the creation of sha1 signatures, which leads to a build error of
allmodconfig configurations:
80A20474797F0000:error:03000098:digital envelope routines:do_sigver_init:invalid digest:crypto/evp/m_sigver.c:342:
make[4]: *** [.../certs/Makefile:53: certs/signing_key.pem] Error 1
make[4]: *** Deleting file 'certs/signing_key.pem'
make[4]: *** Waiting for unfinished jobs....
make[3]: *** [.../scripts/Makefile.build:478: certs] Error 2
make[2]: *** [.../Makefile:1936: .] Error 2
make[1]: *** [.../Makefile:224: __sub-make] Error 2
make[1]: Leaving directory '...'
make: *** [Makefile:224: __sub-make] Error 2
This change makes allmodconfig work again and sets a default that is
more appropriate for current and future users, too.
Link: https://www.schneier.com/blog/archives/2005/02/cryptanalysis_o.html [1]
Link: https://csrc.nist.gov/projects/hash-functions [2]
Link: https://fedoraproject.org/wiki/Changes/OpenSSLDistrustsha1SigVer [3]
Signed-off-by: Thorsten Leemhuis <linux@leemhuis.info>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Tested-by: kdevops <kdevops@lists.linux.dev> [0]
Link: https://github.com/linux-kdevops/linux-modules-kpd/actions/runs/11420092929/job/31775404330 [0]
Link: https://lore.kernel.org/r/52ee32c0c92afc4d3263cea1f8a1cdc809728aff.1729088288.git.linux@leemhuis.info
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Commit 78966b550289 ("s390: pgtable: add statistics for PUD and P4D level
page table") misses the call to pagetable_p4d_ctor() against a newly
allocated P4D table in crst_table_upgrade();
Commit 68c601de75d8 ("mm: introduce ctor/dtor at PGD level") misses the
call to pagetable_pgd_ctor() against a newly allocated PGD and the call to
pagetable_dtor() against a newly allocated P4D that is about to be freed
on crst_table_upgrade() PGD upgrade fail path.
The missed constructors and destructor break (at least) the page table
accounting when a process memory space is upgraded.
Link: https://lkml.kernel.org/r/20250123160349.200154-1-agordeev@linux.ibm.com
Fixes: 78966b550289 ("s390: pgtable: add statistics for PUD and P4D level page table")
Fixes: 68c601de75d8 ("mm: introduce ctor/dtor at PGD level")
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reported-by: Heiko Carstens <hca@linux.ibm.com>
Closes: https://lore.kernel.org/all/20250122074954.8685-A-hca@linux.ibm.com/
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The comment removed in this patch originally belonged to the
build_zonelists_in_zone_order() function, which was introduced by commit
f0c0b2b808 ("change zonelist order: zonelist order selection logic").
Later, commit c9bff3eebc ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE")
removed build_zonelists_in_zone_order() but left its comment behind.
Subsequently, commit 9d3be21bf9 ("mm, page_alloc: simplify zonelist
initialization") moved the node_order variable into build_zonelists(),
making the comment originally belonged to build_zonelists_in_zone_order()
appear as if it were part of build_zonelists().
Remove this misleading comment.
Link: https://lkml.kernel.org/r/20250115041634.63387-1-yuntao.wang@linux.dev
Signed-off-by: Yuntao Wang <yuntao.wang@linux.dev>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
For each accessed chunk a PTE is created. More than 1GiB of PTEs is used
in this way. Remove each PTE after validating a chunk to reduce peak
memory usage.
It is important to only unmap memory that previously mmap()ed, as
unmapping other mappings like the stack, heap or executable mappings will
crash the process.
The mappings read from /proc/self/maps and the return values from mmap()
don't allow a simple correlation due to merging and no guaranteed order.
To correlate the pointers and mappings use prctl(PR_SET_VMA_ANON_NAME).
While it introduces a test dependency, other alternatives would introduce
runtime or development overhead.
Link: https://lkml.kernel.org/r/20250114-virtual_address_range-tests-v4-2-6fd7269934a5@linutronix.de
Fixes: 0104096498 ("selftests/mm: confirm VA exhaustion without reliance on correctness of mmap()")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: kernel test robot <oliver.sang@intel.com>
Cc: Shuah Khan (Samsung OSG) <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add RWF_DONTCACHE as a read operation flag, which means that any data read
wil be removed from the page cache upon completion. Uses the page cache
to synchronize, and simply prunes folios that were instantiated when the
operation completes. While it would be possible to use private pages for
this, using the page cache as synchronization is handy for a variety of
reasons:
1) No special truncate magic is needed
2) Async buffered reads need some place to serialize, using the page
cache is a lot easier than writing extra code for this
3) The pruning cost is pretty reasonable
and the code to support this is much simpler as a result.
You can think of uncached buffered IO as being the much more attractive
cousin of O_DIRECT - it has none of the restrictions of O_DIRECT. Yes, it
will copy the data, but unlike regular buffered IO, it doesn't run into
the unpredictability of the page cache in terms of reclaim. As an
example, on a test box with 32 drives, reading them with buffered IO looks
as follows:
Reading bs 65536, uncached 0
1s: 145945MB/sec
2s: 158067MB/sec
3s: 157007MB/sec
4s: 148622MB/sec
5s: 118824MB/sec
6s: 70494MB/sec
7s: 41754MB/sec
8s: 90811MB/sec
9s: 92204MB/sec
10s: 95178MB/sec
11s: 95488MB/sec
12s: 95552MB/sec
13s: 96275MB/sec
where it's quite easy to see where the page cache filled up, and
performance went from good to erratic, and finally settles at a much
lower rate. Looking at top while this is ongoing, we see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7535 root 20 0 267004 0 0 S 3199 0.0 8:40.65 uncached
3326 root 20 0 0 0 0 R 100.0 0.0 0:16.40 kswapd4
3327 root 20 0 0 0 0 R 100.0 0.0 0:17.22 kswapd5
3328 root 20 0 0 0 0 R 100.0 0.0 0:13.29 kswapd6
3332 root 20 0 0 0 0 R 100.0 0.0 0:11.11 kswapd10
3339 root 20 0 0 0 0 R 100.0 0.0 0:16.25 kswapd17
3348 root 20 0 0 0 0 R 100.0 0.0 0:16.40 kswapd26
3343 root 20 0 0 0 0 R 100.0 0.0 0:16.30 kswapd21
3344 root 20 0 0 0 0 R 100.0 0.0 0:11.92 kswapd22
3349 root 20 0 0 0 0 R 100.0 0.0 0:16.28 kswapd27
3352 root 20 0 0 0 0 R 99.7 0.0 0:11.89 kswapd30
3353 root 20 0 0 0 0 R 96.7 0.0 0:16.04 kswapd31
3329 root 20 0 0 0 0 R 96.4 0.0 0:11.41 kswapd7
3345 root 20 0 0 0 0 R 96.4 0.0 0:13.40 kswapd23
3330 root 20 0 0 0 0 S 91.1 0.0 0:08.28 kswapd8
3350 root 20 0 0 0 0 S 86.8 0.0 0:11.13 kswapd28
3325 root 20 0 0 0 0 S 76.3 0.0 0:07.43 kswapd3
3341 root 20 0 0 0 0 S 74.7 0.0 0:08.85 kswapd19
3334 root 20 0 0 0 0 S 71.7 0.0 0:10.04 kswapd12
3351 root 20 0 0 0 0 R 60.5 0.0 0:09.59 kswapd29
3323 root 20 0 0 0 0 R 57.6 0.0 0:11.50 kswapd1
[...]
which is just showing a partial list of the 32 kswapd threads that are
running mostly full tilt, burning ~28 full CPU cores.
If the same test case is run with RWF_DONTCACHE set for the buffered read,
the output looks as follows:
Reading bs 65536, uncached 0
1s: 153144MB/sec
2s: 156760MB/sec
3s: 158110MB/sec
4s: 158009MB/sec
5s: 158043MB/sec
6s: 157638MB/sec
7s: 157999MB/sec
8s: 158024MB/sec
9s: 157764MB/sec
10s: 157477MB/sec
11s: 157417MB/sec
12s: 157455MB/sec
13s: 157233MB/sec
14s: 156692MB/sec
which is just chugging along at ~155GB/sec of read performance. Looking
at top, we see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7961 root 20 0 267004 0 0 S 3180 0.0 5:37.95 uncached
8024 axboe 20 0 14292 4096 0 R 1.0 0.0 0:00.13 top
where just the test app is using CPU, no reclaim is taking place outside
of the main thread. Not only is performance 65% better, it's also using
half the CPU to do it.
Link: https://lkml.kernel.org/r/20241220154831.1086649-9-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Chris Mason <clm@meta.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>