This is a preparation commit that replaces ctl_table with
ctl_table_header as the pointer that is passed around in proc_sysctl.c.
This will become necessary in subsequent commits when the size of the
ctl_table array can no longer be calculated by searching for an empty
sentinel (last empty ctl_table element) but will be carried along inside
the ctl_table_header struct.
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Pull misc fixes from Andrew Morton:
"14 hotfixes. 11 of these are cc:stable and the remainder address
post-6.4 issues, or are not considered suitable for -stable
backporting"
* tag 'mm-hotfixes-stable-2023-08-11-13-44' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/damon/core: initialize damo_filter->list from damos_new_filter()
nilfs2: fix use-after-free of nilfs_root in dirtying inodes via iput
selftests: cgroup: fix test_kmem_basic false positives
fs/proc/kcore: reinstate bounce buffer for KCORE_TEXT regions
MAINTAINERS: add maple tree mailing list
mm: compaction: fix endless looping over same migrate block
selftests: mm: ksm: fix incorrect evaluation of parameter
hugetlb: do not clear hugetlb dtor until allocating vmemmap
mm: memory-failure: avoid false hwpoison page mapped error info
mm: memory-failure: fix potential unexpected return value from unpoison_memory()
mm/swapfile: fix wrong swap entry type for hwpoisoned swapcache page
radix tree test suite: fix incorrect allocation size for pthreads
crypto, cifs: fix error handling in extract_iter_to_sg()
zsmalloc: fix races between modifications of fullness and isolated
generic_fillattr just fills in the entire stat struct indiscriminately
today, copying data from the inode. There is at least one attribute
(STATX_CHANGE_COOKIE) that can have side effects when it is reported,
and we're looking at adding more with the addition of multigrain
timestamps.
Add a request_mask argument to generic_fillattr and have most callers
just pass in the value that is passed to getattr. Have other callers
(e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of
STATX_CHANGE_COOKIE into generic_fillattr.
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Paulo Alcantara (SUSE)" <pc@manguebit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230807-mgctime-v7-2-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
I'm looking at the directory handling due to the discussion about f_pos
locking (see commit 797964253d: "file: reinstate f_pos locking
optimization for regular files"), and wanting to clean that up.
And one source of ugliness is how we were supposed to move filesystems
over to the '->iterate_shared()' function that only takes the inode lock
for reading many many years ago, but several filesystems still use the
bad old '->iterate()' that takes the inode lock for exclusive access.
See commit 6192269444 ("introduce a parallel variant of ->iterate()")
that also added some documentation stating
Old method is only used if the new one is absent; eventually it will
be removed. Switch while you still can; the old one won't stay.
and that was back in April 2016. Here we are, many years later, and the
old version is still clearly sadly alive and well.
Now, some of those old style iterators are probably just because the
filesystem may end up having per-inode mutable data that it uses for
iterating a directory, but at least one case is just a mistake.
Al switched over most filesystems to use '->iterate_shared()' back when
it was introduced. In particular, the /proc filesystem was converted as
one of the first ones in commit f50752eaa0 ("switch all procfs
directories ->iterate_shared()").
But then later one new user of '->iterate()' was then re-introduced by
commit 6d9c939dbe ("procfs: add smack subdir to attrs").
And that's clearly not what we wanted, since that new case just uses the
same 'proc_pident_readdir()' and 'proc_pident_lookup()' helper functions
that other /proc pident directories use, and they are most definitely
safe to use with the inode lock held shared.
So just fix it.
This still leaves a fair number of oddball filesystems using the
old-style directory iterator (ceph, coda, exfat, jfs, ntfs, ocfs2,
overlayfs, and vboxsf), but at least we don't have any remaining in the
core filesystems.
I'm going to add a wrapper function that just drops the read-lock and
takes it as a write lock, so that we can clean up the core vfs layer and
make all the ugly 'this filesystem needs exclusive inode locking' be
just filesystem-internal warts.
I just didn't want to make that conversion when we still had a core user
left.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Due to an oversight in commit 1b3044e39a ("procfs: fix pthread
cross-thread naming if !PR_DUMPABLE") in switching from REG to NOD,
chmod operations on /proc/thread-self/comm were no longer blocked as
they are on almost all other procfs files.
A very similar situation with /proc/self/environ was used to as a root
exploit a long time ago, but procfs has SB_I_NOEXEC so this is simply a
correctness issue.
Ref: https://lwn.net/Articles/191954/
Ref: 6d76fa58b0 ("Don't allow chmod() on the /proc/<pid>/ files")
Fixes: 1b3044e39a ("procfs: fix pthread cross-thread naming if !PR_DUMPABLE")
Cc: stable@vger.kernel.org # v4.7+
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Message-Id: <20230713141001.27046-1-cyphar@cyphar.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
All other files in /proc/$PID/ use proc_setattr().
Not using it allows the usage of chmod() on /proc/$PID/net, even on
other processes owned by the same user.
The same would probably also be true for other attributes to be changed.
As this technically represents an ABI change it is not marked for
stable so any unlikely regressions are caught during a full release cycle.
Fixes: e9720acd72 ("[NET]: Make /proc/net a symlink on /proc/self/net (v3)")
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/lkml/d0d111ef-edae-4760-83fb-36db84278da1@t-8ch.de/
Fixes: b4844fa0bd ("selftests/nolibc: implement a few tests for various syscalls")
Tested-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Message-Id: <20230624-proc-net-setattr-v1-2-73176812adee@weissschuh.net>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pull asm-generic updates from Arnd Bergmann:
"These are cleanups for architecture specific header files:
- the comments in include/linux/syscalls.h have gone out of sync and
are really pointless, so these get removed
- The asm/bitsperlong.h header no longer needs to be architecture
specific on modern compilers, so use a generic version for newer
architectures that use new enough userspace compilers
- A cleanup for virt_to_pfn/virt_to_bus to have proper type checking,
forcing the use of pointers"
* tag 'asm-generic-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
syscalls: Remove file path comments from headers
tools arch: Remove uapi bitsperlong.h of hexagon and microblaze
asm-generic: Unify uapi bitsperlong.h for arm64, riscv and loongarch
m68k/mm: Make pfn accessors static inlines
arm64: memory: Make virt_to_pfn() a static inline
ARM: mm: Make virt_to_pfn() a static inline
asm-generic/page.h: Make pfn accessors static inlines
xen/netback: Pass (void *) to virt_to_page()
netfs: Pass a pointer to virt_to_page()
cifs: Pass a pointer to virt_to_page() in cifsglob
cifs: Pass a pointer to virt_to_page()
riscv: mm: init: Pass a pointer to virt_to_page()
ARC: init: Pass a pointer to virt_to_pfn() in init
m68k: Pass a pointer to virt_to_pfn() virt_to_page()
fs/proc/kcore.c: Pass a pointer to virt_addr_valid()
smatch reports
fs/proc/proc_sysctl.c:32:18: warning: symbol
'sysctl_mount_point' was not declared. Should it be static?
This variable is only used in its defining file, so it should be static.
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Pull sysctl updates from Luis Chamberlain:
"The changes for sysctl are in line with prior efforts to stop usage of
deprecated routines which incur recursion and also make it hard to
remove the empty array element in each sysctl array declaration.
The most difficult user to modify was parport which required a bit of
re-thinking of how to declare shared sysctls there, Joel Granados has
stepped up to the plate to do most of this work and eventual removal
of register_sysctl_table(). That work ended up saving us about 1465
bytes according to bloat-o-meter. Since we gained a few bloat-o-meter
karma points I moved two rather small sysctl arrays from
kernel/sysctl.c leaving us only two more sysctl arrays to move left.
Most changes have been tested on linux-next for about a month. The
last straggler patches are a minor parport fix, changes to the sysctl
kernel selftest so to verify correctness and prevent regressions for
the future change he made to provide an alternative solution for the
special sysctl mount point target which was using the now deprecated
sysctl child element.
This is all prep work to now finally be able to remove the empty array
element in all sysctl declarations / registrations which is expected
to save us a bit of bytes all over the kernel. That work will be
tested early after v6.5-rc1 is out"
* tag 'v6.5-rc1-sysctl-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
sysctl: replace child with an enumeration
sysctl: Remove debugging dump_stack
test_sysclt: Test for registering a mount point
test_sysctl: Add an option to prevent test skip
test_sysctl: Add an unregister sysctl test
test_sysctl: Group node sysctl test under one func
test_sysctl: Fix test metadata getters
parport: plug a sysctl register leak
sysctl: move security keys sysctl registration to its own file
sysctl: move umh sysctl registration to its own file
signal: move show_unhandled_signals sysctl to its own file
sysctl: remove empty dev table
sysctl: Remove register_sysctl_table
sysctl: Refactor base paths registrations
sysctl: stop exporting register_sysctl_table
parport: Removed sysctl related defines
parport: Remove register_sysctl_table from parport_default_proc_register
parport: Remove register_sysctl_table from parport_device_proc_register
parport: Remove register_sysctl_table from parport_proc_register
parport: Move magic number "15" to a define
Pull non-mm updates from Andrew Morton:
- Arnd Bergmann has fixed a bunch of -Wmissing-prototypes in top-level
directories
- Douglas Anderson has added a new "buddy" mode to the hardlockup
detector. It permits the detector to work on architectures which
cannot provide the required interrupts, by having CPUs periodically
perform checks on other CPUs
- Zhen Lei has enhanced kexec's ability to support two crash regions
- Petr Mladek has done a lot of cleanup on the hard lockup detector's
Kconfig entries
- And the usual bunch of singleton patches in various places
* tag 'mm-nonmm-stable-2023-06-24-19-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (72 commits)
kernel/time/posix-stubs.c: remove duplicated include
ocfs2: remove redundant assignment to variable bit_off
watchdog/hardlockup: fix typo in config HARDLOCKUP_DETECTOR_PREFER_BUDDY
powerpc: move arch_trigger_cpumask_backtrace from nmi.h to irq.h
devres: show which resource was invalid in __devm_ioremap_resource()
watchdog/hardlockup: define HARDLOCKUP_DETECTOR_ARCH
watchdog/sparc64: define HARDLOCKUP_DETECTOR_SPARC64
watchdog/hardlockup: make HAVE_NMI_WATCHDOG sparc64-specific
watchdog/hardlockup: declare arch_touch_nmi_watchdog() only in linux/nmi.h
watchdog/hardlockup: make the config checks more straightforward
watchdog/hardlockup: sort hardlockup detector related config values a logical way
watchdog/hardlockup: move SMP barriers from common code to buddy code
watchdog/buddy: simplify the dependency for HARDLOCKUP_DETECTOR_PREFER_BUDDY
watchdog/buddy: don't copy the cpumask in watchdog_next_cpu()
watchdog/buddy: cleanup how watchdog_buddy_check_hardlockup() is called
watchdog/hardlockup: remove softlockup comment in touch_nmi_watchdog()
watchdog/hardlockup: in watchdog_hardlockup_check() use cpumask_copy()
watchdog/hardlockup: don't use raw_cpu_ptr() in watchdog_hardlockup_kick()
watchdog/hardlockup: HAVE_NMI_WATCHDOG must implement watchdog_hardlockup_probe()
watchdog/hardlockup: keep kernel.nmi_watchdog sysctl as 0444 if probe fails
...
Pull mm updates from Andrew Morton:
- Yosry Ahmed brought back some cgroup v1 stats in OOM logs
- Yosry has also eliminated cgroup's atomic rstat flushing
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning
- Lorenzo Stoakes has done some maintanance work on the
get_user_pages() interface
- Liam Howlett continues with cleanups and maintenance work to the
maple tree code. Peng Zhang also does some work on maple tree
- Johannes Weiner has done some cleanup work on the compaction code
- David Hildenbrand has contributed additional selftests for
get_user_pages()
- Thomas Gleixner has contributed some maintenance and optimization
work for the vmalloc code
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting
- Christoph Hellwig has some cleanups for the filemap/directio code
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the
provided APIs rather than open-coding accesses
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings
- John Hubbard has a series of fixes to the MM selftesting code
- ZhangPeng continues the folio conversion campaign
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
from 128 to 8
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code
- Vishal Moola also has done some folio conversion work
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch
* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
mm/hugetlb: remove hugetlb_set_page_subpool()
mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
hugetlb: revert use of page_cache_next_miss()
Revert "page cache: fix page_cache_next/prev_miss off by one"
mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
mm: memcg: rename and document global_reclaim()
mm: kill [add|del]_page_to_lru_list()
mm: compaction: convert to use a folio in isolate_migratepages_block()
mm: zswap: fix double invalidate with exclusive loads
mm: remove unnecessary pagevec includes
mm: remove references to pagevec
mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
mm: remove struct pagevec
net: convert sunrpc from pagevec to folio_batch
i915: convert i915_gpu_error to use a folio_batch
pagevec: rename fbatch_count()
mm: remove check_move_unevictable_pages()
drm: convert drm_gem_put_pages() to use a folio_batch
i915: convert shmem_sg_free_table() to use a folio_batch
scatterlist: add sg_set_folio()
...
Pull execve updates from Kees Cook:
- Fix a few comments for correctness and typos (Baruch Siach)
- Small simplifications for binfmt (Christophe JAILLET)
- Set p_align to 4 for PT_NOTE in core dump (Fangrui Song)
* tag 'execve-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
binfmt_elf: fix comment typo s/reset/regset/
elf: correct note name comment
binfmt: Slightly simplify elf_fdpic_map_file()
binfmt: Use struct_size()
coredump, vmcore: Set p_align to 4 for PT_NOTE
Pull x86 confidential computing update from Borislav Petkov:
- Add support for unaccepted memory as specified in the UEFI spec v2.9.
The gist of it all is that Intel TDX and AMD SEV-SNP confidential
computing guests define the notion of accepting memory before using
it and thus preventing a whole set of attacks against such guests
like memory replay and the like.
There are a couple of strategies of how memory should be accepted -
the current implementation does an on-demand way of accepting.
* tag 'x86_cc_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
virt: sevguest: Add CONFIG_CRYPTO dependency
x86/efi: Safely enable unaccepted memory in UEFI
x86/sev: Add SNP-specific unaccepted memory support
x86/sev: Use large PSC requests if applicable
x86/sev: Allow for use of the early boot GHCB for PSC requests
x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
x86/sev: Fix calculation of end address based on number of pages
x86/tdx: Add unaccepted memory support
x86/tdx: Refactor try_accept_one()
x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub
efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory
efi: Add unaccepted memory support
x86/boot/compressed: Handle unaccepted memory
efi/libstub: Implement support for unaccepted memory
efi/x86: Get full memory map in allocate_e820()
mm: Add support for unaccepted memory
This is part of the effort to remove the empty element at the end of
ctl_table structs. "child" was a deprecated elem in this struct and was
being used to differentiate between two types of ctl_tables: "normal"
and "permanently emtpy".
What changed?:
* Replace "child" with an enumeration that will have two values: the
default (0) and the permanently empty (1). The latter is left at zero
so when struct ctl_table is created with kzalloc or in a local
context, it will have the zero value by default. We document the
new enum with kdoc.
* Remove the "empty child" check from sysctl_check_table
* Remove count_subheaders function as there is no longer a need to
calculate how many headers there are for every child
* Remove the recursive call to unregister_sysctl_table as there is no
need to traverse down the child tree any longer
* Add a new SYSCTL_PERM_EMPTY_DIR binary flag
* Remove the last remanence of child from partport/procfs.c
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Remove unneeded dump_stack in __register_sysctl_table
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
The pagewalk in pagemap_read reads one PTE past the end of the requested
range, and stops when the buffer runs out of space. While it produces
the right result, the extra read is unnecessary and less performant.
I timed the following command before and after this patch:
dd count=100000 if=/proc/self/pagemap of=/dev/null
The results are consistently within 0.001s across 5 runs.
Before:
100000+0 records in
100000+0 records out
51200000 bytes (51 MB) copied, 0.0763159 s, 671 MB/s
real 0m0.078s
user 0m0.012s
sys 0m0.065s
After:
100000+0 records in
100000+0 records out
51200000 bytes (51 MB) copied, 0.0487928 s, 1.0 GB/s
real 0m0.050s
user 0m0.011s
sys 0m0.039s
Link: https://lkml.kernel.org/r/20230515172608.3558391-1-yuanchu@google.com
Signed-off-by: Yuanchu Xie <yuanchu@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
UEFI Specification version 2.9 introduces the concept of memory
acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, require memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific to the Virtual Machine
platform.
There are several ways the kernel can deal with unaccepted memory:
1. Accept all the memory during boot. It is easy to implement and it
doesn't have runtime cost once the system is booted. The downside is
very long boot time.
Accept can be parallelized to multiple CPUs to keep it manageable
(i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
memory bandwidth and does not scale beyond the point.
2. Accept a block of memory on the first use. It requires more
infrastructure and changes in page allocator to make it work, but
it provides good boot time.
On-demand memory accept means latency spikes every time kernel steps
onto a new memory block. The spikes will go away once workload data
set size gets stabilized or all memory gets accepted.
3. Accept all memory in background. Introduce a thread (or multiple)
that gets memory accepted proactively. It will minimize time the
system experience latency spikes on memory allocation while keeping
low boot time.
This approach cannot function on its own. It is an extension of #2:
background memory acceptance requires functional scheduler, but the
page allocator may need to tap into unaccepted memory before that.
The downside of the approach is that these threads also steal CPU
cycles and memory bandwidth from the user's workload and may hurt
user experience.
Implement #1 and #2 for now. #2 is the default. Some workloads may want
to use #1 with accept_memory=eager in kernel command line. #3 can be
implemented later based on user's demands.
Support of unaccepted memory requires a few changes in core-mm code:
- memblock accepts memory on allocation. It serves early boot memory
allocations and doesn't limit them to pre-accepted pool of memory.
- page allocator accepts memory on the first allocation of the page.
When kernel runs out of accepted memory, it accepts memory until the
high watermark is reached. It helps to minimize fragmentation.
EFI code will provide two helpers if the platform supports unaccepted
memory:
- accept_memory() makes a range of physical addresses accepted.
- range_contains_unaccepted_memory() checks anything within the range
of physical addresses requires acceptance.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com> # memblock
Link: https://lore.kernel.org/r/20230606142637.5171-2-kirill.shutemov@linux.intel.com
The virt_addr_valid() should be passed a pointer, the current
code passing a long unsigned int is just exploiting the
unintentional polymorphism of these calls being implemented
as preprocessor macros.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
This is part of the general push to deprecate register_sysctl_paths and
register_sysctl_table. After removing all the calling functions, we
remove both the register_sysctl_table function and the documentation
check that appeared in check-sysctl-docs awk script.
We save 595 bytes with this change:
./scripts/bloat-o-meter vmlinux.1.refactor-base-paths vmlinux.2.remove-sysctl-table
add/remove: 2/8 grow/shrink: 1/0 up/down: 1154/-1749 (-595)
Function old new delta
count_subheaders - 983 +983
unregister_sysctl_table 29 184 +155
__pfx_count_subheaders - 16 +16
__pfx_unregister_sysctl_table.part 16 - -16
__pfx_register_leaf_sysctl_tables.constprop 16 - -16
__pfx_count_subheaders.part 16 - -16
__pfx___register_sysctl_base 16 - -16
unregister_sysctl_table.part 136 - -136
__register_sysctl_base 478 - -478
register_leaf_sysctl_tables.constprop 524 - -524
count_subheaders.part 547 - -547
Total: Before=21257652, After=21257057, chg -0.00%
[mcgrof: remove register_leaf_sysctl_tables and append_path too and
add bloat-o-meter stats]
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
We make register_sysctl_table static because the only function calling
it is in fs/proc/proc_sysctl.c (__register_sysctl_base). We remove it
from the sysctl.h header and modify the documentation in both the header
and proc_sysctl.c files to mention "register_sysctl" instead of
"register_sysctl_table".
This plus the commits that remove register_sysctl_table from parport
save 217 bytes:
./scripts/bloat-o-meter .bsysctl/vmlinux.old .bsysctl/vmlinux.new
add/remove: 0/1 grow/shrink: 5/1 up/down: 458/-675 (-217)
Function old new delta
__register_sysctl_base 8 286 +278
parport_proc_register 268 379 +111
parport_device_proc_register 195 247 +52
kzalloc.constprop 598 608 +10
parport_default_proc_register 62 69 +7
register_sysctl_table 291 - -291
parport_sysctl_template 1288 904 -384
Total: Before=8603076, After=8602859, chg -0.00%
Signed-off-by: Joel Granados <j.granados@samsung.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
The deprecation for register_sysctl_paths() is over. We can rejoice as
we nuke register_sysctl_paths(). The routine register_sysctl_table()
was the only user left of register_sysctl_paths(), so we can now just
open code and move the implementation over to what used to be
to __register_sysctl_paths().
The old dynamic struct ctl_table_set *set is now the point to
sysctl_table_root.default_set.
The old dynamic const struct ctl_path *path was being used in the
routine register_sysctl_paths() with a static:
static const struct ctl_path null_path[] = { {} };
Since this is a null path we can now just simplfy the old routine
and remove its use as its always empty.
This saves us a total of 230 bytes.
$ ./scripts/bloat-o-meter vmlinux.old vmlinux
add/remove: 2/7 grow/shrink: 1/1 up/down: 1015/-1245 (-230)
Function old new delta
register_leaf_sysctl_tables.constprop - 524 +524
register_sysctl_table 22 497 +475
__pfx_register_leaf_sysctl_tables.constprop - 16 +16
null_path 8 - -8
__pfx_register_sysctl_paths 16 - -16
__pfx_register_leaf_sysctl_tables 16 - -16
__pfx___register_sysctl_paths 16 - -16
__register_sysctl_base 29 12 -17
register_sysctl_paths 18 - -18
register_leaf_sysctl_tables 534 - -534
__register_sysctl_paths 620 - -620
Total: Before=21259666, After=21259436, chg -0.00%
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Pull x86 LAM (Linear Address Masking) support from Dave Hansen:
"Add support for the new Linear Address Masking CPU feature.
This is similar to ARM's Top Byte Ignore and allows userspace to store
metadata in some bits of pointers without masking it out before use"
* tag 'x86_mm_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside
x86/mm/iommu/sva: Fix error code for LAM enabling failure due to SVA
selftests/x86/lam: Add test cases for LAM vs thread creation
selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for linear-address masking
selftests/x86/lam: Add inherit test cases for linear-address masking
selftests/x86/lam: Add io_uring test cases for linear-address masking
selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address masking
selftests/x86/lam: Add malloc and tag-bits test cases for linear-address masking
x86/mm/iommu/sva: Make LAM and SVA mutually exclusive
iommu/sva: Replace pasid_valid() helper with mm_valid_pasid()
mm: Expose untagging mask in /proc/$PID/status
x86/mm: Provide arch_prctl() interface for LAM
x86/mm: Reduce untagged_addr() overhead for systems without LAM
x86/uaccess: Provide untagged_addr() and remove tags before address check
mm: Introduce untagged_addr_remote()
x86/mm: Handle LAM on context switch
x86: CPUID and CR3/CR4 flags for Linear Address Masking
x86: Allow atomic MM_CONTEXT flags setting
x86/mm: Rework address range check in get_user() and put_user()
Pull non-MM updates from Andrew Morton:
"Mainly singleton patches all over the place.
Series of note are:
- updates to scripts/gdb from Glenn Washburn
- kexec cleanups from Bjorn Helgaas"
* tag 'mm-nonmm-stable-2023-04-27-16-01' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (50 commits)
mailmap: add entries for Paul Mackerras
libgcc: add forward declarations for generic library routines
mailmap: add entry for Oleksandr
ocfs2: reduce ioctl stack usage
fs/proc: add Kthread flag to /proc/$pid/status
ia64: fix an addr to taddr in huge_pte_offset()
checkpatch: introduce proper bindings license check
epoll: rename global epmutex
scripts/gdb: add GDB convenience functions $lx_dentry_name() and $lx_i_dentry()
scripts/gdb: create linux/vfs.py for VFS related GDB helpers
uapi/linux/const.h: prefer ISO-friendly __typeof__
delayacct: track delays from IRQ/SOFTIRQ
scripts/gdb: timerlist: convert int chunks to str
scripts/gdb: print interrupts
scripts/gdb: raise error with reduced debugging information
scripts/gdb: add a Radix Tree Parser
lib/rbtree: use '+' instead of '|' for setting color.
proc/stat: remove arch_idle_time()
checkpatch: check for misuse of the link tags
checkpatch: allow Closes tags with links
...
Pull MM updates from Andrew Morton:
- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page()
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
rather than exclusive, and has fixed a bunch of errors which were
caused by its unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics
flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim
accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
shmem: restrict noswap option to initial user namespace
mm/khugepaged: fix conflicting mods to collapse_file()
sparse: remove unnecessary 0 values from rc
mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
maple_tree: fix allocation in mas_sparse_area()
mm: do not increment pgfault stats when page fault handler retries
zsmalloc: allow only one active pool compaction context
selftests/mm: add new selftests for KSM
mm: add new KSM process and sysfs knobs
mm: add new api to enable ksm per process
mm: shrinkers: fix debugfs file permissions
mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
migrate_pages_batch: fix statistics for longterm pin retry
userfaultfd: use helper function range_in_vma()
lib/show_mem.c: use for_each_populated_zone() simplify code
mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
fs/buffer: convert create_page_buffers to folio_create_buffers
fs/buffer: add folio_create_empty_buffers helper
...
Pull sysctl updates from Luis Chamberlain:
"This only does a few sysctl moves from the kernel/sysctl.c file, the
rest of the work has been put towards deprecating two API calls which
incur recursion and prevent us from simplifying the registration
process / saving memory per move. Most of the changes have been
soaking on linux-next since v6.3-rc3.
I've slowed down the kernel/sysctl.c moves due to Matthew Wilcox's
feedback that we should see if we could *save* memory with these moves
instead of incurring more memory. We currently incur more memory since
when we move a syctl from kernel/sysclt.c out to its own file we end
up having to add a new empty sysctl used to register it. To achieve
saving memory we want to allow syctls to be passed without requiring
the end element being empty, and just have our registration process
rely on ARRAY_SIZE(). Without this, supporting both styles of sysctls
would make the sysctl registration pretty brittle, hard to read and
maintain as can be seen from Meng Tang's efforts to do just this [0].
Fortunately, in order to use ARRAY_SIZE() for all sysctl registrations
also implies doing the work to deprecate two API calls which use
recursion in order to support sysctl declarations with subdirectories.
And so during this development cycle quite a bit of effort went into
this deprecation effort. I've annotated the following two APIs are
deprecated and in few kernel releases we should be good to remove
them:
- register_sysctl_table()
- register_sysctl_paths()
During this merge window we should be able to deprecate and unexport
register_sysctl_paths(), we can probably do that towards the end of
this merge window.
Deprecating register_sysctl_table() will take a bit more time but this
pull request goes with a few example of how to do this.
As it turns out each of the conversions to move away from either of
these two API calls *also* saves memory. And so long term, all these
changes *will* prove to have saved a bit of memory on boot.
The way I see it then is if remove a user of one deprecated call, it
gives us enough savings to move one kernel/sysctl.c out from the
generic arrays as we end up with about the same amount of bytes.
Since deprecating register_sysctl_table() and register_sysctl_paths()
does not require maintainer coordination except the final unexport
you'll see quite a bit of these changes from other pull requests, I've
just kept the stragglers after rc3"
Link: https://lkml.kernel.org/r/ZAD+cpbrqlc5vmry@bombadil.infradead.org [0]
* tag 'sysctl-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (29 commits)
fs: fix sysctls.c built
mm: compaction: remove incorrect #ifdef checks
mm: compaction: move compaction sysctl to its own file
mm: memory-failure: Move memory failure sysctls to its own file
arm: simplify two-level sysctl registration for ctl_isa_vars
ia64: simplify one-level sysctl registration for kdump_ctl_table
utsname: simplify one-level sysctl registration for uts_kern_table
ntfs: simplfy one-level sysctl registration for ntfs_sysctls
coda: simplify one-level sysctl registration for coda_table
fs/cachefiles: simplify one-level sysctl registration for cachefiles_sysctls
xfs: simplify two-level sysctl registration for xfs_table
nfs: simplify two-level sysctl registration for nfs_cb_sysctls
nfs: simplify two-level sysctl registration for nfs4_cb_sysctls
lockd: simplify two-level sysctl registration for nlm_sysctls
proc_sysctl: enhance documentation
xen: simplify sysctl registration for balloon
md: simplify sysctl registration
hv: simplify sysctl registration
scsi: simplify sysctl registration with register_sysctl()
csky: simplify alignment sysctl registration
...
This adds the general_profit KSM sysfs knob and the process profit metric
knobs to ksm_stat.
1) expose general_profit metric
The documentation mentions a general profit metric, however this
metric is not calculated. In addition the formula depends on the size
of internal structures, which makes it more difficult for an
administrator to make the calculation. Adding the metric for a better
user experience.
2) document general_profit sysfs knob
3) calculate ksm process profit metric
The ksm documentation mentions the process profit metric and how to
calculate it. This adds the calculation of the metric.
4) mm: expose ksm process profit metric in ksm_stat
This exposes the ksm process profit metric in /proc/<pid>/ksm_stat.
The documentation mentions the formula for the ksm process profit
metric, however it does not calculate it. In addition the formula
depends on the size of internal structures. So it makes sense to
expose it.
5) document new procfs ksm knobs
Link: https://lkml.kernel.org/r/20230418051342.1919757-3-shr@devkernel.io
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Expand documentation to clarify:
o that paths don't need to exist for the new API callers
o clarify that we *require* callers to keep the memory of
the table around during the lifetime of the sysctls
o annotate routines we are trying to deprecate and later remove
Cc: stable@vger.kernel.org # v5.17
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Move the code which creates the subdirectories for a ctl table
into a helper routine so to make it easier to review. Document
the goal.
This creates no functional changes.
Reviewed-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Update the docs for __register_sysctl_table() to make it clear no child
entries can be passed. When the child is true these are non-leaf entries
on the ctl table and sysctl treats these as directories. The point to
__register_sysctl_table() is to deal only with directories not part of
the ctl table where thay may riside, to be simple and avoid recursion.
While at it, hint towards using long on extra1 and extra2 later.
Cc: stable@vger.kernel.org # v5.17
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Patch series "convert read_kcore(), vread() to use iterators", v8.
While reviewing Baoquan's recent changes to permit vread() access to
vm_map_ram regions of vmalloc allocations, Willy pointed out [1] that it
would be nice to refactor vread() as a whole, since its only user is
read_kcore() and the existing form of vread() necessitates the use of a
bounce buffer.
This patch series does exactly that, as well as adjusting how we read the
kernel text section to avoid the use of a bounce buffer in this case as
well.
This has been tested against the test case which motivated Baoquan's
changes in the first place [2] which continues to function correctly, as
do the vmalloc self tests.
This patch (of 4):
Commit df04abfd18 ("fs/proc/kcore.c: Add bounce buffer for ktext data")
introduced the use of a bounce buffer to retrieve kernel text data for
/proc/kcore in order to avoid failures arising from hardened user copies
enabled by CONFIG_HARDENED_USERCOPY in check_kernel_text_object().
We can avoid doing this if instead of copy_to_user() we use
_copy_to_user() which bypasses the hardening check. This is more
efficient than using a bounce buffer and simplifies the code.
We do so as part an overall effort to eliminate bounce buffer usage in the
function with an eye to converting it an iterator read.
Link: https://lkml.kernel.org/r/cover.1679566220.git.lstoakes@gmail.com
Link: https://lore.kernel.org/all/Y8WfDSRkc%2FOHP3oD@casper.infradead.org/ [1]
Link: https://lore.kernel.org/all/87ilk6gos2.fsf@oracle.com/T/#u [2]
Link: https://lkml.kernel.org/r/fd39b0bfa7edc76d360def7d034baaee71d90158.1679511146.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently the memtest results were only presented in dmesg.
When running a large fleet of devices without ECC RAM it's currently not
easy to do bulk monitoring for memory corruption. You have to parse
dmesg, but that's a ring buffer so the error might disappear after some
time. In general I do not consider dmesg to be a great API to query RAM
status.
In several companies I've seen such errors remain undetected and cause
issues for way too long. So I think it makes sense to provide a
monitoring API, so that we can safely detect and act upon them.
This adds /proc/meminfo entry which can be easily used by scripts.
Link: https://lkml.kernel.org/r/20230321103430.7130-1-tomas.mudrunka@gmail.com
Signed-off-by: Tomas Mudrunka <tomas.mudrunka@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>