linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-15 17:51:41 -04:00

Author	SHA1	Message	Date
Christian Brauner	e8c84e2082	statmount: don't call path_put() under namespace semaphore Massage statmount() and make sure we don't call path_put() under the namespace semaphore. If we put the last reference we're fscked. Fixes: `46eae99ef7` ("add statmount(2) syscall") Cc: stable@vger.kernel.org # v6.8+ Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-26 10:16:06 +02:00
Christian Göttsche	b9cb7e59ac	pid: use ns_capable_noaudit() when determining net sysctl permissions The capability check should not be audited since it is only being used to determine the inode permissions. A failed check does not indicate a violation of security policy but, when an LSM is enabled, a denial audit message was being generated. The denial audit message can either lead to the capability being unnecessarily allowed in a security policy, or being silenced potentially masking a legitimate capability check at a later point in time. Similar to commit `d6169b0206` ("net: Use ns_capable_noaudit() when determining net sysctl permissions") Fixes: `7863dcc72d` ("pid: allow pid_max to be set per pid namespace") CC: Christian Brauner <brauner@kernel.org> CC: linux-security-module@vger.kernel.org CC: selinux@vger.kernel.org Signed-off-by: Christian Göttsche <cgzones@googlemail.com> Acked-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 13:08:31 +02:00
Mateusz Guzik	f99b391778	fs: rename generic_delete_inode() and generic_drop_inode() generic_delete_inode() is rather misleading for what the routine is doing. inode_just_drop() should be much clearer. The new naming is inconsistent with generic_drop_inode(), so rename that one as well with inode_ as the suffix. No functional changes. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-15 16:09:42 +02:00
Geert Uytterhoeven	7479260860	init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD INITRAMFS_PRESERVE_MTIME is only used in init/initramfs.c and init/initramfs_test.c. Hence add a dependency on BLK_DEV_INITRD, to prevent asking the user about this feature when configuring a kernel without initramfs support. Fixes: `1274aea127` ("initramfs: add INITRAMFS_PRESERVE_MTIME Kconfig option") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Martin Wilck <mwilck@suse.com> Reviewed-by: David Disseldorp <ddiss@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-15 15:02:17 +02:00
Thorsten Blum	afd77d2050	initramfs: Replace strcpy() with strscpy() in find_link() strcpy() is deprecated; use strscpy() instead. Link: https://github.com/KSPP/linux/issues/88 Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-15 14:52:02 +02:00
Thorsten Blum	beb022ef92	initrd: Use str_plural() in rd_load_image() Add the local variable 'nr_disks' and replace the manual ternary "s" pluralization with the standardized str_plural() helper function. Use pr_notice() instead of printk(KERN_NOTICE) to silence a checkpatch warning. No functional changes intended. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-15 14:47:14 +02:00
Thorsten Blum	e60625e7ce	initramfs: Use struct_size() helper to improve dir_add() Use struct_size() to calculate the number of bytes to allocate for a new directory entry. No functional changes. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-15 14:42:56 +02:00
Thorsten Blum	84f1766bdb	initrd: Fix unused variable warning in rd_load_image() on s390 The local variables 'rotator' and 'rotate' (used for the progress indicator) aren't used on s390. Building the kernel with W=1 generates the following warning: init/do_mounts_rd.c:192:17: warning: variable 'rotate' set but not used [-Wunused-but-set-variable] 192 \| unsigned short rotate = 0; \| ^ 1 warning generated. Remove the preprocessor directives and use the IS_ENABLED(CONFIG_S390) macro instead, allowing the compiler to optimize away unused variables and avoid the warning on s390. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-15 14:28:37 +02:00
Mateusz Guzik	af67f4c1cd	fs: use the switch statement in init_special_inode() Similar to may_open(). No functional changes. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-15 14:28:37 +02:00
Max Kellermann	796667c9dc	fs/proc/namespaces: make ns_entries const Global variables that are never modified should be "const" so so that they live in the .rodata section instead of the .data section of the kernel, gaining the protection of the kernel's strict memory permissions as described in Documentation/security/self-protection.rst Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-15 14:28:37 +02:00
Jeff Layton	c593b9d6c4	filelock: add FL_RECLAIM to show_fl_flags() macro Show the FL_RECLAIM flag symbolically in tracepoints. Fixes: `bb0a55bb71` ("nfs: don't allow reexport reclaims") Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/20250903-filelock-v1-1-f2926902962d@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-05 15:54:27 +02:00
Christian Brauner	e493b83b10	Merge patch "eventpoll: Fix priority inversion problem" Nam Cao <namcao@linutronix.de> says: Hi, This v4 is the follow-up to v3 at: https://lore.kernel.org/linux-fsdevel/20250527090836.1290532-1-namcao@linutronix.de/ which resolves a priority inversion problem. The v3 patch was merged, but then got reverted due to regression. The direction of v3 was wrong in the first place. It changed the eventpoll's event list to be lockless, making the code harder to read. I stared at the patch again, but still couldn't figure out what the bug is. The performance numbers were indeed impressive with lockless, but the numbers are from a benchmark, which is unclear whether it really reflects real workload. This v4 takes a completely different approach: it converts the rwlock to spinlock. Unfortunately, unlike rwlock, spinlock does not allow concurrent readers. This patch therefore reduces the performance numbers. I have some optimization tricks to reduce spinlock contention and bring the numbers back. But Linus appeared and declared that epoll's performance shouldn't be the priority. So I decided not to post those optimization patches. * patches from https://lore.kernel.org/cover.1752581388.git.namcao@linutronix.de: eventpoll: Replace rwlock with spinlock Link: https://lore.kernel.org/cover.1752581388.git.namcao@linutronix.de Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-05 15:51:29 +02:00
Nam Cao	0c43094f8c	eventpoll: Replace rwlock with spinlock The ready event list of an epoll object is protected by read-write semaphore: - The consumer (waiter) acquires the write lock and takes items. - the producer (waker) takes the read lock and adds items. The point of this design is enabling epoll to scale well with large number of producers, as multiple producers can hold the read lock at the same time. Unfortunately, this implementation may cause scheduling priority inversion problem. Suppose the consumer has higher scheduling priority than the producer. The consumer needs to acquire the write lock, but may be blocked by the producer holding the read lock. Since read-write semaphore does not support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y), we have a case of priority inversion: a higher priority consumer is blocked by a lower priority producer. This problem was reported in [1]. Furthermore, this could also cause stall problem, as described in [2]. Fix this problem by replacing rwlock with spinlock. This reduces the event bandwidth, as the producers now have to contend with each other for the spinlock. According to the benchmark from https://github.com/rouming/test-tools/blob/master/stress-epoll.c: On 12 x86 CPUs: Before After Diff threads events/ms events/ms 8 7162 4956 -31% 16 8733 5383 -38% 32 7968 5572 -30% 64 10652 5739 -46% 128 11236 5931 -47% On 4 riscv CPUs: Before After Diff threads events/ms events/ms 8 2958 2833 -4% 16 3323 3097 -7% 32 3451 3240 -6% 64 3554 3178 -11% 128 3601 3235 -10% Although the numbers look bad, it should be noted that this benchmark creates multiple threads who do nothing except constantly generating new epoll events, thus contention on the spinlock is high. For real workload, the event rate is likely much lower, and the performance drop is not as bad. Using another benchmark (perf bench epoll wait) where spinlock contention is lower, improvement is even observed on x86: On 12 x86 CPUs: Before: Averaged 110279 operations/sec (+- 1.09%), total secs = 8 After: Averaged 114577 operations/sec (+- 2.25%), total secs = 8 On 4 riscv CPUs: Before: Averaged 175767 operations/sec (+- 0.62%), total secs = 8 After: Averaged 167396 operations/sec (+- 0.23%), total secs = 8 In conclusion, no one is likely to be upset over this change. After all, spinlock was used originally for years, and the commit which converted to rwlock didn't mention a real workload, just that the benchmark numbers are nice. This patch is not exactly the revert of commit `a218cc4914` ("epoll: use rwlock in order to reduce ep_poll_callback() contention"), because git revert conflicts in some places which are not obvious on the resolution. This patch is intended to be backported, therefore go with the obvious approach: - Replace rwlock_t with spinlock_t one to one - Delete list_add_tail_lockless() and chain_epi_lockless(). These were introduced to allow producers to concurrently add items to the list. But now that spinlock no longer allows producers to touch the event list concurrently, these two functions are not necessary anymore. Fixes: `a218cc4914` ("epoll: use rwlock in order to reduce ep_poll_callback() contention") Signed-off-by: Nam Cao <namcao@linutronix.de> Link: https://lore.kernel.org/ec92458ea357ec503c737ead0f10b2c6e4c37d47.1752581388.git.namcao@linutronix.de Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Cc: stable@vger.kernel.org Reported-by: Frederic Weisbecker <frederic@kernel.org> Closes: https://lore.kernel.org/linux-rt-users/20210825132754.GA895675@lothringen/ [1] Reported-by: Valentin Schneider <vschneid@redhat.com> Closes: https://lore.kernel.org/linux-rt-users/xhsmhttqvnall.mognet@vschneid.remote.csb/ [2] Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-05 15:51:24 +02:00
Christian Brauner	46582a15c1	Merge patch series "procfs: make reference pidns more user-visible" Aleksa Sarai <cyphar@cyphar.com> says: Ever since the introduction of pid namespaces, procfs has had very implicit behaviour surrounding them (the pidns used by a procfs mount is auto-selected based on the mounting process's active pidns, and the pidns itself is basically hidden once the mount has been constructed). /* pidns mount option for procfs / This implicit behaviour has historically meant that userspace was required to do some special dances in order to configure the pidns of a procfs mount as desired. Examples include: In order to bypass the mnt_too_revealing() check, Kubernetes creates a procfs mount from an empty pidns so that user namespaced containers can be nested (without this, the nested containers would fail to mount procfs). But this requires forking off a helper process because you cannot just one-shot this using mount(2). * Container runtimes in general need to fork into a container before configuring its mounts, which can lead to security issues in the case of shared-pidns containers (a privileged process in the pidns can interact with your container runtime process). While SUID_DUMP_DISABLE and user namespaces make this less of an issue, the strict need for this due to a minor uAPI wart is kind of unfortunate. Things would be much easier if there was a way for userspace to just specify the pidns they want. Patch 1 implements a new "pidns" argument which can be set using fsconfig(2): fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd); fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0); or classic mount(2) / mount(8): // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid"); The initial security model I have in this RFC is to be as conservative as possible and just mirror the security model for setns(2) -- which means that you can only set pidns=... to pid namespaces that your current pid namespace is a direct ancestor of and you have CAP_SYS_ADMIN privileges over the pid namespace. This fulfils the requirements of container runtimes, but I suspect that this may be too strict for some usecases. The pidns argument is not displayed in mountinfo -- it's not clear to me what value it would make sense to show (maybe we could just use ns_dname to provide an identifier for the namespace, but this number would be fairly useless to userspace). I'm open to suggestions. Note that PROCFS_GET_PID_NAMESPACE (see below) does at least let userspace get information about this outside of mountinfo. Note that you cannot change the pidns of an already-created procfs instance. The primary reason is that allowing this to be changed would require RCU-protecting proc_pid_ns(sb) and thus auditing all of fs/proc/* and some of the users in fs/* to make sure they wouldn't UAF the pid namespace. Since creating procfs instances is very cheap, it seems unnecessary to overcomplicate this upfront. Trying to reconfigure procfs this way errors out with -EBUSY. * patches from https://lore.kernel.org/20250805-procfs-pidns-api-v4-0-705f984940e7@cyphar.com: selftests/proc: add tests for new pidns APIs procfs: add "pidns" mount option pidns: move is-ancestor logic to helper Link: https://lore.kernel.org/20250805-procfs-pidns-api-v4-0-705f984940e7@cyphar.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-02 11:40:35 +02:00
Aleksa Sarai	5554d820f7	selftests/proc: add tests for new pidns APIs Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/20250805-procfs-pidns-api-v4-4-705f984940e7@cyphar.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-02 11:40:20 +02:00
Aleksa Sarai	fe49652e36	procfs: add "pidns" mount option Since the introduction of pid namespaces, their interaction with procfs has been entirely implicit in ways that require a lot of dancing around by programs that need to construct sandboxes with different PID namespaces. Being able to explicitly specify the pid namespace to use when constructing a procfs super block will allow programs to no longer need to fork off a process which does then does unshare(2) / setns(2) and forks again in order to construct a procfs in a pidns. So, provide a "pidns" mount option which allows such users to just explicitly state which pid namespace they want that procfs instance to use. This interface can be used with fsconfig(2) either with a file descriptor or a path: fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd); fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0); or with classic mount(2) / mount(8): // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid"); As this new API is effectively shorthand for setns(2) followed by mount(2), the permission model for this mirrors pidns_install() to avoid opening up new attack surfaces by loosening the existing permission model. In order to avoid having to RCU-protect all users of proc_pid_ns() (to avoid UAFs), attempting to reconfigure an existing procfs instance's pid namespace will error out with -EBUSY. Creating new procfs instances is quite cheap, so this should not be an impediment to most users, and lets us avoid a lot of churn in fs/proc/* for a feature that it seems unlikely userspace would use. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/20250805-procfs-pidns-api-v4-2-705f984940e7@cyphar.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-02 11:37:24 +02:00
Aleksa Sarai	7df8782012	pidns: move is-ancestor logic to helper This check will be needed in later patches, and there's no point open-coding it each time. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/20250805-procfs-pidns-api-v4-1-705f984940e7@cyphar.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-02 11:37:24 +02:00
Christian Brauner	998541db0e	Merge patch series "vfs: if RESOLVE_NO_XDEV passed to openat2, don't trigger automounts" Askar Safin <safinaskar@zohomail.com> says: openat2 had a bug: if we pass RESOLVE_NO_XDEV, then openat2 doesn't traverse through automounts, but may still trigger them. See this link for full bug report with reproducer: https://lore.kernel.org/linux-fsdevel/20250817075252.4137628-1-safinaskar@zohomail.com/ This patchset fixes the bug. RESOLVE_NO_XDEV logic hopefully becomes more clear: now we immediately fail when we cross mountpoints. * patches from https://lore.kernel.org/20250825181233.2464822-1-safinaskar@zohomail.com: openat2: don't trigger automounts with RESOLVE_NO_XDEV namei: move cross-device check to __traverse_mounts namei: remove LOOKUP_NO_XDEV check from handle_mounts namei: move cross-device check to traverse_mounts Link: https://lore.kernel.org/20250825181233.2464822-1-safinaskar@zohomail.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-02 10:40:58 +02:00
Askar Safin	042a60680d	openat2: don't trigger automounts with RESOLVE_NO_XDEV openat2 had a bug: if we pass RESOLVE_NO_XDEV, then openat2 doesn't traverse through automounts, but may still trigger them. (See the link for full bug report with reproducer.) This commit fixes this bug. Link: https://lore.kernel.org/linux-fsdevel/20250817075252.4137628-1-safinaskar@zohomail.com/ Fixes: `fddb5d430a` ("open: introduce openat2(2) syscall") Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Cc: stable@vger.kernel.org Signed-off-by: Askar Safin <safinaskar@zohomail.com> Link: https://lore.kernel.org/20250825181233.2464822-5-safinaskar@zohomail.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-02 10:40:43 +02:00
Askar Safin	8ded1fde08	namei: move cross-device check to __traverse_mounts This is preparation to RESOLVE_NO_XDEV fix in following commits. Also this commit makes LOOKUP_NO_XDEV logic more clear: now we immediately fail with EXDEV on first mount crossing instead of waiting for very end. No functional change intended Signed-off-by: Askar Safin <safinaskar@zohomail.com> Link: https://lore.kernel.org/20250825181233.2464822-4-safinaskar@zohomail.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-02 10:40:43 +02:00
Askar Safin	8b966d00b3	namei: remove LOOKUP_NO_XDEV check from handle_mounts This is preparation to RESOLVE_NO_XDEV fix in following commits. No functional change intended. The only place that ever looks at ND_JUMPED in nd->state is complete_walk() and we are not going to reach it if handle_mounts() returns an error Signed-off-by: Askar Safin <safinaskar@zohomail.com> Link: https://lore.kernel.org/20250825181233.2464822-3-safinaskar@zohomail.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-02 10:40:42 +02:00
Askar Safin	11c2b7ec2e	namei: move cross-device check to traverse_mounts This is preparation to RESOLVE_NO_XDEV fix in following commits. No functional change intended Signed-off-by: Askar Safin <safinaskar@zohomail.com> Link: https://lore.kernel.org/20250825181233.2464822-2-safinaskar@zohomail.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-02 10:40:42 +02:00
Tetsuo Handa	7f9d34b0a7	cramfs: Verify inode mode when loading from disk The inode mode loaded from corrupted disk can be invalid. Do like what commit `0a9e740513` ("isofs: Verify inode mode when loading from disk") does. Reported-by: syzbot <syzbot+895c23f6917da440ed0d@syzkaller.appspotmail.com> Closes: https://syzkaller.appspot.com/bug?extid=895c23f6917da440ed0d Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Link: https://lore.kernel.org/429b3ef1-13de-4310-9a8e-c2dc9a36234a@I-love.SAKURA.ne.jp Acked-by: Nicolas Pitre <nico@fluxnic.net> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-01 13:13:10 +02:00
Greg Kroah-Hartman	e5bca063c1	fs: remove vfs_ioctl export vfs_ioctl() is no longer called by anything outside of fs/ioctl.c, so remove the global symbol and export as it is not needed. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/2025083038-carving-amuck-a4ae@gregkh Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-01 13:08:01 +02:00
Lauri Vasama	db2ab24a34	Add RWF_NOSIGNAL flag for pwritev2 For a user mode library to avoid generating SIGPIPE signals (e.g. because this behaviour is not portable across operating systems) is cumbersome. It is generally bad form to change the process-wide signal mask in a library, so a local solution is needed instead. For I/O performed directly using system calls (synchronous or readiness based asynchronous) this currently involves applying a thread-specific signal mask before the operation and reverting it afterwards. This can be avoided when it is known that the file descriptor refers to neither a pipe nor a socket, but a conservative implementation must always apply the mask. This incurs the cost of two additional system calls. In the case of sockets, the existing MSG_NOSIGNAL flag can be used with send. For asynchronous I/O performed using io_uring, currently the only option (apart from MSG_NOSIGNAL for sockets), is to mask SIGPIPE entirely in the call to io_uring_enter. Thankfully io_uring_enter takes a signal mask, so only a single syscall is needed. However, copying the signal mask on every call incurs a non-zero performance penalty. Furthermore, this mask applies to all completions, meaning that if the non-signaling behaviour is desired only for some subset of operations, the desired signals must be raised manually from user-mode depending on the completed operation. Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE signal from being raised when writing on disconnected pipes or sockets. The flag is handled directly by the pipe filesystem and converted to the existing MSG_NOSIGNAL flag for sockets. Signed-off-by: Lauri Vasama <git@vasama.org> Link: https://lore.kernel.org/20250827133901.1820771-1-git@vasama.org Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-29 15:08:07 +02:00
Xichao Zhao	38d1227fa7	fs: Replace offsetof() with struct_size() in ioctl_file_dedupe_range() When dealing with structures containing flexible arrays, struct_size() provides additional compile-time checks compared to offsetof(). This enhances code robustness and reduces the risk of potential errors. Signed-off-by: Xichao Zhao <zhao.xichao@vivo.com> Link: https://lore.kernel.org/20250829091510.597858-1-zhao.xichao@vivo.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-29 12:00:58 +02:00
Uros Bizjak	ec6f613ef3	fs: Use try_cmpxchg() in sb_init_done_wq() Use !try_cmpxchg() instead of cmpxchg(*ptr, old, new) != old. The x86 CMPXCHG instruction returns success in the ZF flag, so this change saves a compare after CMPXCHG. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/20250811132326.620521-1-ubizjak@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-25 14:09:25 +02:00
Uros Bizjak	14498ca7e0	fs: Use try_cmpxchg() in start_dir_add() Use try_cmpxchg() instead of cmpxchg(ptr, old, new) == old. The x86 CMPXCHG instruction returns success in the ZF flag, so this change saves a compare after CMPXCHG (and related move instruction in front of CMPXCHG). Note that the value from ptr should be read using READ_ONCE() to prevent the compiler from merging, refetching or reordering the read. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/20250811125308.616717-1-ubizjak@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-25 14:09:00 +02:00
Lichen Liu	278033a225	fs: Add 'initramfs_options' to set initramfs mount options When CONFIG_TMPFS is enabled, the initial root filesystem is a tmpfs. By default, a tmpfs mount is limited to using 50% of the available RAM for its content. This can be problematic in memory-constrained environments, particularly during a kdump capture. In a kdump scenario, the capture kernel boots with a limited amount of memory specified by the 'crashkernel' parameter. If the initramfs is large, it may fail to unpack into the tmpfs rootfs due to insufficient space. This is because to get X MB of usable space in tmpfs, 2*X MB of memory must be available for the mount. This leads to an OOM failure during the early boot process, preventing a successful crash dump. This patch introduces a new kernel command-line parameter, initramfs_options, which allows passing specific mount options directly to the rootfs when it is first mounted. This gives users control over the rootfs behavior. For example, a user can now specify initramfs_options=size=75% to allow the tmpfs to use up to 75% of the available memory. This can significantly reduce the memory pressure for kdump. Consider a practical example: To unpack a 48MB initramfs, the tmpfs needs 48MB of usable space. With the default 50% limit, this requires a memory pool of 96MB to be available for the tmpfs mount. The total memory requirement is therefore approximately: 16MB (vmlinuz) + 48MB (loaded initramfs) + 48MB (unpacked kernel) + 96MB (for tmpfs) + 12MB (runtime overhead) ≈ 220MB. By using initramfs_options=size=75%, the memory pool required for the 48MB tmpfs is reduced to 48MB / 0.75 = 64MB. This reduces the total memory requirement by 32MB (96MB - 64MB), allowing the kdump to succeed with a smaller crashkernel size, such as 192MB. An alternative approach of reusing the existing rootflags parameter was considered. However, a new, dedicated initramfs_options parameter was chosen to avoid altering the current behavior of rootflags (which applies to the final root filesystem) and to prevent any potential regressions. Also add documentation for the new kernel parameter "initramfs_options" This approach is inspired by prior discussions and patches on the topic. Ref: https://www.lightofdawn.org/blog/?viewDetailed=00128 Ref: https://landley.net/notes-2015.html#01-01-2015 Ref: https://lkml.org/lkml/2021/6/29/783 Ref: https://www.kernel.org/doc/html/latest/filesystems/ramfs-rootfs-initramfs.html#what-is-rootfs Signed-off-by: Lichen Liu <lichliu@redhat.com> Link: https://lore.kernel.org/20250815121459.3391223-1-lichliu@redhat.com Tested-by: Rob Landley <rob@landley.net> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-21 10:23:48 +02:00
Tetsuo Handa	7386197093	minixfs: Verify inode mode when loading from disk The inode mode loaded from corrupted disk can be invalid. Do like what commit `0a9e740513` ("isofs: Verify inode mode when loading from disk") does. Reported-by: syzbot <syzbot+895c23f6917da440ed0d@syzkaller.appspotmail.com> Closes: https://syzkaller.appspot.com/bug?extid=895c23f6917da440ed0d Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Link: https://lore.kernel.org/ec982681-84b8-4624-94fa-8af15b77cbd2@I-love.SAKURA.ne.jp Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-19 13:30:46 +02:00
Charalampos Mitrodimas	8e7e265d55	debugfs: fix mount options not being applied Mount options (uid, gid, mode) are silently ignored when debugfs is mounted. This is a regression introduced during the conversion to the new mount API. When the mount API conversion was done, the parsed options were never applied to the superblock when it was reused. As a result, the mount options were ignored when debugfs was mounted. Fix this by following the same pattern as the tracefs fix in commit `e4d32142d1` ("tracing: Fix tracefs mount options"). Call debugfs_reconfigure() in debugfs_get_tree() to apply the mount options to the superblock after it has been created or reused. As an example, with the bug the "mode" mount option is ignored: $ mount -o mode=0666 -t debugfs debugfs /tmp/debugfs_test $ mount \| grep debugfs_test debugfs on /tmp/debugfs_test type debugfs (rw,relatime) $ ls -ld /tmp/debugfs_test drwx------ 25 root root 0 Aug 4 14:16 /tmp/debugfs_test With the fix applied, it works as expected: $ mount -o mode=0666 -t debugfs debugfs /tmp/debugfs_test $ mount \| grep debugfs_test debugfs on /tmp/debugfs_test type debugfs (rw,relatime,mode=666) $ ls -ld /tmp/debugfs_test drw-rw-rw- 37 root root 0 Aug 2 17:28 /tmp/debugfs_test Fixes: `a20971c187` ("vfs: Convert debugfs to use the new mount API") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220406 Cc: stable@vger.kernel.org Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Charalampos Mitrodimas <charmitro@posteo.net> Link: https://lore.kernel.org/20250816-debugfs-mount-opts-v3-1-d271dad57b5b@posteo.net Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-19 13:11:29 +02:00
Miklos Szeredi	f8f59a2c05	copy_file_range: limit size if in compat mode If the process runs in 32-bit compat mode, copy_file_range results can be in the in-band error range. In this case limit copy length to MAX_RW_COUNT to prevent a signed overflow. Reported-by: Florian Weimer <fweimer@redhat.com> Closes: https://lore.kernel.org/all/lhuh5ynl8z5.fsf@oldenburg.str.redhat.com/ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://lore.kernel.org/20250813151107.99856-1-mszeredi@redhat.com Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-15 16:11:47 +02:00
Qianfeng Rong	15769d9478	fs-writeback: Remove redundant __GFP_NOWARN GFP_NOWAIT already includes __GFP_NOWARN, so let's remove the redundant __GFP_NOWARN. Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Link: https://lore.kernel.org/20250803102243.623705-5-rongqianfeng@vivo.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-15 16:06:32 +02:00
Tetsuo Handa	ecb0605364	vfs: show filesystem name at dump_inode() Commit `8b17e54096` ("vfs: add initial support for CONFIG_DEBUG_VFS") added dump_inode(), but dump_inode() currently reports only raw pointer address. Comment says that adding a proper inode dumping routine is a TODO. However, syzkaller concurrently tests multiple filesystems, and several filesystems started calling dump_inode() due to hitting VFS_BUG_ON_INODE() added by commit `af153bb63a` ("vfs: catch invalid modes in may_open()") before a proper inode dumping routine is implemented. Show filesystem name at dump_inode() so that we can find which filesystem has passed an invalid mode to may_open() from syzkaller's crash reports. Link: https://syzkaller.appspot.com/bug?extid=895c23f6917da440ed0d Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Link: https://lore.kernel.org/ceaf4021-65cc-422e-9d0e-6afa18dd8276@I-love.SAKURA.ne.jp Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-11 15:50:48 +02:00
Thomas Weißschuh	708c04a5c2	fs: always return zero on success from replace_fd() replace_fd() returns the number of the new file descriptor through the return value of do_dup2(). However its callers never care about the specific returned number. In fact the caller in receive_fd_replace() treats any non-zero return value as an error and therefore never calls __receive_sock() for most file descriptors, which is a bug. To fix the bug in receive_fd_replace() and to avoid the same issue happening in future callers, signal success through a plain zero. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/lkml/20250801220215.GS222315@ZenIV/ Fixes: `173817151b` ("fs: Expand __receive_fd() to accept existing fd") Fixes: `42eb0d54c0` ("fs: split receive_fd_replace from __receive_fd") Cc: stable@vger.kernel.org Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Link: https://lore.kernel.org/20250805-fix-receive_fd_replace-v3-1-b72ba8b34bac@linutronix.de Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-11 14:52:25 +02:00
Xichao Zhao	f7d812357e	fs: fix "writen"->"written" Trivial fix to spelling mistake in comment text. Signed-off-by: Xichao Zhao <zhao.xichao@vivo.com> Link: https://lore.kernel.org/20250808083758.229563-1-zhao.xichao@vivo.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-11 14:52:25 +02:00
Kriish Sharma	4e02192081	fs: document 'name' parameter for name_contains_dotdot() The kernel-doc for name_contains_dotdot() was missing the @name parameter description, leading to a warning during make htmldocs. Add the missing documentation to resolve this warning. Signed-off-by: Kriish Sharma <kriish.sharma2006@gmail.com> Link: https://lore.kernel.org/20250730201853.8436-1-kriish.sharma2006@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-11 14:52:25 +02:00
Christoph Hellwig	17e8b7e08f	fs: mark file_remove_privs_flags static file_remove_privs_flags is only used inside of inode.c, mark it static. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250724074854.3316911-1-hch@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-11 14:52:24 +02:00
Thiago Becker	15a04f94f4	locks: Remove the last reference to EXPORT_OP_ASYNC_LOCK. Commit `b875bd5b38` ("exportfs: Remove EXPORT_OP_ASYNC_LOCK") removed all references to EXPORT_OP_ASYNC_LOCK, but one lasted in the comments for fs/locks.c. Remove it. Signed-off-by: Thiago Becker <tbecker@redhat.com> Link: https://lore.kernel.org/20250724203516.153616-1-tbecker@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-11 14:52:24 +02:00
Linus Torvalds	8f5ae30d69	Linux 6.17-rc1 v6.17-rc1	2025-08-10 19:41:16 +03:00
Linus Torvalds	2b38afce25	Merge tag 'turbostat-2025.09.09' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux Pull turbostat updates from Len Brown: "tools/power turbostat: version 2025.09.09 - Probe and display L3 Cache topology - Add ability to average an added counter (useful for pre-integrated "counters", such as Watts) - Break the limit of 64 built-in counters - Assorted bug fixes and minor feature tweaks" * tag 'turbostat-2025.09.09' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: tools/power turbostat: version 2025.09.09 tools/power turbostat: Handle non-root legacy-uncore sysfs permissions tools/power turbostat: standardize PER_THREAD_PARAMS tools/power turbostat: Fix DMR support tools/power turbostat: add format "average" for external attributes tools/power turbostat: delete GET_PKG() tools/power turbostat: probe and display L3 cache topology tools/power turbostat: Support more than 64 built-in-counters tools/power turbostat.8: Document Totl%C0, Any%C0, GFX%C0, CPUGFX% columns tools/power turbostat: Fix bogus SysWatt for forked program tools/power turbostat: Handle cap_get_proc() ENOSYS tools/power turbostat: Fix build with musl tools/power turbostat: verify arguments to params --show and --hide tools/power turbostat: regression fix: --show C1E%	2025-08-10 09:02:36 +03:00
Linus Torvalds	b96ddbc5c8	Merge tag 'smp_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull smp fixes from Borislav Petkov: - Remove an obsolete comment and fix spelling * tag 'smp_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu: Remove obsolete comment from takedown_cpu() smp: Fix spelling in on_each_cpu_cond_mask()'s doc-comment	2025-08-10 08:51:37 +03:00
Linus Torvalds	7d2fed1f3c	Merge tag 'irq_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Borislav Petkov: - Fix a wrong ioremap size in mvebu-gicp - Remove yet another compile-test case for a driver which needs an additional dependency - Fix a lock inversion scenario in the IRQ unit test suite - Remove an impossible flag situation in gic-v5 - Do not iounmap resources in gic-v5 which are managed by devm - Make sure stale, left-over interrupts in mvebu-gicp are cleared on driver init - Fix a reference counting mishap in msi-lib - Fix a dereference-before-null-ptr-check case in the riscv-imsic irqchip driver * tag 'irq_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/mvebu-gicp: Use resource_size() for ioremap() irqchip: Build IMX_MU_MSI only on ARM genirq/test: Resolve irq lock inversion warnings irqchip/gic-v5: Remove IRQD_RESEND_WHEN_IN_PROGRESS for ITS IRQs irqchip/gic-v5: iwb: Fix iounmap probe failure path irqchip/mvebu-gicp: Clear pending interrupts on init irqchip/msi-lib: Fix fwnode refcount in msi_lib_irq_domain_select() irqchip/riscv-imsic: Don't dereference before NULL pointer check	2025-08-10 08:46:47 +03:00
Linus Torvalds	acaa21a26f	Merge tag 'x86_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Fix an interrupt vector setup race which leads to a non-functioning device - Add new Intel CPU models and a family: 0x12. Finally. Yippie! :-) * tag 'x86_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/irq: Plug vector setup race x86/cpu: Add new Intel CPU model numbers for Wildcatlake and Novalake	2025-08-10 08:15:32 +03:00
Linus Torvalds	8e8f6b635f	Merge tag 'locking_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Borislav Petkov: - Prevent a futex hash leak due to different mm lifetimes * tag 'locking_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Move futex cleanup to __mmdrop()	2025-08-10 08:11:39 +03:00
Len Brown	5e98a5e73e	tools/power turbostat: version 2025.09.09 Probe and display L3 Cache topology Add ability to average an added counter (useful for pre-integrated "counters", such as Watts) Break the limit of 64 built-in counters. Assorted bug fixes and minor feature tweaks Signed-off-by: Len Brown <len.brown@intel.com>	2025-08-09 21:24:46 -04:00
Len Brown	e60a13bcef	tools/power turbostat: Handle non-root legacy-uncore sysfs permissions /sys/devices/system/cpu/intel_uncore_frequency/package_X_die_Y/ may be readable by all, but /sys/devices/system/cpu/intel_uncore_frequency/package_X_die_Y/current_freq_khz may be readable only by root. Non-root turbostat users see complaints in this scenario. Fail probe of the interface if we can't read current_freq_khz. Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Original-patch-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>	2025-08-09 21:24:46 -04:00
Len Brown	378e901160	tools/power turbostat: standardize PER_THREAD_PARAMS use a macro for PER_THREAD_PARAMS to make adding one later more clear. no functional change Signed-off-by: Len Brown <len.brown@intel.com>	2025-08-09 21:24:46 -04:00
Zhang Rui	3a088b07c4	tools/power turbostat: Fix DMR support Together with the RAPL MSRs, there are more MSRs gone on DMR, including PLR (Perf Limit Reasons), and IRTL (Package cstate Interrupt Response Time Limit) MSRs. The configurable TDP info should also be retrieved from TPMI based Intel Speed Select Technology feature. Remove the access of these MSRs for DMR. Improve the DMR platform feature table to make it more readable at the same time. Fixes: `83075bd59d` ("tools/power turbostat: Add initial support for DMR") Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>	2025-08-09 21:24:46 -04:00
Michael Hebenstreit	dcd1c379b0	tools/power turbostat: add format "average" for external attributes External atributes with format "raw" are not printed in summary lines for nodes/packages (or with option -S). The new format "average" behaves like "raw" but also adds the summary data Signed-off-by: Michael Hebenstreit <michael.hebenstreit@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>	2025-08-09 21:24:46 -04:00

1 2 3 4 5 ...

1381732 Commits