Set FOP_DONTCACHE in ext4_file_operations to declare support for
uncached buffered I/O.
To handle this flag, update ext4_write_begin() and ext4_da_write_begin()
to use write_begin_get_folio(), which encapsulates FGP_DONTCACHE logic
based on iocb->ki_flags.
Part of a series refactoring address_space_operations write_begin and
write_end callbacks to use struct kiocb for passing write context and
flags.
Signed-off-by: Taotao Chen <chentaotao@didiglobal.com>
Link: https://lore.kernel.org/20250716093559.217344-6-chentaotao@didiglobal.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Change the address_space_operations callbacks write_begin() and
write_end() to take struct kiocb * as the first argument instead of
struct file *.
Update all affected function prototypes, implementations, call sites,
and related documentation across VFS, filesystems, and block layer.
Part of a series refactoring address_space_operations write_begin and
write_end callbacks to use struct kiocb for passing write context and
flags.
Signed-off-by: Taotao Chen <chentaotao@didiglobal.com>
Link: https://lore.kernel.org/20250716093559.217344-4-chentaotao@didiglobal.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Ensure that epoll instances can never form a graph deeper than
EP_MAX_NESTS+1 links.
Currently, ep_loop_check_proc() ensures that the graph is loop-free and
does some recursion depth checks, but those recursion depth checks don't
limit the depth of the resulting tree for two reasons:
- They don't look upwards in the tree.
- If there are multiple downwards paths of different lengths, only one of
the paths is actually considered for the depth check since commit
28d82dc1c4 ("epoll: limit paths").
Essentially, the current recursion depth check in ep_loop_check_proc() just
serves to prevent it from recursing too deeply while checking for loops.
A more thorough check is done in reverse_path_check() after the new graph
edge has already been created; this checks, among other things, that no
paths going upwards from any non-epoll file with a length of more than 5
edges exist. However, this check does not apply to non-epoll files.
As a result, it is possible to recurse to a depth of at least roughly 500,
tested on v6.15. (I am unsure if deeper recursion is possible; and this may
have changed with commit 8c44dac8ad ("eventpoll: Fix priority inversion
problem").)
To fix it:
1. In ep_loop_check_proc(), note the subtree depth of each visited node,
and use subtree depths for the total depth calculation even when a subtree
has already been visited.
2. Add ep_get_upwards_depth_proc() for similarly determining the maximum
depth of an upwards walk.
3. In ep_loop_check(), use these values to limit the total path length
between epoll nodes to EP_MAX_NESTS edges.
Fixes: 22bacca48a ("epoll: prevent creating circular epoll structures")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/20250711-epoll-recursion-fix-v1-1-fb2457c33292@google.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
evict_inodes() uses list_for_each_entry_safe() to iterate sb->s_inodes
list. However, since we use i_lru list entry for our local temporary
list of inodes to destroy, the inode is guaranteed to stay in
sb->s_inodes list while we hold sb->s_inode_list_lock. So there is no
real need for safe iteration variant and we can use
list_for_each_entry() just fine.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/20250709090635.26319-2-jack@suse.cz
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
All filesystems will already check the max and min value of their block
size during their initialization. __getblk_slow() is a very low-level
function to have these checks. Remove them and only check for logical
block size alignment.
As this check with logical block size alignment might never trigger, add
WARN_ON_ONCE() to the check. As WARN_ON_ONCE() will already print the
stack, remove the call to dump_stack().
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/20250626113223.181399-1-p.raghav@samsung.com
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
When sysctl_nr_open is set to a very high value (for example, 1073741816
as set by systemd), processes attempting to use file descriptors near
the limit can trigger massive memory allocation attempts that exceed
INT_MAX, resulting in a WARNING in mm/slub.c:
WARNING: CPU: 0 PID: 44 at mm/slub.c:5027 __kvmalloc_node_noprof+0x21a/0x288
This happens because kvmalloc_array() and kvmalloc() check if the
requested size exceeds INT_MAX and emit a warning when the allocation is
not flagged with __GFP_NOWARN.
Specifically, when nr_open is set to 1073741816 (0x3ffffff8) and a
process calls dup2(oldfd, 1073741880), the kernel attempts to allocate:
- File descriptor array: 1073741880 * 8 bytes = 8,589,935,040 bytes
- Multiple bitmaps: ~400MB
- Total allocation size: > 8GB (exceeding INT_MAX = 2,147,483,647)
Reproducer:
1. Set /proc/sys/fs/nr_open to 1073741816:
# echo 1073741816 > /proc/sys/fs/nr_open
2. Run a program that uses a high file descriptor:
#include <unistd.h>
#include <sys/resource.h>
int main() {
struct rlimit rlim = {1073741824, 1073741824};
setrlimit(RLIMIT_NOFILE, &rlim);
dup2(2, 1073741880); // Triggers the warning
return 0;
}
3. Observe WARNING in dmesg at mm/slub.c:5027
systemd commit a8b627a introduced automatic bumping of fs.nr_open to the
maximum possible value. The rationale was that systems with memory
control groups (memcg) no longer need separate file descriptor limits
since memory is properly accounted. However, this change overlooked
that:
1. The kernel's allocation functions still enforce INT_MAX as a maximum
size regardless of memcg accounting
2. Programs and tests that legitimately test file descriptor limits can
inadvertently trigger massive allocations
3. The resulting allocations (>8GB) are impractical and will always fail
systemd's algorithm starts with INT_MAX and keeps halving the value
until the kernel accepts it. On most systems, this results in nr_open
being set to 1073741816 (0x3ffffff8), which is just under 1GB of file
descriptors.
While processes rarely use file descriptors near this limit in normal
operation, certain selftests (like
tools/testing/selftests/core/unshare_test.c) and programs that test file
descriptor limits can trigger this issue.
Fix this by adding a check in alloc_fdtable() to ensure the requested
allocation size does not exceed INT_MAX. This causes the operation to
fail with -EMFILE instead of triggering a kernel warning and avoids the
impractical >8GB memory allocation request.
Fixes: 9cfe015aa4 ("get rid of NR_OPEN and introduce a sysctl_nr_open")
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
Link: https://lore.kernel.org/20250629074021.1038845-1-sashal@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
When running almost any select()/poll() workload intense enough,
KCSAN is likely to report data races around using 'triggered' flag
of 'struct poll_wqueues'. For example, running 'find /' on a tty
console may trigger the following:
BUG: KCSAN: data-race in poll_schedule_timeout / pollwake
write to 0xffffc900030cfb90 of 4 bytes by task 97 on cpu 5:
pollwake+0xd1/0x130
__wake_up_common_lock+0x7f/0xd0
n_tty_receive_buf_common+0x776/0xc30
n_tty_receive_buf2+0x3d/0x60
tty_ldisc_receive_buf+0x6b/0x100
tty_port_default_receive_buf+0x63/0xa0
flush_to_ldisc+0x169/0x3c0
process_scheduled_works+0x6fe/0xf40
worker_thread+0x53b/0x7b0
kthread+0x4f8/0x590
ret_from_fork+0x28c/0x450
ret_from_fork_asm+0x1a/0x30
read to 0xffffc900030cfb90 of 4 bytes by task 5802 on cpu 4:
poll_schedule_timeout+0x96/0x160
do_sys_poll+0x966/0xb30
__se_sys_ppoll+0x1c3/0x210
__x64_sys_ppoll+0x71/0x90
x64_sys_call+0x3079/0x32b0
do_syscall_64+0xfa/0x3b0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
According to Jan, "there's no practical issue here because it is hard
to imagine how the compiler could compile the above code using some
intermediate values stored into 'triggered' or multiple fetches from
'triggered'". Nevertheless, silence KCSAN by using WRITE_ONCE() in
__pollwake() and READ_ONCE() in poll_schedule_timeout(), respectively.
Link: https://lore.kernel.org/linux-fsdevel/bwx72orsztfjx6aoftzzkl7wle3hi4syvusuwc7x36nw6t235e@bjwrosehblty
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Link: https://lore.kernel.org/20250620063059.1800689-1-dmantipov@yandex.ru
Acked-by: Marco Elver <elver@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
all users of 'struct renamedata' have the dentry for the old and new
directories, and often have no use for the inode except to store it in
the renamedata.
This patch changes struct renamedata to hold the dentry, rather than
the inode, for the old and new directories, and changes callers to
match. The names are also changed from a _dir suffix to _parent. This
is consistent with other usage in namei.c and elsewhere.
This results in the removal of several local variables and several
dereferences of ->d_inode at the cost of adding ->d_inode dereferences
to vfs_rename().
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://lore.kernel.org/174977089072.608730.4244531834577097454@noble.neil.brown.name
Signed-off-by: Christian Brauner <brauner@kernel.org>
Currently the function that does this takes a struct file_lock, but
__locks_wake_up_blocks() deals with both locks and leases. Currently
this works because both file_lock and file_lease have the file_lock_core
at the beginning of the struct, but it's fragile to rely on that.
Add a new locks_wake_up_waiter() function and call that from
__locks_wake_up_blocks().
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/20250602-filelock-6-16-v1-1-7da5b2c930fd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Rather than have the caller set the FMODE_NOWAIT flags for both output
files, move it to create_pipe_files() where other f_mode flags are set
anyway with stream_open(). With that, both __do_pipe_flags() and
io_pipe() can remove the manual setting of the NOWAIT flags.
No intended functional changes, just a code cleanup.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/1f0473f8-69f3-4eb1-aa77-3334c6a71d24@kernel.dk
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pull timer cleanup from Thomas Gleixner:
"The delayed from_timer() API cleanup:
The renaming to the timer_*() namespace was delayed due massive
conflicts against Linux-next. Now that everything is upstream finish
the conversion"
* tag 'timers-cleanups-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
treewide, timers: Rename from_timer() to timer_container_of()
Pull x86 fixes from Thomas Gleixner:
"A small set of x86 fixes:
- Cure IO bitmap inconsistencies
A failed fork cleans up all resources of the newly created thread
via exit_thread(). exit_thread() invokes io_bitmap_exit() which
does the IO bitmap cleanups, which unfortunately assume that the
cleanup is related to the current task, which is obviously bogus.
Make it work correctly
- A lockdep fix in the resctrl code removed the clearing of the
command buffer in two places, which keeps stale error messages
around. Bring them back.
- Remove unused trace events"
* tag 'x86-urgent-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
fs/resctrl: Restore the rdt_last_cmd_clear() calls after acquiring rdtgroup_mutex
x86/iopl: Cure TIF_IO_BITMAP inconsistencies
x86/fpu: Remove unused trace events
Pull mount fixes from Al Viro:
"Various mount-related bugfixes:
- split the do_move_mount() checks in subtree-of-our-ns and
entire-anon cases and adapt detached mount propagation selftest for
mount_setattr
- allow clone_private_mount() for a path on real rootfs
- fix a race in call of has_locked_children()
- fix move_mount propagation graph breakage by MOVE_MOUNT_SET_GROUP
- make sure clone_private_mnt() caller has CAP_SYS_ADMIN in the right
userns
- avoid false negatives in path_overmount()
- don't leak MNT_LOCKED from parent to child in finish_automount()
- do_change_type(): refuse to operate on unmounted/not ours mounts"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
do_change_type(): refuse to operate on unmounted/not ours mounts
clone_private_mnt(): make sure that caller has CAP_SYS_ADMIN in the right userns
selftests/mount_setattr: adapt detached mount propagation test
do_move_mount(): split the checks in subtree-of-our-ns and entire-anon cases
fs: allow clone_private_mount() for a path on real rootfs
fix propagation graph breakage by MOVE_MOUNT_SET_GROUP move_mount(2)
finish_automount(): don't leak MNT_LOCKED from parent to child
path_overmount(): avoid false negatives
fs/fhandle.c: fix a race in call of has_locked_children()
Pull more smb client updates from Steve French:
- multichannel/reconnect fixes
- move smbdirect (smb over RDMA) defines to fs/smb/common so they will
be able to be used in the future more broadly, and a documentation
update explaining setting up smbdirect mounts
- update email address for Paulo
* tag '6.16-rc-part2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal version number
MAINTAINERS, mailmap: Update Paulo Alcantara's email address
cifs: add documentation for smbdirect setup
cifs: do not disable interface polling on failure
cifs: serialize other channels when query server interfaces is pending
cifs: deal with the channel loading lag while picking channels
smb: client: make use of common smbdirect_socket_parameters
smb: smbdirect: introduce smbdirect_socket_parameters
smb: client: make use of common smbdirect_socket
smb: smbdirect: add smbdirect_socket.h
smb: client: make use of common smbdirect.h
smb: smbdirect: add smbdirect.h with public structures
smb: client: make use of common smbdirect_pdu.h
smb: smbdirect: add smbdirect_pdu.h with protocol definitions
Pull JFFS2 and UBIFS fixes from Richard Weinberger:
"JFFS2:
- Correctly check return code of jffs2_prealloc_raw_node_refs()
UBIFS:
- Spelling fixes"
* tag 'ubifs-for-linus-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
jffs2: check jffs2_prealloc_raw_node_refs() result in few other places
jffs2: check that raw node were preallocated before writing summary
ubifs: Fix grammar in error message
Ensure that propagation settings can only be changed for mounts located
in the caller's mount namespace. This change aligns permission checking
with the rest of mount(2).
Reviewed-by: Christian Brauner <brauner@kernel.org>
Fixes: 07b20889e3 ("beginning of the shared-subtree proper")
Reported-by: "Orlando, Noah" <Noah.Orlando@deshaw.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
What we want is to verify there is that clone won't expose something
hidden by a mount we wouldn't be able to undo. "Wouldn't be able to undo"
may be a result of MNT_LOCKED on a child, but it may also come from
lacking admin rights in the userns of the namespace mount belongs to.
clone_private_mnt() checks the former, but not the latter.
There's a number of rather confusing CAP_SYS_ADMIN checks in various
userns during the mount, especially with the new mount API; they serve
different purposes and in case of clone_private_mnt() they usually,
but not always end up covering the missing check mentioned above.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reported-by: "Orlando, Noah" <Noah.Orlando@deshaw.com>
Fixes: 427215d85e ("ovl: prevent private clone if bind mount is not allowed")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and fix the breakage in anon-to-anon case. There are two cases
acceptable for do_move_mount() and mixing checks for those is making
things hard to follow.
One case is move of a subtree in caller's namespace.
* source and destination must be in caller's namespace
* source must be detachable from parent
Another is moving the entire anon namespace elsewhere
* source must be the root of anon namespace
* target must either in caller's namespace or in a suitable
anon namespace (see may_use_mount() for details).
* target must not be in the same namespace as source.
It's really easier to follow if tests are *not* mixed together...
Reviewed-by: Christian Brauner <brauner@kernel.org>
Fixes: 3b5260d12b ("Don't propagate mounts into detached trees")
Reported-by: Allison Karlitskaya <lis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Mounting overlayfs with a directory on real rootfs (initramfs)
as upperdir has failed with following message since commit
db04662e2f ("fs: allow detached mounts in clone_private_mount()").
[ 4.080134] overlayfs: failed to clone upperpath
Overlayfs mount uses clone_private_mount() to create internal mount
for the underlying layers.
The commit made clone_private_mount() reject real rootfs because
it does not have a parent mount and is in the initial mount namespace,
that is not an anonymous mount namespace.
This issue can be fixed by modifying the permission check
of clone_private_mount() following [1].
Reviewed-by: Christian Brauner <brauner@kernel.org>
Fixes: db04662e2f ("fs: allow detached mounts in clone_private_mount()")
Link: https://lore.kernel.org/all/20250514190252.GQ2023217@ZenIV/ [1]
Link: https://lore.kernel.org/all/20250506194849.GT2023217@ZenIV/
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kazuma Kondo <kazuma-kondo@nec.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9ffb14ef61 "move_mount: allow to add a mount into an existing group"
breaks assertions on ->mnt_share/->mnt_slave. For once, the data structures
in question are actually documented.
Documentation/filesystem/sharedsubtree.rst:
All vfsmounts in a peer group have the same ->mnt_master. If it is
non-NULL, they form a contiguous (ordered) segment of slave list.
do_set_group() puts a mount into the same place in propagation graph
as the old one. As the result, if old mount gets events from somewhere
and is not a pure event sink, new one needs to be placed next to the
old one in the slave list the old one's on. If it is a pure event
sink, we only need to make sure the new one doesn't end up in the
middle of some peer group.
"move_mount: allow to add a mount into an existing group" ends up putting
the new one in the beginning of list; that's definitely not going to be
in the middle of anything, so that's fine for case when old is not marked
shared. In case when old one _is_ marked shared (i.e. is not a pure event
sink), that breaks the assumptions of propagation graph iterators.
Put the new mount next to the old one on the list - that does the right thing
in "old is marked shared" case and is just as correct as the current behaviour
if old is not marked shared (kudos to Pavel for pointing that out - my original
suggested fix changed behaviour in the "nor marked" case, which complicated
things for no good reason).
Reviewed-by: Christian Brauner <brauner@kernel.org>
Fixes: 9ffb14ef61 ("move_mount: allow to add a mount into an existing group")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Holding namespace_sem is enough to make sure that result remains valid.
It is *not* enough to avoid false negatives from __lookup_mnt(). Mounts
can be unhashed outside of namespace_sem (stuck children getting detached
on final mntput() of lazy-umounted mount) and having an unrelated mount
removed from the hash chain while we traverse it may end up with false
negative from __lookup_mnt(). We need to sample and recheck the seqlock
component of mount_lock...
Bug predates the introduction of path_overmount() - it had come from
the code in finish_automount() that got abstracted into that helper.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Fixes: 26df6034fd ("fix automount/automount race properly")
Fixes: 6ac3928156 ("fs: allow to mount beneath top mount")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
may_decode_fh() is calling has_locked_children() while holding no locks.
That's an oopsable race...
The rest of the callers are safe since they are holding namespace_sem and
are guaranteed a positive refcount on the mount in question.
Rename the current has_locked_children() to __has_locked_children(), make
it static and switch the fs/namespace.c users to it.
Make has_locked_children() a wrapper for __has_locked_children(), calling
the latter under read_seqlock_excl(&mount_lock).
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Fixes: 620c266f39 ("fhandle: relax open_by_handle_at() permission checks")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull ceph updates from Ilya Dryomov:
- a one-liner that leads to a startling (but also very much rational)
performance improvement in cases where an IMA policy with rules that
are based on fsmagic matching is enforced
- an encryption-related fixup that addresses generic/397 and other
fstest failures
- a couple of cleanups in CephFS
* tag 'ceph-for-6.16-rc1' of https://github.com/ceph/ceph-client:
ceph: fix variable dereferenced before check in ceph_umount_begin()
ceph: set superblock s_magic for IMA fsmagic matching
ceph: cleanup hardcoded constants of file handle size
ceph: fix possible integer overflow in ceph_zero_objects()
ceph: avoid kernel BUG for encrypted inode with unaligned file size
Pull overlayfs update from Miklos Szeredi:
- Fix a regression in getting the path of an open file (e.g. in
/proc/PID/maps) for a nested overlayfs setup (André Almeida)
- Support data-only layers and verity in a user namespace (unprivileged
composefs use case)
- Fix a gcc warning (Kees)
- Cleanups
* tag 'ovl-update-v2-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
ovl: Annotate struct ovl_entry with __counted_by()
ovl: Replace offsetof() with struct_size() in ovl_stack_free()
ovl: Replace offsetof() with struct_size() in ovl_cache_entry_new()
ovl: Check for NULL d_inode() in ovl_dentry_upper()
ovl: Use str_on_off() helper in ovl_show_options()
ovl: don't require "metacopy=on" for "verity"
ovl: relax redirect/metacopy requirements for lower -> data redirect
ovl: make redirect/metacopy rejection consistent
ovl: Fix nested backing file paths
smatch warnings:
fs/ceph/super.c:1042 ceph_umount_begin() warn: variable dereferenced before check 'fsc' (see line 1041)
vim +/fsc +1042 fs/ceph/super.c
void ceph_umount_begin(struct super_block *sb)
{
struct ceph_fs_client *fsc = ceph_sb_to_fs_client(sb);
doutc(fsc->client, "starting forced umount\n");
^^^^^^^^^^^
Dereferenced
if (!fsc)
^^^^
Checked too late.
return;
fsc->mount_state = CEPH_MOUNT_SHUTDOWN;
__ceph_umount_begin(fsc);
}
The VFS guarantees that the superblock is still
alive when it calls into ceph via ->umount_begin().
Finally, we don't need to check the fsc and
it should be valid. This patch simply removes
the fsc check.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202503280852.YDB3pxUY-lkp@intel.com/
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed by: Alex Markuze <amarkuze@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Pull smb server updates from Steve French:
"Four smb3 server fixes:
- Fix for special character handling when mounting with "posix"
- Fix for mounts from Mac for fs that don't provide unique inode
numbers
- Two cleanup patches (e.g. for crypto calls)"
* tag '6.16-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: allow a filename to contain special characters on SMB3.1.1 posix extension
ksmbd: provide zero as a unique ID to the Mac client
ksmbd: remove unnecessary softdep on crc32
ksmbd: use SHA-256 library API instead of crypto_shash API
Pull more bcachefs updates from Kent Overstreet:
"More bcachefs updates:
- More stack usage improvements (~600 bytes)
- Define CLASS()es for some commonly used types, and convert most
rcu_read_lock() uses to the new lock guards
- New introspection:
- Superblock error counters are now available in sysfs:
previously, they were only visible with 'show-super', which
doesn't provide a live view
- New tracepoint, error_throw(), which is called any time we
return an error and start to unwind
- Repair
- check_fix_ptrs() can now repair btree node roots
- We can now repair when we've somehow ended up with the journal
using a superblock bucket
- Revert some leftovers from the aborted directory i_size feature,
and add repair code: some userspace programs (e.g. sshfs) were
getting confused
It seems in 6.15 there's a bug where i_nlink on the vfs inode has been
getting incorrectly set to 0, with some unfortunate results;
list_journal analysis showed bch2_inode_rm() being called (by
bch2_evict_inode()) when it clearly should not have been.
- bch2_inode_rm() now runs "should we be deleting this inode?" checks
that were previously only run when deleting unlinked inodes in
recovery
- check_subvol() was treating a dangling subvol (pointing to a
missing root inode) like a dangling dirent, and deleting it. This
was the really unfortunate one: check_subvol() will now recreate
the root inode if necessary
This took longer to debug than it should have, and we lost several
filesystems unnecessarily, because users have been ignoring the
release notes and blindly running 'fsck -y'. Debugging required
reconstructing what happened through analyzing the journal, when
ideally someone would have noticed 'hey, fsck is asking me if I want
to repair this: it usually doesn't, maybe I should run this in dry run
mode and check what's going on?'
As a reminder, fsck errors are being marked as autofix once we've
verified, in real world usage, that they're working correctly; blindly
running 'fsck -y' on an experimental filesystem is playing with fire
Up to this incident we've had an excellent track record of not losing
data, so let's try to learn from this one
This is a community effort, I wouldn't be able to get this done
without the help of all the people QAing and providing excellent bug
reports and feedback based on real world usage. But please don't
ignore advice and expect me to pick up the pieces
If an error isn't marked as autofix, and it /is/ happening in the
wild, that's also something I need to know about so we can check it
out and add it to the autofix list if repair looks good. I haven't
been getting those reports, and I should be; since we don't have any
sort of telemetry yet I am absolutely dependent on user reports
Now I'll be spending the weekend working on new repair code to see if
I can get a filesystem back for a user who didn't have backups"
* tag 'bcachefs-2025-06-04' of git://evilpiepirate.org/bcachefs: (69 commits)
bcachefs: add cond_resched() to handle_overwrites()
bcachefs: Make journal read log message a bit quieter
bcachefs: Fix subvol to missing root repair
bcachefs: Run may_delete_deleted_inode() checks in bch2_inode_rm()
bcachefs: delete dead code from may_delete_deleted_inode()
bcachefs: Add flags to subvolume_to_text()
bcachefs: Fix oops in btree_node_seq_matches()
bcachefs: Fix dirent_casefold_mismatch repair
bcachefs: Fix bch2_fsck_rename_dirent() for casefold
bcachefs: Redo bch2_dirent_init_name()
bcachefs: Fix -Wc23-extensions in bch2_check_dirents()
bcachefs: Run check_dirents second time if required
bcachefs: Run snapshot deletion out of system_long_wq
bcachefs: Make check_key_has_snapshot safer
bcachefs: BCH_RECOVERY_PASS_NO_RATELIMIT
bcachefs: bch2_require_recovery_pass()
bcachefs: bch_err_throw()
bcachefs: Repair code for directory i_size
bcachefs: Kill un-reverted directory i_size code
bcachefs: Delete redundant fsck_err()
...
Users seem to be assuming that the 'dropped unflushed entries' message
at the end of journal read indicates some sort of problem, when it does
not - we expect there to be entries in the journal that weren't
commited, it's purely informational so that we can correlate journal
sequence numbers elsewhere when debugging.
Shorten the log message a bit to hopefully make this clearer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We had a bug where the root inode of a subvolume was erronously deleted:
bch2_evict_inode() called bch2_inode_rm(), meaning the VFS inode's
i_nlink was somehow set to 0 when it shouldn't have - the inode in the
btree indicated it clearly was not unlinked.
This has been addressed with additional safety checks in
bch2_inode_rm() - pulling in the safety checks we already were doing
when deleting unlinked inodes in recovery - but the really disastrous
bug was in check_subvols(), which on finding a dangling subvol (subvol
with a missing root inode) would delete the subvolume.
I assume this bug dates from early check_directory_structure() code,
which originally handled subvolumes and normal paths - the idea being
that still live contents of the subvolume would get reattached
somewhere.
But that's incorrect, and disastrously so; deleting a subvolume triggers
deleting the snapshot ID it points to, deleting the entire contents.
The correct way to repair is to recreate the root inode if it's missing;
then any contents will get reattached under that subvolume's lost+found.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We had a bug where bch2_evict_inode() incorrectly called bch2_inode_rm()
- the journal clearly showed the inode was not unlinked.
We've got checks that we use in recovery when cleaning up deleted
inodes, lift them to bch2_inode_rm() as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
btree_update_nodes_written() needs to wait on in-flight writes to old
nodes before marking them as freed. But it has no reason to pin those
old nodes in memory, so some trickyness ensues.
The update we're completing deleted references to those nodes from the
btree, so we know if they've been evicted they can't be pulled back in.
We just have to check if the nodes we have pointers to are still those
old nodes, and haven't been reused.
To do that we check the node's "sequence number" (actually a random 64
bit cookie), but that lives in the node's data buffer. 'struct btree'
can't be freed until filesystem shutdown (as they're quite small), but
the data buffers can be freed or swapped around.
Commit 1f88c35674, which was fixing a kmsan warning, assumed that we
could safely do this locklessly with just a READ_ONCE() - if we've got a
non-null ptr it would be safe to read from.
But that's not true if the data buffer is a vmalloc allocation, so we
need to restore the locking that commit deleted (or alternatively RCU
free those data buffers, but there's no other reason for that).
Fixes: 1f88c35674 ("bcachefs: Fix a KMSAN splat in btree_update_nodes_written()")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Instead of simply recreating a mis-casefolded dirent, use the str_hash
repair code, which will rename it if necessary - the dirent might have
been created again with the correct casefolding.
Factor out out bch2_str_hash_repair key() from
__bch2_str_hash_check_key() for the new path to use, and export
bch2_dirent_create_key() as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_fsck_renamed_dirent was creating bch_dirent keys open-coded - but
we need to use the appropriate helper, if the directory is casefolded.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Redo (and simplify somewhat) how casefolded and non casefolded dirents
are initialized, and export this to be used by fsck_rename_dirent().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Clang warns (or errors with CONFIG_WERROR=y):
fs/bcachefs/fsck.c:2325:2: error: label followed by a declaration is a C23 extension [-Werror,-Wc23-extensions]
2325 | int ret = bch2_trans_run(c,
| ^
On clang-17 and older, this is an unconditional error:
fs/bcachefs/fsck.c:2325:2: error: expected expression
2325 | int ret = bch2_trans_run(c,
| ^
Move the declaration of ret to the top of the function to resolve both
ways this issue manifests.
Fixes: c72def5237 ("bcachefs: Run check_dirents second time if required")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
A lockdep fix removed two rdt_last_cmd_clear() calls that were used to
clear the last_cmd_status buffer but called without holding the required
rdtgroup_mutex.
The impacted resctrl commands are writing to the cpus or cpus_list files
and creating a new monitor or control group. With stale data in the
last_cmd_status buffer the impacted resctrl commands report the stale error
on success, or append its own failure message to the stale error on
failure.
Consequently, restore the rdt_last_cmd_clear() calls after acquiring
rdtgroup_mutex.
Fixes: c8eafe1495 ("x86/resctrl: Fix potential lockdep warning")
Signed-off-by: Zeng Heng <zengheng4@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/all/20250603125828.1590067-1-zengheng4@huawei.com