Add lightweight accounting for directory lease cache usage
to aid debugging and limiting cache size in future. Track
per-directory entry/byte counts and maintain per-tcon
aggregates. Also expose the totals in /proc/fs/cifs/open_dirs.
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Add write-only /sys/module/cifs/parameters/drop_dir_cache. Writing a
non-zero value iterates all tcons and calls invalidate_all_cached_dirs()
to drop cached directory entries. This is useful to force a dirent cache
drop across mounts for debugging and testing purpose.
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Print the lease/oplock caching state for each open file as a
compact string of letters: R (read), H (handle), W (write).
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The crypto API, through the scatterlist API, expects input buffers to be
in linear memory. We handle this with the cifs_sg_set_buf() helper
that converts vmalloc'd memory to their corresponding pages.
However, when we allocate our aead_request buffer (@creq in
smb2ops.c::crypt_message()), we do so with kvzalloc(), which possibly
puts aead_request->__ctx in vmalloc area.
AEAD algorithm then uses ->__ctx for its private/internal data and
operations, and uses sg_set_buf() for such data on a few places.
This works fine as long as @creq falls into kmalloc zone (small
requests) or vmalloc'd memory is still within linear range.
Tasks' stacks are vmalloc'd by default (CONFIG_VMAP_STACK=y), so too
many tasks will increment the base stacks' addresses to a point where
virt_addr_valid(buf) will fail (BUG() in sg_set_buf()) when that
happens.
In practice: too many parallel reads and writes on an encrypted mount
will trigger this bug.
To fix this, always alloc @creq with kmalloc() instead.
Also drop the @sensitive_size variable/arguments since
kfree_sensitive() doesn't need it.
Backtrace:
[ 945.272081] ------------[ cut here ]------------
[ 945.272774] kernel BUG at include/linux/scatterlist.h:209!
[ 945.273520] Oops: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC NOPTI
[ 945.274412] CPU: 7 UID: 0 PID: 56 Comm: kworker/u33:0 Kdump: loaded Not tainted 6.15.0-lku-11779-g8e9d6efccdd7-dirty #1 PREEMPT(voluntary)
[ 945.275736] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-2-gc13ff2cd-prebuilt.qemu.org 04/01/2014
[ 945.276877] Workqueue: writeback wb_workfn (flush-cifs-2)
[ 945.277457] RIP: 0010:crypto_gcm_init_common+0x1f9/0x220
[ 945.278018] Code: b0 00 00 00 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 48 c7 c0 00 00 00 80 48 2b 05 5c 58 e5 00 e9 58 ff ff ff <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 48 c7 04 24 01 00 00 00 48 8b
[ 945.279992] RSP: 0018:ffffc90000a27360 EFLAGS: 00010246
[ 945.280578] RAX: 0000000000000000 RBX: ffffc90001d85060 RCX: 0000000000000030
[ 945.281376] RDX: 0000000000080000 RSI: 0000000000000000 RDI: ffffc90081d85070
[ 945.282145] RBP: ffffc90001d85010 R08: ffffc90001d85000 R09: 0000000000000000
[ 945.282898] R10: ffffc90001d85090 R11: 0000000000001000 R12: ffffc90001d85070
[ 945.283656] R13: ffff888113522948 R14: ffffc90001d85060 R15: ffffc90001d85010
[ 945.284407] FS: 0000000000000000(0000) GS:ffff8882e66cf000(0000) knlGS:0000000000000000
[ 945.285262] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 945.285884] CR2: 00007fa7ffdd31f4 CR3: 000000010540d000 CR4: 0000000000350ef0
[ 945.286683] Call Trace:
[ 945.286952] <TASK>
[ 945.287184] ? crypt_message+0x33f/0xad0 [cifs]
[ 945.287719] crypto_gcm_encrypt+0x36/0xe0
[ 945.288152] crypt_message+0x54a/0xad0 [cifs]
[ 945.288724] smb3_init_transform_rq+0x277/0x300 [cifs]
[ 945.289300] smb_send_rqst+0xa3/0x160 [cifs]
[ 945.289944] cifs_call_async+0x178/0x340 [cifs]
[ 945.290514] ? __pfx_smb2_writev_callback+0x10/0x10 [cifs]
[ 945.291177] smb2_async_writev+0x3e3/0x670 [cifs]
[ 945.291759] ? find_held_lock+0x32/0x90
[ 945.292212] ? netfs_advance_write+0xf2/0x310
[ 945.292723] netfs_advance_write+0xf2/0x310
[ 945.293210] netfs_write_folio+0x346/0xcc0
[ 945.293689] ? __pfx__raw_spin_unlock_irq+0x10/0x10
[ 945.294250] netfs_writepages+0x117/0x460
[ 945.294724] do_writepages+0xbe/0x170
[ 945.295152] ? find_held_lock+0x32/0x90
[ 945.295600] ? kvm_sched_clock_read+0x11/0x20
[ 945.296103] __writeback_single_inode+0x56/0x4b0
[ 945.296643] writeback_sb_inodes+0x229/0x550
[ 945.297140] __writeback_inodes_wb+0x4c/0xe0
[ 945.297642] wb_writeback+0x2f1/0x3f0
[ 945.298069] wb_workfn+0x300/0x490
[ 945.298472] process_one_work+0x1fe/0x590
[ 945.298949] worker_thread+0x1ce/0x3c0
[ 945.299397] ? __pfx_worker_thread+0x10/0x10
[ 945.299900] kthread+0x119/0x210
[ 945.300285] ? __pfx_kthread+0x10/0x10
[ 945.300729] ret_from_fork+0x119/0x1b0
[ 945.301163] ? __pfx_kthread+0x10/0x10
[ 945.301601] ret_from_fork_asm+0x1a/0x30
[ 945.302055] </TASK>
Fixes: d08089f649 ("cifs: Change the I/O paths to use an iterator rather than a page list")
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
fs/smb/common/cifs_arc4.c has an implementation of ARC4, but a copy of
this same code is also present in lib/crypto/arc4.c to serve the other
users of this legacy algorithm in the kernel. Remove the duplicate
implementation in fs/smb/, which seems to have been added because of a
misunderstanding, and just use the lib/crypto/ one.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Pull smb restructuring updates from Steve French:
"Large set of small restructuring smbdirect related patches for cifs.ko
and ksmbd.ko.
This is the next step in order to use common structures for smbdirect
handling across both modules. And also includes improved handling of
broken connections, as well as fixed negotiation as rdma resources.
Moving to common functions is planned for 6.19, as well as also
providing smbdirect via sockets to userspace (e.g. for samba to also
be able to use smbdirect for userspace server and userspace client
tools).
This was heavily reviewed and tested at the recent SMB3.1.1 test event
at SDC"
* tag 'v6.18-rc-part1-smb3-common' of git://git.samba.org/ksmbd: (159 commits)
smb: server: let smb_direct_flush_send_list() invalidate a remote key first
smb: server: make use of ib_alloc_cq_any() instead of ib_alloc_cq()
smb: server: make consitent use of spin_lock_irq{save,restore}() in transport_rdma.c
smb: server: let {free_transport,smb_direct_disconnect_rdma_{work,connection}}() wake up all wait queues
smb: server: let smb_direct_disconnect_rdma_connection() disable all work but disconnect_work
smb: server: fill in smbdirect_socket.first_error on error
smb: server: let smb_direct_disconnect_rdma_connection() set SMBDIRECT_SOCKET_ERROR...
smb: server: pass struct smbdirect_socket to smb_direct_send_negotiate_response()
smb: server: pass struct smbdirect_socket to {enqueue,get_first}_reassembly()
smb: server: pass struct smbdirect_socket to smb_direct_post_send_data()
smb: server: pass struct smbdirect_socket to post_sendmsg()
smb: server: pass struct smbdirect_socket to smb_direct_create_header()
smb: server: pass struct smbdirect_socket to manage_keep_alive_before_sending()
smb: server: pass struct smbdirect_socket to manage_credits_prior_sending()
smb: server: pass struct smbdirect_socket to calc_rw_credits()
smb: server: pass struct smbdirect_socket to wait_for_rw_credits()
smb: server: pass struct smbdirect_socket to wait_for_send_credits()
smb: server: pass struct smbdirect_socket to wait_for_credits()
smb: server: pass struct smbdirect_socket to smb_direct_flush_send_list()
smb: server: pass struct smbdirect_socket to smb_direct_post_send()
...
Pull xfs updates from Carlos Maiolino:
"For this merge window, there are really no new features, but there are
a few things worth to emphasize:
- Deprecated for years already, the (no)attr2 and (no)ikeep mount
options have been removed for good
- Several cleanups (specially from typedefs) and bug fixes
- Improvements made in the online repair reap calculations
- online fsck is now enabled by default"
* tag 'xfs-merge-6.18' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (53 commits)
xfs: rework datasync tracking and execution
xfs: rearrange code in xfs_inode_item_precommit
xfs: scrub: use kstrdup_const() for metapath scan setups
xfs: use bt_nr_sectors in xfs_dax_translate_range
xfs: track the number of blocks in each buftarg
xfs: constify xfs_errortag_random_default
xfs: improve default maximum number of open zones
xfs: improve zone statistics message
xfs: centralize error tag definitions
xfs: remove pointless externs in xfs_error.h
xfs: remove the expr argument to XFS_TEST_ERROR
xfs: remove xfs_errortag_set
xfs: remove xfs_errortag_get
xfs: move the XLOG_REG_ constants out of xfs_log_format.h
xfs: adjust the hint based zone allocation policy
xfs: refactor hint based zone allocation
fs: add an enum for number of life time hints
xfs: fix log CRC mismatches between i386 and other architectures
xfs: rename the old_crc variable in xlog_recover_process
xfs: remove the unused xfs_log_iovec_t typedef
...
Pull gfs2 updates from Andreas Gruenbacher:
- Partially revert "gfs2: do_xmote fixes" to ignore dlm_lock() errors
during withdraw; passing on those errors doesn't help
- Change the LM_FLAG_TRY and LM_FLAG_TRY_1CB logic in add_to_queue() to
check if the holder would actually block
- Move some more dlm specific code from glock.c to lock_dlm.c
- Remove the unused dlm alternate locking mode code
- Add proper locking to make sure that dlm lockspaces are never used
after being released
- Various other cleanups
* tag 'gfs2-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix unlikely race in gdlm_put_lock
gfs2: Add proper lockspace locking
gfs2: Minor run_queue fixes
gfs2: run_queue cleanup
gfs2: Simplify do_promote
gfs2: Get rid of GLF_INVALIDATE_IN_PROGRESS
gfs2: Fix GLF_INVALIDATE_IN_PROGRESS flag clearing in do_xmote
gfs2: Remove duplicate check in do_xmote
gfs2: Fix LM_FLAG_TRY* logic in add_to_queue
gfs2: Remove DLM_LKF_ALTCW / DLM_LKF_ALTPR code
gfs2: Further sanitize lock_dlm.c
gfs2: Do not use atomic operations unnecessarily
gfs2: Sanitize gfs2_meta_check, gfs2_metatype_check, gfs2_io_error
gfs2: Turn gfs2_withdraw into a void function
gfs2: Partially revert "gfs2: do_xmote fixes"
gfs2: Simplify refcounting in do_xmote
gfs2: do_xmote cleanup
gfs2: Remove space before newline
gfs2: Remove unused sd_withdraw_wait field
gfs2: Remove unused GIF_FREE_VFS_INODE flag
This pattern isn't very documented, and apparently not used much outside
of 'make tools/help', but it has existed for over a decade (since commit
ea01fa9f63: "tools: Connect to the kernel build system").
However, it doesn't work very well for most cases, particularly the
useful "tools/all" target, because it overrides the LDFLAGS value with
an empty one.
And once overridden, 'make' will then not honor the tooling makefiles
trying to change it - which then makes any LDFLAGS use in the tooling
directory break, typically causing odd link errors.
Remove that LDFLAGS override, since it seems to be entirely historical.
The core kernel makefiles no longer modify LDFLAGS as part of the build,
and use kernel-specific link flags instead (eg 'KBUILD_LDFLAGS' and
friends).
This allows more of the 'make tools/*' cases to work. I say 'more',
because some of the tooling build rules make various other assumptions
or have other issues, so it's still a bit hit-or-miss. But those issues
tend to show up with the 'make -C tools xyz' pattern too, so now it's no
longer an issue of this particular 'tools/*' build rule being special.
Acked-by: Nathan Chancellor <nathan@kernel.org>
Cc: Nicolas Schier <nicolas@fjasle.eu>
Cc: Borislav Petkov <bp@alien8.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs async directory updates from Christian Brauner:
"This contains further preparatory changes for the asynchronous directory
locking scheme:
- Add lookup_one_positive_killable() which allows overlayfs to
perform lookup that won't block on a fatal signal
- Unify the mount idmap handling in struct renamedata as a rename can
only happen within a single mount
- Introduce kern_path_parent() for audit which sets the path to the
parent and returns a dentry for the target without holding any
locks on return
- Rename kern_path_locked() as it is only used to prepare for the
removal of an object from the filesystem:
kern_path_locked() => start_removing_path()
kern_path_create() => start_creating_path()
user_path_create() => start_creating_user_path()
user_path_locked_at() => start_removing_user_path_at()
done_path_create() => end_creating_path()
NA => end_removing_path()"
* tag 'vfs-6.18-rc1.async' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
debugfs: rename start_creating() to debugfs_start_creating()
VFS: rename kern_path_locked() and related functions.
VFS/audit: introduce kern_path_parent() for audit
VFS: unify old_mnt_idmap and new_mnt_idmap in renamedata
VFS: discard err2 in filename_create()
VFS/ovl: add lookup_one_positive_killable()
Pull vfs writeback updates from Christian Brauner:
"This contains work adressing lockups reported by users when a systemd
unit reading lots of files from a filesystem mounted with the lazytime
mount option exits.
With the lazytime mount option enabled we can be switching many dirty
inodes on cgroup exit to the parent cgroup. The numbers observed in
practice when systemd slice of a large cron job exits can easily reach
hundreds of thousands or millions.
The logic in inode_do_switch_wbs() which sorts the inode into
appropriate place in b_dirty list of the target wb however has linear
complexity in the number of dirty inodes thus overall time complexity
of switching all the inodes is quadratic leading to workers being
pegged for hours consuming 100% of the CPU and switching inodes to the
parent wb.
Simple reproducer of the issue:
FILES=10000
# Filesystem mounted with lazytime mount option
MNT=/mnt/
echo "Creating files and switching timestamps"
for (( j = 0; j < 50; j ++ )); do
mkdir $MNT/dir$j
for (( i = 0; i < $FILES; i++ )); do
echo "foo" >$MNT/dir$j/file$i
done
touch -a -t 202501010000 $MNT/dir$j/file*
done
wait
echo "Syncing and flushing"
sync
echo 3 >/proc/sys/vm/drop_caches
echo "Reading all files from a cgroup"
mkdir /sys/fs/cgroup/unified/mycg1 || exit
echo $$ >/sys/fs/cgroup/unified/mycg1/cgroup.procs || exit
for (( j = 0; j < 50; j ++ )); do
cat /mnt/dir$j/file* >/dev/null &
done
wait
echo "Switching wbs"
# Now rmdir the cgroup after the script exits
This can be solved by:
- Avoiding contention on the wb->list_lock when switching inodes by
running a single work item per wb and managing a queue of items
switching to the wb
- Allowing rescheduling when switching inodes over to a different
cgroup to avoid softlockups
- Maintaining the b_dirty list ordering instead of sorting it"
* tag 'vfs-6.18-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
writeback: Add tracepoint to track pending inode switches
writeback: Avoid excessively long inode switching times
writeback: Avoid softlockup when switching many inodes
writeback: Avoid contention on wb->list_lock when switching inodes
Pull namespace updates from Christian Brauner:
"This contains a larger set of changes around the generic namespace
infrastructure of the kernel.
Each specific namespace type (net, cgroup, mnt, ...) embedds a struct
ns_common which carries the reference count of the namespace and so
on.
We open-coded and cargo-culted so many quirks for each namespace type
that it just wasn't scalable anymore. So given there's a bunch of new
changes coming in that area I've started cleaning all of this up.
The core change is to make it possible to correctly initialize every
namespace uniformly and derive the correct initialization settings
from the type of the namespace such as namespace operations, namespace
type and so on. This leaves the new ns_common_init() function with a
single parameter which is the specific namespace type which derives
the correct parameters statically. This also means the compiler will
yell as soon as someone does something remotely fishy.
The ns_common_init() addition also allows us to remove ns_alloc_inum()
and drops any special-casing of the initial network namespace in the
network namespace initialization code that Linus complained about.
Another part is reworking the reference counting. The reference
counting was open-coded and copy-pasted for each namespace type even
though they all followed the same rules. This also removes all open
accesses to the reference count and makes it private and only uses a
very small set of dedicated helpers to manipulate them just like we do
for e.g., files.
In addition this generalizes the mount namespace iteration
infrastructure introduced a few cycles ago. As reminder, the vfs makes
it possible to iterate sequentially and bidirectionally through all
mount namespaces on the system or all mount namespaces that the caller
holds privilege over. This allow userspace to iterate over all mounts
in all mount namespaces using the listmount() and statmount() system
call.
Each mount namespace has a unique identifier for the lifetime of the
systems that is exposed to userspace. The network namespace also has a
unique identifier working exactly the same way. This extends the
concept to all other namespace types.
The new nstree type makes it possible to lookup namespaces purely by
their identifier and to walk the namespace list sequentially and
bidirectionally for all namespace types, allowing userspace to iterate
through all namespaces. Looking up namespaces in the namespace tree
works completely locklessly.
This also means we can move the mount namespace onto the generic
infrastructure and remove a bunch of code and members from struct
mnt_namespace itself.
There's a bunch of stuff coming on top of this in the future but for
now this uses the generic namespace tree to extend a concept
introduced first for pidfs a few cycles ago. For a while now we have
supported pidfs file handles for pidfds. This has proven to be very
useful.
This extends the concept to cover namespaces as well. It is possible
to encode and decode namespace file handles using the common
name_to_handle_at() and open_by_handle_at() apis.
As with pidfs file handles, namespace file handles are exhaustive,
meaning it is not required to actually hold a reference to nsfs in
able to decode aka open_by_handle_at() a namespace file handle.
Instead the FD_NSFS_ROOT constant can be passed which will let the
kernel grab a reference to the root of nsfs internally and thus decode
the file handle.
Namespaces file descriptors can already be derived from pidfds which
means they aren't subject to overmount protection bugs. IOW, it's
irrelevant if the caller would not have access to an appropriate
/proc/<pid>/ns/ directory as they could always just derive the
namespace based on a pidfd already.
It has the same advantage as pidfds. It's possible to reliably and for
the lifetime of the system refer to a namespace without pinning any
resources and to compare them trivially.
Permission checking is kept simple. If the caller is located in the
namespace the file handle refers to they are able to open it otherwise
they must hold privilege over the owning namespace of the relevant
namespace.
The namespace file handle layout is exposed as uapi and has a stable
and extensible format. For now it simply contains the namespace
identifier, the namespace type, and the inode number. The stable
format means that userspace may construct its own namespace file
handles without going through name_to_handle_at() as they are already
allowed for pidfs and cgroup file handles"
* tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (65 commits)
ns: drop assert
ns: move ns type into struct ns_common
nstree: make struct ns_tree private
ns: add ns_debug()
ns: simplify ns_common_init() further
cgroup: add missing ns_common include
ns: use inode initializer for initial namespaces
selftests/namespaces: verify initial namespace inode numbers
ns: rename to __ns_ref
nsfs: port to ns_ref_*() helpers
net: port to ns_ref_*() helpers
uts: port to ns_ref_*() helpers
ipv4: use check_net()
net: use check_net()
net-sysfs: use check_net()
user: port to ns_ref_*() helpers
time: port to ns_ref_*() helpers
pid: port to ns_ref_*() helpers
ipc: port to ns_ref_*() helpers
cgroup: port to ns_ref_*() helpers
...
Pull afs updates from Christian Brauner:
"This contains the change to enable afs to support RENAME_NOREPLACE and
RENAME_EXCHANGE if the server supports it"
* tag 'vfs-6.18-rc1.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
afs: Add support for RENAME_NOREPLACE and RENAME_EXCHANGE
Pull copy_process updates from Christian Brauner:
"This contains the changes to enable support for clone3() on nios2
which apparently is still a thing.
The more exciting part of this is that it cleans up the inconsistency
in how the 64-bit flag argument is passed from copy_process() into the
various other copy_*() helpers"
[ Fixed up rv ltl_monitor 32-bit support as per Sasha Levin in the merge ]
* tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
nios2: implement architecture-specific portion of sys_clone3
arch: copy_thread: pass clone_flags as u64
copy_process: pass clone_flags as u64 across calltree
copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64)
Pull vfs workqueue updates from Christian Brauner:
"This contains various workqueue changes affecting the filesystem
layer.
Currently if a user enqueue a work item using schedule_delayed_work()
the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies
to schedule_work() that is using system_wq and queue_work(), that
makes use again of WORK_CPU_UNBOUND.
This replaces the use of system_wq and system_unbound_wq. system_wq is
a per-CPU workqueue which isn't very obvious from the name and
system_unbound_wq is to be used when locality is not required.
So this renames system_wq to system_percpu_wq, and system_unbound_wq
to system_dfl_wq.
This also adds a new WQ_PERCPU flag to allow the fs subsystem users to
explicitly request the use of per-CPU behavior. Both WQ_UNBOUND and
WQ_PERCPU flags coexist for one release cycle to allow callers to
transition their calls. WQ_UNBOUND will be removed in a next release
cycle"
* tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: WQ_PERCPU added to alloc_workqueue users
fs: replace use of system_wq with system_percpu_wq
fs: replace use of system_unbound_wq with system_dfl_wq
Pull vfs rust updates from Christian Brauner:
"This contains a few minor vfs rust changes:
- Add the pid namespace Rust wrappers to the correct MAINTAINERS
entry
- Use to_result() in the Rust file error handling code
- Update imports for fs and pid_namespce Rust wrappers"
* tag 'vfs-6.18-rc1.rust' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
rust: file: use to_result for error handling
pid: add Rust files to MAINTAINERS
rust: fs: update ARef and AlwaysRefCounted imports from sync::aref
rust: pid_namespace: update AlwaysRefCounted imports from sync::aref
Pull pidfs updates from Christian Brauner:
"This just contains a few changes to pid_nr_ns() to make it more robust
and cleans up or improves a few users that ab- or misuse it currently"
* tag 'vfs-6.18-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
pid: change task_state() to use task_ppid_nr_ns()
pid: change bacct_add_tsk() to use task_ppid_nr_ns()
pid: make __task_pid_nr_ns(ns => NULL) safe for zombie callers
pid: Add a judgment for ns null in pid_nr_ns
Pull vfs iomap updates from Christian Brauner:
"This contains minor fixes and cleanups to the iomap code.
Nothing really stands out here"
* tag 'vfs-6.18-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
iomap: error out on file IO when there is no inline_data buffer
iomap: trace iomap_zero_iter zeroing activities
Pull vfs inode updates from Christian Brauner:
"This contains a series I originally wrote and that Eric brought over
the finish line. It moves out the i_crypt_info and i_verity_info
pointers out of 'struct inode' and into the fs-specific part of the
inode.
So now the few filesytems that actually make use of this pay the price
in their own private inode storage instead of forcing it upon every
user of struct inode.
The pointer for the crypt and verity info is simply found by storing
an offset to its address in struct fsverity_operations and struct
fscrypt_operations. This shrinks struct inode by 16 bytes.
I hope to move a lot more out of it in the future so that struct inode
becomes really just about very core stuff that we need, much like
struct dentry and struct file, instead of the dumping ground it has
become over the years.
On top of this are a various changes associated with the ongoing inode
lifetime handling rework that multiple people are pushing forward:
- Stop accessing inode->i_count directly in f2fs and gfs2. They
simply should use the __iget() and iput() helpers
- Make the i_state flags an enum
- Rework the iput() logic
Currently, if we are the last iput, and we have the I_DIRTY_TIME
bit set, we will grab a reference on the inode again and then mark
it dirty and then redo the put. This is to make sure we delay the
time update for as long as possible
We can rework this logic to simply dec i_count if it is not 1, and
if it is do the time update while still holding the i_count
reference
Then we can replace the atomic_dec_and_lock with locking the
->i_lock and doing atomic_dec_and_test, since we did the
atomic_add_unless above
- Add an icount_read() helper and convert everyone that accesses
inode->i_count directly for this purpose to use the helper
- Expand dump_inode() to dump more information about an inode helping
in debugging
- Add some might_sleep() annotations to iput() and associated
helpers"
* tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: add might_sleep() annotation to iput() and more
fs: expand dump_inode()
inode: fix whitespace issues
fs: add an icount_read helper
fs: rework iput logic
fs: make the i_state flags an enum
fs: stop accessing ->i_count directly in f2fs and gfs2
fsverity: check IS_VERITY() in fsverity_cleanup_inode()
fs: remove inode::i_verity_info
btrfs: move verity info pointer to fs-specific part of inode
f2fs: move verity info pointer to fs-specific part of inode
ext4: move verity info pointer to fs-specific part of inode
fsverity: add support for info in fs-specific part of inode
fs: remove inode::i_crypt_info
ceph: move crypt info pointer to fs-specific part of inode
ubifs: move crypt info pointer to fs-specific part of inode
f2fs: move crypt info pointer to fs-specific part of inode
ext4: move crypt info pointer to fs-specific part of inode
fscrypt: add support for info in fs-specific part of inode
fscrypt: replace raw loads of info pointer with helper function
Pull vfs mount updates from Christian Brauner:
"This contains some work around mount api handling:
- Output the warning message for mnt_too_revealing() triggered during
fsmount() to the fscontext log. This makes it possible for the
mount tool to output appropriate warnings on the command line.
For example, with the newest fsopen()-based mount(8) from
util-linux, the error messages now look like:
# mount -t proc proc /tmp
mount: /tmp: fsmount() failed: VFS: Mount too revealing.
dmesg(1) may have more information after failed mount system call.
- Do not consume fscontext log entries when returning -EMSGSIZE
Userspace generally expects APIs that return -EMSGSIZE to allow for
them to adjust their buffer size and retry the operation.
However, the fscontext log would previously clear the message even
in the -EMSGSIZE case.
Given that it is very cheap for us to check whether the buffer is
too small before we remove the message from the ring buffer, let's
just do that instead.
- Drop an unused argument from do_remount()"
* tag 'vfs-6.18-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
vfs: fs/namespace.c: remove ms_flags argument from do_remount
selftests/filesystems: add basic fscontext log tests
fscontext: do not consume log entries when returning -EMSGSIZE
vfs: output mount_too_revealing() errors to fscontext
docs/vfs: Remove mentions to the old mount API helpers
fscontext: add custom-prefix log helpers
fs: Remove mount_bdev
fs: Remove mount_nodev
When calling in listmount() mnt_ns_release() may be passed a NULL
pointer. Handle that case gracefully.
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull misc vfs updates from Christian Brauner:
"This contains the usual selections of misc updates for this cycle.
Features:
- Add "initramfs_options" parameter to set initramfs mount options.
This allows to add specific mount options to the rootfs to e.g.,
limit the memory size
- Add RWF_NOSIGNAL flag for pwritev2()
Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE
signal from being raised when writing on disconnected pipes or
sockets. The flag is handled directly by the pipe filesystem and
converted to the existing MSG_NOSIGNAL flag for sockets
- Allow to pass pid namespace as procfs mount option
Ever since the introduction of pid namespaces, procfs has had very
implicit behaviour surrounding them (the pidns used by a procfs
mount is auto-selected based on the mounting process's active
pidns, and the pidns itself is basically hidden once the mount has
been constructed)
This implicit behaviour has historically meant that userspace was
required to do some special dances in order to configure the pidns
of a procfs mount as desired. Examples include:
* In order to bypass the mnt_too_revealing() check, Kubernetes
creates a procfs mount from an empty pidns so that user
namespaced containers can be nested (without this, the nested
containers would fail to mount procfs)
But this requires forking off a helper process because you cannot
just one-shot this using mount(2)
* Container runtimes in general need to fork into a container
before configuring its mounts, which can lead to security issues
in the case of shared-pidns containers (a privileged process in
the pidns can interact with your container runtime process)
While SUID_DUMP_DISABLE and user namespaces make this less of an
issue, the strict need for this due to a minor uAPI wart is kind
of unfortunate
Things would be much easier if there was a way for userspace to
just specify the pidns they want. So this pull request contains
changes to implement a new "pidns" argument which can be set
using fsconfig(2):
fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
or classic mount(2) / mount(8):
// mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
Cleanups:
- Remove the last references to EXPORT_OP_ASYNC_LOCK
- Make file_remove_privs_flags() static
- Remove redundant __GFP_NOWARN when GFP_NOWAIT is used
- Use try_cmpxchg() in start_dir_add()
- Use try_cmpxchg() in sb_init_done_wq()
- Replace offsetof() with struct_size() in ioctl_file_dedupe_range()
- Remove vfs_ioctl() export
- Replace rwlock() with spinlock in epoll code as rwlock causes
priority inversion on preempt rt kernels
- Make ns_entries in fs/proc/namespaces const
- Use a switch() statement() in init_special_inode() just like we do
in may_open()
- Use struct_size() in dir_add() in the initramfs code
- Use str_plural() in rd_load_image()
- Replace strcpy() with strscpy() in find_link()
- Rename generic_delete_inode() to inode_just_drop() and
generic_drop_inode() to inode_generic_drop()
- Remove unused arguments from fcntl_{g,s}et_rw_hint()
Fixes:
- Document @name parameter for name_contains_dotdot() helper
- Fix spelling mistake
- Always return zero from replace_fd() instead of the file descriptor
number
- Limit the size for copy_file_range() in compat mode to prevent a
signed overflow
- Fix debugfs mount options not being applied
- Verify the inode mode when loading it from disk in minixfs
- Verify the inode mode when loading it from disk in cramfs
- Don't trigger automounts with RESOLVE_NO_XDEV
If openat2() was called with RESOLVE_NO_XDEV it didn't traverse
through automounts, but could still trigger them
- Add FL_RECLAIM flag to show_fl_flags() macro so it appears in
tracepoints
- Fix unused variable warning in rd_load_image() on s390
- Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD
- Use ns_capable_noaudit() when determining net sysctl permissions
- Don't call path_put() under namespace semaphore in listmount() and
statmount()"
* tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits)
fcntl: trim arguments
listmount: don't call path_put() under namespace semaphore
statmount: don't call path_put() under namespace semaphore
pid: use ns_capable_noaudit() when determining net sysctl permissions
fs: rename generic_delete_inode() and generic_drop_inode()
init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD
initramfs: Replace strcpy() with strscpy() in find_link()
initrd: Use str_plural() in rd_load_image()
initramfs: Use struct_size() helper to improve dir_add()
initrd: Fix unused variable warning in rd_load_image() on s390
fs: use the switch statement in init_special_inode()
fs/proc/namespaces: make ns_entries const
filelock: add FL_RECLAIM to show_fl_flags() macro
eventpoll: Replace rwlock with spinlock
selftests/proc: add tests for new pidns APIs
procfs: add "pidns" mount option
pidns: move is-ancestor logic to helper
openat2: don't trigger automounts with RESOLVE_NO_XDEV
namei: move cross-device check to __traverse_mounts
namei: remove LOOKUP_NO_XDEV check from handle_mounts
...
There's a silly problem with the CC_HAS_ASM_GOTO_OUTPUT test: even with
a working compiler it will fail on some architectures simply because it
uses the mnemonic "jmp" for testing the inline asm.
And as reported by Geert, not all architectures use that mnemonic, so
the test fails spuriously on such platforms (including arm and riscv,
but also several other architectures).
This issue avoided any obvious test failures because the build still
works thanks to falling back on the old non-asm-goto code, which just
generates worse code.
Just use an empty asm statement instead.
Reported-and-tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Fixes: e2ffa15b9b ("kbuild: Disable CC_HAS_ASM_GOTO_OUTPUT on clang < 17")
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a mix of using spin_lock() and spin_lock_irq(), which
is confusing as IB_POLL_WORKQUEUE is used and no code would
be called from any interrupt. So using spin_lock() or even
mutexes would be ok.
But we'll soon share common code with the client, which uses
IB_POLL_SOFTIRQ.
And Documentation/kernel-hacking/locking.rst section
"Cheat Sheet For Locking" says:
- Otherwise (== data can be touched in an interrupt), use
spin_lock_irqsave() and
spin_unlock_irqrestore().
So in order to keep it simple and safe we use that version
now. It will help merging functions into common code and
have consistent locking in all cases.
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>