Pull non-MM updates from Andrew Morton:
- "panic: sys_info: Refactor and fix a potential issue" (Andy Shevchenko)
fixes a build issue and does some cleanup in ib/sys_info.c
- "Implement mul_u64_u64_div_u64_roundup()" (David Laight)
enhances the 64-bit math code on behalf of a PWM driver and beefs up
the test module for these library functions
- "scripts/gdb/symbols: make BPF debug info available to GDB" (Ilya Leoshkevich)
makes BPF symbol names, sizes, and line numbers available to the GDB
debugger
- "Enable hung_task and lockup cases to dump system info on demand" (Feng Tang)
adds a sysctl which can be used to cause additional info dumping when
the hung-task and lockup detectors fire
- "lib/base64: add generic encoder/decoder, migrate users" (Kuan-Wei Chiu)
adds a general base64 encoder/decoder to lib/ and migrates several
users away from their private implementations
- "rbree: inline rb_first() and rb_last()" (Eric Dumazet)
makes TCP a little faster
- "liveupdate: Rework KHO for in-kernel users" (Pasha Tatashin)
reworks the KEXEC Handover interfaces in preparation for Live Update
Orchestrator (LUO), and possibly for other future clients
- "kho: simplify state machine and enable dynamic updates" (Pasha Tatashin)
increases the flexibility of KEXEC Handover. Also preparation for LUO
- "Live Update Orchestrator" (Pasha Tatashin)
is a major new feature targeted at cloud environments. Quoting the
cover letter:
This series introduces the Live Update Orchestrator, a kernel
subsystem designed to facilitate live kernel updates using a
kexec-based reboot. This capability is critical for cloud
environments, allowing hypervisors to be updated with minimal
downtime for running virtual machines. LUO achieves this by
preserving the state of selected resources, such as memory,
devices and their dependencies, across the kernel transition.
As a key feature, this series includes support for preserving
memfd file descriptors, which allows critical in-memory data, such
as guest RAM or any other large memory region, to be maintained in
RAM across the kexec reboot.
Mike Rappaport merits a mention here, for his extensive review and
testing work.
- "kexec: reorganize kexec and kdump sysfs" (Sourabh Jain)
moves the kexec and kdump sysfs entries from /sys/kernel/ to
/sys/kernel/kexec/ and adds back-compatibility symlinks which can
hopefully be removed one day
- "kho: fixes for vmalloc restoration" (Mike Rapoport)
fixes a BUG which was being hit during KHO restoration of vmalloc()
regions
* tag 'mm-nonmm-stable-2025-12-06-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (139 commits)
calibrate: update header inclusion
Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
vmcoreinfo: track and log recoverable hardware errors
kho: fix restoring of contiguous ranges of order-0 pages
kho: kho_restore_vmalloc: fix initialization of pages array
MAINTAINERS: TPM DEVICE DRIVER: update the W-tag
init: replace simple_strtoul with kstrtoul to improve lpj_setup
KHO: fix boot failure due to kmemleak access to non-PRESENT pages
Documentation/ABI: new kexec and kdump sysfs interface
Documentation/ABI: mark old kexec sysfs deprecated
kexec: move sysfs entries to /sys/kernel/kexec
test_kho: always print restore status
kho: free chunks using free_page() instead of kfree()
selftests/liveupdate: add kexec test for multiple and empty sessions
selftests/liveupdate: add simple kexec-based selftest for LUO
selftests/liveupdate: add userspace API selftests
docs: add documentation for memfd preservation via LUO
mm: memfd_luo: allow preserving memfd
liveupdate: luo_file: add private argument to store runtime state
mm: shmem: export some functions to internal.h
...
Pull printk updates from Petr Mladek:
- Allow creaing nbcon console drivers with an unsafe write_atomic()
callback that can only be called by the final nbcon_atomic_flush_unsafe().
Otherwise, the driver would rely on the kthread.
It is going to be used as the-best-effort approach for an
experimental nbcon netconsole driver, see
https://lore.kernel.org/r/20251121-nbcon-v1-2-503d17b2b4af@debian.org
Note that a safe .write_atomic() callback is supposed to work in NMI
context. But some networking drivers are not safe even in IRQ
context:
https://lore.kernel.org/r/oc46gdpmmlly5o44obvmoatfqo5bhpgv7pabpvb6sjuqioymcg@gjsma3ghoz35
In an ideal world, all networking drivers would be fixed first and
the atomic flush would be blocked only in NMI context. But it brings
the question how reliable networking drivers are when the system is
in a bad state. They might block flushing more reliable serial
consoles which are more suitable for serious debugging anyway.
- Allow to use the last 4 bytes of the printk ring buffer.
- Prevent queuing IRQ work and block printk kthreads when consoles are
suspended. Otherwise, they create non-necessary churn or even block
the suspend.
- Release console_lock() between each record in the kthread used for
legacy consoles on RT. It might significantly speed up the boot.
- Release nbcon context between each record in the atomic flush. It
prevents stalls of the related printk kthread after it has lost the
ownership in the middle of a record
- Add support for NBCON consoles into KDB
- Add %ptsP modifier for printing struct timespec64 and use it where
possible
- Misc code clean up
* tag 'printk-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: (48 commits)
printk: Use console_is_usable on console_unblank
arch: um: kmsg_dump: Use console_is_usable
drivers: serial: kgdboc: Drop checks for CON_ENABLED and CON_BOOT
lib/vsprintf: Unify FORMAT_STATE_NUM handlers
printk: Avoid irq_work for printk_deferred() on suspend
printk: Avoid scheduling irq_work on suspend
printk: Allow printk_trigger_flush() to flush all types
tracing: Switch to use %ptSp
scsi: snic: Switch to use %ptSp
scsi: fnic: Switch to use %ptSp
s390/dasd: Switch to use %ptSp
ptp: ocp: Switch to use %ptSp
pps: Switch to use %ptSp
PCI: epf-test: Switch to use %ptSp
net: dsa: sja1105: Switch to use %ptSp
mmc: mmc_test: Switch to use %ptSp
media: av7110: Switch to use %ptSp
ipmi: Switch to use %ptSp
igb: Switch to use %ptSp
e1000e: Switch to use %ptSp
...
The wake_up_bit() is called in ceph_async_unlink_cb(),
wake_async_create_waiters(), and ceph_finish_async_create().
It makes sense to switch on clear_bit() function, because
it makes the code much cleaner and easier to understand.
More important rework is the adding of smp_mb__after_atomic()
memory barrier after the bit modification and before
wake_up_bit() call. It can prevent potential race condition
of accessing the modified bit in other threads. Luckily,
clear_and_wake_up_bit() already implements the required
functionality pattern:
static inline void clear_and_wake_up_bit(int bit, unsigned long *word)
{
clear_bit_unlock(bit, word);
/* See wake_up_bit() for which memory barrier you need to use. */
smp_mb__after_atomic();
wake_up_bit(word, bit);
}
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add validation to ensure the cached parent directory inode matches the
directory info in MDS replies. This prevents client-side race conditions
where concurrent operations (e.g. rename) cause r_parent to become stale
between request initiation and reply processing, which could lead to
applying state changes to incorrect directory inodes.
[ idryomov: folded a kerneldoc fixup and a follow-up fix from Alex to
move CEPH_CAP_PIN reference when r_parent is updated:
When the parent directory lock is not held, req->r_parent can become
stale and is updated to point to the correct inode. However, the
associated CEPH_CAP_PIN reference was not being adjusted. The
CEPH_CAP_PIN is a reference on an inode that is tracked for
accounting purposes. Moving this pin is important to keep the
accounting balanced. When the pin was not moved from the old parent
to the new one, it created two problems: The reference on the old,
stale parent was never released, causing a reference leak.
A reference for the new parent was never acquired, creating the risk
of a reference underflow later in ceph_mdsc_release_request(). This
patch corrects the logic by releasing the pin from the old parent and
acquiring it for the new parent when r_parent is switched. This
ensures reference accounting stays balanced. ]
Cc: stable@vger.kernel.org
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Lift copying the name into callers of ceph_encode_encrypted_dname()
that do not have it already copied; ceph_encode_encrypted_fname()
disappears.
That fixes a UAF in ceph_mdsc_build_path() - while the initial copy
of plaintext into buf is done under ->d_lock, we access the
original name again in ceph_encode_encrypted_fname() and that is
done without any locking. With ceph_encode_encrypted_dname() using
the stable copy the problem goes away.
Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull vfs ceph updates from Christian Brauner:
"This contains the work to remove access to page->index from ceph
and fixes the test failure observed for ceph with generic/421 by
refactoring ceph_writepages_start()"
* tag 'vfs-6.15-rc1.ceph' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fscrypt: Change fscrypt_encrypt_pagecache_blocks() to take a folio
ceph: Fix error handling in fill_readdir_cache()
fs: Remove page_mkwrite_check_truncate()
ceph: Pass a folio to ceph_allocate_page_array()
ceph: Convert ceph_move_dirty_page_in_page_array() to move_dirty_folio_in_page_array()
ceph: Remove uses of page from ceph_process_folio_batch()
ceph: Convert ceph_check_page_before_write() to use a folio
ceph: Convert writepage_nounlock() to write_folio_nounlock()
ceph: Convert ceph_readdir_cache_control to store a folio
ceph: Convert ceph_find_incompatible() to take a folio
ceph: Use a folio in ceph_page_mkwrite()
ceph: Remove ceph_writepage()
ceph: fix generic/421 test failure
ceph: introduce ceph_submit_write() method
ceph: introduce ceph_process_folio_batch() method
ceph: extend ceph_writeback_ctl for ceph_writepages_start() refactoring
ceph already splices the correct dentry (in splice_dentry()) from the
result of mkdir but does nothing more with it.
Now that ->mkdir can return a dentry, return the correct dentry.
Note that previously ceph_mkdir() could call
ceph_init_inode_acls()
on the inode from the wrong dentry, which would be NULL. This
is safe as ceph_init_inode_acls() checks for NULL, but is not
strictly correct. With this patch, the inode for the returned dentry
is passed to ceph_init_inode_acls().
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250227013949.536172-4-neilb@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
Some filesystems, such as NFS, cifs, ceph, and fuse, do not have
complete control of sequencing on the actual filesystem (e.g. on a
different server) and may find that the inode created for a mkdir
request already exists in the icache and dcache by the time the mkdir
request returns. For example, if the filesystem is mounted twice the
directory could be visible on the other mount before it is on the
original mount, and a pair of name_to_handle_at(), open_by_handle_at()
calls could instantiate the directory inode with an IS_ROOT() dentry
before the first mkdir returns.
This means that the dentry passed to ->mkdir() may not be the one that
is associated with the inode after the ->mkdir() completes. Some
callers need to interact with the inode after the ->mkdir completes and
they currently need to perform a lookup in the (rare) case that the
dentry is no longer hashed.
This lookup-after-mkdir requires that the directory remains locked to
avoid races. Planned future patches to lock the dentry rather than the
directory will mean that this lookup cannot be performed atomically with
the mkdir.
To remove this barrier, this patch changes ->mkdir to return the
resulting dentry if it is different from the one passed in.
Possible returns are:
NULL - the directory was created and no other dentry was used
ERR_PTR() - an error occurred
non-NULL - this other dentry was spliced in
This patch only changes file-systems to return "ERR_PTR(err)" instead of
"err" or equivalent transformations. Subsequent patches will make
further changes to some file-systems to return a correct dentry.
Not all filesystems reliably result in a positive hashed dentry:
- NFS, cifs, hostfs will sometimes need to perform a lookup of
the name to get inode information. Races could result in this
returning something different. Note that this lookup is
non-atomic which is what we are trying to avoid. Placing the
lookup in filesystem code means it only happens when the filesystem
has no other option.
- kernfs and tracefs leave the dentry negative and the ->revalidate
operation ensures that lookup will be called to correctly populate
the dentry. This could be fixed but I don't think it is important
to any of the users of vfs_mkdir() which look at the dentry.
The recommendation to use
d_drop();d_splice_alias()
is ugly but fits with current practice. A planned future patch will
change this.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
Currently get_fscrypt_altname() requires ->r_dentry->d_name to be stable
and it gets that in almost all cases. The only exception is ->d_revalidate(),
where we have a stable name, but it's passed separately - dentry->d_name
is not stable there.
Propagate it down to get_fscrypt_altname() as a new field of struct
ceph_mds_request - ->r_dname, to be used instead ->r_dentry->d_name
when non-NULL.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
No need to mess with the boilerplate for obtaining what we already
have. Note that ceph is one of the "will want a path from filesystem
root if we want to talk to server" cases, so the name of the last
component is of little use - it is passed to fscrypt_d_revalidate()
and it's used to deal with (also crypt-related) case in request
marshalling, when encrypted name turns out to be too long. The former
is not a problem, but the latter is racy; that part will be handled
in the next commit.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
->d_revalidate() often needs to access dentry parent and name; that has
to be done carefully, since the locking environment varies from caller
to caller. We are not guaranteed that dentry in question will not be
moved right under us - not unless the filesystem is such that nothing
on it ever gets renamed.
It can be dealt with, but that results in boilerplate code that isn't
even needed - the callers normally have just found the dentry via dcache
lookup and want to verify that it's in the right place; they already
have the values of ->d_parent and ->d_name stable. There is a couple
of exceptions (overlayfs and, to less extent, ecryptfs), but for the
majority of calls that song and dance is not needed at all.
It's easier to make ecryptfs and overlayfs find and pass those values if
there's a ->d_revalidate() instance to be called, rather than doing that
in the instances.
This commit only changes the calling conventions; making use of supplied
values is left to followups.
NOTE: some instances need more than just the parent - things like CIFS
may need to build an entire path from filesystem root, so they need
more precautions than the usual boilerplate. This series doesn't
do anything to that need - these filesystems have to keep their locking
mechanisms (rename_lock loops, use of dentry_path_raw(), private rwsem
a-la v9fs).
One thing to keep in mind when using name is that name->name will normally
point into the pathname being resolved; the filename in question occupies
name->len bytes starting at name->name, and there is NUL somewhere after it,
but it the next byte might very well be '/' rather than '\0'. Do not
ignore name->len.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull ceph updates from Ilya Dryomov:
"Three CephFS fixes from Xiubo and Luis and a bunch of assorted
cleanups"
* tag 'ceph-for-6.12-rc1' of https://github.com/ceph/ceph-client:
ceph: remove the incorrect Fw reference check when dirtying pages
ceph: Remove empty definition in header file
ceph: Fix typo in the comment
ceph: fix a memory leak on cap_auths in MDS client
ceph: flush all caps releases when syncing the whole filesystem
ceph: rename ceph_flush_cap_releases() to ceph_flush_session_cap_releases()
libceph: use min() to simplify code in ceph_dns_resolve_name()
ceph: Convert to use jiffies macro
ceph: Remove unused declarations
Correctly spelled comments make it easier for the reader to understand
the code.
replace 'tagert' with 'target' in the comment &
replace 'vaild' with 'valid' in the comment &
replace 'carefull' with 'careful' in the comment &
replace 'trsaverse' with 'traverse' in the comment.
Signed-off-by: Yan Zhen <yanzhen@vivo.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Pull ceph updates from Ilya Dryomov:
"A small patchset to address bogus I/O errors and ultimately an
assertion failure in the face of watch errors with -o exclusive
mappings in RBD marked for stable and some assorted CephFS fixes"
* tag 'ceph-for-6.11-rc1' of https://github.com/ceph/ceph-client:
rbd: don't assume rbd_is_lock_owner() for exclusive mappings
rbd: don't assume RBD_LOCK_STATE_LOCKED for exclusive mappings
rbd: rename RBD_LOCK_STATE_RELEASING and releasing_wait
ceph: fix incorrect kmalloc size of pagevec mempool
ceph: periodically flush the cap releases
ceph: convert comma to semicolon in __ceph_dentry_dir_lease_touch()
ceph: use cap_wait_list only if debugfs is enabled
Pull ceph updates from Ilya Dryomov:
"Assorted CephFS fixes and cleanups with nothing standing out"
* tag 'ceph-for-6.8-rc1' of https://github.com/ceph/ceph-client:
ceph: get rid of passing callbacks in __dentry_leases_walk()
ceph: d_obtain_{alias,root}(ERR_PTR(...)) will do the right thing
ceph: fix invalid pointer access if get_quota_realm return ERR_PTR
ceph: remove duplicated code in ceph_netfs_issue_read()
ceph: send oldest_client_tid when renewing caps
ceph: rename create_session_open_msg() to create_session_full_msg()
ceph: select FS_ENCRYPTION_ALGS if FS_ENCRYPTION
ceph: fix deadlock or deadcode of misusing dget()
ceph: try to allocate a smaller extent map for sparse read
libceph: remove MAX_EXTENTS check for sparse reads
ceph: reinitialize mds feature bit even when session in open
ceph: skip reconnecting if MDS is not ready
__dentry_leases_walk() gets a callback and calls it for
a bunch of denties; there are exactly two callers and
we already have a flag telling them apart - lwc->dir_lease.
Seeing that indirect calls are costly these days, let's
get rid of the callback and just call the right function
directly. Has a side benefit of saner signatures...
[ xiubli: a minor fix in the commit title ]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Saves a pointer per struct dentry and actually makes the things less
clumsy. Cleaned the d_walk() and dcache_readdir() a bit by use
of hlist_for_... iterators.
A couple of new helpers - d_first_child() and d_next_sibling(),
to make the expressions less awful.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Multiple CephFS mounts on a host is increasingly common so
disambiguating messages like this is necessary and will make it easier
to debug issues.
At the same this will improve the debug logs to make them easier to
troubleshooting issues, such as print the ino# instead only printing
the memory addresses of the corresponding inodes and print the dentry
names instead of the corresponding memory addresses for the dentry,etc.
Link: https://tracker.ceph.com/issues/61590
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Instead of setting the no-key dentry, use the new
fscrypt_prepare_lookup_partial() helper. We still need to mark the
directory as incomplete if the directory was just unlocked.
In ceph_atomic_open() this fixes a bug where a dentry is incorrectly
set with DCACHE_NOKEY_NAME when 'dir' has been evicted but the key is
still available (for example, where there's a drop_caches).
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
With snapshot names encryption we can not allow snapshots to be created in
locked directories because the names wouldn't be encrypted. This patch
forces the directory to be unlocked to allow a snapshot to be created.
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
For encrypted inodes, transmit a rounded-up size to the MDS as the
normal file size and send the real inode size in fscrypt_file field.
Also, fix up creates and truncates to also transmit fscrypt_file.
When we get an inode trace from the MDS, grab the fscrypt_file field if
the inode is encrypted, and use it to populate the i_size field instead
of the regular inode size field.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When setting a directory's crypt context, ceph_dir_clear_complete()
needs to be called otherwise if it was complete before, any existing
(old) dentry will still be valid.
This patch adds a wrapper around __fscrypt_prepare_readdir() which will
ensure a directory is marked as non-complete if key status changes.
[ xiubli: revise commit title per Milind ]
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add the appropriate calls into fscrypt for various actions, including
link, rename, setattr, and the open codepaths.
Disable fallocate for encrypted inodes -- hopefully, just for now.
If we have an encrypted inode, then the client will need to re-encrypt
the contents of the new object. Disable copy offload to or from
encrypted inodes.
Set i_blkbits to crypto block size for encrypted inodes -- some of the
underlying infrastructure for fscrypt relies on i_blkbits being aligned
to crypto blocksize.
Report STATX_ATTR_ENCRYPTED on encrypted inodes.
[ lhenriques: forbid encryption with striped layouts ]
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When creating symlinks in encrypted directories, encrypt and
base64-encode the target with the new inode's key before sending to the
MDS.
When filling a symlinked inode, base64-decode it into a buffer that
we'll keep in ci->i_symlink. When get_link is called, decrypt the buffer
into a new one that will hang off i_link.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
To make it simpler to decrypt names in a readdir reply (i.e. before
we have a dentry), add a new ceph_encode_encrypted_fname()-like helper
that takes a qstr pointer instead of a dentry pointer.
Once we've decrypted the names in a readdir reply, we no longer need the
crypttext, so overwrite them in ceph_mds_reply_dir_entry with the
unencrypted names. Then in both ceph_readdir_prepopulate() and
ceph_readdir() we will use the dencrypted name directly.
[ jlayton: convert some BUG_ONs into error returns ]
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If we have a dentry which represents a no-key name, then we need to test
whether the parent directory's encryption key has since been added. Do
that before we test anything else about the dentry.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This is required so that we know to invalidate these dentries when the
directory is unlocked.
Atomic open can act as a lookup if handed a dentry that is negative on
the MDS. Ensure that we set DCACHE_NOKEY_NAME on the dentry in
atomic_open, if we don't have the key for the parent. Otherwise, we can
end up validating the dentry inappropriately if someone later adds a
key.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When creating a new inode, we need to determine the crypto context
before we can transmit the RPC. The fscrypt API has a routine for getting
a crypto context before a create occurs, but it requires an inode.
Change the ceph code to preallocate an inode in advance of a create of
any sort (open(), mknod(), symlink(), etc). Move the existing code that
generates the ACL and SELinux blobs into this routine since that's
mostly common across all the different codepaths.
In most cases, we just want to allow ceph_fill_trace to use that inode
after the reply comes in, so add a new field to the MDS request for it
(r_new_inode).
The async create codepath is a bit different though. In that case, we
want to hash the inode in advance of the RPC so that it can be used
before the reply comes in. If the call subsequently fails with
-EJUKEBOX, then just put the references and clean up the as_ctx. Note
that with this change, we now need to regenerate the as_ctx when this
occurs, but it's quite rare for it to happen.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
All users now just use '->iterate_shared()', which only takes the
directory inode lock for reading.
Filesystems that never got convered to shared mode now instead use a
wrapper that drops the lock, re-takes it in write mode, calls the old
function, and then downgrades the lock back to read mode.
This way the VFS layer and other callers no longer need to care about
filesystems that never got converted to the modern era.
The filesystems that use the new wrapper are ceph, coda, exfat, jfs,
ntfs, ocfs2, overlayfs, and vboxsf.
Honestly, several of them look like they really could just iterate their
directories in shared mode and skip the wrapper entirely, but the point
of this change is to not change semantics or fix filesystems that
haven't been fixed in the last 7+ years, but to finally get rid of the
dual iterators.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
For write requests the parent's mtime will be updated correspondingly.
And if the 'Xx' caps is issued and when releasing other caps together
with the write requests the MDS Locker will try to eval the xattr lock,
which need to change the locker state excl --> sync and need to do Xx
caps revocation.
Just voluntarily dropping CEPH_CAP_XATTR_EXCL caps to avoid a cap
revoke message, which could cause the mtime will be overwrote by stale
one.
[ idryomov: break unnecessarily long lines ]
Link: https://tracker.ceph.com/issues/61584
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When exporting the kceph to NFS it may pass a DCACHE_DISCONNECTED
dentry for the link operation. Then it will parse this dentry as a
snapdir, and the mds will fail the link request as -EROFS.
MDS allow clients to pass a ino# instead of a path.
Link: https://tracker.ceph.com/issues/59515
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
The current way of setting and getting posix acls through the generic
xattr interface is error prone and type unsafe. The vfs needs to
interpret and fixup posix acls before storing or reporting it to
userspace. Various hacks exist to make this work. The code is hard to
understand and difficult to maintain in it's current form. Instead of
making this work by hacking posix acls through xattr handlers we are
building a dedicated posix acl api around the get and set inode
operations. This removes a lot of hackiness and makes the codepaths
easier to maintain. A lot of background can be found in [1].
The current inode operation for getting posix acls takes an inode
argument but various filesystems (e.g., 9p, cifs, overlayfs) need access
to the dentry. In contrast to the ->set_acl() inode operation we cannot
simply extend ->get_acl() to take a dentry argument. The ->get_acl()
inode operation is called from:
acl_permission_check()
-> check_acl()
-> get_acl()
which is part of generic_permission() which in turn is part of
inode_permission(). Both generic_permission() and inode_permission() are
called in the ->permission() handler of various filesystems (e.g.,
overlayfs). So simply passing a dentry argument to ->get_acl() would
amount to also having to pass a dentry argument to ->permission(). We
should avoid this unnecessary change.
So instead of extending the existing inode operation rename it from
->get_acl() to ->get_inode_acl() and add a ->get_acl() method later that
passes a dentry argument and which filesystems that need access to the
dentry can implement instead of ->get_inode_acl(). Filesystems like cifs
which allow setting and getting posix acls but not using them for
permission checking during lookup can simply not implement
->get_inode_acl().
This is intended to be a non-functional change.
Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1]
Suggested-by/Inspired-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
In async unlink case the kclient won't wait for the first reply
from MDS and just drop all the links and unhash the dentry and then
succeeds immediately.
For any new create/link/rename,etc requests followed by using the
same file names we must wait for the first reply of the inflight
unlink request, or the MDS possibly will fail these following
requests with -EEXIST if the inflight async unlink request was
delayed for some reasons.
And the worst case is that for the none async openc request it will
successfully open the file if the CDentry hasn't been unlinked yet,
but later the previous delayed async unlink request will remove the
CDenty. That means the just created file is possiblly deleted later
by accident.
We need to wait for the inflight async unlink requests to finish
when creating new files/directories by using the same file names.
Link: https://tracker.ceph.com/issues/55332
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reset the last_readdir at the same time, and add a comment explaining
why we don't free last_readdir when dir_emit returns false.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>