7185 Commits

Author SHA1 Message Date
Linus Torvalds
6bb34aff1e Merge tag 'nfs-for-6.19-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
 "Bugfixes:
   - Fix 'nlink' attribute update races when unlinking a file
   - Add missing initialisers for the directory verifier in various
     places
   - Don't regress the NFSv4 open state due to misordered racing replies
   - Ensure the NFSv4.x callback server uses the correct transport
     connection
   - Fix potential use-after-free races when shutting down the NFSv4.x
     callback server
   - Fix a pNFS layout commit crash
   - Assorted fixes to ensure correct propagation of mount options when
     the client crosses a filesystem boundary and triggers the VFS
     automount code
   - More localio fixes

  Features and cleanups:
   - Add initial support for basic directory delegations
   - SunRPC back channel code cleanups"

* tag 'nfs-for-6.19-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (24 commits)
  NFSv4: Handle NFS4ERR_NOTSUPP errors for directory delegations
  nfs/localio: remove 61 byte hole from needless ____cacheline_aligned
  nfs/localio: remove alignment size checking in nfs_is_local_dio_possible
  NFS: Fix up the automount fs_context to use the correct cred
  NFS: Fix inheritance of the block sizes when automounting
  NFS: Automounted filesystems should inherit ro,noexec,nodev,sync flags
  Revert "nfs: ignore SB_RDONLY when mounting nfs"
  Revert "nfs: clear SB_RDONLY before getting superblock"
  Revert "nfs: ignore SB_RDONLY when remounting nfs"
  NFS: Add a module option to disable directory delegations
  NFS: Shortcut lookup revalidations if we have a directory delegation
  NFS: Request a directory delegation during RENAME
  NFS: Request a directory delegation on ACCESS, CREATE, and UNLINK
  NFS: Add support for sending GDD_GETATTR
  NFSv4/pNFS: Clear NFS_INO_LAYOUTCOMMIT in pnfs_mark_layout_stateid_invalid
  NFSv4.1: protect destroying and nullifying bc_serv structure
  SUNRPC: new helper function for stopping backchannel server
  SUNRPC: cleanup common code in backchannel request
  NFSv4.1: pass transport for callback shutdown
  NFSv4: ensure the open stateid seqid doesn't go backwards
  ...
2025-12-12 21:52:42 +12:00
Mike Snitzer
3af870aedb nfs/localio: fix regression due to out-of-order __put_cred
Commit f2060bdc21 ("nfs/localio: add refcounting for each iocb IO
associated with NFS pgio header") inadvertantly reintroduced the same
potential for __put_cred() triggering BUG_ON(cred == current->cred) that
commit 992203a1fb ("nfs/localio: restore creds before releasing pageio
data") fixed.

Fix this by saving and restoring the cred around each {read,write}_iter
call within the respective for loop of nfs_local_call_{read,write} using
scoped_with_creds().

NOTE: this fix started by first reverting the following commits:

 94afb627df ("nfs: use credential guards in nfs_local_call_read()")
 bff3c841f7 ("nfs: use credential guards in nfs_local_call_write()")
 1d18101a64 ("Merge tag 'kernel-6.19-rc1.cred' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs")

followed by narrowly fixing the cred lifetime issue by using
scoped_with_creds().  In doing so, this commit's changes appear more
extensive than they really are (as evidenced by comparing to v6.18's
fs/nfs/localio.c).

Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Acked-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/linux-next/20251205111942.4150b06f@canb.auug.org.au/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-12-05 17:53:56 -08:00
Trond Myklebust
bd3b04b46c NFSv4: Handle NFS4ERR_NOTSUPP errors for directory delegations
The error NFS4ERR_NOTSUPP will be returned for operations that are
legal, but not supported by the server.

Fixes: 156b094829 ("NFS: Request a directory delegation on ACCESS, CREATE, and UNLINK")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-12-05 19:34:29 -05:00
Mike Snitzer
0b873de2c0 nfs/localio: remove 61 byte hole from needless ____cacheline_aligned
struct nfs_local_kiocb used ____cacheline_aligned on its iters[] array
and as the structure evolved it caused a 61 byte hole to form.  Fix
this by removing ____cacheline_aligned and reordering iters[] before
iter_is_dio_aligned[].

Fixes: 6a218b9c31 ("nfs/localio: do not issue misaligned DIO out-of-order")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-12-05 19:34:29 -05:00
Mike Snitzer
f50d0328d0 nfs/localio: remove alignment size checking in nfs_is_local_dio_possible
This check to ensure dio_offset_align isn't larger than PAGE_SIZE is
no longer relevant (older iterations of NFS Direct was allocating
misaligned head and tail pages but no longer does, so this check isn't
needed).

Fixes: c817248fc8 ("nfs/localio: add proper O_DIRECT support for READ and WRITE")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-12-05 19:34:29 -05:00
Linus Torvalds
db74a7d02a Merge tag 'vfs-6.19-rc1.directory.delegations' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull directory delegations update from Christian Brauner:
 "This contains the work for recall-only directory delegations for
  knfsd.

  Add support for simple, recallable-only directory delegations. This
  was decided at the fall NFS Bakeathon where the NFS client and server
  maintainers discussed how to merge directory delegation support.

  The approach starts with recallable-only delegations for several reasons:

   1. RFC8881 has gaps that are being addressed in RFC8881bis. In
      particular, it requires directory position information for
      CB_NOTIFY callbacks, which is difficult to implement properly
      under Linux. The spec is being extended to allow that information
      to be omitted.

   2. Client-side support for CB_NOTIFY still lags. The client side
      involves heuristics about when to request a delegation.

   3. Early indication shows simple, recallable-only delegations can
      help performance. Anna Schumaker mentioned seeing a multi-minute
      speedup in xfstests runs with them enabled.

  With these changes, userspace can also request a read lease on a
  directory that will be recalled on conflicting accesses. This may be
  useful for applications like Samba. Users can disable leases
  altogether via the fs.leases-enable sysctl if needed.

  VFS changes:

   - Dedicated Type for Delegations

     Introduce struct delegated_inode to track inodes that may have
     delegations that need to be broken. This replaces the previous
     approach of passing raw inode pointers through the delegation
     breaking code paths, providing better type safety and clearer
     semantics for the delegation machinery.

   - Break parent directory delegations in open(..., O_CREAT) codepath

   - Allow mkdir to wait for delegation break on parent

   - Allow rmdir to wait for delegation break on parent

   - Add try_break_deleg calls for parents to vfs_link(), vfs_rename(),
     and vfs_unlink()

   - Make vfs_create(), vfs_mknod(), and vfs_symlink() break delegations
     on parent directory

   - Clean up argument list for vfs_create()

   - Expose delegation support to userland

  Filelock changes:

   - Make lease_alloc() take a flags argument

   - Rework the __break_lease API to use flags

   - Add struct delegated_inode

   - Push the S_ISREG check down to ->setlease handlers

   - Lift the ban on directory leases in generic_setlease

  NFSD changes:

   - Allow filecache to hold S_IFDIR files

   - Allow DELEGRETURN on directories

   - Wire up GET_DIR_DELEGATION handling

  Fixes:

   - Fix kernel-doc warnings in __fcntl_getlease

   - Add needed headers for new struct delegation definition"

* tag 'vfs-6.19-rc1.directory.delegations' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  vfs: add needed headers for new struct delegation definition
  filelock: __fcntl_getlease: fix kernel-doc warnings
  vfs: expose delegation support to userland
  nfsd: wire up GET_DIR_DELEGATION handling
  nfsd: allow DELEGRETURN on directories
  nfsd: allow filecache to hold S_IFDIR files
  filelock: lift the ban on directory leases in generic_setlease
  vfs: make vfs_symlink break delegations on parent dir
  vfs: make vfs_mknod break delegations on parent directory
  vfs: make vfs_create break delegations on parent directory
  vfs: clean up argument list for vfs_create()
  vfs: break parent dir delegations in open(..., O_CREAT) codepath
  vfs: allow rmdir to wait for delegation break on parent
  vfs: allow mkdir to wait for delegation break on parent
  vfs: add try_break_deleg calls for parents to vfs_{link,rename,unlink}
  filelock: push the S_ISREG check down to ->setlease handlers
  filelock: add struct delegated_inode
  filelock: rework the __break_lease API to use flags
  filelock: make lease_alloc() take a flags argument
2025-12-01 15:34:41 -08:00
Linus Torvalds
1d18101a64 Merge tag 'kernel-6.19-rc1.cred' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull cred guard updates from Christian Brauner:
 "This contains substantial credential infrastructure improvements
  adding guard-based credential management that simplifies code and
  eliminates manual reference counting in many subsystems.

  Features:

   - Kernel Credential Guards

     Add with_kernel_creds() and scoped_with_kernel_creds() guards that
     allow using the kernel credentials without allocating and copying
     them. This was requested by Linus after seeing repeated
     prepare_kernel_creds() calls that duplicate the kernel credentials
     only to drop them again later.

     The new guards completely avoid the allocation and never expose the
     temporary variable to hold the kernel credentials anywhere in
     callers.

   - Generic Credential Guards

     Add scoped_with_creds() guards for the common override_creds() and
     revert_creds() pattern. This builds on earlier work that made
     override_creds()/revert_creds() completely reference count free.

   - Prepare Credential Guards

     Add prepare credential guards for the more complex pattern of
     preparing a new set of credentials and overriding the current
     credentials with them:
      - prepare_creds()
      - modify new creds
      - override_creds()
      - revert_creds()
      - put_cred()

  Cleanups:

   - Make init_cred static since it should not be directly accessed

   - Add kernel_cred() helper to properly access the kernel credentials

   - Fix scoped_class() macro that was introduced two cycles ago

   - coredump: split out do_coredump() from vfs_coredump() for cleaner
     credential handling

   - coredump: move revert_cred() before coredump_cleanup()

   - coredump: mark struct mm_struct as const

   - coredump: pass struct linux_binfmt as const

   - sev-dev: use guard for path"

* tag 'kernel-6.19-rc1.cred' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits)
  trace: use override credential guard
  trace: use prepare credential guard
  coredump: use override credential guard
  coredump: use prepare credential guard
  coredump: split out do_coredump() from vfs_coredump()
  coredump: mark struct mm_struct as const
  coredump: pass struct linux_binfmt as const
  coredump: move revert_cred() before coredump_cleanup()
  sev-dev: use override credential guards
  sev-dev: use prepare credential guard
  sev-dev: use guard for path
  cred: add prepare credential guard
  net/dns_resolver: use credential guards in dns_query()
  cgroup: use credential guards in cgroup_attach_permissions()
  act: use credential guards in acct_write_process()
  smb: use credential guards in cifs_get_spnego_key()
  nfs: use credential guards in nfs_idmap_get_key()
  nfs: use credential guards in nfs_local_call_write()
  nfs: use credential guards in nfs_local_call_read()
  erofs: use credential guards
  ...
2025-12-01 13:45:41 -08:00
Linus Torvalds
9368f0f941 Merge tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode updates from Christian Brauner:
 "Features:

   - Hide inode->i_state behind accessors. Open-coded accesses prevent
     asserting they are done correctly. One obvious aspect is locking,
     but significantly more can be checked. For example it can be
     detected when the code is clearing flags which are already missing,
     or is setting flags when it is illegal (e.g., I_FREEING when
     ->i_count > 0)

   - Provide accessors for ->i_state, converts all filesystems using
     coccinelle and manual conversions (btrfs, ceph, smb, f2fs, gfs2,
     overlayfs, nilfs2, xfs), and makes plain ->i_state access fail to
     compile

   - Rework I_NEW handling to operate without fences, simplifying the
     code after the accessor infrastructure is in place

  Cleanups:

   - Move wait_on_inode() from writeback.h to fs.h

   - Spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
     for clarity

   - Cosmetic fixes to LRU handling

   - Push list presence check into inode_io_list_del()

   - Touch up predicts in __d_lookup_rcu()

   - ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage

   - Assert on ->i_count in iput_final()

   - Assert ->i_lock held in __iget()

  Fixes:

   - Add missing fences to I_NEW handling"

* tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
  dcache: touch up predicts in __d_lookup_rcu()
  fs: push list presence check into inode_io_list_del()
  fs: cosmetic fixes to lru handling
  fs: rework I_NEW handling to operate without fences
  fs: make plain ->i_state access fail to compile
  xfs: use the new ->i_state accessors
  nilfs2: use the new ->i_state accessors
  overlayfs: use the new ->i_state accessors
  gfs2: use the new ->i_state accessors
  f2fs: use the new ->i_state accessors
  smb: use the new ->i_state accessors
  ceph: use the new ->i_state accessors
  btrfs: use the new ->i_state accessors
  Manual conversion to use ->i_state accessors of all places not covered by coccinelle
  Coccinelle-based conversion to use ->i_state accessors
  fs: provide accessors for ->i_state
  fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
  fs: move wait_on_inode() from writeback.h to fs.h
  fs: add missing fences to I_NEW handling
  ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
  ...
2025-12-01 09:02:34 -08:00
Trond Myklebust
a2a8fc27dd NFS: Fix up the automount fs_context to use the correct cred
When automounting, the fs_context should be fixed up to use the cred
from the parent filesystem, since the operation is just extending the
namespace. Authorisation to enter that namespace will already have been
provided by the preceding lookup.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-11-29 17:23:03 -05:00
Trond Myklebust
2b092175f5 NFS: Fix inheritance of the block sizes when automounting
Only inherit the block sizes that were actually specified as mount
parameters for the parent mount.

Fixes: 62a55d088c ("NFS: Additional refactoring for fs_context conversion")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-11-29 17:23:03 -05:00
Trond Myklebust
8675c69816 NFS: Automounted filesystems should inherit ro,noexec,nodev,sync flags
When a filesystem is being automounted, it needs to preserve the
user-set superblock mount options, such as the "ro" flag.

Reported-by: Li Lingfeng <lilingfeng3@huawei.com>
Link: https://lore.kernel.org/all/20240604112636.236517-3-lilingfeng@huaweicloud.com/
Fixes: f2aedb713c ("NFS: Add fs_context support.")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-11-29 17:22:48 -05:00
Trond Myklebust
d4a26d34f1 Revert "nfs: ignore SB_RDONLY when mounting nfs"
This reverts commit 52cb7f8f17.

Silently ignoring the "ro" and "rw" mount options causes user confusion,
and regressions.

Reported-by: Alkis Georgopoulos<alkisg@gmail.com>
Cc: Li Lingfeng <lilingfeng3@huawei.com>
Fixes: 52cb7f8f17 ("nfs: ignore SB_RDONLY when mounting nfs")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-11-29 17:22:39 -05:00
Trond Myklebust
d216b698d4 Revert "nfs: clear SB_RDONLY before getting superblock"
This reverts commit 8cd9b78594.

Silently ignoring the "ro" and "rw" mount options causes user confusion,
and regressions.

Reported-by: Alkis Georgopoulos<alkisg@gmail.com>
Cc: Li Lingfeng <lilingfeng3@huawei.com>
Fixes: 8cd9b78594 ("nfs: clear SB_RDONLY before getting superblock")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-11-29 17:22:32 -05:00
Trond Myklebust
400fa37afb Revert "nfs: ignore SB_RDONLY when remounting nfs"
This reverts commit 80c4de6ab4.

Silently ignoring the "ro" and "rw" mount options causes user confusion,
and regressions.

Reported-by: Alkis Georgopoulos<alkisg@gmail.com>
Cc: Li Lingfeng <lilingfeng3@huawei.com>
Fixes: 80c4de6ab4 ("nfs: ignore SB_RDONLY when remounting nfs")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-11-29 17:19:01 -05:00
Anna Schumaker
b6d2a520f4 NFS: Add a module option to disable directory delegations
When this option is disabled then the client will not request directory
delegations or check if we have one during the revalidation paths.

Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-11-23 16:01:47 -05:00
Anna Schumaker
669c0580ac NFS: Shortcut lookup revalidations if we have a directory delegation
Holding a directory delegation means we know that nobody else has
modified the directory on the server, so we can take a few revalidation
shortcuts.

Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-11-23 16:01:47 -05:00
Anna Schumaker
2da2116707 NFS: Request a directory delegation during RENAME
If we notice that we're renaming a file within a directory then we take
that as a sign that the user is working with the current directory and
may want a delegation to avoid extra revalidations when possible.

The nfs_request_directory_delegation() function exists within the NFS v4
module, so I add an extra flag to rename_setup() to indicate if a dentry
is being renamed within the same parent directory.

Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-11-23 16:01:47 -05:00
Anna Schumaker
156b094829 NFS: Request a directory delegation on ACCESS, CREATE, and UNLINK
This patch adds a new flag: NFS_INO_REQ_DIR_DELEG to signal that a
directory wants to request a directory delegation the next time it does
a GETATTR. I have the client request a directory delegation when doing
an access, create, or unlink call since these calls indicate that a user
is working with a directory.

Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-11-23 16:01:47 -05:00
Anna Schumaker
130ae65c01 NFS: Add support for sending GDD_GETATTR
I add this to the existing GETATTR compound as an option extra step that
we can send if the "dir_deleg" flag is set to 'true'. Actually enabling
this value will happen in a later patch.

Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-11-23 16:01:46 -05:00
Jonathan Curley
e0f8058f2c NFSv4/pNFS: Clear NFS_INO_LAYOUTCOMMIT in pnfs_mark_layout_stateid_invalid
Fixes a crash when layout is null during this call stack:

write_inode
    -> nfs4_write_inode
        -> pnfs_layoutcommit_inode

pnfs_set_layoutcommit relies on the lseg refcount to keep the layout
around. Need to clear NFS_INO_LAYOUTCOMMIT otherwise we might attempt
to reference a null layout.

Fixes: fe1cf9469d ("pNFS: Clear all layout segment state in pnfs_mark_layout_stateid_invalid")
Signed-off-by: Jonathan Curley <jcurley@purestorage.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-11-23 15:33:33 -05:00
Olga Kornievskaia
9e9fdd0ad0 NFSv4.1: protect destroying and nullifying bc_serv structure
When we are shutting down the client, we free the callback
server structure and then at a later pointer we free the
transport used by the client. Yet, it's possible that after
the callback server is freed, the transport receives a
backchannel request at which point we can dereferene freed
memory. Instead, do the freeing the bc server and nullying
bc_serv under the lock.

Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-11-23 15:30:12 -05:00
Olga Kornievskaia
8936ff4736 NFSv4.1: pass transport for callback shutdown
When we are setting up the 4.1 callback server, we pass in
the appropriate rpc_xprt transport pointer with which to associate
the callback server structure. Similarly, pass in the rpc_xprt
pointer for when we are shutting down the callback. This will be
used to make sure that we free the server structure and then clear
the rpc_xprt's bc_server pointer in a safe manner.

Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-11-23 15:30:12 -05:00
Scott Mayhew
2e47c3cc64 NFSv4: ensure the open stateid seqid doesn't go backwards
We have observed an NFSv4 client receiving a LOCK reply with a status of
NFS4ERR_OLD_STATEID and subsequently retrying the LOCK request with an
earlier seqid value in the stateid.  As this was for a new lockowner,
that would imply that nfs_set_open_stateid_locked() had updated the open
stateid seqid with an earlier value.

Looking at nfs_set_open_stateid_locked(), if the incoming seqid is out
of sequence, the task will sleep on the state->waitq for up to 5
seconds.  If the task waits for the full 5 seconds, then after finishing
the wait it'll update the open stateid seqid with whatever value the
incoming seqid has.  If there are multiple waiters in this scenario,
then the last one to perform said update may not be the one with the
highest seqid.

Add a check to ensure that the seqid can only be incremented, and add a
tracepoint to indicate when old seqids are skipped.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@hammerspace.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-11-23 15:10:08 -05:00
Trond Myklebust
0f900f1100 NFS: Initialise verifiers for visible dentries in _nfs4_open_and_get_state
Ensure that the verifiers are initialised before calling
d_splice_alias() in _nfs4_open_and_get_state().

Reported-by: Michael Stoler <michael.stoler@vastdata.com>
Fixes: cf5b4059ba ("NFSv4: Fix races between open and dentry revalidation")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-11-19 11:46:50 -05:00
Trond Myklebust
518c32a1bc NFS: Initialise verifiers for visible dentries in nfs_atomic_open()
Ensure that the verifiers are initialised before calling
d_splice_alias() in nfs_atomic_open().

Reported-by: Michael Stoler <michael.stoler@vastdata.com>
Fixes: 809fd143de ("NFSv4: Ensure nfs_atomic_open set the dentry verifier on ENOENT")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-11-19 08:46:21 -05:00
Trond Myklebust
9bd545539b NFS: Initialise verifiers for visible dentries in readdir and lookup
Ensure that the verifiers are initialised before calling
d_splice_alias() in both nfs_prime_dcache() and nfs_lookup().

Reported-by: Michael Stoler <michael.stoler@vastdata.com>
Fixes: a1147b8281 ("NFS: Fix up directory verifier races")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-11-19 08:45:53 -05:00
Trond Myklebust
bd4928ec79 NFS: Avoid changing nlink when file removes and attribute updates race
If a file removal races with another operation that updates its
attributes, then skip the change to nlink, and just mark the attributes
as being stale.

Reported-by: Aiden Lambert <alambert48@gatech.edu>
Fixes: 59a707b0d4 ("NFS: Ensure we revalidate the inode correctly after remove or rename")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-11-17 17:31:32 -05:00
Jeff Layton
e6d28ebc17 filelock: push the S_ISREG check down to ->setlease handlers
When nfsd starts requesting directory delegations, setlease handlers may
see requests for leases on directories. Push the !S_ISREG check down
into the non-trivial setlease handlers, so we can selectively enable
them where they're supported.

FUSE is special: It's the only filesystem that supports atomic_open and
allows kernel-internal leases. atomic_open is issued when the VFS
doesn't know the state of the dentry being opened. If the file doesn't
exist, it may be created, in which case the dir lease should be broken.

The existing kernel-internal lease implementation has no provision for
this. Ensure that we don't allow directory leases by default going
forward by explicitly disabling them there.

Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-4-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 09:38:35 +01:00
Dai Ngo
b623390045 NFS: Fix LTP test failures when timestamps are delegated
The utimes01 and utime06 tests fail when delegated timestamps are
enabled, specifically in subtests that modify the atime and mtime
fields using the 'nobody' user ID.

The problem can be reproduced as follow:

# echo "/media *(rw,no_root_squash,sync)" >> /etc/exports
# export -ra
# mount -o rw,nfsvers=4.2 127.0.0.1:/media /tmpdir
# cd /opt/ltp
# ./runltp -d /tmpdir -s utimes01
# ./runltp -d /tmpdir -s utime06

This issue occurs because nfs_setattr does not verify the inode's
UID against the caller's fsuid when delegated timestamps are
permitted for the inode.

This patch adds the UID check and if it does not match then the
request is sent to the server for permission checking.

Fixes: e12912d941 ("NFSv4: Add support for delegated atime and mtime attributes")
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-11-10 16:55:12 -05:00
Trond Myklebust
1f214e9c3a NFSv4: Fix an incorrect parameter when calling nfs4_call_sync()
The Smatch static checker noted that in _nfs4_proc_lookupp(), the flag
RPC_TASK_TIMEOUT is being passed as an argument to nfs4_init_sequence(),
which is clearly incorrect.
Since LOOKUPP is an idempotent operation, nfs4_init_sequence() should
not ask the server to cache the result. The RPC_TASK_TIMEOUT flag needs
to be passed down to the RPC layer.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Reported-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Fixes: 76998ebb91 ("NFSv4: Observe the NFS_MOUNT_SOFTREVAL flag in _nfs4_proc_lookupp")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-11-10 14:30:46 -05:00
Yang Xiuwei
7a7a345652 NFS: sysfs: fix leak when nfs_client kobject add fails
If adding the second kobject fails, drop both references to avoid sysfs
residue and memory leak.

Fixes: e96f9268ee ("NFS: Make all of /sys/fs/nfs network-namespace unique")

Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn>
Reviewed-by: Benjamin Coddington <ben.coddington@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-11-10 14:30:45 -05:00
Trond Myklebust
85d2c2392a NFSv2/v3: Fix error handling in nfs_atomic_open_v23()
When nfs_do_create() returns an EEXIST error, it means that a regular
file could not be created. That could mean that a symlink needs to be
resolved. If that's the case, a lookup needs to be kicked off.

Reported-by: Stephen Abbene <sabbene87@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=220710
Fixes: 7c6c5249f0 ("NFS: add atomic_open for NFSv3 to handle O_TRUNC correctly.")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-11-10 14:30:45 -05:00
Mike Snitzer
6a218b9c31 nfs/localio: do not issue misaligned DIO out-of-order
From https://lore.kernel.org/linux-nfs/aQHASIumLJyOoZGH@infradead.org/

On Wed, Oct 29, 2025 at 12:20:40AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 27, 2025 at 12:18:30PM -0400, Mike Snitzer wrote:
> > LOCALIO's misaligned DIO will issue head/tail followed by O_DIRECT
> > middle (via AIO completion of that aligned middle). So out of order
> > relative to file offset.
>
> That's in general a really bad idea.  It will obviously work, but
> both on SSDs and out of place write file systems it is a sure way
> to increase your garbage collection overhead a lot down the line.

Fix this by never issuing misaligned DIO out of order. This fix means
the DIO-aligned middle will only use AIO completion if there is no
misaligned end segment. Otherwise, all 3 segments of a misaligned DIO
will be issued without AIO completion to ensure file offset increases
properly for all partial READ or WRITE situations.

Factoring out nfs_local_iter_setup() helps standardize repetitive
nfs_local_iters_setup_dio() code and is inspired by cleanup work that
Chuck Lever did on the NFSD Direct code.

Fixes: c817248fc8 ("nfs/localio: add proper O_DIRECT support for READ and WRITE")
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-11-10 13:28:45 -05:00
Mike Snitzer
d32ddfeb55 nfs/localio: Ensure DIO WRITE's IO on stable storage upon completion
LOCALIO's misaligned DIO WRITE support requires synchronous IO for any
misaligned head and/or tail that are issued using buffered IO.  In
addition, it is important that the O_DIRECT middle be on stable
storage upon its completion via AIO.

Otherwise, a misaligned DIO WRITE could mix buffered IO for the
head/tail and direct IO for the DIO-aligned middle -- which could lead
to problems associated with deferred writes to stable storage (such as
out of order partial completions causing incorrect advancement of the
file's offset, etc).

Fixes: c817248fc8 ("nfs/localio: add proper O_DIRECT support for READ and WRITE")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-11-10 10:32:28 -05:00
Mike Snitzer
d0497dd274 nfs/localio: backfill missing partial read support for misaligned DIO
Misaligned DIO read can be split into 3 IOs, must handle potential for
short read from each component IO (follows same pattern used for
handling partial writes, except upper layer read code handles advancing
offset before retry).

Fixes: c817248fc8 ("nfs/localio: add proper O_DIRECT support for READ and WRITE")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-11-10 10:32:28 -05:00
Mike Snitzer
f2060bdc21 nfs/localio: add refcounting for each iocb IO associated with NFS pgio header
Improve completion handling of as many as 3 IOs associated with each
misaligned DIO by using a atomic_t to track completion of each IO.

Update nfs_local_pgio_done() to use precise atomic_t accounting for
remaining iov_iter (up to 3) associated with each iocb, so that each
NFS LOCALIO pgio header is only released after all IOs have completed.
But also allow early return if/when a short read or write occurs.

Fixes reported BUG: KASAN: slab-use-after-free in nfs_local_call_read:
https://lore.kernel.org/linux-nfs/aPSvi5Yr2lGOh5Jh@dell-per750-06-vm-07.rhts.eng.pek2.redhat.com/

Reported-by: Yongcheng Yang <yoyang@redhat.com>
Fixes: c817248fc8 ("nfs/localio: add proper O_DIRECT support for READ and WRITE")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-11-10 10:32:28 -05:00
Mike Snitzer
51a491f270 nfs/localio: remove unecessary ENOTBLK handling in DIO WRITE support
Each filesystem is meant to fallback to retrying DIO in terms buffered
IO when it might encounter -ENOTBLK when issuing DIO (which can happen
if the VFS cannot invalidate the page cache).

So NFS doesn't need special handling for -ENOTBLK.

Also, explicitly initialize a couple DIO related iocb members rather
than simply rely on data structure zeroing.

Fixes: c817248fc8 ("nfs/localio: add proper O_DIRECT support for READ and WRITE")
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-11-10 10:32:28 -05:00
Trond Myklebust
fb2cba0854 NFS: Check the TLS certificate fields in nfs_match_client()
If the TLS security policy is of type RPC_XPRTSEC_TLS_X509, then the
cert_serial and privkey_serial fields need to match as well since they
define the client's identity, as presented to the server.

Fixes: 90c9550a8d ("NFS: support the kernel keyring for TLS")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-11-10 10:32:28 -05:00
Trond Myklebust
8ab523ce78 pnfs: Set transport security policy to RPC_XPRTSEC_NONE unless using TLS
The default setting for the transport security policy must be
RPC_XPRTSEC_NONE, when using a TCP or RDMA connection without TLS.
Conversely, when using TLS, the security policy needs to be set.

Fixes: 6c0a8c5fcf ("NFS: Have struct nfs_client carry a TLS policy field")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-11-10 10:32:28 -05:00
Trond Myklebust
28e19737e1 pnfs: Fix TLS logic in _nfs4_pnfs_v4_ds_connect()
Don't try to add an RDMA transport to a client that is already marked as
being a TCP/TLS transport.

Fixes: a35518cae4 ("NFSv4.1/pnfs: fix NFS with TLS in pnfs")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-11-10 10:32:28 -05:00
Trond Myklebust
7aca00d950 pnfs: Fix TLS logic in _nfs4_pnfs_v3_ds_connect()
Don't try to add an RDMA transport to a client that is already marked as
being a TCP/TLS transport.

Fixes: 04a1526366 ("pnfs/flexfiles: connect to NFSv3 DS using TLS if MDS connection uses TLS")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-11-10 10:32:27 -05:00
Christian Brauner
f41799b2e1 nfs: use credential guards in nfs_idmap_get_key()
Use credential guards for scoped credential override with automatic
restoration on scope exit.

Link: https://patch.msgid.link/20251103-work-creds-guards-simple-v1-12-a3e156839e7f@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:45 +01:00
Christian Brauner
bff3c841f7 nfs: use credential guards in nfs_local_call_write()
Use credential guards for scoped credential override with automatic
restoration on scope exit.

Link: https://patch.msgid.link/20251103-work-creds-guards-simple-v1-11-a3e156839e7f@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:43 +01:00
Christian Brauner
94afb627df nfs: use credential guards in nfs_local_call_read()
Use credential guards for scoped credential override with automatic
restoration on scope exit.

Link: https://patch.msgid.link/20251103-work-creds-guards-simple-v1-10-a3e156839e7f@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:42 +01:00
Mateusz Guzik
b4dbfd8653 Coccinelle-based conversion to use ->i_state accessors
All places were patched by coccinelle with the default expecting that
->i_lock is held, afterwards entries got fixed up by hand to use
unlocked variants as needed.

The script:
@@
expression inode, flags;
@@

- inode->i_state & flags
+ inode_state_read(inode) & flags

@@
expression inode, flags;
@@

- inode->i_state &= ~flags
+ inode_state_clear(inode, flags)

@@
expression inode, flag1, flag2;
@@

- inode->i_state &= ~flag1 & ~flag2
+ inode_state_clear(inode, flag1 | flag2)

@@
expression inode, flags;
@@

- inode->i_state |= flags
+ inode_state_set(inode, flags)

@@
expression inode, flags;
@@

- inode->i_state = flags
+ inode_state_assign(inode, flags)

@@
expression inode, flags;
@@

- flags = inode->i_state
+ flags = inode_state_read(inode)

@@
expression inode, flags;
@@

- READ_ONCE(inode->i_state) & flags
+ inode_state_read(inode) & flags

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:26 +02:00
Joshua Watt
9bb3baa9d1 NFS4: Fix state renewals missing after boot
Since the last renewal time was initialized to 0 and jiffies start
counting at -5 minutes, any clients connected in the first 5 minutes
after a reboot would have their renewal timer set to a very long
interval. If the connection was idle, this would result in the client
state timing out on the server and the next call to the server would
return NFS4ERR_BADSESSION.

Fix this by initializing the last renewal time to the current jiffies
instead of 0.

Signed-off-by: Joshua Watt <jpewhacker@gmail.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-10-13 14:33:00 -04:00
Scott Mayhew
9ff022f382 NFS: check if suid/sgid was cleared after a write as needed
I noticed xfstests generic/193 and generic/355 started failing against
knfsd after commit e7a8ebc305 ("NFSD: Offer write delegation for OPEN
with OPEN4_SHARE_ACCESS_WRITE").

I ran those same tests against ONTAP (which has had write delegation
support for a lot longer than knfsd) and they fail there too... so
while it's a new failure against knfsd, it isn't an entirely new
failure.

Add the NFS_INO_REVAL_FORCED flag so that the presence of a delegation
doesn't keep the inode from being revalidated to fetch the updated mode.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-10-13 14:33:00 -04:00
Joshua Watt
7a84394f02 NFS4: Apply delay_retrans to async operations
The setting of delay_retrans is applied to synchronous RPC operations
because the retransmit count is stored in same struct nfs4_exception
that is passed each time an error is checked. However, for asynchronous
operations (READ, WRITE, LOCKU, CLOSE, DELEGRETURN), a new struct
nfs4_exception is made on the stack each time the task callback is
invoked. This means that the retransmit count is always zero and thus
delay_retrans never takes effect.

Apply delay_retrans to these operations by tracking and updating their
retransmit count.

Change-Id: Ieb33e046c2b277cb979caa3faca7f52faf0568c9
Signed-off-by: Joshua Watt <jpewhacker@gmail.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-10-13 14:33:00 -04:00
Mike Snitzer
8db4a1d146 NFSv4/flexfiles: fix to allocate mirror->dss before use
Move mirror_array's dss_count initialization and dss allocation to
ff_layout_alloc_mirror(), just before the loop that initializes each
nfs4_ff_layout_ds_stripe's nfs_file_localio.

Also handle NULL return from kcalloc() and remove one level of indent
in ff_layout_alloc_mirror().

This commit fixes dangling nfsd_serv refcount issues seen when using
NFS LOCALIO and then attempting to stop the NFSD service.

Fixes: 20b1d75fb8 ("NFSv4/flexfiles: Add support for striped layouts")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-10-13 14:33:00 -04:00
Linus Torvalds
50647a1176 Merge tag 'pull-f_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull file->f_path constification from Al Viro:
 "Only one thing was modifying ->f_path of an opened file - acct(2).

  Massaging that away and constifying a bunch of struct path * arguments
  in functions that might be given &file->f_path ends up with the
  situation where we can turn ->f_path into an anon union of const
  struct path f_path and struct path __f_path, the latter modified only
  in a few places in fs/{file_table,open,namei}.c, all for struct file
  instances that are yet to be opened"

* tag 'pull-f_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (23 commits)
  Have cc(1) catch attempts to modify ->f_path
  kernel/acct.c: saner struct file treatment
  configfs:get_target() - release path as soon as we grab configfs_item reference
  apparmor/af_unix: constify struct path * arguments
  ovl_is_real_file: constify realpath argument
  ovl_sync_file(): constify path argument
  ovl_lower_dir(): constify path argument
  ovl_get_verity_digest(): constify path argument
  ovl_validate_verity(): constify {meta,data}path arguments
  ovl_ensure_verity_loaded(): constify datapath argument
  ksmbd_vfs_set_init_posix_acl(): constify path argument
  ksmbd_vfs_inherit_posix_acl(): constify path argument
  ksmbd_vfs_kern_path_unlock(): constify path argument
  ksmbd_vfs_path_lookup_locked(): root_share_path can be const struct path *
  check_export(): constify path argument
  export_operations->open(): constify path argument
  rqst_exp_get_by_name(): constify path argument
  nfs: constify path argument of __vfs_getattr()
  bpf...d_path(): constify path argument
  done_path_create(): constify path argument
  ...
2025-10-03 16:32:36 -07:00