Commit Graph

96808 Commits

Author SHA1 Message Date
Kent Overstreet
f917016f69 bcachefs: Reduce stack frame size of __bch2_str_hash_check_key()
We don't need all the helpers inlined here.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-21 12:57:48 -05:00
Kent Overstreet
a858175227 bcachefs: Fix btree_trans_peek_key_cache()
BTREE_ITER_cached_nofill has some tricky corner cases; it's used
internally for iterators that aren't walking the key cache, but need to
be coherent with the key cache.

It tells traverse to look up and lock the key cache entry if present,
but don't create one if it doesn't exist.

That means we have to have a BTREE_ITER_UPTODATE path (because after
traverse the path has to be UPTODATE, or we pop assertions) that doesn't
point to anything (which is the less bad option, taken by the previous
fix).

The previous fix for this path missed an issue that can happen in
bch2_trans_peek_key_cache(): we can't set should_be_locked on a path
that doesn't point to anything and doesn't hold locks.

Fixes: bd5b09727f ("bcachefs: Don't set btree_path to updtodate if we don't fill")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-21 12:26:25 -05:00
Olga Kornievskaia
0b96c75d86 NFSv4.2: make LAYOUTSTATS and LAYOUTERROR MOVEABLE
LAYOUTSTATS and LAYOUTERROR should be marked MOVEABLE for when we
need to move tasks off a non-functional transport.

Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-01-21 11:34:50 -05:00
Olga Kornievskaia
668135b934 NFSv4.2: mark OFFLOAD_CANCEL MOVEABLE
OFFLOAD_CANCEL should be marked MOVEABLE for when we need to move
tasks off a non-functional transport.

Fixes: c975c20926 ("NFS send OFFLOAD_CANCEL when COPY killed")
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-01-21 11:34:50 -05:00
Olga Kornievskaia
e8380c2d06 NFSv4.2: fix COPY_NOTIFY xdr buf size calculation
We need to include sequence size in the compound.

Fixes: 0491567b51 ("NFS: add COPY_NOTIFY operation")
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-01-21 11:34:50 -05:00
Chuck Lever
aa9e4ad0fd NFS: Rename struct nfs4_offloadcancel_data
Refactor: This struct can be used unchanged for the new
OFFLOAD_STATUS implementation, so give it a more generic name.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-01-21 11:34:50 -05:00
Chuck Lever
36f4e9ef84 NFS: Fix typo in OFFLOAD_CANCEL comment
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-01-21 11:34:50 -05:00
Chuck Lever
d2fc83c5df NFS: CB_OFFLOAD can return NFS4ERR_DELAY
RFC 7862 permits the callback service to respond to a CB_OFFLOAD
operation with NFS4ERR_DELAY. Use that instead of
NFS4ERR_SERVERFAULT for temporary memory allocation failure, as that
is more consistent with how other operations report memory
allocation failure.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-01-21 11:34:50 -05:00
Dragan Simic
90190ba1c3 nfs: Make NFS_FSCACHE select NETFS_SUPPORT instead of depending on it
Having the NFS_FSCACHE option depend on the NETFS_SUPPORT options makes
selecting NFS_FSCACHE impossible unless another option that additionally
selects NETFS_SUPPORT is already selected.

As a result, for example, being able to reach and select the NFS_FSCACHE
option requires the CEPH_FS or CIFS option to be selected beforehand, which
obviously doesn't make much sense.

Let's correct this by making the NFS_FSCACHE option actually select the
NETFS_SUPPORT option, instead of depending on it.

Fixes: 915cd30cde ("netfs, fscache: Combine fscache with netfs")
Cc: stable@vger.kernel.org
Reported-by: Diederik de Haas <didi.debian@cknow.org>
Signed-off-by: Dragan Simic <dsimic@manjaro.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-01-21 11:34:50 -05:00
Mike Snitzer
ead11ac50a nfs: fix incorrect error handling in LOCALIO
nfs4_stat_to_errno() expects a NFSv4 error code as an argument and
returns a POSIX errno.

The problem is LOCALIO is passing nfs4_stat_to_errno() the POSIX errno
return values from filp->f_op->read_iter(), filp->f_op->write_iter()
and vfs_fsync_range().

So the POSIX errno that nfs_local_pgio_done() and
nfs_local_commit_done() are passing to nfs4_stat_to_errno() are
failing to match any NFSv4 error code, which results in
nfs4_stat_to_errno() defaulting to returning -EREMOTEIO. This causes
assertions in upper layers due to -EREMOTEIO not being a valid NFSv4
error code.

Fix this by updating nfs_local_pgio_done() and nfs_local_commit_done()
to use the new nfs_localio_errno_to_nfs4_stat() to map a POSIX errno
to an NFSv4 error code.

Care was taken to factor out nfs4_errtbl_common[] to avoid duplicating
the same NFS error to errno table. nfs4_errtbl_common[] is checked
first by both nfs4_stat_to_errno and nfs_localio_errno_to_nfs4_stat
before they check their own more specialized tables (nfs4_errtbl[] and
nfs4_errtbl_localio[] respectively).

While auditing the associated error mapping tables, the (ab)use of -1
for the last table entry was removed in favor of using ARRAY_SIZE to
iterate the nfs_errtbl[] and nfs4_errtbl[]. And 'errno_NFSERR_IO' was
removed because it caused needless obfuscation.

Fixes: 70ba381e1a ("nfs: add LOCALIO support")
Reported-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-01-21 11:34:43 -05:00
Jaegeuk Kim
e02938613e f2fs: avoid trying to get invalid block address
In f2fs_new_inode(), if we fail to get a new inode, we go iput(), followed by
f2fs_evict_inode(). If the inode is not marked as bad, it'll try to call
f2fs_remove_inode_page() which tries to read the inode block given node id.
But, there's no block address allocated yet, which gives a chance to access
a wrong block address, if the block device has some garbage data in NAT table.

We need to make sure NAT table should have zero data for all the unallocated
node ids, but also would be better to take this unnecessary path as well.
Let's mark the faild inode as bad.

Fixes: 0abd675e97 ("f2fs: support plain user/group quota")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-01-21 16:28:16 +00:00
Linus Torvalds
1cbfb828e0 Merge tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:

 - NVMe pull requests via Keith:
      - Target support for PCI-Endpoint transport (Damien)
      - TCP IO queue spreading fixes (Sagi, Chaitanya)
      - Target handling for "limited retry" flags (Guixen)
      - Poll type fix (Yongsoo)
      - Xarray storage error handling (Keisuke)
      - Host memory buffer free size fix on error (Francis)

 - MD pull requests via Song:
      - Reintroduce md-linear (Yu Kuai)
      - md-bitmap refactor and fix (Yu Kuai)
      - Replace kmap_atomic with kmap_local_page (David Reaver)

 - Quite a few queue freeze and debugfs deadlock fixes

   Ming introduced lockdep support for this in the 6.13 kernel, and it
   has (unsurprisingly) uncovered quite a few issues

 - Use const attributes for IO schedulers

 - Remove bio ioprio wrappers

 - Fixes for stacked device atomic write support

 - Refactor queue affinity helpers, in preparation for better supporting
   isolated CPUs

 - Cleanups of loop O_DIRECT handling

 - Cleanup of BLK_MQ_F_* flags

 - Add rotational support for null_blk

 - Various fixes and cleanups

* tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits)
  block: Don't trim an atomic write
  block: Add common atomic writes enable flag
  md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add()
  block: limit disk max sectors to (LLONG_MAX >> 9)
  block: Change blk_stack_atomic_writes_limits() unit_min check
  block: Ensure start sector is aligned for stacking atomic writes
  blk-mq: Move more error handling into blk_mq_submit_bio()
  block: Reorder the request allocation code in blk_mq_submit_bio()
  nvme: fix bogus kzalloc() return check in nvme_init_effects_log()
  md/md-bitmap: move bitmap_{start, end}write to md upper layer
  md/raid5: implement pers->bitmap_sector()
  md: add a new callback pers->bitmap_sector()
  md/md-bitmap: remove the last parameter for bimtap_ops->endwrite()
  md/md-bitmap: factor behind write counters out from bitmap_{start/end}write()
  md: Replace deprecated kmap_atomic() with kmap_local_page()
  md: reintroduce md-linear
  partitions: ldm: remove the initial kernel-doc notation
  blk-cgroup: rwstat: fix kernel-doc warnings in header file
  blk-cgroup: fix kernel-doc warnings in header file
  nbd: fix partial sending
  ...
2025-01-20 19:38:46 -08:00
Pali Rohár
2948f0d4db cifs: Remove duplicate struct reparse_symlink_data and SYMLINK_FLAG_RELATIVE
In file common/smb2pdu.h is defined struct reparse_symlink_data_buffer
which is same as struct reparse_symlink_data and is used in the whole code.
So remove duplicate struct reparse_symlink_data from client/cifspdu.h.

In file common/smb2pdu.h is defined also SYMLINK_FLAG_RELATIVE constant, so
remove duplication from client/cifspdu.h.

Signed-off-by: Pali Rohár <pali@kernel.org>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-20 19:28:36 -06:00
Linus Torvalds
3d3a9c8b89 Merge tag 'dlm-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:

 - Fix a case where the new scanning code missed removing an unused rsb

 - Fix the error when removing a configfs entry for an invalid node id

* tag 'dlm-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: return -ENOENT if no comm was found
  dlm: fix srcu_read_lock() return type to int
  dlm: fix removal of rsb struct that is master and dir record
2025-01-20 14:26:59 -08:00
Linus Torvalds
2622f29041 Merge tag 'bcachefs-2025-01-20.2' of git://evilpiepirate.org/bcachefs
Pull bcachefs updates from Kent Overstreet:
 "Lots of scalability work, another big on-disk format change. On-disk
  format version goes from 1.13 to 1.20.

  Like 6.11, this is another big and expensive automatic/required on
  disk format upgrade. This is planned to be the last big on disk format
  upgrade before the experimental label comes off. There will be one
  more minor on disk format update for a few things that couldn't make
  this release.

  Headline improvements:

   - Self healing work:

     Allocator and reflink now run the exact same check/repair code that
     fsck does at runtime, where applicable.

     The long term goal here is to remove inconsistent() errors (that
     cause us to go emergency read only) by lifting fsck code up to
     normal runtime paths; we should only go emergency read-only if we
     detect an inconsistency that was due to a runtime bug - or truly
     catastrophic damage (corrupted btree roots/interior nodes).

   - Reflink repair no longer deletes reflink pointers:

     Instead we flip an error bit and log the error, and they can still
     be deleted by file deletion. This means a temporary failure to find
     an indirect extent (perhaps repaired later by btree node scan)
     won't result in unnecessary data loss

   - Improvements to rebalance data path option handling:

     We can now correctly apply changed filesystem-level io path options
     to pending rebalance work, and soon we'll be able to apply
     file-level io path option changes to indirect extents

   - Fix mount time regression that some users encountered post the 6.11
     disk accounting rewrite.

     Accounting keys were encoded little endian (typetag in the low
     bits) - which didn't anticipate adding accounting keys for every
     inode, which aren't stored in memory and we don't want to scan at
     mount time.

   - fsck time on large filesystems is improved by multiple orders of
     magnitude. Previously, 100TB was about the practical max filesystem
     size, where users were reporting fsck times of a day+. With the new
     changes (which nearly eliminate backpointers fsck overhead), we
     fsck'd a filesystem with 10PB of data in 1.5 hours.

     The problematic fsck passes were walking every extent and checking
     for missing backpointers, and walking every backpointer to check
     for dangling backpointers. As we've been adding more and more
     runtime self healing there was no reason to keep around the
     backpointers -> extents pass; dangling backpointers are just
     deleted, and we can do that when using them - thus, backpointers ->
     extents is now only run in debug mode.

     extents -> backpointers does need to exist, since missing
     backpointers would mean we can't find data to move it (for e.g.
     copygc, device evacuate, scrub). But the new on disk format version
     makes possible a new strategy where we sum up backpointers within a
     bucket and check it against the bucket sector counts, and then only
     scan for missing backpointers if the counts are off (and then, only
     for specific buckets).

  Full list of on disk format changes:

   - 1.14: backpointer_bucket_gen

     Backpointers now have a field for the bucket generation number,
     replacing the obsolete bucket_offset field. This is needed for the
     new "sum up backpointers within a bucket" code, since backpointers
     use the btree write buffer - meaning we will see stale reads, and
     this runs online, with the filesystem in full rw mode.

   - 1.15: disk_accounting_big_endian

     As previously described, fix the endianness of accounting keys so
     that accounting keys with the same typetag sort together, and
     accounting read can skip types it's not interested in.

   - 1.16: reflink_p_may_update_opts:

     This version indicates that a new reflink pointer field is
     understood and may be used; the field indicates whether the reflink
     pointer has permissions to update IO path options (e.g.
     compression, replicas) may be updated on the indirect extent it
     points to.

     This completes the rebalance/reflink data path option handling from
     the 6.13 pull request.

   - 1.17: inode_depth

     Add a new inode field, bi_depth, to accelerate the
     check_directory_structure fsck path, which checks for loops in the
     filesystem heirarchy.

     check_inodes and check_dirents check connectivity, so
     check_directory_structure only has to check for loops - by walking
     back up to the root from every directory.

     But a path can't be a loop if it has a counter that increases
     monotonically from root to leaf - adding a depth counter means that
     we can check for loops with only local (parent -> child) checks. We
     might need to occasionally renumber the depth field in fsck if
     directories have been moved around, but then future fsck runs will
     be much faster.

   - 1.18: persistent_inode_cursors

     Previously, the cursor used for inode allocation was only kept in
     memory, which meant that users with large filesystems and lots of
     files were reporting that the first create after mounting would
     take awhile - since it had to scan from the start.

     Inode allocation cursors are now persistent, and also include a
     generation field (incremented on wraparound, which will only happen
     if inode allocation is restricted to 32 bit inodes), so that we
     don't have to leave inode_generation keys around after a delete.

     The option for 32 bit inode numbers may now also be set on
     individual directories, and non-32 bit inode allocations are
     disallowed from allocating from the 32 bit part of the inode number
     space.

   - 1.19: autofix_errors

     Runtime self healing is now the default.o

   - 1.20: directory size (from Hongbo)

     directory i_size is now meaningful, and not 0"

* tag 'bcachefs-2025-01-20.2' of git://evilpiepirate.org/bcachefs: (268 commits)
  bcachefs: Fix check_inode_hash_info_matches_root()
  bcachefs: Document issue with bch_stripe layout
  bcachefs: Fix self healing on read error
  bcachefs: Pop all the transactions from the abort one
  bcachefs: Only abort the transactions in the cycle
  bcachefs: Introduce lock_graph_pop_from
  bcachefs: Convert open-coded lock_graph_pop_all to helper
  bcachefs: Do not allow no fail lock request to fail
  bcachefs: Merge the condition to avoid additional invocation
  Revert "bcachefs: Fix bch2_btree_node_upgrade()"
  bcachefs: bcachefs_metadata_version_directory_size
  bcachefs: make directory i_size meaningful
  bcachefs: check_unreachable_inodes is not actually PASS_ONLINE yet
  bcachefs: Don't use BTREE_ITER_cached when walking alloc btree during fsck
  bcachefs: Check for dirents to overwritten inodes
  bcachefs: bch2_btree_iter_peek_slot() handles navigating to nonexistent depth
  bcachefs: Don't set btree_path to updtodate if we don't fill
  bcachefs: __bch2_btree_pos_to_text()
  bcachefs: printbuf_reset() handles tabstops
  bcachefs: Silence read-only errors when deleting snapshots
  ...
2025-01-20 13:55:19 -08:00
Linus Torvalds
5d8a4bd6b2 Merge tag 'pstore-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull pstore updates from Kees Cook:

 - pstore/blk: trivial typo fixes (Eugen Hristev)

 - pstore/zone: reject zero-sized allocations (Eugen Hristev)

* tag 'pstore-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  pstore/zone: avoid dereferencing zero sized ptr after init zones
  pstore/blk: trivial typo fixes
2025-01-20 13:37:14 -08:00
Linus Torvalds
fadc3ed9ce Merge tag 'execve-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve updates from Kees Cook:

 - fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case (Tycho
   Andersen, Kees Cook)

 - binfmt_misc: Fix comment typos (Christophe JAILLET)

 - move empty argv[0] warning closer to actual logic (Nir Lichtman)

 - remove legacy custom binfmt modules autoloading (Nir Lichtman)

 - Make sure set_task_comm() always NUL-terminates

 - binfmt_flat: Fix integer overflow bug on 32 bit systems (Dan
   Carpenter)

 - coredump: Do not lock when copying "comm"

 - MAINTAINERS: add auxvec.h and set myself as maintainer

* tag 'execve-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  binfmt_flat: Fix integer overflow bug on 32 bit systems
  selftests/exec: add a test for execveat()'s comm
  exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case
  exec: Make sure task->comm is always NUL-terminated
  exec: remove legacy custom binfmt modules autoloading
  exec: move warning of null argv to be next to the relevant code
  fs: binfmt: Fix a typo
  MAINTAINERS: exec: Mark Kees as maintainer
  MAINTAINERS: exec: Add auxvec.h UAPI
  coredump: Do not lock during 'comm' reporting
2025-01-20 13:27:58 -08:00
Linus Torvalds
0eb4aaa230 Merge tag 'for-6.14-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
 "User visible changes, features:

   - rebuilding of the free space tree at mount time is done in more
     transactions, fix potential hangs when the transaction thread is
     blocked due to large amount of block groups

   - more read IO balancing strategies (experimental config), add two
     new ways how to select a device for read if the profiles allow that
     (all RAID1*), the current default selects the device by pid which
     is good on average but less performant for single reader workloads

       - select preferred device for all reads (namely for testing)
       - round-robin, balance reads across devices relevant for the
         requested IO range

   - add encoded write ioctl support to io_uring (read was added in
     6.12), basis for writing send stream using that instead of
     syscalls, non-blocking mode is not yet implemented

   - support FS_IOC_READ_VERITY_METADATA, applications can use the
     metadata to do their own verification

   - pass inode's i_write_hint to bios, for parity with other
     filesystems, ioctls F_GET_RW_HINT/F_SET_RW_HINT

  Core:

   - in zoned mode: allow to directly reclaim a block group by simply
     resetting it, then it can be reused and another block group does
     not need to be allocated

   - super block validation now also does more comprehensive sys array
     validation, adding it to the points where superblock is validated
     (post-read, pre-write)

   - subpage mode fixes:
      - fix double accounting of blocks due to some races
      - improved or fixed error handling in a few cases (compression,
        delalloc)

   - raid stripe tree:
      - fix various cases with extent range splitting or deleting
      - implement hole punching to extent range
      - reduce number of stripe tree lookups during bio submission
      - more self-tests

   - updated self-tests (delayed refs)

   - error handling improvements

   - cleanups, refactoring
      - remove rest of backref caching infrastructure from relocation,
        not needed anymore
      - error message updates
      - remove unnecessary calls when extent buffer was marked dirty
      - unused parameter removal
      - code moved to new files

  Other code changes: add rb_find_add_cached() to the rb-tree API"

* tag 'for-6.14-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (127 commits)
  btrfs: selftests: add a selftest for deleting two out of three extents
  btrfs: selftests: add test for punching a hole into 3 RAID stripe-extents
  btrfs: selftests: add selftest for punching holes into the RAID stripe extents
  btrfs: selftests: test RAID stripe-tree deletion spanning two items
  btrfs: selftests: don't split RAID extents in half
  btrfs: selftests: check for correct return value of failed lookup
  btrfs: don't use btrfs_set_item_key_safe on RAID stripe-extents
  btrfs: implement hole punching for RAID stripe extents
  btrfs: fix deletion of a range spanning parts two RAID stripe extents
  btrfs: fix tail delete of RAID stripe-extents
  btrfs: fix front delete range calculation for RAID stripe extents
  btrfs: assert RAID stripe-extent length is always greater than 0
  btrfs: don't try to delete RAID stripe-extents if we don't need to
  btrfs: selftests: correct RAID stripe-tree feature flag setting
  btrfs: add io_uring interface for encoded writes
  btrfs: remove the unused locked_folio parameter from btrfs_cleanup_ordered_extents()
  btrfs: add extra error messages for delalloc range related errors
  btrfs: subpage: dump the involved bitmap when ASSERT() failed
  btrfs: subpage: fix the bitmap dump of the locked flags
  btrfs: do proper folio cleanup when run_delalloc_nocow() failed
  ...
2025-01-20 13:09:30 -08:00
Linus Torvalds
1851bccf60 Merge tag 'gfs2-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:

 - In the quota code, to avoid spurious audit messages, don't call
   capable() when quotas are off

 - When changing the 'j' flag of an inode, truncate the inode address
   space to avoid mixing "buffer head" and "iomap" pages

* tag 'gfs2-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Truncate address space when flipping GFS2_DIF_JDATA flag
  gfs2: reorder capability check last
2025-01-20 13:06:28 -08:00
Linus Torvalds
b971424b6e Merge tag 'vfs-6.14-rc1.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull afs updates from Christian Brauner:
 "Dynamic root improvements:

   - Create an /afs/.<cell> mountpoint to match the /afs/<cell>
     mountpoint when a cell is created

   - Add some more checks on cell names proposed by the user to prevent
     dodgy symlink bodies from being created. Also prevent rootcell from
     being altered once set to simplify the locking

   - Change the handling of /afs/@cell from being a dentry name
     substitution at lookup time to making it a symlink to the current
     cell name and also provide a /afs/.@cell symlink to point to the
     dotted cell mountpoint

  Fixes:

   - Fix the abort code check in the fallback handling for the
     YFS.RemoveFile2 RPC call

   - Use call->op->server() for oridnary filesystem RPC calls that have
     an operation descriptor instead of call->server()"

* tag 'vfs-6.14-rc1.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  afs: Fix the fallback handling for the YFS.RemoveFile2 RPC call
  afs: Make /afs/@cell and /afs/.@cell symlinks
  afs: Add rootcell checks
  afs: Make /afs/.<cell> as well as /afs/<cell> mountpoints
2025-01-20 11:40:48 -08:00
Linus Torvalds
47c9f2b3c8 Merge tag 'vfs-6.14-rc1.statx.dio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs direct-io updates from Christian Brauner:
 "File systems that write out of place usually require different
  alignment for direct I/O writes than what they can do for reads.

  Add a separate dio read align field to statx, as many out of place
  write file systems can easily do reads aligned to the device sector
  size, but require bigger alignment for writes.

  This is usually papered over by falling back to buffered I/O for
  smaller writes and doing read-modify-write cycles, but performance for
  this sucks, so applications benefit from knowing the actual write
  alignment"

* tag 'vfs-6.14-rc1.statx.dio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  xfs: report larger dio alignment for COW inodes
  xfs: report the correct read/write dio alignment for reflinked inodes
  xfs: cleanup xfs_vn_getattr
  fs: add STATX_DIO_READ_ALIGN
  fs: reformat the statx definition
2025-01-20 11:16:50 -08:00
Linus Torvalds
7e587c20ad Merge tag 'vfs-6.14-rc1.libfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs libfs updates from Christian Brauner:
 "This improves the stable directory offset behavior in various ways.

  Stable offsets are needed so that NFS can reliably read directories on
  filesystems such as tmpfs:

   - Improve the end-of-directory detection

     According to getdents(3), the d_off field in each returned
     directory entry points to the next entry in the directory. The
     d_off field in the last returned entry in the readdir buffer must
     contain a valid offset value, but if it points to an actual
     directory entry, then readdir/getdents can loop.

     Introduce a specific fixed offset value that is placed in the d_off
     field of the last entry in a directory. Some user space
     applications assume that the EOD offset value is larger than the
     offsets of real directory entries, so the largest valid offset
     value is reserved for this purpose. This new value is never
     allocated by simple_offset_add().

     When ->iterate_dir() returns, getdents{64} inserts the ctx->pos
     value into the d_off field of the last valid entry in the readdir
     buffer. When it hits EOD, offset_readdir() sets ctx->pos to the EOD
     offset value so the last entry is updated to point to the EOD
     marker.

     When trying to read the entry at the EOD offset, offset_readdir()
     terminates immediately.

   - Rely on d_children to iterate stable offset directories

     Instead of using the mtree to emit entries in the order of their
     offset values, use it only to map incoming ctx->pos to a starting
     entry. Then use the directory's d_children list, which is already
     maintained properly by the dcache, to find the next child to emit.

   - Narrow the range of directory offset values returned by
     simple_offset_add() to 3 .. (S32_MAX - 1) on all platforms. This
     means the allocation behavior is identical on 32-bit systems,
     64-bit systems, and 32-bit user space on 64-bit kernels. The new
     range still permits over 2 billion concurrent entries per
     directory.

   - Return ENOSPC when the directory offset range is exhausted. Hitting
     this error is almost impossible though.

   - Remove the simple_offset_empty() helper"

* tag 'vfs-6.14-rc1.libfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  libfs: Use d_children list to iterate simple_offset directories
  libfs: Replace simple_offset end-of-directory detection
  Revert "libfs: fix infinite directory reads for offset dir"
  Revert "libfs: Add simple_offset_empty()"
  libfs: Return ENOSPC when the directory offset range is exhausted
2025-01-20 11:00:53 -08:00
Linus Torvalds
100ceb4817 Merge tag 'vfs-6.14-rc1.mount.v2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs mount updates from Christian Brauner:

 - Add a mountinfo program to demonstrate statmount()/listmount()

   Add a new "mountinfo" sample userland program that demonstrates how
   to use statmount() and listmount() to get at the same info that
   /proc/pid/mountinfo provides

 - Remove pointless nospec.h include

 - Prepend statmount.mnt_opts string with security_sb_mnt_opts()

   Currently these mount options aren't accessible via statmount()

 - Add new mount namespaces to mount namespace rbtree outside of the
   namespace semaphore

 - Lockless mount namespace lookup

   Currently we take the read lock when looking for a mount namespace to
   list mounts in. We can make this lockless. The simple search case can
   just use a sequence counter to detect concurrent changes to the
   rbtree

   For walking the list of mount namespaces sequentially via nsfs we
   keep a separate rcu list as rb_prev() and rb_next() aren't usable
   safely with rcu. Currently there is no primitive for retrieving the
   previous list member. To do this we need a new deletion primitive
   that doesn't poison the prev pointer and a corresponding retrieval
   helper

   Since creating mount namespaces is a relatively rare event compared
   with querying mounts in a foreign mount namespace this is worth it.
   Once libmount and systemd pick up this mechanism to list mounts in
   foreign mount namespaces this will be used very frequently

     - Add extended selftests for lockless mount namespace iteration

     - Add a sample program to list all mounts on the system, i.e., in
       all mount namespaces

 - Improve mount namespace iteration performance

   Make finding the last or first mount to start iterating the mount
   namespace from an O(1) operation and add selftests for iterating the
   mount table starting from the first and last mount

 - Use an xarray for the old mount id

   While the ida does use the xarray internally we can use it explicitly
   which allows us to increment the unique mount id under the xa lock.
   This allows us to remove the atomic as we're now allocating both ids
   in one go

 - Use a shared header for vfs sample programs

 - Fix build warnings for new sample program to list all mounts

* tag 'vfs-6.14-rc1.mount.v2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  samples/vfs: fix build warnings
  samples/vfs: use shared header
  samples/vfs/mountinfo: Use __u64 instead of uint64_t
  fs: remove useless lockdep assertion
  fs: use xarray for old mount id
  selftests: add listmount() iteration tests
  fs: cache first and last mount
  samples: add test-list-all-mounts
  selftests: remove unneeded include
  selftests: add tests for mntns iteration
  seltests: move nsfs into filesystems subfolder
  fs: simplify rwlock to spinlock
  fs: lockless mntns lookup for nsfs
  rculist: add list_bidir_{del,prev}_rcu()
  fs: lockless mntns rbtree lookup
  fs: add mount namespace to rbtree late
  fs: prepend statmount.mnt_opts string with security_sb_mnt_opts()
  mount: remove inlude/nospec.h include
  samples: add a mountinfo program to demonstrate statmount()/listmount()
2025-01-20 10:44:51 -08:00
Linus Torvalds
37c12fcb3c Merge tag 'kernel-6.14-rc1.cred' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull cred refcount updates from Christian Brauner:
 "For the v6.13 cycle we switched overlayfs to a variant of
  override_creds() that doesn't take an extra reference. To this end the
  {override,revert}_creds_light() helpers were introduced.

  This generalizes the idea behind {override,revert}_creds_light() to
  the {override,revert}_creds() helpers. Afterwards overriding and
  reverting credentials is reference count free unless the caller
  explicitly takes a reference.

  All callers have been appropriately ported"

* tag 'kernel-6.14-rc1.cred' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
  cred: fold get_new_cred_many() into get_cred_many()
  cred: remove unused get_new_cred()
  nfsd: avoid pointless cred reference count bump
  cachefiles: avoid pointless cred reference count bump
  dns_resolver: avoid pointless cred reference count bump
  trace: avoid pointless cred reference count bump
  cgroup: avoid pointless cred reference count bump
  acct: avoid pointless reference count bump
  io_uring: avoid pointless cred reference count bump
  smb: avoid pointless cred reference count bump
  cifs: avoid pointless cred reference count bump
  cifs: avoid pointless cred reference count bump
  ovl: avoid pointless cred reference count bump
  open: avoid pointless cred reference count bump
  nfsfh: avoid pointless cred reference count bump
  nfs/nfs4recover: avoid pointless cred reference count bump
  nfs/nfs4idmap: avoid pointless reference count bump
  nfs/localio: avoid pointless cred reference count bumps
  coredump: avoid pointless cred reference count bump
  binfmt_misc: avoid pointless cred reference count bump
  ...
2025-01-20 10:13:06 -08:00
Linus Torvalds
5f85bd6aec Merge tag 'vfs-6.14-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pidfs updates from Christian Brauner:

 - Rework inode number allocation

   Recently we received a patchset that aims to enable file handle
   encoding and decoding via name_to_handle_at(2) and
   open_by_handle_at(2).

   A crucical step in the patch series is how to go from inode number to
   struct pid without leaking information into unprivileged contexts.
   The issue is that in order to find a struct pid the pid number in the
   initial pid namespace must be encoded into the file handle via
   name_to_handle_at(2).

   This can be used by containers using a separate pid namespace to
   learn what the pid number of a given process in the initial pid
   namespace is. While this is a weak information leak it could be used
   in various exploits and in general is an ugly wart in the design.

   To solve this problem a new way is needed to lookup a struct pid
   based on the inode number allocated for that struct pid. The other
   part is to remove the custom inode number allocation on 32bit systems
   that is also an ugly wart that should go away.

   Allocate unique identifiers for struct pid by simply incrementing a
   64 bit counter and insert each struct pid into the rbtree so it can
   be looked up to decode file handles avoiding to leak actual pids
   across pid namespaces in file handles.

   On both 64 bit and 32 bit the same 64 bit identifier is used to
   lookup struct pid in the rbtree. On 64 bit the unique identifier for
   struct pid simply becomes the inode number. Comparing two pidfds
   continues to be as simple as comparing inode numbers.

   On 32 bit the 64 bit number assigned to struct pid is split into two
   32 bit numbers. The lower 32 bits are used as the inode number and
   the upper 32 bits are used as the inode generation number. Whenever a
   wraparound happens on 32 bit the 64 bit number will be incremented by
   2 so inode numbering starts at 2 again.

   When a wraparound happens on 32 bit multiple pidfds with the same
   inode number are likely to exist. This isn't a problem since before
   pidfs pidfds used the anonymous inode meaning all pidfds had the same
   inode number. On 32 bit sserspace can thus reconstruct the 64 bit
   identifier by retrieving both the inode number and the inode
   generation number to compare, or use file handles. This gives the
   same guarantees on both 32 bit and 64 bit.

 - Implement file handle support

   This is based on custom export operation methods which allows pidfs
   to implement permission checking and opening of pidfs file handles
   cleanly without hacking around in the core file handle code too much.

 - Support bind-mounts

   Allow bind-mounting pidfds. Similar to nsfs let's allow bind-mounts
   for pidfds. This allows pidfds to be safely recovered and checked for
   process recycling.

   Instead of checking d_ops for both nsfs and pidfs we could in a
   follow-up patch add a flag argument to struct dentry_operations that
   functions similar to file_operations->fop_flags.

* tag 'vfs-6.14-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  selftests: add pidfd bind-mount tests
  pidfs: allow bind-mounts
  pidfs: lookup pid through rbtree
  selftests/pidfd: add pidfs file handle selftests
  pidfs: check for valid ioctl commands
  pidfs: implement file handle support
  exportfs: add permission method
  fhandle: pull CAP_DAC_READ_SEARCH check into may_decode_fh()
  exportfs: add open method
  fhandle: simplify error handling
  pseudofs: add support for export_ops
  pidfs: support FS_IOC_GETVERSION
  pidfs: remove 32bit inode number handling
  pidfs: rework inode number allocation
2025-01-20 09:59:00 -08:00
Linus Torvalds
4b84a4c8d4 Merge tag 'vfs-6.14-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
 "Features:

   - Support caching symlink lengths in inodes

     The size is stored in a new union utilizing the same space as
     i_devices, thus avoiding growing the struct or taking up any more
     space

     When utilized it dodges strlen() in vfs_readlink(), giving about
     1.5% speed up when issuing readlink on /initrd.img on ext4

   - Add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag

     If a file system supports uncached buffered IO, it may set
     FOP_DONTCACHE and enable support for RWF_DONTCACHE.

     If RWF_DONTCACHE is attempted without the file system supporting
     it, it'll get errored with -EOPNOTSUPP

   - Enable VBOXGUEST and VBOXSF_FS on ARM64

     Now that VirtualBox is able to run as a host on arm64 (e.g. the
     Apple M3 processors) we can enable VBOXSF_FS (and in turn
     VBOXGUEST) for this architecture.

     Tested with various runs of bonnie++ and dbench on an Apple MacBook
     Pro with the latest Virtualbox 7.1.4 r165100 installed

  Cleanups:

   - Delay sysctl_nr_open check in expand_files()

   - Use kernel-doc includes in fiemap docbook

   - Use page->private instead of page->index in watch_queue

   - Use a consume fence in mnt_idmap() as it's heavily used in
     link_path_walk()

   - Replace magic number 7 with ARRAY_SIZE() in fc_log

   - Sort out a stale comment about races between fd alloc and dup2()

   - Fix return type of do_mount() from long to int

   - Various cosmetic cleanups for the lockref code

  Fixes:

   - Annotate spinning as unlikely() in __read_seqcount_begin

     The annotation already used to be there, but got lost in commit
     52ac39e5db ("seqlock: seqcount_t: Implement all read APIs as
     statement expressions")

   - Fix proc_handler for sysctl_nr_open

   - Flush delayed work in delayed fput()

   - Fix grammar and spelling in propagate_umount()

   - Fix ESP not readable during coredump

     In /proc/PID/stat, there is the kstkesp field which is the stack
     pointer of a thread. While the thread is active, this field reads
     zero. But during a coredump, it should have a valid value

     However, at the moment, kstkesp is zero even during coredump

   - Don't wake up the writer if the pipe is still full

   - Fix unbalanced user_access_end() in select code"

* tag 'vfs-6.14-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits)
  gfs2: use lockref_init for qd_lockref
  erofs: use lockref_init for pcl->lockref
  dcache: use lockref_init for d_lockref
  lockref: add a lockref_init helper
  lockref: drop superfluous externs
  lockref: use bool for false/true returns
  lockref: improve the lockref_get_not_zero description
  lockref: remove lockref_put_not_zero
  fs: Fix return type of do_mount() from long to int
  select: Fix unbalanced user_access_end()
  vbox: Enable VBOXGUEST and VBOXSF_FS on ARM64
  pipe_read: don't wake up the writer if the pipe is still full
  selftests: coredump: Add stackdump test
  fs/proc: do_task_stat: Fix ESP not readable during coredump
  fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag
  fs: sort out a stale comment about races between fd alloc and dup2
  fs: Fix grammar and spelling in propagate_umount()
  fs: fc_log replace magic number 7 with ARRAY_SIZE()
  fs: use a consume fence in mnt_idmap()
  file: flush delayed work in delayed fput()
  ...
2025-01-20 09:40:49 -08:00
Linus Torvalds
d582952424 Merge tag 'vfs-6.14-rc1.kcore' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull /proc/kcore updates from Christian Brauner:
 "The performance of /proc/kcore reads has been showing up as a
  bottleneck for the drgn debugger. drgn scripts often spend ~25% of
  their time in the kernel reading from /proc/kcore.

  A lot of this overhead comes from silly inefficiencies. This pull
  request contains fixes for the low-hanging fruit. The fixes are all
  fairly small and straightforward.

  The result is a 25% improvement in read latency in micro-benchmarks
  (from ~235 nanoseconds to ~175) and a 15% improvement in execution
  time for real-world drgn scripts:

   - Make /proc/kcore entry permanent

   - Avoid walking the list on every read

   - Use percpu_rw_semaphore for kclist_lock

   - Make Omar Sandoval the official maintainer for /proc/kcore"

* tag 'vfs-6.14-rc1.kcore' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  MAINTAINERS: add me as /proc/kcore maintainer
  proc/kcore: use percpu_rw_semaphore for kclist_lock
  proc/kcore: don't walk list on every read
  proc/kcore: mark proc entry as permanent
2025-01-20 09:36:55 -08:00
Linus Torvalds
ca56a74a31 Merge tag 'vfs-6.14-rc1.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs netfs updates from Christian Brauner:
 "This contains read performance improvements and support for monolithic
  single-blob objects that have to be read/written as such (e.g. AFS
  directory contents). The implementation of the two parts is interwoven
  as each makes the other possible.

   - Read performance improvements

     The read performance improvements are intended to speed up some
     loss of performance detected in cifs and to a lesser extend in afs.

     The problem is that we queue too many work items during the
     collection of read results: each individual subrequest is collected
     by its own work item, and then they have to interact with each
     other when a series of subrequests don't exactly align with the
     pattern of folios that are being read by the overall request.

     Whilst the processing of the pages covered by individual
     subrequests as they complete potentially allows folios to be woken
     in parallel and with minimum delay, it can shuffle wakeups for
     sequential reads out of order - and that is the most common I/O
     pattern.

     The final assessment and cleanup of an operation is then held up
     until the last I/O completes - and for a synchronous sequential
     operation, this means the bouncing around of work items just adds
     latency.

     Two changes have been made to make this work:

     (1) All collection is now done in a single "work item" that works
         progressively through the subrequests as they complete (and
         also dispatches retries as necessary).

     (2) For readahead and AIO, this work item be done on a workqueue
         and can run in parallel with the ultimate consumer of the data;
         for synchronous direct or unbuffered reads, the collection is
         run in the application thread and not offloaded.

     Functions such as smb2_readv_callback() then just tell netfslib
     that the subrequest has terminated; netfslib does a minimal bit of
     processing on the spot - stat counting and tracing mostly - and
     then queues/wakes up the worker. This simplifies the logic as the
     collector just walks sequentially through the subrequests as they
     complete and walks through the folios, if buffered, unlocking them
     as it goes. It also keeps to a minimum the amount of latency
     injected into the filesystem's low-level I/O handling

     The way netfs supports filesystems using the deprecated
     PG_private_2 flag is changed: folios are flagged and added to a
     write request as they complete and that takes care of scheduling
     the writes to the cache. The originating read request can then just
     unlock the pages whatever happens.

   - Single-blob object support

     Single-blob objects are files for which the content of the file
     must be read from or written to the server in a single operation
     because reading them in parts may yield inconsistent results. AFS
     directories are an example of this as there exists the possibility
     that the contents are generated on the fly and would differ between
     reads or might change due to third party interference.

     Such objects will be written to and retrieved from the cache if one
     is present, though we allow/may need to propose multiple
     subrequests to do so. The important part is that read from/write to
     the *server* is monolithic.

     Single blob reading is, for the moment, fully synchronous and does
     result collection in the application thread and, also for the
     moment, the API is supplied the buffer in the form of a folio_queue
     chain rather than using the pagecache.

   - Related afs changes

     This series makes a number of changes to the kafs filesystem,
     primarily in the area of directory handling:

      - AFS's FetchData RPC reply processing is made partially
        asynchronous which allows the netfs_io_request's outstanding
        operation counter to be removed as part of reducing the
        collection to a single work item.

      - Directory and symlink reading are plumbed through netfslib using
        the single-blob object API and are now cacheable with fscache.
        This also allows the afs_read struct to be eliminated and
        netfs_io_subrequest to be used directly instead.

      - Directory and symlink content are now stored in a folio_queue
        buffer rather than in the pagecache. This means we don't require
        the RCU read lock and xarray iteration to access it, and folios
        won't randomly disappear under us because the VM wants them
        back.

      - The vnode operation lock is changed from a mutex struct to a
        private lock implementation. The problem is that the lock now
        needs to be dropped in a separate thread and mutexes don't
        permit that.

      - When a new directory or symlink is created, we now initialise it
        locally and mark it valid rather than downloading it (we know
        what it's likely to look like).

      - We now use the in-directory hashtable to reduce the number of
        entries we need to scan when doing a lookup. The edit routines
        have to maintain the hash chains.

      - Cancellation (e.g. by signal) of an async call after the
        rxrpc_call has been set up is now offloaded to the worker thread
        as there will be a notification from rxrpc upon completion. This
        avoids a double cleanup.

   - A "rolling buffer" implementation is created to abstract out the
     two separate folio_queue chaining implementations I had (one for
     read and one for write).

   - Functions are provided to create/extend a buffer in a folio_queue
     chain and tear it down again.

     This is used to handle AFS directories, but could also be used to
     create bounce buffers for content crypto and transport crypto.

   - The was_async argument is dropped from netfs_read_subreq_terminated()

     Instead we wake the read collection work item by either queuing it
     or waking up the app thread.

   - We don't need to use BH-excluding locks when communicating between
     the issuing thread and the collection thread as neither of them now
     run in BH context.

   - Also included are a number of new tracepoints; a split of the
     netfslib write collection code to put retrying into its own file
     (it gets more complicated with content encryption).

   - There are also some minor fixes AFS included, including fixing the
     AFS directory format struct layout, reducing some directory
     over-invalidation and making afs_mkdir() translate EEXIST to
     ENOTEMPY (which is not available on all systems the servers
     support).

   - Finally, there's a patch to try and detect entry into the folio
     unlock function with no folio_queue structs in the buffer (which
     isn't allowed in the cases that can get there).

     This is a debugging patch, but should be minimal overhead"

* tag 'vfs-6.14-rc1.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (31 commits)
  netfs: Report on NULL folioq in netfs_writeback_unlock_folios()
  afs: Add a tracepoint for afs_read_receive()
  afs: Locally initialise the contents of a new symlink on creation
  afs: Use the contained hashtable to search a directory
  afs: Make afs_mkdir() locally initialise a new directory's content
  netfs: Change the read result collector to only use one work item
  afs: Make {Y,}FS.FetchData an asynchronous operation
  afs: Fix cleanup of immediately failed async calls
  afs: Eliminate afs_read
  afs: Use netfslib for symlinks, allowing them to be cached
  afs: Use netfslib for directories
  afs: Make afs_init_request() get a key if not given a file
  netfs: Add support for caching single monolithic objects such as AFS dirs
  netfs: Add functions to build/clean a buffer in a folio_queue
  afs: Add more tracepoints to do with tracking validity
  cachefiles: Add auxiliary data trace
  cachefiles: Add some subrequest tracepoints
  netfs: Remove some extraneous directory invalidations
  afs: Fix directory format encoding struct
  afs: Fix EEXIST error returned from afs_rmdir() to be ENOTEMPTY
  ...
2025-01-20 09:29:11 -08:00
Pali Rohár
10e6fe53d9 cifs: Do not attempt to call CIFSGetSrvInodeNumber() without CAP_INFOLEVEL_PASSTHRU
CIFSGetSrvInodeNumber() uses SMB_QUERY_FILE_INTERNAL_INFO (0x3ee) level
which is SMB PASSTHROUGH level (>= 0x03e8). SMB PASSTHROUGH levels are
supported only when server announce CAP_INFOLEVEL_PASSTHRU.

So add guard in cifs_query_file_info() function which is the only user of
CIFSGetSrvInodeNumber() function and returns -EOPNOTSUPP when server does
not announce CAP_INFOLEVEL_PASSTHRU.

Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-19 19:58:11 -06:00
Pali Rohár
e20a405fe4 cifs: Do not attempt to call CIFSSMBRenameOpenFile() without CAP_INFOLEVEL_PASSTHRU
CIFSSMBRenameOpenFile() uses SMB_SET_FILE_RENAME_INFORMATION (0x3f2) level
which is SMB PASSTHROUGH level (>= 0x03e8). SMB PASSTHROUGH levels are
supported only when server announce CAP_INFOLEVEL_PASSTHRU.

All usage of CIFSSMBRenameOpenFile() execept the one is already guarded by
checks which prevents calling it against servers without support for
CAP_INFOLEVEL_PASSTHRU.

The remaning usage without guard is in cifs_do_rename() function, so add
missing guard here.

Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-19 19:57:02 -06:00
Pali Rohár
4bda5f4de0 cifs: Remove declaration of dead CIFSSMBQuerySymLink function
Function CIFSSMBQuerySymLink() was renamed to cifs_query_reparse_point() in
commit ed3e0a149b ("smb: client: implement ->query_reparse_point() for
SMB1"). Remove its dead declaration from header file too.

Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-19 19:52:09 -06:00
Pali Rohár
6d08851c60 cifs: Fix printing Status code into dmesg
NT Status code is 32-bit number, so for comparing two NT Status codes is
needed to check all 32 bits, and not just low 24 bits.

Before this change kernel printed message:
"Status code returned 0x8000002d NT_STATUS_NOT_COMMITTED"

It was incorrect as because NT_STATUS_NOT_COMMITTED is defined as
0xC000002d and 0x8000002d has defined name NT_STATUS_STOPPED_ON_SYMLINK.

With this change kernel prints message:
"Status code returned 0x8000002d NT_STATUS_STOPPED_ON_SYMLINK"

Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-19 19:46:42 -06:00
Pali Rohár
014fdae602 cifs: Add missing NT_STATUS_* codes from nterr.h to nterr.c
This allows cifs_print_status() to show string representation also for
these error codes.

Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-19 19:45:24 -06:00
Pali Rohár
4e2ee32829 cifs: Fix endian types in struct rfc1002_session_packet
All fields in struct rfc1002_session_packet are in big endian. This is
because all NetBIOS packet headers are in big endian as opposite of SMB
structures which are in little endian.

Therefore use __be16 and __be32 types instead of __u16 and __u32 in
struct rfc1002_session_packet.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-19 19:34:00 -06:00
Pali Rohár
015683d4ed cifs: Use cifs_autodisable_serverino() for disabling CIFS_MOUNT_SERVER_INUM in readdir.c
In all other places is used function cifs_autodisable_serverino() for
disabling CIFS_MOUNT_SERVER_INUM mount flag. So use is also in readir.c
_initiate_cifs_search() function. Benefit of cifs_autodisable_serverino()
is that it also prints dmesg message that server inode numbers are being
disabled.

Fixes: ec06aedd44 ("cifs: clean up handling when server doesn't consistently support inode numbers")
Fixes: f534dc9943 ("cifs: clear server inode number flag while autodisabling")
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-19 19:34:00 -06:00
Steve French
72cf9e94f3 smb3: add missing tracepoint for querying wsl EAs
We had tracepoints for the return code for querying WSL EAs
(trace_smb3_query_wsl_ea_compound_err and
trace_smb3_query_wsl_ea_compound_done) but were missing one for
trace_smb3_query_wsl_ea_compound_enter.

Fixes: ea41367b2a ("smb: client: introduce SMB2_OP_QUERY_WSL_EA")
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-19 19:34:00 -06:00
Ruben Devos
11f8b80ab9 smb: client: fix order of arguments of tracepoints
The tracepoints based on smb3_inf_compound_*_class have tcon id and
session id swapped around. This results in incorrect output in
`trace-cmd report`.

Fix the order of arguments to resolve this issue. The trace-cmd output
below shows the before and after of the smb3_delete_enter and
smb3_delete_done events as an example. The smb3_cmd_* events show the
correct session and tcon id for reference.

Also fix tracepoint set -> get in the SMB2_OP_GET_REPARSE case.

BEFORE:
rm-2211  [001] .....  1839.550888: smb3_delete_enter:    xid=281 sid=0x5 tid=0x3d path=\hello2.txt
rm-2211  [001] .....  1839.550894: smb3_cmd_enter:        sid=0x1ac000000003d tid=0x5 cmd=5 mid=61
rm-2211  [001] .....  1839.550896: smb3_cmd_enter:        sid=0x1ac000000003d tid=0x5 cmd=6 mid=62
rm-2211  [001] .....  1839.552091: smb3_cmd_done:         sid=0x1ac000000003d tid=0x5 cmd=5 mid=61
rm-2211  [001] .....  1839.552093: smb3_cmd_done:         sid=0x1ac000000003d tid=0x5 cmd=6 mid=62
rm-2211  [001] .....  1839.552103: smb3_delete_done:     xid=281 sid=0x5 tid=0x3d

AFTER:
rm-2501  [001] .....  3237.656110: smb3_delete_enter:    xid=88 sid=0x1ac0000000041 tid=0x5 path=\hello2.txt
rm-2501  [001] .....  3237.656122: smb3_cmd_enter:        sid=0x1ac0000000041 tid=0x5 cmd=5 mid=84
rm-2501  [001] .....  3237.656123: smb3_cmd_enter:        sid=0x1ac0000000041 tid=0x5 cmd=6 mid=85
rm-2501  [001] .....  3237.657909: smb3_cmd_done:         sid=0x1ac0000000041 tid=0x5 cmd=5 mid=84
rm-2501  [001] .....  3237.657909: smb3_cmd_done:         sid=0x1ac0000000041 tid=0x5 cmd=6 mid=85
rm-2501  [001] .....  3237.657922: smb3_delete_done:     xid=88 sid=0x1ac0000000041 tid=0x5

Cc: stable@vger.kernel.org
Signed-off-by: Ruben Devos <devosruben6@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-19 19:34:00 -06:00
Paulo Alcantara
be7a6a7766 smb: client: fix oops due to unset link speed
It isn't guaranteed that NETWORK_INTERFACE_INFO::LinkSpeed will always
be set by the server, so the client must handle any values and then
prevent oopses like below from happening:

Oops: divide error: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 0 UID: 0 PID: 1323 Comm: cat Not tainted 6.13.0-rc7 #2
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-3.fc41
04/01/2014
RIP: 0010:cifs_debug_data_proc_show+0xa45/0x1460 [cifs] Code: 00 00 48
89 df e8 3b cd 1b c1 41 f6 44 24 2c 04 0f 84 50 01 00 00 48 89 ef e8
e7 d0 1b c1 49 8b 44 24 18 31 d2 49 8d 7c 24 28 <48> f7 74 24 18 48 89
c3 e8 6e cf 1b c1 41 8b 6c 24 28 49 8d 7c 24
RSP: 0018:ffffc90001817be0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88811230022c RCX: ffffffffc041bd99
RDX: 0000000000000000 RSI: 0000000000000567 RDI: ffff888112300228
RBP: ffff888112300218 R08: fffff52000302f5f R09: ffffed1022fa58ac
R10: ffff888117d2c566 R11: 00000000fffffffe R12: ffff888112300200
R13: 000000012a15343f R14: 0000000000000001 R15: ffff888113f2db58
FS: 00007fe27119e740(0000) GS:ffff888148600000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe2633c5000 CR3: 0000000124da0000 CR4: 0000000000750ef0
PKRU: 55555554
Call Trace:
 <TASK>
 ? __die_body.cold+0x19/0x27
 ? die+0x2e/0x50
 ? do_trap+0x159/0x1b0
 ? cifs_debug_data_proc_show+0xa45/0x1460 [cifs]
 ? do_error_trap+0x90/0x130
 ? cifs_debug_data_proc_show+0xa45/0x1460 [cifs]
 ? exc_divide_error+0x39/0x50
 ? cifs_debug_data_proc_show+0xa45/0x1460 [cifs]
 ? asm_exc_divide_error+0x1a/0x20
 ? cifs_debug_data_proc_show+0xa39/0x1460 [cifs]
 ? cifs_debug_data_proc_show+0xa45/0x1460 [cifs]
 ? seq_read_iter+0x42e/0x790
 seq_read_iter+0x19a/0x790
 proc_reg_read_iter+0xbe/0x110
 ? __pfx_proc_reg_read_iter+0x10/0x10
 vfs_read+0x469/0x570
 ? do_user_addr_fault+0x398/0x760
 ? __pfx_vfs_read+0x10/0x10
 ? find_held_lock+0x8a/0xa0
 ? __pfx_lock_release+0x10/0x10
 ksys_read+0xd3/0x170
 ? __pfx_ksys_read+0x10/0x10
 ? __rcu_read_unlock+0x50/0x270
 ? mark_held_locks+0x1a/0x90
 do_syscall_64+0xbb/0x1d0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fe271288911
Code: 00 48 8b 15 01 25 10 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8
20 ad 01 00 f3 0f 1e fa 80 3d b5 a7 10 00 00 74 13 31 c0 0f 05 <48> 3d
00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
RSP: 002b:00007ffe87c079d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000040000 RCX: 00007fe271288911
RDX: 0000000000040000 RSI: 00007fe2633c6000 RDI: 0000000000000003
RBP: 00007ffe87c07a00 R08: 0000000000000000 R09: 00007fe2713e6380
R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000040000
R13: 00007fe2633c6000 R14: 0000000000000003 R15: 0000000000000000
 </TASK>

Fix this by setting cifs_server_iface::speed to a sane value (1Gbps)
by default when link speed is unset.

Cc: Shyam Prasad N <nspmangalore@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Fixes: a6d8fb54a5 ("cifs: distribute channels across interfaces based on speed")
Reported-by: Frank Sorenson <sorenson@redhat.com>
Reported-by: Jay Shin <jaeshin@redhat.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-19 19:34:00 -06:00
Liang Jie
215b7f9ecb smb: client: correctly handle ErrorContextData as a flexible array
The `smb2_symlink_err_rsp` structure was previously defined with
`ErrorContextData` as a single `__u8` byte. However, the `ErrorContextData`
field is intended to be a variable-length array based on `ErrorDataLength`.
This mismatch leads to incorrect pointer arithmetic and potential memory
access issues when processing error contexts.

Updates the `ErrorContextData` field to be a flexible array
(`__u8 ErrorContextData[]`). Additionally, it modifies the corresponding
casts in the `symlink_data()` function to properly handle the flexible
array, ensuring correct memory calculations and data handling.

These changes improve the robustness of SMB2 symlink error processing.

Signed-off-by: Liang Jie <liangjie@lixiang.com>
Suggested-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-19 19:34:00 -06:00
Paulo Alcantara
48aa99523e smb: client: don't retry DFS targets on server shutdown
If TCP Server is about to be destroyed (e.g. CifsExiting was set) and
it is reconnecting, stop retrying DFS targets from cached DFS referral
as this would potentially delay server shutdown in several seconds.

Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-19 19:34:00 -06:00
Paulo Alcantara
bfc1155030 smb: client: fix return value of parse_dfs_referrals()
Return -ENOENT in parse_dfs_referrals() when server returns no targets
for a referral request as specified in

  MS-DFSC 3.1.5.4.3 Receiving a Root Referral Response or Link
  Referral Response:

    > If the referral request is successful, but the NumberOfReferrals
    > field in the referral header (as specified in section 2.2.4) is
    > 0, the DFS server could not find suitable targets to return to
    > the client.  In this case, the client MUST fail the original I/O
    > operation with STATUS_OBJECT_PATH_NOT_FOUND.

Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-19 19:34:00 -06:00
Paulo Alcantara
5433c629e8 smb: client: optimize referral walk on failed link targets
If a link referral request sent to root server was successful but
client failed to connect to all link targets, there is no need to
retry same link referral on a different root server.  Set an end
marker for the DFS root referral so the client will not attempt to
re-send link referrals to different root servers on failures.

Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-19 19:34:00 -06:00
Paulo Alcantara
4b1b4c8be9 smb: client: provide dns_resolve_{unc,name} helpers
Some places pass hostnames rather than UNC paths to resolve them to ip
addresses, so provide helpers to handle both cases and then stop
converting hostnames to UNC paths by inserting path delimiters into
them.  Also kill @expiry parameter as it's not used anywhere.

Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-19 19:34:00 -06:00
Paulo Alcantara
489d152310 smb: client: parse DNS domain name from domain= option
If the user specified a DNS domain name in domain= mount option, then
use it instead of parsing it in NTLMSSP CHALLENGE_MESSAGE message.

Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-19 19:34:00 -06:00
Paulo Alcantara
ad46faff1a smb: client: fix DFS mount against old servers with NTLMSSP
Old Windows servers will return not fully qualified DFS targets by
default as specified in

  MS-DFSC 3.2.5.5 Receiving a Root Referral Request or Link Referral
  Request

    | Servers SHOULD<30> return fully qualified DNS host names of
    | targets in responses to root referral requests and link referral
    | requests.
    | ...
    | <30> Section 3.2.5.5: By default, Windows Server 2003, Windows
    | Server 2008, Windows Server 2008 R2, Windows Server 2012, and
    | Windows Server 2012 R2 return DNS host names that are not fully
    | qualified for targets.

Fix this by converting all NetBIOS host names from DFS targets to
FQDNs and try resolving them first if DNS domain name was provided in
NTLMSSP CHALLENGE_MESSAGE message from previous SMB2_SESSION_SETUP.
This also prevents the client from translating the DFS target
hostnames to another domain depending on the network domain search
order.

Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-19 19:34:00 -06:00
Paulo Alcantara
0e8ae9b953 smb: client: parse av pair type 4 in CHALLENGE_MESSAGE
Parse FQDN of the domain in CHALLENGE_MESSAGE message as it's gonna be
useful when mounting DFS shares against old Windows Servers (2012 R2
or earlier) that return not fully qualified hostnames for DFS targets
by default.

Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-19 19:33:59 -06:00
Paulo Alcantara
62eecd8aac smb: client: introduce av_for_each_entry() helper
Use new helper in find_domain_name() and find_timestamp() to avoid
duplicating code.

Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-19 19:33:59 -06:00
Ard Biesheuvel
e7b4b1f61d Merge branch 'efivarfs' into next 2025-01-19 17:50:58 +01:00
James Bottomley
908af31f48 efivarfs: fix error on write to new variable leaving remnants
Make variable cleanup go through the fops release mechanism and use
zero inode size as the indicator to delete the file.  Since all EFI
variables must have an initial u32 attribute, zero size occurs either
because the update deleted the variable or because an unsuccessful
write after create caused the size never to be set in the first place.
In the case of multiple racing opens and closes, the open is counted
to ensure that the zero size check is done on the last close.

Even though this fixes the bug that a create either not followed by a
write or followed by a write that errored would leave a remnant file
for the variable, the file will appear momentarily globally visible
until the last close of the fd deletes it.  This is safe because the
normal filesystem operations will mediate any races; however, it is
still possible for a directory listing at that instant between create
and close contain a zero size variable that doesn't exist in the EFI
table.

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2025-01-19 17:50:26 +01:00
James Bottomley
a58e954464 efivarfs: remove unused efivarfs_list
Remove all function helpers and mentions of the efivarfs_list now that
all consumers of the list have been removed and entry management goes
exclusively through the inode.

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2025-01-19 17:50:26 +01:00