Commit Graph

3068 Commits

Author SHA1 Message Date
Linus Torvalds
b5d760d53a Merge tag 'vfs-6.17-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs iomap updates from Christian Brauner:

 - Refactor the iomap writeback code and split the generic and ioend/bio
   based writeback code.

   There are two methods that define the split between the generic
   writeback code, and the implemementation of it, and all knowledge of
   ioends and bios now sits below that layer.

 - Add fuse iomap support for buffered writes and dirty folio writeback.

   This is needed so that granular uptodate and dirty tracking can be
   used in fuse when large folios are enabled. This has two big
   advantages. For writes, instead of the entire folio needing to be
   read into the page cache, only the relevant portions need to be. For
   writeback, only the dirty portions need to be written back instead of
   the entire folio.

* tag 'vfs-6.17-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fuse: refactor writeback to use iomap_writepage_ctx inode
  fuse: hook into iomap for invalidating and checking partial uptodateness
  fuse: use iomap for folio laundering
  fuse: use iomap for writeback
  fuse: use iomap for buffered writes
  iomap: build the writeback code without CONFIG_BLOCK
  iomap: add read_folio_range() handler for buffered writes
  iomap: improve argument passing to iomap_read_folio_sync
  iomap: replace iomap_folio_ops with iomap_write_ops
  iomap: export iomap_writeback_folio
  iomap: move folio_unlock out of iomap_writeback_folio
  iomap: rename iomap_writepage_map to iomap_writeback_folio
  iomap: move all ioend handling to ioend.c
  iomap: add public helpers for uptodate state manipulation
  iomap: hide ioends from the generic writeback code
  iomap: refactor the writeback interface
  iomap: cleanup the pending writeback tracking in iomap_writepage_map_blocks
  iomap: pass more arguments using the iomap writeback context
  iomap: header diet
2025-07-28 16:09:03 -07:00
Linus Torvalds
57fcb7d930 Merge tag 'vfs-6.17-rc1.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull fileattr updates from Christian Brauner:
 "This introduces the new file_getattr() and file_setattr() system calls
  after lengthy discussions.

  Both system calls serve as successors and extensible companions to
  the FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR system calls which have
  started to show their age in addition to being named in a way that
  makes it easy to conflate them with extended attribute related
  operations.

  These syscalls allow userspace to set filesystem inode attributes on
  special files. One of the usage examples is the XFS quota projects.

  XFS has project quotas which could be attached to a directory. All new
  inodes in these directories inherit project ID set on parent
  directory.

  The project is created from userspace by opening and calling
  FS_IOC_FSSETXATTR on each inode. This is not possible for special
  files such as FIFO, SOCK, BLK etc. Therefore, some inodes are left
  with empty project ID. Those inodes then are not shown in the quota
  accounting but still exist in the directory. This is not critical but
  in the case when special files are created in the directory with
  already existing project quota, these new inodes inherit extended
  attributes. This creates a mix of special files with and without
  attributes. Moreover, special files with attributes don't have a
  possibility to become clear or change the attributes. This, in turn,
  prevents userspace from re-creating quota project on these existing
  files.

  In addition, these new system calls allow the implementation of
  additional attributes that we couldn't or didn't want to fit into the
  legacy ioctls anymore"

* tag 'vfs-6.17-rc1.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: tighten a sanity check in file_attr_to_fileattr()
  tree-wide: s/struct fileattr/struct file_kattr/g
  fs: introduce file_getattr and file_setattr syscalls
  fs: prepare for extending file_get/setattr()
  fs: make vfs_fileattr_[get|set] return -EOPNOTSUPP
  selinux: implement inode_file_[g|s]etattr hooks
  lsm: introduce new hooks for setting/getting inode fsxattr
  fs: split fileattr related helpers into separate file
2025-07-28 15:24:14 -07:00
Linus Torvalds
11fe69fbd5 Merge tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull dentry d_flags updates from Al Viro:
 "The current exclusion rules for dentry->d_flags stores are rather
  unpleasant. The basic rules are simple:

   - stores to dentry->d_flags are OK under dentry->d_lock

   - stores to dentry->d_flags are OK in the dentry constructor, before
     becomes potentially visible to other threads

  Unfortunately, there's a couple of exceptions to that, and that's
  where the headache comes from.

  The main PITA comes from d_set_d_op(); that primitive sets ->d_op of
  dentry and adjusts the flags that correspond to presence of individual
  methods. It's very easy to misuse; existing uses _are_ safe, but proof
  of correctness is brittle.

  Use in __d_alloc() is safe (we are within a constructor), but we might
  as well precalculate the initial value of 'd_flags' when we set the
  default ->d_op for given superblock and set 'd_flags' directly instead
  of messing with that helper.

  The reasons why other uses are safe are bloody convoluted; I'm not
  going to reproduce it here. See [1] for gory details, if you care. The
  critical part is using d_set_d_op() only just prior to
  d_splice_alias(), which makes a combination of d_splice_alias() with
  setting ->d_op, etc a natural replacement primitive.

  Better yet, if we go that way, it's easy to take setting ->d_op and
  modifying 'd_flags' under ->d_lock, which eliminates the headache as
  far as 'd_flags' exclusion rules are concerned. Other exceptions are
  minor and easy to deal with.

  What this series does:

   - d_set_d_op() is no longer available; instead a new primitive
     (d_splice_alias_ops()) is provided, equivalent to combination of
     d_set_d_op() and d_splice_alias().

   - new field of struct super_block - 's_d_flags'. This sets the
     default value of 'd_flags' to be used when allocating dentries on
     this filesystem.

   - new primitive for setting 's_d_op': set_default_d_op(). This
     replaces stores to 's_d_op' at mount time.

     All in-tree filesystems converted; out-of-tree ones will get caught
     by the compiler ('s_d_op' is renamed, so stores to it will be
     caught). 's_d_flags' is set by the same primitive to match the
     's_d_op'.

   - a lot of filesystems had sb->s_d_op->d_delete equal to
     always_delete_dentry; that is equivalent to setting
     DCACHE_DONTCACHE in 'd_flags', so such filesystems can bloody well
     set that bit in 's_d_flags' and drop 'd_delete()' from
     dentry_operations.

     In quite a few cases that results in empty dentry_operations, which
     means that we can get rid of those.

   - kill simple_dentry_operations - not needed anymore

   - massage d_alloc_parallel() to get rid of the other exception wrt
     'd_flags' stores - we can set DCACHE_PAR_LOOKUP as soon as we
     allocate the new dentry; no need to delay that until we commit to
     using the sucker.

  As the result, 'd_flags' stores are all either under ->d_lock or done
  before the dentry becomes visible in any shared data structures"

Link: https://lore.kernel.org/all/20250224010624.GT1977892@ZenIV/ [1]

* tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (21 commits)
  configfs: use DCACHE_DONTCACHE
  debugfs: use DCACHE_DONTCACHE
  efivarfs: use DCACHE_DONTCACHE instead of always_delete_dentry()
  9p: don't bother with always_delete_dentry
  ramfs, hugetlbfs, mqueue: set DCACHE_DONTCACHE
  kill simple_dentry_operations
  devpts, sunrpc, hostfs: don't bother with ->d_op
  shmem: no dentry retention past the refcount reaching zero
  d_alloc_parallel(): set DCACHE_PAR_LOOKUP earlier
  make d_set_d_op() static
  simple_lookup(): just set DCACHE_DONTCACHE
  tracefs: Add d_delete to remove negative dentries
  set_default_d_op(): calculate the matching value for ->d_flags
  correct the set of flags forbidden at d_set_d_op() time
  split d_flags calculation out of d_set_d_op()
  new helper: set_default_d_op()
  fuse: no need for special dentry_operations for root dentry
  switch procfs from d_set_d_op() to d_splice_alias_ops()
  new helper: d_splice_alias_ops()
  procfs: kill ->proc_dops
  ...
2025-07-28 09:17:57 -07:00
Andreas Gruenbacher
deb016c166 gfs2: No more self recovery
When a node withdraws and it turns out that it is the only node that has
the filesystem mounted, gfs2 currently tries to replay the local journal
to bring the filesystem back into a consistent state.  Not only is that
a very bad idea, it has also never worked because gfs2_recover_func()
will refuse to do anything during a withdraw.

However, before even getting to this point, gfs2_recover_func()
dereferences sdp->sd_jdesc->jd_inode.  This was a use-after-free before
commit 04133b607a ("gfs2: Prevent double iput for journal on error")
and is a NULL pointer dereference since then.

Simply get rid of self recovery to fix that.

Fixes: 601ef0d52e ("gfs2: Force withdraw to replay journals and wait for it to finish")
Reported-by: Chunjie Zhu <chunjie.zhu@cloud.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-07-16 23:30:32 +02:00
Andrew Price
557c024ca7 gfs2: Validate i_depth for exhash directories
A fuzzer test introduced corruption that ends up with a depth of 0 in
dir_e_read(), causing an undefined shift by 32 at:

  index = hash >> (32 - dip->i_depth);

As calculated in an open-coded way in dir_make_exhash(), the minimum
depth for an exhash directory is ilog2(sdp->sd_hash_ptrs) and 0 is
invalid as sdp->sd_hash_ptrs is fixed as sdp->bsize / 16 at mount time.

So we can avoid the undefined behaviour by checking for depth values
lower than the minimum in gfs2_dinode_in(). Values greater than the
maximum are already being checked for there.

Also switch the calculation in dir_make_exhash() to use ilog2() to
clarify how the depth is calculated.

Tested with the syzkaller repro.c and xfstests '-g quick'.

Reported-by: syzbot+4708579bb230a0582a57@syzkaller.appspotmail.com
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-07-16 15:26:44 +02:00
Andrew Price
5c8f12cf1e gfs2: Set .migrate_folio in gfs2_{rgrp,meta}_aops
Clears up the warning added in 7ee3647243 ("migrate: Remove call to
->writepage") that occurs in various xfstests, causing "something found
in dmesg" failures.

[  341.136573] gfs2_meta_aops does not implement migrate_folio
[  341.136953] WARNING: CPU: 1 PID: 36 at mm/migrate.c:944 move_to_new_folio+0x2f8/0x300

Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-07-15 13:10:01 +02:00
Andreas Gruenbacher
e7ffc0af0e gfs2: a minor finish_xmote cleanup
As a minor clean-up to commit 1fc05c8d84 ("gfs2: cancel timed-out
glock requests"), when a demote request is in progress in
finish_xmote(), there is no point in waking up the glock holder at the
head of the queue because the reply from dlm cannot be on behalf of that
glock holder.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
2025-07-15 04:20:40 +02:00
Andreas Gruenbacher
92cef39bb3 gfs2: simplify finish_xmote
As a follow-up to commit a431d49243 ("gfs2: Fix request cancelation
bug"), it turns out that any call to finish_xmote() is always followed
by a call to run_queue(), either

 * directly when glock_work_func() calls finish_xmote() before calling
   run_queue(), or

 * indirectly when do_xmote() calls finish_xmote() before calling
   gfs2_glock_queue_work(), which queues a call to glock_work_func() in
   work queue context,

so remove the code in finish_xmote() that duplicates the functionality
of run_queue().

In addition, the code this commit removes is missing a check for the
GLF_DEMOTE flag which indicates that no further promotes should be
performed, so if that code didn't get removed, that check would have to
be added.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
2025-07-15 04:20:40 +02:00
Andreas Gruenbacher
6e417b3eb8 gfs2: sanitize the gdlm_ast -> finish_xmote interface
When gdlm_ast() is called with a non-zero status code, this means that
the requested operation did not succeed and the current lock state
didn't change.  Turn that into a non-zero LM_OUT_* status code (with ret
& ~LM_OUT_ST_MASK != 0) instead of pretending that dlm returned the
current lock state.

That way, we can easily change finish_xmote() to only update
gl->gl_state when the state has actually changed.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
2025-07-15 04:20:40 +02:00
Christoph Hellwig
2a5574fc57 iomap: replace iomap_folio_ops with iomap_write_ops
The iomap_folio_ops are only used for buffered writes, including the zero
and unshare variants.  Rename them to iomap_write_ops to better describe
the usage, and pass them through the call chain like the other operation
specific methods instead of through the iomap.

xfs_iomap_valid grows a IOMAP_HOLE check to keep the existing behavior
that never attached the folio_ops to a iomap representing a hole.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-12-hch@lst.de
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14 10:51:33 +02:00
Christoph Hellwig
f4fa7981fa iomap: hide ioends from the generic writeback code
Replace the ioend pointer in iomap_writeback_ctx with a void *wb_ctx
one to facilitate non-block, non-ioend writeback for use.  Rename
the submit_ioend method to writeback_submit and make it mandatory so
that the generic writeback code stops seeing ioends and bios.

Co-developed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-6-hch@lst.de
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14 10:51:31 +02:00
Christoph Hellwig
fb7399cf2d iomap: refactor the writeback interface
Replace ->map_blocks with a new ->writeback_range, which differs in the
following ways:

 - it must also queue up the I/O for writeback, that is called into the
   slightly refactored and extended in scope iomap_add_to_ioend for
   each region
 - can handle only a part of the requested region, that is the retry
   loop for partial mappings moves to the caller
 - handles cleanup on failures as well, and thus also replaces the
   discard_folio method only implemented by XFS.

This will allow to use the iomap writeback code also for file systems
that are not block based like fuse.

Co-developed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-5-hch@lst.de
Acked-by: Damien Le Moal <dlemoal@kernel.org>	# zonefs
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14 10:51:31 +02:00
Christoph Hellwig
67fd9615a7 iomap: pass more arguments using the iomap writeback context
Add inode and wpc fields to pass the inode and writeback context that
are needed in the entire writeback call chain, and let the callers
initialize all fields in the writeback context before calling
iomap_writepages to simplify the argument passing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-3-hch@lst.de
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14 10:51:31 +02:00
Andreas Gruenbacher
75bb2ddea9 gfs2: Minor do_xmote cancelation fix
Commit 6cb3b1c2df changed how finish_xmote() clears the GLF_LOCK flag,
but it failed to adjust the equivalent code in do_xmote().  Fix that.

Fixes: 6cb3b1c2df ("gfs2: Fix additional unlikely request cancelation race")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-07-09 20:04:14 +02:00
Andreas Gruenbacher
2c6e2cb9e7 gfs2: Remove GIF_ALLOC_FAILED flag
Get rid of the GIF_ALLOC_FAILED flag; we can now be confident that the
additional consistency check isn't needed.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-07-09 19:40:07 +02:00
Andreas Gruenbacher
00983d248c gfs2: Use SECTOR_SIZE and SECTOR_SHIFT
Use the SECTOR_SIZE and SECTOR_SHIFT constants where appropriate instead
of hardcoding their values.

Reported-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-07-09 19:40:07 +02:00
Christian Brauner
ca115d7e75 tree-wide: s/struct fileattr/struct file_kattr/g
Now that we expose struct file_attr as our uapi struct rename all the
internal struct to struct file_kattr to clearly communicate that it is a
kernel internal struct. This is similar to struct mount_{k}attr and
others.

Link: https://lore.kernel.org/20250703-restlaufzeit-baurecht-9ed44552b481@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-04 16:14:39 +02:00
Al Viro
05fb0e6664 new helper: set_default_d_op()
... to be used instead of manually assigning to ->s_d_op.
All in-tree filesystem converted (and field itself is renamed,
so any out-of-tree ones in need of conversion will be caught
by compiler).

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-10 22:21:16 -04:00
Andrew Price
9126d2754c gfs2: Don't clear sb->s_fs_info in gfs2_sys_fs_add
When gfs2_sys_fs_add() fails, it sets sb->s_fs_info to NULL on its error
path (see commit 0d515210b6 ("GFS2: Add kobject release method")).
The intention seems to be to prevent dereferencing sb->s_fs_info once
the object pointed to has been deallocated, but that would be better
achieved by setting the pointer to NULL in free_sbd().

As a consequence, when the call to gfs2_sys_fs_add() fails in
gfs2_fill_super(), sdp = GFS2_SB(inode) will evaluate to NULL in iput()
-> gfs2_drop_inode(), and accessing sdp->sd_flags will be a NULL pointer
dereference.

Fix that by only setting sb->s_fs_info to NULL when actually freeing the
object pointed to in free_sbd().

Fixes: ae9f3bd825 ("gfs2: replace sd_aspace with sd_inode")
Reported-by: syzbot+b12826218502df019f9d@syzkaller.appspotmail.com
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-30 19:20:20 +02:00
Linus Torvalds
8fdabcd9c0 Merge tag 'gfs2-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:

 - Fix the long-standing warnings in inode_to_wb() when CONFIG_LOCKDEP
   is enabled: gfs2 doesn't support cgroup writeback and so inode->i_wb
   will never change. This is the counterpart of commit 9e888998ea
   ("writeback: fix false warning in inode_to_wb()")

 - Fix a hang introduced by commit 8d391972ae ("gfs2: Remove
   __gfs2_writepage()"): prevent gfs2_logd from creating transactions
   for jdata pages while trying to flush the log

 - Fix a race between gfs2_create_inode() and gfs2_evict_inode() by
   deallocating partially created inodes on the gfs2_create_inode()
   error path

 - Fix a bug in the journal head lookup code that could cause mount to
   fail after successful recovery

 - Various smaller fixes and cleanups from various people

* tag 'gfs2-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (23 commits)
  gfs2: No more gfs2_find_jhead caching
  gfs2: Get rid of duplicate log head lookup
  gfs2: Simplify clean_journal
  gfs2: Simplify gfs2_log_pointers_init
  gfs2: Move gfs2_log_pointers_init
  gfs2: Minor comments fix
  gfs2: Don't start unnecessary transactions during log flush
  gfs2: Move gfs2_trans_add_databufs
  gfs2: Rename jdata_dirty_folio to gfs2_jdata_dirty_folio
  gfs2: avoid inefficient use of crc32_le_shift()
  gfs2: Do not call iomap_zero_range beyond eof
  gfs: don't check for AOP_WRITEPAGE_ACTIVATE in gfs2_write_jdata_batch
  gfs2: Fix usage of bio->bi_status in gfs2_end_log_write
  gfs2: deallocate inodes in gfs2_create_inode
  gfs2: Move GIF_ALLOC_FAILED check out of gfs2_ea_dealloc
  gfs2: Move gfs2_dinode_dealloc
  gfs2: Don't reread inodes unnecessarily
  gfs2: gfs2_create_inode error handling fix
  gfs2: Remove unnecessary NULL check before free_percpu()
  gfs2: check sb_min_blocksize return value
  ...
2025-05-26 12:35:08 -07:00
Linus Torvalds
6f59de9bc0 Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:

 - ublk updates:
      - Add support for updating the size of a ublk instance
      - Zero-copy improvements
      - Auto-registering of buffers for zero-copy
      - Series simplifying and improving GET_DATA and request lookup
      - Series adding quiesce support
      - Lots of selftests additions
      - Various cleanups

 - NVMe updates via Christoph:
      - add per-node DMA pools and use them for PRP/SGL allocations
        (Caleb Sander Mateos, Keith Busch)
      - nvme-fcloop refcounting fixes (Daniel Wagner)
      - support delayed removal of the multipath node and optionally
        support the multipath node for private namespaces (Nilay Shroff)
      - support shared CQs in the PCI endpoint target code (Wilfred
        Mallawa)
      - support admin-queue only authentication (Hannes Reinecke)
      - use the crc32c library instead of the crypto API (Eric Biggers)
      - misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes
        Reinecke, Leon Romanovsky, Gustavo A. R. Silva)

 - MD updates via Yu:
      - Fix that normal IO can be starved by sync IO, found by mkfs on
        newly created large raid5, with some clean up patches for bdev
        inflight counters

 - Clean up brd, getting rid of atomic kmaps and bvec poking

 - Add loop driver specifically for zoned IO testing

 - Eliminate blk-rq-qos calls with a static key, if not enabled

 - Improve hctx locking for when a plug has IO for multiple queues
   pending

 - Remove block layer bouncing support, which in turn means we can
   remove the per-node bounce stat as well

 - Improve blk-throttle support

 - Improve delay support for blk-throttle

 - Improve brd discard support

 - Unify IO scheduler switching. This should also fix a bunch of lockdep
   warnings we've been seeing, after enabling lockdep support for queue
   freezing/unfreezeing

 - Add support for block write streams via FDP (flexible data placement)
   on NVMe

 - Add a bunch of block helpers, facilitating the removal of a bunch of
   duplicated boilerplate code

 - Remove obsolete BLK_MQ pci and virtio Kconfig options

 - Add atomic/untorn write support to blktrace

 - Various little cleanups and fixes

* tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits)
  selftests: ublk: add test for UBLK_F_QUIESCE
  ublk: add feature UBLK_F_QUIESCE
  selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE
  traceevent/block: Add REQ_ATOMIC flag to block trace events
  ublk: run auto buf unregisgering in same io_ring_ctx with registering
  io_uring: add helper io_uring_cmd_ctx_handle()
  ublk: remove io argument from ublk_auto_buf_reg_fallback()
  ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch()
  selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK
  selftests: ublk: support UBLK_F_AUTO_BUF_REG
  ublk: support UBLK_AUTO_BUF_REG_FALLBACK
  ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG
  ublk: prepare for supporting to register request buffer automatically
  ublk: convert to refcount_t
  selftests: ublk: make IO & device removal test more stressful
  nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk
  nvme: introduce multipath_always_on module param
  nvme-multipath: introduce delayed removal of the multipath head node
  nvme-pci: derive and better document max segments limits
  nvme-pci: use struct_size for allocation struct nvme_dev
  ...
2025-05-26 11:39:36 -07:00
Linus Torvalds
8dd53535f1 Merge tag 'vfs-6.16-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs freezing updates from Christian Brauner:
 "This contains various filesystem freezing related work for this cycle:

   - Allow the power subsystem to support filesystem freeze for suspend
     and hibernate.

     Now all the pieces are in place to actually allow the power
     subsystem to freeze/thaw filesystems during suspend/resume.
     Filesystems are only frozen and thawed if the power subsystem does
     actually own the freeze.

     If the filesystem is already frozen by the time we've frozen all
     userspace processes we don't care to freeze it again. That's
     userspace's job once the process resumes. We only actually freeze
     filesystems if we absolutely have to and we ignore other failures
     to freeze.

     We could bubble up errors and fail suspend/resume if the error
     isn't EBUSY (aka it's already frozen) but I don't think that this
     is worth it. Filesystem freezing during suspend/resume is
     best-effort. If the user has 500 ext4 filesystems mounted and 4
     fail to freeze for whatever reason then we simply skip them.

     What we have now is already a big improvement and let's see how we
     fare with it before making our lives even harder (and uglier) than
     we have to.

   - Allow efivars to support freeze and thaw

     Allow efivarfs to partake to resync variable state during system
     hibernation and suspend. Add freeze/thaw support.

     This is a pretty straightforward implementation. We simply add
     regular freeze/thaw support for both userspace and the kernel.
     efivars is the first pseudofilesystem that adds support for
     filesystem freezing and thawing.

     The simplicity comes from the fact that we simply always resync
     variable state after efivarfs has been frozen. It doesn't matter
     whether that's because of suspend, userspace initiated freeze or
     hibernation. Efivars is simple enough that it doesn't matter that
     we walk all dentries. There are no directories and there aren't
     insane amounts of entries and both freeze/thaw are already
     heavy-handed operations. If userspace initiated a freeze/thaw cycle
     they would need CAP_SYS_ADMIN in the initial user namespace (as
     that's where efivarfs is mounted) so it can't be triggered by
     random userspace. IOW, we really really don't care"

* tag 'vfs-6.16-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  f2fs: fix freezing filesystem during resize
  kernfs: add warning about implementing freeze/thaw
  efivarfs: support freeze/thaw
  power: freeze filesystems during suspend/resume
  libfs: export find_next_child()
  super: add filesystem freezing helpers for suspend and hibernate
  gfs2: pass through holder from the VFS for freeze/thaw
  super: use common iterator (Part 2)
  super: use a common iterator (Part 1)
  super: skip dying superblocks early
  super: simplify user_get_super()
  super: remove pointless s_root checks
  fs: allow all writers to be frozen
  locking/percpu-rwsem: add freezable alternative to down_read
2025-05-26 09:33:44 -07:00
Andreas Gruenbacher
e320050eb7 gfs2: No more gfs2_find_jhead caching
We are no longer calling gfs2_find_jhead() on the same log twice, so
there is no more reason for keeping the log contents cached across those
calls.  In addition, log head lookup and log header writing didn't go
through the same address space and so the caching wasn't even fully
working, anyway.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-22 09:12:27 +02:00
Andreas Gruenbacher
93bd5edbd6 gfs2: Get rid of duplicate log head lookup
Currently at mount time, the recovery code looks up the current log head
and, if necessary, replays the log and writes a recovery header to
indicate that the log is clean.  It does that for each log that may need
recovery.  We also know that our own log will always be checked as part
of that process.  Then, the mount code looks up the log head of our own
log again.

The double log head lookup can be costly, but more importantly, it is
unnecessary because we can trivially compute the position of the log
head after recovery; all we need to do for that is bump the position and
lh_sequence by one when writing a recovery header.

With that in mind, move the call to gfs2_log_pointers_init() into
gfs2_recover_func() and get rid of the double lookup in
gfs2_make_fs_rw().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-22 09:12:27 +02:00
Andreas Gruenbacher
2ebb94ab93 gfs2: Simplify clean_journal
In function clean_journal(), update @head to point at the log header
that indicates successful recovery:  this is where logging needs to
resume.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-22 09:12:27 +02:00
Andreas Gruenbacher
8a43d21876 gfs2: Simplify gfs2_log_pointers_init
Move the initialization of sdp->sd_log_sequence and
sdp->sd_log_flush_head inside gfs2_log_pointers_init().  Use
gfs2_replay_incr_blk().

Before this change, the log head lookup code in freeze_go_xmote_bh()
didn't update sdp->sd_log_flush_head.  This is now fixed, but the code
in freeze_go_xmote_bh() appears to be pretty useless in the first place:
on a frozen filesystem, the log head will not change.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-22 09:12:27 +02:00
Andreas Gruenbacher
703a4af356 gfs2: Move gfs2_log_pointers_init
Move gfs2_log_pointers_init to recovery.c: there is no need for inlining
this function.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-22 09:12:27 +02:00
Andreas Gruenbacher
91793971f3 gfs2: Minor comments fix
Commit 4082976009 ("gfs2: Convert gfs2_find_jhead() to use a folio")
replaced grab_cache_page() by filemap_grab_folio(), but the comments
were still referring to grab_cache_page().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-22 09:12:27 +02:00
Andreas Gruenbacher
5a90f8d499 gfs2: Don't start unnecessary transactions during log flush
Commit 8d391972ae ("gfs2: Remove __gfs2_writepage()") changed the log
flush code in gfs2_ail1_start_one() to call aops->writepages() instead
of aops->writepage().  For jdata inodes, this means that we will now try
to reserve log space and start a transaction before we can determine
that the pages in question have already been journaled.  When this
happens in the context of gfs2_logd(), it can now appear that not enough
log space is available for freeing up log space, and we will lock up.

Fix that by issuing journal writes directly instead of going through
aops->writepages() in the log flush code.

Fixes: 8d391972ae ("gfs2: Remove __gfs2_writepage()")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-22 09:12:27 +02:00
Andreas Gruenbacher
d50a64e3c5 gfs2: Move gfs2_trans_add_databufs
Move gfs2_trans_add_databufs() to trans.c.  Pass in a glock instead of
a gfs2_inode.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-22 09:12:27 +02:00
Andreas Gruenbacher
2f022736ee gfs2: Rename jdata_dirty_folio to gfs2_jdata_dirty_folio
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-22 09:12:27 +02:00
Eric Biggers
b6ccde39b1 gfs2: avoid inefficient use of crc32_le_shift()
__get_log_header() was using crc32_le_shift() to update a CRC with four
zero bytes.  However, this is about 5x slower than just CRC'ing four
zero bytes in the normal way.  Just do that instead.

(We could instead make crc32_le_shift() faster on short lengths.  But
all its callers do just fine without it, so I'd like to just remove it.)

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-22 09:12:27 +02:00
Andreas Gruenbacher
87faee382d gfs2: Do not call iomap_zero_range beyond eof
Since commit eb65540aa9 ("iomap: warn on zero range of a post-eof
folio"), iomap_zero_range() warns when asked to zero a folio beyond eof.
The warning triggers on the following code path:

  gfs2_fallocate(FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE)
    __gfs2_punch_hole()
      gfs2_block_zero_range()
        iomap_zero_range()

In __gfs2_punch_hole(), gfs2 zeroes out partial folios at the beginning
and at the end of the specified range, whether those folios are beyond
eof or not.  This may add folios to the page cache which are entirely
beyond eof, which isn't of any use.  Avoid that by truncating the range
to zero out at eof.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-22 09:12:26 +02:00
Christoph Hellwig
e9a4af22af gfs: don't check for AOP_WRITEPAGE_ACTIVATE in gfs2_write_jdata_batch
__gfs2_jdata_write_folio can't return AOP_WRITEPAGE_ACTIVATE, so don't
check for it in gfs2_write_jdata_batch.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-22 09:12:26 +02:00
Christian Brauner
1af3331764 super: add filesystem freezing helpers for suspend and hibernate
Allow the power subsystem to support filesystem freeze for
suspend and hibernate.

For some kernel subsystems it is paramount that they are guaranteed that
they are the owner of the freeze to avoid any risk of deadlocks. This is
the case for the power subsystem. Enable it to recognize whether it did
actually freeze the filesystem.

If userspace has 10 filesystems and suspend/hibernate manges to freeze 5
and then fails on the 6th for whatever odd reason (current or future)
then power needs to undo the freeze of the first 5 filesystems. It can't
just walk the list again because while it's unlikely that a new
filesystem got added in the meantime it still cannot tell which
filesystems the power subsystem actually managed to get a freeze
reference count on that needs to be dropped during thaw.

There's various ways out of this ugliness. For example, record the
filesystems the power subsystem managed to freeze on a temporary list in
the callbacks and then walk that list backwards during thaw to undo the
freezing or make sure that the power subsystem just actually exclusively
freezes things it can freeze and marking such filesystems as being owned
by power for the duration of the suspend or resume cycle. I opted for
the latter as that seemed the clean thing to do even if it means more
code changes.

If hibernation races with filesystem freezing (e.g. DM reconfiguration),
then hibernation need not freeze a filesystem because it's already
frozen but userspace may thaw the filesystem before hibernation actually
happens.

If the race happens the other way around, DM reconfiguration may
unexpectedly fail with EBUSY.

So allow FREEZE_EXCL to nest with other holders. An exclusive freezer
cannot be undone by any of the other concurrent freezers.

Link: https://lore.kernel.org/r/20250329-work-freeze-v2-6-a47af37ecc3d@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09 12:41:02 +02:00
Christoph Hellwig
65f8e62593 gfs2: use bdev_rw_virt in gfs2_read_super
Switch gfs2_read_super to allocate the superblock buffer using kmalloc
which falls back to the page allocator for PAGE_SIZE allocation but
gives us a kernel virtual address and then use bdev_rw_virt to perform
the synchronous read into it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250507120451.4000627-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07 07:31:07 -06:00
Andrew Price
0a828c3ab0 gfs2: Fix usage of bio->bi_status in gfs2_end_log_write
bio->bi_status is an index into the blk_errors array, not an errno. Its
__bitwise tag is cast away here, resulting in a sparse warning:

  fs/gfs2/lops.c:207:22: warning: cast from restricted blk_status_t

We could either add __force to the cast and continue logging bi_status
in the error message, or we could look up the errno in the array and log
that. As sdp->sd_log_error is used as an errno in all other cases, look
up the errno here for consistency.

Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-04-24 23:12:15 +02:00
Andreas Gruenbacher
2c63986dd3 gfs2: deallocate inodes in gfs2_create_inode
When creating and destroying inodes, we are relying on the inode hash
table to make sure that for a given inode number, only a single inode
will exist.  We then link that inode to its inode and iopen glock and
let those glocks point back at the inode.  However, when iget_failed()
is called, the inode is removed from the inode hash table before
gfs_evict_inode() is called, and uniqueness is no longer guaranteed.

Commit f1046a472b70 ("gfs2: gl_object races fix") was trying to work
around that problem by detaching the inode glock from the inode before
calling iget_failed(), but that broke the inode deallocation code in
gfs_evict_inode().

To fix that, deallocate partially created inodes in gfs2_create_inode()
instead of relying on gfs_evict_inode() for doing that.

This means that gfs2_evict_inode() and its helper functions will no
longer see partially created inodes, and so some simplifications are
possible there.

Fixes: 9ffa18884c ("gfs2: gl_object races fix")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-04-24 23:10:05 +02:00
Andreas Gruenbacher
0cc617a54d gfs2: Move GIF_ALLOC_FAILED check out of gfs2_ea_dealloc
Don't check for the GIF_ALLOC_FAILED flag in gfs2_ea_dealloc() and pass
that information explicitly instead.  This allows for a cleaner
follow-up patch.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-04-21 18:20:36 +02:00
Andreas Gruenbacher
bcd18105fb gfs2: Move gfs2_dinode_dealloc
Move gfs2_dinode_dealloc() and its helper gfs2_final_release_pages()
from super.c to inode.c.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-04-21 18:20:36 +02:00
Andreas Gruenbacher
84a79ee68f gfs2: Don't reread inodes unnecessarily
In gfs2_create_inode(), we initialize the inode from scratch and then we
write the result to disk.  Clear the GLF_INSTANTIATE_NEEDED glock flag
to indicate that the inode is up to date.  Otherwise, the next time the
inode glock is acquired, gfs2_instantiate() would reread the inode from
disk, which isn't necessary.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-04-21 18:20:36 +02:00
Andreas Gruenbacher
af4044fd0b gfs2: gfs2_create_inode error handling fix
When gfs2_create_inode() finds a directory, make sure to return -EISDIR.

Fixes: 571a4b5797 ("GFS2: bugger off early if O_CREAT open finds a directory")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-04-21 18:20:36 +02:00
Chen Ni
4023c3cbc3 gfs2: Remove unnecessary NULL check before free_percpu()
free_percpu() checks for NULL pointers internally.
Remove unneeded NULL check here.

Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-04-21 18:20:36 +02:00
Edward Adam Davis
27d2f101e7 gfs2: check sb_min_blocksize return value
Check the return value of sb_min_blocksize(): it will be 0 when the
requested block size is invalid.

In addition, check the return value of sb_set_blocksize() as well.

Reported-by: syzbot+b0018b7468b2af33b4d5@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-04-21 18:20:36 +02:00
Andreas Gruenbacher
ae9f3bd825 gfs2: replace sd_aspace with sd_inode
Currently, sdp->sd_aspace and the per-inode metadata address spaces use
sb->s_bdev->bd_mapping->host as their ->host; folios in those address
spaces will thus appear to be on bdev rather than on gfs2 filesystems.
This is a problem because gfs2 doesn't support cgroup writeback
(SB_I_CGROUPWB), but bdev does.

Fix that by using a "dummy" gfs2 inode as ->host in those address
spaces.  When coming from a folio, folio->mapping->host->i_sb will then
be a gfs2 super block and the SB_I_CGROUPWB flag will not be set in
sb->s_iflags.

Based on a previous version from Bob Peterson from several years ago.
Thanks to Tetsuo Handa, Jan Kara, and Rafael Aquini for helping figure
this out.

Fixes: aaa2cacf81 ("writeback: add lockdep annotation to inode_to_wb()")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-04-21 18:20:36 +02:00
Alexander Aring
ff22e5da42 gfs2: only apply DLM_LKF_VALBLK if sb_lvbptr is not NULL
Currently, gfs2 always sets the DLM_LKF_VALBLK flag to enable lvb
handling even when sb_lvbptr is NULL.  This currently causes no problems
because DLM ignores the DLM_LKF_VALBLK flag when sb_lvbptr is NULL, but
it does violate the DLM API.  Fix that by only setting DLM_LKF_VALBLK
when sb_lvbptr is not NULL.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-04-21 18:20:36 +02:00
Alexander Aring
ac5ee087d3 gfs2: move msleep to sleepable context
This patch moves the msleep_interruptible() out of the non-sleepable
context by moving the ls->ls_recover_spin spinlock around so
msleep_interruptible() will be called in a sleepable context.

Cc: stable@vger.kernel.org
Fixes: 4a7727725d ("GFS2: Fix recovery issues for spectators")
Suggested-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-04-21 18:20:36 +02:00
Christian Brauner
62a2175ddf gfs2: pass through holder from the VFS for freeze/thaw
The filesystem's freeze/thaw functions can be called from contexts where
the holder isn't userspace but the kernel, e.g., during systemd
suspend/hibernate. So pass through the freeze/thaw flags from the VFS
instead of hard-coding them.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07 09:37:17 +02:00
Eric Biggers
b261d22220 lib/crc: remove CONFIG_LIBCRC32C
Now that LIBCRC32C does nothing besides select CRC32, make every option
that selects LIBCRC32C instead select CRC32 directly.  Then remove
LIBCRC32C.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250401221600.24878-8-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-04-04 11:31:42 -07:00
Linus Torvalds
ef479de65a Merge tag 'gfs2-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:

 - Fix two bugs related to locking request cancelation (locking request
   being retried instead of canceled; canceling the wrong locking
   request)

 - Prevent a race between inode creation and deferred delete analogous
   to commit ffd1cf0443 from 6.13. This now allows to further simplify
   gfs2_evict_inode() without introducing mysterious problems

 - When in inode delete should be verified / retried "later" but that
   isn't possible, skip the delete instead of carrying it out
   immediately. This broke in 6.13

 - More folio conversions from Matthew Wilcox (plus a fix from Dan
   Carpenter)

 - Various minor fixes and cleanups

* tag 'gfs2-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (22 commits)
  gfs2: some comment clarifications
  gfs2: Fix a NULL vs IS_ERR() bug in gfs2_find_jhead()
  gfs2: Convert gfs2_meta_read_endio() to use a folio
  gfs2: Convert gfs2_end_log_write_bh() to work on a folio
  gfs2: Convert gfs2_find_jhead() to use a folio
  gfs2: Convert gfs2_jhead_pg_srch() to gfs2_jhead_folio_search()
  gfs2: Use b_folio in gfs2_check_magic()
  gfs2: Use b_folio in gfs2_submit_bhs()
  gfs2: Use b_folio in gfs2_trans_add_meta()
  gfs2: Use b_folio in gfs2_log_write_bh()
  gfs2: skip if we cannot defer delete
  gfs2: remove redundant warnings
  gfs2: minor evict fix
  gfs2: Prevent inode creation race (2)
  gfs2: Fix additional unlikely request cancelation race
  gfs2: Fix request cancelation bug
  gfs2: Check for empty queue in run_queue
  gfs2: Remove more dead code in add_to_queue
  gfs2: Replace GIF_DEFER_DELETE with GLF_DEFER_DELETE
  gfs2: glock holder GL_NOPID fix
  ...
2025-03-27 12:09:25 -07:00