3143 Commits

Author SHA1 Message Date
Linus Torvalds
afcbce74f3 Merge tag 'gfs2-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:

 - Major withdraw / error handling overhaul based on dlm's new
   DLM_RELEASE_RECOVER feature: this allows gfs to treat withdraws like
   node failures. Make withdraws asynchronous

 - Fix a bug in commit e4a8b5481c that caused 'df' to remain out of
   sync. ('df' is still allowed to go slightly out of sync for short
   periods of time)

 - Prevent recusive memory reclaim in gfs2_unstuff_dinode()

 - Clean up SDF_JOURNAL_LIVE flag handling

 - Fix remote evict for read-only filesystems

 - Fix a misuse of bio_chain()

 - Various other minor cleanups

* tag 'gfs2-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (35 commits)
  gfs2: Fix use of bio_chain
  gfs2: Clean up SDF_JOURNAL_LIVE flag handling
  gfs2: No longer thaw filesystems during a withdraw
  gfs2: Withdraw immediately in gfs2_trans_add_meta
  gfs2: New gfs2_withdraw_helper
  gfs2: Clean up properly during a withdraw
  gfs2: Rename gfs2_{gl_dq_holders => withdraw_glocks}
  Revert "gfs2: fix infinite loop when checking ail item count before go_inval"
  Revert "gfs2: Allow some glocks to be used during withdraw"
  Revert "gfs2: Check for log write errors before telling dlm to unlock"
  Revert "gfs2: fix a deadlock on withdraw-during-mount"
  Revert "gfs2: Force withdraw to replay journals and wait for it to finish" (6/6)
  Revert "gfs2: Force withdraw to replay journals and wait for it to finish" (5/6)
  Revert "gfs2: Force withdraw to replay journals and wait for it to finish" (4/6)
  Revert "gfs2: Force withdraw to replay journals and wait for it to finish" (3/6)
  Revert "gfs2: Force withdraw to replay journals and wait for it to finish" (2/6)
  Revert "gfs2: Force withdraw to replay journals and wait for it to finish" (1/6)
  Revert "gfs2: don't stop reads while withdraw in progress"
  gfs2: Rename LM_FLAG_{NOEXP -> RECOVER}
  gfs2: Kill gfs2_io_error_bh_wd
  ...
2025-12-03 20:28:50 -08:00
Andreas Gruenbacher
8a157e0a0a gfs2: Fix use of bio_chain
In gfs2_chain_bio(), the call to bio_chain() has its arguments swapped.
The result is leaked bios and incorrect synchronization (only the last
bio will actually be waited for).  This code is only used during mount
and filesystem thaw, so the bug normally won't be noticeable.

Reported-by: Stephen Zhang <starzhangzsd@gmail.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-12-02 16:44:54 +00:00
Linus Torvalds
f2e74ecfba Merge tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull folio updates from Christian Brauner:
 "Add a new folio_next_pos() helper function that returns the file
  position of the first byte after the current folio. This is a common
  operation in filesystems when needing to know the end of the current
  folio.

  The helper is lifted from btrfs which already had its own version, and
  is now used across multiple filesystems and subsystems:
   - btrfs
   - buffer
   - ext4
   - f2fs
   - gfs2
   - iomap
   - netfs
   - xfs
   - mm

  This fixes a long-standing bug in ocfs2 on 32-bit systems with files
  larger than 2GiB. Presumably this is not a common configuration, but
  the fix is backported anyway. The other filesystems did not have bugs,
  they were just mildly inefficient.

  This also introduce uoff_t as the unsigned version of loff_t. A recent
  commit inadvertently changed a comparison from being unsigned (on
  64-bit systems) to being signed (which it had always been on 32-bit
  systems), leading to sporadic fstests failures.

  Generally file sizes are restricted to being a signed integer, but in
  places where -1 is passed to indicate "up to the end of the file", it
  is convenient to have an unsigned type to ensure comparisons are
  always unsigned regardless of architecture"

* tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: Add uoff_t
  mm: Use folio_next_pos()
  xfs: Use folio_next_pos()
  netfs: Use folio_next_pos()
  iomap: Use folio_next_pos()
  gfs2: Use folio_next_pos()
  f2fs: Use folio_next_pos()
  ext4: Use folio_next_pos()
  buffer: Use folio_next_pos()
  btrfs: Use folio_next_pos()
  filemap: Add folio_next_pos()
2025-12-01 10:26:38 -08:00
Linus Torvalds
ebaeabfa5a Merge tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull writeback updates from Christian Brauner:
 "Features:

   - Allow file systems to increase the minimum writeback chunk size.

     The relatively low minimal writeback size of 4MiB means that
     written back inodes on rotational media are switched a lot. Besides
     introducing additional seeks, this also can lead to extreme file
     fragmentation on zoned devices when a lot of files are cached
     relative to the available writeback bandwidth.

     This adds a superblock field that allows the file system to
     override the default size, and sets it to the zone size for zoned
     XFS.

   - Add logging for slow writeback when it exceeds
     sysctl_hung_task_timeout_secs. This helps identify tasks waiting
     for a long time and pinpoint potential issues. Recording the
     starting jiffies is also useful when debugging a crashed vmcore.

   - Wake up waiting tasks when finishing the writeback of a chunk

  Cleanups:

   - filemap_* writeback interface cleanups.

     Adding filemap_fdatawrite_wbc ended up being a mistake, as all but
     the original btrfs caller should be using better high level
     interfaces instead.

     This series removes all these low-level interfaces, switches btrfs
     to a more specific interface, and cleans up other too low-level
     interfaces. With this the writeback_control that is passed to the
     writeback code is only initialized in three places.

   - Remove __filemap_fdatawrite, __filemap_fdatawrite_range, and
     filemap_fdatawrite_wbc

   - Add filemap_flush_nr helper for btrfs

   - Push struct writeback_control into start_delalloc_inodes in btrfs

   - Rename filemap_fdatawrite_range_kick to filemap_flush_range

   - Stop opencoding filemap_fdatawrite_range in 9p, ocfs2, and mm

   - Make wbc_to_tag() inline and use it in fs"

* tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: Make wbc_to_tag() inline and use it in fs.
  xfs: set s_min_writeback_pages for zoned file systems
  writeback: allow the file system to override MIN_WRITEBACK_PAGES
  writeback: cleanup writeback_chunk_size
  mm: rename filemap_fdatawrite_range_kick to filemap_flush_range
  mm: remove __filemap_fdatawrite_range
  mm: remove filemap_fdatawrite_wbc
  mm: remove __filemap_fdatawrite
  mm,btrfs: add a filemap_flush_nr helper
  btrfs: push struct writeback_control into start_delalloc_inodes
  btrfs: use the local tmp_inode variable in start_delalloc_inodes
  ocfs2: don't opencode filemap_fdatawrite_range in ocfs2_journal_submit_inode_data_buffers
  9p: don't opencode filemap_fdatawrite_range in v9fs_mmap_vm_close
  mm: don't opencode filemap_fdatawrite_range in filemap_invalidate_inode
  writeback: Add logging for slow writeback (exceeds sysctl_hung_task_timeout_secs)
  writeback: Wake up waiting tasks when finishing the writeback of a chunk.
2025-12-01 09:20:51 -08:00
Linus Torvalds
9368f0f941 Merge tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode updates from Christian Brauner:
 "Features:

   - Hide inode->i_state behind accessors. Open-coded accesses prevent
     asserting they are done correctly. One obvious aspect is locking,
     but significantly more can be checked. For example it can be
     detected when the code is clearing flags which are already missing,
     or is setting flags when it is illegal (e.g., I_FREEING when
     ->i_count > 0)

   - Provide accessors for ->i_state, converts all filesystems using
     coccinelle and manual conversions (btrfs, ceph, smb, f2fs, gfs2,
     overlayfs, nilfs2, xfs), and makes plain ->i_state access fail to
     compile

   - Rework I_NEW handling to operate without fences, simplifying the
     code after the accessor infrastructure is in place

  Cleanups:

   - Move wait_on_inode() from writeback.h to fs.h

   - Spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
     for clarity

   - Cosmetic fixes to LRU handling

   - Push list presence check into inode_io_list_del()

   - Touch up predicts in __d_lookup_rcu()

   - ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage

   - Assert on ->i_count in iput_final()

   - Assert ->i_lock held in __iget()

  Fixes:

   - Add missing fences to I_NEW handling"

* tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
  dcache: touch up predicts in __d_lookup_rcu()
  fs: push list presence check into inode_io_list_del()
  fs: cosmetic fixes to lru handling
  fs: rework I_NEW handling to operate without fences
  fs: make plain ->i_state access fail to compile
  xfs: use the new ->i_state accessors
  nilfs2: use the new ->i_state accessors
  overlayfs: use the new ->i_state accessors
  gfs2: use the new ->i_state accessors
  f2fs: use the new ->i_state accessors
  smb: use the new ->i_state accessors
  ceph: use the new ->i_state accessors
  btrfs: use the new ->i_state accessors
  Manual conversion to use ->i_state accessors of all places not covered by coccinelle
  Coccinelle-based conversion to use ->i_state accessors
  fs: provide accessors for ->i_state
  fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
  fs: move wait_on_inode() from writeback.h to fs.h
  fs: add missing fences to I_NEW handling
  ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
  ...
2025-12-01 09:02:34 -08:00
Andreas Gruenbacher
83348905e4 gfs2: Clean up SDF_JOURNAL_LIVE flag handling
Change do_withdraw() to clear the SDF_JOURNAL_LIVE flag under the log
flush lock.  In addition, change __gfs2_trans_begin() to check if the
filesystem is already known to be withdrawn using gfs2_withdrawn().
Then, once we are holding the log flush lock, check if the
SDF_JOURNAL_LIVE flag is still set.  This second check ensures that the
filesystem will remain live until the transaction is submitted.

With these changes, it is no longer useful to clear SDF_JOURNAL_LIVE in
gfs2_end_log_write() after calling gfs2_withdraw().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:52:28 +00:00
Andreas Gruenbacher
16c3197984 gfs2: No longer thaw filesystems during a withdraw
Previously, when a withdraw occurred, we would wait for another node to
recover our journal.  This also meant that frozen filesystem needed to
be thawed because otherwise, other nodes wouldn't be able to recover the
filesystem.  With the reversal of commit 601ef0d52e ("gfs2: Force
withdraw to replay journals and wait for it to finish"), we are no
longer waiting for journal recovery during a withdraw, so we no longer
need to thaw frozen filesystems, either.  This also fixes a potential
deadlock reported by lockdep when running xfstest generic/108.

In addition, there is nothing left in do_withdraw() that would require
taking sd_freeze_mutex, so don't bother taking that lock there anymore.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:52:28 +00:00
Andreas Gruenbacher
3a88edc165 gfs2: Withdraw immediately in gfs2_trans_add_meta
We can now withdraw while the log is locked.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:52:28 +00:00
Andreas Gruenbacher
bbbf1529ea gfs2: New gfs2_withdraw_helper
Currently, when a gfs2 filesystem is withdrawn, an "offline" uevent is
triggered that invokes gfs2-util's gfs2_withdraw_helper script.  The
purpose of this script is to deactivate the filesystem's block device so
that it can be withdrawn immediately, even before all the filesystem's
caches have been discarded.  The script provided by gfs2-utils never did
anything useful, and there was no way for it to report back its status
to the kernel.

To fix that, extend the gfs2_withdraw_helper mechanism so that the
script can report one of the following results by writing the
corresponding value into "/sys$DEVPATH/lock_module/withdraw":

 0 - The shared block device has been marked inactive.  Future write
     operations will fail.

 1 - The shared block device may still be active and carry out
     write operations.

If the "offline" uevent isn't reacted upon within the timeout configured
in /sys$DEVPATH/tune/withdraw_helper_timeout (default 5 seconds), the
event handler is assumed to have failed.

In addition, add an additional "errors=deactivate" mount option.

With these changes, if fatal errors are detected on a gfs2 filesystem
and the filesystem is mounted with the "errors=panic" option, the kernel
will panic immediately.  Otherwise, an attempt will be made to
deactivate the underlying block device.  If successful, the kernel will
release all cluster-wide locks immediately so that the rest of the
cluster can continue.  If unsuccessful, the kernel will either panic
("errors=deactivate"), or it will purge all filesystem I/O before
releasing all cluster-wide locks ("errors=withdraw").

Note that the gfs2_withdraw_helper script still needs to be fixed to
take advantage of these improvements.  It could be changed to use a
mechanism like LVM Persistent Reservations.  "dmsetup suspend" is not a
suitable mechanism as it infinitely postpones I/O operations, which may
prevent withdraw from completing.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:52:27 +00:00
Andreas Gruenbacher
0e10da69d1 gfs2: Clean up properly during a withdraw
During a withdraw, we don't want to write out any more data than we have
to, so in do_xmote(), skip the ->go_sync() glock operation.  We still
want to keep calling ->go_inval() to discard any cached data or
metadata, whether clean or dirty.

We do still allow glocks to transition into state LM_ST_UNLOCKED.  This
has the desired side effect of calling ->go_inval() and invalidating the
glock caches.

Function gfs2_withdraw_glocks() is already used for dequeuing any
left-over waiters.  We still want that to happen, but additionally, we
want all glocks to be unlocked.

Finally, we change function do_promote() to refuse any further
promotions.

This commit cleans up the leftovers of commit 86934198ee ("gfs2: Clear
flags when withdraw prevents xmote").

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:52:27 +00:00
Andreas Gruenbacher
473678ccb9 gfs2: Rename gfs2_{gl_dq_holders => withdraw_glocks}
Rename function gfs2_gl_dq_holders() to gfs2_withdraw_glocks().  This
function will soon be used for more than just dequeuing holders.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:52:27 +00:00
Andreas Gruenbacher
655531c95b Revert "gfs2: fix infinite loop when checking ail item count before go_inval"
The current withdraw code duplicates the journal recovery code gfs2
already has for dealing with node failures, and it does so poorly.  That
code was added because when releasing a lockspace, we didn't have a way
to indicate that the lockspace needs recovery.  We now do have this
feature, so the current withdraw code can be removed almost entirely.
This is one of several steps towards that.

Reverts commit 33dbd1e41a ("gfs2: fix infinite loop when checking ail
item count before go_inval").

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:52:27 +00:00
Andreas Gruenbacher
af572efef1 Revert "gfs2: Allow some glocks to be used during withdraw"
The current withdraw code duplicates the journal recovery code gfs2
already has for dealing with node failures, and it does so poorly.  That
code was added because when releasing a lockspace, we didn't have a way
to indicate that the lockspace needs recovery.  We now do have this
feature, so the current withdraw code can be removed almost entirely.
This is one of several steps towards that.

Reverts commit a72d2401f5 ("gfs2: Allow some glocks to be used during
withdraw").

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:52:27 +00:00
Andreas Gruenbacher
41ad1f7c8b Revert "gfs2: Check for log write errors before telling dlm to unlock"
The current withdraw code duplicates the journal recovery code gfs2
already has for dealing with node failures, and it does so poorly.  That
code was added because when releasing a lockspace, we didn't have a way
to indicate that the lockspace needs recovery.  We now do have this
feature, so the current withdraw code can be removed almost entirely.
This is one of several steps towards that.

Reverts the rest of d93ae386ef ("gfs2: Check for log write errors
before telling dlm to unlock").

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:52:27 +00:00
Andreas Gruenbacher
6bb7c1bf5a Revert "gfs2: fix a deadlock on withdraw-during-mount"
The current withdraw code duplicates the journal recovery code gfs2
already has for dealing with node failures, and it does so poorly.  That
code was added because when releasing a lockspace, we didn't have a way
to indicate that the lockspace needs recovery.  We now do have this
feature, so the current withdraw code can be removed almost entirely.
This is one of several steps towards that.

Reverts commit 865cc3e9cc ("gfs2: fix a deadlock on
withdraw-during-mount").

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:52:27 +00:00
Andreas Gruenbacher
dcc42d5541 Revert "gfs2: Force withdraw to replay journals and wait for it to finish" (6/6)
The current withdraw code duplicates the journal recovery code gfs2
already has for dealing with node failures, and it does so poorly.  That
code was added because when releasing a lockspace, we didn't have a way
to indicate that the lockspace needs recovery.  We now do have this
feature, so the current withdraw code can be removed almost entirely.
This is one of several steps towards that.

Reverts parts of commit 601ef0d52e ("gfs2: Force withdraw to replay
journals and wait for it to finish").

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:52:27 +00:00
Andreas Gruenbacher
406058184c Revert "gfs2: Force withdraw to replay journals and wait for it to finish" (5/6)
The current withdraw code duplicates the journal recovery code gfs2
already has for dealing with node failures, and it does so poorly.  That
code was added because when releasing a lockspace, we didn't have a way
to indicate that the lockspace needs recovery.  We now do have this
feature, so the current withdraw code can be removed almost entirely.
This is one of several steps towards that.

Reverts parts of commit 601ef0d52e ("gfs2: Force withdraw to replay
journals and wait for it to finish").

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:52:27 +00:00
Andreas Gruenbacher
a07a1e46d2 Revert "gfs2: Force withdraw to replay journals and wait for it to finish" (4/6)
The current withdraw code duplicates the journal recovery code gfs2
already has for dealing with node failures, and it does so poorly.  That
code was added because when releasing a lockspace, we didn't have a way
to indicate that the lockspace needs recovery.  We now do have this
feature, so the current withdraw code can be removed almost entirely.
This is one of several steps towards that.

Reverts parts of commit 601ef0d52e ("gfs2: Force withdraw to replay
journals and wait for it to finish").

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:52:27 +00:00
Andreas Gruenbacher
4cee5b0f7a Revert "gfs2: Force withdraw to replay journals and wait for it to finish" (3/6)
The current withdraw code duplicates the journal recovery code gfs2
already has for dealing with node failures, and it does so poorly.  That
code was added because when releasing a lockspace, we didn't have a way
to indicate that the lockspace needs recovery.  We now do have this
feature, so the current withdraw code can be removed almost entirely.
This is one of several steps towards that.

Reverts parts of commit 601ef0d52e ("gfs2: Force withdraw to replay
journals and wait for it to finish").

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:52:27 +00:00
Andreas Gruenbacher
2aae092dc4 Revert "gfs2: Force withdraw to replay journals and wait for it to finish" (2/6)
The current withdraw code duplicates the journal recovery code gfs2
already has for dealing with node failures, and it does so poorly.  That
code was added because when releasing a lockspace, we didn't have a way
to indicate that the lockspace needs recovery.  We now do have this
feature, so the current withdraw code can be removed almost entirely.
This is one of several steps towards that.

Reverts parts of commit 601ef0d52e ("gfs2: Force withdraw to replay
journals and wait for it to finish").

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:52:26 +00:00
Andreas Gruenbacher
20b44ddbbb Revert "gfs2: Force withdraw to replay journals and wait for it to finish" (1/6)
The current withdraw code duplicates the journal recovery code gfs2
already has for dealing with node failures, and it does so poorly.  That
code was added because when releasing a lockspace, we didn't have a way
to indicate that the lockspace needs recovery.  We now do have this
feature, so the current withdraw code can be removed almost entirely.
This is one of several steps towards that.

Reverts parts of commit 601ef0d52e ("gfs2: Force withdraw to replay
journals and wait for it to finish").

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:52:26 +00:00
Andreas Gruenbacher
833c93caea Revert "gfs2: don't stop reads while withdraw in progress"
The current withdraw code duplicates the journal recovery code gfs2
already has for dealing with node failures, and it does so poorly.  That
code was added because when releasing a lockspace, we didn't have a way
to indicate that the lockspace needs recovery.  We now do have this
feature, so the current withdraw code can be removed almost entirely.
This is one of several steps towards that.

The withdrawing node has no role in recovering from the withdraw
anymore, so it also no longer needs to read metadata blocks after a
withdraw.

We now only need to set a single bit in gfs2_withdraw(), so switch from
try_cmpxchg() to test_and_set_bit().

Reverts commit 8cc67f704f ("gfs2: don't stop reads while withdraw in
progress").

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:52:26 +00:00
Andreas Gruenbacher
1714e8543d gfs2: Rename LM_FLAG_{NOEXP -> RECOVER}
GFS sets the LM_FLAG_NOEXP flag on locking requests it makes during
journal recovery, so rename the flag to LM_FLAG_RECOVER for improved
code readability.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:52:26 +00:00
Andreas Gruenbacher
fab27b4930 gfs2: Kill gfs2_io_error_bh_wd
All callers of gfs2_io_error_bh() call gfs2_withdraw() as well, so
change gfs2_io_error_bh() to call gfs2_withdraw() directly.  This also
brings it in line with other similar error reporting functions.

With that, gfs2_io_error_bh() is the same as gfs2_io_error_bh_wd(), so
remove the latter.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:52:26 +00:00
Andreas Gruenbacher
0e2038a90c gfs2: Withdraw immediately on log write errors
Now that gfs2_withdraw() is asynchronous, immediately withdraw when
a log write error is detected.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:52:26 +00:00
Andreas Gruenbacher
1b7d498dca gfs2: Rename gfs2_{withdrawing_or_ => }withdrawn
With delayed withdraws and the SDF_WITHDRAWING flag gone, we can now
rename gfs2_withdrawing_or_withdrawn() back to gfs2_withdrawn().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:52:23 +00:00
Andreas Gruenbacher
8daf6c2b3d gfs2: Get rid of delayed withdraws
Now that gfs2_withdraw() is asynchronous, is can be called in any
context and there is no more need for gfs2_withdraw_delayed() or for
turning delayed withdraws into actual withdraws.  Remove the
now-obsolete code.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:51:47 +00:00
Andreas Gruenbacher
9c4a3de6cd gfs2: Asynchronous withdraw
So far, withdraws are carried out in the context of the calling task.
When another task tries to withdraw while a withdraw is already
underway, that task blocks as well.  Change that to carry out withdraws
asynchronously in workqueue context and don't block the task triggering
the withdraw anymore.

Fixes: syzbot+6b156e132970e550194c@syzkaller.appspotmail.com
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:51:47 +00:00
Andreas Gruenbacher
9334c73fb1 gfs2: Add clean argument to lm_unmount hook
Add a 'clean' argument to ->lm_unmount() that indicates whether the
filesystem is clean or needs recovery.  Set clean to true for normal
unmounts, and to false for withdraws.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:51:47 +00:00
Andreas Gruenbacher
94f56488c7 gfs2: Clean up quotad timeout handling
Instead of tracking the remaining time, track the deadline of each of
the timeouts.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:51:41 +00:00
Andreas Gruenbacher
dff1fb6d8b gfs2: Fix "gfs2: Switch to wait_event in gfs2_quotad"
Commit e4a8b5481c ("gfs2: Switch to wait_event in gfs2_quotad") broke
cyclic statfs syncing, so the numbers reported by "df" could easily get
completely out of sync with reality.  Fix this by reverting part of
commit e4a8b5481c for now.

A follow-up commit will clean this code up later.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:50:53 +00:00
Andreas Gruenbacher
5b351583a3 gfs2: Minor cosmetic remote delete cleanups
Rename gfs2_try_evict() to gfs2_try_to_evict().  The GIF_DEFER_DELETE
flag has been superceded by the GLF_DEFER_DELETE flag, so fix a
left-over comment.  Add a clarifying comment to delete_work_func().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:50:53 +00:00
Andreas Gruenbacher
64c10ed927 gfs2: fix remote evict for read-only filesystems
When a node tries to delete an inode, it first requests exclusive access
to the iopen glock.  This triggers demote requests on all remote nodes
currently holding the iopen glock.  To satisfy those requests, the
remote nodes evict the inode in question, or they poke the corresponding
inode glock to signal that the inode is still in active use.

This behavior doesn't depend on whether or not a filesystem is
read-only, so remove the incorrect read-only check.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 23:50:53 +00:00
Alexey Velichayshiy
4cfc7d5a4a gfs2: fix freeze error handling
After commit b77b4a4815 ("gfs2: Rework freeze / thaw logic"),
the freeze error handling is broken because gfs2_do_thaw()
overwrites the 'error' variable, causing incorrect processing
of the original freeze error.

Fix this by calling gfs2_do_thaw() when gfs2_lock_fs_check_clean()
fails but ignoring its return value to preserve the original
freeze error for proper reporting.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Fixes: b77b4a4815 ("gfs2: Rework freeze / thaw logic")
Cc: stable@vger.kernel.org # v6.5+
Signed-off-by: Alexey Velichayshiy <a.velichayshiy@ispras.ru>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 13:01:07 +00:00
Andreas Gruenbacher
2c5f4a5347 gfs2: Prevent recursive memory reclaim
Function new_inode() returns a new inode with inode->i_mapping->gfp_mask
set to GFP_HIGHUSER_MOVABLE.  This value includes the __GFP_FS flag, so
allocations in that address space can recurse into filesystem memory
reclaim.  We don't want that to happen because it can consume a
significant amount of stack memory.

Worse than that is that it can also deadlock: for example, in several
places, gfs2_unstuff_dinode() is called inside filesystem transactions.
This calls filemap_grab_folio(), which can allocate a new folio, which
can trigger memory reclaim.  If memory reclaim recurses into the
filesystem and starts another transaction, a deadlock will ensue.

To fix these kinds of problems, prevent memory reclaim from recursing
into filesystem code by making sure that the gfp_mask of inode address
spaces doesn't include __GFP_FS.

The "meta" and resource group address spaces were already using GFP_NOFS
as their gfp_mask (which doesn't include __GFP_FS).  The default value
of GFP_HIGHUSER_MOVABLE is less restrictive than GFP_NOFS, though.  To
avoid being overly limiting, use the default value and only knock off
the __GFP_FS flag.  I'm not sure if this will actually make a
difference, but it also shouldn't hurt.

This patch is loosely based on commit ad22c7a043 ("xfs: prevent stack
overflows from page cache allocation").

Fixes xfstest generic/273.

Fixes: dc0b943523 ("gfs: Don't use GFP_NOFS in gfs2_unstuff_dinode")
Reviewed-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26 12:57:10 +00:00
Mateusz Guzik
a27628f436 fs: rework I_NEW handling to operate without fences
In the inode hash code grab the state while ->i_lock is held. If found
to be set, synchronize the sleep once more with the lock held.

In the real world the flag is not set most of the time.

Apart from being simpler to reason about, it comes with a minor speed up
as now clearing the flag does not require the smp_mb() fence.

While here rename wait_on_inode() to wait_on_new_inode() to line it up
with __wait_on_freeing_inode().

Christian Brauner <brauner@kernel.org> says:

As per the discussion in [1] I folded in the diff sent in [2].

Link: https://lore.kernel.org/69238e4d.a70a0220.d98e3.006e.GAE@google.com [1]
Link: https://lore.kernel.org/c2kpawomkbvtahjm7y5mposbhckb7wxthi3iqy5yr22ggpucrm@ufvxwy233qxo [2]
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251010221737.1403539-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:32:39 +01:00
Matthew Wilcox (Oracle)
c3454ac036 gfs2: Use bio_add_folio_nofail()
As the label says, we've just allocated a new BIO so we know
we can add this folio to it.  We now have bio_add_folio_nofail()
for this purpose.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-09 20:19:34 +00:00
Joanne Koong
b2f35ac414 iomap: add caller-provided callbacks for read and readahead
Add caller-provided callbacks for read and readahead so that it can be
used generically, especially by filesystems that are not block-based.

In particular, this:
* Modifies the read and readahead interface to take in a
  struct iomap_read_folio_ctx that is publicly defined as:

  struct iomap_read_folio_ctx {
	const struct iomap_read_ops *ops;
	struct folio *cur_folio;
	struct readahead_control *rac;
	void *read_ctx;
  };

  where struct iomap_read_ops is defined as:

  struct iomap_read_ops {
      int (*read_folio_range)(const struct iomap_iter *iter,
                             struct iomap_read_folio_ctx *ctx,
                             size_t len);
      void (*read_submit)(struct iomap_read_folio_ctx *ctx);
  };

  read_folio_range() reads in the folio range and is required by the
  caller to provide. read_submit() is optional and is used for
  submitting any pending read requests.

* Modifies existing filesystems that use iomap for read and readahead to
  use the new API, through the new statically inlined helpers
  iomap_bio_read_folio() and iomap_bio_readahead(). There is no change
  in functionality for those filesystems.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:57:23 +01:00
Sukrut Heroorkar
de90521604 gfs2: document ip in __gfs2_holder_init kernel-doc comment
Building with W=1 reports:
Warning: fs/gfs2/glock.c:1248 function parameter 'ip' not described
in '__gfs2_holder_init'

The ip parameter was added when __gfs2_holder_init started saving the
gfs2_glock_nq_init caller's return address to gh_ip. This makes it
easier to backtrack which holder took the lock. Document @ip to silence
this warning.

Fixes: b016d9a84a ("gfs2: Save ip from gfs2_glock_nq_init")
Signed-off-by: Sukrut Heroorkar <hsukrut3@gmail.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-10-31 21:04:17 +00:00
Utkarsh Singh
02c03021e2 gfs2/sysfs: Replace sprintf/snprintf with sysfs_emit
Documentation/filesystems/sysfs.rst mentions that show() should only
use sysfs_emit() or sysfs_emit_at() when formatting values returned
to user space. This patch updates the GFS2 sysfs interface accordingly.

It replaces uses of sprintf() and snprintf() in all *_show() functions
with sysfs_emit() to align with current kernel sysfs API best practices.
It also updates the TUNE_ATTR_2 macro to use sysfs_emit() instead of
snprintf().

Signed-off-by: Utkarsh Singh <utkarsh.singh.em@gmail.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-10-31 21:00:42 +00:00
Matthew Wilcox (Oracle)
5f0fc78532 gfs2: Use folio_next_pos()
This is one instruction more efficient than open-coding folio_pos() +
folio_size().  It's the equivalent of (x + y) << z rather than
x << z + y << z.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-7-willy@infradead.org
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: gfs2@lists.linux.dev
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31 13:11:38 +01:00
Julian Sun
4952f35f05 fs: Make wbc_to_tag() inline and use it in fs.
The logic in wbc_to_tag() is widely used in file systems, so modify this
function to be inline and use it in file systems.

This patch has only passed compilation tests, but it should be fine.

Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-29 23:33:48 +01:00
Mateusz Guzik
40a4c512ad gfs2: use the new ->i_state accessors
Change generated with coccinelle and fixed up by hand as appropriate.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:27 +02:00
Linus Torvalds
829745b75a Merge tag 'pull-finish_no_open' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull finish_no_open updates from Al Viro:
 "finish_no_open calling conventions change to simplify callers"

* tag 'pull-finish_no_open' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  slightly simplify nfs_atomic_open()
  simplify gfs2_atomic_open()
  simplify fuse_atomic_open()
  simplify nfs_atomic_open_v23()
  simplify vboxsf_dir_atomic_open()
  simplify cifs_atomic_open()
  9p: simplify v9fs_vfs_atomic_open_dotl()
  9p: simplify v9fs_vfs_atomic_open()
  allow finish_no_open(file, ERR_PTR(-E...))
2025-10-03 10:59:31 -07:00
Linus Torvalds
8804d970fa Merge tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:

 - "mm, swap: improve cluster scan strategy" from Kairui Song improves
   performance and reduces the failure rate of swap cluster allocation

 - "support large align and nid in Rust allocators" from Vitaly Wool
   permits Rust allocators to set NUMA node and large alignment when
   perforning slub and vmalloc reallocs

 - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend
   DAMOS_STAT's handling of the DAMON operations sets for virtual
   address spaces for ops-level DAMOS filters

 - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren
   Baghdasaryan reduces mmap_lock contention during reads of
   /proc/pid/maps

 - "mm/mincore: minor clean up for swap cache checking" from Kairui Song
   performs some cleanup in the swap code

 - "mm: vm_normal_page*() improvements" from David Hildenbrand provides
   code cleanup in the pagemap code

 - "add persistent huge zero folio support" from Pankaj Raghav provides
   a block layer speedup by optionalls making the
   huge_zero_pagepersistent, instead of releasing it when its refcount
   falls to zero

 - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to
   the recently added Kexec Handover feature

 - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo
   Stoakes turns mm_struct.flags into a bitmap. To end the constant
   struggle with space shortage on 32-bit conflicting with 64-bit's
   needs

 - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap
   code

 - "selftests/mm: Fix false positives and skip unsupported tests" from
   Donet Tom fixes a few things in our selftests code

 - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised"
   from David Hildenbrand "allows individual processes to opt-out of
   THP=always into THP=madvise, without affecting other workloads on the
   system".

   It's a long story - the [1/N] changelog spells out the considerations

 - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on
   the memdesc project. Please see

      https://kernelnewbies.org/MatthewWilcox/Memdescs and
      https://blogs.oracle.com/linux/post/introducing-memdesc

 - "Tiny optimization for large read operations" from Chi Zhiling
   improves the efficiency of the pagecache read path

 - "Better split_huge_page_test result check" from Zi Yan improves our
   folio splitting selftest code

 - "test that rmap behaves as expected" from Wei Yang adds some rmap
   selftests

 - "remove write_cache_pages()" from Christoph Hellwig removes that
   function and converts its two remaining callers

 - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD
   selftests issues

 - "introduce kernel file mapped folios" from Boris Burkov introduces
   the concept of "kernel file pages". Using these permits btrfs to
   account its metadata pages to the root cgroup, rather than to the
   cgroups of random inappropriate tasks

 - "mm/pageblock: improve readability of some pageblock handling" from
   Wei Yang provides some readability improvements to the page allocator
   code

 - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON
   to understand arm32 highmem

 - "tools: testing: Use existing atomic.h for vma/maple tests" from
   Brendan Jackman performs some code cleanups and deduplication under
   tools/testing/

 - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes
   a couple of 32-bit issues in tools/testing/radix-tree.c

 - "kasan: unify kasan_enabled() and remove arch-specific
   implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific
   initialization code into a common arch-neutral implementation

 - "mm: remove zpool" from Johannes Weiner removes zspool - an
   indirection layer which now only redirects to a single thing
   (zsmalloc)

 - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a
   couple of cleanups in the fork code

 - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of
   adjustments at various nth_page() callsites, eventually permitting
   the removal of that undesirable helper function

 - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun
   creates a KASAN read-only mode for ARM, using that architecture's
   memory tagging feature. It is felt that a read-only mode KASAN is
   suitable for use in production systems rather than debug-only

 - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does
   some tidying in the hugetlb folio allocation code

 - "mm: establish const-correctness for pointer parameters" from Max
   Kellermann makes quite a number of the MM API functions more accurate
   about the constness of their arguments. This was getting in the way
   of subsystems (in this case CEPH) when they attempt to improving
   their own const/non-const accuracy

 - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of
   code sites which were confused over when to use free_pages() vs
   __free_pages()

 - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the
   mapletree code accessible to Rust. Required by nouveau and by its
   forthcoming successor: the new Rust Nova driver

 - "selftests/mm: split_huge_page_test: split_pte_mapped_thp
   improvements" from David Hildenbrand adds a fix and some cleanups to
   the thp selftesting code

 - "mm, swap: introduce swap table as swap cache (phase I)" from Chris
   Li and Kairui Song is the first step along the path to implementing
   "swap tables" - a new approach to swap allocation and state tracking
   which is expected to yield speed and space improvements. This
   patchset itself yields a 5-20% performance benefit in some situations

 - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc
   layer to clean up the ptdesc code a little

 - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some
   issues in our 5-level pagetable selftesting code

 - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan
   addresses a couple of minor issues in relatively new memory
   allocation profiling feature

 - "Small cleanups" from Matthew Wilcox has a few cleanups in
   preparation for more memdesc work

 - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from
   Quanmin Yan makes some changes to DAMON in furtherance of supporting
   arm highmem

 - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad
   Anjum adds that compiler check to selftests code and fixes the
   fallout, by removing dead code

 - "Improvements to Victim Process Thawing and OOM Reaper Traversal
   Order" from zhongjinji makes a number of improvements in the OOM
   killer: mainly thawing a more appropriate group of victim threads so
   they can release resources

 - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park
   is a bunch of small and unrelated fixups for DAMON

 - "mm/damon: define and use DAMON initialization check function" from
   SeongJae Park implement reliability and maintainability improvements
   to a recently-added bug fix

 - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from
   SeongJae Park provides additional transparency to userspace clients
   of the DAMON_STAT information

 - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes
   some constraints on khubepaged's collapsing of anon VMAs. It also
   increases the success rate of MADV_COLLAPSE against an anon vma

 - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()"
   from Lorenzo Stoakes moves us further towards removal of
   file_operations.mmap(). This patchset concentrates upon clearing up
   the treatment of stacked filesystems

 - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau
   provides some fixes and improvements to mlock's tracking of large
   folios. /proc/meminfo's "Mlocked" field became more accurate

 - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from
   Donet Tom fixes several user-visible KSM stats inaccuracies across
   forks and adds selftest code to verify these counters

 - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses
   some potential but presently benign issues in KSM's mm_slot handling

* tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits)
  mm: swap: check for stable address space before operating on the VMA
  mm: convert folio_page() back to a macro
  mm/khugepaged: use start_addr/addr for improved readability
  hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list
  alloc_tag: fix boot failure due to NULL pointer dereference
  mm: silence data-race in update_hiwater_rss
  mm/memory-failure: don't select MEMORY_ISOLATION
  mm/khugepaged: remove definition of struct khugepaged_mm_slot
  mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL
  hugetlb: increase number of reserving hugepages via cmdline
  selftests/mm: add fork inheritance test for ksm_merging_pages counter
  mm/ksm: fix incorrect KSM counter handling in mm_struct during fork
  drivers/base/node: fix double free in register_one_node()
  mm: remove PMD alignment constraint in execmem_vmalloc()
  mm/memory_hotplug: fix typo 'esecially' -> 'especially'
  mm/rmap: improve mlock tracking for large folios
  mm/filemap: map entire large folio faultaround
  mm/fault: try to map the entire file folio in finish_fault()
  mm/rmap: mlock large folios in try_to_unmap_one()
  mm/rmap: fix a mlock race condition in folio_referenced_one()
  ...
2025-10-02 18:18:33 -07:00
Linus Torvalds
a769648f46 Merge tag 'dlm-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
 "This adds a dlm_release_lockspace() flag to request that node-failure
  recovery be performed for the node leaving the lockspace.

  The implementation of this flag requires coordination with userland
  clustering components. It's been requested for use by GFS2"

* tag 'dlm-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: check for undefined release_option values
  dlm: handle release_option as unsigned
  dlm: move to rinfo for all middle conversion cases
  dlm: handle invalid lockspace member remove
  dlm: add new flag DLM_RELEASE_RECOVER for dlm_lockspace_release
  dlm: add new configfs entry release_recover for lockspace members
  dlm: add new RELEASE_RECOVER uevent attribute for release_lockspace
  dlm: use defines for force values in dlm_release_lockspace
  dlm: check for defined force value in dlm_lockspace_release
2025-09-29 15:24:58 -07:00
Linus Torvalds
a40eb50a95 Merge tag 'gfs2-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:

 - Partially revert "gfs2: do_xmote fixes" to ignore dlm_lock() errors
   during withdraw; passing on those errors doesn't help

 - Change the LM_FLAG_TRY and LM_FLAG_TRY_1CB logic in add_to_queue() to
   check if the holder would actually block

 - Move some more dlm specific code from glock.c to lock_dlm.c

 - Remove the unused dlm alternate locking mode code

 - Add proper locking to make sure that dlm lockspaces are never used
   after being released

 - Various other cleanups

* tag 'gfs2-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Fix unlikely race in gdlm_put_lock
  gfs2: Add proper lockspace locking
  gfs2: Minor run_queue fixes
  gfs2: run_queue cleanup
  gfs2: Simplify do_promote
  gfs2: Get rid of GLF_INVALIDATE_IN_PROGRESS
  gfs2: Fix GLF_INVALIDATE_IN_PROGRESS flag clearing in do_xmote
  gfs2: Remove duplicate check in do_xmote
  gfs2: Fix LM_FLAG_TRY* logic in add_to_queue
  gfs2: Remove DLM_LKF_ALTCW / DLM_LKF_ALTPR code
  gfs2: Further sanitize lock_dlm.c
  gfs2: Do not use atomic operations unnecessarily
  gfs2: Sanitize gfs2_meta_check, gfs2_metatype_check, gfs2_io_error
  gfs2: Turn gfs2_withdraw into a void function
  gfs2: Partially revert "gfs2: do_xmote fixes"
  gfs2: Simplify refcounting in do_xmote
  gfs2: do_xmote cleanup
  gfs2: Remove space before newline
  gfs2: Remove unused sd_withdraw_wait field
  gfs2: Remove unused GIF_FREE_VFS_INODE flag
2025-09-29 14:28:50 -07:00
Linus Torvalds
b786405685 Merge tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs workqueue updates from Christian Brauner:
 "This contains various workqueue changes affecting the filesystem
  layer.

  Currently if a user enqueue a work item using schedule_delayed_work()
  the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
  WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies
  to schedule_work() that is using system_wq and queue_work(), that
  makes use again of WORK_CPU_UNBOUND.

  This replaces the use of system_wq and system_unbound_wq. system_wq is
  a per-CPU workqueue which isn't very obvious from the name and
  system_unbound_wq is to be used when locality is not required.

  So this renames system_wq to system_percpu_wq, and system_unbound_wq
  to system_dfl_wq.

  This also adds a new WQ_PERCPU flag to allow the fs subsystem users to
  explicitly request the use of per-CPU behavior. Both WQ_UNBOUND and
  WQ_PERCPU flags coexist for one release cycle to allow callers to
  transition their calls. WQ_UNBOUND will be removed in a next release
  cycle"

* tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: WQ_PERCPU added to alloc_workqueue users
  fs: replace use of system_wq with system_percpu_wq
  fs: replace use of system_unbound_wq with system_dfl_wq
2025-09-29 10:27:17 -07:00
Linus Torvalds
56e7b31071 Merge tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode updates from Christian Brauner:
 "This contains a series I originally wrote and that Eric brought over
  the finish line. It moves out the i_crypt_info and i_verity_info
  pointers out of 'struct inode' and into the fs-specific part of the
  inode.

  So now the few filesytems that actually make use of this pay the price
  in their own private inode storage instead of forcing it upon every
  user of struct inode.

  The pointer for the crypt and verity info is simply found by storing
  an offset to its address in struct fsverity_operations and struct
  fscrypt_operations. This shrinks struct inode by 16 bytes.

  I hope to move a lot more out of it in the future so that struct inode
  becomes really just about very core stuff that we need, much like
  struct dentry and struct file, instead of the dumping ground it has
  become over the years.

  On top of this are a various changes associated with the ongoing inode
  lifetime handling rework that multiple people are pushing forward:

   - Stop accessing inode->i_count directly in f2fs and gfs2. They
     simply should use the __iget() and iput() helpers

   - Make the i_state flags an enum

   - Rework the iput() logic

     Currently, if we are the last iput, and we have the I_DIRTY_TIME
     bit set, we will grab a reference on the inode again and then mark
     it dirty and then redo the put. This is to make sure we delay the
     time update for as long as possible

     We can rework this logic to simply dec i_count if it is not 1, and
     if it is do the time update while still holding the i_count
     reference

     Then we can replace the atomic_dec_and_lock with locking the
     ->i_lock and doing atomic_dec_and_test, since we did the
     atomic_add_unless above

   - Add an icount_read() helper and convert everyone that accesses
     inode->i_count directly for this purpose to use the helper

   - Expand dump_inode() to dump more information about an inode helping
     in debugging

   - Add some might_sleep() annotations to iput() and associated
     helpers"

* tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: add might_sleep() annotation to iput() and more
  fs: expand dump_inode()
  inode: fix whitespace issues
  fs: add an icount_read helper
  fs: rework iput logic
  fs: make the i_state flags an enum
  fs: stop accessing ->i_count directly in f2fs and gfs2
  fsverity: check IS_VERITY() in fsverity_cleanup_inode()
  fs: remove inode::i_verity_info
  btrfs: move verity info pointer to fs-specific part of inode
  f2fs: move verity info pointer to fs-specific part of inode
  ext4: move verity info pointer to fs-specific part of inode
  fsverity: add support for info in fs-specific part of inode
  fs: remove inode::i_crypt_info
  ceph: move crypt info pointer to fs-specific part of inode
  ubifs: move crypt info pointer to fs-specific part of inode
  f2fs: move crypt info pointer to fs-specific part of inode
  ext4: move crypt info pointer to fs-specific part of inode
  fscrypt: add support for info in fs-specific part of inode
  fscrypt: replace raw loads of info pointer with helper function
2025-09-29 09:42:30 -07:00
Marco Crivellari
69635d7f4b fs: WQ_PERCPU added to alloc_workqueue users
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This patch adds a new WQ_PERCPU flag to all the fs subsystem users to
explicitly request the use of the per-CPU behavior. Both flags coexist
for one release cycle to allow callers to transition their calls.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

All existing users have been updated accordingly.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://lore.kernel.org/20250916082906.77439-4-marco.crivellari@suse.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19 16:15:07 +02:00