Commit Graph

3020 Commits

Author SHA1 Message Date
Eric Biggers
b261d22220 lib/crc: remove CONFIG_LIBCRC32C
Now that LIBCRC32C does nothing besides select CRC32, make every option
that selects LIBCRC32C instead select CRC32 directly.  Then remove
LIBCRC32C.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250401221600.24878-8-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-04-04 11:31:42 -07:00
Linus Torvalds
ef479de65a Merge tag 'gfs2-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:

 - Fix two bugs related to locking request cancelation (locking request
   being retried instead of canceled; canceling the wrong locking
   request)

 - Prevent a race between inode creation and deferred delete analogous
   to commit ffd1cf0443 from 6.13. This now allows to further simplify
   gfs2_evict_inode() without introducing mysterious problems

 - When in inode delete should be verified / retried "later" but that
   isn't possible, skip the delete instead of carrying it out
   immediately. This broke in 6.13

 - More folio conversions from Matthew Wilcox (plus a fix from Dan
   Carpenter)

 - Various minor fixes and cleanups

* tag 'gfs2-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (22 commits)
  gfs2: some comment clarifications
  gfs2: Fix a NULL vs IS_ERR() bug in gfs2_find_jhead()
  gfs2: Convert gfs2_meta_read_endio() to use a folio
  gfs2: Convert gfs2_end_log_write_bh() to work on a folio
  gfs2: Convert gfs2_find_jhead() to use a folio
  gfs2: Convert gfs2_jhead_pg_srch() to gfs2_jhead_folio_search()
  gfs2: Use b_folio in gfs2_check_magic()
  gfs2: Use b_folio in gfs2_submit_bhs()
  gfs2: Use b_folio in gfs2_trans_add_meta()
  gfs2: Use b_folio in gfs2_log_write_bh()
  gfs2: skip if we cannot defer delete
  gfs2: remove redundant warnings
  gfs2: minor evict fix
  gfs2: Prevent inode creation race (2)
  gfs2: Fix additional unlikely request cancelation race
  gfs2: Fix request cancelation bug
  gfs2: Check for empty queue in run_queue
  gfs2: Remove more dead code in add_to_queue
  gfs2: Replace GIF_DEFER_DELETE with GLF_DEFER_DELETE
  gfs2: glock holder GL_NOPID fix
  ...
2025-03-27 12:09:25 -07:00
Linus Torvalds
26d8e43079 Merge tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs async dir updates from Christian Brauner:
 "This contains cleanups that fell out of the work from async directory
  handling:

   - Change kern_path_locked() and user_path_locked_at() to never return
     a negative dentry. This simplifies the usability of these helpers
     in various places

   - Drop d_exact_alias() from the remaining place in NFS where it is
     still used. This also allows us to drop the d_exact_alias() helper
     completely

   - Drop an unnecessary call to fh_update() from nfsd_create_locked()

   - Change i_op->mkdir() to return a struct dentry

     Change vfs_mkdir() to return a dentry provided by the filesystems
     which is hashed and positive. This allows us to reduce the number
     of cases where the resulting dentry is not positive to very few
     cases. The code in these places becomes simpler and easier to
     understand.

   - Repack DENTRY_* and LOOKUP_* flags"

* tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  doc: fix inline emphasis warning
  VFS: Change vfs_mkdir() to return the dentry.
  nfs: change mkdir inode_operation to return alternate dentry if needed.
  fuse: return correct dentry for ->mkdir
  ceph: return the correct dentry on mkdir
  hostfs: store inode in dentry after mkdir if possible.
  Change inode_operations.mkdir to return struct dentry *
  nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked()
  nfs/vfs: discard d_exact_alias()
  VFS: add common error checks to lookup_one_qstr_excl()
  VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry
  VFS: repack LOOKUP_ bit flags.
  VFS: repack DENTRY_ flags.
2025-03-24 10:47:14 -07:00
Linus Torvalds
0ec0d4ecdd Merge tag 'vfs-6.15-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs iomap updates from Christian Brauner:

 - Allow the filesystem to submit the writeback bios.

    - Allow the filsystem to track completions on a per-bio bases
      instead of the entire I/O.

    - Change writeback_ops so that ->submit_bio can be done by the
      filesystem.

    - A new ANON_WRITE flag for writes that don't have a block number
      assigned to them at the iomap level leaving the filesystem to do
      that work in the submission handler.

 - Incremental iterator advance

   The folio_batch support for zero range where the filesystem provides
   a batch of folios to process that might not be logically continguous
   requires more flexibility than the current offset based iteration
   currently offers.

   Update all iomap operations to advance the iterator within the
   operation and thus remove the need to advance from the core iomap
   iterator.

 - Make buffered writes work with RWF_DONTCACHE

   If RWF_DONTCACHE is set for a write, mark the folios being written as
   uncached. On writeback completion the pages will be dropped.

 - Introduce infrastructure for large atomic writes

   This will eventually be used by xfs and ext4.

* tag 'vfs-6.15-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits)
  iomap: rework IOMAP atomic flags
  iomap: comment on atomic write checks in iomap_dio_bio_iter()
  iomap: inline iomap_dio_bio_opflags()
  iomap: fix inline data on buffered read
  iomap: Lift blocksize restriction on atomic writes
  iomap: Support SW-based atomic writes
  iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW
  xfs: flag as supporting FOP_DONTCACHE
  iomap: make buffered writes work with RWF_DONTCACHE
  iomap: introduce a full map advance helper
  iomap: rename iomap_iter processed field to status
  iomap: remove unnecessary advance from iomap_iter()
  dax: advance the iomap_iter on pte and pmd faults
  dax: advance the iomap_iter on dedupe range
  dax: advance the iomap_iter on unshare range
  dax: advance the iomap_iter on zero range
  dax: push advance down into dax_iomap_iter() for read and write
  dax: advance the iomap_iter in the read/write path
  iomap: convert misc simple ops to incremental advance
  iomap: advance the iter on direct I/O
  ...
2025-03-24 10:19:31 -07:00
Andreas Gruenbacher
8cb70b91b2 gfs2: some comment clarifications
Since commit e1fa9ea85c ("gfs2: Stop using glock holder auto-demotion
for now"), we unconditionally drop the inode glock before trying to
fault in more pages.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-18 13:21:39 +01:00
Dan Carpenter
951d701ef1 gfs2: Fix a NULL vs IS_ERR() bug in gfs2_find_jhead()
The filemap_grab_folio() function doesn't return NULL, it returns error
pointers.  Fix the check to match.

Fixes: 4082976009 ("gfs2: Convert gfs2_find_jhead() to use a folio")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-12 13:00:28 +01:00
Matthew Wilcox (Oracle)
0776a508d1 gfs2: Convert gfs2_meta_read_endio() to use a folio
Switch from bio_for_each_segment_all() to bio_for_each_folio_all()
which removes a call to page_buffers().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10 18:15:39 +01:00
Matthew Wilcox (Oracle)
536da2a440 gfs2: Convert gfs2_end_log_write_bh() to work on a folio
gfs2_end_log_write() has to handle bios which consist of both pages
which belong to folios and pages which were allocated from a mempool and
do not belong to a folio.  It would be cleaner to have separate endio
handlers which handle each type, but it's not clear to me whether that's
even possible.

This patch is slightly forward-looking in that page_folio() cannot
currently return NULL, but it will return NULL in the future for pages
which do not belong to a folio.

This was the last user of page_has_buffers(), so remove it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10 18:15:39 +01:00
Matthew Wilcox (Oracle)
4082976009 gfs2: Convert gfs2_find_jhead() to use a folio
Remove a call to grab_cache_page() by using a folio throughout
this function.

[agruenba@redhat.com: Adjust to return value difference between
bio_add_page() and bio_add_folio().]

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10 18:15:39 +01:00
Matthew Wilcox (Oracle)
e00307e8d4 gfs2: Convert gfs2_jhead_pg_srch() to gfs2_jhead_folio_search()
Pass in the folio instead of the page.  Add an assert that this is
not a large folio as we'd need a more complex solution if we wanted to
kmap() each page out of a large folio.  Removes a use of folio->page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
[agruenba@redhat.com: Rename gfs2_jhead_folio_srch() to gfs2_jhead_folio_search().]
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10 18:15:39 +01:00
Matthew Wilcox (Oracle)
e6ff5f2089 gfs2: Use b_folio in gfs2_check_magic()
We are preparing to remove bh->b_page.  Use kmap_local_folio() instead
of kmap_local_page().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10 18:15:39 +01:00
Matthew Wilcox (Oracle)
072d732c05 gfs2: Use b_folio in gfs2_submit_bhs()
Remove a reference to bh->b_page which is going to be removed soon.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10 18:15:39 +01:00
Matthew Wilcox (Oracle)
3f2fc848be gfs2: Use b_folio in gfs2_trans_add_meta()
The lock bit is maintained on the folio, not on the page.  Saves two
calls to compound_head() as well as removing two references to
bh->b_page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10 18:15:39 +01:00
Matthew Wilcox (Oracle)
6576742b90 gfs2: Use b_folio in gfs2_log_write_bh()
We are preparing to remove bh->b_page.  gfs2_log_write() should continue
to operate on pages as some of the memory being logged does not come
from folios, so convert from folio to page in this function.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10 18:15:39 +01:00
Andreas Gruenbacher
41a8e04c94 gfs2: skip if we cannot defer delete
In gfs2_evict_inode(), in the unlikely case that we cannot defer
deleting the inode, it is not safe to fall back to deleting the inode;
the only valid choice we have is to skip the delete.

In addition, in evict_should_delete(), if we cannot lock the inode glock
exclusively, we are in a bad enough state that skipping the delete is
likely a better choice than trying to recover from the failure later.

Fixes: c5b7a2400e ("gfs2: Only defer deletes when we have an iopen glock")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10 18:15:38 +01:00
Andreas Gruenbacher
79fe790a32 gfs2: remove redundant warnings
In glock_set_object() and glock_clear_object(), there is no need to
print the glock type and number when we dump the entire glock, anyway.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10 18:15:38 +01:00
Andreas Gruenbacher
e9e38ed725 gfs2: minor evict fix
In evict_should_delete(), when gfs2_upgrade_iopen_glock() fails, we
detach the iopen glock from the inode without calling
glock_clear_object().  This leads to a warning in glock_set_object()
when the same inode is recreated and the glock is reused.
Fix that by only detaching the iopen glock in gfs2_evict_inode().

In addition, remove the dequeue code from evict_should_delete(); we
already perform a conditional dequeue in gfs2_evict_inode().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10 18:15:38 +01:00
Andreas Gruenbacher
9136cad723 gfs2: Prevent inode creation race (2)
In gfs2_try_evict(), we try grabbing the inode to evict, we try to evict
it, and then we try grabbing it again to see if it still exists.  There
is no guarantee that we will end up with the same inode both times; the
inode validity check that commit ffd1cf0443 ("gfs2: Prevent inode
creation race") added to the first grab is actually needed both times.

(To avoid code duplication, add a grab_existing_inode() helper.)

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10 18:15:38 +01:00
Andreas Gruenbacher
6cb3b1c2df gfs2: Fix additional unlikely request cancelation race
In gfs2_glock_dq(), we must drop the glock spin lock before calling
->lm_cancel, but this means that in the meantime, the operation we are
trying to cancel could complete.  If the operation completes
unsuccessfully, another holder can end up at the head of the queue and
another ->lm_lock operation can get started.  In this case, we would end
up canceling that second operation by accident.

To prevent that, introduce a new GLF_CANCELING flag.  Set that flag in
gfs2_glock_dq() when trying to cancel an operation.  When seeing that
flag, finish_xmote() will then keep the GLF_LOCK flag set to prevent
other glock operations from taking place.  gfs2_glock_dq() then
completes the cancelation attempt by clearing GLF_LOCK and
GLF_CANCELING.

In addition, add a missing GLF_DEMOTE_IN_PROGRESS check in
gfs2_glock_dq() to make sure that we won't accidentally cancel a demote
request.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10 18:15:38 +01:00
Andreas Gruenbacher
a431d49243 gfs2: Fix request cancelation bug
In finish_xmote(), when a locking request is canceled, the corresponding
holder is moved to the tail of the holders list instead of being
dequeued immediately.  When there is only a single holder, the canceled
locking request is then immediately repeated.  This makes no sense; it
looks like another remnant of LM_FLAG_PRIORITY support.

Instead, dequeue canceled holders and proceed with the next holder in
finish_xmote().  We can then easily detect in gfs2_glock_dq() when a
holder has been canceled.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10 18:15:38 +01:00
Andreas Gruenbacher
d838605fea gfs2: Check for empty queue in run_queue
In run_queue(), check if the queue of pending requests is empty instead
of blindly assuming that it won't be.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10 18:15:38 +01:00
Andreas Gruenbacher
0360faca5d gfs2: Remove more dead code in add_to_queue
Remove some more dead code in add_to_queue() that commit 0b93bac227
("gfs2: Remove LM_FLAG_PRIORITY flag") has rendered obsolete.  This is a
continuation of commit 3302764610057 ("gfs2: remove dead code in
add_to_queue"); no functional change.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10 18:15:38 +01:00
Andreas Gruenbacher
3774f53d7f gfs2: Replace GIF_DEFER_DELETE with GLF_DEFER_DELETE
Having this flag attached to the iopen glock instead of the inode is
much simpler; it eliminates a protential weird race in gfs2_try_evict().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10 18:15:38 +01:00
Andreas Gruenbacher
f83f897614 gfs2: glock holder GL_NOPID fix
Glocks are always actively acquired by processes, but as indicated by
the GL_NOPID holder flag, some of them are then associated with objects
like cached inodes rather than the process that acquired them.  As such,
for those glock holders, it makes little sense to dump which processes
originally acquired them.

Therefore, gfs2 is trying to hide the identity of the processes that
acquired those glocks.  The code for doing that is incorrect though, so
fix it.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10 18:15:38 +01:00
Andreas Gruenbacher
8bbfde0875 gfs2: Add GLF_PENDING_REPLY flag
Introduce a new GLF_PENDING_REPLY flag to indicate that a reply from DLM
is expected.  Include that flag in glock dumps to show more clearly
what's going on.  (When the GLF_PENDING_REPLY flag is set, the GLF_LOCK
flag will also be set but the GLF_LOCK flag alone isn't sufficient to
tell that we are waiting for a DLM reply.)

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10 18:15:38 +01:00
Andreas Gruenbacher
5788253392 gfs2: Decode missing glock flags in tracepoints
Add a number of glock flags are currently not shown in the text form of
glock tracepoints.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10 18:15:38 +01:00
NeilBrown
88d5baf690 Change inode_operations.mkdir to return struct dentry *
Some filesystems, such as NFS, cifs, ceph, and fuse, do not have
complete control of sequencing on the actual filesystem (e.g.  on a
different server) and may find that the inode created for a mkdir
request already exists in the icache and dcache by the time the mkdir
request returns.  For example, if the filesystem is mounted twice the
directory could be visible on the other mount before it is on the
original mount, and a pair of name_to_handle_at(), open_by_handle_at()
calls could instantiate the directory inode with an IS_ROOT() dentry
before the first mkdir returns.

This means that the dentry passed to ->mkdir() may not be the one that
is associated with the inode after the ->mkdir() completes.  Some
callers need to interact with the inode after the ->mkdir completes and
they currently need to perform a lookup in the (rare) case that the
dentry is no longer hashed.

This lookup-after-mkdir requires that the directory remains locked to
avoid races.  Planned future patches to lock the dentry rather than the
directory will mean that this lookup cannot be performed atomically with
the mkdir.

To remove this barrier, this patch changes ->mkdir to return the
resulting dentry if it is different from the one passed in.
Possible returns are:
  NULL - the directory was created and no other dentry was used
  ERR_PTR() - an error occurred
  non-NULL - this other dentry was spliced in

This patch only changes file-systems to return "ERR_PTR(err)" instead of
"err" or equivalent transformations.  Subsequent patches will make
further changes to some file-systems to return a correct dentry.

Not all filesystems reliably result in a positive hashed dentry:

- NFS, cifs, hostfs will sometimes need to perform a lookup of
  the name to get inode information.  Races could result in this
  returning something different. Note that this lookup is
  non-atomic which is what we are trying to avoid.  Placing the
  lookup in filesystem code means it only happens when the filesystem
  has no other option.
- kernfs and tracefs leave the dentry negative and the ->revalidate
  operation ensures that lookup will be called to correctly populate
  the dentry.  This could be fixed but I don't think it is important
  to any of the users of vfs_mkdir() which look at the dentry.

The recommendation to use
    d_drop();d_splice_alias()
is ugly but fits with current practice.  A planned future patch will
change this.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-27 20:00:17 +01:00
Andreas Gruenbacher
bb504b4d64 lockref: remove count argument of lockref_init
All users of lockref_init() now initialize the count to 1, so hardcode
that and remove the count argument.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Link: https://lore.kernel.org/r/20250130135624.1899988-4-agruenba@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-07 10:27:25 +01:00
Andreas Gruenbacher
34ad6fa2ad gfs2: switch to lockref_init(..., 1)
In qd_alloc(), initialize the lockref count to 1 to cover the common
case.  Compensate for that in gfs2_quota_init() by adjusting the count
back down to 0; this only occurs when mounting the filesystem rw.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Link: https://lore.kernel.org/r/20250130135624.1899988-3-agruenba@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-07 10:27:25 +01:00
Andreas Gruenbacher
d9b3a3c70d gfs2: use lockref_init for gl_lockref
Move the initialization of gl_lockref from gfs2_init_glock_once() to
gfs2_glock_get().  This allows to use lockref_init() there.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Link: https://lore.kernel.org/r/20250130135624.1899988-2-agruenba@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-07 10:27:25 +01:00
Christoph Hellwig
c6d1b8d154 iomap: pass private data to iomap_zero_range
Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250206064035.2323428-11-hch@lst.de
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-06 13:02:15 +01:00
Linus Torvalds
d3d90cc289 Merge tag 'pull-revalidate' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs d_revalidate updates from Al Viro:
 "Provide stable parent and name to ->d_revalidate() instances

  Most of the filesystem methods where we care about dentry name and
  parent have their stability guaranteed by the callers;
  ->d_revalidate() is the major exception.

  It's easy enough for callers to supply stable values for expected name
  and expected parent of the dentry being validated. That kills quite a
  bit of boilerplate in ->d_revalidate() instances, along with a bunch
  of races where they used to access ->d_name without sufficient
  precautions"

* tag 'pull-revalidate' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  9p: fix ->rename_sem exclusion
  orangefs_d_revalidate(): use stable parent inode and name passed by caller
  ocfs2_dentry_revalidate(): use stable parent inode and name passed by caller
  nfs: fix ->d_revalidate() UAF on ->d_name accesses
  nfs{,4}_lookup_validate(): use stable parent inode passed by caller
  gfs2_drevalidate(): use stable parent inode and name passed by caller
  fuse_dentry_revalidate(): use stable parent inode and name passed by caller
  vfat_revalidate{,_ci}(): use stable parent inode passed by caller
  exfat_d_revalidate(): use stable parent inode passed by caller
  fscrypt_d_revalidate(): use stable parent inode passed by caller
  ceph_d_revalidate(): propagate stable name down into request encoding
  ceph_d_revalidate(): use stable parent inode passed by caller
  afs_d_revalidate(): use stable name and parent inode passed by caller
  Pass parent directory inode and expected name to ->d_revalidate()
  generic_ci_d_compare(): use shortname_storage
  ext4 fast_commit: make use of name_snapshot primitives
  dissolve external_name.u into separate members
  make take_dentry_name_snapshot() lockless
  dcache: back inline names with a struct-wrapped array of unsigned long
  make sure that DNAME_INLINE_LEN is a multiple of word size
2025-01-30 09:13:35 -08:00
Al Viro
eab2a11e5b gfs2_drevalidate(): use stable parent inode and name passed by caller
No need to mess with dget_parent() for the former; for the latter we really should
not rely upon ->d_name.name remaining stable.  Theoretically a UAF, but it's
hard to exfiltrate the information...

Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27 19:25:24 -05:00
Al Viro
5be1fa8abd Pass parent directory inode and expected name to ->d_revalidate()
->d_revalidate() often needs to access dentry parent and name; that has
to be done carefully, since the locking environment varies from caller
to caller.  We are not guaranteed that dentry in question will not be
moved right under us - not unless the filesystem is such that nothing
on it ever gets renamed.

It can be dealt with, but that results in boilerplate code that isn't
even needed - the callers normally have just found the dentry via dcache
lookup and want to verify that it's in the right place; they already
have the values of ->d_parent and ->d_name stable.  There is a couple
of exceptions (overlayfs and, to less extent, ecryptfs), but for the
majority of calls that song and dance is not needed at all.

It's easier to make ecryptfs and overlayfs find and pass those values if
there's a ->d_revalidate() instance to be called, rather than doing that
in the instances.

This commit only changes the calling conventions; making use of supplied
values is left to followups.

NOTE: some instances need more than just the parent - things like CIFS
may need to build an entire path from filesystem root, so they need
more precautions than the usual boilerplate.  This series doesn't
do anything to that need - these filesystems have to keep their locking
mechanisms (rename_lock loops, use of dentry_path_raw(), private rwsem
a-la v9fs).

One thing to keep in mind when using name is that name->name will normally
point into the pathname being resolved; the filename in question occupies
name->len bytes starting at name->name, and there is NUL somewhere after it,
but it the next byte might very well be '/' rather than '\0'.  Do not
ignore name->len.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27 19:25:23 -05:00
Linus Torvalds
1851bccf60 Merge tag 'gfs2-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:

 - In the quota code, to avoid spurious audit messages, don't call
   capable() when quotas are off

 - When changing the 'j' flag of an inode, truncate the inode address
   space to avoid mixing "buffer head" and "iomap" pages

* tag 'gfs2-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Truncate address space when flipping GFS2_DIF_JDATA flag
  gfs2: reorder capability check last
2025-01-20 13:06:28 -08:00
Christoph Hellwig
3e652eba24 gfs2: use lockref_init for qd_lockref
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250115094702.504610-9-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-16 11:48:12 +01:00
Andreas Gruenbacher
7c9d922380 gfs2: Truncate address space when flipping GFS2_DIF_JDATA flag
Truncate an inode's address space when flipping the GFS2_DIF_JDATA flag:
depending on that flag, the pages in the address space will either use
buffer heads or iomap_folio_state structs, and we cannot mix the two.

Reported-by: Kun Hu <huk23@m.fudan.edu.cn>, Jiaji Qin <jjtan24@m.fudan.edu.cn>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-01-14 18:54:08 +01:00
Christian Göttsche
ead64b20f1 gfs2: reorder capability check last
capable() calls refer to enabled LSMs whether to permit or deny the
request.  This is relevant in connection with SELinux, where a
capability check results in a policy decision and by default a denial
message on insufficient permission is issued.
It can lead to three undesired cases:
  1. A denial message is generated, even in case the operation was an
     unprivileged one and thus the syscall succeeded, creating noise.
  2. To avoid the noise from 1. the policy writer adds a rule to ignore
     those denial messages, hiding future syscalls, where the task
     performs an actual privileged operation, leading to hidden limited
     functionality of that task.
  3. To avoid the noise from 1. the policy writer adds a rule to permit
     the task the requested capability, while it does not need it,
     violating the principle of least privilege.

Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-12-09 10:44:35 +01:00
Linus Torvalds
ff2a7a064a Merge tag 'gfs2-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:

 - Fix the code that cleans up left-over unlinked files.

   Various fixes and minor improvements in deleting files cached or held
   open remotely.

 - Simplify the use of dlm's DLM_LKF_QUECVT flag.

 - A few other minor cleanups.

* tag 'gfs2-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (21 commits)
  gfs2: Prevent inode creation race
  gfs2: Only defer deletes when we have an iopen glock
  gfs2: Simplify DLM_LKF_QUECVT use
  gfs2: gfs2_evict_inode clarification
  gfs2: Make gfs2_inode_refresh static
  gfs2: Use get_random_u32 in gfs2_orlov_skip
  gfs2: Randomize GLF_VERIFY_DELETE work delay
  gfs2: Use mod_delayed_work in gfs2_queue_try_to_evict
  gfs2: Update to the evict / remote delete documentation
  gfs2: Call gfs2_queue_verify_delete from gfs2_evict_inode
  gfs2: Clean up delete work processing
  gfs2: Minor delete_work_func cleanup
  gfs2: Return enum evict_behavior from gfs2_upgrade_iopen_glock
  gfs2: Rename dinode_demise to evict_behavior
  gfs2: Rename GIF_{DEFERRED -> DEFER}_DELETE
  gfs2: Faster gfs2_upgrade_iopen_glock wakeups
  KMSAN: uninit-value in inode_go_dump (5)
  gfs2: Fix unlinked inode cleanup
  gfs2: Allow immediate GLF_VERIFY_DELETE work
  gfs2: Initialize gl_no_formal_ino earlier
  ...
2024-11-26 12:34:50 -08:00
Linus Torvalds
5c00ff742b Merge tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:

 - The series "zram: optimal post-processing target selection" from
   Sergey Senozhatsky improves zram's post-processing selection
   algorithm. This leads to improved memory savings.

 - Wei Yang has gone to town on the mapletree code, contributing several
   series which clean up the implementation:
	- "refine mas_mab_cp()"
	- "Reduce the space to be cleared for maple_big_node"
	- "maple_tree: simplify mas_push_node()"
	- "Following cleanup after introduce mas_wr_store_type()"
	- "refine storing null"

 - The series "selftests/mm: hugetlb_fault_after_madv improvements" from
   David Hildenbrand fixes this selftest for s390.

 - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
   implements some rationaizations and cleanups in the page mapping
   code.

 - The series "mm: optimize shadow entries removal" from Shakeel Butt
   optimizes the file truncation code by speeding up the handling of
   shadow entries.

 - The series "Remove PageKsm()" from Matthew Wilcox completes the
   migration of this flag over to being a folio-based flag.

 - The series "Unify hugetlb into arch_get_unmapped_area functions" from
   Oscar Salvador implements a bunch of consolidations and cleanups in
   the hugetlb code.

 - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
   takes away the wp-fault time practice of turning a huge zero page
   into small pages. Instead we replace the whole thing with a THP. More
   consistent cleaner and potentiall saves a large number of pagefaults.

 - The series "percpu: Add a test case and fix for clang" from Andy
   Shevchenko enhances and fixes the kernel's built in percpu test code.

 - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
   optimizes mremap() by avoiding doing things which we didn't need to
   do.

 - The series "Improve the tmpfs large folio read performance" from
   Baolin Wang teaches tmpfs to copy data into userspace at the folio
   size rather than as individual pages. A 20% speedup was observed.

 - The series "mm/damon/vaddr: Fix issue in
   damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON
   splitting.

 - The series "memcg-v1: fully deprecate charge moving" from Shakeel
   Butt removes the long-deprecated memcgv2 charge moving feature.

 - The series "fix error handling in mmap_region() and refactor" from
   Lorenzo Stoakes cleanup up some of the mmap() error handling and
   addresses some potential performance issues.

 - The series "x86/module: use large ROX pages for text allocations"
   from Mike Rapoport teaches x86 to use large pages for
   read-only-execute module text.

 - The series "page allocation tag compression" from Suren Baghdasaryan
   is followon maintenance work for the new page allocation profiling
   feature.

 - The series "page->index removals in mm" from Matthew Wilcox remove
   most references to page->index in mm/. A slow march towards shrinking
   struct page.

 - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
   interface tests" from Andrew Paniakin performs maintenance work for
   DAMON's self testing code.

 - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
   improves zswap's batching of compression and decompression. It is a
   step along the way towards using Intel IAA hardware acceleration for
   this zswap operation.

 - The series "kasan: migrate the last module test to kunit" from
   Sabyrzhan Tasbolatov completes the migration of the KASAN built-in
   tests over to the KUnit framework.

 - The series "implement lightweight guard pages" from Lorenzo Stoakes
   permits userapace to place fault-generating guard pages within a
   single VMA, rather than requiring that multiple VMAs be created for
   this. Improved efficiencies for userspace memory allocators are
   expected.

 - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
   tracepoints to provide increased visibility into memcg stats flushing
   activity.

 - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
   fixes a zram buglet which potentially affected performance.

 - The series "mm: add more kernel parameters to control mTHP" from
   Maíra Canal enhances our ability to control/configuremultisize THP
   from the kernel boot command line.

 - The series "kasan: few improvements on kunit tests" from Sabyrzhan
   Tasbolatov has a couple of fixups for the KASAN KUnit tests.

 - The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
   from Kairui Song optimizes list_lru memory utilization when lockdep
   is enabled.

* tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits)
  cma: enforce non-zero pageblock_order during cma_init_reserved_mem()
  mm/kfence: add a new kunit test test_use_after_free_read_nofault()
  zram: fix NULL pointer in comp_algorithm_show()
  memcg/hugetlb: add hugeTLB counters to memcg
  vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event
  mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount
  zram: ZRAM_DEF_COMP should depend on ZRAM
  MAINTAINERS/MEMORY MANAGEMENT: add document files for mm
  Docs/mm/damon: recommend academic papers to read and/or cite
  mm: define general function pXd_init()
  kmemleak: iommu/iova: fix transient kmemleak false positive
  mm/list_lru: simplify the list_lru walk callback function
  mm/list_lru: split the lock to per-cgroup scope
  mm/list_lru: simplify reparenting and initial allocation
  mm/list_lru: code clean up for reparenting
  mm/list_lru: don't export list_lru_add
  mm/list_lru: don't pass unnecessary key parameters
  kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller
  kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW
  kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols
  ...
2024-11-23 09:58:07 -08:00
Andreas Gruenbacher
ffd1cf0443 gfs2: Prevent inode creation race
When a request to evict an inode comes in over the network, we are
trying to grab an inode reference via the iopen glock's gl_object
pointer.  There is a very small probability that by the time such a
request comes in, inode creation hasn't completed and the I_NEW flag is
still set.  To deal with that, wait for the inode and then check if
inode creation was successful.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-11-19 13:05:41 +01:00
Andreas Gruenbacher
c5b7a2400e gfs2: Only defer deletes when we have an iopen glock
The mechanism to defer deleting unlinked inodes is tied to
delete_work_func(), which is tied to iopen glocks.  When we don't have
an iopen glock, we must carry out deletes immediately instead.

Fixes a NULL pointer dereference in gfs2_evict_inode().

Fixes: 8c21c2c71e ("gfs2: Call gfs2_queue_verify_delete from gfs2_evict_inode")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-11-19 12:33:20 +01:00
Linus Torvalds
4c797b11a8 Merge tag 'vfs-6.13.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs file updates from Christian Brauner:
 "This contains changes the changes for files for this cycle:

   - Introduce a new reference counting mechanism for files.

     As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop
     it has O(N^2) behaviour under contention with N concurrent
     operations and it is in a hot path in __fget_files_rcu().

     The rcuref infrastructures remedies this problem by using an
     unconditional increment relying on safe- and dead zones to make
     this work and requiring rcu protection for the data structure in
     question. This not just scales better it also introduces overflow
     protection.

     However, in contrast to generic rcuref, files require a memory
     barrier and thus cannot rely on *_relaxed() atomic operations and
     also require to be built on atomic_long_t as having massive amounts
     of reference isn't unheard of even if it is just an attack.

     This adds a file specific variant instead of making this a generic
     library.

     This has been tested by various people and it gives consistent
     improvement up to 3-5% on workloads with loads of threads.

   - Add a fastpath for find_next_zero_bit(). Skip 2-levels searching
     via find_next_zero_bit() when there is a free slot in the word that
     contains the next fd. This improves pts/blogbench-1.1.0 read by 8%
     and write by 4% on Intel ICX 160.

   - Conditionally clear full_fds_bits since it's very likely that a bit
     in full_fds_bits has been cleared during __clear_open_fds(). This
     improves pts/blogbench-1.1.0 read up to 13%, and write up to 5% on
     Intel ICX 160.

   - Get rid of all lookup_*_fdget_rcu() variants. They were used to
     lookup files without taking a reference count. That became invalid
     once files were switched to SLAB_TYPESAFE_BY_RCU and now we're
     always taking a reference count. Switch to an already existing
     helper and remove the legacy variants.

   - Remove pointless includes of <linux/fdtable.h>.

   - Avoid cmpxchg() in close_files() as nobody else has a reference to
     the files_struct at that point.

   - Move close_range() into fs/file.c and fold __close_range() into it.

   - Cleanup calling conventions of alloc_fdtable() and expand_files().

   - Merge __{set,clear}_close_on_exec() into one.

   - Make __set_open_fd() set cloexec as well instead of doing it in two
     separate steps"

* tag 'vfs-6.13.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  selftests: add file SLAB_TYPESAFE_BY_RCU recycling stressor
  fs: port files to file_ref
  fs: add file_ref
  expand_files(): simplify calling conventions
  make __set_open_fd() set cloexec state as well
  fs: protect backing files with rcu
  file.c: merge __{set,clear}_close_on_exec()
  alloc_fdtable(): change calling conventions.
  fs/file.c: add fast path in find_next_fd()
  fs/file.c: conditionally clear full_fds
  fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
  move close_range(2) into fs/file.c, fold __close_range() into it
  close_files(): don't bother with xchg()
  remove pointless includes of <linux/fdtable.h>
  get rid of ...lookup...fdget_rcu() family
2024-11-18 10:30:29 -08:00
Kairui Song
da0c02516c mm/list_lru: simplify the list_lru walk callback function
Now isolation no longer takes the list_lru global node lock, only use the
per-cgroup lock instead.  And this lock is inside the list_lru_one being
walked, no longer needed to pass the lock explicitly.

Link: https://lkml.kernel.org/r/20241104175257.60853-7-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11 17:22:26 -08:00
Andreas Gruenbacher
b6900ce151 gfs2: Simplify DLM_LKF_QUECVT use
The DLM_LKF_QUECVT flag needs to be set for "upward" lock conversions to
ensure fairness, but setting it for "downward" lock conversions will
lead to a failure.  The flag is currently set based on the GLF_BLOCKING
flag and it's not immediately obvious why this is correct.  Simplify
things by figuring out if a lock conversion is "upward" by looking at
the before and after locking modes instead of relying on the
GLF_BLOCKING flag.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-11-05 12:39:29 +01:00
Andreas Gruenbacher
03ff3781bf gfs2: gfs2_evict_inode clarification
When function evict_should_delete() returns SHOULD_DEFER_EVICTION, gh is
never initialized, but that isn't obvious; if it did initialize gh and
then return SHOULD_DEFER_EVICTION, gfs2_evict_inode() would fail to
release it.  To clarify the code, change gfs2_evict_inode() to always
check if gh needs to be released, no matter what evict_should_delete()
returns.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-11-05 12:39:29 +01:00
Andreas Gruenbacher
70cddf16cb gfs2: Make gfs2_inode_refresh static
Function gfs2_inode_refresh() is only used in fs/gfs2/glops.c.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-11-05 12:39:29 +01:00
Andreas Gruenbacher
0c5bee608f gfs2: Use get_random_u32 in gfs2_orlov_skip
Use get_random_u32() instead of get_random_bytes() to remove the last
remaining call to get_random_bytes().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-11-05 12:39:29 +01:00
Andreas Gruenbacher
085e423b4d gfs2: Randomize GLF_VERIFY_DELETE work delay
Randomize the delay of GLF_VERIFY_DELETE work.  This avoids thundering
herd problems when multiple nodes schedule that kind of work in response
to an inode being unlinked remotely.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-11-05 12:39:29 +01:00
Andreas Gruenbacher
f6ca45e3d2 gfs2: Use mod_delayed_work in gfs2_queue_try_to_evict
In the unlikely case that we're trying to queue GLF_TRY_TO_EVICT work
for an inode that already has GLF_VERIFY_DELETE work queued, we want to
make sure that the GLF_TRY_TO_EVICT work gets scheduled immediately
instead of waiting for the delayed work timer to expire.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-11-05 12:39:29 +01:00