Commit Graph

516 Commits

Author SHA1 Message Date
Darrick J. Wong
f621324dfb iomap: fix lockdep complaint when reads fail
Zorro Lang reported the following lockdep splat:

"While running fstests xfs/556 on kernel 7.0.0-rc4+ (HEAD=04a9f1766954),
a lockdep warning was triggered indicating an inconsistent lock state
for sb->s_type->i_lock_key.

"The deadlock might occur because iomap_read_end_io (called from a
hardware interrupt completion path) invokes fserror_report, which then
calls igrab.  igrab attempts to acquire the i_lock spinlock. However,
the i_lock is frequently acquired in process context with interrupts
enabled. If an interrupt occurs while a process holds the i_lock, and
that interrupt handler calls fserror_report, the system deadlocks.

"I hit this warning several times by running xfs/556 (mostly) or
generic/648 on xfs. More details refer to below console log."

along with this dmesg, for which I've cleaned up the stacktraces:

 run fstests xfs/556 at 2026-03-18 20:05:30
 XFS (sda3): Mounting V5 Filesystem 396e9164-c45a-4e05-be9d-b38c2c5c6477
 XFS (sda3): Ending clean mount
 XFS (sda3): Unmounting Filesystem 396e9164-c45a-4e05-be9d-b38c2c5c6477
 XFS (sda3): Mounting V5 Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e
 XFS (sda3): Ending clean mount
 XFS (sda3): Unmounting Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e
 XFS (dm-0): Mounting V5 Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e
 XFS (dm-0): Ending clean mount
 device-mapper: table: 253:0: adding target device (start sect 209 len 1) caused an alignment inconsistency
 device-mapper: table: 253:0: adding target device (start sect 210 len 62914350) caused an alignment inconsistency
 buffer_io_error: 6 callbacks suppressed
 Buffer I/O error on dev dm-0, logical block 209, async page read
 Buffer I/O error on dev dm-0, logical block 209, async page read
 XFS (dm-0): Unmounting Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e
 XFS (dm-0): Mounting V5 Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e
 XFS (dm-0): Ending clean mount

 ================================
 WARNING: inconsistent lock state
 7.0.0-rc4+ #1 Tainted: G S      W
 --------------------------------
 inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
 od/2368602 [HC1[1]:SC0[0]:HE0:SE1] takes:
 ff1100069f2b4a98 (&sb->s_type->i_lock_key#31){?.+.}-{3:3}, at: igrab+0x28/0x1a0
 {HARDIRQ-ON-W} state was registered at:
   __lock_acquire+0x40d/0xbd0
   lock_acquire.part.0+0xbd/0x260
   _raw_spin_lock+0x37/0x80
   unlock_new_inode+0x66/0x2a0
   xfs_iget+0x67b/0x7b0 [xfs]
   xfs_mountfs+0xde4/0x1c80 [xfs]
   xfs_fs_fill_super+0xe86/0x17a0 [xfs]
   get_tree_bdev_flags+0x312/0x590
   vfs_get_tree+0x8d/0x2f0
   vfs_cmd_create+0xb2/0x240
   __do_sys_fsconfig+0x3d8/0x9a0
   do_syscall_64+0x13a/0x1520
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
 irq event stamp: 3118
 hardirqs last  enabled at (3117): [<ffffffffb54e4ad8>] _raw_spin_unlock_irq+0x28/0x50
 hardirqs last disabled at (3118): [<ffffffffb54b84c9>] common_interrupt+0x19/0xe0
 softirqs last  enabled at (3040): [<ffffffffb290ca28>] handle_softirqs+0x6b8/0x950
 softirqs last disabled at (3023): [<ffffffffb290ce4d>] __irq_exit_rcu+0xfd/0x250

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&sb->s_type->i_lock_key#31);
   <Interrupt>
     lock(&sb->s_type->i_lock_key#31);

  *** DEADLOCK ***

 1 lock held by od/2368602:
  #0: ff1100069f2b4b58 (&sb->s_type->i_mutex_key#19){++++}-{4:4}, at: xfs_ilock+0x324/0x4b0 [xfs]

 stack backtrace:
 CPU: 15 UID: 0 PID: 2368602 Comm: od Kdump: loaded Tainted: G S      W           7.0.0-rc4+ #1 PREEMPT(full)
 Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
 Hardware name: Dell Inc. PowerEdge R660/0R5JJC, BIOS 2.1.5 03/14/2024
 Call Trace:
  <IRQ>
  dump_stack_lvl+0x6f/0xb0
  print_usage_bug.part.0+0x230/0x2c0
  mark_lock_irq+0x3ce/0x5b0
  mark_lock+0x1cb/0x3d0
  mark_usage+0x109/0x120
  __lock_acquire+0x40d/0xbd0
  lock_acquire.part.0+0xbd/0x260
  _raw_spin_lock+0x37/0x80
  igrab+0x28/0x1a0
  fserror_report+0x127/0x2d0
  iomap_finish_folio_read+0x13c/0x280
  iomap_read_end_io+0x10e/0x2c0
  clone_endio+0x37e/0x780 [dm_mod]
  blk_update_request+0x448/0xf00
  scsi_end_request+0x74/0x750
  scsi_io_completion+0xe9/0x7c0
  _scsih_io_done+0x6ba/0x1ca0 [mpt3sas]
  _base_process_reply_queue+0x249/0x15b0 [mpt3sas]
  _base_interrupt+0x95/0xe0 [mpt3sas]
  __handle_irq_event_percpu+0x1f0/0x780
  handle_irq_event+0xa9/0x1c0
  handle_edge_irq+0x2ef/0x8a0
  __common_interrupt+0xa0/0x170
  common_interrupt+0xb7/0xe0
  </IRQ>
  <TASK>
  asm_common_interrupt+0x26/0x40
 RIP: 0010:_raw_spin_unlock_irq+0x2e/0x50
 Code: 0f 1f 44 00 00 53 48 8b 74 24 08 48 89 fb 48 83 c7 18 e8 b5 73 5e fd 48 89 df e8 ed e2 5e fd e8 08 78 8f fd fb bf 01 00 00 00 <e8> 8d 56 4d fd 65 8b 05 46 d5 1d 03 85 c0 74 06 5b c3 cc cc cc cc
 RSP: 0018:ffa0000027d07538 EFLAGS: 00000206
 RAX: 0000000000000c2d RBX: ffffffffb6614bc8 RCX: 0000000000000080
 RDX: 0000000000000000 RSI: ffffffffb6306a01 RDI: 0000000000000001
 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
 R10: ffffffffb75efc67 R11: 0000000000000001 R12: ff1100015ada0000
 R13: 0000000000000083 R14: 0000000000000002 R15: ffffffffb6614c10
  folio_wait_bit_common+0x407/0x780
  filemap_update_page+0x8e7/0xbd0
  filemap_get_pages+0x904/0xc50
  filemap_read+0x320/0xc20
  xfs_file_buffered_read+0x2aa/0x380 [xfs]
  xfs_file_read_iter+0x263/0x4a0 [xfs]
  vfs_read+0x6cb/0xb70
  ksys_read+0xf9/0x1d0
  do_syscall_64+0x13a/0x1520

Zorro's diagnosis makes sense, so the solution is to kick the failed
read handling to a workqueue much like we added for writeback ioends in
commit 294f54f849 ("fserror: fix lockdep complaint when igrabbing
inode").

Cc: Zorro Lang <zlang@redhat.com>
Link: https://lore.kernel.org/linux-xfs/20260319194303.efw4wcu7c4idhthz@doltdoltdolt/
Fixes: a9d573ee88 ("iomap: report file I/O errors to the VFS")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Link: https://patch.msgid.link/20260323210017.GL6223@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-24 09:14:46 +01:00
Joanne Koong
bd71fb3fea iomap: fix invalid folio access when i_blkbits differs from I/O granularity
Commit aa35dd5cbc ("iomap: fix invalid folio access after
folio_end_read()") partially addressed invalid folio access for folios
without an ifs attached, but it did not handle the case where
1 << inode->i_blkbits matches the folio size but is different from the
granularity used for the IO, which means IO can be submitted for less
than the full folio for the !ifs case.

In this case, the condition:

  if (*bytes_submitted == folio_len)
    ctx->cur_folio = NULL;

in iomap_read_folio_iter() will not invalidate ctx->cur_folio, and
iomap_read_end() will still be called on the folio even though the IO
helper owns it and will finish the read on it.

Fix this by unconditionally invalidating ctx->cur_folio for the !ifs
case.

Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/linux-fsdevel/b3dfe271-4e3d-4922-b618-e73731242bca@wdc.com/
Fixes: b2f35ac414 ("iomap: add caller-provided callbacks for read and readahead")
Cc: stable@vger.kernel.org
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20260317203935.830549-1-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-18 10:42:08 +01:00
Darrick J. Wong
d320f160aa iomap: reject delalloc mappings during writeback
Filesystems should never provide a delayed allocation mapping to
writeback; they're supposed to allocate the space before replying.
This can lead to weird IO errors and crashes in the block layer if the
filesystem is being malicious, or if it hadn't set iomap->dev because
it's a delalloc mapping.

Fix this by failing writeback on delalloc mappings.  Currently no
filesystems actually misbehave in this manner, but we ought to be
stricter about things like that.

Cc: stable@vger.kernel.org # v5.5
Fixes: 598ecfbaa7 ("iomap: lift the xfs writeback code to iomap")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://patch.msgid.link/20260302173002.GL13829@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-04 14:31:56 +01:00
Joanne Koong
debc1a492b iomap: don't mark folio uptodate if read IO has bytes pending
If a folio has ifs metadata attached to it and the folio is partially
read in through an async IO helper with the rest of it then being read
in through post-EOF zeroing or as inline data, and the helper
successfully finishes the read first, then post-EOF zeroing / reading
inline will mark the folio as uptodate in iomap_set_range_uptodate().

This is a problem because when the read completion path later calls
iomap_read_end(), it will call folio_end_read(), which sets the uptodate
bit using XOR semantics. Calling folio_end_read() on a folio that was
already marked uptodate clears the uptodate bit.

Fix this by not marking the folio as uptodate if the read IO has bytes
pending. The folio uptodate state will be set in the read completion
path through iomap_end_read() -> folio_end_read().

Reported-by: Wei Gao <wegao@suse.com>
Suggested-by: Sasha Levin <sashal@kernel.org>
Tested-by: Wei Gao <wegao@suse.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: stable@vger.kernel.org # v6.19
Link: https://lore.kernel.org/linux-fsdevel/aYbmy8JdgXwsGaPP@autotest-wegao.qe.prg2.suse.org/
Fixes: b2f35ac414 ("iomap: add caller-provided callbacks for read and readahead")
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20260303233420.874231-2-joannelkoong@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-04 14:18:54 +01:00
Darrick J. Wong
cd3c877d04 iomap: don't report direct-io retries to fserror
iomap's directio implementation has two magic errno codes that it uses
to signal callers -- ENOTBLK tells the filesystem that it should retry
a write with the pagecache; and EAGAIN tells the caller that pagecache
flushing or invalidation failed and that it should try again.

Neither of these indicate data loss, so let's not report them.

Fixes: a9d573ee88 ("iomap: report file I/O errors to the VFS")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://patch.msgid.link/20260224154637.GD2390381@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-02-26 09:23:22 +01:00
Linus Torvalds
0e335a7745 Merge tag 'vfs-7.0-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:

 - Fix an uninitialized variable in file_getattr().

   The flags_valid field wasn't initialized before calling
   vfs_fileattr_get(), triggering KMSAN uninit-value reports in fuse

 - Fix writeback wakeup and logging timeouts when DETECT_HUNG_TASK is
   not enabled.

   sysctl_hung_task_timeout_secs is 0 in that case causing spurious
   "waiting for writeback completion for more than 1 seconds" warnings

 - Fix a null-ptr-deref in do_statmount() when the mount is internal

 - Add missing kernel-doc description for the @private parameter in
   iomap_readahead()

 - Fix mount namespace creation to hold namespace_sem across the mount
   copy in create_new_namespace().

   The previous drop-and-reacquire pattern was fragile and failed to
   clean up mount propagation links if the real rootfs was a shared or
   dependent mount

 - Fix /proc mount iteration where m->index wasn't updated when
   m->show() overflows, causing a restart to repeatedly show the same
   mount entry in a rapidly expanding mount table

 - Return EFSCORRUPTED instead of ENOSPC in minix_new_inode() when the
   inode number is out of range

 - Fix unshare(2) when CLONE_NEWNS is set and current->fs isn't shared.

   copy_mnt_ns() received the live fs_struct so if a subsequent
   namespace creation failed the rollback would leave pwd and root
   pointing to detached mounts. Always allocate a new fs_struct when
   CLONE_NEWNS is requested

 - fserror bug fixes:

    - Remove the unused fsnotify_sb_error() helper now that all callers
      have been converted to fserror_report_metadata

    - Fix a lockdep splat in fserror_report() where igrab() takes
      inode::i_lock which can be held in IRQ context.

      Replace igrab() with a direct i_count bump since filesystems
      should not report inodes that are about to be freed or not yet
      exposed

 - Handle error pointer in procfs for try_lookup_noperm()

 - Fix an integer overflow in ep_loop_check_proc() where recursive calls
   returning INT_MAX would overflow when +1 is added, breaking the
   recursion depth check

 - Fix a misleading break in pidfs

* tag 'vfs-7.0-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  pidfs: avoid misleading break
  eventpoll: Fix integer overflow in ep_loop_check_proc()
  proc: Fix pointer error dereference
  fserror: fix lockdep complaint when igrabbing inode
  fsnotify: drop unused helper
  unshare: fix unshare_fs() handling
  minix: Correct errno in minix_new_inode
  namespace: fix proc mount iteration
  mount: hold namespace_sem across copy in create_new_namespace()
  iomap: Describe @private in iomap_readahead()
  statmount: Fix the null-ptr-deref in do_statmount()
  writeback: Fix wakeup and logging timeouts for !DETECT_HUNG_TASK
  fs: init flags_valid before calling vfs_fileattr_get
2026-02-25 10:34:23 -08:00
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Darrick J. Wong
294f54f849 fserror: fix lockdep complaint when igrabbing inode
Christoph Hellwig reported a lockdep splat in generic/108:

 ================================
 WARNING: inconsistent lock state
 6.19.0+ #4827 Tainted: G                 N
 --------------------------------
 inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
 swapper/1/0 [HC1[1]:SC0[0]:HE0:SE1] takes:
 ffff88811ed1b140 (&sb->s_type->i_lock_key#33){?.+.}-{3:3}, at: igrab+0x1a/0xb0
 {HARDIRQ-ON-W} state was registered at:
   lock_acquire+0xca/0x2c0
   _raw_spin_lock+0x2e/0x40
   unlock_new_inode+0x2c/0xc0
   xfs_iget+0xcf4/0x1080
   xfs_trans_metafile_iget+0x3d/0x100
   xfs_metafile_iget+0x2b/0x50
   xfs_mount_setup_metadir+0x20/0x60
   xfs_mountfs+0x457/0xa60
   xfs_fs_fill_super+0x6b3/0xa90
   get_tree_bdev_flags+0x13c/0x1e0
   vfs_get_tree+0x27/0xe0
   vfs_cmd_create+0x54/0xe0
   __do_sys_fsconfig+0x309/0x620
   do_syscall_64+0x8b/0xf80
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
 irq event stamp: 139080
 hardirqs last  enabled at (139079): [<ffffffff813a923c>] do_idle+0x1ec/0x270
 hardirqs last disabled at (139080): [<ffffffff828a8d09>] common_interrupt+0x19/0xe0
 softirqs last  enabled at (139032): [<ffffffff8134a853>] __irq_exit_rcu+0xc3/0x120
 softirqs last disabled at (139025): [<ffffffff8134a853>] __irq_exit_rcu+0xc3/0x120

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&sb->s_type->i_lock_key#33);
   <Interrupt>
     lock(&sb->s_type->i_lock_key#33);

  *** DEADLOCK ***

 1 lock held by swapper/1/0:
  #0: ffff8881052c81a0 (&vblk->vqs[i].lock){-.-.}-{3:3}, at: virtblk_done+0x4b/0x110

 stack backtrace:
 CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Tainted: G                 N  6.19.0+ #4827 PREEMPT(full)
 Tainted: [N]=TEST
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <IRQ>
  dump_stack_lvl+0x5b/0x80
  print_usage_bug.part.0+0x22c/0x2c0
  mark_lock+0xa6f/0xe90
  __lock_acquire+0x10b6/0x25e0
  lock_acquire+0xca/0x2c0
  _raw_spin_lock+0x2e/0x40
  igrab+0x1a/0xb0
  fserror_report+0x135/0x260
  iomap_finish_ioend_buffered+0x170/0x210
  clone_endio+0x8f/0x1c0
  blk_update_request+0x1e4/0x4d0
  blk_mq_end_request+0x1b/0x100
  virtblk_done+0x6f/0x110
  vring_interrupt+0x59/0x80
  __handle_irq_event_percpu+0x8a/0x2e0
  handle_irq_event+0x33/0x70
  handle_edge_irq+0xdd/0x1e0
  __common_interrupt+0x6f/0x180
  common_interrupt+0xb7/0xe0
  </IRQ>

It looks like the concern here is that inode::i_lock is sometimes taken
in IRQ context, and sometimes it is held when going to IRQ context,
though it's a little difficult to tell since I think this is a kernel
from after the actual 6.19 release but before 7.0-rc1.

Either way, we don't need to take i_lock, because filesystems should
not report files to fserror if they're about to be freed or have not
yet been exposed to other threads, because the resulting fsnotify report
will be meaningless.

Therefore, bump inode::i_count directly and clarify the preconditions on
the inode being passed in.

Link: https://lore.kernel.org/linux-fsdevel/aY7BndIgQg3ci_6s@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://patch.msgid.link/177148129564.716249.3069780698231701540.stgit@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-02-19 09:12:08 +01:00
Hongbo Li
ac83896172 iomap: Describe @private in iomap_readahead()
The kernel test rebot reports the kernel-doc warning:

```
Warning: fs/iomap/buffered-io.c:624 function parameter 'private'
 not described in 'iomap_readahead'
```

The former commit in "iomap: stash iomap read ctx in the private
field of iomap_iter" has added a new parameter @private to
iomap_readahead(), so let's describe the parameter.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202601261111.vIL9rhgD-lkp@intel.com/
Fixes: 8806f27924 ("iomap: stash iomap read ctx in the private field of iomap_iter")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Link: https://patch.msgid.link/20260213022812.766187-1-lihongbo22@huawei.com
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-02-14 13:24:44 +01:00
Linus Torvalds
4adc13ed7c Merge tag 'for-7.0/block-stable-pages-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull bounce buffer dio for stable pages from Jens Axboe:
 "This adds support for bounce buffering of dio for stable pages. This
  was all done by Christoph. In his words:

  This series tries to address the problem that under I/O pages can be
  modified during direct I/O, even when the device or file system
  require stable pages during I/O to calculate checksums, parity or data
  operations. It does so by adding block layer helpers to bounce buffer
  an iov_iter into a bio, then wires that up in iomap and ultimately
  XFS.

  The reason that the file system even needs to know about it, is
  because reads need a user context to copy the data back, and the
  infrastructure to defer ioends to a workqueue currently sits in XFS.
  I'm going to look into moving that into ioend and enabling it for
  other file systems. Additionally btrfs already has it's own
  infrastructure for this, and actually an urgent need to bounce buffer,
  so this should be useful there and could be wire up easily. In fact
  the idea comes from patches by Qu that did this in btrfs.

  This patch fixes all but one xfstests failures on T10 PI capable
  devices (generic/095 seems to have issues with a mix of mmap and
  splice still, I'm looking into that separately), and make qemu VMs
  running Windows, or Linux with swap enabled fine on an XFS file on a
  device using PI.

  Performance numbers on my (not exactly state of the art) NVMe PI test
  setup:

      Sequential reads using io_uring, QD=16.
      Bandwidth and CPU usage (usr/sys):

      | size |        zero copy         |          bounce          |
      +------+--------------------------+--------------------------+
      |   4k | 1316MiB/s (12.65/55.40%) | 1081MiB/s (11.76/49.78%) |
      |  64K | 3370MiB/s ( 5.46/18.20%) | 3365MiB/s ( 4.47/15.68%) |
      |   1M | 3401MiB/s ( 0.76/23.05%) | 3400MiB/s ( 0.80/09.06%) |
      +------+--------------------------+--------------------------+

      Sequential writes using io_uring, QD=16.
      Bandwidth and CPU usage (usr/sys):

      | size |        zero copy         |          bounce          |
      +------+--------------------------+--------------------------+
      |   4k |  882MiB/s (11.83/33.88%) |  750MiB/s (10.53/34.08%) |
      |  64K | 2009MiB/s ( 7.33/15.80%) | 2007MiB/s ( 7.47/24.71%) |
      |   1M | 1992MiB/s ( 7.26/ 9.13%) | 1992MiB/s ( 9.21/19.11%) |
      +------+--------------------------+--------------------------+

  Note that the 64k read numbers look really odd to me for the baseline
  zero copy case, but are reproducible over many repeated runs.

  The bounce read numbers should further improve when moving the PI
  validation to the file system and removing the double context switch,
  which I have patches for that will sent out soon"

* tag 'for-7.0/block-stable-pages-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  xfs: use bounce buffering direct I/O when the device requires stable pages
  iomap: add a flag to bounce buffer direct I/O
  iomap: support ioends for direct reads
  iomap: rename IOMAP_DIO_DIRTY to IOMAP_DIO_USER_BACKED
  iomap: free the bio before completing the dio
  iomap: share code between iomap_dio_bio_end_io and iomap_finish_ioend_direct
  iomap: split out the per-bio logic from iomap_dio_bio_iter
  iomap: simplify iomap_dio_bio_iter
  iomap: fix submission side handling of completion side errors
  block: add helpers to bounce buffer an iov_iter into bios
  block: remove bio_release_page
  iov_iter: extract a iov_iter_extract_bvecs helper from bio code
  block: open code bio_add_page and fix handling of mismatching P2P ranges
  block: refactor get_contig_folio_len
  block: add a BIO_MAX_SIZE constant and use it
2026-02-09 18:14:52 -08:00
Linus Torvalds
0c00ed308d Merge tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:

 - Support for batch request processing for ublk, improving the
   efficiency of the kernel/ublk server communication. This can yield
   nice 7-12% performance improvements

 - Support for integrity data for ublk

 - Various other ublk improvements and additions, including a ton of
   selftests additions and updated

 - Move the handling of blk-crypto software fallback from below the
   block layer to above it. This reduces the complexity of dealing with
   bio splitting

 - Series fixing a number of potential deadlocks in blk-mq related to
   the queue usage counter and writeback throttling and rq-qos debugfs
   handling

 - Add an async_depth queue attribute, to resolve a performance
   regression that's been around for a qhilw related to the scheduler
   depth handling

 - Only use task_work for IOPOLL completions on NVMe, if it is necessary
   to do so. An earlier fix for an issue resulted in all these
   completions being punted to task_work, to guarantee that completions
   were only run for a given io_uring ring when it was local to that
   ring. With the new changes, we can detect if it's necessary to use
   task_work or not, and avoid it if possible.

 - rnbd fixes:
      - Fix refcount underflow in device unmap path
      - Handle PREFLUSH and NOUNMAP flags properly in protocol
      - Fix server-side bi_size for special IOs
      - Zero response buffer before use
      - Fix trace format for flags
      - Add .release to rnbd_dev_ktype

 - MD pull requests via Yu Kuai
      - Fix raid5_run() to return error when log_init() fails
      - Fix IO hang with degraded array with llbitmap
      - Fix percpu_ref not resurrected on suspend timeout in llbitmap
      - Fix GPF in write_page caused by resize race
      - Fix NULL pointer dereference in process_metadata_update
      - Fix hang when stopping arrays with metadata through dm-raid
      - Fix any_working flag handling in raid10_sync_request
      - Refactor sync/recovery code path, improve error handling for
        badblocks, and remove unused recovery_disabled field
      - Consolidate mddev boolean fields into mddev_flags
      - Use mempool to allocate stripe_request_ctx and make sure
        max_sectors is not less than io_opt in raid5
      - Fix return value of mddev_trylock
      - Fix memory leak in raid1_run()
      - Add Li Nan as mdraid reviewer

 - Move phys_vec definitions to the kernel types, mostly in preparation
   for some VFIO and RDMA changes

 - Improve the speed for secure erase for some devices

 - Various little rust updates

 - Various other minor fixes, improvements, and cleanups

* tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits)
  blk-mq: ABI/sysfs-block: fix docs build warnings
  selftests: ublk: organize test directories by test ID
  block: decouple secure erase size limit from discard size limit
  block: remove redundant kill_bdev() call in set_blocksize()
  blk-mq: add documentation for new queue attribute async_dpeth
  block, bfq: convert to use request_queue->async_depth
  mq-deadline: covert to use request_queue->async_depth
  kyber: covert to use request_queue->async_depth
  blk-mq: add a new queue sysfs attribute async_depth
  blk-mq: factor out a helper blk_mq_limit_depth()
  blk-mq-sched: unify elevators checking for async requests
  block: convert nr_requests to unsigned int
  block: don't use strcpy to copy blockdev name
  blk-mq-debugfs: warn about possible deadlock
  blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs()
  blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos()
  blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static
  blk-rq-qos: fix possible debugfs_mutex deadlock
  blk-mq-debugfs: factor out a helper to register debugfs for all rq_qos
  blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter
  ...
2026-02-09 17:57:21 -08:00
Linus Torvalds
3304b3fedd Merge tag 'vfs-7.0-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs iomap updates from Christian Brauner:

 - Erofs page cache sharing preliminaries:

   Plumb a void *private parameter through iomap_read_folio() and
   iomap_readahead() into iomap_iter->private, matching iomap DIO. Erofs
   uses this to replace a bogus kmap_to_page() call, as preparatory work
   for page cache sharing.

 - Fix for invalid folio access:

   Fix an invalid folio access when a folio without iomap_folio_state
   is fully submitted to the IO helper — the helper may call
   folio_end_read() at any time, so ctx->cur_folio must be invalidated
   after full submission.

* tag 'vfs-7.0-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  iomap: fix invalid folio access after folio_end_read()
  erofs: hold read context in iomap_iter if needed
  iomap: stash iomap read ctx in the private field of iomap_iter
2026-02-09 15:08:16 -08:00
Linus Torvalds
dd466ea002 Merge tag 'vfs-7.0-rc1.fserror' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs error reporting updates from Christian Brauner:
 "This contains the changes to support generic I/O error reporting.

  Filesystems currently have no standard mechanism for reporting
  metadata corruption and file I/O errors to userspace via fsnotify.
  Each filesystem (xfs, ext4, erofs, f2fs, etc.) privately defines
  EFSCORRUPTED, and error reporting to fanotify is inconsistent or
  absent entirely.

  This introduces a generic fserror infrastructure built around struct
  super_block that gives filesystems a standard way to queue metadata
  and file I/O error reports for delivery to fsnotify.

  Errors are queued via mempools and queue_work to avoid holding
  filesystem locks in the notification path; unmount waits for pending
  events to drain. A new super_operations::report_error callback lets
  filesystem drivers respond to file I/O errors themselves (to be used
  by an upcoming XFS self-healing patchset).

  On the uapi side, EFSCORRUPTED and EUCLEAN are promoted from private
  per-filesystem definitions to canonical errno.h values across all
  architectures"

* tag 'vfs-7.0-rc1.fserror' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  ext4: convert to new fserror helpers
  xfs: translate fsdax media errors into file "data lost" errors when convenient
  xfs: report fs metadata errors via fsnotify
  iomap: report file I/O errors to the VFS
  fs: report filesystem and file I/O errors to fsnotify
  uapi: promote EFSCORRUPTED and EUCLEAN to errno.h
2026-02-09 12:21:37 -08:00
Joanne Koong
aa35dd5cbc iomap: fix invalid folio access after folio_end_read()
If the folio does not have an iomap_folio_state (ifs) attached and the
folio gets read in by the filesystem's IO helper, folio_end_read() will
be called by the IO helper at any time. For this case, we cannot access
the folio after dispatching it to the IO helper, eg subsequent accesses
like

        if (ctx->cur_folio &&
                    offset_in_folio(ctx->cur_folio, iter->pos) == 0) {

are incorrect.

Fix these invalid accesses by invalidating ctx->cur_folio if all bytes
of the folio have been read in by the IO helper.

This allows us to also remove the +1 bias added for the ifs case. The
bias was previously added to ensure that if all bytes are read in, the
IO helper does not end the read on the folio until iomap has decremented
the bias.

Fixes: b2f35ac414 ("iomap: add caller-provided callbacks for read and readahead")
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20260126224107.2182262-2-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-29 13:42:05 +01:00
Christoph Hellwig
c9d114846b iomap: add a flag to bounce buffer direct I/O
Add a new flag that request bounce buffering for direct I/O.  This is
needed to provide the stable pages requirement requested by devices
that need to calculate checksums or parity over the data and allows
file systems to properly work with things like T10 protection
information.  The implementation just calls out to the new bio bounce
buffering helpers to allocate a bounce buffer, which is used for
I/O and to copy to/from it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-28 05:16:40 -07:00
Christoph Hellwig
d969bd72cf iomap: support ioends for direct reads
Support using the ioend structure to defer I/O completion for direct
reads in addition to writes.  This requires a check for the operation
to not merge reads and writes in iomap_ioend_can_merge.  This support
will be used for bounce buffered direct I/O reads that need to copy
data back to the user address space on read completion.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-28 05:16:40 -07:00
Christoph Hellwig
c96b8b2202 iomap: rename IOMAP_DIO_DIRTY to IOMAP_DIO_USER_BACKED
Match the more descriptive iov_iter terminology instead of encoding
what we do with them for reads only.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-28 05:16:40 -07:00
Christoph Hellwig
45cec0de6c iomap: free the bio before completing the dio
There are good arguments for processing the user completions ASAP vs.
freeing resources ASAP, but freeing the bio first here removes potential
use after free hazards when checking flags, and will simplify the
upcoming bounce buffer support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-28 05:16:40 -07:00
Christoph Hellwig
e2fcff5bb4 iomap: share code between iomap_dio_bio_end_io and iomap_finish_ioend_direct
Refactor the two per-bio completion handlers to share common code using
a new helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-28 05:16:40 -07:00
Christoph Hellwig
2631c94602 iomap: split out the per-bio logic from iomap_dio_bio_iter
Factor out a separate helper that builds and submits a single bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-28 05:16:40 -07:00
Christoph Hellwig
6e7a6c8019 iomap: simplify iomap_dio_bio_iter
Use iov_iter_count to check if we need to continue as that just reads
a field in the iov_iter, and only use bio_iov_vecs_to_alloc to calculate
the actual number of vectors to allocate for the bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-28 05:16:39 -07:00
Christoph Hellwig
4ad357e39b iomap: fix submission side handling of completion side errors
The "if (dio->error)" in iomap_dio_bio_iter exists to stop submitting
more bios when a completion already return an error.  Commit cfe057f7db
("iomap_dio_actor(): fix iov_iter bugs") made it revert the iov by
"copied", which is very wrong given that we've already consumed that
range and submitted a bio for it.

Fixes: cfe057f7db ("iomap_dio_actor(): fix iov_iter bugs")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-28 05:16:39 -07:00
Christoph Hellwig
561940a7ee iomap: wait for batched folios to be stable in __iomap_get_folio
__iomap_get_folio needs to wait for writeback to finish if the file
requires folios to be stable for writes.  For the regular path this is
taken care of by __filemap_get_folio, but for the newly added batch
lookup it has to be done manually.

This fixes xfs/131 failures when running on PI-capable hardware.

Fixes: 395ed1ef00 ("iomap: optional zero range dirty folio processing")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260113153943.3323869-1-hch@lst.de
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-14 17:06:02 +01:00
Hongbo Li
8806f27924 iomap: stash iomap read ctx in the private field of iomap_iter
It's useful to get filesystem-specific information using the
existing private field in the @iomap_iter passed to iomap_{begin,end}
for advanced usage for iomap buffered reads, which is much like the
current iomap DIO.

For example, EROFS needs it to:

 - implement an efficient page cache sharing feature, since iomap
   needs to apply to anon inode page cache but we'd like to get the
   backing inode/fs instead, so filesystem-specific private data is
   needed to keep such information;

 - pass in both struct page * and void * for inline data to avoid
   kmap_to_page() usage (which is bogus).

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Link: https://patch.msgid.link/20260109102856.598531-2-lihongbo22@huawei.com
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-14 16:31:41 +01:00
Darrick J. Wong
a9d573ee88 iomap: report file I/O errors to the VFS
Wire up iomap so that it reports all file read and write errors to the
VFS (and hence fsnotify) via the new fserror mechanism.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://patch.msgid.link/176826402631.3490369.729008983502742314.stgit@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-13 09:58:01 +01:00
Christoph Hellwig
bb8e2019ad blk-crypto: handle the fallback above the block layer
Add a blk_crypto_submit_bio helper that either submits the bio when
it is not encrypted or inline encryption is provided, but otherwise
handles the encryption before going down into the low-level driver.
This reduces the risk from bio reordering and keeps memory allocation
as high up in the stack as possible.

Note that if the submitter knows that inline enctryption is known to
be supported by the underyling driver, it can still use plain
submit_bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-11 12:55:41 -07:00
Brian Foster
ed61378b4d iomap: replace folio_batch allocation with stack allocation
Zhang Yi points out that the dynamic folio_batch allocation in
iomap_fill_dirty_folios() is problematic for the ext4 on iomap work
that is under development because it doesn't sufficiently handle the
allocation failure case (by allowing a retry, for example). We've
also seen lockdep (via syzbot) complain recently about the scope of
the allocation.

The dynamic allocation was initially added for simplicity and to
help indicate whether the batch was used or not by the calling fs.
To address these issues, put the batch on the stack of
iomap_zero_range() and use a flag to control whether the batch
should be used in the iomap folio lookup path. This keeps things
simple and eliminates allocation issues with lockdep and for ext4 on
iomap.

While here, also clean up the fill helper signature to be more
consistent with the underlying filemap helper. Pass through the
return value of the filemap helper (folio count) and update the
lookup offset via an out param.

Fixes: 395ed1ef00 ("iomap: optional zero range dirty folio processing")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://patch.msgid.link/20251208140548.373411-1-bfoster@redhat.com
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-12-15 15:17:44 +01:00
Linus Torvalds
f2e74ecfba Merge tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull folio updates from Christian Brauner:
 "Add a new folio_next_pos() helper function that returns the file
  position of the first byte after the current folio. This is a common
  operation in filesystems when needing to know the end of the current
  folio.

  The helper is lifted from btrfs which already had its own version, and
  is now used across multiple filesystems and subsystems:
   - btrfs
   - buffer
   - ext4
   - f2fs
   - gfs2
   - iomap
   - netfs
   - xfs
   - mm

  This fixes a long-standing bug in ocfs2 on 32-bit systems with files
  larger than 2GiB. Presumably this is not a common configuration, but
  the fix is backported anyway. The other filesystems did not have bugs,
  they were just mildly inefficient.

  This also introduce uoff_t as the unsigned version of loff_t. A recent
  commit inadvertently changed a comparison from being unsigned (on
  64-bit systems) to being signed (which it had always been on 32-bit
  systems), leading to sporadic fstests failures.

  Generally file sizes are restricted to being a signed integer, but in
  places where -1 is passed to indicate "up to the end of the file", it
  is convenient to have an unsigned type to ensure comparisons are
  always unsigned regardless of architecture"

* tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: Add uoff_t
  mm: Use folio_next_pos()
  xfs: Use folio_next_pos()
  netfs: Use folio_next_pos()
  iomap: Use folio_next_pos()
  gfs2: Use folio_next_pos()
  f2fs: Use folio_next_pos()
  ext4: Use folio_next_pos()
  buffer: Use folio_next_pos()
  btrfs: Use folio_next_pos()
  filemap: Add folio_next_pos()
2025-12-01 10:26:38 -08:00
Linus Torvalds
b04b2e7a61 Merge tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
 "Features:

   - Cheaper MAY_EXEC handling for path lookup. This elides MAY_WRITE
     permission checks during path lookup and adds the
     IOP_FASTPERM_MAY_EXEC flag so filesystems like btrfs can avoid
     expensive permission work.

   - Hide dentry_cache behind runtime const machinery.

   - Add German Maglione as virtiofs co-maintainer.

  Cleanups:

   - Tidy up and inline step_into() and walk_component() for improved
     code generation.

   - Re-enable IOCB_NOWAIT writes to files. This refactors file
     timestamp update logic, fixing a layering bypass in btrfs when
     updating timestamps on device files and improving FMODE_NOCMTIME
     handling in VFS now that nfsd started using it.

   - Path lookup optimizations extracting slowpaths into dedicated
     routines and adding branch prediction hints for mntput_no_expire(),
     fd_install(), lookup_slow(), and various other hot paths.

   - Enable clang's -fms-extensions flag, requiring a JFS rename to
     avoid conflicts.

   - Remove spurious exports in fs/file_attr.c.

   - Stop duplicating union pipe_index declaration. This depends on the
     shared kbuild branch that brings in -fms-extensions support which
     is merged into this branch.

   - Use MD5 library instead of crypto_shash in ecryptfs.

   - Use largest_zero_folio() in iomap_dio_zero().

   - Replace simple_strtol/strtoul with kstrtoint/kstrtouint in init and
     initrd code.

   - Various typo fixes.

  Fixes:

   - Fix emergency sync for btrfs. Btrfs requires an explicit sync_fs()
     call with wait == 1 to commit super blocks. The emergency sync path
     never passed this, leaving btrfs data uncommitted during emergency
     sync.

   - Use local kmap in watch_queue's post_one_notification().

   - Add hint prints in sb_set_blocksize() for LBS dependency on THP"

* tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
  MAINTAINERS: add German Maglione as virtiofs co-maintainer
  fs: inline step_into() and walk_component()
  fs: tidy up step_into() & friends before inlining
  orangefs: use inode_update_timestamps directly
  btrfs: fix the comment on btrfs_update_time
  btrfs: use vfs_utimes to update file timestamps
  fs: export vfs_utimes
  fs: lift the FMODE_NOCMTIME check into file_update_time_flags
  fs: refactor file timestamp update logic
  include/linux/fs.h: trivial fix: regualr -> regular
  fs/splice.c: trivial fix: pipes -> pipe's
  fs: mark lookup_slow() as noinline
  fs: add predicts based on nd->depth
  fs: move mntput_no_expire() slowpath into a dedicated routine
  fs: remove spurious exports in fs/file_attr.c
  watch_queue: Use local kmap in post_one_notification()
  fs: touch up predicts in path lookup
  fs: move fd_install() slowpath into a dedicated routine and provide commentary
  fs: hide dentry_cache behind runtime const machinery
  fs: touch predicts in do_dentry_open()
  ...
2025-12-01 08:44:26 -08:00
Christoph Hellwig
7fd8720dff iomap: allocate s_dio_done_wq for async reads as well
Since commit 222f2c7c6d14 ("iomap: always run error completions in user
context"), read error completions are deferred to s_dio_done_wq.  This
means the workqueue also needs to be allocated for async reads.

Fixes: 222f2c7c6d14 ("iomap: always run error completions in user context")
Reported-by: syzbot+a2b9a4ed0d61b1efb3f5@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251124140013.902853-1-hch@lst.de
Tested-by: syzbot+a2b9a4ed0d61b1efb3f5@syzkaller.appspotmail.com
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:22:19 +01:00
Joanne Koong
d7ff85d4b8 iomap: fix iomap_read_end() for already uptodate folios
There are some cases where when iomap_read_end() is called, the folio
may already have been marked uptodate. For example, if the iomap block
needed zeroing, then the folio may have been marked uptodate after the
zeroing.

iomap_read_end() should unlock the folio instead of calling
folio_end_read(), which is how these cases were handled prior to commit
f8eaf79406 ("iomap: simplify ->read_folio_range() error handling for
reads"). Calling folio_end_read() on an uptodate folio leads to buggy
behavior where marking an already uptodate folio as uptodate will XOR it
to be marked nonuptodate.

Fixes: f8eaf79406 ("iomap: simplify ->read_folio_range() error handling for reads")
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251118211111.1027272-2-joannelkoong@gmail.com
Tested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:22:19 +01:00
Christoph Hellwig
76192a42c2 iomap: invert the polarity of IOMAP_DIO_INLINE_COMP
Replace IOMAP_DIO_INLINE_COMP with a flag to indicate that the
completion should be offloaded.  This removes a tiny bit of boilerplate
code, but more importantly just makes the code easier to follow as this
new flag gets set most of the time and only cleared in one place, while
it was the inverse for the old version.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113170633.1453259-6-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:22:19 +01:00
Christoph Hellwig
eca9dc2089 iomap: support write completions from interrupt context
Completions for pure overwrites don't need to be deferred to a workqueue
as there is no work to be done, or at least no work that needs a user
context.  Set the IOMAP_DIO_INLINE_COMP by default for writes like we
already do for reads, and the clear it for all the cases that actually
do need a user context for completions to update the inode size or
record updates to the logical to physical mapping.

I've audited all users of the ->end_io callback, and they only require
user context for I/O that involves unwritten extents, COW, size
extensions, or error handling and all those are still run from workqueue
context.

This restores the behavior of the old pre-iomap direct I/O code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113170633.1453259-5-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:22:19 +01:00
Christoph Hellwig
29086a31b3 iomap: rework REQ_FUA selection
The way how iomap_dio_can_use_fua and the caller is structured is
a bit confusing, as the main guarding condition is hidden in the
helper, and the secondary conditions are split between caller and
callee.

Refactor the code, so that iomap_dio_bio_iter itself tracks if a write
might need metadata updates based on the iomap type and flags, and
then have a condition based on that to use the FUA flag.

Note that this also moves the REQ_OP_WRITE assignment to the end of
the branch to improve readability a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113170633.1453259-4-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:22:18 +01:00
Christoph Hellwig
ddb4873286 iomap: always run error completions in user context
At least zonefs expects error completions to be able to sleep.  Because
error completions aren't performance critical, just defer them to workqueue
context unconditionally.

Fixes: 8dcc1a9d90 ("fs: New zonefs file system")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113170633.1453259-3-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:22:18 +01:00
Christoph Hellwig
f9f8514999 fs, iomap: remove IOCB_DIO_CALLER_COMP
This was added by commit 099ada2c87 ("io_uring/rw: add write support
for IOCB_DIO_CALLER_COMP") and disabled a little later by commit
838b35bb6a ("io_uring/rw: disable IOCB_DIO_CALLER_COMP") because it
didn't work.  Remove all the related code that sat unused for 2 years.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113170633.1453259-2-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:22:18 +01:00
Joanne Koong
b56c1c54f2 iomap: use find_next_bit() for uptodate bitmap scanning
Use find_next_bit()/find_next_zero_bit() for iomap uptodate bitmap
scanning. This uses __ffs() internally and is more efficient for
finding the next uptodate or non-uptodate bit than iterating through the
the bitmap range testing every bit.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-10-joannelkoong@gmail.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:22:10 +01:00
Joanne Koong
fed9c62d28 iomap: use find_next_bit() for dirty bitmap scanning
Use find_next_bit()/find_next_zero_bit() for iomap dirty bitmap
scanning. This uses __ffs() internally and is more efficient for
finding the next dirty or clean bit than iterating through the bitmap
range testing every bit.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-9-joannelkoong@gmail.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:22:10 +01:00
Joanne Koong
a298febc47 iomap: simplify when reads can be skipped for writes
Currently, the logic for skipping the read range for a write is

if (!(iter->flags & IOMAP_UNSHARE) &&
    (from <= poff || from >= poff + plen) &&
    (to <= poff || to >= poff + plen))

which breaks down to skipping the read if any of these are true:
a) from <= poff && to <= poff
b) from <= poff && to >= poff + plen
c) from >= poff + plen && to <= poff
d) from >= poff + plen && to >= poff + plen

This can be simplified to
if (!(iter->flags & IOMAP_UNSHARE) && from <= poff && to >= poff + plen)

from the following reasoning:

a) from <= poff && to <= poff
This reduces to 'to <= poff' since it is guaranteed that 'from <= to'
(since to = from + len). It is not possible for 'from <= to' to be true
here because we only reach here if plen > 0 (thanks to the preceding 'if
(plen == 0)' check that would break us out of the loop). If 'to <=
poff', plen would have to be 0 since poff and plen get adjusted in
lockstep for uptodate blocks. This means we can eliminate this check.

c) from >= poff + plen && to <= poff
This is not possible since 'from <= to' and 'plen > 0'. We can eliminate
this check.

d) from >= poff + plen && to >= poff + plen
This reduces to 'from >= poff + plen' since 'from <= to'.
It is not possible for 'from >= poff + plen' to be true here. We only
reach here if plen > 0 and for writes, poff and plen will always be
block-aligned, which means poff <= from < poff + plen. We can eliminate
this check.

The only valid check is b) from <= poff && to >= poff + plen.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-7-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 10:50:32 +01:00
Joanne Koong
f8eaf79406 iomap: simplify ->read_folio_range() error handling for reads
Instead of requiring that the caller calls iomap_finish_folio_read()
even if the ->read_folio_range() callback returns an error, account for
this internally in iomap instead, which makes the interface simpler and
makes it match writeback's ->read_folio_range() error handling
expectations.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-6-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 10:50:32 +01:00
Joanne Koong
6b1fd2281f iomap: optimize pending async writeback accounting
Pending writebacks must be accounted for to determine when all requests
have completed and writeback on the folio should be ended. Currently
this is done by atomically incrementing ifs->write_bytes_pending for
every range to be written back.

Instead, the number of atomic operations can be minimized by setting
ifs->write_bytes_pending to the folio size, internally tracking how many
bytes are written back asynchronously, and then after sending off all
the requests, decrementing ifs->write_bytes_pending by the number of
bytes not written back asynchronously. Now, for N ranges written back,
only N + 2 atomic operations are required instead of 2N + 2.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-5-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 10:50:32 +01:00
Joanne Koong
9d875e0eef iomap: account for unaligned end offsets when truncating read range
The end position to start truncating from may be at an offset into a
block, which under the current logic would result in overtruncation.

Adjust the calculation to account for unaligned end offsets.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-3-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 10:50:31 +01:00
Joanne Koong
a0f1cabe29 iomap: rename bytes_pending/bytes_accounted to bytes_submitted/bytes_not_submitted
The naming "bytes_pending" and "bytes_accounted" may be confusing and
could be better named. Rename this to "bytes_submitted" and
"bytes_not_submitted" to make it more clear that these are bytes we
passed to the IO helper to read in.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-2-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 10:50:31 +01:00
Qu Wenruo
001397f5ef iomap: add IOMAP_DIO_FSBLOCK_ALIGNED flag
Btrfs requires all of its bios to be fs block aligned, normally it's
totally fine but with the incoming block size larger than page size
(bs > ps) support, the requirement is no longer met for direct IOs.

Because iomap_dio_bio_iter() calls bio_iov_iter_get_pages(), only
requiring alignment to be bdev_logical_block_size().

In the real world that value is either 512 or 4K, on 4K page sized
systems it means bio_iov_iter_get_pages() can break the bio at any page
boundary, breaking btrfs' requirement for bs > ps cases.

To address this problem, introduce a new public iomap dio flag,
IOMAP_DIO_FSBLOCK_ALIGNED.

When calling __iomap_dio_rw() with that new flag, iomap_dio::flags will
inherit that new flag, and iomap_dio_bio_iter() will take fs block size
into the calculation of the alignment, and pass the alignment to
bio_iov_iter_get_pages(), respecting the fs block size requirement.

The initial user of this flag will be btrfs, which needs to calculate the
checksum for direct read and thus requires the biovec to be fs block
aligned for the incoming bs > ps support.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
[hch: also align pos/len, incorporate the trace flags from Darrick]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251031131045.1613229-2-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 13:09:27 +01:00
Brian Foster
39be21386d iomap: remove old partial eof zeroing optimization
iomap_zero_range() optimizes the partial eof block zeroing use case
by force zeroing if the mapping is dirty. This is to avoid frequent
flushing on file extending workloads, which hurts performance.

Now that the folio batch mechanism provides a more generic solution
and is used by the only real zero range user (XFS), this isolated
optimization is no longer needed. Remove the unnecessary code and
let callers use the folio batch or fall back to flushing by default.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:57:25 +01:00
Brian Foster
395ed1ef00 iomap: optional zero range dirty folio processing
The only way zero range can currently process unwritten mappings
with dirty pagecache is to check whether the range is dirty before
mapping lookup and then flush when at least one underlying mapping
is unwritten. This ordering is required to prevent iomap lookup from
racing with folio writeback and reclaim.

Since zero range can skip ranges of unwritten mappings that are
clean in cache, this operation can be improved by allowing the
filesystem to provide a set of dirty folios that require zeroing. In
turn, rather than flush or iterate file offsets, zero range can
iterate on folios in the batch and advance over clean or uncached
ranges in between.

Add a folio_batch in struct iomap and provide a helper for
filesystems to populate the batch at lookup time. Update the folio
lookup path to return the next folio in the batch, if provided, and
advance the iter if the folio starts beyond the current offset.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:57:24 +01:00
Brian Foster
49590716be iomap: remove pos+len BUG_ON() to after folio lookup
The bug checks at the top of iomap_write_begin() assume the pos/len
reflect exactly the next range to process. This may no longer be the
case once the get folio path is able to process a folio batch from
the filesystem. On top of that, len is already trimmed to within the
iomap/srcmap by iomap_length(), so these checks aren't terribly
useful. Remove the unnecessary BUG_ON() checks.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:57:24 +01:00
Joanne Koong
d4e88bb08e iomap: make iomap_read_folio() a void return
No errors are propagated in iomap_read_folio(). Change
iomap_read_folio() to a void return to make this clearer to callers.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:57:23 +01:00
Christoph Hellwig [1]
c2b1adc462 iomap: move buffered io bio logic into new file
Move bio logic in the buffered io code into its own file and remove
CONFIG_BLOCK gating for iomap read/readahead.

[1] https://lore.kernel.org/linux-fsdevel/aMK2GuumUf93ep99@infradead.org/

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:57:23 +01:00