linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-26 04:52:23 -04:00

Author	SHA1	Message	Date
Darrick J. Wong	f621324dfb	iomap: fix lockdep complaint when reads fail Zorro Lang reported the following lockdep splat: "While running fstests xfs/556 on kernel 7.0.0-rc4+ (HEAD=04a9f1766954), a lockdep warning was triggered indicating an inconsistent lock state for sb->s_type->i_lock_key. "The deadlock might occur because iomap_read_end_io (called from a hardware interrupt completion path) invokes fserror_report, which then calls igrab. igrab attempts to acquire the i_lock spinlock. However, the i_lock is frequently acquired in process context with interrupts enabled. If an interrupt occurs while a process holds the i_lock, and that interrupt handler calls fserror_report, the system deadlocks. "I hit this warning several times by running xfs/556 (mostly) or generic/648 on xfs. More details refer to below console log." along with this dmesg, for which I've cleaned up the stacktraces: run fstests xfs/556 at 2026-03-18 20:05:30 XFS (sda3): Mounting V5 Filesystem 396e9164-c45a-4e05-be9d-b38c2c5c6477 XFS (sda3): Ending clean mount XFS (sda3): Unmounting Filesystem 396e9164-c45a-4e05-be9d-b38c2c5c6477 XFS (sda3): Mounting V5 Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e XFS (sda3): Ending clean mount XFS (sda3): Unmounting Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e XFS (dm-0): Mounting V5 Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e XFS (dm-0): Ending clean mount device-mapper: table: 253:0: adding target device (start sect 209 len 1) caused an alignment inconsistency device-mapper: table: 253:0: adding target device (start sect 210 len 62914350) caused an alignment inconsistency buffer_io_error: 6 callbacks suppressed Buffer I/O error on dev dm-0, logical block 209, async page read Buffer I/O error on dev dm-0, logical block 209, async page read XFS (dm-0): Unmounting Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e XFS (dm-0): Mounting V5 Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e XFS (dm-0): Ending clean mount ================================ WARNING: inconsistent lock state 7.0.0-rc4+ #1 Tainted: G S W -------------------------------- inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage. od/2368602 [HC1[1]:SC0[0]:HE0:SE1] takes: ff1100069f2b4a98 (&sb->s_type->i_lock_key#31){?.+.}-{3:3}, at: igrab+0x28/0x1a0 {HARDIRQ-ON-W} state was registered at: __lock_acquire+0x40d/0xbd0 lock_acquire.part.0+0xbd/0x260 _raw_spin_lock+0x37/0x80 unlock_new_inode+0x66/0x2a0 xfs_iget+0x67b/0x7b0 [xfs] xfs_mountfs+0xde4/0x1c80 [xfs] xfs_fs_fill_super+0xe86/0x17a0 [xfs] get_tree_bdev_flags+0x312/0x590 vfs_get_tree+0x8d/0x2f0 vfs_cmd_create+0xb2/0x240 __do_sys_fsconfig+0x3d8/0x9a0 do_syscall_64+0x13a/0x1520 entry_SYSCALL_64_after_hwframe+0x76/0x7e irq event stamp: 3118 hardirqs last enabled at (3117): [<ffffffffb54e4ad8>] _raw_spin_unlock_irq+0x28/0x50 hardirqs last disabled at (3118): [<ffffffffb54b84c9>] common_interrupt+0x19/0xe0 softirqs last enabled at (3040): [<ffffffffb290ca28>] handle_softirqs+0x6b8/0x950 softirqs last disabled at (3023): [<ffffffffb290ce4d>] __irq_exit_rcu+0xfd/0x250 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&sb->s_type->i_lock_key#31); <Interrupt> lock(&sb->s_type->i_lock_key#31); * DEADLOCK * 1 lock held by od/2368602: #0: ff1100069f2b4b58 (&sb->s_type->i_mutex_key#19){++++}-{4:4}, at: xfs_ilock+0x324/0x4b0 [xfs] stack backtrace: CPU: 15 UID: 0 PID: 2368602 Comm: od Kdump: loaded Tainted: G S W 7.0.0-rc4+ #1 PREEMPT(full) Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN Hardware name: Dell Inc. PowerEdge R660/0R5JJC, BIOS 2.1.5 03/14/2024 Call Trace: <IRQ> dump_stack_lvl+0x6f/0xb0 print_usage_bug.part.0+0x230/0x2c0 mark_lock_irq+0x3ce/0x5b0 mark_lock+0x1cb/0x3d0 mark_usage+0x109/0x120 __lock_acquire+0x40d/0xbd0 lock_acquire.part.0+0xbd/0x260 _raw_spin_lock+0x37/0x80 igrab+0x28/0x1a0 fserror_report+0x127/0x2d0 iomap_finish_folio_read+0x13c/0x280 iomap_read_end_io+0x10e/0x2c0 clone_endio+0x37e/0x780 [dm_mod] blk_update_request+0x448/0xf00 scsi_end_request+0x74/0x750 scsi_io_completion+0xe9/0x7c0 _scsih_io_done+0x6ba/0x1ca0 [mpt3sas] _base_process_reply_queue+0x249/0x15b0 [mpt3sas] _base_interrupt+0x95/0xe0 [mpt3sas] __handle_irq_event_percpu+0x1f0/0x780 handle_irq_event+0xa9/0x1c0 handle_edge_irq+0x2ef/0x8a0 __common_interrupt+0xa0/0x170 common_interrupt+0xb7/0xe0 </IRQ> <TASK> asm_common_interrupt+0x26/0x40 RIP: 0010:_raw_spin_unlock_irq+0x2e/0x50 Code: 0f 1f 44 00 00 53 48 8b 74 24 08 48 89 fb 48 83 c7 18 e8 b5 73 5e fd 48 89 df e8 ed e2 5e fd e8 08 78 8f fd fb bf 01 00 00 00 <e8> 8d 56 4d fd 65 8b 05 46 d5 1d 03 85 c0 74 06 5b c3 cc cc cc cc RSP: 0018:ffa0000027d07538 EFLAGS: 00000206 RAX: 0000000000000c2d RBX: ffffffffb6614bc8 RCX: 0000000000000080 RDX: 0000000000000000 RSI: ffffffffb6306a01 RDI: 0000000000000001 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 R10: ffffffffb75efc67 R11: 0000000000000001 R12: ff1100015ada0000 R13: 0000000000000083 R14: 0000000000000002 R15: ffffffffb6614c10 folio_wait_bit_common+0x407/0x780 filemap_update_page+0x8e7/0xbd0 filemap_get_pages+0x904/0xc50 filemap_read+0x320/0xc20 xfs_file_buffered_read+0x2aa/0x380 [xfs] xfs_file_read_iter+0x263/0x4a0 [xfs] vfs_read+0x6cb/0xb70 ksys_read+0xf9/0x1d0 do_syscall_64+0x13a/0x1520 Zorro's diagnosis makes sense, so the solution is to kick the failed read handling to a workqueue much like we added for writeback ioends in commit `294f54f849` ("fserror: fix lockdep complaint when igrabbing inode"). Cc: Zorro Lang <zlang@redhat.com> Link: https://lore.kernel.org/linux-xfs/20260319194303.efw4wcu7c4idhthz@doltdoltdolt/ Fixes: `a9d573ee88` ("iomap: report file I/O errors to the VFS") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Link: https://patch.msgid.link/20260323210017.GL6223@frogsfrogsfrogs Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-03-24 09:14:46 +01:00
Joanne Koong	bd71fb3fea	iomap: fix invalid folio access when i_blkbits differs from I/O granularity Commit `aa35dd5cbc` ("iomap: fix invalid folio access after folio_end_read()") partially addressed invalid folio access for folios without an ifs attached, but it did not handle the case where 1 << inode->i_blkbits matches the folio size but is different from the granularity used for the IO, which means IO can be submitted for less than the full folio for the !ifs case. In this case, the condition: if (*bytes_submitted == folio_len) ctx->cur_folio = NULL; in iomap_read_folio_iter() will not invalidate ctx->cur_folio, and iomap_read_end() will still be called on the folio even though the IO helper owns it and will finish the read on it. Fix this by unconditionally invalidating ctx->cur_folio for the !ifs case. Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/linux-fsdevel/b3dfe271-4e3d-4922-b618-e73731242bca@wdc.com/ Fixes: `b2f35ac414` ("iomap: add caller-provided callbacks for read and readahead") Cc: stable@vger.kernel.org Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://patch.msgid.link/20260317203935.830549-1-joannelkoong@gmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-03-18 10:42:08 +01:00
Darrick J. Wong	d320f160aa	iomap: reject delalloc mappings during writeback Filesystems should never provide a delayed allocation mapping to writeback; they're supposed to allocate the space before replying. This can lead to weird IO errors and crashes in the block layer if the filesystem is being malicious, or if it hadn't set iomap->dev because it's a delalloc mapping. Fix this by failing writeback on delalloc mappings. Currently no filesystems actually misbehave in this manner, but we ought to be stricter about things like that. Cc: stable@vger.kernel.org # v5.5 Fixes: `598ecfbaa7` ("iomap: lift the xfs writeback code to iomap") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Link: https://patch.msgid.link/20260302173002.GL13829@frogsfrogsfrogs Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-03-04 14:31:56 +01:00
Joanne Koong	debc1a492b	iomap: don't mark folio uptodate if read IO has bytes pending If a folio has ifs metadata attached to it and the folio is partially read in through an async IO helper with the rest of it then being read in through post-EOF zeroing or as inline data, and the helper successfully finishes the read first, then post-EOF zeroing / reading inline will mark the folio as uptodate in iomap_set_range_uptodate(). This is a problem because when the read completion path later calls iomap_read_end(), it will call folio_end_read(), which sets the uptodate bit using XOR semantics. Calling folio_end_read() on a folio that was already marked uptodate clears the uptodate bit. Fix this by not marking the folio as uptodate if the read IO has bytes pending. The folio uptodate state will be set in the read completion path through iomap_end_read() -> folio_end_read(). Reported-by: Wei Gao <wegao@suse.com> Suggested-by: Sasha Levin <sashal@kernel.org> Tested-by: Wei Gao <wegao@suse.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: stable@vger.kernel.org # v6.19 Link: https://lore.kernel.org/linux-fsdevel/aYbmy8JdgXwsGaPP@autotest-wegao.qe.prg2.suse.org/ Fixes: `b2f35ac414` ("iomap: add caller-provided callbacks for read and readahead") Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://patch.msgid.link/20260303233420.874231-2-joannelkoong@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-03-04 14:18:54 +01:00
Darrick J. Wong	cd3c877d04	iomap: don't report direct-io retries to fserror iomap's directio implementation has two magic errno codes that it uses to signal callers -- ENOTBLK tells the filesystem that it should retry a write with the pagecache; and EAGAIN tells the caller that pagecache flushing or invalidation failed and that it should try again. Neither of these indicate data loss, so let's not report them. Fixes: `a9d573ee88` ("iomap: report file I/O errors to the VFS") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Link: https://patch.msgid.link/20260224154637.GD2390381@frogsfrogsfrogs Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-02-26 09:23:22 +01:00
Linus Torvalds	0e335a7745	Merge tag 'vfs-7.0-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix an uninitialized variable in file_getattr(). The flags_valid field wasn't initialized before calling vfs_fileattr_get(), triggering KMSAN uninit-value reports in fuse - Fix writeback wakeup and logging timeouts when DETECT_HUNG_TASK is not enabled. sysctl_hung_task_timeout_secs is 0 in that case causing spurious "waiting for writeback completion for more than 1 seconds" warnings - Fix a null-ptr-deref in do_statmount() when the mount is internal - Add missing kernel-doc description for the @private parameter in iomap_readahead() - Fix mount namespace creation to hold namespace_sem across the mount copy in create_new_namespace(). The previous drop-and-reacquire pattern was fragile and failed to clean up mount propagation links if the real rootfs was a shared or dependent mount - Fix /proc mount iteration where m->index wasn't updated when m->show() overflows, causing a restart to repeatedly show the same mount entry in a rapidly expanding mount table - Return EFSCORRUPTED instead of ENOSPC in minix_new_inode() when the inode number is out of range - Fix unshare(2) when CLONE_NEWNS is set and current->fs isn't shared. copy_mnt_ns() received the live fs_struct so if a subsequent namespace creation failed the rollback would leave pwd and root pointing to detached mounts. Always allocate a new fs_struct when CLONE_NEWNS is requested - fserror bug fixes: - Remove the unused fsnotify_sb_error() helper now that all callers have been converted to fserror_report_metadata - Fix a lockdep splat in fserror_report() where igrab() takes inode::i_lock which can be held in IRQ context. Replace igrab() with a direct i_count bump since filesystems should not report inodes that are about to be freed or not yet exposed - Handle error pointer in procfs for try_lookup_noperm() - Fix an integer overflow in ep_loop_check_proc() where recursive calls returning INT_MAX would overflow when +1 is added, breaking the recursion depth check - Fix a misleading break in pidfs * tag 'vfs-7.0-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: pidfs: avoid misleading break eventpoll: Fix integer overflow in ep_loop_check_proc() proc: Fix pointer error dereference fserror: fix lockdep complaint when igrabbing inode fsnotify: drop unused helper unshare: fix unshare_fs() handling minix: Correct errno in minix_new_inode namespace: fix proc mount iteration mount: hold namespace_sem across copy in create_new_namespace() iomap: Describe @private in iomap_readahead() statmount: Fix the null-ptr-deref in do_statmount() writeback: Fix wakeup and logging timeouts for !DETECT_HUNG_TASK fs: init flags_valid before calling vfs_fileattr_get	2026-02-25 10:34:23 -08:00
Linus Torvalds	bf4afc53b7	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/\(alloc_objs(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Darrick J. Wong	294f54f849	fserror: fix lockdep complaint when igrabbing inode Christoph Hellwig reported a lockdep splat in generic/108: ================================ WARNING: inconsistent lock state 6.19.0+ #4827 Tainted: G N -------------------------------- inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage. swapper/1/0 [HC1[1]:SC0[0]:HE0:SE1] takes: ffff88811ed1b140 (&sb->s_type->i_lock_key#33){?.+.}-{3:3}, at: igrab+0x1a/0xb0 {HARDIRQ-ON-W} state was registered at: lock_acquire+0xca/0x2c0 _raw_spin_lock+0x2e/0x40 unlock_new_inode+0x2c/0xc0 xfs_iget+0xcf4/0x1080 xfs_trans_metafile_iget+0x3d/0x100 xfs_metafile_iget+0x2b/0x50 xfs_mount_setup_metadir+0x20/0x60 xfs_mountfs+0x457/0xa60 xfs_fs_fill_super+0x6b3/0xa90 get_tree_bdev_flags+0x13c/0x1e0 vfs_get_tree+0x27/0xe0 vfs_cmd_create+0x54/0xe0 __do_sys_fsconfig+0x309/0x620 do_syscall_64+0x8b/0xf80 entry_SYSCALL_64_after_hwframe+0x76/0x7e irq event stamp: 139080 hardirqs last enabled at (139079): [<ffffffff813a923c>] do_idle+0x1ec/0x270 hardirqs last disabled at (139080): [<ffffffff828a8d09>] common_interrupt+0x19/0xe0 softirqs last enabled at (139032): [<ffffffff8134a853>] __irq_exit_rcu+0xc3/0x120 softirqs last disabled at (139025): [<ffffffff8134a853>] __irq_exit_rcu+0xc3/0x120 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&sb->s_type->i_lock_key#33); <Interrupt> lock(&sb->s_type->i_lock_key#33); * DEADLOCK * 1 lock held by swapper/1/0: #0: ffff8881052c81a0 (&vblk->vqs[i].lock){-.-.}-{3:3}, at: virtblk_done+0x4b/0x110 stack backtrace: CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Tainted: G N 6.19.0+ #4827 PREEMPT(full) Tainted: [N]=TEST Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 Call Trace: <IRQ> dump_stack_lvl+0x5b/0x80 print_usage_bug.part.0+0x22c/0x2c0 mark_lock+0xa6f/0xe90 __lock_acquire+0x10b6/0x25e0 lock_acquire+0xca/0x2c0 _raw_spin_lock+0x2e/0x40 igrab+0x1a/0xb0 fserror_report+0x135/0x260 iomap_finish_ioend_buffered+0x170/0x210 clone_endio+0x8f/0x1c0 blk_update_request+0x1e4/0x4d0 blk_mq_end_request+0x1b/0x100 virtblk_done+0x6f/0x110 vring_interrupt+0x59/0x80 __handle_irq_event_percpu+0x8a/0x2e0 handle_irq_event+0x33/0x70 handle_edge_irq+0xdd/0x1e0 __common_interrupt+0x6f/0x180 common_interrupt+0xb7/0xe0 </IRQ> It looks like the concern here is that inode::i_lock is sometimes taken in IRQ context, and sometimes it is held when going to IRQ context, though it's a little difficult to tell since I think this is a kernel from after the actual 6.19 release but before 7.0-rc1. Either way, we don't need to take i_lock, because filesystems should not report files to fserror if they're about to be freed or have not yet been exposed to other threads, because the resulting fsnotify report will be meaningless. Therefore, bump inode::i_count directly and clarify the preconditions on the inode being passed in. Link: https://lore.kernel.org/linux-fsdevel/aY7BndIgQg3ci_6s@infradead.org/ Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Link: https://patch.msgid.link/177148129564.716249.3069780698231701540.stgit@frogsfrogsfrogs Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-02-19 09:12:08 +01:00
Hongbo Li	ac83896172	iomap: Describe @private in iomap_readahead() The kernel test rebot reports the kernel-doc warning: ``` Warning: fs/iomap/buffered-io.c:624 function parameter 'private' not described in 'iomap_readahead' ``` The former commit in "iomap: stash iomap read ctx in the private field of iomap_iter" has added a new parameter @private to iomap_readahead(), so let's describe the parameter. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202601261111.vIL9rhgD-lkp@intel.com/ Fixes: `8806f27924` ("iomap: stash iomap read ctx in the private field of iomap_iter") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Link: https://patch.msgid.link/20260213022812.766187-1-lihongbo22@huawei.com Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-02-14 13:24:44 +01:00
Linus Torvalds	4adc13ed7c	Merge tag 'for-7.0/block-stable-pages-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull bounce buffer dio for stable pages from Jens Axboe: "This adds support for bounce buffering of dio for stable pages. This was all done by Christoph. In his words: This series tries to address the problem that under I/O pages can be modified during direct I/O, even when the device or file system require stable pages during I/O to calculate checksums, parity or data operations. It does so by adding block layer helpers to bounce buffer an iov_iter into a bio, then wires that up in iomap and ultimately XFS. The reason that the file system even needs to know about it, is because reads need a user context to copy the data back, and the infrastructure to defer ioends to a workqueue currently sits in XFS. I'm going to look into moving that into ioend and enabling it for other file systems. Additionally btrfs already has it's own infrastructure for this, and actually an urgent need to bounce buffer, so this should be useful there and could be wire up easily. In fact the idea comes from patches by Qu that did this in btrfs. This patch fixes all but one xfstests failures on T10 PI capable devices (generic/095 seems to have issues with a mix of mmap and splice still, I'm looking into that separately), and make qemu VMs running Windows, or Linux with swap enabled fine on an XFS file on a device using PI. Performance numbers on my (not exactly state of the art) NVMe PI test setup: Sequential reads using io_uring, QD=16. Bandwidth and CPU usage (usr/sys): \| size \| zero copy \| bounce \| +------+--------------------------+--------------------------+ \| 4k \| 1316MiB/s (12.65/55.40%) \| 1081MiB/s (11.76/49.78%) \| \| 64K \| 3370MiB/s ( 5.46/18.20%) \| 3365MiB/s ( 4.47/15.68%) \| \| 1M \| 3401MiB/s ( 0.76/23.05%) \| 3400MiB/s ( 0.80/09.06%) \| +------+--------------------------+--------------------------+ Sequential writes using io_uring, QD=16. Bandwidth and CPU usage (usr/sys): \| size \| zero copy \| bounce \| +------+--------------------------+--------------------------+ \| 4k \| 882MiB/s (11.83/33.88%) \| 750MiB/s (10.53/34.08%) \| \| 64K \| 2009MiB/s ( 7.33/15.80%) \| 2007MiB/s ( 7.47/24.71%) \| \| 1M \| 1992MiB/s ( 7.26/ 9.13%) \| 1992MiB/s ( 9.21/19.11%) \| +------+--------------------------+--------------------------+ Note that the 64k read numbers look really odd to me for the baseline zero copy case, but are reproducible over many repeated runs. The bounce read numbers should further improve when moving the PI validation to the file system and removing the double context switch, which I have patches for that will sent out soon" * tag 'for-7.0/block-stable-pages-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: xfs: use bounce buffering direct I/O when the device requires stable pages iomap: add a flag to bounce buffer direct I/O iomap: support ioends for direct reads iomap: rename IOMAP_DIO_DIRTY to IOMAP_DIO_USER_BACKED iomap: free the bio before completing the dio iomap: share code between iomap_dio_bio_end_io and iomap_finish_ioend_direct iomap: split out the per-bio logic from iomap_dio_bio_iter iomap: simplify iomap_dio_bio_iter iomap: fix submission side handling of completion side errors block: add helpers to bounce buffer an iov_iter into bios block: remove bio_release_page iov_iter: extract a iov_iter_extract_bvecs helper from bio code block: open code bio_add_page and fix handling of mismatching P2P ranges block: refactor get_contig_folio_len block: add a BIO_MAX_SIZE constant and use it	2026-02-09 18:14:52 -08:00
Linus Torvalds	0c00ed308d	Merge tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - Support for batch request processing for ublk, improving the efficiency of the kernel/ublk server communication. This can yield nice 7-12% performance improvements - Support for integrity data for ublk - Various other ublk improvements and additions, including a ton of selftests additions and updated - Move the handling of blk-crypto software fallback from below the block layer to above it. This reduces the complexity of dealing with bio splitting - Series fixing a number of potential deadlocks in blk-mq related to the queue usage counter and writeback throttling and rq-qos debugfs handling - Add an async_depth queue attribute, to resolve a performance regression that's been around for a qhilw related to the scheduler depth handling - Only use task_work for IOPOLL completions on NVMe, if it is necessary to do so. An earlier fix for an issue resulted in all these completions being punted to task_work, to guarantee that completions were only run for a given io_uring ring when it was local to that ring. With the new changes, we can detect if it's necessary to use task_work or not, and avoid it if possible. - rnbd fixes: - Fix refcount underflow in device unmap path - Handle PREFLUSH and NOUNMAP flags properly in protocol - Fix server-side bi_size for special IOs - Zero response buffer before use - Fix trace format for flags - Add .release to rnbd_dev_ktype - MD pull requests via Yu Kuai - Fix raid5_run() to return error when log_init() fails - Fix IO hang with degraded array with llbitmap - Fix percpu_ref not resurrected on suspend timeout in llbitmap - Fix GPF in write_page caused by resize race - Fix NULL pointer dereference in process_metadata_update - Fix hang when stopping arrays with metadata through dm-raid - Fix any_working flag handling in raid10_sync_request - Refactor sync/recovery code path, improve error handling for badblocks, and remove unused recovery_disabled field - Consolidate mddev boolean fields into mddev_flags - Use mempool to allocate stripe_request_ctx and make sure max_sectors is not less than io_opt in raid5 - Fix return value of mddev_trylock - Fix memory leak in raid1_run() - Add Li Nan as mdraid reviewer - Move phys_vec definitions to the kernel types, mostly in preparation for some VFIO and RDMA changes - Improve the speed for secure erase for some devices - Various little rust updates - Various other minor fixes, improvements, and cleanups * tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits) blk-mq: ABI/sysfs-block: fix docs build warnings selftests: ublk: organize test directories by test ID block: decouple secure erase size limit from discard size limit block: remove redundant kill_bdev() call in set_blocksize() blk-mq: add documentation for new queue attribute async_dpeth block, bfq: convert to use request_queue->async_depth mq-deadline: covert to use request_queue->async_depth kyber: covert to use request_queue->async_depth blk-mq: add a new queue sysfs attribute async_depth blk-mq: factor out a helper blk_mq_limit_depth() blk-mq-sched: unify elevators checking for async requests block: convert nr_requests to unsigned int block: don't use strcpy to copy blockdev name blk-mq-debugfs: warn about possible deadlock blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs() blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos() blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static blk-rq-qos: fix possible debugfs_mutex deadlock blk-mq-debugfs: factor out a helper to register debugfs for all rq_qos blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter ...	2026-02-09 17:57:21 -08:00
Linus Torvalds	3304b3fedd	Merge tag 'vfs-7.0-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs iomap updates from Christian Brauner: - Erofs page cache sharing preliminaries: Plumb a void private parameter through iomap_read_folio() and iomap_readahead() into iomap_iter->private, matching iomap DIO. Erofs uses this to replace a bogus kmap_to_page() call, as preparatory work for page cache sharing. - Fix for invalid folio access: Fix an invalid folio access when a folio without iomap_folio_state is fully submitted to the IO helper — the helper may call folio_end_read() at any time, so ctx->cur_folio must be invalidated after full submission. tag 'vfs-7.0-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: iomap: fix invalid folio access after folio_end_read() erofs: hold read context in iomap_iter if needed iomap: stash iomap read ctx in the private field of iomap_iter	2026-02-09 15:08:16 -08:00
Linus Torvalds	dd466ea002	Merge tag 'vfs-7.0-rc1.fserror' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs error reporting updates from Christian Brauner: "This contains the changes to support generic I/O error reporting. Filesystems currently have no standard mechanism for reporting metadata corruption and file I/O errors to userspace via fsnotify. Each filesystem (xfs, ext4, erofs, f2fs, etc.) privately defines EFSCORRUPTED, and error reporting to fanotify is inconsistent or absent entirely. This introduces a generic fserror infrastructure built around struct super_block that gives filesystems a standard way to queue metadata and file I/O error reports for delivery to fsnotify. Errors are queued via mempools and queue_work to avoid holding filesystem locks in the notification path; unmount waits for pending events to drain. A new super_operations::report_error callback lets filesystem drivers respond to file I/O errors themselves (to be used by an upcoming XFS self-healing patchset). On the uapi side, EFSCORRUPTED and EUCLEAN are promoted from private per-filesystem definitions to canonical errno.h values across all architectures" * tag 'vfs-7.0-rc1.fserror' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: ext4: convert to new fserror helpers xfs: translate fsdax media errors into file "data lost" errors when convenient xfs: report fs metadata errors via fsnotify iomap: report file I/O errors to the VFS fs: report filesystem and file I/O errors to fsnotify uapi: promote EFSCORRUPTED and EUCLEAN to errno.h	2026-02-09 12:21:37 -08:00
Joanne Koong	aa35dd5cbc	iomap: fix invalid folio access after folio_end_read() If the folio does not have an iomap_folio_state (ifs) attached and the folio gets read in by the filesystem's IO helper, folio_end_read() will be called by the IO helper at any time. For this case, we cannot access the folio after dispatching it to the IO helper, eg subsequent accesses like if (ctx->cur_folio && offset_in_folio(ctx->cur_folio, iter->pos) == 0) { are incorrect. Fix these invalid accesses by invalidating ctx->cur_folio if all bytes of the folio have been read in by the IO helper. This allows us to also remove the +1 bias added for the ifs case. The bias was previously added to ensure that if all bytes are read in, the IO helper does not end the read on the folio until iomap has decremented the bias. Fixes: `b2f35ac414` ("iomap: add caller-provided callbacks for read and readahead") Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://patch.msgid.link/20260126224107.2182262-2-joannelkoong@gmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-01-29 13:42:05 +01:00
Christoph Hellwig	c9d114846b	iomap: add a flag to bounce buffer direct I/O Add a new flag that request bounce buffering for direct I/O. This is needed to provide the stable pages requirement requested by devices that need to calculate checksums or parity over the data and allows file systems to properly work with things like T10 protection information. The implementation just calls out to the new bio bounce buffering helpers to allocate a bounce buffer, which is used for I/O and to copy to/from it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-28 05:16:40 -07:00
Christoph Hellwig	d969bd72cf	iomap: support ioends for direct reads Support using the ioend structure to defer I/O completion for direct reads in addition to writes. This requires a check for the operation to not merge reads and writes in iomap_ioend_can_merge. This support will be used for bounce buffered direct I/O reads that need to copy data back to the user address space on read completion. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-28 05:16:40 -07:00
Christoph Hellwig	c96b8b2202	iomap: rename IOMAP_DIO_DIRTY to IOMAP_DIO_USER_BACKED Match the more descriptive iov_iter terminology instead of encoding what we do with them for reads only. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-28 05:16:40 -07:00
Christoph Hellwig	45cec0de6c	iomap: free the bio before completing the dio There are good arguments for processing the user completions ASAP vs. freeing resources ASAP, but freeing the bio first here removes potential use after free hazards when checking flags, and will simplify the upcoming bounce buffer support. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-28 05:16:40 -07:00
Christoph Hellwig	e2fcff5bb4	iomap: share code between iomap_dio_bio_end_io and iomap_finish_ioend_direct Refactor the two per-bio completion handlers to share common code using a new helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-28 05:16:40 -07:00
Christoph Hellwig	2631c94602	iomap: split out the per-bio logic from iomap_dio_bio_iter Factor out a separate helper that builds and submits a single bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-28 05:16:40 -07:00
Christoph Hellwig	6e7a6c8019	iomap: simplify iomap_dio_bio_iter Use iov_iter_count to check if we need to continue as that just reads a field in the iov_iter, and only use bio_iov_vecs_to_alloc to calculate the actual number of vectors to allocate for the bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-28 05:16:39 -07:00
Christoph Hellwig	4ad357e39b	iomap: fix submission side handling of completion side errors The "if (dio->error)" in iomap_dio_bio_iter exists to stop submitting more bios when a completion already return an error. Commit `cfe057f7db` ("iomap_dio_actor(): fix iov_iter bugs") made it revert the iov by "copied", which is very wrong given that we've already consumed that range and submitted a bio for it. Fixes: `cfe057f7db` ("iomap_dio_actor(): fix iov_iter bugs") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-28 05:16:39 -07:00
Christoph Hellwig	561940a7ee	iomap: wait for batched folios to be stable in __iomap_get_folio __iomap_get_folio needs to wait for writeback to finish if the file requires folios to be stable for writes. For the regular path this is taken care of by __filemap_get_folio, but for the newly added batch lookup it has to be done manually. This fixes xfs/131 failures when running on PI-capable hardware. Fixes: `395ed1ef00` ("iomap: optional zero range dirty folio processing") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260113153943.3323869-1-hch@lst.de Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-01-14 17:06:02 +01:00
Hongbo Li	8806f27924	iomap: stash iomap read ctx in the private field of iomap_iter It's useful to get filesystem-specific information using the existing private field in the @iomap_iter passed to iomap_{begin,end} for advanced usage for iomap buffered reads, which is much like the current iomap DIO. For example, EROFS needs it to: - implement an efficient page cache sharing feature, since iomap needs to apply to anon inode page cache but we'd like to get the backing inode/fs instead, so filesystem-specific private data is needed to keep such information; - pass in both struct page * and void * for inline data to avoid kmap_to_page() usage (which is bogus). Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Link: https://patch.msgid.link/20260109102856.598531-2-lihongbo22@huawei.com Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-01-14 16:31:41 +01:00
Darrick J. Wong	a9d573ee88	iomap: report file I/O errors to the VFS Wire up iomap so that it reports all file read and write errors to the VFS (and hence fsnotify) via the new fserror mechanism. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Link: https://patch.msgid.link/176826402631.3490369.729008983502742314.stgit@frogsfrogsfrogs Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-01-13 09:58:01 +01:00
Christoph Hellwig	bb8e2019ad	blk-crypto: handle the fallback above the block layer Add a blk_crypto_submit_bio helper that either submits the bio when it is not encrypted or inline encryption is provided, but otherwise handles the encryption before going down into the low-level driver. This reduces the risk from bio reordering and keeps memory allocation as high up in the stack as possible. Note that if the submitter knows that inline enctryption is known to be supported by the underyling driver, it can still use plain submit_bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-11 12:55:41 -07:00
Brian Foster	ed61378b4d	iomap: replace folio_batch allocation with stack allocation Zhang Yi points out that the dynamic folio_batch allocation in iomap_fill_dirty_folios() is problematic for the ext4 on iomap work that is under development because it doesn't sufficiently handle the allocation failure case (by allowing a retry, for example). We've also seen lockdep (via syzbot) complain recently about the scope of the allocation. The dynamic allocation was initially added for simplicity and to help indicate whether the batch was used or not by the calling fs. To address these issues, put the batch on the stack of iomap_zero_range() and use a flag to control whether the batch should be used in the iomap folio lookup path. This keeps things simple and eliminates allocation issues with lockdep and for ext4 on iomap. While here, also clean up the fill helper signature to be more consistent with the underlying filemap helper. Pass through the return value of the filemap helper (folio count) and update the lookup offset via an out param. Fixes: `395ed1ef00` ("iomap: optional zero range dirty folio processing") Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://patch.msgid.link/20251208140548.373411-1-bfoster@redhat.com Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-12-15 15:17:44 +01:00
Linus Torvalds	f2e74ecfba	Merge tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull folio updates from Christian Brauner: "Add a new folio_next_pos() helper function that returns the file position of the first byte after the current folio. This is a common operation in filesystems when needing to know the end of the current folio. The helper is lifted from btrfs which already had its own version, and is now used across multiple filesystems and subsystems: - btrfs - buffer - ext4 - f2fs - gfs2 - iomap - netfs - xfs - mm This fixes a long-standing bug in ocfs2 on 32-bit systems with files larger than 2GiB. Presumably this is not a common configuration, but the fix is backported anyway. The other filesystems did not have bugs, they were just mildly inefficient. This also introduce uoff_t as the unsigned version of loff_t. A recent commit inadvertently changed a comparison from being unsigned (on 64-bit systems) to being signed (which it had always been on 32-bit systems), leading to sporadic fstests failures. Generally file sizes are restricted to being a signed integer, but in places where -1 is passed to indicate "up to the end of the file", it is convenient to have an unsigned type to ensure comparisons are always unsigned regardless of architecture" * tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: Add uoff_t mm: Use folio_next_pos() xfs: Use folio_next_pos() netfs: Use folio_next_pos() iomap: Use folio_next_pos() gfs2: Use folio_next_pos() f2fs: Use folio_next_pos() ext4: Use folio_next_pos() buffer: Use folio_next_pos() btrfs: Use folio_next_pos() filemap: Add folio_next_pos()	2025-12-01 10:26:38 -08:00
Linus Torvalds	b04b2e7a61	Merge tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Cheaper MAY_EXEC handling for path lookup. This elides MAY_WRITE permission checks during path lookup and adds the IOP_FASTPERM_MAY_EXEC flag so filesystems like btrfs can avoid expensive permission work. - Hide dentry_cache behind runtime const machinery. - Add German Maglione as virtiofs co-maintainer. Cleanups: - Tidy up and inline step_into() and walk_component() for improved code generation. - Re-enable IOCB_NOWAIT writes to files. This refactors file timestamp update logic, fixing a layering bypass in btrfs when updating timestamps on device files and improving FMODE_NOCMTIME handling in VFS now that nfsd started using it. - Path lookup optimizations extracting slowpaths into dedicated routines and adding branch prediction hints for mntput_no_expire(), fd_install(), lookup_slow(), and various other hot paths. - Enable clang's -fms-extensions flag, requiring a JFS rename to avoid conflicts. - Remove spurious exports in fs/file_attr.c. - Stop duplicating union pipe_index declaration. This depends on the shared kbuild branch that brings in -fms-extensions support which is merged into this branch. - Use MD5 library instead of crypto_shash in ecryptfs. - Use largest_zero_folio() in iomap_dio_zero(). - Replace simple_strtol/strtoul with kstrtoint/kstrtouint in init and initrd code. - Various typo fixes. Fixes: - Fix emergency sync for btrfs. Btrfs requires an explicit sync_fs() call with wait == 1 to commit super blocks. The emergency sync path never passed this, leaving btrfs data uncommitted during emergency sync. - Use local kmap in watch_queue's post_one_notification(). - Add hint prints in sb_set_blocksize() for LBS dependency on THP" * tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) MAINTAINERS: add German Maglione as virtiofs co-maintainer fs: inline step_into() and walk_component() fs: tidy up step_into() & friends before inlining orangefs: use inode_update_timestamps directly btrfs: fix the comment on btrfs_update_time btrfs: use vfs_utimes to update file timestamps fs: export vfs_utimes fs: lift the FMODE_NOCMTIME check into file_update_time_flags fs: refactor file timestamp update logic include/linux/fs.h: trivial fix: regualr -> regular fs/splice.c: trivial fix: pipes -> pipe's fs: mark lookup_slow() as noinline fs: add predicts based on nd->depth fs: move mntput_no_expire() slowpath into a dedicated routine fs: remove spurious exports in fs/file_attr.c watch_queue: Use local kmap in post_one_notification() fs: touch up predicts in path lookup fs: move fd_install() slowpath into a dedicated routine and provide commentary fs: hide dentry_cache behind runtime const machinery fs: touch predicts in do_dentry_open() ...	2025-12-01 08:44:26 -08:00
Christoph Hellwig	7fd8720dff	iomap: allocate s_dio_done_wq for async reads as well Since commit 222f2c7c6d14 ("iomap: always run error completions in user context"), read error completions are deferred to s_dio_done_wq. This means the workqueue also needs to be allocated for async reads. Fixes: 222f2c7c6d14 ("iomap: always run error completions in user context") Reported-by: syzbot+a2b9a4ed0d61b1efb3f5@syzkaller.appspotmail.com Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251124140013.902853-1-hch@lst.de Tested-by: syzbot+a2b9a4ed0d61b1efb3f5@syzkaller.appspotmail.com Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-25 10:22:19 +01:00
Joanne Koong	d7ff85d4b8	iomap: fix iomap_read_end() for already uptodate folios There are some cases where when iomap_read_end() is called, the folio may already have been marked uptodate. For example, if the iomap block needed zeroing, then the folio may have been marked uptodate after the zeroing. iomap_read_end() should unlock the folio instead of calling folio_end_read(), which is how these cases were handled prior to commit `f8eaf79406` ("iomap: simplify ->read_folio_range() error handling for reads"). Calling folio_end_read() on an uptodate folio leads to buggy behavior where marking an already uptodate folio as uptodate will XOR it to be marked nonuptodate. Fixes: `f8eaf79406` ("iomap: simplify ->read_folio_range() error handling for reads") Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://patch.msgid.link/20251118211111.1027272-2-joannelkoong@gmail.com Tested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-25 10:22:19 +01:00
Christoph Hellwig	76192a42c2	iomap: invert the polarity of IOMAP_DIO_INLINE_COMP Replace IOMAP_DIO_INLINE_COMP with a flag to indicate that the completion should be offloaded. This removes a tiny bit of boilerplate code, but more importantly just makes the code easier to follow as this new flag gets set most of the time and only cleared in one place, while it was the inverse for the old version. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251113170633.1453259-6-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-25 10:22:19 +01:00
Christoph Hellwig	eca9dc2089	iomap: support write completions from interrupt context Completions for pure overwrites don't need to be deferred to a workqueue as there is no work to be done, or at least no work that needs a user context. Set the IOMAP_DIO_INLINE_COMP by default for writes like we already do for reads, and the clear it for all the cases that actually do need a user context for completions to update the inode size or record updates to the logical to physical mapping. I've audited all users of the ->end_io callback, and they only require user context for I/O that involves unwritten extents, COW, size extensions, or error handling and all those are still run from workqueue context. This restores the behavior of the old pre-iomap direct I/O code. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251113170633.1453259-5-hch@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-25 10:22:19 +01:00
Christoph Hellwig	29086a31b3	iomap: rework REQ_FUA selection The way how iomap_dio_can_use_fua and the caller is structured is a bit confusing, as the main guarding condition is hidden in the helper, and the secondary conditions are split between caller and callee. Refactor the code, so that iomap_dio_bio_iter itself tracks if a write might need metadata updates based on the iomap type and flags, and then have a condition based on that to use the FUA flag. Note that this also moves the REQ_OP_WRITE assignment to the end of the branch to improve readability a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251113170633.1453259-4-hch@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-25 10:22:18 +01:00
Christoph Hellwig	ddb4873286	iomap: always run error completions in user context At least zonefs expects error completions to be able to sleep. Because error completions aren't performance critical, just defer them to workqueue context unconditionally. Fixes: `8dcc1a9d90` ("fs: New zonefs file system") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251113170633.1453259-3-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-25 10:22:18 +01:00
Christoph Hellwig	f9f8514999	fs, iomap: remove IOCB_DIO_CALLER_COMP This was added by commit `099ada2c87` ("io_uring/rw: add write support for IOCB_DIO_CALLER_COMP") and disabled a little later by commit `838b35bb6a` ("io_uring/rw: disable IOCB_DIO_CALLER_COMP") because it didn't work. Remove all the related code that sat unused for 2 years. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251113170633.1453259-2-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-25 10:22:18 +01:00
Joanne Koong	b56c1c54f2	iomap: use find_next_bit() for uptodate bitmap scanning Use find_next_bit()/find_next_zero_bit() for iomap uptodate bitmap scanning. This uses __ffs() internally and is more efficient for finding the next uptodate or non-uptodate bit than iterating through the the bitmap range testing every bit. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://patch.msgid.link/20251111193658.3495942-10-joannelkoong@gmail.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-25 10:22:10 +01:00
Joanne Koong	fed9c62d28	iomap: use find_next_bit() for dirty bitmap scanning Use find_next_bit()/find_next_zero_bit() for iomap dirty bitmap scanning. This uses __ffs() internally and is more efficient for finding the next dirty or clean bit than iterating through the bitmap range testing every bit. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://patch.msgid.link/20251111193658.3495942-9-joannelkoong@gmail.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-25 10:22:10 +01:00
Joanne Koong	a298febc47	iomap: simplify when reads can be skipped for writes Currently, the logic for skipping the read range for a write is if (!(iter->flags & IOMAP_UNSHARE) && (from <= poff \|\| from >= poff + plen) && (to <= poff \|\| to >= poff + plen)) which breaks down to skipping the read if any of these are true: a) from <= poff && to <= poff b) from <= poff && to >= poff + plen c) from >= poff + plen && to <= poff d) from >= poff + plen && to >= poff + plen This can be simplified to if (!(iter->flags & IOMAP_UNSHARE) && from <= poff && to >= poff + plen) from the following reasoning: a) from <= poff && to <= poff This reduces to 'to <= poff' since it is guaranteed that 'from <= to' (since to = from + len). It is not possible for 'from <= to' to be true here because we only reach here if plen > 0 (thanks to the preceding 'if (plen == 0)' check that would break us out of the loop). If 'to <= poff', plen would have to be 0 since poff and plen get adjusted in lockstep for uptodate blocks. This means we can eliminate this check. c) from >= poff + plen && to <= poff This is not possible since 'from <= to' and 'plen > 0'. We can eliminate this check. d) from >= poff + plen && to >= poff + plen This reduces to 'from >= poff + plen' since 'from <= to'. It is not possible for 'from >= poff + plen' to be true here. We only reach here if plen > 0 and for writes, poff and plen will always be block-aligned, which means poff <= from < poff + plen. We can eliminate this check. The only valid check is b) from <= poff && to >= poff + plen. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://patch.msgid.link/20251111193658.3495942-7-joannelkoong@gmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-12 10:50:32 +01:00
Joanne Koong	f8eaf79406	iomap: simplify ->read_folio_range() error handling for reads Instead of requiring that the caller calls iomap_finish_folio_read() even if the ->read_folio_range() callback returns an error, account for this internally in iomap instead, which makes the interface simpler and makes it match writeback's ->read_folio_range() error handling expectations. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://patch.msgid.link/20251111193658.3495942-6-joannelkoong@gmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-12 10:50:32 +01:00
Joanne Koong	6b1fd2281f	iomap: optimize pending async writeback accounting Pending writebacks must be accounted for to determine when all requests have completed and writeback on the folio should be ended. Currently this is done by atomically incrementing ifs->write_bytes_pending for every range to be written back. Instead, the number of atomic operations can be minimized by setting ifs->write_bytes_pending to the folio size, internally tracking how many bytes are written back asynchronously, and then after sending off all the requests, decrementing ifs->write_bytes_pending by the number of bytes not written back asynchronously. Now, for N ranges written back, only N + 2 atomic operations are required instead of 2N + 2. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://patch.msgid.link/20251111193658.3495942-5-joannelkoong@gmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-12 10:50:32 +01:00
Joanne Koong	9d875e0eef	iomap: account for unaligned end offsets when truncating read range The end position to start truncating from may be at an offset into a block, which under the current logic would result in overtruncation. Adjust the calculation to account for unaligned end offsets. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://patch.msgid.link/20251111193658.3495942-3-joannelkoong@gmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-12 10:50:31 +01:00
Joanne Koong	a0f1cabe29	iomap: rename bytes_pending/bytes_accounted to bytes_submitted/bytes_not_submitted The naming "bytes_pending" and "bytes_accounted" may be confusing and could be better named. Rename this to "bytes_submitted" and "bytes_not_submitted" to make it more clear that these are bytes we passed to the IO helper to read in. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://patch.msgid.link/20251111193658.3495942-2-joannelkoong@gmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-12 10:50:31 +01:00
Qu Wenruo	001397f5ef	iomap: add IOMAP_DIO_FSBLOCK_ALIGNED flag Btrfs requires all of its bios to be fs block aligned, normally it's totally fine but with the incoming block size larger than page size (bs > ps) support, the requirement is no longer met for direct IOs. Because iomap_dio_bio_iter() calls bio_iov_iter_get_pages(), only requiring alignment to be bdev_logical_block_size(). In the real world that value is either 512 or 4K, on 4K page sized systems it means bio_iov_iter_get_pages() can break the bio at any page boundary, breaking btrfs' requirement for bs > ps cases. To address this problem, introduce a new public iomap dio flag, IOMAP_DIO_FSBLOCK_ALIGNED. When calling __iomap_dio_rw() with that new flag, iomap_dio::flags will inherit that new flag, and iomap_dio_bio_iter() will take fs block size into the calculation of the alignment, and pass the alignment to bio_iov_iter_get_pages(), respecting the fs block size requirement. The initial user of this flag will be btrfs, which needs to calculate the checksum for direct read and thus requires the biovec to be fs block aligned for the incoming bs > ps support. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> [hch: also align pos/len, incorporate the trace flags from Darrick] Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251031131045.1613229-2-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-05 13:09:27 +01:00
Brian Foster	39be21386d	iomap: remove old partial eof zeroing optimization iomap_zero_range() optimizes the partial eof block zeroing use case by force zeroing if the mapping is dirty. This is to avoid frequent flushing on file extending workloads, which hurts performance. Now that the folio batch mechanism provides a more generic solution and is used by the only real zero range user (XFS), this isolated optimization is no longer needed. Remove the unnecessary code and let callers use the folio batch or fall back to flushing by default. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-05 12:57:25 +01:00
Brian Foster	395ed1ef00	iomap: optional zero range dirty folio processing The only way zero range can currently process unwritten mappings with dirty pagecache is to check whether the range is dirty before mapping lookup and then flush when at least one underlying mapping is unwritten. This ordering is required to prevent iomap lookup from racing with folio writeback and reclaim. Since zero range can skip ranges of unwritten mappings that are clean in cache, this operation can be improved by allowing the filesystem to provide a set of dirty folios that require zeroing. In turn, rather than flush or iterate file offsets, zero range can iterate on folios in the batch and advance over clean or uncached ranges in between. Add a folio_batch in struct iomap and provide a helper for filesystems to populate the batch at lookup time. Update the folio lookup path to return the next folio in the batch, if provided, and advance the iter if the folio starts beyond the current offset. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-05 12:57:24 +01:00
Brian Foster	49590716be	iomap: remove pos+len BUG_ON() to after folio lookup The bug checks at the top of iomap_write_begin() assume the pos/len reflect exactly the next range to process. This may no longer be the case once the get folio path is able to process a folio batch from the filesystem. On top of that, len is already trimmed to within the iomap/srcmap by iomap_length(), so these checks aren't terribly useful. Remove the unnecessary BUG_ON() checks. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-05 12:57:24 +01:00
Joanne Koong	d4e88bb08e	iomap: make iomap_read_folio() a void return No errors are propagated in iomap_read_folio(). Change iomap_read_folio() to a void return to make this clearer to callers. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-05 12:57:23 +01:00
Christoph Hellwig [1]	c2b1adc462	iomap: move buffered io bio logic into new file Move bio logic in the buffered io code into its own file and remove CONFIG_BLOCK gating for iomap read/readahead. [1] https://lore.kernel.org/linux-fsdevel/aMK2GuumUf93ep99@infradead.org/ Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-05 12:57:23 +01:00

1 2 3 4 5 ...

516 Commits