Commit Graph

103187 Commits

Author SHA1 Message Date
Linus Torvalds
fcb70a56f4 Merge tag 'vfs-6.19-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:

 - Fix the the buggy conversion of fuse_reverse_inval_entry() introduced
   during the creation rework

 - Disallow nfs delegation requests for directories by setting
   simple_nosetlease()

 - Require an opt-in for getting readdir flag bits outside of S_DT_MASK
   set in d_type

 - Fix scheduling delayed writeback work by only scheduling when the
   dirty time expiry interval is non-zero and cancel the delayed work if
   the interval is set to zero

 - Use rounded_jiffies_interval for dirty time work

 - Check the return value of sb_set_blocksize() for romfs

 - Wait for batched folios to be stable in __iomap_get_folio()

 - Use private naming for fuse hash size

 - Fix the stale dentry cleanup to prevent a race that causes a UAF

* tag 'vfs-6.19-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  vfs: document d_dispose_if_unused()
  fuse: shrink once after all buckets have been scanned
  fuse: clean up fuse_dentry_tree_work()
  fuse: add need_resched() before unlocking bucket
  fuse: make sure dentry is evicted if stale
  fuse: fix race when disposing stale dentries
  fuse: use private naming for fuse hash size
  writeback: use round_jiffies_relative for dirtytime_work
  iomap: wait for batched folios to be stable in __iomap_get_folio
  romfs: check sb_set_blocksize() return value
  docs: clarify that dirtytime_expire_seconds=0 disables writeback
  writeback: fix 100% CPU usage when dirtytime_expire_interval is 0
  readdir: require opt-in for d_type flags
  vboxsf: don't allow delegations to be set on directories
  ceph: don't allow delegations to be set on directories
  gfs2: don't allow delegations to be set on directories
  9p: don't allow delegations to be set on directories
  smb/client: properly disallow delegations on directories
  nfs: properly disallow delegation requests on directories
  fuse: fix conversion of fuse_reverse_inval_entry() to start_removing()
2026-01-26 09:30:48 -08:00
Linus Torvalds
6d06443237 Merge tag 'v6.19-rc6-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:

 - Use the original nents value for ib_dma_unmap_sg(), preventing
   potential memory corruption in the RDMA transport layer

 - Fix a naming discrepancy in the kernel-doc for
   ksmbd_vfs_kern_path_start_removing() as identified by sparse static
   analysis

 - Reset smb_direct_port to its default value during initialization to
   ensure the correct port is used when switching between different RDMA
   device types without module reload

* tag 'v6.19-rc6-server-fixes' of git://git.samba.org/ksmbd:
  smb: server: reset smb_direct_port = SMB_DIRECT_PORT_INFINIBAND on init
  smb: server: fix comment for ksmbd_vfs_kern_path_start_removing()
  ksmbd: smbd: fix dma_unmap_sg() nents
2026-01-23 13:40:55 -08:00
Stefan Metzmacher
5914d98ff0 smb: server: reset smb_direct_port = SMB_DIRECT_PORT_INFINIBAND on init
This allows testing with different devices (iwrap vs. non-iwarp) without
'rmmod ksmbd && modprobe ksmbd', but instead
'ksmbd.control -s && ksmbd.mountd' is enough.

In the long run we want to listen on iwarp and non-iwarp at the same time,
but requires more changes, most likely also in the rdma layer.

Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2026-01-22 18:13:44 -06:00
Stefan Metzmacher
8e50cd059c smb: server: fix comment for ksmbd_vfs_kern_path_start_removing()
This was found by sparse...

Fixes: 1ead2213dd ("smb/server: use end_removing_noperm for for target of smb2_create_link()")
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: NeilBrown <neil@brown.name>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2026-01-22 18:13:44 -06:00
Thomas Fourier
98e3e2b561 ksmbd: smbd: fix dma_unmap_sg() nents
The dma_unmap_sg() functions should be called with the same nents as the
dma_map_sg(), not the value the map function returned.

Fixes: 0626e6641f ("cifsd: add server handler for central processing and tranport layers")
Cc: <stable@vger.kernel.org>
Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2026-01-22 18:13:44 -06:00
Linus Torvalds
07eebd934c Merge tag 'for-6.19-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:

 - protect reading super block vs setting block size externally (found
   by syzbot)

 - make sure no transaction is started in read-only mode even with some
   rescue mount option combinations

 - fix checksum calculation of backup super blocks when block-group-tree
   is enabled

 - more extensive mount-time checks of device items that could be left
   after device replace and attempting degraded mount

 - fix build warning with -Wmaybe-uninitialized on loongarch64-gcc 12

* tag 'for-6.19-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: add extra device item checks at mount
  btrfs: fix missing fields in superblock backup with BLOCK_GROUP_TREE
  btrfs: reject new transactions if the fs is fully read-only
  btrfs: sync read disk super and set block size
  btrfs: fix Wmaybe-uninitialized warning in replay_one_buffer()
2026-01-21 08:42:34 -08:00
Qu Wenruo
3430818739 btrfs: add extra device item checks at mount
[BUG]
There is a bug report where after a dev-replace, the replace source
device with devid 4 is properly erased (dump tree shows it's the old
devid 4), but the target device is still using devid 0.

When the user tries to mount the fs degraded, the mount failed with the
following errors:

  BTRFS: device fsid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9 devid 5 transid 1394395 /dev/sda (8:0) scanned by btrfs (261)
  BTRFS: device fsid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9 devid 6 transid 1394395 /dev/sde (8:64) scanned by btrfs (261)
  BTRFS: device fsid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9 devid 0 transid 1394395 /dev/sdd (8:48) scanned by btrfs (261)
  BTRFS: device fsid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9 devid 3 transid 1394395 /dev/sdf (8:80) scanned by btrfs (261)
  BTRFS info (device sdd): first mount of filesystem 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9
  BTRFS info (device sdd): using crc32c (crc32c-intel) checksum algorithm
  BTRFS warning (device sdd): devid 4 uuid 01e2081c-9c2a-4071-b9f4-e1b27e571ff5 is missing
  BTRFS info (device sdd): bdev <missing disk> errs: wr 84994544, rd 15567, flush 65872, corrupt 0, gen 0
  BTRFS info (device sdd): bdev /dev/sdd errs: wr 71489901, rd 0, flush 30001, corrupt 0, gen 0
  BTRFS error (device sdd): replace without active item, run 'device scan --forget' on the target device
  BTRFS error (device sdd): failed to init dev_replace: -117
  BTRFS error (device sdd): open_ctree failed: -117

[CAUSE]
The devid 0 didn't get its devid updated is its own problem, here I'm
only focusing on the mount failure itself.

The mount is not caused by the missing device, as the fs has RAID1C3 for
metadata and RAID10 for data, thus is completely able to tolerate one
missing device.

The device tree shows the dev-replace has properly finished:

        item 7 key (0 DEV_REPLACE 0) itemoff 15931 itemsize 72
                src devid -1 cursor left 11091821199360 cursor right 11091821199360 mode ALWAYS
                state FINISHED write errors 0 uncorrectable read errors 0
		      ^^^^^^^^

And the chunk tree shows there is no devid 0:

  leaf 37980736602112 items 23 free space 12548 generation 1394388 owner CHUNK_TREE
  leaf 37980736602112 flags 0x1(WRITTEN) backref revision 1
  fs uuid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9
  chunk uuid d074c661-6311-4570-b59f-a5c83fd37f8e
         item 0 key (DEV_ITEMS DEV_ITEM 3) itemoff 16185 itemsize 98
                 devid 3 total_bytes 20000588955648 bytes_used 8282877984768
                 io_align 4096 io_width 4096 sector_size 4096 type 0
                 generation 0 start_offset 0 dev_group 0
                 seek_speed 0 bandwidth 0
                 uuid 0d596b69-fb0d-4031-b4af-a301d0868b8b
                 fsid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9
         ...

Which shows the first device is devid 3.

But there is indeed /dev/sdd with devid 0:

  superblock: bytenr=65536, device=/dev/sdd
  ---------------------------------------------------------
  csum_type               0 (crc32c)
  csum_size               4
  csum                    0xd4bed87e [match]
  bytenr                  65536
  flags                   0x1
                          ( WRITTEN )
  magic                   _BHRfS_M [match]
  fsid                    84a1ed4a-365c-45c3-a9ee-a7df525dc3c9
  ...
  uuid_tree_generation    1394388
  dev_item.uuid           ee6532ad-5442-45f7-87fb-7703e29ed934
  dev_item.fsid           84a1ed4a-365c-45c3-a9ee-a7df525dc3c9 [match]
  dev_item.type           0
  dev_item.total_bytes    20000588955648
  dev_item.bytes_used     8292541661184
  dev_item.io_align       0
  dev_item.io_width       0
  dev_item.sector_size    0
  dev_item.devid          0 <<<

So this means device scan will register sdd as devid 0 into the fs, then
during btrfs_init_dev_replace(), we located the replace progress item,
found the previous replace is finished, but we still need to check if
the dev-replace target device (devid 0) exists.

If that device exists, we error out showing that error message.

But to be honest the end user may not really remember which device is
the replace target device, thus not sure what to do in the next step.

[ENHANCEMENT]
To make the error more obvious, and tell the end user which devices
should be unregistered:

- Introduce BTRFS_DEV_STATE_ITEM_FOUND flag
  During device item read from the chunk tree, set the flag for each
  found device item.

- Verify there is no device without the above flag during mount
  Even missing device should have that flag set.
  If we found a device without that flag set, it means it's an
  unexpected one and should be rejected.

- More detailed error message on what to do next
  This will show all unexpected devices and tell the end user to use
  'btrfs dev scan --forget' to forget them or remove them before mount.

There is an example dmesg where a device of a valid filesystem is modified to
have devid 0, then try degraded mount:

  BTRFS info (device dm-6): first mount of filesystem 7c873869-844c-4b39-bd75-a96148bf4656
  BTRFS info (device dm-6): using crc32c checksum algorithm
  BTRFS warning (device dm-6): devid 3 uuid b4a9f35b-db42-4ac4-b55a-cbf81d3b9683 is missing
  BTRFS error (device dm-6): devid 0 path /dev/mapper/test-scratch3 is registered but not found in chunk tree
  BTRFS error (device dm-6): please remove above devices or use 'btrfs device scan --forget <dev>' to unregister them before mount
  BTRFS error (device dm-6): open_ctree failed: -117

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-20 17:18:48 +01:00
Mark Harmstone
1d8f69f453 btrfs: fix missing fields in superblock backup with BLOCK_GROUP_TREE
When the BLOCK_GROUP_TREE compat_ro flag is set, the extent root and
csum root fields are getting missed.

This is because EXTENT_TREE_V2 treated these differently, and when
they were split off this special-casing was mistakenly assigned to
BGT rather than the rump EXTENT_TREE_V2. There's no reason why the
existence of the block group tree should mean that we don't record the
details of the last commit's extent root and csum root.

Fix the code in backup_super_roots() so that the correct check gets
made.

Fixes: 1c56ab9919 ("btrfs: separate BLOCK_GROUP_TREE compat RO flag from EXTENT_TREE_V2")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-20 17:18:47 +01:00
Qu Wenruo
1972f44c18 btrfs: reject new transactions if the fs is fully read-only
[BUG]
There is a bug report where a heavily fuzzed fs is mounted with all
rescue mount options, which leads to the following warnings during
unmount:

  BTRFS: Transaction aborted (error -22)
  Modules linked in:
  CPU: 0 UID: 0 PID: 9758 Comm: repro.out Not tainted
  6.19.0-rc5-00002-gb71e635feefc #7 PREEMPT(full)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  RIP: 0010:find_free_extent_update_loop fs/btrfs/extent-tree.c:4208 [inline]
  RIP: 0010:find_free_extent+0x52f0/0x5d20 fs/btrfs/extent-tree.c:4611
  Call Trace:
   <TASK>
   btrfs_reserve_extent+0x2cd/0x790 fs/btrfs/extent-tree.c:4705
   btrfs_alloc_tree_block+0x1e1/0x10e0 fs/btrfs/extent-tree.c:5157
   btrfs_force_cow_block+0x578/0x2410 fs/btrfs/ctree.c:517
   btrfs_cow_block+0x3c4/0xa80 fs/btrfs/ctree.c:708
   btrfs_search_slot+0xcad/0x2b50 fs/btrfs/ctree.c:2130
   btrfs_truncate_inode_items+0x45d/0x2350 fs/btrfs/inode-item.c:499
   btrfs_evict_inode+0x923/0xe70 fs/btrfs/inode.c:5628
   evict+0x5f4/0xae0 fs/inode.c:837
   __dentry_kill+0x209/0x660 fs/dcache.c:670
   finish_dput+0xc9/0x480 fs/dcache.c:879
   shrink_dcache_for_umount+0xa0/0x170 fs/dcache.c:1661
   generic_shutdown_super+0x67/0x2c0 fs/super.c:621
   kill_anon_super+0x3b/0x70 fs/super.c:1289
   btrfs_kill_super+0x41/0x50 fs/btrfs/super.c:2127
   deactivate_locked_super+0xbc/0x130 fs/super.c:474
   cleanup_mnt+0x425/0x4c0 fs/namespace.c:1318
   task_work_run+0x1d4/0x260 kernel/task_work.c:233
   exit_task_work include/linux/task_work.h:40 [inline]
   do_exit+0x694/0x22f0 kernel/exit.c:971
   do_group_exit+0x21c/0x2d0 kernel/exit.c:1112
   __do_sys_exit_group kernel/exit.c:1123 [inline]
   __se_sys_exit_group kernel/exit.c:1121 [inline]
   __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1121
   x64_sys_call+0x2210/0x2210 arch/x86/include/generated/asm/syscalls_64.h:232
   do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
   do_syscall_64+0xe8/0xf80 arch/x86/entry/syscall_64.c:94
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
  RIP: 0033:0x44f639
  Code: Unable to access opcode bytes at 0x44f60f.
  RSP: 002b:00007ffc15c4e088 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
  RAX: ffffffffffffffda RBX: 00000000004c32f0 RCX: 000000000044f639
  RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000001
  RBP: 0000000000000001 R08: ffffffffffffffc0 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004c32f0
  R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001
   </TASK>

Since rescue mount options will mark the full fs read-only, there should
be no new transaction triggered.

But during unmount we will evict all inodes, which can trigger a new
transaction, and triggers warnings on a heavily corrupted fs.

[CAUSE]
Btrfs allows new transaction even on a read-only fs, this is to allow
log replay happen even on read-only mounts, just like what ext4/xfs do.

However with rescue mount options, the fs is fully read-only and cannot
be remounted read-write, thus in that case we should also reject any new
transactions.

[FIX]
If we find the fs has rescue mount options, we should treat the fs as
error, so that no new transaction can be started.

Reported-by: Jiaming Zhang <r772577952@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CANypQFYw8Nt8stgbhoycFojOoUmt+BoZ-z8WJOZVxcogDdwm=Q@mail.gmail.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-20 17:18:47 +01:00
Edward Adam Davis
3f29d661e5 btrfs: sync read disk super and set block size
When the user performs a btrfs mount, the block device is not set
correctly. The user sets the block size of the block device to 0x4000
by executing the BLKBSZSET command.
Since the block size change also changes the mapping->flags value, this
further affects the result of the mapping_min_folio_order() calculation.

Let's analyze the following two scenarios:

Scenario 1: Without executing the BLKBSZSET command, the block size is
0x1000, and mapping_min_folio_order() returns 0;

Scenario 2: After executing the BLKBSZSET command, the block size is
0x4000, and mapping_min_folio_order() returns 2.

do_read_cache_folio() allocates a folio before the BLKBSZSET command
is executed. This results in the allocated folio having an order value
of 0. Later, after BLKBSZSET is executed, the block size increases to
0x4000, and the mapping_min_folio_order() calculation result becomes 2.

This leads to two undesirable consequences:

1. filemap_add_folio() triggers a VM_BUG_ON_FOLIO(folio_order(folio) <
mapping_min_folio_order(mapping)) assertion.

2. The syzbot report [1] shows a null pointer dereference in
create_empty_buffers() due to a buffer head allocation failure.

Synchronization should be established based on the inode between the
BLKBSZSET command and read cache page to prevent inconsistencies in
block size or mapping flags before and after folio allocation.

[1]
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
RIP: 0010:create_empty_buffers+0x4d/0x480 fs/buffer.c:1694
Call Trace:
 folio_create_buffers+0x109/0x150 fs/buffer.c:1802
 block_read_full_folio+0x14c/0x850 fs/buffer.c:2403
 filemap_read_folio+0xc8/0x2a0 mm/filemap.c:2496
 do_read_cache_folio+0x266/0x5c0 mm/filemap.c:4096
 do_read_cache_page mm/filemap.c:4162 [inline]
 read_cache_page_gfp+0x29/0x120 mm/filemap.c:4195
 btrfs_read_disk_super+0x192/0x500 fs/btrfs/volumes.c:1367

Reported-by: syzbot+b4a2af3000eaa84d95d5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b4a2af3000eaa84d95d5
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-20 17:18:47 +01:00
Qiang Ma
9c7e71c97c btrfs: fix Wmaybe-uninitialized warning in replay_one_buffer()
Warning was found when compiling using loongarch64-gcc 12.3.1:

  $ make CFLAGS_tree-log.o=-Wmaybe-uninitialized

  In file included from fs/btrfs/ctree.h:21,
		   from fs/btrfs/tree-log.c:12:
  fs/btrfs/accessors.h: In function 'replay_one_buffer':
  fs/btrfs/accessors.h:66:16: warning: 'inode_item' may be used uninitialized [-Wmaybe-uninitialized]
     66 |         return btrfs_get_##bits(eb, s, offsetof(type, member));         \
	|                ^~~~~~~~~~
  fs/btrfs/tree-log.c:2803:42: note: 'inode_item' declared here
   2803 |                 struct btrfs_inode_item *inode_item;
	|                                          ^~~~~~~~~~

Initialize the inode_item to NULL, the compiler does not seem to see the
relation between the first 'wc->log_key.type == BTRFS_INODE_ITEM_KEY'
check and the other one that also checks the replay phase.

Signed-off-by: Qiang Ma <maqianga@uniontech.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-20 01:41:47 +01:00
Joanne Koong
f9a49aa302 fs/writeback: skip AS_NO_DATA_INTEGRITY mappings in wait_sb_inodes()
Above the while() loop in wait_sb_inodes(), we document that we must wait
for all pages under writeback for data integrity.  Consequently, if a
mapping, like fuse, traditionally does not have data integrity semantics,
there is no need to wait at all; we can simply skip these inodes.

This restores fuse back to prior behavior where syncs are no-ops.  This
fixes a user regression where if a system is running a faulty fuse server
that does not reply to issued write requests, this causes wait_sb_inodes()
to wait forever.

Link: https://lkml.kernel.org/r/20260105211737.4105620-2-joannelkoong@gmail.com
Fixes: 0c58a97f91 ("fuse: remove tmp folio for writebacks and internal rb tree")
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reported-by: Athul Krishna <athul.krishna.kr@protonmail.com>
Reported-by: J. Neuschäfer <j.neuschaefer@gmx.net>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Tested-by: J. Neuschäfer <j.neuschaefer@gmx.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Bernd Schubert <bschubert@ddn.com>
Cc: Bonaccorso Salvatore <carnil@debian.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-19 12:30:01 -08:00
Linus Torvalds
f8907398a6 Merge tag 'ext4_for_linus-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:

 - Fix an inconsistency in structure size on 32-bit platforms caused by
   padding differences for the new EXT4_IOC_[GS]ET_TUNE_SB_PARAM ioctls

 - Fix a buffer leak on the error path when dropping the refcount an
   xattr value stored in an inode

 - Fix missing locking on the error path for the file defragmentation
   ioctl leading to a BUG

* tag 'ext4_for_linus-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix iloc.bh leak in ext4_xattr_inode_update_ref
  ext4: add missing down_write_data_sem in mext_move_extent().
  ext4: fix ext4_tune_sb_params padding
2026-01-18 14:01:20 -08:00
Yang Erkun
d250bdf531 ext4: fix iloc.bh leak in ext4_xattr_inode_update_ref
The error branch for ext4_xattr_inode_update_ref forget to release the
refcount for iloc.bh. Find this when review code.

Fixes: 57295e8354 ("ext4: guard against EA inode refcount underflow in xattr update")
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20251213055706.3417529-1-yangerkun@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2026-01-18 11:23:10 -05:00
Julian Sun
0ef7ef4227 ext4: add missing down_write_data_sem in mext_move_extent().
Commit 962e8a01ea ("ext4: introduce mext_move_extent()") attempts to
call ext4_swap_extents() on the failure path to recover the swapped
extents, but fails to acquire locks for the two inode->i_data_sem,
triggering the BUG_ON statement in ext4_swap_extents().

This issue can be fixed by calling ext4_double_down_write_data_sem()
before ext4_swap_extents().

Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
Reported-by: syzbot+4ea6bd8737669b423aae@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69368649.a70a0220.38f243.0093.GAE@google.com/
Fixes: 962e8a01ea ("ext4: introduce mext_move_extent()")
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://patch.msgid.link/20251208123713.1971068-1-sunjunchao@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-18 11:23:10 -05:00
Linus Torvalds
e84d960149 Merge tag 'for-6.19-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:

 - with large folios in use, fix partial incorrect update of a reflinked
   range

 - fix potential deadlock in iget when lookup fails and eviction is
   needed

 - in send, validate inline extent type while detecting file holes

 - fix memory leak after an error when creating a space info

 - remove zone statistics from sysfs again, the output size limitations
   make it unusable, we'll do it in another way in another release

 - test fixes:
     - return proper error codes from block remapping tests
     - fix tree root leaks in qgroup tests after errors

* tag 'for-6.19-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: remove zoned statistics from sysfs
  btrfs: fix memory leaks in create_space_info() error paths
  btrfs: invalidate pages instead of truncate after reflinking
  btrfs: update the Kconfig string for CONFIG_BTRFS_EXPERIMENTAL
  btrfs: send: check for inline extents in range_is_hole_in_parent()
  btrfs: tests: fix return 0 on rmap test failure
  btrfs: tests: fix root tree leak in btrfs_test_qgroups()
  btrfs: release path before iget_failed() in btrfs_read_locked_inode()
2026-01-17 19:29:32 -08:00
Miklos Szeredi
79d11311f6 vfs: document d_dispose_if_unused()
Add a warning about the danger of using this function without proper
locking preventing eviction.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://patch.msgid.link/20260114145344.468856-7-mszeredi@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-16 19:15:14 +01:00
Miklos Szeredi
fa79401a9c fuse: shrink once after all buckets have been scanned
In fuse_dentry_tree_work() move the shrink_dentry_list() out from the loop.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://patch.msgid.link/20260114145344.468856-6-mszeredi@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-16 19:15:14 +01:00
Miklos Szeredi
3926746b55 fuse: clean up fuse_dentry_tree_work()
- Change time_after64() time_before64(), since the latter is exclusively
  used in this file to compare dentry/inode timeout with current time.

- Move the break statement from the else branch to the if branch, reducing
  indentation.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://patch.msgid.link/20260114145344.468856-5-mszeredi@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-16 19:15:14 +01:00
Miklos Szeredi
09f7a43ae5 fuse: add need_resched() before unlocking bucket
In fuse_dentry_tree_work() no need to unlock/lock dentry_hash[i].lock on
each iteration.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://patch.msgid.link/20260114145344.468856-4-mszeredi@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-16 19:15:14 +01:00
Miklos Szeredi
1e2c1af1be fuse: make sure dentry is evicted if stale
d_dispose_if_unused() may find the dentry with a positive refcount, in
which case it won't be put on the dispose list even though it has already
timed out.

"Reinstall" the d_delete() callback, which was optimized out in
fuse_dentry_settime().  This will result in the dentry being evicted as
soon as the refcount hits zero.

Fixes: ab84ad5973 ("fuse: new work queue to periodically invalidate expired dentries")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://patch.msgid.link/20260114145344.468856-3-mszeredi@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-16 19:15:14 +01:00
Miklos Szeredi
cb8d2bdcb8 fuse: fix race when disposing stale dentries
In fuse_dentry_tree_work() just before d_dispose_if_unused() the dentry
could get evicted, resulting in UAF.

Move unlocking dentry_hash[i].lock to after the dispose.  To do this,
fuse_dentry_tree_del_node() needs to be moved from fuse_dentry_prune() to
fuse_dentry_release() to prevent an ABBA deadlock.

The lock ordering becomes:

 -> dentry_bucket.lock
    -> dentry.d_lock

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Closes: https://lore.kernel.org/all/20251206014242.GO1712166@ZenIV/
Fixes: ab84ad5973 ("fuse: new work queue to periodically invalidate expired dentries")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://patch.msgid.link/20260114145344.468856-2-mszeredi@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-16 19:15:14 +01:00
Linus Torvalds
353c6f43ab Merge tag 'xfs-fixes-6.19-rc6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
 "Just a few obvious fixes and some 'cosmetic' changes"

* tag 'xfs-fixes-6.19-rc6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: set max_agbno to allow sparse alloc of last full inode chunk
  xfs: Fix xfs_grow_last_rtg()
  xfs: improve the assert at the top of xfs_log_cover
  xfs: fix an overly long line in xfs_rtgroup_calc_geometry
  xfs: mark __xfs_rtgroup_extents static
  xfs: Fix the return value of xfs_rtcopy_summary()
  xfs: fix memory leak in xfs_growfs_check_rtgeom()
2026-01-16 09:09:41 -08:00
Jens Axboe
4973d95679 fuse: use private naming for fuse hash size
With a mix of include dependencies, the compiler warns that:

fs/fuse/dir.c:35:9: warning: ?HASH_BITS? redefined
   35 | #define HASH_BITS       5
      |         ^~~~~~~~~
In file included from ./include/linux/io_uring_types.h:5,
                 from ./include/linux/bpf.h:34,
                 from ./include/linux/security.h:35,
                 from ./include/linux/fs_context.h:14,
                 from fs/fuse/dir.c:13:
./include/linux/hashtable.h:28:9: note: this is the location of the previous definition
   28 | #define HASH_BITS(name) ilog2(HASH_SIZE(name))
      |         ^~~~~~~~~
fs/fuse/dir.c:36:9: warning: ?HASH_SIZE? redefined
   36 | #define HASH_SIZE       (1 << HASH_BITS)
      |         ^~~~~~~~~
./include/linux/hashtable.h:27:9: note: this is the location of the previous definition
   27 | #define HASH_SIZE(name) (ARRAY_SIZE(name))
      |         ^~~~~~~~~

Hence rename the HASH_SIZE/HASH_BITS in fuse, by prefixing them with
FUSE_ instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://patch.msgid.link/195c9525-281c-4302-9549-f3d9259416c6@kernel.dk
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-16 10:55:44 +01:00
Linus Torvalds
603c05a163 Merge tag 'nfs-for-6.19-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:

 - Fix another deadlock involving nfs_release_folio()

 - localio:
     - Stop I/O upon hitting a fatal error
     - Deal with page offsets that are > PAGE_SIZE

 - Fix size read races in truncate, fallocate and copy offload

 - Several bugfixes for the NFSv4.x directory delegation client code

 - pNFS:
    - Fix a deadlock when returning delegations during open
    - Fix memory leaks in various error paths

* tag 'nfs-for-6.19-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS: Fix size read races in truncate, fallocate and copy offload
  NFS: Don't immediately return directory delegations when disabled
  NFS/localio: Deal with page bases that are > PAGE_SIZE
  NFS/localio: Stop further I/O upon hitting an error
  NFSv4.x: Directory delegations don't require any state recovery
  NFSv4: Don't free slots prematurely if requesting a directory delegation
  NFSv4: Fix nfs_clear_verifier_delegated() for delegated directories
  NFS: Fix directory delegation verifier checks
  pnfs/blocklayout: Fix memory leak in bl_parse_scsi()
  pnfs/flexfiles: Fix memory leak in nfs4_ff_alloc_deviceid_node()
  NFS: Fix a deadlock involving nfs_release_folio()
  pNFS: Fix a deadlock when returning a delegation during open()
2026-01-15 11:59:49 -08:00
Trond Myklebust
d5811e6297 NFS: Fix size read races in truncate, fallocate and copy offload
If the pre-operation file size is read before locking the inode and
quiescing O_DIRECT writes, then nfs_truncate_last_folio() might end up
overwriting valid file data.

Fixes: b1817b18ff ("NFS: Protect against 'eof page pollution'")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2026-01-15 14:38:25 -05:00
Johannes Thumshirn
437cc6057e btrfs: remove zoned statistics from sysfs
Remove the newly introduced zoned statistics from sysfs, as sysfs can
only show a single page this will truncate the output on a busy
filesystem.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-14 22:08:04 +01:00
Zhao Mengmeng
e93b31d081 writeback: use round_jiffies_relative for dirtytime_work
The dirtytime_work is a background housekeeping task that flushes dirty
inodes, using round_jiffies_relative() will allow kernel to batch this
work with other aligned system tasks, reducing power consumption.

Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Link: https://patch.msgid.link/20260113082614.231580-1-zhaomzhao@126.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-14 17:11:43 +01:00
Christoph Hellwig
561940a7ee iomap: wait for batched folios to be stable in __iomap_get_folio
__iomap_get_folio needs to wait for writeback to finish if the file
requires folios to be stable for writes.  For the regular path this is
taken care of by __filemap_get_folio, but for the newly added batch
lookup it has to be done manually.

This fixes xfs/131 failures when running on PI-capable hardware.

Fixes: 395ed1ef00 ("iomap: optional zero range dirty folio processing")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260113153943.3323869-1-hch@lst.de
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-14 17:06:02 +01:00
Linus Torvalds
b54345928f Merge tag 'gfs2-for-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 revert from Andreas Gruenbacher:
 "Revert bad commit "gfs2: Fix use of bio_chain"

  I was originally assuming that there must be a bug in gfs2
  because gfs2 chains bios in the opposite direction of what
  bio_chain_and_submit() expects.

  It turns out that the bio chains are set up in "reverse direction"
  intentionally so that the first bio's bi_end_io callback is invoked
  rather than the last bio's callback.

  We want the first bio's callback invoked for the following reason: The
  initial bio starts page aligned and covers one or more pages. When it
  terminates at a non-page-aligned offset, subsequent bios are added to
  handle the remaining portion of the final page.

  Upon completion of the bio chain, all affected pages need to be be
  marked as read, and only the first bio references all of these pages"

* tag 'gfs2-for-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  Revert "gfs2: Fix use of bio_chain"
2026-01-13 10:04:34 -08:00
Brian Foster
c360004c01 xfs: set max_agbno to allow sparse alloc of last full inode chunk
Sparse inode cluster allocation sets min/max agbno values to avoid
allocating an inode cluster that might map to an invalid inode
chunk. For example, we can't have an inode record mapped to agbno 0
or that extends past the end of a runt AG of misaligned size.

The initial calculation of max_agbno is unnecessarily conservative,
however. This has triggered a corner case allocation failure where a
small runt AG (i.e. 2063 blocks) is mostly full save for an extent
to the EOFS boundary: [2050,13]. max_agbno is set to 2048 in this
case, which happens to be the offset of the last possible valid
inode chunk in the AG. In practice, we should be able to allocate
the 4-block cluster at agbno 2052 to map to the parent inode record
at agbno 2048, but the max_agbno value precludes it.

Note that this can result in filesystem shutdown via dirty trans
cancel on stable kernels prior to commit 9eb775968b ("xfs: walk
all AGs if TRYLOCK passed to xfs_alloc_vextent_iterate_ags") because
the tail AG selection by the allocator sets t_highest_agno on the
transaction. If the inode allocator spins around and finds an inode
chunk with free inodes in an earlier AG, the subsequent dir name
creation path may still fail to allocate due to the AG restriction
and cancel.

To avoid this problem, update the max_agbno calculation to the agbno
prior to the last chunk aligned agbno in the AG. This is not
necessarily the last valid allocation target for a sparse chunk, but
since inode chunks (i.e. records) are chunk aligned and sparse
allocs are cluster sized/aligned, this allows the sb_spino_align
alignment restriction to take over and round down the max effective
agbno to within the last valid inode chunk in the AG.

Note that even though the allocator improvements in the
aforementioned commit seem to avoid this particular dirty trans
cancel situation, the max_agbno logic improvement still applies as
we should be able to allocate from an AG that has been appropriately
selected. The more important target for this patch however are
older/stable kernels prior to this allocator rework/improvement.

Cc: stable@vger.kernel.org # v4.2
Fixes: 56d1115c9b ("xfs: allocate sparse inode chunks on full chunk allocation failure")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-13 10:45:55 +01:00
Nirjhar Roy (IBM)
a65fd81207 xfs: Fix xfs_grow_last_rtg()
The last rtg should be able to grow when the size of the last is less
than (and not equal to) sb_rgextents. xfs_growfs with realtime groups
fails without this patch. The reason is that, xfs_growfs_rtg() tries
to grow the last rt group even when the last rt group is at its
maximal size i.e, sb_rgextents. It fails with the following messages:

XFS (loop0): Internal error block >= mp->m_rsumblocks at line 253 of file fs/xfs/libxfs/xfs_rtbitmap.c.  Caller xfs_rtsummary_read_buf+0x20/0x80
XFS (loop0): Corruption detected. Unmount and run xfs_repair
XFS (loop0): Internal error xfs_trans_cancel at line 976 of file fs/xfs/xfs_trans.c.  Caller xfs_growfs_rt_bmblock+0x402/0x450
XFS (loop0): Corruption of in-memory data (0x8) detected at xfs_trans_cancel+0x10a/0x1f0 (fs/xfs/xfs_trans.c:977).  Shutting down filesystem.
XFS (loop0): Please unmount the filesystem and rectify the problem(s)

Signed-off-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-13 10:40:45 +01:00
Christoph Hellwig
df7ec7226f xfs: improve the assert at the top of xfs_log_cover
Move each condition into a separate assert so that we can see which
on triggered.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-13 10:36:23 +01:00
Christoph Hellwig
baed03efe2 xfs: fix an overly long line in xfs_rtgroup_calc_geometry
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-13 10:34:29 +01:00
Christoph Hellwig
e0aea42a32 xfs: mark __xfs_rtgroup_extents static
__xfs_rtgroup_extents is not used outside of xfs_rtgroup.c, so mark it
static.  Move it and xfs_rtgroup_extents up in the file to avoid forward
declarations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-13 10:34:29 +01:00
Nirjhar Roy (IBM)
6b2d155366 xfs: Fix the return value of xfs_rtcopy_summary()
xfs_rtcopy_summary() should return the appropriate error code
instead of always returning 0. The caller of this function which is
xfs_growfs_rt_bmblock() is already handling the error.

Fixes: e94b53ff69 ("xfs: cache last bitmap block in realtime allocator")
Signed-off-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org # v6.7
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-13 10:32:12 +01:00
Deepanshu Kartikey
ab7ad7abb3 romfs: check sb_set_blocksize() return value
romfs_fill_super() ignores the return value of sb_set_blocksize(), which
can fail if the requested block size is incompatible with the block
device's configuration.

This can be triggered by setting a loop device's block size larger than
PAGE_SIZE using ioctl(LOOP_SET_BLOCK_SIZE, 32768), then mounting a romfs
filesystem on that device.

When sb_set_blocksize(sb, ROMBSIZE) is called with ROMBSIZE=4096 but the
device has logical_block_size=32768, bdev_validate_blocksize() fails
because the requested size is smaller than the device's logical block
size. sb_set_blocksize() returns 0 (failure), but romfs ignores this and
continues mounting.

The superblock's block size remains at the device's logical block size
(32768). Later, when sb_bread() attempts I/O with this oversized block
size, it triggers a kernel BUG in folio_set_bh():

    kernel BUG at fs/buffer.c:1582!
    BUG_ON(size > PAGE_SIZE);

Fix by checking the return value of sb_set_blocksize() and failing the
mount with -EINVAL if it returns 0.

Reported-by: syzbot+9c4e33e12283d9437c25@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=9c4e33e12283d9437c25
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Link: https://patch.msgid.link/20260113084037.1167887-1-kartikey406@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-13 09:56:58 +01:00
Anna Schumaker
803e18641f NFS: Don't immediately return directory delegations when disabled
The function nfs_inode_evict_delegation() immediately and synchronously
returns a delegation when called. This means we can't call it from
nfs4_have_delegation(), since that function could be called under a
lock. Instead we should mark the delegation for return and let the state
manager handle it for us.

Fixes: b6d2a520f4 ("NFS: Add a module option to disable directory delegations")
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2026-01-12 11:50:22 -05:00
Jiasheng Jiang
a11224a016 btrfs: fix memory leaks in create_space_info() error paths
In create_space_info(), the 'space_info' object is allocated at the
beginning of the function. However, there are two error paths where the
function returns an error code without freeing the allocated memory:

1. When create_space_info_sub_group() fails in zoned mode.
2. When btrfs_sysfs_add_space_info_type() fails.

In both cases, 'space_info' has not yet been added to the
fs_info->space_info list, resulting in a memory leak. Fix this by
adding an error handling label to kfree(space_info) before returning.

Fixes: 2be12ef79f ("btrfs: Separate space_info create/update")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-12 16:21:55 +01:00
Filipe Manana
8826807749 btrfs: invalidate pages instead of truncate after reflinking
Qu reported that generic/164 often fails because the read operations get
zeroes when it expects to either get all bytes with a value of 0x61 or
0x62. The issue stems from truncating the pages from the page cache
instead of invalidating, as truncating can zero page contents. This
zeroing is not just in case the range is not page sized (as it's commented
in truncate_inode_pages_range()) but also in case we are using large
folios, they need to be split and the splitting fails. Stealing Qu's
comment in the thread linked below:

  "We can have the following case:

	0          4K         8K         12K          16K
        |          |          |          |            |
        |<---- Extent A ----->|<----- Extent B ------>|

   The page size is still 4K, but the folio we got is 16K.

   Then if we remap the range for [8K, 16K), then
   truncate_inode_pages_range() will get the large folio 0 sized 16K,
   then call truncate_inode_partial_folio().

   Which later calls folio_zero_range() for the [8K, 16K) range first,
   then tries to split the folio into smaller ones to properly drop them
   from the cache.

   But if splitting failed (e.g. racing with other operations holding the
   filemap lock), the partially zeroed large folio will be kept, resulting
   the range [8K, 16K) being zeroed meanwhile the folio is still a 16K
   sized large one."

So instead of truncating, invalidate the page cache range with a call to
filemap_invalidate_inode(), which besides not doing any zeroing also
ensures that while it's invalidating folios, no new folios are added.

This helps ensure that buffered reads that happen while a reflink
operation is in progress always get either the whole old data (the one
before the reflink) or the whole new data, which is what generic/164
expects.

Link: https://lore.kernel.org/linux-btrfs/7fb9b44f-9680-4c22-a47f-6648cb109ddf@suse.com/
Reported-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-12 16:21:55 +01:00
Qu Wenruo
64dd1caf88 btrfs: update the Kconfig string for CONFIG_BTRFS_EXPERIMENTAL
The following new features are missing:

- Async checksum

- Shutdown ioctl and auto-degradation

- Larger block size support
  Which is dependent on larger folios.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-12 16:21:55 +01:00
Andreas Gruenbacher
469d71512d Revert "gfs2: Fix use of bio_chain"
This reverts commit 8a157e0a0a.

That commit incorrectly assumed that the bio_chain() arguments were
swapped in gfs2.  However, gfs2 intentionally constructs bio chains so
that the first bio's bi_end_io callback is invoked when all bios in the
chain have completed, unlike bio chains where the last bio's callback is
invoked.

Fixes: 8a157e0a0a ("gfs2: Fix use of bio_chain")
Cc: stable@vger.kernel.org
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2026-01-12 14:58:32 +01:00
Laveesh Bansal
543467d6fe writeback: fix 100% CPU usage when dirtytime_expire_interval is 0
When vm.dirtytime_expire_seconds is set to 0, wakeup_dirtytime_writeback()
schedules delayed work with a delay of 0, causing immediate execution.
The function then reschedules itself with 0 delay again, creating an
infinite busy loop that causes 100% kworker CPU usage.

Fix by:
- Only scheduling delayed work in wakeup_dirtytime_writeback() when
  dirtytime_expire_interval is non-zero
- Cancelling the delayed work in dirtytime_interval_handler() when
  the interval is set to 0
- Adding a guard in start_dirtytime_writeback() for defensive coding

Tested by booting kernel in QEMU with virtme-ng:
- Before fix: kworker CPU spikes to ~73%
- After fix: CPU remains at normal levels
- Setting interval back to non-zero correctly resumes writeback

Fixes: a2f4870697 ("fs: make sure the timestamps for lazytime inodes eventually get written")
Cc: stable@vger.kernel.org
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220227
Signed-off-by: Laveesh Bansal <laveeshb@laveeshbansal.com>
Link: https://patch.msgid.link/20260106145059.543282-2-laveeshb@laveeshbansal.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-12 11:07:25 +01:00
Amir Goldstein
c644bce62b readdir: require opt-in for d_type flags
Commit c31f91c6af ("fuse: don't allow signals to interrupt getdents
copying") introduced the use of high bits in d_type as flags. However,
overlayfs was not adapted to handle this change.

In ovl_cache_entry_new(), the code checks if d_type == DT_CHR to
determine if an entry might be a whiteout. When fuse is used as the
lower layer and sets high bits in d_type, this comparison fails,
causing whiteout files to not be recognized properly and resulting in
incorrect overlayfs behavior.

Fix this by requiring callers of iterate_dir() to opt-in for getting
flag bits in d_type outside of S_DT_MASK.

Fixes: c31f91c6af ("fuse: don't allow signals to interrupt getdents copying")
Link: https://lore.kernel.org/all/20260107034551.439-1-luochunsheng@ustc.edu/
Link: https://github.com/containerd/stargz-snapshotter/issues/2214
Reported-by: Chunsheng Luo <luochunsheng@ustc.edu>
Reviewed-by: Chunsheng Luo <luochunsheng@ustc.edu>
Tested-by: Chunsheng Luo <luochunsheng@ustc.edu>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://patch.msgid.link/20260108074522.3400998-1-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-12 10:57:34 +01:00
Jeff Layton
8a5511eeaa vboxsf: don't allow delegations to be set on directories
With the advent of directory leases, it's necessary to set the
->setlease() handler in directory file_operations to properly deny them.

Fixes: e6d28ebc17 ("filelock: push the S_ISREG check down to ->setlease handlers")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260107-setlease-6-19-v1-6-85f034abcc57@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-12 10:54:47 +01:00
Jeff Layton
ffb321045b ceph: don't allow delegations to be set on directories
With the advent of directory leases, it's necessary to set the
->setlease() handler in directory file_operations to properly deny them.

Fixes: e6d28ebc17 ("filelock: push the S_ISREG check down to ->setlease handlers")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260107-setlease-6-19-v1-5-85f034abcc57@kernel.org
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-12 10:54:47 +01:00
Jeff Layton
ce946c4fb9 gfs2: don't allow delegations to be set on directories
With the advent of directory leases, it's necessary to set the
->setlease() handler in directory file_operations to properly deny them.

In the "nolock" case however, there is no need to deny them.

Fixes: e6d28ebc17 ("filelock: push the S_ISREG check down to ->setlease handlers")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260107-setlease-6-19-v1-4-85f034abcc57@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-12 10:54:47 +01:00
Jeff Layton
5d65a70bd0 9p: don't allow delegations to be set on directories
With the advent of directory leases, it's necessary to set the
->setlease() handler in directory file_operations to properly deny them.

Fixes: e6d28ebc17 ("filelock: push the S_ISREG check down to ->setlease handlers")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260107-setlease-6-19-v1-3-85f034abcc57@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-12 10:54:47 +01:00
Jeff Layton
b9a9be4d35 smb/client: properly disallow delegations on directories
The check for S_ISREG() in cifs_setlease() is incorrect since that
operation doesn't get called for directories. The correct way to prevent
delegations on directories is to set the ->setlease() method in directory
file_operations to simple_nosetlease().

Fixes: e6d28ebc17 ("filelock: push the S_ISREG check down to ->setlease handlers")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260107-setlease-6-19-v1-2-85f034abcc57@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-12 10:54:47 +01:00
Jeff Layton
10dcd51106 nfs: properly disallow delegation requests on directories
Checking for S_ISREG() in nfs4_setlease() is incorrect, since that op is
never called for directories. The right way to deny lease requests on
directories is to set the ->setlease() operation to simple_nosetlease()
in the directory file_operations.

Fixes: e6d28ebc17 ("filelock: push the S_ISREG check down to ->setlease handlers")
Reported-by: Christoph Hellwig <hch@infradead.org>
Closes: https://lore.kernel.org/linux-fsdevel/aV316LhsVSl0n9-E@infradead.org/
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260107-setlease-6-19-v1-1-85f034abcc57@kernel.org
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-12 10:54:46 +01:00