Commit Graph

15017 Commits

Author SHA1 Message Date
Linus Torvalds
997f9640c9 Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux
Pull fsverity updates from Eric Biggers:
 "fsverity cleanups, speedup, and memory usage optimization from
  Christoph Hellwig:

   - Move some logic into common code

   - Fix btrfs to reject truncates of fsverity files

   - Improve the readahead implementation

   - Store each inode's fsverity_info in a hash table instead of using a
     pointer in the filesystem-specific part of the inode.

     This optimizes for memory usage in the usual case where most files
     don't have fsverity enabled.

   - Look up the fsverity_info fewer times during verification, to
     amortize the hash table overhead"

* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
  fsverity: remove inode from fsverity_verification_ctx
  fsverity: use a hashtable to find the fsverity_info
  btrfs: consolidate fsverity_info lookup
  f2fs: consolidate fsverity_info lookup
  ext4: consolidate fsverity_info lookup
  fs: consolidate fsverity_info lookup in buffer.c
  fsverity: push out fsverity_info lookup
  fsverity: deconstify the inode pointer in struct fsverity_info
  fsverity: kick off hash readahead at data I/O submission time
  ext4: move ->read_folio and ->readahead to readpage.c
  readahead: push invalidate_lock out of page_cache_ra_unbounded
  fsverity: don't issue readahead for non-ENOENT errors from __filemap_get_folio
  fsverity: start consolidating pagecache code
  fsverity: pass struct file to ->write_merkle_tree_block
  f2fs: don't build the fsverity work handler for !CONFIG_FS_VERITY
  ext4: don't build the fsverity work handler for !CONFIG_FS_VERITY
  fs,fsverity: clear out fsverity_info from common code
  fs,fsverity: reject size changes on fsverity files in setattr_prepare
2026-02-12 10:41:34 -08:00
Linus Torvalds
41f1a08645 Merge tag 'kbuild-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux
Pull Kbuild/Kconfig updates from Nathan Chancellor:
 "Kbuild:

   - Drop '*_probe' pattern from modpost section check allowlist, which
     hid legitimate warnings (Johan Hovold)

   - Disable -Wtype-limits altogether, instead of enabling at W=2
     (Vincent Mailhol)

   - Improve UAPI testing to skip testing headers that require a libc
     when CONFIG_CC_CAN_LINK is not set, opening up testing of headers
     with no libc dependencies to more environments (Thomas Weißschuh)

   - Update gendwarfksyms documentation with required dependencies
     (Jihan LIN)

   - Reject invalid LLVM= values to avoid unintentionally falling back
     to system toolchain (Thomas Weißschuh)

   - Add a script to help run the kernel build process in a container
     for consistent environments and testing (Guillaume Tucker)

   - Simplify kallsyms by getting rid of the relative base (Ard
     Biesheuvel)

   - Performance and usability improvements to scripts/make_fit.py
     (Simon Glass)

   - Minor various clean ups and fixes

  Kconfig:

   - Move XPM icons to individual files, clearing up GTK deprecation
     warnings (Rostislav Krasny)

   - Support

        depends on FOO if BAR

     as syntactic sugar for

        depends on FOO || !BAR

     (Nicolas Pitre, Graham Roff)

   - Refactor merge_config.sh to use awk over shell/sed/grep,
     dramatically speeding up processing large number of config
     fragments (Anders Roxell, Mikko Rapeli)"

* tag 'kbuild-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux: (39 commits)
  kbuild: remove dependency of run-command on config
  scripts/make_fit: Compress dtbs in parallel
  scripts/make_fit: Support a few more parallel compressors
  kbuild: Support a FIT_EXTRA_ARGS environment variable
  scripts/make_fit: Move dtb processing into a function
  scripts/make_fit: Support an initial ramdisk
  scripts/make_fit: Speed up operation
  rust: kconfig: Don't require RUST_IS_AVAILABLE for rustc-option
  MAINTAINERS: Add scripts/install.sh into Kbuild entry
  modpost: Amend ppc64 save/restfpr symnames for -Os build
  MIPS: tools: relocs: Ship a definition of R_MIPS_PC32
  streamline_config.pl: remove superfluous exclamation mark
  kbuild: dummy-tools: Add python3
  scripts: kconfig: merge_config.sh: warn on duplicate input files
  scripts: kconfig: merge_config.sh: use awk in checks too
  scripts: kconfig: merge_config.sh: refactor from shell/sed/grep to awk
  kallsyms: Get rid of kallsyms relative base
  mips: Add support for PC32 relocations in vmlinux
  Documentation: dev-tools: add container.rst page
  scripts: add tool to run containerized builds
  ...
2026-02-11 13:40:35 -08:00
Linus Torvalds
8912c2fd58 Merge tag 'for-6.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
 "User visible changes, feature updates:

   - when using block size > page size, enable direct IO

   - fallback to buffered IO if the data profile has duplication,
     workaround to avoid checksum mismatches on block group profiles
     with redundancy, real direct IO is possible on single or RAID0

   - redo export of zoned statistics, moved from sysfs to
     /proc/pid/mountstats due to size limitations of the former

  Experimental features:

   - remove offload checksum tunable, intended to find best way to do it
     but since we've switched to offload to thread for everything we
     don't need it anymore

   - initial support for remap-tree feature, a translation layer of
     logical block addresses that allow changes without moving/rewriting
     blocks to do eg. relocation, or other changes that require COW

  Notable fixes:

   - automatic removal of accidentally leftover chunks when
     free-space-tree is enabled since mkfs.btrfs v6.16.1

   - zoned mode:
      - do not try to append to conventional zones when RAID is mixing
        zoned and conventional drives
      - fixup write pointers when mixing zoned and conventional on
        DUP/RAID* profiles

   - when using squota, relax deletion rules for qgroups with 0 members
     to allow easier recovery from accounting bugs, also add more checks
     to detect bad accounting

   - fix periodic reclaim scanning, properly check boundary conditions
     not to trigger it unexpectedly or miss the time to run it

   - trim:
      - continue after first error
      - change reporting to the first detected error
      - add more cancellation points
      - reduce contention of big device lock that can block other
        operations when there's lots of trimmed space

   - when chunk allocation is forced (needs experimental build) fix
     transaction abort when unexpected space layout is detected

  Core:

   - switch to crypto library API for checksumming, removed module
     dependencies, pointer indirections, etc.

   - error handling improvements

   - adjust how and where transaction commit or abort are done and are
     maybe not necessary

   - minor compression optimization to skip single block ranges

   - improve how compression folios are handled

   - new and updated selftests

   - cleanups, refactoring:
      - auto-freeing and other automatic variable cleanup conversion
      - structure size optimizations
      - condition annotations"

* tag 'for-6.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (137 commits)
  btrfs: get rid of compressed_bio::compressed_folios[]
  btrfs: get rid of compressed_folios[] usage for encoded writes
  btrfs: get rid of compressed_folios[] usage for compressed read
  btrfs: remove the old btrfs_compress_folios() infrastructure
  btrfs: switch to btrfs_compress_bio() interface for compressed writes
  btrfs: introduce btrfs_compress_bio() helper
  btrfs: zlib: introduce zlib_compress_bio() helper
  btrfs: zstd: introduce zstd_compress_bio() helper
  btrfs: lzo: introduce lzo_compress_bio() helper
  btrfs: zoned: factor out the zone loading part into a testable function
  btrfs: add cleanup function for btrfs_free_chunk_map
  btrfs: tests: add cleanup functions for test specific functions
  btrfs: raid56: fix memory leak of btrfs_raid_bio::stripe_uptodate_bitmap
  btrfs: tests: add unit tests for pending extent walking functions
  btrfs: fix EEXIST abort due to non-consecutive gaps in chunk allocation
  btrfs: fix transaction commit blocking during trim of unallocated space
  btrfs: handle user interrupt properly in btrfs_trim_fs()
  btrfs: preserve first error in btrfs_trim_fs()
  btrfs: continue trimming remaining devices on failure
  btrfs: do not BUG_ON() in btrfs_remove_block_group()
  ...
2026-02-09 15:45:21 -08:00
Linus Torvalds
9e355113f0 Merge tag 'vfs-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
 "This contains a mix of VFS cleanups, performance improvements, API
  fixes, documentation, and a deprecation notice.

  Scalability and performance:

   - Rework pid allocation to only take pidmap_lock once instead of
     twice during alloc_pid(), improving thread creation/teardown
     throughput by 10-16% depending on false-sharing luck. Pad the
     namespace refcount to reduce false-sharing

   - Track file lock presence via a flag in ->i_opflags instead of
     reading ->i_flctx, avoiding false-sharing with ->i_readcount on
     open/close hot paths. Measured 4-16% improvement on 24-core
     open-in-a-loop benchmarks

   - Use a consume fence in locks_inode_context() to match the
     store-release/load-consume idiom, eliminating a hardware fence on
     some architectures

   - Annotate cdev_lock with __cacheline_aligned_in_smp to prevent
     false-sharing

   - Remove a redundant DCACHE_MANAGED_DENTRY check in
     __follow_mount_rcu() that never fires since the caller already
     verifies it, eliminating a 100% mispredicted branch

   - Fix a 100% mispredicted likely() in devcgroup_inode_permission()
     that became wrong after a prior code reorder

  Bug fixes and correctness:

   - Make insert_inode_locked() wait for inode destruction instead of
     skipping, fixing a corner case where two matching inodes could
     exist in the hash

   - Move f_mode initialization before file_ref_init() in alloc_file()
     to respect the SLAB_TYPESAFE_BY_RCU ordering contract

   - Add a WARN_ON_ONCE guard in try_to_free_buffers() for folios with
     no buffers attached, preventing a null pointer dereference when
     AS_RELEASE_ALWAYS is set but no release_folio op exists

   - Fix select restart_block to store end_time as timespec64, avoiding
     truncation of tv_sec on 32-bit architectures

   - Make dump_inode() use get_kernel_nofault() to safely access inode
     and superblock fields, matching the dump_mapping() pattern

  API modernization:

   - Make posix_acl_to_xattr() allocate the buffer internally since
     every single caller was doing it anyway. Reduces boilerplate and
     unnecessary error checking across ~15 filesystems

   - Replace deprecated simple_strtoul() with kstrtoul() for the
     ihash_entries, dhash_entries, mhash_entries, and mphash_entries
     boot parameters, adding proper error handling

   - Convert chardev code to use guard(mutex) and __free(kfree) cleanup
     patterns

   - Replace min_t() with min() or umin() in VFS code to avoid silently
     truncating unsigned long to unsigned int

   - Gate LOOKUP_RCU assertions behind CONFIG_DEBUG_VFS since callers
     already check the flag

  Deprecation:

   - Begin deprecating legacy BSD process accounting (acct(2)). The
     interface has numerous footguns and better alternatives exist
     (eBPF)

  Documentation:

   - Fix and complete kernel-doc for struct export_operations, removing
     duplicated documentation between ReST and source

   - Fix kernel-doc warnings for __start_dirop() and ilookup5_nowait()

  Testing:

   - Add a kunit test for initramfs cpio handling of entries with
     filesize > PATH_MAX

  Misc:

   - Add missing <linux/init_task.h> include in fs_struct.c"

* tag 'vfs-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits)
  posix_acl: make posix_acl_to_xattr() alloc the buffer
  fs: make insert_inode_locked() wait for inode destruction
  initramfs_test: kunit test for cpio.filesize > PATH_MAX
  fs: improve dump_inode() to safely access inode fields
  fs: add <linux/init_task.h> for 'init_fs'
  docs: exportfs: Use source code struct documentation
  fs: move initializing f_mode before file_ref_init()
  exportfs: Complete kernel-doc for struct export_operations
  exportfs: Mark struct export_operations functions at kernel-doc
  exportfs: Fix kernel-doc output for get_name()
  acct(2): begin the deprecation of legacy BSD process accounting
  device_cgroup: remove branch hint after code refactor
  VFS: fix __start_dirop() kernel-doc warnings
  fs: Describe @isnew parameter in ilookup5_nowait()
  fs/namei: Remove redundant DCACHE_MANAGED_DENTRY check in __follow_mount_rcu
  fs: only assert on LOOKUP_RCU when built with CONFIG_DEBUG_VFS
  select: store end_time as timespec64 in restart block
  chardev: Switch to guard(mutex) and __free(kfree)
  namespace: Replace simple_strtoul with kstrtoul to parse boot params
  dcache: Replace simple_strtoul with kstrtoul in set_dhash_entries
  ...
2026-02-09 15:13:05 -08:00
Linus Torvalds
6124fa45e2 Merge tag 'vfs-7.0-rc1.btrfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs updates for btrfs from Christian Brauner:
 "This contains some changes for btrfs that are taken to the vfs tree to
  stop duplicating VFS code for subvolume/snapshot dentry

  Btrfs has carried private copies of the VFS may_delete() and
  may_create() functions in fs/btrfs/ioctl.c for permission checks
  during subvolume creation and snapshot destruction. These copies have
  drifted out of sync with the VFS originals — btrfs_may_delete() is
  missing the uid/gid validity check and btrfs_may_create() is missing
  the audit_inode_child() call.

  Export the VFS functions as may_{create,delete}_dentry() and switch
  btrfs to use them, removing ~70 lines of duplicated code"

* tag 'vfs-7.0-rc1.btrfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  btrfs: use may_create_dentry() in btrfs_mksubvol()
  btrfs: use may_delete_dentry() in btrfs_ioctl_snap_destroy()
  fs: export may_create() as may_create_dentry()
  fs: export may_delete() as may_delete_dentry()
2026-02-09 13:05:35 -08:00
Linus Torvalds
aa2a0fcd4c Merge tag 'vfs-7.0-rc1.leases' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs lease updates from Christian Brauner:
 "This contains updates for lease support to require filesystems to
  explicitly opt-in to lease support

  Currently kernel_setlease() falls through to generic_setlease() when a
  a filesystem does not define ->setlease(), silently granting lease
  support to every filesystem regardless of whether it is prepared for
  it.

  This is a poor default: most filesystems never intended to support
  leases, and the silent fallthrough makes it impossible to distinguish
  "supports leases" from "never thought about it".

  This inverts the default. It adds explicit

	.setlease = generic_setlease;

  assignments to every in-tree filesystem that should retain lease
  support, then changes kernel_setlease() to return -EINVAL when
  ->setlease is NULL.

  With the new default in place, simple_nosetlease() is redundant and
  is removed along with all references to it"

* tag 'vfs-7.0-rc1.leases' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (25 commits)
  fuse: add setlease file operation
  fs: remove simple_nosetlease()
  filelock: default to returning -EINVAL when ->setlease operation is NULL
  xfs: add setlease file operation
  ufs: add setlease file operation
  udf: add setlease file operation
  tmpfs: add setlease file operation
  squashfs: add setlease file operation
  overlayfs: add setlease file operation
  orangefs: add setlease file operation
  ocfs2: add setlease file operation
  ntfs3: add setlease file operation
  nilfs2: add setlease file operation
  jfs: add setlease file operation
  jffs2: add setlease file operation
  gfs2: add a setlease file operation
  fat: add setlease file operation
  f2fs: add setlease file operation
  exfat: add setlease file operation
  ext4: add setlease file operation
  ...
2026-02-09 11:59:07 -08:00
Linus Torvalds
74554251df Merge tag 'vfs-7.0-rc1.nonblocking_timestamps' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs timestamp updates from Christian Brauner:
 "This contains the changes to support non-blocking timestamp updates.

  Since commit 66fa3cedf1 ("fs: Add async write file modification
  handling") file_update_time_flags() unconditionally returns -EAGAIN
  when any timestamp needs updating and IOCB_NOWAIT is set. This makes
  non-blocking direct writes impossible on file systems with granular
  enough timestamps, which in practice means all of them.

  This reworks the timestamp update path to propagate IOCB_NOWAIT
  through ->update_time so that file systems which can update timestamps
  without blocking are no longer penalized.

  With that groundwork in place, the core change passes IOCB_NOWAIT into
  ->update_time and returns -EAGAIN only when the file system indicates
  it would block.

  XFS implements non-blocking timestamp updates by using the new
  ->sync_lazytime and open-coding generic_update_time without the
  S_NOWAIT check, since the lazytime path through the generic helpers
  can never block in XFS"

* tag 'vfs-7.0-rc1.nonblocking_timestamps' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  xfs: enable non-blocking timestamp updates
  xfs: implement ->sync_lazytime
  fs: refactor file_update_time_flags
  fs: add support for non-blocking timestamp updates
  fs: add a ->sync_lazytime method
  fs: factor out a sync_lazytime helper
  fs: refactor ->update_time handling
  fat: cleanup the flags for fat_truncate_time
  nfs: split nfs_update_timestamps
  fs: allow error returns from generic_update_time
  fs: remove inode_update_time
2026-02-09 11:25:01 -08:00
Christoph Hellwig
f77f281b61 fsverity: use a hashtable to find the fsverity_info
Use the kernel's resizable hash table (rhashtable) to find the
fsverity_info.  This way file systems that want to support fsverity don't
have to bloat every inode in the system with an extra pointer.  The
trade-off is that looking up the fsverity_info is a bit more expensive
now, but the main operations are still dominated by I/O and hashing
overhead.

The rhashtable implementations requires no external synchronization, and
the _fast versions of the APIs provide the RCU critical sections required
by the implementation.  Because struct fsverity_info is only removed on
inode eviction and does not contain a reference count, there is no need
for an extended critical section to grab a reference or validate the
object state.  The file open path uses rhashtable_lookup_get_insert_fast,
which can either find an existing object for the hash key or insert a
new one in a single atomic operation, so that concurrent opens never
instantiate duplicate fsverity_info structure.  FS_IOC_ENABLE_VERITY must
already be synchronized by a combination of i_rwsem and file system flags
and uses rhashtable_lookup_insert_fast, which errors out on an existing
object for the hash key as an additional safety check.

Because insertion into the hash table now happens before S_VERITY is set,
fsverity just becomes a barrier and a flag check and doesn't have to look
up the fsverity_info at all, so there is only a single lookup per
->read_folio or ->readahead invocation.  For btrfs there is an additional
one for each bio completion, while for ext4 and f2fs the fsverity_info
is stored in the per-I/O context and reused for the completion workqueue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Link: https://lore.kernel.org/r/20260202060754.270269-12-hch@lst.de
[EB: folded in fix for missing fsverity_free_info()]
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-04 11:31:54 -08:00
Christoph Hellwig
b0160e4501 btrfs: consolidate fsverity_info lookup
Look up the fsverity_info once in btrfs_do_readpage, and then use it
for all operations performed there, and do the same in end_folio_read
for all folios processed there.  The latter is also changed to derive
the inode from the btrfs_bio - while bbio->inode is optional, it is
always set for buffered reads.

This amortizes the lookup better once it becomes less efficient.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Link: https://lore.kernel.org/r/20260202060754.270269-11-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-04 11:31:54 -08:00
Linus Torvalds
de0674d9bc Merge tag 'for-6.19-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
 "A regression fix for a memory leak when raid56 is used"

* tag 'for-6.19-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: raid56: fix memory leak of btrfs_raid_bio::stripe_uptodate_bitmap
2026-02-03 10:19:58 -08:00
Qu Wenruo
161ab30da6 btrfs: get rid of compressed_bio::compressed_folios[]
Now there is no one utilizing that member, we can safely remove it along
with compressed_bio::nr_folios member. The size is reduced from 352 to
336 bytes on x86_64.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:59:07 +01:00
Qu Wenruo
e1bc83f8b1 btrfs: get rid of compressed_folios[] usage for encoded writes
Currently only encoded writes utilized btrfs_submit_compressed_write(),
which utilized compressed_bio::compressed_folios[] array.

Change the only call site to call the new helper,
btrfs_alloc_compressed_write(), to allocate a compressed bio, then queue
needed folios into that bio, and finally call
btrfs_submit_compressed_write() to submit the compressed bio.

This change has one hidden benefit, previously we used
btrfs_alloc_folio_array() for the folios of
btrfs_submit_compressed_read(), which doesn't utilize the compression
page pool for bs == ps cases.

Now we call btrfs_alloc_compr_folio() which will benefit from the page pool.

The other obvious benefit is that we no longer need to allocate an array
to hold all those folios, thus one less error path.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:59:07 +01:00
Qu Wenruo
dafcfa1c8e btrfs: get rid of compressed_folios[] usage for compressed read
Currently btrfs_submit_compressed_read() still uses
compressed_bio::compressed_folios[] array.

Change it to allocate each folio and queue them into the compressed bio
so that we do not need to allocate that array.

Considering how small each compressed read bio is (less than 128KiB), we
do not benefit that much from btrfs_alloc_folio_array() anyway,
while we may benefit more from btrfs_alloc_compr_folio() by using
the global folio pool.

So changing from btrfs_alloc_folio_array() to btrfs_alloc_compr_folio()
in a loop should still be fine.

This removes one error path, and paves the way to completely remove
compressed_folios[] array.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:59:07 +01:00
Qu Wenruo
26902be0cd btrfs: remove the old btrfs_compress_folios() infrastructure
Since it's been replaced by btrfs_compress_bio(), remove all involved
functions.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:59:07 +01:00
Qu Wenruo
6f706f34fc btrfs: switch to btrfs_compress_bio() interface for compressed writes
This switch has the following benefits:

- A single structure to handle all compression

  No more extra members like compressed_folios[] nor compress_type, all
  those members.

  This means the structure of async_extent is much smaller.

- Simpler error handling

  A single cleanup_compressed_bio() will handle everything, no extra
  compressed_folios[] array to bother.

Some extra notes:

- Compressed folios releasing

  Now we go bio_for_each_folio_all() loop to release the folios of the
  bio. This will work for both the old compressed_folios[] array and the
  new pure bio method.

  For old compressed_folios[], all folios of that array is queued into
  the bio, thus releasing the folios from the bio is the same as
  releasing each folio of that array. We just need to be sure no double
  releasing from the array and bio.

  For the new pure bio method, that array is NULL, just usual folio
  releasing of the bio.

  The only extra note is for end_bbio_compressed_read(), as the folios
  are allocated using btrfs_alloc_folio_array(), thus the folios should
  only be released by regular folio_put(), not btrfs_free_compr_folio().

- Rounding up the bio to block size

  We cannot simply increase bi_size, as that will not increase the
  length of the last bvec.

  Thus we have to properly add the last part into the bio.
  This will be done by the helper, round_up_last_block().

  The reason we do not round those bios up at compression time is to get
  the unaligned compressed size, so that they can be utilized for
  inline extents.
  If we round the bios up at *_compress_bio(), then every compressed bio
  will be larger than or equal to one fs block, resulting no inline
  compressed extent.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:59:07 +01:00
Qu Wenruo
c51173271d btrfs: introduce btrfs_compress_bio() helper
The helper will allocate a new compressed_bio, do the compression, and
return it to the caller.

This greatly simplifies the compression path, as we no longer need to
allocate a folio array thus no extra error path, furthermore the
compressed bio structure can be utilized for submission with very minor
modifications (like rounding up the bi_size and populate the bi_sector).

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:59:06 +01:00
Qu Wenruo
3d74a7556f btrfs: zlib: introduce zlib_compress_bio() helper
The new helper has the following enhancements against the existing
zlib_compress_folios()

- Much smaller parameter list

  No more shared IN/OUT members, no need to pre-allocate a
  compressed_folios[] array.

  Just a workspace and compressed_bio pointer, everything we need can be
  extracted from that @cb pointer.

- Ready-to-be-submitted compressed bio

  Although the caller still needs to do some common works like
  rounding up and zeroing the tailing part of the last fs block.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:59:06 +01:00
Qu Wenruo
bba959655a btrfs: zstd: introduce zstd_compress_bio() helper
The new helper has the following enhancements against the existing
zstd_compress_folios()

- Much smaller parameter list

  No more shared IN/OUT members, no need to pre-allocate a
  compressed_folios[] array.

  Just a workspace and compressed_bio pointer, everything we need can be
  extracted from that @cb pointer.

- Ready-to-be-submitted compressed bio

  Although the caller still needs to do some common works like
  rounding up and zeroing the tailing part of the last fs block.

Overall the workflow is the same as zstd_compress_folios(), but with
some minor changes:

- @start/@len is now constant
  For the current input file offset, use @start + @tot_in instead.

  The original change of @start and @len makes it pretty hard to know
  what value we're really comparing to.

- No more @cur_len
  It's only utilized when switching input buffer.
  Directly use btrfs_calc_input_length() instead.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:59:06 +01:00
Qu Wenruo
3be8a788ee btrfs: lzo: introduce lzo_compress_bio() helper
The new helper has the following enhancements against the existing
lzo_compress_folios()

- Much smaller parameter list

  No more shared IN/OUT members, no need to pre-allocate a
  compressed_folios[] array.

  Just a workspace list header and a compressed_bio pointer.

  Everything else can be fetched from that @cb pointer.

- Read-to-be-submitted compressed bio

  Although the caller still needs to do some common works like
  rounding up and zeroing the tailing part of the last fs block.

Some workloads are specific to lZO that is not needed with other
multi-run compression interfaces:

- Need to write a LZO header or segment header

  Use the new write_and_queue_folio() helper to do the bio_add_folio()
  call and folio switching.

- Need to update the LZO header after compression is done

  Use bio_first_folio_all() to grab the first folio and update the header.

- Extra corner case of error handling

  This can happen when we have queued part of a folio and hit an error.
  In that case those folios will be released by the bio.
  Thus we can only release the folio that has no queued part.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:59:06 +01:00
Naohiro Aota
e8f6130419 btrfs: zoned: factor out the zone loading part into a testable function
Separate btrfs_load_block_group_* calling path into a function, so that it
can be an entry point of unit test.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:59:06 +01:00
Naohiro Aota
e564cd2511 btrfs: add cleanup function for btrfs_free_chunk_map
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:59:06 +01:00
Naohiro Aota
392431cf9c btrfs: tests: add cleanup functions for test specific functions
Add auto-cleanup helper functions for btrfs_free_dummy_fs_info and
btrfs_free_dummy_block_group.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:59:05 +01:00
Filipe Manana
a56a70f8d2 btrfs: raid56: fix memory leak of btrfs_raid_bio::stripe_uptodate_bitmap
We allocate the bitmap but we never free it in free_raid_bio_pointers().
Fix this by adding a bitmap_free() call against the stripe_uptodate_bitmap
of a raid bio.

Fixes: 1810350b04 ("btrfs: raid56: move sector_ptr::uptodate into a dedicated bitmap")
Reported-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/linux-btrfs/20260126045315.GA31641@lst.de/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:25 +01:00
Boris Burkov
5341c98450 btrfs: tests: add unit tests for pending extent walking functions
I ran into another sort of trivial bug in v1 of the patch and concluded
that these functions really ought to be unit tested.

These two functions form the core of searching the chunk allocation
pending extent bitmap and have relatively easily definable semantics, so
unit testing them can help ensure the correctness of chunk allocation.

I also made a minor unrelated fix in volumes.h to properly forward
declare btrfs_space_info. Because of the order of the includes in the
new test, this was actually hitting a latent build warning.

Note:
This is an early example for me of a commit authored in part by an AI
agent, so I wanted to more clear about what I did. I defined a
trivial test and explained the set of tests I wanted to the agent and it
produced the large set of test cases seen here. I then checked each test
case to make sure it matched the description and simplified the
constants and numbers until they looked reasonable to me. I then checked
the looping logic to make sure it made sense to the original spirit of
the trivial test. Finally, carefully combed over all the lines it wrote
to loop over the tests it generated to make sure they followed our code
style guide.

Assisted-by: Claude:claude-opus-4-5
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:25 +01:00
Boris Burkov
b14c5e04bd btrfs: fix EEXIST abort due to non-consecutive gaps in chunk allocation
I have been observing a number of systems aborting at
insert_dev_extents() in btrfs_create_pending_block_groups(). The
following is a sample stack trace of such an abort coming from forced
chunk allocation (typically behind CONFIG_BTRFS_EXPERIMENTAL) but this
can theoretically happen to any DUP chunk allocation.

  [81.801] ------------[ cut here ]------------
  [81.801] BTRFS: Transaction aborted (error -17)
  [81.801] WARNING: fs/btrfs/block-group.c:2876 at btrfs_create_pending_block_groups+0x721/0x770 [btrfs], CPU#1: bash/319
  [81.802] Modules linked in: virtio_net btrfs xor zstd_compress raid6_pq null_blk
  [81.803] CPU: 1 UID: 0 PID: 319 Comm: bash Kdump: loaded Not tainted 6.19.0-rc6+ #319 NONE
  [81.803] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-2-2 04/01/2014
  [81.804] RIP: 0010:btrfs_create_pending_block_groups+0x723/0x770 [btrfs]
  [81.806] RSP: 0018:ffffa36241a6bce8 EFLAGS: 00010282
  [81.806] RAX: 000000000000000d RBX: ffff8e699921e400 RCX: 0000000000000000
  [81.807] RDX: 0000000002040001 RSI: 00000000ffffffef RDI: ffffffffc0608bf0
  [81.807] RBP: 00000000ffffffef R08: ffff8e69830f6000 R09: 0000000000000007
  [81.808] R10: ffff8e699921e5e8 R11: 0000000000000000 R12: ffff8e6999228000
  [81.808] R13: ffff8e6984d82000 R14: ffff8e69966a69c0 R15: ffff8e69aa47b000
  [81.809] FS:  00007fec6bdd9740(0000) GS:ffff8e6b1b379000(0000) knlGS:0000000000000000
  [81.809] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [81.810] CR2: 00005604833670f0 CR3: 0000000116679000 CR4: 00000000000006f0
  [81.810] Call Trace:
  [81.810]  <TASK>
  [81.810]  __btrfs_end_transaction+0x3e/0x2b0 [btrfs]
  [81.811]  btrfs_force_chunk_alloc_store+0xcd/0x140 [btrfs]
  [81.811]  kernfs_fop_write_iter+0x15f/0x240
  [81.812]  vfs_write+0x264/0x500
  [81.812]  ksys_write+0x6c/0xe0
  [81.812]  do_syscall_64+0x66/0x770
  [81.812]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [81.813] RIP: 0033:0x7fec6be66197
  [81.814] RSP: 002b:00007fffb159dd30 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
  [81.815] RAX: ffffffffffffffda RBX: 00007fec6bdd9740 RCX: 00007fec6be66197
  [81.815] RDX: 0000000000000002 RSI: 0000560483374f80 RDI: 0000000000000001
  [81.816] RBP: 0000560483374f80 R08: 0000000000000000 R09: 0000000000000000
  [81.816] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
  [81.817] R13: 00007fec6bfb85c0 R14: 00007fec6bfb5ee0 R15: 00005604833729c0
  [81.817]  </TASK>
  [81.817] irq event stamp: 20039
  [81.818] hardirqs last  enabled at (20047): [<ffffffff99a68302>] __up_console_sem+0x52/0x60
  [81.818] hardirqs last disabled at (20056): [<ffffffff99a682e7>] __up_console_sem+0x37/0x60
  [81.819] softirqs last  enabled at (19470): [<ffffffff999d2b46>] __irq_exit_rcu+0x96/0xc0
  [81.819] softirqs last disabled at (19463): [<ffffffff999d2b46>] __irq_exit_rcu+0x96/0xc0
  [81.820] ---[ end trace 0000000000000000 ]---
  [81.820] BTRFS: error (device dm-7 state A) in btrfs_create_pending_block_groups:2876: errno=-17 Object already exists

Inspecting these aborts with drgn, I observed a pattern of overlapping
chunk_maps. Note how stripe 1 of the first chunk overlaps in physical
address with stripe 0 of the second chunk.

Physical Start     Physical End       Length       Logical            Type                 Stripe
----------------------------------------------------------------------------------------------------
0x0000000102500000 0x0000000142500000 1.0G         0x0000000641d00000 META|DUP             0/2
0x0000000142500000 0x0000000182500000 1.0G         0x0000000641d00000 META|DUP             1/2
0x0000000142500000 0x0000000182500000 1.0G         0x0000000601d00000 META|DUP             0/2
0x0000000182500000 0x00000001c2500000 1.0G         0x0000000601d00000 META|DUP             1/2

Now how could this possibly happen? All chunk allocation is protected by
the chunk_mutex so racing allocations should see a consistent view of
the CHUNK_ALLOCATED bit in the chunk allocation extent-io-tree
(device->alloc_state as set by chunk_map_device_set_bits()) The tree
itself is protected by a spin lock, and clearing/setting the bits is
always protected by fs_info->mapping_tree_lock, so no race is apparent.

It turns out that there is a subtle bug in the logic regarding chunk
allocations that have happened in the current transaction, known as
"pending extents". The chunk allocation as defined in
find_free_dev_extent() is a loop which searches the commit root of the
dev_root and looks for gaps between DEV_EXTENT items. For those gaps, it
then checks alloc_state bitmap for any pending extents and adjusts the
hole that it finds accordingly. However, the logic in that adjustment
assumes that the first pending extent is the only one in that range.

e.g., given a layout with two non-consecutive pending extents in a hole
passed to dev_extent_hole_check() via *hole_start and *hole_size:

  |----pending A----|    real hole     |----pending B----|
           |           candidate hole        |
      *hole_start                         *hole_start + *hole_size

the code incorrectly returns a "hole" from the end of pending extent A
until the passed in hole end, failing to account for pending B.

However, it is not entirely obvious that it is actually possible to
produce such a layout. I was able to reproduce it, but with some
contortions: I continued to use the force chunk allocation sysfs file
and I introduced a long delay (10 seconds) into the start of the cleaner
thread. I also prevented the unused bgs cleaning logic from ever
deleting metadata bgs. These help make it easier to deterministically
produce the condition but shouldn't really matter if you imagine the
conditions happening by race/luck. Allocations/frees can happen
concurrently with the cleaner thread preparing to process an unused
extent and both create some used chunks with an unused chunk
interleaved, all during one transaction. Then btrfs_delete_unused_bgs()
sees the unused one and clears it, leaving a range with several pending
chunk allocations and a gap in the middle.

The basic idea is that the unused_bgs cleanup work happens on a worker
so if we allocate 3 block groups in one transaction, then the cleaner
work kicked off by the previous transaction comes through and deletes
the middle one of the 3, then the commit root shows no dev extents and
we have the bad pattern in the extent-io-tree. One final consideration
is that the code happens to loop to the next hole if there are no more
extents at all, so we need one more dev extent way past the area we are
working in. Something like the following demonstrates the technique:

  # push the BG frontier out to 20G
  fallocate -l 20G $mnt/foo
  # allocate one more that will prevent the "no more dev extents" luck
  fallocate -l 1G $mnt/sticky
  # sync
  sync
  # clear out the allocation area
  rm $mnt/foo
  sync
  _cleaner
  # let everything quiesce
  sleep 20
  sync

  # dev tree should have one bg 20G out and the rest at the beginning..
  # sort of like an empty FS but with a random sticky chunk.

  # kick off the cleaner in the background, remember it will sleep 10s
  # before doing interesting work
  _cleaner &

  sleep 3

  # create 3 trivial block groups, all empty, all immediately marked as unused.
  echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc"
  echo 1 > "$(_btrfs_sysfs_space_info $dev data)/force_chunk_alloc"
  echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc"

  # let the cleaner thread definitely finish, it will remove the data bg
  sleep 10

  # this allocation sees the non-consecutive pending metadata chunks with
  # data chunk gap of 1G and allocates a 2G extent in that hole. ENOSPC!
  echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc"

As for the fix, it is not that obvious. I could not see a trivial way to
do it even by adding backup loops into find_free_dev_extent(), so I
opted to change the semantics of dev_extent_hole_check() to not stop
looping until it finds a sufficiently big hole. For clarity, this also
required changing the helper function contains_pending_extent() into two
new helpers which find the first pending extent and the first suitable
hole in a range.

I attempted to clean up the documentation and range calculations to be
as consistent and clear as possible for the future.

I also looked at the zoned case and concluded that the loop there is
different and not to be unified with this one. As far as I can tell, the
zoned check will only further constrain the hole so looping back to find
more holes is acceptable. Though given that zoned really only appends, I
find it highly unlikely that it is susceptible to this bug.

Fixes: 1b98450816 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole")
Reported-by: Dimitrios Apostolou <jimis@gmx.net>
Closes: https://lore.kernel.org/linux-btrfs/q7760374-q1p4-029o-5149-26p28421s468@tzk.arg/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:25 +01:00
jinbaohong
b291ad4458 btrfs: fix transaction commit blocking during trim of unallocated space
When trimming unallocated space, btrfs_trim_fs() holds the device_list_mutex
for the entire duration while iterating through all devices. On large
filesystems with significant unallocated space, this operation can take
minutes to hours on large storage systems.

This causes a problem because btrfs_run_dev_stats(), which is called
during transaction commit, also requires device_list_mutex:

  btrfs_trim_fs()
    mutex_lock(&fs_devices->device_list_mutex)
    list_for_each_entry(device, ...)
      btrfs_trim_free_extents(device)
    mutex_unlock(&fs_devices->device_list_mutex)

  commit_transaction()
    btrfs_run_dev_stats()
      mutex_lock(&fs_devices->device_list_mutex)  // blocked!
      ...

While trim is running, all transaction commits are blocked waiting for
the mutex.

Fix this by refactoring btrfs_trim_free_extents() to process devices in
bounded chunks (up to 2GB per iteration) and release device_list_mutex
between chunks.

Signed-off-by: robbieko <robbieko@synology.com>
Signed-off-by: jinbaohong <jinbaohong@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:25 +01:00
jinbaohong
bfb670b918 btrfs: handle user interrupt properly in btrfs_trim_fs()
When a fatal signal is pending or the process is freezing,
btrfs_trim_block_group() and btrfs_trim_free_extents() return -ERESTARTSYS.
Currently this is treated as a regular error: the loops continue to the
next iteration and count it as a block group or device failure.

Instead, break out of the loops immediately and return -ERESTARTSYS to
userspace without counting it as a failure. Also skip the device loop
entirely if the block group loop was interrupted.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: jinbaohong <jinbaohong@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:25 +01:00
jinbaohong
1cc4ada418 btrfs: preserve first error in btrfs_trim_fs()
When multiple block groups or devices fail during trim, preserve the
first error encountered rather than the last one. The first error is
typically more useful for debugging as it represents the original
failure, while subsequent errors may be cascading effects.

Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: jinbaohong <jinbaohong@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:24 +01:00
jinbaohong
912d1c6680 btrfs: continue trimming remaining devices on failure
Commit 93bba24d4b ("btrfs: Enhance btrfs_trim_fs function to handle
error better") intended to make device trimming continue even if one
device fails, tracking failures and reporting them at the end. However,
it used 'break' instead of 'continue', causing the loop to exit on the
first device failure.

Fix this by replacing 'break' with 'continue'.

Fixes: 93bba24d4b ("btrfs: Enhance btrfs_trim_fs function to handle error better")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: jinbaohong <jinbaohong@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:24 +01:00
Filipe Manana
719dc4b755 btrfs: do not BUG_ON() in btrfs_remove_block_group()
There's no need to BUG_ON(), we can just abort the transaction and return
an error.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:24 +01:00
Filipe Manana
6f926597f9 btrfs: abort transaction on error in btrfs_remove_block_group()
When btrfs_remove_block_group() fails we abort the transaction in its
single caller (btrfs_remove_chunk()). This makes it harder to find out
where exactly the failure happened, as several steps inside
btrfs_remove_block_group() can fail.

So make btrfs_remove_block_group() abort the transaction whenever an
error happens, instead of aborting in its caller.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:24 +01:00
Boris Burkov
3a1f4264da btrfs: fix block_group_tree dirty_list corruption
When the incompat flag EXTENT_TREE_V2 is set, we unconditionally add the
block group tree to the switch_commits list before calling
switch_commit_roots, as we do for the tree root and the chunk root.
However, the block group tree uses normal root dirty tracking and in any
transaction that does an allocation and dirties a block group, the block
group root will already be linked to a list by the dirty_list field and
this use of list_add_tail() is invalid and corrupts the prev/next
members of block_group_root->dirty_list.

This is apparent on a subsequent list_del on the prev if we enable
CONFIG_DEBUG_LIST:

  [32.1571] ------------[ cut here ]------------
  [32.1572] list_del corruption. next->prev should beffff958890202538, but was ffff9588992bd538. (next=ffff958890201538)
  [32.1575] WARNING: lib/list_debug.c:65 at 0x0, CPU#3: sync/607
  [32.1583] CPU: 3 UID: 0 PID: 607 Comm: sync Not tainted 6.18.0 #24PREEMPT(none)
  [32.1585] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS1.17.0-4.fc41 04/01/2014
  [32.1587] RIP: 0010:__list_del_entry_valid_or_report+0x108/0x120
  [32.1593] RSP: 0018:ffffaa288287fdd0 EFLAGS: 00010202
  [32.1594] RAX: 0000000000000001 RBX: ffff95889326e800 RCX:ffff958890201538
  [32.1596] RDX: ffff9588992bd538 RSI: ffff958890202538 RDI:ffffffff82a41e00
  [32.1597] RBP: ffff958890202538 R08: ffffffff828fc1e8 R09:00000000ffffefff
  [32.1599] R10: ffffffff8288c200 R11: ffffffff828e4200 R12:ffff958890201538
  [32.1601] R13: ffff95889326e958 R14: ffff958895c24000 R15:ffff958890202538
  [32.1603] FS:  00007f0c28eb5740(0000) GS:ffff958af2bd2000(0000)knlGS:0000000000000000
  [32.1605] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [32.1607] CR2: 00007f0c28e8a3cc CR3: 0000000109942005 CR4:0000000000370ef0
  [32.1609] Call Trace:
  [32.1610]  <TASK>
  [32.1611]  switch_commit_roots+0x82/0x1d0 [btrfs]
  [32.1615]  btrfs_commit_transaction+0x968/0x1550 [btrfs]
  [32.1618]  ? btrfs_attach_transaction_barrier+0x23/0x60 [btrfs]
  [32.1621]  __iterate_supers+0xe8/0x190
  [32.1622]  ? __pfx_sync_fs_one_sb+0x10/0x10
  [32.1623]  ksys_sync+0x63/0xb0
  [32.1624]  __do_sys_sync+0xe/0x20
  [32.1625]  do_syscall_64+0x73/0x450
  [32.1626]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [32.1627] RIP: 0033:0x7f0c28d05d2b
  [32.1632] RSP: 002b:00007ffc9d988048 EFLAGS: 00000246 ORIG_RAX:00000000000000a2
  [32.1634] RAX: ffffffffffffffda RBX: 00007ffc9d988228 RCX:00007f0c28d05d2b
  [32.1636] RDX: 00007f0c28e02301 RSI: 00007ffc9d989b21 RDI:00007f0c28dba90d
  [32.1637] RBP: 0000000000000001 R08: 0000000000000001 R09:0000000000000000
  [32.1639] R10: 0000000000000000 R11: 0000000000000246 R12:000055b96572cb80
  [32.1641] R13: 000055b96572b19f R14: 00007f0c28dfa434 R15:000055b96572b034
  [32.1643]  </TASK>
  [32.1644] irq event stamp: 0
  [32.1644] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  [32.1646] hardirqs last disabled at (0): [<ffffffff81298817>]copy_process+0xb37/0x2260
  [32.1648] softirqs last  enabled at (0): [<ffffffff81298817>]copy_process+0xb37/0x2260
  [32.1650] softirqs last disabled at (0): [<0000000000000000>] 0x0
  [32.1652] ---[ end trace 0000000000000000 ]---

Furthermore, this list corruption eventually (when we happen to add a
new block group) results in getting the switch_commits and
dirty_cowonly_roots lists mixed up and attempting to call update_root
on the tree root which can't be found in the tree root, resulting in a
transaction abort:

  [87.8269] BTRFS critical (device nvme1n1): unable to find root key (1 0 0) in tree 1
  [87.8272] ------------[ cut here ]------------
  [87.8274] BTRFS: Transaction aborted (error -117)
  [87.8275] WARNING: fs/btrfs/root-tree.c:153 at 0x0, CPU#4: sync/703
  [87.8285] CPU: 4 UID: 0 PID: 703 Comm: sync Not tainted 6.18.0 #25 PREEMPT(none)
  [87.8287] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-4.fc41 04/01/2014
  [87.8289] RIP: 0010:btrfs_update_root+0x296/0x790 [btrfs]
  [87.8295] RSP: 0018:ffffa58d035dfd60 EFLAGS: 00010282
  [87.8297] RAX: ffff9a59126ddb68 RBX: ffff9a59126dc000 RCX: 0000000000000000
  [87.8299] RDX: 0000000000000000 RSI: 00000000ffffff8b RDI: ffffffffc0b28270
  [87.8301] RBP: ffff9a5904aec000 R08: 0000000000000000 R09: 00000000ffffefff
  [87.8303] R10: ffffffff9ac8c200 R11: ffffffff9ace4200 R12: 0000000000000001
  [87.8305] R13: ffff9a59041740e8 R14: ffff9a5904aec1f7 R15: ffff9a590fdefaf0
  [87.8307] FS:  00007f54cde6b740(0000) GS:ffff9a5b5a81c000(0000) knlGS:0000000000000000
  [87.8309] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [87.8310] CR2: 00007f54cde403cc CR3: 0000000112902004 CR4: 0000000000370ef0
  [87.8312] Call Trace:
  [87.8313]  <TASK>
  [87.8314]  ? _raw_spin_unlock+0x23/0x40
  [87.8315]  commit_cowonly_roots+0x1ad/0x250 [btrfs]
  [87.8317]  ? btrfs_commit_transaction+0x79b/0x1560 [btrfs]
  [87.8320]  btrfs_commit_transaction+0x8aa/0x1560 [btrfs]
  [87.8322]  ? btrfs_attach_transaction_barrier+0x23/0x60 [btrfs]
  [87.8325]  __iterate_supers+0xf1/0x170
  [87.8326]  ? __pfx_sync_fs_one_sb+0x10/0x10
  [87.8327]  ksys_sync+0x63/0xb0
  [87.8328]  __do_sys_sync+0xe/0x20
  [87.8329]  do_syscall_64+0x73/0x450
  [87.8330]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [87.8331] RIP: 0033:0x7f54cdd05d2b
  [87.8336] RSP: 002b:00007fff1b58ff78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
  [87.8338] RAX: ffffffffffffffda RBX: 00007fff1b590158 RCX: 00007f54cdd05d2b
  [87.8340] RDX: 00007f54cde02301 RSI: 00007fff1b592b66 RDI: 00007f54cddba90d
  [87.8342] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
  [87.8344] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e07ca96b80
  [87.8346] R13: 000055e07ca9519f R14: 00007f54cddfa434 R15: 000055e07ca95034
  [87.8348]  </TASK>
  [87.8348] irq event stamp: 0
  [87.8349] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  [87.8351] hardirqs last disabled at (0): [<ffffffff99698797>] copy_process+0xb37/0x21e0
  [87.8353] softirqs last  enabled at (0): [<ffffffff99698797>] copy_process+0xb37/0x21e0
  [87.8355] softirqs last disabled at (0): [<0000000000000000>] 0x0
  [87.8357] ---[ end trace 0000000000000000 ]---
  [87.8358] BTRFS: error (device nvme1n1 state A) in btrfs_update_root:153: errno=-117 Filesystem corrupted
  [87.8360] BTRFS info (device nvme1n1 state EA): forced readonly
  [87.8362] BTRFS warning (device nvme1n1 state EA): Skipping commit of aborted transaction.
  [87.8364] BTRFS: error (device nvme1n1 state EA) in cleanup_transaction:2037: errno=-117 Filesystem corrupted

Since the block group tree was pulled out of the extent tree and uses
normal root dirty tracking, remove the offending extra list_add. This
fixes the list corruption and the resulting fs corruption.

Fixes: 14033b08a0 ("btrfs: don't save block group root into super block")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:24 +01:00
Johannes Thumshirn
c757edbef9 btrfs: fix copying the flags of btrfs_bio after split
When a btrfs_bio gets split, only 'bbio->csum_search_commit_root' gets
copied to the new btrfs_bio, all the other flags don't.

When a bio is split in btrfs_submit_chunk(), btrfs_split_bio() creates
the new split bio via btrfs_bio_init() which zeroes the struct with
memset. Looking at btrfs_split_bio(), it copies csum_search_commit_root
from the original but does not copy can_use_append.

After the split, the code does:

    bbio = split;
    bio = &bbio->bio;

This means the split bio (with can_use_append = false) gets submitted,
not the original. In btrfs_submit_dev_bio(), the condition:

    if (btrfs_bio(bio)->can_use_append && btrfs_dev_is_sequential(...))

Will be false for the split bio even when writing to a sequential zone.
Does the split bio need to inherit can_use_append from the original? The
old code used a local variable use_append which persisted across the
split.

Copy the rest of the flags as well.

Link: https://lore.kernel.org/linux-btrfs/20260125132120.2525146-1-clm@meta.com/
Reported-by: Chris Mason <clm@meta.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:24 +01:00
Johannes Thumshirn
3fe608dbac btrfs: zoned: use local fs_info variable in btrfs_load_block_group_dup()
btrfs_load_block_group_dup() has a local pointer to fs_info, yet the
error prints dereference fs_info from the block_group.

Use local fs_info variable to make the code more uniform.

Reviewed-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:23 +01:00
Naohiro Aota
52ee9965d0 btrfs: zoned: fixup last alloc pointer after extent removal for RAID0/10
When a block group is composed of a sequential write zone and a
conventional zone, we recover the (pseudo) write pointer of the
conventional zone using the end of the last allocated position.

However, if the last extent in a block group is removed, the last extent
position will be smaller than the other real write pointer position.
Then, that will cause an error due to mismatch of the write pointers.

We can fixup this case by moving the alloc_offset to the corresponding
write pointer position.

Fixes: 568220fa96 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:23 +01:00
Naohiro Aota
e2d848649e btrfs: zoned: fixup last alloc pointer after extent removal for DUP
When a block group is composed of a sequential write zone and a
conventional zone, we recover the (pseudo) write pointer of the
conventional zone using the end of the last allocated position.

However, if the last extent in a block group is removed, the last extent
position will be smaller than the other real write pointer position.
Then, that will cause an error due to mismatch of the write pointers.

We can fixup this case by moving the alloc_offset to the corresponding
write pointer position.

Fixes: c0d90a79e8 ("btrfs: zoned: fix alloc_offset calculation for partly conventional block groups")
CC: stable@vger.kernel.org # 6.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:23 +01:00
Naohiro Aota
dda3ec9ee6 btrfs: zoned: fixup last alloc pointer after extent removal for RAID1
When a block group is composed of a sequential write zone and a
conventional zone, we recover the (pseudo) write pointer of the
conventional zone using the end of the last allocated position.

However, if the last extent in a block group is removed, the last extent
position will be smaller than the other real write pointer position.
Then, that will cause an error due to mismatch of the write pointers.

We can fixup this case by moving the alloc_offset to the corresponding
write pointer position.

Fixes: 568220fa96 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:23 +01:00
Filipe Manana
3f8982543d btrfs: remove out label in btrfs_wait_for_commit()
There is no point in having the label since all it does is return the
value in the 'ret' variable. Instead make every goto return directly
and remove the label.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:23 +01:00
Filipe Manana
5eb01bf4a9 btrfs: remove out label in btrfs_init_space_info()
There is no point in having the label since all it does is return the
value in the 'ret' variable. Instead make every goto return directly
and remove the label.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:23 +01:00
Filipe Manana
cefef3cc12 btrfs: remove out label in btrfs_check_rw_degradable()
There is no point in having the label since all it does is return the
value in the 'ret' variable. Instead make every goto return directly
and remove the label.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:23 +01:00
Filipe Manana
61fb7f04ee btrfs: remove out label in finish_verity()
There is no point in having the label since all it does is return the
value in the 'ret' variable. Instead make every goto return directly
and remove the label.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:22 +01:00
Filipe Manana
6329592ca6 btrfs: remove out label in scrub_find_fill_first_stripe()
There is no point in having the label since all it does is return the
value in the 'ret' variable. Instead make every goto return directly
and remove the label.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:22 +01:00
Filipe Manana
55807025a6 btrfs: remove out label in lzo_decompress()
There is no point in having the label since all it does is return the
value in the 'ret' variable. Instead make every goto return directly
and remove the label.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:22 +01:00
Filipe Manana
610ff1c9df btrfs: remove out label in btrfs_mark_extent_written()
There is no point in having the label since all it does is return the
value in the 'ret' variable. Instead make every goto return directly
and remove the label.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:22 +01:00
Filipe Manana
cc27540dd0 btrfs: remove out label in btrfs_csum_file_blocks()
There is no point in having the label since all it does is return the
value in the 'ret' variable. Instead make every goto return directly
and remove the label.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:22 +01:00
Filipe Manana
bb09b9a491 btrfs: remove out_failed label in find_lock_delalloc_range()
There is no point in having the label since all it does is return the
value in the 'found' variable. Instead make every goto return directly
and remove the label.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:22 +01:00
Filipe Manana
2efcd25a76 btrfs: remove out label in load_extent_tree_free()
There is no point in having the label since all it does is return the
value in the 'ret' variable. Instead make every goto return directly
and remove the label.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:22 +01:00
Filipe Manana
1038614e8f btrfs: remove pointless out labels from uuid-tree.c
Some functions (btrfs_uuid_iter_rem() and btrfs_check_uuid_tree_entry())
have an 'out' label that does nothing but return, making it pointless.
Simplify this by removing the label and returning instead of gotos plus
setting the 'ret' variable.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:21 +01:00
Filipe Manana
47c9dbc791 btrfs: remove pointless out labels from inode.c
Some functions (insert_inline_extent() and insert_reserved_file_extent())
have an 'out' label that does nothing but return, making it pointless.
Simplify this by removing the label and returning instead of gotos plus
setting the 'ret' variable.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:21 +01:00
Filipe Manana
46099eaef3 btrfs: remove pointless out labels from free-space-cache.c
Some functions (update_cache_item(), find_free_space(), trim_bitmaps(),
btrfs_remove_free_space() and cleanup_free_space_cache_v1()) have an 'out'
label that does nothing but return, making it pointless. Simplify this by
removing the label and returning instead of gotos plus setting the 'ret'
variable.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:21 +01:00