linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-16 16:01:44 -04:00

Author	SHA1	Message	Date
Qu Wenruo	37cc07cab7	btrfs: lzo: use folio_iter to handle lzo_decompress_bio() Currently lzo_decompress_bio() is using compressed_bio->compressed_folios[] array to grab each compressed folio. This is making the code much easier to read, as we only need to maintain a single iterator, @cur_in, and can easily grab any random folio using @cur_in >> min_folio_shift as an index. However lzo_decompress_bio() itself is ensured to only advance to the next folio at one time, and compressed_folios[] is just a pointer to each folio of the compressed bio, thus we have no real random access requirement for lzo_decompress_bio(). Replace the compressed_folios[] access by a helper, get_current_folio(), which uses folio_iter and an external folio counter to properly switch the folio when needed. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:17 +01:00
Sun YangKai	4b7ecd0984	btrfs: consolidate reclaim readiness checks in btrfs_should_reclaim() Move the filesystem state validation from btrfs_reclaim_bgs_work() into btrfs_should_reclaim() to centralize the reclaim eligibility logic. Since it is the only caller of btrfs_should_reclaim(), there's no functional change. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:14 +01:00
Sun YangKai	19eff93dc7	btrfs: fix periodic reclaim condition Problems with current implementation: 1. reclaimable_bytes is signed while chunk_sz is unsigned, causing negative reclaimable_bytes to trigger reclaim unexpectedly 2. The "space must be freed between scans" assumption breaks the two-scan requirement: first scan marks block groups, second scan reclaims them. Without the second scan, no reclamation occurs. Instead, track actual reclaim progress: pause reclaim when block groups will be reclaimed, and resume only when progress is made. This ensures reclaim continues until no further progress can be made. And resume periodic reclaim when there's enough free space. And we take care if reclaim is making any progress now, so it's unnecessary to set periodic_reclaim_ready to false when failed to reclaim a block group. Fixes: `813d4c6422` ("btrfs: prevent pathological periodic reclaim loops") CC: stable@vger.kernel.org # 6.12+ Suggested-by: Boris Burkov <boris@bur.io> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Sun YangKai <sunk67188@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:37 +01:00
Johannes Thumshirn	de62f138f9	btrfs: don't pass io_ctl to __btrfs_write_out_cache() There's no need to pass both the block_group and block_group::io_ctl to __btrfs_write_out_cache(). Remove passing io_ctl to __btrfs_write_out_cache() and dereference it inside __btrfs_write_out_cache(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:37 +01:00
Filipe Manana	ea7ab405c5	btrfs: use the btrfs_extent_map_end() helper everywhere We have a helper to calculate an extent map's exclusive end offset, but we only use it in some places. Update every site that open codes the calculation to use the helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:36 +01:00
Filipe Manana	4ac81c3811	btrfs: use the btrfs_block_group_end() helper everywhere We have a helper to calculate a block group's exclusive end offset, but we only use it in some places. Update every site that open codes the calculation to use the helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:36 +01:00
Johannes Thumshirn	57a4a863cd	btrfs: remove bogus NULL checks in __btrfs_write_out_cache() Dan reported a new smatch warning in free-space-cache.c: New smatch warnings: fs/btrfs/free-space-cache.c:1207 write_pinned_extent_entries() warn: variable dereferenced before check 'block_group' (see line 1203) But the check if the block_group pointer is NULL is bogus, because to get to this point block_group::io_ctl has already been dereferenced further up the call-chain when calling __btrfs_write_out_cache() from btrfs_write_out_cache(). Remove the bogus checks for block_group == NULL in __btrfs_write_out_cache() and it's siblings. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202601170636.WsePMV5H-lkp@intel.com/ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:36 +01:00
Mark Harmstone	2aef934b56	btrfs: populate fully_remapped_bgs_list on mount Add a function btrfs_populate_fully_remapped_bgs_list() which gets called on mount, which looks for fully remapped block groups (i.e. identity_remap_count == 0) which haven't yet had their chunk stripes and device extents removed. This happens when a filesystem is unmounted while async discard has not yet finished, as otherwise the data range occupied by the chunk stripes would be permanently unusable. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:36 +01:00
Mark Harmstone	7cddbb4339	btrfs: handle discarding fully-remapped block groups Discard normally works by iterating over the free-space entries of a block group. This doesn't work for fully-remapped block groups, as we removed their free-space entries when we started relocation. For sync discard, call btrfs_discard_extent() when we commit the transaction in which the last identity remap was removed. For async discard, add a new function btrfs_trim_fully_remapped_block_group() to be called by the discard worker, which iterates over the block group's range using the normal async discard rules. Once we reach the end, remove the chunk's stripes and device extents to get back its free space. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:36 +01:00
Mark Harmstone	81e5a4551c	btrfs: allow balancing remap tree Balancing the METADATA_REMAP chunk, i.e. the chunk in which the remap tree lives, is a special case. We can't use the remap tree itself for this, as then we'd have no way to boostrap it on mount. And we can't use the pre-remap tree code for this as it relies on walking the extent tree, and we're not creating backrefs for METADATA_REMAP chunks. So instead, if a balance would relocate any METADATA_REMAP block groups, mark those block groups as readonly and COW every leaf of the remap tree. There's more sophisticated ways of doing this, such as only COWing nodes within a block group that's to be relocated, but they're fiddly and with lots of edge cases. Plus it's not anticipated that a) the number of METADATA_REMAP chunks is going to be particularly large, or b) that users will want to only relocate some of these chunks - the main use case here is to unbreak RAID conversion and device removal. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:36 +01:00
Mark Harmstone	a645372e7e	btrfs: add do_remap parameter to btrfs_discard_extent() btrfs_discard_extent() can be called either when an extent is removed or from walking the free-space tree. With a remapped block group these two things are no longer equivalent: the extent's addresses are remapped, while the free-space tree exclusively uses underlying addresses. Add a do_remap parameter to btrfs_discard_extent() and btrfs_map_discard(), saying whether or not the address needs to be run through the remap tree first. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:35 +01:00
Mark Harmstone	fd6594b144	btrfs: replace identity remaps with actual remaps when doing relocations Add a function do_remap_tree_reloc(), which does the actual work of doing a relocation using the remap tree. In a loop we call do_remap_reloc_trans(), which searches for the first identity remap for the block group. We call btrfs_reserve_extent() to find space elsewhere for it, and read the data into memory and write it to the new location. We then carve out the identity remap and replace it with an actual remap, which points to the new location in which to look. Once the last identity remap has been removed we call last_identity_remap_gone(), which, as with deletions, removes the chunk's stripes and device extents. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:35 +01:00
Mark Harmstone	bbea42dfb9	btrfs: move existing remaps before relocating block group If when relocating a block group we find that `remap_bytes` > 0 in its block group item, that means that it has been the destination block group for another that has been remapped. We need to search the remap tree for any remap backrefs within this range, and move the data to a third block group. This is because otherwise btrfs_translate_remap() could end up following an unbounded chain of remaps, which would only get worse over time. We only relocate one block group at a time, so `remap_bytes` will only ever go down while we are doing this. Once we're finished we set the REMAPPED flag on the block group, which will permanently prevent any other data from being moved to within it. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:35 +01:00
Mark Harmstone	b56f35560b	btrfs: handle setting up relocation of block group with remap-tree Handle the preliminary work for relocating a block group in a filesystem with the remap-tree flag set. If the block group is SYSTEM btrfs_relocate_block_group() proceeds as it does already, as bootstrapping issues mean that these block groups have to be processed the existing way. Similarly with METADATA_REMAP blocks, which are dealt with in a later patch. Otherwise we walk the free-space tree for the block group in question, recording any holes. These get converted into identity remaps and placed in the remap tree, and the block group's REMAPPED flag is set. From now on no new allocations are possible within this block group, and any I/O to it will be funnelled through btrfs_translate_remap(). We store the number of identity remaps in `identity_remap_count`, so that we know when we've removed the last one and the block group is fully remapped. The change in btrfs_read_roots() is because data relocations no longer rely on the data reloc tree as a hidden subvolume in which to do snapshots. (Thanks to Sun YangKai for his suggestions.) Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:35 +01:00
Mark Harmstone	979e1dc3d6	btrfs: handle deletions from remapped block group Handle the case where we free an extent from a block group that has the REMAPPED flag set. Because the remap tree is orthogonal to the extent tree, for data this may be within any number of identity remaps or actual remaps. If we're freeing a metadata node, this will be wholly inside one or the other. btrfs_remove_extent_from_remap_tree() searches the remap tree for the remaps that cover the range in question, then calls remove_range_from_remap_tree() for each one, to punch a hole in the remap and adjust the free-space tree. For an identity remap, remove_range_from_remap_tree() will adjust the block group's `identity_remap_count` if this changes. If it reaches zero we mark the block group as fully remapped. For an identity remap, remove_range_from_remap_tree() will adjust the block group's `identity_remap_count` if this changes. If it reaches zero we mark the block group as fully remapped. Fully remapped block groups have their chunk stripes removed and their device extents freed, which makes the disk space available again to the chunk allocator. This happens asynchronously: in the cleaner thread for sync discard and nodiscard, and (in a later patch) in the discard worker for async discard. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:35 +01:00
Mark Harmstone	18ba649928	btrfs: redirect I/O for remapped block groups Change btrfs_map_block() so that if the block group has the REMAPPED flag set, we call btrfs_translate_remap() to obtain a new address. btrfs_translate_remap() searches the remap tree for a range corresponding to the logical address passed to btrfs_map_block(). If it is within an identity remap, this part of the block group hasn't yet been relocated, and so we use the existing address. If it is within an actual remap, we subtract the start of the remap range and add the address of its destination, contained in the item's payload. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:35 +01:00
Mark Harmstone	8620da16fb	btrfs: allow mounting filesystems with remap-tree incompat flag If we encounter a filesystem with the remap-tree incompat flag set, validate its compatibility with the other flags, and load the remap tree using the values that have been added to the superblock. The remap-tree feature depends on the free-space-tree, but no-holes and block-group-tree have been made dependencies to reduce the testing matrix. Similarly I'm not aware of any reason why mixed-bg and zoned would be incompatible with remap-tree, but this is blocked for the time being until it can be fully tested. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:35 +01:00
Mark Harmstone	7977011460	btrfs: add extended version of struct block_group_item Add a struct btrfs_block_group_item_v2, which is used in the block group tree if the remap-tree incompat flag is set. This adds two new fields to the block group item: `remap_bytes` and `identity_remap_count`. `remap_bytes` records the amount of data that's physically within this block group, but nominally in another, remapped block group. This is necessary because this data will need to be moved first if this block group is itself relocated. If `remap_bytes` > 0, this is an indicator to the relocation thread that it will need to search the remap-tree for backrefs. A block group must also have `remap_bytes` == 0 before it can be dropped. `identity_remap_count` records how many identity remap items are located in the remap tree for this block group. When relocation is begun for this block group, this is set to the number of holes in the free-space tree for this range. As identity remaps are converted into actual remaps by the relocation process, this number is decreased. Once it reaches 0, either because of relocation or because extents have been deleted, the block group has been fully remapped and its chunk's device extents are removed. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:34 +01:00
Mark Harmstone	bf8ff4b9f0	btrfs: rename struct btrfs_block_group field commit_used to last_used Rename the field commit_used in struct btrfs_block_group to last_used, for clarity and consistency with the similar fields we're about to add. It's not obvious that commit_flags means "flags as of the last commit" rather than "flags related to a commit". Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:34 +01:00
Mark Harmstone	efcab3176e	btrfs: don't add metadata items for the remap tree to the extent tree There is the following potential problem with the remap tree and delayed refs: * Remapped extent freed in a delayed ref, which removes an entry from the remap tree * Remap tree now small enough to fit in a single leaf * Corruption as we now have a level-0 block with a level-1 metadata item in the extent tree One solution to this would be to rework the remap tree code so that it operates via delayed refs. But as we're hoping to remove cow-only metadata items in the future anyway, change things so that the remap tree doesn't have any entries in the extent tree. This also has the benefit of reducing write amplification. We also make it so that the clear_cache mount option is a no-op, as with the extent tree v2, as the free-space tree can no longer be recreated from the extent tree. Finally disable relocating the remap tree itself, which is added back in a later patch. As it is we would get corruption as the traditional relocation method walks the extent tree, and we're removing its metadata items. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:34 +01:00
Mark Harmstone	76377db18a	btrfs: remove remapped block groups from the free-space-tree No new allocations can be done from block groups that have the REMAPPED flag set, so there's no value in their having entries in the free-space tree. Prevent a search through the free-space tree being scheduled for such a block group, and prevent any additions to the in-memory free-space tree. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:34 +01:00
Mark Harmstone	c3d6dda60c	btrfs: allow remapped chunks to have zero stripes When a chunk has been fully remapped, we are going to set its num_stripes to 0, as it will no longer represent a physical location on disk. Change tree-checker to allow for this, and fix read_one_chunk() to avoid a divide by zero. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:34 +01:00
Mark Harmstone	0b4d29fa98	btrfs: add METADATA_REMAP chunk type Add a new METADATA_REMAP chunk type, which is a metadata chunk that holds the remap tree. This is needed for bootstrapping purposes: the remap tree can't itself be remapped, and must be relocated the existing way, by COWing every leaf. The remap tree can't go in the SYSTEM chunk as space there is limited, because a copy of the chunk item gets placed in the superblock. The changes in fs/btrfs/volumes.h are because we're adding a new block group type bit after the profile bits, and so can no longer rely on the const_ilog2 trick. The sizing to 32MB per chunk, matching the SYSTEM chunk, is an estimate here, we can adjust it later if it proves to be too big or too small. This works out to be ~500,000 remap items, which for a 4KB block size covers ~2GB of remapped data in the worst case and ~500TB in the best case. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:27 +01:00
Mark Harmstone	ef6a31d035	btrfs: add definitions and constants for remap-tree Add an incompat flag for the new remap-tree feature, and the constants and definitions needed to support it. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:02 +01:00
Filipe Manana	c208aa0ef6	btrfs: add and use helper to compute the available space for a block group We have currently three places that compute how much available space a block group has. Add a helper function for this and use it in those places. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:51:44 +01:00
Filipe Manana	b322fa5ff1	btrfs: tag as unlikely error handling in run_one_delayed_ref() We don't expect to get errors unless we have a corrupted fs, bad RAM or a bug. So tag the error handling as unlikely. This slightly reduces the module's text size on x86_64 using gcc 14.2.0-19 from Debian. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1939458 172512 15592 2127562 2076ca fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1939398 172512 15592 2127502 20768e fs/btrfs/btrfs.ko Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:51:43 +01:00
Filipe Manana	271cbe7635	btrfs: remove unnecessary else branch in run_one_delayed_ref() There is no need for an else branch to deal with an unexpected delayed ref type. We can just change the previous branch to deal with this by checking if the ref type is not BTRFS_EXTENT_OWNER_REF_KEY, since that branch is useless as it only sets 'ret' to zero when it's already zero. So merge the two branches. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:51:43 +01:00
Filipe Manana	c7d1d4ff56	btrfs: don't BUG() on unexpected delayed ref type in run_one_delayed_ref() There is no need to BUG(), we can just return an error and log an error message. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:51:43 +01:00
jinbaohong	23d4f616cb	btrfs: use READA_FORWARD_ALWAYS for device extent verification btrfs_verify_dev_extents() iterates through the entire device tree during mount to verify dev extents against chunks. Since this function scans the whole tree, READA_FORWARD_ALWAYS is more appropriate than READA_FORWARD. While the device tree is typically small (a few hundred KB even for multi-TB filesystems), using the correct readahead mode for full-tree iteration is more consistent with the intended usage. Signed-off-by: robbieko <robbieko@synology.com> Signed-off-by: jinbaohong <jinbaohong@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:51:43 +01:00
Qu Wenruo	4681dbcfdc	btrfs: shrink the size of btrfs_device There are two main causes of holes inside btrfs_device: - The single bytes member of last_flush_error Not only it's a single byte member, but we never really care about the exact error number. - The @devt member Which is placed between two u64 members. Shrink the size of btrfs_device by: - Use a single bit flag for flush error Use BTRFS_DEV_STATE_FLUSH_FAILED so that we no longer need that dedicated member. - Move @devt to the hole after dev_stat_values[] This reduces the size of btrfs_device from 528 to exact 512 bytes for x86_64. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:51:43 +01:00
Filipe Manana	8ecf596ed8	btrfs: update comment for delalloc flush and oe wait in btrfs_clone_files() Make the comment more detailed about why we need to flush delalloc and wait for ordered extent completion before attempting to invalidate the page cache. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:51:43 +01:00
Qu Wenruo	ae23fee41b	btrfs: remove experimental offload csum mode The offload csum mode was introduced to allow developers to compare the performance of generating checksum for data writes at different timings: - During btrfs_submit_chunk() This is the most common one, if any of the following condition is met we go this path: * The csum is fast For now it's CRC32C and xxhash. * It's a synchronous write * Zoned - Delay the checksum generation to a workqueue However since commit `dd57c78aec` ("btrfs: introduce btrfs_bio::async_csum") we no longer need to bother any of them. As if it's an experimental build, async checksum generation at the background will be faster anyway. And if not an experimental build, we won't even have the offload csum mode support. Considering the async csum will be the new default, let's remove the offload csum mode code. There will be no impact to end users, and offload csum mode is still under experimental features. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:51:43 +01:00
David Sterba	e582f22030	btrfs: split btrfs_fs_closing() and change return type to bool There are two tests in btrfs_fs_closing() but checking the BTRFS_FS_CLOSING_DONE bit is done only in one place load_extent_tree_free(). As this is an inline we can reduce size of the generated code. The types can be also changed to bool as this becomes a simple condition. text data bss dec hex filename 1674006 146704 15560 1836270 1c04ee pre/btrfs.ko 1673772 146704 15560 1836036 1c0404 post/btrfs.ko DELTA: -234 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:51:42 +01:00
Qu Wenruo	59615e2c1f	btrfs: reject single block sized compression early Currently for an inode that needs compression, even if there is a delalloc range that is single fs block sized and can not be inlined, we will still go through the compression path. Then inside compress_file_range(), we have one extra check to reject single block sized range, and fall back to regular uncompressed write. This rejection is in fact a little too late, we have already allocated memory to async_chunk, delayed the submission, just to fallback to the same uncompressed write. Change the behavior to reject such cases earlier at inode_need_compress(), so for such single block sized range we won't even bother trying to go through compress path. And since the inline small block check is inside inode_need_compress() and compress_file_range() also calls that function, we no longer need a dedicate check inside compress_file_range(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:51:42 +01:00
Julia Lawall	d6f6109fe4	btrfs: update outdated comment in __add_block_group_free_space() The function add_block_group_free_space() was renamed btrfs_add_block_group_free_space() by commit `6fc5ef7829` ("btrfs: add btrfs prefix to free space tree exported functions"). Update the comment accordingly. Do some reorganization of the next few lines to keep the comment within 80 characters. Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:51:42 +01:00
Qu Wenruo	d1a020a8d7	btrfs: add mount time auto fix for orphan fst entries [BUG] Before btrfs-progs v6.16.1 release, mkfs.btrfs can leave free space tree entries for deleted chunks: # mkfs.btrfs -f -O fst $dev # btrfs ins dump-tree -t chunk $dev btrfs-progs v6.16 chunk tree leaf 22036480 items 4 free space 15781 generation 8 owner CHUNK_TREE leaf 22036480 flags 0x1(WRITTEN) backref revision 1 item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98 item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 13631488) itemoff 16105 itemsize 80 ^^^ The first chunk is at 13631488 item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15993 itemsize 112 item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15881 itemsize 112 # btrfs ins dump-tree -t free-space-tree $dev btrfs-progs v6.16 free space tree key (FREE_SPACE_TREE ROOT_ITEM 0) leaf 30556160 items 13 free space 15918 generation 8 owner FREE_SPACE_TREE leaf 30556160 flags 0x1(WRITTEN) backref revision 1 item 0 key (1048576 FREE_SPACE_INFO 4194304) itemoff 16275 itemsize 8 free space info extent count 1 flags 0 item 1 key (1048576 FREE_SPACE_EXTENT 4194304) itemoff 16275 itemsize 0 free space extent item 2 key (5242880 FREE_SPACE_INFO 8388608) itemoff 16267 itemsize 8 free space info extent count 1 flags 0 item 3 key (5242880 FREE_SPACE_EXTENT 8388608) itemoff 16267 itemsize 0 free space extent ^^^ Above 4 items are all before the first chunk. item 4 key (13631488 FREE_SPACE_INFO 8388608) itemoff 16259 itemsize 8 free space info extent count 1 flags 0 item 5 key (13631488 FREE_SPACE_EXTENT 8388608) itemoff 16259 itemsize 0 free space extent ... This can trigger btrfs check errors. [CAUSE] It's a bug in free space tree implementation of btrfs-progs, which doesn't delete involved fst entries for the to-be-deleted chunk/block group. [ENHANCEMENT] The mostly common fix is to clear the space cache and rebuild it, but that requires a ro->rw remount which may not be possible for rootfs, and also relies on users to use "clear_cache" mount option manually. Here introduce a kernel fix for it, which will delete any entries that is before the first block group automatically at the first RW mount. For filesystems without such problem, the overhead is just a single tree search and no modification to the free space tree, thus the overhead should be minimal. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:51:34 +01:00
Zhen Ni	fdb945f665	btrfs: simplify check for zoned NODATASUM writes in btrfs_submit_chunk() This function already dereferences 'inode' multiple times earlier, making the additional NULL check at line 840 redundant since the function would have crashed already if inode were NULL. After commit `81cea6cd70` ("btrfs: remove btrfs_bio::fs_info by extracting it from btrfs_bio::inode"), the btrfs_bio::inode field is mandatory for all btrfs_bio allocations and is guaranteed to be non-NULL. Simplify the condition for allocating dummy checksums for zoned NODATASUM data by removing the unnecessary 'inode &&' check. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Zhen Ni <zhen.ni@easystack.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:50:26 +01:00
Filipe Manana	8d206b0c21	btrfs: avoid transaction commit on error in insert_balance_item() There's no point in committing the transaction if we failed to insert the balance item, since we haven't done anything else after we started/joined the transaction. Also stop using two variables for tracking the return value and use only 'ret'. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:50:23 +01:00
Filipe Manana	7d7608cc9a	btrfs: move unlikely checks around btrfs_is_shutdown() into the helper Instead of surrounding every caller of btrfs_is_shutdown() with unlikely, move the unlikely into the helper itself, like we do in other places in btrfs and is common in the kernel outside btrfs too. Also make the fs_info argument of btrfs_is_shutdown() const. On a x86_84 box using gcc 14.2.0-19 from Debian, this resulted in a slight reduction of the module's text size. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1939044 172568 15592 2127204 207564 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1938876 172568 15592 2127036 2074bc fs/btrfs/btrfs.ko Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:49:12 +01:00
Filipe Manana	858f32937c	btrfs: tag as unlikely error conditions in the transaction commit path Errors are unexpected during the transaction commit path, and when they happen we abort the transaction (by calling cleanup_transaction() under the label 'cleanup_transaction' in btrfs_commit_transaction()). So mark every error check in the transaction commit path as unlikely, to hint the compiler so that it can possibly generate better code, and make it clear for a reader about being unexpected. On a x86_84 box using gcc 14.2.0-19 from Debian, this resulted in a slight reduction of the module's text size. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1939476 172568 15592 2127636 207714 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1939044 172568 15592 2127204 207564 fs/btrfs/btrfs.ko Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:49:12 +01:00
Zhen Ni	4cdb457a23	btrfs: remove unreachable return after btrfs_backref_panic() in btrfs_backref_finish_upper_links() The return statement after btrfs_backref_panic() is unreachable since btrfs_backref_panic() calls BUG() which never returns. Remove the return to unify it with the other calls to btrfs_backref_panic(). Signed-off-by: Zhen Ni <zhen.ni@easystack.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:49:12 +01:00
Qu Wenruo	c28214bde6	btrfs: refactor the main loop of cow_file_range() Currently inside the main loop of cow_file_range(), we do the following sequence: - Reserve an extent - Lock the IO tree range - Create an IO extent map - Create an ordered extent Every step will need extra steps to do cleanup in the following order: - Drop the newly created extent map - Unlock extent range and cleanup the involved folios - Free the reserved extent However currently the error handling is done inconsistently: - Extent map drop is handled in a dedicated tag Out of the main loop, make it much harder to track. - The extent unlock and folios cleanup is done separately The extent is unlocked through btrfs_unlock_extent(), then extent_clear_unlock_delalloc() again in a dedicated tag. Meanwhile all other callsites (compression/encoded/nocow) all just call extent_clear_unlock_delalloc() to handle unlock and folio clean up in one go. - Reserved extent freeing is handled in a dedicated tag Out of the main loop, make it much harder to track. - Error handling of btrfs_reloc_clone_csums() is relying out-of-loop tags This is due to the special requirement to finish ordered extents to handle the metadata reserved space. Enhance the error handling and align the behavior by: - Introduce a dedicated cow_one_range() helper Which do the reserve/lock/allocation in the helper. And also handle the errors inside the helper. No more dedicated tags out of the main loop. - Use a single extent_clear_unlock_delalloc() to unlock and cleanup folios - Move the btrfs_reloc_clone_csums() error handling into the new helper Thankfully it's not that complex compared to other cases. And since we're here, also reduce the width of the following local variables to u32: - cur_alloc_size - min_alloc_size Each allocation won't go beyond 128M, thus u32 is more than enough. - blocksize The maximum is 64K, no need for u64. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:49:12 +01:00
Johannes Thumshirn	9da49784ae	btrfs: zoned: print block-group type for zoned statistics When printing the zoned statistics, also include the block-group type in the block-group listing output. The updated output looks as follows: device /dev/vda mounted on /mnt with fstype btrfs zoned statistics: active block-groups: 9 reclaimable: 0 unused: 2 need reclaim: false data relocation block-group: 3221225472 active zones: start: 1073741824, wp: 268419072 used: 268419072, reserved: 0, unusable: 0 (DATA) start: 1342177280, wp: 0 used: 0, reserved: 0, unusable: 0 (DATA) start: 1610612736, wp: 81920 used: 16384, reserved: 16384, unusable: 49152 (SYSTEM) start: 1879048192, wp: 2031616 used: 1458176, reserved: 65536, unusable: 507904 (METADATA) start: 2147483648, wp: 268419072 used: 268419072, reserved: 0, unusable: 0 (DATA) start: 2415919104, wp: 268419072 used: 268419072, reserved: 0, unusable: 0 (DATA) start: 2684354560, wp: 268419072 used: 268419072, reserved: 0, unusable: 0 (DATA) start: 2952790016, wp: 65536 used: 65536, reserved: 0, unusable: 0 (DATA) start: 3221225472, wp: 0 used: 0, reserved: 0, unusable: 0 (DATA) Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:49:12 +01:00
Johannes Thumshirn	2ef2e97fe7	btrfs: move space_info_flag_to_str() to space-info.h Move space_info_flag_to_str() to space-info.h and as it now isn't static to space-info.c any more prefix it with 'btrfs_'. This way it can be re-used in other places. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:49:12 +01:00
Johannes Thumshirn	6a5ac228d4	btrfs: zoned: show statistics about zoned filesystems in mountstats Add statistics output to /proc/<pid>/mountstats for zoned BTRFS, similar to the zoned statistics from XFS in mountstats. The output for /proc/<pid>/mountstats on an example filesystem will be as follows: device /dev/vda mounted on /mnt with fstype btrfs zoned statistics: active block-groups: 7 reclaimable: 0 unused: 5 need reclaim: false data relocation block-group: 1342177280 active zones: start: 1073741824, wp: 268419072 used: 0, reserved: 268419072, unusable: 0 start: 1342177280, wp: 0 used: 0, reserved: 0, unusable: 0 start: 1610612736, wp: 49152 used: 16384, reserved: 16384, unusable: 16384 start: 1879048192, wp: 950272 used: 131072, reserved: 622592, unusable: 196608 start: 2147483648, wp: 212238336 used: 0, reserved: 212238336, unusable: 0 start: 2415919104, wp: 0 used: 0, reserved: 0, unusable: 0 start: 2684354560, wp: 0 used: 0, reserved: 0, unusable: 0 Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:49:12 +01:00
Filipe Manana	68d4ece9c3	btrfs: don't call btrfs_handle_fs_error() in btrfs_commit_transaction() There's no need to call btrfs_handle_fs_error() as we are inside a transaction and if we get an error we jump to the 'scrub_continue' label and end up calling cleanup_transaction(), which aborts the transaction. This is odd given that we have a transaction handle and that in the transaction commit path any error makes us abort the transaction and it's the only place that calls btrfs_handle_fs_error(). Remove the btrfs_handle_fs_error() call and replace it with an error message so that if it happens we know what went wrong during the transaction commit. Also annotate the condition in the if statement with 'unlikely' since this is not expected to happen. We've been wanting to remove btrfs_handle_fs_error(), so this removes one user that does not even needs it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:49:11 +01:00
Filipe Manana	d15a190d9e	btrfs: don't call btrfs_handle_fs_error() in qgroup_account_snapshot() There's no need to call btrfs_handle_fs_error() as we are inside a transaction and we propagate the error returned from btrfs_write_and_wait_transaction() to the caller and it ends going up the call chain to btrfs_commit_transaction() (returned by the call to create_pending_snapshots()), where we jump to the 'unlock_reloc' label and end up calling cleanup_transaction(), which aborts the transaction. This is odd given that we have a transaction handle and that in the transaction commit path any error makes us abort the transaction and, besides another place inside btrfs_commit_transaction(), it's the only place that calls btrfs_handle_fs_error(). Remove the btrfs_handle_fs_error() call and replace it with an error message so that if it happens we know what went wrong during the transaction commit. Also annotate the condition in the if statement with 'unlikely' since this is not expected to happen. We've been wanting to remove btrfs_handle_fs_error(), so this removes one user that does not even need it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:49:11 +01:00
Filipe Manana	c9b640cefa	btrfs: don't call btrfs_handle_fs_error() after failure to delete orphan item In btrfs_find_orphan_roots() we don't need to call btrfs_handle_fs_error() if we fail to delete the orphan item for the current root. This is because we haven't done anything yet regarding the current root and previous iterations of the loop dealt with other roots, so there's nothing we need to undo. Instead log an error message and return the error to the caller, which will result either in a mount failure or remount failure (the only contexts it's called from). Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:49:11 +01:00
Filipe Manana	8bc612906f	btrfs: don't call btrfs_handle_fs_error() after failure to join transaction In btrfs_find_orphan_roots() we don't need to call btrfs_handle_fs_error() if we fail to join a transaction. This is because we haven't done anything yet regarding the current root and previous iterations of the loop dealt with other roots, so there's nothing we need to undo. Instead log an error message and return the error to the caller, which will result either in a mount failure or remount failure (the only contexts it's called from). Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:49:11 +01:00
Filipe Manana	1923190371	btrfs: remove redundant path release in btrfs_find_orphan_roots() There's no need to release the path in the if branch used when the root does not exists since we released the path before the call to btrfs_get_fs_root(). So remove that redundant btrfs_release_path() call. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:49:11 +01:00

1 2 3 4 5 ...

1414286 Commits