Currently lzo_decompress_bio() is using
compressed_bio->compressed_folios[] array to grab each compressed folio.
This is making the code much easier to read, as we only need to maintain
a single iterator, @cur_in, and can easily grab any random folio using
@cur_in >> min_folio_shift as an index.
However lzo_decompress_bio() itself is ensured to only advance to the
next folio at one time, and compressed_folios[] is just a pointer to
each folio of the compressed bio, thus we have no real random access
requirement for lzo_decompress_bio().
Replace the compressed_folios[] access by a helper, get_current_folio(),
which uses folio_iter and an external folio counter to properly switch
the folio when needed.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move the filesystem state validation from btrfs_reclaim_bgs_work() into
btrfs_should_reclaim() to centralize the reclaim eligibility logic.
Since it is the only caller of btrfs_should_reclaim(), there's no
functional change.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Problems with current implementation:
1. reclaimable_bytes is signed while chunk_sz is unsigned, causing
negative reclaimable_bytes to trigger reclaim unexpectedly
2. The "space must be freed between scans" assumption breaks the
two-scan requirement: first scan marks block groups, second scan
reclaims them. Without the second scan, no reclamation occurs.
Instead, track actual reclaim progress: pause reclaim when block groups
will be reclaimed, and resume only when progress is made. This ensures
reclaim continues until no further progress can be made. And resume
periodic reclaim when there's enough free space.
And we take care if reclaim is making any progress now, so it's
unnecessary to set periodic_reclaim_ready to false when failed to reclaim
a block group.
Fixes: 813d4c6422 ("btrfs: prevent pathological periodic reclaim loops")
CC: stable@vger.kernel.org # 6.12+
Suggested-by: Boris Burkov <boris@bur.io>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need to pass both the block_group and block_group::io_ctl to
__btrfs_write_out_cache().
Remove passing io_ctl to __btrfs_write_out_cache() and dereference it
inside __btrfs_write_out_cache().
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a helper to calculate an extent map's exclusive end offset, but
we only use it in some places. Update every site that open codes the
calculation to use the helper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a helper to calculate a block group's exclusive end offset, but
we only use it in some places. Update every site that open codes the
calculation to use the helper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Dan reported a new smatch warning in free-space-cache.c:
New smatch warnings:
fs/btrfs/free-space-cache.c:1207 write_pinned_extent_entries() warn: variable dereferenced before check 'block_group' (see line 1203)
But the check if the block_group pointer is NULL is bogus, because to
get to this point block_group::io_ctl has already been dereferenced
further up the call-chain when calling __btrfs_write_out_cache() from
btrfs_write_out_cache().
Remove the bogus checks for block_group == NULL in
__btrfs_write_out_cache() and it's siblings.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202601170636.WsePMV5H-lkp@intel.com/
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a function btrfs_populate_fully_remapped_bgs_list() which gets
called on mount, which looks for fully remapped block groups
(i.e. identity_remap_count == 0) which haven't yet had their chunk
stripes and device extents removed.
This happens when a filesystem is unmounted while async discard has not
yet finished, as otherwise the data range occupied by the chunk stripes
would be permanently unusable.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Discard normally works by iterating over the free-space entries of a
block group. This doesn't work for fully-remapped block groups, as we
removed their free-space entries when we started relocation.
For sync discard, call btrfs_discard_extent() when we commit the
transaction in which the last identity remap was removed.
For async discard, add a new function btrfs_trim_fully_remapped_block_group()
to be called by the discard worker, which iterates over the block
group's range using the normal async discard rules. Once we reach the
end, remove the chunk's stripes and device extents to get back its free
space.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Balancing the METADATA_REMAP chunk, i.e. the chunk in which the remap tree
lives, is a special case.
We can't use the remap tree itself for this, as then we'd have no way to
boostrap it on mount. And we can't use the pre-remap tree code for this
as it relies on walking the extent tree, and we're not creating backrefs
for METADATA_REMAP chunks.
So instead, if a balance would relocate any METADATA_REMAP block groups, mark
those block groups as readonly and COW every leaf of the remap tree.
There's more sophisticated ways of doing this, such as only COWing nodes
within a block group that's to be relocated, but they're fiddly and with
lots of edge cases. Plus it's not anticipated that
a) the number of METADATA_REMAP chunks is going to be particularly large, or
b) that users will want to only relocate some of these chunks - the main
use case here is to unbreak RAID conversion and device removal.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_discard_extent() can be called either when an extent is removed
or from walking the free-space tree. With a remapped block group these
two things are no longer equivalent: the extent's addresses are
remapped, while the free-space tree exclusively uses underlying
addresses.
Add a do_remap parameter to btrfs_discard_extent() and
btrfs_map_discard(), saying whether or not the address needs to be run
through the remap tree first.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a function do_remap_tree_reloc(), which does the actual work of
doing a relocation using the remap tree.
In a loop we call do_remap_reloc_trans(), which searches for the first
identity remap for the block group. We call btrfs_reserve_extent() to
find space elsewhere for it, and read the data into memory and write it
to the new location. We then carve out the identity remap and replace it
with an actual remap, which points to the new location in which to look.
Once the last identity remap has been removed we call
last_identity_remap_gone(), which, as with deletions, removes the
chunk's stripes and device extents.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If when relocating a block group we find that `remap_bytes` > 0 in its
block group item, that means that it has been the destination block
group for another that has been remapped.
We need to search the remap tree for any remap backrefs within this
range, and move the data to a third block group. This is because
otherwise btrfs_translate_remap() could end up following an unbounded
chain of remaps, which would only get worse over time.
We only relocate one block group at a time, so `remap_bytes` will only
ever go down while we are doing this. Once we're finished we set the
REMAPPED flag on the block group, which will permanently prevent any
other data from being moved to within it.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Handle the preliminary work for relocating a block group in a filesystem
with the remap-tree flag set.
If the block group is SYSTEM btrfs_relocate_block_group() proceeds as it
does already, as bootstrapping issues mean that these block groups have
to be processed the existing way. Similarly with METADATA_REMAP blocks, which
are dealt with in a later patch.
Otherwise we walk the free-space tree for the block group in question,
recording any holes. These get converted into identity remaps and placed
in the remap tree, and the block group's REMAPPED flag is set. From now
on no new allocations are possible within this block group, and any I/O
to it will be funnelled through btrfs_translate_remap(). We store the
number of identity remaps in `identity_remap_count`, so that we know
when we've removed the last one and the block group is fully remapped.
The change in btrfs_read_roots() is because data relocations no longer
rely on the data reloc tree as a hidden subvolume in which to do
snapshots.
(Thanks to Sun YangKai for his suggestions.)
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Handle the case where we free an extent from a block group that has the
REMAPPED flag set. Because the remap tree is orthogonal to the extent
tree, for data this may be within any number of identity remaps or
actual remaps. If we're freeing a metadata node, this will be wholly
inside one or the other.
btrfs_remove_extent_from_remap_tree() searches the remap tree for the
remaps that cover the range in question, then calls
remove_range_from_remap_tree() for each one, to punch a hole in the
remap and adjust the free-space tree.
For an identity remap, remove_range_from_remap_tree() will adjust the
block group's `identity_remap_count` if this changes. If it reaches
zero we mark the block group as fully remapped.
For an identity remap, remove_range_from_remap_tree() will adjust the
block group's `identity_remap_count` if this changes. If it reaches
zero we mark the block group as fully remapped.
Fully remapped block groups have their chunk stripes removed and their
device extents freed, which makes the disk space available again to the
chunk allocator. This happens asynchronously: in the cleaner thread for
sync discard and nodiscard, and (in a later patch) in the discard worker
for async discard.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Change btrfs_map_block() so that if the block group has the REMAPPED
flag set, we call btrfs_translate_remap() to obtain a new address.
btrfs_translate_remap() searches the remap tree for a range
corresponding to the logical address passed to btrfs_map_block(). If it
is within an identity remap, this part of the block group hasn't yet
been relocated, and so we use the existing address.
If it is within an actual remap, we subtract the start of the remap
range and add the address of its destination, contained in the item's
payload.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we encounter a filesystem with the remap-tree incompat flag set,
validate its compatibility with the other flags, and load the remap tree
using the values that have been added to the superblock.
The remap-tree feature depends on the free-space-tree, but no-holes and
block-group-tree have been made dependencies to reduce the testing
matrix. Similarly I'm not aware of any reason why mixed-bg and zoned would be
incompatible with remap-tree, but this is blocked for the time being
until it can be fully tested.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a struct btrfs_block_group_item_v2, which is used in the block group
tree if the remap-tree incompat flag is set.
This adds two new fields to the block group item: `remap_bytes` and
`identity_remap_count`.
`remap_bytes` records the amount of data that's physically within this
block group, but nominally in another, remapped block group. This is
necessary because this data will need to be moved first if this block
group is itself relocated. If `remap_bytes` > 0, this is an indicator to
the relocation thread that it will need to search the remap-tree for
backrefs. A block group must also have `remap_bytes` == 0 before it can
be dropped.
`identity_remap_count` records how many identity remap items are located
in the remap tree for this block group. When relocation is begun for
this block group, this is set to the number of holes in the free-space
tree for this range. As identity remaps are converted into actual remaps
by the relocation process, this number is decreased. Once it reaches 0,
either because of relocation or because extents have been deleted, the
block group has been fully remapped and its chunk's device extents are
removed.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rename the field commit_used in struct btrfs_block_group to last_used,
for clarity and consistency with the similar fields we're about to add.
It's not obvious that commit_flags means "flags as of the last commit"
rather than "flags related to a commit".
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is the following potential problem with the remap tree and delayed refs:
* Remapped extent freed in a delayed ref, which removes an entry from the
remap tree
* Remap tree now small enough to fit in a single leaf
* Corruption as we now have a level-0 block with a level-1 metadata item
in the extent tree
One solution to this would be to rework the remap tree code so that it operates
via delayed refs. But as we're hoping to remove cow-only metadata items in the
future anyway, change things so that the remap tree doesn't have any entries in
the extent tree. This also has the benefit of reducing write amplification.
We also make it so that the clear_cache mount option is a no-op, as with the
extent tree v2, as the free-space tree can no longer be recreated from the
extent tree.
Finally disable relocating the remap tree itself, which is added back in
a later patch. As it is we would get corruption as the traditional
relocation method walks the extent tree, and we're removing its metadata
items.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
No new allocations can be done from block groups that have the REMAPPED
flag set, so there's no value in their having entries in the free-space
tree.
Prevent a search through the free-space tree being scheduled for such a
block group, and prevent any additions to the in-memory free-space tree.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a chunk has been fully remapped, we are going to set its
num_stripes to 0, as it will no longer represent a physical location on
disk.
Change tree-checker to allow for this, and fix read_one_chunk() to avoid
a divide by zero.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a new METADATA_REMAP chunk type, which is a metadata chunk that holds the
remap tree.
This is needed for bootstrapping purposes: the remap tree can't itself
be remapped, and must be relocated the existing way, by COWing every
leaf. The remap tree can't go in the SYSTEM chunk as space there is
limited, because a copy of the chunk item gets placed in the superblock.
The changes in fs/btrfs/volumes.h are because we're adding a new block
group type bit after the profile bits, and so can no longer rely on the
const_ilog2 trick.
The sizing to 32MB per chunk, matching the SYSTEM chunk, is an estimate
here, we can adjust it later if it proves to be too big or too small.
This works out to be ~500,000 remap items, which for a 4KB block size
covers ~2GB of remapped data in the worst case and ~500TB in the best case.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add an incompat flag for the new remap-tree feature, and the constants
and definitions needed to support it.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have currently three places that compute how much available space a
block group has. Add a helper function for this and use it in those
places.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't expect to get errors unless we have a corrupted fs, bad RAM or a
bug. So tag the error handling as unlikely.
This slightly reduces the module's text size on x86_64 using gcc 14.2.0-19
from Debian.
Before this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1939458 172512 15592 2127562 2076ca fs/btrfs/btrfs.ko
After this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1939398 172512 15592 2127502 20768e fs/btrfs/btrfs.ko
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need for an else branch to deal with an unexpected delayed ref
type. We can just change the previous branch to deal with this by checking
if the ref type is not BTRFS_EXTENT_OWNER_REF_KEY, since that branch is
useless as it only sets 'ret' to zero when it's already zero. So merge the
two branches.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to BUG(), we can just return an error and log an error
message.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_verify_dev_extents() iterates through the entire device tree
during mount to verify dev extents against chunks. Since this function
scans the whole tree, READA_FORWARD_ALWAYS is more appropriate than
READA_FORWARD.
While the device tree is typically small (a few hundred KB even for
multi-TB filesystems), using the correct readahead mode for full-tree
iteration is more consistent with the intended usage.
Signed-off-by: robbieko <robbieko@synology.com>
Signed-off-by: jinbaohong <jinbaohong@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are two main causes of holes inside btrfs_device:
- The single bytes member of last_flush_error
Not only it's a single byte member, but we never really care about the
exact error number.
- The @devt member
Which is placed between two u64 members.
Shrink the size of btrfs_device by:
- Use a single bit flag for flush error
Use BTRFS_DEV_STATE_FLUSH_FAILED so that we no longer need that
dedicated member.
- Move @devt to the hole after dev_stat_values[]
This reduces the size of btrfs_device from 528 to exact 512 bytes for
x86_64.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make the comment more detailed about why we need to flush delalloc and
wait for ordered extent completion before attempting to invalidate the
page cache.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The offload csum mode was introduced to allow developers to compare the
performance of generating checksum for data writes at different timings:
- During btrfs_submit_chunk()
This is the most common one, if any of the following condition is met
we go this path:
* The csum is fast
For now it's CRC32C and xxhash.
* It's a synchronous write
* Zoned
- Delay the checksum generation to a workqueue
However since commit dd57c78aec ("btrfs: introduce
btrfs_bio::async_csum") we no longer need to bother any of them.
As if it's an experimental build, async checksum generation at the
background will be faster anyway.
And if not an experimental build, we won't even have the offload csum
mode support.
Considering the async csum will be the new default, let's remove the
offload csum mode code.
There will be no impact to end users, and offload csum mode is still
under experimental features.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are two tests in btrfs_fs_closing() but checking the
BTRFS_FS_CLOSING_DONE bit is done only in one place
load_extent_tree_free(). As this is an inline we can reduce size of the
generated code. The types can be also changed to bool as this becomes a
simple condition.
text data bss dec hex filename
1674006 146704 15560 1836270 1c04ee pre/btrfs.ko
1673772 146704 15560 1836036 1c0404 post/btrfs.ko
DELTA: -234
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently for an inode that needs compression, even if there is a delalloc
range that is single fs block sized and can not be inlined, we will
still go through the compression path.
Then inside compress_file_range(), we have one extra check to reject
single block sized range, and fall back to regular uncompressed write.
This rejection is in fact a little too late, we have already allocated
memory to async_chunk, delayed the submission, just to fallback to the
same uncompressed write.
Change the behavior to reject such cases earlier at
inode_need_compress(), so for such single block sized range we won't
even bother trying to go through compress path.
And since the inline small block check is inside inode_need_compress()
and compress_file_range() also calls that function, we no longer need a
dedicate check inside compress_file_range().
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function add_block_group_free_space() was renamed
btrfs_add_block_group_free_space() by commit 6fc5ef7829 ("btrfs:
add btrfs prefix to free space tree exported functions"). Update
the comment accordingly.
Do some reorganization of the next few lines to keep the comment
within 80 characters.
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Before btrfs-progs v6.16.1 release, mkfs.btrfs can leave free space tree
entries for deleted chunks:
# mkfs.btrfs -f -O fst $dev
# btrfs ins dump-tree -t chunk $dev
btrfs-progs v6.16
chunk tree
leaf 22036480 items 4 free space 15781 generation 8 owner CHUNK_TREE
leaf 22036480 flags 0x1(WRITTEN) backref revision 1
item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98
item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 13631488) itemoff 16105 itemsize 80
^^^ The first chunk is at 13631488
item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15993 itemsize 112
item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15881 itemsize 112
# btrfs ins dump-tree -t free-space-tree $dev
btrfs-progs v6.16
free space tree key (FREE_SPACE_TREE ROOT_ITEM 0)
leaf 30556160 items 13 free space 15918 generation 8 owner FREE_SPACE_TREE
leaf 30556160 flags 0x1(WRITTEN) backref revision 1
item 0 key (1048576 FREE_SPACE_INFO 4194304) itemoff 16275 itemsize 8
free space info extent count 1 flags 0
item 1 key (1048576 FREE_SPACE_EXTENT 4194304) itemoff 16275 itemsize 0
free space extent
item 2 key (5242880 FREE_SPACE_INFO 8388608) itemoff 16267 itemsize 8
free space info extent count 1 flags 0
item 3 key (5242880 FREE_SPACE_EXTENT 8388608) itemoff 16267 itemsize 0
free space extent
^^^ Above 4 items are all before the first chunk.
item 4 key (13631488 FREE_SPACE_INFO 8388608) itemoff 16259 itemsize 8
free space info extent count 1 flags 0
item 5 key (13631488 FREE_SPACE_EXTENT 8388608) itemoff 16259 itemsize 0
free space extent
...
This can trigger btrfs check errors.
[CAUSE]
It's a bug in free space tree implementation of btrfs-progs, which
doesn't delete involved fst entries for the to-be-deleted chunk/block
group.
[ENHANCEMENT]
The mostly common fix is to clear the space cache and rebuild it, but
that requires a ro->rw remount which may not be possible for rootfs,
and also relies on users to use "clear_cache" mount option manually.
Here introduce a kernel fix for it, which will delete any entries that
is before the first block group automatically at the first RW mount.
For filesystems without such problem, the overhead is just a single tree
search and no modification to the free space tree, thus the overhead
should be minimal.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already dereferences 'inode' multiple times earlier,
making the additional NULL check at line 840 redundant since the
function would have crashed already if inode were NULL.
After commit 81cea6cd70 ("btrfs: remove btrfs_bio::fs_info by
extracting it from btrfs_bio::inode"), the btrfs_bio::inode field is
mandatory for all btrfs_bio allocations and is guaranteed to be
non-NULL.
Simplify the condition for allocating dummy checksums for zoned
NODATASUM data by removing the unnecessary 'inode &&' check.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no point in committing the transaction if we failed to insert the
balance item, since we haven't done anything else after we started/joined
the transaction. Also stop using two variables for tracking the return
value and use only 'ret'.
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of surrounding every caller of btrfs_is_shutdown() with unlikely,
move the unlikely into the helper itself, like we do in other places in
btrfs and is common in the kernel outside btrfs too. Also make the fs_info
argument of btrfs_is_shutdown() const.
On a x86_84 box using gcc 14.2.0-19 from Debian, this resulted in a slight
reduction of the module's text size.
Before:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1939044 172568 15592 2127204 207564 fs/btrfs/btrfs.ko
After:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1938876 172568 15592 2127036 2074bc fs/btrfs/btrfs.ko
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Errors are unexpected during the transaction commit path, and when they
happen we abort the transaction (by calling cleanup_transaction() under
the label 'cleanup_transaction' in btrfs_commit_transaction()). So mark
every error check in the transaction commit path as unlikely, to hint the
compiler so that it can possibly generate better code, and make it clear
for a reader about being unexpected.
On a x86_84 box using gcc 14.2.0-19 from Debian, this resulted in a slight
reduction of the module's text size.
Before:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1939476 172568 15592 2127636 207714 fs/btrfs/btrfs.ko
After:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1939044 172568 15592 2127204 207564 fs/btrfs/btrfs.ko
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The return statement after btrfs_backref_panic() is unreachable since
btrfs_backref_panic() calls BUG() which never returns. Remove the
return to unify it with the other calls to btrfs_backref_panic().
Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently inside the main loop of cow_file_range(), we do the following
sequence:
- Reserve an extent
- Lock the IO tree range
- Create an IO extent map
- Create an ordered extent
Every step will need extra steps to do cleanup in the following order:
- Drop the newly created extent map
- Unlock extent range and cleanup the involved folios
- Free the reserved extent
However currently the error handling is done inconsistently:
- Extent map drop is handled in a dedicated tag
Out of the main loop, make it much harder to track.
- The extent unlock and folios cleanup is done separately
The extent is unlocked through btrfs_unlock_extent(), then
extent_clear_unlock_delalloc() again in a dedicated tag.
Meanwhile all other callsites (compression/encoded/nocow) all just
call extent_clear_unlock_delalloc() to handle unlock and folio clean
up in one go.
- Reserved extent freeing is handled in a dedicated tag
Out of the main loop, make it much harder to track.
- Error handling of btrfs_reloc_clone_csums() is relying out-of-loop
tags
This is due to the special requirement to finish ordered extents to
handle the metadata reserved space.
Enhance the error handling and align the behavior by:
- Introduce a dedicated cow_one_range() helper
Which do the reserve/lock/allocation in the helper.
And also handle the errors inside the helper.
No more dedicated tags out of the main loop.
- Use a single extent_clear_unlock_delalloc() to unlock and cleanup
folios
- Move the btrfs_reloc_clone_csums() error handling into the new helper
Thankfully it's not that complex compared to other cases.
And since we're here, also reduce the width of the following local
variables to u32:
- cur_alloc_size
- min_alloc_size
Each allocation won't go beyond 128M, thus u32 is more than enough.
- blocksize
The maximum is 64K, no need for u64.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move space_info_flag_to_str() to space-info.h and as it now isn't static
to space-info.c any more prefix it with 'btrfs_'.
This way it can be re-used in other places.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need to call btrfs_handle_fs_error() as we are inside a
transaction and if we get an error we jump to the 'scrub_continue' label
and end up calling cleanup_transaction(), which aborts the transaction.
This is odd given that we have a transaction handle and that in the
transaction commit path any error makes us abort the transaction and
it's the only place that calls btrfs_handle_fs_error().
Remove the btrfs_handle_fs_error() call and replace it with an error
message so that if it happens we know what went wrong during the
transaction commit. Also annotate the condition in the if statement
with 'unlikely' since this is not expected to happen.
We've been wanting to remove btrfs_handle_fs_error(), so this removes
one user that does not even needs it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need to call btrfs_handle_fs_error() as we are inside a
transaction and we propagate the error returned from
btrfs_write_and_wait_transaction() to the caller and it ends going up the
call chain to btrfs_commit_transaction() (returned by the call to
create_pending_snapshots()), where we jump to the 'unlock_reloc' label
and end up calling cleanup_transaction(), which aborts the transaction.
This is odd given that we have a transaction handle and that in the
transaction commit path any error makes us abort the transaction and,
besides another place inside btrfs_commit_transaction(), it's the only
place that calls btrfs_handle_fs_error().
Remove the btrfs_handle_fs_error() call and replace it with an error
message so that if it happens we know what went wrong during the
transaction commit. Also annotate the condition in the if statement
with 'unlikely' since this is not expected to happen.
We've been wanting to remove btrfs_handle_fs_error(), so this removes
one user that does not even need it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_find_orphan_roots() we don't need to call btrfs_handle_fs_error()
if we fail to delete the orphan item for the current root. This is because
we haven't done anything yet regarding the current root and previous
iterations of the loop dealt with other roots, so there's nothing we need
to undo. Instead log an error message and return the error to the caller,
which will result either in a mount failure or remount failure (the only
contexts it's called from).
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_find_orphan_roots() we don't need to call btrfs_handle_fs_error()
if we fail to join a transaction. This is because we haven't done anything
yet regarding the current root and previous iterations of the loop dealt
with other roots, so there's nothing we need to undo. Instead log an error
message and return the error to the caller, which will result either in
a mount failure or remount failure (the only contexts it's called from).
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need to release the path in the if branch used when the root
does not exists since we released the path before the call to
btrfs_get_fs_root(). So remove that redundant btrfs_release_path() call.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>