Commit Graph

1090617 Commits

Author SHA1 Message Date
Qu Wenruo
d201238ccd btrfs: repair super block num_devices automatically
[BUG]
There is a report that a btrfs has a bad super block num devices.

This makes btrfs to reject the fs completely.

  BTRFS error (device sdd3): super_num_devices 3 mismatch with num_devices 2 found here
  BTRFS error (device sdd3): failed to read chunk tree: -22
  BTRFS error (device sdd3): open_ctree failed

[CAUSE]
During btrfs device removal, chunk tree and super block num devs are
updated in two different transactions:

  btrfs_rm_device()
  |- btrfs_rm_dev_item(device)
  |  |- trans = btrfs_start_transaction()
  |  |  Now we got transaction X
  |  |
  |  |- btrfs_del_item()
  |  |  Now device item is removed from chunk tree
  |  |
  |  |- btrfs_commit_transaction()
  |     Transaction X got committed, super num devs untouched,
  |     but device item removed from chunk tree.
  |     (AKA, super num devs is already incorrect)
  |
  |- cur_devices->num_devices--;
  |- cur_devices->total_devices--;
  |- btrfs_set_super_num_devices()
     All those operations are not in transaction X, thus it will
     only be written back to disk in next transaction.

So after the transaction X in btrfs_rm_dev_item() committed, but before
transaction X+1 (which can be minutes away), a power loss happen, then
we got the super num mismatch.

This has been fixed by commit bbac58698a ("btrfs: remove device item
and update super block in the same transaction").

[FIX]
Make the super_num_devices check less strict, converting it from a hard
error to a warning, and reset the value to a correct one for the current
or next transaction commit.

As the number of device items is the critical information where the
super block num_devices is only a cached value (and also useful for
cross checking), it's safe to automatically update it. Other device
related problems like missing device are handled after that and may
require other means to resolve, like degraded mount. With this fix,
potentially affected filesystems won't fail mount and require the manual
repair by btrfs check.

Reported-by: Luca Béla Palkovics <luca.bela.palkovics@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+8xDSpvdm_U0QLBAnrH=zqDq_cWCOH5TiV46CKmp3igr44okQ@mail.gmail.com/
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:14 +02:00
Goldwyn Rodrigues
46fbd18e78 btrfs: do not pass compressed_bio to submit_compressed_bio()
Parameter struct compressed_bio is not used by the function
submit_compressed_bio(). Remove it.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:13 +02:00
Filipe Manana
2306e83e73 btrfs: avoid double search for block group during NOCOW writes
When doing a NOCOW write, either through direct IO or buffered IO, we do
two lookups for the block group that contains the target extent: once
when we call btrfs_inc_nocow_writers() and then later again when we call
btrfs_dec_nocow_writers() after creating the ordered extent.

The lookups require taking a lock and navigating the red black tree used
to track all block groups, which can take a non-negligible amount of time
for a large filesystem with thousands of block groups, as well as lock
contention and cache line bouncing.

Improve on this by having a single block group search: making
btrfs_inc_nocow_writers() return the block group to its caller and then
have the caller pass that block group to btrfs_dec_nocow_writers().

This is part of a patchset comprised of the following patches:

  btrfs: remove search start argument from first_logical_byte()
  btrfs: use rbtree with leftmost node cached for tracking lowest block group
  btrfs: use a read/write lock for protecting the block groups tree
  btrfs: return block group directly at btrfs_next_block_group()
  btrfs: avoid double search for block group during NOCOW writes

The following test was used to test these changes from a performance
perspective:

   $ cat test.sh
   #!/bin/bash

   modprobe null_blk nr_devices=0

   NULL_DEV_PATH=/sys/kernel/config/nullb/nullb0
   mkdir $NULL_DEV_PATH
   if [ $? -ne 0 ]; then
       echo "Failed to create nullb0 directory."
       exit 1
   fi
   echo 2 > $NULL_DEV_PATH/submit_queues
   echo 16384 > $NULL_DEV_PATH/size # 16G
   echo 1 > $NULL_DEV_PATH/memory_backed
   echo 1 > $NULL_DEV_PATH/power

   DEV=/dev/nullb0
   MNT=/mnt/nullb0
   LOOP_MNT="$MNT/loop"
   MOUNT_OPTIONS="-o ssd -o nodatacow"
   MKFS_OPTIONS="-R free-space-tree -O no-holes"

   cat <<EOF > /tmp/fio-job.ini
   [io_uring_writes]
   rw=randwrite
   fsync=0
   fallocate=posix
   group_reporting=1
   direct=1
   ioengine=io_uring
   iodepth=64
   bs=64k
   filesize=1g
   runtime=300
   time_based
   directory=$LOOP_MNT
   numjobs=8
   thread
   EOF

   echo performance | \
       tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

   echo
   echo "Using config:"
   echo
   cat /tmp/fio-job.ini
   echo

   umount $MNT &> /dev/null
   mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
   mount $MOUNT_OPTIONS $DEV $MNT

   mkdir $LOOP_MNT

   truncate -s 4T $MNT/loopfile
   mkfs.btrfs -f $MKFS_OPTIONS $MNT/loopfile &> /dev/null
   mount $MOUNT_OPTIONS $MNT/loopfile $LOOP_MNT

   # Trigger the allocation of about 3500 data block groups, without
   # actually consuming space on underlying filesystem, just to make
   # the tree of block group large.
   fallocate -l 3500G $LOOP_MNT/filler

   fio /tmp/fio-job.ini

   umount $LOOP_MNT
   umount $MNT

   echo 0 > $NULL_DEV_PATH/power
   rmdir $NULL_DEV_PATH

The test was run on a non-debug kernel (Debian's default kernel config),
the result were the following.

Before patchset:

  WRITE: bw=1455MiB/s (1526MB/s), 1455MiB/s-1455MiB/s (1526MB/s-1526MB/s), io=426GiB (458GB), run=300006-300006msec

After patchset:

  WRITE: bw=1503MiB/s (1577MB/s), 1503MiB/s-1503MiB/s (1577MB/s-1577MB/s), io=440GiB (473GB), run=300006-300006msec

  +3.3% write throughput and +3.3% IO done in the same time period.

The test has somewhat limited coverage scope, as with only NOCOW writes
we get less contention on the red black tree of block groups, since we
don't have the extra contention caused by COW writes, namely when
allocating data extents, pinning and unpinning data extents, but on the
hand there's access to tree in the NOCOW path, when incrementing a block
group's number of NOCOW writers.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:13 +02:00
Filipe Manana
8b01f931c1 btrfs: return block group directly at btrfs_next_block_group()
At btrfs_next_block_group(), we have this long line with two statements:

  cache = btrfs_lookup_first_block_group(...); return cache;

This makes it a bit harder to read due to two statements on the same
line, so change that to directly return the result of the call to
btrfs_lookup_first_block_group().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:13 +02:00
Filipe Manana
16b0c2581e btrfs: use a read/write lock for protecting the block groups tree
Currently we use a spin lock to protect the red black tree that we use to
track block groups. Most accesses to that tree are actually read only and
for large filesystems, with thousands of block groups, it actually has
a bad impact on performance, as concurrent read only searches on the tree
are serialized.

Read only searches on the tree are very frequent and done when:

1) Pinning and unpinning extents, as we need to lookup the respective
   block group from the tree;

2) Freeing the last reference of a tree block, regardless if we pin the
   underlying extent or add it back to free space cache/tree;

3) During NOCOW writes, both buffered IO and direct IO, we need to check
   if the block group that contains an extent is read only or not and to
   increment the number of NOCOW writers in the block group. For those
   operations we need to search for the block group in the tree.
   Similarly, after creating the ordered extent for the NOCOW write, we
   need to decrement the number of NOCOW writers from the same block
   group, which requires searching for it in the tree;

4) Decreasing the number of extent reservations in a block group;

5) When allocating extents and freeing reserved extents;

6) Adding and removing free space to the free space tree;

7) When releasing delalloc bytes during ordered extent completion;

8) When relocating a block group;

9) During fitrim, to iterate over the block groups;

10) etc;

Write accesses to the tree, to add or remove block groups, are much less
frequent as they happen only when allocating a new block group or when
deleting a block group.

We also use the same spin lock to protect the list of currently caching
block groups. Additions to this list are made when we need to cache a
block group, because we don't have a free space cache for it (or we have
but it's invalid), and removals from this list are done when caching of
the block group's free space finishes. These cases are also not very
common, but when they happen, they happen only once when the filesystem
is mounted.

So switch the lock that protects the tree of block groups from a spinning
lock to a read/write lock.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:13 +02:00
Filipe Manana
08dddb2951 btrfs: use rbtree with leftmost node cached for tracking lowest block group
We keep track of the start offset of the block group with the lowest start
offset at fs_info->first_logical_byte. This requires explicitly updating
that field every time we add, delete or lookup a block group to/from the
red black tree at fs_info->block_group_cache_tree.

Since the block group with the lowest start address happens to always be
the one that is the leftmost node of the tree, we can use a red black tree
that caches the left most node. Then when we need the start address of
that block group, we can just quickly get the leftmost node in the tree
and extract the start offset of that node's block group. This avoids the
need to explicitly keep track of that address in the dedicated member
fs_info->first_logical_byte, and it also allows the next patch in the
series to switch the lock that protects the red black tree from a spin
lock to a read/write lock - without this change it would be tricky
because block group searches also update fs_info->first_logical_byte.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:13 +02:00
Filipe Manana
0eb997bff0 btrfs: remove search start argument from first_logical_byte()
The search start argument passed to first_logical_byte() is always 0, as
we always want to get the logical start address of the block group with
the lowest logical start address. So remove it, as not only it is not
necessary, it also makes the following patches that change the lock that
protects the red black tree of block groups from a spin lock to a
read/write lock.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:13 +02:00
Qu Wenruo
44e5801fad btrfs: return correct error number for __extent_writepage_io()
[BUG]
If we hit an error from submit_extent_page() inside
__extent_writepage_io(), we could still return 0 to the caller, and
even trigger the warning in btrfs_page_assert_not_dirty().

[CAUSE]
In __extent_writepage_io(), if we hit an error from
submit_extent_page(), we will just clean up the range and continue.

This is completely fine for regular PAGE_SIZE == sectorsize, as we can
only hit one sector in one page, thus after the error we're ensured to
exit and @ret will be saved.

But for subpage case, we may have other dirty subpage range in the page,
and in the next loop, we may succeeded submitting the next range.

In that case, @ret will be overwritten, and we return 0 to the caller,
while we have hit some error.

[FIX]
Introduce @has_error and @saved_ret to record the first error we hit, so
we will never forget what error we hit.

CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:13 +02:00
Qu Wenruo
10f7f6f879 btrfs: fix the error handling for submit_extent_page() for btrfs_do_readpage()
[BUG]
Test case generic/475 have a very high chance (almost 100%) to hit a fs
hang, where a data page will never be unlocked and hang all later
operations.

[CAUSE]
In btrfs_do_readpage(), if we hit an error from submit_extent_page() we
will try to do the cleanup for our current io range, and exit.

This works fine for PAGE_SIZE == sectorsize cases, but not for subpage.

For subpage btrfs_do_readpage() will lock the full page first, which can
contain several different sectors and extents:

 btrfs_do_readpage()
 |- begin_page_read()
 |  |- btrfs_subpage_start_reader();
 |     Now the page will have PAGE_SIZE / sectorsize reader pending,
 |     and the page is locked.
 |
 |- end_page_read() for different branches
 |  This function will reduce subpage readers, and when readers
 |  reach 0, it will unlock the page.

But when submit_extent_page() failed, we only cleanup the current
io range, while the remaining io range will never be cleaned up, and the
page remains locked forever.

[FIX]
Update the error handling of submit_extent_page() to cleanup all the
remaining subpage range before exiting the loop.

Please note that, now submit_extent_page() can only fail due to
sanity check in alloc_new_bio().

Thus regular IO errors are impossible to trigger the error path.

CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:13 +02:00
Qu Wenruo
c9583ada8c btrfs: avoid double clean up when submit_one_bio() failed
[BUG]
When running generic/475 with 64K page size and 4K sector size, it has a
very high chance (almost 100%) to hang, with mostly data page locked but
no one is going to unlock it.

[CAUSE]
With commit 1784b7d502 ("btrfs: handle csum lookup errors properly on
reads"), if we failed to lookup checksum due to metadata IO error, we
will return error for btrfs_submit_data_bio().

This will cause the page to be unlocked twice in btrfs_do_readpage():

 btrfs_do_readpage()
 |- submit_extent_page()
 |  |- submit_one_bio()
 |     |- btrfs_submit_data_bio()
 |        |- if (ret) {
 |        |-     bio->bi_status = ret;
 |        |-     bio_endio(bio); }
 |               In the endio function, we will call end_page_read()
 |               and unlock_extent() to cleanup the subpage range.
 |
 |- if (ret) {
 |-        unlock_extent(); end_page_read() }
           Here we unlock the extent and cleanup the subpage range
           again.

For unlock_extent(), it's mostly double unlock safe.

But for end_page_read(), it's not, especially for subpage case,
as for subpage case we will call btrfs_subpage_end_reader() to reduce
the reader number, and use that to number to determine if we need to
unlock the full page.

If double accounted, it can underflow the number and leave the page
locked without anyone to unlock it.

[FIX]
The commit 1784b7d502 ("btrfs: handle csum lookup errors properly on
reads") itself is completely fine, it's our existing code not properly
handling the error from bio submission hook properly.

This patch will make submit_one_bio() to return void so that the callers
will never be able to do cleanup when bio submission hook fails.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:13 +02:00
Schspa Shi
dd7382a2a7 btrfs: use non-bh spin_lock in zstd timer callback
This is an optimization for fix fee13fe965 ("btrfs: correct zstd
workspace manager lock to use spin_lock_bh()")

The critical region for wsm.lock is only accessed by the process context and
the softirq context.

Because in the soft interrupt, the critical section will not be
preempted by the soft interrupt again, there is no need to call
spin_lock_bh(&wsm.lock) to turn off the soft interrupt,
spin_lock(&wsm.lock) is enough for this situation.

Signed-off-by: Schspa Shi <schspa@gmail.com>
[ minor comment update ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:13 +02:00
Filipe Manana
490243884e btrfs: use BTRFS_DIR_START_INDEX at btrfs_create_new_inode()
We are still using the magic value of 2 at btrfs_create_new_inode(), but
there's now a constant for that, named BTRFS_DIR_START_INDEX, which was
introduced in commit 528ee69712 ("btrfs: put initial index value of a
directory in a constant"). So change that to use the constant.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:13 +02:00
Qu Wenruo
c0111c4417 btrfs: simplify parameters of submit_read_repair() and rename
Cleanup the function submit_read_repair() by:

- Remove the fixed argument submit_bio_hook()
  The function is only called on buffered data read path, so the
  @submit_bio_hook argument is always btrfs_submit_data_bio().

  Since it's fixed, then there is no need to pass that argument at all.

- Rename the function to submit_data_read_repair()
  Just to be more explicit on all the 3 things, data, read and repair.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:13 +02:00
Christoph Hellwig
8e010b3d70 btrfs: remove the zoned/zone_size union in struct btrfs_fs_info
Reading a value from a different member of a union is not just a great
way to obfuscate code, but also creates an aliasing violation.  Switch
btrfs_is_zoned to look at ->zone_size and remove the union.

Note: union was to simplify the detection of zoned filesystem but now
this is wrapped behind btrfs_is_zoned so we can drop the union.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:12 +02:00
Lv Ruyi
8aa1e49ea1 btrfs: remove unnecessary check of iput argument
iput() already handles NULL and non-NULL parameter, so it is not needed
to check that. This unifies all iput calls.

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Lv Ruyi <lv.ruyi@zte.com.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:12 +02:00
Christoph Hellwig
b027669449 btrfs: stop using the btrfs_bio saved iter in index_rbio_pages
The bios added to ->bio_list are the original bios fed into
btrfs_map_bio, which are never advanced.  Just use the iter in the
bio itself.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:12 +02:00
Christoph Hellwig
75c17e6666 btrfs: don't allocate a btrfs_bio for scrub bios
All the scrub bios go straight to the block device or the raid56 code,
none of which looks at the btrfs_bio.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:12 +02:00
Christoph Hellwig
e1b4b44e00 btrfs: don't allocate a btrfs_bio for raid56 per-stripe bios
Except for the spurious initialization of ->device just after allocation
nothing uses the btrfs_bio, so just allocate a normal bio without extra
data.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:12 +02:00
Christoph Hellwig
e01bf588f8 btrfs: pass bio opf to rbio_add_io_page
Prepare for further refactoring by moving this initialization to a
single place instead of setting it in the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:12 +02:00
Christoph Hellwig
110ac0e543 btrfs: pass a block_device to btrfs_bio_clone
Pass the block_device to bio_alloc_clone instead of setting it later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:12 +02:00
Christoph Hellwig
fce3f24ada btrfs: move the call to bio_set_dev out of submit_stripe_bio
Prepare for additional refactoring, btrfs_map_bio is direct caller of
submit_stripe_bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:12 +02:00
Christoph Hellwig
f77dcc0d64 btrfs: use on-stack bio in scrub_repair_page_from_good_copy
The I/O in repair_io_failue is synchronous and doesn't need a btrfs_bio,
so just use an on-stack bio.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:12 +02:00
Christoph Hellwig
f3b8a7f3fb btrfs: use on-stack bio in scrub_recheck_block
The I/O in repair_io_failue is synchronous and doesn't need a btrfs_bio,
so just use an on-stack bio.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:12 +02:00
Christoph Hellwig
e9458bfe5f btrfs: use on-stack bio in repair_io_failure
The I/O in repair_io_failue is synchronous and doesn't need a btrfs_bio,
so just use an on-stack bio.  Also cleanup the error handling to use goto
labels and not discard the actual return values.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:12 +02:00
Christoph Hellwig
91e3b5f1e2 btrfs: check-integrity: simplify bio allocation in btrfsic_read_block
btrfsic_read_block does not need the btrfs_bio structure, so switch to
plain bio_alloc (that also does not fail as it's backed by a bioset).

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:12 +02:00
Christoph Hellwig
58ff51f148 btrfs: check-integrity: split submit_bio from btrfsic checking
Require a separate call to the integrity checking helpers from the
actual bio submission.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:12 +02:00
Christoph Hellwig
57906d58e2 btrfs: factor check and flush helpers from __btrfsic_submit_bio
Split out two helpers to make __btrfsic_submit_bio more readable.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:11 +02:00
Johannes Thumshirn
3687fcb075 btrfs: zoned: make auto-reclaim less aggressive
The current auto-reclaim algorithm starts reclaiming all block groups
with a zone_unusable value above a configured threshold. This is causing
a lot of reclaim IO even if there would be enough free zones on the
device.

Instead of only accounting a block groups zone_unusable value, also take
the ratio of free and not usable (written as well as zone_unusable)
bytes a device has into account.

Tested-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:11 +02:00
Josef Bacik
ef972e7b5e btrfs: change the bg_reclaim_threshold valid region from 0 to 100
For the non-zoned case we may want to set the threshold for reclaim to
something below 50%.  Change the acceptable threshold from 50-100 to
0-100.

Tested-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:11 +02:00
Josef Bacik
ac2f1e63c6 btrfs: allow block group background reclaim for non-zoned filesystems
This will allow us to set a threshold for block groups to be
automatically relocated even if we don't have zoned devices.

We have found this feature invaluable at Facebook due to how our
workload interacts with the allocator.  We have been using this in
production for months with only a single problem that has already been
fixed.

Tested-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:11 +02:00
Josef Bacik
bb5a098d97 btrfs: make the bg_reclaim_threshold per-space info
For non-zoned file systems it's useful to have the auto reclaim feature,
however there are different use cases for non-zoned, for example we may
not want to reclaim metadata chunks ever, only data chunks.  Move this
sysfs flag to per-space_info.  This won't affect current users because
this tunable only ever did anything for zoned, and that is currently
hidden behind BTRFS_CONFIG_DEBUG.

Tested-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ jth restore global bg_reclaim_threshold ]
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:11 +02:00
Filipe Manana
a7bb6bd4bd btrfs: do not test for free space inode during NOCOW check against file extent
When checking if we can do a NOCOW write against a range covered by a file
extent item, we do a quick a check to determine if the inode's root was
snapshotted in a generation older than the generation of the file extent
item or not. This is to quickly determine if the extent is likely shared
and avoid the expensive check for cross references (this was added in
commit 78d4295b1e ("btrfs: lift some btrfs_cross_ref_exist checks in
nocow path").

We restrict that check to the case where the inode is not a free space
inode (since commit 27a7ff554e ("btrfs: skip file_extent generation
check for free_space_inode in run_delalloc_nocow")). That is because when
we had the inode cache feature, inode caches were backed by a free space
inode that belonged to the inode's root.

However we don't have support for the inode cache feature since kernel
5.11, so we don't need this check anymore since free space inodes are
now always related to free space caches, which are always associated to
the root tree (which can't be snapshotted, and its last_snapshot field
is always 0).

So remove that condition.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:11 +02:00
Filipe Manana
619104ba45 btrfs: move common NOCOW checks against a file extent into a helper
Verifying if we can do a NOCOW write against a range fully or partially
covered by a file extent item requires verifying several constraints, and
these are currently duplicated at two different places: can_nocow_extent()
and run_delalloc_nocow().

This change moves those checks into a common helper function to avoid
duplication. It adds some comments and also preserves all existing
behaviour like for example can_nocow_extent() treating errors from the
calls to btrfs_cross_ref_exist() and csum_exist_in_range() as meaning
we can not NOCOW, instead of propagating the error back to the caller.
That specific behaviour is questionable but also reasonable to some
degree.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:11 +02:00
Sweet Tea Dorminy
395cb57e85 btrfs: wait between incomplete batch memory allocations
When allocating memory in a loop, each iteration should call
memalloc_retry_wait() in order to prevent starving memory-freeing
processes (and to mark where allocation loops are). Other filesystems do
that as well.

The bulk page allocation is the only place in btrfs with an allocation
retry loop, so add an appropriate call to it.

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:11 +02:00
Sweet Tea Dorminy
91d6ac1d62 btrfs: allocate page arrays using bulk page allocator
While calling alloc_page() in a loop is an effective way to populate an
array of pages, the MM subsystem provides a method to allocate pages in
bulk.  alloc_pages_bulk_array() populates the NULL slots in a page
array, trying to grab more than one page at a time.

Unfortunately, it doesn't guarantee allocating all slots in the array,
but it's easy to call it in a loop and return an error if no progress
occurs.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:11 +02:00
Sweet Tea Dorminy
dd137dd1f2 btrfs: factor out allocating an array of pages
Several functions currently populate an array of page pointers one
allocated page at a time. Factor out the common code so as to allow
improvements to all of the sites at once.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:11 +02:00
Yu Zhe
0d031dc4aa btrfs: remove unnecessary type casts
Explicit type casts are not necessary when it's void* to another pointer
type.

Signed-off-by: Yu Zhe <yuzhe@nfschina.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:11 +02:00
Qu Wenruo
1a42daab11 btrfs: expand subpage support to any PAGE_SIZE > 4K
With the recent change in metadata handling, we can handle metadata in
the following cases:

- nodesize < PAGE_SIZE and sectorsize < PAGE_SIZE
  Go subpage routine for both metadata and data.

- nodesize < PAGE_SIZE and sectorsize >= PAGE_SIZE
  Invalid case for now. As we require nodesize >= sectorsize.

- nodesize >= PAGE_SIZE and sectorsize < PAGE_SIZE
  Go subpage routine for data, but regular page routine for metadata.

- nodesize >= PAGE_SIZE and sectorsize >= PAGE_SIZE
  Go regular page routine for both metadata and data.

Now we can handle any sectorsize < PAGE_SIZE, plus the existing
sectorsize == PAGE_SIZE support.

But here we introduce an artificial limit, any PAGE_SIZE > 4K case, we
will only support 4K and PAGE_SIZE as sector size.

The idea here is to reduce the test combinations, and push 4K as the
default standard in the future.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:11 +02:00
Qu Wenruo
fbca46eb46 btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine
The reason why we only support 64K page size for subpage is, for 64K
page size we can ensure no matter what the nodesize is, we can fit it
into one page.

When other page size come, especially like 16K, the limitation is a bit
limiting.

To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the
non-subpage routine.  By this, we can allow 4K sectorsize on 16K page
size.

Although this introduces another smaller limitation, the metadata can
not cross page boundary, which is already met by most recent mkfs.

Another small improvement is, we can avoid the overhead for metadata if
nodesize >= PAGE_SIZE.
For 4K sector size and 64K page size/node size, or 4K sector size and
16K page size/node size, we don't need to allocate extra memory for the
metadata pages.

Please note that, this patch will not yet enable other page size support
yet.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:11 +02:00
Qu Wenruo
e959d3c1df btrfs: use dummy extent buffer for super block sys chunk array read
In function btrfs_read_sys_array(), we allocate a real extent buffer
using btrfs_find_create_tree_block().

Such extent buffer will be even cached into buffer_radix tree, and using
btree inode address space.

However we only use such extent buffer to enable the accessors, thus we
don't even need to bother using real extent buffer, a dummy one is
what we really need.

And for dummy extent buffer, we no longer need to do any special
handling for the first page, as subpage helper is already doing it
properly.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:10 +02:00
Naohiro Aota
0320b3538b btrfs: assert that relocation is protected with sb_start_write()
Relocation of a data block group creates ordered extents. They can cause
a hang when a process is trying to thaw the filesystem.

We should have called sb_start_write(), so the filesystem is not being
frozen. Add an ASSERT to check it is protected.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:10 +02:00
Naohiro Aota
7f8d12ea96 fs: add a lockdep check function for sb_start_write()
Add a function sb_write_started() to allow callers to verify if
sb_start_write() is properly called. It will be used for assertion in
btrfs.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:10 +02:00
Nikolay Borisov
d864546231 btrfs: simplify code flow in btrfs_ioctl_balance
Move code in btrfs_ioctl_balance to simplify its flow. This is
possible thanks to the removal of balance v1 ioctl and ensuring 'arg'
argument is always present. First move the code duplicating the
userspace arg to the kernel 'barg'. This makes the out_unlock label
redundant.  Secondly, check the validity of bargs::flags before copying
to the dynamically allocated 'bctl'. This removes the need for the
out_bctl label.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:10 +02:00
Nikolay Borisov
398646011e btrfs: remove checks for arg argument in btrfs_ioctl_balance
With the removal of balance v1 ioctl the 'arg' argument is guaranteed to
be present so simply remove all conditional code which checks for its
presence.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:10 +02:00
Qu Wenruo
b06660b595 btrfs: replace memset with memzero_page in data checksum verification
The original code resets the page to 0x1 for not apparent reason, it's
been like that since the initial 2007 code added in commit 07157aacb1
("Btrfs: Add file data csums back in via hooks in the extent map code").

It could mean that a failed buffer can be detected from the data but
that's just a guess and any value is good.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:10 +02:00
Filipe Manana
d4135134ab btrfs: avoid blocking on space revervation when doing nowait dio writes
When doing a NOWAIT direct IO write, if we can NOCOW then it means we can
proceed with the non-blocking, NOWAIT path. However reserving the metadata
space and qgroup meta space can often result in blocking - flushing
delalloc, wait for ordered extents to complete, trigger transaction
commits, etc, going against the semantics of a NOWAIT write.

So make the NOWAIT write path to try to reserve all the metadata it needs
without resulting in a blocking behaviour - if we get -ENOSPC or -EDQUOT
then return -EAGAIN to make the caller fallback to a blocking direct IO
write.

This is part of a patchset comprised of the following patches:

  btrfs: avoid blocking on page locks with nowait dio on compressed range
  btrfs: avoid blocking nowait dio when locking file range
  btrfs: avoid double nocow check when doing nowait dio writes
  btrfs: stop allocating a path when checking if cross reference exists
  btrfs: free path at can_nocow_extent() before checking for checksum items
  btrfs: release path earlier at can_nocow_extent()
  btrfs: avoid blocking when allocating context for nowait dio read/write
  btrfs: avoid blocking on space revervation when doing nowait dio writes

The following test was run before and after applying this patchset:

  $ cat io-uring-nodatacow-test.sh
  #!/bin/bash

  DEV=/dev/sdc
  MNT=/mnt/sdc

  MOUNT_OPTIONS="-o ssd -o nodatacow"
  MKFS_OPTIONS="-R free-space-tree -O no-holes"

  NUM_JOBS=4
  FILE_SIZE=8G
  RUN_TIME=300

  cat <<EOF > /tmp/fio-job.ini
  [io_uring_rw]
  rw=randrw
  fsync=0
  fallocate=posix
  group_reporting=1
  direct=1
  ioengine=io_uring
  iodepth=64
  bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
  filesize=$FILE_SIZE
  runtime=$RUN_TIME
  time_based
  filename=foobar
  directory=$MNT
  numjobs=$NUM_JOBS
  thread
  EOF

  echo performance | \
     tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  umount $MNT &> /dev/null
  mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
  mount $MOUNT_OPTIONS $DEV $MNT

  fio /tmp/fio-job.ini

  umount $MNT

The test was run a 12 cores box with 64G of ram, using a non-debug kernel
config (Debian's default config) and a spinning disk.

Result before the patchset:

 READ: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec
WRITE: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec

Result after the patchset:

 READ: bw=436MiB/s (457MB/s), 436MiB/s-436MiB/s (457MB/s-457MB/s), io=128GiB (137GB), run=300044-300044msec
WRITE: bw=435MiB/s (456MB/s), 435MiB/s-435MiB/s (456MB/s-456MB/s), io=128GiB (137GB), run=300044-300044msec

That's about +7.2% throughput for reads and +6.9% for writes.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:10 +02:00
Filipe Manana
4f208dcc6b btrfs: avoid blocking when allocating context for nowait dio read/write
When doing a NOWAIT direct IO read/write, we allocate a context object
(struct btrfs_dio_data) with GFP_NOFS, which can result in blocking
waiting for memory allocation (GFP_NOFS is __GFP_RECLAIM | __GFP_IO).
This is undesirable for the NOWAIT semantics, so do the allocation with
GFP_NOWAIT if we are serving a NOWAIT request and if the allocation fails
return -EAGAIN, so that the caller can fallback to a blocking context and
retry with a non-blocking write.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:10 +02:00
Filipe Manana
59d35c5171 btrfs: release path earlier at can_nocow_extent()
At can_nocow_extent(), we are releasing the path only after checking if
the block group that has the target extent is read only, and after
checking if there's delalloc in the range in case our extent is a
preallocated extent. The read only extent check can be expensive if we
have a very large filesystem with many block groups, as well as the
check for delalloc in the inode's io_tree in case the io_tree is big
due to IO on other file ranges.

Our path is holding a read lock on a leaf and there's no need to keep
the lock while doing those two checks, so release the path before doing
them, immediately after the last use of the leaf.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:10 +02:00
Filipe Manana
c1a548db25 btrfs: free path at can_nocow_extent() before checking for checksum items
When we look for checksum items, through csum_exist_in_range(), at
can_nocow_extent(), we no longer need the path that we have previously
allocated. Through csum_exist_in_range() -> btrfs_lookup_csums_range(),
we also end up allocating a path, so we are adding unnecessary extra
memory usage. So free the path before calling csum_exist_in_range().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:10 +02:00
Filipe Manana
1a89f17386 btrfs: stop allocating a path when checking if cross reference exists
At btrfs_cross_ref_exist() we always allocate a path, but we really don't
need to because all its callers (only 2) already have an allocated path
that is not being used when they call btrfs_cross_ref_exist(). So change
btrfs_cross_ref_exist() to take a path as an argument and update both
its callers to pass in the unused path they have when they call
btrfs_cross_ref_exist().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:10 +02:00