At need_log_inode() we directly check the ->logged_trans field of the
given inode to check if it was previously logged in the transaction, with
the goal of skipping logging the inode again when it's not necessary.
The ->logged_trans field in not persisted in the inode item or elsewhere,
it's only stored in memory (struct btrfs_inode), so it's transient and
lost once the inode is evicted and then loaded again. Once an inode is
loaded, we are conservative and set ->logged_trans to 0, which may mean
that either the inode was never logged in the current transaction or it
was logged but evicted before being loaded again.
Instead of checking the inode's ->logged_trans field directly, we can
use instead the helper inode_logged(), which will really check if the
inode was logged before in the current transaction in case we have a
->logged_trans field with a value of 0. This will prevent unnecessarily
logging an inode when it's not needed, and in some cases preventing a
transaction commit, in case the logging requires a fallback to a
transaction commit. The following test script shows a scenario where
due to eviction we fallback a transaction commit when trying to fsync
a file that was renamed:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
num_init_files=10000
num_new_files=10000
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $num_init_files; i++)); do
echo -n > $MNT/testdir/file_$i
done
echo -n > $MNT/testdir/foo
sync
# Add some files so that there's more work in the transaction other
# than just renaming file foo.
for ((i = 1; i <= $num_new_files; i++)); do
echo -n > $MNT/testdir/new_file_$i
done
# Fsync the directory first.
xfs_io -c "fsync" $MNT/testdir
# Rename file foo.
mv $MNT/testdir/foo $MNT/testdir/bar
# Now trigger eviction of the test directory's inode.
# Once loaded again, it will have logged_trans set to 0 and
# last_unlink_trans set to the current transaction.
echo 2 > /proc/sys/vm/drop_caches
# Fsync file bar (ex-foo).
# Before the patch the fsync would result in a transaction commit
# because the inode for file bar has last_unlink_trans set to the
# current transaction, so it will attempt to log the parent directory
# as well, which will fallback to a full transaction commit because
# it also has its last_unlink_trans set to the current transaction,
# due to the inode eviction.
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/testdir/bar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "file fsync took: $dur milliseconds"
umount $MNT
Before this patch: fsync took 22 milliseconds
After this patch: fsync took 8 milliseconds
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Smatch reports the following errors related to commit ("btrfs: output
affected files when relocation fails"):
fs/btrfs/inode.c:283 print_data_reloc_error()
error: uninitialized symbol 'ref_level'.
[CAUSE]
That part of code is mostly copied from scrub, but unfortunately scrub
code from the beginning is not doing the error handling properly.
The offending code looks like this:
do {
ret = tree_backref_for_extent();
btrfs_warn_rl();
} while (ret != 1);
There are several problems involved:
- No error handling
If that tree_backref_for_extent() failed, we would output the same
error again and again, never really exit as it requires ret == 1 to
exit.
- Always do one extra output
As tree_backref_for_extent() only return > 0 if there is no more
backref item.
This means after the last item we hit, we would output an invalid
error message for ret > 0 case.
[FIX]
Fix the old code by:
- Move @ref_root and @ref_level into the if branch
And do not initialize them, so we can catch such uninitialized values
just like what we do in the inode.c
- Explicitly check the return value of tree_backref_for_extent()
And handle ret < 0 and ret > 0 cases properly.
- No more do {} while () loop
Instead go while (true) {} loop since we will handle @ret manually.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When btrfs_redirty_list_add redirties a buffer, it also acquires
an extra reference that is released on transaction commit. But
this is not required as buffers that are dirty or under writeback
are never freed (look for calls to extent_buffer_under_io())).
Remove the extra reference and the infrastructure used to drop it
again.
History behind redirty logic:
In the first place, it used releasing_list to hold all the
to-be-released extent buffers, and decided which buffers to re-dirty at
the commit time. Then, in a later version, the behaviour got changed to
re-dirty a necessary buffer and add re-dirtied one to the list in
btrfs_free_tree_block(). In short, the list was there mostly for the
patch series' historical reason.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[ add Naohiro's comment regarding history ]
Signed-off-by: David Sterba <dsterba@suse.com>
dirty_metadata_bytes is decremented in both places that clear the dirty
bit in a buffer, but only incremented in btrfs_mark_buffer_dirty, which
means that a buffer that is redirtied using btrfs_redirty_list_add won't
be added to dirty_metadata_bytes, but it will be subtracted when written
out, leading an inconsistency in the counter.
Move the dirty_metadata_bytes from btrfs_mark_buffer_dirty into
set_extent_buffer_dirty to also account for the redirty case, and remove
the now unused set_extent_buffer_dirty return value.
Fixes: d3575156f6 ("btrfs: zoned: redirty released extent buffers")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
This exists internal to ctree.c, however btrfs check needs to use it for
some of its operations. I'd rather not duplicate that code inside of
btrfs check as this is low level and I want to keep this code in one
place, so rename the function to btrfs_del_ptr and export it so that it
can be used inside of btrfs-progs safely. Add a comment to make sure
this doesn't get removed by a future cleanup.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is needed in btrfs-progs for the tools that convert the checksum
types for file systems and a few other things. We don't have it in the
kernel as we just want to get the size for the super blocks type.
However I don't want to have to manually add this every time we sync
ctree.c into btrfs-progs, so add the helper in the kernel with a note so
it doesn't get removed by a later cleanup.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We want to override this in btrfs-progs, so wrap this in the __KERNEL__
check so we can easily sync this to btrfs-progs and have our local
version of btrfs_no_printk do the work.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These are more related to the inode item flags on disk than the
in-memory btrfs_inode, move the helpers to inode-item.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is more a buffer validation helper, move it into the tree-checker
files where it makes more sense.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This helper returns a btrfs_tree_block_status for the various errors,
and then btrfs_check_node() will return -EUCLEAN if it gets anything
other than BTRFS_TREE_BLOCK_CLEAN which will be used by the kernel. In
the future btrfs-progs will use this helper instead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of blanket returning -EUCLEAN for all the failures in
btrfs_check_leaf, use btrfs_tree_block_status and return the appropriate
status for each failure. Rename the helper to __btrfs_check_leaf and
then make a wrapper of btrfs_check_leaf that will return -EUCLEAN to
non-clean error codes. This will allow us to have the
__btrfs_check_leaf variant in btrfs-progs while keeping the behavior in
the kernel consistent.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a variety of item specific errors that can occur. For now
simply put these under the umbrella of BTRFS_TREE_BLOCK_INVALID_ITEM,
this can be fleshed out as we need in the future.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We use this in btrfs-progs to determine if we can fix different types of
corruptions. We don't care about this in the kernel, however it would
be good to share this code between the kernel and btrfs-progs, so add
the status definitions so we can start converting the tree-checker code
over to using these status flags instead of blanket returning -EUCLEAN.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have two helpers for checking leaves, because we have an extra check
for debugging in btrfs_mark_buffer_dirty(), and at that stage we may
have item data that isn't consistent yet. However we can handle this
case internally in the helper, if BTRFS_HEADER_FLAG_WRITTEN is set we
know the buffer should be internally consistent, otherwise we need to
skip checking the item data.
Simplify this helper down a single helper and handle the item data
checking logic internally to the helper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We just pass in btrfs_header_level(eb) for the level, and we're passing
in the eb already, so simply get the level from the eb inside of
btrfs_set_block_flags.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is completely related to block rsv's, move it out of the free space
cache code and into block-rsv.c.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For P/Q stripe scrub, we have quite some duplicated read IO:
- Data stripes read for verification
This is triggered by the scrub_submit_initial_read() inside
scrub_raid56_parity_stripe().
- Data stripes read (again) for P/Q stripe verification
This is triggered by scrub_assemble_read_bios() from scrub_rbio().
Although we can have hit rbio cache and avoid unnecessary read, the
chance is very low, as scrub would easily flush the whole rbio cache.
This means, even we're just scrubbing a single P/Q stripe, we would read
the data stripes twice for the best case scenario. If we need to
recover some data stripes, it would cause more reads on the same data
stripes, again and again.
However before we call raid56_parity_submit_scrub_rbio() we already
have all data stripes repaired and their contents ready to use.
But RAID56 cache is unaware about the scrub cache, thus RAID56 layer
itself still needs to re-read the data stripes.
To avoid such cache miss, this patch would:
- Introduce a new helper, raid56_parity_cache_data_pages()
This function would grab the pages from an array, and copy the content
to the rbio, marking all the involved sectors uptodate.
The page copy is unavoidable because of the cache pages of rbio are all
self managed, thus can not utilize outside pages without screwing up
the lifespan.
- Use the repaired data stripes as cache inside
scrub_raid56_parity_stripe()
By this, we ensure all the data sectors of the scrub rbio are already
uptodate, and no need to read them again from disk.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Removing a free space entry from an in memory space cache requires having
the corresponding btrfs_free_space_ctl's 'tree_lock' held. We have several
code paths that remove an entry, so add assertions where appropriate to
verify we are holding the lock, as the lock is acquired by some other
function up in the call chain, which makes it easy to miss in the future.
Note: for this to work we need to lock the local btrfs_free_space_ctl at
load_free_space_cache(), which was not being done because it's local,
declared on the stack, so no other task has access to it.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When linking a free space entry, at link_free_space(), the caller should
be holding the spinlock 'tree_lock' of the given btrfs_free_space_ctl
argument, which is necessary for manipulating the red black tree of free
space entries (done by tree_insert_offset(), which already asserts the
lock is held) and for manipulating the 'free_space', 'free_extents',
'discardable_extents' and 'discardable_bytes' counters of the given
struct btrfs_free_space_ctl.
So assert that the spinlock 'tree_lock' of the given btrfs_free_space_ctl
is held by the current task. We have multiple code paths that end up
calling link_free_space(), and all currently take the lock before calling
it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When searching for a free space entry by offset, at tree_search_offset(),
we are supposed to have the btrfs_free_space_ctl's 'tree_lock' held, so
assert that. We have multiple callers of tree_search_offset(), and all
currently hold the necessary lock before calling it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are multiple code paths leading to tree_insert_offset(), and each
path takes the necessary locks before tree_insert_offset() is called,
since they do other things that require those locks to be held. This makes
it easy to miss the locking somewhere, so make tree_insert_offset() assert
that the required locks are being held by the calling task.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For the in-memory component of space caching (free space cache and free
space tree), three of the arguments passed to tree_insert_offset() can
always be taken from the new free space entry that we are about to add.
So simplify tree_insert_offset() to take the new entry instead of the
'offset', 'node' and 'bitmap' arguments. This will also allow to make
further changes simpler.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The are two computations of end offsets at do_trimming() that are not
necessary, as they were previously computed and stored in local const
variables. So just use the variables instead, to make the source code
shorter and easier to read.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At try_merge_free_space(), avoid calling twice rb_prev() to find the
previous node, as that requires looping through the red black tree, so
store the result of the rb_prev() call and then use it.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At copy_free_space_cache(), we add a new entry to the block group's ctl
before we free the entry from the temporary ctl. Adding a new entry
requires the allocation of a new struct btrfs_free_space, so we can
avoid a temporary extra allocation by freeing the entry from the
temporary ctl before we add a new entry to the main ctl, which possibly
also reduces the chances for a memory allocation failure in case of very
high memory pressure. So just do that.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A small code simplification, move the default value of transid to its
initialization and remove the else-statement.
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[PROBLEM]
When relocation fails (mostly due to checksum mismatch), we only got
very cryptic error messages like:
BTRFS info (device dm-4): relocating block group 13631488 flags data
BTRFS warning (device dm-4): csum failed root -9 ino 257 off 0 csum 0x373e1ae3 expected csum 0x98757625 mirror 1
BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS info (device dm-4): balance: ended with status: -5
The end user has to decipher the above messages and use various tools to
locate the affected files and find a way to fix the problem (mostly
deleting the file). This is not an easy work even for experienced
developer, not to mention the end users.
[SCRUB IS DOING BETTER]
By contrast, scrub is providing much better error messages:
BTRFS error (device dm-4): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488
BTRFS warning (device dm-4): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file)
BTRFS info (device dm-4): scrub: finished on devid 1 with status: 0
Which provides the affected files directly to the end user.
[IMPROVEMENT]
Instead of the generic data checksum error messages, which is not doing
a good job for data reloc inodes, this patch introduce a scrub like
backref walking based solution.
When a sector fails its checksum for data reloc inode, we go the
following workflow:
- Get the real logical bytenr
For data reloc inode, the file offset is the offset inside the block
group.
Thus the real logical bytenr is @file_off + @block_group->start.
- Do an extent type check
If it's tree blocks it's much easier to handle, just go through
all the tree block backref.
- Do a backref walk and inode path resolution for data extents
This is mostly the same as scrub.
But unfortunately we can not reuse the same function as the output
format is different.
Now the new output would be more user friendly:
BTRFS info (device dm-4): relocating block group 13631488 flags data
BTRFS warning (device dm-4): csum failed root -9 ino 257 off 0 logical 13631488 csum 0x373e1ae3 expected csum 0x98757625 mirror 1
BTRFS warning (device dm-4): checksum error at logical 13631488 mirror 1 root 5 inode 257 offset 0 length 4096 links 1 (path: file)
BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
BTRFS info (device dm-4): balance: ended with status: -5
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that btrfs_wq_submit_bio is never called for synchronous I/O,
the hipri_workers workqueue is not used anymore and can be removed.
Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The writeback_control structure already passes down the information about
a writeback being synchronous from the core VM code, and thus information
is propagated into the bio REQ_SYNC flag through the wbc_to_write_flags
helper.
Use that information to decide if checksums calculation is offloaded to
a workqueue instead of btrfs_inode::sync_writers field that not only
bloats the inode but also has too wide scope, being inode wide instead
of limited to the actual writeback request.
The sync writes were set in:
- btrfs_do_write_iter - regular IO, sync status is set
- start_ordered_ops - ordered write start, writeback with WB_SYNC_ALL
mode
- btrfs_write_marked_extents - write marked extents, writeback with
WB_SYNC_ALL mode
Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Most modern hardware supports very fast accelerated crc32c calculation.
If that is supported the CPU overhead of the checksum calculation is
very limited, and offloading the calculation to special worker threads
has a lot of overhead for no gain.
E.g. on an Intel Optane device is actually very much slows down even
1M buffered writes with fio:
Unpatched:
write: IOPS=3316, BW=3316MiB/s (3477MB/s)(200GiB/61757msec); 0 zone resets
With synchronous CRCs:
write: IOPS=4882, BW=4882MiB/s (5119MB/s)(200GiB/41948msec); 0 zone resets
With a lot of variation during the unpatched run going down as low as
1100MB/s, while the synchronous CRC version has about the same peak write
speed but much lower dips, and fewer kworkers churning around.
Both tests had fio saturated at 100% CPU.
(thanks to Jens Axboe via Chris Mason for the benchmarking)
Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Using SECTOR_SHIFT to convert LBA to physical address makes it more
readable.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use SECTOR_SHIFT while converting a physical address to an LBA, makes
it more readable.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Improve the leaf dump behavior by:
- Always dump the leaf first, then the error message
- Output the slot number if possible
Especially in __btrfs_free_extent() the leaf dump of extent tree can
be pretty large.
With an extra slot number it's much easier to locate the problem.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since print-tree infrastructure only prints the content of a tree block,
we can make them to accept const extent buffer pointer.
This removes a forced type convert in extent-tree, where we convert a
const extent buffer pointer to regular one, just to avoid compiler
warning.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
bitmap_test_range_all_{set,zero} defined in subpage.c are useful for other
components. Move them to misc.h and use them in zoned.c. Also, as
find_next{,_zero}_bit take/return "unsigned long" instead of "unsigned
int", convert the type to "unsigned long".
While at it, also rewrite the "if (...) return true; else return false;"
pattern and add const to the input bitmap.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When checking siblings keys, before moving keys from one node/leaf to a
sibling node/leaf, it's very unexpected to have the last key of the left
sibling greater than or equals to the first key of the right sibling, as
that means we have a (serious) corruption that breaks the key ordering
properties of a b+tree. Since this is unexpected, surround the comparison
with the unlikely macro, which helps the compiler generate better code
for the most expected case (no existing b+tree corruption). This is also
what we do for other unexpected cases of invalid key ordering (like at
btrfs_set_item_key_safe()).
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_free_device() is never used outside of volumes.c, so
make it static and remove its prototype declaration at volumes.h.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Recently a Meta-internal workload encountered subvolume creation taking
up to 2s each, significantly slower than directory creation. As they
were hoping to be able to use subvolumes instead of directories, and
were looking to create hundreds, this was a significant issue. After
Josef investigated, it turned out to be due to the transaction commit
currently performed at the end of subvolume creation.
This change improves the workload by not doing transaction commit for every
subvolume creation, and merely requiring a transaction commit on fsync.
In the worst case, of doing a subvolume create and fsync in a loop, this
should require an equal amount of time to the current scheme; and in the
best case, the internal workload creating hundreds of subvolumes before
fsyncing is greatly improved.
While it would be nice to be able to use the log tree and use the normal
fsync path, log tree replay can't deal with new subvolume inodes
presently.
It's possible that there's some reason that the transaction commit is
necessary for correctness during subvolume creation; however,
git logs indicate that the commit dates back to the beginning of
subvolume creation, and there are no notes on why it would be necessary.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_prev_leaf() is not used outside ctree.c, so there's no need to
export it at ctree.h - just make it static at ctree.c and move its
definition above btrfs_search_slot_for_read(), since that function
calls it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull SCSI fixes from James Bottomley:
"Four fixes, all in drivers: three fairly obvious small ones and a
large one in aacraid to add block queue completion mapping and fix a
CPU offline hang"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: lpfc: Fix incorrect big endian type assignment in bsg loopback path
scsi: target: core: Fix error path in target_setup_session()
scsi: storvsc: Always set no_report_opcodes
scsi: aacraid: Reply queue mapping to CPUs based on IRQ affinity
Pull ata fix from Damien Le Moal:
- Avoid deadlocks on resume from sleep by delaying scsi rescan until
the scsi device is also fully resumed.
* tag 'ata-6.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
ata: libata-scsi: Avoid deadlock on rescan after device resume
Pull parisc fix from Helge Deller:
- Drop redundant register definitions to fix build with latest binutils
* tag 'parisc-for-6.4-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Delete redundant register definitions in <asm/assembly.h>
The error unrolling was leaving the VMAs detached in many cases and
leaving the locked_vm statistic altered, and skipping the unrolling
entirely in the case of the vma tree write failing.
Fix the error path by re-attaching the detached VMAs and adding the
necessary goto for the failed vma tree write, and fix the locked_vm
statistic by only updating after the vma tree write succeeds.
Fixes: 763ecb0350 ("mm: remove the vma linked list")
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When an ATA port is resumed from sleep, the port is reset and a power
management request issued to libata EH to reset the port and rescanning
the device(s) attached to the port. Device rescanning is done by
scheduling an ata_scsi_dev_rescan() work, which will execute
scsi_rescan_device().
However, scsi_rescan_device() takes the generic device lock, which is
also taken by dpm_resume() when the SCSI device is resumed as well. If
a device rescan execution starts before the completion of the SCSI
device resume, the rcu locking used to refresh the cached VPD pages of
the device, combined with the generic device locking from
scsi_rescan_device() and from dpm_resume() can cause a deadlock.
Avoid this situation by changing struct ata_port scsi_rescan_task to be
a delayed work instead of a simple work_struct. ata_scsi_dev_rescan() is
modified to check if the SCSI device associated with the ATA device that
must be rescanned is not suspended. If the SCSI device is still
suspended, ata_scsi_dev_rescan() returns early and reschedule itself for
execution after an arbitrary delay of 5ms.
Reported-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Reported-by: Joe Breuer <linux-kernel@jmbreuer.net>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217530
Fixes: a19a93e4c6 ("scsi: core: pm: Rely on the device driver core for async power management")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Tested-by: Joe Breuer <linux-kernel@jmbreuer.net>
Pull staging driver fix from Greg KH:
"Here is a single staging driver "fix" for 6.4-rc7. I've been sitting
on it in my tree for many weeks as it is just a simple documentation
update, with the hope that maybe some other staging driver fixes would
need to be merged for 6.4-final, but that does not seem to be the
case.
So please, pull in this one documentation update so that Aaro doesn't
get emails going forward that he can't do anything about"
* tag 'staging-6.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: octeon: delete my name from TODO contact
Pull USB / Thunderbolt fixes from Greg KH:
"Here are some small USB and Thunderbolt driver fixes and new device
ids for 6.4-rc7 to resolve some reported problems. Included in here
are:
- new USB serial device ids
- USB gadget core fixes for long-dissussed problems
- dwc3 bugfixes for reported issues.
- typec driver fixes
- thunderbolt driver fixes
All of these have been in linux-next this week with no reported issues"
* tag 'usb-6.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: gadget: udc: core: Prevent soft_connect_store() race
usb: gadget: udc: core: Offload usb_udc_vbus_handler processing
usb: typec: Fix fast_role_swap_current show function
usb: typec: ucsi: Fix command cancellation
USB: dwc3: fix use-after-free on core driver unbind
USB: dwc3: qcom: fix NULL-deref on suspend
usb: dwc3: gadget: Reset num TRBs before giving back the request
usb: gadget: udc: renesas_usb3: Fix RZ/V2M {modprobe,bind} error
USB: serial: option: add Quectel EM061KGL series
thunderbolt: Mask ring interrupt on Intel hardware as well
thunderbolt: Do not touch CL state configuration during discovery
thunderbolt: Increase DisplayPort Connection Manager handshake timeout
thunderbolt: dma_test: Use correct value for absent rings when creating paths
Pull serial driver fixes from Greg KH:
"Here are two small serial driver fixes for 6.4-rc7 that resolve some
reported problems:
- lantiq serial driver irq fix
- fsl_lpuart serial driver watermark fix
Both of these have been in linux-next this week with no reported issues"
* tag 'tty-6.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
tty: serial: fsl_lpuart: reduce RX watermark to 0 on LS1028A
serial: lantiq: add missing interrupt ack