Commit Graph

1369460 Commits

Author SHA1 Message Date
Naohiro Aota
62be7afcc1 btrfs: zoned: requeue to unused block group list if zone finish failed
btrfs_zone_finish() can fail for several reason. If it is -EAGAIN, we need
to try it again later. So, put the block group to the retry list properly.

Failing to do so will keep the removable block group intact until remount
and can causes unnecessary ENOSPC.

Fixes: 74e91b12b1 ("btrfs: zoned: zone finish unused block group")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:23 +02:00
Naohiro Aota
3061801420 btrfs: zoned: do not remove unwritten non-data block group
There are some reports of "unable to find chunk map for logical 2147483648
length 16384" error message appears in dmesg. This means some IOs are
occurring after a block group is removed.

When a metadata tree node is cleaned on a zoned setup, we keep that node
still dirty and write it out not to create a write hole. However, this can
make a block group's used bytes == 0 while there is a dirty region left.

Such an unused block group is moved into the unused_bg list and processed
for removal. When the removal succeeds, the block group is removed from the
transaction->dirty_bgs list, so the unused dirty nodes in the block group
are not sent at the transaction commit time. It will be written at some
later time e.g, sync or umount, and causes "unable to find chunk map"
errors.

This can happen relatively easy on SMR whose zone size is 256MB. However,
calling do_zone_finish() on such block group returns -EAGAIN and keep that
block group intact, which is why the issue is hidden until now.

Fixes: afba2bc036 ("btrfs: zoned: implement active zone tracking")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:23 +02:00
Filipe Manana
d6be378de0 btrfs: remove btrfs_clear_extent_bits()
It's just a simple wrapper around btrfs_clear_extent_bit() that passes a
NULL for its last argument (a cached extent state record), plus there is
not counter part - we have a btrfs_set_extent_bit() but we do not have a
btrfs_set_extent_bits() (plural version). So just remove it and make all
callers use btrfs_clear_extent_bit() directly.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:22 +02:00
Filipe Manana
279b4db10e btrfs: use cached state when falling back from NOCoW write to CoW write
We have a cached extent state record from the previous extent locking so
we can use when setting the EXTENT_NORESERVE in the range, allowing the
operation to be faster if the extent io tree is relatively large.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:22 +02:00
Filipe Manana
bfc9d71aa4 btrfs: set EXTENT_NORESERVE before range unlock in btrfs_truncate_block()
Set the EXTENT_NORESERVE bit in the io tree before unlocking the range so
that we can use the cached state and speedup the operation, since the
unlock operation releases the cached state.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:22 +02:00
Johannes Thumshirn
5ae011bcbb btrfs: don't print relocation messages from auto reclaim
When BTRFS is doing automatic block-group reclaim, it is spamming the
kernel log messages a lot.

Add a 'verbose' parameter to btrfs_relocate_chunk() and
btrfs_relocate_block_group() to control the verbosity of these log
message. This way the old behaviour of printing log messages on a
user-space initiated balance operation can be kept while excessive log
spamming due to auto reclaim is mitigated.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:22 +02:00
Johannes Thumshirn
a507904090 btrfs: remove redundant auto reclaim log message
Remove the log message before reclaiming a chunk in
btrfs_reclaim_bgs_work(). Especially with automatic block-group
reclaiming these messages spam the kernel log.

Note there is also a tracepoint for the same condition to ease debugging.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:22 +02:00
Filipe Manana
240fafaa44 btrfs: make btrfs_check_nocow_lock() check more than one extent
Currently btrfs_check_nocow_lock() stops at the first extent it finds and
that extent may be smaller than the target range we want to NOCOW into.
But we can have multiple consecutive extents which we can NOCOW into, so
by stopping at the first one we find we just make the caller do more work
by splitting the write into multiple ones, or in the case of mmap writes
with large folios we fail with -ENOSPC in case the folio's range is
covered by more than one extent (the fallback to NOCOW for mmap writes in
case there's no available data space to reserve/allocate was recently
added by the patch "btrfs: fix -ENOSPC mmap write failure on NOCOW
files/extents").

Improve on this by checking for multiple consecutive extents.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:22 +02:00
Filipe Manana
68e0fcc361 btrfs: assert we can NOCOW the range in btrfs_truncate_block()
We call btrfs_check_nocow_lock() to see if we can NOCOW a block sized
range but we don't check later if we can NOCOW the whole range.
It's unexpected to be able to NOCOW a range smaller than blocksize, so
add an assertion to check the NOCOW range matches the blocksize.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:22 +02:00
Filipe Manana
c6482cff95 btrfs: update function comment for btrfs_check_nocow_lock()
The documentation for the @nowait parameter is missing, so add it.
The @nowait parameter was added in commit 80f9d24130 ("btrfs: make
btrfs_check_nocow_lock nowait compatible"), which forgot to update the
function comment.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:22 +02:00
Filipe Manana
601ea9c42a btrfs: use btrfs_inode local variable at btrfs_page_mkwrite()
Most of the time we want to use the btrfs_inode, so change the local inode
variable to be a btrfs_inode instead of a VFS inode, reducing verbosity
by eliminating a lot of BTRFS_I() calls.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:22 +02:00
Filipe Manana
11ad7983c2 btrfs: use variable for io_tree when clearing range in btrfs_page_mkwrite()
We have the inode's io_tree already stored in a local variable, so use it
instead of grabbing it again in the call to btrfs_clear_extent_bit().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:22 +02:00
Filipe Manana
6599716de2 btrfs: fix -ENOSPC mmap write failure on NOCOW files/extents
If we attempt a mmap write into a NOCOW file or a prealloc extent when
there is no more available data space (or unallocated space to allocate a
new data block group) and we can do a NOCOW write (there are no reflinks
for the target extent or snapshots), we always fail due to -ENOSPC, unlike
for the regular buffered write and direct IO paths where we check that we
can do a NOCOW write in case we can't reserve data space.

Simple reproducer:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi

  umount $DEV &> /dev/null
  mkfs.btrfs -f -b $((512 * 1024 * 1024)) $DEV
  mount $DEV $MNT

  touch $MNT/foobar
  # Make it a NOCOW file.
  chattr +C $MNT/foobar

  # Add initial data to file.
  xfs_io -c "pwrite -S 0xab 0 1M" $MNT/foobar

  # Fill all the remaining data space and unallocated space with data.
  dd if=/dev/zero of=$MNT/filler bs=4K &> /dev/null

  # Overwrite the file with a mmap write. Should succeed.
  xfs_io -c "mmap -w 0 1M"        \
         -c "mwrite -S 0xcd 0 1M" \
         -c "munmap"              \
         $MNT/foobar

  # Unmount, mount again and verify the new data was persisted.
  umount $MNT
  mount $DEV $MNT

  od -A d -t x1 $MNT/foobar

  umount $MNT

Running this:

  $ ./test.sh
  (...)
  wrote 1048576/1048576 bytes at offset 0
  1 MiB, 256 ops; 0.0008 sec (1.188 GiB/sec and 311435.5231 ops/sec)
  ./test.sh: line 24: 234865 Bus error               xfs_io -c "mmap -w 0 1M" -c "mwrite -S 0xcd 0 1M" -c "munmap" $MNT/foobar
  0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
  *
  1048576

Fix this by not failing in case we can't allocate data space and we can
NOCOW into the target extent - reserving only metadata space in this case.

After this change the test passes:

  $ ./test.sh
  (...)
  wrote 1048576/1048576 bytes at offset 0
  1 MiB, 256 ops; 0.0007 sec (1.262 GiB/sec and 330749.3540 ops/sec)
  0000000 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
  *
  1048576

A test case for fstests will be added soon.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:22 +02:00
David Sterba
e8d2e254dc btrfs: use clear_and_wake_up_bit() where open coded
There are two cases open coding the clear and wake up pattern, we can
use the helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:21 +02:00
David Sterba
72b2b199d5 btrfs: accessors: rename variable for folio offset
There used to be 'oip' short for offset in page, which got changed
during conversion to folios, the name is a bit confusing so rename it.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:21 +02:00
David Sterba
ae80748225 btrfs: accessors: factor out split memcpy with two sources
The case of a reading the bytes from 2 folios needs two memcpy()s, the
compiler does not emit calls but two inline loops.

Factoring out the code makes some improvement (stack, code) and in the
future will provide an optimized implementation as well. (The analogical
version with two destinations is not done as it increases stack usage
but can be done if needed.)

The address of the second folio is reordered before the first memcpy,
which leads to an optimization reusing the vmemmap_base and
page_offset_base (implementing folio_address()).

Stack usage reduction:

  btrfs_get_32                                           -8 (32 -> 24)
  btrfs_get_64                                           -8 (32 -> 24)

Code size reduction:

     text    data     bss     dec     hex filename
  1454279  115665   16088 1586032  183370 pre/btrfs.ko
  1454229  115665   16088 1585982  18333e post/btrfs.ko

  DELTA: -50

As this is the last patch in this series, here's the overall diff
starting and including commit "btrfs: accessors: simplify folio bounds
checks":

Stack:

  btrfs_set_16                                          -72 (88 -> 16)
  btrfs_get_32                                          -56 (80 -> 24)
  btrfs_set_8                                           -72 (88 -> 16)
  btrfs_set_64                                          -64 (88 -> 24)
  btrfs_get_8                                           -72 (80 -> 8)
  btrfs_get_16                                          -64 (80 -> 16)
  btrfs_set_32                                          -64 (88 -> 24)
  btrfs_get_64                                          -56 (80 -> 24)

  NEW (48):
	  report_setget_bounds                           48
  LOST/NEW DELTA:      +48
  PRE/POST DELTA:     -472

Code:

     text    data     bss     dec     hex filename
  1456601  115665   16088 1588354  183c82 pre/btrfs.ko
  1454229  115665   16088 1585982  18333e post/btrfs.ko

  DELTA: -2372

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:21 +02:00
David Sterba
c8b33a57fb btrfs: accessors: set target address at initialization
The target address for the read/write can be simplified as it's the same
expression for the first folio. This improves the generated code as the
folio address does not have to be cached on stack.

Stack usage reduction:

  btrfs_set_32                                           -8 (32 -> 24)
  btrfs_set_64                                           -8 (32 -> 24)
  btrfs_get_16                                           -8 (24 -> 16)

Code size reduction:

     text    data     bss     dec     hex filename
  1454459  115665   16088 1586212  183424 pre/btrfs.ko
  1454279  115665   16088 1586032  183370 post/btrfs.ko

  DELTA: -180

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:21 +02:00
David Sterba
1ed0f75d57 btrfs: accessors: compile-time fast path for u16
Reading/writing 2 bytes (u16) may need 2 folios to be written to, each
time it's just one byte so using memcpy for that is an overkill.
Add a branch for the split case so that memcpy is now used for u32 and
u64. Another side effect is that the u16 types now don't need additional
stack space, everything fits to registers.

Stack usage is reduced:

  btrfs_get_16                                           -8 (32 -> 24)
  btrfs_set_16                                          -16 (32 -> 16)

Code size reduction:

     text    data     bss     dec     hex filename
  1454691  115665   16088 1586444  18350c pre/btrfs.ko
  1454459  115665   16088 1586212  183424 post/btrfs.ko

  DELTA: -232

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:21 +02:00
David Sterba
58383c6866 btrfs: accessors: compile-time fast path for u8
Reading/writing 1 byte (u8) is a special case compared to the others as
it's always contained in the folio we find, so the split memcpy will
never be needed. Turn it to a compile-time check that the memcpy part
can be optimized out.

The stack usage is reduced:

  btrfs_set_8                                         -16 (32 -> 16)
  btrfs_get_8                                         -16 (24 -> 8)

Code size reduction:

     text    data     bss     dec     hex filename
  1454951  115665   16088 1586704  183610 pre/btrfs.ko
  1454691  115665   16088 1586444  18350c post/btrfs.ko

  DELTA: -260

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:21 +02:00
David Sterba
d5a87dbd95 btrfs: accessors: inline eb bounds check and factor out the error report
There's a check in each set/get helper if the requested range is within
extent buffer bounds, and if it's not then report it. This was in an
ASSERT statement so with CONFIG_BTRFS_ASSERT this crashes right away, on
other configs this is only reported but reading out of the bounds is
done anyway. There are currently no known reports of this particular
condition failing.

There are some drawbacks though. The behaviour dependence on the
assertions being compiled in or not and a less visible effect of
inlining report_setget_bounds() into each helper.

As the bounds check is expected to succeed almost always it's ok to
inline it but make the report a function and move it out of the helper
completely (__cold puts it to a different section). This also skips
reading/writing the requested range in case it fails.

This improves stack usage significantly:

  btrfs_get_16                                         -48 (80 -> 32)
  btrfs_get_32                                         -48 (80 -> 32)
  btrfs_get_64                                         -48 (80 -> 32)
  btrfs_get_8                                          -48 (72 -> 24)
  btrfs_set_16                                         -56 (88 -> 32)
  btrfs_set_32                                         -56 (88 -> 32)
  btrfs_set_64                                         -56 (88 -> 32)
  btrfs_set_8                                          -48 (80 -> 32)

  NEW (48):
	  report_setget_bounds                                     48
  LOST/NEW DELTA:      +48
  PRE/POST DELTA:     -360

Same as .ko size:

     text    data     bss     dec     hex filename
  1456079  115665   16088 1587832  183a78 pre/btrfs.ko
  1454951  115665   16088 1586704  183610 post/btrfs.ko

  DELTA: -1128

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:21 +02:00
David Sterba
378c95c477 btrfs: accessors: use type sizeof constants directly
Now unit_size is used only once, so use it directly in 'part'
calculation. Don't cache sizeof(type) in a variable. While this is a
compile-time constant, forcing the type 'int' generates worse code as it
leads to additional conversion from 32 to 64 bit type on x86_64.

The sizeof() is used only a few times and it does not make the code that
harder to read, so use it directly and let the compiler utilize the
immediate constants in the context it needs. The .ko code size slightly
increases (+50) but further patches will reduce that again.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:21 +02:00
David Sterba
00c0cf8444 btrfs: accessors: simplify folio bounds checks
As we can a have non-contiguous range in the eb->folios, any item can be
straddling two folios and we need to check if it can be read in one go
or in two parts. For that there's a check which is not implemented in
the simplest way:

  offset in folio + size <= folio size

With a simple expression transformation:

  oil + size <= unit_size
        size <= unit_size - oil
    sizeof() <= part

this can be simplified and reusing existing run-time or compile-time
constants.

Add likely() annotation for this expression as this is the fast path and
compiler sometimes reorders that after the followup block with the
memcpy (observed in practice with other simplifications).

Overall effect on stack consumption:

  btrfs_get_8                                        -8 (80 -> 72)
  btrfs_set_8                                        -8 (88 -> 80)

And .ko size (due to optimizations making use of the direct constants):

     text    data     bss     dec     hex filename
  1456601  115665   16088 1588354  183c82 pre/btrfs.ko
  1456093  115665   16088 1587846  183a86 post/btrfs.ko

  DELTA: -508

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:21 +02:00
David Sterba
c76841362f btrfs: remove struct rcu_string
The only use for device name has been removed so we can kill the RCU
string API.

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:21 +02:00
David Sterba
e8d58aef11 btrfs: open code RCU for device name
The RCU protected string is only used for a device name, and RCU is used
so we can print the name and eventually synchronize against the rare
device rename in device_list_add().

We don't need the whole API just for that. Open code all the helpers and
access to the string itself.

Notable change is in device_list_add() when the device name is changed,
which is the only place that can actually happen at the same time as
message prints using the device name under RCU read lock.

Previously there was kfree_rcu() which used the embedded rcu_head to
delay freeing the object depending on the RCU mechanism. Now there's
kfree_rcu_mightsleep() which does not need the rcu_head and waits for
the grace period.

Sleeping is safe in this context and as this is a rare event it won't
interfere with the rest as it's holding the device_list_mutex.

Straightforward changes:

- rcu_string_strdup -> kstrdup
- rcu_str_deref -> rcu_dereference
- drop ->str from safe contexts and use rcu_dereference_raw() so it does
  not trigger RCU validators

Historical notes:

Introduced in 606686eeac ("Btrfs: use rcu to protect device->name")
with a vague reference of the potential problem described in
https://lore.kernel.org/all/20120531155304.GF11775@ZenIV.linux.org.uk/ .

The RCU protection looks like the easiest and most lightweight way of
protecting the rare event of device rename racing device_list_add()
with a random printk() that uses the device name.

Alternatives: a spin lock would require to protect the printk
anyway, a fixed buffer for the name would be eventually wrong in case
the new name is overwritten when being printed, an array switching
pointers and cleaning them up eventually resembles RCU too much.

The cleanups up to this patch should hide special case of RCU to the
minimum that only the name needs rcu_dereference(), which can be further
cleaned up to use btrfs_dev_name().

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:21 +02:00
Daniel Vacek
f2cb97ee96 btrfs: index buffer_tree using node size
So far we've been deriving the buffer tree index using the sector size.
But each extent buffer covers multiple sectors. This makes the buffer
tree rather sparse.

For example the typical and quite common configuration uses sector size
of 4KiB and node size of 16KiB. In this case it means the buffer tree is
using up to the maximum of 25% of it's slots. Or in other words at least
75% of the tree slots are wasted as never used.

We can score significant memory savings on the required tree nodes by
indexing the tree using the node size instead. As a result far less
slots are wasted and the tree can now use up to all 100% of it's slots
this way.

Note: This works even with unaligned tree blocks as we can still get
      unique index by doing eb->start >> nodesize_shift.

Getting some stats from running fio write test, there is a bit of
variance.  The values presented in the table below are medians from 5
test runs.  The numbers are:

  - # of allocated ebs in the tree
  - # of leaf tree nodes
  - highest index in the tree (radix tree width)):

ebs / leaves / Index |   bare for-next    |      with fix
---------------------+--------------------+-------------------
	post mount   |   16 /  11 / 10e5c |   16 /  10 / 4240
	post test    | 5810 / 891 / 11cfc | 4420 / 252 / 473a
	post rm	     |  574 / 300 / 10ef0 |  540 / 163 / 46e9

In this case (10GiB filesystem) the height of the tree is still 3 levels
but the 4x width reduction is clearly visible as expected. But since the
tree is more dense we can see the 54-72% reduction of leaf nodes. That's
very close to ideal with this test. It means the tree is getting really
dense with this kind of workload.

Also, the fio results show no performance change.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:20 +02:00
Filipe Manana
2b759eea98 btrfs: send: directly return strcmp() result when comparing recorded refs
There's no point in converting the return values from strcmp() as all we
need is that it returns a negative value if the first argument is less
than the second, a positive value if it's greater and 0 if equal. We do
not have a need for -1 instead of any other negative value and no need
for 1 instead of any other positive value - that's all that rb_find()
needs and no where else we need specific negative and positive values.

So remove the intermediate local variable and checks and return directly
the result from strcmp().

This also reduces the module's text size.

Before:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1888116	 161347	  16136	2065599	 1f84bf	fs/btrfs/btrfs.ko

After:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1888052	 161347	  16136	2065535	 1f847f	fs/btrfs/btrfs.ko

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:20 +02:00
Filipe Manana
e560afc1a8 btrfs: set search_commit_root to false in iterate_inodes_from_logical()
There's no point in checking at iterate_inodes_from_logical() if the path
has search_commit_root set, the only caller never sets search_commit_root
to true and it doesn't make sense for it ever to be true for the current
use case (logical_to_ino ioctl). So stop checking for that and since the
only caller allocates the path just for it to be used by
iterate_inodes_from_logical(), move the path allocation into that function.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:20 +02:00
Filipe Manana
aee10fe4e4 btrfs: reduce size of struct tree_mod_elem
Several members are used for specific types of tree mod log operations so
they can be placed in a union in order to reduce the structure's size.

This reduces the structure size from 112 bytes to 88 bytes on x86_64,
so we can now use the kmalloc-96 slab instead of the kmalloc-128 slab.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:20 +02:00
Filipe Manana
d30b236a3e btrfs: avoid logging tree mod log elements for irrelevant extent buffers
We are logging tree mod log operations for extent buffers from any tree
but we only need to log for the extent tree and subvolume tree, since
the tree mod log is used to get a consistent view, within a transaction,
of extents and their backrefs. So it's pointless to log operations for
trees such as the csum tree, free space tree, root tree, chunk tree,
log trees, data relocation tree, etc, as these trees are not used for
backref walking and all tree mod log users are about backref walking.

So skip extent buffers that don't belong neither to the extent nor to
subvolume trees. This avoids unnecessary memory allocations and having a
larger tree mod log rbtree with nodes that are never needed.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:20 +02:00
Boris Burkov
9e9ff875e4 btrfs: use readahead_expand() on compressed extents
We recently received a report of poor performance doing sequential
buffered reads of a file with compressed extents. With bs=128k, a naive
sequential dd ran as fast on a compressed file as on an uncompressed
(1.2GB/s on my reproducing system) while with bs<32k, this performance
tanked down to ~300MB/s.

i.e., slow:

  dd if=some-compressed-file of=/dev/null bs=4k count=X

vs fast:

  dd if=some-compressed-file of=/dev/null bs=128k count=Y

The cause of this slowness is overhead to do with looking up extent_maps
to enable readahead pre-caching on compressed extents
(add_ra_bio_pages()), as well as some overhead in the generic VFS
readahead code we hit more in the slow case. Notably, the main
difference between the two read sizes is that in the large sized request
case, we call btrfs_readahead() relatively rarely while in the smaller
request we call it for every compressed extent. So the fast case stays
in the btrfs readahead loop:

    while ((folio = readahead_folio(rac)) != NULL)
	    btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);

where the slower one breaks out of that loop every time. This results in
calling add_ra_bio_pages a lot, doing lots of extent_map lookups,
extent_map locking, etc.

This happens because although add_ra_bio_pages() does add the
appropriate un-compressed file pages to the cache, it does not
communicate back to the ractl in any way. To solve this, we should be
using readahead_expand() to signal to readahead to expand the readahead
window.

This change passes the readahead_control into the btrfs_bio_ctrl and in
the case of compressed reads sets the expansion to the size of the
extent_map we already looked up anyway. It skips the subpage case as
that one already doesn't do add_ra_bio_pages().

With this change, whether we use bs=4k or bs=128k, btrfs expands the
readahead window up to the largest compressed extent we have seen so far
(in the trivial example: 128k) and the call stacks of the two modes look
identical. Notably, we barely call add_ra_bio_pages at all. And the
performance becomes identical as well. So this change certainly "fixes"
this performance problem.

Of course, it does seem to beg a few questions:

1. Will this waste too much page cache with a too large ra window?
2. Will this somehow cause bugs prevented by the more thoughtful
   checking in add_ra_bio_pages?
3. Should we delete add_ra_bio_pages?

My stabs at some answers:

1. Hard to say. See attempts at generic performance testing below. Is
   there a "readahead_shrink" we should be using? Should we expand more
   slowly, by half the remaining em size each time?
2. I don't think so. Since the new behavior is indistinguishable from
   reading the file with a larger read size passed in, I don't see why
   one would be safe but not the other.
3. Probably! I tested that and it was fine in fstests, and it seems like
   the pages would get re-used just as well in the readahead case.
   However, it is possible some reads that use page cache but not
   btrfs_readahead() could suffer. I will investigate this further as a
   follow up.

I tested the performance implications of this change in 3 ways (using
compress-force=zstd:3 for compression):

Directly test the affected workload of small sequential reads on a
compressed file (improved from ~250MB/s to ~1.2GB/s)

==========for-next==========
  dd /mnt/lol/non-cmpr 4k
  1048576+0 records in
  1048576+0 records out
  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.02983 s, 712 MB/s
  dd /mnt/lol/non-cmpr 128k
  32768+0 records in
  32768+0 records out
  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 5.92403 s, 725 MB/s
  dd /mnt/lol/cmpr 4k
  1048576+0 records in
  1048576+0 records out
  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 17.8832 s, 240 MB/s
  dd /mnt/lol/cmpr 128k
  32768+0 records in
  32768+0 records out
  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.71001 s, 1.2 GB/s

==========ra-expand==========
  dd /mnt/lol/non-cmpr 4k
  1048576+0 records in
  1048576+0 records out
  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.09001 s, 705 MB/s
  dd /mnt/lol/non-cmpr 128k
  32768+0 records in
  32768+0 records out
  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.07664 s, 707 MB/s
  dd /mnt/lol/cmpr 4k
  1048576+0 records in
  1048576+0 records out
  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.79531 s, 1.1 GB/s
  dd /mnt/lol/cmpr 128k
  32768+0 records in
  32768+0 records out
  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.69533 s, 1.2 GB/s

Built the linux kernel from clean (no change)

Ran fsperf. Mostly neutral results with some improvements and
regressions here and there.

Reported-by: Dimitrios Apostolou <jimis@gmx.net>
Link: https://lore.kernel.org/linux-btrfs/34601559-6c16-6ccc-1793-20a97ca0dbba@gmx.net/
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:00 +02:00
Qu Wenruo
1ef94169db btrfs: populate otime when logging an inode item
[TEST FAILURE WITH EXPERIMENTAL FEATURES]
When running test case generic/508, the test case will fail with the new
btrfs shutdown support:

generic/508       - output mismatch (see /home/adam/xfstests/results//generic/508.out.bad)
    --- tests/generic/508.out	2022-05-11 11:25:30.806666664 +0930
    +++ /home/adam/xfstests/results//generic/508.out.bad	2025-07-02 14:53:22.401824212 +0930
    @@ -1,2 +1,6 @@
     QA output created by 508
     Silence is golden
    +Before:
    +After : stat.btime = Thu Jan  1 09:30:00 1970
    +Before:
    +After : stat.btime = Wed Jul  2 14:53:22 2025
    ...
    (Run 'diff -u /home/adam/xfstests/tests/generic/508.out /home/adam/xfstests/results//generic/508.out.bad'  to see the entire diff)
Ran: generic/508
Failures: generic/508
Failed 1 of 1 tests

Please note that the test case requires shutdown support, thus the test
case will be skipped using the current upstream kernel, as it doesn't
have shutdown ioctl support.

[CAUSE]
The direct cause the 0 time stamp in the log tree:

leaf 30507008 items 2 free space 16057 generation 9 owner TREE_LOG
leaf 30507008 flags 0x1(WRITTEN) backref revision 1
checksum stored e522548d
checksum calced e522548d
fs uuid 57d45451-481e-43e4-aa93-289ad707a3a0
chunk uuid d52bd3fd-5163-4337-98a7-7986993ad398
	item 0 key (257 INODE_ITEM 0) itemoff 16123 itemsize 160
		generation 9 transid 9 size 0 nbytes 0
		block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
		sequence 1 flags 0x0(none)
		atime 1751432947.492000000 (2025-07-02 14:39:07)
		ctime 1751432947.492000000 (2025-07-02 14:39:07)
		mtime 1751432947.492000000 (2025-07-02 14:39:07)
		otime 0.0 (1970-01-01 09:30:00) <<<

But the old fs tree has all the correct time stamp:

btrfs-progs v6.12
fs tree key (FS_TREE ROOT_ITEM 0)
leaf 30425088 items 2 free space 16061 generation 5 owner FS_TREE
leaf 30425088 flags 0x1(WRITTEN) backref revision 1
checksum stored 48f6c57e
checksum calced 48f6c57e
fs uuid 57d45451-481e-43e4-aa93-289ad707a3a0
chunk uuid d52bd3fd-5163-4337-98a7-7986993ad398
	item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
		generation 3 transid 0 size 0 nbytes 16384
		block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0
		sequence 0 flags 0x0(none)
		atime 1751432947.0 (2025-07-02 14:39:07)
		ctime 1751432947.0 (2025-07-02 14:39:07)
		mtime 1751432947.0 (2025-07-02 14:39:07)
		otime 1751432947.0 (2025-07-02 14:39:07) <<<

The root cause is that fill_inode_item() in tree-log.c is only
populating a/c/m time, not the otime (or btime in statx output).

Part of the reason is that, the vfs inode only has a/c/m time, no native
btime support yet.

[FIX]
Thankfully btrfs has its otime stored in btrfs_inode::i_otime_sec and
btrfs_inode::i_otime_nsec.

So what we really need is just fill the otime time stamp in
fill_inode_item() of tree-log.c

There is another fill_inode_item() in inode.c, which is doing the proper
otime population.

Fixes: 94edf4ae43 ("Btrfs: don't bother committing delayed inode updates when fsyncing")
CC: stable@vger.kernel.org
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:08:35 +02:00
Filipe Manana
a943812bff btrfs: qgroup: use btrfs_qgroup_enabled() in ioctls
We have a publicly exported btrfs_qgroup_enabled() and an ioctl.c private
qgroup_enabled() helper. Both of these test if qgroups are enabled, the
first check if the flag BTRFS_FS_QUOTA_ENABLED is set in fs_info->flags
while the second checks if fs_info->quota_root is not NULL while holding
the mutex fs_info->qgroup_ioctl_lock.

We can get away with the private ioctl.c:qgroup_enabled(), as all entry
points into the qgroup code check if fs_info->quota_root is NULL or not
while holding the mutex fs_info->qgroup_ioctl_lock, and returning the
error -ENOTCONN in case it's NULL.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:07:37 +02:00
Filipe Manana
08530d6e63 btrfs: qgroup: fix qgroup create ioctl returning success after quotas disabled
When quotas are disabled qgroup ioctls are supposed to return -ENOTCONN,
but the qgroup create ioctl stopped doing that when it races with a quota
disable operation, returning 0 instead. This change of behaviour happened
in commit 6ed05643dd ("btrfs: create qgroup earlier in snapshot
creation").

The issue happens as follows:

1) Task A enters btrfs_ioctl_qgroup_create(), qgroups are enabled and so
   qgroup_enabled() returns true since fs_info->quota_root is not NULL;

2) Task B enters btrfs_ioctl_quota_ctl() -> btrfs_quota_disable() and
   disables qgroups, so now fs_info->quota_root is NULL;

3) Task A enters btrfs_create_qgroup() and calls btrfs_qgroup_mode(),
   which returns BTRFS_QGROUP_MODE_DISABLED since quotas are disabled,
   and then btrfs_create_qgroup() returns 0 to the caller, which makes
   the ioctl return 0 instead of -ENOTCONN.

   The check for fs_info->quota_root and returning -ENOTCONN if it's NULL
   is made only after the call btrfs_qgroup_mode().

Fix this by moving the check for disabled quotas with btrfs_qgroup_mode()
into transaction.c:create_pending_snapshot(), so that we don't abort the
transaction if btrfs_create_qgroup() returns -ENOTCONN and quotas are
disabled.

Fixes: 6ed05643dd ("btrfs: create qgroup earlier in snapshot creation")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:07:30 +02:00
Filipe Manana
e41c75ca31 btrfs: qgroup: set quota enabled bit if quota disable fails flushing reservations
Before waiting for the rescan worker to finish and flushing reservations,
we clear the BTRFS_FS_QUOTA_ENABLED flag from fs_info. If we fail flushing
reservations we leave with the flag not set which is not correct since
quotas are still enabled - we must set back the flag on error paths, such
as when we fail to start a transaction, except for error paths that abort
a transaction. The reservation flushing happens very early before we do
any operation that actually disables quotas and before we start a
transaction, so set back BTRFS_FS_QUOTA_ENABLED if it fails.

Fixes: af0e2aab3b ("btrfs: qgroup: flush reservations during quota disable")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:07:08 +02:00
Qu Wenruo
736bd9d2e3 btrfs: restrict writes to opened btrfs devices
[FLAG EXCLUSION]
Commit ead622674d ("btrfs: Do not restrict writes to btrfs devices")
removes the BLK_OPEN_RESTRICT_WRITES flag when opening the devices
during mount.  This was an exception at the time as it depended on other
patches.

[REASON TO EXCLUDE THAT FLAG]
Btrfs needs to call btrfs_scan_one_device() to determine the fsid, no
matter if we're mounting a new fs or an existing one.

But if a fs is already mounted and the BLK_OPEN_RESTRICT_WRITES is
honored, meaning no other write open is allowed for the block device.

Then we want to mount a subvolume of the mounted fs to another mount
point, we will call btrfs_scan_one_device() again, but it will fail due
to the BLK_OPEN_RESTRICT_WRITES flag (no more write open allowed),
causing only one mount point for the fs.

Thus at that time, we had to exclude the BLK_OPEN_RESTRICT_WRITES to
allow multiple mount points for one fs.

[WHY IT'S SAFE NOW]
The root problem is, we do not need to nor should use BLK_OPEN_WRITE for
btrfs_scan_one_device().
That function is only to read out the super block, no write at all, and
BLK_OPEN_WRITE is only going to cause problems for such usage.

The root problem has been fixed by patch "btrfs: always open the device
read-only in btrfs_scan_one_device", so btrfs_scan_one_device() will
always work no matter if the device is opened with
BLK_OPEN_RESTRICT_WRITES.

[ENHANCEMENT]
Just remove the btrfs_open_mode(), as the only call site can be replaced
with regular sb_open_mode().

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:06:20 +02:00
Qu Wenruo
08fa138864 btrfs: use fs_holder_ops for all opened devices
Since we have btrfs_fs_info::sb (struct super_block) as our block device
holder, we can safely use fs_holder_ops for all of our block devices.

This enables freezing/thawing the filesystem from the underlying block
devices.

This may enhance hibernation/suspend support since previously
freezing/thawing a block device managed by btrfs won't do anything btrfs
specific, but only syncing the block device.

Thus before this change, freezing the underlying block devices won't
prevent future writes into the filesystem, thus may cause problems for
hibernation/suspend when btrfs is involved.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:06:20 +02:00
Christoph Hellwig
40426dd147 btrfs: use the super_block as holder when mounting file systems
The file system type is not a very useful holder as it doesn't allow us
to go back to the actual file system instance.  Pass the super_block
instead which is useful when passed back to the file system driver.

This matches what is done for all other block device based file systems,
and allows us to remove btrfs_fs_info::bdev_holder completely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:06:19 +02:00
Qu Wenruo
bddf57a707 btrfs: delay btrfs_open_devices() until super block is created
Currently we always call btrfs_open_devices() before creating the
super block.

It's fine for now because:

- No blk_holder_ops is provided
- btrfs_fs_type is used as a holder

This means no matter who wins the device opening race, the holder will be
the same thus not affecting the later sget_fc() race.

And since no blk_holder_ops is provided, no bdev operation is depending on
the holder.

But this will no longer be true if we want to implement a proper
blk_holder_ops using fs_holder_ops.
This means we will need a proper super block as the bdev holder.

To prepare for such change:

- Add btrfs_fs_devices::holding member
  This will prevent btrfs_free_stale_devices() and btrfs_close_device()
  from deleting the fs_devices when there is another process trying to
  mount the fs.

  Along with the new member, here come the two helpers,
  btrfs_fs_devices_inc_holding() and btrfs_fs_devices_dec_holding().

  This will allow us to hold fs_devices without opening it.

  This is needed because we cannot hold uuid_mutex while calling
  sget_fc(), this will reverse the lock sequence with s_umount, causing
  a lockdep warning.

- Delay btrfs_open_devices() until a super block is returned
  This means we have to hold the initial fs_devices first, then unlock
  uuid_mutex, call sget_fc(), then re-lock uuid_mutex, and decrease the
  holding number.

  For new super block case, we continue to btrfs_open_devices() with
  uuid_mutex hold.
  For existing super block case, we can unlock uuid_mutex and continue.

  Although this means a more complex error handling path, as if we
  didn't call btrfs_open_devices() (either got an existing sb, or
  sget_fc() failed), we cannot let btrfs_put_fs_info() cleanup the
  fs_devices, as it can be freed at any time after we decrease the hold
  on fs_devices and unlock uuid_mutex.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:06:19 +02:00
Qu Wenruo
de339cbfb4 btrfs: call bdev_fput() to reclaim the blk_holder immediately
As part of the preparation for btrfs blk_holder_ops, we want to ensure
the holder of a block device has a proper lifespan.

However btrfs is always using fput() to close a block device, which has
one problem:

- fput() is deferred
  Meaning we can have a block device with invalid (aka, freed) holder.

To avoid the problem and align the behavior to other code, just call
bdev_fput() instead.

There is some extra requirement on the locking, but that's all resolved
by previous patches and we should be safe to call bdev_fput().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:06:19 +02:00
Christoph Hellwig
9f43d0ff55 btrfs: call btrfs_close_devices() from ->kill_sb
Although btrfs is not yet implementing blk_holder_ops, there is a
requirement for proper blk_holder_ops:

- blkdev_put() must not be called under sb->s_umount
  The blkdev_put()/bdev_fput() must not be called under sb->s_umount to
  avoid lock order reversal with disk->open_mutex.
  This is for the proper blk_holder_ops callbacks.

  Currently we're fine because we call regular fput() which defers the
  blk holder reclaiming.

To prepare for the future of blk_holder_ops, move the
btrfs_close_devices() calls into btrfs_free_fs_info().

That will be called from kill_sb() callbacks, which is also called for
error handing during mount failures, or there is already an existing
super block.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:06:19 +02:00
Qu Wenruo
2936a6ac8d btrfs: add assertions to make super block creation more clear
When calling sget_fc(), there are 3 different situations:

a) Critical error
   No super block created.

b) A new super block is created
   The fc->s_fs_info is transferred to the super block, and fc->s_fs_info
   is reset to NULL.

   In this case sb->s_root should still be NULL, and needs to be properly
   initialized later by btrfs_fill_super().

c) An existing super block is returned
   The fc->s_fs_info is untouched, and anything related to that fs_info
   should be properly cleaned up.

This is not obvious even with the extra comments at sget_fc().

Enhance the situation by:

- Add comments for case b) and c)
  Especially for case c), the fs_info and fs_devices cleanup happens at
  different timing, thus needs extra explanation.

- Move the comments closer to case b) and case c)

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:06:19 +02:00
Qu Wenruo
35ea448b75 btrfs: get rid of re-entering of btrfs_get_tree()
[EXISTING PROBLEM]
Currently btrfs mount is split into two parts:

- btrfs_get_tree_subvol()
  Which sets up the very basic fs_info, and eventually calls
  mount_subvol() to mount the target subvolume.

- btrfs_get_tree_super()
  This is the part doing super block allocation and if there is no
  existing super block, do the real open_ctree() to open the fs.

However currently we're doing this in a complex re-entering way:

vfs_get_tree()
|- btrfs_get_tree()
   |- btrfs_get_tree_subvol()
      |- vfs_get_tree()
      |  |- btrfs_get_tree()
      |     |- btrfs_get_tree_super()
      |- mount_subvol()

This is definitely not that easy to grasp.

[ENHANCEMENT]
The function vfs_get_tree() is only doing the following work:

- Call get_tree() call back
- Call super_wake()
- Call security_sb_set_mnt_opts()

In our case, super_wake() can be skipped, as after
btrfs_get_tree_subvol() finishes, vfs_get_tree() will call super_wake()
on the super block we got anyway.

The same applies to security_sb_set_mnt_opts(), as long as we do not
free the security from our original fc in btrfs_get_tree_subvol(), the
first vfs_get_tree() call will handle the security correctly.

So here we only need to:

- Replace vfs_get_tree() call with btrfs_get_tree_super()

- Keep the existing fc->security for vfs_get_tree() to handle the
  security

This will remove the re-entering behavior and make thing much easier to
follow.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:06:19 +02:00
Christoph Hellwig
ae818824a2 btrfs: always open the device read-only in btrfs_scan_one_device()
btrfs_scan_one_device() opens the block device only to read the super
block.  Instead of passing a blk_mode_t argument to sometimes open
it for writing, just hard code BLK_OPEN_READ as it will never write
to the device or hand the block_device out to someone else.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:06:02 +02:00
Caleb Sander Mateos
ea124ec327 btrfs: don't skip accounting in early ENOTTY return in btrfs_uring_encoded_read()
btrfs_uring_encoded_read() returns early with -ENOTTY if the uring_cmd
is issued with IO_URING_F_COMPAT but the kernel doesn't support compat
syscalls. However, this early return bypasses the syscall accounting.
Go to out_acct instead to ensure the syscall is counted.

Fixes: 34310c442e ("btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)")
CC: stable@vger.kernel.org # 6.15+
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:05:30 +02:00
David Sterba
9950c31ad9 btrfs: rename inode number parameter passed to btrfs_check_dir_item_collision()
The name 'dir' is misleading as it's the inode number of the directory,
so rename it accordingly.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:05:00 +02:00
David Sterba
8320febc64 btrfs: pass bool to indicate subvolume/snapshot creation type
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:05:00 +02:00
David Sterba
a5f0e0a4df btrfs: pass dentry to btrfs_mksubvol() and btrfs_mksnapshot()
There's no reason to pass 'struct path' to btrfs_mksubvol(), though it's
been like the since the first commit 76dda93c6a ("Btrfs: add
snapshot/subvolume destroy ioctl").  We only use the dentry so we should
pass it directly.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:05:00 +02:00
David Sterba
34f6cc5b18 btrfs: use struct qstr for subvolume ioctl helpers
We pass name and length of subvolumes separately to the related
functions, while this can be a struct qstr which is otherwise used for
dentry interfaces.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:05:00 +02:00
Brahmajit Das
164299ba11 btrfs: replace strcpy() with strscpy()
strcpy() is discouraged from use due to lack of bounds checking.
Replaces it with strscpy(), the recommended alternative for null
terminated strings, to follow best practices.

There are instances where strscpy() cannot be used such as where both
the source and destination are character pointers. In that instance we
can use sysfs_emit().

Link: https://github.com/KSPP/linux/issues/88
Suggested-by: Anthony Iliopoulos <ailiop@suse.com>
Signed-off-by: Brahmajit Das <bdas@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:05:00 +02:00
David Sterba
b37eb352c4 btrfs: accessors: delete token versions of set/get helpers
Once upon a time there was a need to cache address of extent buffer
pages, as it was a costly operation (map_private_extent_buffer(),
cfed81a04e ("Btrfs: add the ability to cache a pointer into the
eb")).  This was not even due to use of HIGHMEM, this had been removed
before that due to possible locking issues (a65917156e ("Btrfs:
stop using highmem for extent_buffers")).

Over time the amount of work in the set/get helpers got reduced and
became quite straightforward bounds checking with an unaligned
read/write, commit db3756c879 ("btrfs: remove unused
map_private_extent_buffer").

The actual caching of the page_address()/folio_address() in the token
was more work for very little gain. This depended on subsequent access
into the same page/folio, otherwise the cached pointer had to be
updated.

For metadata-heavy operations this showed up in the 'perf top' profile
where the btrfs_get_token_32() calls were at the top, on my testing
machine consuming about 2-3%. The other generic 32/64 bit helpers also
appeared in the profile with similar fraction.

After removing use of the token helpers we can remove them completely,
this leads to reduction of btrfs.ko by 6.7KiB on release config.

   text    data     bss     dec     hex filename
1463289  115665   16088 1595042  1856a2 pre/btrfs.ko
1456601  115665   16088 1588354  183c82 post/btrfs.ko

DELTA: -6688

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:05:00 +02:00