linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-13 19:59:34 -04:00

Author	SHA1	Message	Date
David Sterba	378c95c477	btrfs: accessors: use type sizeof constants directly Now unit_size is used only once, so use it directly in 'part' calculation. Don't cache sizeof(type) in a variable. While this is a compile-time constant, forcing the type 'int' generates worse code as it leads to additional conversion from 32 to 64 bit type on x86_64. The sizeof() is used only a few times and it does not make the code that harder to read, so use it directly and let the compiler utilize the immediate constants in the context it needs. The .ko code size slightly increases (+50) but further patches will reduce that again. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:21 +02:00
David Sterba	00c0cf8444	btrfs: accessors: simplify folio bounds checks As we can a have non-contiguous range in the eb->folios, any item can be straddling two folios and we need to check if it can be read in one go or in two parts. For that there's a check which is not implemented in the simplest way: offset in folio + size <= folio size With a simple expression transformation: oil + size <= unit_size size <= unit_size - oil sizeof() <= part this can be simplified and reusing existing run-time or compile-time constants. Add likely() annotation for this expression as this is the fast path and compiler sometimes reorders that after the followup block with the memcpy (observed in practice with other simplifications). Overall effect on stack consumption: btrfs_get_8 -8 (80 -> 72) btrfs_set_8 -8 (88 -> 80) And .ko size (due to optimizations making use of the direct constants): text data bss dec hex filename 1456601 115665 16088 1588354 183c82 pre/btrfs.ko 1456093 115665 16088 1587846 183a86 post/btrfs.ko DELTA: -508 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:21 +02:00
David Sterba	c76841362f	btrfs: remove struct rcu_string The only use for device name has been removed so we can kill the RCU string API. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:21 +02:00
David Sterba	e8d58aef11	btrfs: open code RCU for device name The RCU protected string is only used for a device name, and RCU is used so we can print the name and eventually synchronize against the rare device rename in device_list_add(). We don't need the whole API just for that. Open code all the helpers and access to the string itself. Notable change is in device_list_add() when the device name is changed, which is the only place that can actually happen at the same time as message prints using the device name under RCU read lock. Previously there was kfree_rcu() which used the embedded rcu_head to delay freeing the object depending on the RCU mechanism. Now there's kfree_rcu_mightsleep() which does not need the rcu_head and waits for the grace period. Sleeping is safe in this context and as this is a rare event it won't interfere with the rest as it's holding the device_list_mutex. Straightforward changes: - rcu_string_strdup -> kstrdup - rcu_str_deref -> rcu_dereference - drop ->str from safe contexts and use rcu_dereference_raw() so it does not trigger RCU validators Historical notes: Introduced in `606686eeac` ("Btrfs: use rcu to protect device->name") with a vague reference of the potential problem described in https://lore.kernel.org/all/20120531155304.GF11775@ZenIV.linux.org.uk/ . The RCU protection looks like the easiest and most lightweight way of protecting the rare event of device rename racing device_list_add() with a random printk() that uses the device name. Alternatives: a spin lock would require to protect the printk anyway, a fixed buffer for the name would be eventually wrong in case the new name is overwritten when being printed, an array switching pointers and cleaning them up eventually resembles RCU too much. The cleanups up to this patch should hide special case of RCU to the minimum that only the name needs rcu_dereference(), which can be further cleaned up to use btrfs_dev_name(). Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:21 +02:00
Daniel Vacek	f2cb97ee96	btrfs: index buffer_tree using node size So far we've been deriving the buffer tree index using the sector size. But each extent buffer covers multiple sectors. This makes the buffer tree rather sparse. For example the typical and quite common configuration uses sector size of 4KiB and node size of 16KiB. In this case it means the buffer tree is using up to the maximum of 25% of it's slots. Or in other words at least 75% of the tree slots are wasted as never used. We can score significant memory savings on the required tree nodes by indexing the tree using the node size instead. As a result far less slots are wasted and the tree can now use up to all 100% of it's slots this way. Note: This works even with unaligned tree blocks as we can still get unique index by doing eb->start >> nodesize_shift. Getting some stats from running fio write test, there is a bit of variance. The values presented in the table below are medians from 5 test runs. The numbers are: - # of allocated ebs in the tree - # of leaf tree nodes - highest index in the tree (radix tree width)): ebs / leaves / Index \| bare for-next \| with fix ---------------------+--------------------+------------------- post mount \| 16 / 11 / 10e5c \| 16 / 10 / 4240 post test \| 5810 / 891 / 11cfc \| 4420 / 252 / 473a post rm \| 574 / 300 / 10ef0 \| 540 / 163 / 46e9 In this case (10GiB filesystem) the height of the tree is still 3 levels but the 4x width reduction is clearly visible as expected. But since the tree is more dense we can see the 54-72% reduction of leaf nodes. That's very close to ideal with this test. It means the tree is getting really dense with this kind of workload. Also, the fio results show no performance change. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:20 +02:00
Filipe Manana	2b759eea98	btrfs: send: directly return strcmp() result when comparing recorded refs There's no point in converting the return values from strcmp() as all we need is that it returns a negative value if the first argument is less than the second, a positive value if it's greater and 0 if equal. We do not have a need for -1 instead of any other negative value and no need for 1 instead of any other positive value - that's all that rb_find() needs and no where else we need specific negative and positive values. So remove the intermediate local variable and checks and return directly the result from strcmp(). This also reduces the module's text size. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1888116 161347 16136 2065599 1f84bf fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1888052 161347 16136 2065535 1f847f fs/btrfs/btrfs.ko Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:20 +02:00
Filipe Manana	e560afc1a8	btrfs: set search_commit_root to false in iterate_inodes_from_logical() There's no point in checking at iterate_inodes_from_logical() if the path has search_commit_root set, the only caller never sets search_commit_root to true and it doesn't make sense for it ever to be true for the current use case (logical_to_ino ioctl). So stop checking for that and since the only caller allocates the path just for it to be used by iterate_inodes_from_logical(), move the path allocation into that function. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:20 +02:00
Filipe Manana	aee10fe4e4	btrfs: reduce size of struct tree_mod_elem Several members are used for specific types of tree mod log operations so they can be placed in a union in order to reduce the structure's size. This reduces the structure size from 112 bytes to 88 bytes on x86_64, so we can now use the kmalloc-96 slab instead of the kmalloc-128 slab. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:20 +02:00
Filipe Manana	d30b236a3e	btrfs: avoid logging tree mod log elements for irrelevant extent buffers We are logging tree mod log operations for extent buffers from any tree but we only need to log for the extent tree and subvolume tree, since the tree mod log is used to get a consistent view, within a transaction, of extents and their backrefs. So it's pointless to log operations for trees such as the csum tree, free space tree, root tree, chunk tree, log trees, data relocation tree, etc, as these trees are not used for backref walking and all tree mod log users are about backref walking. So skip extent buffers that don't belong neither to the extent nor to subvolume trees. This avoids unnecessary memory allocations and having a larger tree mod log rbtree with nodes that are never needed. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:20 +02:00
Boris Burkov	9e9ff875e4	btrfs: use readahead_expand() on compressed extents We recently received a report of poor performance doing sequential buffered reads of a file with compressed extents. With bs=128k, a naive sequential dd ran as fast on a compressed file as on an uncompressed (1.2GB/s on my reproducing system) while with bs<32k, this performance tanked down to ~300MB/s. i.e., slow: dd if=some-compressed-file of=/dev/null bs=4k count=X vs fast: dd if=some-compressed-file of=/dev/null bs=128k count=Y The cause of this slowness is overhead to do with looking up extent_maps to enable readahead pre-caching on compressed extents (add_ra_bio_pages()), as well as some overhead in the generic VFS readahead code we hit more in the slow case. Notably, the main difference between the two read sizes is that in the large sized request case, we call btrfs_readahead() relatively rarely while in the smaller request we call it for every compressed extent. So the fast case stays in the btrfs readahead loop: while ((folio = readahead_folio(rac)) != NULL) btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start); where the slower one breaks out of that loop every time. This results in calling add_ra_bio_pages a lot, doing lots of extent_map lookups, extent_map locking, etc. This happens because although add_ra_bio_pages() does add the appropriate un-compressed file pages to the cache, it does not communicate back to the ractl in any way. To solve this, we should be using readahead_expand() to signal to readahead to expand the readahead window. This change passes the readahead_control into the btrfs_bio_ctrl and in the case of compressed reads sets the expansion to the size of the extent_map we already looked up anyway. It skips the subpage case as that one already doesn't do add_ra_bio_pages(). With this change, whether we use bs=4k or bs=128k, btrfs expands the readahead window up to the largest compressed extent we have seen so far (in the trivial example: 128k) and the call stacks of the two modes look identical. Notably, we barely call add_ra_bio_pages at all. And the performance becomes identical as well. So this change certainly "fixes" this performance problem. Of course, it does seem to beg a few questions: 1. Will this waste too much page cache with a too large ra window? 2. Will this somehow cause bugs prevented by the more thoughtful checking in add_ra_bio_pages? 3. Should we delete add_ra_bio_pages? My stabs at some answers: 1. Hard to say. See attempts at generic performance testing below. Is there a "readahead_shrink" we should be using? Should we expand more slowly, by half the remaining em size each time? 2. I don't think so. Since the new behavior is indistinguishable from reading the file with a larger read size passed in, I don't see why one would be safe but not the other. 3. Probably! I tested that and it was fine in fstests, and it seems like the pages would get re-used just as well in the readahead case. However, it is possible some reads that use page cache but not btrfs_readahead() could suffer. I will investigate this further as a follow up. I tested the performance implications of this change in 3 ways (using compress-force=zstd:3 for compression): Directly test the affected workload of small sequential reads on a compressed file (improved from ~250MB/s to ~1.2GB/s) ==========for-next========== dd /mnt/lol/non-cmpr 4k 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.02983 s, 712 MB/s dd /mnt/lol/non-cmpr 128k 32768+0 records in 32768+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 5.92403 s, 725 MB/s dd /mnt/lol/cmpr 4k 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 17.8832 s, 240 MB/s dd /mnt/lol/cmpr 128k 32768+0 records in 32768+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.71001 s, 1.2 GB/s ==========ra-expand========== dd /mnt/lol/non-cmpr 4k 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.09001 s, 705 MB/s dd /mnt/lol/non-cmpr 128k 32768+0 records in 32768+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.07664 s, 707 MB/s dd /mnt/lol/cmpr 4k 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.79531 s, 1.1 GB/s dd /mnt/lol/cmpr 128k 32768+0 records in 32768+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.69533 s, 1.2 GB/s Built the linux kernel from clean (no change) Ran fsperf. Mostly neutral results with some improvements and regressions here and there. Reported-by: Dimitrios Apostolou <jimis@gmx.net> Link: https://lore.kernel.org/linux-btrfs/34601559-6c16-6ccc-1793-20a97ca0dbba@gmx.net/ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:00 +02:00
Qu Wenruo	1ef94169db	btrfs: populate otime when logging an inode item [TEST FAILURE WITH EXPERIMENTAL FEATURES] When running test case generic/508, the test case will fail with the new btrfs shutdown support: generic/508 - output mismatch (see /home/adam/xfstests/results//generic/508.out.bad) --- tests/generic/508.out 2022-05-11 11:25:30.806666664 +0930 +++ /home/adam/xfstests/results//generic/508.out.bad 2025-07-02 14:53:22.401824212 +0930 @@ -1,2 +1,6 @@ QA output created by 508 Silence is golden +Before: +After : stat.btime = Thu Jan 1 09:30:00 1970 +Before: +After : stat.btime = Wed Jul 2 14:53:22 2025 ... (Run 'diff -u /home/adam/xfstests/tests/generic/508.out /home/adam/xfstests/results//generic/508.out.bad' to see the entire diff) Ran: generic/508 Failures: generic/508 Failed 1 of 1 tests Please note that the test case requires shutdown support, thus the test case will be skipped using the current upstream kernel, as it doesn't have shutdown ioctl support. [CAUSE] The direct cause the 0 time stamp in the log tree: leaf 30507008 items 2 free space 16057 generation 9 owner TREE_LOG leaf 30507008 flags 0x1(WRITTEN) backref revision 1 checksum stored e522548d checksum calced e522548d fs uuid 57d45451-481e-43e4-aa93-289ad707a3a0 chunk uuid d52bd3fd-5163-4337-98a7-7986993ad398 item 0 key (257 INODE_ITEM 0) itemoff 16123 itemsize 160 generation 9 transid 9 size 0 nbytes 0 block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 sequence 1 flags 0x0(none) atime 1751432947.492000000 (2025-07-02 14:39:07) ctime 1751432947.492000000 (2025-07-02 14:39:07) mtime 1751432947.492000000 (2025-07-02 14:39:07) otime 0.0 (1970-01-01 09:30:00) <<< But the old fs tree has all the correct time stamp: btrfs-progs v6.12 fs tree key (FS_TREE ROOT_ITEM 0) leaf 30425088 items 2 free space 16061 generation 5 owner FS_TREE leaf 30425088 flags 0x1(WRITTEN) backref revision 1 checksum stored 48f6c57e checksum calced 48f6c57e fs uuid 57d45451-481e-43e4-aa93-289ad707a3a0 chunk uuid d52bd3fd-5163-4337-98a7-7986993ad398 item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 generation 3 transid 0 size 0 nbytes 16384 block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0 sequence 0 flags 0x0(none) atime 1751432947.0 (2025-07-02 14:39:07) ctime 1751432947.0 (2025-07-02 14:39:07) mtime 1751432947.0 (2025-07-02 14:39:07) otime 1751432947.0 (2025-07-02 14:39:07) <<< The root cause is that fill_inode_item() in tree-log.c is only populating a/c/m time, not the otime (or btime in statx output). Part of the reason is that, the vfs inode only has a/c/m time, no native btime support yet. [FIX] Thankfully btrfs has its otime stored in btrfs_inode::i_otime_sec and btrfs_inode::i_otime_nsec. So what we really need is just fill the otime time stamp in fill_inode_item() of tree-log.c There is another fill_inode_item() in inode.c, which is doing the proper otime population. Fixes: `94edf4ae43` ("Btrfs: don't bother committing delayed inode updates when fsyncing") CC: stable@vger.kernel.org Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:08:35 +02:00
Filipe Manana	a943812bff	btrfs: qgroup: use btrfs_qgroup_enabled() in ioctls We have a publicly exported btrfs_qgroup_enabled() and an ioctl.c private qgroup_enabled() helper. Both of these test if qgroups are enabled, the first check if the flag BTRFS_FS_QUOTA_ENABLED is set in fs_info->flags while the second checks if fs_info->quota_root is not NULL while holding the mutex fs_info->qgroup_ioctl_lock. We can get away with the private ioctl.c:qgroup_enabled(), as all entry points into the qgroup code check if fs_info->quota_root is NULL or not while holding the mutex fs_info->qgroup_ioctl_lock, and returning the error -ENOTCONN in case it's NULL. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:07:37 +02:00
Filipe Manana	08530d6e63	btrfs: qgroup: fix qgroup create ioctl returning success after quotas disabled When quotas are disabled qgroup ioctls are supposed to return -ENOTCONN, but the qgroup create ioctl stopped doing that when it races with a quota disable operation, returning 0 instead. This change of behaviour happened in commit `6ed05643dd` ("btrfs: create qgroup earlier in snapshot creation"). The issue happens as follows: 1) Task A enters btrfs_ioctl_qgroup_create(), qgroups are enabled and so qgroup_enabled() returns true since fs_info->quota_root is not NULL; 2) Task B enters btrfs_ioctl_quota_ctl() -> btrfs_quota_disable() and disables qgroups, so now fs_info->quota_root is NULL; 3) Task A enters btrfs_create_qgroup() and calls btrfs_qgroup_mode(), which returns BTRFS_QGROUP_MODE_DISABLED since quotas are disabled, and then btrfs_create_qgroup() returns 0 to the caller, which makes the ioctl return 0 instead of -ENOTCONN. The check for fs_info->quota_root and returning -ENOTCONN if it's NULL is made only after the call btrfs_qgroup_mode(). Fix this by moving the check for disabled quotas with btrfs_qgroup_mode() into transaction.c:create_pending_snapshot(), so that we don't abort the transaction if btrfs_create_qgroup() returns -ENOTCONN and quotas are disabled. Fixes: `6ed05643dd` ("btrfs: create qgroup earlier in snapshot creation") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:07:30 +02:00
Filipe Manana	e41c75ca31	btrfs: qgroup: set quota enabled bit if quota disable fails flushing reservations Before waiting for the rescan worker to finish and flushing reservations, we clear the BTRFS_FS_QUOTA_ENABLED flag from fs_info. If we fail flushing reservations we leave with the flag not set which is not correct since quotas are still enabled - we must set back the flag on error paths, such as when we fail to start a transaction, except for error paths that abort a transaction. The reservation flushing happens very early before we do any operation that actually disables quotas and before we start a transaction, so set back BTRFS_FS_QUOTA_ENABLED if it fails. Fixes: `af0e2aab3b` ("btrfs: qgroup: flush reservations during quota disable") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:07:08 +02:00
Qu Wenruo	736bd9d2e3	btrfs: restrict writes to opened btrfs devices [FLAG EXCLUSION] Commit `ead622674d` ("btrfs: Do not restrict writes to btrfs devices") removes the BLK_OPEN_RESTRICT_WRITES flag when opening the devices during mount. This was an exception at the time as it depended on other patches. [REASON TO EXCLUDE THAT FLAG] Btrfs needs to call btrfs_scan_one_device() to determine the fsid, no matter if we're mounting a new fs or an existing one. But if a fs is already mounted and the BLK_OPEN_RESTRICT_WRITES is honored, meaning no other write open is allowed for the block device. Then we want to mount a subvolume of the mounted fs to another mount point, we will call btrfs_scan_one_device() again, but it will fail due to the BLK_OPEN_RESTRICT_WRITES flag (no more write open allowed), causing only one mount point for the fs. Thus at that time, we had to exclude the BLK_OPEN_RESTRICT_WRITES to allow multiple mount points for one fs. [WHY IT'S SAFE NOW] The root problem is, we do not need to nor should use BLK_OPEN_WRITE for btrfs_scan_one_device(). That function is only to read out the super block, no write at all, and BLK_OPEN_WRITE is only going to cause problems for such usage. The root problem has been fixed by patch "btrfs: always open the device read-only in btrfs_scan_one_device", so btrfs_scan_one_device() will always work no matter if the device is opened with BLK_OPEN_RESTRICT_WRITES. [ENHANCEMENT] Just remove the btrfs_open_mode(), as the only call site can be replaced with regular sb_open_mode(). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:06:20 +02:00
Qu Wenruo	08fa138864	btrfs: use fs_holder_ops for all opened devices Since we have btrfs_fs_info::sb (struct super_block) as our block device holder, we can safely use fs_holder_ops for all of our block devices. This enables freezing/thawing the filesystem from the underlying block devices. This may enhance hibernation/suspend support since previously freezing/thawing a block device managed by btrfs won't do anything btrfs specific, but only syncing the block device. Thus before this change, freezing the underlying block devices won't prevent future writes into the filesystem, thus may cause problems for hibernation/suspend when btrfs is involved. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:06:20 +02:00
Christoph Hellwig	40426dd147	btrfs: use the super_block as holder when mounting file systems The file system type is not a very useful holder as it doesn't allow us to go back to the actual file system instance. Pass the super_block instead which is useful when passed back to the file system driver. This matches what is done for all other block device based file systems, and allows us to remove btrfs_fs_info::bdev_holder completely. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:06:19 +02:00
Qu Wenruo	bddf57a707	btrfs: delay btrfs_open_devices() until super block is created Currently we always call btrfs_open_devices() before creating the super block. It's fine for now because: - No blk_holder_ops is provided - btrfs_fs_type is used as a holder This means no matter who wins the device opening race, the holder will be the same thus not affecting the later sget_fc() race. And since no blk_holder_ops is provided, no bdev operation is depending on the holder. But this will no longer be true if we want to implement a proper blk_holder_ops using fs_holder_ops. This means we will need a proper super block as the bdev holder. To prepare for such change: - Add btrfs_fs_devices::holding member This will prevent btrfs_free_stale_devices() and btrfs_close_device() from deleting the fs_devices when there is another process trying to mount the fs. Along with the new member, here come the two helpers, btrfs_fs_devices_inc_holding() and btrfs_fs_devices_dec_holding(). This will allow us to hold fs_devices without opening it. This is needed because we cannot hold uuid_mutex while calling sget_fc(), this will reverse the lock sequence with s_umount, causing a lockdep warning. - Delay btrfs_open_devices() until a super block is returned This means we have to hold the initial fs_devices first, then unlock uuid_mutex, call sget_fc(), then re-lock uuid_mutex, and decrease the holding number. For new super block case, we continue to btrfs_open_devices() with uuid_mutex hold. For existing super block case, we can unlock uuid_mutex and continue. Although this means a more complex error handling path, as if we didn't call btrfs_open_devices() (either got an existing sb, or sget_fc() failed), we cannot let btrfs_put_fs_info() cleanup the fs_devices, as it can be freed at any time after we decrease the hold on fs_devices and unlock uuid_mutex. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:06:19 +02:00
Qu Wenruo	de339cbfb4	btrfs: call bdev_fput() to reclaim the blk_holder immediately As part of the preparation for btrfs blk_holder_ops, we want to ensure the holder of a block device has a proper lifespan. However btrfs is always using fput() to close a block device, which has one problem: - fput() is deferred Meaning we can have a block device with invalid (aka, freed) holder. To avoid the problem and align the behavior to other code, just call bdev_fput() instead. There is some extra requirement on the locking, but that's all resolved by previous patches and we should be safe to call bdev_fput(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:06:19 +02:00
Christoph Hellwig	9f43d0ff55	btrfs: call btrfs_close_devices() from ->kill_sb Although btrfs is not yet implementing blk_holder_ops, there is a requirement for proper blk_holder_ops: - blkdev_put() must not be called under sb->s_umount The blkdev_put()/bdev_fput() must not be called under sb->s_umount to avoid lock order reversal with disk->open_mutex. This is for the proper blk_holder_ops callbacks. Currently we're fine because we call regular fput() which defers the blk holder reclaiming. To prepare for the future of blk_holder_ops, move the btrfs_close_devices() calls into btrfs_free_fs_info(). That will be called from kill_sb() callbacks, which is also called for error handing during mount failures, or there is already an existing super block. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:06:19 +02:00
Qu Wenruo	2936a6ac8d	btrfs: add assertions to make super block creation more clear When calling sget_fc(), there are 3 different situations: a) Critical error No super block created. b) A new super block is created The fc->s_fs_info is transferred to the super block, and fc->s_fs_info is reset to NULL. In this case sb->s_root should still be NULL, and needs to be properly initialized later by btrfs_fill_super(). c) An existing super block is returned The fc->s_fs_info is untouched, and anything related to that fs_info should be properly cleaned up. This is not obvious even with the extra comments at sget_fc(). Enhance the situation by: - Add comments for case b) and c) Especially for case c), the fs_info and fs_devices cleanup happens at different timing, thus needs extra explanation. - Move the comments closer to case b) and case c) Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:06:19 +02:00
Qu Wenruo	35ea448b75	btrfs: get rid of re-entering of btrfs_get_tree() [EXISTING PROBLEM] Currently btrfs mount is split into two parts: - btrfs_get_tree_subvol() Which sets up the very basic fs_info, and eventually calls mount_subvol() to mount the target subvolume. - btrfs_get_tree_super() This is the part doing super block allocation and if there is no existing super block, do the real open_ctree() to open the fs. However currently we're doing this in a complex re-entering way: vfs_get_tree() \|- btrfs_get_tree() \|- btrfs_get_tree_subvol() \|- vfs_get_tree() \| \|- btrfs_get_tree() \| \|- btrfs_get_tree_super() \|- mount_subvol() This is definitely not that easy to grasp. [ENHANCEMENT] The function vfs_get_tree() is only doing the following work: - Call get_tree() call back - Call super_wake() - Call security_sb_set_mnt_opts() In our case, super_wake() can be skipped, as after btrfs_get_tree_subvol() finishes, vfs_get_tree() will call super_wake() on the super block we got anyway. The same applies to security_sb_set_mnt_opts(), as long as we do not free the security from our original fc in btrfs_get_tree_subvol(), the first vfs_get_tree() call will handle the security correctly. So here we only need to: - Replace vfs_get_tree() call with btrfs_get_tree_super() - Keep the existing fc->security for vfs_get_tree() to handle the security This will remove the re-entering behavior and make thing much easier to follow. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:06:19 +02:00
Christoph Hellwig	ae818824a2	btrfs: always open the device read-only in btrfs_scan_one_device() btrfs_scan_one_device() opens the block device only to read the super block. Instead of passing a blk_mode_t argument to sometimes open it for writing, just hard code BLK_OPEN_READ as it will never write to the device or hand the block_device out to someone else. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:06:02 +02:00
Caleb Sander Mateos	ea124ec327	btrfs: don't skip accounting in early ENOTTY return in btrfs_uring_encoded_read() btrfs_uring_encoded_read() returns early with -ENOTTY if the uring_cmd is issued with IO_URING_F_COMPAT but the kernel doesn't support compat syscalls. However, this early return bypasses the syscall accounting. Go to out_acct instead to ensure the syscall is counted. Fixes: `34310c442e` ("btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)") CC: stable@vger.kernel.org # 6.15+ Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:05:30 +02:00
David Sterba	9950c31ad9	btrfs: rename inode number parameter passed to btrfs_check_dir_item_collision() The name 'dir' is misleading as it's the inode number of the directory, so rename it accordingly. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:05:00 +02:00
David Sterba	8320febc64	btrfs: pass bool to indicate subvolume/snapshot creation type Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:05:00 +02:00
David Sterba	a5f0e0a4df	btrfs: pass dentry to btrfs_mksubvol() and btrfs_mksnapshot() There's no reason to pass 'struct path' to btrfs_mksubvol(), though it's been like the since the first commit `76dda93c6a` ("Btrfs: add snapshot/subvolume destroy ioctl"). We only use the dentry so we should pass it directly. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:05:00 +02:00
David Sterba	34f6cc5b18	btrfs: use struct qstr for subvolume ioctl helpers We pass name and length of subvolumes separately to the related functions, while this can be a struct qstr which is otherwise used for dentry interfaces. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:05:00 +02:00
Brahmajit Das	164299ba11	btrfs: replace strcpy() with strscpy() strcpy() is discouraged from use due to lack of bounds checking. Replaces it with strscpy(), the recommended alternative for null terminated strings, to follow best practices. There are instances where strscpy() cannot be used such as where both the source and destination are character pointers. In that instance we can use sysfs_emit(). Link: https://github.com/KSPP/linux/issues/88 Suggested-by: Anthony Iliopoulos <ailiop@suse.com> Signed-off-by: Brahmajit Das <bdas@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:05:00 +02:00
David Sterba	b37eb352c4	btrfs: accessors: delete token versions of set/get helpers Once upon a time there was a need to cache address of extent buffer pages, as it was a costly operation (map_private_extent_buffer(), `cfed81a04e` ("Btrfs: add the ability to cache a pointer into the eb")). This was not even due to use of HIGHMEM, this had been removed before that due to possible locking issues (`a65917156e` ("Btrfs: stop using highmem for extent_buffers")). Over time the amount of work in the set/get helpers got reduced and became quite straightforward bounds checking with an unaligned read/write, commit `db3756c879` ("btrfs: remove unused map_private_extent_buffer"). The actual caching of the page_address()/folio_address() in the token was more work for very little gain. This depended on subsequent access into the same page/folio, otherwise the cached pointer had to be updated. For metadata-heavy operations this showed up in the 'perf top' profile where the btrfs_get_token_32() calls were at the top, on my testing machine consuming about 2-3%. The other generic 32/64 bit helpers also appeared in the profile with similar fraction. After removing use of the token helpers we can remove them completely, this leads to reduction of btrfs.ko by 6.7KiB on release config. text data bss dec hex filename 1463289 115665 16088 1595042 1856a2 pre/btrfs.ko 1456601 115665 16088 1588354 183c82 post/btrfs.ko DELTA: -6688 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:05:00 +02:00
David Sterba	c418a15045	btrfs: tree-log: don't use token set/get accessors in fill_inode_item() The token versions of set/get accessors will be removed, use the normal helpers. There's additional overhead of the token helpers that update the cached address in case it moves to another page/folio. The normal versions don't need to do that. Note this is similar to fill_inode_item() in inode.c but with slight differences. The two functions could be deduplicated eventually. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:05:00 +02:00
David Sterba	e3df5141a4	btrfs: don't use token set/get accessors in inode.c:fill_inode_item() The token versions of set/get accessors will be removed, use the normal helpers. There's additional overhead of the token helpers that update the cached address in case it moves to another page/folio. The normal versions don't need to do that. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:04:59 +02:00
David Sterba	114b806a73	btrfs: don't use token set/get accessors for btrfs_item members The token versions of set/get accessors will be removed, use the normal helpers. The btrfs_item members use that interface the most but there are no real benefits anymore. This reduces stack consumption on x86_64 release config: setup_items_for_insert -32 (144 -> 112) __push_leaf_left -32 (176 -> 144) btrfs_extend_item -16 (104 -> 88) copy_for_split -32 (144 -> 112) btrfs_del_items -16 (144 -> 128) btrfs_truncate_item -24 (152 -> 128) __push_leaf_right -24 (168 -> 144) and module size: text data bss dec hex filename 1463615 115665 16088 1595368 1857e8 pre/btrfs.ko 1463413 115665 16088 1595166 18571e post/btrfs.ko DELTA: -202 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:04:59 +02:00
Filipe Manana	60127c29f1	btrfs: qgroup: remove no longer used fs_info->qgroup_ulist It's not used anymore after commit `0913445082` ("btrfs: qgroup: use qgroup_iterator in qgroup_convert_meta()"), so remove it. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:04:59 +02:00
Filipe Manana	e124966775	btrfs: qgroup: fix race between quota disable and quota rescan ioctl There's a race between a task disabling quotas and another running the rescan ioctl that can result in a use-after-free of qgroup records from the fs_info->qgroup_tree rbtree. This happens as follows: 1) Task A enters btrfs_ioctl_quota_rescan() -> btrfs_qgroup_rescan(); 2) Task B enters btrfs_quota_disable() and calls btrfs_qgroup_wait_for_completion(), which does nothing because at that point fs_info->qgroup_rescan_running is false (it wasn't set yet by task A); 3) Task B calls btrfs_free_qgroup_config() which starts freeing qgroups from fs_info->qgroup_tree without taking the lock fs_info->qgroup_lock; 4) Task A enters qgroup_rescan_zero_tracking() which starts iterating the fs_info->qgroup_tree tree while holding fs_info->qgroup_lock, but task B is freeing qgroup records from that tree without holding the lock, resulting in a use-after-free. Fix this by taking fs_info->qgroup_lock at btrfs_free_qgroup_config(). Also at btrfs_qgroup_rescan() don't start the rescan worker if quotas were already disabled. Reported-by: cen zhang <zzzccc427@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAFRLqsV+cMDETFuzqdKSHk_FDm6tneea45krsHqPD6B3FetLpQ@mail.gmail.com/ CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:04:49 +02:00
Filipe Manana	c0d013495a	btrfs: clear dirty status from extent buffer on error at insert_new_root() If we failed to insert the tree mod log operation, we are not removing the dirty status from the allocated and dirtied extent buffer before we free it. Removing the dirty status is needed for several reasons such as to adjust the fs_info->dirty_metadata_bytes counter and remove the dirty status from the respective folios. So add the missing call to btrfs_clear_buffer_dirty(). Fixes: `f61aa7ba08` ("btrfs: do not BUG_ON() on tree mod log failure at insert_new_root()") CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:02:09 +02:00
Johannes Thumshirn	9669fcb77e	btrfs: change dump_block_groups() in btrfs_dump_space_info() from int to bool btrfs_dump_space_info()'s parameter dump_block_groups is used as a boolean although it is defined as an integer. Change it from int to bool. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:05 +02:00
David Sterba	ab5fcbb1ad	btrfs: use pgoff_t for page index variables Any conversion of offsets in the logical or the physical mapping space of the pages is done by a shift and the target type should be pgoff_t (type of struct page::index). Fix the locations where it's still unsigned long. Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:05 +02:00
George Hu	afd1dacbd0	btrfs: replace nested usage of min & max with clamp in btrfs_compress_set_level() Refactor the btrfs_compress_set_level() function by replacing the nested usage of min() and max() macro with clamp() to simplify the code and improve readability. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: George Hu <integral@archlinux.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:05 +02:00
Dmitry Antipov	2fb5e56f52	btrfs: send: avoid extra calls to strlen() in gen_unique_name() Since 'snprintf()' returns the number of characters which would be emitted and output truncation is handled by 'ASSERT()', it should be safe to use that return value instead of the subsequent calls to 'strlen()' in 'gen_unique_name()'. This also reduces the module's text size. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1897006 161571 16136 2074713 1fa859 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1896848 161571 16136 2074555 1fa7bb fs/btrfs/btrfs.ko Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:05 +02:00
Filipe Manana	6633a416ed	btrfs: qgroup: avoid memory allocation if qgroups are not enabled At btrfs_qgroup_inherit() we allocate a qgroup record even if qgroups are not enabled, which is unnecessary overhead and can result in subvolume creation to fail with -ENOMEM, as create_subvol() calls this function. Improve on this by making btrfs_qgroup_inherit() check if qgroups are enabled earlier and return if they are not, avoiding the unnecessary memory allocation and taking some locks. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:04 +02:00
Filipe Manana	2fda07effb	btrfs: qgroup: remove pointless error check for add_qgroup_rb() call The add_qgroup_rb() function never returns an error pointer anymore since commit `8d54518b5e` ("btrfs: qgroup: pre-allocate btrfs_qgroup to reduce GFP_ATOMIC usage"), so checking for an error pointer result at btrfs_quota_enable() is pointless. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:04 +02:00
Filipe Manana	da7f005239	btrfs: split btrfs_is_fstree() into multiple if statements for readability Instead of a single if statement with multiple conditions, split it into several if statements testing only one condition at a time and return true or false immediately after. This makes it more immediate to reason. The module's text size also slightly decreases, at least with gcc 14.2.0 on x86_64. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1897138 161583 16136 2074857 1fa8e9 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename `1896976` 161583 16136 2074695 1fa847 fs/btrfs/btrfs.ko Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:04 +02:00
Filipe Manana	fd00922abc	btrfs: add btrfs prefix to is_fstree() and make it return bool This is an exported function and therefore it should have a 'btrfs_' prefix, to make it clear it's btrfs specific, avoid future name collisions with code outside btrfs, and make its naming consistent with most other btrfs exported functions. So add a 'btrfs_' prefix to it and make it return bool instead of int, since all we need is to return true or false. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:04 +02:00
Filipe Manana	0c6f37eaa5	btrfs: split inode extref processing from __add_inode_ref() into a helper The __add_inode_ref() function is quite big and with too much nesting, so move the code that processes inode extrefs into a helper function, to make the function easier to read and reduce the level of indentation too. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:04 +02:00
Filipe Manana	06f77c659e	btrfs: split inode ref processing from __add_inode_ref() into a helper The __add_inode_ref() function is quite big and with too much nesting, so move the code that processes inode refs into a helper function, to make the function easier to read and reduce the level of indentation too. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:04 +02:00
Filipe Manana	98060e1611	btrfs: use btrfs inodes in btrfs_rmdir() to avoid so much usage of BTRFS_I() Almost everywhere we want to use a btrfs inode and therefore we have a lot of calls to BTRFS_I(), making the code more verbose. Instead use btrfs inode local variables to avoid so much use of BTRFS_I(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:04 +02:00
Filipe Manana	9f82a4ed34	btrfs: use inode already stored in local variable at btrfs_rmdir() There's no need to call d_inode(dentry) when calling btrfs_unlink_inode() since we have already stored that in a local inode variable. So just use the local variable to make the code less verbose. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:04 +02:00
David Sterba	44cac52341	btrfs: use our message helpers instead of pr_err/pr_warn/pr_info Our message helpers accept NULL for the fs_info in the context that does not provide and print the common header of the message. The use of pr_* helpers is only for special reasons, like module loading, device scanning or multi-line output (print-tree). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:04 +02:00
Sun YangKai	27260dd190	btrfs: remove partial support for lowest level from btrfs_search_forward() Commit `323ac95bce` ("Btrfs: don't read leaf blocks containing only checksums during truncate") changed the condition from `level == 0` to `level == path->lowest_level`, while its original purpose was just to do some leaf node handling (calling btrfs_item_key_to_cpu()) and skip some code that doesn't fit leaf nodes. After changing the condition, the code path: 1. Also handles the non-leaf nodes when path->lowest_level is nonzero, which is wrong. However btrfs_search_forward() is never called with a nonzero path->lowest_level, which makes this bug not found before. 2. Makes the later if block with the same condition, which was originally used to handle non-leaf node (calling btrfs_node_key_to_cpu()) when lowest_level is not zero, dead code. Since btrfs_search_forward() is never called for a path with a lowest_level different from zero, just completely remove the partial support for a non-zero lowest_level, simplifying a bit the code, and assert that lowest_level is zero at the start of the function. Suggested-by: Qu Wenruo <wqu@suse.com> Fixes: `323ac95bce` ("Btrfs: don't read leaf blocks containing only checksums during truncate") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Sun YangKai <sunk67188@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:04 +02:00

1 2 3 4 5 ...

1369440 Commits