linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-21 20:45:27 -04:00

Author	SHA1	Message	Date
Miquel Sabaté Solà	f08d7147da	btrfs: use kmalloc_array() for open-coded arithmetic in kmalloc() As pointed out in the documentation, calling 'kmalloc' with open-coded arithmetic can lead to unfortunate overflows and this particular way of using it has been deprecated. Instead, it's preferred to use 'kmalloc_array' in cases where it might apply so an overflow check is performed. Note this is an API cleanup and is not fixing any overflows because in all cases the multipliers are bounded small numbers derived from number of items in leaves/nodes. Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:25 +02:00
Qu Wenruo	98077f7f21	btrfs: enable experimental bs > ps support With all the preparation patches, we're able to finally enable btrfs block size (sector size) larger than page size support and give it a full fstests run. And obviously this new feature is hidden behind experimental flags, and should not be considered as a core feature yet as btrfs' default block size is still 4K. But this is still a feature that will shine in the future where 16K block sized device are widely adopted. For now there are some features explicitly disabled: - Direct IO This is the most complex part to support, the root reason is we can not control the pages of iov iter passed in. User space programs can only ensure the virtual addresses are contiguous, but have no control on their physical addresses. Our bs > ps support heavily relies on large folios, and direct IO memory can easily break it. So direct IO is disabled and will always fall back to buffered IO. - RAID56 In theory we can convert RAID56 to use large folios, but it will need to be converted back to page based if we want to support direct IO in the future. So just reject it for now. - Encoded send - Encoded read Both are utilizing btrfs_encoded_read_regular_fill_pages(), and send is utilizing vmallocated memory. Unfortunately for vmallocated memory we can not guarantee the minimal folio order. For send, it will just always fallback to regular writes, which reads from page cache and will follow the existing folio order requirement. - Encoded write Encoded write itself is allocating pages by themselves, and we can easily change it to follow the minimal order. But since encoded read is already disabled, there is no need to only enable encoded write. Finally just like what we did for bs < ps support in the past, add a warning message for bs > ps mounts. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:25 +02:00
Qu Wenruo	e9bed72e88	btrfs: add extra ASSERT()s to catch unaligned bios Btrfs uses btrfs_bio to handle read/write of logical address, for the incoming bs > ps support, btrfs has extra requirements: - One folio must contain at least one fs block - No fs block can cross folio boundaries This requirement is not hard to maintain, thanks to the address space's minimal folio order. But not all btrfs bios are generated through address space, e.g. compression and scrub. To catch possible unaligned bios, introduce a helper, assert_bbio_alginment(), for each btrfs_bio in btrfs_submit_bbio(). This will check the following things: - bv_offset is aligned to block size - bv_len is aligned to block size With a btrfs bio passing above checks, unless it's empty it will ensure the requirements for bs > ps support. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:25 +02:00
Qu Wenruo	67378b7546	btrfs: fix symbolic link reading when bs > ps [BUG DURING BS > PS TEST] When running the following script on a btrfs whose block size is larger than page size, e.g. 8K block size and 4K page size, it will trigger a kernel BUG: # mkfs.btrfs -s 8k $dev # mount $dev $mnt # mkdir $mnt/dir # ln -s dir $mnt/link # ls $mnt/link The call trace looks like this: BTRFS warning (device dm-2): support for block size 8192 with page size 4096 is experimental, some features may be missing BTRFS info (device dm-2): checking UUID tree BTRFS info (device dm-2): enabling ssd optimizations BTRFS info (device dm-2): enabling free space tree ------------[ cut here ]------------ kernel BUG at /home/adam/linux/include/linux/highmem.h:275! Oops: invalid opcode: 0000 [#1] SMP CPU: 8 UID: 0 PID: 667 Comm: ls Tainted: G OE 6.17.0-rc4-custom+ #283 PREEMPT(full) Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022 RIP: 0010:zero_user_segments.constprop.0+0xdc/0xe0 [btrfs] Call Trace: <TASK> btrfs_get_extent.cold+0x85/0x101 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f] btrfs_do_readpage+0x244/0x750 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f] btrfs_read_folio+0x9c/0x100 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f] filemap_read_folio+0x37/0xe0 do_read_cache_folio+0x94/0x3e0 __page_get_link.isra.0+0x20/0x90 page_get_link+0x16/0x40 step_into+0x69b/0x830 path_lookupat+0xa7/0x170 filename_lookup+0xf7/0x200 ? set_ptes.isra.0+0x36/0x70 vfs_statx+0x7a/0x160 do_statx+0x63/0xa0 __x64_sys_statx+0x90/0xe0 do_syscall_64+0x82/0xae0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> Please note bs > ps support is still under development and the enablement patch is not even in btrfs development branch. [CAUSE] Btrfs reuses its data folio read path to handle symbolic links, as the symbolic link target is stored as an inline data extent. But for newly created inodes, btrfs only set the minimal order if the target inode is a regular file. Thus for above newly created symbolic link, it doesn't properly respect the minimal folio order, and triggered the above crash. [FIX] Call btrfs_set_inode_mapping_order() unconditionally inside btrfs_create_new_inode(). For symbolic links this will fix the crash as now the folio will meet the minimal order. For regular files this brings no change. For directory/bdev/char and all the other types of inodes, they won't go through the data read path, thus no effect either. Fixes: `cc38d178ff` ("btrfs: enable large data folio support under CONFIG_BTRFS_EXPERIMENTAL") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:25 +02:00
Qu Wenruo	5fbaae4b85	btrfs: prepare scrub to support bs > ps cases This involves: - Migrate scrub_stripe::pages[] to folios[] - Use btrfs_alloc_folio_array() and folio_put() to alloc above array. - Migrate scrub_stripe_get_kaddr() and scrub_stripe_get_paddr() to use folio interfaces - Migrate raid56_parity_cache_data_pages() to raid56_parity_cache_data_folios() Since scrub is the only caller still using pages. This helper will copy the folio array contents into rbio::stripe_pages, with sector uptodate flags updated. And a new ASSERT() to make sure bs > ps cases will not hit this path. Since most scrub code is based on kaddr/paddr, the migration itself is pretty straightforward. And since we're here, also move the loop to set the stripe_sectors[].uptodate out of the copy loop. As we always mark all the sectors as uptodate for the data stripe, it's easier to do in one go, other than doing it inside the copy loop. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:25 +02:00
Qu Wenruo	e88cb48e67	btrfs: prepare zlib to support bs > ps cases This involves converting the following functions to use correct folio sizes/shifts: - zlib_compress_folios() - zlib_decompress_bio() There is a special handling for s390 hardware acceleration. With bs > ps cases, we can go with 16K block size on s390 (which uses fixed 4K page size). In that case we do not need to do the buffer copy as our folio is large enough for hardware acceleration. So factor out the s390 specific and folio size check into a helper, need_special_buffer(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:25 +02:00
Qu Wenruo	4fd188a4fe	btrfs: prepare lzo to support bs > ps cases This involves converting the following functions to use correct folio sizes/shifts: - copy_compress_data_to_page() - lzo_compress_folios() - lzo_decompress_bio() Just like zstd, lzo has some extra incorrect usage of kmap_local_folio() that the offset is always 0. This will not handle HIGHMEM large folios correctly, but those cases are already rejected explicitly so it should not cause problems when bs > ps support is enabled. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:25 +02:00
Qu Wenruo	a6452b85b3	btrfs: prepare zstd to support bs > ps cases This involves converting the following functions to use proper folio sizes/shifts: - zstd_compress_folios() - zstd_decompress_bio() The function zstd_decompress() is already using block size correctly without using page size, thus it needs no modification. And since zstd compression is calling kmap_local_folio(), the existing code cannot handle large folios with HIGHMEM, as kmap_local_folio() requires us to handle one page range each time. I do not really think it's worth to spend time on some feature that will be deprecated eventually. So here just add an extra explicit rejection for bs > ps with HIGHMEM feature enabled kernels. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:24 +02:00
Qu Wenruo	c2ffb1ec1a	btrfs: prepare compression folio alloc/free for bs > ps cases This includes the following preparation for bs > ps cases: - Always alloc/free the folio directly if bs > ps This adds a new @fs_info parameter for btrfs_alloc_compr_folio(), thus affecting all compression algorithms. For btrfs_free_compr_folio() it needs no parameter for now, as we can use the folio size to skip the caching part. For now the change is just to passing a @fs_info into the function, all the folio size assumption is still based on page size. - Properly zero the last folio in compress_file_range() Since the compressed folios can be larger than a page, we need to properly zero the whole folio. - Use correct folio size for btrfs_add_compressed_bio_folios() Instead of page size, use the correct folio size. - Use correct folio size/shift for btrfs_compress_filemap_get_folio() As we are not only using simple page sized folios anymore. - Use correct folio size for btrfs_decompress() There is an ASSERT() making sure the decompressed range is no larger than a page, which will be triggered for bs > ps cases. - Skip readahead for compressed pages Similar to subpage cases. - Make btrfs_alloc_folio_array() to accept a new @order parameter - Add a helper to calculate the minimal folio size All those changes should not affect the existing bs <= ps handling. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:24 +02:00
Qu Wenruo	7b26da4074	btrfs: fix the incorrect max_bytes value for find_lock_delalloc_range() [BUG] With my local branch to enable bs > ps support for btrfs, sometimes I hit the following ASSERT() inside submit_one_sector(): ASSERT(block_start != EXTENT_MAP_HOLE); Please note that it's not yet possible to hit this ASSERT() in the wild yet, as it requires btrfs bs > ps support, which is not even in the development branch. But on the other hand, there is also a very low chance to hit above ASSERT() with bs < ps cases, so this is an existing bug affect not only the incoming bs > ps support but also the existing bs < ps support. [CAUSE] Firstly that ASSERT() means we're trying to submit a dirty block but without a real extent map nor ordered extent map backing it. Furthermore with extra debugging, the folio triggering such ASSERT() is always larger than the fs block size in my bs > ps case. (8K block size, 4K page size) After some more debugging, the ASSERT() is trigger by the following sequence: extent_writepage() \| We got a 32K folio (4 fs blocks) at file offset 0, and the fs block \| size is 8K, page size is 4K. \| And there is another 8K folio at file offset 32K, which is also \| dirty. \| So the filemap layout looks like the following: \| \| "\|\|" is the filio boundary in the filemap. \| "//\| is the dirty range. \| \| 0 8K 16K 24K 32K 40K \| \|////////\| \|//////////////////////\|\|////////\| \| \|- writepage_delalloc() \| \|- find_lock_delalloc_range() for [0, 8K) \| \| Now range [0, 8K) is properly locked. \| \| \| \|- find_lock_delalloc_range() for [16K, 40K) \| \| \|- btrfs_find_delalloc_range() returned range [16K, 40K) \| \| \|- lock_delalloc_folios() locked folio 0 successfully \| \| \| \| \| \| The filemap range [32K, 40K) got dropped from filemap. \| \| \| \| \| \|- lock_delalloc_folios() failed with -EAGAIN on folio 32K \| \| \| As the folio at 32K is dropped. \| \| \| \| \| \|- loops = 1; \| \| \|- max_bytes = PAGE_SIZE; \| \| \|- goto again; \| \| \| This will re-do the lookup for dirty delalloc ranges. \| \| \| \| \| \|- btrfs_find_delalloc_range() called with @max_bytes == 4K \| \| \| This is smaller than block size, so \| \| \| btrfs_find_delalloc_range() is unable to return any range. \| \| \- return false; \| \| \| \- Now only range [0, 8K) has an OE for it, but for dirty range \| [16K, 32K) it's dirty without an OE. \| This breaks the assumption that writepage_delalloc() will find \| and lock all dirty ranges inside the folio. \| \|- extent_writepage_io() \|- submit_one_sector() for [0, 8K) \| Succeeded \| \|- submit_one_sector() for [16K, 24K) Triggering the ASSERT(), as there is no OE, and the original extent map is a hole. Please note that, this also exposed the same problem for bs < ps support. E.g. with 64K page size and 4K block size. If we failed to lock a folio, and falls back into the "loops = 1;" branch, we will re-do the search using 64K as max_bytes. Which may fail again to lock the next folio, and exit early without handling all dirty blocks inside the folio. [FIX] Instead of using the fixed size PAGE_SIZE as @max_bytes, use @sectorsize, so that we are ensured to find and lock any remaining blocks inside the folio. And since we're here, add an extra ASSERT() to before calling btrfs_find_delalloc_range() to make sure the @max_bytes is at least no smaller than a block to avoid false negative. Cc: stable@vger.kernel.org # 5.15+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:24 +02:00
Filipe Manana	62701f4190	btrfs: remove pointless key offset setup in create_pending_snapshot() There's no point in setting the key's offset to (u64)-1 since we never use it before setting it to the current transaction's ID. So remove the assignment of (u64)-1 to the key's offset and move the remainder of the key initialization close to where it's used. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:24 +02:00
Filipe Manana	db524fd980	btrfs: annotate btrfs_is_testing() as unlikely and make it return bool We can annotate btrfs_is_testing() as unlikely since that's the most expected scenario and it's desirable for the compiler to optimize for the case we are not running the self tests. So add the annotation to btrfs_is_testing() and while at it also make it return bool instead of int. Also make two of the existing callers use btrfs_is_testing() directly instead of storing its result in a local variable. On x86_64 with Debian's gcc 14.2.0-19 this resulted in a very tiny object code reduction. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1913263 161567 15592 2090422 1fe5b6 fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1913257 161567 15592 2090416 1fe5b0 fs/btrfs/btrfs.ko Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:24 +02:00
Filipe Manana	f07575bab6	btrfs: make the rule checking more readable for should_cow_block() It's quite hard and unreadable the way the rule checks are organized in should_cow_block(). We have a single if statement that returns 0 (false) and it checks several conditions, with one them being a negated compound condition which is particularly hard to reason immediately. Improve on this by using multiple if statements, each checking a single condition and returning immediately. Also change the return type from an integer to a boolean, since all we need is to return true or false. At least on x86_64 with Debian's gcc 14.2.0-19, this also reduces the object code size by 64 bytes. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1913327 161567 15592 2090486 1fe5f6 fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1913263 161567 15592 2090422 1fe5b6 fs/btrfs/btrfs.ko Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:24 +02:00
Filipe Manana	b7ff7b0d76	btrfs: simplify inline extent end calculation at replay_one_extent() There is no need to store the extent's ram_bytes in two variables, further more one of them, named 'size', is used only for the extent's end offset calculation. So remove the 'size' variable and use 'nbytes' only. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:24 +02:00
Filipe Manana	a35b3dd59b	btrfs: fix comment about nbytes increase at replay_one_extent() The comment is wrong about the part where it says a prealloc extent does not contribute to an inode's nbytes - it does. Only holes don't contribute and that's what we are checking for, as prealloc extents always have a disk_bytenr different from 0. So fix the comment and re-organize the code to not set nbytes twice and set it to the extent item's number of bytes only if it doesn't represent a hole - in case it's a hole we have already initialized nbytes to 0 when we declared it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:23 +02:00
Qu Wenruo	2d83ed6c6c	btrfs: return any hit error from extent_writepage_io() Since the support of bs < ps support, extent_writepage_io() will submit multiple blocks inside the folio. But if we hit error submitting one sector, but the next sector can still be submitted successfully, the function extent_writepage_io() will still return 0. This will make btrfs to silently ignore the error without setting error flag for the filemap. Fix it by recording the first error hit, and always return that value. Fixes: `8bf334beb3` ("btrfs: fix double accounting race when extent_writepage_io() failed") Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:23 +02:00
Filipe Manana	5afe85b771	btrfs: mark leaf space and overflow checks as unlikely on insert and extension We have several sanity checks when inserting or extending items in a btree that verify we didn't overflow the leaf or access a slot beyond the last one. These are cases that are never expected to be hit so mark them as unlikely, allowing the compiler to potentially generate better code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:23 +02:00
Filipe Manana	b0e30e373e	btrfs: mark as unlikely not uptodate extent buffer checks when navigating btrees We expect that after attempting to read an extent buffer we had no errors therefore the extent buffer is up to date, so mark the checks for a not up to date extent buffer as unlikely and allow the compiler to pontentially generate better code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:23 +02:00
Filipe Manana	8f0534ec96	btrfs: mark extent buffer alignment checks as unlikely We are not expecting to ever fail the extent buffer alignment checks, so mark them as unlikely to allow the compiler to potentially generate more optimized code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:23 +02:00
Filipe Manana	6a9e1d1a65	btrfs: store and use node size in local variable in check_eb_alignment() Instead of dereferencing fs_info every time we need to access the node size, store in a local variable to make the code less verbose and avoid a line split too. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:23 +02:00
Filipe Manana	26baec69ac	btrfs: print-tree: print key types as human readable strings Looking at a leaf dump from the kernel's print-tree implementation is not so friendly to analyze since key types are printed as numbers. Improve on this by printing key types as strings that are a diminutive of the macro names for key types, just like we do in btrfs-progs. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:23 +02:00
Filipe Manana	00b7eaaaa5	btrfs: print-tree: move code for processing file extent item into helper The code for processing file extent items is quite large and it's better to have it in a dedicated helper rather than in a huge switch statement, just like we do in btrfs-progs. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:23 +02:00
Filipe Manana	caac170737	btrfs: print-tree: print compression type for file extent items We are not printing anything about the compression type, so add that useful information in the same format as btrfs-progs. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:22 +02:00
Filipe Manana	c1b9a4782b	btrfs: print-tree: print correct inline extent data size We are advertising the ram_bytes of an inline extent as its data size, but that is not true for compressed extents. The ram_bytes corresponds to the uncompressed data size while the data size (compressed data) is given by btrfs_file_extent_inline_item_len(). So fix this and print both values in the same format as in btrfs-progs. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:22 +02:00
Filipe Manana	4dc1c3d0ae	btrfs: print-tree: print range information for extent csum items Currently we don't print anything for extent csum items other than the generic line with the key, item offset and item size. While one can still determine the range the extent csum covers by doing a few simple computations, it makes it more time consuming to analyse a leaf dump. So add a line that prints information about the range covered by the checksum using the same format as btrfs-progs. This is useful when debugging log tree issues since we log extent csum items for new extents. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:22 +02:00
Filipe Manana	7d2197b5dc	btrfs: print-tree: print information about dir log items We currently don't print information about dir log items (other than the key, item offset and item size), which is useful to look at when debugging problems with a log tree. So print their specific information (currently they only have an end index number) in a format similar to btrfs-progs. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:22 +02:00
Filipe Manana	7317555f45	btrfs: print-tree: print information about inode extref items Currently we ignore inode extref items, we just print their key, item offset in the leaf and their size, no information about their content like the index number, parent inode, name length and name. Improve on this by printing the index, parent and name length in the same format as btrfs-progs. Note that we don't print the name, as that would require some processing and escaping like we do in btrfs-progs, and that could expose sensitive information for some users in case they share their dmesg/syslog and it contains a leaf dump. So for now leave names out. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:22 +02:00
Filipe Manana	cee3aa1387	btrfs: print-tree: print information about inode ref items Currently we ignore inode ref items, we just print their key, item offset in the leaf and their size, no information about their content like the index number, name length and name. Improve on this by printing the index and name length in the same format as btrfs-progs. Note that we don't print the name, as that would require some processing and escaping like we do in btrfs-progs, and that could expose sensitive information for some users in case they share their dmesg/syslog and it contains a leaf dump. So for now leave names out. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:22 +02:00
Filipe Manana	93f818e62a	btrfs: print-tree: print dir items for dir index and xattr keys too Currently we only print the dir items for BTRFS_DIR_ITEM_KEY keys, but we also have dir items for BTRFS_DIR_INDEX_KEY and BTRFS_XATTR_ITEM_KEY keys too. So print them for those keys too. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:22 +02:00
Filipe Manana	96fb032238	btrfs: print-tree: print more information about dir items Currently we only print the object id component of the location key from a dir item and the flags. We are missing the whole key, transid and the name and data lengths. We are also ignoring the fact that we can have multiple dir item objects encoded in a single item for a BTRFS_DIR_ITEM_KEY key, so what we print is only for the first item. Improve on this by iterating on all dir items and print the missing information. This is done with the same format as in btrfs-progs, what we miss is printing the names and data since not only that would require some processing and escaping like in btrfs-progs, but it would also reveal information that may be sensitive and users may not want to share that in case that get a leaf dumped in dmesg. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:22 +02:00
Filipe Manana	ac9affd899	btrfs: print-tree: print missing fields for inode items We are not dumping a lot of fields for an inode item which are useful for debugging whenever we dump a leaf (log replay failure for example), so add them and make it as close as possible to the print tree implementation in btrfs-progs (things like converting timespecs to human readable dates and converting flags to strings are missing since they are not so practical to do in the kernel). Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:21 +02:00
Qu Wenruo	aab9458b9f	btrfs: tree-checker: add inode extref checks Like inode refs, inode extrefs have a variable length name, which means we have to do a proper check to make sure no header nor name can exceed the item limits. The check itself is very similar to check_inode_ref(), just a different structure (btrfs_inode_extref vs btrfs_inode_ref). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:21 +02:00
Filipe Manana	0dc93e4652	btrfs: send: index backref cache by node number instead of by sector number We now have a nodesize_bits member in fs_info so we can index an extent buffer in the backref cache by node number instead of by sector number. While this allows for a denser index space with the possibility of using less maple tree nodes, in practice it's unlikely to hit such benefits since we currently limit the maximum number of keys in the cache to 128, so unless all extent buffers are contiguous we are unlikely to see a memory usage reduction in the backing maple tree due to fewer nodes. Nevertheless it doesn't cost anything to index by node number and it's more logical. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:21 +02:00
Filipe Manana	2753e49176	btrfs: dump detailed info and specific messages on log replay failures Currently debugging log replay failures can be harder than needed, since all we do now is abort a transaction, which gives us a line number, a stack trace and an error code. But that is most of the times not enough to give some clue about what went wrong. So add a new helper to abort log replay and provide contextual information: 1) Dump the current leaf of the log tree being processed and print the slot we are currently at and the key at that slot; 2) Dump the current subvolume tree leaf if we have any; 3) Print the current stage of log replay; 4) Print the id of the subvolume root associated with the log tree we are currently processing (as we can have multiple); 5) Print some error message to mention what we were trying to do when we got an error. Replace all transaction abort calls (btrfs_abort_transaction()) with the new helper btrfs_abort_log_replay(), which besides dumping all that extra information, it also aborts the current transaction. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:21 +02:00
Filipe Manana	5a0565cad3	btrfs: abort transaction if we fail to update inode in log replay dir fixup If we fail to update the inode at link_to_fixup_dir(), we don't abort the transaction and propagate the error up the call chain, which makes it hard to pinpoint the error to the inode update. So abort the transaction if the inode update call fails, so that if it happens we known immediately. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:21 +02:00
Filipe Manana	0b7453b7a1	btrfs: abort transaction if we fail to find dir item during log replay At __add_inode_ref() if we get an error when trying to lookup a dir item we don't abort the transaction and propagate the error up the call chain, so that somewhere else up in the call chain the transaction is aborted. This however makes it hard to know that the failure comes from looking up a dir item, so add a transaction abort in case we fail there, so that we immediately pinpoint where the problem comes from during log replay. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:21 +02:00
Filipe Manana	e41c5e611a	btrfs: remove pointless inode lookup when processing extrefs during log replay At unlink_extrefs_not_in_log() we do an inode lookup of the directory but we already have the directory inode accessible as a function argument, so the lookup is redudant. Remove it and use the directory inode passed in as an argument. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:21 +02:00
Filipe Manana	bd9c063e6f	btrfs: stop passing inode object IDs to __add_inode_ref() in log replay There's no point in passing the inode and parent inode object IDs to __add_inode_ref() and its helpers because we can get them by calling btrfs_ino() against the inode and the directory inode, and we pass both inodes to __add_inode_ref() and its helpers. So remove the object IDs parameters to reduce arguments passed and to make things less confusing. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:20 +02:00
Filipe Manana	1ebeee283a	btrfs: add path for subvolume tree changes to struct walk_control While replaying log trees we need to do searches and updates to subvolume trees and for that we use a path that we allocate in replay_one_buffer() and then pass it as a parameter to other functions deeper in the log replay call chain. Instead of passing it as parameter, add it to struct walk_control since we pass a pointer to that structure for every log replay function. This reduces the number of arguments passed to the functions and it will be needed and important for an upcoming changes that improves error reporting for log replay. Also name the new filed in struct walk_control to 'subvol_path' - while that is longer to type, the naming makes it clear it's used for subvolume tree operations since many of the log replay functions operate both on subvolume and log trees, and for the log tree searches we have struct walk_control::log_leaf to also make it obvious it's an extent buffer for a log tree extent buffer. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:20 +02:00
Filipe Manana	f9c02e4b52	btrfs: remove redundant path release when overwriting item during log replay At overwrite_item() we have a redundant btrfs_release_path() just before failing with -ENOMEM, as the caller who passed in the path will free it and therefore also release any refcounts and locks on the extent buffers of the path. So remove it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:20 +02:00
Filipe Manana	9bdfa3eddb	btrfs: remove redundant path release when processing dentry during log replay At replay_one_one() we have a redundant btrfs_release_path() just before calling insert_one_name(), as some lines above we have already released the path with another btrfs_release_path() call. So remove it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:20 +02:00
Filipe Manana	f366722f33	btrfs: avoid unnecessary path allocation when replaying a dir item There's no need to allocate 'fixup_path' at replay_one_dir_item(), as the path passed as an argument is unused by the time link_to_fixup_dir() is called (replay_one_name() releases the path before it returns). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:20 +02:00
Filipe Manana	b343047c1a	btrfs: avoid path allocations when dropping extents during log replay We can avoid a path allocation in the btrfs_drop_extents() calls we have at replay_one_extent() and replay_one_buffer() by passing the path we already have in those contextes as it's unused by the time they call btrfs_drop_extents(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:20 +02:00
Filipe Manana	29d9c5e037	btrfs: avoid unnecessary path allocation at fixup_inode_link_count() There's no need to allocate a path as our single caller already has a path that we can use. So pass the caller's path and use it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:20 +02:00
Filipe Manana	2ac7094662	btrfs: add current log leaf, key and slot to struct walk_control A lot of the log replay functions get passed the current log leaf being processed as well as the current slot and the key at that slot. Instead of passing them as parameters, add them to struct walk_control so that we reduce the numbers of parameters. This is also going to be needed to further changes that improve error reporting during log replay. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:20 +02:00
Filipe Manana	2a13cfc949	btrfs: use the inode item boolean everywhere in overwrite_item() We have this boolean 'inode_item' to tell if we are processing an inode item key and we use it in a couple of places while in another two places we open code by checking if the key type matches the inode item type. Make this consistent and use the boolean everywhere. Also rename it from 'inode_item' to 'is_inode_item', which makes it more clear that it's a boolean and not an instance of struct btrfs_inode_item, and make it const too. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:19 +02:00
Filipe Manana	6cb7f0b8c9	btrfs: use level argument in log tree walk callback replay_one_buffer() We already have the extent buffer's level in an argument, there's no need to first ensure the extent buffer's data is loaded (by calling btrfs_read_extent_buffer()) and then call btrfs_header_level() to check the level. So use the level argument and do the check before calling btrfs_read_extent_buffer(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:19 +02:00
Filipe Manana	c2ef817b28	btrfs: use level argument in log tree walk callback process_one_buffer() We already have the extent buffer's level in an argument, there's no need to call btrfs_header_level(). So use the level argument and make the code shorter. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:19 +02:00
Filipe Manana	266967c0e2	btrfs: pass walk_control structure to overwrite_item() Instead of passing the transaction and subvolume root as arguments to overwrite_item(), pass the walk_control structure as we can grab them from the structure. This reduces the number of arguments passed and it's going to be needed by an incoming change that improves error reporting for log replay. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:19 +02:00
Filipe Manana	aa5b6635b0	btrfs: pass walk_control structure to drop_one_dir_item() and helpers Instead of passing the transaction as an argument to drop_one_dir_item() and its helpers (link_to_fixup_dir() and unlink_inode_for_log_replay()), pass the walk_control structure as we can access the transaction from it and the subvolume root. This is going to be needed by an incoming change that improves error reporting for log replay and also reduces the number of arguments passed to link_to_fixup_dir(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:19 +02:00

... 11 12 13 14 15 ...

102029 Commits