linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-21 21:54:52 -04:00

Author	SHA1	Message	Date
Filipe Manana	744e0cebb4	btrfs: pass walk_control structure to replay_one_dir_item() and replay_one_name() Instead of passing the transaction and subvolume root and log tree as arguments, pass the walk_control structure as we can grab all of those from the structure. This reduces the number of arguments passed and it's going to be needed by an incoming change that improves error reporting for log replay. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:19 +02:00
Filipe Manana	44463eb079	btrfs: pass walk_control structure to add_inode_ref() and helpers Instead of passing the transaction, subvolume root and log tree as arguments to add_inode_ref() and its helpers (__add_inode_ref(), unlink_refs_not_in_log(), unlink_extrefs_not_in_log() and unlink_old_inode_refs()), pass the walk_control structure as we can access all of those from the structure. This reduces the number of arguments passed and it's going to be needed by an incoming change that improves error reporting for log replay. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:19 +02:00
Filipe Manana	c7da72022b	btrfs: pass walk_control structure to replay_one_extent() Instead of passing the transaction and subvolume root as arguments to replay_one_extent(), pass the walk_control structure as we can grab all of those from the structure. This reduces the number of arguments passed and it's going to be needed by an incoming change that improves error reporting for log replay. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:19 +02:00
Filipe Manana	b150f1c321	btrfs: pass walk_control structure to check_item_in_log() Instead of passing the transaction and log tree as arguments to check_item_in_log(), pass the walk_control structure as we can grab those from the structure. This reduces the number of arguments passed and it's going to be needed by an incoming change that improves error reporting for log replay. Notice that a NULL log root argument to check_item_in_log() makes it unconditionally delete a directory entry, so since the walk_control always has a non-NULL log root, we add an extra boolean to check_item_in_log() to tell it if it should unconditionally delete a directory entry, preserving the behaviour and also making it a bit more clear. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:18 +02:00
Filipe Manana	82d1db6f46	btrfs: pass walk_control structure to replay_dir_deletes() Instead of passing the transaction, subvolume root and log tree as arguments to replay_dir_deletes(), pass the walk_control structure as we can grab all of those from the structure. This reduces the number of arguments passed and it's going to be needed by an incoming change that improves error reporting for log replay. This also requires changing fixup_inode_link_counts() and fixup_inode_link_count() to take that structure as an argument since fixup_inode_link_count() makes a call to replay_dir_deletes(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:18 +02:00
Filipe Manana	94a5ac668a	btrfs: move up the definition of struct walk_control In upcoming changes we need to pass struct walk_control as an argument to replay_dir_deletes() and link_to_fixup_dir() so we need to move its definition above the prototypes of those functions. So move it up right below the enum that defines log replay stages and before any functions and function prototypes are declared. Also fixup the comments while moving it so that they comply with the preferred code style (capitalize the first word in a sentence, end sentences with punctuation, makes lines wider and closer to the 80 characters limit). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:18 +02:00
Filipe Manana	7790a882ca	btrfs: pass walk_control structure to replay_xattr_deletes() Instead of passing the transaction, subvolume root and log tree as arguments to replay_xattr_deletes(), pass the walk_control structure as we can grab all of those from the structure. This reduces the number of arguments passed and it's going to be needed by an incoming change that improves error reporting for log replay. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:18 +02:00
Filipe Manana	2f5b8095ea	btrfs: always drop log root tree reference in btrfs_replay_log() Currently we have this odd behaviour: 1) At btrfs_replay_log() we drop the reference of the log root tree if the call to btrfs_recover_log_trees() failed; 2) But if the call to btrfs_recover_log_trees() did not fail, we don't drop the reference in btrfs_replay_log() - we expect that btrfs_recover_log_trees() does it in case it returns success. Let's simplify this and make btrfs_replay_log() always drop the reference on the log root tree, not only this simplifies code as it's what makes sense since it's btrfs_replay_log() who grabbed the reference in the first place. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:18 +02:00
Filipe Manana	4b7699f406	btrfs: stop setting log_root_tree->log_root to NULL in btrfs_recover_log_trees() There's no point in setting log_root_tree->log_root to NULL as this is already NULL, we never assigned anything to it before and it's meaningless as a log root never has a value other than NULL for the ->log_root field, that can be not NULL only for non log roots. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:18 +02:00
Filipe Manana	d73896a55c	btrfs: stop passing transaction parameter to log tree walk functions It's unncessary to pass a transaction parameter since struct walk_control already has a member that points to the transaction, so we can make the functions access the structure. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:18 +02:00
Filipe Manana	7f09699e5e	btrfs: deduplicate log root free in error paths from btrfs_recover_log_trees() Instead of duplicating the dropping of a log tree in case we jump to the 'error' label, move the dropping under the 'error' label and get rid of the the unnecessary setting of the log root to NULL since we return immediately after. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:18 +02:00
Filipe Manana	efa44fc4fd	btrfs: add and use a log root field to struct walk_control Instead of passing an extra log root parameter for the log tree walk functions and callbacks, add the log tree to struct walk_control and make those functions and callbacks extract the log root from that structure, reducing the number of parameters. This also simplifies further upcoming changes to report log tree replay failures. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:17 +02:00
Filipe Manana	60ac80242b	btrfs: rename root to log in walk_down_log_tree() and walk_up_log_tree() Everywhere we have a log root we name it as 'log' or 'log_root' except in walk_down_log_tree() and walk_up_log_tree() where we name it as 'root', which not only it's inconsistent, it's also confusing since we typically use 'root' when naming variables that refer to a subvolume tree. So for clairty and consistency rename the 'root' argument to 'log'. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:17 +02:00
Filipe Manana	2c123db1f0	btrfs: rename replay_dest member of struct walk_control to root Everywhere else we refer to a subvolume root we are replaying to simply as 'root', so rename from 'replay_dest' to 'root' for consistency and having a more meaningful and shorter name. While at it also update the comment to be more detailed and comply to preferred style (first word in a sentence is capitalized and sentence ends with punctuation). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:17 +02:00
Filipe Manana	6803bff896	btrfs: use booleans in walk control structure for log replay The 'free' and 'pin' member of struct walk_control, used during log replay and when freeing a log tree, are defined as integers but in practice are used as booleans. Change their type to bool and while at it update their comments to be more detailed and comply with the preferred comment style (first word in a sentence is capitalized, sentences end with punctuation and the comment opening (/*) is on a line of its own). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:17 +02:00
Qu Wenruo	ea77a1c1c7	btrfs: cache max and min order inside btrfs_fs_info Inside btrfs_fs_info we cache several bits shift like sectorsize_bits. Apply this to max and min folio orders so that every time mapping order needs to be applied we can skip the calculation. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:17 +02:00
Qu Wenruo	7425a28940	btrfs: introduce btrfs_bio_for_each_block_all() helper Currently if we want to iterate all blocks inside a bio, we do something like this: bio_for_each_segment_all(bvec, bio, iter_all) { for (off = 0; off < bvec->bv_len; off += sectorsize) { /* Iterate blocks using bv + off / } } That's fine for now, but it will not handle future bs > ps, as bio_for_each_segment_all() is a single-page iterator, it will always return a bvec that's no larger than a page. But for bs > ps cases, we need a full folio (which covers at least one block) so that we can work on the block. To address this problem and handle future bs > ps cases better: - Introduce a helper btrfs_bio_for_each_block_all() This helper will create a local bvec_iter, which has the size of the target bio. Then grab the current physical address of the current location, then advance the iterator by block size. - Use btrfs_bio_for_each_block_all() to replace existing call sites Including: set_bio_pages_uptodate() in raid56 * verify_bio_data_sectors() in raid56 Both will result much easier to read code. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:17 +02:00
Qu Wenruo	9afc617265	btrfs: introduce btrfs_bio_for_each_block() helper Currently if we want to iterate a bio in block unit, we do something like this: while (iter->bi_size) { struct bio_vec bv = bio_iter_iovec(); /* Do something with using the bv / bio_advance_iter_single(&bbio->bio, iter, sectorsize); } That's fine for now, but it will not handle future bs > ps, as bio_iter_iovec() returns a single-page bvec, meaning the bv_len will not exceed page size. This means the code using that bv can only handle a block if bs <= ps. To address this problem and handle future bs > ps cases better: - Introduce a helper btrfs_bio_for_each_block() Instead of bio_vec, which has single and multiple page version and multiple page version has quite some limits, use my favorite way to represent a block, phys_addr_t. For bs <= ps cases, nothing is changed, except we will do a very small overhead to convert phys_addr_t to a folio, then use the proper folio helpers to handle the possible highmem cases. For bs > ps cases, all blocks will be backed by large folios, meaning every folio will cover at least one block. And still use proper folio helpers to handle highmem cases. With phys_addr_t, we will handle both large folio and highmem properly. So there is no better single variable to present a btrfs block than phys_addr_t. - Extract the data block csum calculation into a helper The new helper, btrfs_calculate_block_csum() will be utilized by btrfs_csum_one_bio(). - Use btrfs_bio_for_each_block() to replace existing call sites Including: index_one_bio() from raid56.c Very straight-forward. * btrfs_check_read_bio() Also update repair_one_sector() to grab the folio using phys_addr_t, and do extra checks to make sure the folio covers at least one block. We do not need to bother bv_len at all now. * btrfs_csum_one_bio() Now we can move the highmem handling into a dedicated helper, calculate_block_csum(), and use btrfs_bio_for_each_block() helper. There is one exception in btrfs_decompress_buf2page(), which is copying decompressed data into the original bio, which is not iterating using block size thus we don't need to bother. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:17 +02:00
Qu Wenruo	35aff706dc	btrfs: concentrate highmem handling for data verification Currently for btrfs checksum verification, we do it in the following pattern: kaddr = kmap_local_*(); ret = btrfs_check_csum_csum(kaddr); kunmap_local(kaddr); It's OK for now, but it's still not following the patterns of helpers inside linux/highmem.h, which never requires a virt memory address. In those highmem helpers, they mostly accept a folio, some offset/length inside the folio, and in the implementation they check if the folio needs partial kmap, and do the handling. Inspired by those formal highmem helpers, enhance the highmem handling of data checksum verification by: - Rename btrfs_check_sector_csum() to btrfs_check_block_csum() To follow the more common term "block" used in all other major filesystems. - Pass a physical address into btrfs_check_block_csum() and btrfs_data_csum_ok() The physical address is always available even for a highmem page. Since it's page frame number << PAGE_SHIFT + offset in page. And with that physical address, we can grab the folio covering the page, and do extra checks to ensure it covers at least one block. This also allows us to do the kmap inside btrfs_check_block_csum(). This means all the extra HIGHMEM handling will be concentrated into btrfs_check_block_csum(), and no callers will need to bother highmem by themselves. - Properly zero out the block if csum mismatch Since btrfs_data_csum_ok() only got a paddr, we can not and should not use memzero_bvec(), which only accepts single page bvec. Instead use paddr to grab the folio and call folio_zero_range() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:16 +02:00
Qu Wenruo	2ccfaf7369	btrfs: support all block sizes which is no larger than page size Currently if block size < page size, btrfs only supports one single config, 4K. This is mostly to reduce the test configurations, as 4K is going to be the default block size for all architectures. However all other major filesystems have no artificial limits on the support block size, and some are already supporting block size > page sizes. Since the btrfs subpage block support has been there for a long time, it's time for us to enable all block size <= page size support. So here enable all block sizes support as long as it's no larger than page size for experimental builds. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:16 +02:00
Thorsten Blum	a7f3dfb829	btrfs: scrub: replace max_t()/min_t() with clamp() in scrub_throttle_dev_io() Replace max_t() followed by min_t() with a single clamp(). As was pointed by David Laight in https://lore.kernel.org/linux-btrfs/20250906122458.75dfc8f0@pumpkin/ the calculation may overflow u32 when the input value is too large, so clamp_t() is not used. In practice the expected values are in range of megabytes to gigabytes (throughput limit) so the bug would not happen. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: David Sterba <dsterba@suse.com> [ Use clamp() and add explanation. ] Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:16 +02:00
David Sterba	17dc82dc1e	btrfs: fix typos in comments and strings Annual typo fixing pass. Strangely codespell found only about 30% of what is in this patch, the rest was done manually using text spellchecker with a custom dictionary of acceptable terms. Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:16 +02:00
Qu Wenruo	74e8f002b7	btrfs: reduce compression workspace buffer space to block size Currently the compression workspace buffer size is always based on PAGE_SIZE, but btrfs has support subpage sized block size for some time. This means for one-shot compression algorithm like lzo, we're wasting quite some memory if the block size is smaller than page size, as the LZO only works on one block (thus one-shot). On 64K page sized systems with 4K block size, it means we only need at most 8K buffer space for lzo, but in reality we're allocating 64K buffer. So to reduce the memory usage, change all workspace buffer to base its size based on block size other than page size. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:16 +02:00
Qu Wenruo	0d0b80929e	btrfs: rename btrfs_compress_op to btrfs_compress_levels Since all workspace managers are per-fs, there is no need nor no way to store them inside btrfs_compress_op::wsm anymore. With that said, we can do the following modifications: - Remove zstd_workspace_mananger::ops Zstd always grab the global btrfs_compress_op[]. - Remove btrfs_compress_op::wsm member - Rename btrfs_compress_op to btrfs_compress_levels This should make it more clear that btrfs_compress_levels structures are only to indicate the levels of each compress algorithm. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:16 +02:00
Qu Wenruo	9c8f4cf456	btrfs: cleanup the per-module compression workspace managers Since all workspaces are handled by the per-fs workspace managers, we can safely remove the old per-module managers. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:16 +02:00
Qu Wenruo	856d46c313	btrfs: migrate to use per-fs workspace manager There are several interfaces involved for each algorithm: - alloc workspace All algorithms allocate a workspace without the need for workspace manager. So no change needs to be done. - get workspace This involves checking the workspace manager to find a free one, and if not, allocate a new one. For none and lzo, they share the same generic btrfs_get_workspace() helper, only needs to update that function to use the per-fs manager. For zlib it uses a wrapper around btrfs_get_workspace(), so no special work needed. For zstd, update zstd_find_workspace() and zstd_get_workspace() to utilize the per-fs manager. - put workspace For none/zlib/lzo they share the same btrfs_put_workspace(), update that function to use the per-fs manager. For zstd, it's zstd_put_workspace(), the same update. - zstd specific timer This is the timer to reclaim workspace, change it to grab the per-fs workspace manager instead. Now all workspace are managed by the per-fs manager. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:15 +02:00
Qu Wenruo	6f9c3f48ac	btrfs: add generic workspace manager initialization This involves: - Add (alloc\|free)_workspace_manager helpers. These are the helper to alloc/free workspace_manager structure. The allocator will allocate a workspace_manager structure, initialize it, and pre-allocate one workspace for it. The freer will do the cleanup and set the manager pointer to NULL. - Call alloc_workspace_manager() inside btrfs_alloc_compress_wsm() - Call alloc_workspace_manager() inside btrfs_free_compress_wsm() For none, zlib and lzo compression algorithms. For now the generic per-fs workspace managers won't really have any effect, and all compression is still going through the global workspace manager. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:15 +02:00
Qu Wenruo	330f02b136	btrfs: add workspace manager initialization for zstd This involves: - Add zstd_alloc_workspace_manager() and zstd_free_workspace_manager() Those two functions will accept an fs_info pointer, and alloc/free fs_info->compr_wsm[BTRFS_COMPRESS_ZSTD] pointer. - Add btrfs_alloc_compress_wsm() and btrfs_free_compress_wsm() Those are helpers allocating the workspace managers for all algorithms. For now only zstd is supported, and the timing is a little unusual, the btrfs_alloc_compress_wsm() should only be called after the sectorsize being initialized. Meanwhile btrfs_free_fs_info_compress() is called in btrfs_free_fs_info(). - Move the definition of btrfs_compression_type to "fs.h" The reason is that "compression.h" has already included "fs.h", thus we can not just include "compression.h" to get the definition of BTRFS_NR_COMPRESS_TYPES to define fs_info::compr_wsm[]. For now the per-fs zstd workspace manager won't really have any effect, and all compression is still going through the global workspace manager. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:15 +02:00
Qu Wenruo	2c5cca03c1	btrfs: add an fs_info parameter for compression workspace manager [BACKGROUND] Currently btrfs shares workspaces and their managers for all filesystems, this is mostly fine as all those workspaces are using page size based buffers, and btrfs only support block size (bs) <= page size (ps). This means even if bs < ps, we at most waste some buffer space in the workspace, but everything will still work fine. The problem here is that is limiting our support for bs > ps cases. As now a workspace now may need larger buffer to handle bs > ps cases, but since the pool has no way to distinguish different workspaces, a regular workspace (which is still using buffer size based on ps) can be passed to a btrfs whose bs > ps. In that case the buffer is not large enough, and will cause various problems. [ENHANCEMENT] To prepare for the per-fs workspace migration, add an fs_info parameter to all workspace related functions. For now this new fs_info parameter is not yet utilized. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:15 +02:00
Qu Wenruo	737852c060	btrfs: keep folios locked inside run_delalloc_nocow() [BUG] There is a very low chance that DEBUG_WARN() inside btrfs_writepage_cow_fixup() can be triggered when CONFIG_BTRFS_EXPERIMENTAL is enabled. This only happens after run_delalloc_nocow() failed. Unfortunately I haven't hit it for a while thus no real world dmesg for now. [CAUSE] There is a race window where after run_delalloc_nocow() failed, error handling can race with writeback thread. Before we hit run_delalloc_nocow(), there is an inode with the following dirty pages: (4K page size, 4K block size, no large folio) 0 4K 8K 12K 16K \|/////////\|///////////\|///////////\|////////////\| The inode also have NODATACOW flag, and the above dirty range will go through different extents during run_delalloc_range(): 0 4K 8K 12K 16K \| NOCOW \| COW \| COW \| NOCOW \| The race happen like this: writeback thread A \| writeback thread B ----------------------------------+-------------------------------------- Writeback for folio 0 \| run_delalloc_nocow() \| \|- nocow_one_range() \| \| For range [0, 4K), ret = 0 \| \| \| \|- fallback_to_cow() \| \| For range [4K, 8K), ret = 0 \| \| Folio 4K UNLOCKED \| \| \| Writeback for folio 4K \|- fallback_to_cow() \| extent_writepage() \| For range [8K, 12K), failure \| \|- writepage_delalloc() \| \| \| \|- btrfs_cleanup_ordered_extents()\| \| \|- btrfs_folio_clear_ordered() \| \| \| Folio 0 still locked, safe \| \| \| \| \| Ordered extent already allocated. \| \| \| Nothing to do. \| \| \|- extent_writepage_io() \| \| \|- btrfs_writepage_cow_fixup() \|- btrfs_folio_clear_ordered() \| \| Folio 4K hold by thread B, \| \| UNSAFE! \| \|- btrfs_test_ordered() \| \| Cleared by thread A, \| \| \| \|- DEBUG_WARN(); This is only possible after run_delalloc_nocow() failure, as cow_file_range() will keep all folios and io tree range locked, until everything is finished or after error handling. The root cause is we allow fallback_to_cow() and nocow_one_range() to unlock the folios after a successful run, so that during error handling we're no longer safe to use btrfs_cleanup_ordered_extents() as the folios are already unlocked. [FIX] - Make fallback_to_cow() and nocow_one_range() to keep folios locked after a successful run For fallback_to_cow() we can pass COW_FILE_RANGE_KEEP_LOCKED flag into cow_file_range(). For nocow_one_range() we have to remove the PAGE_UNLOCK flag from extent_clear_unlock_delalloc(). - Unlock folios if everything is fine in run_delalloc_nocow() - Use extent_clear_unlock_delalloc() to handle range [@start, @cur_offset) inside run_delalloc_nocow() Since folios are still locked, we do not need cleanup_dirty_folios() to do the cleanup. extent_clear_unlock_delalloc() with "PAGE_START_WRITEBACK \| PAGE_END_WRITEBACK" will clear the dirty flags. - Remove cleanup_dirty_folios() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:15 +02:00
Qu Wenruo	13141df705	btrfs: make nocow_one_range() to do cleanup on error Currently if we hit an error inside nocow_one_range(), we do not clear the page dirty, and let the caller to handle it. This is very different compared to fallback_to_cow(), when that function failed, everything will be cleaned up by cow_file_range(). Enhance the situation by: - Use a common error handling for nocow_one_range() If we failed anything, use the same btrfs_cleanup_ordered_extents() and extent_clear_unlock_delalloc(). btrfs_cleanup_ordered_extents() is safe even if we haven't created new ordered extent, in that case there should be no OE and that function will do nothing. The same applies to extent_clear_unlock_delalloc(), and since we're passing PAGE_UNLOCK \| PAGE_START_WRITEBACK \| PAGE_END_WRITEBACK, it will also clear folio dirty flag during error handling. - Avoid touching the failed range of nocow_one_range() As the failed range will be cleaned up and unlocked by that function. Here we introduce a new variable @nocow_end to record the failed range, so that we can skip it during the error handling of run_delalloc_nocow(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:15 +02:00
Qu Wenruo	6a378edc9a	btrfs: enhance error messages for delalloc range failure When running emulated write error tests like generic/475, we can hit error messages like this: BTRFS error (device dm-12 state EA): run_delalloc_nocow failed, root=596 inode=264 start=1605632 len=73728: -5 BTRFS error (device dm-12 state EA): failed to run delalloc range, root=596 ino=264 folio=1605632 submit_bitmap=0-7 start=1605632 len=73728: -5 Which is normally buried by direct IO error messages. However above error messages are not enough to determine which is the real range that caused the error. Considering we can have multiple different extents in one delalloc range (e.g. some COW extents along with some NOCOW extents), just outputting the error at the end of run_delalloc_nocow() is not enough. To enhance the error messages: - Remove the rate limit on the existing error messages In the generic/475 example, most error messages are from direct IO, not really from the delalloc range. Considering how useful the delalloc range error messages are, we don't want they to be rate limited. - Add extra @cur_offset output for cow_file_range() - Add extra variable output for run_delalloc_nocow() This is especially important for run_delalloc_nocow(), as there are extra error paths where we can hit error without into nocow_one_range() nor fallback_to_cow(). - Add an error message for nocow_one_range() That's the missing part. For fallback_to_cow(), we have error message from cow_file_range() already. - Constify the @len and @end local variables for nocow_one_range() This makes it much easier to make sure @len and @end are not modified at runtime. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:09 +02:00
Caleb Sander Mateos	ef9f603fd3	io_uring/cmd: drop unused res2 param from io_uring_cmd_done() Commit `79525b51ac` ("io_uring: fix nvme's 32b cqes on mixed cq") split out a separate io_uring_cmd_done32() helper for ->uring_cmd() implementations that return 32-byte CQEs. The res2 value passed to io_uring_cmd_done() is now unused because __io_uring_cmd_done() ignores it when is_cqe32 is passed as false. So drop the parameter from io_uring_cmd_done() to simplify the callers and clarify that it's not possible to return an extra value beyond the 32-bit CQE result. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-23 00:15:02 -06:00
Dmitry Antipov	3437819c5e	ocfs2: avoid extra calls to strlen() after ocfs2_sprintf_system_inode_name() Since 'ocfs2_sprintf_system_inode_name()' uses 'snprintf()' and returns the number of characters emitted, callers of the former are better to use that return value instead of an explicit calls to 'strlen()'. Link: https://lkml.kernel.org/r/20250917060229.1854335-1-dmantipov@yandex.ru Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-22 20:11:00 -07:00
Viacheslav Dubeyko	42520df65b	hfsplus: fix slab-out-of-bounds read in hfsplus_strcasecmp() The hfsplus_strcasecmp() logic can trigger the issue: [ 117.317703][ T9855] ================================================================== [ 117.318353][ T9855] BUG: KASAN: slab-out-of-bounds in hfsplus_strcasecmp+0x1bc/0x490 [ 117.318991][ T9855] Read of size 2 at addr ffff88802160f40c by task repro/9855 [ 117.319577][ T9855] [ 117.319773][ T9855] CPU: 0 UID: 0 PID: 9855 Comm: repro Not tainted 6.17.0-rc6 #33 PREEMPT(full) [ 117.319780][ T9855] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 117.319783][ T9855] Call Trace: [ 117.319785][ T9855] <TASK> [ 117.319788][ T9855] dump_stack_lvl+0x1c1/0x2a0 [ 117.319795][ T9855] ? __virt_addr_valid+0x1c8/0x5c0 [ 117.319803][ T9855] ? __pfx_dump_stack_lvl+0x10/0x10 [ 117.319808][ T9855] ? rcu_is_watching+0x15/0xb0 [ 117.319816][ T9855] ? lock_release+0x4b/0x3e0 [ 117.319821][ T9855] ? __kasan_check_byte+0x12/0x40 [ 117.319828][ T9855] ? __virt_addr_valid+0x1c8/0x5c0 [ 117.319835][ T9855] ? __virt_addr_valid+0x4a5/0x5c0 [ 117.319842][ T9855] print_report+0x17e/0x7e0 [ 117.319848][ T9855] ? __virt_addr_valid+0x1c8/0x5c0 [ 117.319855][ T9855] ? __virt_addr_valid+0x4a5/0x5c0 [ 117.319862][ T9855] ? __phys_addr+0xd3/0x180 [ 117.319869][ T9855] ? hfsplus_strcasecmp+0x1bc/0x490 [ 117.319876][ T9855] kasan_report+0x147/0x180 [ 117.319882][ T9855] ? hfsplus_strcasecmp+0x1bc/0x490 [ 117.319891][ T9855] hfsplus_strcasecmp+0x1bc/0x490 [ 117.319900][ T9855] ? __pfx_hfsplus_cat_case_cmp_key+0x10/0x10 [ 117.319906][ T9855] hfs_find_rec_by_key+0xa9/0x1e0 [ 117.319913][ T9855] __hfsplus_brec_find+0x18e/0x470 [ 117.319920][ T9855] ? __pfx_hfsplus_bnode_find+0x10/0x10 [ 117.319926][ T9855] ? __pfx_hfs_find_rec_by_key+0x10/0x10 [ 117.319933][ T9855] ? __pfx___hfsplus_brec_find+0x10/0x10 [ 117.319942][ T9855] hfsplus_brec_find+0x28f/0x510 [ 117.319949][ T9855] ? __pfx_hfs_find_rec_by_key+0x10/0x10 [ 117.319956][ T9855] ? __pfx_hfsplus_brec_find+0x10/0x10 [ 117.319963][ T9855] ? __kmalloc_noprof+0x2a9/0x510 [ 117.319969][ T9855] ? hfsplus_find_init+0x8c/0x1d0 [ 117.319976][ T9855] hfsplus_brec_read+0x2b/0x120 [ 117.319983][ T9855] hfsplus_lookup+0x2aa/0x890 [ 117.319990][ T9855] ? __pfx_hfsplus_lookup+0x10/0x10 [ 117.320003][ T9855] ? d_alloc_parallel+0x2f0/0x15e0 [ 117.320008][ T9855] ? __lock_acquire+0xaec/0xd80 [ 117.320013][ T9855] ? __pfx_d_alloc_parallel+0x10/0x10 [ 117.320019][ T9855] ? __raw_spin_lock_init+0x45/0x100 [ 117.320026][ T9855] ? __init_waitqueue_head+0xa9/0x150 [ 117.320034][ T9855] __lookup_slow+0x297/0x3d0 [ 117.320039][ T9855] ? __pfx___lookup_slow+0x10/0x10 [ 117.320045][ T9855] ? down_read+0x1ad/0x2e0 [ 117.320055][ T9855] lookup_slow+0x53/0x70 [ 117.320065][ T9855] walk_component+0x2f0/0x430 [ 117.320073][ T9855] path_lookupat+0x169/0x440 [ 117.320081][ T9855] filename_lookup+0x212/0x590 [ 117.320089][ T9855] ? __pfx_filename_lookup+0x10/0x10 [ 117.320098][ T9855] ? strncpy_from_user+0x150/0x290 [ 117.320105][ T9855] ? getname_flags+0x1e5/0x540 [ 117.320112][ T9855] user_path_at+0x3a/0x60 [ 117.320117][ T9855] __x64_sys_umount+0xee/0x160 [ 117.320123][ T9855] ? __pfx___x64_sys_umount+0x10/0x10 [ 117.320129][ T9855] ? do_syscall_64+0xb7/0x3a0 [ 117.320135][ T9855] ? entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 117.320141][ T9855] ? entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 117.320145][ T9855] do_syscall_64+0xf3/0x3a0 [ 117.320150][ T9855] ? exc_page_fault+0x9f/0xf0 [ 117.320154][ T9855] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 117.320158][ T9855] RIP: 0033:0x7f7dd7908b07 [ 117.320163][ T9855] Code: 23 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 08 [ 117.320167][ T9855] RSP: 002b:00007ffd5ebd9698 EFLAGS: 00000202 ORIG_RAX: 00000000000000a6 [ 117.320172][ T9855] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7dd7908b07 [ 117.320176][ T9855] RDX: 0000000000000009 RSI: 0000000000000009 RDI: 00007ffd5ebd9740 [ 117.320179][ T9855] RBP: 00007ffd5ebda780 R08: 0000000000000005 R09: 00007ffd5ebd9530 [ 117.320181][ T9855] R10: 00007f7dd799bfc0 R11: 0000000000000202 R12: 000055e2008b32d0 [ 117.320184][ T9855] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 117.320189][ T9855] </TASK> [ 117.320190][ T9855] [ 117.351311][ T9855] Allocated by task 9855: [ 117.351683][ T9855] kasan_save_track+0x3e/0x80 [ 117.352093][ T9855] __kasan_kmalloc+0x8d/0xa0 [ 117.352490][ T9855] __kmalloc_noprof+0x288/0x510 [ 117.352914][ T9855] hfsplus_find_init+0x8c/0x1d0 [ 117.353342][ T9855] hfsplus_lookup+0x19c/0x890 [ 117.353747][ T9855] __lookup_slow+0x297/0x3d0 [ 117.354148][ T9855] lookup_slow+0x53/0x70 [ 117.354514][ T9855] walk_component+0x2f0/0x430 [ 117.354921][ T9855] path_lookupat+0x169/0x440 [ 117.355325][ T9855] filename_lookup+0x212/0x590 [ 117.355740][ T9855] user_path_at+0x3a/0x60 [ 117.356115][ T9855] __x64_sys_umount+0xee/0x160 [ 117.356529][ T9855] do_syscall_64+0xf3/0x3a0 [ 117.356920][ T9855] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 117.357429][ T9855] [ 117.357636][ T9855] The buggy address belongs to the object at ffff88802160f000 [ 117.357636][ T9855] which belongs to the cache kmalloc-2k of size 2048 [ 117.358827][ T9855] The buggy address is located 0 bytes to the right of [ 117.358827][ T9855] allocated 1036-byte region [ffff88802160f000, ffff88802160f40c) [ 117.360061][ T9855] [ 117.360266][ T9855] The buggy address belongs to the physical page: [ 117.360813][ T9855] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x21608 [ 117.361562][ T9855] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 [ 117.362285][ T9855] flags: 0xfff00000000040(head\|node=0\|zone=1\|lastcpupid=0x7ff) [ 117.362929][ T9855] page_type: f5(slab) [ 117.363282][ T9855] raw: 00fff00000000040 ffff88801a842f00 ffffea0000932000 dead000000000002 [ 117.364015][ T9855] raw: 0000000000000000 0000000080080008 00000000f5000000 0000000000000000 [ 117.364750][ T9855] head: 00fff00000000040 ffff88801a842f00 ffffea0000932000 dead000000000002 [ 117.365491][ T9855] head: 0000000000000000 0000000080080008 00000000f5000000 0000000000000000 [ 117.366232][ T9855] head: 00fff00000000003 ffffea0000858201 00000000ffffffff 00000000ffffffff [ 117.366968][ T9855] head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008 [ 117.367711][ T9855] page dumped because: kasan: bad access detected [ 117.368259][ T9855] page_owner tracks the page as allocated [ 117.368745][ T9855] page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO\|__GFP_FS\|__GFP_NOWARN1 [ 117.370541][ T9855] post_alloc_hook+0x240/0x2a0 [ 117.370954][ T9855] get_page_from_freelist+0x2101/0x21e0 [ 117.371435][ T9855] __alloc_frozen_pages_noprof+0x274/0x380 [ 117.371935][ T9855] alloc_pages_mpol+0x241/0x4b0 [ 117.372360][ T9855] allocate_slab+0x8d/0x380 [ 117.372752][ T9855] ___slab_alloc+0xbe3/0x1400 [ 117.373159][ T9855] __kmalloc_cache_noprof+0x296/0x3d0 [ 117.373621][ T9855] nexthop_net_init+0x75/0x100 [ 117.374038][ T9855] ops_init+0x35c/0x5c0 [ 117.374400][ T9855] setup_net+0x10c/0x320 [ 117.374768][ T9855] copy_net_ns+0x31b/0x4d0 [ 117.375156][ T9855] create_new_namespaces+0x3f3/0x720 [ 117.375613][ T9855] unshare_nsproxy_namespaces+0x11c/0x170 [ 117.376094][ T9855] ksys_unshare+0x4ca/0x8d0 [ 117.376477][ T9855] __x64_sys_unshare+0x38/0x50 [ 117.376879][ T9855] do_syscall_64+0xf3/0x3a0 [ 117.377265][ T9855] page last free pid 9110 tgid 9110 stack trace: [ 117.377795][ T9855] __free_frozen_pages+0xbeb/0xd50 [ 117.378229][ T9855] __put_partials+0x152/0x1a0 [ 117.378625][ T9855] put_cpu_partial+0x17c/0x250 [ 117.379026][ T9855] __slab_free+0x2d4/0x3c0 [ 117.379404][ T9855] qlist_free_all+0x97/0x140 [ 117.379790][ T9855] kasan_quarantine_reduce+0x148/0x160 [ 117.380250][ T9855] __kasan_slab_alloc+0x22/0x80 [ 117.380662][ T9855] __kmalloc_noprof+0x232/0x510 [ 117.381074][ T9855] tomoyo_supervisor+0xc0a/0x1360 [ 117.381498][ T9855] tomoyo_env_perm+0x149/0x1e0 [ 117.381903][ T9855] tomoyo_find_next_domain+0x15ad/0x1b90 [ 117.382378][ T9855] tomoyo_bprm_check_security+0x11c/0x180 [ 117.382859][ T9855] security_bprm_check+0x89/0x280 [ 117.383289][ T9855] bprm_execve+0x8f1/0x14a0 [ 117.383673][ T9855] do_execveat_common+0x528/0x6b0 [ 117.384103][ T9855] __x64_sys_execve+0x94/0xb0 [ 117.384500][ T9855] [ 117.384706][ T9855] Memory state around the buggy address: [ 117.385179][ T9855] ffff88802160f300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 117.385854][ T9855] ffff88802160f380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 117.386534][ T9855] >ffff88802160f400: 00 04 fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 117.387204][ T9855] ^ [ 117.387566][ T9855] ffff88802160f480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 117.388243][ T9855] ffff88802160f500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 117.388918][ T9855] ================================================================== The issue takes place if the length field of struct hfsplus_unistr is bigger than HFSPLUS_MAX_STRLEN. The patch simply checks the length of comparing strings. And if the strings' length is bigger than HFSPLUS_MAX_STRLEN, then it is corrected to this value. v2 The string length correction has been added for hfsplus_strcmp(). Reported-by: Jiaming Zhang <r772577952@gmail.com> Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com> cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> cc: Yangtao Li <frank.li@vivo.com> cc: linux-fsdevel@vger.kernel.org cc: syzkaller@googlegroups.com Link: https://lore.kernel.org/r/20250919191243.1370388-1-slava@dubeyko.com Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>	2025-09-22 15:11:33 -07:00
Larshin Sergey	3bd5e45c2c	fs: udf: fix OOB read in lengthAllocDescs handling When parsing Allocation Extent Descriptor, lengthAllocDescs comes from on-disk data and must be validated against the block size. Crafted or corrupted images may set lengthAllocDescs so that the total descriptor length (sizeof(allocExtDesc) + lengthAllocDescs) exceeds the buffer, leading udf_update_tag() to call crc_itu_t() on out-of-bounds memory and trigger a KASAN use-after-free read. BUG: KASAN: use-after-free in crc_itu_t+0x1d5/0x2b0 lib/crc-itu-t.c:60 Read of size 1 at addr ffff888041e7d000 by task syz-executor317/5309 CPU: 0 UID: 0 PID: 5309 Comm: syz-executor317 Not tainted 6.12.0-rc4-syzkaller-00261-g850925a8133c #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 crc_itu_t+0x1d5/0x2b0 lib/crc-itu-t.c:60 udf_update_tag+0x70/0x6a0 fs/udf/misc.c:261 udf_write_aext+0x4d8/0x7b0 fs/udf/inode.c:2179 extent_trunc+0x2f7/0x4a0 fs/udf/truncate.c:46 udf_truncate_tail_extent+0x527/0x7e0 fs/udf/truncate.c:106 udf_release_file+0xc1/0x120 fs/udf/file.c:185 __fput+0x23f/0x880 fs/file_table.c:431 task_work_run+0x24f/0x310 kernel/task_work.c:239 exit_task_work include/linux/task_work.h:43 [inline] do_exit+0xa2f/0x28e0 kernel/exit.c:939 do_group_exit+0x207/0x2c0 kernel/exit.c:1088 __do_sys_exit_group kernel/exit.c:1099 [inline] __se_sys_exit_group kernel/exit.c:1097 [inline] __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1097 x64_sys_call+0x2634/0x2640 arch/x86/include/generated/asm/syscalls_64.h:232 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> Validate the computed total length against epos->bh->b_size. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Reported-by: syzbot+8743fca924afed42f93e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=8743fca924afed42f93e Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Signed-off-by: Larshin Sergey <Sergey.Larshin@kaspersky.com> Link: https://patch.msgid.link/20250922131358.745579-1-Sergey.Larshin@kaspersky.com Signed-off-by: Jan Kara <jack@suse.cz>	2025-09-22 15:33:56 +02:00
Christian Brauner	d7610cb745	ns: simplify ns_common_init() further Simply derive the ns operations from the namespace type. Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-22 14:47:10 +02:00
Dmitry Antipov	fc0d192303	xfs: scrub: use kstrdup_const() for metapath scan setups Except 'xchk_setup_metapath_rtginode()' case, 'path' argument of 'xchk_setup_metapath_scan()' is a compile-time constant. So it may be reasonable to use 'kstrdup_const()' / 'kree_const()' to manage 'path' field of 'struct xchk_metapath' in attempt to reuse .rodata instance rather than making a copy. Compile tested only. Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-09-22 12:57:34 +02:00
Christoph Hellwig	6ef2175fce	xfs: use bt_nr_sectors in xfs_dax_translate_range Only ranges inside the file system can be translated, and the file system can be smaller than the containing device. Fixes: `f4ed930379` ("xfs: don't shut down the filesystem for media failures beyond end of log") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-09-22 12:55:20 +02:00
Christoph Hellwig	42852fe57c	xfs: track the number of blocks in each buftarg Add a bt_nr_sectors to track the number of sector in each buftarg, and replace the check that hard codes sb_dblock in xfs_buf_map_verify with this new value so that it is correct for non-ddev buftargs. The RT buftarg only has a superblock in the first block, so it is unlikely to trigger this, or are we likely to ever have enough blocks in the in-memory buftargs, but we might as well get the check right. Fixes: `10616b806d` ("xfs: fix _xfs_buf_find oops on blocks beyond the filesystem end") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-09-22 12:55:20 +02:00
Qu Wenruo	3239c44df7	btrfs: rework error handling of run_delalloc_nocow() Currently the error handling of run_delalloc_nocow() modifies @cur_offset to handle different parts of the delalloc range. However the error handling can always be split into 3 parts: 1) The range with ordered extents allocated (OE cleanup) We have to cleanup the ordered extents and unlock the folios. 2) The range that have been cleaned up (skip) For now it's only for the range of fallback_to_cow(). We should not touch it at all. 3) The range that is not yet touched (untouched) We have to unlock the folios and clear any reserved space. This 3 ranges split has the same principle as cow_file_range(), however the NOCOW/COW handling makes the above 3 range split much more complex: a) Failure immediately after a successful OE allocation Thus no @cow_start nor @cow_end set. start cur_offset end \| OE cleanup \| untouched \| b) Failure after hitting a COW range but before calling fallback_to_cow() start cow_start cur_offset end \| OE Cleanup \| untouched \| c) Failure to call fallback_to_cow() start cow_start cow_end end \| OE Cleanup \| skip \| untouched \| Instead of modifying @cur_offset, do proper range calculation for OE-cleanup and untouched ranges using above 3 cases with proper range charts. This avoid updating @cur_offset, as it will an extra output for debug purposes later, and explain the behavior easier. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:32 +02:00
Leo Martins	46d33a0cc4	btrfs: add mount option for ref_tracker The ref_tracker infrastructure aids debugging but is not enabled by default as it has a performance impact. Add mount option 'ref_tracker' so it can be selectively enabled on a filesystem. Currently it track references of 'delayed inodes'. Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:32 +02:00
Leo Martins	b767a28d61	btrfs: print leaked references in kill_all_delayed_nodes() We are seeing soft lockups in kill_all_delayed_nodes due to a delayed_node with a lingering reference count of 1. Printing at this point will reveal the guilty stack trace. If the delayed_node has no references there should be no output. Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:32 +02:00
Leo Martins	e8513c012d	btrfs: implement ref_tracker for delayed_nodes Add ref_tracker infrastructure for struct btrfs_delayed_node. It is a response to the largest btrfs related crash in our fleet. We're seeing soft lockups in btrfs_kill_all_delayed_nodes() that seem to be a result of delayed_nodes not being released properly. A ref_tracker object is allocated on reference count increases and freed on reference count decreases. The ref_tracker object stores a stack trace of where it is allocated. The ref_tracker_dir object is embedded in btrfs_delayed_node and keeps track of all current and some old/freed ref_tracker objects. When a leak is detected we can print the stack traces for all ref_trackers that have not yet been freed. Here is a common example of taking a reference to a delayed_node and freeing it with ref_tracker. struct btrfs_ref_tracker tracker; struct btrfs_delayed_node *node; node = btrfs_get_delayed_node(inode, &tracker); // use delayed_node... btrfs_release_delayed_node(node, &tracker); There are two special cases where the delayed_node reference is "long lived", meaning that the thread that takes the reference and the thread that releases the reference are different. The 'inode_cache_tracker' tracks the delayed_node stored in btrfs_inode. The 'node_list_tracker' tracks the delayed_node stored in the btrfs_delayed_root node_list/prepare_list. These trackers are embedded in the btrfs_delayed_node. btrfs_ref_tracker and btrfs_ref_tracker_dir are wrappers that either compile to the corresponding ref_tracker structs or empty structs depending on CONFIG_BTRFS_DEBUG. There are also btrfs wrappers for the ref_tracker API. Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:32 +02:00
David Sterba	67e78f983e	btrfs: convert several int parameters to bool We're almost done cleaning misused int/bool parameters. Convert a bunch of them, found by manual grepping. Note that btrfs_sync_fs() needs an int as it's mandated by the struct super_operations prototype. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:32 +02:00
Leo Martins	cba7c35fec	btrfs: move ref-verify under CONFIG_BTRFS_DEBUG Remove CONFIG_BTRFS_FS_REF_VERIFY Kconfig and add it as part of CONFIG_BTRFS_DEBUG. This should not be impactful to the performance of debug. The struct btrfs_ref takes an additional u64, btrfs_fs_info takes an additional spinlock_t and rb_root. All of the ref_verify logic is still protected by a mount option. Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:32 +02:00
Xichao Zhao	28a38e20ac	btrfs: use PTR_ERR_OR_ZERO() to simplify code inbtrfs_control_ioctl() Use the standard error pointer macro to simplify the code. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Xichao Zhao <zhao.xichao@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:31 +02:00
Qu Wenruo	d728f2e5f8	btrfs: simplify support block size check Currently we manually check the block size against 3 different values: - 4K - PAGE_SIZE - MIN_BLOCKSIZE Those 3 values can match or differ from each other. This makes it pretty complex to output the supported block sizes. Considering we're going to add block size > page size support soon, this can make the support block size sysfs attribute much harder to implement. To make it easier, factor out a helper, btrfs_supported_blocksize() to do a simple check for the block size. Then utilize it in the two locations: - btrfs_validate_super() This is very straightforward - supported_sectorsizes_show() Iterate through all valid block sizes, and only output supported ones. This is to make future full range block sizes support much easier. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:31 +02:00
Qu Wenruo	0a6dcd4235	btrfs: use blocksize to check if compression is making things larger [BEHAVIOR DIFFERENCE BETWEEN COMPRESSION ALGOS] Currently LZO compression algorithm will check if we're making the compressed data larger after compressing more than 2 blocks. But zlib and zstd do the same checks after compressing more than 8192 bytes. This is not a big deal, but since we're already supporting larger block size (e.g. 64K block size if page size is also 64K), this check is not suitable for all block sizes. For example, if our page and block size are both 16KiB, and after the first block compressed using zlib, the resulted compressed data is slightly larger than 16KiB, we will immediately abort the compression. This makes zstd and zlib compression algorithms to behave slightly different from LZO, which only aborts after compressing two blocks. [ENHANCEMENT] To unify the behavior, only abort the compression after compressing at least two blocks. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:31 +02:00
Qu Wenruo	d71b419f27	btrfs: pass btrfs_inode pointer directly into btrfs_compress_folios() For the 3 supported compression algorithms, two of them (zstd and zlib) are already grabbing the btrfs inode for error messages. It's more common to pass btrfs_inode and grab the address space from it. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:31 +02:00

... 12 13 14 15 16 ...

102029 Commits