linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-04-05 00:04:34 -04:00

Author	SHA1	Message	Date
Filipe Manana	a56a70f8d2	btrfs: raid56: fix memory leak of btrfs_raid_bio::stripe_uptodate_bitmap We allocate the bitmap but we never free it in free_raid_bio_pointers(). Fix this by adding a bitmap_free() call against the stripe_uptodate_bitmap of a raid bio. Fixes: `1810350b04` ("btrfs: raid56: move sector_ptr::uptodate into a dedicated bitmap") Reported-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/linux-btrfs/20260126045315.GA31641@lst.de/ Reviewed-by: Qu Wenruo <wqu@suse.com> Tested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:25 +01:00
Boris Burkov	5341c98450	btrfs: tests: add unit tests for pending extent walking functions I ran into another sort of trivial bug in v1 of the patch and concluded that these functions really ought to be unit tested. These two functions form the core of searching the chunk allocation pending extent bitmap and have relatively easily definable semantics, so unit testing them can help ensure the correctness of chunk allocation. I also made a minor unrelated fix in volumes.h to properly forward declare btrfs_space_info. Because of the order of the includes in the new test, this was actually hitting a latent build warning. Note: This is an early example for me of a commit authored in part by an AI agent, so I wanted to more clear about what I did. I defined a trivial test and explained the set of tests I wanted to the agent and it produced the large set of test cases seen here. I then checked each test case to make sure it matched the description and simplified the constants and numbers until they looked reasonable to me. I then checked the looping logic to make sure it made sense to the original spirit of the trivial test. Finally, carefully combed over all the lines it wrote to loop over the tests it generated to make sure they followed our code style guide. Assisted-by: Claude:claude-opus-4-5 Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:25 +01:00
Boris Burkov	b14c5e04bd	btrfs: fix EEXIST abort due to non-consecutive gaps in chunk allocation I have been observing a number of systems aborting at insert_dev_extents() in btrfs_create_pending_block_groups(). The following is a sample stack trace of such an abort coming from forced chunk allocation (typically behind CONFIG_BTRFS_EXPERIMENTAL) but this can theoretically happen to any DUP chunk allocation. [81.801] ------------[ cut here ]------------ [81.801] BTRFS: Transaction aborted (error -17) [81.801] WARNING: fs/btrfs/block-group.c:2876 at btrfs_create_pending_block_groups+0x721/0x770 [btrfs], CPU#1: bash/319 [81.802] Modules linked in: virtio_net btrfs xor zstd_compress raid6_pq null_blk [81.803] CPU: 1 UID: 0 PID: 319 Comm: bash Kdump: loaded Not tainted 6.19.0-rc6+ #319 NONE [81.803] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-2-2 04/01/2014 [81.804] RIP: 0010:btrfs_create_pending_block_groups+0x723/0x770 [btrfs] [81.806] RSP: 0018:ffffa36241a6bce8 EFLAGS: 00010282 [81.806] RAX: 000000000000000d RBX: ffff8e699921e400 RCX: 0000000000000000 [81.807] RDX: 0000000002040001 RSI: 00000000ffffffef RDI: ffffffffc0608bf0 [81.807] RBP: 00000000ffffffef R08: ffff8e69830f6000 R09: 0000000000000007 [81.808] R10: ffff8e699921e5e8 R11: 0000000000000000 R12: ffff8e6999228000 [81.808] R13: ffff8e6984d82000 R14: ffff8e69966a69c0 R15: ffff8e69aa47b000 [81.809] FS: 00007fec6bdd9740(0000) GS:ffff8e6b1b379000(0000) knlGS:0000000000000000 [81.809] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [81.810] CR2: 00005604833670f0 CR3: 0000000116679000 CR4: 00000000000006f0 [81.810] Call Trace: [81.810] <TASK> [81.810] __btrfs_end_transaction+0x3e/0x2b0 [btrfs] [81.811] btrfs_force_chunk_alloc_store+0xcd/0x140 [btrfs] [81.811] kernfs_fop_write_iter+0x15f/0x240 [81.812] vfs_write+0x264/0x500 [81.812] ksys_write+0x6c/0xe0 [81.812] do_syscall_64+0x66/0x770 [81.812] entry_SYSCALL_64_after_hwframe+0x76/0x7e [81.813] RIP: 0033:0x7fec6be66197 [81.814] RSP: 002b:00007fffb159dd30 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 [81.815] RAX: ffffffffffffffda RBX: 00007fec6bdd9740 RCX: 00007fec6be66197 [81.815] RDX: 0000000000000002 RSI: 0000560483374f80 RDI: 0000000000000001 [81.816] RBP: 0000560483374f80 R08: 0000000000000000 R09: 0000000000000000 [81.816] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002 [81.817] R13: 00007fec6bfb85c0 R14: 00007fec6bfb5ee0 R15: 00005604833729c0 [81.817] </TASK> [81.817] irq event stamp: 20039 [81.818] hardirqs last enabled at (20047): [<ffffffff99a68302>] __up_console_sem+0x52/0x60 [81.818] hardirqs last disabled at (20056): [<ffffffff99a682e7>] __up_console_sem+0x37/0x60 [81.819] softirqs last enabled at (19470): [<ffffffff999d2b46>] __irq_exit_rcu+0x96/0xc0 [81.819] softirqs last disabled at (19463): [<ffffffff999d2b46>] __irq_exit_rcu+0x96/0xc0 [81.820] ---[ end trace 0000000000000000 ]--- [81.820] BTRFS: error (device dm-7 state A) in btrfs_create_pending_block_groups:2876: errno=-17 Object already exists Inspecting these aborts with drgn, I observed a pattern of overlapping chunk_maps. Note how stripe 1 of the first chunk overlaps in physical address with stripe 0 of the second chunk. Physical Start Physical End Length Logical Type Stripe ---------------------------------------------------------------------------------------------------- 0x0000000102500000 0x0000000142500000 1.0G 0x0000000641d00000 META\|DUP 0/2 0x0000000142500000 0x0000000182500000 1.0G 0x0000000641d00000 META\|DUP 1/2 0x0000000142500000 0x0000000182500000 1.0G 0x0000000601d00000 META\|DUP 0/2 0x0000000182500000 0x00000001c2500000 1.0G 0x0000000601d00000 META\|DUP 1/2 Now how could this possibly happen? All chunk allocation is protected by the chunk_mutex so racing allocations should see a consistent view of the CHUNK_ALLOCATED bit in the chunk allocation extent-io-tree (device->alloc_state as set by chunk_map_device_set_bits()) The tree itself is protected by a spin lock, and clearing/setting the bits is always protected by fs_info->mapping_tree_lock, so no race is apparent. It turns out that there is a subtle bug in the logic regarding chunk allocations that have happened in the current transaction, known as "pending extents". The chunk allocation as defined in find_free_dev_extent() is a loop which searches the commit root of the dev_root and looks for gaps between DEV_EXTENT items. For those gaps, it then checks alloc_state bitmap for any pending extents and adjusts the hole that it finds accordingly. However, the logic in that adjustment assumes that the first pending extent is the only one in that range. e.g., given a layout with two non-consecutive pending extents in a hole passed to dev_extent_hole_check() via hole_start and hole_size: \|----pending A----\| real hole \|----pending B----\| \| candidate hole \| hole_start hole_start + *hole_size the code incorrectly returns a "hole" from the end of pending extent A until the passed in hole end, failing to account for pending B. However, it is not entirely obvious that it is actually possible to produce such a layout. I was able to reproduce it, but with some contortions: I continued to use the force chunk allocation sysfs file and I introduced a long delay (10 seconds) into the start of the cleaner thread. I also prevented the unused bgs cleaning logic from ever deleting metadata bgs. These help make it easier to deterministically produce the condition but shouldn't really matter if you imagine the conditions happening by race/luck. Allocations/frees can happen concurrently with the cleaner thread preparing to process an unused extent and both create some used chunks with an unused chunk interleaved, all during one transaction. Then btrfs_delete_unused_bgs() sees the unused one and clears it, leaving a range with several pending chunk allocations and a gap in the middle. The basic idea is that the unused_bgs cleanup work happens on a worker so if we allocate 3 block groups in one transaction, then the cleaner work kicked off by the previous transaction comes through and deletes the middle one of the 3, then the commit root shows no dev extents and we have the bad pattern in the extent-io-tree. One final consideration is that the code happens to loop to the next hole if there are no more extents at all, so we need one more dev extent way past the area we are working in. Something like the following demonstrates the technique: # push the BG frontier out to 20G fallocate -l 20G $mnt/foo # allocate one more that will prevent the "no more dev extents" luck fallocate -l 1G $mnt/sticky # sync sync # clear out the allocation area rm $mnt/foo sync _cleaner # let everything quiesce sleep 20 sync # dev tree should have one bg 20G out and the rest at the beginning.. # sort of like an empty FS but with a random sticky chunk. # kick off the cleaner in the background, remember it will sleep 10s # before doing interesting work _cleaner & sleep 3 # create 3 trivial block groups, all empty, all immediately marked as unused. echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc" echo 1 > "$(_btrfs_sysfs_space_info $dev data)/force_chunk_alloc" echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc" # let the cleaner thread definitely finish, it will remove the data bg sleep 10 # this allocation sees the non-consecutive pending metadata chunks with # data chunk gap of 1G and allocates a 2G extent in that hole. ENOSPC! echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc" As for the fix, it is not that obvious. I could not see a trivial way to do it even by adding backup loops into find_free_dev_extent(), so I opted to change the semantics of dev_extent_hole_check() to not stop looping until it finds a sufficiently big hole. For clarity, this also required changing the helper function contains_pending_extent() into two new helpers which find the first pending extent and the first suitable hole in a range. I attempted to clean up the documentation and range calculations to be as consistent and clear as possible for the future. I also looked at the zoned case and concluded that the loop there is different and not to be unified with this one. As far as I can tell, the zoned check will only further constrain the hole so looping back to find more holes is acceptable. Though given that zoned really only appends, I find it highly unlikely that it is susceptible to this bug. Fixes: `1b98450816` ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole") Reported-by: Dimitrios Apostolou <jimis@gmx.net> Closes: https://lore.kernel.org/linux-btrfs/q7760374-q1p4-029o-5149-26p28421s468@tzk.arg/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:25 +01:00
jinbaohong	b291ad4458	btrfs: fix transaction commit blocking during trim of unallocated space When trimming unallocated space, btrfs_trim_fs() holds the device_list_mutex for the entire duration while iterating through all devices. On large filesystems with significant unallocated space, this operation can take minutes to hours on large storage systems. This causes a problem because btrfs_run_dev_stats(), which is called during transaction commit, also requires device_list_mutex: btrfs_trim_fs() mutex_lock(&fs_devices->device_list_mutex) list_for_each_entry(device, ...) btrfs_trim_free_extents(device) mutex_unlock(&fs_devices->device_list_mutex) commit_transaction() btrfs_run_dev_stats() mutex_lock(&fs_devices->device_list_mutex) // blocked! ... While trim is running, all transaction commits are blocked waiting for the mutex. Fix this by refactoring btrfs_trim_free_extents() to process devices in bounded chunks (up to 2GB per iteration) and release device_list_mutex between chunks. Signed-off-by: robbieko <robbieko@synology.com> Signed-off-by: jinbaohong <jinbaohong@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:25 +01:00
jinbaohong	bfb670b918	btrfs: handle user interrupt properly in btrfs_trim_fs() When a fatal signal is pending or the process is freezing, btrfs_trim_block_group() and btrfs_trim_free_extents() return -ERESTARTSYS. Currently this is treated as a regular error: the loops continue to the next iteration and count it as a block group or device failure. Instead, break out of the loops immediately and return -ERESTARTSYS to userspace without counting it as a failure. Also skip the device loop entirely if the block group loop was interrupted. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: jinbaohong <jinbaohong@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:25 +01:00
jinbaohong	1cc4ada418	btrfs: preserve first error in btrfs_trim_fs() When multiple block groups or devices fail during trim, preserve the first error encountered rather than the last one. The first error is typically more useful for debugging as it represents the original failure, while subsequent errors may be cascading effects. Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: jinbaohong <jinbaohong@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:24 +01:00
jinbaohong	912d1c6680	btrfs: continue trimming remaining devices on failure Commit `93bba24d4b` ("btrfs: Enhance btrfs_trim_fs function to handle error better") intended to make device trimming continue even if one device fails, tracking failures and reporting them at the end. However, it used 'break' instead of 'continue', causing the loop to exit on the first device failure. Fix this by replacing 'break' with 'continue'. Fixes: `93bba24d4b` ("btrfs: Enhance btrfs_trim_fs function to handle error better") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: jinbaohong <jinbaohong@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:24 +01:00
Filipe Manana	719dc4b755	btrfs: do not BUG_ON() in btrfs_remove_block_group() There's no need to BUG_ON(), we can just abort the transaction and return an error. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:24 +01:00
Filipe Manana	6f926597f9	btrfs: abort transaction on error in btrfs_remove_block_group() When btrfs_remove_block_group() fails we abort the transaction in its single caller (btrfs_remove_chunk()). This makes it harder to find out where exactly the failure happened, as several steps inside btrfs_remove_block_group() can fail. So make btrfs_remove_block_group() abort the transaction whenever an error happens, instead of aborting in its caller. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:24 +01:00
Boris Burkov	3a1f4264da	btrfs: fix block_group_tree dirty_list corruption When the incompat flag EXTENT_TREE_V2 is set, we unconditionally add the block group tree to the switch_commits list before calling switch_commit_roots, as we do for the tree root and the chunk root. However, the block group tree uses normal root dirty tracking and in any transaction that does an allocation and dirties a block group, the block group root will already be linked to a list by the dirty_list field and this use of list_add_tail() is invalid and corrupts the prev/next members of block_group_root->dirty_list. This is apparent on a subsequent list_del on the prev if we enable CONFIG_DEBUG_LIST: [32.1571] ------------[ cut here ]------------ [32.1572] list_del corruption. next->prev should beffff958890202538, but was ffff9588992bd538. (next=ffff958890201538) [32.1575] WARNING: lib/list_debug.c:65 at 0x0, CPU#3: sync/607 [32.1583] CPU: 3 UID: 0 PID: 607 Comm: sync Not tainted 6.18.0 #24PREEMPT(none) [32.1585] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS1.17.0-4.fc41 04/01/2014 [32.1587] RIP: 0010:__list_del_entry_valid_or_report+0x108/0x120 [32.1593] RSP: 0018:ffffaa288287fdd0 EFLAGS: 00010202 [32.1594] RAX: 0000000000000001 RBX: ffff95889326e800 RCX:ffff958890201538 [32.1596] RDX: ffff9588992bd538 RSI: ffff958890202538 RDI:ffffffff82a41e00 [32.1597] RBP: ffff958890202538 R08: ffffffff828fc1e8 R09:00000000ffffefff [32.1599] R10: ffffffff8288c200 R11: ffffffff828e4200 R12:ffff958890201538 [32.1601] R13: ffff95889326e958 R14: ffff958895c24000 R15:ffff958890202538 [32.1603] FS: 00007f0c28eb5740(0000) GS:ffff958af2bd2000(0000)knlGS:0000000000000000 [32.1605] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [32.1607] CR2: 00007f0c28e8a3cc CR3: 0000000109942005 CR4:0000000000370ef0 [32.1609] Call Trace: [32.1610] <TASK> [32.1611] switch_commit_roots+0x82/0x1d0 [btrfs] [32.1615] btrfs_commit_transaction+0x968/0x1550 [btrfs] [32.1618] ? btrfs_attach_transaction_barrier+0x23/0x60 [btrfs] [32.1621] __iterate_supers+0xe8/0x190 [32.1622] ? __pfx_sync_fs_one_sb+0x10/0x10 [32.1623] ksys_sync+0x63/0xb0 [32.1624] __do_sys_sync+0xe/0x20 [32.1625] do_syscall_64+0x73/0x450 [32.1626] entry_SYSCALL_64_after_hwframe+0x76/0x7e [32.1627] RIP: 0033:0x7f0c28d05d2b [32.1632] RSP: 002b:00007ffc9d988048 EFLAGS: 00000246 ORIG_RAX:00000000000000a2 [32.1634] RAX: ffffffffffffffda RBX: 00007ffc9d988228 RCX:00007f0c28d05d2b [32.1636] RDX: 00007f0c28e02301 RSI: 00007ffc9d989b21 RDI:00007f0c28dba90d [32.1637] RBP: 0000000000000001 R08: 0000000000000001 R09:0000000000000000 [32.1639] R10: 0000000000000000 R11: 0000000000000246 R12:000055b96572cb80 [32.1641] R13: 000055b96572b19f R14: 00007f0c28dfa434 R15:000055b96572b034 [32.1643] </TASK> [32.1644] irq event stamp: 0 [32.1644] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [32.1646] hardirqs last disabled at (0): [<ffffffff81298817>]copy_process+0xb37/0x2260 [32.1648] softirqs last enabled at (0): [<ffffffff81298817>]copy_process+0xb37/0x2260 [32.1650] softirqs last disabled at (0): [<0000000000000000>] 0x0 [32.1652] ---[ end trace 0000000000000000 ]--- Furthermore, this list corruption eventually (when we happen to add a new block group) results in getting the switch_commits and dirty_cowonly_roots lists mixed up and attempting to call update_root on the tree root which can't be found in the tree root, resulting in a transaction abort: [87.8269] BTRFS critical (device nvme1n1): unable to find root key (1 0 0) in tree 1 [87.8272] ------------[ cut here ]------------ [87.8274] BTRFS: Transaction aborted (error -117) [87.8275] WARNING: fs/btrfs/root-tree.c:153 at 0x0, CPU#4: sync/703 [87.8285] CPU: 4 UID: 0 PID: 703 Comm: sync Not tainted 6.18.0 #25 PREEMPT(none) [87.8287] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-4.fc41 04/01/2014 [87.8289] RIP: 0010:btrfs_update_root+0x296/0x790 [btrfs] [87.8295] RSP: 0018:ffffa58d035dfd60 EFLAGS: 00010282 [87.8297] RAX: ffff9a59126ddb68 RBX: ffff9a59126dc000 RCX: 0000000000000000 [87.8299] RDX: 0000000000000000 RSI: 00000000ffffff8b RDI: ffffffffc0b28270 [87.8301] RBP: ffff9a5904aec000 R08: 0000000000000000 R09: 00000000ffffefff [87.8303] R10: ffffffff9ac8c200 R11: ffffffff9ace4200 R12: 0000000000000001 [87.8305] R13: ffff9a59041740e8 R14: ffff9a5904aec1f7 R15: ffff9a590fdefaf0 [87.8307] FS: 00007f54cde6b740(0000) GS:ffff9a5b5a81c000(0000) knlGS:0000000000000000 [87.8309] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [87.8310] CR2: 00007f54cde403cc CR3: 0000000112902004 CR4: 0000000000370ef0 [87.8312] Call Trace: [87.8313] <TASK> [87.8314] ? _raw_spin_unlock+0x23/0x40 [87.8315] commit_cowonly_roots+0x1ad/0x250 [btrfs] [87.8317] ? btrfs_commit_transaction+0x79b/0x1560 [btrfs] [87.8320] btrfs_commit_transaction+0x8aa/0x1560 [btrfs] [87.8322] ? btrfs_attach_transaction_barrier+0x23/0x60 [btrfs] [87.8325] __iterate_supers+0xf1/0x170 [87.8326] ? __pfx_sync_fs_one_sb+0x10/0x10 [87.8327] ksys_sync+0x63/0xb0 [87.8328] __do_sys_sync+0xe/0x20 [87.8329] do_syscall_64+0x73/0x450 [87.8330] entry_SYSCALL_64_after_hwframe+0x76/0x7e [87.8331] RIP: 0033:0x7f54cdd05d2b [87.8336] RSP: 002b:00007fff1b58ff78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2 [87.8338] RAX: ffffffffffffffda RBX: 00007fff1b590158 RCX: 00007f54cdd05d2b [87.8340] RDX: 00007f54cde02301 RSI: 00007fff1b592b66 RDI: 00007f54cddba90d [87.8342] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000 [87.8344] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e07ca96b80 [87.8346] R13: 000055e07ca9519f R14: 00007f54cddfa434 R15: 000055e07ca95034 [87.8348] </TASK> [87.8348] irq event stamp: 0 [87.8349] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [87.8351] hardirqs last disabled at (0): [<ffffffff99698797>] copy_process+0xb37/0x21e0 [87.8353] softirqs last enabled at (0): [<ffffffff99698797>] copy_process+0xb37/0x21e0 [87.8355] softirqs last disabled at (0): [<0000000000000000>] 0x0 [87.8357] ---[ end trace 0000000000000000 ]--- [87.8358] BTRFS: error (device nvme1n1 state A) in btrfs_update_root:153: errno=-117 Filesystem corrupted [87.8360] BTRFS info (device nvme1n1 state EA): forced readonly [87.8362] BTRFS warning (device nvme1n1 state EA): Skipping commit of aborted transaction. [87.8364] BTRFS: error (device nvme1n1 state EA) in cleanup_transaction:2037: errno=-117 Filesystem corrupted Since the block group tree was pulled out of the extent tree and uses normal root dirty tracking, remove the offending extra list_add. This fixes the list corruption and the resulting fs corruption. Fixes: `14033b08a0` ("btrfs: don't save block group root into super block") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:24 +01:00
Johannes Thumshirn	c757edbef9	btrfs: fix copying the flags of btrfs_bio after split When a btrfs_bio gets split, only 'bbio->csum_search_commit_root' gets copied to the new btrfs_bio, all the other flags don't. When a bio is split in btrfs_submit_chunk(), btrfs_split_bio() creates the new split bio via btrfs_bio_init() which zeroes the struct with memset. Looking at btrfs_split_bio(), it copies csum_search_commit_root from the original but does not copy can_use_append. After the split, the code does: bbio = split; bio = &bbio->bio; This means the split bio (with can_use_append = false) gets submitted, not the original. In btrfs_submit_dev_bio(), the condition: if (btrfs_bio(bio)->can_use_append && btrfs_dev_is_sequential(...)) Will be false for the split bio even when writing to a sequential zone. Does the split bio need to inherit can_use_append from the original? The old code used a local variable use_append which persisted across the split. Copy the rest of the flags as well. Link: https://lore.kernel.org/linux-btrfs/20260125132120.2525146-1-clm@meta.com/ Reported-by: Chris Mason <clm@meta.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:24 +01:00
Johannes Thumshirn	3fe608dbac	btrfs: zoned: use local fs_info variable in btrfs_load_block_group_dup() btrfs_load_block_group_dup() has a local pointer to fs_info, yet the error prints dereference fs_info from the block_group. Use local fs_info variable to make the code more uniform. Reviewed-by: Daniel Vacek <neelx@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:23 +01:00
Naohiro Aota	52ee9965d0	btrfs: zoned: fixup last alloc pointer after extent removal for RAID0/10 When a block group is composed of a sequential write zone and a conventional zone, we recover the (pseudo) write pointer of the conventional zone using the end of the last allocated position. However, if the last extent in a block group is removed, the last extent position will be smaller than the other real write pointer position. Then, that will cause an error due to mismatch of the write pointers. We can fixup this case by moving the alloc_offset to the corresponding write pointer position. Fixes: `568220fa96` ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:23 +01:00
Naohiro Aota	e2d848649e	btrfs: zoned: fixup last alloc pointer after extent removal for DUP When a block group is composed of a sequential write zone and a conventional zone, we recover the (pseudo) write pointer of the conventional zone using the end of the last allocated position. However, if the last extent in a block group is removed, the last extent position will be smaller than the other real write pointer position. Then, that will cause an error due to mismatch of the write pointers. We can fixup this case by moving the alloc_offset to the corresponding write pointer position. Fixes: `c0d90a79e8` ("btrfs: zoned: fix alloc_offset calculation for partly conventional block groups") CC: stable@vger.kernel.org # 6.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:23 +01:00
Naohiro Aota	dda3ec9ee6	btrfs: zoned: fixup last alloc pointer after extent removal for RAID1 When a block group is composed of a sequential write zone and a conventional zone, we recover the (pseudo) write pointer of the conventional zone using the end of the last allocated position. However, if the last extent in a block group is removed, the last extent position will be smaller than the other real write pointer position. Then, that will cause an error due to mismatch of the write pointers. We can fixup this case by moving the alloc_offset to the corresponding write pointer position. Fixes: `568220fa96` ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:23 +01:00
Filipe Manana	3f8982543d	btrfs: remove out label in btrfs_wait_for_commit() There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:23 +01:00
Filipe Manana	5eb01bf4a9	btrfs: remove out label in btrfs_init_space_info() There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:23 +01:00
Filipe Manana	cefef3cc12	btrfs: remove out label in btrfs_check_rw_degradable() There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:23 +01:00
Filipe Manana	61fb7f04ee	btrfs: remove out label in finish_verity() There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:22 +01:00
Filipe Manana	6329592ca6	btrfs: remove out label in scrub_find_fill_first_stripe() There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:22 +01:00
Filipe Manana	55807025a6	btrfs: remove out label in lzo_decompress() There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:22 +01:00
Filipe Manana	610ff1c9df	btrfs: remove out label in btrfs_mark_extent_written() There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:22 +01:00
Filipe Manana	cc27540dd0	btrfs: remove out label in btrfs_csum_file_blocks() There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:22 +01:00
Filipe Manana	bb09b9a491	btrfs: remove out_failed label in find_lock_delalloc_range() There is no point in having the label since all it does is return the value in the 'found' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:22 +01:00
Filipe Manana	2efcd25a76	btrfs: remove out label in load_extent_tree_free() There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:22 +01:00
Filipe Manana	1038614e8f	btrfs: remove pointless out labels from uuid-tree.c Some functions (btrfs_uuid_iter_rem() and btrfs_check_uuid_tree_entry()) have an 'out' label that does nothing but return, making it pointless. Simplify this by removing the label and returning instead of gotos plus setting the 'ret' variable. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:21 +01:00
Filipe Manana	47c9dbc791	btrfs: remove pointless out labels from inode.c Some functions (insert_inline_extent() and insert_reserved_file_extent()) have an 'out' label that does nothing but return, making it pointless. Simplify this by removing the label and returning instead of gotos plus setting the 'ret' variable. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:21 +01:00
Filipe Manana	46099eaef3	btrfs: remove pointless out labels from free-space-cache.c Some functions (update_cache_item(), find_free_space(), trim_bitmaps(), btrfs_remove_free_space() and cleanup_free_space_cache_v1()) have an 'out' label that does nothing but return, making it pointless. Simplify this by removing the label and returning instead of gotos plus setting the 'ret' variable. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:21 +01:00
Filipe Manana	ea8f921005	btrfs: remove pointless out labels from extent-tree.c Some functions (lookup_extent_data_ref(), __btrfs_mod_ref() and btrfs_free_tree_block()) have an 'out' label that does nothing but return, making it pointless. Simplify this by removing the label and returning instead of gotos plus setting the 'ret' variable. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:21 +01:00
Filipe Manana	3ca4f9d096	btrfs: remove pointless out labels from disk-io.c Some functions (btrfs_validate_extent_buffer() and btrfs_start_pre_rw_mount()) have an 'out' label that does nothing but return, making it pointless. Simplify this by removing the label and returning instead of gotos plus setting the 'ret' variable. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:21 +01:00
Filipe Manana	b3acb158ea	btrfs: remove pointless out labels from qgroup.c Some functions (__del_qgroup_relation() and qgroup_trace_new_subtree_blocks()) have an 'out' label that does nothing but return, making it pointless. Simplify this by removing the label and returning instead of gotos plus setting the 'ret' variable. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:21 +01:00
Filipe Manana	ccba88cb6a	btrfs: remove pointless out labels from send.c Some functions (process_extent(), process_recorded_refs_if_needed(), changed_inode(), compare_refs() and changed_cb()) have an 'out' label that does nothing but return, making it pointless. Simplify this by removing the label and returning instead of gotos plus setting the 'ret' variable. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:20 +01:00
Filipe Manana	01f93271ed	btrfs: remove pointless out labels from ioctl.c Some functions (__btrfs_ioctl_snap_create(), btrfs_ioctl_subvol_setflags() and copy_to_sk()) have an 'out' label that does nothing but return, making it pointless. Simplify this by removing the label and returning instead of gotos plus setting up the 'ret' variable. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:20 +01:00
Filipe Manana	51b1fcf71c	btrfs: qgroup: return correct error when deleting qgroup relation item If we fail to delete the second qgroup relation item, we end up returning success or -ENOENT in case the first item does not exist, instead of returning the error from the second item deletion. Fixes: `73798c465b` ("btrfs: qgroup: Try our best to delete qgroup relations") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:20 +01:00
David Sterba	8ad2f2edc8	btrfs: pass btrfs_fs_info to btrfs_first_delayed_node() As the delayed root is now in the fs_info we can pass it to btrfs_first_delayed_node(). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:20 +01:00
David Sterba	2891539a26	btrfs: don't use local variables for fs_info->delayed_root In all cases the delayed_root is used once in a function, we don't need to use a local variable for that. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:20 +01:00
David Sterba	86523d8d2f	btrfs: reorder members in btrfs_delayed_root for better packing There are two unnecessary 4B holes in btrfs_delayed_root; struct btrfs_delayed_root { spinlock_t lock; /* 0 4 / / XXX 4 bytes hole, try to pack / struct list_head node_list; / 8 16 / struct list_head prepare_list; / 24 16 / atomic_t items; / 40 4 / atomic_t items_seq; / 44 4 / int nodes; / 48 4 / / XXX 4 bytes hole, try to pack / wait_queue_head_t wait; / 56 24 / / size: 80, cachelines: 2, members: 7 / / sum members: 72, holes: 2, sum holes: 8 / / last cacheline: 16 bytes */ }; Reordering 'nodes' after 'lock' reduces size by 8B, to 72 on release config. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:20 +01:00
David Sterba	c8bafc8d6a	btrfs: embed delayed root to struct btrfs_fs_info The fs_info::delayed_root is allocated dynamically but there's only one instance per filesystem so we can embed it into the fs_info itself. The two object have the same lifetime and delayed roots are always present so we don't need to allocate it on demand from slab. There's still some space left in fs_info until the 4K so there won't be an spill over to next page on release config (size grows from 3880 to 3952). In case we want to shrink fs_info there are still holes to fill or we can separate other non-core or optional structures if needed. Link: https://lore.kernel.org/all/cover.1767979013.git.dsterba@suse.com/ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:20 +01:00
Qu Wenruo	71e545d4e3	btrfs: add strict extent map alignment checks Currently we do not check the alignment of extent_map structure. The reasons are the inode and extent-map tests use unaligned values for start offsets and lengths. Thankfully those legacy problems are properly addressed by previous patches, now we can finally put the alignment checks into validate_extent_map(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:19 +01:00
Qu Wenruo	385c65f827	btrfs: tests: prepare extent map tests for strict alignment checks Currently the extent map self tests have the following points that will cause false alerts for the incoming strict extent map alignment checks: - Incorrect inlined extent map size Which is not following what the kernel is doing for inlined extents, as btrfs_extent_item_to_extent_map() always uses the fs block size as the length, not the ram_bytes. Fix it by using SZ_4K as extent map's length. - Incorrect btrfs_fs_info::sectorsize As we always use PAGE_SIZE, which can be values larger than 4K. Meanwhile all the immediate numbers used are based on 4K fs block size in the test case. Fix it by using fixed SZ_4K fs block size when allocating the dummy btrfs_fs_info. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:19 +01:00
Qu Wenruo	d77b90cfe0	btrfs: tests: remove invalid file extent map tests In the inode self tests, there are several problems: - Invalid file extents E.g. hole range [4K, 4K + 4) is completely invalid. Only inlined extent maps can have an unaligned ram_bytes, and even for that case, the generated extent map will use sectorsize as em->len. - Unaligned hole after inlined extent The kernel never does this by itself, the current btrfs_get_extent() will only return a single inlined extent map that covers the first block. - Incorrect numbers in the comment E.g. 12291 no matter if you add or dec 1, is not aligned to 4K. The properly number for 12K is 12288, I don't know why there is even a diff of 3, and this completely doesn't match the extent map we inserted later. - Hard-to-modify sequence in setup_file_extents() If some unfortunate person, just like me, needs to modify setup_file_extents(), good luck not screwing up the file offset. Fix them by: - Remove invalid unaligned extent maps This mostly means remove the [4K, 4K + 4) hole case. The remaining ones are already properly aligned. This slightly changes the on-disk data extent allocation, with that removed, the regular extents at [4K, 8K) and [8K , 12K) can be merged. So also add a 4K gap between those two data extents to prevent em merge. - Remove the implied hole after an inlined extent Just like what the kernel is doing for inlined extents in the real world. - Update the commit using proper numbers with 'K' suffixes Since there is no unaligned range except the first inlined one, we can always use numbers with 'K' suffixes, which is way more easier to read, and will always be aligned to 1024 at least. - Add comments in setup_file_extents() So that we're clear about the file offset for each test file extent. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:19 +01:00
Filipe Manana	571e75f4c0	btrfs: unfold transaction aborts in btrfs_finish_one_ordered() We have a single transaction abort that can be caused either by a failure from a call to btrfs_mark_extent_written(), if we are dealing with a write to a prealloc extent, or otherwise from a call to insert_ordered_extent_file_extent(). So when the transaction abort happens we can not know for sure which case failed. Unfold the aborts so that it's clear in case of a failure. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:19 +01:00
Filipe Manana	a8bec25e01	btrfs: deal with missing root in sample_block_group_extent_item() In case the root does not exists, which is unexpected, btrfs_extent_root() returns NULL, but we ignore that and so if it happens we can trigger a NULL pointer dereference later. So verify if we found the root and log an error message in case it's missing. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:19 +01:00
Filipe Manana	79d51b5c7a	btrfs: remove bogus root search condition in sample_block_group_extent_item() There's no need to pass the maximum between the block group's start offset and BTRFS_SUPER_INFO_OFFSET (64K) since we can't have any block groups allocated in the first megabyte, as that's reserved space. Furthermore, even if we could, the correct thing to do was to pass the block group's start offset anyway - and that's precisely what we do for block groups that happen to contain superblock mirror (the range for the super block is never marked as free and it's marked as dirty in the fs_info->excluded_extents io tree). So simplify this and get rid of that maximum expression. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:19 +01:00
Qu Wenruo	7c2830f00c	btrfs: fallback to buffered IO if the data profile has duplication [BACKGROUND] Inspired by a recent kernel bug report, which is related to direct IO buffer modification during writeback, that leads to contents mismatch of different RAID1 mirrors. [CAUSE AND PROBLEMS] The root cause is exactly the same explained in commit `968f19c5b1` ("btrfs: always fallback to buffered write if the inode requires checksum"), that we can not trust direct IO buffer which can be modified halfway during writeback. Unlike data checksum verification, if this happened on inodes without data checksum but has the data has extra mirrors, it will lead to stealth data mismatch on different mirrors. This will be way harder to detect without data checksum. Furthermore for RAID56, we can even have data without checksum and data with checksum mixed inside the same full stripe. In that case if the direct IO buffer got changed halfway for the nodatasum part, the data with checksum immediately lost its ability to recover, e.g.: " " = Good old data or parity calculated using good old data "X" = Data modified during writeback 0 32K 64K Data 1 \| \| Has csum Data 2 \|XXXXXXXXXXXXXXXX \| No csum Parity \| \| In above case, the parity is calculated using data 1 (has csum, from page cache, won't change during writeback), and old data 2 (has no csum, direct IO write). After parity is calculated, but before submission to the storage, direct IO buffer of data 2 is modified, causing the range [0, 32K) of data 2 has a different content. Now all data is submitted to the storage, and the fs got fully synced. Then the device of data 1 is lost, has to be rebuilt from data 2 and parity. But since the data 2 has some modified data, and the parity is calculated using old data, the recovered data is no the same for data 1, causing data checksum mismatch. [FIX] Fix the problem by checking the data allocation profile. If our data allocation profile is either RAID0 or SINGLE, we can allow true zero-copy direct IO and the end user is fully responsible for any race. However this is not going to fix all situations, as it's still possible to race with balance where the fs got a new data profile after the data allocation profile check. But this fix should still greatly reduce the window of the original bug. Link: https://bugzilla.kernel.org/show_bug.cgi?id=99171 Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:19 +01:00
Filipe Manana	954f3217f6	btrfs: assert block group is locked in btrfs_use_block_group_size_class() It's supposed to be called with the block group locked, in order to read and set its size_class member, so assert it's locked. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:18 +01:00
Filipe Manana	0bf63d385f	btrfs: don't pass block group argument to load_block_group_size_class() There's no need to pass the block group since we can extract it from the given caching control structure. Same goes for its helper function sample_block_group_extent_item(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:18 +01:00
Filipe Manana	e46a9f84bf	btrfs: allocate path on stack in load_block_group_size_class() Instead of allocating and freeing a path in every iteration of load_block_group_size_class(), through its helper function sample_block_group_extent_item(), allocate the path in the former and pass it to the later. The path is allocated on stack since it's short and we are in a workqueue context so there's not much stack usage. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:18 +01:00
Filipe Manana	17078525e5	btrfs: make load_block_group_size_class() return void There's no point in returning anything since determining and setting a size class for a block group is an optimization, not something critical. The only caller of load_block_group_size_class() (the caching thread) does not do anything with the return value anyway, exactly because having a size class is just an optimization and it can always be set later when adding reserved bytes to a block group (btrfs_add_reserved_bytes()). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:18 +01:00
Qu Wenruo	1914b94231	btrfs: zstd: use folio_iter to handle zstd_decompress_bio() Currently zstd_decompress_bio() is using compressed_bio->compressed_folios[] array to grab each compressed folio. However cb->compressed_folios[] is just a pointer to each folio of the compressed bio, meaning we can just replace the compressed_folios[] array by just grabbing the folio inside the compressed bio. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:18 +01:00

1 2 3 4 5 ...

1414337 Commits