linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-17 08:24:21 -04:00

Author	SHA1	Message	Date
Mark Harmstone	68a135013b	btrfs: fix bytes_may_use leak in move_existing_remap() If the call to btrfs_reserve_extent() in move_existing_remap() returns a smaller extent than we asked for, currently we're not undoing the bytes_may_use change that we made. Fix this by calling btrfs_space_info_update_bytes_may_use() again for the difference. Fixes: `bbea42dfb9` ("btrfs: move existing remaps before relocating block group") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-21 04:00:32 +02:00
Boris Burkov	fc3d532881	btrfs: btrfs_log_dev_io_error() on all bio errors As far as I can tell, we never intentionally constrained ourselves to these status codes, and it is misleading and surprising to lack the bdev error logging when we get a different error code from the block layer. This can lead to jumping to a wrong conclusion like "this system didn't see any bio failures but aborted with EIO". For example on nvme devices, I observe many failures coming back as BLK_STS_MEDIUM. It is apparent that the nvme driver returns a variety of BLK_STS_* status values in nvme_error_status(). So handle the known expected errors and make some noise on the rest which we expect won't really happen. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 20:00:29 +02:00
Michal Grzedzicki	3cd181cc46	btrfs: fix silent IO error loss in encoded writes and zoned split can_finish_ordered_extent() and btrfs_finish_ordered_zoned() set BTRFS_ORDERED_IOERR via bare set_bit(). Later, btrfs_mark_ordered_extent_error() in btrfs_finish_one_ordered() uses test_and_set_bit(), finds it already set, and skips mapping_set_error(). The error is never recorded on the inode's address_space, making it invisible to fsync. For encoded writes this causes btrfs receive to silently produce files with zero-filled holes. Fix: replace bare set_bit(BTRFS_ORDERED_IOERR) with btrfs_mark_ordered_extent_error() which pairs test_and_set_bit() with mapping_set_error(), guaranteeing the error is recorded exactly once. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: Michal Grzedzicki <mge@meta.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 19:43:50 +02:00
Dave Chen	e0dfaebb8f	btrfs: skip clearing EXTENT_DEFRAG for NOCOW ordered extents In btrfs_finish_one_ordered(), clear_bits is unconditionally initialized with EXTENT_DEFRAG. For NOCOW ordered extents this is always a no-op because should_nocow() already forces the COW path when EXTENT_DEFRAG is set, so a NOCOW ordered extent can never have EXTENT_DEFRAG on its range. Although harmless, the unconditional btrfs_clear_extent_bit() call still performs a cold rbtree lookup under the io tree spinlock on every NOCOW write completion. Avoid this by only adding EXTENT_DEFRAG to clear_bits for non-NOCOW ordered extents, and skip the call entirely when there are no bits to clear. Signed-off-by: Dave Chen <davechen@synology.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 19:43:22 +02:00
Dave Chen	e70e3f858e	btrfs: use BTRFS_FS_UPDATE_UUID_TREE_GEN flag for UUID tree rescan check The UUID tree rescan check in open_ctree() compares fs_info->generation with the superblock's uuid_tree_generation. This comparison is not reliable because fs_info->generation is bumped at transaction start time in join_transaction(), while uuid_tree_generation is only updated at commit time via update_super_roots(). Between the early BTRFS_FS_UPDATE_UUID_TREE_GEN flag check and the late rescan decision, mount operations such as file orphan cleanup from an unclean shutdown start transactions without committing them. This advances fs_info->generation past uuid_tree_generation and produces a false-positive mismatch. Use the BTRFS_FS_UPDATE_UUID_TREE_GEN flag directly instead. The flag was already set earlier in open_ctree() when the generations were known to match, and accurately represents "UUID tree is up to date" without being affected by subsequent transaction starts. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Dave Chen <davechen@synology.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 19:43:06 +02:00
Filipe Manana	e1194226bf	btrfs: remove duplicate journal_info reset on failure to commit transaction If we get an error during the transaction commit path, we are resetting current->journal_info to NULL twice - once in btrfs_commit_transaction() right before calling cleanup_transaction() and then once again inside cleanup_transaction(). Remove the instance in btrfs_commit_transaction(). Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 19:42:24 +02:00
Filipe Manana	7801f3ea95	btrfs: tag as unlikely if statements that check for fs in error state Having the filesystem in an error state, meaning we had a transaction abort, is unexpected. Mark every check for the error state with the unlikely annotation to convey that and to allow the compiler to generate better code. On x86_64, using gcc 14.2.0-19 from Debian, resulted in a slightly reduced object size and better code. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 2008598 175912 15592 2200102 219226 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 2008450 175912 15592 2199954 219192 fs/btrfs/btrfs.ko Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 19:41:42 +02:00
Guangshuo Li	3f487be812	btrfs: fix double free in create_space_info() error path When kobject_init_and_add() fails, the call chain is: create_space_info() -> btrfs_sysfs_add_space_info_type() -> kobject_init_and_add() -> failure -> kobject_put(&space_info->kobj) -> space_info_release() -> kfree(space_info) Then control returns to create_space_info(): btrfs_sysfs_add_space_info_type() returns error -> goto out_free -> kfree(space_info) This causes a double free. Keep the direct kfree(space_info) for the earlier failure path, but after btrfs_sysfs_add_space_info_type() has called kobject_put(), let the kobject release callback handle the cleanup. Fixes: `a11224a016` ("btrfs: fix memory leaks in create_space_info() error paths") CC: stable@vger.kernel.org # 6.19+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:09 +02:00
Guangshuo Li	a7449edf96	btrfs: fix double free in create_space_info_sub_group() error path When kobject_init_and_add() fails, the call chain is: create_space_info_sub_group() -> btrfs_sysfs_add_space_info_type() -> kobject_init_and_add() -> failure -> kobject_put(&sub_group->kobj) -> space_info_release() -> kfree(sub_group) Then control returns to create_space_info_sub_group(), where: btrfs_sysfs_add_space_info_type() returns error -> kfree(sub_group) Thus, sub_group is freed twice. Keep parent->sub_group[index] = NULL for the failure path, but after btrfs_sysfs_add_space_info_type() has called kobject_put(), let the kobject release callback handle the cleanup. Fixes: `f92ee31e03` ("btrfs: introduce btrfs_space_info sub-group") CC: stable@vger.kernel.org # 6.18+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:09 +02:00
Qu Wenruo	3c0c45a4df	btrfs: do not reject a valid running dev-replace [BUG] There is a bug report that a btrfs with running dev-replace got rejected with the following messages: BTRFS error (device sdk1): devid 0 path /dev/sdk1 is registered but not found in chunk tree BTRFS error (device sdk1): remove the above devices or use 'btrfs device scan --forget <dev>' to unregister them before mount BTRFS error (device sdk1): open_ctree failed: -117 [CAUSE] The tree and super block dumps show the fs is completely sane, except one thing, there is no dev item for devid 0 in chunk tree. However this is not a bug, as we do not insert dev item for devid 0 in the first place. Since the devid 0 is only there temporarily we do not really need to insert a dev item for it and then later remove it again. It is the commit `3430818739` ("btrfs: add extra device item checks at mount") adding a overly strict check that triggers a false alert and rejected the valid filesystem. [FIX] Add a special handling for devid 0, and doesn't require devid 0 to have a device item in chunk tree. Reported-by: Jaron Viëtor <jaron@vietors.com> Link: https://lore.kernel.org/linux-btrfs/CAF1bhLVYLZvD=j2XyuxXDKD-NWNJAwDnpVN+UYeQW-HbzNRn1A@mail.gmail.com/ Fixes: `3430818739` ("btrfs: add extra device item checks at mount") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:08 +02:00
Qu Wenruo	48aa5c0e2b	btrfs: only invalidate btree inode pages after all ebs are released In close_ctree(), we call invalidate_inode_pages2() to invalidate all pages from btree inode. But the problem is, it never returns 0, but always -EBUSY. The problem is that we are still holding all the essential tree root nodes, thus pages holding those tree blocks can not be invalidated thus invalidate_inode_pages2() always returns -EBUSY. This is also against the error cleanup path of open_ctree(), which properly frees all root pointers before calling invalidate_inode_pages(). So fix the order by delaying invalidate_inode_pages2() until we have freed all root pointers. Reviewed-by: Anand Jain <asj@kernel.org> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:08 +02:00
JP Kobryn (Meta)	7ae37b2c94	btrfs: prevent direct reclaim during compressed readahead Under memory pressure, direct reclaim can kick in during compressed readahead. This puts the associated task into D-state. Then shrink_lruvec() disables interrupts when acquiring the LRU lock. Under heavy pressure, we've observed reclaim can run long enough that the CPU becomes prone to CSD lock stalls since it cannot service incoming IPIs. Although the CSD lock stalls are the worst case scenario, we have found many more subtle occurrences of this latency on the order of seconds, over a minute in some cases. Prevent direct reclaim during compressed readahead. This is achieved by using different GFP flags at key points when the bio is marked for readahead. There are two functions that allocate during compressed readahead: btrfs_alloc_compr_folio() and add_ra_bio_pages(). Both currently use GFP_NOFS which includes __GFP_DIRECT_RECLAIM. For the internal API call btrfs_alloc_compr_folio(), the signature changes to accept an additional gfp_t parameter. At the readahead call site, it gets flags similar to GFP_NOFS but stripped of __GFP_DIRECT_RECLAIM. __GFP_NOWARN is added since these allocations are allowed to fail. Demand reads still use full GFP_NOFS and will enter reclaim if needed. All other existing call sites of btrfs_alloc_compr_folio() now explicitly pass GFP_NOFS to retain their current behavior. add_ra_bio_pages() gains a bool parameter which allows callers to specify if they want to allow direct reclaim or not. In either case, the __GFP_NOWARN flag was added unconditionally since the allocations are speculative. There has been some previous work done on calling add_ra_bio_pages() [0]. This patch is complementary: where that patch reduces call frequency, this patch reduces the latency associated with those calls. [0] https://lore.kernel.org/linux-btrfs/656838ec1232314a2657716e59f4f15a8eadba64.1751492111.git.boris@bur.io/ Reviewed-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:08 +02:00
Teng Liu	30d537f723	btrfs: replace BUG_ON() with error return in cache_save_setup() In cache_save_setup(), if create_free_space_inode() succeeds but the subsequent lookup_free_space_inode() still fails on retry, the BUG_ON(retries) will crash the kernel. This can happen due to I/O errors or transient failures, not just programming bugs. Replace the BUG_ON with proper error handling that returns the original error code through the existing cleanup path. The callers already handle this gracefully: disk_cache_state defaults to BTRFS_DC_ERROR, so the space cache simply won't be written for that block group. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Teng Liu <27rabbitlt@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:08 +02:00
David Sterba	f0d3b4c7b8	btrfs: zstd: don't cache sectorsize in a local variable The sectorsize is used once or at most twice in the callbacks, no need to cache it on stack. Minor effect on zstd_compress_folios() where it saves 8 bytes of stack. Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:08 +02:00
David Sterba	efcf0898a6	btrfs: zlib: don't cache sectorsize in a local variable The sectorsize is used once or at most twice in the callbacks, no need to cache it on stack. Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:08 +02:00
David Sterba	4d083672b4	btrfs: zlib: drop redundant folio address variable We're caching the current output folio address but it's not really necessary as we store it in the variable and then pass it to the stream context. We can read the folio address directly. Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:07 +02:00
David Sterba	5b93f24168	btrfs: lzo: inline read/write length helpers The LZO_LEN read/write helpers are supposed to be trivial and we're duplicating the put/get unaligned helpers so use them directly. Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:07 +02:00
David Sterba	463626a2ec	btrfs: use common eb range validation in read_extent_buffer_to_user_nofault() The extent buffer access is checked in other helpers by check_eb_range(), which validates the requested start, length against the extent buffer. While this almost never fails we should still handle it as an error and not just warn. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:07 +02:00
David Sterba	b8aa337121	btrfs: read eb folio index right before loops There are generic helpers to access extent buffer folio data of any length, potentially iterating over a few of them. This is a slow path, either we use the type based accessors or the eb folio allocation is contiguous and we can use the memcpy/memcmp helpers. The initialization of 'i' is done at the beginning though it may not be needed. Move it right before the folio loop, this has minor effect on generated code in __write_extent_buffer(). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:07 +02:00
David Sterba	aae9042194	btrfs: rename local variable for offset in folio Use proper abbreviation of the 'offset in folio' in the variable name, same as we have in accessors.c. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:07 +02:00
David Sterba	a5b6b23c45	btrfs: unify types for binary search variables The variables calculating where to jump next are using mixed in types which requires some conversions on the instruction level. Using 'u32' removes one call to 'movslq', making the main loop shorter. This complements type conversion done in `a724f313f8` ("btrfs: do unsigned integer division in the extent buffer binary search loop") Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:07 +02:00
David Sterba	7e1e45a9e4	btrfs: remove duplicate calculation of eb offset in btrfs_bin_search() In the main search loop the variable 'oil' (offset in folio) is set twice, one duplicated when the key fits completely to the contiguous range. We can remove it and while it's just a simple calculation, the binary search loop is executed many times so micro optimizations add up. The code size is reduced by 64 bytes on release config, the loop is reorganized a bit and a few instructions shorter. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:07 +02:00
Mark Harmstone	b753612be0	btrfs: tree-checker: add remap-tree checks to check_block_group_item() Add some write-time checks for block group items relating to the remap tree. Here we're checking: * That the REMAPPED or METADATA_REMAP flags aren't set unless the REMAP_TREE incompat flag is also set * That `remap_bytes` isn't more than the size of the block group * That `identity_remap_count` isn't more than the number of sectors in the block group Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:06 +02:00
Filipe Manana	e3799e65c1	btrfs: make btrfs_free_log() and btrfs_free_log_root_tree() return void These functions never fail, always return success (0) and none of the callers care about their return values. Change their return type from int to void. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:06 +02:00
Filipe Manana	b48c980b6a	btrfs: fix deadlock between reflink and transaction commit when using flushoncommit When using the flushoncommit mount option, we can have a deadlock between a transaction commit and a reflink operation that copied an inline extent to an offset beyond the current i_size of the destination node. The deadlock happens like this: 1) Task A clones an inline extent from inode X to an offset of inode Y that is beyond Y's current i_size. This means we copied the inline extent's data to a folio of inode Y that is beyond its EOF, using a call to copy_inline_to_page(); 2) Task B starts a transaction commit and calls btrfs_start_delalloc_flush() to flush delalloc; 3) The delalloc flushing sees the new dirty folio of inode Y and when it attempts to flush it, it ends up at extent_writepage() and sees that the offset of the folio is beyond the i_size of inode Y, so it attempts to invalidate the folio by calling folio_invalidate(), which ends up at btrfs' folio invalidate callback - btrfs_invalidate_folio(). There it tries to lock the folio's range in inode Y's extent io tree, but it blocks since it's currently locked by task A - during a reflink we lock the inodes and the source and destination ranges after flushing all delalloc and waiting for ordered extent completion - after that we don't expect to have dirty folios in the ranges, the exception is if we have to copy an inline extent's data (because the destination offset is not zero); 4) Task A then attempts to start a transaction to update the inode item, and then it's blocked since the current transaction is in the TRANS_STATE_COMMIT_START state. Therefore task A has to wait for the current transaction to become unblocked (its state >= TRANS_STATE_UNBLOCKED). So task A is waiting for the transaction commit done by task B, and the later waiting on the extent lock of inode Y that is currently held by task A. Syzbot recently reported this with the following stack traces: INFO: task kworker/u8:7:1053 blocked for more than 143 seconds. Not tainted syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u8:7 state:D stack:23520 pid:1053 tgid:1053 ppid:2 task_flags:0x4208060 flags:0x00080000 Workqueue: writeback wb_workfn (flush-btrfs-46) Call Trace: <TASK> context_switch kernel/sched/core.c:5298 [inline] __schedule+0x1553/0x5240 kernel/sched/core.c:6911 __schedule_loop kernel/sched/core.c:6993 [inline] schedule+0x164/0x360 kernel/sched/core.c:7008 wait_extent_bit fs/btrfs/extent-io-tree.c:811 [inline] btrfs_lock_extent_bits+0x59c/0x700 fs/btrfs/extent-io-tree.c:1914 btrfs_lock_extent fs/btrfs/extent-io-tree.h:152 [inline] btrfs_invalidate_folio+0x43d/0xc40 fs/btrfs/inode.c:7704 extent_writepage fs/btrfs/extent_io.c:1852 [inline] extent_write_cache_pages fs/btrfs/extent_io.c:2580 [inline] btrfs_writepages+0x12ff/0x2440 fs/btrfs/extent_io.c:2713 do_writepages+0x32e/0x550 mm/page-writeback.c:2554 __writeback_single_inode+0x133/0x11a0 fs/fs-writeback.c:1750 writeback_sb_inodes+0x995/0x19d0 fs/fs-writeback.c:2042 wb_writeback+0x456/0xb70 fs/fs-writeback.c:2227 wb_do_writeback fs/fs-writeback.c:2374 [inline] wb_workfn+0x41a/0xf60 fs/fs-writeback.c:2414 process_one_work kernel/workqueue.c:3276 [inline] process_scheduled_works+0xb6e/0x18c0 kernel/workqueue.c:3359 worker_thread+0xa53/0xfc0 kernel/workqueue.c:3440 kthread+0x388/0x470 kernel/kthread.c:436 ret_from_fork+0x51e/0xb90 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> INFO: task syz.4.64:6910 blocked for more than 143 seconds. Not tainted syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz.4.64 state:D stack:22752 pid:6910 tgid:6905 ppid:5944 task_flags:0x400140 flags:0x00080002 Call Trace: <TASK> context_switch kernel/sched/core.c:5298 [inline] __schedule+0x1553/0x5240 kernel/sched/core.c:6911 __schedule_loop kernel/sched/core.c:6993 [inline] schedule+0x164/0x360 kernel/sched/core.c:7008 wait_current_trans+0x39f/0x590 fs/btrfs/transaction.c:535 start_transaction+0x6a7/0x1650 fs/btrfs/transaction.c:705 clone_copy_inline_extent fs/btrfs/reflink.c:299 [inline] btrfs_clone+0x128a/0x24d0 fs/btrfs/reflink.c:529 btrfs_clone_files+0x271/0x3f0 fs/btrfs/reflink.c:750 btrfs_remap_file_range+0x76b/0x1320 fs/btrfs/reflink.c:903 vfs_copy_file_range+0xda7/0x1390 fs/read_write.c:1600 __do_sys_copy_file_range fs/read_write.c:1683 [inline] __se_sys_copy_file_range+0x2fb/0x480 fs/read_write.c:1650 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f5f73afc799 RSP: 002b:00007f5f7315e028 EFLAGS: 00000246 ORIG_RAX: 0000000000000146 RAX: ffffffffffffffda RBX: 00007f5f73d75fa0 RCX: 00007f5f73afc799 RDX: 0000000000000005 RSI: 0000000000000000 RDI: 0000000000000005 RBP: 00007f5f73b92c99 R08: 0000000000000863 R09: 0000000000000000 R10: 00002000000000c0 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f5f73d76038 R14: 00007f5f73d75fa0 R15: 00007fff138a5068 </TASK> INFO: task syz.4.64:6975 blocked for more than 143 seconds. Not tainted syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz.4.64 state:D stack:24736 pid:6975 tgid:6905 ppid:5944 task_flags:0x400040 flags:0x00080002 Call Trace: <TASK> context_switch kernel/sched/core.c:5298 [inline] __schedule+0x1553/0x5240 kernel/sched/core.c:6911 __schedule_loop kernel/sched/core.c:6993 [inline] schedule+0x164/0x360 kernel/sched/core.c:7008 wb_wait_for_completion+0x3e8/0x790 fs/fs-writeback.c:227 __writeback_inodes_sb_nr+0x24c/0x2d0 fs/fs-writeback.c:2838 try_to_writeback_inodes_sb+0x9a/0xc0 fs/fs-writeback.c:2886 btrfs_start_delalloc_flush fs/btrfs/transaction.c:2175 [inline] btrfs_commit_transaction+0x82e/0x31a0 fs/btrfs/transaction.c:2364 btrfs_ioctl+0xca7/0xd00 fs/btrfs/ioctl.c:5206 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:597 [inline] __se_sys_ioctl+0xff/0x170 fs/ioctl.c:583 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f5f73afc799 RSP: 002b:00007f5f7313d028 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f5f73d76090 RCX: 00007f5f73afc799 RDX: 0000000000000000 RSI: 0000000000009408 RDI: 0000000000000004 RBP: 00007f5f73b92c99 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f5f73d76128 R14: 00007f5f73d76090 R15: 00007fff138a5068 </TASK> Fix this by updating the i_size of the destination inode of a reflink operation after we copy an inline extent's data to an offset beyond the i_size and before attempting to start a transaction to update the inode's item. Reported-by: syzbot+63056bf627663701bbbf@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/69bba3fe.050a0220.227207.002f.GAE@google.com/ Fixes: `05a5a7621c` ("Btrfs: implement full reflink support for inline extents") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:06 +02:00
Mark Harmstone	18addf9ec8	btrfs: tree-checker: check remap-tree flags in btrfs_check_chunk_valid() Add a check to btrfs_check_chunk_valid() that the METADATA_REMAP and REMAPPED flags are only set if the REMAP_TREE incompat flag is also set. Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:06 +02:00
Mark Harmstone	da08c02bc7	btrfs: tree-checker: add checker for items in remap tree Add write-time checking of items in the remap tree, to catch errors before they are written to disk. We're checking: * That remap items, remap backrefs, and identity remaps aren't written unless the REMAP_TREE incompat flag is set * That identity remaps have a size of 0 * That remap items and remap backrefs have a size of sizeof(struct btrfs_remap_item) * That the objectid for these items is aligned to the sector size * That the offset for these items (i.e. the size of the remapping) isn't 0 and is aligned to the sector size * That objectid + offset doesn't overflow Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:06 +02:00
Dave Chen	0e6a169c64	btrfs: fix unnecessary flush on close when truncating zero-sized files In btrfs_setsize(), when a file is truncated to size 0, the BTRFS_INODE_FLUSH_ON_CLOSE flag is unconditionally set to ensure pending writes get flushed on close. This flag was designed to protect the "truncate-then-rewrite" pattern, where an application truncates a file with existing data down to zero and writes new content, ensuring the new data reach disk on close. However, when a file already has a size of 0 (e.g. a newly created file opened with O_CREAT \| O_TRUNC), oldsize and newsize are both 0. In this case, setting BTRFS_INODE_FLUSH_ON_CLOSE is unnecessary because no "good data" was truncated away. The subsequent filemap_flush() in btrfs_release_file() then triggers avoidable writeback that disrupts the normal delayed writeback batching, adding I/O overhead. This comes from a real workload. A backup service creates temporary files via mkstemp(), closes them, and later reopens them with O_TRUNC for writing. The O_TRUNC is defensive. The file creation and usage is done by a different component, so removing the unneeded truncation is not straightforward. This pattern repeats for a large number of files each close() triggers an unnecessary filemap_flush(). Signed-off-by: Dave Chen <davechen@synology.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:06 +02:00
Qu Wenruo	304076527c	btrfs: move shutdown and remove_bdev callbacks out of experimental features These two new callbacks have been introduced in v6.19, and it has been two releases in v7.1. During that time we have not yet exposed bugs related that two features, thus it's time to expose them for end users. It's especially important to expose remove_bdev callback to end users. That new callback makes btrfs automatically shutdown or go degraded when a device is missing (depending on if the fs can maintain RW), which is affecting end users. We want some feedback from early adopters. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:05 +02:00
Yochai Eisenrich	973e57c726	btrfs: fix btrfs_ioctl_space_info() slot_count TOCTOU which can lead to info-leak btrfs_ioctl_space_info() has a TOCTOU race between two passes over the block group RAID type lists. The first pass counts entries to determine the allocation size, then the second pass fills the buffer. The groups_sem rwlock is released between passes, allowing concurrent block group removal to reduce the entry count. When the second pass fills fewer entries than the first pass counted, copy_to_user() copies the full alloc_size bytes including trailing uninitialized kmalloc bytes to userspace. Fix by copying only total_spaces entries (the actually-filled count from the second pass) instead of alloc_size bytes, and switch to kzalloc so any future copy size mismatch cannot leak heap data. Fixes: `7fde62bffb` ("Btrfs: buffer results in the space_info ioctl") CC: stable@vger.kernel.org # 3.0 Signed-off-by: Yochai Eisenrich <echelonh@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:05 +02:00
Filipe Manana	cee4cfd6cc	btrfs: avoid taking the device_list_mutex in btrfs_run_dev_stats() btrfs_run_dev_stats() is called during the critical section of a transaction commit and it takes the device_list_mutex, which is also acquired by fitrim, which does discard operations while holding that mutex. Most of the time, if we are on a healthy filesystem, we don't have new stat updates to persist in the device tree, so blocking on the device_list_mutex is just wasting time and making any tasks that need to start a new transaction wait longer that necessary. Since the device list is RCU safe/protected, make btrfs_run_dev_stats() do an initial check for device stat updates using RCU and quit without taking the device_list_mutex in case there are no new device stats that need to be persisted in the device tree. Also note that adding/removing devices also requires starting a transaction, and since btrfs_run_dev_stats() is called from the critical section of a transaction commit, no one can be concurrently adding or removing a device while btrfs_run_dev_stats() is called. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:05 +02:00
Leo Martins	e0a85137a8	btrfs: avoid GFP_ATOMIC allocations in qgroup free paths When qgroups are enabled, __btrfs_qgroup_release_data() and qgroup_free_reserved_data() pass an extent_changeset to btrfs_clear_record_extent_bits() to track how many bytes had their EXTENT_QGROUP_RESERVED bits cleared. Inside the extent IO tree spinlock, add_extent_changeset() calls ulist_add() with GFP_ATOMIC to record each changed range. If this allocation fails, it hits a BUG_ON and panics the kernel. However, both of these callers only read changeset.bytes_changed afterwards — the range_changed ulist is populated and immediately freed without ever being iterated. The GFP_ATOMIC allocation is entirely unnecessary for these paths. Introduce extent_changeset_init_bytes_only() which uses a sentinel value (EXTENT_CHANGESET_BYTES_ONLY) on the ulist's prealloc field to signal that only bytes_changed should be tracked. add_extent_changeset() checks for this sentinel and returns early after updating bytes_changed, skipping the ulist_add() call entirely. This eliminates the GFP_ATOMIC allocation and makes the BUG_ON unreachable for these paths. Callers that need range tracking (qgroup_reserve_data, qgroup_unreserve_range, btrfs_qgroup_check_reserved_leak) continue to use extent_changeset_init() and are unaffected. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Leo Martins <loemra.dev@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:05 +02:00
Johannes Thumshirn	390aa432f3	btrfs: decrease indentation of find_free_extent_update_loop Decrease the indentation of find_free_extent_update_loop(), by inverting the check if the loop state is smaller than LOOP_NO_EMPTY_SIZE. This also allows for an early return from find_free_extent_update_loop(), in case LOOP_NO_EMPTY_SIZE is already set at this point. While at it change a if () { } else if else pattern to all using curly braces and be consistent with the rest of btrfs code. Also change 'int exists' to 'bool have_trans' giving it a more meaningful name and type. No functional changes intended. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:05 +02:00
Filipe Manana	9a04488473	btrfs: unexport btrfs_qgroup_reserve_meta() There's only one caller outside qgroup.c of btrfs_qgroup_reserve_meta() and we have btrfs_qgroup_reserve_meta_prealloc() is a wrapper around that function. Make that caller use btrfs_qgroup_reserve_meta_prealloc() and unexport btrfs_qgroup_reserve_meta(), simplifying the external API. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:05 +02:00
Filipe Manana	534c0adacd	btrfs: collapse __btrfs_qgroup_reserve_meta() into btrfs_qgroup_reserve_meta_prealloc() Since __btrfs_qgroup_reserve_meta() is only called by btrfs_qgroup_reserve_meta_prealloc(), which is a simple inline wrapper, get rid of the later and rename __btrfs_qgroup_reserve_meta() to the later. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:04 +02:00
Filipe Manana	5adf3f32c0	btrfs: collapse __btrfs_qgroup_free_meta() into btrfs_qgroup_free_meta_prealloc() Since __btrfs_qgroup_free_meta() is only called by btrfs_qgroup_free_meta_prealloc(), which is a simple inline wrapper, get rid of the later and rename __btrfs_qgroup_free_meta() to the later. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:04 +02:00
Filipe Manana	e35577706e	btrfs: remove unused qgroup functions for pertrans reservation and freeing They have no more users since commit `a649684967` ("btrfs: fix start transaction qgroup rsv double free"), so remove them. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:04 +02:00
Filipe Manana	031dd12c37	btrfs: optimize clearing all bits from first extent record in an io tree When we are clearing all the bits from the first record that contains the target range and that record ends at or before our target range but starts before our target range, we are doing a lot of unnecessary work: 1) Allocating a prealloc state if we don't have one already; 2) Adjust that record's start offset to the start of our range and make the prealloc state have a range going from the original start offset of that first record to the start offset of our target range, and with the same bits as that first record. Then we insert the prealloc extent in the rbtree - this is done in split_state(); 3) Remove our adjusted first state from the rbtree since all the bits were cleared - this is done in clear_state_bit(). This is only wasting time when we can simply trim that first record, so that it represents the range from its start offset to the start offset of our target range. So optimize for that case and avoid the prealloc state allocation, insertion and deletion from the rbtree. This patch is the last patch of a patchset comprised of the following patches (in descending order): btrfs: optimize clearing all bits from first extent record in an io tree btrfs: panic instead of warn when splitting extent state not in the tree btrfs: free cached state outside critical section in wait_extent_bit() btrfs: avoid unnecessary wake ups on io trees when there are no waiters btrfs: remove wake parameter from clear_state_bit() btrfs: change last argument of add_extent_changeset() to boolean btrfs: use extent_io_tree_panic() instead of BUG_ON() btrfs: make add_extent_changeset() only return errors or success btrfs: tag as unlikely branches that call extent_io_tree_panic() btrfs: turn extent_io_tree_panic() into a macro for better error reporting btrfs: optimize clearing all bits from the last extent record in an io tree The following fio script was used to measure performance before and after applying all the patches: $ cat ./fio-io-uring-2.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" if [ $# -ne 3 ]; then echo "Use $0 NUM_JOBS FILE_SIZE RUN_TIME" exit 1 fi NUM_JOBS=$1 FILE_SIZE=$2 RUN_TIME=$3 cat <<EOF > /tmp/fio-job.ini [io_uring_rw] rw=randwrite fsync=0 fallocate=none group_reporting=1 direct=1 ioengine=io_uring fixedbufs=1 iodepth=64 bs=4K filesize=$FILE_SIZE runtime=$RUN_TIME time_based filename=foobar directory=$MNT numjobs=$NUM_JOBS thread EOF echo performance \| \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT When running this script on a 12 cores machine using a 16G null block device the results were the following: Before patchset: $ ./fio-io-uring-2.sh 12 8G 60 (...) WRITE: bw=74.8MiB/s (78.5MB/s), 74.8MiB/s-74.8MiB/s (78.5MB/s-78.5MB/s), io=4504MiB (4723MB), run=60197-60197msec After patchset: $ ./fio-io-uring-2.sh 12 8G 60 (...) WRITE: bw=82.2MiB/s (86.2MB/s), 82.2MiB/s-82.2MiB/s (86.2MB/s-86.2MB/s), io=4937MiB (5176MB), run=60027-60027msec Also, using bpftrace to collect the duration (in nanoseconds) of all the btrfs_clear_extent_bit_changeset() calls done during that fio test and then making an histogram from that data, held the following results: Before patchset: Count: 6304804 Range: 0.000 - 7587172.000; Mean: 2011.308; Median: 1219.000; Stddev: 17117.533 Percentiles: 90th: 1888.000; 95th: 2189.000; 99th: 16104.000 0.000 - 8.098: 7 \| 8.098 - 40.385: 20 \| 40.385 - 187.254: 146 \| 187.254 - 855.347: 742048 ####### 855.347 - 3894.426: 5462542 ##################################################### 3894.426 - 17718.848: 41489 \| 17718.848 - 80604.558: 46085 \| 80604.558 - 366664.449: 11285 \| 366664.449 - 1667918.122: 961 \| 1667918.122 - 7587172.000: 113 \| After patchset: Count: 6282879 Range: 0.000 - 6029290.000; Mean: 1896.482; Median: 1126.000; Stddev: 15276.691 Percentiles: 90th: 1741.000; 95th: 2026.000; 99th: 15713.000 0.000 - 60.014: 12 \| 60.014 - 217.984: 63 \| 217.984 - 784.949: 517515 ##### 784.949 - 2819.823: 5632335 ##################################################### 2819.823 - 10123.127: 55716 # 10123.127 - 36335.184: 46034 \| 36335.184 - 130412.049: 25708 \| 130412.049 - 468060.350: 4824 \| 468060.350 - 1679903.189: 549 \| 1679903.189 - 6029290.000: 84 \| Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:04 +02:00
Filipe Manana	057217f39a	btrfs: panic instead of warn when splitting extent state not in the tree We are not expected ever to split an extent state record that is not in the rbtree, as every record we pass to split_state() was found by iterating the rbtree, so if that ever happens it means we are not holding the extent io tree's spinlock or we have some memory corruption. Instead of simply warning in case the extent state record passed to split_state() is not in the rbtree, panic as this is a serious problem. Also tag as unlikely the case where the record is not in the rbtree. This also makes a tiny reduction the btrfs module's text size. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 2000080 174328 15592 2190000 216ab0 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 2000064 174328 15592 2189984 216aa0 fs/btrfs/btrfs.ko Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:04 +02:00
Filipe Manana	0611e2ad02	btrfs: free cached state outside critical section in wait_extent_bit() There's no need to free the cached extent state record while holding the io tree's spinlock, it's just making the critical section longer than it needs to be. So just do it after unlocking the io tree. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:04 +02:00
Filipe Manana	87f5c3139e	btrfs: avoid unnecessary wake ups on io trees when there are no waiters Whenever clearing the extent lock bits of an extent state record, we unconditionally call wake_up() on the state's waitqueue. Most of the time there are no waiters on the queue so we are just wasting time calling wake_up(), since that requires locking and unlocking the queue's spinlock, disable and re-enable interrupts, function calls, and other minor overhead while we are holding a critical section delimited by the extent io tree's spinlock. So call wake_up() only if there are waiters on an extent state's wait queue. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:04 +02:00
Filipe Manana	d110eb2f7f	btrfs: remove wake parameter from clear_state_bit() There's no need to pass the 'wake' parameter, we can determine if we have to wake up waiters by checking if EXTENT_LOCK_BITS is set in the bits to clear. So simplify things and remove the parameter. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:03 +02:00
Filipe Manana	4a7ae19ca3	btrfs: change last argument of add_extent_changeset() to boolean The argument is used as a boolean but it's defined as an integer. Switch it to a boolean for better readability. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:03 +02:00
Filipe Manana	6fc58dec5b	btrfs: use extent_io_tree_panic() instead of BUG_ON() There's no need to call BUG_ON(), instead call extent_io_tree_panic(), which also calls BUG(), but it prints an additional error message with some useful information before hitting BUG(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:03 +02:00
Filipe Manana	f5405ffce7	btrfs: make add_extent_changeset() only return errors or success Currently add_extent_changeset() always returns the return value from its call to ulist_add(), which can return an error, 0 or 1. There are no callers that care about the difference between 0 and 1 and all except one of them, check for negative values and ignore other values, but there is another caller (btrfs_clear_extent_bit_changeset()) that must set its 'ret' variable to 0 after calling add_extent_changeset(), so that it does not return an unexpected value of 1 to its caller. So change add_extent_changeset() to only return errors or 0, avoiding that caller (and any future callers) from having to deal with a return value of 1. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:03 +02:00
Filipe Manana	0f7c10d662	btrfs: tag as unlikely branches that call extent_io_tree_panic() It's unexpected to ever call extent_io_tree_panic() so surround with 'unlikely' every if statement condition that leads to it, making it explicit to a reader and to hint the compiler to potentially generate better code. On x86_64, using gcc 14.2.0-19 from Debian, this resulted in a slightly decrease of the btrfs module's text size. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1999832 174320 15592 2189744 2169b0 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1999768 174320 15592 2189680 216970 fs/btrfs/btrfs.ko Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:03 +02:00
Filipe Manana	598c10a9e9	btrfs: turn extent_io_tree_panic() into a macro for better error reporting When extent_io_tree_panic() is called we get a stace trace that is not very useful since the error message reports the location inside the extent_io_tree_panic() function and not in the caller of the function. Example: [ 7830.424291] BTRFS critical (device sdb): panic in extent_io_tree_panic:334: extent io tree error on add_extent_changeset state start 4083712 end 4112383 (errno=1 unknown) [ 7830.426816] ------------[ cut here ]------------ [ 7830.427581] kernel BUG at fs/btrfs/extent-io-tree.c:334! [ 7830.428495] Oops: invalid opcode: 0000 [#1] SMP PTI [ 7830.429318] CPU: 5 UID: 0 PID: 1451600 Comm: fsstress Not tainted 7.0.0-rc2-btrfs-next-227+ #1 PREEMPT(full) [ 7830.430899] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [ 7830.432771] RIP: 0010:extent_io_tree_panic+0x41/0x43 [btrfs] [ 7830.433815] Code: 75 0a 48 8b (...) [ 7830.436849] RSP: 0018:ffffd2334f4a3b68 EFLAGS: 00010246 [ 7830.437668] RAX: 0000000000000000 RBX: 00000000003ebfff RCX: 0000000000000000 [ 7830.438801] RDX: ffffffffc08d4368 RSI: ffffffffbb6ce475 RDI: ffff896501d6b780 [ 7830.439671] RBP: 0000000000001000 R08: 0000000000000000 R09: 00000000ffefffff [ 7830.440575] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000000 [ 7830.441458] R13: ffff896547374c08 R14: 00000000003effff R15: ffff896547374c08 [ 7830.442333] FS: 00007f3e252af0c0(0000) GS:ffff896c6185d000(0000) knlGS:0000000000000000 [ 7830.443326] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7830.444047] CR2: 00007f3e252ad000 CR3: 0000000113b0a004 CR4: 0000000000370ef0 [ 7830.444905] Call Trace: [ 7830.445229] <TASK> [ 7830.445557] btrfs_clear_extent_bit_changeset.cold+0x43/0x80 [btrfs] [ 7830.446543] btrfs_clear_record_extent_bits+0x19/0x20 [btrfs] [ 7830.447308] qgroup_free_reserved_data+0xf9/0x170 [btrfs] [ 7830.448040] btrfs_buffered_write+0x368/0x8e0 [btrfs] [ 7830.448707] btrfs_direct_write+0x1a5/0x480 [btrfs] [ 7830.449396] btrfs_do_write_iter+0x18c/0x210 [btrfs] [ 7830.450167] vfs_write+0x21f/0x450 [ 7830.450662] ksys_write+0x5f/0xd0 [ 7830.451092] do_syscall_64+0xe9/0xf20 [ 7830.451610] entry_SYSCALL_64_after_hwframe+0x76/0x7e Change extent_io_tree_panic() to a macro so that we get a report that gives the exact place where the error happens. Example after this change: [63677.406061] BTRFS critical (device sdc): panic in btrfs_clear_extent_bit_changeset:744: extent io tree error on add_extent_changeset state start 1818624 end 1830911 (errno=1 unknown) [63677.410055] ------------[ cut here ]------------ [63677.410910] kernel BUG at fs/btrfs/extent-io-tree.c:744! [63677.411918] Oops: invalid opcode: 0000 [#1] SMP PTI [63677.413032] CPU: 0 UID: 0 PID: 13028 Comm: fsstress Not tainted 7.0.0-rc2-btrfs-next-227+ #1 PREEMPT(full) [63677.415139] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [63677.417283] RIP: 0010:btrfs_clear_extent_bit_changeset.cold+0xcd/0x10c [btrfs] [63677.418676] Code: 8b 37 48 8b (...) [63677.421917] RSP: 0018:ffffd2290a417b30 EFLAGS: 00010246 [63677.422824] RAX: 0000000000000000 RBX: 00000000001befff RCX: 0000000000000000 [63677.424320] RDX: ffffffffc0970348 RSI: ffffffffa92ce475 RDI: ffff8897ded9dc80 [63677.429772] RBP: 0000000000001000 R08: 0000000000000000 R09: 00000000ffefffff [63677.430787] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000000 [63677.431818] R13: ffff8897966655d8 R14: 00000000001bffff R15: ffff8897966655d8 [63677.432764] FS: 00007f5c074c50c0(0000) GS:ffff889ef3b1d000(0000) knlGS:0000000000000000 [63677.433940] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [63677.434787] CR2: 00007f5c074c3000 CR3: 000000014b9de002 CR4: 0000000000370ef0 [63677.435960] Call Trace: [63677.436432] <TASK> [63677.436838] btrfs_clear_record_extent_bits+0x19/0x20 [btrfs] [63677.437980] qgroup_free_reserved_data+0xf9/0x170 [btrfs] [63677.439070] btrfs_buffered_write+0x368/0x8e0 [btrfs] [63677.439889] btrfs_do_write_iter+0x1a8/0x210 [btrfs] [63677.441460] do_iter_readv_writev+0x145/0x240 [63677.446309] vfs_writev+0x120/0x3b0 [63677.446878] ? __do_sys_newfstat+0x33/0x60 [63677.447759] ? do_pwritev+0x8a/0xd0 [63677.449119] do_pwritev+0x8a/0xd0 [63677.452342] do_syscall_64+0xe9/0xf20 [63677.452961] entry_SYSCALL_64_after_hwframe+0x76/0x7e Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:03 +02:00
Filipe Manana	aa40d5601e	btrfs: optimize clearing all bits from the last extent record in an io tree When we are clearing all the bits from the last record that contains the target range (i.e. the record starts before our target range and ends beyond it), we are doing a lot of unnecessary work: 1) Allocating a prealloc state if we don't have one already; 2) Adjust that last record's start offset to the end of our range and make the prealloc state have a range going from the original start offset of that last record to the end offset of our target range and the same bits as the last record. Then we insert the prealloc extent in the rbtree - this is done in split_state(); 3) Remove our prealloc state from the rbtree since all the bits were cleared - this is done in clear_state_bit(). This is only wasting time when we can simply trim the last record so that it's start offset is adjust to the end of the target range. So optimize for that case and avoid the prealloc state allocation, insertion and deletion from the rbtree. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:02 +02:00
Qu Wenruo	908ab56347	btrfs: remove atomic parameter from btrfs_buffer_uptodate() That parameter was introduced by commit `b9fab919b7` ("Btrfs: avoid sleeping in verify_parent_transid while atomic"). At that time we needed to lock the extent buffer range inside the io tree to avoid content changes, thus it could sleep. But that behavior is no longer there, as later commit `9e2aff90fc` ("btrfs: stop using lock_extent in btrfs_buffer_uptodate") dropped the io tree lock. We can remove the @atomic parameter safely now. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:02 +02:00
Qu Wenruo	3c53ad7549	btrfs: output more info when duplicated ordered extent is found During development of a new feature, I triggered that btrfs_panic() inside insert_ordered_extent() and spent quite some unnecessary before noticing I'm passing incorrect flags when creating a new ordered extent. Unfortunately the existing error message is not providing much help. Enhance the output to provide file offset, num bytes and flags of both existing and new ordered extents. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:02 +02:00

1 2 3 4 5 ...

1429854 Commits